Opened 2 months ago

Closed 2 months ago

#32288 closed Cleanup/optimization (duplicate)

ManyToManyField creates a (unneeded) duplicate index

Reported by: Álvaro Justen Owned by: nobody
Component: Database layer (models, ORM) Version: 3.1
Severity: Normal Keywords: orm, many-to-many, database, indexes, optimization
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I have many projects running with Django + PostgreSQL and use pghero in most of then, but there are always alerts showing some django.contrib.auth models have duplicate indexes, like these:

  • On auth_group_permissions: auth_group_permissions_group_id_b120cbf9 (group_id) is covered by auth_group_permissions_group_id_permission_id_0cd325b0_uniq (group_id, permission_id)
  • On auth_permission: auth_permission_content_type_id_2f476e4b (content_type_id) is covered by auth_permission_content_type_id_codename_01ab375a_uniq (content_type_id, codename)
  • On auth_user_groups: auth_user_groups_user_id_6a12ed8b (user_id) is covered by auth_user_groups_user_id_group_id_94350c0c_uniq (user_id, group_id)
  • On auth_user_user_permissions: auth_user_user_permissions_user_id_a95ead1b (user_id) is covered by auth_user_user_permissions_user_id_permission_id_14a6b632_uniq (user_id, permission_id)

These indexes summed together won't waste more than 64KB, but the problem is actually not on auth models: any app which uses ManyToManyField will have at least one index duplicated. If you have a huge m2m table, than this unnecessary index will use a lot of space and will never be used and could end up slowing down your writes to the table (at least for PostgreSQL)!

I did a little test setup and for a M2M intermediary table containing 2M rows, the unneeded index will have ~42MB of wasted storage space. Little CRUD projects won't be affected by this, but projects having big databases will suffer.

Note: the problem of duplicate auth models was reported by a user in [this form.djangoproject.com post](https://forum.djangoproject.com/t/pghero-duplicate-indexes/5214/2) and then I complemented it with all the information below. I didn't know if the forum was the right place to suggest an improvement, so I'm creating this ticket.

Understanding the Problem

The problem source is on the way ManyToManyField creates the new model to store from/to IDs:

  • It adds two ForeignKeys (one to “from” field and another to “to” field), each one having its own index (it’s automatic for ForeignKey)
  • It adds Meta.unique_together = (from_, to), which creates a compound index using these two fields

If you’re querying for a specific "from" id inside the M2M intermediary table, the compound index could be used instead of the FK index on the "from" field, so the index created by "from"'s ForeignKey could be removed (the compound index can be used in this case because the "from" field is the first one in the index).

Reproducing the Test Setup

Let’s write a simple app to get some metrics:

$ django-admin startapp testapp

Create simple models with M2M:

# testapp/models.py
from django.db import models


class Author(models.Model):
    first_name = models.CharField(max_length=30)
    last_name = models.CharField(max_length=30)


class Book(models.Model):
    title = models.CharField(max_length=256)
    author = models.ManyToManyField("Author")

After running python manage.py makemigrations and python manage.py migrate, we can inspect the created M2M table in PostgreSQL:

mydb=# \d testapp_book_author
                              Table "public.testapp_book_author"
  Column   |  Type   | Collation | Nullable |                     Default                     
-----------+---------+-----------+----------+-------------------------------------------------
 id        | integer |           | not null | nextval('testapp_book_author_id_seq'::regclass)
 book_id   | integer |           | not null | 
 author_id | integer |           | not null | 
Indexes:
    "testapp_book_author_pkey" PRIMARY KEY, btree (id)
    "testapp_book_author_author_id_657c9727" btree (author_id)
    "testapp_book_author_book_id_author_id_ae92da66_uniq" UNIQUE CONSTRAINT, btree (book_id, author_id)
    "testapp_book_author_book_id_e2a2a45a" btree (book_id)
Foreign-key constraints:
    "testapp_book_author_author_id_657c9727_fk_testapp_author_id" FOREIGN KEY (author_id) REFERENCES testapp_author(id) DEFERRABLE INITIALLY DEFERRED
    "testapp_book_author_book_id_e2a2a45a_fk_testapp_book_id" FOREIGN KEY (book_id) REFERENCES testapp_book(id) DEFERRABLE INITIALLY DEFERRED

Since testapp_book_author_book_id_author_id_ae92da66_uniq is a compound index and have book_id as its first field, then the index testapp_book_author_book_id_e2a2a45a is completely useless. Let’s check this affirmation by using EXPLAIN:

If we query for both author_id and book_id, the compound index is used:

mydb=# EXPLAIN SELECT * FROM testapp_book_author WHERE book_id = 1 AND author_id = 1;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Index Scan using testapp_book_author_book_id_author_id_ae92da66_uniq on testapp_book_author  (cost=0.15..8.17 rows=1 width=12)
   Index Cond: ((book_id = 1) AND (author_id = 1))

If we query for author_id, the FK index on author_id is used:

mydb=# EXPLAIN SELECT * FROM testapp_book_author WHERE author_id = 1;
                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on testapp_book_author  (cost=4.23..14.79 rows=10 width=12)
   Recheck Cond: (author_id = 1)
   ->  Bitmap Index Scan on testapp_book_author_author_id_657c9727  (cost=0.00..4.23 rows=10 width=0)
         Index Cond: (author_id = 1)
(4 rows)

If we query for book_id, the FK index on book_id is used:

mydb=# EXPLAIN SELECT * FROM testapp_book_author WHERE book_id = 1;
                                             QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on testapp_book_author  (cost=4.23..14.79 rows=10 width=12)
   Recheck Cond: (book_id = 1)
   ->  Bitmap Index Scan on testapp_book_author_book_id_e2a2a45a  (cost=0.00..4.23 rows=10 width=0)
         Index Cond: (book_id = 1)

But if we disable the FK index on book_id and query for book_id, the compound index will be used with the same cost, which proves this index is unnecessary:

mydb=# UPDATE pg_index SET indisvalid = FALSE WHERE indexrelid = 'testapp_book_author_book_id_e2a2a45a'::regclass;
UPDATE 1

mydb=# EXPLAIN SELECT * FROM testapp_book_author WHERE book_id = 1;
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on testapp_book_author  (cost=4.23..14.79 rows=10 width=12)
   Recheck Cond: (book_id = 1)
   ->  Bitmap Index Scan on testapp_book_author_book_id_author_id_ae92da66_uniq  (cost=0.00..4.23 rows=10 width=0)
         Index Cond: (book_id = 1)

Wasted Database Space

Database space is gold. Let’s check how this behavior affects the whole database size. First, let’s re-enable the index:

mydb=# UPDATE pg_index SET indisvalid = TRUE WHERE indexrelid = 'testapp_book_author_book_id_e2a2a45a'::regclass;
UPDATE 1

And populate the database with some thousands of rows:

import math


total_authors = 100_000
books_per_pair = 10
bulk_size = 100

total_iters = math.ceil(total_authors / bulk_size)
for number in range(1, total_iters + 1):
    authors = Author.objects.bulk_create([
        Author(first_name=f"John #{number * (bulk_size - 1) + n}", last_name="Doe")
        for n in range(bulk_size)
    ])
    for author_1, author_2 in zip(authors, authors[1:]):
        # Create new books
        books = Book.objects.bulk_create([
            Book(title=f"Book #{n} by #{author_1.id} + #{author_2.id}")
            for n in range(books_per_pair)
        ])
        # Assign these books to this pair of authors
        Book.author.through.objects.bulk_create(
            [
                Book.author.through(book_id=book.id, author_id=author_1.id)
                for book in books
            ]
            +
            [
                Book.author.through(book_id=book.id, author_id=author_2.id)
                for book in books
            ]
        )
    print(f"\r{number}/{total_iters}", flush=True, end="")
print()

(this will take some minutes to run)

I needed to stop the inserting here when it was finishing, but it inserted more than 100k authors, almost 1M books and almost 2M M2M rows:

In [1]: Author.objects.count()
Out[1]: 100100

In [2]: Book.objects.count()
Out[2]: 990200

In [3]: Book.author.through.objects.count()
Out[3]: 1980400

Now, we can finally check the total sizes of testapp’s indexes:

mydb=# SELECT indexrelid::regclass AS index, pg_size_pretty(pg_relation_size(indexrelid)) AS size FROM pg_index WHERE indexrelid::regclass::text LIKE 'testapp_%';
                        index                        |  size   
-----------------------------------------------------+---------
 testapp_author_pkey                                 | 2208 kB
 testapp_book_pkey                                   | 21 MB
 testapp_book_author_pkey                            | 42 MB
 testapp_book_author_book_id_author_id_ae92da66_uniq | 42 MB
 testapp_book_author_author_id_657c9727              | 42 MB
 testapp_book_author_book_id_e2a2a45a                | 42 MB

For 2M relationships, we have 42MB of wasted space in testapp_book_author_book_id_e2a2a45a.

I’ve been working on projects including tables with 100x this size in rows. The cost per database-GB in Heroku PostgreSQL, for instance, goes from 1.28 USD to 2.34 USD, depending on the plan. With an average cost of 1.81 USD per GB, for M2M with 100M relationships we have 3.80 USD wasted per month.

With an optimization (just removing the FK index on "from" field), not only the cost will be lower, but also the insert time will be way faster (I have some tasks bulk inserting millions of rows per day).

Fixing the problem

I think the optimization for this case is not hard to implement and I’m willing to send a patch/create a pull request if you think I should proceed.

Change History (1)

comment:1 Changed 2 months ago by Mariusz Felisiak

Easy pickings: unset
Resolution: duplicate
Status: newclosed

Duplicate of #22125.

Note: See TracTickets for help on using tickets.
Back to Top