﻿id	summary	reporter	owner	description	type	status	component	version	severity	resolution	keywords	cc	stage	has_patch	needs_docs	needs_tests	needs_better_patch	easy	ui_ux
32288	ManyToManyField creates a (unneeded) duplicate index	Álvaro Justen	nobody	"I have many projects running with Django + PostgreSQL and use [https://github.com/ankane/pghero pghero] in most of then, but there are always alerts showing some `django.contrib.auth` models have duplicate indexes, like these:

- On `auth_group_permissions`: `auth_group_permissions_group_id_b120cbf9` (`group_id`) is covered by `auth_group_permissions_group_id_permission_id_0cd325b0_uniq` (`group_id`, `permission_id`)
- On `auth_permission`: `auth_permission_content_type_id_2f476e4b` (`content_type_id`) is covered by `auth_permission_content_type_id_codename_01ab375a_uniq` (`content_type_id`, `codename`)
- On auth_user_groups: `auth_user_groups_user_id_6a12ed8b` (`user_id`) is covered by `auth_user_groups_user_id_group_id_94350c0c_uniq` (`user_id`, `group_id`)
- On auth_user_user_permissions: `auth_user_user_permissions_user_id_a95ead1b` (`user_id`) is covered by `auth_user_user_permissions_user_id_permission_id_14a6b632_uniq` (`user_id`, `permission_id`)

These indexes summed together won't waste more than 64KB, but the problem is actually not on auth models: any app which uses `ManyToManyField` will have at least one index duplicated. If you have a huge m2m table, than this unnecessary index will use a lot of space and will never be used and could end up slowing down your writes to the table (at least for PostgreSQL)!

I did a little test setup and for a M2M intermediary table containing 2M rows, the unneeded index will have ~42MB of wasted storage space. Little CRUD projects won't be affected by this, but projects having big databases will suffer.

**Note**: the problem of duplicate auth models was reported by a user in [this form.djangoproject.com post](https://forum.djangoproject.com/t/pghero-duplicate-indexes/5214/2) and then I complemented it with all the information below. I didn't know if the forum was the right place to suggest an improvement, so I'm creating this ticket.

**Understanding the Problem**

The problem source is on the way `ManyToManyField` creates the new model to store from/to IDs:

- It adds two `ForeignKey`s (one to “from” field and another to “to” field), each one having its own index (it’s automatic for `ForeignKey`)
- It adds `Meta.unique_together = (from_, to)`, which creates a compound index using these two fields

If you’re querying for a specific ""from"" id inside the M2M intermediary table, the compound index could be used instead of the FK index on the ""from"" field, so the index created by ""from""'s `ForeignKey` could be removed (the compound index can be used in this case because the ""from"" field is the first one in the index).


**Reproducing the Test Setup**

Let’s write a simple app to get some metrics:

{{{
$ django-admin startapp testapp
}}}

Create simple models with M2M:

{{{
# testapp/models.py
from django.db import models


class Author(models.Model):
    first_name = models.CharField(max_length=30)
    last_name = models.CharField(max_length=30)


class Book(models.Model):
    title = models.CharField(max_length=256)
    author = models.ManyToManyField(""Author"")
}}}

After running `python manage.py makemigrations` and `python manage.py migrate`, we can inspect the created M2M table in PostgreSQL:

{{{
mydb=# \d testapp_book_author
                              Table ""public.testapp_book_author""
  Column   |  Type   | Collation | Nullable |                     Default                     
-----------+---------+-----------+----------+-------------------------------------------------
 id        | integer |           | not null | nextval('testapp_book_author_id_seq'::regclass)
 book_id   | integer |           | not null | 
 author_id | integer |           | not null | 
Indexes:
    ""testapp_book_author_pkey"" PRIMARY KEY, btree (id)
    ""testapp_book_author_author_id_657c9727"" btree (author_id)
    ""testapp_book_author_book_id_author_id_ae92da66_uniq"" UNIQUE CONSTRAINT, btree (book_id, author_id)
    ""testapp_book_author_book_id_e2a2a45a"" btree (book_id)
Foreign-key constraints:
    ""testapp_book_author_author_id_657c9727_fk_testapp_author_id"" FOREIGN KEY (author_id) REFERENCES testapp_author(id) DEFERRABLE INITIALLY DEFERRED
    ""testapp_book_author_book_id_e2a2a45a_fk_testapp_book_id"" FOREIGN KEY (book_id) REFERENCES testapp_book(id) DEFERRABLE INITIALLY DEFERRED
}}}

Since `testapp_book_author_book_id_author_id_ae92da66_uniq` is a compound index and have `book_id` as its first field, then the index `testapp_book_author_book_id_e2a2a45a` is completely useless. Let’s check this affirmation by using `EXPLAIN`:

If we query for both `author_id` and `book_id`, the compound index is used:

{{{
mydb=# EXPLAIN SELECT * FROM testapp_book_author WHERE book_id = 1 AND author_id = 1;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Index Scan using testapp_book_author_book_id_author_id_ae92da66_uniq on testapp_book_author  (cost=0.15..8.17 rows=1 width=12)
   Index Cond: ((book_id = 1) AND (author_id = 1))
}}}

If we query for `author_id`, the FK index on `author_id` is used:

{{{
mydb=# EXPLAIN SELECT * FROM testapp_book_author WHERE author_id = 1;
                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on testapp_book_author  (cost=4.23..14.79 rows=10 width=12)
   Recheck Cond: (author_id = 1)
   ->  Bitmap Index Scan on testapp_book_author_author_id_657c9727  (cost=0.00..4.23 rows=10 width=0)
         Index Cond: (author_id = 1)
(4 rows)
}}}

If we query for `book_id`, the FK index on `book_id` is used:

{{{
mydb=# EXPLAIN SELECT * FROM testapp_book_author WHERE book_id = 1;
                                             QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on testapp_book_author  (cost=4.23..14.79 rows=10 width=12)
   Recheck Cond: (book_id = 1)
   ->  Bitmap Index Scan on testapp_book_author_book_id_e2a2a45a  (cost=0.00..4.23 rows=10 width=0)
         Index Cond: (book_id = 1)
}}}

But if we disable the FK index on `book_id` and query for `book_id`, the compound index will be used with the same cost, which proves this index is unnecessary:

{{{
mydb=# UPDATE pg_index SET indisvalid = FALSE WHERE indexrelid = 'testapp_book_author_book_id_e2a2a45a'::regclass;
UPDATE 1

mydb=# EXPLAIN SELECT * FROM testapp_book_author WHERE book_id = 1;
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on testapp_book_author  (cost=4.23..14.79 rows=10 width=12)
   Recheck Cond: (book_id = 1)
   ->  Bitmap Index Scan on testapp_book_author_book_id_author_id_ae92da66_uniq  (cost=0.00..4.23 rows=10 width=0)
         Index Cond: (book_id = 1)
}}}

**Wasted Database Space**

Database space is gold. Let’s check how this behavior affects the whole database size. First, let’s re-enable the index:

{{{
mydb=# UPDATE pg_index SET indisvalid = TRUE WHERE indexrelid = 'testapp_book_author_book_id_e2a2a45a'::regclass;
UPDATE 1
}}}

And populate the database with some thousands of rows:

{{{
import math


total_authors = 100_000
books_per_pair = 10
bulk_size = 100

total_iters = math.ceil(total_authors / bulk_size)
for number in range(1, total_iters + 1):
    authors = Author.objects.bulk_create([
        Author(first_name=f""John #{number * (bulk_size - 1) + n}"", last_name=""Doe"")
        for n in range(bulk_size)
    ])
    for author_1, author_2 in zip(authors, authors[1:]):
        # Create new books
        books = Book.objects.bulk_create([
            Book(title=f""Book #{n} by #{author_1.id} + #{author_2.id}"")
            for n in range(books_per_pair)
        ])
        # Assign these books to this pair of authors
        Book.author.through.objects.bulk_create(
            [
                Book.author.through(book_id=book.id, author_id=author_1.id)
                for book in books
            ]
            +
            [
                Book.author.through(book_id=book.id, author_id=author_2.id)
                for book in books
            ]
        )
    print(f""\r{number}/{total_iters}"", flush=True, end="""")
print()
}}}
(this will take some minutes to run)

I needed to stop the inserting here when it was finishing, but it inserted more than 100k authors, almost 1M books and almost 2M M2M rows:

{{{
In [1]: Author.objects.count()
Out[1]: 100100

In [2]: Book.objects.count()
Out[2]: 990200

In [3]: Book.author.through.objects.count()
Out[3]: 1980400
}}}

Now, we can finally check the total sizes of testapp’s indexes:

{{{
mydb=# SELECT indexrelid::regclass AS index, pg_size_pretty(pg_relation_size(indexrelid)) AS size FROM pg_index WHERE indexrelid::regclass::text LIKE 'testapp_%';
                        index                        |  size   
-----------------------------------------------------+---------
 testapp_author_pkey                                 | 2208 kB
 testapp_book_pkey                                   | 21 MB
 testapp_book_author_pkey                            | 42 MB
 testapp_book_author_book_id_author_id_ae92da66_uniq | 42 MB
 testapp_book_author_author_id_657c9727              | 42 MB
 testapp_book_author_book_id_e2a2a45a                | 42 MB
}}}

For 2M relationships, we have **42MB of wasted space** in `testapp_book_author_book_id_e2a2a45a`.

I’ve been working on projects including tables with 100x this size in rows. The cost per database-GB in Heroku PostgreSQL, for instance, goes from 1.28 USD to 2.34 USD, depending on the plan. With an average cost of 1.81 USD per GB, for M2M with 100M relationships we have 3.80 USD wasted per month.

With an optimization (just removing the FK index on ""from"" field), not only the cost will be lower, but also the insert time will be way faster (I have some tasks bulk inserting millions of rows per day).

**Fixing the problem**

I think the optimization for this case is not hard to implement and I’m willing to send a patch/create a pull request if you think I should proceed."	Cleanup/optimization	closed	Database layer (models, ORM)	3.1	Normal	duplicate	orm, many-to-many, database, indexes, optimization		Unreviewed	0	0	0	0	0	0
