Opened 11 years ago
Last modified 9 months ago
#22125 new Cleanup/optimization
Unnecessary creation of index for ManyToManyField — at Version 24
Reported by: | Owned by: | bwreilly | |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | db-indexes |
Cc: | emorley@…, Phil Krylov, Dan LaManna | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
Suppose I have the following model:
class Food(models.Model): restaurants = models.ManyToManyField(Restaurant)
The following table is created:
CREATE TABLE "main_food_restaurants" ( "id" integer NOT NULL PRIMARY KEY, "food_id" integer NOT NULL, "restaurant_id" integer NOT NULL, UNIQUE ("food_id", "restaurant_id") )
and the indexes:
CREATE INDEX "main_food_restaurants_0899c464" ON "main_food_restaurants" ("food_id"); CREATE INDEX "main_food_restaurants_be4c8f84" ON "main_food_restaurants" ("restaurant_id");
Notice that the single index on food_id is not needed due to the unique index (food_id, restaurant_id)
.
Change History (24)
comment:1 by , 11 years ago
comment:2 by , 11 years ago
Let me clarify.
Suppose you have the following code:
class B(models.Model): pass class A(models.Model): b = models.ManyToManyField(B)
The following SQL is produced:
CREATE TABLE "main_b" ( "id" integer NOT NULL PRIMARY KEY ) ; CREATE TABLE "main_a_b" ( "id" integer NOT NULL PRIMARY KEY, "a_id" integer NOT NULL, "b_id" integer NOT NULL REFERENCES "main_b" ("id"), UNIQUE ("a_id", "b_id") ) ; CREATE TABLE "main_a" ( "id" integer NOT NULL PRIMARY KEY ) CREATE INDEX "main_a_b_4d6cc2fb" ON "main_a_b" ("a_id"); CREATE INDEX "main_a_b_2c14050b" ON "main_a_b" ("b_id");
Note that the index on a_id is redundant due to the unique constraint on a_id, b_id
comment:3 by , 11 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:4 by , 11 years ago
I think this fixes it: https://github.com/OneOcean/django/compare/django:master...master
But I am not sure I am entirely happy with it, and perhaps it needs a test. Either way, I'll need to get a CLA and company CLA signed before I can make a pull request.
comment:5 by , 11 years ago
looks good, but I can't comment much because I don't know Django ORM internals.
Thanks for the work!
comment:6 by , 11 years ago
Resolution: | → needsinfo |
---|---|
Status: | assigned → closed |
The question of whether an index for the first column of a unique constraint is redundant, depends on how the database implements the unique constraint. I am quite certain that the SQL standard does not specify, that unique constraints must be implemented by indexes which can be used partially.
So, for accepting this change, I'd need some indications that the index is actually redundant -- at least in all core backends; hence, needsinfo. Even if we accept it, I'd require the code to allow a (3rd-party) backend to opt-out of it easily -- that is, with no need to override the entire create_model
method.
Your code is not a PR, so I can't comment there; I think it will break on unique_together=[('a','b','c')]
. It certainly does need tests (does it pass the Django test-suite?) and also documentation changes.
comment:7 by , 11 years ago
The SQL standard *doesn't talk about indexes*, so claiming that it doesn't happen to talk about how unique constraints are implemented is a bit like trying to rearrange deck chairs on the Titanic.
The index is redundant in every database that I've ever come across- and any that are sensible will do the same. How can a unique constraint be at all performant if there isn't a fast way to find an existing value? Allowing some backend to "opt-out" is fine, but it certainly shouldn't be the default.
Note that only the index on the *first* column would be redundant; an independent index on the second or subsequent columns which are included in a unique constraint could be quite useful.
comment:8 by , 11 years ago
I've confirmed PostgreSQL and SQLite both exhibit this behavior (Gist), I'll check MySQL when I get a chance. I don't have access to an Oracle instance, but I'll check their documentation.
Additionally, I've fixed the issue with unique_together=[('a','b','c')]
, added tests, and moved the functionality to an overridable method (compare). Sorry it isn't a pull request at the moment, I'm going to fix that Monday.
Besides Model Meta options unique_together and Fields db_index, should I update the documentation anywhere else?
comment:9 by , 11 years ago
SQL Fiddle may help (I can't link it since it would be marked as spam, but the address is pretty easy to determine). Oracle and MySQL both do this as well.
comment:10 by , 11 years ago
For the core DBs, they usually use B-Tree indexes so I think if [a,b] is indexed, then [a] would be a redundant index. To generalize this, any leftmost prefix of an indexed tuple would be redundant. This is definitely true for MySQL.
However, for hash indexes (and possibly other types), both [a] and [a,b] would be perfectly valid and non-redundant.
Source: https://dev.mysql.com/doc/refman/5.5/en/index-btree-hash.html
To quote: "[Hash index] Only whole keys can be used to search for a row. (With a B-tree index, any leftmost prefix of the key can be used to find rows.)"
I think we need to have an opt-out on a per table basis (based on Meta options), or even better, per ManyToManyField basis (based on a setting to init)
comment:11 by , 11 years ago
By the way, a very quick way to verify is to actually create a compound index, and then use "EXPLAIN" to check whether a SELECT statement based on the leftmost prefix can make use of the compound index. Of course this is assuming you have access to those core DBs, of which Oracle could be a bit of a problem.
comment:12 by , 11 years ago
Again, please look at SQL Fiddle. It allows you to write queries against a database of every type mentioned here, including Oracle (and also Microsoft SQL Server). And every single one of them uses the composite index for a query that uses only the first indexed column.
comment:13 by , 11 years ago
Resolution: | needsinfo |
---|---|
Status: | closed → new |
comment:14 by , 11 years ago
Triage Stage: | Unreviewed → Accepted |
---|
This seems like a reasonable optimization to me -- at a minimum we can do it by adding a new flag to backends and then only performing it on databases we've validated have this behavior (you'd have to have a REALLY terrible database for index looks not to support a subset of fields with a left-side prefix though).
comment:15 by , 11 years ago
Has patch: | set |
---|
follow-up: 21 comment:17 by , 10 years ago
Patch needs improvement: | set |
---|
I don't think the solution in PR2578 is the right solution at all.
If I read the patch correctly it removes all indexes for fields that are already first element in some unique_together index. There are valid cases for asking a separate index for a field even if it is part of an unique_together index. The most important reason for this is that a single-field index is faster than multicolumn index. In addition we shouldn't override what users have asked explicitly.
We should just skip index creation for m2m tables unless the user asks for single column index. I am not sure if we should add flags to m2m field for creation or skipping of these indexes.
comment:19 by , 9 years ago
Keywords: | db-indexes added |
---|
comment:20 by , 8 years ago
Cc: | added |
---|
comment:21 by , 6 years ago
Cc: | added |
---|
Replying to Anssi Kääriäinen:
We should just skip index creation for m2m tables unless the user asks for single column index. I am not sure if we should add flags to m2m field for creation or skipping of these indexes.
I would say that if someone needs a single column index in their through table, they can use an explicit through table without db_index=False
on that (ForeignKey
) column.
comment:23 by , 6 years ago
Patch needs improvement: | unset |
---|
comment:24 by , 6 years ago
Description: | modified (diff) |
---|---|
Patch needs improvement: | set |
Is there some of the code missing? I'm not seeing a unique index created in the SQL code you have here, just a unique constraint on food_id and restaurant_id.