Code

Opened 5 months ago

Last modified 3 months ago

#22125 new Cleanup/optimization

Unnecessary creation of index for ManyToManyField

Reported by: tbhtan3@… Owned by: bwreilly
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Suppose I have the following model:
class Food(models.Model):

restaurants = models.ManyToManyField(Restaurant)

The following table is created:

CREATE TABLE "main_food_restaurants" (

"id" integer NOT NULL PRIMARY KEY,
"food_id" integer NOT NULL,
"restaurant_id" integer NOT NULL,
UNIQUE ("food_id", "restaurant_id")

)

and the indexes:

CREATE INDEX "main_food_restaurants_0899c464" ON "main_food_restaurants" ("food_id");
CREATE INDEX "main_food_restaurants_be4c8f84" ON "main_food_restaurants" ("restaurant_id");

Notice that the single index on food_id is not needed due to the unique index (food_id, restaurant_id)

Attachments (0)

Change History (16)

comment:1 Changed 4 months ago by ceaess

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

Is there some of the code missing? I'm not seeing a unique index created in the SQL code you have here, just a unique constraint on food_id and restaurant_id.

comment:2 Changed 4 months ago by anonymous

Let me clarify.
Suppose you have the following code:

class B(models.Model):

pass

class A(models.Model):

b = models.ManyToManyField(B)

===========================

The following SQL is produced:

CREATE TABLE "main_b" (

"id" integer NOT NULL PRIMARY KEY

)
;
CREATE TABLE "main_a_b" (

"id" integer NOT NULL PRIMARY KEY,
"a_id" integer NOT NULL,
"b_id" integer NOT NULL REFERENCES "main_b" ("id"),
UNIQUE ("a_id", "b_id")

)
;
CREATE TABLE "main_a" (

"id" integer NOT NULL PRIMARY KEY

)

CREATE INDEX "main_a_b_4d6cc2fb" ON "main_a_b" ("a_id");
CREATE INDEX "main_a_b_2c14050b" ON "main_a_b" ("b_id");

==============

Note that the index on a_id is redundant due to the unique constraint on a_id, b_id

comment:3 Changed 4 months ago by bwreilly

  • Owner changed from nobody to bwreilly
  • Status changed from new to assigned

comment:4 Changed 4 months ago by bwreilly

I think this fixes it: https://github.com/OneOcean/django/compare/django:master...master

But I am not sure I am entirely happy with it, and perhaps it needs a test. Either way, I'll need to get a CLA and company CLA signed before I can make a pull request.

comment:5 Changed 4 months ago by anonymous

looks good, but I can't comment much because I don't know Django ORM internals.
Thanks for the work!

comment:6 Changed 4 months ago by shai

  • Resolution set to needsinfo
  • Status changed from assigned to closed

The question of whether an index for the first column of a unique constraint is redundant, depends on how the database implements the unique constraint. I am quite certain that the SQL standard does not specify, that unique constraints must be implemented by indexes which can be used partially.

So, for accepting this change, I'd need some indications that the index is actually redundant -- at least in all core backends; hence, needsinfo. Even if we accept it, I'd require the code to allow a (3rd-party) backend to opt-out of it easily -- that is, with no need to override the entire create_model method.

Your code is not a PR, so I can't comment there; I think it will break on unique_together=[('a','b','c')]. It certainly does need tests (does it pass the Django test-suite?) and also documentation changes.

comment:7 Changed 4 months ago by sfrost@…

The SQL standard *doesn't talk about indexes*, so claiming that it doesn't happen to talk about how unique constraints are implemented is a bit like trying to rearrange deck chairs on the Titanic.

The index is redundant in every database that I've ever come across- and any that are sensible will do the same. How can a unique constraint be at all performant if there isn't a fast way to find an existing value? Allowing some backend to "opt-out" is fine, but it certainly shouldn't be the default.

Note that only the index on the *first* column would be redundant; an independent index on the second or subsequent columns which are included in a unique constraint could be quite useful.

comment:8 Changed 4 months ago by bwreilly

I've confirmed PostgreSQL and SQLite both exhibit this behavior (Gist), I'll check MySQL when I get a chance. I don't have access to an Oracle instance, but I'll check their documentation.

Additionally, I've fixed the issue with unique_together=[('a','b','c')], added tests, and moved the functionality to an overridable method (compare). Sorry it isn't a pull request at the moment, I'm going to fix that Monday.

Besides Model Meta options unique_together and Fields db_index, should I update the documentation anywhere else?

comment:9 Changed 4 months ago by Joshua Yanovski

SQL Fiddle may help (I can't link it since it would be marked as spam, but the address is pretty easy to determine). Oracle and MySQL both do this as well.

comment:10 Changed 4 months ago by alucard

For the core DBs, they usually use B-Tree indexes so I think if [a,b] is indexed, then [a] would be a redundant index. To generalize this, any leftmost prefix of an indexed tuple would be redundant. This is definitely true for MySQL.


However, for hash indexes (and possibly other types), both [a] and [a,b] would be perfectly valid and non-redundant.

Source: https://dev.mysql.com/doc/refman/5.5/en/index-btree-hash.html
To quote: "[Hash index] Only whole keys can be used to search for a row. (With a B-tree index, any leftmost prefix of the key can be used to find rows.)"

I think we need to have an opt-out on a per table basis (based on Meta options), or even better, per ManyToManyField basis (based on a setting to init)

comment:11 Changed 4 months ago by anonymous

By the way, a very quick way to verify is to actually create a compound index, and then use "EXPLAIN" to check whether a SELECT statement based on the leftmost prefix can make use of the compound index. Of course this is assuming you have access to those core DBs, of which Oracle could be a bit of a problem.

comment:12 Changed 4 months ago by Joshua Yanovski

Again, please look at SQL Fiddle. It allows you to write queries against a database of every type mentioned here, including Oracle (and also Microsoft SQL Server). And every single one of them uses the composite index for a query that uses only the first indexed column.

comment:13 Changed 4 months ago by anonymous

  • Resolution needsinfo deleted
  • Status changed from closed to new

comment:14 Changed 3 months ago by anonymous

  • Triage Stage changed from Unreviewed to Accepted

This seems like a reasonable optimization to me -- at a minimum we can do it by adding a new flag to backends and then only performing it on databases we've validated have this behavior (you'd have to have a REALLY terrible database for index looks not to support a subset of fields with a left-side prefix though).

comment:15 Changed 3 months ago by bwreilly

  • Has patch set

comment:16 Changed 3 months ago by timo

  • Easy pickings unset
  • Version changed from 1.6 to master

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as new
The owner will be changed from bwreilly to anonymous. Next status will be 'assigned'
as The resolution will be set. Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.