Opened 3 years ago

Closed 8 months ago

Last modified 8 months ago

#22268 closed Cleanup/optimization (fixed)

Document values_list() behavior for ManyToManyField

Reported by: k_sze Owned by: Mikkel Bach Mortensen
Component: Documentation Version: master
Severity: Normal Keywords: orm, values_list, ManyToManyField
Cc: anubhav9042@…, pirosb3, saivnm5@… Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: yes UI/UX: no

Description

Suppose you have a schools app, with these model definitions:

from django.db import models

class School(models.Model):
    name = models.CharField(unique=True, max_length=255)

class Class(models.Model):
    school = models.ForeignKey(School)
    name = models.CharField(unique=True, max_length=255)
    students = models.ManyToManyField('Student')

class Student(models.Model):
    surname = models.CharField(max_length=255)
    given_name = models.CharField(max_length=255)

Now, try this in the Django shell:

$ python manage.py shell

>>> from schools.models import School, Class, Student

# Create the school
>>> concordia = School(name='Concordia University')
>>> concordia.save()

# Create the Software Engineering class
>>> soen = Class(school=concordia, name='Software Engineering')
>>> soen.save()

# Create the Computer Engineering class
>>> coen = Class(school=concordia, name='Computer Engineering')
>>> coen.save()

# Create a student
>>> john_smith = Student(surname='Smith', given_name='John')
>>> john_smith.save()

# Add this student into one of the classes
>>> soen.students.add(john_smith)

# Now make a query using values_list
>>> students = Class.objects.values_list('students', flat=True)

# How many students are there supposed to be in this `students` QuerySet?
>>> print students.count()
1

# What if we iterate over it?
>>> for s in students:
>>>     print s
1
None

# Wait, what!?
>>> print len(list(students))
>>> 2

Attachments (1)

22268.diff (1.1 KB) - added by ANUBHAV JOSHI 3 years ago.

Download all attachments as: .zip

Change History (49)

comment:1 Changed 3 years ago by k_sze

Type: UncategorizedBug

comment:2 Changed 3 years ago by Baptiste Mispelon

Triage Stage: UnreviewedAccepted
Version: 1.6master

Hi,

Indeed, there seem to be something going wrong here.
I tried it out and the bug seems present on older versions too (I tried all the way to 1.4).

Thanks.

comment:3 Changed 3 years ago by ANUBHAV JOSHI

One thing I want to clear out is that I think that values and values_list return dictionary/tuple type data for all objects present, hence this might not be a bug.
eg.

Class.objects.values_list()

will give you something like

[(1L, 1L, u'Software Engineering'), (2L, 1L, u'Computer Engineering')]

thus for

Class.objects.values_list('students')

you are bound to get

[(1L,), (None,)]

Thoughts??

Version 0, edited 3 years ago by ANUBHAV JOSHI (next)

comment:4 Changed 3 years ago by Tim Graham

I agree it's unclear what change in behavior is desired here.

comment:5 Changed 3 years ago by ANUBHAV JOSHI

Cc: anubhav9042@… added

comment:6 Changed 3 years ago by pirosb3

I have start looking at this ticket, that is currently replicated on my machine.

The query generated by the values_list statement above is:

(u'SELECT "m2m_and_m2o_class_students"."student_id" FROM "m2m_and_m2o_class" LEFT OUTER JOIN "m2m_and_m2o_class_students" ON ( "m2m_and_m2o_class"."id" = "m2m_and_m2o_class_students"."class_id" )', ())

I think the issue here is related to the LEFT OUTER JOIN. We could modify ValuesListQuerySet.iterator and make changes to avoid this, but I am not sure it should be done here.
Can anyone give me some advice? am I looking in the correct place?

Thanks,
Dan

comment:7 Changed 3 years ago by k_sze

And I believe the problem is even more serious when the values_list is querying for a nullable field of a ManyToManyField. For example, if I modify the Student class and add a new field:

class Student(models.Model):
    surname = models.CharField(max_length=255)
    given_name = models.CharField(max_length=255)
    year_of_birth = models.SmallIntegerField(null=True)

And then I make this query:

>>> Class.objects.values_list('students__year_of_birth', flat=True)
[None, None]

It becomes ambiguous where the Nones come from: one of the Nones is from john_smith, who has a null year_of_birth, the other None is because the other class has no student at all.

Last edited 3 years ago by k_sze (previous) (diff)

comment:8 Changed 3 years ago by ANUBHAV JOSHI

First thing we should make clear is that whether we want to change the current behaviour or not.
Please see my previous comment: values and values_list give result for all objects present.

Last edited 3 years ago by ANUBHAV JOSHI (previous) (diff)

comment:9 Changed 3 years ago by ANUBHAV JOSHI

Since values and values_list return a value for all objects present(here Class).
Therefore if a class doesn't have that field present, it will have to return some value, if None is not demanded for that field then what should we be wanting in place.

comment:10 Changed 3 years ago by pirosb3

Cc: pirosb3 added

I understand that it is unclear if this is an error or expected behavour, and that changing this will lead to backwards compatibility issues.
I still believe this should be modified in the future. With None being a possible value in values_list, it makes developer's lives more difficult, as they need to check when iterating over the list.
Another possible solution would be to make a documentation patch, the developer needs to be aware of this.

comment:11 Changed 3 years ago by Erik Romijn

Changing this to not include the None value is definitely backwards compatibility breaking. So far, in this ticket, we have not been able to reach a clear consensus on what the correct behaviour is. To me, the current behaviour does not seem unreasonable. The issue raised in comment:7 is valid, but I don't see how removing extra None's would help - this is simply not the right query for that question.

Given that there is no strong case for removing the None's, I think we should not break backwards compatibility, and therefore keep the existing behaviour. If you disagree, please make sure to clarify exactly how you would like the behaviour to change. Otherwise, I do think a documentation patch would be good.

comment:12 Changed 3 years ago by ANUBHAV JOSHI

Owner: changed from nobody to ANUBHAV JOSHI
Status: newassigned

I am also in favour of doc patch...as current behaviour is correct.

Last edited 3 years ago by ANUBHAV JOSHI (previous) (diff)

comment:13 in reply to:  12 ; Changed 3 years ago by k_sze

On the one hand, I do understand that a change in behaviour would break backward compatibility. On the other hand, I tend to agree with comment:10 here, in that the behaviour should be modified in the future. To me, the correct behaviour would be to not include the extra Nones. In my example, the ones from the null year_of_birth are fine and should be included. Another point of consideration that I would raise is whether there is any reason to believe that people are relying on the current behaviour (I think it would be quite strange for people to be relying on ambiguity).

Would it be OK to document it as a quirk that is subject to change in the future, and not as the correct bahaviour?

Regarding comment:11, I think there are possible legitimate uses for this kind of query. For instance, if you want a Cartesian product count of some M2M relation, and then run it through the Python collections.Counter or something.

comment:14 in reply to:  13 Changed 3 years ago by ANUBHAV JOSHI

Replying to k_sze:

On the one hand, I do understand that a change in behaviour would break backward compatibility. On the other hand, I tend to agree with comment:10 here, in that the behaviour should be modified in the future. To me, the correct behaviour would be to not include the extra Nones.

This is what I want to ask, if we do not want those extra Nones, then what do we return in its place.
If we just don't return anything at all, it is going to be more problematic.
Consider three Class objects, out of which the second object doesn't have student, then if we return something like ['some_id','some_id'] for this, would it not be confusing for similar case when third object doesn't have student.

comment:15 Changed 3 years ago by k_sze

Actually the whole behaviour of values() and values_list() seems to be wrong once you mix M2M fields into the picture. I just tried creating more Class and Student objects:

# New student, Joe Blo
>>> joe_blo = Student(surname='Blo', given_name='Joe')
>>> joe_blo.save()

# New class, Discrete Mathematics
>>> discrete_math = Class(name='Discrete Mathematics', school=concordia)
>>> discrete_math.save()

# Enroll both john_smith and joe_blo in discrete_math
discrete_math.students.add(joe_blo)
discrete_math.students.add(john_smith)

First, let's look at values() queries:

>>> Class.objects.values()
[{u'id': 1, 'name': u'Software Engineering', 'school_id': 1}, {u'id': 2, 'name': u'Computer Engineering', 'school_id': 1}, {u'id': 4, 'name': u'Discrete Mathematics', 'school_id': 1}]

That's 3 dictionaries because we now have 3 Class objects. That seems fine. On the other hand, why don't students appear here?

So, let's specifically ask values() to include students:

>>> Class.objects.values('name', 'students')
[{'students': 1, 'name': u'Software Engineering'}, {'students': None, 'name': u'Computer Engineering'}, {'students': 1, 'name': u'Discrete Mathematics'}, {'students': 2, 'name': u'Discrete Mathematics'}]

That's 4 dictionaries even though we only have 3 Class objects (soen, coen, and discrete_math). Can you imagine what would happen if the Class object had more M2M fields? The number of dictionaries returned would probably explode exponentially!

Why aren't the students combined into a tuple or a list? I would expect the results to be a list of 3 dictionaries like this:

[
    {'students': (1,), 'name': u'Software Engineering'},
    {'students': (,), 'name': u'Computer Engineering'}, # Note the empty tuple for students, instead of None
    {'students': (1, 2), 'name': u'Discrete Mathematics'},
]

Now, let's look at values_list() queries:

>>> Class.objects.values_list()
[(1, 1, u'Software Engineering'), (2, 1, u'Computer Engineering'), (4, 1, u'Discrete Mathematics')]

3 tuples. Again, this seems fine except for the fact that students aren't included.

So, let's specifically ask values_list() to include students:

>>> Class.objects.values_list('name', 'students')
[(u'Software Engineering', 1), (u'Computer Engineering', None), (u'Discrete Mathematics', 1), (u'Discrete Mathematics', 2)]

4 tuples even though we have 3 Class objects only. I would expect the results to be a list of 3 tuples like this:

[
    (u'Software Engineering', (1,)),
    (u'Computer Engineering', (,)), # Note the empty tuple instead of None
    (u'Discrete Mathematics', (1, 2)),
]
Last edited 3 years ago by k_sze (previous) (diff)

comment:16 Changed 3 years ago by k_sze

To demonstrate how bad things become once you have multiple M2M fields, I created a new Language model and added it as a M2M field in the Class model:

from django.db import models

# other Model definitions skipped ...

class Class(models.Model):
    school = models.ForeignKey(School)
    name = models.CharField(unique=True, max_length=255)
    students = models.ManyToManyField('Student')
    available_in_languages = models.ManyToManyField('Language')
    
   
class Language(models.Model):
    name = models.CharField(max_length=255)

Now I create new languages and add them to the Discrete Mathematics class (which now has 2 students, if you recall):

# French
>>> french = Language(name='French')
>>> french.save()

# English
>>> english = Language(name='English')
>>> english.save()

# Add them to Discrete Mathematics class
discrete_math = Class.objects.get(name='Discrete Mathematics')
discrete_math.available_in_languages.add(french)
discrete_math.available_in_languages.add(english)

Now I make a values() query:

>>> Class.objects.values('available_in_languages', 'students')
[{'students': 1, 'available_in_languages': None}, {'students': None, 'available_in_languages': None}, {'students': 1, 'available_in_languages': 2}, {'students': 2, 'available_in_languages': 2}, {'students': 1, 'available_in_languages': 1}, {'students': 2, 'available_in_languages': 1}]

That's 6 dictionaries now!

Similarly, with values_list():

>>> Class.objects.values_list('available_in_languages', 'students')
[(None, 1), (None, None), (2, 1), (2, 2), (1, 1), (1, 2)]

6 tuples even though we only have 3 Class objects.

Last edited 3 years ago by k_sze (previous) (diff)

comment:17 Changed 3 years ago by ANUBHAV JOSHI

According to me the query generated for Class.objects.values('available_in_languages', 'students'):

SELECT `tic_22268_class_available_in_languages`.`language_id`, `tic_22268_class_students`.`student_id` FROM `tic_22268_class` LEFT OUTER JOIN `tic_22268_class_available_in_languages` ON (`tic_22268_class`.`id` = `tic_22268_class_available_in_languages`.`class_id`) LEFT OUTER JOIN `tic_22268_class_students` ON (`tic_22268_class`.`id` = `tic_22268_class_students`.`class_id`)

is correct as it yield the following result

language_id 	student_id
NULL 	           1
NULL 	           NULL
1 	           1
1 	           2
2 	           1
2 	           2

Although the result is correct as we want combination of available_in_languages and students for Class objects. However it is confusing.
I propose the following:
Whenever there is a case of M2M in values() or values_list(), we could introduce additional id into the query, which will help in making the result meaningful and understandable.

We make the query into:

SELECT `tic_22268_class_available_in_languages`.`language_id`, `tic_22268_class_students`.`student_id`, `tic_22268_class_students`.`id` FROM `tic_22268_class` LEFT OUTER JOIN `tic_22268_class_available_in_languages` ON (`tic_22268_class`.`id` = `tic_22268_class_available_in_languages`.`class_id`) LEFT OUTER JOIN `tic_22268_class_students` ON (`tic_22268_class`.`id` = `tic_22268_class_students`.`class_id`)

resulting in

language_id 	student_id 	id
NULL 	              1	        1
NULL  	           NULL 	2
1 	              1 	3
1 	              2 	3
2 	              1 	3
2 	              2 	3

The output also becomes:

[{'students': 1L, 'available_in_languages': None, 'id': 1L},
{'students': None, 'available_in_languages': None, 'id': 2L},
{'students': 1L, 'available_in_languages': 1L, 'id': 3L}, {'students': 2L, 'available_in_languages': 1L, 'id': 3L}, {
'students': 1L, 'available_in_languages': 2L, 'id': 3L}, {'students': 2L, 'available_in_languages': 2L, 'id': 3L}]

}}}

Now the result is as required, since we wanted the combination of values: 'available_in_languages' and 'students' along with 'id' of the Class object to which it belongs.

I am adding a diff.

Last edited 3 years ago by ANUBHAV JOSHI (previous) (diff)

comment:18 Changed 3 years ago by ANUBHAV JOSHI

Added the diff.
Although I am not sure if we should add it at the place where we have added it.
Apart from the place in the diff, we could also add it in _setup_query of ValuesQuerySet.

Last edited 3 years ago by ANUBHAV JOSHI (previous) (diff)

Changed 3 years ago by ANUBHAV JOSHI

Attachment: 22268.diff added

comment:19 Changed 3 years ago by k_sze

I would argue against solving the problem this way.

  1. Letting the number of results grow combinatorially seems questionable. According to the current documentation for values() and values_list(), each result is supposed to correspond to one model object. Granted, the documentation doesn't say that the query cannot return multiple results for each model object, it still feels wrong because, in the current behaviour, none of the results represents a coherent big picture of one model object; for each model object, the values are either getting duplicated or fragmented across multiple dictionaries or tuples.
  2. I suspect that including the id (or pk) still doesn't solve the ambiguity that arises when you query for a nullable field of an M2M field (e.g. students__year_of_birth). You still get None for both a) null year_of_birth or b) no student at all. I can't confirm this yet because the diff doesn't seem to work against Django 1.6.2 (what version does the diff target?).

By the way, the problem is not limited to M2M fields, but also to ForeignKeys; e.g. if you have a Province model and a City model, and the City has a ForeignKey on Province, and you do Province.objects.values('cities'), you get the same problem.

This is looking like a rabbit hole. :(

comment:20 Changed 3 years ago by Erik Romijn

The examples in comment:15 and comment:16 do seem odd. However, have a close look at #5768, where this behaviour seems to have been added. I haven't read it in detail, but there may be some explanation in there of the choices made.

comment:21 in reply to:  19 Changed 3 years ago by ANUBHAV JOSHI

Replying to k_sze:

I would argue against solving the problem this way.

  1. Letting the number of results grow combinatorially seems questionable. According to the current documentation for values() and values_list(), each result is supposed to correspond to one model object. Granted, the documentation doesn't say that the query cannot return multiple results for each model object, it still feels wrong because, in the current behaviour, none of the results represents a coherent big picture of one model object; for each model object, the values are either getting duplicated or fragmented across multiple dictionaries or tuples.
  2. I suspect that including the id (or pk) still doesn't solve the ambiguity that arises when you query for a nullable field of an M2M field (e.g. students__year_of_birth). You still get None for both a) null year_of_birth or b) no student at all. I can't confirm this yet because the diff doesn't seem to work against Django 1.6.2 (what version does the diff target?).

By the way, the problem is not limited to M2M fields, but also to ForeignKeys; e.g. if you have a Province model and a City model, and the City has a ForeignKey on Province, and you do Province.objects.values('cities'), you get the same problem.

This is looking like a rabbit hole. :(

Diff targets master.
Even if two Nones are returned from pk value we know that which corresponds to what. What should we give instead of None

comment:22 Changed 3 years ago by ANUBHAV JOSHI

Although I have another idea in mind:
We just add pk to the query and then when final result is obtained we can club the dicts/tuples having same pk and remove pk.

Thoughts??

I am suggesting changes this way becoz I think that the query generated is correct.

comment:23 in reply to:  22 Changed 3 years ago by k_sze

Replying to anubhav9042:

Although I have another idea in mind:
We just add pk to the query and then when final result is obtained we can club the dicts/tuples having same pk and remove pk.

Thoughts??

I am suggesting changes this way becoz I think that the query generated is correct.

And also remove the ones where the pk is None (due to the LEFT OUTER JOIN)?

Last edited 3 years ago by k_sze (previous) (diff)

comment:24 Changed 3 years ago by k_sze

Another idea that I have, which is probably crazy:

Instead of retrieving the M2M or reverse foreign key values in the same SQL query, use a clever combination of:

  • prefetch_related()
  • subclassing dict and tuple

So values() and values_list return these subclassed dicts and tuples, which lazily construct and return the relevant M2M or reverse foreign key elements when they are accessed.

In the end, the subclassed dict and tuple would work somewhat like the Django model, except that you access things in the form of dictionaries and tuples, and you limit the elements that can appear in them.

Last edited 3 years ago by k_sze (previous) (diff)

comment:25 in reply to:  22 Changed 3 years ago by ANUBHAV JOSHI

Although I have another idea in mind:
We just add pk to the query and then when final result is obtained we can club the dicts/tuples having same pk and remove pk.

Thoughts??

I am suggesting changes this way becoz I think that the query generated is correct.

I tried to do this way, but clubbing those dics/tuples is not easy as they are not exactly dics/tuple rather QuerySets.
Will keep trying and post back soon

comment:26 in reply to:  24 ; Changed 3 years ago by ANUBHAV JOSHI

In the end, the subclassed dict and tuple would work somewhat like the Django model, except that you access things in the form of dictionaries and tuples, and you limit the elements that can appear in them.

Will think
Although the same difficulty might come here as well.

comment:27 in reply to:  26 Changed 3 years ago by k_sze

Replying to anubhav9042:

In the end, the subclassed dict and tuple would work somewhat like the Django model, except that you access things in the form of dictionaries and tuples, and you limit the elements that can appear in them.

Will think
Although the same difficulty might come here as well.

My idea is for these subclassed dict and tuple to be used recursively. And then because they are constructed lazily using QuerySets, you won't get combinatorial explosions, so you won't have to do any clubbing.

comment:28 Changed 3 years ago by ANUBHAV JOSHI

For better discussion, I have opened a thread:
https://groups.google.com/forum/#!topic/django-developers/DAslY6GI1O8

comment:29 Changed 3 years ago by ANUBHAV JOSHI

As discussed on IRC and mailing list, the solution has come out to be a doc update for this behaviour rather than tweaking the code any further.

comment:30 Changed 3 years ago by Tim Graham

Component: Database layer (models, ORM)Documentation
Type: BugCleanup/optimization

comment:31 Changed 13 months ago by Tim Graham

Easy pickings: set
Owner: ANUBHAV JOSHI deleted
Status: assignednew
Summary: values_list() on a ManyToManyField returns extra None's when iterated over.Document values_list() behavior for ManyToManyField

comment:32 Changed 12 months ago by Umesh Singla

Owner: set to Umesh Singla
Status: newassigned

comment:33 Changed 11 months ago by Sai Krishna

From the django documentation https://docs.djangoproject.com/en/dev/ref/models/querysets/#values :

Because ManyToManyField attributes and reverse relations can have multiple related rows, including these can have a multiplier effect on the size of your result set. This will be especially pronounced if you include multiple such fields in your values() query, in which case all possible combinations will be returned.

I believe that gives a fair amount of warning to the user as to the behaviour of .values() with ManyToManyField attributes. What we can do is:

  1. Add a similar warning under the .values_list() reference.
  2. Or modify the above warning.

The modification I am suggesting would be in the last line:

This will be especially pronounced if you include multiple such fields in your values() or values_list() query, in which case all possible combinations (including None) will be returned.

comment:34 Changed 11 months ago by Sai Krishna

Cc: saivnm5@… added

comment:35 Changed 11 months ago by Tim Graham

Modifying the existing warning might make sense. An example may also help.

comment:36 Changed 11 months ago by Sai Krishna

Added a pull request on Github https://github.com/django/django/pull/5976
Following timgraham's suggestion of adding an example to illustrate behavior.

comment:37 Changed 11 months ago by Sai Krishna

Owner: changed from Umesh Singla to Sai Krishna
Triage Stage: AcceptedReady for checkin

Assigning to myself to expedite the process of closing this ticket, it has been open for far too long.

Last edited 11 months ago by Sai Krishna (previous) (diff)

comment:38 Changed 11 months ago by Simon Charette

Has patch: set
Triage Stage: Ready for checkinAccepted

comment:39 Changed 11 months ago by Tim Graham

Patch needs improvement: set

Left comments for improvement on the pull request in comment 36.

comment:40 Changed 10 months ago by Akshesh Doshi

How about adding the same warning for values_list() as there is for values() (as suggested in comment:33) and add an example in this warning itself.

Would that meet our requirements ?

comment:41 Changed 10 months ago by Tim Graham

I'd rather not duplicate a large block of test if possible. Did you read my comments on the the pull request?

comment:42 Changed 10 months ago by Sai Krishna

Owner: Sai Krishna deleted
Status: assignednew

comment:43 Changed 8 months ago by Mikkel Bach Mortensen

Owner: set to Mikkel Bach Mortensen
Status: newassigned

comment:44 Changed 8 months ago by Tim Graham

Patch needs improvement: unset
Triage Stage: AcceptedReady for checkin

comment:45 Changed 8 months ago by Tim Graham <timograham@…>

Resolution: fixed
Status: assignedclosed

In 7d485d5d:

Fixed #22268 -- Documented values_list() behavior for multivalued relations.a

Thanks Sai Krishna for the initial patch.

comment:46 Changed 8 months ago by Tim Graham <timograham@…>

In 9956d89f:

[1.9.x] Fixed #22268 -- Documented values_list() behavior for multivalued relations.a

Thanks Sai Krishna for the initial patch.

Backport of 7d485d5d75bd9faab0b949fd34d4f098f8079452 from master

comment:47 Changed 8 months ago by Tim Graham <timograham@…>

In 5ac7c8f7:

Refs #22268 -- Fixed typo in docs/ref/models/querysets.txt

comment:48 Changed 8 months ago by Tim Graham <timograham@…>

In 1ef5a328:

[1.9.x] Refs #22268 -- Fixed typo in docs/ref/models/querysets.txt

Backport of 5ac7c8f7ab2b2e1fec50abb14539a2eb520d1995 from master

Note: See TracTickets for help on using tickets.
Back to Top