Opened 17 months ago

Last modified 16 months ago

#22268 assigned Cleanup/optimization

values_list() on a ManyToManyField returns extra None's when iterated over.

Reported by: k_sze Owned by: anubhav9042
Component: Documentation Version: master
Severity: Normal Keywords: orm, values_list, ManyToManyField
Cc: anubhav9042@…, pirosb3 Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Suppose you have a schools app, with these model definitions:

from django.db import models

class School(models.Model):
    name = models.CharField(unique=True, max_length=255)

class Class(models.Model):
    school = models.ForeignKey(School)
    name = models.CharField(unique=True, max_length=255)
    students = models.ManyToManyField('Student')

class Student(models.Model):
    surname = models.CharField(max_length=255)
    given_name = models.CharField(max_length=255)

Now, try this in the Django shell:

$ python manage.py shell

>>> from schools.models import School, Class, Student

# Create the school
>>> concordia = School(name='Concordia University')
>>> concordia.save()

# Create the Software Engineering class
>>> soen = Class(school=concordia, name='Software Engineering')
>>> soen.save()

# Create the Computer Engineering class
>>> coen = Class(school=concordia, name='Computer Engineering')
>>> coen.save()

# Create a student
>>> john_smith = Student(surname='Smith', given_name='John')
>>> john_smith.save()

# Add this student into one of the classes
>>> soen.students.add(john_smith)

# Now make a query using values_list
>>> students = Class.objects.values_list('students', flat=True)

# How many students are there supposed to be in this `students` QuerySet?
>>> print students.count()
1

# What if we iterate over it?
>>> for s in students:
>>>     print s
1
None

# Wait, what!?
>>> print len(list(students))
>>> 2

Attachments (1)

22268.diff (1.1 KB) - added by anubhav9042 17 months ago.

Download all attachments as: .zip

Change History (31)

comment:1 Changed 17 months ago by k_sze

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Type changed from Uncategorized to Bug

comment:2 Changed 17 months ago by bmispelon

  • Triage Stage changed from Unreviewed to Accepted
  • Version changed from 1.6 to master

Hi,

Indeed, there seem to be something going wrong here.
I tried it out and the bug seems present on older versions too (I tried all the way to 1.4).

Thanks.

comment:3 Changed 17 months ago by anubhav9042

One thing I want to clear out is that I think that values and values_list return dictionary/tuple type data for all objects present, hence this might not be a bug.
eg.
Class.objects.values_list()
will give you something like
[(1L, 1L, u'Software Engineering'), (2L, 1L, u'Computer Engineering')]
thus for
Class.objects.values_list('students')
you are bound to get
[(1L,), (None,)]

Thoughts??

Last edited 17 months ago by anubhav9042 (previous) (diff)

comment:4 Changed 17 months ago by timo

I agree it's unclear what change in behavior is desired here.

comment:5 Changed 17 months ago by anubhav9042

  • Cc anubhav9042@… added

comment:6 Changed 17 months ago by pirosb3

I have start looking at this ticket, that is currently replicated on my machine.

The query generated by the values_list statement above is:

(u'SELECT "m2m_and_m2o_class_students"."student_id" FROM "m2m_and_m2o_class" LEFT OUTER JOIN "m2m_and_m2o_class_students" ON ( "m2m_and_m2o_class"."id" = "m2m_and_m2o_class_students"."class_id" )', ())

I think the issue here is related to the LEFT OUTER JOIN. We could modify ValuesListQuerySet.iterator and make changes to avoid this, but I am not sure it should be done here.
Can anyone give me some advice? am I looking in the correct place?

Thanks,
Dan

comment:7 Changed 17 months ago by k_sze

And I believe the problem is even more serious when the values_list is querying for a nullable field of a ManyToManyField. For example, if I modify the Student class and add a new field:

class Student(models.Model):
    surname = models.CharField(max_length=255)
    given_name = models.CharField(max_length=255)
    year_of_birth = models.SmallIntegerField(null=True)

And then I make this query:

>>> Class.objects.values_list('students__year_of_birth', flat=True)
[None, None]

It becomes ambiguous where the Nones come from: one of the Nones is from john_smith, who has a null year_of_birth, the other None is because the other class has no student at all.

Last edited 17 months ago by k_sze (previous) (diff)

comment:8 Changed 17 months ago by anubhav9042

First thing we should make clear is that whether we want to change the current behaviour or not.
Please see my previous comment: values and values_list give result for all objects present.

Version 0, edited 17 months ago by anubhav9042 (next)

comment:9 Changed 17 months ago by anubhav9042

Since values and values_list return a value for all objects present(here Class).
Therefore if a class doesn't have that field present, it will have to return some value, if None is not demanded for that field then what should we be wanting in place.

comment:10 Changed 17 months ago by pirosb3

  • Cc pirosb3 added

I understand that it is unclear if this is an error or expected behavour, and that changing this will lead to backwards compatibility issues.
I still believe this should be modified in the future. With None being a possible value in values_list, it makes developer's lives more difficult, as they need to check when iterating over the list.
Another possible solution would be to make a documentation patch, the developer needs to be aware of this.

comment:11 Changed 17 months ago by erikr

Changing this to not include the None value is definitely backwards compatibility breaking. So far, in this ticket, we have not been able to reach a clear consensus on what the correct behaviour is. To me, the current behaviour does not seem unreasonable. The issue raised in comment:7 is valid, but I don't see how removing extra None's would help - this is simply not the right query for that question.

Given that there is no strong case for removing the None's, I think we should not break backwards compatibility, and therefore keep the existing behaviour. If you disagree, please make sure to clarify exactly how you would like the behaviour to change. Otherwise, I do think a documentation patch would be good.

comment:12 follow-up: Changed 17 months ago by anubhav9042

  • Owner changed from nobody to anubhav9042
  • Status changed from new to assigned

I am also in favour of doc patch...as current behaviour is correct.

Last edited 17 months ago by anubhav9042 (previous) (diff)

comment:13 in reply to: ↑ 12 ; follow-up: Changed 17 months ago by k_sze

On the one hand, I do understand that a change in behaviour would break backward compatibility. On the other hand, I tend to agree with comment:10 here, in that the behaviour should be modified in the future. To me, the correct behaviour would be to not include the extra Nones. In my example, the ones from the null year_of_birth are fine and should be included. Another point of consideration that I would raise is whether there is any reason to believe that people are relying on the current behaviour (I think it would be quite strange for people to be relying on ambiguity).

Would it be OK to document it as a quirk that is subject to change in the future, and not as the correct bahaviour?

Regarding comment:11, I think there are possible legitimate uses for this kind of query. For instance, if you want a Cartesian product count of some M2M relation, and then run it through the Python collections.Counter or something.

comment:14 in reply to: ↑ 13 Changed 17 months ago by anubhav9042

Replying to k_sze:

On the one hand, I do understand that a change in behaviour would break backward compatibility. On the other hand, I tend to agree with comment:10 here, in that the behaviour should be modified in the future. To me, the correct behaviour would be to not include the extra Nones.

This is what I want to ask, if we do not want those extra Nones, then what do we return in its place.
If we just don't return anything at all, it is going to be more problematic.
Consider three Class objects, out of which the second object doesn't have student, then if we return something like ['some_id','some_id'] for this, would it not be confusing for similar case when third object doesn't have student.

comment:15 Changed 17 months ago by k_sze

Actually the whole behaviour of values() and values_list() seems to be wrong once you mix M2M fields into the picture. I just tried creating more Class and Student objects:

# New student, Joe Blo
>>> joe_blo = Student(surname='Blo', given_name='Joe')
>>> joe_blo.save()

# New class, Discrete Mathematics
>>> discrete_math = Class(name='Discrete Mathematics', school=concordia)
>>> discrete_math.save()

# Enroll both john_smith and joe_blo in discrete_math
discrete_math.students.add(joe_blo)
discrete_math.students.add(john_smith)

First, let's look at values() queries:

>>> Class.objects.values()
[{u'id': 1, 'name': u'Software Engineering', 'school_id': 1}, {u'id': 2, 'name': u'Computer Engineering', 'school_id': 1}, {u'id': 4, 'name': u'Discrete Mathematics', 'school_id': 1}]

That's 3 dictionaries because we now have 3 Class objects. That seems fine. On the other hand, why don't students appear here?

So, let's specifically ask values() to include students:

>>> Class.objects.values('name', 'students')
[{'students': 1, 'name': u'Software Engineering'}, {'students': None, 'name': u'Computer Engineering'}, {'students': 1, 'name': u'Discrete Mathematics'}, {'students': 2, 'name': u'Discrete Mathematics'}]

That's 4 dictionaries even though we only have 3 Class objects (soen, coen, and discrete_math). Can you imagine what would happen if the Class object had more M2M fields? The number of dictionaries returned would probably explode exponentially!

Why aren't the students combined into a tuple or a list? I would expect the results to be a list of 3 dictionaries like this:

[
    {'students': (1,), 'name': u'Software Engineering'},
    {'students': (,), 'name': u'Computer Engineering'}, # Note the empty tuple for students, instead of None
    {'students': (1, 2), 'name': u'Discrete Mathematics'},
]

Now, let's look at values_list() queries:

>>> Class.objects.values_list()
[(1, 1, u'Software Engineering'), (2, 1, u'Computer Engineering'), (4, 1, u'Discrete Mathematics')]

3 tuples. Again, this seems fine except for the fact that students aren't included.

So, let's specifically ask values_list() to include students:

>>> Class.objects.values_list('name', 'students')
[(u'Software Engineering', 1), (u'Computer Engineering', None), (u'Discrete Mathematics', 1), (u'Discrete Mathematics', 2)]

4 tuples even though we have 3 Class objects only. I would expect the results to be a list of 3 tuples like this:

[
    (u'Software Engineering', (1,)),
    (u'Computer Engineering', (,)), # Note the empty tuple instead of None
    (u'Discrete Mathematics', (1, 2)),
]
Last edited 17 months ago by k_sze (previous) (diff)

comment:16 Changed 17 months ago by k_sze

To demonstrate how bad things become once you have multiple M2M fields, I created a new Language model and added it as a M2M field in the Class model:

from django.db import models

# other Model definitions skipped ...

class Class(models.Model):
    school = models.ForeignKey(School)
    name = models.CharField(unique=True, max_length=255)
    students = models.ManyToManyField('Student')
    available_in_languages = models.ManyToManyField('Language')
    
   
class Language(models.Model):
    name = models.CharField(max_length=255)

Now I create new languages and add them to the Discrete Mathematics class (which now has 2 students, if you recall):

# French
>>> french = Language(name='French')
>>> french.save()

# English
>>> english = Language(name='English')
>>> english.save()

# Add them to Discrete Mathematics class
discrete_math = Class.objects.get(name='Discrete Mathematics')
discrete_math.available_in_languages.add(french)
discrete_math.available_in_languages.add(english)

Now I make a values() query:

>>> Class.objects.values('available_in_languages', 'students')
[{'students': 1, 'available_in_languages': None}, {'students': None, 'available_in_languages': None}, {'students': 1, 'available_in_languages': 2}, {'students': 2, 'available_in_languages': 2}, {'students': 1, 'available_in_languages': 1}, {'students': 2, 'available_in_languages': 1}]

That's 6 dictionaries now!

Similarly, with values_list():

>>> Class.objects.values_list('available_in_languages', 'students')
[(None, 1), (None, None), (2, 1), (2, 2), (1, 1), (1, 2)]

6 tuples even though we only have 3 Class objects.

Last edited 17 months ago by k_sze (previous) (diff)

comment:17 Changed 17 months ago by anubhav9042

According to me the query generated for Class.objects.values('available_in_languages', 'students'):

SELECT `tic_22268_class_available_in_languages`.`language_id`, `tic_22268_class_students`.`student_id` FROM `tic_22268_class` LEFT OUTER JOIN `tic_22268_class_available_in_languages` ON (`tic_22268_class`.`id` = `tic_22268_class_available_in_languages`.`class_id`) LEFT OUTER JOIN `tic_22268_class_students` ON (`tic_22268_class`.`id` = `tic_22268_class_students`.`class_id`)

is correct as it yield the following result

language_id 	student_id
NULL 	           1
NULL 	           NULL
1 	           1
1 	           2
2 	           1
2 	           2

Although the result is correct as we want combination of available_in_languages and students for Class objects. However it is confusing.
I propose the following:
Whenever there is a case of M2M in values() or values_list(), we could introduce additional id into the query, which will help in making the result meaningful and understandable.

We make the query into:

SELECT `tic_22268_class_available_in_languages`.`language_id`, `tic_22268_class_students`.`student_id`, `tic_22268_class_students`.`id` FROM `tic_22268_class` LEFT OUTER JOIN `tic_22268_class_available_in_languages` ON (`tic_22268_class`.`id` = `tic_22268_class_available_in_languages`.`class_id`) LEFT OUTER JOIN `tic_22268_class_students` ON (`tic_22268_class`.`id` = `tic_22268_class_students`.`class_id`)

resulting in

language_id 	student_id 	id
NULL 	              1	        1
NULL  	           NULL 	2
1 	              1 	3
1 	              2 	3
2 	              1 	3
2 	              2 	3

The output also becomes:

[{'students': 1L, 'available_in_languages': None, 'id': 1L},
{'students': None, 'available_in_languages': None, 'id': 2L},
{'students': 1L, 'available_in_languages': 1L, 'id': 3L}, {'students': 2L, 'available_in_languages': 1L, 'id': 3L}, {
'students': 1L, 'available_in_languages': 2L, 'id': 3L}, {'students': 2L, 'available_in_languages': 2L, 'id': 3L}]

}}}

Now the result is as required, since we wanted the combination of values: 'available_in_languages' and 'students' along with 'id' of the Class object to which it belongs.

I am adding a diff.

Last edited 17 months ago by anubhav9042 (previous) (diff)

comment:18 Changed 17 months ago by anubhav9042

Added the diff.
Although I am not sure if we should add it at the place where we have added it.
Apart from the place in the diff, we could also add it in _setup_query of ValuesQuerySet.

Last edited 17 months ago by anubhav9042 (previous) (diff)

Changed 17 months ago by anubhav9042

comment:19 follow-up: Changed 17 months ago by k_sze

I would argue against solving the problem this way.

  1. Letting the number of results grow combinatorially seems questionable. According to the current documentation for values() and values_list(), each result is supposed to correspond to one model object. Granted, the documentation doesn't say that the query cannot return multiple results for each model object, it still feels wrong because, in the current behaviour, none of the results represents a coherent big picture of one model object; for each model object, the values are either getting duplicated or fragmented across multiple dictionaries or tuples.
  2. I suspect that including the id (or pk) still doesn't solve the ambiguity that arises when you query for a nullable field of an M2M field (e.g. students__year_of_birth). You still get None for both a) null year_of_birth or b) no student at all. I can't confirm this yet because the diff doesn't seem to work against Django 1.6.2 (what version does the diff target?).

By the way, the problem is not limited to M2M fields, but also to ForeignKeys; e.g. if you have a Province model and a City model, and the City has a ForeignKey on Province, and you do Province.objects.values('cities'), you get the same problem.

This is looking like a rabbit hole. :(

comment:20 Changed 17 months ago by erikr

The examples in comment:15 and comment:16 do seem odd. However, have a close look at #5768, where this behaviour seems to have been added. I haven't read it in detail, but there may be some explanation in there of the choices made.

comment:21 in reply to: ↑ 19 Changed 17 months ago by anubhav9042

Replying to k_sze:

I would argue against solving the problem this way.

  1. Letting the number of results grow combinatorially seems questionable. According to the current documentation for values() and values_list(), each result is supposed to correspond to one model object. Granted, the documentation doesn't say that the query cannot return multiple results for each model object, it still feels wrong because, in the current behaviour, none of the results represents a coherent big picture of one model object; for each model object, the values are either getting duplicated or fragmented across multiple dictionaries or tuples.
  2. I suspect that including the id (or pk) still doesn't solve the ambiguity that arises when you query for a nullable field of an M2M field (e.g. students__year_of_birth). You still get None for both a) null year_of_birth or b) no student at all. I can't confirm this yet because the diff doesn't seem to work against Django 1.6.2 (what version does the diff target?).

By the way, the problem is not limited to M2M fields, but also to ForeignKeys; e.g. if you have a Province model and a City model, and the City has a ForeignKey on Province, and you do Province.objects.values('cities'), you get the same problem.

This is looking like a rabbit hole. :(

Diff targets master.
Even if two Nones are returned from pk value we know that which corresponds to what. What should we give instead of None

comment:22 follow-ups: Changed 17 months ago by anubhav9042

Although I have another idea in mind:
We just add pk to the query and then when final result is obtained we can club the dicts/tuples having same pk and remove pk.

Thoughts??

I am suggesting changes this way becoz I think that the query generated is correct.

comment:23 in reply to: ↑ 22 Changed 17 months ago by k_sze

Replying to anubhav9042:

Although I have another idea in mind:
We just add pk to the query and then when final result is obtained we can club the dicts/tuples having same pk and remove pk.

Thoughts??

I am suggesting changes this way becoz I think that the query generated is correct.

And also remove the ones where the pk is None (due to the LEFT OUTER JOIN)?

Last edited 17 months ago by k_sze (previous) (diff)

comment:24 follow-up: Changed 17 months ago by k_sze

Another idea that I have, which is probably crazy:

Instead of retrieving the M2M or reverse foreign key values in the same SQL query, use a clever combination of:

  • prefetch_related()
  • subclassing dict and tuple

So values() and values_list return these subclassed dicts and tuples, which lazily construct and return the relevant M2M or reverse foreign key elements when they are accessed.

In the end, the subclassed dict and tuple would work somewhat like the Django model, except that you access things in the form of dictionaries and tuples, and you limit the elements that can appear in them.

Last edited 17 months ago by k_sze (previous) (diff)

comment:25 in reply to: ↑ 22 Changed 17 months ago by anubhav9042

Although I have another idea in mind:
We just add pk to the query and then when final result is obtained we can club the dicts/tuples having same pk and remove pk.

Thoughts??

I am suggesting changes this way becoz I think that the query generated is correct.

I tried to do this way, but clubbing those dics/tuples is not easy as they are not exactly dics/tuple rather QuerySets.
Will keep trying and post back soon

comment:26 in reply to: ↑ 24 ; follow-up: Changed 17 months ago by anubhav9042

In the end, the subclassed dict and tuple would work somewhat like the Django model, except that you access things in the form of dictionaries and tuples, and you limit the elements that can appear in them.

Will think
Although the same difficulty might come here as well.

comment:27 in reply to: ↑ 26 Changed 17 months ago by k_sze

Replying to anubhav9042:

In the end, the subclassed dict and tuple would work somewhat like the Django model, except that you access things in the form of dictionaries and tuples, and you limit the elements that can appear in them.

Will think
Although the same difficulty might come here as well.

My idea is for these subclassed dict and tuple to be used recursively. And then because they are constructed lazily using QuerySets, you won't get combinatorial explosions, so you won't have to do any clubbing.

comment:28 Changed 17 months ago by anubhav9042

For better discussion, I have opened a thread:
https://groups.google.com/forum/#!topic/django-developers/DAslY6GI1O8

comment:29 Changed 16 months ago by anubhav9042

As discussed on IRC and mailing list, the solution has come out to be a doc update for this behaviour rather than tweaking the code any further.

comment:30 Changed 16 months ago by timo

  • Component changed from Database layer (models, ORM) to Documentation
  • Type changed from Bug to Cleanup/optimization
Note: See TracTickets for help on using tickets.
Back to Top