Opened 7 months ago

Closed 7 months ago

#35434 closed Cleanup/optimization (wontfix)

prefetch_related_objects fails to cache UUID FKs when the string representation of a UUID is used

Reported by: Selcuk Ayguney Owned by:
Component: Documentation Version: dev
Severity: Normal Keywords:
Cc: Selcuk Ayguney, Simon Charette Triage Stage: Unreviewed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Selcuk Ayguney)

Consider the following models:

class Pet(models.Model):
    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
    name = models.CharField(max_length=20)


class Toy(models.Model):
    pet = models.ForeignKey(Pet, models.CASCADE, related_name="toys")
    name = models.CharField(max_length=20)

The following works and caches the deferred pet instance as expected:

pet = Pet.objects.create(name="Fifi")
toy_bone = Toy(pet_id=str(pet.id), name="Bone")
print(toy_bone.pet)

It fails if prefetch_related_objects is used:

pet = Pet.objects.create(name="Fifi")
toy_bone = Toy(pet_id=str(pet.id), name="Bone")
prefetch_related_objects([toy_bone], "pet")
print(toy_bone.pet)
    raise self.RelatedObjectDoesNotExist(
    ^^^^^^^^^^^^^^^^^
prefetch_related.models.Toy.pet.RelatedObjectDoesNotExist: Toy has no pet.

The behaviour is inconsistent as str to UUID conversion is automatically performed in the first example.

One may argue that the use of prefetch_related_objects is redundant in this case. However, it is useful in async code where deferred lookups are not automatic.

Unit test to reproduce the issue and the suggested fix:
https://github.com/django/django/pull/18132

Change History (15)

comment:1 by Selcuk Ayguney, 7 months ago

Description: modified (diff)

comment:2 by Selcuk Ayguney, 7 months ago

Description: modified (diff)

comment:3 by Selcuk Ayguney, 7 months ago

Has patch: set

comment:4 by Jacob Walls, 7 months ago

Owner: changed from nobody to Selcuk Ayguney
Status: newassigned
Triage Stage: UnreviewedAccepted

I ran into a symptom of this problem recently and didn't get far enough to diagnose the root cause.

I was prefetching at two levels, so my hack was to just iterate the objects and reselect the object at the intermediate level (reintroducing N+1 queries at that level), but at least allowing me to avoid N+1 queries at the second level.

Thanks for the report.

comment:5 by Simon Charette, 7 months ago

Cc: Simon Charette added

Have we performed any form of benchmarks to assess the impact of calling to_python for each sides of the prefetching bridge? Field.to_python can perform non-trivial validation and instance checks that quickly add up when dealing with large collection of objects.

Given prefetch_related_object is normally called from a prefetch_related queryset where both sides are already guaranteed to be of the right type and serializers/forms are meant to be turn representations of objects to the expected field types at input boundaries it seems like this change could do more harm than good both from a performance and areas of concerns violation perspective.

My question to you would be why you are assigning the unexpected type to field in the first place (the pet_id=str(pet.id))? Plenty of things subtly break if you assign the string representation of a DateField, DecimalField, FloatField, etc to a model instance field instead of the proper type so I'm not sure how this particular case is special. If you have to assign such properties I believe you should call .full_clean() / clean_fields() on the model instance and be ready to deal with any associated ValidationError.

I'm more of the opinion that we should wont-fix this issue than proceed here.

comment:6 by Selcuk Ayguney, 7 months ago

True, I already implemented a similar workaround in the codebase I was working on when I spotted this issue, and honestly I'm not sure if this is a wontfix or a legitimate bug. I haven't performed any benchmark either. My rationale was that the usual Django deferred lookup works when a str is used, but it only fails when prefetch_related_objects is used, so this looked like the more consistent behaviour. I don't think other types such as DecimalField etc are commonly used as primary keys.

In my real world example the model instance is instantiated from a JSON object in a generic way, and foreign keys are set using the attnames where it is not possible to know the actual data type.

comment:7 by Simon Charette, 7 months ago

My rationale was that the usual Django deferred lookup works when a str is used, but it only fails when prefetch_related_objects is used, so this looked like the more consistent behaviour.

I understand where that could be perceived as inconsistent but I would argue that it's undefined behaviour in both cases. It only happens to work in the lazy loading case because it incurs a query that defers the lookup to the backend.

I don't think other types such as DecimalField etc are commonly used as primary keys.

Fair point, it remains unnecessary while arguably faster validation for AutoField, UUIDField`, and others though.

In my real world example the model instance is instantiated from a JSON object in a generic way, and foreign keys are set using the attnames where it is not possible to know the actual data type.

If you are turning a raw JSON object to a model instance I think that you should be expected to call clean_fields at the very least. That's what serialization layers are usually for; turn data into their proper model equivalent.

Last edited 7 months ago by Simon Charette (previous) (diff)

comment:8 by Selcuk Ayguney, 7 months ago

It only happens to work in the lazy loading case because it incurs a query that defers the lookup to the backend.

Agreed. Granted, it is still because of the database backend, but strings can be used interchangeably with UUIDs in almost everywhere, for example Toy.objects.filter(pk="00000000-...").exists(). This makes the behaviour in this ticket even more peculiar.

That's what serialization layers are usually for; turn data into their proper model equivalent.

Sure thing. As I mentioned I have already changed my code to validate the instance, but this behaviour took me by surprise and it looked like it violates the least astonishment principle.

Edit: Apparently the same issue affects Jacob Walls in a different way.

Last edited 7 months ago by Selcuk Ayguney (previous) (diff)

comment:9 by Jacob Walls, 7 months ago

I understand where that could be perceived as inconsistent but I would argue that it's undefined behaviour in both cases. It only happens to work in the lazy loading case because it incurs a query that defers the lookup to the backend.

Interesting. If pet_id=str(pet.id) is undefined, then I'm looking at a huge yak shave to audit my projects for UUIDs represented as strings, which I'm pretty sure we're doing in every view, test, model, migration...

comment:10 by Simon Charette, 7 months ago

There's a difference between lookup against the database (e.g. filter(pet_id=str(pet.id))) which performs implicit conversions and direct model attribute assignments. I would argue that if you explicitly assign the string representation of objects meant to be stored in a field to a model instance you are effectively doing things wrong.

Version 0, edited 7 months ago by Simon Charette (next)

comment:11 by Jacob Walls, 7 months ago

Component: Database layer (models, ORM)Documentation
Has patch: unset
Owner: Selcuk Ayguney removed
Status: assignednew
Triage Stage: AcceptedUnreviewed
Type: BugCleanup/optimization

Thanks, that helps me understand the severity of the anti-pattern.

Now I'm aware that despite the implicit type-conversion you get on the database-side, python-side operations like == or obj in queryset become unsafe when a stringified UUID is assigned to a model attribute.

Or even queryset.contains(), but only if called on an evaluated queryset.

In [1]: from models import GraphModel

In [2]: original = GraphModel.objects.first()

In [3]: unsafe = GraphModel(str(original.pk))

In [4]: all_graphs = GraphModel.objects.all()

In [5]: all_graphs.contains(unsafe)
Out[5]: True

In [6]: len(all_graphs)
Out[6]: 1

In [7]: all_graphs.contains(unsafe)
Out[7]: False

I think a version of that REPL would be useful in the docs somewhere. Calling your query tested by testing as far as line 5 but not line 7 is something I can see happening.

Selcuk, is it okay if I reframe your ticket as a documentation request and ask for another opinion?

in reply to:  11 comment:12 by Selcuk Ayguney, 7 months ago

Replying to Jacob Walls:

Selcuk, is it okay if I reframe your ticket as a documentation request and ask for another opinion?

It's perfectly fine. Simon has a good point, and I agree that it is easy to handle this on the application side.

comment:13 by Sarah Boyce, 7 months ago

Resolution: needsinfo
Status: newclosed

Hi all, currently closing as "needsinfo". It's not clear to me where in the docs you're hoping to add something or what would have been useful in the docs for this case. Can you share an idea of what you have in mind?

comment:14 by Jacob Walls, 7 months ago

Has patch: set
Resolution: needsinfo
Status: closednew

in reply to:  10 comment:15 by Natalia Bidart, 7 months ago

Resolution: wontfix
Status: newclosed

I have reviewed the PR and as mentioned there I'm not comfortable with the adding, I think it's misleading and could be interpreted it as a use case that Django supports when it does not. In my view, the example line doing DailyEntry(day="2020-01-01", headline="Happy New Year!") goes in direct opposition of what Simon said here:

There's a difference between lookup against the database (e.g. filter(pet_id=str(pet.id))) which performs implicit conversions and direct model attribute assignments. I would argue that if you explicitly assign the string representation of objects meant to be stored in a field to a model instance you are effectively doing something wrong.

Note: See TracTickets for help on using tickets.
Back to Top