Opened 5 years ago

Closed 5 years ago

Last modified 5 years ago

#31051 closed Bug (fixed)

Serialization dependency sorting disallows circular references unneccesarily.

Reported by: Matthijs Kooijman Owned by: Matthijs Kooijman
Component: Core (Serialization) Version: dev
Severity: Normal Keywords:
Cc: Matthijs Kooijman Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The core.serialization.sort_dependencies() function takes a list of apps and/or models, and resolves this into a sorted flat list of models, ready to be serialized in that order. This function is intended to make natural foreign keys work, by serializing models referenced by a natural key before the referencing models. When deserializing, this guarantees that natural key references can be resolved, because there are no "forward references". Furthermore, when a circular reference using natural keys is present, this function raises an exception (e.g. "Can't resolve dependencies for some_app.SomeModel in serialized app list") and prevents serialization from completing, since there is no way to guarantee a model ordering that will have no forward references.

Note that this ordering is *only* needed when natural keys are involved, since data is intended to be loaded in a transaction without constraint checks, so numerical foreign keys can be added in the wrong order, as long as all referenced data is present at the end of the transaction. This does not work with natural keys, since those are resolved by Python code that needs the referenced objects present in the database to resolve them.

However, this sorting is not actually strictly necessary in all cases where it is applied. When circular references are involved, this then actually prevents serialization for no good reason. In particular, this is the case:

  • When running dumpdata without natural keys enabled (which is the default). Even though natural keys might be defined in the models (which causes the sorting and exception), no natural keys will be present in the dumped data, so no ordering is needed.
  • When dumping data intended for loading with loaddata (which I think is the primary usecase for dumpdata?). loaddata will (since 17 months ago in v2.2, see #26291) automatically handle forward references by deferring setting fields that reference natural keys that are not added yet. In this case, sorting is still useful, to prevent forward references where possible, but when there are circular references, it is acceptable to ignore some dependencies rather than preventing serialization from happening alltogether.
  • When serializing data for tests for serialized_rollback=True (in django.db.backends.base.creation.create_test_db). This is a serialization that does not use natural keys, so no ordering is needed at all. Note that this serialization happens always (unlike deserialization only happens with serialized_rollback=True), so AFAIU this effectively prevents *any* tests from working on a database with circular references with natural keys defined.

The fix for these issues seems to be rather simple:

  • For dumpdata without use_natural_foreign_keys, skip the ordering and just serialize all models in arbitrary order. AFAICS use_natural_primary_keys is not relevant here, since that only controls omitting the numerical primary key.
  • For dumpdata *with* use_natural_foreign_keys, do the ordering but do not bail out when there are circular references (instead just ignore some dependencies and produce a best-effort ordering).
  • For test database serialization, also skip the ordering and serialize in arbitrary order.

Note that this would remove two of the three calls to sort_dependencies() and allow loops in the last remaining instance. This means that sort_dependencies could be modified to allow loops unconditionally, or we could add an argument and default to disallowing loops in case any code outside of django is using this function?

Note that #26552 is a related, but different issue, concerning the *deserialization* of data in testcases.

I've been working on fixing this and that related issue today and have a basic version working, with testcases (which proved to be quite a challenge, since testing the test subsystem is a bit tricky...). I'll do some additional testing and cleanup and submit a PR soon.

Also note that the circular-reference exception was already disabled for self-referencing models in #16317. The fix for that issue simply ignores self-referencing models for sorting, without taking any additional measures to sort instances to prevent problems in deserialization (this code was added when the deferred deserialization did not exist yet), so I wonder how much value this exception still has.

Change History (12)

comment:1 by Matthijs Kooijman, 5 years ago

Cc: Matthijs Kooijman added
Owner: changed from nobody to Matthijs Kooijman
Status: newassigned

comment:2 by Carlton Gibson, 5 years ago

Triage Stage: UnreviewedAccepted

Hi Matthijs. Yes, OK, let's accept this on the basis of your description — Happy to see a PR tackling it!

...with testcases (which proved to be quite a challenge, since testing the test subsystem is a bit tricky...).

Yes. That's the rub. :)

comment:3 by Matthijs Kooijman, 5 years ago

Has patch: set

A PR is available at https://github.com/django/django/pull/12166. It is still marked as draft, since there are still some issues with running the tests, but review of the code, additions to the discussion in the PR and maybe ideas on how to fix the remaining testsuite issues are welcome already.

comment:4 by Mariusz Felisiak, 5 years ago

Patch needs improvement: set

Marking as Patch needs improvement, because it's a draft PR.

comment:5 by Matthijs Kooijman, 5 years ago

Patch needs improvement: unset

The issues with the testcases have been resolved, so the patch is ready for review.

comment:6 by Mariusz Felisiak, 5 years ago

comment:7 by Mariusz Felisiak <felisiak.mariusz@…>, 5 years ago

In 289d0ec6:

Refs #31051 -- Fixed reloading the database with circular related objects and natural keys for tests.

Made deserialize_db_from_string() do not sort dependencies.

deserialize_db_from_string() doesn't use natural keys, so there is no
need to sort dependencies in serialize_db_to_string(). Moreover,
sorting models cause issues for circular dependencies.

comment:8 by Mariusz Felisiak <felisiak.mariusz@…>, 5 years ago

In 75520e17:

Refs #31051 -- Optimized serialize_db_to_string() by avoiding creation of models list.

comment:9 by Mariusz Felisiak <felisiak.mariusz@…>, 5 years ago

In 481d8fc3:

Refs #31051 -- Added test for loaddata/dumpdata with circular references without natural keys.

comment:10 by Mariusz Felisiak, 5 years ago

Summary: Serialization dependency sorting disallows circular references unneccesarilySerialization dependency sorting disallows circular references unneccesarily.
Triage Stage: AcceptedReady for checkin

comment:11 by Mariusz Felisiak <felisiak.mariusz@…>, 5 years ago

Resolution: fixed
Status: assignedclosed

In 4f216e4f:

Fixed #31051 -- Allowed dumpdata to handle circular references in natural keys.

Since #26291 forward references in natural keys are properly handled by
loaddata, so sorting depenencies in dumpdata doesn't need to break on
cycles. This patch allows circular references in natural keys by
breaking sort_depenencies() on loops.

comment:12 by Mariusz Felisiak <felisiak.mariusz@…>, 5 years ago

In 2e67e80:

Refs #31051 -- Made dumpdata do not sort dependencies if natural foreign keys are not used.

There is no need to sort dependencies when natural foreign keys are not
used.

Note: See TracTickets for help on using tickets.
Back to Top