== The new Options API proposal As of my 2014 Summer of Code project, my second deliverable is a refactored working implementation of the Options API. The Options API is at the core of Django, it enables introspection of Django Models with the rest of the system. This enables lookups, queries, forms, admin to understand the capabilities of every model. The Options API is hidden under the _meta attribute of each model class. Options has always been a private API, but Django developers have always been using it in their projects in a non-official way. This is obviously very dangerous because, as there is no official API, Options could change breaking other people's implementation. Options also did not have any unit-tests, but the entire system uses it and relies on it to work correctly. My Summer of Code project is all about understanding and refactoring Options to make it a testable and official API that Django and any other developers can use. === Current state of the API I now have a working and tested implementation of Options, I have managed to reduce it to 2 main endpoints. Because Options needs to be very fast, I necessarily had to add some accessors for the most common calls (although both endpoints are cached, we can increase speed by avoiding function calls). Each accessor is a cached property and is computed, using the new API, on first access. I am planning to release in the attached PR: - Unit tests for the new Meta API - The new Meta API - The implementation of the new API throughout django and django.contrib - Documentation === Concepts ==== Field types There are 5 main types of fields: ===== Data fields A data field is any field that has an entry on the database, for example a CharField, BooleanField, a ForeignKey {{{ class Person(models.Model): data_abstract = models.CharField(max_length=10) }}} ===== M2M fields A M2M field that is defined on the current model {{{ class Person(models.Model): friends = models.ManyToManyField('self', related_name='friends', symmetrical=True) }}} ===== Related Object A Related Object is a one-to-many relation from another model (such as a ForeignKey) that points to the current model {{{ class City(models.Model): name = models.CharField(max_length=100) class Person(models.Model): # M2M fields city = models.ForeignKey(City) }}} In this case, City has a related object from Person ===== Related M2M A Related M2M is a M2M relation from another model that points to the current model {{{ class City(models.Model): name = models.CharField(max_length=100) class Person(models.Model): # M2M fields cities_lived_in = models.ManyToManyField(City) }}} In this case, City has a related m2m from Person ===== Virtual Virtual fields do not necessarily have an entry on the database, they are "Django fields" such as a GenericForeignKey {{{ class Person(models.Model): content_type = models.ForeignKey(ContentType, related_name='+') object_id_ = models.PositiveIntegerField() item = GenericForeignKey('content_type', 'object_id') }}} GenericForeignKey uses 'content_type' and 'object_id' to keep track of what model type and id is set to item, but item itself does not have a concrete presence on the database. In this case, item is a virtual field. ==== Field options There are 5 properties that each field can have: ===== Local A local field is when is not derived from inheritance. Fields from models that directly inherit from abstract models or proxy classes are still local {{{ class Person(models.Model): name = models.CharField(max_length=50) class Londoner(Person): overdraft = models.DecimalField() }}} Londoner has two fields (name and overdraft) but only one local field (overdraft) ===== Hidden Hidden fields are only referred to related objects and related m2m. When a relational model (such as ManyToManyField, or ForeignKey) specifies a related_name that starts with a "+", it tells Django to not create a reverse relation. {{{ class City(models.Model): name = models.CharField(max_length=100) class Person(models.Model): city = models.ForeignKey(City, related_name='+') }}} City has a related hidden object from Person (as you can't access person_set) ===== Concrete Concrete fields are fields that have a column ===== Proxied relations Proxied relations are relations that point to a proxy of a model. {{{ class Person(models.Model): pass class ProxyPerson(Person): class Meta: proxy = True class RelationToProxy(models.Model): proxy_person = models.ForeignKey(ProxyPerson) }}} In this case, Person has no related objects, but it has 1 proxied related object from RelationToProxy. === The new API The new API is composed of 2 main functions: get_fields, and get_field. ===== get_fields {{{ def get_fields(self, m2m=False, data=True, related_m2m=False, related_objects=False, virtual=False, include_parents=True, include_non_concrete=True, include_hidden=False, include_proxy=False, export_map=False): }}} get_fields takes a set of flags as parameters, and returns a tuple of field instances. All possible combinations of options are possible here, although some will have no effect (such as include_proxy combined with data or m2m by itself). get_fields is internally cached for speed and it is a recursive function that collects fields from each parent of the model. An example of every (sane) combination of flags will be available in the model_meta test suite that I will ship with the new API. The 'export_map' key is only used internally (by get_field) and is not part of the public API. 'export_map=True' will return an OrderedDict with fields as keys and a tuple of strings as values. While the keys map exactly to the same output as 'export_map=False', the tuple of values will contain all possible lookup names for that field. This is used to build a fast lookup table for get_field and to avoid re-iterating over every field to pull out every possible name. {{{ >>> User._meta.get_fields() # Only data by default (, , , , , , , , , , ) >>> User._meta.get_fields(data=False, related_objects=True) # only related_objects (,) >>> User._meta.get_fields(data=False, related_objects=True include_hidden=True) # only related_objects including hidden (, , ) }}} ===== get_field {{{ def get_field(self, field_name, m2m=True, data=True, related_m2m=False, related_objects=False, virtual=False) }}} 'get_field' returns a field_instance from a given field name. field_name can be anything from name, attname and related_query_name. get_field is recursive by default and does not include any hidden or proxied relations. If a given name is not found, it will raise a FieldDoesNotExist error. 'get_field' is internally cached and gets all field information from 'get_fields' internally. NOTE: There is an inconsistency between the defaults of get_field and get_fields. 'get_fields' by default enables only data fields while 'get_field' by default enables data and m2m. This is because of backwards-compatibility issues (read more below). {{{ >>> User._meta.get_field('username') # A data field >>> User._meta.get_field('logentry', related_objects=True) # A related object >>> LogEntry._meta.get_field('user') # ForeignKey can be queried by field name >>> LogEntry._meta.get_field('user_id') # .. and also by database column name >>> User._meta.get_field('does_not_exist') # A non existent field *** FieldDoesNotExist: User has no field named 'does_not_exist' }}} === The Decision Process Since I started my Summer of Code project, this API has gone through several designs, and has now finalised onto the one shown above. The API has gone through many transformations. Each decision has gone through my mentor, with whom I have weekly meetings (Russell). ==== Using bitfields as flags get_field and get_fields were originally designed to work with bits. The main choice for this decision was because there were many options and to avoid providing too many flags. The original API for bits is: {{{ DATA = 0b00001 M2M = 0b00010 RELATED_OBJECTS = 0b00100 RELATED_M2M = 0b01000 VIRTUAL = 0b10000 # Aggregates NON_RELATED_FIELDS = DATA | M2M | VIRTUAL ALL = RELATED_M2M | RELATED_OBJECTS | M2M | DATA | VIRTUAL NONE = 0b0000 LOCAL_ONLY = 0b0001 CONCRETE = 0b0010 INCLUDE_HIDDEN = 0b0100 INCLUDE_PROXY = 0b1000 def get_fields(types, opts) }}} There are numerous reasons why we backed away from this design: 1) There is always a need to import flags from models/options, this can bring to circular dependencies 2) Importing flags all the time can also be a nuinsance 2) Importing flags is not Pythonic at all The decision taken was to port 'get_field' and 'get_fields' to flags. A port of the old implementation still lies here if you are interested: https://github.com/PirosB3/django/blob/soc2014_meta_refactor_upgrade/django/db/models/options.py ==== Removed direct, m2m, model In the previous API, it was a common pattern to return model, direct (bool), m2m (bool). I soon realized that not only these three paramenters can be easily derived from a field_instance, but there were very few places that actually used some of the attributes (there is only 1 place where m2m is used). The decision taken was to drop direct, m2m, model in the return type and only keep field_instance. All the rest will be derived if needed. ==== Removed all calls "with_model" As said previously, it is redundant to include any model as this can be derived. ==== Removed the need of multiple maps The previous implementation relied on many different cache maps internally. This is necessary, but tends to increase bug-risk when cache-expiry happens. For this reason, my implementation relies on only 2 cache tables, and I have added a specific function to do cache expiry easily (_expire_cache). The downsides of this aspect is that we cache a bit more naively (there are less layers of caching) but benchmark shows no real decrease of performance. ==== Used internal caching instead of lru_cache Our first approach to caching was to use 'functools.lru_cache'. 'lru_cache' is a simple decorator that provides cache and an expiry function built-it. It worked correctly with the new API but cProfile quickly showed how a lot of computing time was done inside lru_cache itself. The decision taken was to drop 'lru_cache' in favour of a simpler caching strategy. This is also because we really don't need the lru part of 'lru_caching'. there are only a finite number of combinations that can be called. ==== Use cached_properties when possible Function calls are expensive in Python, All sensible attributes with no arguments have been transformed into cached_properties. A cached property is a read-only property that is calculated on demand and automatically cached. If the value has already been calculated, the cached value is returned. Cached properties avoid a new stack and are used for fast-access to fields, concrete_fields, local_concrete_fields, many_to_many, field_names ==== enabled m2m fields by default on get_field The old get_field API was defined as follows: {{{ def get_fields(self, field_name, many_to_many=True) }}} Our first iteration of the API was to refactor this as {{{ def get_fields(self, field_name, include_related=True) }}} This was done for 2 reasons: - 1) We managed to squash 2 functions (get_field and get_field_by_name) in 1 single call. - 2) I could not find any reason for the many_to_many flag to exist! there can never be data and m2m fields with the same name. So this looked like a legacy parameter that was never removed (because turning it off did not break any tests). The reason the many_to_many flag existed was for a special validation case that was not documented anywhere. Russell helped me in looking for edge cases and finally I came up with a failing test case: https://github.com/django/django/pull/2893. The test case would fail on the new API but succeed on master. Our final iteration was to add all the field types as flags to get_field. By making m2m as first parameter, we avoid breaking existing implementations and maintain a similarity with the 'get_fields' API. === Performance Throughout my project I have always kept an eye on performance. I have always looked for bottlenecks using cProfile and other benchmarking tools. I am happy to say no major decrease in speed has happened, actually the new implementation does a couple of optimizations that were not present in the old system. Said this, I prefer to not comment on performance but just show the benchmarks. It will be the core team to decide if this is feasible or not. === Main optimization points ==== Compute inverse relation map on first access In order to find related objects, the current implementation does the following: {{{ for each model in apps for each field in model if field is a related object: if field is related to self: add to related_objects }}} REF: https://github.com/django/django/blob/master/django/db/models/options.py#L488 This tends to be expensive, it results in a O(models * fields) complexity. We can increase performance by computing an inverse relation map on first access. This is done only **once**, not once per model (https://github.com/PirosB3/django/blob/soc2014_meta_refactor_upgrade_flags_get_field/django/apps/registry.py#L176). In this way we have a map of { model : [related_object, related_object, ..] } and computing a hash lookup is O(1) (https://github.com/PirosB3/django/blob/soc2014_meta_refactor_upgrade_flags_get_field/django/db/models/options.py#L423). ==== Benchmarks Here is a benchmark results table. It is benchmarking soc2014_meta_refactor_upgrade_flags_get_field (68dc11708eb2170540729b71db6bcaf4c46d6504) against django/master. Djangobench: each number was picked as median of 2000 trials. https://gist.github.com/PirosB3/35a9231ee0214427321d ==== Backwards compatibility All previous _meta functions will be backwards-compatible, with a DeprecationWarning. ==== Next Steps - Feedback - Code cleanup and code documentation - Django manual documentation - Decide deprecaction strategy - More benchmarking - Merge