#19385 closed New feature (fixed)
Add support for multiple-column join
Reported by: | cseibert | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 1.4 |
Severity: | Normal | Keywords: | join |
Cc: | Triage Stage: | Accepted | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description
For enterprise multi-tenant databases, a typical table schema might have both company_id and record_id, where record_id is only unique within that company_id. There is already an issue open for multi-column primary keys; #373 (ticket).
Even without support for multi-column primary key, it would be useful to be able to do multi-column joins, so we can group records in the same company on the same page in the index.
SELECT * FROM app_record a JOIN app_user b ON b.id = a.user_id AND b.company_id = a.company_id;
My team is in the process of writing such a patch for Django 1.4.
https://github.com/jtillman/django/tree/MultiColumnJoin
In this patch, the ForeignKey field accepts an additional parameter named 'include_related'. This allows you to dictate what fields are related to each other on a model. It takes a list of tuples which contains the left and right hand property names.
class User(models.Model): name = models.CharField(max_length=128) company = models.ForeignKey(Company) class Record(models.Model): name = models.CharField(max_length=128) company = models.ForeignKey(Company) user = models.ForeignKey(User, include_related=[('company', 'company')])
There are some situations that are not implemented yet, such as pre-fetch.
We're looking for feedback on the idea, feedback on the implementation and feedback on how to structure the patch to maximize palpability to the core devs. We realize that this is unlikely to be merged.
Change History (36)
comment:1 by , 12 years ago
comment:2 by , 12 years ago
I've found that the major challenge here is the global assumption that model relations will always have single column comparisons. The majority of the patch tries to change this assumption by directing all relations to depend on ForeignKey.get_related_columns in some fashion. If the 'include_related' argument was removed and the premise of get_related_columns was more wide spread, would this seem more of a reasonable patch to enable the subclassing scenarios?
comment:3 by , 12 years ago
Triage Stage: | Unreviewed → Accepted |
---|
IMO it is a good plan to lift the assumption about single column joins from the ORM internals.
There are some cases the patch doesn't seem to deal with (split_exclude() -> subquery support, __in
lookups, there are probably more). But, we can remove the single column limitations one at a time. We don't have to guarantee that multicolumn joins work in all cases.
We can't do these changes to 1.4. 1.5 is also out, doing big changes to sql/query.py internals this close to release is IMO a bad idea - even if the sql/query.py is 100% private API we should try to keep it stable from RC stage on. In master we are free to change sql/* if need be.
But, you don't want to write a patch for master just yet, #10790 is going to change setup_joins() a lot, so there will be a lot of conflicts.
I am marking this as accepted in the meaning that we should do incremental work and remove assumptions about single-column joins from the ORM internals. We should not aim for public API for multicolumn primary keys & joins in this ticket, #373 deals with that.
We should also check what koniiiik's summer of code work has to offer (https://github.com/koniiiik/django/tree/soc2011/composite-fields). There is also another branch for foreign keys: https://github.com/koniiiik/django/tree/auxiliary_fk_fields.
comment:4 by , 12 years ago
I think Anssii, is correct that the right solution is to support "virtual fields" (and with the multi column primary keys), and have that work with foreign keys. However, there could be an argument made that ForeignKey needs some support of its own, specifically with the to_field
otpion, which lets you point a foreign key at something besides a primary key, it's not unreasonable to say that to_field
should accept a list of fields. I suspect however that doing this correctly would still require most of the work of #373, so it should happen afterwards.
comment:5 by , 12 years ago
I am somewhat interested in the idea of doing the "ON ..." clause generation in the field. This would allow for a lot of freedom in what the join clause would be and would be fairly straightforward to implement. As of commit of #10790 the get_extra_join_sql() already allows adding custom stuff to the join clause, but not altering it totally.
It seems that for generic foreign keys we will need the ability to generate fully custom ON clauses for multicolumn foreign key support (see #373, comment 111). And, this SQL will need to be backend specific.
For now I see two basic approaches:
- Just change the lhscol, rhscol in query datastructures to lists, that is lhscols, rhscols. Deal with the generic foreign key in some other way (basically, add a hack for it).
- Generate the SQL in the field (get_join_clause() or somesuch). This might need to just postpone the clause creation to the used connection. This way generic foreign keys wouldn't be a special case for the compiler, and it would be possible to implement exotic join clauses in custom fields (time range join restrictions for example).
I guess we can't implement the ForeignKey(to_field=a list of fields) just yet, there are some complications along that path - mainly how do we represent the ForeignKey in the model._meta as it doesn't have just a single backing field. I guess #373 will need to answer that part.
Good ideas welcome. Testing different things will likely be a good solution here.
comment:6 by , 12 years ago
Yes, this is clearly going to be non-trivial under any circumstances, for example field__in
can no longer generate a SQL IN
clause, it needs to be a chain of OR
s.
comment:7 by , 12 years ago
Yes, I bet there are a bunch of places which will not work, not least subquerying support.
Anyways, I have created a patch for generating the ON clause in the field itself. It seems to work pretty nicely, and is IMO a minor cleanup to current code. While there I spotted some other cleanups, too, which are included in the patch. See: https://github.com/akaariai/django/compare/ticket_19385
comment:8 by , 12 years ago
Some more work... I moved names_to_path path generation to Field.get_path_info(). This seems to be a really nice cleanup and also makes the abstraction between fields and the ORM a lot better.
I also added multicolumn join tests. The test models are _really_ ugly because the model._meta doesn't have any support for custom relation fields. Even with this limitation the basic joining seems to work. There isn't any support for select_related, defer or stuff like that, and there isn't support for customrelfield=val filters. Still, to me the tests seem like a strong indication that the work is on the right path. The interface needed for basic join support is really simple: get_path_info() and get_join_sql().
In my opinion the next step would be to add proper ._meta support for custom relation fields. This would likely mean cherry-picking stuff from #373. Then the next steps would be to find problematic queries in ORM, fix em, repeat. Finally make sure other parts of Django support multicolumn fields.
I am wondering if it might be a good idea to do some polish to what I have now and commit it. To me it seems the code in https://github.com/akaariai/django/compare/ticket_19385 is a step forward even if multicolumn joins aren't considered at all.
comment:9 by , 12 years ago
Patch needs improvement: | set |
---|
Now with basic select_related and order_by support. The field <-> ORM interface contains a bit more stuff if you want to support select_related(), but it isn't huge by any standard:
- four attributes: model, null, unique, name
- get_path_info() and get_join_sql() as described in above comments
- select_related_descent() - should this field be descended into? Parameters contain info about what is requested.
- get_cache_name() and get_related_cache_name() - when doing select_related fetches to what attribute should the fetched value be saved.
However, I had to do a lot of hackery to fake model._meta support in tests. So, model._meta refactor would be needed next.
Note that the goal is not to make the Field <-> ORM interface public anytime soon. I am afraid making it stable would make it hard to do further work on the ORM.
As for IN
support etc - it seems there could be some hard problems awaiting in that path regarding partially NULL composite values and so on.
I think polishing the current work and then pushing the work to master could be a good plan... But, enough of this ticket for now, time to concentrate on other issues for a while.
comment:10 by , 12 years ago
I've updated my patch to work with the new changes in master, found at https://github.com/jtillman/django/compare/ticket_19385. I also I have branch that I run which is based off of 1.4.3 [https://github.com/jtillman/django/tree/MultiColumnJoin].
To your point about the IN statement, this doesn't currently work in the case of multi-column because of the use of _pk_trace. I'm thinking of doing the same method as I do for create_prefetch_query and using IN for single columns and complex queries for multiple columns while change _pk_trace to return a tuple of values.
comment:11 by , 12 years ago
jeremyt: Your version adds information about multiple columns into the compiler.py. This seems like a good straightforward approach which allows for multicolumn joins with minimal hassle.
The patch from https://github.com/akaariai/django/compare/ticket_19385 moves the whole ON clause generation to the field. I like the idea of completely abstracting the related join handling to the field. The worry is that once all cases (like join trimming, IN filtering, prefetch_related, ...) are handled by the field, then the API between ORM and the relation fields will be too complicated to actually use.
Will have to think about this.
Anyways, I cherry-picked the most obvious changes from the patch for this ticket into separate branch. See https://github.com/akaariai/django/compare/names_to_path_to_fields. I plan to commit this branch separately, it is a major cleanup to parent model joining, and to names_to_path().
comment:12 by , 12 years ago
Nice..... I like the field being able to explain the path no matter how many table jumps it has. Is there a reason we do 'if hasattr(field, 'get_path_info'):' instead of 'if isinstance(field, RelatedField)' though?
comment:18 by , 12 years ago
The names_to_path_to_fields stuff is in master now.
I think the next step is the possibility to have multiple columns per join, then try to fix all the possible places which break under multicolumn joins in the ORM. Here is a partial list:
- IN lookups
- What other lookups to support and how to do this
- Subquery support (we should use EXISTS instead of IN subqueries)
- Join trimming
- Representation of the foreign key in the model's _meta.
I am sure I am missing half of the complications.
I think we will want to abandon the get_join_clause() approach. This means there will be too little information available in the ORM, and we would need to move stuff like "can we trim this join" into fields which doesn't seem right. However, we will likely want to support the case of _no_ join columns, so that one can generate the whole clause in the field.get_extra_join_sql(). (Also, we might want to reconsider the name AND implementation of the get_extra_join_sql() method before 1.6...)
It seems the work done in https://github.com/jtillman/django/compare/ticket_19385 is a good basis for doing further work on the ORM changes. We should also see what koniiiik has done so that we won't be duplicating or invalidating his work here.
comment:20 by , 12 years ago
[https://github.com/jtillman/django/compare/ticket_19385] rebased with recent updated. Good additions btw.
Here are my thoughts/plans for the current list:
- IN lookups:
- Multicolumn could be done in mysql with multicolumn IN with format "SELECT * FROM T1 WHERE (T1.col1, T1.col2) IN ((11, 12), (21, 22))" but this isn't widely support by all databases so I'm staying away.
- This could be done by converting the IN filter into a Q object which explains the IN using ANDs and ORs of single fields as done in prefetching. Either we could create a field lookup that return a list of constraints or create a method on the field that takes in the query and adds the constraints itself..(I favor the latter).
- Subquery support
- Could use multicolumn IN here too but no for same reason
- I agree with the use of EXISTS over IN but I'm a little hesitant on this because mysql struggles with dependent subqueries (supposed to be fixed in 5.6). An option would be to special case single column relations and use IN and use exists for multicase scenario.
- Join trimming
- Shouldn't change much but we'll have to return a list of columns in trim_joins I'm thinking???
- Representation of the foreign key in the models's _meta
- I'm currently not sure why we need this one.
Thoughts?
All feedback/discussion welcome.
comment:21 by , 12 years ago
[https://github.com/jtillman/django/compare/ticket_19385] rebased with recent updated.
All above plans have been implemented.
What Changed:
*To and From field on PathInfo was removed and replaced with 'target_fields'
*Trimming can happen on multicolumns
*Added 'get_lookup_constraint' to ForeignKey Field. It can build constraints based on its target
*Sending a query object to get_lookup_constraint will add a SubqueryConstraint that may use EXISTS if multiple targets
Unit tests are all passing
I did find some interesting inconsistencies though that currently exist in the framework, especially in subqueries.
This seems to be a bug when using to_field (Using regressiontests.queries:ToFieldTests.test_recursive_fk):
*Node.objects.filter(parent=node1) => SELECT "queries_node"."id", "queries_node"."num", "queries_node"."parent_id" FROM "queries_node" WHERE "queries_node"."parent_id" = 42
-but-
*Node.objects.filter(parent=Node.objects.filter(id=node1.id)) => SELECT "queries_node"."id", "queries_node"."num", "queries_node"."parent_id" FROM "queries_node" WHERE "queries_node"."parent_id" = (SELECT U0."id" FROM "queries_node" U0 WHERE U0."id" = 1 )
-I believe the subquery will alway be incorrect when using a to_field that isn't 'pk' of the target model. If this isn't a known issue then I'll open a ticket for it, but this has been fixed with my change.
comment:22 by , 12 years ago
On a quick glance, looks good!
However, I don't think we want to add a new field to contrib. We either want to make ForeignKey work with multicolumn joins (even if there is no possibility for multicolumn PKs yet) or we want to leave the multicolumn foreign key implementation to external apps.
We are basically in plumbing stage here - we are improving the internals. After done, the patch that adds the support in public APIs should be somewhat more manageable. Still, I think it is worth it to test the internals. So, maybe we could add the contrib field in a test module instead?
Making ForeignKeys work properly will likely need the virtual fields support koniiiik has been working on. We really should wait and see if he can continue his work.
In any case, the branch contains multiple improvements that we should get committed. Likely the whole stuff targeting models/sql and fields is good to go. I will try to find the time to review the patch as soon as possible.
comment:23 by , 12 years ago
Could you make a pull request? There are various pieces of code I'd like to comment on, and the best tool to do that is the github PR on-line commenting. (While at it, could you remove the whitespace errors, git diff --check is your friend here).
I did spend already some time reviewing the code. I think the changes are mostly good, though there are a couple of issues I'd like to tackle, best reviewed in a PR.
Also, I still think the ForeignKeyEx doesn't belong into contrib, it should be defined in a test module. It is a good test case to see we really have working multicolumn joins, but it is unlikely that such a field would be accepted to django contrib. A discussion at django-developers is needed at least.
follow-up: 25 comment:24 by , 12 years ago
Pull Request created https://github.com/django/django/pull/656
I haven't made changes for previous comment yet but just pulled what I have committed. I will move the ForeignKeyEx out and placing the functionality as 'JoinField' which will always be the type of field for PathInfo.join_field. Please comment away and I will continuously integrate them.
comment:25 by , 12 years ago
Pull Request updated concerning comments.
To the point of the contrib portion, I would like to put this functionality in related.py. I have extra commits that moves this to a ForeignObject field. ForeignObject is basically the join_field of a JoinInfo tuple as ForeignKey is today. It is represented as a virtual field so if preferred, we can integrate koniiiik's change. ForeignKey will inherit ForeignObject and add properties to make the field non-virtual to retain it being used as a column.
If we want to got this route (my vote goes here), I'm ready to pull the trigger so let me know.
comment:26 by , 12 years ago
I've made a seperate branch and pull request with the second option as discussed above. It may be commented on and I will polish this version up with comments and so forth.
Branch: https://github.com/jtillman/django/compare/ticket_19385_2
Pull Request: https://github.com/django/django/pull/680
comment:27 by , 12 years ago
Sorry for not replying sooner, I've been busy lately...
I just took a quick look at the second approach and I think I like it. It seems the ForeignObject could have more uses than just ForeignKey, so it is a good addition. I will try to take a closer look soon. In any case, this looks to be on good track for inclusion in 1.6.
If you have interest in reviewing / working on other ORM tickets that would be a good way to make sure I will have time to review this ticket...
comment:28 by , 12 years ago
I have done some testing, and this is looking excellent. You have convinced me that the ForeignObject is a good idea. I tried the ForeignObject with ActiveTranslationField (a "reverse foreign key" fetching current active language's translations). There is need for couple of simple additions to ForeignObject, but otherwise things seems to just work.
The tests & changes for translation fields are here: https://github.com/akaariai/django/commit/790fc1693744fe3a8f407080527a0c70b1f4a759. This is not intended for inclusion in your patch as is, but I would like to add the ability to do the reverse joining stuff at some point. Only minor changes.
The get_extra_join_sql() method could likely be replaced with get_extra_filter(). The get_extra_filter() could work for both the descriptor and extra_join_restriction cases. In addition this would be much better than generating raw SQL for custom backend support. This is not needed in this ticket, but definitely something to think about for future work.
Unfortunately it seems I will be very busy still for the couple next weeks, so I can't promise a full review & commit soon. What I have managed to review so far suggests that there isn't much work to do.
comment:29 by , 12 years ago
I did a rebase of the patch. While working on the patch I saw three things worth mention:
- Generally looking very good
- Unused imports, other PEP-8 code cleanup required
- Some naming issues in ForeignObject.*_related_fields (I find them a bit hard to remember). The naming issues might be OK to postpone after commit...
The rebase commit tells also what else apart of just making this run was done. Most important change is using ForeignObject as base of GenericRelation, too.
I was too lazy to do the rebase per commit, so the commits before HEAD aren't passing tests.
I will be on holiday next week, but after that I think I will just do final cleanup & commit. Polishing this outside master will just lead to more rebase conflicts. Meanwhile, using SubqueryConstraint in split_exclude() isn't forbidden... :)
Rebase available from: https://github.com/akaariai/django/compare/ticket_19385_2_rebased
comment:30 by , 12 years ago
Reading through the patch I found a couple of things in need of fixing or at least closer inspection before this can be committed:
- column_fields, local_column_fields -> concrete_fields, local_concrete_fields, do these need better caching for
__init__
speed? - requires_unique_target validation must check unique_together indexes, too
- Model.
__init__
, "# This field will be populated on request" if condition cleanup - possible problems from related field lookup value resolving (the _pk_trace() change). The new code seems to be doing better job than the old code (validating that the related model is of right class, not blindly using PK instead of the relation's real target field). If regressions happen here they should be somewhat straightforward to fix.
- native_related_fields -> better name?
I was thinking of renaming all [*_]related_fields to [*_]relation_fields. Also, rename native_related_fields to local_relation_fields, foreign_related_fields to remote_relation_fields. Similar for the get_..._value() methods.
Otherwise I think this patch is ready. split_exclude() at least doesn't yet work correctly for multicol joins, but that can be fixed later on. There are likely other places needing some work, too. All in all there is somewhat big risk of regressions but IMO the risk is worth it.
If anybody wants to do a review of this now is the time. Doing a quick API level check would be good. Commit is going to happen real soon now.
Abstract of the patch for those not wanting to read through the whole patch: the patch adds models.fields.related.ForeignObject which is a virtual field marking a connection from_fields -> to_fields. In addition the patch adds support for traversing these multicolumn connections into ORM. When plain ForeignObject is used it is added to ._meta.fields. Those fields which have a concrete database column will be present in ._meta.concrete_fields, too (still named column_fields in the patch). ForeignKey and GenericRelation are special cases of ForeignObject, ForeignKey has just single from_fields and to_fields entries, and also has a concrete column. GenericRelation is still using ._meta.virtual_fields only. After the patch it will be possible to make the ORM handle multicolumn relations even if multicolumn ForeignKey isn't available using public APIs. Examples of what can be done is in tests/foreign_objects/models.py and tests.py.
comment:31 by , 12 years ago
Further problems found: what to do in Model initialization? It should be possible to initialize fields using ForeignObject based fields, but at the same time this should not interfere with loading from database (*args based initialization with only concrete fields). The resolution for now is to do *args based initialization using concrete fields, but **kwargs based initialization using all fields.
ModelForms now skip any non-concrete fields for ModelForms. This isn't a nice solution, but works for now. If one wishes to create ModelForms for models having ForeignObjects, then manual removal of the concrete fields and insertion of a form field for the ForeignObject field will need to be done. That is, if you want a select widget of persons instead of first_name and last_name text fields you will need to do that yourself.
Currenly the ForeignObject does not integrate with other parts of Django. It is possible to do some ORM tricks with it but that's it. The rest of virtual field support needs to be fleshed out later on. As such documenting ForeignObjects at all seems like a bad idea. At least the representation in model._meta needs to change later on.
Interestingly using @cached_attribute gave a small speed increase to model initialization, so performance problems in that part of the patch aren't a concern.
So, the ForeignObject isn't public and it will be only barely usable in 1.6. It is very likely that many parts of Django will crash if trying to deal with a model having a ForeignObject field. Still, I think commit to 1.6 is worth it, ForeignObject as is allows for some nice things already, and this way there is possibility to find out the problem spots without committing to public API yet. The related fields implementation is a tiny bit cleaner, and public API use should not suffer (though some regressions are likely). As long as a project doesn't play with ForeignObjects directly there should not be much problems.
A squashed version of the patch is available from https://github.com/akaariai/django/commit/4810cefd0a99f6963ad57415dd8771e8b2b4a279. I will still need to do a final review of the patch, and confirm that commit of only ORM parts of multicolumn joins makes sense.
comment:32 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
This got fixed in [266de5f9ae9e9f2fbfaec3b7e4b5fb9941967801].
I did still some improvements to how GenericRelation works, it is now pure virtual field and it isn't m2m at all. Things seem to work nicely.
I guarantee that the commit will introduce regressions, and that there are things that do not work as wanted with multicolumn joins and/or when using plain ForeignObject. Still, I (obviously) think that commit was the right thing to do, as writing a perfect patch would have meant refactoring model._meta field representations, and even bigger refactor to fields/related.py. 700 lines of changed ORM code in one patch is enough...
A big thank you to Jeremy Tillman for the patch!
comment:36 by , 12 years ago
Thanks akaarial for censoring this ticket throughout it's lifetime. Your constant feedback and contributions was very much appreciated and a huge part of getting this done.. Thank you!
My feeling is that Django should not offer the functionality out of the box, but should allow somewhat easy subclassing of ForeignKeys & other relation fields to achieve this functionality.
You might be interested to take a look at what is going on in master currently, specifically in ticket #10790, https://github.com/akaariai/django/tree/ticket_10790_final and commit https://github.com/akaariai/django/commit/f6041f47328fc26b7cdbe9d6f604e1b1f1f4b3a9.
For this ticket, the interesting part is the addition of join_field to the alias_map. If the join_field (that is, the FK / reverse relation used for the join) defines get_extra_join_sql(connection, qn, lhs_alias, rhs_alias) then the method is called in compiler.get_from_clause() and the SQL returned is added to the join condition. This seems to be a bit more generic approach than what you have done.
The work in #10790 isn't enough to allow the functionality you are after. It seems some more hooks will be needed. And there are some more places to change, at least the select_related stuff in compiler.py needs to take the join_field in account.
In addition I think we want to abstract the interaction between fields and the query generation code somewhat. Currently the query generation code knows too much of the implementation of the fields which leads to somewhat hard to understand code in setup_joins(). Abstraction would also allow for easier creation of custom relation fields.
I would be very interested to hear what you have to say about the approach taken in #10790.
I am not sure what to do with this ticket. I definitely support the idea of allowing ForeignKey subclasses with the functionality you are looking after. But I don't think we want to add those subclasses directly into django-core, at least not before knowing a lot more how they will function in practice, and which ones are the commonly used ones.