Opened 8 years ago

Closed 8 years ago

#25811 closed Cleanup/optimization (fixed)

Error querying models in different databases in one queryset

Reported by: Edwar Baron Owned by: nobody
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords: database
Cc: edwar.baron@…, Anssi Kääriäinen Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Model1: Deportes from database1
Model2: Hecho1_VentasCadenasJuegos from database2

Code:

Deportes.objects.filter(
    pk__in=Hecho1_VentasCadenasJuegos.objects.all().values_list(
        'juegos__deporte_id', flat=True
    ).distinct('juegos__deporte_id')
)

Error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
psycopg2.ProgrammingError: relation "admin_datamart_hecho1_ventascadenasjuegos" does not exist
LINE 1: ...ISTINCT ON (U1."deporte_id") U1."deporte_id" FROM "admin_dat...
                                                             ^


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<console>", line 4, in <module>
  File "/usr/local/lib/python3.5/site-packages/django/db/models/query.py", line 138, in __repr__
    data = list(self[:REPR_OUTPUT_SIZE + 1])
  File "/usr/local/lib/python3.5/site-packages/django/db/models/query.py", line 162, in __iter__
    self._fetch_all()
  File "/usr/local/lib/python3.5/site-packages/django/db/models/query.py", line 965, in _fetch_all
    self._result_cache = list(self.iterator())
  File "/usr/local/lib/python3.5/site-packages/django/db/models/query.py", line 238, in iterator
    results = compiler.execute_sql()
  File "/usr/local/lib/python3.5/site-packages/django/db/models/sql/compiler.py", line 840, in execute_sql
    cursor.execute(sql, params)
  File "/usr/local/lib/python3.5/site-packages/django/db/backends/utils.py", line 79, in execute
    return super(CursorDebugWrapper, self).execute(sql, params)
  File "/usr/local/lib/python3.5/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/usr/local/lib/python3.5/site-packages/django/db/utils.py", line 97, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/usr/local/lib/python3.5/site-packages/django/utils/six.py", line 658, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.5/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
django.db.utils.ProgrammingError: relation "admin_datamart_hecho1_ventascadenasjuegos" does not exist
LINE 1: ...ISTINCT ON (U1."deporte_id") U1."deporte_id" FROM "admin_dat...

Viewing the SQL generated

SELECT "admin_juego_deportes"."id", "admin_juego_deportes"."nombre", "admin_juego_deportes"."logo", "admin_juego_deportes"."orden", "admin_juego_deportes"."orden_equipos", "admin_juego_deportes"."count_apuesta", "admin_juego_deportes"."fondoweb", "admin_juego_deportes"."cantidad", "admin_juego_deportes"."cantidad_tiempos", "admin_juego_deportes"."runline_positivo", "admin_juego_deportes"."ganador_empate_not_null", "admin_juego_deportes"."resultado", "admin_juego_deportes"."created_at", "admin_juego_deportes"."updated_at" FROM "admin_juego_deportes" WHERE "admin_juego_deportes"."id" IN (SELECT DISTINCT ON (U1."deporte_id") U1."deporte_id" FROM "admin_datamart_hecho1_ventascadenasjuegos" U0 INNER JOIN "admin_datamart_dimensionjuegos" U1 ON ( U0."juegos_id" = U1."id" )) ORDER BY "admin_juego_deportes"."nombre" ASC

Solution:

Deportes.objects.filter(
    pk__in=list(Hecho1_VentasCadenasJuegos.objects.all().values_list(
        'juegos__deporte_id', flat=True
    ).distinct('juegos__deporte_id'))
)

Viewing the SQL generated

SELECT "admin_juego_deportes"."id", "admin_juego_deportes"."nombre", "admin_juego_deportes"."logo", "admin_juego_deportes"."orden", "admin_juego_deportes"."orden_equipos", "admin_juego_deportes"."count_apuesta", "admin_juego_deportes"."fondoweb", "admin_juego_deportes"."cantidad", "admin_juego_deportes"."cantidad_tiempos", "admin_juego_deportes"."runline_positivo", "admin_juego_deportes"."ganador_empate_not_null", "admin_juego_deportes"."resultado", "admin_juego_deportes"."created_at", "admin_juego_deportes"."updated_at" FROM "admin_juego_deportes" WHERE "admin_juego_deportes"."id" IN (1, 2) ORDER BY "admin_juego_deportes"."nombre" ASC

Nevertheless, the problem is given in the subquery that incrupta in the first query.

In the ORM should be verified that both models belong to different data base and generate the correct SQL, evaluating the second and not generate a subquery.

Change History (18)

comment:1 by Tim Graham, 8 years ago

Resolution: invalid
Status: newclosed
Summary: Error in consultation with models of different databaseError querying across foreign keys with models in different databases

Per the multiple databases documentation:

Django doesn’t currently provide any support for foreign key or many-to-many relationships spanning multiple databases. If you have used a router to partition models to different databases, any foreign key and many-to-many relationships defined by those models must be internal to a single database.
... if you’re using SQLite or MySQL with MyISAM tables, there is no enforced referential integrity; as a result, you may be able to ‘fake’ cross database foreign keys. However, this configuration is not officially supported by Django.

comment:2 by Edwar Baron, 8 years ago

You're right actually Django does not currently provide this support, but both models are not directly related, plus I have a properly configured router according to specifications.

I show you some models:

class Deportes(models.Model):
    ...
    
class DimensionJuegos(models.Model):
    
    ...

    deporte_id = models.IntegerField()
    
    ...

Django so no support for relationships between different database, the ORM should be able to identify subqueries that are different databases and evaluate them before processing the first query, which is a optimization generate internal inquiry but as subquery in this case it is not the right way, that only works when the models are in the same database.

Perhaps lazy functionality Django queries that are evaluated only when needed, but to build it must be verified in the generated subqueries.

By the way sorry for my bad English,

Regards.

comment:3 by Edwar Baron, 8 years ago

Cc: edwar.baron@… added
Resolution: invalid
Status: closednew

comment:4 by Shai Berger, 8 years ago

Summary: Error querying across foreign keys with models in different databasesError querying models in different databases in one queryset
Triage Stage: UnreviewedAccepted

OP is right to reopen -- the query had nothing to do with relations, only an inner query on a separate database.

A documentation fix is a much more likely fix than behavior change, though.

comment:5 by Edwar Baron, 8 years ago

Well, given the way the queryset are evaluated the only solution is to force the evaluation of internal consultation, exactly as specified above with "list ()", but what happens when large projects where time is it separate the database ?, possibly like me exist any code with nested queries, Django should be able to generate differences that and not only evaluate the subquery separately.

I like it try to force this behavior, I have some experience using Django but still not at such a high level, but with some tutoring in things that maybe do not understand well make it, you think? I try to fix it or someone else will?

Excuse my bad English.

Regards.

Version 2, edited 8 years ago by Edwar Baron (previous) (next) (diff)

comment:6 by Tim Graham, 8 years ago

Cc: Anssi Kääriäinen added

Maybe Anssi (akaariai) could say if this is feasible and guide you if so.

comment:7 by Edwar Baron, 8 years ago

Thanks, I'm already writing an email to akaariai@…, will await their response.

I've been seeing some code, exactly:

https://github.com/django/django/blob/stable/1.8.x/django/db/models/sql/compiler.py
Line number: 882
return '%s.%s IN (%s)' % (qn(alias), qn2(columns[0]), sql), params

more or less there, shoulds verify that the added subquery belongs to the same database, otherwise not allowed and should evaluate the subquery but values.

I do not have much experience but I would like to make my first contribution, maybe not the right place in the code to solve the problem, but if we could you help me add the right solution.

comment:8 by Josh Smeaton, 8 years ago

Type: BugCleanup/optimization
Version: 1.8master

That would be roughly the right place @ebar0n, but you might want to try fixing it at a higher level (callers of "as_subquery_condition") if you go down that route.

My concern here is one of surprise. Users are mostly aware that passing in a queryset as an __IN lookup results in a single query that has a subquery. It is what we document to happen. If we silently change this behaviour based on that subquery having to execute on a different database, then we're going from a potential efficient subquery to a potential extremely bad query that pulls out thousands/millions rows back to Django. Erroring in this condition seems like a good idea, so the user can explicitly wrap the subquery in a list().

Providing a nicer error message with a hint to list() the subquery seems like the right solution here.

comment:9 by Edwar Baron, 8 years ago

It is effectively efficiently perform the subquery, but I think Django must be able to identify the model referenced in the subquery belongs to a different base data to the initial consultation; Only then the subquery should be evaluated before generating the entire query, and generate the subquery and avoid error.

Rest when they belong to the same database operation should be such as this, because it is more optimal.

That would be the need to use the same functional code with multiple database support.

Then you advise me? I'm anxious to try to fix it.

Excuse my bad English is not my native language.

comment:10 by Edwar Baron, 8 years ago

It would also be necessary to specify that functionality in the documentation __IN
, that would be the only exception to not generate the subquery.

comment:11 by Edwar Baron, 8 years ago

I've got the solution, you can see it in https://github.com/django/django/pull/5732

Tests conducted using debugsqlshell:

Two models that belong to the same database

Deportes.objects.filter(
    pk__in=Equipos.objects.all().values_list(
        'deporte_id', flat=True
    ).distinct('deporte_id')
)
...
SELECT "admin_juego_deportes"."id",
       "admin_juego_deportes"."nombre",
       "admin_juego_deportes"."logo",
       "admin_juego_deportes"."orden",
       "admin_juego_deportes"."orden_equipos",
       "admin_juego_deportes"."count_apuesta",
       "admin_juego_deportes"."fondoweb",
       "admin_juego_deportes"."cantidad",
       "admin_juego_deportes"."cantidad_tiempos",
       "admin_juego_deportes"."runline_positivo",
       "admin_juego_deportes"."ganador_empate_not_null",
       "admin_juego_deportes"."resultado",
       "admin_juego_deportes"."created_at",
       "admin_juego_deportes"."updated_at"
FROM "admin_juego_deportes"
WHERE "admin_juego_deportes"."id"
IN (SELECT DISTINCT ON (U0."deporte_id") U0."deporte_id" FROM "admin_juego_equipos" U0)
ORDER BY "admin_juego_deportes"."nombre" ASC LIMIT 21 [291.41ms]

1 models, using as an argument a list

Deportes.objects.filter(
    pk__in=[1, 2]
)
...
SELECT "admin_juego_deportes"."id",
       "admin_juego_deportes"."nombre",
       "admin_juego_deportes"."logo",
       "admin_juego_deportes"."orden",
       "admin_juego_deportes"."orden_equipos",
       "admin_juego_deportes"."count_apuesta",
       "admin_juego_deportes"."fondoweb",
       "admin_juego_deportes"."cantidad",
       "admin_juego_deportes"."cantidad_tiempos",
       "admin_juego_deportes"."runline_positivo",
       "admin_juego_deportes"."ganador_empate_not_null",
       "admin_juego_deportes"."resultado",
       "admin_juego_deportes"."created_at",
       "admin_juego_deportes"."updated_at"
FROM "admin_juego_deportes"
WHERE "admin_juego_deportes"."id" IN (1, 2)
ORDER BY "admin_juego_deportes"."nombre" ASC LIMIT 21 [0.88ms]

Models belonging to two different databases

Deportes.objects.filter(
	pk__in=DimensionJuegos.objects.all().values_list(
		'deporte_id', flat=True
	)
)
...
SELECT U0."deporte_id"
FROM "admin_datamart_dimensionjuegos" U0 [0.68ms]
SELECT "admin_juego_deportes"."id",
       "admin_juego_deportes"."nombre",
       "admin_juego_deportes"."logo",
       "admin_juego_deportes"."orden",
       "admin_juego_deportes"."orden_equipos",
       "admin_juego_deportes"."count_apuesta",
       "admin_juego_deportes"."fondoweb",
       "admin_juego_deportes"."cantidad",
       "admin_juego_deportes"."cantidad_tiempos",
       "admin_juego_deportes"."runline_positivo",
       "admin_juego_deportes"."ganador_empate_not_null",
       "admin_juego_deportes"."resultado",
       "admin_juego_deportes"."created_at",
       "admin_juego_deportes"."updated_at"
FROM "admin_juego_deportes"
WHERE "admin_juego_deportes"."id" IN (1)
ORDER BY "admin_juego_deportes"."nombre" ASC LIMIT 21 [1.74ms]
Last edited 8 years ago by Edwar Baron (previous) (diff)

in reply to:  10 comment:12 by Shai Berger, 8 years ago

Replying to ebar0n:

It would also be necessary to specify that functionality in the documentation __IN
, that would be the only exception to not generate the subquery.

You seem to be missing @jarshwah 's main concern: Your fix is all nice and well when the subquery returns a small number of records. But assume it can return thousands of records -- even if the external query returns only a few in the end. Say "find customers with last name Baron who made a purchase", written as

Customer.objects.filter(last_name='Baron',
                        customer_id__in=
                            Purchase.objects.all().values_list('customer_id'))

(I know this doesn't look realistic, it is simplified for the sake of example, but believe me, there are plenty of such queries).

Now, as long as we're all in one database, this may well be efficient enough, because only the small number of records will finally be returned, and the database can usually make sure to execute such a query efficiently internally.

But with your fix, and if Purchase is on a different database, this query suddenly becomes a performance disaster. Further, there is no way for Django to tell if the inner query is "fast enough".

This is why Josh prefers that the fix will be, instead of "just go and get it", to make the error informative and tell the developer how to fix it. For what it's worth, I agree.

Josh, is it possible for a user to override the built-in lookups? @ebar0n's concern seems to be that there are a lot of queries in his code-base which would need to be changed if we don't fix it as he suggests, and where performance is probably not an issue. If he can just override __in with a lookup that does what he wants, then we can have our cake and he can eat it too.

comment:13 by Josh Smeaton, 8 years ago

Thanks Shai for clearing that up. Your suggestion about implementing a custom IN lookup is a good one. That is definitely a way users such as ebar0n can opt in to new behaviour. The only issue is that they'd need to carry around and share the code outside of django. It would also be an all-or-nothing proposition. One __in lookup can be registered per field type, so it's not really something you'd want provided by third party libs.

I still think a nicer error message would be good though, so I won't "wontfix" the ticket. But I don't think we should accept the change that ebar0n is putting forward.

ebar0n, you can use the code you've written in your own application by doing the following:

from django.db.models.lookups import In

class MultiDBIn(In):
    def process_rhs(self, compiler, connection):
        db_rhs = getattr(self.rhs, 'db', None)
        # if the argument for the {{__in}} is a subquery, check if the db is different from the main query,
        # and if they are true evaluate the subquery because Django does not provide support for multiple
        # base relations of data
        if (db_rhs and db_rhs != connection.alias) or self.rhs_is_direct_value():
            # rhs should be an iterable, we use batch_process_rhs
            # to prepare/transform those values
            rhs = list(self.rhs)
            if not rhs:
                from django.db.models.sql.datastructures import EmptyResultSet
                raise EmptyResultSet
            sqls, sqls_params = self.batch_process_rhs(compiler, connection, rhs)
            placeholder = '(' + ', '.join(sqls) + ')'
            return (placeholder, sqls_params)
        else:
            return super(In, self).process_rhs(compiler, connection)

IntegerField.register_lookup(MultiDBIn)
CharField.register_lookup(MultiDBIn)
# .. and any other fields you'd want this behaviour to apply to

Thank you ebar0n for working on this and finding a solution, but I'm afraid it's not one that we should accept.

comment:14 by Edwar Baron, 8 years ago

Thanks for the clarification, I fully understand the performance problems that this may cause, but because I'm pretty sure that when a developer decides to use different database and do this kind of consultation should take into account this potential problem (if you have a subquery with thousands of records), if it continues in the same database which is running as the subquery is generating optimally. Also in the official documentation shall specify the variant behavior.

But then I respect their opinions; I will correct the code to add the exception as suggested, so that other users do not fall into the same error time and tarden known to occur. and only I customize my code.

Thank you for your help.

comment:15 by Edwar Baron, 8 years ago

Ready was added and the change https://github.com/django/django/pull/5732

Then pull that can be approved?

comment:16 by Tim Graham, 8 years ago

Has patch: set

comment:17 by Tim Graham, 8 years ago

Triage Stage: AcceptedReady for checkin

Pending some cosmetic edits.

comment:18 by Tim Graham <timograham@…>, 8 years ago

Resolution: fixed
Status: newclosed

In eb441727:

Fixed #25811 -- Added a helpful error when making _in queries across different databases.

Note: See TracTickets for help on using tickets.
Back to Top