Opened 4 years ago

Closed 3 years ago

Last modified 3 years ago

#17258 closed New feature (fixed)

Move threading.local from DatabaseWrapper to connections dictionary

Reported by: akaariai Owned by: nobody
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords:
Cc: anssi.kaariainen@… Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

It might be a good idea to move the threadlocality of connections from the DatabaseWrapper to the django.db.connections dictionary. This would be useful if you want to do routing of connections on per-view or per-method basis. The use case is when you have a writing view in your application, you might want to route all queries to the master database, not just the writing queries. Now your view will work on a consistent snapshot all the time. This would also allow for much easier generation of separate connections.

Currently, if you go on and change the DEFAULT_DB_ALIAS to point to "master" in the connections dictionary, this will result in global routing of the DEFAULT_DB_ALIAS, that is, all threads will see the master connection in the DEFAULT_DB_ALIAS, not just your thread. In other words, this is not doable currently.

I would like to work on an external project "db_helpers", which would provide a function wrapper for this. After this change, you could do this:

@route_connections(default="master", close_connections=True):
def my_writing_view(...):
    ...

or
@route_connections(default=db_helpers.separate_connection("master")):
def log_stuff_in_separate_connection(...):
    ...

And the implementation of route_connections would be:

def route_connections(f):
    # first handle special kwargs, like _close_connections
    old_connections = {}
    try:
        for k, v in kwargs:
            old_connections[k] = connections[k]
            connections[k] = connections[v] # or new connection...
        # call the wrapped method
    finally:
        for k in old_connections.keys():
            connections[k] = old_connections[k]

Is the change of threading.local from DatabaseWrapper to the connections dictionary something that has a chance to get included in Django? I think this could be a pretty simple thing to do (if you can call anything involving threads simple). Of course, I am willing to do the investigation & the patch if there is a chance for this.

Attachments (7)

17258.thread-local-connections.diff (4.8 KB) - added by julien 3 years ago.
17258.thread-local-connections.2.diff (6.0 KB) - added by julien 3 years ago.
17258.thread-local-connections.3.diff (10.4 KB) - added by akaariai 3 years ago.
POC: check sharing of connections between threads
17258.thread-local-connections.4.diff (12.7 KB) - added by julien 3 years ago.
Similar approach to pysqlite's check_same_thread
17258.thread-local-connections.5.diff (12.7 KB) - added by julien 3 years ago.
17258.thread-local-connections.6.diff (14.7 KB) - added by julien 3 years ago.
benchmark.txt (7.2 KB) - added by akaariai 3 years ago.
djangobench results (see comment 17)

Download all attachments as: .zip

Change History (31)

comment:1 Changed 4 years ago by aaugustin

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Design decision needed
  • Type changed from Uncategorized to New feature
  • Version set to SVN

If I understand correctly, your goal is to avoid adding a .using(...) clause to every QuerySet in the view.

The drawback is that it introduces some state — basically, a global (thread-local) variable that defines the "current default" database connection.

Two other remarks:

  • Isn't close_connection a separate topic?
  • Changing the value of DEFAULT_DB_ALIAS certainly isn't recommended: this is a constant, not a variable.
Last edited 4 years ago by aaugustin (previous) (diff)

comment:2 Changed 4 years ago by akaariai

First, everything user-visible would be done in a project external to Django. For routing connections (instead of queries) some changes in django.db.connections is needed. The idea behind this is that you could make an one-db application multi-db app by just changing where 'default' points - for writing views it points to master, else it points to one of the slaves.

I do not mean to change the value of DEFAULT_DB_ALIAS, I just mean to change where it points at (that is, connectionsdefault? is changed for one thread, not the word 'default', not the DatabaseWrapper it points to).

Yes, there are some drawbacks... Maybe the global state is a stupid thing to do, but on the other hand it is pretty hard to pass the correct connection you want to use from middleware, so that it is used in other middleware, and in all the queries (user.get_profile, get_permissions, model validation etc). Separate connections to the same database are also somewhat hard to do correctly.

Anyways, this would not be django-core stuff, but a prerequisite is moving the thread-locality of connections from DatabaseWrapper to django.db.connections dictionary. This involves some problems, because threading.local objects have some special GC behavior when dealing with cyclic references, and DatabaseWrapper happens to have many cyclic references. This would mean that separate threads' connections are GCed (and closed) with different timing than they are now. But this timing is Python implementation dependent, for pypy there would not be any change for example... And closing connections manually for separate threads should be done anyways.

comment:3 Changed 3 years ago by akaariai

Hmmh, my descriptions above aren't as clear as they could be. So, another try. I suggest that Django would:

  • Make BaseDatabaseWrapper a subclass of object instead of threading.local
  • Make django.db.connections threadlocal.

Benefits:

  • BaseDatabaseWrapper is a complex object, and it is subclassed a lot. django.db.connections is a simple dictionary. Making the simple object threading.local instead of the complex object would be a good thing IMHO.
  • You can share connections between threads (needed for SQLite in-memory databases): Pass the connections you want to share to new thread, and in the new thread assign the passed in connections to django.db.connections.
  • You can change per-thread where each connections[alias] points to. So, you could do this in a middleware:
    old_conn = django.db.connections['default']
    try:
        django.db.connections['default'] = django.db.connections['master'] # for writing transactions. (Or just for POST connections)
        django.db.connections['default'] = django.db.connections['slaveN'] # for read-only transactions.
    finally:
        django.db.connections['default'].close()
        django.db.connections['default'] = old_conn
    
    • Implementing a connection pool: Make a ConnectionPoolWrapper (a fake backend), assign it to django.db.connections[alias]. The ConnectionPoolWrapper would then pass most attribute accesses down to one of the pooled connections (which would be regular backend objects).
    • It would be fairly easy to have multiple connections to the same database in one thread: just assign a new instance of a wanted connection to django.db.connections['another_connection`], and then do queries with .using('another_connection'). (Or use the above mentioned snippet to assign it to one of the existing connections). Useful for logging and testing for example. Currently Django creates another thread when it wants a separate connection for testing (see select_for_update tests for example).

The important thing to note is that the examples above would not be Django-core. Moving thread-locality from BaseDatabaseWrapper to django.db.connections would allow 3rd party projects to do the above things. As far as I see, they are not doable at the moment, or at least not in any trivial way.

Changed 3 years ago by julien

comment:4 Changed 3 years ago by julien

  • Triage Stage changed from Design decision needed to Accepted

Accepting as the benefits suggested in comment:3 all seem valid and as we already have an immediate need for this, i.e. to allow in-memory sqlite database connections being shared between multiple threads for Selenium tests (ref #2879).

The attached patch turns the DatabaseWrapper instances into global objects and makes the django.db.connections and django.db.connection references thread-local.

As discussed with akaariai on IRC, there is a potential risk that garbage collection may not be handled properly since DatabaseWrapper has cyclic references. So we'd need to do further testing to be sure this change doesn't cause memory leaks or unexpected behaviours.

comment:5 Changed 3 years ago by akaariai

  • Cc anssi.kaariainen@… added
  • Has patch set

Do you need to make django.db.connection threading.local? As is, it is fetching / assigning the connection using connections, but connections is already threading.local. So, would this work exactly as is, but with making the DefaultConnectionProxy subclass of object instead?

About the garbage collection risk, there are a couple of things to note:

  1. The garbage collection closes the underlying connection, so you would a connection leak if the GC fails to collect the connection object.
  2. Circular references + __del__ make objects in cyclic references non-collectable. See: http://docs.python.org/library/gc.html#gc.garbage.
  3. However, if the cycle is caused by a threading.local object, 2. isn't in effect. (Yes, this is weird).
  4. On standard Python the timing of the garbage collection is altered. Non-cyclic (or cycles containing threading.local objects) are collected immediately, but objects with referential cycles only when the GC happens to run next time. On PyPy, the objects are collected only when the garbage collector happens to run, so no change for PyPy users. In general, if your code relies on the exact timing of GC, you are doing it wrong.

It so happens that the BaseDatabaseWrapper is currently threading.local, and has referential cycles (wrapper -> introspection -> wrapper etc). So, by removing the threading.local, we remove rule 3, and rule 2 is in effect. Luckily there is no __del__ defined for the wrapper, so we should be relatively safe, except for the altered timing. External backends which define __del__ would have a really hard bug to debug.

This should not affect web-serving at all, as after each request connections are closed explicitly. But testing & scripts and things like that could potentially have a problem.

The long-term solution would be to break the cycles in BaseDatabaseWrapper. However that is going to be an invasive patch, and would break most external backends.

All in all, I think the patch should be safe from GC point of view. The patch looks good (on a quick glance), except potentially for the duplicate threading.local.

Last edited 3 years ago by akaariai (previous) (diff)

Changed 3 years ago by julien

comment:6 Changed 3 years ago by julien

Thanks akaariai. DefaultConnectionProxy now inherits from object. I've also tweaked the tests to pass on Postgres.

I now have just one concern about the patch. Some third party code may expect django.db.connection to be a DatabseWrapper instance and therefore would break. This is not a huge deal, but it'd be nice if this could be achieved without having to use a proxy class.

comment:7 Changed 3 years ago by akaariai

I am afraid there is no easy way for connection to be DBWrapper instance instead of the proxy. If you do in your code:

from django.db import connection

on module level, and the DBWrapper isn't threading.local, then that connection must be a proxy. BTW this also means that if somebody does this for example:

def somefunc(connection=connections['some_alias']):
    # use connection in some way

that would break if used in multithreaded fashion. You are now sharing a single connection instead of a threading.local object.

Would there be any point in making an assert in some proper place of BaseDatabaseWrapper that a single connection is not shared between threads (unless explicitly allowed). That would be really helpful for those who would hit multithread-bugs due to this change. I haven't looked where that check could be made so that it is checked often enough to actually catch something, but seldom enough to not cause performance regressions. Maybe ._cursor() for each backend could do this check?

In short, this change can cause connections to be shared when not actually wanted. Before you would be sharing a threading.local, and thus get different connections, now you would be sharing the actual connection. Allowing this is actually one of the aims of this patch, but this might break user code in hard-to-debug ways. So, a guard against this and a way to explicitly tell that "I am sharing this connection" would be a good thing IMHO.

Changed 3 years ago by akaariai

POC: check sharing of connections between threads

comment:8 Changed 3 years ago by julien

I like this idea. Pysqlite by default prevents connections to be passed around between multiple threads, unless you explicitly provide the 'check_same_thread=False' parameter.

See how this is implemented: http://www.google.com/codesearch#aEvhAxCkZ8U/src/connection.c&q=check_same_thread%20package:http://pysqlite%5C.googlecode%5C.com&l=838
And where this function is called: http://www.google.com/codesearch#search/&q=pysqlite_check_thread%20package:http://pysqlite%5C.googlecode%5C.com&type=cs

I've implemented something similar in the latest patch. Thoughts?

Changed 3 years ago by julien

Similar approach to pysqlite's check_same_thread

comment:9 Changed 3 years ago by julien

akaariai: I hadn't realised you had posted a patch as well. It seems like we've both used a similar approach. Let's merge the good bits from both patches ;)

comment:10 Changed 3 years ago by akaariai

It seems your approach is better. I actually don't know what would be the good bits to merge from my patch :)

Changed 3 years ago by julien

comment:11 Changed 3 years ago by julien

After a long discussion with akaariai on IRC, I've updated the patch. The most notable changes are:

  • All underlying SQLite connections (i.e. sqlite3.DatabaseWrapper.connection) systematically receive check_same_thread=False on opening.
  • Modified the API namings, in particular check_same_thread (defaulting to True) becomes allow_thread_sharing (defaulting to False).

Some documentation is now necessary:

  • Release notes announcing that the database connections are now global objects (only their references are thread-local), that django.db.connection now is a proxy, and that sqlite db connections can be shared between threads.
  • Notes in the SQLite doc about the change in behaviour: https://docs.djangoproject.com/en/dev/ref/databases/#sqlite-notes

In the code, maybe a warning or exception could be generated if DatabaseWrapper.connection is directly accessed and DatabaseWrapper.allow_thread_sharing is False with SQLite. I'm not sure if/how this is achievable.

There is also a question as whether or not we want to officially document DatabaseWrapper.allow_thread_sharing property yet.

comment:12 Changed 3 years ago by julien

One more thing. It'd be useful to see if this change wouldn't affect performance too much due to all the added conditional tests.

comment:13 Changed 3 years ago by akaariai

Quick review: The attribute check_same_thread is used in the exception risen from validate_thread_sharing (instead of the allow_thread_sharing). If it is not public API, should it be mentioned at all? I kinda like the idea of making it semi-public...

I haven't done any performance testing, but I would be surprised if the checks would cause any noticeable performance regressions. The checks are in places where the actual operation is likely going to be much more expensive than the check. Maybe the overhead in .cursor() could be noticed if you repeatedly call it and don't actually do anything with the cursor. But that is hardly an use case to optimize for...

Re the needed release notes. Here is the things I thought would be good to mention (lots of polish needed):

  • If you are passing connections (that is, connections[some_alias] instead of just some_alias) between threads, you are now sharing the actual connection instead of having different connections to the same database. Using the same connection in different threads is an error, so this should not accidentally hit you. Pass the alias instead and use connections[alias] in the other thread to get a separate connection.
  • If you just use the ORM or connection.cursor() for raw SQL, then nothing to see here.
  • django.db.connection is now a proxy for the DatabaseWrapper. If you need to have the real DBWrapper for some reason, use connections[DEFAULT_DB_ALIAS] instead.
  • (if allow_thread_sharing is made public): While it is possible to pass connections between threads now, Django does not make any effort to synchronize access to the underlying connection implementation. Concurrency behavior is defined by the underlying implementation. Check their docs for details.

Did I miss something? Is this overly cautious/verbose? I wonder what to say about garbage collection. It is very technical, and the risk sounds a lot worse than it actually is. Maybe it would be good to keep the release notes section very short, and have a longer explanation somewhere else (check django-developers thread X or Trac post Y for more details?).

Changed 3 years ago by julien

comment:14 Changed 3 years ago by julien

I believe it's ready to go. You may do a quick scan of the patch before I commit.

comment:15 Changed 3 years ago by akaariai

I wonder if a normal user who just uses the ORM and does nothing special will understand the release notes. However, I believe it is fine to adjust the release notes later on, if need be. Also: Finally, while it is now possible to pass connections between threads... section hints that it is possible to pass connections between threads. However, using public API, you are not allowed to do that.

I will try to run this through djangobench to see if there is any performance problems lurking in there. Otherwise looks good to me.

comment:16 Changed 3 years ago by julien

Yes, perhaps the release notes are overly technical. The fact we're not publicising allow_thread_sharing perhaps also makes it a bit confusing ("it is now possible to share connections between threads but we won't tell you how").

The two main things that have changed and could surprise some users are that django.db.connection now is a proxy, and that underlying sqlite connections are thread-shareable. The former is straight forward, the latter is a bit awkward without telling more details about the new behaviour.

comment:17 Changed 3 years ago by akaariai

OK, benchmarks ran, nothing repeatable found. There is something wrong with my computer or djangobench, because in the attached benchmarks query_update and query_get show bad results, but when I retested them, the results were not any slower. I wonder if it is the garbage collector which affect the results. The tests were run as:

djangobench --trials 400 --control master --experiment connections_threadlocal --vcs git

Re the release notes. Maybe the best thing is to let them just be, and fine tune them later on. Get opinions of more people, and then tune them. Waiting for the opinions before commit will make this a 1.5 feature...

Changed 3 years ago by akaariai

djangobench results (see comment 17)

comment:18 Changed 3 years ago by julien

  • Triage Stage changed from Accepted to Ready for checkin

In my experience djangobench's results are to be taken with a grain of salt as many external factors seem to influence them. As long as nothing is repeatable then we should be safe. I'll go ahead with it.

comment:19 Changed 3 years ago by julien

  • Resolution set to fixed
  • Status changed from new to closed

In [17205]:

(The changeset message doesn't reference this ticket)

comment:20 Changed 3 years ago by julien

In [17206]:

Ensured that thread-shareability gets validated when closing a PostgreSQL or SQLite connection. Refs #17258.

comment:21 Changed 3 years ago by vsajip

  • Resolution fixed deleted
  • Status changed from closed to reopened

This change leads to a failure under Python 3.x because the SQLite DatabaseWrapper is no longer hashable. I noticed that the BaseDatabaseWrapper defines __eq__ but not __hash__ - does this not prevent its instances from being keys in dictionaries / members of sets?

If you add hash = object.hash, then the problem seemingly goes away.

Last edited 3 years ago by vsajip (previous) (diff)

comment:22 follow-up: Changed 3 years ago by Alex

  • Resolution set to fixed
  • Status changed from reopened to closed

Please don't reopen this ticket, if there's a new issue as a result open a new ticket.

comment:23 in reply to: ↑ 22 Changed 3 years ago by vsajip

Replying to Alex:

Please don't reopen this ticket, if there's a new issue as a result open a new ticket.

Okay, opened #17427.

comment:24 Changed 3 years ago by ramiro

In [17499]:

Removed code duplicated in SQLite3 and SpatiaLite GeoDjango DB backends.

Moved it to an auxiliary method in the SQLite3 backend cursor class.
Did this to reduce the chances of us forgetting to port changes in DB
connection setup process from the former to the latter one like it
happened e.g. in r17205.

Refs #17258.

Note: See TracTickets for help on using tickets.
Back to Top