Opened 4 weeks ago
Closed 4 weeks ago
#37033 closed Uncategorized (invalid)
Psycopg2 has different semantics than psycopg3 and should not be deprecated
| Reported by: | Cameron Gorrie | Owned by: | |
|---|---|---|---|
| Component: | Uncategorized | Version: | 6.0 |
| Severity: | Normal | Keywords: | |
| Cc: | Triage Stage: | Unreviewed | |
| Has patch: | no | Needs documentation: | no |
| Needs tests: | no | Patch needs improvement: | no |
| Easy pickings: | no | UI/UX: | no |
Description
The context: we have been trialling psycopg3 in our (async-free!) Django 5.2 codebase. I noted that https://docs.djangoproject.com/en/6.0/releases/4.2/#psycopg-3-support says "Support for psycopg2 is likely to be deprecated and removed at some point in the future." -- this ticket has my observations that make me believe this is not a good idea.
We have many Celery (5.6.2) tasks that run on idempotent queue workers with Django fully loaded. When we re-deploy, we kill the worker containers and re-queue the tasks that were in-flight / ready to process on that worker. It does this by raising exceptions from signal handlers: https://github.com/celery/billiard/blob/main/billiard/common.py#L106 ultimately raises a SystemExit exception from inside the signal handler when it receives the SIGQUIT signal.
As outlined in [Safe Asynchronous Exceptions for Python](https://www.cs.williams.edu/~freund/papers/python.pdf), raising exceptions in this way can arbitrarily interrupt execute at any bytecode boundary.
Example case
Here's a simplified task handler so you can visualize the problem:
@celery_app.task(MY_IDEMPOTENT_QUEUE) def my_task(): with transaction.atomic(): [...] my_model.save() transaction.on_commit(lambda: my_other_task.delay())
Sometimes, when a signal is delivered while commit() is running, the following happens:
- PostgreSQL has applied the COMMIT (we can later verify DB state).
- psycopg3 is interrupted during its Python‑level wait loop; an exception bubbles out of commit() (or _set_autocommit) rather than a normal return.
- Django sees an exception in
atomic.__exit__and therefore does not runon_commitcallbacks, even though the transaction actually committed.
Root cause
- psycopg2's COMMIT path uses
PQexec("COMMIT")inside a Py_BEGIN_ALLOW_THREADS region, so the interpreter will not run the Python signal handler until that call returns; this effectively makes COMMIT atomic w.r.t. Python async exceptions. - psycopg3's design explicitly avoids blocking libpq calls, using a Python generator + wait functions to drive the COMMIT protocol step-by-step. That means Python bytecode is executing throughout, so signal handlers can raise exceptions mid‑protocol.
I do not believe this will be possible to solve with psycopg3's approach, so developers using Django's on_commit inside of celery worker handlers need to be aware of this behaviour.
Change History (3)
comment:2 by , 4 weeks ago
Thanks for taking a look, Simon. Agreed that we're asking a lot of the libraries / frameworks here w.r.t. idempotency.
Warm shutdown seems like a good plan: check e.g. self.terminate = True periodically. In the context we're supporting this system, it would require us to change an uncomfortable amount of code, but it is doable. I thought about doing it in a thread, but I think SystemExit would have the exact same problem. I have a feeling we'd have options if we had async handlers, but that would be a big change... can you think of anything I've missed?
I've opened a ticket with psycopg and am about to with celery. At first look, I didn't see anything in the documentation that explicitly calls this out. I don't think this issue is resolved by writing code -- but to collaborate in ensuring the message is out to help those who--perhaps naively--believe this combination will work. I opened it here too because the root cause is two-fold:
- celery soft / cold shutdown (i.e. exception propagation)
- I/O "commits" that require transactional properties across Python bytecode
Django ORM's on_commit semantics hit the latter: when we can't be sure if our COMMIT was successful or not, we cannot run code that depends on the result. It results in situations that seem impossible until you look at the problem from the right lens.
In my case I inherited parts of this system and am still in some ways finding my footing; opening a ticket with this problem in itself makes it more discoverable, but I think you're right -- it's Celery's soft shutdown that intrinsically breaks some of the lower-level stuff in the python stack.
comment:3 by , 4 weeks ago
| Resolution: | → invalid |
|---|---|
| Status: | new → closed |
Thank you both for the discussion. As the current conclusion is that Django is not at fault here, closing the ticket
I think the real problem at play is that your worker deployment workflow is not truly idempotent; I don't think it's fair to assume that
SIGKILL'ing a process without gracefully handling it on the receiving end will do the right thing by default.The way Celery deals with
SoftTimeLimitExceededand worker termination signals by surfacing exception and allowing the flow of the program to be interrupted at any time is naive and error prone (it also causes cursor corruption on themysqlclientMySQL library). The correct way of gracefully handling non-fatal error signal is usually to set a flag on the execution context that the application can check at a safe checkpoint (self.terminate = True).Any reason not to perform a warm shutdown first to limit the cases when this can happen?
I'm afraid there's is little Django can do here, if
psycopg2development halts and it is no longer supported upstream we will have to pull the plug on it. I'm not sure this is the right place to advocate for a problem at the intersection of Celery and psycopg to be solved, have you raised this problem with any of these third party library maintainers? Surely Celery is aware that this signalling approach is problematic (e.g. what if a task is the middle of performing a critical HTTP request when the signal surfaces?) and has documented patterns on how to deal with this problem.