Opened 4 weeks ago

Closed 4 weeks ago

#37033 closed Uncategorized (invalid)

Psycopg2 has different semantics than psycopg3 and should not be deprecated

Reported by: Cameron Gorrie Owned by:
Component: Uncategorized Version: 6.0
Severity: Normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The context: we have been trialling psycopg3 in our (async-free!) Django 5.2 codebase. I noted that https://docs.djangoproject.com/en/6.0/releases/4.2/#psycopg-3-support says "Support for psycopg2 is likely to be deprecated and removed at some point in the future." -- this ticket has my observations that make me believe this is not a good idea.

We have many Celery (5.6.2) tasks that run on idempotent queue workers with Django fully loaded. When we re-deploy, we kill the worker containers and re-queue the tasks that were in-flight / ready to process on that worker. It does this by raising exceptions from signal handlers: https://github.com/celery/billiard/blob/main/billiard/common.py#L106 ultimately raises a SystemExit exception from inside the signal handler when it receives the SIGQUIT signal.

As outlined in [Safe Asynchronous Exceptions for Python](https://www.cs.williams.edu/~freund/papers/python.pdf), raising exceptions in this way can arbitrarily interrupt execute at any bytecode boundary.

Example case

Here's a simplified task handler so you can visualize the problem:

@celery_app.task(MY_IDEMPOTENT_QUEUE)
def my_task():
    with transaction.atomic():
        [...]
        my_model.save()
        transaction.on_commit(lambda: my_other_task.delay())

Sometimes, when a signal is delivered while commit() is running, the following happens:

  1. PostgreSQL has applied the COMMIT (we can later verify DB state).
  2. psycopg3 is interrupted during its Python‑level wait loop; an exception bubbles out of commit() (or _set_autocommit) rather than a normal return.
  3. Django sees an exception in atomic.__exit__ and therefore does not run on_commit callbacks, even though the transaction actually committed.

Root cause

  • psycopg2's COMMIT path uses PQexec("COMMIT") inside a Py_BEGIN_ALLOW_THREADS region, so the interpreter will not run the Python signal handler until that call returns; this effectively makes COMMIT atomic w.r.t. Python async exceptions.
  • psycopg3's design explicitly avoids blocking libpq calls, using a Python generator + wait functions to drive the COMMIT protocol step-by-step. That means Python bytecode is executing throughout, so signal handlers can raise exceptions mid‑protocol.

I do not believe this will be possible to solve with psycopg3's approach, so developers using Django's on_commit inside of celery worker handlers need to be aware of this behaviour.

Change History (3)

comment:1 by Simon Charette, 4 weeks ago

I think the real problem at play is that your worker deployment workflow is not truly idempotent; I don't think it's fair to assume that SIGKILL'ing a process without gracefully handling it on the receiving end will do the right thing by default.

The way Celery deals with SoftTimeLimitExceeded and worker termination signals by surfacing exception and allowing the flow of the program to be interrupted at any time is naive and error prone (it also causes cursor corruption on the mysqlclient MySQL library). The correct way of gracefully handling non-fatal error signal is usually to set a flag on the execution context that the application can check at a safe checkpoint (self.terminate = True).

When we re-deploy, we kill the worker containers and re-queue the tasks that were in-flight / ready to process on that worker.

Any reason not to perform a warm shutdown first to limit the cases when this can happen?

I'm afraid there's is little Django can do here, if psycopg2 development halts and it is no longer supported upstream we will have to pull the plug on it. I'm not sure this is the right place to advocate for a problem at the intersection of Celery and psycopg to be solved, have you raised this problem with any of these third party library maintainers? Surely Celery is aware that this signalling approach is problematic (e.g. what if a task is the middle of performing a critical HTTP request when the signal surfaces?) and has documented patterns on how to deal with this problem.

Last edited 4 weeks ago by Simon Charette (previous) (diff)

comment:2 by Cameron Gorrie, 4 weeks ago

Thanks for taking a look, Simon. Agreed that we're asking a lot of the libraries / frameworks here w.r.t. idempotency.

Warm shutdown seems like a good plan: check e.g. self.terminate = True periodically. In the context we're supporting this system, it would require us to change an uncomfortable amount of code, but it is doable. I thought about doing it in a thread, but I think SystemExit would have the exact same problem. I have a feeling we'd have options if we had async handlers, but that would be a big change... can you think of anything I've missed?

I've opened a ticket with psycopg and am about to with celery. At first look, I didn't see anything in the documentation that explicitly calls this out. I don't think this issue is resolved by writing code -- but to collaborate in ensuring the message is out to help those who--perhaps naively--believe this combination will work. I opened it here too because the root cause is two-fold:

  1. celery soft / cold shutdown (i.e. exception propagation)
  2. I/O "commits" that require transactional properties across Python bytecode

Django ORM's on_commit semantics hit the latter: when we can't be sure if our COMMIT was successful or not, we cannot run code that depends on the result. It results in situations that seem impossible until you look at the problem from the right lens.

In my case I inherited parts of this system and am still in some ways finding my footing; opening a ticket with this problem in itself makes it more discoverable, but I think you're right -- it's Celery's soft shutdown that intrinsically breaks some of the lower-level stuff in the python stack.

Last edited 4 weeks ago by Cameron Gorrie (previous) (diff)

comment:3 by Sarah Boyce, 4 weeks ago

Resolution: invalid
Status: newclosed

Thank you both for the discussion. As the current conclusion is that Django is not at fault here, closing the ticket

Note: See TracTickets for help on using tickets.
Back to Top