Opened 8 hours ago
#36700 new Bug
ASGIHandler creates reference cycles that require a gc pass to free
| Reported by: | Patryk Zawadzki | Owned by: | |
|---|---|---|---|
| Component: | HTTP handling | Version: | 5.2 | 
| Severity: | Normal | Keywords: | memory asgihandler gc | 
| Cc: | Patryk Zawadzki | Triage Stage: | Unreviewed | 
| Has patch: | no | Needs documentation: | no | 
| Needs tests: | no | Patch needs improvement: | no | 
| Easy pickings: | no | UI/UX: | no | 
Description
Disclaimer: it's impossible for pure Python code to truly leak memory (in the sense that valgrind would detect), however it's quite easy to create structures that effectively occupy memory for a long time because they require the deepest (generation 2) garbage collection cycle to collect and that happens very rarely. In addition to that, the more such structures aggregate, the more expensive the garbage collection cycle becomes, because it effectively stops the entire interpreter to do its job and it can take seconds. On top of that, it's entirely possible for a container to run out of memory before the garbage collection happens and we (Saleor Commerce) see containers being terminated by the kernel OOM killer due to high memory pressure where most of that memory is locked by garbage.
One such case is found in the ASGIHandler. When handling a request, the ASGIHandler.handle spawns two async tasks. One for the actual app code (process_request) and one for the disconnection handler (ASGIHandler.listen_for_disconnect). The latter will raise RequestAborted every time it receives the http.disconnect ASGI message.
In our setup (uvicorn), the http.disconnect message is received for every request, even after successfully processing the view code and delivering the response, but that's not critical for this issue, it just makes it easy to reproduce this on our end.
Here's where the problem is:
- When RequestAbortedis raised, its stack trace includes the call toASGIHandler.handle, which is whereASGIHandler.listen_for_disconnectwas called.
- In turn, the ASGIHandler.handlestack frame includes references to all local variables.
- Among those variables is taskswhich holds the references to both async tasks.
- Now, one of those tasks is the task created from ASGIHandler.listen_for_disconnect.
- The task future is already resolved and now holds a reference back to the RequestAbortedexception from step 1. And thus the cycle completes, creating an unfreeable reference cycle.
All of those objects hold references to other objects and stack frames that also become unfreeable, ending up holding a sizeable list of objects hostage until the next time gc.collect(2) happens (which can be minutes, depending on how much code your app executes).
Making ASGIHandler.handle explicitly call tasks.clear() or just del tasks after the tasks are no longer needed breaks the cycle by removing the link between the exception stack frame locals and the future referencing the exception.
PS: I've classified this as a bug as high memory use can lead to OOM kills and crashes but feel free to reclassify as "cleanup/optimization" if that's more fitting.