Opened 7 years ago
Closed 7 years ago
#29563 closed Cleanup/optimization (fixed)
Add support for QuerySet.iterator() result streaming on SQLite
Reported by: | Andrew Brown | Owned by: | Andrew Brown |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Ryan P Kilby | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
I’m writing a non-web app that uses Django as the ORM and SQLite as the backend, and I have a need to iterate over large tables efficiently. Django’s documentation says Queryset.iterator() does not work on SQLite (1) (2) but I tried it anyways, and discovered that it works: results are not read into memory in entirety, but streamed from the database in chunks. I traced this to an apparent logic error in the SQLCompiler.execute_sql() method (3) and the interpretation of the can_use_chunked_reads flag. More on this below.
But I was curious why it didn’t crash or get some error from the SQLite library despite the Django documentation saying SQLite doesn’t support streaming queries. Careful reading of the SQLite documentation seems to indicate there’s no problem reading a query in one cursor and writing to the database (even the same table) in another. There’s a caveat about isolation (4) to watch out for, but otherwise seems like a perfectly supported mode of operation.
So I dug into Django’s history and I came upon Django bug #7411. This bug was written in June 2008. At the time, the latest version of SQLite was 3.5.9, which didn’t support database commits interleaved with partially read cursors. So the workaround implemented was to read the entire query result into memory. The database feature flag “can_use_chunked_reads” was added, and this was set to False in the SQLite backend. Code was added to the SQLCompiler.execute_sql() method to wrap the result iterator in list(result) if that flag was false [3b37c8151a]. I call this a “workaround” because it’s working around a limitation SQLite had at the time.
However, later that year SQLite version 3.6.5 was released, which added the ability to run COMMIT simultaneously with other read operations (5). I was not able to reproduce bug #7411 using SQLite versions >= 3.6.5. This version was released in November 2008.
So about that bug in SQLCompiler.execute_sql(). From what I can tell, Django ticket #16614 committed a change [f3b7c05936] which introduced a logic bug to the relevant code in June 2016, reproduced below (3):
if not chunked_fetch and not self.connection.features.can_use_chunked_reads: try: # If we are using non-chunked reads, we return the same data # structure as normally, but ensure it is all read into memory # before going any further. Use chunked_fetch if requested. return list(result) finally: # done with the cursor cursor.close() return result
Notice the condition will skip the if statement body if either chunked_fetch is True or if can_use_chunked_reads is True. Since calling queryset.iterator() sets chunked_fetch to True, the can_use_chunked_reads flag is ignored and the if statement skipped. I believe the “and” should be an “or”, which would return list(result) if the database doesn’t support chunked reads regardless of the chunked_fetch value.
Regardless of that bug, I suggest the workaround be removed entirely since it hasn’t been necessary since 2008 and hasn’t been functional since 2016. If we need to support the can_use_chunked_reads flag for compatibility with custom and third-party database backends, then we can fix the logic error and set SQLite’s flag to True. Documentation should be updated accordingly.
If there is a reason for keeping chunked reads disabled for SQLite (such as the SQLite caveats on isolation (4), or needing to support older versions of SQLite <3.6.5), then the logic error should be fixed.
I’m willing to put in a pull request, but since the situation is quite complicated (the workaround also inadvertently helped avoid a related bug in Python’s sqlite3 driver, #10513, fixed as of Python 2.7.13 and 3.5.3) I wanted to keep this report as short as I could to get some feedback first.
(1) https://docs.djangoproject.com/en/2.0/ref/models/querysets/#without-server-side-cursors
(2) https://github.com/django/django/blob/2.0.7/django/db/backends/sqlite3/features.py#L9
(3) https://github.com/django/django/blob/2.0.7/django/db/models/sql/compiler.py#L1096
(4) https://sqlite.org/isolation.html (see final paragraph “No Isolation Between Operations On The Same Database Connection”)
(5) https://sqlite.org/releaselog/3_6_5.html
Attachments (2)
Change History (16)
comment:1 by , 7 years ago
Cc: | added |
---|
comment:2 by , 7 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:4 by , 7 years ago
I’ve attached my test case test1.py that I used to reproduce bug #7411. It doesn’t use Django and hits the sqlite interface directly in what I believe is the same pattern as #7411 to hit the same bug.
To show the workaround is no longer needed, I reproduced the bug under Python 2.5.2 and SQLite 3.5.9, then tried to reproduce the bug under any version of Python using SQLite >=3.6.5
Under Python 2.5.2 compiled against SQLite 3.5.9 (the latest of each as of mid 2008) it gets the same error as bug #7411:
test1.py Python Version: 2.5.2 (trunk, Jul 19 2018, 14:16:29) [GCC 5.4.0 20160609] SQLite Version: 3.5.9 Traceback (most recent call last): File "../test1.py", line 25, in <module> conn.commit() sqlite3.OperationalError: SQL logic error or missing database
The exact error changes depending on the combination of Python and SQLite versions, and some versions of Python the test isn’t reproducible but a similar bug is (Python bug #10513 test case attached as test2.py). However, I could not reproduce #7411 with any version of Python when using SQLite versions 3.6.5 and up.
test1.py Python Version: 2.5.2 (trunk, Jul 19 2018, 14:10:44) [GCC 5.4.0 20160609] SQLite Version: 3.6.5 Didn't crash
Python bug #10513 was present in some versions regardless of SQLite version, but was fixed in 2.7.13 and 3.5.3. As I understand Django only supports the latest revision of the supported major Python versions so this bug is no longer relevant either.
Feel free to ask any questions. I’ve tried to summarize what I’ve found, but I have a lot more info including bisections and specific commits where bugs were introduced and fixed. I know this is a relatively complicated situation so let me know what I can do to help verify my claims.
by , 7 years ago
by , 7 years ago
comment:7 by , 7 years ago
I've created a patch that can be seen on this branch: https://github.com/brownan/django/tree/ticket_29563
What would you like to see in the way of new tests? No tests broke for me, so it seems the behavior I changed in SQLCompiler.execute_sql() wasn't covered. How do I approach this?
comment:8 by , 7 years ago
Hey Andrew, first thanks for the extremely well detailed report, investigation, and the initial patch.
The only change I'd make to your branch is to break it in two commits.
A first one that addresses the execute_sql()
and
->or
switch and adds a test to make sure the result is fully iterated when (chunked_fetch := True) and (connection.features.can_use_chunked_reads := False)
. I'm not sure exactly where the test should live and how the testing should be performed but you could rely on @skipIfDatabaseFeature('can_use_chunked_reads')
to skip the test on backends that don't support it. I'd start with this commit while having the feature turned off on SQLite to make the iteration easier. I'd also refer the commit that added support for this feature in the commit message.
Then I'd include the rest of the changes in a second commit referring to this ticket.
By the way, you can open a PR with your branch to give it more exposure and have it run against CI.
comment:9 by , 7 years ago
Thanks for the feedback. I'll see what I can come up with for tests in the next few days when I get some free time.
One hitch in testing this is that even if I write a test to make sure the can_use_chunked_reads flag is respected, after the proposed second commit's changes there will be no database backends that set that flag to false; SQLite is the only backend that sets it to False right now. I could handle this by temporarily setting that flag for the test perhaps. Otherwise, that test will always get skipped and there doesn't seem to be much point to me.
comment:10 by , 7 years ago
Has patch: | set |
---|---|
Patch needs improvement: | set |
Type: | Uncategorized → Cleanup/optimization |
One hitch in testing this is that even if I write a test to make sure the can_use_chunked_reads flag is respected, after the proposed second commit's changes there will be no database backends that set that flag to false; SQLite is the only backend that sets it to False right now. I could handle this by temporarily setting that flag for the test perhaps. Otherwise, that test will always get skipped and there doesn't seem to be much point to me.
Temporarily setting the flag in the test is an option but even if no built-in backends have this flag set to False
it might be useful for third-party backends (MSSQL, Firebird) which might not have implemented this feature yet and run the full test suite to determine coverage.
Another option might be to get rid of the feature flag entirely but that would require a post to the django-developers mailing list.
comment:11 by , 7 years ago
I'll let others decide if removing the flag is something worth considering. I've updated the pull request to split it into two commits, and added a unit test that ensures the can_use_chunked_reads flag is properly interpreted. It tests that the SQLCompiler.execute_sql() method returns a list if the flag is False.
comment:12 by , 7 years ago
Patch needs improvement: | unset |
---|---|
Summary: | SQLite and Queryset.iterator() support → Add support for QuerySet.iterator() result streaming on SQLite |
Triage Stage: | Accepted → Ready for checkin |
Accepting tentatively based on the analysis Andrew did.