Opened 5 years ago
Closed 4 years ago
#31276 closed Cleanup/optimization (wontfix)
Add 'flush' option to only clear non-empty tables
Reported by: | Adam Johnson | Owned by: | jonatron |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Accepted | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
This was suggested as an extra feature for Django-MySQL by @jonatron in https://github.com/adamchainz/django-mysql/pull/611/files , but it would make more sense to make it in Django core.
The sql_flush
database operation, used by the 'flush' management command and, thus through TransactionTestCase
, creates one TRUNCATE TABLE
statement for every table - regardless of whether they are empty already. This means unnecessary database operations are run, which will be most noticeable in test runs.
Instead of flushing *every* table, we can flush only the non-empty ones.
A minimal, hopefully cross-database way of quickly finding out which tables in a list are non-empty is to make existence queries on them combined with union:
chainz@localhost [14]> (select 't' as has_rows from t limit 1) union (select 't2' as has_rows from t2 limit 1); +----------+ | has_rows | +----------+ | t2 | +----------+ 1 row in set (0.001 sec)
We could use logic like this to batch the tables into such union queries in django.core.management.sql.sql_flush
. We could then reduce the list of tables we wish to flush down to the non-empty ones for the sql_flush
operation. Perhaps it would be best to put this behind a flag that is set only in tests, for backwards compatibility.
Change History (12)
comment:1 by , 5 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:2 by , 5 years ago
Not sure actually, just being cautious by suggesting the flag in case someone else can come up with a case. As far as I can tell, as long as we still reset sequences, there’s no need to truncate/delete from empty tables.
comment:3 by , 5 years ago
I'm not sure, TBH. Checking no. of rows for each table will have a performance impact, and creating a huge union for all tables (as proposed) may not be better:
(select 't' as has_rows from t limit 1) union (select 't2' as has_rows from t2 limit 1);
We need to take into account a few restrictions e.g. the database's limit on the number of query parameters and 3rd-party database backends may not support UNION
.
comment:4 by , 5 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:5 by , 5 years ago
I've been benchmarking with 1000 tables and 5 tests. You'll notice sqlite has a problem with too many tables. Also the query for sqlite needs to be:
select * from ( SELECT '{}' AS has_rows FROM {} LIMIT 1 )
instead of
( SELECT '{}' AS has_rows FROM {} LIMIT 1 )
postgres
current behaviour
Ran 5 tests in 1.538s
Ran 5 tests in 1.545s
Ran 5 tests in 1.573s
with patch
Ran 5 tests in 0.807s
Ran 5 tests in 0.880s
Ran 5 tests in 0.812s
mysql
current behaviour
Ran 5 tests in 17.101s
Ran 5 tests in 15.396s
Ran 5 tests in 15.487s
with patch
Ran 5 tests in 3.891s
Ran 5 tests in 3.545s
Ran 5 tests in 3.522s
sqlite - 400 tables
current behaviour
Ran 5 tests in 2.006s
Ran 5 tests in 2.039s
Ran 5 tests in 2.055s
with patch
Ran 5 tests in 0.106s
Ran 5 tests in 0.104s
Ran 5 tests in 0.107s
sqlite - 1000 tables
django.db.utils.OperationalError: too many terms in compound SELECT
comment:6 by , 5 years ago
About SQLite, I mentioned this in my comment "... the database's limit on the number of query parameters" (999 on SQLite).
comment:7 by , 5 years ago
The sqlite problem can be fixed using the same chunking as is in _quote_params_for_last_executed_query.
I can't explain sqlite's performance.
sqlite
1000 tables with patch
Ran 5 tests in 0.237s
Ran 5 tests in 0.345s
Ran 5 tests in 0.288s
Ran 5 tests in 0.257s
Ran 5 tests in 0.240s
1000 tables without patch
Ran 5 tests in 55.617s
Ran 5 tests in 57.577s
Ran 5 tests in 55.242s
900 tables without patch
Ran 5 tests in 45.796s
800 tables without patch
Ran 5 tests in 35.795s
600 tables without patch
Ran 5 tests in 20.521s
550 tables without patch
Ran 5 tests in 19.025s
500 tables without patch
Ran 5 tests in 3.049s
400 tables without patch
Ran 5 tests in 2.068s
200 tables without patch
Ran 5 tests in 0.612s
comment:9 by , 5 years ago
Needs documentation: | set |
---|
comment:10 by , 4 years ago
Needs documentation: | unset |
---|
comment:11 by , 4 years ago
Thanks for your efforts, however I have strong doubts about adding this feature.
First of all it's not atomic so there is a potential race condition.
Secondly it's hard to choose a reasonable number of tables for batching, because we can hit here various restrictions e.g. the maximum number of terms (SQLITE_MAX_COMPOUND_SELECT
on SQLite), the maximum length of SQL query, maximum no. of tables in a query etc. We also should take into account time needed for parsing such a long query. 500 seems reasonable for all backends, but it's not well-grounded.
Thirdly "flush" is optimized in most databases, e.g. TRUNCATE TABLE
on Oracle is really fast and will be definitely slower with performing enormous query for checking empty tables, the number of which may be marginal. IMO it is sth that should be optimized in database engines not in Django.
In the end, this addition isn't worth the complexity.
comment:12 by , 4 years ago
Resolution: | → wontfix |
---|---|
Status: | assigned → closed |
+1. What sort of backwards compatibility issues would you imagine here?