Opened 5 years ago

Closed 4 years ago

#31276 closed Cleanup/optimization (wontfix)

Add 'flush' option to only clear non-empty tables

Reported by: Adam Johnson Owned by: jonatron
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

This was suggested as an extra feature for Django-MySQL by @jonatron in ​https://github.com/adamchainz/django-mysql/pull/611/files , but it would make more sense to make it in Django core.

The sql_flush database operation, used by the 'flush' management command and, thus through TransactionTestCase, creates one TRUNCATE TABLE statement for every table - regardless of whether they are empty already. This means unnecessary database operations are run, which will be most noticeable in test runs.

Instead of flushing *every* table, we can flush only the non-empty ones.

A minimal, hopefully cross-database way of quickly finding out which tables in a list are non-empty is to make existence queries on them combined with union:

chainz@localhost [14]> (select 't' as has_rows from t limit 1) union (select 't2' as has_rows from t2 limit 1);
+----------+
| has_rows |
+----------+
| t2       |
+----------+
1 row in set (0.001 sec)

We could use logic like this to batch the tables into such union queries in django.core.management.sql.sql_flush. We could then reduce the list of tables we wish to flush down to the non-empty ones for the sql_flush operation. Perhaps it would be best to put this behind a flag that is set only in tests, for backwards compatibility.

Change History (12)

comment:1 by Claude Paroz, 5 years ago

Triage Stage: UnreviewedAccepted

+1. What sort of backwards compatibility issues would you imagine here?

comment:2 by Adam Johnson, 5 years ago

Not sure actually, just being cautious by suggesting the flag in case someone else can come up with a case. As far as I can tell, as long as we still reset sequences, there’s no need to truncate/delete from empty tables.

comment:3 by Mariusz Felisiak, 5 years ago

I'm not sure, TBH. Checking no. of rows for each table will have a performance impact, and creating a huge union for all tables (as proposed) may not be better:

(select 't' as has_rows from t limit 1) union (select 't2' as has_rows from t2 limit 1);

We need to take into account a few restrictions e.g. the database's limit on the number of query parameters and 3rd-party database backends may not support UNION.

comment:4 by jonatron, 5 years ago

Owner: changed from nobody to jonatron
Status: newassigned

comment:5 by jonatron, 5 years ago

I've been benchmarking with 1000 tables and 5 tests. You'll notice sqlite has a problem with too many tables. Also the query for sqlite needs to be:
select * from ( SELECT '{}' AS has_rows FROM {} LIMIT 1 )
instead of
( SELECT '{}' AS has_rows FROM {} LIMIT 1 )

postgres

current behaviour
Ran 5 tests in 1.538s
Ran 5 tests in 1.545s
Ran 5 tests in 1.573s

with patch
Ran 5 tests in 0.807s
Ran 5 tests in 0.880s
Ran 5 tests in 0.812s

mysql
current behaviour
Ran 5 tests in 17.101s
Ran 5 tests in 15.396s
Ran 5 tests in 15.487s

with patch
Ran 5 tests in 3.891s
Ran 5 tests in 3.545s
Ran 5 tests in 3.522s

sqlite - 400 tables
current behaviour
Ran 5 tests in 2.006s
Ran 5 tests in 2.039s
Ran 5 tests in 2.055s

with patch
Ran 5 tests in 0.106s
Ran 5 tests in 0.104s
Ran 5 tests in 0.107s

sqlite - 1000 tables
django.db.utils.OperationalError: too many terms in compound SELECT

comment:6 by Mariusz Felisiak, 5 years ago

About SQLite, I mentioned this in my comment "... the database's limit on the number of query parameters" (999 on SQLite).

comment:7 by jonatron, 5 years ago

The sqlite problem can be fixed using the same chunking as is in _quote_params_for_last_executed_query.
I can't explain sqlite's performance.

sqlite
1000 tables with patch
Ran 5 tests in 0.237s
Ran 5 tests in 0.345s
Ran 5 tests in 0.288s
Ran 5 tests in 0.257s
Ran 5 tests in 0.240s

1000 tables without patch

Ran 5 tests in 55.617s
Ran 5 tests in 57.577s
Ran 5 tests in 55.242s

900 tables without patch
Ran 5 tests in 45.796s

800 tables without patch
Ran 5 tests in 35.795s

600 tables without patch
Ran 5 tests in 20.521s

550 tables without patch
Ran 5 tests in 19.025s

500 tables without patch
Ran 5 tests in 3.049s

400 tables without patch
Ran 5 tests in 2.068s

200 tables without patch
Ran 5 tests in 0.612s

comment:8 by jonatron, 5 years ago

Has patch: set
Last edited 5 years ago by Mariusz Felisiak (previous) (diff)

comment:9 by Mariusz Felisiak, 5 years ago

Needs documentation: set

comment:10 by jonatron, 4 years ago

Needs documentation: unset

comment:11 by Mariusz Felisiak, 4 years ago

Thanks for your efforts, however I have strong doubts about adding this feature.

First of all it's not atomic so there is a potential race condition.

Secondly it's hard to choose a reasonable number of tables for batching, because we can hit here various restrictions e.g. the maximum number of terms (SQLITE_MAX_COMPOUND_SELECT on SQLite), the maximum length of SQL query, maximum no. of tables in a query etc. We also should take into account time needed for parsing such a long query. 500 seems reasonable for all backends, but it's not well-grounded.

Thirdly "flush" is optimized in most databases, e.g. TRUNCATE TABLE on Oracle is really fast and will be definitely slower with performing enormous query for checking empty tables, the number of which may be marginal. IMO it is sth that should be optimized in database engines not in Django.

In the end, this addition isn't worth the complexity.

comment:12 by Mariusz Felisiak, 4 years ago

Resolution: wontfix
Status: assignedclosed
Note: See TracTickets for help on using tickets.
Back to Top