#20485 closed Cleanup/optimization (fixed)
Find faster ways for test fixture loading
Reported by: | Anssi Kääriäinen | Owned by: | Aymeric Augustin |
---|---|---|---|
Component: | Testing framework | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Ready for checkin | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Currently Django's test suite uses about 70s for fixture loading when running full test suite on in-memory SQLite. Considering that after #20483 Django's test suite takes around 210s in total, this means 1/3 of time is used for fixture loading. Most of the time in fixture loading is used for finding the fixture file. Django needs to do effectively this for each file, for each test:
for each app in INSTALLED_APPS: for each suffix in combo(('json', 'yaml', 'xml') X ('bz2', 'gz', 'zip')): try to open file() - if succeeds add to fixture files, else skip
Unsurprisingly this is expensive. In addition fixture loading after the file has been found isn't that fast, it is actually somewhat faster to create the models by SomeModel.objects.create() than by fixtures.
I have done some testing and it shows that of the 70s, around 50-60s should be avoidable.
Some options:
- Somehow use absolute paths for the fixtures. Maybe by using
path(__file__) + '/fixtures/' + fixture_name
. A subset of this is to always use, say gz compressed files which removes the need to check for combos. - Limit fixture searching to current app (if requested): maybe by something like './somefixture' -> search only from current app. 'somefixture' -> search everywhere.
- Cache the existing fixtures. That is, go only once through the directories, check what files exists and cache that information, then check against the cached information instead of trying to open each possible combo in each possible directory.
- Do not use fixtures for Django's test data.
Of the above I like the last option as I find reading models created in Python code much easier than definitions in fixtures (especially, what relations does given instance have). Unfortunately converting the fixtures takes time. In addition, I am sure some developers favour fixtures instead of models created in Python code.
Marking as accepted as in "lets see what can be done here".
Attachments (1)
Change History (7)
comment:1 by , 11 years ago
comment:2 by , 11 years ago
Would it be faster to call os.listdir
on the fixtures
directory of each application, and then do everything in Python instead of hitting the filesystem repeatedly?
comment:3 by , 11 years ago
I'm attaching a patch that caches the contents of fixture directories in memory and checks this cache before attempting to hit the filesystem.
This patch makes the test suite 10% faster (540s vs. 600s) on my laptop with a SSD.
I didn't include the current directory in the list of directories whose content is cached, because it could contain a lot of files, and the cache would then use lots of memory for no good reason. A regression ensues: fixtures cannot be found the current directory. (Two tests fail because of this.)
This could be fixed by adding an explicit search inside the current directory if no fixture is found... but then the patch starts looking bad, and I didn't go further.
I think I have a better idea to make fixture search decently fast.
Here's how I would load a given fixture label:
- Take the basename of the label and build all the combinations of format, compression, etc. These are the targets we're looking for.
- Build a list of prefixes:
- If label is a relative path, [<app>/fixtures/<label> for each app] + [<fixture_dir>/<label> for each fixture dir] + [<label>]
- If label is an absolute path, [<label>]
- For each prefix, run glob.glob(<prefix>*) to obtain a list of candidates, and split them in dirname + basename. The basenames are the candidates.
- Check the intersection of the set of targets and the set of candidates; these are the fixtures that must be loaded.
- Once this is done for all prefixes, cache the list of all fixtures found for this label.
This minimizes filesystems hits by making the search for a single label efficient, and by performing it only once.
by , 11 years ago
Attachment: | 20485-prewalk-fixtures-dirs.diff added |
---|
comment:4 by , 11 years ago
Has patch: | set |
---|---|
Owner: | changed from | to
Status: | new → assigned |
Triage Stage: | Accepted → Ready for checkin |
I created a pull request with the technique described in my previous comment. The speedup is still around 10% as expected.
https://github.com/django/django/pull/1237
Feedback welcome! Yes, it's a major refactoring of loaddata, I know it's quite hard to review...
comment:5 by , 11 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
I created some proof-of-concept code for this, see: https://github.com/akaariai/django/commit/448b5d76d30b5af4682702ec21c8fe85cc1bb7d3 (applies over isolated_apps patch).
Runtime of Django's test suite is reduced from ~210 to ~160. So, around 50 seconds gone.
The code isn't commit quality. It uses a syntax of
fixtures = [(__file__, 'fixturename.json')]
to indicate that fixtures from a directory relative to current test class should be loaded. This causes a couple of problems, but is enough to demonstrate the speed gain. Better ideas & implementations welcome.I am not sure if my initial analysis of all the suffix combinations causing the problem is correct. While going through the tests it became evident that there are multiple fixtures with the same name in different test apps. If I am mistaken any time 'testdata.json' is defined as a fixture for a test class, then all different testdata.json files will be loaded. Another similar issue is that there are a couple of 'initial_data.json' files in the test suite, and so every flush will load these files.