|Version 15 (modified by oggie rob, 10 years ago) (diff)|
Django Schema evolution
Schema migration is one of those "hard" problems that I think will be impossible to get right for all the cases. What follows are my thoughts on how to get as close as possible in Django.
(please leave comments at the bottom and not inline to the proposal!)
A few bits of prior arts I examined.
Alice has a blog application written in Django. A blog entry looks like:class Entry(models.Model): title = models.CharField(maxlength=30) body = models.TextField()
After using this to write some blog entries, she realizes that she really should have had a publish date field on her model, and she wants to add one.
Alice doesn't particularly care about old entries; she just wants them to have an empty publish date. She knows just enough SQL to understand how relational databases work, but that's about it.
Ben has a simple ticket tracker with a ticket class that contains:class Ticket(models.Model): reporter = models.EmailField() owner = models.EmailField() status = models.IntegerField(choices=STATUS_CHOICES) description = models.TextField()
After two years of use, Ben wants to product information to his ticket tracker and has decided he doesn't care about who reported a ticket, so he writes these models:class Product(models.Model): ... class Ticket(models.Model): owner = models.EmailField() status = models.IntegerField(choices=STATUS_CHOICES) description = models.TextField() product = models.ForeignKey(Product)
Ben knows quite a bit of SQL; he's also the company DBA.
Carol supports a newsroom with a set of models that look like:class Reporter(models.Model): ... class Article(models.Model): reporter = models.ForeignKey(Reporter) ... class Section(models.Model): stories = models.ManyToManyField(Article) ...
As it turns out, her conception of Section was naive, and on top of that she wants to separate certain types of articles into multiple classes. She rewrites her models to look like:class Reporter(models.Model): ... class Category(models.Model): ... class Article(models.Model): reporter = models.ForeignKey(Reporter) categories = models.ManyToManyField(Category) ... class Obituary(models.Model): ... class WeddingAnnouncement(models.Model): ... class Section(models.Model): included_categories = models.ManyToManyField(Category) ...
Carol obviously needs to keep all the data in her system. Some articles will need to be moved into Obituary or WeddingAnnouncement, objects. Additionally, stories will be assigned categories from a mapping given to her by the web editor since the direct mapping has been removed.
To add to the fun, Carol has 7 different news sites with separate databases using this system. Three of those systems run MySQL, the other four run PostgreSQL.
Carol is a very experienced developer; she's been doing this web thing for almost ten years now.
This is the current situation: you write a bunch of SQL and bung it into your interactive interpreter.
Alice is a simple enough case that she shouldn't have to know SQL to make this work.
This situation sounds good to Ben: it's just two lines of SQL.
Carol would have to apply this SQL against 7 different databases (which is a waste of time), and she'd still have to write a bunch of Python code to deal with the data migration.
In this scenario, Django inspects your models and your database, and outputs the SQL for adding/removing fields. You're supposed to inspect the SQL and then stick it into your database using something like:
$ ./manage.py sqlupdate | psql mydb
This works for Alice; she does django-admin sqlupdate | mysql mydb and is happy. However, if the SQL that Django put out happened to have a mistake in it, she might not have noticed, and could have lost some data.
Ben loves this situation; it does his two lines of SQL automatically, and he can poke at what's put out to make sure it's right for him.
This doesn't really help Carol. Although the SQL that is put out is a start, it doesn't really get her very far. She still has to write a bunch of data migration by hand, and she also has no way of applying the SQL she writes quickly.
In this scenario, you give your models a version. When it comes time to upgrade, you write a "upgrade-to-version-3" script and increment your version number. When you run a django-admin command (probably the newly-added syncdb), Django notices that you've added an upgrade script and runs it automatically.
This type of system comes in two (not necessarily exclusive) flavors:
- SQL only, where your upgrade scripts are in SQL.
- ActiveRecord Migration style, where you write your migrations in Python (which gets converted to SQL for you).
Each situation also could support symmetric migration (where you provide both an upgrade and a downgrade script).
Alice doesn't like the SQL flavor at all (she wants to use Python!). The Pythonic one works very well for her since the syntax mirrors other parts of Django she's familiar with. However, she'd prefer something like the automatic solution since it would save her from having to type anything at all.
Ben really likes this one; the SQL is still easy to write, and it gets into SVN so he can track his upgrades.
Carol loves this one. She can write custom SQL/Python that takes care of all the situation she needs. The Python option is a better fit for her since she has lots of custom code she needs to run.
This approach is a combination of Automatic db introspection and Automatically applied migration code -- one command produces a migration script, you tickle your version number, and the syncdb runs the migrations.
This works very well for Alice, and she gets the added bonus of having nice migrations stored for future inspection.
This is Ben's favorite option.
Carol's indifferent about this as opposed to the previous one: it helps slightly, but not all that much, really.
- "Just write SQL" sucks and is obviously wrong and needs to be fixed.
- DB introspection only goes so far, and doesn't give any way to extend what comes out beyond "just write SQL".
- Migration code works pretty well -- which is obvious if you look at how successful Rails has been with the feature [http://api.rubyonrails.com/classes/ActiveRecord/Migration.html]. The Python flavor is obviously more powerful, but for some uses SQL works well.
- Introspection + migration seems to be best-of-breed; it even has the added bonus of getting Alice to save her migrations in source control.
Given all the above, here's how I think schema evolution should work in Django:
An optional module, django.contrib.evolution provides a way to store versions in the database:class ModelVersion(models.Model): content_type = models.ForeignKey(ContentType) version = models.PositiveIntegerField()
This makes evolution optional and thus doesn't clutter up your tablespace if you don't need it.
Models grow a version attribute:class MyModel(models.Model): ... class Meta: version = 3
django-admin syncdb applies evolutions using the following steps:
Makes sure evolution is installed before proceeding.
For each content type, checks that the version in the database matches the version in defined in the model. For those that don't match...
Django looks for evolutions in <app_package>/upgrade/. Evolutions can either be SQL scripts named to_version_X.sql or Python scripts named to_version_X.py.
django.contrib.evolution will contain utilities to abstract evolution for different db backends. This exact API is TBD.
For SQL evolutions, if a file named to_version_X.out exists, syncdb will expect the output of the evolution to match the text file EXACTLY. If any differences happen, the transaction will be rolled back (for that reason SQL evolution scripts must not contain BEGIN/COMMIT).
For Python evolutions, there will be some way of doing similar tests.
django-admin find-migrations (or somesuch) will inspect the database and pump out SQL evolution scripts into the right place in the app package(s). Users will then upgrade the version number and run syncdb.
I'd like to suggest a different approach altogether. This is, I think, orthoganol to the version management part of the existing proposals. Call this the DatabaseRefactoring view of schema evolution. For background, let me suggest a few starting points:
I think there are several important advantages to this approach as compared to the one already suggested. For one thing, the smaller steps it takes are likely to be much more automatable. Sure, in some cases you'll have to write conversion code, but a lot of changes shouldn't require this. If you're evolving your application by (in part) refactorings, then there's an obviously good match (object refactorings map to database refactorings).
I agree with "Something Completely Different" above. I worked on a Java-based product that used a similar approach to numbered schema versioning (although we worked with serialized objects in the backend instead of a relational DB). We had trouble merging changes with conflicting version numbers. This was not initially a huge problem for us, because we had very little overlap within and between our projects. However, the times that it happened it was a big deal, and as soon as we started real parallel development, it became a critical issue. I think for Django teams this is a killer, because deployments are so quite common and testing tends to be a part of the developers roles.
An example is svnA user creates schema version 5, while concurrently svnB makes version 5. Everything is fine until merge, then either user has an unusable database. This is bad enough, but consider branches! It gets ugly, and can easily break the production database at unexpected times.