wiki:SchemaEvolution

Context Navigation

Version 26 (modified by anonymous, 19 years ago) ( diff )
--

Django Schema evolution

Schema migration is one of those "hard" problems that I think will be impossible to get right for all the cases. What follows are my thoughts on how to get as close as possible in Django.

(please leave comments at the bottom and not inline to the proposal! Also, please add your name or some handle after the comment; everybody using "I" and then not identifying themselves is terribly unclear.)

Contents

Prior art
Use cases
Possible solutions
Conclusions
Proposal
Comments/questions
Something Completely Different
Why something different
Handling concurrent development -- multiple users and branches
Another way to concurrent development -- migrations as distributed version control systems
Automatic interactive db introspection that generates migration scripts

Prior art

A few bits of prior arts I examined.

http://api.rubyonrails.org/classes/ActiveRecord/Migration.html

http://sqlobject.org/sqlobject-admin.html#versioning-upgrading

http://www.varlena.com/varlena/GeneralBits/127.php

http://divmod.org/trac/wiki/DivmodAxiom/Reference#Upgraders

Use cases

Alice has a blog application written in Django. A blog entry looks like:
class Entry(models.Model):
    title = models.CharField(maxlength=30)
    body = models.TextField()
After using this to write some blog entries, she realizes that she really should have had a publish date field on her model, and she wants to add one.

Alice doesn't particularly care about old entries; she just wants them to have an empty publish date. She knows just enough SQL to understand how relational databases work, but that's about it.
Ben has a simple ticket tracker with a ticket class that contains:
class Ticket(models.Model):
    reporter    = models.EmailField()
    owner       = models.EmailField()
    status      = models.IntegerField(choices=STATUS_CHOICES)
    description = models.TextField()
After two years of use, Ben wants to add product information to his ticket tracker and has decided he doesn't care about who reported a ticket, so he writes these models:
class Product(models.Model):
    ...

class Ticket(models.Model):
    owner       = models.EmailField()
    status      = models.IntegerField(choices=STATUS_CHOICES)
    description = models.TextField()
    product     = models.ForeignKey(Product)
Ben knows quite a bit of SQL; he's also the company DBA.
Carol supports a newsroom with a set of models that look like:
class Reporter(models.Model):
    ...

class Article(models.Model):
    reporter = models.ForeignKey(Reporter)
    ...

class Section(models.Model):
    stories = models.ManyToManyField(Article)
    ...
As it turns out, her conception of Section was naive, and on top of that she wants to separate certain types of articles into multiple classes. She rewrites her models to look like:
class Reporter(models.Model):
    ...

class Category(models.Model):
    ...

class Article(models.Model):
    reporter = models.ForeignKey(Reporter)
    categories = models.ManyToManyField(Category)
    ...

class Obituary(models.Model):
    ...

class WeddingAnnouncement(models.Model):
    ...

class Section(models.Model):
    included_categories = models.ManyToManyField(Category)
    ...
Carol obviously needs to keep all the data in her system. Some articles will need to be moved into Obituary or WeddingAnnouncement, objects. Additionally, stories will be assigned categories from a mapping given to her by the web editor since the direct mapping has been removed.

To add to the fun, Carol has 7 different news sites with separate databases using this system. Three of those systems run MySQL, the other four run PostgreSQL.

Carol is a very experienced developer; she's been doing this web thing for almost ten years now.

Possible solutions

Write SQL

This is the current situation: you write a bunch of SQL and bung it into your interactive interpreter.

Ramifications:

Alice is a simple enough case that she shouldn't have to know SQL to make this work.

This situation sounds good to Ben: it's just two lines of SQL.

Carol would have to apply this SQL against 7 different databases (which is a waste of time), and she'd still have to write a bunch of Python code to deal with the data migration.

Automatic db introspection

In this scenario, Django inspects your models and your database, and outputs the SQL for adding/removing fields. You're supposed to inspect the SQL and then stick it into your database using something like:

$ ./manage.py sqlupdate | psql mydb

Ramifications:

This works for Alice; she does django-admin sqlupdate | mysql mydb and is happy. However, if the SQL that Django put out happened to have a mistake in it, she might not have noticed, and could have lost some data.

Ben loves this situation; it does his two lines of SQL automatically, and he can poke at what's put out to make sure it's right for him. Except if he's using sqlite, which doesn't have a real ALTER statement, it's going to be a ton of work.

This doesn't really help Carol. Although the SQL that is put out is a start, it doesn't really get her very far. She still has to write a bunch of data migration by hand, and she also has no way of applying the SQL she writes quickly.

Automatically applied migration code

In this scenario, you give your models a version. When it comes time to upgrade, you write a "upgrade-to-version-3" script and increment your version number. When you run a django-admin command (probably the newly-added syncdb), Django notices that you've added an upgrade script and runs it automatically.

This type of system comes in two (not necessarily exclusive) flavors:

SQL only, where your upgrade scripts are in SQL.

ActiveRecord Migration style, where you write your migrations in Python (which gets converted to SQL for you).

Each situation also could support symmetric migration (where you provide both an upgrade and a downgrade script).

Ramifications:

Alice doesn't like the SQL flavor at all (she wants to use Python!). The Pythonic one works very well for her since the syntax mirrors other parts of Django she's familiar with. However, she'd prefer something like the automatic solution since it would save her from having to type anything at all.

Ben really likes this one; the SQL is still easy to write, and it gets into SVN so he can track his upgrades.

Carol loves this one. She can write custom SQL/Python that takes care of all the situation she needs. The Python option is a better fit for her since she has lots of custom code she needs to run.

Introspection + migration

This approach is a combination of Automatic db introspection and Automatically applied migration code -- one command produces a migration script, you tickle your version number, and the syncdb runs the migrations.

Ramifications:

This works very well for Alice, and she gets the added bonus of having nice migrations stored for future inspection.

This is Ben's favorite option.

Carol's indifferent about this as opposed to the previous one: it helps slightly, but not all that much, really.

Conclusions

"Just write SQL" sucks and is obviously wrong and needs to be fixed.

DB introspection only goes so far, and doesn't give any way to extend what comes out beyond "just write SQL".

Migration code works pretty well -- which is obvious if you look at how successful Rails has been with the feature [http://api.rubyonrails.org/classes/ActiveRecord/Migration.html]. The Python flavor is obviously more powerful, but for some uses SQL works well.

Introspection + migration seems to be best-of-breed; it even has the added bonus of getting Alice to save her migrations in source control.

Proposal

So.

Given all the above, here's how I think schema evolution should work in Django:

An optional module, django.contrib.evolution provides a way to store versions in the database:
class ModelVersion(models.Model):
    content_type = models.ForeignKey(ContentType)
    version      = models.PositiveIntegerField()
This makes evolution optional and thus doesn't clutter up your tablespace if you don't need it.
Models grow a version attribute:
class MyModel(models.Model):
    ...

    class Meta:
        version = 3
django-admin syncdb applies evolutions using the following steps:

Makes sure evolution is installed before proceeding.

For each content type, checks that the version in the database matches the version in defined in the model. For those that don't match...

Django looks for evolutions in <app_package>/upgrade/. Evolutions can either be SQL scripts named to_version_X.sql or Python scripts named to_version_X.py.

django.contrib.evolution will contain utilities to abstract evolution for different db backends. This exact API is TBD.

For SQL evolutions, if a file named to_version_X.out exists, syncdb will expect the output of the evolution to match the text file EXACTLY. If any differences happen, the transaction will be rolled back (for that reason SQL evolution scripts must not contain BEGIN/COMMIT).

For Python evolutions, there will be some way of doing similar tests.

django-admin find-migrations (or somesuch) will inspect the database and pump out SQL evolution scripts into the right place in the app package(s). Users will then upgrade the version number and run syncdb.

Comments/questions

Asked by nwp in the chat room: "what are the rules for checking whether you are in a position to apply to_version_X?"

Given a model that defines version = 6 and a version of 3 stored in the database, to_version_4, to_version_5, and to_version_6 will be applied -- but only if they exist (missing files are OK).

If versions are stored per-model, what happens if evolutions affect multiple models, or create or delete them? For example, in Carol's situation above, how would the individual migration scripts cooperate to move the data around in the required way? I think versions make more sense on a per-app basis; if that distinction exists in magic-removal.

More discussion in a django-developers thread (should main points be copied here?).

[brantley: I added a proposal here at SchemaEvolutionProposal.]

Something Completely Different

I'd like to suggest a different approach altogether. This is, I think, orthoganol to the version management part of the existing proposals. Call this the DatabaseRefactoring view of schema evolution. For background, let me suggest a few starting points:

Fowler and Sadalage's paper EvolutionaryDatabaseDesign

TheProcessOfDatabaseRefactoring from Scott Ambler's agiledata site, which leads to the online catalog of refactorings

and of course there's Ambler and Sandalage's book, RefactoringDatabases

I think there are several important advantages to this approach as compared to the one already suggested. For one thing, the smaller steps it takes are likely to be much more automatable. Sure, in some cases you'll have to write conversion code, but a lot of changes shouldn't require this. If you're evolving your application by (in part) refactorings, then there's an obviously good match (object refactorings map to database refactorings).

Why something different

I agree with "Something Completely Different" above. I worked on a Java-based product that used a similar approach to numbered schema versioning (although we worked with serialized objects in the backend instead of a relational DB). We had trouble merging changes with conflicting version numbers. This was not initially a huge problem for us, because we had very little overlap within and between our projects. However, the times that it happened it was a big deal, and as soon as we started real parallel development, it became a critical issue. I think for Django teams this is a killer, because deployments are so quite common and testing tends to be a part of the developers roles.

An example is svnA user creates schema version 5, while concurrently svnB makes version 5. Everything is fine until merge, then either user has an unusable database. This is bad enough, but consider branches! It gets ugly, and can easily break the production database at unexpected times.

Handling concurrent development -- multiple users and branches

Concurrent development requires communication, especially with a shared data store like the database. The above example should never break the production database -- if two developers commit conflicting code, it's up to them to work out the conflict. Preventing the conflicting code from reaching production is an organizational policy issue, not a web framework issue.

You do raise a good point regarding concurrent development with branches. It's still not possible to get rid of communication -- when it comes time to do a branch merge, discussing database schema will be mandatory. However, the tool should allow for branch merges; why not store a branch name AND a version number for each schema revision? When planning for a branch merge, one can implement a "from branch 1 version 6 to branch 2 version 3" migration. Obviously, migration points will need to be well defined, but supporting branches at least allows for the possibility.

Another way to concurrent development -- migrations as distributed version control systems

One of the issues with data base migrations, one raised by the previous section and one which has often bitten users of Rails migrations, is that they don't scale with the number of users: while rails-like migrations are wonderful for single users or very small teams (up to 3 or so members), the point is soon reached when different programmers will create different migrations with the same version number, and hell ensues (manual merges/renames of some of the migrations etc...).

The main problem of Rails' migration scheme versioning is, in fact, that they took a Centralized VCS scheme and applied it to a Decentralized VCS situation. Version numbers / revision numbers come from the worlds of CVS, SVN and the like where a central authority (the repository/VCS server) has the role of handling these version / revision numbers out and is the only entity allowed to issue these numbers. This central entity thus forces synchronization and updates (forbids the creation of version n+2 if you were using a version <= n, prevents the creation of version n+1 from version n when version n+1 already exists, ...).

As any user of the migration tool can create a new migration without having to ask a central authority, two users can create the same migration, or different migrations with the same version number, and the tool breaks down where it's sorely needed.

Since the problem has already been solved, this should be handled by learning from the various decentralized VCS (darcs, Git, Mercurial, Bazaar) and introducing one of their patch-merging and diff-tracking strategies into the migration tool. Of course this wouldn't necessarily resolve all the issues of incompatible migrations, but conflicting ones could at least be tracked.

Automatic interactive db introspection that generates migration scripts

I think that the nicest way for the user that would work for 99% of the cases is an automatic and (if needed) interactive command, that will generate the migration steps it took and store them in some script. Also, it could automatically update the version number in the suggestion on migration scripts above. So for a model that changes from:

class Ticket(models.Model):
        reporter    = models.EmailField()
        owner       = models.EmailField()
        stat      = models.IntegerField(choices=STATUS_CHOICES)

to:

class Ticket(models.Model):
        reporter    = models.EmailField()
        status      = models.IntegerField(choices=STATUS_CHOICES)
        description = models.TextField()

This is one removed field (owner), one added field (description), and one renamed field (stat->status). The renamed is the trickiest one obviously since it looks like one has been deleted and another added. So in this case the user should be given the menu:

The following fields are new:
        1 status
        2 description
The following fields are not present any longer:
        3 stat
        4 owner

Are any of these fields a rename? [y/N] y
Which pair is coupled? 1, 3

The following fields are new:
        1 description
The following fields are not present any longer:
        2 owner

Are any of these fields a rename? [y/N] n
Warning: removing the field owner will permanently delete data. Write "delete"  and press enter to continue: delete
Database update complete. Modifications written to Ticket-migration-1.sql

If there are only additions, then the entire menu can be totally skipped and just the additions added. If there are no renames the entire thing also becomes much shorter. I think this covers most cases, and any other cases can be handled manually by writing custom SQL migration scripts as above.

Anders Hovmöller (boxed@killingar.net)

Agreed, this would fit very nicely with the DbMigration project.

Mike H (mike@mugwuffin.com)

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text

Issues

Context Navigation

Django Schema evolution

Prior art

Use cases

Possible solutions

Write SQL

Automatic db introspection

Automatically applied migration code

Introspection + migration

Conclusions

Proposal

Comments/questions

Something Completely Different

Why something different

Handling concurrent development -- multiple users and branches

Another way to concurrent development -- migrations as distributed version control systems

Automatic interactive db introspection that generates migration scripts

Download in other formats:

Django Links

Learn More

Get Involved

Get Help

Follow Us

Support Us