Code


Version 4 (modified by jacob, 8 years ago) (diff)

Added question from nwp (and answer)

Django Schema evolution

Schema migration is one of those "hard" problems that I think will be impossible to get right for all the cases. What follows are my thoughts on how to get as close as possible in Django.

(please leave comments at the bottom and not inline to the proposal!)

Use cases

  1. Alice has a blog application written in Django. A blog entry looks like:

    class Entry(models.Model):
        title = models.CharField(maxlength=30)
        body = models.TextField()
    

    After using this to write some blog entries, she realizes that she really should have had a publish date field on her model, and she wants to add one.

    Alice doesn't particularly care about old entries; she just wants them to have an empty publish date. She knows just enough SQL to understand how relational databases work, but that's about it.

  2. Ben has a simple ticket tracker with a ticket class that contains:

    class Ticket(models.Model):
        reporter    = models.EmailField()
        owner       = models.EmailField()
        status      = models.IntegerField(choices=STATUS_CHOICES)
        description = models.TextField()
    

    After two years of use, Ben wants to product information to his ticket tracker and has decided he doesn't care about who reported a ticket, so he writes these models:

    class Product(models.Model):
        ...
    
    class Ticket(models.Model):
        owner       = models.EmailField()
        status      = models.IntegerField(choices=STATUS_CHOICES)
        description = models.TextField()
        product     = models.ForeignKey(Product)
    

    Ben knows quite a bit of SQL; he's also the company DBA.

  3. Carol supports a newsroom with a set of models that look like:

    class Reporter(models.Model):
        ...
    
    class Article(models.Model):
        reporter = models.ForeignKey(Reporter)
        ...
    
    class Section(models.Model):
        stories = models.ManyToManyField(Article)
        ...
    

    As it turns out, her conception of Section was naive, and on top of that she wants to separate certain types of articles into multiple classes. She rewrites her models to look like:

    class Reporter(models.Model):
        ...
    
    class Category(models.Model):
        ...
    
    class Article(models.Model):
        reporter = models.ForeignKey(Reporter)
        categories = models.ManyToManyField(Category)
        ...
    
    class Obituary(models.Model):
        ...
    
    class WeddingAnnouncement(models.Model):
        ...
    
    class Section(models.Model):
        included_categories = models.ManyToManyField(Category)
        ...
    

    Carol obviously needs to keep all the data in her system. Some articles will need to be moved into Obituary or WeddingAnnouncement, objects. Additionally, stories will be assigned categories from a mapping given to her by the web editor since the direct mapping has been removed.

    To add to the fun, Carol has 7 different news sites with separate databases using this system. Three of those systems run MySQL, the other four run PostgreSQL.

    Carol is a very experienced developer; she's been doing this web thing for almost ten years now.

Possible solutions

Write SQL

This is the current situation: you write a bunch of SQL and bung it into your interactive interpreter.

Ramifications:

Alice is a simple enough case that she shouldn't have to know SQL to make this work.

This situation sounds good to Ben: it's just two lines of SQL.

Carol would have to apply this SQL against 7 different databases (which is a waste of time), and she'd still have to write a bunch of Python code to deal with the data migration.

Automatic db introspection

In this scenario, Django inspects your models and your database, and outputs the SQL for adding/removing fields. You're supposed to inspect the SQL and then stick it into your database using something like:

$ ./manage.py sqlupdate | psql mydb
Ramifications:

This works for Alice; she does django-admin sqlupgrade | mysql and is happy. However, if the SQL that Django put out happened to have a mistake in it, she might not have noticed, and could have lost some data.

Ben loves this situation; it does his two lines of SQL automatically, and he can poke at what's put out to make sure it's right for him.

This doesn't really help Carol. Although the SQL that is put out is a start, it doesn't really get her very far. She still has to write a bunch of data migration by hand, and she also has no way of applying the SQL she writes quickly.

Automatically applied migration code

In this scenario, you give your models a version. When it comes time to upgrade, you write a "upgrade-to-version-3" script and increment your version number. When you run a django-admin command (probably the newly-added syncdb), Django notices that you've added an upgrade script and runs it automatically.

This type of system comes in two (not necessarily exclusive) flavors:

  • SQL only, where your upgrade scripts are in SQL.
  • ActiveRecord Migration style, where you write your migrations in Python (which gets converted to SQL for you).

Each situation also could support symmetric migration (where you provide both an upgrade and a downgrade script).

Ramifications:

Alice doesn't like the SQL flavor at all (she wants to use Python!). The Pythonic one works very well for her since the syntax mirrors other parts of Django she's familiar with. However, she'd prefer something like the automatic solution since it would save her from having to type anything at all.

Ben really likes this one; the SQL is still easy to write, and it gets into SVN so he can track his upgrades.

Carol loves this one. She can write custom SQL/Python that takes care of all the situation she needs. The Python option is a better fit for her since she has lots of custom code she needs to run.

Introspection + migration

This approach is a combination of Automatic db introspection and Automatically applied migration code -- one command produces a migration script, you tickle your version number, and the syncdb runs the migrations.

Ramifications:

This works very well for Alice, and she gets the added bonus of having nice migrations stored for future inspection.

This is Ben's favorite option.

Carol's indifferent about this as opposed to the previous one: it helps slightly, but not all that much, really.

Conclusions

  • "Just write SQL" sucks and is obviously wrong and needs to be fixed.
  • DB introspection only goes so far, and doesn't give any way to extend what comes out beyond "just write SQL".
  • Migration code works pretty well -- which is obvious if you look at how successful Rails has been with the feature. The Python flavor is obviously more powerful, but for some uses SQL works well.
  • Introspection + migration seems to be best-of-breed; it even has the added bonus of getting Alice to save her migrations in source control.

Proposal

So.

Given all the above, here's how I think schema evolution should work in Django:

  • An optional module, django.contrib.evolution provides a way to store versions in the database:

    class ModelVersion(models.Model):
        content_type = models.ForeignKey(ContentType)
        version      = models.PositiveIntegerField()
    

    This makes evolution optional and thus doesn't clutter up your tablespace if you don't need it.

  • Models grow a version attribute:

    class MyModel(models.Model):
        ...
    
        class Meta:
            version = 3
    
  • django-admin syncdb applies evolutions using the following steps:

    1. Makes sure evolution is installed before proceeding.

    2. For each content type, checks that the version in the database matches the version in defined in the model. For those that don't match...

    3. Django looks for evolutions in <app_package>/upgrade/. Evolutions can either be SQL scripts named to_version_X.sql or Python scripts named to_version_X.py.

    4. django.contrib.evolution will contain utilities to abstract evolution for different db backends. This exact API is TBD.

    5. For SQL evolutions, if a file named to_version_X.out exists, syncdb will expect the output of the evolution to match the text file EXACTLY. If any differences happen, the transaction will be rolled back (for that reason SQL evolution scripts must not contain BEGIN/COMMIT).

      For Python evolutions, there will be some way of doing similar tests.

  • django-admin find-migrations (or somesuch) will inspect the database and pump out SQL evolution scripts into the right place in the app package(s). Users will then upgrade the version number and run syncdb.

Comments/questions

Asked by nwp in the chat room: "what are the rules for checking whether you are in a position to apply to_version_X?"

Given a model that defines version = 6 and a version of 3 stored in the database, to_version_4, to_version_5, and to_version_6 will be applied -- but only if they exist (missing files are OK).