Opened 7 years ago

Last modified 14 months ago

#6344 new Bug

Refactored manage.py inspectdb

Reported by: Daniel Pope <dan@…> Owned by: nobody
Component: Core (Management commands) Version: master
Severity: Normal Keywords: inspectdb
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: yes Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

I have inherited a database developed in Microsoft Access that contains 1985 columns in 85 tables.

To import this into Django, I tried the existing inspectdb tool, but I found numerous bugs that would be too time-consuming to fix manually.

I have refactored the inspectdb tool to do a better job of importing. Aside from better ensuring that output is syntactically valid, the tool now constructs object-oriented structure which is then post-processed and serialized, rather than serializing on-the-fly. This approach makes it easier to procedurally rename models, and to include heuristics to fix up semantic problems with the models.

  • model class names are sanitized and converted to CamelCase
  • model field names are sanitized and converted to lowercase_with_underscores
  • db_column is always specified: this allows refactoring of field names in the generated code
  • no comment is issued where field names differ from database column names. This should now be assumed for all fields but the use of db_column makes this explicit
  • AutoFields are always explicit: again, this is useful when refactoring
  • blank lines are added in the output, so that fields definitions are visually grouped into keys, relations and other fields
  • write a warning comment if no primary_key is detected, as such models cannot be used with Django

I've added the following heuristics. These heuristics are not guaranteed to produce the right results, but they should produce the right results more often than if they were not included. For very large database schemas this is very important.

  • heuristically detect field names that are potentially keys but not defined as such, grouping these too (this is safe)
  • if no primary key is set, heuristically choose a candidate field if one matches the name of the model (this is unsafe but a warning comment is issued; besides, the alternative cannot work)
  • heuristically, upgrade IntegerFields with primary_key=True to AutoField if there is none (this is unsafe. A warning comment is issued.)

This tool still does not guarantee to resolve conflicts between field or model names (a very insidious issue should it occur), although the refactor was designed partly this this goal in mind.

Attachments (1)

inspectdb_refactor.diff (13.1 KB) - added by Daniel Pope <dan@…> 7 years ago.
Patch for inspectdb command

Download all attachments as: .zip

Change History (10)

Changed 7 years ago by Daniel Pope <dan@…>

Patch for inspectdb command

comment:1 follow-up: Changed 7 years ago by mtredinnick

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement set
  • Triage Stage changed from Unreviewed to Accepted

The intention here looks reasonable. This is probably worth doing.

For a start, could you please reformat the patch so it roughly conforms to PEP 8 (and, more importantly, other Django code)? The things I noticed are:

  1. Commented out code: Just remove it if it's no longer needed. We have version control for code history.
  2. Indentation should be four spaces everywhere. It looks like you might be using tabs in some places, or 6 or 8 spaces or something.
  3. Comments on a line of their own should be full sentences. Docstrings should have a newline after the initial triple quotes if they're more than one line long. There are some docstrings with bonus blank lines at the end and the like, too.
  4. There's quite a few places where spaces around operators and after comment markers are needed.

Probably worth running pep8.py (search on Google if you don't have it) over the whole thing as a general guide.

Some further notes on the code itself:

  1. I'd do away with the attempts to guess a primary key. If there isn't one, drop in comment saying the user must specify one. Don't guess. Keep the code simple.
  2. Similarly, don't rename fields to "id" just because they're the primary key. The names are useful guides for the user, so keep it that way (plus it reduces the lines of code, without removing functionality).
  3. Might be clearer to convert things to camel case using the "title" method on strings. Something like ''.join(foo.title().split()) should work, but there may be other ways.
  4. Upgrading fields to AutoField is a little dangerous, since it assumes the underlying field has an auto-increment on it. I'd prefer this wasn't done and a comment was just inserted where it might be possible. We should be conservative in what we produce here.
  5. Whilst you're sanitising the field names, note that Django field attributes cannot have a trailing underscore, for technical reasons (it clashes with the double-underscore separating table joins when parsing query filters).

Other than that looks good. Nicely commented, fairly easy to follow. Fix up those few things and we should be pretty much good to go.

comment:2 in reply to: ↑ 1 ; follow-up: Changed 7 years ago by Daniel Pope <dan@…>

Replying to mtredinnick:

The intention here looks reasonable. This is probably worth doing.

For a start, could you please reformat the patch so it roughly conforms to PEP 8 (and, more importantly, other Django code)? The things I noticed are:

Yes, I knew you were going to ask that. I actually paid some attention to PEP8, because my normal coding style is very different. The generated code in particular should be mostly PEP8 compliant already.

  1. Commented out code: Just remove it if it's no longer needed. We have version control for code history.

Sorry, that was for my reference while coding rather than to preserve it indefinitely. I just forgot to clean up.

Probably worth running pep8.py (search on Google if you don't have it) over the whole thing as a general guide.

Excellent, I will try that.

  1. I'd do away with the attempts to guess a primary key. If there isn't one, drop in comment saying the user must specify one. Don't guess. Keep the code simple.

For the database I was importing this was 100% reliable and saved a huge amount of time. But I agree this shouldn't happen automatically.

What I will do is change all the heuristics to individual command line switches to the manage.py command, defaulting to off. They might be useful if users understand the caveats.

  1. Similarly, don't rename fields to "id" just because they're the primary key. The names are useful guides for the user, so keep it that way (plus it reduces the lines of code, without removing functionality).

Actually, users should use the 'pk' alias for looking up the field and then the name of the public key field would be irrelevant.

But the documentation and examples tend to use 'id' so I think don't think that's what happens. Having dropped the guarantee that field names will be faithfully preserved, this produces models more faithful to Django convention. The fact that the db_column is explicit is an adequate guide for the user.

  1. Might be clearer to convert things to camel case using the "title" method on strings. Something like ''.join(foo.title().split()) should work, but there may be other ways.

No, my code gracefully handles more cases by preserving existing capitalisation where possible - so, for example, my_table => MyTable, HTML Pages => HTMLPages and so on. I haven't commented these lines so perhaps that would be worthwhile.

  1. Upgrading fields to AutoField is a little dangerous, since it assumes the underlying field has an auto-increment on it. I'd prefer this wasn't done and a comment was just inserted where it might be possible. We should be conservative in what we produce here.

Again, for my purposes it would have been painful to omit it. Actually the fault here lay with the MySQL database introspection, which doesn't identify auto_increment fields. I don't think the cursor.description data exposes this information - it needs DESCRIBE `table` or SHOW TABLE CREATE `table` to read this correctly.

I can imagine a workflow where users use 'inspectdb' to build the Django models, make tweaks, then 'reset' to recreate the tables and re-import the data - in this case, heuristic fixups would be valuable and safe.

  1. Whilst you're sanitising the field names, note that Django field attributes cannot have a trailing underscore, for technical reasons (it clashes with the double-underscore separating table joins when parsing query filters).

My patch already removes trailing underscores (Line 121) but the comment is out of date :(

comment:3 in reply to: ↑ 2 Changed 7 years ago by mtredinnick

Replying to Daniel Pope <dan@mauveinternet.co.uk>:

  1. I'd do away with the attempts to guess a primary key. If there isn't one, drop in comment saying the user must specify one. Don't guess. Keep the code simple.

For the database I was importing this was 100% reliable and saved a huge amount of time. But I agree this shouldn't happen automatically.

What I will do is change all the heuristics to individual command line switches to the manage.py command, defaulting to off. They might be useful if users understand the caveats.

Please don't. That will add extra switches, which just complicates things like the help output when they're really not required. This is very much an edge-case -- we're talking about a table that doesn't have a primary key and yet, for some reason, the user thinks they'll be able to make a Django model out of it. It will happen almost never and, when it does, the comment in the model file will be a good guide. The other heuristics you've introduced are similar: they're all edge cases. Inspectdb just isn't used that often to be worth having it really complicated. It is used for highly varied set ups, so trying to cover all bases, no matter how well intentioned is pretty much a doomed exercise.

  1. Similarly, don't rename fields to "id" just because they're the primary key. The names are useful guides for the user, so keep it that way (plus it reduces the lines of code, without removing functionality).

Actually, users should use the 'pk' alias for looking up the field and then the name of the public key field would be irrelevant.

That's precisely why it's not worth renaming.

But the documentation and examples tend to use 'id' so I think don't think that's what happens. Having dropped the guarantee that field names will be faithfully preserved, this produces models more faithful to Django convention. The fact that the db_column is explicit is an adequate guide for the user.

No. We use "id" in the examples because we are always using models with auto-generated primary keys, so it's quite safe to do so. Note that Django core always uses the "pk" reference when required.

When somebody is reading their code, they are going to see my_model.some_field, not the db_column name, so having some_field bear some resemblence to the initial column is helping the user. Otherwise, you might as well go all the way, sort the attribute names and call them "a", "b", "c", etc, which hopefully you'll agree is not very informative. Let's be nice to the users here, please.

  1. Might be clearer to convert things to camel case using the "title" method on strings. Something like ''.join(foo.title().split()) should work, but there may be other ways.

No, my code gracefully handles more cases by preserving existing capitalisation where possible - so, for example, my_table => MyTable, HTML Pages => HTMLPages and so on. I haven't commented these lines so perhaps that would be worthwhile.

Fair enough. I'd forgotten those cases.

  1. Upgrading fields to AutoField is a little dangerous, since it assumes the underlying field has an auto-increment on it. I'd prefer this wasn't done and a comment was just inserted where it might be possible. We should be conservative in what we produce here.

Again, for my purposes it would have been painful to omit it. Actually the fault here lay with the MySQL database introspection, which doesn't identify auto_increment fields. I don't think the cursor.description data exposes this information - it needs DESCRIBE `table` or SHOW TABLE CREATE `table` to read this correctly.

You're generalising from one example, again. It's dangerous to do this upgrade automatically, so unfortunately that means cases where it's really required will have to be done manually.

I can imagine a workflow where users use 'inspectdb' to build the Django models, make tweaks, then 'reset' to recreate the tables and re-import the data - in this case, heuristic fixups would be valuable and safe.

Not particularly common, though. And it's not always safe to do this, so please don't.

  1. Whilst you're sanitising the field names, note that Django field attributes cannot have a trailing underscore, for technical reasons (it clashes with the double-underscore separating table joins when parsing query filters).

My patch already removes trailing underscores (Line 121) but the comment is out of date :(

Ah, yes, you're right. I read the comment, not the reg-exp. My bad.

comment:4 Changed 7 years ago by adrian

  • Component changed from Tools to django-admin.py inspectdb

comment:5 Changed 6 years ago by maubp

There's a bug in Daniel's patch, inspectdb_refactor.diff (13.1 kB) dated 01/08/08.

 	146	    def __str__(self): 
 	147	        s='class %s(models.Model):\n' % self.model_name 
 	148	 
 	149	        keys, rels, ids, other=self.group_fields() 
 	150	        if not keys: 
 	151	            s+='    # Warning: this model needs a field with primary_key=True\n\n' 
 	152	 
 	153	        for f in keys: 
 	154	            s+='    %s\n'%self._field_as_str(f) 
 	155	        if keys: 
 	156	            s+='\n' 
 	157	 
 	158	        for f in ids: 
 	159	            s+='    %s\n'%self._field_as_str(f) 
 	160	        if ids: 
 	161	            s+='\n' 
 	162	 
 	163	        for r in rels: 
 	164	            s+='    %s\n'%self._rel_as_str(f) 
 	165	        if rels: 
 	166	            s+='\n' 
 	167	 
 	168	        for f in other: 
 	169	            s+='    %s\n'%self._field_as_str(f) 
 	170	        if other: 
 	171	            s+='\n'

Line 163, should be "for f in rels" not "for r in rels"

On a style point, I would have written this section more like:

for group in [keys, ids, rels, other] :
    if group :
        s+= "\n".join(['    %s'%self._field_as_str(f) for f in ids)]) + "\n"

Similarly for the Database object's str method,

class Database(object):
    ...
    def __str__(self):
        return "\n\n".join([str(m) for m in self.models.values()])

But this is subjective and I don't know which follows the Django coding style best.


I've being trying the inspectdb tool on a MySQL database, and particularly liked three bits of this patch:

  • automatic sorting of the models/tables (a very tedious manual task)
  • db_column is always specified (I'm already familiar with our schema, so I want to see the original column names)
  • warning comment if no primary_key is detected (very helpful for diagnostics, see also ticket 373)

comment:6 Changed 4 years ago by gabrielhurley

  • Component changed from django-admin.py inspectdb to Core (Management commands)

comment:7 Changed 4 years ago by julien

  • Needs tests set
  • Severity set to Normal
  • Type set to Bug

2 years later, no doubt this patch needs an update. It'd also be great if someone could check whether the issues reported here still exist.

comment:8 Changed 4 years ago by ramiro

  • Easy pickings unset
  • Keywords inspectdb added
  • UI/UX unset

comment:9 Changed 14 months ago by WoLpH

So... it's been 6 and a half years now, any progress on this?

After trying the inspectdb command I still notice a few issues that would be great to have fixed. For example the field printableStyleTemplateId should be named printable_style_template_id instead of printablestyletemplateid. Or maybe even printable_style_template, but something smarter at least :)

Note: See TracTickets for help on using tickets.
Back to Top