Opened 8 years ago

Closed 3 months ago

#5253 closed New feature (wontfix)

serializer for csv format for loaddata and dumpdata

Reported by: Adam Schmideg <adam@…> Owned by: erob
Component: Core (Serialization) Version: master
Severity: Normal Keywords: csv, foreign keys, feature
Cc: erob@…, tim.babych@…, cmawebsite@… Triage Stage: Accepted
Has patch: yes Needs documentation: yes
Needs tests: yes Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

Currently supported serialization formats are to verbose to be easily edited by hand. Csv format would support a more fluent development cycle of click together data, dump data, edit data, load data.

Attachments (6)

csvserializer.diff (13.4 KB) - added by Adam Schmideg 8 years ago.
A working implementation with minor limitations
csv_serializer.diff (11.5 KB) - added by Adam Schmideg <adam@…> 8 years ago.
An implementation that works with serializers_regress test ready for inclusion
csv_serializer_5253.diff (11.3 KB) - added by Mnewman 7 years ago.
version of old (incomplete) patch that works with r9232
csv_serializer.patch (12.1 KB) - added by erob 6 years ago.
Patch based on the previous ones and adding support for 0.96.3
csv_serializer-0.96.3.diff (12.1 KB) - added by erob 6 years ago.
csv_serializer.2.patch (11.9 KB) - added by erob 6 years ago.
Removed a typo in is_string_field

Download all attachments as: .zip

Change History (31)

Changed 8 years ago by Adam Schmideg

A working implementation with minor limitations

comment:1 Changed 8 years ago by russellm

  • Needs documentation set
  • Needs tests set
  • Patch needs improvement set
  • Triage Stage changed from Unreviewed to Accepted

A CSV serializer is an interesting proposal, and certainly one I'm amenable to. If you're serious about this, the patch needs to be integrated with Django itself - i.e., a patch that is complete and requires no modifications to apply against the current trunk source. This will also integrate the serializer with the serialization test suite, providing a much more comprehensive set of tests than your current test suite (all registered serializers will be tested by the regressiontest/serializers_regress suite, for a spanning set of data types and foreign key data). Documentation updates are also required.

On top of this - one question, and one comment.

A question: Is the 'model identification' syntax (--- article: id,owner,text) in any way standardized? I can't say i've ever seen it before

A comment: extra configuration options (CSV_SERIALIZER_OPTIONS) are probably a non-starter, unless you can make a _very_ strong case for them. Passing options to individual serialization instances would be vastly preferable.

comment:2 follow-up: Changed 8 years ago by Adam Schmideg

Thanks for your feedback. I'm serious enough to integrate it with the current source/tests.

And an answer and a comment,

I haven't seen the syntax elsewhere, I haven't even seen putting different table data into a single csv file. I just needed a header that stands out.

I can't make a strong case for the configuration options, I'm probably missing the extra command line arguments that get passed to loaddata/dumpdata.

comment:3 in reply to: ↑ 2 ; follow-up: Changed 8 years ago by russellm

Replying to Adam Schmideg:

I haven't seen the syntax elsewhere, I haven't even seen putting different table data into a single csv file. I just needed a header that stands out.

Ok. In this case, you will need to make a case for why your header is the right one. Other serializers (YAML, XML, JSON) are unambiguous, and use the syntax as-published. If you are going to invent a syntax, you will need to convince us that your chosen syntax is the right one (i.e., it is a common 'comment' format, or it matches the headers produced by Excel, or some such thing).

I can't make a strong case for the configuration options, I'm probably missing the extra command line arguments that get passed to loaddata/dumpdata.

I think you missed my point. Ideally, you want to avoid configuration options at all. Some are unavoidable (such as the unicode handling options for the JSON backend), and some are common (such as indent). Ideally, you only want options that are 'common'; a small number of other options may be acceptable, but you would need to make a good case for them.

To me, having an option like format=Excel suggests that at a user level, you actually have two serializers. They may share a lot of implementation, but the public interface should be separate.

I haven't dug in detail to work out why 'nice_foreign_keys' exists, but by the looks of it, it isn't required functionality at all. Foreign keys should be integers (ignoring non-numerical keys for the moment) - not strings containing some magic syntax.

comment:4 in reply to: ↑ 3 ; follow-up: Changed 8 years ago by Adam Schmideg <adam@…>

Replying to russellm:

I haven't seen the syntax elsewhere, I haven't even seen putting different table data into a single csv file. I just needed a header that stands out.

Ok. In this case, you will need to make a case for why your header is the right one. Other serializers (YAML, XML, JSON) are unambiguous, and use the syntax as-published. If you are going to invent a syntax, you will need to convince us that your chosen syntax is the right one (i.e., it is a common 'comment' format, or it matches the headers produced by Excel, or some such thing).

After some googling I still couldn't find a reference to putting multiple table data into a single csv file. So I'm going to invent a syntax for it keeping in mind that it should

  • be handled by spreadsheet applications
  • be friendly to processing by grep, sed, etc.
  • look human.

I see two locations for table metadata,

  1. right before the corresponding table data
  2. at the beginning of the csv file

I would go with the 1st option to be able to split the file easily into separate csv files for each table. My suggestion is

news_reporter:id,name
1,John Doe
2,Mary Poppins

news_article:id,reporter_id,text
1,1,Being And Time
2,1,Life After Life

Note the empty line between the tables. It has a number of advantages,

  • headers behave almost like normal csv headers with the exception that the first column will be called <tablename>:id
  • table boundaries can be spotted visually
  • the table name is separated from the first column by a colon
    • which is an usual unix separator
    • it doesn't conflict with other separators in this context, it can't be part of the table name
    • looks human to me

To me, having an option like format=Excel suggests that at a user level, you actually have two serializers. They may share a lot of implementation, but the public interface should be separate.

I'm going to drop that option

I haven't dug in detail to work out why 'nice_foreign_keys' exists, but by the looks of it, it isn't required functionality at all. Foreign keys should be integers (ignoring non-numerical keys for the moment) - not strings containing some magic syntax.

I found this feature very useful in the following situation. The first version of my initial_data.csv looked something like this

auth.group:id,name,permissions
1,Reporters,"(1,5,8)"

where the last column is a list of foreign keys to contenttypes. When I did a flush to my database, the records of the contenttypes table were created with a different id, changing the permissions of my Reporters group. It might be due to the fact that I added other installed apps, or that the content types are created in a random order. Anyway, this problem could be solved using the nice_foreign_keys feature, resulting in this file

auth.group:id,name,permissions
1,Reporters,"({'codename': u'add_article', 'content_type': {'model': 'article', 'app_label': u'news'},
{'codename': u'add_article', 'content_type': {'model': 'article', 'app_label': u'news'})"

It may look a longish example but I find it a mixture of the good parts of the json and the csv world. But I agree it's a different serializer rather than a plain csv one with the nice-and-tricky feature set.

comment:5 in reply to: ↑ 4 ; follow-ups: Changed 8 years ago by russellm

Replying to Adam Schmideg <adam@schmideg.net>:

After some googling I still couldn't find a reference to putting multiple table data into a single csv file.

Before this gets committed, I'll need to see some consensus amongst others that this is indeed the case. "I couldn't find it" isn't a particularly compelling argument by itself :-)

Once the patch is ready for inclusion, we will need to raise the design for public comment. This should shake out any objections or counterexamples for the design you are proposing.

  • headers behave almost like normal csv headers with the exception that the first column will be called <tablename>:id

<tablename>:pk would be better here; both for consistency with other backends, and because the primary key isn't always called id. pk is the reserved name for the primary key in searches, etc, which is why it is used in the JSON and YAML serializers.

I haven't dug in detail to work out why 'nice_foreign_keys' exists, but by the looks of it, it isn't required functionality at all. Foreign keys should be integers (ignoring non-numerical keys for the moment) - not strings containing some magic syntax.

I found this feature very useful in the following situation. The first version of my initial_data.csv looked something like this

1) CSV is a terse syntax by design. This a very pythonic extension.
2) Your original complaint was that JSON syntax was too verbose - and then your introduce the most verbose and repetitive component of JSON syntax into your CSV syntax? Am I the only one that thinks this is a little strange?
3) You are attempting to solve a problem that isn't part of what a serializer should be doing. If you require contenttypes as part of your fixtures, serialize (and deserialize) the contenttypes table. That way you can be guaranteed the IDs for the contenttypes you use.

comment:6 in reply to: ↑ 5 Changed 8 years ago by Adam Schmideg <adam@…>

Before this gets committed, I'll need to see some consensus amongst others that this is indeed the case. "I couldn't find it" isn't a particularly compelling argument by itself :-)

I agree. But it presents some philosophical problems how to prove that something doesn't exist. Anyway, I included a reference in the module doc to the only similar thing I found.

Once the patch is ready for inclusion, we will need to raise the design for public comment. This should shake out any objections or counterexamples for the design you are proposing.

Sounds scary but the patch is ready now.

<tablename>:pk would be better here; both for consistency with other backends, and because the primary key isn't always called id. pk is the reserved name for the primary key in searches, etc, which is why it is used in the JSON and YAML serializers.

Well, I was thinking about this proposal then I went with the pk column name which is id in many cases but necesserily. I included my arguments in the module doc.

I haven't dug in detail to work out why 'nice_foreign_keys' exists, but by the looks of it, it isn't required functionality at all. Foreign keys should be integers (ignoring non-numerical keys for the moment) - not strings containing some magic syntax.

I found this feature very useful in the following situation. The first version of my initial_data.csv looked something like this

1) CSV is a terse syntax by design. This a very pythonic extension.

Agree.

2) Your original complaint was that JSON syntax was too verbose - and then your introduce the most verbose and repetitive component of JSON syntax into your CSV syntax? Am I the only one that thinks this is a little strange?

It would be terser than JSON in most cases and as verbose as JSON is some cases. But your reasoning made me think and I came up with a simpler format. I also realized it wouldn't fit here so if I have enough time, I'll make a new ticket for the CSO (comma-separated objects) format.

3) You are attempting to solve a problem that isn't part of what a serializer should be doing. If you require contenttypes as part of your fixtures, serialize (and deserialize) the contenttypes table. That way you can be guaranteed the IDs for the contenttypes you use.

In my experience, contenttypes creation is hooked on the post_syncdb signal that runs before loading initial_data so when I want to load my data, the contenttypes are already created with messed up ids.

Changed 8 years ago by Adam Schmideg <adam@…>

An implementation that works with serializers_regress test ready for inclusion

comment:7 in reply to: ↑ 5 Changed 8 years ago by Adam Schmideg <adam@…>

Replying to russellm:

The triage process is not quite clear to me. Is there anything else you expect me to do to complete this patch? I think it's documented enough, it runs the tests. This is my first code contribution ever to open source so I'm excited it.

comment:8 Changed 8 years ago by russellm

The triage process calls for exactly what I said in my last message -

Once the patch is ready for inclusion, we will need to raise the design for public comment. This should shake out any objections or counterexamples for the design you are proposing.

That is the next step - getting feedback on django-dev. Your patch contains several issues (header format, model separation, pk naming, handling of empty strings) that are value judgements rather than literal interpretations of a clear specification. If these issues didn't exist, I would commit this straight away. However, the fact that they do exist means that the patch need to be sanity checked by the community.

You already know my position on the pk naming - IMHO consistency with other serializers is more important that consistency with the SQLite database dumper. I'm not wild on the empty string proposal, but I acknowledge the issue. You might want to look at how the Oracle backend handles the null problem - Oracle doesn't differentiate between NULL and "", and I seem to recall there was some special handling for that case.

Now, that's my opinion, and I'm willing to be overruled. However, nothing will happen unless you push it. Start a new thread on django-developers, and deal with the responses. The sprint that is currently underway is another means to get your idea approved. Essentially, if you want it, you have to push it.

comment:9 Changed 8 years ago by PhiR

  • Keywords feature added

comment:10 Changed 7 years ago by Mnewman

  • Owner changed from nobody to Mnewman
  • Status changed from new to assigned

I am adding a version of the current patch that works with current trunk. I need csv serialization for a vendor and would like to see this get into django at some point, so I will accept it, create some tests, address russellm's concerns and bring it up when all those things are ready to see if anyone else likes the idea.

Changed 7 years ago by Mnewman

version of old (incomplete) patch that works with r9232

comment:11 Changed 7 years ago by leith.john@…

I just hacked a mini csv serializer together and would like to propose an idea: serialize different models to different csv files, and then zip them up... but maybe that would be more of a zip serializer? The one i made creates a csv file for every model and then also stores any file/images in the zip file too.

Works for me.

comment:12 Changed 6 years ago by Mnewman

  • Owner changed from Mnewman to nobody
  • Status changed from assigned to new

Backing out of this. Sorry.

comment:13 Changed 6 years ago by erob

  • Cc e.robillard@… added
  • Owner changed from nobody to erob

I propose to take charge of this ticket until someone else want to resume work on it..

comment:14 Changed 6 years ago by erob

Made a updated patch that seems to work with Django 0.96.3. I also changed the is_string_field
function so that it doesn't require importing model definitions by using slightly improved
type introspection.


Changed 6 years ago by erob

Patch based on the previous ones and adding support for 0.96.3

comment:15 follow-up: Changed 6 years ago by Alex

Django .96 is not eligible for new features, this patch should be written against the latest trunk.

Changed 6 years ago by erob

comment:16 Changed 6 years ago by erob

giving up trying to upload the patch... something seems broken with the file uploader thing.

comment:17 Changed 6 years ago by Alex

The patch has been uploaded fine, the preview doesn't always show correctly.

comment:18 in reply to: ↑ 15 Changed 6 years ago by erob

Replying to Alex:

Django .96 is not eligible for new features, this patch should be written against the latest trunk.

I agree, except that my client requires that it works with Django 0.96.3.

Changed 6 years ago by erob

Removed a typo in is_string_field

comment:19 Changed 6 years ago by tymofiy

  • Cc tim.babych@… added

comment:20 Changed 5 years ago by erob

  • Cc erob@… added; e.robillard@… removed

comment:21 Changed 4 years ago by gabrielhurley

  • Severity set to Normal
  • Type set to New feature

comment:22 Changed 4 years ago by aaugustin

  • UI/UX unset

Change UI/UX from NULL to False.

comment:23 Changed 4 years ago by aaugustin

  • Easy pickings unset

Change Easy pickings from NULL to False.

comment:24 Changed 7 months ago by collinanderson

  • Cc cmawebsite@… added

comment:25 Changed 3 months ago by freakboy3742

  • Resolution set to wontfix
  • Status changed from new to closed

I'm going to make an executive decision and close this as a wontfix.

This specific ticket is asking for a piecemeal addition to Django's serializers; the real underlying problem is that Django's serializers aren't as flexible as they need to be for real world problems.

Once upon a time, I harboured dreams of writing a fully flexible serialisation framework for Django; luckily, that is no longer required, as Tom Christie has written a very robust and flexible set of serializers for Django REST Framework. If we want this in Django, we should maybe approach Tom to pull that piece of REST Framework into Django; but frankly, I'm not convinced this is worth the code churn (and it's one more thing that the core team doesn't have to worry about, which is a good thing).

Note: See TracTickets for help on using tickets.
Back to Top