Django

Code

Ticket #2417 (assigned)

Opened 2 years ago

Last modified 2 weeks ago

Support for binary type fields (aka: bytea in postgres and VARBINARY in mysql)

Reported by: scanner@nominum.com Assigned to: nobody (accepted)
Component: Database wrapper Version:
Keywords: Cc: matthias@urlichs.de, myer0052@gmail.com, Maniac@SoftwareManiacs.Org, mpjung@terreon.de, nelchael@gentoo.org, sam@robots.org.uk, bsdlogical@bsdlogical.com, david@chocorisu.com, vic@stream.net.ua, eric@ericwalstad.com
Triage Stage: Accepted Has patch: 1
Needs documentation: 0 Needs tests: 1
Patch needs improvement: 0

Description

I have a need of having fields that store small blips of binary information. In general this works fine with the CharField?, however it will not work for all such fields. For example a datum that is just the NUL character, or a series of NUL characters. With postgres at least: "Character strings disallow zero octets,"

These are not things like images or other such large amounts of data. They are typically fairly short (10 to 200 bytes).

For my purposes I created a small patch that adds a new field type that basically is a CharField? except it uses the binary type on postgres and mysql.

Does the django project plan on supporting a binary type field, if so should I post my patch for this?

Attachments

binaryfield.patch (4.1 kB) - added by scanner@nominum.com on 07/27/06 13:49:04.
A small patch to add a BinaryField? type directly to django. Supports small'ish binary fields.
model-api.patch (0.7 kB) - added by Marc Fargas <telenieko@telenieko.com> on 01/18/07 15:48:07.
Documentation for BinaryField?

Change History

07/26/06 15:52:09 changed by chukharev@mail.ru

Please post your patch, I think I'm not the only one who is interested to see it.

07/27/06 13:47:49 changed by scanner@nominum.com

Here is the patch against the HEAD of django's subversion tree. What I actually use is a teeny bit different in that I have put the BinaryField? class in my project instead of modifying django's source. I had to modify django/db/models/manipulator.py::manipulator_validator_unique_together() because binary types do not support case-insensitive comparison under postgresql at least (see http://code.djangoproject.com/ticket/2422). Otherwise this is an extremely trivial patch.

07/27/06 13:49:04 changed by scanner@nominum.com

  • attachment binaryfield.patch added.

A small patch to add a BinaryField? type directly to django. Supports small'ish binary fields.

07/28/06 15:29:29 changed by jacob

  • owner changed from adrian to jacob.
  • status changed from new to assigned.

This looks very, very good to me.

However, it needs (a) tests and (b) documentation before it can be checked in. Can you add 'em to the patch?

08/01/06 12:57:13 changed by todd.kennedy@gmail.com

There could be potential issues here. There should be some sort of support for setting the maximum size that a file can be that gets inserted into this field to prevent a DOS issue by uploading multiple large items. The BYTEA column can contain any size data (see http://www.postgresql.org/docs/8.1/static/datatype-binary.html).

Would be great if this could be used for ImageField? as well. Maybe instead of adding a new field type, allow the user to set the "backend" that Django uses to store the file. When set to Binary it would create a new table to hold the binary objects, then a ForeignKey? relationship between the table that contains the metadata for the binary object, and the the table which just contains the binary object itself. Deciding whether to update the binary information in the binary table could be done as the result of an md5sum hash or some other form.

08/01/06 15:33:27 changed by mtredinnick

Let's try to keep one issue per ticket, please. This ticket is not about storing ImageFields in the database.

08/06/06 08:10:11 changed by toddobryan@mac.com

Could we consider using LARGE_BLOB as the default type for MySQL (and ado_mssql, I guess), so that there's not a 64k limit on the size, by default. I realize your application is specifically for small bits of binary data, but if there's only going to be one binary type, it should be as flexible as possible.

Also, I'm not sure about extending CharField?. As I mentioned in an email to django-developers, that means you show up in the admin, and I have to agree with Malcolm that it's probably not a good thing to just display random binary data. I'm not sure what the best way to allow binary input is, but maybe using hex, as you mentioned in the patch, is an option. At any rate, I'm not sure BinaryField? should inherit from any currently defined field type.

12/08/06 05:56:35 changed by anonymous

Is this getting any attention?

I also need binary fields to store mid-states of large computing tasks i do in python - storing and restoring them using pickle and having them all well sorted in a database...

01/16/07 18:32:05 changed by Robert Myers <myer0052@gmail.com>

  • cc set to myer0052@gmail.com.

01/17/07 20:09:07 changed by Gary Wilson <gary.wilson@gmail.com>

  • keywords deleted.
  • needs_docs set to 1.
  • has_patch set to 1.
  • needs_tests set to 1.
  • stage changed from Unreviewed to Accepted.

As per Jacob's comments above.

(follow-up: ↓ 18 ) 01/18/07 15:44:20 changed by Marc Fargas <telenieko@telenieko.com>

Just a side note, BinaryField? will require maxlength (as it inherits CharField?) but PostgreSQL backend will ignore it. Related to this, you can use this BinaryField? to store any amount of data on PostgreSQL but only maxlength bytes on MySQL.

The ticket is marked "needs documentation" so I'm writting it, I'll write a note stating that ;)

01/18/07 15:45:35 changed by Marc Fargas <telenieko@telenieko.com>

Oh, forgot to mention, CharFields? show as <input type="text"> on the Admin, how is BinaryField? shown? It could be maybe a checkbox stating "there is data on this BinaryField?, uncheck to delete it" or something like that?

01/18/07 15:48:07 changed by Marc Fargas <telenieko@telenieko.com>

  • attachment model-api.patch added.

Documentation for BinaryField?

01/18/07 15:53:27 changed by Marc Fargas <telenieko@telenieko.com>

Where are tests for Field-types written? could not find them :)

01/18/07 16:24:51 changed by Marc Fargas <telenieko@telenieko.com>

  • needs_docs deleted.

03/26/07 05:42:18 changed by Simon G. <dev@simon.net.nz>

03/26/07 06:26:52 changed by Ivan Sagalaev <Maniac@SoftwareManiacs.Org>

BTW, Postgres (at least) requires escaping for some characters put into BYTEA field. In my code I used to do it like this (this yields octal representation for all characters):

class ByteAField(models.CharField):
  def get_db_prep_save(self, value):
    return value and ''.join(['\\%03o' % ord(c) for c in value])

However it's put in field's code and I don't know how to do it in a backend-specific way...

03/26/07 06:27:30 changed by Ivan Sagalaev <Maniac@SoftwareManiacs.Org>

  • cc changed from myer0052@gmail.com to myer0052@gmail.com, Maniac@SoftwareManiacs.Org.

03/28/07 06:01:34 changed by Simon G. <dev@simon.net.nz>

This looks good, and we'd like to get this in. However, the patch needs a few improvements (and tests!) if possible (see the discussion in the list thread I linked above). First, it should not subclass CharField?, and it will need to maintain database portability (do we need to transparently store everything in base64 here?). Finally, we should increase the size of this (not to store huge blobs of data, but larger than 255 bytes - 64kb is enough?)

(in reply to: ↑ 10 ) 03/28/07 06:17:46 changed by Marc Fargas <telenieko@telenieko.com>

Replying to Marc Fargas <telenieko@telenieko.com>:

Just a side note, BinaryField? will require maxlength (as it inherits CharField?) but PostgreSQL backend will ignore it. Related to this, you can use this BinaryField? to store any amount of data on PostgreSQL but only maxlength bytes on MySQL. The ticket is marked "needs documentation" so I'm writting it, I'll write a note stating that ;)

If a binary field is provided it should allow the developer to decide which amount of data does he/she want to store so in the case of PostgreSQL maxlength could be enforced checking the length before saving **and** it should not be limited to 64Kb, it should allow **any amount** from 1 byte to the maximum field size (1GB on postgresql).

Also maxlength should be "maxbytes" ?

04/03/07 18:10:10 changed by Simon G. <dev@simon.net.nz>

Ok - so does someone want to rewrite the attached patch to take into account my and Marc's comments above?

05/18/07 07:37:40 changed by Michael P. Jung

  • cc changed from myer0052@gmail.com, Maniac@SoftwareManiacs.Org to myer0052@gmail.com, Maniac@SoftwareManiacs.Org, mpjung@terreon.de.

How about calling the argument which limits the file size "maxsize"?

This would be analogous to "maxlength" (not "maxchars") where you limit the length of a CharField?.

05/18/07 17:50:54 changed by SmileyChris

But make sure you call it max_size (or _bytes), as the CharField? one is changing to max_length eventually.

06/10/07 06:27:12 changed by anonymous

  • cc changed from myer0052@gmail.com, Maniac@SoftwareManiacs.Org, mpjung@terreon.de to myer0052@gmail.com, Maniac@SoftwareManiacs.Org, mpjung@terreon.de, nelchael@gentoo.org.

07/06/07 04:49:01 changed by smurf@smurf.noris.de

  • cc changed from myer0052@gmail.com, Maniac@SoftwareManiacs.Org, mpjung@terreon.de, nelchael@gentoo.org to matthias@urlichs.de, myer0052@gmail.com, Maniac@SoftwareManiacs.Org, mpjung@terreon.de, nelchael@gentoo.org.

07/06/07 05:07:39 changed by smurf@smurf.noris.de

The Admin interface to a binary field should probably be an upload/download interface.

Like a FileField? (and a downlaod link next to it), except that it stores its data in the database instead.

10/17/07 10:08:05 changed by anonymous

  • cc changed from matthias@urlichs.de, myer0052@gmail.com, Maniac@SoftwareManiacs.Org, mpjung@terreon.de, nelchael@gentoo.org to matthias@urlichs.de, myer0052@gmail.com, Maniac@SoftwareManiacs.Org, mpjung@terreon.de, nelchael@gentoo.org, sam@robots.org.uk.

01/16/08 11:28:39 changed by Nick Fishman <bsdlogical@bsdlogical.com>

Simon,

I'm interested in bringing this patch so some sort of close. I read through the thread you posted earlier, and I think the best thing to do is exactly what you said in your last message: focus on this patch first, then think about storing large data later (or at least leave it up to individual developers).

As far as I understand, we need the following changes to this BinaryField?:

  1. The Field should not inherit from anything.
  2. The Field should transparently store everything in base64
  3. The Field needs a max_size attribute to let developers specify the maximum amount of data it can store. This will have to be enforced by Django before saving it to the database, since PostgreSQL will ignore the attribute.
  4. Many more good tests.

Am I missing anything?

Nick

01/16/08 11:29:30 changed by Nick Fishman <bsdlogical@bsdlogical.com>

  • cc changed from matthias@urlichs.de, myer0052@gmail.com, Maniac@SoftwareManiacs.Org, mpjung@terreon.de, nelchael@gentoo.org, sam@robots.org.uk to matthias@urlichs.de, myer0052@gmail.com, Maniac@SoftwareManiacs.Org, mpjung@terreon.de, nelchael@gentoo.org, sam@robots.org.uk, bsdlogical@bsdlogical.com.

Adding myself to cc list.

01/30/08 02:03:33 changed by David Swift <david@chocorisu.com>

  • cc changed from matthias@urlichs.de, myer0052@gmail.com, Maniac@SoftwareManiacs.Org, mpjung@terreon.de, nelchael@gentoo.org, sam@robots.org.uk, bsdlogical@bsdlogical.com to matthias@urlichs.de, myer0052@gmail.com, Maniac@SoftwareManiacs.Org, mpjung@terreon.de, nelchael@gentoo.org, sam@robots.org.uk, bsdlogical@bsdlogical.com, david@chocorisu.com.

I'm very keen to see this feature. I'd like to use Django for a project that will store a lot of binary data and I'd rather store it in the database for scalability and ease of backups.

01/30/08 02:04:47 changed by David Swift <david@chocorisu.com>

Nick - re #2, base64 surely isn't necessary if the database supports binary BLOBs, or am I missing something?

02/19/08 09:22:44 changed by mrts

Please do not add transparent base64 encoding as

  • none of the current database backends need it for storing binary data - except postgres, that needs only escaping a fixed set of symbols (and that could indeed be done transparently, see http://www.postgresql.org/docs/8.3/interactive/datatype-binary.html)
  • it degrades performance with no gain (needless pessimisation)
  • it makes interoperability with other frameworks (e.g. openid.store.sqlstore) that may need to store their data in a Django-managed database impossible.

So the roadmap should be as follows (according to the discussion in mailing list):

  1. There should be two fields, BinaryField? and SmallBinaryField? that use different backend features as appropriate (NEEDS BACKEND EXPERTISE AND DECISION, use BLOB/VARBINARY/bytea properly)
  2. The fields should inherit directly from db.models.fields.Field (TRIVIAL)
  3. The fields should NOT mangle the provided binary data on save (no base64-encoding), except for postgres (TRIVIAL, postgres escaping should be in the postgres backend)
    • the fields could base64-encode or hexencode (a subset of?) the raw data for display on retrieve (TRIVIAL, interface needs decision though, should __unicode__() do it?)
    • also, for large fields, a download link from admin could be provided (NON-TRIVIAL, less essential)
  4. The fields may need a max_size attribute to let developers specify the maximum amount of data they can store. This will have to be enforced by Django before saving it to the database, since PostgreSQL will ignore the attribute (NEEDS DECISION, a naive len(1_gb_of_data) is not a good idea, the restriction is less essential IMHO)
  5. An appropriate test suite is required.

Image/FileField backend in DB with these fields is non-essential and should be marked someday/maybe.

02/19/08 09:24:14 changed by mrts

I'm willing to update the patch once design decision is reached.

02/19/08 17:49:50 changed by Nick Fishman <bsdlogical@bsdlogical.com>

mrts,

You're absolutely right about transparent base64 encoding. I didn't think through the implications, and your reasoning explains why it's not necessary.

For #1, I'm thinking that the BinaryField? should provide lots of data (i.e. unlimited) while SmallBinaryField? should have a max_size attribute (#4) enforced by Django before saving it to the database, as you mentioned. Jacob talked about this in http://groups.google.com/group/django-developers/browse_thread/thread/266565a23b765ea2

-- PostgreSQL --

I think the BLOB in Postgres is best suited for BinaryField?, as it has no size limit. Although bytea in Postgres also has no size limit, it's a "binary string," so it seems like we should save that for the SmallBinaryField? with a defined max_size.

As far as #3, we should definitely escape data passed into postgres, but I couldn't find a method in the backend to do it.

-- MySQL --

I'm not sure what the best choice is for MySQL. Quoting from the MySQL documentation:

  • TINYBLOB - A BLOB column with a maximum length of 255 (28 - 1) characters.
  • BLOB - A BLOB column with a maximum length of 65,535 (216 - 1) characters.
  • MEDIUMBLOB - A BLOB column with a maximum length of 16,777,215 (224 - 1) characters.
  • LONGBLOB - A BLOB column with a maximum length of 4,294,967,295 (232 - 1) characters.

Using MEDIUMBLOB will give us about 15mb of data, but if we want BinaryField? to truly be unlimited in size, that won't work for all cases. I'm not sure what to do with that.

For SmallBinaryField?, I thought of using VARBINARY, which doesn't pad values up to the limit specified, but both BINARY and VARBINARY have a maximum size of 65535 bytes, which isn't nearly enough. It seems like we'll need something like MEDIUMBLOB instead? Or maybe different BLOB types depending on the max_size? I'm not sure how to resolve this.

I'm not sure about providing a download link from the admin interface. It's a possibility, but I'm -0 on it since actually getting the fields ready is a higher priority. Also, I don't think we should think about a DB backend for Image/FileField. If someone really wants to do that, they can code it on their own.

03/02/08 09:36:01 changed by mrts

  • stage changed from Accepted to Design decision needed.

Resetting triage state to design decision needed. Once concensus is reached, I'm ready to start hacking.

03/02/08 11:44:11 changed by Nick Fishman <bsdlogical@bsdlogical.com>

mrts, could you offer your comments on my last comment? I've attempted to lay out the specifications for how this should be done.

03/03/08 04:51:03 changed by mrts

I think the Django BDFLs should comment on this now.

03/03/08 06:30:21 changed by mtredinnick

Whoever works on this should realise it's a little trickier than it looks. You'll need to work carefully with the database backends so that this binary data isn't automatically converted to unicode (which happens for all bytestrings at the moment). Make sure you test with binary data that cannot possibly be treated as valid UTF-8, so that you don't accidentally get fooled by it being converted to Unicode. I'd suggest creating a new class, similar to SmartUnicode?, etc, for autoescaping that is used to tell the database backend *not* to convert this string. Might require some magical incantations for things like psycopg2 and mysql that do the UTF-8 -> Unicode conversion in the Python database wrapper (beyond the Django borders); that will need research.

I don't know why this ticket is in design decision needed, since there aren't any specific questions that seem to be unresolved beyond "what should we choose?". Just pick something and go. The field doesn't need infinitely many options. One case cannot possibly meet all requirements in any case, so pick a practical middle ground that works completely portably on all our database backends and if people want something else they can always subclass the field.

If you do need a design discussion, this ticket isn't the place for it. We have a mailing list for that purpose.

03/03/08 11:43:50 changed by jacob

  • stage changed from Design decision needed to Accepted.

-> Accepted.

04/07/08 12:52:07 changed by anonymous

  • cc changed from matthias@urlichs.de, myer0052@gmail.com, Maniac@SoftwareManiacs.Org, mpjung@terreon.de, nelchael@gentoo.org, sam@robots.org.uk, bsdlogical@bsdlogical.com, david@chocorisu.com to matthias@urlichs.de, myer0052@gmail.com, Maniac@SoftwareManiacs.Org, mpjung@terreon.de, nelchael@gentoo.org, sam@robots.org.uk, bsdlogical@bsdlogical.com, david@chocorisu.com, vic@stream.net.ua.

04/28/08 13:59:55 changed by Eric Walstad <eric@ericwalstad.com>

  • cc changed from matthias@urlichs.de, myer0052@gmail.com, Maniac@SoftwareManiacs.Org, mpjung@terreon.de, nelchael@gentoo.org, sam@robots.org.uk, bsdlogical@bsdlogical.com, david@chocorisu.com, vic@stream.net.ua to matthias@urlichs.de, myer0052@gmail.com, Maniac@SoftwareManiacs.Org, mpjung@terreon.de, nelchael@gentoo.org, sam@robots.org.uk, bsdlogical@bsdlogical.com, david@chocorisu.com, vic@stream.net.ua, eric@ericwalstad.com.

Add/Change #2417 (Support for binary type fields (aka: bytea in postgres and VARBINARY in mysql))




Change Properties
Action