Opened 5 years ago

Closed 4 years ago

Last modified 4 years ago

#30190 closed New feature (fixed)

json lines (jsonl) serializer

Reported by: aliva Owned by: nobody
Component: Core (Serialization) Version: dev
Severity: Normal Keywords:
Cc: Francisco Javier Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I have create a patch to add jsonl serializer/deserializer to json,

Issue with json serializer is when using dumpdata/loaddata json deserializer tries to load the whole file which causes MemoryError, but with jsonl loaddata will read file line by line which so you won't see memory overflow

note that I will add docs when I get the green light for this patch

Attachments (2)

patch.diff (2.1 KB ) - added by aliva 5 years ago.
patch 1
patch-2-wip.diff (13.8 KB ) - added by aliva 5 years ago.
patch 2 WIP

Download all attachments as: .zip

Change History (16)

by aliva, 5 years ago

Attachment: patch.diff added

patch 1

comment:1 by Carlton Gibson, 5 years ago

Resolution: needsinfo
Status: newclosed

Hi. Thanks for the report.

This seems like a reasonable idea for a serializer. It also though seems perfectly suited to be wrapped up as a third-party application. In general we tend to prefer those, rather than increasing the surface area of the code in the core framework. As such, we'd need to see if there was a consensus on mailing list to add the code. I wouldn't be surprised if a third-party app didn't already exist, so we should look into that too.

You'd need to add test cases as well as docs. I'd suggest putting that together as a third-party app on GitHub say, and then seeing if there was a willingness to move that into core, if that's what you want to do at that point.

comment:2 by Claude Paroz, 5 years ago

See #22259 where you'll find a link to an existing third-party library.

Personally I would be open to integrate this into Django, as several tickets were open in the past wrt memory issues with big files. However, this should indeed be discussed on the django-developers mailing list first.

comment:3 by Ian Foote, 5 years ago

I've been using the library linked in the other ticket with good success. I'd be in favour of adding something to core.

comment:4 by Adam Johnson, 5 years ago

+1 to add to core from me too.

comment:5 by aliva, 5 years ago

Hi

I have started writing tests for jsonl, (I simply copied test_json to test_jsonl) and trying to fix issues. there is a problem I'm facing, when using jsonl output lines can be very long, so what is the preferred way to pass flake for test data?

  • should I add NOQA: E501 for these lines?
  • use this style
a = """{"key": value",""" \
""""key2": value"}"""


  • use this style
a = """{
  "key": "value"
  "ley2: "value2",
}.replace("\n", "")

also I have added patch 2 which is a work in progress and some tests fail

as a sample here is a sample test data from test_json.py

mapping_ordering_str = """[
{
  "model": "serializers.article",
  "pk": %(article_pk)s,
  "fields": {
    "author": %(author_pk)s,
    "headline": "Poker has no place on ESPN",
    "pub_date": "2006-06-16T11:00:00",
    "categories": [
      %(first_category_pk)s,
      %(second_category_pk)s
    ],
    "meta_data": []
  }
}
]

It should be like this for jsonl

mapping_ordering_str = """{"model": "serializers.article", "pk": %(article_pk)s, fields": {"author": %(author_pk)s, "headline": "Poker has no place on ESPN", pub_date": "2006-06-16T11:00:00", categories": [%(first_category_pk)s, %(second_category_pk)s], "meta_data": []}}\n"""

by aliva, 5 years ago

Attachment: patch-2-wip.diff added

patch 2 WIP

comment:6 by Francisco Javier, 4 years ago

The Django Core MUST solve his own things by himself, and not relaying to a third party solution.

If dumpdata/loaddata is broken for very big datasets, the issue must be addresed and solved as soon as posible.

In my particular case, I have a Sqlite3 DataBase of 800MB, and want to migrate to PostgreSQL. The Dumpdata/Loaddata combo is the only straight way to do it.

If the django-mljson (or the patch submited in the ticket) solves the MemoryError of Loaddata in big datasets, Django-Core must integrate the mljson (or the patch submited) as the default serializer/deserializer for dumpdata/loaddata process.

Actually, my issue is solved by using django-mljson, but lost 2 days figuring 'what the heck' was going wrong.

comment:7 by Francisco Javier, 4 years ago

Cc: Francisco Javier added
Needs tests: set
Resolution: needsinfo
Status: closednew
Type: New featureBug

comment:8 by Mariusz Felisiak, 4 years ago

Resolution: needsinfo
Status: newclosed
Type: BugNew feature

Please don't reopen closed ticket. This ticket should be discussed in the DevelopersMailingList before reopening. We can reconsider it when we reach a strong consensus on the mailing list.

Also please try to be more polite, your comment sounds like a demand. This is not a good way to be heard.

comment:9 by aliva, 4 years ago

Hi!

I have talked about it on mailing list and here is my PR

Last edited 4 years ago by aliva (previous) (diff)

comment:10 by aliva, 4 years ago

Needs tests: unset
Resolution: needsinfo
Status: closednew

comment:11 by Carlton Gibson, 4 years ago

Triage Stage: UnreviewedAccepted

OK, thanks for updating the ticket here.

comment:12 by Carlton Gibson, 4 years ago

Triage Stage: AcceptedReady for checkin

comment:13 by GitHub <noreply@…>, 4 years ago

Resolution: fixed
Status: newclosed

In e2963768:

Fixed #30190 -- Added JSONL serializer.

comment:14 by GitHub <noreply@…>, 4 years ago

In 78c81133:

Refs #30190 -- Minor edits to JSONL serializer.

Follow up to e29637681be07606674cdccb47d1e53acb930f5b.

Note: See TracTickets for help on using tickets.
Back to Top