Opened 2 years ago

Closed 7 months ago

Last modified 7 months ago

#30190 closed New feature (fixed)

json lines (jsonl) serializer

Reported by: aliva Owned by: nobody
Component: Core (Serialization) Version: master
Severity: Normal Keywords:
Cc: Francisco Javier Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I have create a patch to add jsonl serializer/deserializer to json,

Issue with json serializer is when using dumpdata/loaddata json deserializer tries to load the whole file which causes MemoryError, but with jsonl loaddata will read file line by line which so you won't see memory overflow

note that I will add docs when I get the green light for this patch

Attachments (2)

patch.diff (2.1 KB) - added by aliva 2 years ago.
patch 1
patch-2-wip.diff (13.8 KB) - added by aliva 23 months ago.
patch 2 WIP

Download all attachments as: .zip

Change History (16)

Changed 2 years ago by aliva

Attachment: patch.diff added

patch 1

comment:1 Changed 2 years ago by Carlton Gibson

Resolution: needsinfo
Status: newclosed

Hi. Thanks for the report.

This seems like a reasonable idea for a serializer. It also though seems perfectly suited to be wrapped up as a third-party application. In general we tend to prefer those, rather than increasing the surface area of the code in the core framework. As such, we'd need to see if there was a consensus on mailing list to add the code. I wouldn't be surprised if a third-party app didn't already exist, so we should look into that too.

You'd need to add test cases as well as docs. I'd suggest putting that together as a third-party app on GitHub say, and then seeing if there was a willingness to move that into core, if that's what you want to do at that point.

comment:2 Changed 2 years ago by Claude Paroz

See #22259 where you'll find a link to an existing third-party library.

Personally I would be open to integrate this into Django, as several tickets were open in the past wrt memory issues with big files. However, this should indeed be discussed on the django-developers mailing list first.

comment:3 Changed 23 months ago by Ian Foote

I've been using the library linked in the other ticket with good success. I'd be in favour of adding something to core.

comment:4 Changed 23 months ago by Adam Johnson

+1 to add to core from me too.

comment:5 Changed 23 months ago by aliva

Hi

I have started writing tests for jsonl, (I simply copied test_json to test_jsonl) and trying to fix issues. there is a problem I'm facing, when using jsonl output lines can be very long, so what is the preferred way to pass flake for test data?

  • should I add NOQA: E501 for these lines?
  • use this style
a = """{"key": value",""" \
""""key2": value"}"""


  • use this style
a = """{
  "key": "value"
  "ley2: "value2",
}.replace("\n", "")

also I have added patch 2 which is a work in progress and some tests fail

as a sample here is a sample test data from test_json.py

mapping_ordering_str = """[
{
  "model": "serializers.article",
  "pk": %(article_pk)s,
  "fields": {
    "author": %(author_pk)s,
    "headline": "Poker has no place on ESPN",
    "pub_date": "2006-06-16T11:00:00",
    "categories": [
      %(first_category_pk)s,
      %(second_category_pk)s
    ],
    "meta_data": []
  }
}
]

It should be like this for jsonl

mapping_ordering_str = """{"model": "serializers.article", "pk": %(article_pk)s, fields": {"author": %(author_pk)s, "headline": "Poker has no place on ESPN", pub_date": "2006-06-16T11:00:00", categories": [%(first_category_pk)s, %(second_category_pk)s], "meta_data": []}}\n"""

Changed 23 months ago by aliva

Attachment: patch-2-wip.diff added

patch 2 WIP

comment:6 Changed 14 months ago by Francisco Javier

The Django Core MUST solve his own things by himself, and not relaying to a third party solution.

If dumpdata/loaddata is broken for very big datasets, the issue must be addresed and solved as soon as posible.

In my particular case, I have a Sqlite3 DataBase of 800MB, and want to migrate to PostgreSQL. The Dumpdata/Loaddata combo is the only straight way to do it.

If the django-mljson (or the patch submited in the ticket) solves the MemoryError of Loaddata in big datasets, Django-Core must integrate the mljson (or the patch submited) as the default serializer/deserializer for dumpdata/loaddata process.

Actually, my issue is solved by using django-mljson, but lost 2 days figuring 'what the heck' was going wrong.

comment:7 Changed 14 months ago by Francisco Javier

Cc: Francisco Javier added
Needs tests: set
Resolution: needsinfo
Status: closednew
Type: New featureBug

comment:8 Changed 14 months ago by Mariusz Felisiak

Resolution: needsinfo
Status: newclosed
Type: BugNew feature

Please don't reopen closed ticket. This ticket should be discussed in the DevelopersMailingList before reopening. We can reconsider it when we reach a strong consensus on the mailing list.

Also please try to be more polite, your comment sounds like a demand. This is not a good way to be heard.

comment:9 Changed 13 months ago by aliva

Hi!

I have talked about it on mailing list and here is my PR

Last edited 13 months ago by aliva (previous) (diff)

comment:10 Changed 8 months ago by aliva

Needs tests: unset
Resolution: needsinfo
Status: closednew

comment:11 Changed 8 months ago by Carlton Gibson

Triage Stage: UnreviewedAccepted

OK, thanks for updating the ticket here.

comment:12 Changed 7 months ago by Carlton Gibson

Triage Stage: AcceptedReady for checkin

comment:13 Changed 7 months ago by GitHub <noreply@…>

Resolution: fixed
Status: newclosed

In e2963768:

Fixed #30190 -- Added JSONL serializer.

comment:14 Changed 7 months ago by GitHub <noreply@…>

In 78c81133:

Refs #30190 -- Minor edits to JSONL serializer.

Follow up to e29637681be07606674cdccb47d1e53acb930f5b.

Note: See TracTickets for help on using tickets.
Back to Top