Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#32411 closed Cleanup/optimization (fixed)

Case-insensitive lookups on JSONField doesn't work on MySQL.

Reported by: elonzh Owned by: Hasan Ramezani
Component: Database layer (models, ORM) Version: 3.1
Severity: Normal Keywords: JSONField mysql
Cc: Mariusz Felisiak, Sage Abdullah Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by elonzh)

given model

class Paper(models.Model):
    ...
    authors = models.JSONField(default=list)
    ...
In [53]: p = Paper.objects.get(id=1)

In [54]: p.__dict__
Out[54]: 
{'_state': <django.db.models.base.ModelState at 0x7f2aa74b3790>,
 'id': 1,
...
 'authors': [{'name': 'Rehmann Kim-Thomas',
   'workplace': 'SAP SE, Walldorf, Germany'},
  {'name': 'Folkerts Enno', 'workplace': 'SAP SE, Walldorf, Germany'}],
...
}

In [55]: print(Paper.objects.filter(authors__icontains="rehmann").count())
0

In [56]: print(Paper.objects.filter(authors__icontains="Rehmann").count())
1

In [57]: print(Paper.objects.filter(authors__icontains="Rehmann").query)
SELECT `scholardata_paper`.`id`, `scholardata_paper`.`created`, `scholardata_paper`.`modified`, `scholardata_paper`.`doi`, `scholardata_paper`.`s2_id`, `scholardata_paper`.`wanfang_id`, `scholardata_paper`.`corpus_id`, `scholardata_paper`.`pm_id`, `scholardata_paper`.`arxiv_id`, `scholardata_paper`.`acl_id`, `scholardata_paper`.`mag_id`, `scholardata_paper`.`language`, `scholardata_paper`.`url`, `scholardata_paper`.`abstract`, `scholardata_paper`.`title`, `scholardata_paper`.`authors`, `scholardata_paper`.`keywords`, `scholardata_paper`.`issued`, `scholardata_paper`.`venue`, `scholardata_paper`.`journal_id`, `scholardata_paper`.`volume`, `scholardata_paper`.`issue`, `scholardata_paper`.`page`, `scholardata_paper`.`references_count`, `scholardata_paper`.`citations_count`, `scholardata_paper`.`refreshed`, `scholardata_paper`.`references_refreshed`, `scholardata_paper`.`citations_refreshed` FROM `scholardata_paper` WHERE JSON_UNQUOTE(`scholardata_paper`.`authors`) LIKE %Rehmann% ORDER BY `scholardata_paper`.`id` ASC

In [58]: print(Paper.objects.filter(authors__iregex="rehmann").count())
1

In [59]: print(Paper.objects.filter(authors__iregex="Rehmann").count())
1

In [60]: print(Paper.objects.filter(authors__iregex="Rehmann").query)
SELECT `scholardata_paper`.`id`, `scholardata_paper`.`created`, `scholardata_paper`.`modified`, `scholardata_paper`.`doi`, `scholardata_paper`.`s2_id`, `scholardata_paper`.`wanfang_id`, `scholardata_paper`.`corpus_id`, `scholardata_paper`.`pm_id`, `scholardata_paper`.`arxiv_id`, `scholardata_paper`.`acl_id`, `scholardata_paper`.`mag_id`, `scholardata_paper`.`language`, `scholardata_paper`.`url`, `scholardata_paper`.`abstract`, `scholardata_paper`.`title`, `scholardata_paper`.`authors`, `scholardata_paper`.`keywords`, `scholardata_paper`.`issued`, `scholardata_paper`.`venue`, `scholardata_paper`.`journal_id`, `scholardata_paper`.`volume`, `scholardata_paper`.`issue`, `scholardata_paper`.`page`, `scholardata_paper`.`references_count`, `scholardata_paper`.`citations_count`, `scholardata_paper`.`refreshed`, `scholardata_paper`.`references_refreshed`, `scholardata_paper`.`citations_refreshed` FROM `scholardata_paper` WHERE REGEXP_LIKE(JSON_UNQUOTE(`scholardata_paper`.`authors`), Rehmann, 'i') ORDER BY `scholardata_paper`.`id` ASC

In [63]: print(Paper.objects.filter(authors__0__name__icontains="rehmann").count())
1

In [64]: print(Paper.objects.filter(authors__0__name__icontains="rehmann").query)
SELECT `scholardata_paper`.`id`, `scholardata_paper`.`created`, `scholardata_paper`.`modified`, `scholardata_paper`.`doi`, `scholardata_paper`.`s2_id`, `scholardata_paper`.`wanfang_id`, `scholardata_paper`.`corpus_id`, `scholardata_paper`.`pm_id`, `scholardata_paper`.`arxiv_id`, `scholardata_paper`.`acl_id`, `scholardata_paper`.`mag_id`, `scholardata_paper`.`language`, `scholardata_paper`.`url`, `scholardata_paper`.`abstract`, `scholardata_paper`.`title`, `scholardata_paper`.`authors`, `scholardata_paper`.`keywords`, `scholardata_paper`.`issued`, `scholardata_paper`.`venue`, `scholardata_paper`.`journal_id`, `scholardata_paper`.`volume`, `scholardata_paper`.`issue`, `scholardata_paper`.`page`, `scholardata_paper`.`references_count`, `scholardata_paper`.`citations_count`, `scholardata_paper`.`refreshed`, `scholardata_paper`.`references_refreshed`, `scholardata_paper`.`citations_refreshed` FROM `scholardata_paper` WHERE LOWER(JSON_UNQUOTE(JSON_EXTRACT(`scholardata_paper`.`authors`, $[0]."name"))) LIKE LOWER(%rehmann%) ORDER BY `scholardata_paper`.`id` ASC

Change History (9)

comment:1 by elonzh, 3 years ago

Description: modified (diff)

comment:2 by Carlton Gibson, 3 years ago

Cc: Mariusz Felisiak added
Resolution: invalid
Status: newclosed

Hi. Thanks for the report.

Your usage is wrong for what you're trying to do. See the Querying JSONField topic docs.

For JSONField contains takes key pairs, rather than a string:

>>> Dog.objects.filter(data__contains={'owner': 'Bob'})

icontains is not explicitly implemented on the field itself, so it's falling back to the base implementation (wrongly — the JSON_UNQUOTE gives us a string but we didn't generate the LOWER) but it's not clear that makes much sense. (Mariusz: I'm closing this as invalid, but do we want to accept in order to raise an error here? 🤔)

As you've discovered the correct usage is to query by field (as per here):

>>> Dog.objects.filter(data__owner__name__icontains='bob') 

For your example Paper.objects.filter(authors__name__icontains="rehmann") should give you what you want.
I hope that helps.

This seems related to the documentation issue #26511.

comment:3 by Mariusz Felisiak, 3 years ago

Cc: Sage Abdullah added
Keywords: mysql added
Resolution: invalid
Status: closednew
Summary: JSONField filter icontains is case sensitive in mysqlCase-insensitive lookups on JSONField doesn't work on MySQL.
Triage Stage: UnreviewedAccepted
Type: BugCleanup/optimization

This is not a documented usage, however it doesn't work only on MySQL, so IMO we can fix this for consistency. MySQL handles strings used in JSON context using the utf8mb4_bin binary collation, that's why comparison of JSON values is case-sensitive and we need to use CaseInsensitiveMixin. We can fix this be registering a custom case-insensitive lookup for JSONField, e.g.

class JSONFieldIContains(CaseInsensitiveMixin, lookups.IContains):
    pass


JSONField.register_lookup(JSONFieldIContains)

I don't think it makes sense to do the same for __istartswith, iendswith, and __iexact.

comment:4 by Hasan Ramezani, 3 years ago

Has patch: set
Owner: changed from nobody to Hasan Ramezani
Status: newassigned

comment:5 by Mariusz Felisiak, 3 years ago

Triage Stage: AcceptedReady for checkin

comment:6 by Mariusz Felisiak <felisiak.mariusz@…>, 3 years ago

Resolution: fixed
Status: assignedclosed

In 63d239db:

Fixed #32411 -- Fixed icontains lookup for JSONField on MySQL.

in reply to:  2 comment:7 by elonzh, 3 years ago

Replying to Carlton Gibson:

Hi. Thanks for the report.

Your usage is wrong for what you're trying to do. See the Querying JSONField topic docs.

For JSONField contains takes key pairs, rather than a string:

>>> Dog.objects.filter(data__contains={'owner': 'Bob'})

icontains is not explicitly implemented on the field itself, so it's falling back to the base implementation (wrongly — the JSON_UNQUOTE gives us a string but we didn't generate the LOWER) but it's not clear that makes much sense. (Mariusz: I'm closing this as invalid, but do we want to accept in order to raise an error here? 🤔)

As you've discovered the correct usage is to query by field (as per here):

>>> Dog.objects.filter(data__owner__name__icontains='bob') 

For your example Paper.objects.filter(authors__name__icontains="rehmann") should give you what you want.
I hope that helps.

This seems related to the documentation issue #26511.

Thanks for your information, I know how to query field in a json object, but the problem is the authors fields is a list and Django does not support filter it.

Paper.objects.filter(authors__name__icontains="rehmann") won't give me what I want.

>>> print(Paper.objects.filter(authors__name__icontains="rehmann").query)
SELECT `scholardata_paper`.`id`, `scholardata_paper`.`created`, `scholardata_paper`.`modified`, `scholardata_paper`.`doi`, `scholardata_paper`.`s2_id`, `scholardata_paper`.`wanfang_id`, `scholardata_paper`.`corpus_id`, `scholardata_paper`.`pm_id`, `scholardata_paper`.`arxiv_id`, `scholardata_paper`.`acl_id`, `scholardata_paper`.`mag_id`, `scholardata_paper`.`language`, `scholardata_paper`.`url`, `scholardata_paper`.`abstract`, `scholardata_paper`.`title`, `scholardata_paper`.`authors`, `scholardata_paper`.`keywords`, `scholardata_paper`.`issued`, `scholardata_paper`.`venue`, `scholardata_paper`.`journal_id`, `scholardata_paper`.`volume`, `scholardata_paper`.`issue`, `scholardata_paper`.`page`, `scholardata_paper`.`references_count`, `scholardata_paper`.`citations_count`, `scholardata_paper`.`refreshed`, `scholardata_paper`.`references_refreshed`, `scholardata_paper`.`citations_refreshed` FROM `scholardata_paper` WHERE LOWER(JSON_UNQUOTE(JSON_EXTRACT(`scholardata_paper`.`authors`, $."name"))) LIKE LOWER(%rehmann%) ORDER BY `scholardata_paper`.`id` ASC

see https://dev.mysql.com/doc/refman/8.0/en/json.html#docs-body:~:text=Searching%20and%20Modifying%20JSON%20Values,-A

Last edited 3 years ago by elonzh (previous) (diff)

comment:8 by Simon Charette, 3 years ago

the problem is the authors fields is a list and Django does not support filter it

That has little to do with this ticket though. The ORM has no way of knowing authors is a JSON Array because JSONField are schema less. If you want to filter out papers that have at least one author with a matching name you'll have to use the MySQL equivalent of PostgreSQL json_array_elements and a subquery through a Func expression but that's something you should ask about on support channels.

Last edited 3 years ago by Simon Charette (previous) (diff)

in reply to:  8 comment:9 by elonzh, 3 years ago

Replying to Simon Charette:

the problem is the authors fields is a list and Django does not support filter it

That little to do with this ticket though. The ORM has no way of knowing authors is a JSON Array because JSONField are schema less. If you want to filter out papers that have at least one author with a matching name you'll have to use the MySQL equivalent of PostgreSQL json_array_elements and a subquery through a Func expression but that's something you should ask about on support channels.

Yes, that little to do with this ticket. I just want to tell Carlton Gibson that he has given the wrong solution.

Note: See TracTickets for help on using tickets.
Back to Top