Opened 13 years ago

Closed 10 years ago

#15529 closed Bug (fixed)

GeoJSON regexes doesn't accept some characters within a quoted string

Reported by: Wouter Klein Heerenbrink Owned by: jbronn
Component: GIS Version: 1.2
Severity: Normal Keywords: gis geojson gdal json_regex
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The json_regex that is used by GeoJSON to determine if some string is in JSON-format, does not accept some of the allowed characters within a quotes string, for instance brackets: '(', ')', '<', '>'. The result is that the following perfect JSON file is not being accepted:

{
  "abc": "123 (bug)",
  "abc": "123 <bug>"
}

The problem is in the json_regex definition in contrib.gis.geometry.regex where json_regex is defined as

json_regex = re.compile(r'^(\s+)?\{[\s\w,\[\]\{\}\-\."\':]+\}(\s+)?$')

The first solution to the problem is to change the regex in such a way that it accepts any character within a quotes string.

json_regex = re.compile(r'^(\s+)?\{([\s\w,\[\]\{\}\-\.:]|(\"[^\"]*\"))+\}(\s+)?$)

I think though, that the regex is too complicated for its purpose. As far as I can see, this regex is only to determine 'roughly' if something is in fact JSON or not (eg. it is done in django.contrib.gis.gdal.geometries). Real validation is done by other methods, it really only checks if it somewhat looks like json.

Maybe even more important; the current json_regex does not provide us with any reliable information about wether the JSON is valid or not.

 {
   } 'reversed': 'i am' {,
   [[[ 'i am invalid json }}}
 }

This is not a problem because we only want to know if it looks like JSON before we start parsing the file into detail. But I think the following, less complicated, piece of code would do:

json_regex = re.compile(r'^(\s+)?\{.*\}(\s+)?$)

It just checks if the file-content starts with a { and ends with a } (excluding whitespace). It seems like not much of a check, but it is probably the most you can check using regex. Regex is not really designed to handle (infinite) recursion.

Last but not least; the the regex won't match any of the other formats that are described within GIS (HEX and WKT).

Attachments (2)

regex.diff (627 bytes ) - added by Wouter Klein Heerenbrink 13 years ago.
Added the DOTALL operator; Python excludes newlines for the DOT operator
15529-2.diff (1.4 KB ) - added by Claude Paroz 12 years ago.
Same patch with test

Download all attachments as: .zip

Change History (10)

by Wouter Klein Heerenbrink, 13 years ago

Attachment: regex.diff added

Added the DOTALL operator; Python excludes newlines for the DOT operator

comment:1 by Russell Keith-Magee, 13 years ago

Triage Stage: UnreviewedAccepted

comment:2 by jbronn, 13 years ago

milestone: 1.4
Owner: changed from Wouter Klein Heerenbrink to jbronn
Status: newassigned

Blah. Probably have to 1.4, when we use something obviously more robust than regexes. I'm trying to reduce the attack surface area of a C library that does not have robust checking of string input.

Bottom line, GDAL shouldn't be the sole source of GeoJSON output and input.

Version 0, edited 13 years ago by jbronn (next)

comment:3 by Luke Plant, 13 years ago

Type: Bug

comment:4 by Julien Phalip, 13 years ago

Needs tests: set
Severity: Normal

comment:5 by John Paulett, 13 years ago

Easy pickings: unset
UI/UX: unset

What about actually attempting to decode the JSON (json.loads) and catch the ValueError when it is not valid?

def is_json(geom_input):
    try:
        json.loads(geom_input)
        return True
    except ValueError:
        return Fals

The major downside is that we are loading a (potentially large) GeoJSON object that will never be used. The regex is fast, but has corner cases.

comment:6 by Jacob, 13 years ago

milestone: 1.4

Milestone 1.4 deleted

by Claude Paroz, 12 years ago

Attachment: 15529-2.diff added

Same patch with test

comment:7 by Claude Paroz, 12 years ago

Needs tests: unset

Just uploaded a patch with a test, if we want to keep going the regex route.

comment:8 by Claude Paroz <claude@…>, 10 years ago

Resolution: fixed
Status: assignedclosed

In c64efe37345cc632e24beb8c7e2a81ef5d3713ba:

Fixed #15529 -- More permissive geojson syntax in constructor

Thanks Wouter Klein Heerenbrink for the report.

Note: See TracTickets for help on using tickets.
Back to Top