Opened 5 years ago

Closed 3 years ago

#15529 closed Bug (fixed)

GeoJSON regexes doesn't accept some characters within a quoted string

Reported by: woutkh Owned by: jbronn
Component: GIS Version: 1.2
Severity: Normal Keywords: gis geojson gdal json_regex
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no


The json_regex that is used by GeoJSON to determine if some string is in JSON-format, does not accept some of the allowed characters within a quotes string, for instance brackets: '(', ')', '<', '>'. The result is that the following perfect JSON file is not being accepted:

  "abc": "123 (bug)",
  "abc": "123 <bug>"

The problem is in the json_regex definition in contrib.gis.geometry.regex where json_regex is defined as

json_regex = re.compile(r'^(\s+)?\{[\s\w,\[\]\{\}\-\."\':]+\}(\s+)?$')

The first solution to the problem is to change the regex in such a way that it accepts any character within a quotes string.

json_regex = re.compile(r'^(\s+)?\{([\s\w,\[\]\{\}\-\.:]|(\"[^\"]*\"))+\}(\s+)?$)

I think though, that the regex is too complicated for its purpose. As far as I can see, this regex is only to determine 'roughly' if something is in fact JSON or not (eg. it is done in django.contrib.gis.gdal.geometries). Real validation is done by other methods, it really only checks if it somewhat looks like json.

Maybe even more important; the current json_regex does not provide us with any reliable information about wether the JSON is valid or not.

   } 'reversed': 'i am' {,
   [[[ 'i am invalid json }}}

This is not a problem because we only want to know if it looks like JSON before we start parsing the file into detail. But I think the following, less complicated, piece of code would do:

json_regex = re.compile(r'^(\s+)?\{.*\}(\s+)?$)

It just checks if the file-content starts with a { and ends with a } (excluding whitespace). It seems like not much of a check, but it is probably the most you can check using regex. Regex is not really designed to handle (infinite) recursion.

Last but not least; the the regex won't match any of the other formats that are described within GIS (HEX and WKT).

Attachments (2)

regex.diff (627 bytes) - added by woutkh 5 years ago.
Added the DOTALL operator; Python excludes newlines for the DOT operator
15529-2.diff (1.4 KB) - added by claudep 4 years ago.
Same patch with test

Download all attachments as: .zip

Change History (10)

Changed 5 years ago by woutkh

Added the DOTALL operator; Python excludes newlines for the DOT operator

comment:1 Changed 5 years ago by russellm

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Accepted

comment:2 Changed 5 years ago by jbronn

  • milestone set to 1.4
  • Owner changed from Wouter Klein Heerenbrink to jbronn
  • Status changed from new to assigned

Blah. Probably have to wait until 1.4, when we use something obviously more robust than regexes. I'm trying to reduce the attack surface area of a C library that does not have robust checking of string input.

Bottom line, GDAL shouldn't be the sole source of GeoJSON output and input.

Last edited 5 years ago by jbronn (previous) (diff)

comment:3 Changed 5 years ago by lukeplant

  • Type set to Bug

comment:4 Changed 5 years ago by julien

  • Needs tests set
  • Severity set to Normal

comment:5 Changed 5 years ago by jpaulett

  • Easy pickings unset
  • UI/UX unset

What about actually attempting to decode the JSON (json.loads) and catch the ValueError when it is not valid?

def is_json(geom_input):
        return True
    except ValueError:
        return Fals

The major downside is that we are loading a (potentially large) GeoJSON object that will never be used. The regex is fast, but has corner cases.

comment:6 Changed 5 years ago by jacob

  • milestone 1.4 deleted

Milestone 1.4 deleted

Changed 4 years ago by claudep

Same patch with test

comment:7 Changed 4 years ago by claudep

  • Needs tests unset

Just uploaded a patch with a test, if we want to keep going the regex route.

comment:8 Changed 3 years ago by Claude Paroz <claude@…>

  • Resolution set to fixed
  • Status changed from assigned to closed

In c64efe37345cc632e24beb8c7e2a81ef5d3713ba:

Fixed #15529 -- More permissive geojson syntax in constructor

Thanks Wouter Klein Heerenbrink for the report.

Note: See TracTickets for help on using tickets.
Back to Top