Opened 14 years ago
Closed 11 years ago
#15529 closed Bug (fixed)
GeoJSON regexes doesn't accept some characters within a quoted string
Reported by: | Wouter Klein Heerenbrink | Owned by: | jbronn |
---|---|---|---|
Component: | GIS | Version: | 1.2 |
Severity: | Normal | Keywords: | gis geojson gdal json_regex |
Cc: | Triage Stage: | Accepted | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
The json_regex
that is used by GeoJSON to determine if some string is in JSON-format, does not accept some of the allowed characters within a quotes string, for instance brackets: '(', ')', '<', '>'. The result is that the following perfect JSON file is not being accepted:
{ "abc": "123 (bug)", "abc": "123 <bug>" }
The problem is in the json_regex
definition in contrib.gis.geometry.regex where json_regex
is defined as
json_regex = re.compile(r'^(\s+)?\{[\s\w,\[\]\{\}\-\."\':]+\}(\s+)?$')
The first solution to the problem is to change the regex in such a way that it accepts any character within a quotes string.
json_regex = re.compile(r'^(\s+)?\{([\s\w,\[\]\{\}\-\.:]|(\"[^\"]*\"))+\}(\s+)?$)
I think though, that the regex is too complicated for its purpose. As far as I can see, this regex is only to determine 'roughly' if something is in fact JSON or not (eg. it is done in django.contrib.gis.gdal.geometries). Real validation is done by other methods, it really only checks if it somewhat looks like json.
Maybe even more important; the current json_regex does not provide us with any reliable information about wether the JSON is valid or not.
{ } 'reversed': 'i am' {, [[[ 'i am invalid json }}} }
This is not a problem because we only want to know if it looks like JSON before we start parsing the file into detail. But I think the following, less complicated, piece of code would do:
json_regex = re.compile(r'^(\s+)?\{.*\}(\s+)?$)
It just checks if the file-content starts with a { and ends with a } (excluding whitespace). It seems like not much of a check, but it is probably the most you can check using regex. Regex is not really designed to handle (infinite) recursion.
Last but not least; the the regex won't match any of the other formats that are described within GIS (HEX and WKT).
Attachments (2)
Change History (10)
by , 14 years ago
Attachment: | regex.diff added |
---|
comment:1 by , 14 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:2 by , 14 years ago
milestone: | → 1.4 |
---|---|
Owner: | changed from | to
Status: | new → assigned |
Blah. Probably have to 1.4, when we use something obviously more robust than regexes. I'm trying to reduce the attack surface area of a C library that does not have robust checking of string input.
Bottom line, GDAL shouldn't be the sole source of GeoJSON output and input.
comment:3 by , 14 years ago
Type: | → Bug |
---|
comment:4 by , 14 years ago
Needs tests: | set |
---|---|
Severity: | → Normal |
comment:5 by , 13 years ago
Easy pickings: | unset |
---|---|
UI/UX: | unset |
What about actually attempting to decode the JSON (json.loads) and catch the ValueError when it is not valid?
def is_json(geom_input): try: json.loads(geom_input) return True except ValueError: return Fals
The major downside is that we are loading a (potentially large) GeoJSON object that will never be used. The regex is fast, but has corner cases.
comment:7 by , 12 years ago
Needs tests: | unset |
---|
Just uploaded a patch with a test, if we want to keep going the regex route.
comment:8 by , 11 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Added the DOTALL operator; Python excludes newlines for the DOT operator