Opened 21 months ago
Last modified 9 days ago
#35440 assigned Cleanup/optimization
Update parse_header_parameters to leverage the parsing logic from (stdlib) email Message.
| Reported by: | Natalia Bidart | Owned by: | Pravin |
|---|---|---|---|
| Component: | HTTP handling | Version: | dev |
| Severity: | Normal | Keywords: | |
| Cc: | Carlton Gibson, Adam Johnson, Shai Berger | Triage Stage: | Accepted |
| Has patch: | yes | Needs documentation: | no |
| Needs tests: | no | Patch needs improvement: | no |
| Easy pickings: | no | UI/UX: | no |
Description (last modified by )
Following a security report, which was concluded not to be such, the Security Team agreed to apply a few improvements to the parse_header_parameters function from django/utils/http.py.
This function was historically using Python's (now deprecated) cgi module, but in 34e2148fc725e7200050f74130d7523e3cd8507a the code was changed by porting the cgi implementations into the file to avoid the deprecation (this was in the context of #33173). Later on the header parsing logic was cleaned up in d4d5427571b4bf3a21c902276c2a00215c2a37cc to unify the logic with the one used in the multipartparser.py module (see #33697).
At the time of the cgi deprecation, the PEP including the deprecation mentions:
Replacements for the various parts of cgi which are not directly related to executing code are: [ ... ] parse_header with email.message.Message (see example below)
Providing an explicit example of how close parse_header and email.message.Message are:
>>> from cgi import parse_header >>> from email.message import Message >>> parse_header(h) ('application/json', {'charset': 'utf8'}) >>> m = Message() >>> m['content-type'] = h >>> m.get_params() [('application/json', ''), ('charset', 'utf8')] >>> m.get_param('charset')
The goal of this ticket is to track the improvement of the current parse_header_parameters implementation by leveraging the logic from email.message.Message.
The Security Team also agreed that it's worth adding some early checks in the parse_header_parameters function to limit the amount of provided semicolons. This would require some investigation as to what would be a good threshold, considering that it's likely that more than one semicolon may not be necessary in valid HTTP headers.
Attachments (2)
Change History (25)
comment:1 by , 21 months ago
| Description: | modified (diff) |
|---|
comment:2 by , 21 months ago
| Triage Stage: | Unreviewed → Accepted |
|---|
comment:3 by , 18 months ago
| Owner: | changed from to |
|---|---|
| Status: | new → assigned |
comment:6 by , 18 months ago
| Has patch: | set |
|---|
comment:7 by , 18 months ago
| Patch needs improvement: | set |
|---|
Setting as patch needs improvements following my comments in the PR (the goal of this ticket is, ideally, to remove/replace our custom parsing logic with Python stdlib's Message methods).
comment:8 by , 18 months ago
| Patch needs improvement: | unset |
|---|
I've proposed a fix in the PR, but have clarifying questions. In case this task is not on the review radar, I change the status, but am ready to make further modifications.
comment:9 by , 14 months ago
| Patch needs improvement: | set |
|---|
comment:10 by , 12 months ago
| Patch needs improvement: | unset |
|---|
comment:11 by , 11 months ago
| Cc: | added |
|---|---|
| Patch needs improvement: | set |
comment:12 by , 10 months ago
| Patch needs improvement: | unset |
|---|---|
| Triage Stage: | Accepted → Ready for checkin |
comment:15 by , 4 months ago
| Has patch: | unset |
|---|---|
| Resolution: | fixed |
| Status: | closed → new |
| Triage Stage: | Ready for checkin → Accepted |
comment:16 by , 4 months ago
| Owner: | removed |
|---|---|
| Status: | new → assigned |
comment:17 by , 4 months ago
| Status: | assigned → new |
|---|
comment:18 by , 4 months ago
Just to say, I think this is the right decision here. 👏 Even if we remerge the same thing later, having time to measure twice seems right to me. Thanks for all the consideration that's gone into it.
comment:19 by , 3 weeks ago
I ran into this ticket while working on Accept header parsing. I agree that moving toward a stdlib-based solution is the right long-term direction, particularly given the previous performance concerns.
While complex quoted parameters in Accept headers appear to be very rare in real-world traffic, incorrect handling can still lead to broken parsing for otherwise valid headers, so it seems important that the eventual approach handles these cases correctly.
comment:20 by , 2 weeks ago
I tried using a simple string split for normal headers, and only uses email.message when it sees complex symbols like " or *.
def _parse_header_params(line):
"""The 'Slower' logic."""
m = Message()
m["content-type"] = line
params = m.get_params()
if not params:
return "", {}
pdict = {}
key = params.pop(0)[0].lower()
for name, value in params:
if not name:
continue
if isinstance(value, tuple):
value = collapse_rfc2231_value(value)
pdict[name] = value
return key, pdict
def parse_header_parameters(line, max_length=MAX_HEADER_LENGTH):
"""
Parse a Content-type like header.
Return the main content-type and a dictionary of options.
If `line` is longer than `max_length`, `ValueError` is raised.
"""
if not line:
return "", {}
if max_length is not None and len(line) > max_length:
raise ValueError("Unable to parse header parameters (value too long).")
if ";" not in line:
return line.lower().strip(), {}
if '"' not in line and "*" not in line:
parts = line.split(";")
key = parts[0].lower().strip()
pdict = {}
for p in parts[1:]:
if "=" in p:
name, value = p.split("=", 1)
pdict[name.lower().strip()] = value.strip()
return key, pdict
return _parse_header_params(line)
I ran a benchmark with 50,000 iterations. Here are my results:
Test Case Existing New implementation Simple (no params) 0.050s 0.016s Standard (charset) 0.109s 0.044s Complex (quotes) 0.214s 0.681s
for me this make sense , It will rarely happened that there would be a complex case for parsing.
comment:21 by , 11 days ago
| Owner: | set to |
|---|---|
| Status: | new → assigned |
comment:22 by , 9 days ago
| Has patch: | set |
|---|
comment:23 by , 9 days ago
| Patch needs improvement: | set |
|---|
by , 9 days ago
comment:24 by , 9 days ago
my previous suggestion was failing for one test case.
So I wrote new simple patch as below modifying the existing one as below in the diff
attachment:diff.txt
and i ran 9 to 10 times attachment:bench_mark.py
50,000 iterations for each. and following result i got on average. I used linux i3 8gb 5 year old laptop may be that's why i'm getting fluctuations in the results.
Complex test case (which triggers the fallback), results varied between -73% and +42% across runs.
Test Case Original Avg Time Optimized Avg Time Improvement (Avg) Simple (No Params) ~0.047s ~0.012s ~73% Standard (Charset) ~0.110s ~0.045s ~59% Multi-Param ~0.190s ~0.065s ~65% Complex (RFC 2231) ~0.650s ~0.530s Variable*
Below are the sampled results when i ran attachment:bench_mark.py for 15 consecutive times:
Run | Simple | Standard | Multi-P | Complex | ---------------------------------------------- 1 | 70.26% | 63.36% | 65.50% | -2.99% | 2 | 66.09% | 57.83% | 63.27% | -34.87% | 3 | 73.31% | 57.90% | 59.09% | -17.48% | 4 | 78.64% | 58.98% | 65.31% | 36.69% | 5 | 74.02% | 59.89% | 60.66% | 39.80% | 6 | 74.90% | 58.15% | 53.38% | 23.74% | 7 | 74.85% | 60.53% | 65.36% | 33.77% | 8 | 71.63% | 53.78% | 54.52% | 28.88% | 9 | 72.75% | 53.04% | 65.91% | 33.18% | 10 | 71.69% | 60.46% | 64.59% | -14.10% | 11 | 75.16% | 61.84% | 67.22% | 36.87% | 12 | 74.15% | 58.61% | 64.92% | -73.24% | 13 | 72.17% | 56.63% | 63.03% | 33.01% | 14 | 73.14% | 47.98% | 64.39% | 24.96% | 15 | 73.54% | 57.26% | 64.47% | 25.37% |
comment:25 by , 9 days ago
| Patch needs improvement: | unset |
|---|
PR