#36777 closed Bug (invalid)
Exception raised when accessing files with UTF-8 characters in filename on debian/Apache
| Reported by: | Caram | Owned by: | |
|---|---|---|---|
| Component: | Uncategorized | Version: | 6.0 |
| Severity: | Normal | Keywords: | |
| Cc: | Triage Stage: | Unreviewed | |
| Has patch: | no | Needs documentation: | no |
| Needs tests: | no | Patch needs improvement: | no |
| Easy pickings: | no | UI/UX: | no |
Description
Unicode Filename Handling Issues in Django under Apache/WSGI
Environment
- Django Version: 5.2/6.0
- Python Version: 3.12
- Web Server: Apache 2.4.65 with mod_wsgi 5.0.0
- OS: Debian Linux
- Database: MySQL with utf8mb3_general_ci collation
Problem Description
Files with Unicode characters in their filenames (e.g., Note d'information Gestion des récupérations.pdf) fail under Apache/WSGI in two ways:
- File size displays as "0 bytes" when using
{{ attachment.file.size|filesizeformat }} - File downloads return HTTP 404 errors
Both issues work correctly under Django's runserver but fail in production under Apache/WSGI.
Root Cause Analysis
1. ASCII Encoding Default
Apache/WSGI defaults to ASCII encoding for standard streams, unlike runserver which uses UTF-8.
2. FileField.size Property Failure
The FileField.size property attempts to access file metadata using the default ASCII codec, which fails for non-ASCII characters in paths:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 88: ordinal not in range(128)
3. UTF-8 Mojibake
File paths from the database (stored as UTF-8) get incorrectly interpreted as Latin-1 by
Apache/WSGI. For example:
- Actual filename:
récupérations.pdf - In database: UTF-8 bytes
\xc3\xa9(correct encoding of "é") - Received by Django: String
r\xc3\xa9cup\xc3\xa9rations(UTF-8 bytes misinterpreted as Latin-1 characters)
4. Filesystem Operations
os.path.exists(), os.path.getsize(), and open() fail when Python tries to encode strings using the default ASCII codec.
Workaround Overview
The workaround requires three components:
1. Custom filesize Template Filter
Replace {{ attachment.file.size|filesizeformat }} with a custom filter that:
- Fixes UTF-8 mojibake by re-encoding:
path.encode('latin-1').decode('utf-8') - Uses explicit UTF-8 byte paths:
path.encode('utf-8') - Performs filesystem operations with byte strings to bypass ASCII codec
Usage:
{{ attachment.file.path|filesize|filesizeformat }}
2. Custom File Serving View
Replace django.views.static.serve with a Unicode-aware version (serve_unicode) that:
- Fixes UTF-8 mojibake in incoming URL paths
- Converts paths to UTF-8 bytes before filesystem operations
- Opens files using byte paths: open(fullpath_bytes, 'rb')
- Maintains security checks for path traversal
- Handles HTTP caching headers properly
URL Configuration:
re_path(r'^%s(?P<path>.*)$' % re.escape(settings.MEDIA_URL.lstrip('/')),
serve_unicode,
{'document_root': settings.MEDIA_ROOT})
3. URL Encoding Filter (Optional)
Add urlencode_path filter to properly encode URLs for href attributes:
- Decodes existing encoding to avoid double-encoding
- Re-encodes with proper UTF-8 percent-encoding
- Handles special characters (apostrophes, spaces, accented characters)
Usage:
<a href="{{ attachment.file.url|urlencode_path }}?filename={{ attachment.friendly_name|urlencode }}">
Key Techniques
1. Mojibake Fix
Convert UTF-8 bytes incorrectly decoded as Latin-1 back to proper UTF-8
path = path.encode('latin-1').decode('utf-8')`
2. Byte Paths for Filesystem Operations
Always use byte strings for filesystem access
path_bytes = path.encode('utf-8')
if os.path.exists(path_bytes):
size = os.path.getsize(path_bytes)
with open(path_bytes, 'rb') as f:
# ...
3. Explicit UTF-8 Encoding
Never rely on default encoding (os.fsencode() uses ASCII in Apache/WSGI). Always specify UTF-8 explicitly: path.encode('utf-8')
Testing Checklist
Test with filenames containing:
- Accented characters: café.pdf
- Apostrophes: Note d'information.pdf
- Multiple Unicode characters: récupérations.pdf
- Spaces and apostrophes: Note d'information Gestion des récupérations.pdf
- Non-Latin scripts: 文档.pdf
- Mixed characters: rapport_année_2024.pdf
Related Issues
This addresses the common Apache/WSGI Unicode problem where:
- UnicodeEncodeError: 'ascii' codec can't encode character
- File operations work in development (runserver) but fail in production (Apache/WSGI)
- Database stores UTF-8 correctly but Apache/WSGI mangles the encoding
Attachments (2)
Change History (7)
by , 4 weeks ago
by , 4 weeks ago
comment:1 by , 4 weeks ago
| Resolution: | → invalid |
|---|---|
| Status: | new → closed |
comment:2 by , 4 weeks ago
| Resolution: | invalid |
|---|---|
| Status: | closed → new |
Thanks Simon. Could you please have a closer look before you close the ticket? It' a real issue, and it crashed my production server after moving to Django 6.0 and I've spent 3 hours yesterday debugging and fixing, so I think it deserves a little more consideration.
I realise the ticket description may be suboptimal, and I'm sorry if it caused any irritation. But closing tickets maybe a little hastily does quite send the kind of positive message that we would like to send the community when they are reporting or fixing bugs, and I feel that the board would agree with me.
Again, please have a closer look and let me know if you need any additional information. I have a quite detailed trace of the debugging work I performed yesteday.
comment:3 by , 4 weeks ago
| Resolution: | → invalid |
|---|---|
| Status: | new → closed |
I appreciate you're feeling frustrated, but please don't reopen tickets without bringing new information to light. Same-day triage isn't hasty: firsthand experience in an area can lead to more efficient triage. We're not stubborn; we change our minds when new information compels it.
comment:4 by , 4 weeks ago
Some extra note you can get to if you git-blame the origin of the documentation linked above
- #17686 (25b912abbe31fa440e702b5273c18cf74e2d6e0b) which I started watching 14 years ago when I ran into a similar problem
- Some extra documentation on diagnozing misconfiguration that lead to file upload mishandling of unicode file name
- Similar report on Apache2, Ubuntu,
mod_wsgion the forum resolved by following the documentation
comment:5 by , 3 weeks ago
https://forum.djangoproject.com/t/unicodeencodeerror-ubuntu-apache2-admin-page/29351/2
Excellent, thanks Simon, this was a lifesaver. It all works perfectly now from Apache.
Please refer to the How to use Django with Apache and mod_wsgi documentation on the subject and avoid using LLM to generate overly verbose reports.
Your mention of
Clearly point out you didn't refer to the documentation on the subject.