Opened 4 weeks ago

Closed 4 weeks ago

Last modified 3 weeks ago

#36777 closed Bug (invalid)

Exception raised when accessing files with UTF-8 characters in filename on debian/Apache

Reported by: Caram Owned by:
Component: Uncategorized Version: 6.0
Severity: Normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Unicode Filename Handling Issues in Django under Apache/WSGI

Environment

  • Django Version: 5.2/6.0
  • Python Version: 3.12
  • Web Server: Apache 2.4.65 with mod_wsgi 5.0.0
  • OS: Debian Linux
  • Database: MySQL with utf8mb3_general_ci collation

Problem Description

Files with Unicode characters in their filenames (e.g., Note d'information Gestion des récupérations.pdf) fail under Apache/WSGI in two ways:

  1. File size displays as "0 bytes" when using {{ attachment.file.size|filesizeformat }}
  2. File downloads return HTTP 404 errors

Both issues work correctly under Django's runserver but fail in production under Apache/WSGI.

Root Cause Analysis

1. ASCII Encoding Default

Apache/WSGI defaults to ASCII encoding for standard streams, unlike runserver which uses UTF-8.

2. FileField.size Property Failure

The FileField.size property attempts to access file metadata using the default ASCII codec, which fails for non-ASCII characters in paths:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 88: ordinal not in range(128)

3. UTF-8 Mojibake

File paths from the database (stored as UTF-8) get incorrectly interpreted as Latin-1 by
Apache/WSGI. For example:

  • Actual filename: récupérations.pdf
  • In database: UTF-8 bytes \xc3\xa9 (correct encoding of "é")
  • Received by Django: String r\xc3\xa9cup\xc3\xa9rations (UTF-8 bytes misinterpreted as Latin-1 characters)

4. Filesystem Operations

os.path.exists(), os.path.getsize(), and open() fail when Python tries to encode strings using the default ASCII codec.

Workaround Overview

The workaround requires three components:

1. Custom filesize Template Filter

Replace {{ attachment.file.size|filesizeformat }} with a custom filter that:

  • Fixes UTF-8 mojibake by re-encoding: path.encode('latin-1').decode('utf-8')
  • Uses explicit UTF-8 byte paths: path.encode('utf-8')
  • Performs filesystem operations with byte strings to bypass ASCII codec

Usage:

    {{ attachment.file.path|filesize|filesizeformat }}

2. Custom File Serving View

Replace django.views.static.serve with a Unicode-aware version (serve_unicode) that:

  • Fixes UTF-8 mojibake in incoming URL paths
  • Converts paths to UTF-8 bytes before filesystem operations
  • Opens files using byte paths: open(fullpath_bytes, 'rb')
  • Maintains security checks for path traversal
  • Handles HTTP caching headers properly

URL Configuration:

    re_path(r'^%s(?P<path>.*)$' % re.escape(settings.MEDIA_URL.lstrip('/')), 
            serve_unicode, 
            {'document_root': settings.MEDIA_ROOT})

3. URL Encoding Filter (Optional)

Add urlencode_path filter to properly encode URLs for href attributes:

  • Decodes existing encoding to avoid double-encoding
  • Re-encodes with proper UTF-8 percent-encoding
  • Handles special characters (apostrophes, spaces, accented characters)

Usage:

    <a href="{{ attachment.file.url|urlencode_path }}?filename={{ attachment.friendly_name|urlencode }}">

Key Techniques

1. Mojibake Fix

Convert UTF-8 bytes incorrectly decoded as Latin-1 back to proper UTF-8

    path = path.encode('latin-1').decode('utf-8')`

2. Byte Paths for Filesystem Operations

Always use byte strings for filesystem access

    path_bytes = path.encode('utf-8')
    if os.path.exists(path_bytes):
        size = os.path.getsize(path_bytes)
        with open(path_bytes, 'rb') as f:
            # ...

3. Explicit UTF-8 Encoding

Never rely on default encoding (os.fsencode() uses ASCII in Apache/WSGI). Always specify UTF-8 explicitly: path.encode('utf-8')

Testing Checklist

Test with filenames containing:

  • Accented characters: café.pdf
  • Apostrophes: Note d'information.pdf
  • Multiple Unicode characters: récupérations.pdf
  • Spaces and apostrophes: Note d'information Gestion des récupérations.pdf
  • Non-Latin scripts: 文档.pdf
  • Mixed characters: rapport_année_2024.pdf

Related Issues

This addresses the common Apache/WSGI Unicode problem where:

  • UnicodeEncodeError: 'ascii' codec can't encode character
  • File operations work in development (runserver) but fail in production (Apache/WSGI)
  • Database stores UTF-8 correctly but Apache/WSGI mangles the encoding

Attachments (2)

views.py (3.5 KB ) - added by Caram 4 weeks ago.
tags.py (3.1 KB ) - added by Caram 4 weeks ago.

Download all attachments as: .zip

Change History (7)

by Caram, 4 weeks ago

Attachment: views.py added

by Caram, 4 weeks ago

Attachment: tags.py added

comment:1 by Simon Charette, 4 weeks ago

Resolution: invalid
Status: newclosed

Please refer to the How to use Django with Apache and mod_wsgi documentation on the subject and avoid using LLM to generate overly verbose reports.

Fixing UnicodeEncodeError for file uploads

If you get a UnicodeEncodeError when uploading or writing files with file names or content that contains non-ASCII characters, make sure Apache is configured to support UTF-8 encoding

Your mention of

Apache/WSGI defaults to ASCII encoding for standard streams, unlike runserver which uses UTF-8.

Clearly point out you didn't refer to the documentation on the subject.

comment:2 by Caram, 4 weeks ago

Resolution: invalid
Status: closednew

Thanks Simon. Could you please have a closer look before you close the ticket? It' a real issue, and it crashed my production server after moving to Django 6.0 and I've spent 3 hours yesterday debugging and fixing, so I think it deserves a little more consideration.

I realise the ticket description may be suboptimal, and I'm sorry if it caused any irritation. But closing tickets maybe a little hastily does quite send the kind of positive message that we would like to send the community when they are reporting or fixing bugs, and I feel that the board would agree with me.

Again, please have a closer look and let me know if you need any additional information. I have a quite detailed trace of the debugging work I performed yesteday.

comment:3 by Jacob Walls, 4 weeks ago

Resolution: invalid
Status: newclosed

I appreciate you're feeling frustrated, but please don't reopen tickets without bringing new information to light. Same-day triage isn't hasty: firsthand experience in an area can lead to more efficient triage. We're not stubborn; we change our minds when new information compels it.

comment:4 by Simon Charette, 4 weeks ago

Some extra note you can get to if you git-blame the origin of the documentation linked above

comment:5 by Caram, 3 weeks ago

https://forum.djangoproject.com/t/unicodeencodeerror-ubuntu-apache2-admin-page/29351/2

Excellent, thanks Simon, this was a lifesaver. It all works perfectly now from Apache.

Note: See TracTickets for help on using tickets.
Back to Top