Code

Opened 5 years ago

Last modified 7 months ago

#11487 new Bug

Oracle encoding bug when saving more than 4000 characters

Reported by: mdpetry Owned by: nobody
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords: oracle database nclob clob
Cc: ikelly, Goldan, shai@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: yes Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I am working on a project where I have to store a large amount of
content, (html) in a record from a table, I am using Oracle as the
database

What is very strange ... is put a text reasonably small, (around 3000
characters) and it works correctly, save without problems, but
doubling this size, this content is saved completely changed.

This patch works well on: Django 1.1 (svn), debian, and Oracle 10g
with cx_oracle 4.4.1

Attachments (10)

base.diff (510 bytes) - added by mdpetry 5 years ago.
base.py (21.6 KB) - added by mdpetry 5 years ago.
base.2.diff (647 bytes) - added by mdpetry 5 years ago.
Change variable type to Database.CLOB
base.2.py (21.6 KB) - added by mdpetry 5 years ago.
changed variable to Database.CLOB
base.3.diff (725 bytes) - added by mdpetry 5 years ago.
same diff, but now from django root folder
base.4.diff (724 bytes) - added by anball@… 5 years ago.
patch with grammar fixed "an -> a"
base_with_tests.diff (1.9 KB) - added by rafax 5 years ago.
Patch with regression test added.
11487_fix.diff (2.3 KB) - added by jbronn 5 years ago.
Cleaned up base_with_tests patch.
clob_test.py (491 bytes) - added by Goldan 3 years ago.
Test that fails with CLOB
long_string.diff (527 bytes) - added by Goldan 3 years ago.
A string that is being patched (variants: LONG_STRING and NCLOB)

Download all attachments as: .zip

Change History (47)

comment:1 Changed 5 years ago by mdpetry

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

I simply changed the line 364 by placing the following contents:
self.input_size = Database.LONG_STRING

comment:2 Changed 5 years ago by jacob

  • milestone 1.1 deleted

Please upload a proper patch; finding the change by eye can be tricky. If you need help, check out the contributor's guide or ask someone in IRC. Thanks!

Changed 5 years ago by mdpetry

Changed 5 years ago by mdpetry

comment:3 Changed 5 years ago by mdpetry

  • milestone set to 1.1
  • Needs tests set

comment:4 Changed 5 years ago by ikelly

  • Summary changed from Oracle does not register in the batabase items with more than 4000 characters to Oracle encoding bug when saving more than 4000 characters
  • Triage Stage changed from Unreviewed to Accepted
  • Version changed from 1.0 to SVN

Glad you were able to get this working. Would you mind testing out for me the same change but using Database.CLOB rather than Database.LONG_STRING? The reason is I'm concerned that LONG_STRING might cause problems in other configurations. Also, the patch in #10566 already changes it from NCLOB to CLOB, so I'd prefer a solution that both tickets can agree on.

Also, since I can't reproduce this myself, would you please write up a test case for this? It doesn't need to be complicated. Just inserting a large string, reading it back, and checking that it's the same would be fine. The main thing is we need to be sure the string used will actually trigger the bug.

Thanks!

Changed 5 years ago by mdpetry

Change variable type to Database.CLOB

Changed 5 years ago by mdpetry

changed variable to Database.CLOB

comment:5 Changed 5 years ago by mdpetry

Using CLOB also works perfectly! I've added the file with the diff (base.2.diff) and source code (base.2.py)

comment:6 follow-up: Changed 5 years ago by Alex

Please upload the diffs from the root of the source tree, not just for the single file. Also there's no need to keep uploading the source file, just the diff is enough.

Changed 5 years ago by mdpetry

same diff, but now from django root folder

comment:7 in reply to: ↑ 6 Changed 5 years ago by mdpetry

Replying to Alex:

Please upload the diffs from the root of the source tree, not just for the single file. Also there's no need to keep uploading the source file, just the diff is enough.

Sorry Alex, I'm still adjusting to the process of submission of patches

I believe that now the file is ok: base.3.diff

comment:8 follow-up: Changed 5 years ago by dg

I belive it should say "as _a_ CLOB" (just a grammar nitpick ;))

Changed 5 years ago by anball@…

patch with grammar fixed "an -> a"

comment:9 in reply to: ↑ 8 Changed 5 years ago by anonymous

Replying to dg:

I belive it should say "as _a_ CLOB" (just a grammar nitpick ;))

I just attached a patch with the correct grammar.

comment:10 Changed 5 years ago by anonymous

so is the bug fixed? why is the ticket still open? :\

comment:11 Changed 5 years ago by Alex

This ticket is still open because no code has been committed to the Django repository.

comment:12 Changed 5 years ago by ikelly

And it hasn't been committed yet because it still needs a regression test. Tests aren't optional.

comment:13 Changed 5 years ago by mboersma

  • Cc mboersma, ikelly added

comment:14 Changed 5 years ago by CollinAnderson

  • Cc CollinAnderson added

comment:15 Changed 5 years ago by anonymous

Can you give an example of string that triggers the bug?
I tried to replicate it but with no success, and without string triggering it I cannot verify if the test I tried to write is proper.
Using Oracle 11, Python 2.6, Django SVN (and 1.0.2).

Changed 5 years ago by rafax

Patch with regression test added.

comment:16 Changed 5 years ago by rafax

Sorry for the anonymous comment from yesterday.
I managed to replicate the bug using Oracle 10, Python 2.6.2, Django SVN, cx_Oracle 4.4.1, which might indicate the improper behavior is connected with the version of Oracle/ cx_Oracle.
I wrote a simple test case as suggested by ikelly, if there is something that should be corrected in it please let me know.

comment:17 follow-ups: Changed 5 years ago by ikelly

  • Needs tests unset

Thanks, rafax. I've actually tested with the same set of versions and not reproduced the bug, so there must be something more to it. I suspect it has something to do with the database encodings.

About the test case, it appears to be saving a string of exactly 4000 characters, not more than 4000 characters. Is that intentional? Also, does it still replicate if you use an NCLOB column instead of a CLOB column? Since that is the default for Django, that is what we should be testing.

Changed 5 years ago by jbronn

Cleaned up base_with_tests patch.

comment:18 in reply to: ↑ 17 Changed 5 years ago by jbronn

Replying to ikelly:

Thanks, rafax. I've actually tested with the same set of versions and not reproduced the bug, so there must be something more to it. I suspect it has something to do with the database encodings.

I was able to reproduce the bug using Oracle 11, Python 2.5.4, and cx_Oracle 4.4.1. However, when using cx_Oracle 5.0.1 it's not a problem. I'm doing this on Windows, so this may be why I can reproduce.

About the test case, it appears to be saving a string of exactly 4000 characters, not more than 4000 characters. Is that intentional? Also, does it still replicate if you use an NCLOB column instead of a CLOB column? Since that is the default for Django, that is what we should be testing.

I thought that at first, but if you evaluate len(''.join([unicode(x) for x in xrange(4000)])) you see it is really 14,890 characters long. I also updated the test to use a NCLOB column, and the test case still fails without the patch.

Ian, look over my patch and make sure it doesn't break anything else -- I've run the full test suite but I want a second run on a different system to make sure there are no unintended side-effects.

comment:19 in reply to: ↑ 17 Changed 5 years ago by rafax

Replying to ikelly:

Thanks, rafax. I've actually tested with the same set of versions and not reproduced the bug, so there must be something more to it. I suspect it has something to do with the database encodings.

Or with the OS you run - I was able to reproduce on my work machine running Windows XP (with default CP-1250 encoding), while on my personal notebook running Ubuntu (with UTF based locale) it works OK.

About the test case, it appears to be saving a string of exactly 4000 characters, not more than 4000 characters. Is that intentional?

I thought it will certainly be longer than 4000 this way, in fact something like

>>> len(''.join(unicode(x) for x in xrange(2000)))
6890

should be enough.

comment:20 Changed 5 years ago by kmtracey

11487_fix.diff also fixes this test failure:

======================================================================
FAIL: Doctest: regressiontests.model_regress.models.__test__.API_TESTS
----------------------------------------------------------------------
Traceback (most recent call last):
  File "d:\u\kmt\django\trunk\django\test\_doctest.py", line 2180, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for regressiontests.model_regress.models.__test__.API_TESTS
  File "D:\u\kmt\django\trunk\tests\regressiontests\model_regress\models.py", line unknown line number, in API_TESTS

----------------------------------------------------------------------
File "D:\u\kmt\django\trunk\tests\regressiontests\model_regress\models.py", line ?, in regressiontests.model_regress.models.__test__.API_TESTS
Failed example:
    len(a4.article_text)
Expected:
    5000
Got:
    2500

for my test install of Oracle (also on a Windows box). I've seen this failure ever since I set up a test Oracle install, but never found time to dig into what might be causing it.

comment:21 Changed 5 years ago by ikelly

  • Resolution set to fixed
  • Status changed from new to closed

(In [11285]) Fixed #11487: pass long strings to Oracle as CLOB rather than NCLOB to prevent an encoding bug that occurs in some installations.

comment:22 Changed 5 years ago by ikelly

(In [11286]) Fixed #11487: pass long strings to Oracle as CLOB rather than NCLOB to prevent an encoding bug that occurs in some installations. Backport of [11285] from trunk.

comment:23 Changed 3 years ago by Goldan

  • Cc Goldan added
  • Keywords nclob clob added
  • Resolution fixed deleted
  • Status changed from closed to reopened

This issue seems to be a bit more complicated.

I'm using Oracle9i Enterprise Edition Release 9.2.0.1.0 - Production and Django development version.

I'm experiencing the following error when inserting more than 4000 characters (I'm trying to add a new flatpage):

DatabaseError at /admin/flatpages/flatpage/1/
ORA-12704: character set mismatch

I've tried to change the string in oracle/base.py (self.input_size = Database.CLOB, see attachment long_string.diff) to LONG_STRING and NCLOB. All three variants produce errors. I'm struggling with the issue for several days, so I've tested on various cx_Oracle versions.

I've made tests on three cx_Oracle versions: 4.4.1, 5.0.4 and 5.0.4 with WITH_UNICODE compilation flag.

I've made tests on two strings: cyrillic string of length 10350 symbols (the string can be found on http://dumpz.org/37744/text/) and a simple latin string of 16004 symbols "a".

During the tests, I was getting three types of errors:

  1. (hieroglyphs) The text is being saved without errors, but it is displayed as hieroglyphs (e.g. 慡)
  2. (mismatch) The text is not saved, and the following error is returned:
    DatabaseError at /admin/flatpages/flatpage/1/
    ORA-12704: character set mismatch
    
  3. (UnicodeDecodeError) The text is saved, but when trying to display it, an exception is raised:
    UnicodeDecodeError at /admin/flatpages/flatpage/1/
    'utf16' codec can't decode bytes in position 44-45: illegal encoding
    

The distribution of types of errors is the following:

  1. Cyrillic string:

1.1. cx_Oracle 4.4.1.
CLOB: mismatch,
LONG_STRING: hieroglyphs,
NCLOB: hieroglyphs.

1.2. cx_Oracle 5.0.4.
CLOB: mismatch,
LONG_STRING: UnicodeDecodeError,
NCLOB: UnicodeDecodeError.

1.3. cx_Oracle 5.0.4 + WITH_UNICODE.
CLOB: mismatch,
LONG_STRING: UnicodeDecodeError,
NCLOB: mismatch.

  1. Latin string:

2.1. cx_Oracle 4.4.1.
CLOB: mismatch,
LONG_STRING: hieroglyphs,
NCLOB: hieroglyphs.

2.2. cx_Oracle 5.0.4.
CLOB: mismatch,
LONG_STRING: hieroglyphs,
NCLOB: hieroglyphs.

2.3. cx_Oracle 5.0.4 + WITH_UNICODE.
CLOB: mismatch,
LONG_STRING: hieroglyphs,
NCLOB: mismatch.

Version 5, edited 3 years ago by Goldan (previous) (next) (diff)

Changed 3 years ago by Goldan

Test that fails with CLOB

Changed 3 years ago by Goldan

A string that is being patched (variants: LONG_STRING and NCLOB)

comment:24 Changed 3 years ago by julien

  • Severity set to Normal
  • Type set to Bug

comment:25 Changed 3 years ago by kimus.linuxus@…

  • Easy pickings unset

Same problem here and changing column to CLOB resolved my issue with ORA-12704: character set mismatch

comment:26 Changed 3 years ago by aaugustin

  • UI/UX unset

#9152 was a duplicate.

comment:27 Changed 3 years ago by jacob

  • milestone 1.1 deleted

Milestone 1.1 deleted

comment:28 Changed 13 months ago by aaugustin

  • Status changed from reopened to new

comment:29 Changed 11 months ago by aaugustin

#20201 was probably a duplicate.

comment:30 Changed 11 months ago by CollinAnderson

  • Cc CollinAnderson removed

comment:31 Changed 11 months ago by shai

  • Cc shai@… added

comment:32 Changed 11 months ago by mboersma

  • Cc mboersma removed

comment:33 Changed 10 months ago by graham.boyle@…

We're seeing this when putting more than 4 message.success(..) on a page. It seems the success messages go onto the session, and the session_data gets too big.

The len() of the string going into the SESSION_DATA field is 2120, but that's a unicode string, so I _strongly suspect_ the byte length of the data going into the django_session.session_data column is twice that. Looking at the HEX values of django_session.session_data already in the database, we're seeing 0x6C00 for "1", for example.
We are using TextFields in our models with no problem, even with large chunks of text, and even when creating a new row. We see the sql being run to save the session is
u'INSERT INTO "DJANGO_SESSION" ("SESSION_KEY", "SESSION_DATA", "EXPIRE_DATE") SELECT %s, %s, %s FROM DUAL'
from a call to cursor.execute(sql, params)
at line 937 of django\db\models\sql\compiler.py in execute_sql()

We note that there's a bulk_insert_sql() in django.db.backends.oracle.base.py (line 411) that's doing inserts from selects off dual.

We're using Oracle 11g, Django 1.5, Python 2.7.3, the python running under Windows 7 Professional (64 bit), Service Pack 1.

comment:34 Changed 10 months ago by graham.boyle@…

and the NLS database parameters of interest are:
NLS_CHARACTERSET AL32UTF8
NLS_NCHAR_CHARACTERSET AL16UTF16

and django_session.session_data is nclob

comment:35 Changed 7 months ago by timo

  • Needs tests set

I'm not an Oracle expert to comment on the fix, but I don't see a test that's integrated as part of Django's test suite so marking as "needs tests".

comment:36 Changed 7 months ago by ikelly

The test is at backends.OracleChecks.test_long_string and was added in [11285]. I note that it claims to test strings longer than 4000 characters but actually tests a string of exactly 4000 characters; this is probably not important as the Oracle limitation is actually 4000 bytes, and the string in question is more than 4000 bytes in any encoding.

The test is probably not reliable though, as the details of reproducing this issue seem to be dependent on the configuration of the database.

comment:37 Changed 7 months ago by timo

Sorry, I was referring to a test for the patch that was added as an attachment since the ticket was reopened (long_string.diff​).

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as new
The owner will be changed from nobody to anonymous. Next status will be 'assigned'
as The resolution will be set. Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.