Opened 15 years ago

Closed 6 years ago

#11487 closed Bug (needsinfo)

Oracle encoding bug when saving more than 4000 characters

Reported by: Marcos Daniel Petry Owned by: nobody
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords: oracle database nclob clob
Cc: Erin Kelly, Goldan, shai@…, felixx Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: yes Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I am working on a project where I have to store a large amount of
content, (html) in a record from a table, I am using Oracle as the
database

What is very strange ... is put a text reasonably small, (around 3000
characters) and it works correctly, save without problems, but
doubling this size, this content is saved completely changed.

This patch works well on: Django 1.1 (svn), debian, and Oracle 10g
with cx_oracle 4.4.1

Attachments (10)

base.diff (510 bytes ) - added by Marcos Daniel Petry 15 years ago.
base.py (21.6 KB ) - added by Marcos Daniel Petry 15 years ago.
base.2.diff (647 bytes ) - added by Marcos Daniel Petry 15 years ago.
Change variable type to Database.CLOB
base.2.py (21.6 KB ) - added by Marcos Daniel Petry 15 years ago.
changed variable to Database.CLOB
base.3.diff (725 bytes ) - added by Marcos Daniel Petry 15 years ago.
same diff, but now from django root folder
base.4.diff (724 bytes ) - added by anball@… 15 years ago.
patch with grammar fixed "an -> a"
base_with_tests.diff (1.9 KB ) - added by rafax 15 years ago.
Patch with regression test added.
11487_fix.diff (2.3 KB ) - added by jbronn 15 years ago.
Cleaned up base_with_tests patch.
clob_test.py (491 bytes ) - added by Goldan 13 years ago.
Test that fails with CLOB
long_string.diff (527 bytes ) - added by Goldan 13 years ago.
A string that is being patched (variants: LONG_STRING and NCLOB)

Download all attachments as: .zip

Change History (51)

comment:1 by Marcos Daniel Petry, 15 years ago

I simply changed the line 364 by placing the following contents:
self.input_size = Database.LONG_STRING

comment:2 by Jacob, 15 years ago

milestone: 1.1

Please upload a proper patch; finding the change by eye can be tricky. If you need help, check out the contributor's guide or ask someone in IRC. Thanks!

by Marcos Daniel Petry, 15 years ago

Attachment: base.diff added

by Marcos Daniel Petry, 15 years ago

Attachment: base.py added

comment:3 by Marcos Daniel Petry, 15 years ago

milestone: 1.1
Needs tests: set

comment:4 by Erin Kelly, 15 years ago

Summary: Oracle does not register in the batabase items with more than 4000 charactersOracle encoding bug when saving more than 4000 characters
Triage Stage: UnreviewedAccepted
Version: 1.0SVN

Glad you were able to get this working. Would you mind testing out for me the same change but using Database.CLOB rather than Database.LONG_STRING? The reason is I'm concerned that LONG_STRING might cause problems in other configurations. Also, the patch in #10566 already changes it from NCLOB to CLOB, so I'd prefer a solution that both tickets can agree on.

Also, since I can't reproduce this myself, would you please write up a test case for this? It doesn't need to be complicated. Just inserting a large string, reading it back, and checking that it's the same would be fine. The main thing is we need to be sure the string used will actually trigger the bug.

Thanks!

by Marcos Daniel Petry, 15 years ago

Attachment: base.2.diff added

Change variable type to Database.CLOB

by Marcos Daniel Petry, 15 years ago

Attachment: base.2.py added

changed variable to Database.CLOB

comment:5 by Marcos Daniel Petry, 15 years ago

Using CLOB also works perfectly! I've added the file with the diff (base.2.diff) and source code (base.2.py)

comment:6 by Alex Gaynor, 15 years ago

Please upload the diffs from the root of the source tree, not just for the single file. Also there's no need to keep uploading the source file, just the diff is enough.

by Marcos Daniel Petry, 15 years ago

Attachment: base.3.diff added

same diff, but now from django root folder

in reply to:  6 comment:7 by Marcos Daniel Petry, 15 years ago

Replying to Alex:

Please upload the diffs from the root of the source tree, not just for the single file. Also there's no need to keep uploading the source file, just the diff is enough.

Sorry Alex, I'm still adjusting to the process of submission of patches

I believe that now the file is ok: base.3.diff

comment:8 by dg, 15 years ago

I belive it should say "as _a_ CLOB" (just a grammar nitpick ;))

by anball@…, 15 years ago

Attachment: base.4.diff added

patch with grammar fixed "an -> a"

in reply to:  8 comment:9 by anonymous, 15 years ago

Replying to dg:

I belive it should say "as _a_ CLOB" (just a grammar nitpick ;))

I just attached a patch with the correct grammar.

comment:10 by anonymous, 15 years ago

so is the bug fixed? why is the ticket still open? :\

comment:11 by Alex Gaynor, 15 years ago

This ticket is still open because no code has been committed to the Django repository.

comment:12 by Erin Kelly, 15 years ago

And it hasn't been committed yet because it still needs a regression test. Tests aren't optional.

comment:13 by Matt Boersma, 15 years ago

Cc: Matt Boersma Erin Kelly added

comment:14 by Collin Anderson, 15 years ago

Cc: Collin Anderson added

comment:15 by anonymous, 15 years ago

Can you give an example of string that triggers the bug?
I tried to replicate it but with no success, and without string triggering it I cannot verify if the test I tried to write is proper.
Using Oracle 11, Python 2.6, Django SVN (and 1.0.2).

by rafax, 15 years ago

Attachment: base_with_tests.diff added

Patch with regression test added.

comment:16 by rafax, 15 years ago

Sorry for the anonymous comment from yesterday.
I managed to replicate the bug using Oracle 10, Python 2.6.2, Django SVN, cx_Oracle 4.4.1, which might indicate the improper behavior is connected with the version of Oracle/ cx_Oracle.
I wrote a simple test case as suggested by ikelly, if there is something that should be corrected in it please let me know.

comment:17 by Erin Kelly, 15 years ago

Needs tests: unset

Thanks, rafax. I've actually tested with the same set of versions and not reproduced the bug, so there must be something more to it. I suspect it has something to do with the database encodings.

About the test case, it appears to be saving a string of exactly 4000 characters, not more than 4000 characters. Is that intentional? Also, does it still replicate if you use an NCLOB column instead of a CLOB column? Since that is the default for Django, that is what we should be testing.

by jbronn, 15 years ago

Attachment: 11487_fix.diff added

Cleaned up base_with_tests patch.

in reply to:  17 comment:18 by jbronn, 15 years ago

Replying to ikelly:

Thanks, rafax. I've actually tested with the same set of versions and not reproduced the bug, so there must be something more to it. I suspect it has something to do with the database encodings.

I was able to reproduce the bug using Oracle 11, Python 2.5.4, and cx_Oracle 4.4.1. However, when using cx_Oracle 5.0.1 it's not a problem. I'm doing this on Windows, so this may be why I can reproduce.

About the test case, it appears to be saving a string of exactly 4000 characters, not more than 4000 characters. Is that intentional? Also, does it still replicate if you use an NCLOB column instead of a CLOB column? Since that is the default for Django, that is what we should be testing.

I thought that at first, but if you evaluate len(''.join([unicode(x) for x in xrange(4000)])) you see it is really 14,890 characters long. I also updated the test to use a NCLOB column, and the test case still fails without the patch.

Ian, look over my patch and make sure it doesn't break anything else -- I've run the full test suite but I want a second run on a different system to make sure there are no unintended side-effects.

in reply to:  17 comment:19 by rafax, 15 years ago

Replying to ikelly:

Thanks, rafax. I've actually tested with the same set of versions and not reproduced the bug, so there must be something more to it. I suspect it has something to do with the database encodings.

Or with the OS you run - I was able to reproduce on my work machine running Windows XP (with default CP-1250 encoding), while on my personal notebook running Ubuntu (with UTF based locale) it works OK.

About the test case, it appears to be saving a string of exactly 4000 characters, not more than 4000 characters. Is that intentional?

I thought it will certainly be longer than 4000 this way, in fact something like

>>> len(''.join(unicode(x) for x in xrange(2000)))
6890

should be enough.

comment:20 by Karen Tracey, 15 years ago

11487_fix.diff also fixes this test failure:

======================================================================
FAIL: Doctest: regressiontests.model_regress.models.__test__.API_TESTS
----------------------------------------------------------------------
Traceback (most recent call last):
  File "d:\u\kmt\django\trunk\django\test\_doctest.py", line 2180, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for regressiontests.model_regress.models.__test__.API_TESTS
  File "D:\u\kmt\django\trunk\tests\regressiontests\model_regress\models.py", line unknown line number, in API_TESTS

----------------------------------------------------------------------
File "D:\u\kmt\django\trunk\tests\regressiontests\model_regress\models.py", line ?, in regressiontests.model_regress.models.__test__.API_TESTS
Failed example:
    len(a4.article_text)
Expected:
    5000
Got:
    2500

for my test install of Oracle (also on a Windows box). I've seen this failure ever since I set up a test Oracle install, but never found time to dig into what might be causing it.

comment:21 by Erin Kelly, 15 years ago

Resolution: fixed
Status: newclosed

(In [11285]) Fixed #11487: pass long strings to Oracle as CLOB rather than NCLOB to prevent an encoding bug that occurs in some installations.

comment:22 by Erin Kelly, 15 years ago

(In [11286]) Fixed #11487: pass long strings to Oracle as CLOB rather than NCLOB to prevent an encoding bug that occurs in some installations. Backport of [11285] from trunk.

comment:23 by Goldan, 13 years ago

Cc: Goldan added
Keywords: nclob clob added
Resolution: fixed
Status: closedreopened

This issue seems to be a bit more complicated.

I'm using Oracle9i Enterprise Edition Release 9.2.0.1.0 - Production and Django development version.

I'm experiencing the following error when inserting more than 4000 characters (I'm trying to add a new flatpage):

DatabaseError at /admin/flatpages/flatpage/1/
ORA-12704: character set mismatch

I've tried to change the string in oracle/base.py (self.input_size = Database.CLOB, see attachment long_string.diff) to LONG_STRING and NCLOB. All three variants produce errors. I'm struggling with the issue for several days, so I've tested on various cx_Oracle versions.

I've made tests on three cx_Oracle versions: 4.4.1, 5.0.4 and 5.0.4 with WITH_UNICODE compilation flag.

I've made tests on two strings: cyrillic string of length 10350 symbols (the string can be found on http://dumpz.org/37744/text/) and a simple latin string of 16004 symbols "a".

During the tests, I was getting three types of errors:

  1. (hieroglyphs) The text is being saved without errors, but it is displayed as hieroglyphs (e.g. 慡)
  2. (mismatch) The text is not saved, and the following error is returned:
    DatabaseError at /admin/flatpages/flatpage/1/
    ORA-12704: character set mismatch
    
  3. (UnicodeDecodeError) The text is saved, but when trying to display it, an exception is raised:
    UnicodeDecodeError at /admin/flatpages/flatpage/1/
    'utf16' codec can't decode bytes in position 44-45: illegal encoding
    

The distribution of types of errors is the following:

  1. Cyrillic string:

1.1. cx_Oracle 4.4.1.
CLOB: mismatch,
LONG_STRING: hieroglyphs,
NCLOB: hieroglyphs.

1.2. cx_Oracle 5.0.4.
CLOB: mismatch,
LONG_STRING: UnicodeDecodeError,
NCLOB: UnicodeDecodeError.

1.3. cx_Oracle 5.0.4 + WITH_UNICODE.
CLOB: mismatch,
LONG_STRING: UnicodeDecodeError,
NCLOB: mismatch.

  1. Latin string:

2.1. cx_Oracle 4.4.1.
CLOB: mismatch,
LONG_STRING: hieroglyphs,
NCLOB: hieroglyphs.

2.2. cx_Oracle 5.0.4.
CLOB: mismatch,
LONG_STRING: hieroglyphs,
NCLOB: hieroglyphs.

2.3. cx_Oracle 5.0.4 + WITH_UNICODE.
CLOB: mismatch,
LONG_STRING: hieroglyphs,
NCLOB: mismatch.

No problems occur if I save a short cyrillic string (10 symbols) on any versions.

A possible workaround is to change all NCLOB fields in the database to CLOBs. It seems to resolve the issue (I have not performed full testing, just a simple one), but I'm not aware of side effects.

Last edited 13 years ago by Goldan (previous) (diff)

by Goldan, 13 years ago

Attachment: clob_test.py added

Test that fails with CLOB

by Goldan, 13 years ago

Attachment: long_string.diff added

A string that is being patched (variants: LONG_STRING and NCLOB)

comment:24 by Julien Phalip, 13 years ago

Severity: Normal
Type: Bug

comment:25 by kimus.linuxus@…, 13 years ago

Easy pickings: unset

Same problem here and changing column to CLOB resolved my issue with ORA-12704: character set mismatch

comment:26 by Aymeric Augustin, 13 years ago

UI/UX: unset

#9152 was a duplicate.

comment:27 by Jacob, 12 years ago

milestone: 1.1

Milestone 1.1 deleted

comment:28 by Aymeric Augustin, 11 years ago

Status: reopenednew

comment:29 by Aymeric Augustin, 11 years ago

#20201 was probably a duplicate.

comment:30 by Collin Anderson, 11 years ago

Cc: Collin Anderson removed

comment:31 by Shai Berger, 11 years ago

Cc: shai@… added

comment:32 by Matt Boersma, 11 years ago

Cc: Matt Boersma removed

comment:33 by graham.boyle@…, 11 years ago

We're seeing this when putting more than 4 message.success(..) on a page. It seems the success messages go onto the session, and the session_data gets too big.

The len() of the string going into the SESSION_DATA field is 2120, but that's a unicode string, so I _strongly suspect_ the byte length of the data going into the django_session.session_data column is twice that. Looking at the HEX values of django_session.session_data already in the database, we're seeing 0x6C00 for "1", for example.
We are using TextFields in our models with no problem, even with large chunks of text, and even when creating a new row. We see the sql being run to save the session is
u'INSERT INTO "DJANGO_SESSION" ("SESSION_KEY", "SESSION_DATA", "EXPIRE_DATE") SELECT %s, %s, %s FROM DUAL'
from a call to cursor.execute(sql, params)
at line 937 of django\db\models\sql\compiler.py in execute_sql()

We note that there's a bulk_insert_sql() in django.db.backends.oracle.base.py (line 411) that's doing inserts from selects off dual.

We're using Oracle 11g, Django 1.5, Python 2.7.3, the python running under Windows 7 Professional (64 bit), Service Pack 1.

comment:34 by graham.boyle@…, 11 years ago

and the NLS database parameters of interest are:
NLS_CHARACTERSET AL32UTF8
NLS_NCHAR_CHARACTERSET AL16UTF16

and django_session.session_data is nclob

comment:35 by Tim Graham, 10 years ago

Needs tests: set

I'm not an Oracle expert to comment on the fix, but I don't see a test that's integrated as part of Django's test suite so marking as "needs tests".

comment:36 by Erin Kelly, 10 years ago

The test is at backends.OracleChecks.test_long_string and was added in [11285]. I note that it claims to test strings longer than 4000 characters but actually tests a string of exactly 4000 characters; this is probably not important as the Oracle limitation is actually 4000 bytes, and the string in question is more than 4000 bytes in any encoding.

The test is probably not reliable though, as the details of reproducing this issue seem to be dependent on the configuration of the database.

comment:37 by Tim Graham, 10 years ago

Sorry, I was referring to a test for the patch that was added as an attachment since the ticket was reopened (long_string.diff​).

comment:38 by Arpit, 9 years ago

Is there any update on this issue? we are still experiencing this issue with 2000 characters. It happens usually when it has unicode characters in it. I read about the work around to solve this issue but its not an ideal solution, Does anyone have a fix for this.

comment:39 by RohitV24, 9 years ago

We are currently experiencing the same issue while inserting large strings (between 2000-4000 characters) into the Oracle database and it looks like a few tickets were opened which have been closed as duplicate of this.The NLS_NCHAR_CHARACTERSET on our database is AL16UTF16 which assigns 2 bytes for a char .From digging in a little deeper, it looks like a string is mapped to cx_Oracle.STRING which is then mapped to either a VARCHAR, NVARCHAR or LONG in Oracle and the conversion to long in case of long values is causing the error. It looks like the issue with 4000 characters was fixed by setting the input size to cx_Oracle.CLOB when it reached the character limit. Using 2000( for utf-16) seems to work fine and solve the problem. Would setting the comparison value to 1000 (taking into consideration other encoding formats) before setting it to CLOB be the fix for this issue?

Environment: Oracle 10g, Django 1.7.7, cx_Oracle 5.2

Last edited 9 years ago by RohitV24 (previous) (diff)

comment:40 by Tim Graham <timograham@…>, 6 years ago

In 950171d:

Refs #11487 -- Removed redundant test_long_string() test.

Redundant with model_regress.tests.ModelTests.test_long_textfield
since 3ede430b9a94e3c2aed64d2cf898920635bdf4ae.

comment:41 by Shai Berger, 6 years ago

Cc: felixx added
Resolution: needsinfo
Status: newclosed

Apparently this hasn't been a problem lately, and the original description and suggested solution have been made largely irrelevant.

If you have a similar problem you can reproduce, please re-open, or, more usefully, just open a new ticket.

Note: See TracTickets for help on using tickets.
Back to Top