Opened 18 months ago

Last modified 18 months ago

#26683 new New feature

Oracle DBs: Add option to use VARCHAR2 data type

Reported by: Emma Arandjelovic Owned by: nobody
Component: Migrations Version: 1.9
Severity: Normal Keywords: Oracle
Cc: Jani Tiainen, Josh Smeaton, Shai Berger Triage Stage: Someday/Maybe
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

CharFields are currently created in Oracle databases using the NVARCHAR2 data type. If possible, we would like to have an option to use the VARCHAR2 data type instead. This is preferable when using a Unicode database, to avoid any performance degradation caused by data conversions when the database contains a mix of the two types.

Change History (15)

comment:1 Changed 18 months ago by Tim Graham

Did you consider writing a custom field? I don't think Django would ship an Oracle specific field or add some option to CharField that's Oracle specific. Do you have a proposal about how to offer this feature?

Actually what you might want is a custom Oracle database backend that overrides the default DatabaseWrapper.data_types.

comment:2 Changed 18 months ago by Emma Arandjelovic

Thanks for the reply. We hadn't looked at custom fields, but were planning on using a custom backend as you suggested. We'd just like to minimise the amount of custom code wherever possible and thought other people might also want this feature.

I was thinking an additional setting in settings.py might be appropriate? There are already some Oracle specific settings in there.

comment:3 Changed 18 months ago by Tim Graham

Cc: Jani Tiainen Josh Smeaton Shai Berger added
Resolution: wontfix
Status: newclosed

I'll add some known Oracle users to the CC to perhaps get some insight about whether trying to make the use case easier is worth it. I have some concerns that if this works at the database backend level, then we couldn't support migrating an existing database from one to the other, but maybe that's not important.

Please reopen the ticket if you guys see value as well as a reasonable implementation for this.

Last edited 18 months ago by Tim Graham (previous) (diff)

comment:4 Changed 18 months ago by Shai Berger

Personally I do not understand this:

This is preferable when using a Unicode database, to avoid any performance degradation caused by data conversions when the database contains a mix of the two types

What two types? VARCHAR2 and NVARCHAR2?

Either way, the request is for a feature at the database (rather than field) level; I'm not sure that is the right granularity. I'm also not sure what should happen if the proposed setting is changed on a database which already has some fields (of the "wrong" type).

Could you please elaborate on these points?

comment:5 Changed 18 months ago by Vackar Afzal

The industry standard seems to to be VARCHAR2, with support for NVARCHAR2 really there for legacy purposes.
I've encountered issues where existing oracle code was all VARCHAR2, as per our (and probably most) DBA's spec, but django fields were NVARCHAR2, when joining data accross these schemas it would take 20 minutes to complete, changing the field to VARCHAR2 resulted in the same query executing in 200ms.

My suggestion would be to default to using VARCHAR2 for everything UNLESS the database has not been setup with unicode support. Checking for this is pretty straightforward:

SELECT value$ FROM sys.props$ WHERE name = 'NLS_CHARACTERSET' ;

If you get this is a result, you're OK

AL32UTF8

If not then fallback to NVARCHAR2 if unicode support is required.

comment:6 Changed 18 months ago by Vackar Afzal

Resolution: wontfix
Status: closednew

comment:7 Changed 18 months ago by Josh Smeaton

Just about every database I've worked with uses VARCHAR rather than NVARCHAR. I wasn't aware of performance issues between the two though. Vackar, were the joins you're talking about joins between NVAR and VAR fields, or just two tables that contained those fields. I'm fairly sure you mean joining NCHAR to CHAR, but just want to be clear.

I guess the original decision to support NVARCHAR was due to the possibility/frequency of oracle databases at the time not using utf8 (or similar) encoding by default, so the NVARCHAR was a way to make unicode work in all circumstances.

Recommendations mostly follow this from Ask Tom (https://asktom.oracle.com/pls/apex/f%3Fp%3D100:11:::NO:RP:P11_QUESTION_ID:9462837200346048883)

NVARCHAR2 is a unicode-only data type. It's useful if you want to have unicode for some columns (those that are NVARCHAR2) but have the rest of the database use a different characterset. In all other cases stick with using VARCHAR2.

I think this is probably worth fixing provided a palatable solution can be found. Simply querying for the character set is probably not enough. You need to know this information every time you run a migration (or cache it somewhere). If the character set is changed between migrations, now you have different character types within the same application which is arguably worse. I'm unsure if django needs to know what the type of varchar column is at runtime or not, but if it does, that'd incur an extra query at connection creation time.

Also, looking into the code a bit it looks like the startswith/istartswith/contains based operators all expect NCHAR, and do the appropriate conversions:

 _standard_operators = {
        'contains': "LIKE TRANSLATE(%s USING NCHAR_CS) ESCAPE TRANSLATE('\\' USING NCHAR_CS)",
        'icontains': "LIKE UPPER(TRANSLATE(%s USING NCHAR_CS)) ESCAPE TRANSLATE('\\' USING NCHAR_CS)",
        'startswith': "LIKE TRANSLATE(%s USING NCHAR_CS) ESCAPE TRANSLATE('\\' USING NCHAR_CS)",
        'endswith': "LIKE TRANSLATE(%s USING NCHAR_CS) ESCAPE TRANSLATE('\\' USING NCHAR_CS)",
        'istartswith': "LIKE UPPER(TRANSLATE(%s USING NCHAR_CS)) ESCAPE TRANSLATE('\\' USING NCHAR_CS)",
        'iendswith': "LIKE UPPER(TRANSLATE(%s USING NCHAR_CS)) ESCAPE TRANSLATE('\\' USING NCHAR_CS)",
    }

I would think most of that TRANSLATE/USING would be unnecessary if we were simply using varchar to begin with. There's probably other optimisations we can make elsewhere if varchar was used. But these operators probably prevent you from creating a useful model field yourself - I haven't tried to see what happens if you do LIKE comparisons by translating char using nchar.

Migrations between the two are possible, but it requires creating a new field, migrating the data across, and then dropping the existing field. For big tables that's going to be very expensive.

It seems to me that this is an all or nothing proposition. Since the character encoding is database wide (not per user/schema), and there are many very old installations of oracle around, we can't guarantee that an existing django deployment will be able to continue on a version using varchar if they aren't able to change the global character set.

I think this is going to require a new backend since there is no easy way to dynamically switch behaviours. A new backend would be able to sweep out a tonne of old work arounds that are probably not useful anymore. It'd also be able to break backwards compatibility, because I don't see the django organisation maintaining two oracle backends when maintaining the current one is hard enough due to lack of oracle maintainers.

Sorry for the wall of text, but I kept jumping down new rabbit holes. As I said earlier though, if a palatable solution can be found, I think it's definitely worth looking into.

comment:8 Changed 18 months ago by Vackar Afzal

With regards to the slow queries it was VARCHAR2 to NVARCHAR2 translation that was the killer, one table had around 100 million rows, and was executing as a full table scan, so understandably took a while.

I agree with and understand all of your points. It's a complex situation and there doesn't appear to be a simple solution.
My recommendation would be to create a new backend with the desired behaviour, and mark the current one as deprecated in version x.x.x

Then in version x.x.x+1, switch to having the new backend as the default and drop support for the older one. Migration can be done at the Python level as you suggested, or alternatively there is this tool from Oracle that any dba (or even dev) can use to convert the DB to unicode. Literally just need to click a few buttons then wait a while for it to convert.

http://www.oracle.com/technetwork/database/database-technologies/globalization/dmu/overview/index.html

There will be some initial pain for sites using the legacy approach, but I think in the long run this will benefit the Django project, and in turn the community.

comment:9 Changed 18 months ago by Tim Graham

If we move forward with this, I prefer the approach of a temporary setting to opt-in to the new behavior (which eventually becomes mandatory) rather than creating a backend with a new name (unless there are compelling arguments against this approach).

comment:10 Changed 18 months ago by Jani Tiainen

Oracle database encoding is slightly problematic case since once you have set it it's quite impossible to change it without reinstalling whole server.

At least we and our customers do have servers that run with WE8ISO8859P1 encoding, and very probably will have them for next few years at least. From my perspective I would like to have that datatype switch to be permanent so I could use that per customer basis.

It's not that we could use tools like DMU. It's just that some of our customers do still have non-unicode aware applications that requires VARCHAR2 to be something else but unicode.

I could do Proof of Concept with a switch and to see if it wouldn't be too complex to achieve.

Sidenote: There is also in use NCLOB which supposed to be CLOB as well?

comment:11 in reply to:  9 Changed 18 months ago by Vackar Afzal

Replying to timgraham:

If we move forward with this, I prefer the approach of a temporary setting to opt-in to the new behavior (which eventually becomes mandatory) rather than creating a backend with a new name (unless there are compelling arguments against this approach).

Thinking about it, this actually would be more sensible.

comment:12 in reply to:  10 ; Changed 18 months ago by Vackar Afzal

Replying to jtiai:

Oracle database encoding is slightly problematic case since once you have set it it's quite impossible to change it without reinstalling whole server.

I can only speak for Oracle12c, but you can change the charecterset without having to re-install the server, not sure what the situation is with older versions.

At least we and our customers do have servers that run with WE8ISO8859P1 encoding, and very probably will have them for next few years at least. From my perspective I would like to have that datatype switch to be permanent so I could use that per customer basis.

It's not that we could use tools like DMU. It's just that some of our customers do still have non-unicode aware applications that requires VARCHAR2 to be something else but unicode.

I could do Proof of Concept with a switch and to see if it wouldn't be too complex to achieve.

Sidenote: There is also in use NCLOB which supposed to be CLOB as well?

I've never had a use case that required NCLOB, but I'm sure there will be others who do. Suppose it makes sense for the same reason NVARCHAR2 was used.

comment:13 in reply to:  12 Changed 18 months ago by Jani Tiainen

Replying to VackarAfzal:

Replying to jtiai:

Oracle database encoding is slightly problematic case since once you have set it it's quite impossible to change it without reinstalling whole server.

I can only speak for Oracle12c, but you can change the charecterset without having to re-install the server, not sure what the situation is with older versions.

Well we do have Oracle 11g mostly, as well as our customers.

At least we and our customers do have servers that run with WE8ISO8859P1 encoding, and very probably will have them for next few years at least. From my perspective I would like to have that datatype switch to be permanent so I could use that per customer basis.

This is our main problem for unseen future. There exists, mainly windows applications that doesn't handle unicode and customers do use them still.

I could do Proof of Concept with a switch and to see if it wouldn't be too complex to achieve.

Sidenote: There is also in use NCLOB which supposed to be CLOB as well?

I've never had a use case that required NCLOB, but I'm sure there will be others who do. Suppose it makes sense for the same reason NVARCHAR2 was used.

Django uses NCLOB as database datatype for models.TextFields.

comment:14 Changed 18 months ago by Shai Berger

A few points:

1) Oracle 11.2 "extended support" is until 2020. We may decide to drop our support of 11.2 earlier, but that wouldn't be nice to our users.

2) Varchar vs. NVarchar performance issues are well documented; such an issue was raised only a few hours ago on the cx-oracle users group.

3) Actually, if anyone has the time to put into it, a compatibility-breaking new Oracle backend could have a lot of benefits -- two important problems that are hard to fix in a backward-compatible manner are the performance issues around numbers (we translate many numbers to strings in order to figure out the right Python type) and the case issues around database object (tables etc) names.

comment:15 Changed 18 months ago by Tim Graham

Triage Stage: UnreviewedSomeday/Maybe

If there are advantages to dropping Oracle 11.2 support, we might consider making Django 1.11 LTS (supported until at least April 2020) the last version to support it.

I'll bump this ticket to "Someday/Maybe" to get it off the unreviewed queue. If we have a specific plan for how to move it forward, we can move it to "Accepted".

Note: See TracTickets for help on using tickets.
Back to Top