#34944 closed Bug (fixed)
Missing or misinferred attributes in output fields of generated fields
Reported by: | Paolo Melchiorre | Owned by: | Mariusz Felisiak |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 5.0 |
Severity: | Release blocker | Keywords: | field, database, generated, output_field |
Cc: | Lily Foote, Mariusz Felisiak | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
Continuing with my experiments with the generated fields I found these two error situations which I reported together because they are very connected and could lead to a single solution.
I do not rule out that there are other examples of use of the generated fields that could lead to the same errors and if they come to mind, please add them in the comments.
Misinferred attributes
In the common example of a field generated as a concatenation of two CharField in a Model:
class Person(models.Model): first_name = models.CharField(max_length=255) last_name = models.CharField(max_length=255) full_name = models.GeneratedField( expression=Concat( "first_name", models.Value(" "), "last_name" ), db_persist=True, )
The SQL code for the generated column has an automatically inferred max length of only 255 characters (varchar(255)
), while the field should be able to contain strings of 511 characters (varchar(511)
):
CREATE TABLE "community_person" ( "id" integer NOT NULL PRIMARY KEY AUTOINCREMENT, "first_name" varchar(255) NOT NULL, "last_name" varchar(255) NOT NULL, "full_name" varchar(255) GENERATED ALWAYS AS ( COALESCE("first_name", '') || COALESCE(COALESCE(' ', '') || COALESCE("last_name", ''), '') ) STORED );
Proposal
To solve the problem you could alternatively:
- make the maximum length extraction process smarter for the automatically created output fields
- mandatorily require specifying the output field when some of the fields involved could lead to situations like the ones above
Missing attributes
If in the previous example, I explicitly specify the output field without attributes, the migration is generated without generating errors:
class Person(models.Model): first_name = models.CharField(max_length=255) last_name = models.CharField(max_length=255) full_name = models.GeneratedField( expression=Concat( "first_name", models.Value(" "), "last_name" ), db_persist=True, output_field=models.CharField(), )
The SQL code contains an incorrect varchar(None)
type:
CREATE TABLE "community_person" ( "id" integer NOT NULL PRIMARY KEY AUTOINCREMENT, "first_name" varchar(255) NOT NULL, "last_name" varchar(255) NOT NULL, "full_name" varchar(None) GENERATED ALWAYS AS ( COALESCE("first_name", '') || COALESCE(COALESCE(' ', '') || COALESCE("last_name", ''), '') ) STORED );
which will generate an error when trying to apply the migration to the database:
sqlite3.OperationalError: near "None": syntax error
Proposal
To solve the problem you could alternatively:
- make the SQL code generation process smarter so as to generate working SQL code
- request to specify mandatory attributes for the output fields raising an error in the migration creation phase
Cumulative proposal
To solve both problems shown above, given the little time remaining before the release of the stable version of Django 5, the two most drastic solutions from both of the above proposals could be adopted.
Required output field
Always require specifying the output field (except when you are sure that the extracted type cannot generate error situations?)
Example of error message:
django.core.exceptions.FieldError: Expression doesn't contain an explicit type. You must set output_field.
Required attributes
Request to specify mandatory attributes for the output fields raising an error in the migration creation phase.
$ python3 -m manage makemigrations SystemCheckError: System check identified some issues: ERRORS: community.Person.full_name: (fields.E120) CharFields must define a 'max_length' attribute.
Change History (38)
comment:1 by , 10 months ago
Description: | modified (diff) |
---|
comment:2 by , 10 months ago
Cc: | added |
---|---|
Triage Stage: | Unreviewed → Accepted |
follow-up: 4 comment:3 by , 10 months ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
Hello, I would like to work on this issue.
As pointed out by Paolo, I can -
- make the output_field required when it can generate errors
- and ask to specify the required attributes like max_length on migration creation phase.
follow-up: 6 comment:4 by , 10 months ago
Replying to Om Dahale:
Hello, I would like to work on this issue.
Since this issue is a release blocker I hope you have time to propose an MR ASAP :)
As pointed out by Paolo, I can -
- make the output_field required when it can generate errors
- and ask to specify the required attributes like max_length on migration creation phase.
Precisely, the two points I proposed as a solution are the ones that can be implemented faster than others.
For the point above:
- "when it can generate errors" it can be difficult in all the cases, maybe you can start asking every time for
output_field
- "attributes like max_length" we need to add tests for all model field types
follow-ups: 7 8 comment:5 by , 10 months ago
Hm I think there are 2 tickets here… or at least 2 PRs. Perhaps forum discussion is required but I don't think we ought to go down the route of trying to make resolve_expression()
smart enough to determine that concat'ing 2 varchars means we need to add the max-lengths together – there may be a can of worms here since expressions could be ever so complex? 🤔
We definitely need to fix the broken "None" in the DDL 👍
Always require specifying the output field (except when you are sure that the extracted type cannot generate error situations?)
Perhaps 👍 "Explicit is better than implicit" and all that, though I think that documenting "For complex expressions consider always declaring an output_field" is a nice option.
I'm keen to hear Mariusz & Lily's thoughts.
comment:6 by , 10 months ago
Replying to Paolo Melchiorre:
Since this issue is a release blocker I hope you have time to propose an MR ASAP :)
Okay
Precisely, the two points I proposed as a solution are the ones that can be implemented faster than others.
For the point above:
- "when it can generate errors" it can be difficult in all the cases, maybe you can start asking every time for
output_field
- "attributes like max_length" we need to add tests for all model field types
Noted
follow-up: 9 comment:7 by , 10 months ago
Replying to David Sanders:
Hm I think there are 2 tickets here… or at least 2 PRs. Perhaps forum discussion is required but I don't think we ought to go down the route of trying to make
resolve_expression()
smart enough to determine that concat'ing 2 varchars means we need to add the max-lengths together – there may be a can of worms here since expressions could be ever so complex? 🤔
We definitely need to fix the broken "None" in the DDL 👍
Always require specifying the output field (except when you are sure that the extracted type cannot generate error situations?)
Perhaps 👍 "Explicit is better than implicit" and all that, though I think that documenting "For complex expressions consider always declaring an output_field" is a nice option.
I'm keen to hear Mariusz & Lily's thoughts.
So shall I continue my work on the 2 points given above?
comment:8 by , 10 months ago
Replying to David Sanders:
Perhaps forum discussion is required ...
I started a discussion in the forum:
https://forum.djangoproject.com/t/release-blocker-missing-or-misinferred-attributes-in-output-fields-of-generated-fields/25103
... I don't think we ought to go down the route of trying to make
resolve_expression()
smart enough to determine that concat'ing 2 varchars means we need to add the max-lengths together ... 🤔
In fact, in my proposal for the "Cumulative proposal", I excluded this hypothesis.
We definitely need to fix the broken "None" in the DDL 👍
Yes, I think it absolutely needs to be resolved.
Always require specifying the output field (except when you are sure that the extracted type cannot generate error situations?)
Perhaps 👍 "Explicit is better than implicit" and all that, though I think that documenting "For complex expressions consider always declaring an output_field" is a nice option.
It could be a good solution, but it should be highlighted that deriving the output_field
from the expression could hide pitfalls even in apparently simple cases such as concatenating two CharField
on SQLite
. Not all Django developers look under the hood of the ORM
or check the SQL
code generated by migrations.
Perhaps it might be safer to always require the user to specify an output_field
.
I'm keen to hear Mariusz & Lily's thoughts.
Me too, and also the people who collaborated on this feature like Simon, Lily, and Adam.
follow-up: 10 comment:9 by , 10 months ago
comment:10 by , 10 months ago
Replying to Paolo Melchiorre:
Replying to Om Dahale:
So shall I continue my work on the 2 points given above?
I think you should.
Okay
follow-ups: 12 13 comment:11 by , 10 months ago
I'm curious if there are other expressions that could lead to an increased column size or if it's just Concat
. Maybe we could add something to Concat
(and other relevant expressions) that can express that column size change?
If that's not feasible I would lean towards detecting the case where the column size could be invalid and require output_field
explicitly in those cases. I'd be sad if GeneratedField
always required an explicit output_field
.
comment:12 by , 10 months ago
I published an introduction to database generated columns, using SQLite and the new GeneratedField added in Django 5.0
In this article, I shared the work I have done in testing GenerateField using SQLite as a database backend, and I hope it can be useful to people working in this issue.
https://www.paulox.net/2023/11/07/database-generated-columns-part-1-django-and-sqlite/
comment:13 by , 10 months ago
Replying to Lily Foote:
I'm curious if there are other expressions that could lead to an increased column size or if it's just
Concat
.
Unfortunately, many other cases come to mind, for example a field generated as a product of two IntegerField which should be a BigIntegerField.
If that's not feasible I would lean towards detecting the case where the column size could be invalid and require
output_field
explicitly in those cases. I'd be sad ifGeneratedField
always required an explicitoutput_field
.
I agree with you but from the various tests I have done there are more and more cases in which the automatically derived output field has problems.
follow-ups: 18 23 comment:14 by , 10 months ago
Resolving the output_field
of combined expressions is a complex task that the ORM only has limited support for #26355, #31506, #33397, this is not a problem specific to GenerateField
.
It is particularly tricky in the case of varchar(x) + varchar(y)
as some backends restrict the allowed length of varchar
(e.g. MySQL at 255) and using a TextField
might come with backend specific limitations.
Our stance so far has been to make a best effort guess and refuse the temptation to guess.
The SQL code for the generated column has an automatically inferred max length of only 255 characters (varchar(255))
I wish we were more rigid in this particular case when dealing with parameterized fields such as CharFied
, ArrayField
, DecimalField
but it's non-trivial and often backward incompatible. In this particular case it's tempting to adjust the logic of Concat._resolve_output_field
to be smarter and generate a CharField(max_length=511)
but I suspect it might have unintended effect and not something we'd like to include in 5.0 at this point.
I suggest that we take the same approach here and expect the user to specify an explicit output_field
.
For the second part of the problem, CharField(None)
, I think that the addition of a check is in order. Adding a GeneratedField._check_output_field
that simply delegates to self.output_field.check(**kwargs)
seems like the way to go to catch all misuse of output_field
.
comment:15 by , 10 months ago
Just fyi I am getting the hang of it now and will be raising a PR soon 🙂
follow-ups: 19 24 comment:16 by , 10 months ago
Thanks Simon. I think we all seem to be in consensus about explicit output_field… Lily & I were chatting on Discord (Paolo I believe you're on there too) about how it seems unavoidable. 👍
@Om Dahale Thanks, I was going to suggest actually starting with an easier ticket as I assumed you were a new contributor? But if you feel like you're up to the task then good luck, just be aware that someone else may need to "take over" if things stall seeing as this is a release blocker ;)
follow-up: 20 comment:17 by , 10 months ago
FWIW I create a draft MR to demonstrate how Concat
resolving could be enhanced but I still think that we should not included it in 5.0. Might be worth a follow up cleanup/optimization ticket though.
comment:18 by , 10 months ago
Replying to Simon Charette:
Resolving the
output_field
of combined expressions is a complex task that the ORM only has limited support for #26355, #31506, #33397, this is not a problem specific toGenerateField
.
Thanks for pointing this out, it helped me put this issue in the right perspective.
It is particularly tricky in the case of
varchar(x) + varchar(y)
as some backends restrict the allowed length ofvarchar
(e.g. MySQL at 255) ...
I would add that, to complicate things, the restrictions on the various types of fields also vary between different versions of the same database.
The SQL code for the generated column has an automatically inferred max length of only 255 characters (varchar(255))
... In this particular case it's tempting to adjust the logic of
Concat._resolve_output_field
to be smarter and generate aCharField(max_length=511)
but I suspect it might have unintended effect and not something we'd like to include in 5.0 at this point.
I suggest that we take the same approach here and expect the user to specify an explicitoutput_field
.
I totally agree, and thanks for rephrasing my initial proposal more clearly.
For the second part of the problem,
CharField(None)
, I think that the addition of a check is in order. Adding aGeneratedField._check_output_field
that simply delegates toself.output_field.check(**kwargs)
seems like the way to go to catch all misuse ofoutput_field
.
Thanks for pointing out this approach.
comment:19 by , 10 months ago
Replying to David Sanders:
Thanks Simon. I think we all seem to be in consensus about explicit output_field… Lily & I were chatting on Discord (Paolo I believe you're on there too) about how it seems unavoidable. 👍
Given the different time zone, I have only now read Discord, but I repeat that, as I suspected in the description of the issue, it is inevitable at this point to request an explicit output_field
.
... just be aware that someone else may need to "take over" if things stall seeing as this is a release blocker ;)
Thanks for making that clear.
comment:20 by , 10 months ago
Replying to Simon Charette:
FWIW I create a draft MR to demonstrate how
Concat
resolving could be enhanced but I still think that we should not included it in 5.0. Might be worth a follow up cleanup/optimization ticket though.
To avoid mixing too many things here, I have already opened Issue #34954
follow-up: 28 comment:23 by , 10 months ago
Replying to Simon Charette:
For the second part of the problem,
CharField(None)
, I think that the addition of a check is in order. Adding aGeneratedField._check_output_field
that simply delegates toself.output_field.check(**kwargs)
seems like the way to go to catch all misuse ofoutput_field
.
Hello, Simon!
I raised a PR can you please have a look and let me know if I am going in the right direction?
comment:24 by , 10 months ago
Replying to David Sanders:
@Om Dahale Thanks, I was going to suggest actually starting with an easier ticket as I assumed you were a new contributor? But if you feel like you're up to the task then good luck, just be aware that someone else may need to "take over" if things stall seeing as this is a release blocker ;)
Yes, understood.
comment:26 by , 10 months ago
Reposting from PR..
I did this and now receiving the message to specify max_length
def check(self, **kwargs): databases = kwargs.get("databases") or [] return [ *super().check(**kwargs), *self._check_output_field(**kwargs), *self._check_supported(databases), *self._check_persistence(databases), ] def _check_output_field(self, **kwargs): # Ignore field name checks as the output_field is not a field. self.output_field.name = self.name + "_output_field" self.output_field.model = self.model return self.output_field.check(**kwargs)
is this correct?
comment:28 by , 10 months ago
Hello, Paolo Simon David
Can you help me out, just let me know if it is correct and if not then please point out the mistakes?
comment:30 by , 10 months ago
Patch needs improvement: | set |
---|
There is a minor typo but otherwise PR looks good.
comment:33 by , 10 months ago
Patch needs improvement: | unset |
---|
comment:34 by , 10 months ago
Triage Stage: | Accepted → Ready for checkin |
---|
Thank you Paolo! I reproduced both scenarios, accepting.