Opened 19 months ago

Closed 18 months ago

Last modified 18 months ago

#34564 closed Bug (invalid)

returning None instead of zero in Count annotation

Reported by: Amin Aminian Owned by: Amin Aminian
Component: Database layer (models, ORM) Version: 4.2
Severity: Normal Keywords: count, orm, annotate
Cc: Simon Charette Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: yes Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

I am trying to upgrade the Django's version from 3.2 to 4.2 in my project. after installation, some of my tests were broken.
The problem:
I am annotating a queryset with Count function, and the count result in some objects should be zero.
This is my code:

Order.objects.annotate(count=Count("customer", distinct=True))

In Django==3.2, if some orders don't have any customers, the count field would be zero (as I want).
But in Django==4.2, if some orders don't have any customers, the count field would be None, which breaks my code.
I read the source code for both versions. In version 3.2, we have a method called convert_value in Count class in aggregates module. The method is:

class Count(Aggregate):
    ...
    def convert_value(self, value, expression, connection):
            return 0 if value is None else value

As we can see, this function is responsible to return 0 instead of None.
But this method is deleted in version 4.2, and the conversion is handling in convert_value property in BaseExpression class in expressions module:

class BaseExpression:
    ...
    @cached_property
    def convert_value(self):
        field = self.output_field
        internal_type = field.get_internal_type()
        if internal_type == "FloatField":
            return (
                lambda value, expression, connection: None
                if value is None
                else float(value)
            )
        elif internal_type.endswith("IntegerField"):
            return (
                lambda value, expression, connection: None
                if value is None
                else int(value)
            )
        elif internal_type == "DecimalField":
            return (
                lambda value, expression, connection: None
                if value is None
                else Decimal(value)
            )
        return self._convert_value_noop

In this property we are return a lambda function to be called later, and when we call this lambda, what ever we pass to it, we would get None as the result.
Should it be something like returning 0, if the value is None ?

Change History (22)

comment:1 by Mariusz Felisiak, 19 months ago

Resolution: invalid
Status: newclosed

This was an intended change in 9f3cce172f6913c5ac74272fa5fc07f847b4e112. You can use Coalesce() to return 0 instead, e.g.

Coalesce(Count("customer", distinct=True)), 0)

comment:2 by Simon Charette, 19 months ago

Cc: Simon Charette added

I think this should be considered a release blocker as this is an unintended regression that will affect quite a few users given how common count annotations are over multiple valued relationships.

Unlike ArrayAgg and friends in fee87345967b3d917b618533585076cbfa43451b there was not deprecation period to force the usage of default=0 for Count in 9f3cce172f6913c5ac74272fa5fc07f847b4e112.

A simple solution for a backport would be to adjust Count.__init__ to do extra.setdefault("default", 0) (or define in it's signature for enhanced introspection) so Coalesce is used.

Happy to submit a patch if you agree.

Amin, you can use Count(..., default=0) in the mean time.

in reply to:  2 comment:3 by Mariusz Felisiak, 19 months ago

Resolution: invalid
Status: closednew
Triage Stage: UnreviewedAccepted

OK, let's accept this.

Replying to Simon Charette:

I think this should be considered a release blocker as this is an unintended regression that will affect quite a few users given how common count annotations are over multiple valued relationships.

This is a regression in Django 4.0, so it doesn't qualify for a backport anymore.

comment:4 by Mariusz Felisiak, 19 months ago

Owner: changed from nobody to Simon Charette
Status: newassigned

in reply to:  2 comment:5 by Amin Aminian, 19 months ago

Replying to Simon Charette:

I think this should be considered a release blocker as this is an unintended regression that will affect quite a few users given how common count annotations are over multiple valued relationships.

Unlike ArrayAgg and friends in fee87345967b3d917b618533585076cbfa43451b there was not deprecation period to force the usage of default=0 for Count in 9f3cce172f6913c5ac74272fa5fc07f847b4e112.

A simple solution for a backport would be to adjust Count.__init__ to do extra.setdefault("default", 0) (or define in it's signature for enhanced introspection) so Coalesce is used.

Happy to submit a patch if you agree.

Amin, you can use Count(..., default=0) in the mean time.

I have read 9f3cce172f6913c5ac74272fa5fc07f847b4e112. The thing that i'm not understanding is why we want to return None instead of zero. What is the reason ? How this is benefiting us ?

As far as I have read the changes, in Django==4.2, we have an attribute called empty_result_set_value which is equal to NotImplemented in BaseExpression:

class BaseExpression:
    empty_result_set_value = NotImplemented
    ...

And aggregate classes have implemented this. All of classes have empty_result_set_value = None instead of Count class and RegrCount class, which it is empty_result_set_value = 0 in those classes. And because of this attr, we can't use default=0 in Count:

class Aggregate(Func):
    ...
    empty_result_set_value = None

    def __init__(
        self, *expressions, distinct=False, filter=None, default=None, **extra
    ):
       ...
        if default is not None and self.empty_result_set_value is not None:
            raise TypeError(f"{self.__class__.__name__} does not allow default.")
      ...

And in Count class, empty_result_set_value=0 and we use default because of that. So as far as I understand, we are considering empty_result_set_value as kind of a default value. So why don't we just return empty_result_set_value in case of being None in convert_value property ?

Last edited 19 months ago by Amin Aminian (previous) (diff)

in reply to:  4 comment:6 by Amin Aminian, 19 months ago

Replying to Mariusz Felisiak:

Can't it be assigned to me ? I really would love to solve it, if we could deal with the solution.

comment:7 by Amin Aminian, 19 months ago

Owner: changed from Simon Charette to Amin Aminian

comment:8 by Simon Charette, 19 months ago

Thanks for your consideration Mariusz, I missed that it wasn't a recent regression, time fly by!

If that can help you Amin, the ORM compiler has some logic in place to prevent doing database queries when it knows that it cannot match any rows (e.g. id__in=[]). When this happens (aggregation or when annotating subqueries that don't match any rows) we still need to provide some values to the user that are coherent with what would have happened if the query was still performed, that's when empty_result_set_value kicks in. You might notice in this patch that QuerySet.none() is used, that's a public way of making sure that evaluating the queryset results in no query.

Through a series of patches that tried to make the logic more coherent in this area, and the introduction of Aggregate(..., default=default) which is basically a shorthand for Coalesce(Aggregate(...), default), the long standing empty().aggregate(cnt=Count()) issue where {"cnt": None} was returned was fixed (due to Count.empty_result_set_value = 0) but surprisingly (this is mind blowing to me) we didn't have any tests for the case where .annotate(cnt=Cnt("possibly_empty_multi_value")) so convert_value was wrongly removed assuming that it was covered by the adjusted logic.

As for the patch I could see us going to ways

  1. Drop Count.empty_result_set_value to use extra["default"] = 0 so Coalesce(..., 0) is systematically used and we inherit the empty_result_set_value from the Coaesce and Value chain. This might require adjusting some tests that check the exact SQL generated or assert that Count(default) cannot be used.
  2. Simply add back the convert_value that were removed in 9f3cce172f6913c5ac74272fa5fc07f847b4e112
  1. seems more desirable to me as it pushes the logic to the database and avoids Python time conversions (convert_value comes at a cost as its run for each row) while 2. is less invasive as it requires no test changes and keeps the generated SQL the same (no Coalesce wrapping for each usage of Count).

Whatever solution we choose we should make sure to add address the same issue we have with RegrCount.

Last edited 19 months ago by Simon Charette (previous) (diff)

comment:9 by Amin Aminian, 19 months ago

Thanks Simon for explaining empty_result_set_value usage.

So, as far as I understood, if the query is going to return None, we return empty_result_set_value instead, without running the query. Right ? If it's right, there is still something that I'm missing.

My point is, assume that we have a query that the compiler does not know that it is going to return None (normal queries like objects.annotate(count=Count("..."))). In this case, the query is going to be run, and the result would be fetched. Now why we can't use empty_result_set_value in case of result is None ?

What I am saying is a code like this:

class BaseExpression:
    ...
    @cached_property
    def convert_value(self):
        field = self.output_field
        internal_type = field.get_internal_type()
        ...
        elif internal_type.endswith("IntegerField"):
            return (
                lambda value, expression, connection: self.empty_result_set_value # instead of None
                if value is None
                else int(value)
            )
        ...
        return self._convert_value_noop

And about Python time conversions, we are already running this convert_property for each row.

comment:10 by Simon Charette, 19 months ago

So, as far as I understood, if the query is going to return None, we return empty_result_set_value instead, without running the query. Right ? If it's right, there is still something that I'm missing.

That's not exactly how it works, there's a difference between returning None and dealing with an empty result. Some aggregate functions might return None even when not dealing with empty results (e.g. when aggregation over a set of value that contain a NULL). It is not the case for Count but that's the rationale for empty_result_set_value being a distinct feature from default and how convert_value deals with None.

My point is, assume that we have a query that the compiler does not know that it is going to return None (normal queries like objects.annotate(count=Count("..."))). In this case, the query is going to be run, and the result would be fetched. Now why we can't use empty_result_set_value in case of result is None ?

Because empty_result_set_value is used a single very peculiar case which is when an EmptyResultSet is exception is raised at query compilation time. The change you are proposing will work but they blur the line in terms of empty_result_set_value usage. Why should expressions with an IntegerField as an output_field default to empty_result_set_value when other types don't? It also impacts the locality of change since one can no longer see that Count is treated differently that the normal COUNT SQL function solely by looking at its definition.

And about Python time conversions, we are already running this convert_property for each row.

Correct but adding even more logic to it makes it harder to remove eventually which is something option 1. would eventually benefit from. In reality these converters would only need to be run if an explicit output_field is provided otherwise the ORM has the possibility to generate the proper SQL and backend pecularities can be dealt with get_db_converters.

Since the regression is solely about Count, I believe that either 1. or 2. still remain the two best options.

comment:11 by Amin Aminian, 18 months ago

Allright then.
I will start working on first solution, as it sounds better for me, if anything sounded like being complicated, I just simply use second solution for now.
Thanks

Last edited 18 months ago by Amin Aminian (previous) (diff)

comment:12 by Amin Aminian, 18 months ago

Hi again.

I implemented the first solution, I want to just double check the output ...

I removed empty_result_set_value class attr from both Count and RegrCount, so basically they are inheriting this attr from Aggregate class.

In __init__ method of Count and RegrCount class, I called super()... with default=0, so the default value is 0 for Count. But still, I don't let user pass default value for Count and the default value would be 0 all the time.

So basically Coalesce is appended to Count query with this default value. I fixed all tests that had SELECT COUNT(...) and changed it to SELECT COALESCE(COUNT(...), 0).

Is everything right and as we wanted ?

Thanks for your time Simon!

comment:13 by Simon Charette, 18 months ago

Yep that sounds about right Amin!

Submitting a PR with these changes should validate that everything work as expected.

Thanks for the report, rubber ducking, and working on a patch!

comment:14 by Amin Aminian, 18 months ago

Has patch: set

comment:15 by Amin Aminian, 18 months ago

Resolution: fixed
Status: assignedclosed
Triage Stage: AcceptedReady for checkin

comment:16 by Mariusz Felisiak, 18 months ago

Resolution: fixed
Status: closednew
Triage Stage: Ready for checkinAccepted

Ticket is not fixed until PR is merged.

comment:17 by Mariusz Felisiak, 18 months ago

Needs tests: set
Patch needs improvement: set
Status: newassigned

comment:18 by Amin Aminian, 18 months ago

Hi again.
I was trying to write a regression test that I faced an issue.
Apparently, Count function is returning 0 usually. But I was unlucky enough to face some situations that Count return None.

One situation is during my tests, when using Timescaledb (Postgres extension). In this scenario, I'm getting None instead of 0 when using Count.
And the another situation is on production server, when some complicated queries are executed and I'm getting None there too. I cannot find out why that leads to None, as I get 0 when using similar queries.

So I'm basically asking, do you know some simple scenarios that DB would return None ? (I suppose the old convert_value function was there for those scenarios)

And if your not, is it valid to use Timescaledb in test ? (Actually I have written such a test already for TDD development during this PR, and my changed fixes my issue with that)

Last edited 18 months ago by Amin Aminian (previous) (diff)

comment:19 by Simon Charette, 18 months ago

Django doesn't support TimescaleDB so any change against it is not considered a regression from our perspective. You might have run into bugs with TimescaleDB for all we know.

I suppose the old convert_value function was there for those scenarios

It could have been there solely to support a subset of the cases that empty_result_set_value covers for aggregation and selects now. This code was introduced in f59fd15c4928caf3dfcbd50f6ab47be409a43b01 without a clear reason. We'd have to checkout the code from that time and run the suite to be sure.

The fact you cannot reproduce against SQLite, Postgres, MySQL, or Oracle might explain why this was not caught by the test suite in the first place which would make more sense given how common such annotations are.

Here are my attempts at reproducing

Last edited 18 months ago by Simon Charette (previous) (diff)

in reply to:  19 comment:20 by Amin Aminian, 18 months ago

Replying to Simon Charette:

The fact you cannot reproduce against SQLite, Postgres, MySQL, or Oracle might explain why this was not caught by the test suite in the first place which would make more sense given how common such annotations are.

I'm sure about this problem on Timescale, and as it is an extension for Postgres, I think it is a corner case for Postgres (or even other DBs).

I should have tried to do this regression test earlier. It is my bad, sorry about that.

But still, I suggest to simply have old convert_value to prevent corner cases for now and for the future.

comment:21 by Simon Charette, 18 months ago

Resolution: invalid
Status: assignedclosed

Unfortunately this was not even demonstrated to be a corner case for Postgres, if it was the case we'd certainly accept a patch for it.

We can't keep code around that we can't test for regression against I'm afraid.

If someone is able to write a test demonstrating that Count can return NULL values from modern SQL standard or against any supported database backend that might diverge from the standard (Postgres, SQLite, MySQL, Oracle) we can re-open this ticket but in the mean time this ticket doesn't seem actionable.

comment:22 by David Sanders, 18 months ago

PG docs confirms that count(<expression>) must always return a result as it counts the number of rows for which <expression> is not null:

count ( "any" ) → bigint

Computes the number of input rows in which the input value is not null.

My advice would be to confirm this isn't documented behaviour with Timescale, if it isn't then report the issue ¯\_(ツ)_/¯.

Note: See TracTickets for help on using tickets.
Back to Top