Context Navigation

← Previous Ticket
Next Ticket →

#37126 closed New feature (needsnewfeatureprocess)

Make Task and TaskResult comparable

Reported by:	Johannes Maron	Owned by:	zky
Component:	Tasks	Version:	dev
Severity:	Normal	Keywords:
Cc:	Johannes Maron, Jake Howard	Triage Stage:	Unreviewed
Has patch:	yes	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	no
Easy pickings:	yes	UI/UX:	no

Description

Jake and I have been discussing changes to the Task framework for 6.2 with a focus on performance and versatility.

Python's queue.PriorityQueue implementation requires objects to be comparable.

Since the base implementation of a Task implements a priority, it only makes sense to provide basic comparability based on the priority and date.

Dataclasses have a neat order=True attribute to make this stupidly simple.

Change History (25)

comment:1 by Johannes Maron, 2 months ago

I would also suggest adding a custom eq method. We only need to include id and a simple string comparison is much quicker. Benchmarks N = 100_000_000:

Original == (equal)             18.614s  (186.1 ns/call)
Original == (not equal)         7.452s  (74.5 ns/call)
Id-only == (equal)              4.434s  (44.3 ns/call)
Id-only == (not equal)          4.423s  (44.2 ns/call)

comment:2 by Johannes Maron, 2 months ago

Missing benchmark sources for my previous comment:

import timeit
from dataclasses import dataclass, field
from datetime import datetime
from typing import Any

from django.utils.json import normalize_json
from django.tasks.base import Task, TaskResult, TaskResultStatus

# --- Real Task instance ---

def my_func():
    pass

real_task = Task.__new__(Task)
object.__setattr__(real_task, "func", my_func)
object.__setattr__(real_task, "priority", 0)
object.__setattr__(real_task, "backend", "default")
object.__setattr__(real_task, "queue_name", "default")
object.__setattr__(real_task, "run_after", None)
object.__setattr__(real_task, "takes_context", False)

# --- Shared kwargs ---

now = datetime.now()

common_kwargs = dict(
    task=real_task,
    id="abc123",
    status=TaskResultStatus.SUCCESSFUL,
    enqueued_at=now,
    started_at=now,
    finished_at=now,
    last_attempted_at=now,
    args=[],
    kwargs={},
    backend="default",
    errors=[],
    worker_ids=["worker-1"],
)

# --- id-only equality variant ---

@dataclass(frozen=True, slots=True, kw_only=True)
class TaskResultIdEq:
    task: Any = field(compare=False)
    id: str
    status: Any = field(compare=False)
    enqueued_at: datetime | None = field(compare=False)
    started_at: datetime | None = field(compare=False)
    finished_at: datetime | None = field(compare=False)
    last_attempted_at: datetime | None = field(compare=False)
    args: list[Any] = field(compare=False)
    kwargs: dict[str, Any] = field(compare=False)
    backend: str = field(compare=False)
    errors: list = field(compare=False)
    worker_ids: list[str] = field(compare=False)
    _return_value: Any | None = field(init=False, default=None, compare=False)

    def __post_init__(self):
        object.__setattr__(self, "args", normalize_json(self.args))
        object.__setattr__(self, "kwargs", normalize_json(self.kwargs))

# --- Instances to compare ---

a_full = TaskResult(**common_kwargs)
b_full = TaskResult(**common_kwargs)                                      # equal
c_full = TaskResult(**{**common_kwargs, "id": "other"})                  # not equal

a_id = TaskResultIdEq(**common_kwargs)
b_id = TaskResultIdEq(**common_kwargs)                                    # equal
c_id = TaskResultIdEq(**{**common_kwargs, "id": "other"})                # not equal

N = 100_000_000

results = {
    "Original == (equal)":     timeit.timeit(lambda: a_full == b_full, number=N),
    "Original == (not equal)": timeit.timeit(lambda: a_full == c_full, number=N),
    "Id-only == (equal)":      timeit.timeit(lambda: a_id == b_id, number=N),
    "Id-only == (not equal)":  timeit.timeit(lambda: a_id == c_id, number=N),
}

for label, t in results.items():
    print(f"{label:<30}  {t:.3f}s  ({t/N*1e9:.1f} ns/call)")

comment:3 by zky, 2 months ago

Hi Johannes, thanks for providing such clear benchmarks! The performance gains for the task queue are very significant. I'm quite interested in this and would love to help take the implementation work off your hands.

comment:4 by zky, 2 months ago

Owner:	set to zky
Status:	new → assigned
Triage Stage:	Unreviewed → Accepted

comment:5 by zky, 2 months ago

Hi Johannes,

Regarding the sorting implementation, while @dataclass(order=True) is indeed stupidly simple, I noticed two potential issues.

First, it makes the sorting logic strictly dependent on the physical order of the class attributes, which could cause regressions if someone accidentally reorders them in the future.

More importantly, since run_after is typed as datetime | None, using order=True will crash the queue with a TypeError (comparing NoneType and datetime) whenever a task scheduled to run immediately (run_after=None) is compared against a scheduled task.

To prevent the queue from crashing and to make the codebase more defensive, would it be safer to explicitly implement lt and eq? We can safely handle the None fallback inside lt by comparing tuples like (self.run_after is not None, self.run_after).

Let me know your thoughts!

follow-up: 7 comment:6 by Johannes Maron, 2 months ago

Hi there,

Good call. The dataclasses factory utilities are mainly supposed to reduce boilerplate. They just create classes with the very same methods for you. If it gets in the way, as it seems to do here, it is an excellent choice to do things “manually.”

Idioms and tools should never get in your way. That sentiment speaks much to the core of Django and its success. In other words, I believe you are on the right path :)

Cheers!
Joe

in reply to: 6 comment:7 by zky, 2 months ago

Replying to Johannes Maron:

Hi there,

Good call. The dataclasses factory utilities are mainly supposed to reduce boilerplate. They just create classes with the very same methods for you. If it gets in the way, as it seems to do here, it is an excellent choice to do things “manually.”

Idioms and tools should never get in your way. That sentiment speaks much to the core of Django and its success. In other words, I believe you are on the right path :)

Cheers!
Joe

Thank you for the encouraging words! I completely agree—pragmatism over strict adherence to tools is exactly what makes Django's design philosophy so great to work with. I've already submitted the PR：https://github.com/django/django/pull/21383

comment:8 by zky, 2 months ago

Has patch:	set

https://github.com/django/django/pull/21383

comment:9 by Johannes Maron, 2 months ago

Needs documentation:	set
Patch needs improvement:	set

comment:10 by zky, 2 months ago

Patch needs improvement:	unset

comment:11 by Johannes Maron, 2 months ago

Needs tests:	set
Patch needs improvement:	set

comment:12 by zky, 8 weeks ago

Patch needs improvement:	unset

comment:13 by Johannes Maron, 8 weeks ago

Needs tests:	unset
Patch needs improvement:	set

comment:14 by Johannes Maron, 7 weeks ago

Needs documentation:	unset

comment:15 by Johannes Maron, 7 weeks ago

Patch needs improvement:	unset
Triage Stage:	Accepted → Ready for checkin

comment:16 by Johannes Maron, 7 weeks ago

Patch needs improvement:	set
Triage Stage:	Ready for checkin → Accepted

comment:17 by zky, 7 weeks ago

Patch needs improvement:	unset

comment:18 by Johannes Maron, 7 weeks ago

Triage Stage:	Accepted → Ready for checkin

comment:19 by Sarah Boyce, 3 weeks ago

Cc:	Jake Howard added
Resolution:	→ needsnewfeatureprocess
Status:	assigned → closed
Triage Stage:	Ready for checkin → Unreviewed

Apologies if I have missed further discussion here but I believe this new feature got accepted without clear consensus

There is a forum discussion but from what I can see:

Make task results comparable: This one is interesting to me - what value do you see in them being comparable? It’s more than just priority - in ORM speak it should be [F("priority").desc(), F("run_after").asc(), F("enqueued_at").asc()]. I’m not opposed to implementing that, but I’m not sure I see the value (at least wide enough to be implemented in core).

I see PriorityQueue has been mentioned in this ticket, but not the forum discussion, which might be the main motivation perhaps missed there.
I don't see a discussion on https://github.com/django/new-features/issues

I am going to mark this as needsnewfeatureprocess. We can reopen once it is clearer this is something we want support for

comment:20 by Johannes Maron, 3 weeks ago

I think Jake just used the ORM example to communicate how it should be sorted; I think there was no misunderstanding about implementing __lt__.

Anyhow, is there a question about the why? Well, because a collection of priority tasks should be sortable. The API priotity argument directly implies this functionality. Therefore, I still don't think it's a new feature; we already have consensus on having a task order; it simply wasn't implemented properly. Hence the ticket here.

comment:21 by Johannes Maron, 3 weeks ago

To better illustrate my point, forget the priority queue: I'd currently expect this to work:

>>> meh = Task(priority=3)
>>> super_important.= Task(priority=1)
>>> sorted([meh, super_important])
... [super_important, meh]

It doesn't.

Last edited 3 weeks ago by Johannes Maron (previous) (diff)

follow-up: 23 comment:22 by Sarah Boyce, 3 weeks ago

I’m not opposed to implementing that, but I’m not sure I see the value (at least wide enough to be implemented in core).

I do think he was questioning whether we need this in core. I see that django-tasks-db has implemented it's own sorting without that being defined in core.

Note that as we currently only have DummyBackend and ImmediateBackend in core, for which a queue execution order does not make sense, attributes like priority and run_after still feel quite abstract to me.

This feels like this should be part of a discussion around standardizing how task queues should be executed. We haven't defined an api for getting all tasks in a queue in order for example, which maybe that would then need work like this? I don't want us to add small bits here and there without having a clear idea around the boundary of what Django core should be doing in the tasks space

in reply to: 22 comment:23 by Jake Howard, 3 weeks ago

I do think he was questioning whether we need this in core.

Exactly right.

I see that django-tasks-db has implemented it's own sorting

Not quite. The sorting there is for the queue itself (and thus execution order). There should definitely be a canonical, backend-agnostic sorting semantic, which is what I defined in ORM syntax (for convenience) in the forum.

Given there is (or should be) a canonical order, having that implemented when sorting TaskResult sounds sensible. However it's the value which is lost on me. If you're implementing your own queue, it's not a huge ask to implement ordering yourself. And if you are, it's more likely there's an underlying storage object which needs sorting (eg DBTaskResult) rather than TaskResult itself. Therefore, ordering TaskResult likely adds little value.

I agree with Sarah's decision though to push this over to new-features, to see if there's enough community drive or other caveats we've not considered.

In general, I think pushing suggestions for Tasks through new-features rather than direct to ticket is the right route forward, especially for changes which aren't more objective bugs.

comment:24 by Johannes Maron, 3 weeks ago

Honestly, since most queues will not be implemented in pure Python but depend on some kind of broker, I absolutely agree with the practicality concern.

For me, this was more a question of API completeness. If Django has tasks with a priority, and Python ships with PriorityQueue, I just figured those two should work together. Especially since adding a __lt__ function wasn't far-fetched.

Now, with a bit of distance, I see another important function, which makes me rethink the new-feature route. How do we actually sort tasks? By priority… probably, and then LIFO? We should probably have a reference implementation with documentation to serve as a reliable standard for all backends.
Switching from one backend to another should still give you consistent sorting.

comment:25 by Jake Howard, 3 weeks ago

I've raised this discussion as an issue on new-features, to confirm everyone is happy with the semantics django-tasks-db uses, but also to discuss how we standardise on them.

https://github.com/django/new-features/issues/188

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues