Opened 3 months ago

Closed 5 weeks ago

#36439 closed Cleanup/optimization (fixed)

Auth hashing blocks event loop if using asyncio

Reported by: Robert Aistleitner Owned by: Roelzkie
Component: contrib.auth Version: 5.2
Severity: Normal Keywords: async, auth, asyncio, performance
Cc: Robert Aistleitner, Roelzkie Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

To a large extent, issues with auth in async auth has been fixed in https://github.com/django/django/commit/50f89ae850f6b4e35819fe725a08c7e579bfd099, I guess this is the last part of proper async implementation of authentication.

There exist custom implementations for asyncio regarding checking a password using the configured hashers which can be found in https://github.com/django/django/blob/68c9f7e0b79168007e6ba0139fd315d7c44ca8c9/django/contrib/auth/hashers.py#L86-L91.

This implementation has a flaw because the CPU heavy calculation of the hash is blocking the event loop of asyncio, causing the whole server to stall and queueing up all the following authentications that may rush in in case of heavy load.

My proposal is to use a ThreadPoolExecutor here to be able to unload the work from the event loop:

CHECK_PASSWORD_THREAD_POOL_EXECUTOR = ThreadPoolExecutor(16)

async def acheck_password(password, encoded, setter=None, preferred="default"):
    """See check_password()."""

    # verify_password is cpu heavy and needs to be executed in a separate thread to not block a running asyncio event loop
    is_correct, must_update = await sync_to_async(
        verify_password, thread_sensitive=False, executor=CHECK_PASSWORD_THREAD_POOL_EXECUTOR
    )(password, encoded, preferred=preferred)

    if setter and is_correct and must_update:
        await setter(password)
    return is_correct

The number of available thread could be exposed via setting, or skipped altogether by using a new thread on each verify_password call. What are your thoughts on this?

Change History (18)

comment:1 by Robert Aistleitner, 3 months ago

Version: 5.15.2

comment:2 by Carlton Gibson, 3 months ago

Triage Stage: UnreviewedAccepted

Yes, OK... — I'll accept this as certainly in theory. I'd like to see some basic profiling of the change as part of the development, but it's likely correct using any decent hasher.

comment:3 by Robert Aistleitner, 3 months ago

Should I provide a patch for this improvement? Not really sure if there's anything I need to provide besides the patch itself, because tests basically already test the function extensively.

comment:4 by Jake Howard, 3 months ago

A patch (PR), particularly with benchmarks, sounds great. The sync_to_async call is probably doing most of the work - it's possible there's no need for a dedicated threadpool just for this hashing. But again, a patch and some benchmark figures will help make that decision.

in reply to:  4 comment:5 by Roelzkie, 2 months ago

Replying to Jake Howard:

A patch (PR), particularly with benchmarks, sounds great. The sync_to_async call is probably doing most of the work - it's possible there's no need for a dedicated threadpool just for this hashing. But again, a patch and some benchmark figures will help make that decision.

Hello, I see the status of this ticket is Accepted without an owner. I wonder if you can allow me to spend time doing a basic profiling for the above-suggested solution, and make a comparison to help with your decision. Thank you.


EDIT: I made some basic profiling using cProfile for 1000 users using an SQLite database.

  1. Current: 94770 function calls (94752 primitive calls) in 251.144 seconds. See all logs.
  2. With sync_to_async: 407770 function calls (405752 primitive calls) in 251.129 seconds. See all logs.
  3. With ThreadPoolExecutor(4): 390842 function calls (389818 primitive calls) in 254.972 seconds for 1000 users. See all logs.
  4. With ThreadPoolExecutor(8): 390842 function calls (389818 primitive calls) in 251.255 seconds for 1000 users. See all logs.
  5. With ThreadPoolExecutor(12): 390842 function calls (389818 primitive calls) in 254.198 seconds for 1000 users. See all logs.
  6. ThreadPoolExecutor(16): 390842 function calls (389818 primitive calls) in 254.155 seconds for 1000 users. See all logs.

They're almost has no difference, but I'm not 100% sure if I'm doing it correctly. Below is the python script for profiling:

import os

import django

os.environ.setdefault("DJANGO_SETTINGS_MODULE", "mysite.settings")
django.setup()

# --
import asyncio
import cProfile

from django.contrib.auth.models import User


async def main():

    profiler = cProfile.Profile()
    profiler.enable()

    async for user in User.objects.all()[:1000]:
        await user.acheck_password(user.username), user.username

    profiler.disable()
    profiler.print_stats(sort='cumulative')


if __name__ == "__main__":
    asyncio.run(main())

System information:

$ sysctl -a | grep machdep.cpu
machdep.cpu.cores_per_package: 8
machdep.cpu.core_count: 8
machdep.cpu.logical_per_package: 8
machdep.cpu.thread_count: 8
machdep.cpu.brand_string: Apple M1

$ python --version
Python 3.13.2

$ python -c "import django; print(django.get_version());"
5.2.3
Last edited 2 months ago by Roelzkie (previous) (diff)

comment:6 by Simon Charette, 2 months ago

There appears to be a few things wrong with your current benchmark.

First you don't provide which hasher you used and and your setup steps most importantly whether or not you first set a password for the users you are testing against. If you didn't (e.g. you only set a username with an invalid password) then it's highly likely all underlying verify_password calls never get to the point of hashing user.username and then performing a constant time compare which are the expensive operations CPU wise that would benefit from executing in a thread pool.

Secondly the way you're iterating over each awaitable serially in a loop prevents any concurrent scheduling execution from taking place thus you most likely wouldn't notice if the event loop was blocked as you're only processing one task at a time. You'd want to buffer up futures and await them all so they can step on each others toes a bit if you hope to catch any interference between them.

in reply to:  6 comment:7 by Roelzkie, 2 months ago

Replying to Simon Charette:

There appears to be a few things wrong with your current benchmark.

First you don't provide which hasher you used and and your setup steps most importantly whether or not you first set a password for the users you are testing against. If you didn't (e.g. you only set a username with an invalid password) then it's highly likely all underlying verify_password calls never get to the point of hashing user.username and then performing a constant time compare which are the expensive operations CPU wise that would benefit from executing in a thread pool.

Secondly the way you're iterating over each awaitable serially in a loop prevents any concurrent scheduling execution from taking place thus you most likely wouldn't notice if the event loop was blocked as you're only processing one task at a time. You'd want to buffer up futures and await them all so they can step on each others toes a bit if you hope to catch any interference between them.

Hello, my bad, you are right. I'm executing the code serially with awaitables. I put them all into a task group, and the gap in performance is huge with a ThreadPoolExecutor.

This is for 100 users only having passwords as their usernames, and I'm using the default PBKDF2PasswordHasher hasher. I will also check other hashers later on:

  1. Current: 14138 function calls (14120 primitive calls) in 24.844 seconds
  2. With sync_to_async: 46823 function calls (45895 primitive calls) in 24.615 seconds
  3. With ThreadPoolExecutor(8): 45522 function calls (43874 primitive calls) in 5.565 seconds
import os
from datetime import datetime

import django

os.environ.setdefault("DJANGO_SETTINGS_MODULE", "mysite.settings")
django.setup()

# --
import asyncio
import cProfile

from django.contrib.auth.models import User


async def main():

    profiler = cProfile.Profile()
    profiler.enable()

    async with asyncio.TaskGroup() as tg:
        # These users were created in a different script, with their username as their password.
        async for user in User.objects.all()[:100]:
            tg.create_task(user.acheck_password(user.username))

    profiler.disable()
    profiler.print_stats(sort='cumulative')


if __name__ == "__main__":
    asyncio.run(main())

Last edited 2 months ago by Roelzkie (previous) (diff)

comment:8 by Roelzkie, 2 months ago

More benchmark tests for different hashers with 100 users:

PBKDF2PasswordHasher 100 users

  • Current: 25.555 seconds
  • With sync_to_async: 25.340 seconds
  • With ThreadPoolExecutor(8): 7.141 seconds

PBKDF2SHA1PasswordHasher 100 users

  • Current: 26.066 seconds
  • With sync_to_async: 25.890 seconds
  • With ThreadPoolExecutor(8): 7.436 seconds

Argon2PasswordHasher 100 users

  • Current: 3.568 seconds
  • With sync_to_async: 3.469 seconds
  • With ThreadPoolExecutor(8): 2.683 seconds

BCryptSHA256PasswordHasher 100 users

  • Current: 23.532 seconds
  • With sync_to_async: 23.347 seconds
  • With ThreadPoolExecutor(8): 6.435 seconds

ScryptPasswordHasher 100 users

  • Current: 16.721 seconds
  • With sync_to_async: 16.282 seconds
  • With ThreadPoolExecutor(8): 4.782 seconds

Certainly, having a ThreadPoolExecutor has a wide performance gap compared to others without it, as theorized by Robert. I'm happy to provide a patch for this if you'll allow me. Not sure if we need to create a test for this but let me know.

comment:9 by Roelzkie, 2 months ago

Cc: Roelzkie added
Owner: set to Roelzkie
Status: newassigned

comment:10 by Roelzkie, 2 months ago

Created a PR. Not 100% sure about the PR, but if you have any thoughts to improve it, I'll be happy to modify the PR. Thank you.

comment:11 by Roelzkie, 2 months ago

Has patch: set

comment:12 by Sarah Boyce, 7 weeks ago

Patch needs improvement: set

comment:13 by Roelzkie, 7 weeks ago

I made a new benchmarking with the latest main. The result is very different from my previous testing, and I'm not so sure if it's due to the rebase from recent main or due to some changes in my local setup.

These tests were conducted on 100 users with a SQLite database, and the results of using sync_to_async with or without ThreadPoolExecutor showed no significant difference. However, it's 5x faster than the current version. For Argon2, the performance shows almost no difference, at least for the SQLite database.

PBKDF2PasswordHasher PBKDF2SHA1PasswordHasher Argon2PasswordHasher BCryptSHA256PasswordHasher ScryptPasswordHasher
Current 14.103s 15.578s 0.005s 23.487s 16.11s
With sync_to_async 3.124s 2.957s 0.004s 3.867s 3.335s
With sync_to_async and ThreadPoolExecutor 2.775s 2.855s 0.004s 3.804s 3.661s

comment:14 by Roelzkie, 7 weeks ago

Patch needs improvement: unset

comment:15 by Sarah Boyce, 6 weeks ago

Needs tests: set

Marking as "needs tests" but I mean that it needs a benchmark script which is then verified by others

comment:16 by Roelzkie, 6 weeks ago

Needs tests: unset

Benchmark is added here: https://github.com/django/django-asv/pull/94. Instructions are included. Hope it's clear.

comment:17 by Sarah Boyce, 5 weeks ago

Triage Stage: AcceptedReady for checkin

comment:18 by Sarah Boyce <42296566+sarahboyce@…>, 5 weeks ago

Resolution: fixed
Status: assignedclosed

In 748ca0a:

Fixed #36439 -- Optimized acheck_password by using sync_to_async on verify_password.

Note: See TracTickets for help on using tickets.
Back to Top