Opened 7 days ago

Closed 4 days ago

Last modified 4 days ago

#36190 closed Cleanup/optimization (wontfix)

High memory usage of CommonPasswordValidator

Reported by: Michel Le Bihan Owned by: Priyanshu Singh Panda
Component: contrib.auth Version: dev
Severity: Normal Keywords: CommonPasswordValidator
Cc: Priyanshu Singh Panda Triage Stage: Accepted
Has patch: no Needs documentation: yes
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Hello,

I noticed that CommonPasswordValidator has a very high memory usage. Loading the piotrcki-wordlist-top10m.txt list that has 9872702 entries causes a 755M usage even though the disk size of the list is only 87M. Since CommonPasswordValidator basically only needs to check for membership, I think that using a variant of a bloom filter would be much better than using a hash table (set()).

Attachments (2)

test2.py (661 bytes ) - added by Michel Le Bihan 4 days ago.
test.py (496 bytes ) - added by Michel Le Bihan 4 days ago.

Download all attachments as: .zip

Change History (16)

comment:1 by Michel Le Bihan, 7 days ago

Component: Uncategorizedcontrib.auth
Keywords: CommonPasswordValidator added

comment:2 by Priyanshu Singh Panda, 6 days ago

Owner: set to Priyanshu Singh Panda
Status: newassigned
Triage Stage: UnreviewedAccepted
Type: UncategorizedCleanup/optimization

comment:3 by Sarah Boyce, 6 days ago

Resolution: needsinfo
Status: assignedclosed
Triage Stage: AcceptedUnreviewed

Michel can you please share steps to reproduce? This will help in investigating and resolving any issues

in reply to:  3 comment:4 by Priyanshu Singh Panda, 5 days ago

Replying to Sarah Boyce:

Michel can you please share steps to reproduce? This will help in investigating and resolving any issues

I was able to generate this using the "piotrcki-wordlist-top10m.txt" file, which contains around 10 million commonly used passwords. The CommonPasswordValidator loads the file and stores the passwords in a list. By default, the original file contains only 2,000 passwords, which is much smaller compared to the new file and requires more memory.

try:
    with gzip.open(password_list_path, "rt", encoding="utf-8") as f:
    self.passwords = {x.strip() for x in f}
except OSError:
    with open(password_list_path) as f:
    self.passwords = {x.strip() for x in f

I'm currently working on addressing the memory issue caused by loading such a large file. Please assign this task to me so I can continue working on optimizing it.

comment:5 by Michel Le Bihan, 5 days ago

Hello,

Thanks for looking at this. How are you planning to reduce memory usage?

in reply to:  5 comment:6 by Priyanshu Singh Panda, 5 days ago

Replying to Michel Le Bihan:

Hello,

Thanks for looking at this. How are you planning to reduce memory usage?

I plan to optimize memory and processing time by implementing a Bloom filter. This will reduce memory usage by storing only essential data and speed up lookups using a probabilistic approach. I have also validated the improvements using the @profile function, which tracks memory usage per line of code. After implementing the Bloom filter, both time and memory usage have been significantly reduced.

comment:7 by Michel Le Bihan, 5 days ago

How will you deal with false positives? Maybe using a sorted array of 64 bit hashes and doing a binary search on it would have a much better lower FP ratio.

in reply to:  7 comment:8 by Priyanshu Singh Panda, 4 days ago

Replying to Michel Le Bihan:

How will you deal with false positives? Maybe using a sorted array of 64 bit hashes and doing a binary search on it would have a much better lower FP ratio.

Both approaches seem optimal for handling false positives. I’ve already set a custom false positive rate for the Bloom filter, which provides a good balance between memory usage and performance. However, as you mentioned, using a sorted array of 64-bit hashes with binary search could further optimize the false positive rate.

Would you recommend switching to this method or keeping the Bloom filter with the custom FP rate? I can also implement the hash map approach if needed for further optimization.

Can you let me know how you'd like to proceed, and kindly assign this task to me.

comment:9 by Michel Le Bihan, 4 days ago

That's a question that should be answered by the Django team...

comment:10 by Michel Le Bihan, 4 days ago

Resolution: needsinfo
Status: closednew

comment:11 by Priyanshu Singh Panda, 4 days ago

Cc: Priyanshu Singh Panda added
Needs documentation: set
Status: newassigned
Triage Stage: UnreviewedReady for checkin
Version: 5.1dev

comment:12 by Priyanshu Singh Panda, 4 days ago

Triage Stage: Ready for checkinAccepted

comment:13 by Sarah Boyce, 4 days ago

Resolution: wontfix
Status: assignedclosed

There is often a trade-off between memory usage and performance.

For CommonPasswordValidator to compromise on accuracy via a bloom filter would be backwards incompatible, and therefore, not acceptable without a strong consensus from the community.

In Django, you can write your own password validators. In this case, a custom very large file is being used, and so it might make sense to use a custom validator that uses a bloom filter (to deal with the very large file). This custom validator could be provided by a third-party package.

Anyone is welcome to discuss this further on the Django forum.
Note that any PR to improve the current state should not compromise on accuracy and needs to include performance and memory bench-marking.

by Michel Le Bihan, 4 days ago

Attachment: test2.py added

by Michel Le Bihan, 4 days ago

Attachment: test.py added

comment:14 by Michel Le Bihan, 4 days ago

Here is a simple benchmark:

(venv) michel@debian:/dev/shm$ python3 test.py 
Reading password list took: 3.177845547
Checking if password is in list took: 2.2280000000485245e-06
Password is in list: False
Size of passwords in MB: 755.3031387329102
(venv) michel@debian:/dev/shm$ python3 test2.py 
Reading password list took: 17.459581553
Checking if password is in list took: 1.1341000000442136e-05
Password is in list: False
Size of passwords in MB: 75.32260131835938

As you can see, reading the password list is much longer and not really optimized. However, checking if a password is in the list is basically instant and the memory usage reduction is significant. As for false positives, I think that they are extremely unlikely. 64 bits is really a lot.

Note: See TracTickets for help on using tickets.
Back to Top