Context Navigation

← Previous Ticket
Next Ticket →

#36190 closed Cleanup/optimization (wontfix)

High memory usage of CommonPasswordValidator

Reported by:	Michel Le Bihan	Owned by:	Priyanshu Singh Panda
Component:	contrib.auth	Version:	dev
Severity:	Normal	Keywords:	CommonPasswordValidator
Cc:	Priyanshu Singh Panda	Triage Stage:	Accepted
Has patch:	no	Needs documentation:	yes
Needs tests:	no	Patch needs improvement:	no
Easy pickings:	no	UI/UX:	no

Description ¶

Hello,

I noticed that CommonPasswordValidator has a very high memory usage. Loading the piotrcki-wordlist-top10m.txt list that has 9872702 entries causes a 755M usage even though the disk size of the list is only 87M. Since CommonPasswordValidator basically only needs to check for membership, I think that using a variant of a bloom filter would be much better than using a hash table (set()).

Attachments (2)

test2.py (661 bytes ) - added by Michel Le Bihan 7 weeks ago.
test.py (496 bytes ) - added by Michel Le Bihan 7 weeks ago.

Download all attachments as: .zip

Change History (16)

comment:1 by Michel Le Bihan, 8 weeks ago

Component:	Uncategorized → contrib.auth
Keywords:	CommonPasswordValidator added

comment:2 by Priyanshu Singh Panda, 7 weeks ago

Owner:	set to Priyanshu Singh Panda
Status:	new → assigned
Triage Stage:	Unreviewed → Accepted
Type:	Uncategorized → Cleanup/optimization

follow-up: 4 comment:3 by Sarah Boyce, 7 weeks ago

Resolution:	→ needsinfo
Status:	assigned → closed
Triage Stage:	Accepted → Unreviewed

Michel can you please share steps to reproduce? This will help in investigating and resolving any issues

in reply to: 3 comment:4 by Priyanshu Singh Panda, 7 weeks ago

Replying to Sarah Boyce:

Michel can you please share steps to reproduce? This will help in investigating and resolving any issues

I was able to generate this using the "piotrcki-wordlist-top10m.txt" file, which contains around 10 million commonly used passwords. The CommonPasswordValidator loads the file and stores the passwords in a list. By default, the original file contains only 2,000 passwords, which is much smaller compared to the new file and requires more memory.

try:
    with gzip.open(password_list_path, "rt", encoding="utf-8") as f:
    self.passwords = {x.strip() for x in f}
except OSError:
    with open(password_list_path) as f:
    self.passwords = {x.strip() for x in f

I'm currently working on addressing the memory issue caused by loading such a large file. Please assign this task to me so I can continue working on optimizing it.

follow-up: 6 comment:5 by Michel Le Bihan, 7 weeks ago

Hello,

Thanks for looking at this. How are you planning to reduce memory usage?

in reply to: 5 comment:6 by Priyanshu Singh Panda, 7 weeks ago

Replying to Michel Le Bihan:

Hello,

Thanks for looking at this. How are you planning to reduce memory usage?

I plan to optimize memory and processing time by implementing a Bloom filter. This will reduce memory usage by storing only essential data and speed up lookups using a probabilistic approach. I have also validated the improvements using the @profile function, which tracks memory usage per line of code. After implementing the Bloom filter, both time and memory usage have been significantly reduced.

follow-up: 8 comment:7 by Michel Le Bihan, 7 weeks ago

How will you deal with false positives? Maybe using a sorted array of 64 bit hashes and doing a binary search on it would have a much better lower FP ratio.

in reply to: 7 comment:8 by Priyanshu Singh Panda, 7 weeks ago

Replying to Michel Le Bihan:

How will you deal with false positives? Maybe using a sorted array of 64 bit hashes and doing a binary search on it would have a much better lower FP ratio.

Both approaches seem optimal for handling false positives. I’ve already set a custom false positive rate for the Bloom filter, which provides a good balance between memory usage and performance. However, as you mentioned, using a sorted array of 64-bit hashes with binary search could further optimize the false positive rate.

Would you recommend switching to this method or keeping the Bloom filter with the custom FP rate? I can also implement the hash map approach if needed for further optimization.

Can you let me know how you'd like to proceed, and kindly assign this task to me.

comment:9 by Michel Le Bihan, 7 weeks ago

That's a question that should be answered by the Django team...

comment:10 by Michel Le Bihan, 7 weeks ago

Resolution:	needsinfo
Status:	closed → new

comment:11 by Priyanshu Singh Panda, 7 weeks ago

Cc:	Priyanshu Singh Panda added
Needs documentation:	set
Status:	new → assigned
Triage Stage:	Unreviewed → Ready for checkin
Version:	5.1 → dev

comment:12 by Priyanshu Singh Panda, 7 weeks ago

Triage Stage:	Ready for checkin → Accepted

comment:13 by Sarah Boyce, 7 weeks ago

Resolution:	→ wontfix
Status:	assigned → closed

There is often a trade-off between memory usage and performance.

For CommonPasswordValidator to compromise on accuracy via a bloom filter would be backwards incompatible, and therefore, not acceptable without a strong consensus from the community.

In Django, you can write your own password validators. In this case, a custom very large file is being used, and so it might make sense to use a custom validator that uses a bloom filter (to deal with the very large file). This custom validator could be provided by a third-party package.

Anyone is welcome to discuss this further on the Django forum.
Note that any PR to improve the current state should not compromise on accuracy and needs to include performance and memory bench-marking.

by Michel Le Bihan, 7 weeks ago

Attachment:	test2.py added

by Michel Le Bihan, 7 weeks ago

Attachment:	test.py added

comment:14 by Michel Le Bihan, 7 weeks ago

Here is a simple benchmark:

(venv) michel@debian:/dev/shm$ python3 test.py 
Reading password list took: 3.177845547
Checking if password is in list took: 2.2280000000485245e-06
Password is in list: False
Size of passwords in MB: 755.3031387329102
(venv) michel@debian:/dev/shm$ python3 test2.py 
Reading password list took: 17.459581553
Checking if password is in list took: 1.1341000000442136e-05
Password is in list: False
Size of passwords in MB: 75.32260131835938

As you can see, reading the password list is much longer and not really optimized. However, checking if a password is in the list is basically instant and the memory usage reduction is significant. As for false positives, I think that they are extremely unlikely. 64 bits is really a lot.

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues

Context Navigation

#36190 closed Cleanup/optimization (wontfix)

High memory usage of CommonPasswordValidator

Description ¶

Attachments (2)

Change History (16)

comment:1 by Michel Le Bihan, 8 weeks ago

comment:2 by Priyanshu Singh Panda, 7 weeks ago

follow-up: 4 comment:3 by Sarah Boyce, 7 weeks ago

in reply to: 3 comment:4 by Priyanshu Singh Panda, 7 weeks ago

follow-up: 6 comment:5 by Michel Le Bihan, 7 weeks ago

in reply to: 5 comment:6 by Priyanshu Singh Panda, 7 weeks ago

follow-up: 8 comment:7 by Michel Le Bihan, 7 weeks ago

in reply to: 7 comment:8 by Priyanshu Singh Panda, 7 weeks ago

comment:9 by Michel Le Bihan, 7 weeks ago

comment:10 by Michel Le Bihan, 7 weeks ago

comment:11 by Priyanshu Singh Panda, 7 weeks ago

comment:12 by Priyanshu Singh Panda, 7 weeks ago

comment:13 by Sarah Boyce, 7 weeks ago

by Michel Le Bihan, 7 weeks ago

by Michel Le Bihan, 7 weeks ago

comment:14 by Michel Le Bihan, 7 weeks ago

Download in other formats:

Django Links

Learn More

Get Involved

Get Help

Follow Us

Support Us

TracLink
Image macro
× Delete attachment