Why are sets faster for checking elements?

What are the reasons for set being highly optimized for checking element apart from the reason as no duplicate is allowed in set?

First thing is when we talk optimisability of sets, we need to talk about it in reference to something and in which parameter is it optimal
So the reference is that sets are faster than lists
and the parameter is that sets are faster than list when we want to check if a given element is in the data or not i.e.

s = {1,2,3,4,5,6,7,8,9}

print(9 in s)

l = [1,2,3,4,5,6,7,8,9]

print(9 in l)

the first operation with set is more faster than the second operation with list. (the difference would not be significant if the number of element is low like above but if we have million element then the difference would be significant.)

The answer to this question is definitely not that since sets do not store duplicate items so it is faster.

The answer is dependent upon how sets and lists are stored inside memory in python. Based on this the short answer is
Since python sets are stored as hash tables in memory so retrieval of elements is quite fast while lists are stored as arrays so it is less faster than hash tables.

But in order to understand the above logic you need to learn about hash tables and array storage. These topics are quite advanced CS topics but i will try to give a basic answer in order to make you understand

A Python list is implemented as a flexible array — basically a sequence of objects placed in a contiguous area of memory. To iterate through a list, Python just accesses each object in turn, in the order in which they are stored. But if we ask “x in m” for a list m, Python searches the whole list until it finds an item equal to x. The cost of “x in m” is proportional to the length of the list.

A Python set (as well as a Python dict) is implemented as a hash table. Elements are scattered randomly around in a memory area based on a function of the key value. The scattering function (called a hash function) is deterministic, so some particular key is always stored in the same, predictable place. When I write s.add(“foo”), the hash value of “foo” is used to determine where in s to place a record of “foo”’s presence. (This place is called a “bucket.”) . When I ask “foo” in s, the same hash value is used to determine which bucket to check for the presence of “foo”. Whether the set has five elements of five million elements, Python just checks in that one place, so the time required is independent of the size of the set.

So let’s say we are asking “foo” in m, where m could be either a list or a set, and in both cases m has a million entries. Let’s suppose moreover that “foo” is not in m. If m is a list, Python will have to check each of the million entries to see if it is equal to “foo”. If m is a set, it will check just one place, the place indicated by the hash value of “foo”. Much faster!

1 Like

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.