Programming

We all know that getting to a minimum viable product (MVP) is a race. It is a race against competition, market need, industry direction, etc, etc etc. I came to a realization recently that reaching MVP can also be a race against yourself and how long your technological choices hold out.

For example let us say that I have chosen a specific UI framework. I chose it because it satisfied a good portion of my initial requirements out of the box, was well established, and generally well maintained. This UI framework saves me a a lot of time and money along with lets me get something up and going quickly. So now that I’ve selected a UI framework and have done some work, the requirements grow and evolve and the framework begins showing its age. Assuming that this happens before I reach MVP, I am left with a bit of a problem: reevaluate and possibly retool the system or keep moving forward. Retooling the system will mean a step backward and slowdown my timeline to market, but continuing means incurring more debt that will have to be recuperated later. What is the right choice in this situation? I believe you have to play it by ear, but favor sticking with your choices for as long as possible.

Background

Sometimes we may generate or retrieve a list, set or even dict when creating collection of things that we will be testing against. Theoretically a set, frozenset or dictionary should be the fastest (and equivalent) forms of storage for this operation. However, what I’ve found is that on some systems the set is faster and on others dict is faster. This difference maybe very important if your writing real-time or close to real-time software. So where am I going with this?

Big-O Notation – Advertised Complexity

Python has published the expected time complexities of their collection types. I’ve copied the ones for the in operator below. These Big-O numbers are exactly what you would expect since everything but a list is implemented using a hashing algorithm. It should be noted, however, that the speed of the set, frozenset, and dict can be compromised if the objects stored do not implement a good hashing algorithm.

Type	Average	Worst
list	O(n)
set	O(1)	O(n)
frozenset	O(1)	O(n)
dict	O(1)	O(n)

More: Python Time Complexity

What I Found

Going back to my statement above, I found that on certain machines, python sets were faster and on some machines python dicts where faster. I cannot replicate sets being faster in all cases directly so I tried to replicate it with a RHEL 7.1 machine on AWS. Given that I was at an optimal case for the collection (no collisions), I would have thought that set, frozenset, and dict at least performed on par with each other. I was surprised to find with the default python interpreter my tests showed that python dicts are actually faster. So, I reran the tests with the corresponding version of PyPy and found that the expected results hold true and set and frozenset operate at virtually the same speed as dicts. I suspect the primary reasons for the differences are the compiler used to create the python binaries. It was interesting however that PyPy performed as expected on all systems.

The Data

I ran the benchmarks on OSX, Ubuntu 14.04, and RHEL 7.1 (Courtesy of AWS Free Tier); Though, I opted not to record the RHEL results as they are similar to the Ubuntu results.

	Benchmarks	Fastest	% Difference
OSX
Python
list	5.47		150.641
set	0.85		9.877
frozenset	0.85		9.877
dict	0.77	0.77	0.000
PyPy
list	0.34		89.362
set	0.13	0.13	0.000
frozenset	0.13	0.13	0.000
dict	0.13	0.13	0.000

Ubuntu
Python
list	6.07		123.733
set	1.44		0.697
frozenset	1.49		4.110
dict	1.43	1.43	0.000
PyPy
list	0.78		102.913
set	0.25	0.25	0.000
frozenset	0.25	0.25	0.000
dict	0.26		3.922

Recommendations

If you have a need to create a collection to test for existence like in this example; favor set, frozenset or dict whichever makes sense for your situation. If you are working with a list your given and you want to speedup the system, you can consider changing the list to a set.

The Code

I’ve uploaded all the code to github. It is available here: https://github.com/chaddotson/container-membership-benchmark/.

Thoughts on MVP