Tips

Tips, suggestions, and other useful stuff.

The Python “in” Operator – Theoretical vs Actual Time Complexity

Background

Sometimes we may generate or retrieve a list, set or even dict when creating collection of things that we will be testing against.  Theoretically a set, frozenset or dictionary should be the fastest (and equivalent) forms of storage for this operation.  However, what I’ve found is that on some systems the set is faster and on others dict is faster.  This difference maybe very important if your writing real-time or close to real-time software.  So where am I going with this?

Big-O Notation – Advertised Complexity

Python has published the expected time complexities of their collection types.  I’ve copied the ones for the in operator below.  These Big-O numbers are exactly what you would expect since everything but a list is implemented using a hashing algorithm.  It should be noted, however, that the speed of the set, frozenset, and dict can be compromised if the objects stored do not implement a good hashing algorithm.

Type Average Worst
list O(n)
set O(1) O(n)
frozenset O(1) O(n)
dict O(1) O(n)


More: Python Time Complexity

What I Found

Going back to my statement above, I found that on certain machines, python sets were faster and on some machines python dicts where faster.  I cannot replicate sets being faster in all cases directly so I tried to replicate it with a RHEL 7.1 machine on AWS.  Given that I was at an optimal case for the collection (no collisions), I would have thought that set, frozenset, and dict at least performed on par with each other.  I was surprised to find with the default python interpreter my tests showed that python dicts are actually faster.  So, I reran the tests with the corresponding version of PyPy and found that the expected results hold true and set and frozenset operate at virtually the same speed as dicts.  I suspect the primary reasons for the differences are the compiler used to create the python binaries.  It was interesting however that PyPy performed as expected on all systems.

The Data

I ran the benchmarks on OSX, Ubuntu 14.04, and RHEL 7.1 (Courtesy of AWS Free Tier);  Though, I opted not to record the RHEL results as they are similar to the Ubuntu results.

Benchmarks Fastest % Difference
OSX
Python
list 5.47 150.641
set 0.85 9.877
frozenset 0.85 9.877
dict 0.77 0.77 0.000
PyPy
list 0.34 89.362
set 0.13 0.13 0.000
frozenset 0.13 0.13 0.000
dict 0.13 0.13 0.000
Ubuntu
Python
list 6.07 123.733
set 1.44 0.697
frozenset 1.49 4.110
dict 1.43 1.43 0.000
PyPy
list 0.78 102.913
set 0.25 0.25 0.000
frozenset 0.25 0.25 0.000
dict 0.26 3.922

Recommendations

If you have a need to create a collection to test for existence like in this example; favor set, frozenset or dict whichever makes sense for your situation.  If you are working with a list your given and you want to speedup the system, you can consider changing the list to a set.

The Code

I’ve uploaded all the code to github.  It is available here: https://github.com/chaddotson/container-membership-benchmark/.

Posted by Chad Dotson in Doing Things Better, Programming, Software Engineering, Tips, 0 comments

Python Logger

The Scenario

This scenario illustrates two possible mistakes people make when using the python logging module.  Analyze the following code and look for issues.

So what is wrong with that?

First and foremost, the code fails to use the existing Logging.exception function that could and in most cases should be used when logging exceptions.  That function will automatically add all the exception info to the log, meaning that you will have the stack trace!  Secondly, this sample used the string.format function to format the log message for the logging library when the logging library can in fact handle string formatting itself via old style format specifiers.

Fixing it

If I were to ignore the first problem, the following code is what I should have written.  The benefit here is that the formatting is only executed if the log message is to be captured, unlike the first method.

Taking both errors into account, we should have used the exception function instead of the error function on the logger as well as the built in formatting.  Given both of these, the code becomes.

More Data

This scenario led me to quantifying the error in execution time.  The first set of data is related to logging alone; the second set extends to timing the different string formatting options.  As you can see by the data, using the format is a good bit slower than the built-in “old-style” formatting in the logging package.  While it will add up, it isn’t a world ending difference if done on a small scale.  Again, the time difference is largely due to the fact that no formatting takes place unless the message has a high enough level.  This data caused me to extend my study into timing the two different formatting options.  As you can see by the data, the “old style” is marginally slower than the format style.

Comparing old style to new style string formatting

 

In the end

You should use functionality the API gives you.  In most cases, and the case with python, it has been engineered to work, be fast and be maintainable.  For more information on the logging module, check the python docs.  2.7 or 3.5.

Posted by Chad Dotson in Doing Things Better, Programming, Software Engineering, Technology, Tips, 0 comments

PyCharm and Version Control

So, you want to add your PyCharm project files to a VCS but you constantly deal with problems because each of your team members have different names/locations for their project interpreter.  There is a rather simple solution to this problem.  Basically, at the project level, PyCharm only cares about the name of the interpreter, not the location.  Follow these instructions to give your interpreter a name that is consistent across developers and then use that in your project files to fix the issue.

    1. Open PyCharm
    2. Select File >> Settings (or Configure >> Preferences if you don’t have a project open).
    3. Search for the Project Interpreter setting (this will also work if you don’t use the project interpreter in your run configurations).
    4. Hit the little gear box next to the project interpreter then select More.
    5. Select the interpreter from the virtual environment of your choice and hit the edit button.
    6. Now give it a unique name (preferably identifiable and related to your project.
    7. Now, select that unique name as your default project interpreter or your run configurations.
    8. Commit
    9. Make sure each team member does the same.

Now whenever you update or commit, you won’t constantly see changes associated with people selecting their interpreter.

If you want to know more, I definitely recommend reading a thread over at Jetbrains’ Support.

Posted by Chad Dotson in Doing Things Better, Programming, Software Engineering, Tips, 0 comments

Automating Pylint with Gulp.js

Automating Pylint (and other Python Tasks) can be achieved with several viable python-based methods, but what if we used Gulp.js?  The following code snippet gathers runs Pylint on the set of python files defined by pySource.

Notes:

  • This is just a first cut.  I may find a better way.
  • I am aware that I could have simply used gulp-shell to call pylint with a collection of directories.
  • I am open to feedback on this.  Let me know if I’m doing something wrong or inefficient.
Posted by Chad Dotson in Doing Things Better, Software Engineering, Tips, 0 comments

No Comments – A Failure Twice

Everyone that writes code has encountered or written their fair share of undocumented code.

A failure twice?

I was encouraged to write this article because of something I read about the pinball game that once was included with windows.  According to the article on MSDN, the pinball game was removed due to a bug that should have been fixable.  However, the overall qualify of the code made repair impossible and the game was removed from the distribution.  Two major items can be taken from this:

  • Failure 1: The could should have been self-documenting / few comments required.
  • Failure 2: Code not easily understood should have been commented.

Properly commented code can be tricky!

In college they push you to comment your code while not stressing that over documentation is also bad.  I remember writing programs that had comments on almost every line for assignments.  In the end it doesn’t buy you anything, it just restates the obvious and clutters the solution.  In the workplace, comments and whether or not the code needs them are hit and miss.

Some notes about comments:

  • Many comments are an acknowledgement of your failure to communicate.  Write better self-documenting code.
  • Outdated or wrong comments are worse than no comments.
  • Consider refactoring code that’s not self-documenting.  I find one of the biggest places this can be done is extracting methods from if statements.
  • Choose a good, descriptive names.  I’m a little wordy in my names, but in the end most of my code can almost read like a sentence.

Remember

The code is your best documentation.  Think about what you want to communicate with it and how to be as clear as possible.

Posted by Chad Dotson in Doing Things Better, Programming, Software Engineering, Tips, 0 comments

The Flood of Social Media Posts – Whats Wrong, How To Mitigate It, How To Fix It

What’s Wrong

Since its creation, social media has taken off at an exponential rate.  Each day more and more people are creating accounts and contributing to the feed.  Facebook is the service most readily adopted, mainly because of the people already on it.  In the case of Facebook, lets say that the typical user has 200-300 friends and lets say that just a quarter of those (50 of 200) are enthusiastic posters (2-3 posts/day).  That equates to somewhere north of at least 100 posts per day then add to that the posts from your other friends.  Lets make a guess of an overall total of 200 posts/day on just that one social media outlet.  Facebook’s solution to the inundation of posts is the top posts vs recent posts feature.

Skip to the services that I think are the most susceptible to inundating users with content: Twitter and Pinterest.  As a user of Twitter, it seems that I have trouble following over 40 people.  Once upon a time I followed near 90 people and found that useful posts actually got buried in the noise.  Pinterest on the other hand seems to violate all that is good with respect to UI design; it is way to busy and your eyes don’t follow any particular lines.  Liberal use of pinterest’s follow feature can unintentionally muddy the content you see.

Tips On Dealing With It

I think the only way to fix usage of Twitter is to limit the number of people you follow.  Pick high quality, low post rate users.  This will improve the overall quality of content you get.  For Pinterest, limit the number of users or boards you follow.

How To Fix It

How do you fix being inundated with content?  Facebook is certainly trying to fix it via their “Top Posts” feature.  The top posts feature attempts to make guesses about what you want to see based on your interests, what you’ve looked at in the past, and probably other metrics (and controversial stuff such as a recent study).  Even though I don’t care for their current top posts feature I believe Facebook is on the right track.  It all boils down to the ability to listen to everything but only pay attention to what your interested in, even when you may not 100% know yourself.

I think the ultimate “Top Posts” algorithm would take into account the following:

  • What you’ve looked at in the past. (Given)
  • Users would be given preferred tags based on what they view.
  • It would tag each post and each would receive an initial ranking based on the types of posts the user normally make.
  • The initial ranking score given to a user’s posts is a function of how well their posts have done in the past.
  • Each post’s rank is updated in each of its tags as people view them based on their rank in the tag category.

Up Next: I may decide to make a post discussing a theory of how people leave one service and adopt another.

Posted by Chad Dotson in Misc, Ramblings, Tips, 0 comments

Generating JSON Documents From SQLite Databases In Python

Special Note

This article assumes that you do not wish to use a more sophisticated ORM tool such as SQLAlchemy.

Some Setup

Let’s start with a Q&D sqlite database given the following sql.

You can create the sqlite database given the following command.

Some Different Methods

For this example we want each record returned via the sql select statement to be its on JSON document.  There are several ways of doing this.  All of them solve the problem reasonably well but I was in search of the best way.  In checking python.org, I discovered that the sqlite connection object has an attribute falled row_factory.  This attribute can be modified provide selection results in a more advanced way.

Method 1 – My Preferred Method

From the python docs, we find that they already have a good factory for generating dictionaries.  It is my opinion that this functionality to should be more explicitly enabled in the language.

In this method, we override the row_factory attribute with a callable function that generates the python dictionary from the results.

 Method 2 – Almost As Good As Method 1

This method is just about as good as method 1.  Matter of fact, you can get away with this one and be just fine.  Functionally, the methods are almost identical.  With this method, the records can be accessed via index or via column name.  The biggest difference is that unlike method 1, these results don’t have the full functionality of a python dictionary.  For most people, this might be enough.

Putting It All Together

The following code snippet will extract a group of dictionaries based on the select statement from the sqlite database and dump it to JSON for display.

The Code

 The Results

 

 

Posted by Chad Dotson in Programming, Tips, 3 comments
DNS Changer And What It Means To You

DNS Changer And What It Means To You

With all the fear mongering by the news outlets, I thought I’d make a short post about DNS Changer.

DNS Changer is malware that if you are infected it changes how your computer looks up website addresses.  Essentially, it tells your computer to use a service setup by criminals.  The criminals behind DNS Changer have since been caught, but with far reaching implications, arrangements were made to turn the criminals’ service into a short-term legitimate service.  Monday that service will be turned off.

To make sure you’re not infected, goto this site setup to be a quick test: http://www.dns-ok.us/.  If the result is Green, you are ok and have no need to worry.

If the test fails. you will need to go here: http://www.dcwg.org/fix/.  It contains directions and links required to get your computer fixed.

For more information, goto the DNS Changer Working Group (http://www.dcwg.org/)

Posted by Chad Dotson in Featured, Tips, 0 comments