When people tell computers what to do.

Property based testing in Haskell with QuickCheck

Automated testing is (hopefully) becoming a normal part of the development workflow, ideally tests should be thought and written even before the actual code as an aid for specifying the problem that has to be solved. But this is not all, as the code base grows, all the tests accumulated in the process will help detecting changes that break backwards compatibility. The agile community is especially keen on Test Driven Development (TDD) and Behavioural Driven Development BDD, which are guidelines and tools meant to manage the development workflow and structure code "in the large".

However little attention is given to the tool that actually performs the testing, usually choosing the venerable unit testing paradigm. Unit testing, even without the BDD structure are not bad, but force the developer to come up with the testing scenarios. This means that usually one runs out of ideas after specifying a few cases. A tool that helps addressing this problem is the Haskell library QuickCheck that was originally conceived in 2000. Despite this posts focuses on the Haskell implementation, the library has now been translated in many other languages including Erlang, Scala, Clojure, Python and many more languages.

The core concept with QuickCheck is property-based testing, meaning that a developer is required to write a function that states a property of the function to be tested. The QuickCheck library will then automatically generate data to feed to the function and test that the property holds in all cases. The beauty of it all is that property-based testing can be used alongside unit-based testing and both of them can be plugged in a BDD framework of choice. In that respect property checking is not meant to replace, but augment any test suite, even legacy ones!

Moving our first steps: out of the box testing.

Testing a property predicate can be as easy as passing it to the quickCheck function and let the magic happen.

Main> quickCheck prop_idem
+++ OK, passed 100 tests.

Notice that we don't even need to pass an argument to prop_idem, as the framework takes care of generating the test values automatically. As default QuickCheck generates up to 100 values to test the property, so far so good.

Now let's try writing a wrong function to fail tests, for instance something like

sum' :: [Int] -> Int
sum' (x:xs) = x + sum' xs

is an incomplete implementation of the sum function. As far as properties go we know that sums are associative and commutative, so we can translate that in Haskell with

prop_sumAsso :: [Int] -> Bool
prop_sumAsso xs = sum' xs + sum' xs == sum' (xs ++ xs)

prop_sumComm :: [Int] -> Bool
prop_sumComm xs = sum' xs == sum' (reverse xs)

As the implementation of sum' is incomplete we expect QuickCheck to fail the stests, in fact running them produces similar results in both cases.

Main> quickCheck prop_sumAsso
** Failed! (after 1 test):
  Quick.hs:5:1-25: Non-exhaustive patterns in function sum'

QuickCheck exposed for us a corner case we did not think about: the empty list! We can quickly fix that by adding the missing case sum [] = 0 and test again, now enjoying watching all tests passing with flying colours.

Sometimes however we might write a correct function but a wrong property, for instance

propSumPos :: [Int] -> Bool
propSumPos xs = sum' xs >= 0

does not hold true in all cases, in fact when we try running it we get this

Main> quickCheck propSumPos
** Failed! Falsifiable (after 5 tests and 4 shrinks):

which per se isn't very helpful in helping us debug the wrong property. The first thing we could do to debug the test is to run QuickCheck in verbose mode. This will print what values were generated and what made it fail in the first place.

Main> verboseCheck propSumPos
** Failed! Passed:
Falsifiable (after 3 tests and 2 shrinks):

It is worth at this point spending a few lines on explaining how QuickCheck generates its test cases and runs the tests. First it will try running corner cases for the input data structure (an empty list for lists in this case), and then generate progressively bigger data points. When a counter example is found, QuickCheck will use the shrink function to do the opposite of before, that is reduce a large data point in progressively smaller ones. In this case the [1, -2] data point fails the property predicate, so QuickCheck shrinks it, if we try ourselves the result is

Main> shrink [1, -2]

different from the example before but helps explaining that shrink is generating "smaller" examples of the starting data point.

Going back to our failing test, QuickCheck will try providing the smaller possible counter example, which is the reason why it presents as result not the original value [1, -2], but a shrunk version of it [-1]. Now that we understand the meaning of this is easy to realise that our property can only hold as long as all values are equal or bigger then 0.

A step further: generating data with combinators.

Knowing this we need to find a way to generate only positive numbers for populating the list so that we can feed the property with the right data. As what we need to achieve is just a special case of an already existing data type we should be able to use one of the many combinators that ship in the Test.QuickCheck.Gen module. In our specific case we need to create a list of positive numbers, so we can quickly write the following function:

posIntList :: (System.Random.Random a, Num a) => Gen [a]
posIntList = listOf $ choose (1,1000000)

Note that the type signature, as usual in Haskell, is a tell tale, it tell us that we have a polymorphic function able to create a list of any type of number. Now that we have a generator let's create a property that can be called by the quickCheck function.

gen_sumPos :: Property
gen_sumPos = forAll posIntList prop_sumPos

Now we can run this property with quickCheck get_sumPos and watch it pass. A caveat to this approach is that our forAll property does not automatically generate a shrink function, so failing tests might not be so easy to debug.

The final frontier, generating user-defined types.

This far we've seen only cases using out of the box data types, but Haskell is famous for its powerful algebraic type system, so QuickCheck is also capable of dealing with any user defined data types.

Unfortunately there is no easy way of getting QuickCheck to do that, this means that we'll have to implement the Arbitrary typeclass methods, which is the standard QuickCheck way of specifying a generator for a custom data type. The Arbitrary class specifies two functions: arbitrary and shrink, both of which can be tricky to implement. The arbitrary function is in fact a monad where the standard combinators can be composed together. One of the advantages of using a monad is that all the values generated by the combinators can be unpacked from the Gen class and modified or used to construct the custom data type.

To give an example let's create a Phone type that contains a full phone number including country and area code.

data Phone = Phone String String String
    deriving (Show, Ord, Eq)

The reason why I'm using strings here is that phone numbers are more like names than numbers, that is one uses a phone number to call somebody not to make maths with it. However when it comes to actually generating them in Haskell I find more convenient to use actual integer numbers.

instance Arbitrary Phone where
  arbitrary = do
    countryCode <- choose (1, 999) :: Gen Int
    areaCode <- choose (1, 9999) :: Gen Int
    phoneNumber <- choose (10000, 999999) :: Gen Int
    return (Phone (show countryCode) (show areaCode) (show phoneNumber))

The arbitrary implementation is pretty straightforward: for each section of the phone number a number is randomly generated, and its value bound to a name. Finally a Phone data instance is constructed and returned, notice that all generative functions were restricted to to Int to avoid ambiguity with the show function (this is called monomorphic restriction and is sometimes necessary in Haskell to allow the compiler to type-check our functions).

To make sure that our arbitrary function is generating what we had in mind we can use restriction again:

genPhone :: Gen Phone
genPhone = arbitrary

This function, thanks to the type signature is automatically bound to our implementation of arbitrary, this allows to call the sample function that will generate a few samples of our data type.

Phone "793" "5119" "448388"
Phone "351" "8035" "334695"
Phone "490" "1422" "113181"
Phone "352" "2140" "702930"
Phone "965" "7500" "89922"
Phone "914" "3584" "34219"
Phone "112" "785" "780162"
Phone "803" "8779" "52016"
Phone "141" "3299" "557328"
Phone "541" "4082" "674339"
Phone "93" "1747" "298804"

This was a quick tour of the powers of property-based testing, I hope this post inspired you to delve deeper in it and start swatting out bugs in your program.

Haskell optimisations for large datasets

Python and Haskell might seem like worlds apart. One is object-oriented, the other purely functional, one is dynamic the other static, one is interpreted the other compiled. Yet I find the two languages to have a similar feel: they both have a clan syntax, prefer a declarative approach to programming and provide high-level constructs.
In this post we're going to try both languages in a mapreduce-like job, the kind of work that is often necessary to do when analysing large amounts of data. Even though the task is a bit contrived it works as a good example and is very simple to understand. The job consists of reading a text file containing 12 million rows, filtering out the ones that are too short and too long, counting the number of Capital letters in each line and them and then summing up all the partial counts.

An interesting aspect of mapreduce which is in some respects reminiscent of old school batch jobs, is that in many cases it only requires an initial IO phase to gather data and is concluded with a final IO phase to return the result. In between the two phases there can be an enormous amount of computation that is mostly pure. Furthermore it is the type of computation that often allows for parallelisation of the tasks. These are some of the reasons why in my opinion functional languages can be a good fit for this type of computation.

A standard Python implementation for the task at hand could be something like

from string import ascii_letters
from random import randrange, choice

def gen_line() -> str:
    return ''.join((choice(ascii_letters) for _ in
                    range(randrange(3, 15)))) + '\n'

def gen_file(count: int) -> None:
    with open(FILE_PATH, 'w') as out_file:
        for line in range(count):

def get_file(file_path: str) -> List[str]:
    with open(file_path, 'r') as in_file:
        str_list = []
        for line in in_file:

        return str_list

def mapping_fun(in_str: str) -> int:
    return len([char for char in in_str if char.isupper()])

def for_red(in_list: Iterator[int]) -> int:
    sum = 0
    for count in in_list: sum += count
    return sum

def buil_fil(in_list: List[str]) -> Iterator[str]:
    return filter(lambda row: len(row) > 3 and len(row) < 15, in_list)

def buil_map(in_list: Iterator[str]) -> Iterator[int]:
    #import ipdb; ipdb.set_trace()
    return map(lambda row: mapping_fun(row), in_list)

def sum_sum() -> int:
    return sum(buil_map(buil_fil(get_file(FILE_PATH))))

Here I'm using built-in functions that come with the standard Python distribution, the reader is welcome to checkout the blog repository for alternative implementations or better still experiment with his/her own.

The first two functions are used to generate the text file that will be used to run the performance tests. My testing environment is a laptop equipped with a 4 cores i7 2 GHz and 8 Gb of RAM running Debian Jessie, please note that running times will also differ depending on the text file as it is created randomly. After generating the text file we can use Ipython to time the function.

In [5]: %timeit mr.sum_sum()
1 loops, best of 3: 14.8 s per loop

This is the out of the box performance of Python, the program runs in constant memory and uses one CPU core only. It will provide the reference against which to compare Haskell.

The same program can be re-written in Haskell like so:

import Data.Char (isUpper)
import Data.List (foldl1')

filterStr :: [String] -> [String]
filterStr = filter (\s -> (length s > 3) && (length s < 15))

capCount :: String -> Int
capCount = foldl addUpper 0

addUpper :: Int -> Char -> Int
addUpper count char = if isUpper char then count + 1 else count

totCaps :: [String] -> Int
totCaps = sum . map capCount . filterStr

main :: IO ()
main = do
    file <- readFile "list.txt"
    print $ totCaps $ lines file

Let's compile the source code as is with only the -rtsopts (Run Time System options) flag to allow for basic profiling, once that's done then let's run the executable with the +RTS -sstderr switch. This will allow to dump to standard error a wealth of information regarding the performance of the program.

  28,213,590,928 bytes allocated in the heap
  12,120,060,960 bytes copied during GC
   2,229,231,024 bytes maximum residency (14 sample(s))
      47,675,496 bytes maximum slop
            4382 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     54149 colls,     0 par   10.30s   10.31s     0.0002s    0.0021s
  Gen  1        14 colls,     0 par    4.64s    5.99s     0.4282s    2.8306s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   15.46s  ( 15.41s elapsed)
  GC      time   14.94s  ( 16.31s elapsed)
  EXIT    time    0.04s  (  0.24s elapsed)
  Total   time   30.44s  ( 31.96s elapsed)

  %GC     time      49.1%  (51.0% elapsed)

  Alloc rate    1,824,941,198 bytes per MUT second

  Productivity  50.9% of total user, 48.5% of total elapsed

The first version of the compiled source actually runs half the speed of Python and consumes as much memory as it can get it hands on, oh dear! By a closer inspection of the printout it is clear that the amount of memory allocated to the heap and that therefore has to be handled by the garbage collector is colossal. This is confirmed by the fact that almost half of the running time is taken by the GC itself. The root cause of the problem is most likely the last function, sum that is implemented as a right fold. Haskell is a lazy language which means that the runtime will evaluate a function only when its result is needed. But sum requires the entire list to be evaluated, this has the knock-on effect all the way up to readFile forcing it to read and allocate the entire 12 million record list before actually doing some computation. So the first optimisation that we can easily do is to substitute the sum with a a better implementation:

totCaps = foldl1' (+) . map capCount . filterStr

The foldl' function is an eager version of a fold left, which is the best suited for working with very large or indeed boundless structures. After re-compiling let's try running the program again.

  27,408,098,264 bytes allocated in the heap
   2,748,692,920 bytes copied during GC
         116,736 bytes maximum residency (1833 sample(s))
          41,984 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     50824 colls,     0 par    2.81s    2.81s     0.0001s    0.0003s
  Gen  1      1833 colls,     0 par    0.19s    0.17s     0.0001s    0.0003s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   15.05s  ( 15.25s elapsed)
  GC      time    3.00s  (  2.99s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   18.05s  ( 18.23s elapsed)

  %GC     time      16.6%  (16.4% elapsed)

  Alloc rate    1,821,378,140 bytes per MUT second

  Productivity  83.4% of total user, 82.5% of total elapsed

Now the runtime is doing much better, the GC allocates a lot less memory and is kept less busy. Notice that just by changing one function the entire chain of computation is changed! This is one of the quirks of lazy evaluation: as the strict left fold forces complete evaluation at regular intervals, now every function feeding it behaves in the same manner. This therefore means that the readFile function (the very first one in the computation chain) is not trying to read the entire file in one go, but runs in constant space!

If on the memory management side things have improved dramatically, the actual MUT elapsed time did not change significantly. As it happens the Glasgow compiler ships with a few optimisation flags that can be turned on during compilation. The easiest to try out is the -O (big O) or -O2. So let's try compiling and running both versions with the optimisation flag on. This is the version with sum

  24,152,302,232 bytes allocated in the heap
   1,153,660,608 bytes copied during GC
         102,112 bytes maximum residency (2 sample(s))
          24,720 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     46408 colls,     0 par    1.22s    1.30s     0.0000s    0.0002s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   11.12s  ( 11.19s elapsed)
  GC      time    1.22s  (  1.30s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   12.35s  ( 12.49s elapsed)

  %GC     time       9.9%  (10.4% elapsed)

  Alloc rate    2,171,969,625 bytes per MUT second

  Productivity  90.1% of total user, 89.1% of total elapsed

And this is the one with foldl

  24,152,302,216 bytes allocated in the heap
   1,153,660,608 bytes copied during GC
         102,112 bytes maximum residency (2 sample(s))
          24,720 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     46408 colls,     0 par    1.14s    1.31s     0.0000s    0.0002s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   11.25s  ( 11.14s elapsed)
  GC      time    1.14s  (  1.31s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   12.39s  ( 12.46s elapsed)

  %GC     time       9.2%  (10.6% elapsed)

  Alloc rate    2,146,489,709 bytes per MUT second

  Productivity  90.8% of total user, 90.3% of total elapsed

This test is perhaps the most surprising: switching on the optimisation flag enables the compiler to do some really clever transformation. The strategy is smart enough to do without mindless lazy evaluation and run the program in constant memory even with the plain sum function. Not just that (which is remarkable on its own) but the entire runtime is now faster on all metrics. Now thanks to one flag only our program is, despite being shorter than a Python version is faster and does not consume more memory. Try beating that!

Python type hinting

The newest version of Python, 3.5, has been recently released. Together with some smaller improvements, the team delighted us with two big updates, one of which is, perhaps, the beginning of a revolution. What I'm referring to is the addition of type hinting to the Python language, which may snowball in one of the biggest changes since the bumping up to version 3 in the last few years.

I'm being cautious because type hinting does not break backward compatibility, and is not even necessary, that is developers can avoid using this feature completely. It'll be interesting to see how well received and widespread hinting will be in the next few years, how willing to change the Python community will be.

The history of type hinting is actually longer that I originally thought. Hinting builds on the little-known type annotation feature as specified in PEP3107 that has been part of Python for already several years. Type annotations are information added to the declaration of functions and classes using standard Python syntax. Only CPython will completely ignore those annotations considering them comments. For example:

>>> def add(x: int, y: int) -> int:
>>>   return x + y

is entirely identical to

>>> def add(x, y):
>>>   return x + y

As a proof running the following

>>> add(1, 2)
>>> 3
>>> add('hello ', 'reader')
>>> 'hello reader'

doesn't break, despite the type annotation, as a further proof the behaviour is the same in all major versions of Python 3, 3.5 included.

At this point type annotations might seem like a useless feature, because there is nothing in the official release that actually uses them. In reality PEP3107 more like an official response to a community need. In other words some in the community wanted to add static type annotations to Python and the official release provided a structured endorsed way of doing it.

With time some external projects that provided type checking sprung up, some of which are rather good. In particular the MyPy inspired the Python people to come up with a structured type system that could deal with the dynamic nature of the language but also provide optional static typing.

The result of that effort is PEP484 an extensive specification of a gradual typing system that was co-authored by Guido van Rossum and Jukka Lehosalo, the author of the MyPy project. The purpose of this specification is to add a rigorous specification of static type checking that will help coordinating the efforts of the community. As the document states, type hinting is not static Python creeping in from the window (whether that is good or bad is for another post), but a valuable bolt-on feature at the disposal of the programmer.

In fact Python 3.5 only contains a new typing module containing classes and functions that comply with PEP484. That is to say there is still no official static type checker that ships with python, which explains why the examples shown above do not break even in Python 3.5.

If Python 3.5 ships with no type-checker, the MyPy project implements one compatible with the latest version of Python 3.4 that can be fetched simply using PIP. The implementation of MyPy checker is PEP484 compliant, but the implementation is not complete yet; that however will not prevent a curious mind from having some fun.

MyPy also comes with a typing module plus its very own type-checker that can be invoked by entering mypy <python source file> from the command line. The author considers the checker as a linter that the user can run on source code.

The type system specified by PEP484 is actually quite rich, for the sake of this post I'm just going to introduce some salient features it provides. For a start let's go back to the previous function, if we try running MyPy on the previous code we get

Argument 1 to "add" has incompatible type "str"; expected "int"
Argument 2 to "add" has incompatible type "str"; expected "int"

Here the type checker recognised the mistake of trying to add two strings.

The typing module comes with a series of convenience classes that help in creating types, for instance

def incBoth(x: int, y:int) -> (int, int):
    return(x + 1, y + 1)

will fail the check, throwing the exception

assert False, 'Unsupported type %s' % type
AssertionError: Unsupported type Tuple[int?, int?]

because the proper way to have tuples in PEP484 is to use the Tuple class, so the previous function becomes

def incBoth(x: int, y:int) -> Tuple[int, int]:
    return(x + 1, y + 1)

Another feature of Python gradual typing is the possibility to create type aliases, that is bind a type constructor to a name for convenience's sake, going back to the previous example we can re-write the function as

from typing import Tuple

Pair = Tuple[int, int]
def incBoth(x: int, y:int) -> Pair:
    return(x + 1, y + 1)

This far we've seen only functions that take and return one type only, but what what if we need more flexibility? For instance the inc_both function will break if we pass a float instead of an integer, and yet addition is a valid operator for decimal number as well. This pattern is often referred as generics, and rest assured Python is able to deal with that too. So if we want to re-write the function to accept any valid number in Python we can write:

from typing import TypeVar

Num = TypeVar('Num', int, float, complex)

def add(x: Num, y: Num) -> Num:
    return x + y

Where TypeVar is type variable class that generates a data type that can be either one of the types passed to its constructor. This is reminescent of the ML-style algebraic type system.

This is just a sample of the power and flexibility of the new Python type system, which also includes typing for classes, more powerful generics and more.

Mining Twitter - 2

In a previous post we started mining content from Twitter using a simple script activated every 10 minutes and a Cassandra back-end. After gathering data for a few weeks it's time to have some fun analysing what has been mined. This is where normal programming and data science start to differ significantly, for data analysis is often an explorative endeavour that should be carried out with a scientific mindset.

For this post we'll be using the excellent Ipython notebook, which allows breaking down the workflow in units called cells and to code and visualise in the same place.

As with every data analysis task the first step is fetch the data to be analysed, to do that we first connect to Cassandra to get a rough idea of how much data we'll have to deal with, depending on that we can come up with a plan. First start the Cassandra shell with cqlsh from the command line.

cqlsh> SELECT COUNT(*) FROM trending.tweets;


Because of the small amount of data and the limited number of columns per row we can afford to load all the data into the notebook in one go. To achieve that we simply add a few lines to the TwitterCassanrdaConnector class written for the previous post.

def read(self):
        query = 'SELECT * FROM trending.tweets;'
        self.session.row_factory = dict_factory
        return self.session.execute(query)

This method will perform the simplest query and return a paginator, an iterator that will allow to consume the query asynchronously.

To start up the notebook type ipython notebook from the command line. The notebook is a web-based interface that allows to analyse data interactively, it does not make programming fast (reason why I still do the heavy lifting in vim) but for data analysis is very good.

Data analysis is exploratory work: a data scientist will be given a black box and be asked to make sense of the data contained in the box. Initially it helps to have a few questions to answer, for our dataset it would be good to know how many retweets or how many people take part in trending topics. Amongst the data provided by Twitter we saved the number of retweets and the time of the oldest and newest post in the thread, these values will form the basis of our initial analysis.

Let's start by uploading our data from C*

cluster = gt.TwitterCassandraConnector()
cursor = cluster.read()

tweets = []

for item in cursor:


If you remember from the previous post the sampling time was 10 minutes, so it would make sense to group together all the posts that were captured at the same time. Also it is always good to have a sanity check after this operation, just to make sure that the data was processed correctly.

ten_min = util.group(tweets, itemgetter('creation_minute'))
for t in ten_min:
    res = ten_min[t]
tweet_count = Counter([len(ten_min[trends]) for trends in ten_min])

The reason why we're inspecting the first element of the ten_min structure with a broken for loop is that the util.group function is a generator, that is it does not return a list but an iterator, so to inspect it we have to actually materialise the grouping by calling a for loop. Also notice that the res variable was inspected using res??, the double question marks a nifty Ipython feature for variable reflection. After making sure that the grouping was carried out correctly we may proceed.

Once we formed our groups we can find out how many trending topics Twitter returned each time. We know from the documentation that it should be 10 each time, to make sure we can quickly interrogate the data:

tweet_count = Counter([len(ten_min[trends]) for trends in ten_min])
tweet_count = [(key, tweet_count[key]) for key in tweet_count]
tweet_count = list(zip(*tweet_count))
plt.bar(tweet_count[0], tweet_count[1], align="center")

Inspecting the tweet_count variable tells us that in most cases Twitter actually returned 10 trending topics, but there were a few cases where there weren't just as many, it would be interesting to see whether that happened at night when traffic is expected to be lower.

Satisfied that we have a relatively homogeneous dataset we can start measuring trending topics. For now we can establish a measure of "virality" by checking the speed at which people contribute to a thread (velocity) and validate the initial findings with the number of retweets in each thread.

Measuring velocity once we have our groups is rather easy:

tweet_vel = []
for minute in ten_min:
    tweet_vel.append(map(tw.get_velocity, ten_min[minute]))

tweet_vel = list(chain(*tweet_vel))
plt.hist(tweet_vel, bins=50)

Plotting a histogram of the distribution we obtain this:

What the histogram is telling us is that the vast majority of trending tweets are not very fast, and as speed increase the number of topics decreases sharply. An interesting finding is that after 20 tweets per minute there seems to be only data for multiples of 10. We can zoom in to make sure:

tweet_vel_filt = [vel for vel in tweet_vel if vel > 10]
plt.hist(tweet_vel_filt, bins=20)

The second plot confirms the finding. As it seems odd to have bars so precisely spaced, it makes sense to speculate that perhaps Twitter is binning the data behind the scenes and present it already cleaned up.

Any time a pattern in data is found it is good practice to trying confirming it. To confirm the distribution of trending tweets we can plot another measure, retweets count and see if the distribution of the data is analogous.

retweets = []
for minute in ten_min:
    retweets.append(map(itemgetter('total_retweets'), ten_min[minute]))

retweets = list(chain(*retweets))

plt.hist(retweets, bins=50)

The plot shows that distribution of retweets is very similar to the one of tweet velocity, which confirms the initial findings. Interestingly both distributions are not normal, but seemingly logarithmic or power law (the two are distinct and somewhat difficult to tell apart). Again we're faced with a hunch to confirm, to do that we could change slightly the data representation and plot another graph.


This last plot differs from all the previous as it is a scatter-plot. Also more importantly we've changed the abscissa axis from linear to logarithmic. If the distribution was perfectly exponential we would see a straight line, but as it happens real data is far from perfect. What we can see from this plot is that the top end of the most viral tweets has an approximate log(10) distribution. As we approach to the long tail of the data (left hand side of the histogram and right hand side of the scatter plot) the tail grows even faster, is more and more tweets have fewer retweets.

This was a very quick walkthrough on some data analysis that can be done on a dataset. As it often happens understanding a few things about a dataset begs more questions to be answered, such is the life of the data scientist.

Currying and closures, functional programming only?

Currying and closures really shine in ML or Lisp style functional languages like Scala, Haskell, Scheme or Clojure where they allow to encapsulate computation elegantly. However both concepts are not implemented uniquely in functional languages: as ideas from different backgrounds converge together, these days is possible to have both in languages like Python, that despite being OO also partially implements the functional paradigm.

Before we kick off with some code a bit of background: currying is named after the mathematician Haskell Curry that that developed techniques to partially apply lambda calculus functions.

Speaking of functions in mathematics a function is a relationship or a mapping between a set of input values and a set of possible outputs. You can follow this [link] (https://en.wikipedia.org/wiki/Function_%28mathematics%29) for a lengthy explanation about mathematical functions, but for now suffice to say that a function can be thought as the specification of how to manipulate an input to produce the output (please mathematicians have mercy). Another important aspect of mathematical functions is that their specification is fixed, that is they will always produce the same output given the same input.

Many programming languages also provide a construct called function, although in most of those languages functions do not quite resemble their mathematical counterpart. For this reason is perhaps easier to think of them as a technique to enclose a bit of computation in a box so it can be accessed easily. In mathematics a function is just the definition of the operation, so each time we want to run the function we must supply all the input values. However sometimes in programming languages it is convenient to have a function access some of the values in the background, so that we can avoid supplying all of them each time. This behaviour is obtained with function currying, or partial application.

Currying can only be used in languages that feature first-class functions, that is functions that return functions, or accept functions as parameters. Currying is rarely seen in pure OO languages like Java (at least up to 7), mainly because is really not a preferred technique in that programming paradigm. However some languages that partially implement functional programming like Python allow to do it, and a programmer might find it useful. Here is an example implementation in Python.

 def curry_pair(fun, first): return lambda second: fun(first, second)

The one above is the simplest implementation of a currying function, that is a function that takes another function and its first argument and returns another function ready to take the second argument and pass it to the original function. I find it amazing how 68 characters can pack so many concepts. In action the currying function could be used like so:

>>> def add(x, y): return x+y
>>> add_2 = curry_pair(add, 2)
>>> add_2(2)

The code shows that curry_pair has to both accept and return functions in order to work, which explains why only languages that implement first-class functions can allow currying.

Another concept that goes hand in hand with currying shown by curry_pair is that of the closure. The closure is a private namespace "closed" around the curried function and is what allows the curried function to access the already supplied values. Perhaps most importantly (apart from rare exceptions) closures guarantee that the values already supplied cannot be changed but only accessed. This is sometimes called referential transparency. Referential transparency is what makes closed functions "mathematical", that is given a certain input will always return the same output.

Finally as the name implies curry_pair can only work on functions that accept a pair of arguments. A Python workaround to make the currying function deal with functions that accept an arbitrary number of parameters is:

def curry_2(fun, first):
    def wrapper(*args, **kwargs):
        return fun(first, *args, **kwargs)
            return wrapper

>>> def add_3(x, y, z): return x+y+z
>>> add_triplet = curry_pair(add_3, 2)
>>> add_triplet(2, 2)

Which brings us further down in the rabbit hole. You probably have already noticed that the definition of curry_2 looks suspiciously similar to that of a decorator. That is exactly the case, as a decorator can act as a "compile-time" currying function used to create a closure oven another function. This is a good example of how ideas from maths and functional programming can trickle down to more industry-like paradigms like OO.

This is however as far as it is wise to draw the parallel, because decorators are not just currying functions as they can be used on classes or even modules. Also decorators create closures only when first loaded, not at any point in the run-time.

Now that we've seen currying in OO I'd like to show an example of Haskell, a language that implements curried functions out of the box. For example foldr (reduce in Python) takes 3 arguments: a function taking 2 arguments, an initial value and a sequence of values. Let's assume we want to use a fold to calculate the sum of a list of integers, we could easily define it as:

Prelude> let sum = foldr (+) 0 :: [Int] -> Int
Prelude> sum [1,2,3,4,5]

Now the sum function is ready to accept any list of integers. What if we want to calculate the factorial of a list of Int?

Prelude> let factorial = foldr (*) 1 :: [Int] -> Int
Prelude> factorial [1,2,3,4,5]

And isn't that just beautiful? Those two brief examples show how powerful fully implemented currying can be in allowing a programmer to re-use simple, general building blocks to create more specific ones.