When people tell computers what to do.

Mining Twitter - 2

In a previous post we started mining content from Twitter using a simple script activated every 10 minutes and a Cassandra back-end. After gathering data for a few weeks it's time to have some fun analysing what has been mined. This is where normal programming and data science start to differ significantly, for data analysis is often an explorative endeavour that should be carried out with a scientific mindset.

For this post we'll be using the excellent Ipython notebook, which allows breaking down the workflow in units called cells and to code and visualise in the same place.

As with every data analysis task the first step is fetch the data to be analysed, to do that we first connect to Cassandra to get a rough idea of how much data we'll have to deal with, depending on that we can come up with a plan. First start the Cassandra shell with cqlsh from the command line.

cqlsh> SELECT COUNT(*) FROM trending.tweets;

 count
-------
31523
(1rows)
cqlsh>

Because of the small amount of data and the limited number of columns per row we can afford to load all the data into the notebook in one go. To achieve that we simply add a few lines to the TwitterCassanrdaConnector class written for the previous post.

def read(self):
        query = 'SELECT * FROM trending.tweets;'
        self.session.row_factory = dict_factory
        return self.session.execute(query)

This method will perform the simplest query and return a paginator, an iterator that will allow to consume the query asynchronously.

To start up the notebook type ipython notebook from the command line. The notebook is a web-based interface that allows to analyse data interactively, it does not make programming fast (reason why I still do the heavy lifting in vim) but for data analysis is very good.

Data analysis is exploratory work: a data scientist will be given a black box and be asked to make sense of the data contained in the box. Initially it helps to have a few questions to answer, for our dataset it would be good to know how many retweets or how many people take part in trending topics. Amongst the data provided by Twitter we saved the number of retweets and the time of the oldest and newest post in the thread, these values will form the basis of our initial analysis.

Let's start by uploading our data from C*

cluster = gt.TwitterCassandraConnector()
cluster.connect(['127.0.0.1'])
cursor = cluster.read()

tweets = []

for item in cursor:
    tweets.append(item)

cluster.close()

If you remember from the previous post the sampling time was 10 minutes, so it would make sense to group together all the posts that were captured at the same time. Also it is always good to have a sanity check after this operation, just to make sure that the data was processed correctly.

ten_min = util.group(tweets, itemgetter('creation_minute'))
for t in ten_min:
    res = ten_min[t]
    break
res??
tweet_count = Counter([len(ten_min[trends]) for trends in ten_min])

The reason why we're inspecting the first element of the ten_min structure with a broken for loop is that the util.group function is a generator, that is it does not return a list but an iterator, so to inspect it we have to actually materialise the grouping by calling a for loop. Also notice that the res variable was inspected using res??, the double question marks a nifty Ipython feature for variable reflection. After making sure that the grouping was carried out correctly we may proceed.

Once we formed our groups we can find out how many trending topics Twitter returned each time. We know from the documentation that it should be 10 each time, to make sure we can quickly interrogate the data:

tweet_count = Counter([len(ten_min[trends]) for trends in ten_min])
tweet_count = [(key, tweet_count[key]) for key in tweet_count]
tweet_count?
tweet_count = list(zip(*tweet_count))
plt.bar(tweet_count[0], tweet_count[1], align="center")
plt.show()

Inspecting the tweet_count variable tells us that in most cases Twitter actually returned 10 trending topics, but there were a few cases where there weren't just as many, it would be interesting to see whether that happened at night when traffic is expected to be lower.

Satisfied that we have a relatively homogeneous dataset we can start measuring trending topics. For now we can establish a measure of "virality" by checking the speed at which people contribute to a thread (velocity) and validate the initial findings with the number of retweets in each thread.

Measuring velocity once we have our groups is rather easy:

tweet_vel = []
for minute in ten_min:
    tweet_vel.append(map(tw.get_velocity, ten_min[minute]))

tweet_vel = list(chain(*tweet_vel))
plt.hist(tweet_vel, bins=50)
plt.show()

Plotting a histogram of the distribution we obtain this:

What the histogram is telling us is that the vast majority of trending tweets are not very fast, and as speed increase the number of topics decreases sharply. An interesting finding is that after 20 tweets per minute there seems to be only data for multiples of 10. We can zoom in to make sure:

tweet_vel_filt = [vel for vel in tweet_vel if vel > 10]
plt.hist(tweet_vel_filt, bins=20)
plt.show()

The second plot confirms the finding. As it seems odd to have bars so precisely spaced, it makes sense to speculate that perhaps Twitter is binning the data behind the scenes and present it already cleaned up.

Any time a pattern in data is found it is good practice to trying confirming it. To confirm the distribution of trending tweets we can plot another measure, retweets count and see if the distribution of the data is analogous.

retweets = []
for minute in ten_min:
    retweets.append(map(itemgetter('total_retweets'), ten_min[minute]))

retweets = list(chain(*retweets))

plt.hist(retweets, bins=50)
plt.show()

The plot shows that distribution of retweets is very similar to the one of tweet velocity, which confirms the initial findings. Interestingly both distributions are not normal, but seemingly logarithmic or power law (the two are distinct and somewhat difficult to tell apart). Again we're faced with a hunch to confirm, to do that we could change slightly the data representation and plot another graph.

retweets.sort(reverse=True)
plt.plot(retweets)
plt.xscale('log')
plt.show()

This last plot differs from all the previous as it is a scatter-plot. Also more importantly we've changed the abscissa axis from linear to logarithmic. If the distribution was perfectly exponential we would see a straight line, but as it happens real data is far from perfect. What we can see from this plot is that the top end of the most viral tweets has an approximate log(10) distribution. As we approach to the long tail of the data (left hand side of the histogram and right hand side of the scatter plot) the tail grows even faster, is more and more tweets have fewer retweets.

This was a very quick walkthrough on some data analysis that can be done on a dataset. As it often happens understanding a few things about a dataset begs more questions to be answered, such is the life of the data scientist.

Currying and closures, functional programming only?

Currying and closures really shine in ML or Lisp style functional languages like Scala, Haskell, Scheme or Clojure where they allow to encapsulate computation elegantly. However both concepts are not implemented uniquely in functional languages: as ideas from different backgrounds converge together, these days is possible to have both in languages like Python, that despite being OO also partially implements the functional paradigm.

Before we kick off with some code a bit of background: currying is named after the mathematician Haskell Curry that that developed techniques to partially apply lambda calculus functions.

Speaking of functions in mathematics a function is a relationship or a mapping between a set of input values and a set of possible outputs. You can follow this [link] (https://en.wikipedia.org/wiki/Function_%28mathematics%29) for a lengthy explanation about mathematical functions, but for now suffice to say that a function can be thought as the specification of how to manipulate an input to produce the output (please mathematicians have mercy). Another important aspect of mathematical functions is that their specification is fixed, that is they will always produce the same output given the same input.

Many programming languages also provide a construct called function, although in most of those languages functions do not quite resemble their mathematical counterpart. For this reason is perhaps easier to think of them as a technique to enclose a bit of computation in a box so it can be accessed easily. In mathematics a function is just the definition of the operation, so each time we want to run the function we must supply all the input values. However sometimes in programming languages it is convenient to have a function access some of the values in the background, so that we can avoid supplying all of them each time. This behaviour is obtained with function currying, or partial application.

Currying can only be used in languages that feature first-class functions, that is functions that return functions, or accept functions as parameters. Currying is rarely seen in pure OO languages like Java (at least up to 7), mainly because is really not a preferred technique in that programming paradigm. However some languages that partially implement functional programming like Python allow to do it, and a programmer might find it useful. Here is an example implementation in Python.

 def curry_pair(fun, first): return lambda second: fun(first, second)

The one above is the simplest implementation of a currying function, that is a function that takes another function and its first argument and returns another function ready to take the second argument and pass it to the original function. I find it amazing how 68 characters can pack so many concepts. In action the currying function could be used like so:

>>> def add(x, y): return x+y
>>> add_2 = curry_pair(add, 2)
>>> add_2(2)
4

The code shows that curry_pair has to both accept and return functions in order to work, which explains why only languages that implement first-class functions can allow currying.

Another concept that goes hand in hand with currying shown by curry_pair is that of the closure. The closure is a private namespace "closed" around the curried function and is what allows the curried function to access the already supplied values. Perhaps most importantly (apart from rare exceptions) closures guarantee that the values already supplied cannot be changed but only accessed. This is sometimes called referential transparency. Referential transparency is what makes closed functions "mathematical", that is given a certain input will always return the same output.

Finally as the name implies curry_pair can only work on functions that accept a pair of arguments. A Python workaround to make the currying function deal with functions that accept an arbitrary number of parameters is:

def curry_2(fun, first):
    def wrapper(*args, **kwargs):
        return fun(first, *args, **kwargs)
            return wrapper

>>> def add_3(x, y, z): return x+y+z
>>> add_triplet = curry_pair(add_3, 2)
>>> add_triplet(2, 2)
6

Which brings us further down in the rabbit hole. You probably have already noticed that the definition of curry_2 looks suspiciously similar to that of a decorator. That is exactly the case, as a decorator can act as a "compile-time" currying function used to create a closure oven another function. This is a good example of how ideas from maths and functional programming can trickle down to more industry-like paradigms like OO.

This is however as far as it is wise to draw the parallel, because decorators are not just currying functions as they can be used on classes or even modules. Also decorators create closures only when first loaded, not at any point in the run-time.

Now that we've seen currying in OO I'd like to show an example of Haskell, a language that implements curried functions out of the box. For example foldr (reduce in Python) takes 3 arguments: a function taking 2 arguments, an initial value and a sequence of values. Let's assume we want to use a fold to calculate the sum of a list of integers, we could easily define it as:

Prelude> let sum = foldr (+) 0 :: [Int] -> Int
Prelude> sum [1,2,3,4,5]
10

Now the sum function is ready to accept any list of integers. What if we want to calculate the factorial of a list of Int?

Prelude> let factorial = foldr (*) 1 :: [Int] -> Int
Prelude> factorial [1,2,3,4,5]
120

And isn't that just beautiful? Those two brief examples show how powerful fully implemented currying can be in allowing a programmer to re-use simple, general building blocks to create more specific ones.

Mining Twitter - 1

Amongst social networks Twitter is by far the one that makes the life easier for programmers. This is because Twitter has being maintaining high-quality REST API from the very beginning which makes querying Twitter a doddle. However for the data scientist things are different: Twitter does not release a dump of all the tweets and even if one had full access to a live feed just managing the traffic would be an enormous challenge. And we haven't started analysing that data yet!

Still not all is lost, the Twitter API has end-points that allow to get some interesting data, so for this post series we're going to have a go at mining and analysing tweets that Twitter says are trending in the London area.

It is really important to stress at this point that what Twitter returns is not "what's going on in London" but it's a filtered report of what some Twitter users in London are interested in. In other words the data set collected in this fashion is biased by whatever Twitter wants us to see or not see.

Data capture

To do data analysis is necessary to first have data, and as Twitter is an on line service, we'll also need an on-line system to capture data we need. The easiest thing would be knocking out a simple script that saves data onto a CSV file and trigger it with a cron-job. But this is the 21 century, so doing something more sophisticated than that won't be hard.

To capture data from Twitter we're going to use a python script that is scheduled every 10 minutes to call the API, do a modicum of processing, and saves the records onto a Cassandra back-end. If this seems too sophisticated compared to the CSV approach I'll assure you it is not :)

Cassandra instance

This is the easiest part, just go to the Datastax website, chose your operating system and follow the installation instructions, you'll have a cassandra instance up and running in 10 min.

Data capture

For this example we're going to use a simple script in Python that will be triggered as a background process, we'll leave it running for a few days (or weeks) in order to retrieve enough data. First up is contacting the Twitter API, to do so, just go to the Twitter app management page and start a new app. This will provide you with OAuth keys to access the REST API.

In this expample I'm saving the keys in a configuration YAML file that is loaded every time the script is run, this allows plugging in your credential very easily and avoiding exposing mine ;)

LONDON_WHOEID: 44418
APP_KEY:
APP_SECRET:
ACCESS_TOKEN:

Instead of calling the API URL's manually we're going to use a Python library that manages authentication and calls for us. Specifically we're going to use Twython, please note that this is not the only one, nor is better than other wrappers, so feel free to explore alternatives!

# Retrieve keys from configuration file
with open('twitter_keys.yaml', 'r') as file:
    keys = load(file)

# Retreive access token if not present.
if not keys['ACCESS_TOKEN']:
    tw = Twython(keys['APP_KEY'], keys['APP_SECRET'], oauth_version=2)
    keys['ACCESS_TOKEN'] = tw.obtain_access_token()

    with open('twitter_keys.yaml', 'w') as file:
        file.write(dump(keys))

else:
    tw = Twython(keys['APP_KEY'], access_token=keys['ACCESS_TOKEN'])

# Retrieve treding topics
latest_trends = tw.get_place_trends(id=keys['LONDON_WHOEID'])
latest_trends = latest_trends[0]['trends']

Each time the API is called Twitter will return a list of 8 trending topics for the requrired area. In the response object there is a field called "query" that we'll use to perform anoter call to the search API. The search will return a list of 50 tweet object that have been lately added to the topic thread. From the tweet objects we'll get some data that will be logged for later analysis.

An important part of using NoSQL database is data modelling, what people say about NoSQL bein schemaless isn't quite true (that would be a lot more akin to a jam jar than a database!). Data modelling in C* however is certainly different from the usual relational model we're used to: in the latter one has to think how to store data, while with C* one has to think about how one's going to query data. As a general rule is best to forget about normalization and relise that replication is good.

As the number of record saved per day is relatively small (a few hundreds) I've chosen the day the record was saved as a primary key. This will allow fetching the whole column family easily. The preferred way to interact with C* is to use the Cassandra Query Language (CQL) for all operations, from creating tables, to inserting or updating rows, to querying the database.

class TwitterCassandraConnector(object):
    cluster = None
    session = None

    def create_schema(self):
        trending_tweets_keyspace = \
            """CREATE KEYSPACE IF NOT EXISTS trending
            WITH REPLICATION = { 'class': 'SimpleStrategy',
            'replication_factor': 1 };"""

        trending_tweets_table = \
            """CREATE TABLE IF NOT EXISTS trending.tweets(

            name                TEXT,
            promoted_content    BOOLEAN,
            query               TEXT,
            url                 TEXT,
            number_zeros        INT,
            total_retweets      INT,
            max_retweets        INT,
            newest_tweet_time   TIMESTAMP,
            oldest_tweet_time   TIMESTAMP,

            creation_day        TIMESTAMP,
            creation_minute     TIMESTAMP,
            created_at          TIMEUUID,

            PRIMARY KEY( (creation_day), created_at )
            )
            WITH CLUSTERING ORDER BY (created_at DESC);"""

        self.session.execute(trending_tweets_keyspace)
        self.session.execute(trending_tweets_table)

    def log(self, params):

        tweet_insert_query = """INSERT INTO trending.tweets(
                name,
                promoted_content,
                query,
                url,
                number_zeros,
                total_retweets,
                max_retweets,
                newest_tweet_time,
                oldest_tweet_time,
                creation_day,
                creation_minute,
                created_at)
                VALUES(
                '{name}',
                {promoted_content},
                '{query}',
                '{url}',
                {number_zeros},
                {total_retweets},
                {max_retweets},
                '{newest_tweet_time}',
                '{oldest_tweet_time}',
                '{creation_day}',
                '{creation_minute}',
                NOW());"""

        self.session.execute(tweet_insert_query.format(**params))

Finally to launch the mining script we'll provide a simple shell script that will activate the virtual environment for that session and perform all the operations required. After checking everything works as expected we'll let the script run for ideally a month to gather data, and then we'll have some fun analysing it!

You can find the complete script on this GitHub repository.

Functional Programming Adoption

Functional programming lately has generated so much buzz that even the corporate world is turning its heavy head to have a look.

With Scala and Clojure the Java Virtual Machine (JVM) now boasts two mature functional languages that can be used in conjunction with the legacy Java codebase. The other big corporate player, Microsoft added to its toolchain F#, another language that features a functional approach. Because of all the recent developments one might be led to believe that FP is a brand new cutting-edge technology, as I did when I first heard about it. But in reality FP is one of the oldest programming paradigms and even the corporate king Microsoft quietly started financing research on the purely functional language Haskell as far back as 1998, an effort still active to this day.

But if functional programming was born 1959, why is it that only now it is becoming a viable option? In other words why is it that the industry is only now becoming interested in the functional approach? Is functional a better paradigm than imperative or Object Oriented (OO)? And if so will FP be able to repeat or even top the success of OO in becoming the of the most adopted paradigm?

Is FP better?

In the 1930s It was proved that as long as a language is Touring complete is as expressive as it can be, or in other terms all Touring complete languages have the same expressive power. This allows the user to solve the same problems in any language: if it can be done in C it can be done in Scheme and vice-versa. With this in mind we can easily prove mathematically that FP as a paradigm is not inherently more powerful than OO or imperative. This tells us immediately that higher power cannot be the reason why FP is now being talked about and adopted more widely.

But if all complete languages are mathematically the same, in practice a language paradigm, features and libraries do make a big difference in how a user goes about solving a specific problem. Together with that every programmer has his/her own taste: I like language X, while my workmate next to me likes Y. X and Y are supposed to be the same but we would never exchange them, because we don't like using the other language. Using what we like make us productive and makes us enjoy what we do. So ultimately the choice of language makes a difference in how productive, enjoyable and easy it is to develop software. In this respect functional languages have an edge compared to other paradigms, as we'll shortly see.

First things first

All computers have been able to do since their inception is to take a sequence of actions and execute them. Actions can be "read data from memory", or "do some calculation", or "write the result on screen", etc... It is no surprise then that the first paradigm to evolve was the imperative, featured by languages like Fortran, Cobol or C. This is only only logical because all imperative languages do is to make the do-one-thing-at-a-time business of computers a bit more human-like. Imperative languages are relatively easy to grok at least in their basic functionalities. Despite being very different from any natural language they allow to express simple concepts rather simply. Assigning a value to variable is a bit like giving names to things or people; a data type is a bit like different varieties of fruits (apples go with apples, oranges with oranges) and a cycle is a bit like going through a phonebook to find the phone number we're after. Also we intuitively understand actions that can only be made in sequence: you first break eggs, then beat them, then mix them with flour and then bake the mixture. It is easy to see that changing the order of the actions makes a big difference, which is to say that the result of the whole program is dependent on when each action it is executed. By the same token it is also just as it is easy to understand that the outcome of an action can affect the outcome of all the following actions.

However if the programs we write grow in size and have to do more complex stuff things start getting complicated. Because imperative languages tend to be so close to machine-speak it becomes difficult to express abstract concepts. The capacity to abstract gives us the possibility to write more general programs that can be used not in just one specific case, but in many situations. Generality also makes it easier to break down the program in logical parts that can be composed together to make a complex system. As the human brain struggles to keep many concepts in mind at once, generality and modularity become a big bonus, and this is where the imperative paradigm is found wanting.
It was with the divide et impera intent that in the late 60s researches come up with the idea of providing programmers with an in-built new abstraction: the object; fast forward a few years and objects were added to C in what become C++. C++ allowed imperative programmers a lower entry level in the world of objects, allowing them to keep on using the imperative style while getting accustomed with objects as they went along. The model was so successful that now objects are regularly taught in most university courses, and OO is by far the most used paradigm.

Objects were successful because they solved some of the issues of modularity and generality, but carried with them an increasingly heavy baggage. The OO programmer must-read is a book written to help taming the complexity created by the abstraction that was meant to tame complexity. Despite being widely successful the OO paradigm is not a cure-all, in fact it has not replaced completely C and is found wanting in certain applications that are becoming topical. Of all the problems IT is facing today concurrency is one of the most talked about, and concurrent computation are notoriously hard to tackle with imperative or OO.

To picture concurrency imagine having 10 people on the line that constantly need to talk to each other, but you control a switchboard that can only deal with one call at a time. All calls must progress in strict sequence, and the two parties must be matched exactly all the times. And oh by the way if you mess up even one call the whole program crashes, if you block calls in the wrong way it may hang forever. In imperative or OO this is mostly in the programmer hands, and just to give a taste of how hard it is I'll cite Java own documentation.

It is our basic belief that extreme caution is warranted when designing and building multi-threaded applications … use of threads can be very deceptive … in almost all cases they make debugging, testing, and maintenance vastly more difficult and sometimes impossible. Neither the training, experience, or actual practices of most programmers, nor the tools we have to help us, are designed to cope with the non-determinism … this is particularly true in Java … we urge you to think twice about using threads in cases where they are not absolutely necessary …

Which finally brings us back to the subject of this post: functional programming.

Why did it take you so long?

Generality, modularity and concurrency is where FP languages really shine. Because of restrictions imposed by the programming paradigm, the programmer is forced to break down the program in small units, known as functions. Also, functions are usually often more restrictive than objects, treading flexibility with guarantee to execute consistently, no matter when or where they are executed in the program.

This allows for great modularity, breaking and resuming a part of the program at will (concurrency) or even to let it run on a different machine and get a result when done (parallelism). It is important to stress that the last two features do not have to be managed directly by the programmer, but the compiler can take care of that. Now you can see that it starts to make sense to use functional languages.

So why despite all these features that address pressing matters did FP adoption came so late and is still this low? There certainly are a few reasons, but to me the most important is how steep is the entry barrier to the functional world. As we have seen before simple imperative programming is relatively easy, and in most cases you do not have to commit to the Object in full, or at least right away. In other words, the entry barrier of most OO languages is low, and perhaps even most importantly the learning curve progressive.

FP on the other hand is a different story. FP is not based on how a machine works, but on mathematical theories, chiefly lambda calculus. That is to say that something intuitive like a recipe for a cake is replaced by mathematics, which together with contemporary art must be one one of the most abstract things mankind came up with. There still are shared concepts between functional and other paradigms, but many intuitive imperative concepts are replaced by much more abstract mathematical concepts. Assignments become bindings, cycles mutate in recursion, and all or the sudden people start talking about state and big O notation.

All this mind shifting imposes a big burden on the programmer, the entry barrier rises and becomes steeper as it takes a lot more time and effort to start writing meaningful programs.

World domination?

Being a keen functional programmer myself I should be happy that FP is finally being recognised, and in fact I really am. However there still is something that spoils the party. The "FP is good to tackle complexity and concurrency" sounds a lot like: "FP is a necessary evil to tackle complexity and concurrency", and from the corporate CTO point of view, it actually is like that. So despite functional languages are excellent candidates to address current issues (and many more) I would be really be surprised if even one of those languages will get an adoption rate above 5 %. What I see in the future are tools and systems being programmed in functional languages, but in such a way that hides the functional nature of the tool to make it available for a wider audience. This is indeed a shame, for functional programming is not just for the programmer elite, it makes you productive and it's jolly good fun to do!