In a previous post we started mining content from Twitter using a simple script activated every 10 minutes and a Cassandra back-end. After gathering data for a few weeks it's time to have some fun analysing what has been mined. This is where normal programming and data science start to differ significantly, for data analysis is often an explorative endeavour that should be carried out with a scientific mindset.
For this post we'll be using the excellent Ipython notebook, which allows breaking down the workflow in units called cells and to code and visualise in the same place.
As with every data analysis task the first step is fetch the data to be analysed, to do that we first connect to Cassandra to get a rough idea of how much data we'll have to deal with, depending on that we can come up with a plan. First start the Cassandra shell with
cqlsh from the command line.
cqlsh> SELECT COUNT(*) FROM trending.tweets; count ------- 31523 (1rows) cqlsh>
Because of the small amount of data and the limited number of columns per row we can afford to load all the data into the notebook in one go. To achieve that we simply add a few lines to the
TwitterCassanrdaConnector class written for the previous post.
def read(self): query = 'SELECT * FROM trending.tweets;' self.session.row_factory = dict_factory return self.session.execute(query)
This method will perform the simplest query and return a paginator, an iterator that will allow to consume the query asynchronously.
To start up the notebook type
ipython notebook from the command line. The notebook is a web-based interface that allows to analyse data interactively, it does not make programming fast (reason why I still do the heavy lifting in vim) but for data analysis is very good.
Data analysis is exploratory work: a data scientist will be given a black box and be asked to make sense of the data contained in the box. Initially it helps to have a few questions to answer, for our dataset it would be good to know how many retweets or how many people take part in trending topics. Amongst the data provided by Twitter we saved the number of retweets and the time of the oldest and newest post in the thread, these values will form the basis of our initial analysis.
Let's start by uploading our data from C*
cluster = gt.TwitterCassandraConnector() cluster.connect(['127.0.0.1']) cursor = cluster.read() tweets =  for item in cursor: tweets.append(item) cluster.close()
If you remember from the previous post the sampling time was 10 minutes, so it would make sense to group together all the posts that were captured at the same time. Also it is always good to have a sanity check after this operation, just to make sure that the data was processed correctly.
ten_min = util.group(tweets, itemgetter('creation_minute')) for t in ten_min: res = ten_min[t] break res?? tweet_count = Counter([len(ten_min[trends]) for trends in ten_min])
The reason why we're inspecting the first element of the ten_min structure with a broken for loop is that the
util.group function is a generator, that is it does not return a list but an iterator, so to inspect it we have to actually materialise the grouping by calling a for loop. Also notice that the res variable was inspected using
res??, the double question marks a nifty Ipython feature for variable reflection. After making sure that the grouping was carried out correctly we may proceed.
Once we formed our groups we can find out how many trending topics Twitter returned each time. We know from the documentation that it should be 10 each time, to make sure we can quickly interrogate the data:
tweet_count = Counter([len(ten_min[trends]) for trends in ten_min]) tweet_count = [(key, tweet_count[key]) for key in tweet_count] tweet_count? tweet_count = list(zip(*tweet_count)) plt.bar(tweet_count, tweet_count, align="center") plt.show()
tweet_count variable tells us that in most cases Twitter actually returned 10 trending topics, but there were a few cases where there weren't just as many, it would be interesting to see whether that happened at night when traffic is expected to be lower.
Satisfied that we have a relatively homogeneous dataset we can start measuring trending topics. For now we can establish a measure of "virality" by checking the speed at which people contribute to a thread (velocity) and validate the initial findings with the number of retweets in each thread.
Measuring velocity once we have our groups is rather easy:
tweet_vel =  for minute in ten_min: tweet_vel.append(map(tw.get_velocity, ten_min[minute])) tweet_vel = list(chain(*tweet_vel)) plt.hist(tweet_vel, bins=50) plt.show()
Plotting a histogram of the distribution we obtain this:
What the histogram is telling us is that the vast majority of trending tweets are not very fast, and as speed increase the number of topics decreases sharply. An interesting finding is that after 20 tweets per minute there seems to be only data for multiples of 10. We can zoom in to make sure:
tweet_vel_filt = [vel for vel in tweet_vel if vel > 10] plt.hist(tweet_vel_filt, bins=20) plt.show()
The second plot confirms the finding. As it seems odd to have bars so precisely spaced, it makes sense to speculate that perhaps Twitter is binning the data behind the scenes and present it already cleaned up.
Any time a pattern in data is found it is good practice to trying confirming it. To confirm the distribution of trending tweets we can plot another measure, retweets count and see if the distribution of the data is analogous.
retweets =  for minute in ten_min: retweets.append(map(itemgetter('total_retweets'), ten_min[minute])) retweets = list(chain(*retweets)) plt.hist(retweets, bins=50) plt.show()
The plot shows that distribution of retweets is very similar to the one of tweet velocity, which confirms the initial findings. Interestingly both distributions are not normal, but seemingly logarithmic or power law (the two are distinct and somewhat difficult to tell apart). Again we're faced with a hunch to confirm, to do that we could change slightly the data representation and plot another graph.
retweets.sort(reverse=True) plt.plot(retweets) plt.xscale('log') plt.show()
This last plot differs from all the previous as it is a scatter-plot. Also more importantly we've changed the abscissa axis from linear to logarithmic. If the distribution was perfectly exponential we would see a straight line, but as it happens real data is far from perfect. What we can see from this plot is that the top end of the most viral tweets has an approximate log(10) distribution. As we approach to the long tail of the data (left hand side of the histogram and right hand side of the scatter plot) the tail grows even faster, is more and more tweets have fewer retweets.
This was a very quick walkthrough on some data analysis that can be done on a dataset. As it often happens understanding a few things about a dataset begs more questions to be answered, such is the life of the data scientist.