Public tweets can be mined by companies looking to track human activity, monitor advertisement success and improve infrastructure. Alistair Charlton talks to Twitter data analyst Alistair Leak to find out more.

Tweet mining to better understand modern life, and even attempt to make it better. (Credit: Reuters)

On the surface, Twitter is a crowded and noisy environment; businesses and news outlets mix with celebrities, parodies, politicians, Justin Bieber fans, more parodies and those simply trying to have a conversation.

It's messy and quickly gets congested the more users you follow, a world away from the controlled, walled garden of Facebook. But what if these tweets could be monitored, catalogued and combined with other datasets like the census and electoral register to create meaningful and valuable information?

That's exactly what a team of data miners at University College London is doing with the Uncertainty of Identity project. Though still in its early stages, the mining of data from tweets and their authors could become a powerful tool used to monitor and improve on modern life.

Currently working on a project called 'Data Mining to Understand International Dimensions to Online Identity' as part of his PhD, Alistair Leak explains how you may think Twitter is free you are infact paying for the serivce with your details:

"Twitter doesn't make money from you directly, but it does make money from what you do and say. There are two sides to Twitter, you in essence are paying for the service with your information, and that's not something everyone realises."

As part of his project he was so far collected a 1% sample of tweets from all over the world, which amounts to around 400 million 14--character messages collected over the past eight months.

All the tweets

Twitter provides this 1% for free to anyone who asks for it, although this of course excludes tweets from users who have their account protected. But this is only the tip of the iceberg, with firms like DataSift getting access to every single tweet, which Leak tells me costs "millions of pounds."

Using the information within the tweets themselves, and integrated into your profiles, Leak says through a long term study "you can start to build up a picture of how individuals live their lives; maybe not guess which house they live in, but you can guess the rough area they live in and the area they work in. From that you can start to establish their socio-economic characteristics.

"Initially we take the screen name or the user name and try to extract the first name and surname if it's available."


The next step is to find a use for the collected data, and an example Leak gave was if a train company was researching which station or route to improve first.

"As much as it's easy to put a questionnaire in the station, people don't use them or at least not as often as they should. With hashtags and your own company Twitter account you can start to build up an idea of how users feel. But if people tweet the train is running late and don't say where or when, it's not really much use to the train companies."

Instead, Leak explains how you could look at data collected from tweets close to the station or route in question.

"This can't be done instantly, but you can build this data up over a period of months to build a better picture of the complaints."


People often tweet about where they are or where they're going. Pairing this with the location data embedded within each tweet (providing they choose to disclose their location, of course) it's possible to build up a better understanding of travel patterns.

"You can look at someone and find out their home, work and commute by looking at their tweets," Leak says. "On their own they're just a line, but lay that over an ordnance survey map and you can see they follow a railway. Then see the cluster of tweets at the start and end of that line, one in a residential area and one in a business area, and then you can make a fair assumption that this is where they live and that's where they work."

Add this data to the name, gender and age of users, and a clearer picture emerges. Incidently, if a user doesn't reveal their age, Leak can use debit card records (which hold full names and their date of birth) to roughly work out a user's age based on when names come in and out of fashion; but it doesn't always work: "a search for Alistair suggests we're both 40..."

Of the 1% sample of Tweets Leak has access to, approximately 760,000 UK users include their location in every tweet they post. For those who don't, a mention of a town, city or country in their profile bio can be automatically scanned, although this makes the data less useful on a local scale.

A universal census

While the census is undeniably a large-scale and reliable means of recording the population and how we live our lives, there is no standard way to collect the data, making it difficult to monitor trends internationally.

"One of the exciting parts of this data collection," Leak says, "is that the format of the data from Twitter will be consistent in any country. Whereas historically you'd have to create a methodology and an analysis for every individual country, with Twitter you can look at France, or America or wherever and start to significantly expand your knowledge without adjusting the way you find it."

When asked about how data collected from Twitter could be used to gauge the success of a marketing campaign, Leak suggested the Space Academy campaign currently run by Lynx deodorant.

"Very simply you could look at Twitter and say 'we have 200,000 people talking about Lynx Space Academy' but then you could look at it over time, revealing a peak in interest. When that starts to drop you can start to intelligently drag it out [with new adverts] to add more peaks each time interest starts to fall.

"Then you can ask 'who do we want to appeal to?' You can then extract users' first name from tweets about the campaign, and if that shows 60% male of course that's good because it's a male product. Then you could add billboards to an area and if the amount of Twitter activity in this specific area has increased, it should mean the location of the advert has been effective."


All this monitoring of tweets doubtless raises concerns about privacy, but Leak remains optimistic that people will continue to use their real name and offer up enough personal information to make the collected data trustworthy.

"Our optimism is that, mostly, people will use their real name. Back in the days of chat rooms and forums the internet was relatively anonymous, whereas now we want people to find us through the internet, especially if you use Twitter professionally, you use something close to your real name."

Looking to the future, no one knows yet if Twitter data mining can be trusted enough to become a long-term success, and speaking to Leak highlights just how much development the process still requires, but with hundreds of millions of tweets posted every day it can't be long before the eureka moment is realised and genuinely useful, rich data is extracted.