MIT

Thasos specialises in converting real-time location data from mobile phones into "actionable information". The start-up has spent six years in stealth – one of those on-site at a $10bn-plus hedge fund.

Before Thasos, its founder Greg Skibiski was CEO of Sense Networks, another location data analytics company, which was recognised by Bloomberg Businessweek in the "Top 25 Most Intriguing New Businesses in the World" and named "The Next Google" on the cover of Newsweek. Since the Thasos website went live about three months ago, it too has begun to pick up awards.

Skibiski is an expert on big data and particularly nerdy when it comes to location data, which he has something of a love/hate relationship with. "It's a horrible dataset because it's really difficult to work with, but it's also an amazing dataset because it's so highly dimensional and has relevance to many important things," he says.

It took Thasos a number of years just to build the evaluation framework to grade the quality of data sources and understand whether those sources have predictive capabilities or not. A good dataset would involve many millions of users over multiple years. And once these high quality data sources are feeding into to the production platform, there are always upstream errors which makes some percentage of them unusable at any given time.

Being overly reliant on one upstream data source means that if something goes wrong with that, the signal may be incorrect without anyone realising it, which can cost a fund millions. So it's important to have multiple independent data streams with location, notes Skibiski – unlike analysing credit card data, for example, where there might be just one credit card data provider and the data is going to be more or less the same accuracy all the time for any given ticker.

Skibiski recounted how hard it can be to pinpoint where such errors originate. "One time, we had all this erroneous data streaming in and we couldn't figure out what was going on until we finally tracked [the error] down with the data provider, and it took us over a month. Remember this is data exhaust in many cases so they [providers] are not using it for anything and not looking at it very carefully."

The example he gave involved servers that were processing the incoming location data from the phones, where a timer on the server would grab the data every day at a certain time and then it would send this on. What happened was one of the servers got too busy and was grabbing the file and sending it before all the data was finished writing into it.

"That meant we saw less location pings at certain places, which caused us to think that the number of people there was less, which produced a bad read on the foot traffic for a particular retailer. It took us and the source a long time to figure out and de-bug that it was a quirky server error on a single machine, deep in the middle of their infrastructure," said Skibiski.

"We now have many different sources and combine them all together so that they cross-validate each other, so we know that we have a stable signal and we can quantify the expected error rate against ground truths that we track."

Once some anomaly has been detected, the offending data source is rotated out of the main feed, then the problem must be tracked down, fixed, and the historical data corrected. By the time it's put back into the main feed usually another one is broken. This is normal day-to-day business, said Skibiski.

"In addition, for each feed of location data, the data is generated using some unique methodology, at different frequency rates, by groups of people that have different biases: one is older people; one is younger people; one is metro people; one is country people, driving around in cars.

"There are many different biases and noise or error patterns in each source, so all of these have to be corrected and normalised separately, and monitored independently against ground truths to make sure that all of the sources are more or less correct, and also cross-validate each other. Only after all this is working reliably do you have even the beginnings of a real-time feed that can be used by investors as a primary source.

"The real test isn't in finding the high quality data, though that does take forever given the amount of legal contracts and negotiation involved. It's the error correction, noise reduction, and normalisation processes – that's what our PhDs have spent years developing, that's the key IP that makes it work. And there's just no substitute for time in solving these kinds of complex problems. Every month we would measure success by watching the Bloomberg and seeing that we had shaved a few basis points off our of out-of-sample error rate."

Another challenge came from the direction of the cloud, recalls Skibiski. At the time, his company processed everything on Amazon AWS, but one of his big data sources did everything on Azure. It turned out the data was so big it could not be moved over from Azure's datacenter to the AWS datacenter in the same city and also processed within the same day. "We had to rewrite everything to run also in Azure because we had no choice," he said.

"So it's a massive amount of effort with location data. It took us years to be able to figure out how to do all this well enough to sell into the financial services industry. And keep in mind this is the second company I've started that just analyses location data, so I walked in knowing quite a bit.

"If you get raw credit card data, for instance, it's highly valuable and not as hard to work with – it's the poster child for alternative data. Location data, however, can be used to analyse many more things than just consumer spending. For example, hours worked in a company's factories, contractors shopping at lumberyards for building supplies, or employment across different sectors. And don't forget our data is international, that's a whole new world for real-time economic measurement.

"Of course if location data was as easy to work with as other kinds of data, you wouldn't need companies like Thasos, so I guess we've got to be grateful."

Greg Skibiski, founder of Thasos, will be talking about big data analytics at Newsweek's AI and data science conference in New York.