Enigma, a New York based data technology company has found clever ways to link together disparate data from one hundred thousand-plus public datasets.
Matching attributes within data as diverse as H-1B Visa applications, curated US sanctions, White House visitor logs, bills of lading, etc, might sound like a rather obscure pastime to many of us, but to those who understand the growing value of data, this is gold.
Enigma CEO and co-founder Hicham Oudghiri explains that "anything a government would file officially in the public domain we have set up to grab". Metadata is in the company's DNA, he says, and in fact one of the hardest problems is building a truly canonical capacity for linking very disparate data together.
"Various datasets are highly unlikely to speak to each other. The way one dataset refers to a company will be different from another; the way one dataset refers to a person will be different another. So we have to think about how you match attributes across datasets."
When it comes to providing data as a commercial service to clients (the company also offers Enigma Public, a range of free curated datasets) they are buying programmatic access and availability through APIs, and in certain cases, curation.
Oudghiri used oil and gas datasets as an example: "We collect datasets from each and every state, so harmonising these under one schema, one paradigm, one way of referring to them - that would be the main value-add.
"In this instance we are literally going to each and every regulator in the process; some folks have that data available online, others require that we actually file the right to receive and it's sent on CDs and DVDs on a recurring basis.
"At any given moment there can be two dozen third parties, each with their own process, and their own timing for when they release the data. So all this data comes in and then needs to be loaded and cleansed; some of these government providers are running off mainframe systems that use 1980s technology, and we have to connect to them using something as old as a COBOL," he said.
These datasets are usually only touched by people that work in some area of the oil business, and are a part of their process and not generally used to extract value beyond that. But it's extremely valuable to people in the markets that may have been used to weekly reporting by the Energy Information Administration.
"It's about getting closer and closer to the source of information," said Oudghiri. "Some folks are even thinking about flying drones over certain production fields to see how empty or full the tanks are. But generally when you are getting closer to the source, the source always looks messier, and always needs more and more disambiguation."
Today Enigma works with many Fortune 500 companies, from financial services compliance to pharmaceutical safety monitoring and healthcare. Hedge funds and asset managers are some of the firm's most sophisticated and data-hungry clients.
"We have gotten to know a bunch of them; I'd say roughly three out of the top five quantitative firms are clients of ours."
Oudghiri said the funds tend to be secretive about their use cases, but they are generally looking for more granular content, as opposed to using aggregated statistics to do modelling. "Take import and export flows, as reported at the end of the month by countries. From Enigma they would be getting more granular data, day to day: here's actually what's on each and every shipping container.
"They are moving away from aggregate time series statistics and reconstituting that in a more customised panel of data by actually looking at the underlying event, observed transaction by transaction."
Hedge funds also tend to differ from other enterprise clients in that they are moving to operationalise what they do, said Oudghiri.
"They've bought in, they have methods, they have teams. Much like we do, they are looking to operationalise things like ingestion and data modelling. It's more like one contiguous machine, as opposed to a swat team that hones on a problem and throws out a theory and creates some sort of strategy for a trade."
Newsweek's AI and Data Science in Capital Markets conference on December 6-7 in New York is the most important gathering of experts in Artificial Intelligence and Machine Learning in trading. Join us for two days of talks, workshops and networking sessions with key industry players. With many of the world's leading capital market firms represented don't miss out - register now.