There are many different components and systems for solving big data problems and one of the big challenges is interoperability. There can be high costs associated with plugging components together to transfer data between one system and another.
The Apache Arrow project is good example of how to address issues of cost, speed and flexibility associated with interoperability. It's a collaboration of a dozen or so big data projects that decided to create a piece of technology enabling them to plug their systems together more efficiently, move data around very fast and process it in memory without a great deal of conversion.
Someone who is passionate about projects like this is Wes McKinney, the venerated data scientist who started the Pandas open source project while at hedge fund AQR Capital Management. McKinney has spent the last seven years working in the Python open source ecosystem, latterly working for Cloudera, building integrations between the Python open source data science stack and the big data Hadoop ecosystem.
He returned to the quantitative trading world last year as a software engineer at Two Sigma Investments. He will be giving a presentation at Newsweek's forthcoming Data Science Capital Markets event in London.
McKinney said disparate data management systems spend 80-90% of the time converting between one format and another. "Each system proverbially speaks a different language, so this [Apache Arrow] establishes a kind of efficient lingua franca for data that we can use to make the whole greater than the sum of its parts."
Last year, as part of the Apache Arrow project, McKinney worked with Hadley Wickham, a well-known developer in the R community. Together they built a small file format called Feather, an interoperable high speed data storage format for R and Python which has become very popular in both communities.
"You find that many data scientists are using both R and Python in their work and so they are able to sort of break down that wall and be able to transition more fluidly between the environments," said McKinney.
"There are certain tasks where R is a stronger tool, particularly in data visualisation and statistics, and there are certain tasks where Python is the stronger tool, in particular software engineering and machine learning. To be able to build a hybrid analysis environment where you can easily move back and forth is very useful."
This technology is also very relevant in the Apache Spark project, which features Python and R programming interfaces. In general they can be slower than the native Scala language interfaces. "You can use Spark with Python and R but you pay a performance penalty due to inefficient data transfer," said McKinney.
This important work is being carried on in an open source capacity at Two Sigma, which has been collaborating with IBM and some of the Spark developers on building a better, tighter integration between Python and Apache Spark. Apache Arrow has been the data interoperability technology to build the bridge – something McKinney is speaking about at the Spark Summit in Boston.
Two Sigma employs a large research staff and requires a first-class engineering team to drive innovation on its data science platform. Given the rapid pace of innovation in recent years, the firm has chosen to leverage the best of what's available in the open source technology stack. For example, Two Sigma has built an open source project called Flint, which is a scaleable time series analytics package for Spark.
McKinney said: "That's filling a major need in the ecosystem. Spark excels on traditional SQL-type relational data and ETL (extract, transform, load) workloads; there's less of a strong tool for time series data and we work with a great deal of time series data, so that's one area where we are investing. We believe that participating in open source is the right way to go about that. We have also gotten involved in the Python Pandas project."
From his experience working on Pandas, McKinney says he learned lots of valuable things from industry users, who would bring to the table the real world problems they encounter. The daily grind of data cleaning may not seem too sexy, but it has helped define new features to be added to the project.
"You would be amazed at the number of different data input formats that you see in the wild; over the course of years we have had to evolve the tools in Pandas to be able to accommodate the needs of hundreds of thousands of users around the world," he said.