Wall Street is big business, and it is about to become even bigger with the rise of big data.
It is every investor's dream to have prior knowledge of the direction of the market before it happens, which is why financial investment firms are driven to mine for data rather than for gold in the information economy. Traditionally, investors have based their decisions on fundamentals, intuition, and analysis drawn from traditional data sources, such as quarterly earnings reports, financial statement filings to the U.S. Securities and Exchange Commission (SEC), historical market data, institutional research reports and sometimes the so-called "expert networks."
The new data-driven paradigm, fueled by new alternative data sources, high performance computing and predictive analytics, offers a more robust framework to generate data-driven investment theses. Data – from satellite images of areas of interest, automated drones, people-counting sensors, container ships' positions, credit card transactional data, jobs and layoffs reports, cell phones, social media, news articles, tweets, online search queries – is now the most valuable commodity for Wall Street. Applying predictive analytics to these alternative data sources can help discover and contextualize insights that can produce better predictions. Knowing something that only few others know affords firms a competitive edge and positions them better to forge new strategies.
In the world of finance, the new data paradigm entails applying predictive analytics to new datasets that are collected from non-traditional financial data sources to discover novel and consistently predictive features, and potentially useful patterns about the entity in question beyond what is easily available from traditional financial data sources. For example, anonymized mobile signals that can show how many people are in an area of interest at a given time are now an instance of alternative financial data that can be used with other financial data to deduce an evidence-based insight to help make investment decisions.
In another instance, data scientists mined satellite images of shopping mall parking lots as an alternative data source to predict revenue numbers of major retailing corporations. In several cases, the number of cars in parking lots has the potential to be a predictive feature of retail corporations' sales numbers. Applying predictive analytics to satellite imagery goes beyond retail and food chains; it has also been used to make decisions in large agricultural investments.In particular, applying deep learning algorithms to satellite imagery can help investors assess potential investees' companies. I was part of a project that analyzed nighttime satellite images of earth to help predict changes in GDP per square kilometer. The hypothesis was that the intensity of light data in geographical areas is correlated with the GDP in that area. Investors would buy that information to make stock allocation decisions.
Eyes in the sky are watching agricultural crops, and predictive analytics algorithms are applied to crop images to identify crop health, predict oil moisture and yield numbers. Furthermore, many hedge funds have been using satellite imagery to make decisions regarding oil and energy investments around the world. Predictive analytics can help process satellite images of oil tanks and can be used to learn the level of crude oil in the tanks to reveal the supply of crude oil in the country of interest. As satellite images become increasingly available at various levels of resolution and frequency, new markets are emerging in precision agriculture, forestry and management of disasters such as floods and droughts, oil spills, and illegal fishing, which are directly linked to a number of investment decisions.
This data revolution in finance inspired me to start researching and developing an evidence-based decision support framework for understanding financial markets. Over the last few years, I have been working on several practical predictive analytics use cases of different levels of complexity, domains, data sources and business impacts, which ultimately led me to realize the need for a comprehensive framework with a fundamentally new approach. The framework mines an array of heterogeneous data sources and generates a set of hypotheses, their associated evidence, and fitness scores to better explain financial phenomena.
I started developing algorithms that aim to help us understand the factors that affect businesses' performance and earnings, and ways that these factors could be used to generate hypotheses and make predictions. The framework addresses a data science problem by first surveying a big universe of entities and data sources, and then at every iteration surveying narrowed sub-universes where new signals are extracted that may generate a new level of even narrower sub-universes. In many cases, there will be several winning hypotheses that might be applicable only for a period of time, in which case the algorithms will then need to update their knowledge base and possibly employ new data points.
The framework relies on advances in several areas in computer science: swarm intelligence, predictive analytics, machine learning, information retrieval, natural language processing, and knowledge representation and reasoning. In one case in which we adopted this framework, the data science problem at hand was to assess and predict future earnings of a major furnishing business. Among many hypotheses that were analyzed, there was one with a relatively high confidence score: the stock prices and performance of major home-furnishings companies are correlated with major home improvement supply retailing companies that sell tools and services with an approximately five-week lag. The hypothesis was supported by evidence extracted from different data sources such as customer data, online social media, and financial data.
The fitness of this hypothesis was computed and compared against all other hypotheses to avoid the possibility of a coincidence. In a recent use case, our algorithms mined multiple data sources to build an early customer loyalty index for Apple Inc. that helped predict iPhone X sales and the upcoming "Supercycle" for Apple. In another scenario, the framework was used to identify the leaders and the outliers in the clothing sector in order to predict retail sales ahead of time. Insights that were extracted from big data helped predict major turning points in the data.
While the predictive analytics framework I am developing is still in its preliminary stages, advancements in artificial intelligence will enable it to accommodate diverse data science problems. I will, however, also use it to assess the legality and privacy preservation of the datasets in question. The avalanche of new data flowing into Wall Street, often provided by the so-called "data brokers," is certainly invaluable, but concerns have been raised over the means of data acquisition. I predict that we will start seeing a number of legal cases questioning the legality of those alternative data sources and data collection methods, as they will soon be the new form of insider trading. Wall Street firms should be conscientious about which datasets are legal to buy and to mine.
The new data paradigm that leads decision-making on Wall Street is to be celebrated but also regulated, as it has the power to impact decisions on Wall Street, consequently impacting our lives. As I prepare the next generation of data scientists in my classrooms at New York University, I often emphasize that a data scientist should not only have a good understanding of machine learning algorithms and be a creative strategist, but also have the critical ability to question the legality of the datasets in question.
Predictive analytics is certainly one of the most exciting and promising fields at the moment, and we can already see it shape our lives in a variety of ways.
Professor Anasse Bari of New York University, formerly with the World Bank Group, is a prominent figure in the realm of predictive analytics. Bari is teaching a new generation of data scientists at NYU and is conducting research on novel data mining frameworks to better model financial markets. He is providing data-driven insights that can help Wall Street hedge funds and other institutions make sound investment decisions. Bari is a Fulbright scholar born in Morocco who recently co-authored the second edition of the book Predictive Analytics for Dummies that was published in 2016 by John Wiley & Sons, Inc.
Newsweek's AI and Data Science in Capital Markets conference on December 6-7 in New York is the most important gathering of experts in Artificial Intelligence and Machine Learning in trading. Join us for two days of talks, workshops and networking sessions with key industry players.