Artificial intelligence DeepMind Google
For fund managers, social media data is becoming essential grist to the machine learning mill. CC

Extracting value from a universe of data, analysing sentiment around company names (equities) or about anything else (macro), is a complex journey and we are only about 5% down that road.

The parameters are evolving by which an ever-expanding data set, including the likes of Twitter, pictures, text, video is processed; relying on experts versus the wisdom of the crowd; sentiment derived from a "bag of words", as opposed to structured linguistic analysis.

Last week's Unicom conference, AI, Machine Learning and Sentiment Analysis Applied to Finance (July 14) brought together a group of experts in this area. Professor Gautum Mitra, OptiRisk Systems introduced Elijah DePalma and James Cantarella, Thomson Reuters; Pierce Crosby, StockTwits; Anders Bally, Sentifi; Peter Hafez, RavenPack; Stephen Morse, Twitter.

DePalma differed somewhat from the others because the Thomson Reuters sentiment engine uses only accredited Reuters news data, rather than raw social media chatter. DePalma explained: "When we extract features, the simpler approach that's often done on academic literature, is a 'bag of words' approach. What we are doing is a bit more sophisticated; we are doing linguistic parsing, where you are looking at the structure of the language - so you can think object, verb, subject-type representation."

An example of this in action could be the sentence: "IBM surpasses Microsoft". A simple bag of words approach would give IBM and Microsoft the same sentiment score. DePalma's news analytics engine recognises "IBM" is the subject, "Microsoft" is the object and "surpasses" as the verb and the positive/negative relationships between subject and the object, which the sentiment scores reflect: IBM positive, Microsoft, negative.

"So you are creating a grammatical parse tree and one of the benefits of this is, rather than bag of words where you might have tens of thousands of features, you have a low dimensional feature representation when you create these grammatical parse trees. This makes the last step there - classification - much faster.

"But it also makes the sentiment scores about 20% more accurate; from say 60 % accurate up to 80%. And keep in mind that among human readers, internal accuracy consistency is around 85%."

DePalma pointed out the parsing approach also affects how Reuters approaches foreign languages, as in case of its Japanese news analyitics service.

"Why not take auto translation engine like Google Translate, translate Japanese language to English and apply your engine? Because we would lose the language structure and essentially reduced to a bag of words type accuracy."

The unstructured "noisy" character of data such as Twitter has not stopped big hedge funds and asset managers analysing it in an attempt to get an edge over their competitors.

Stephen Morse, senior manager, data partnerships and sales at Twitter, said: "The financial space is a rapidly growing vertical for us. We serve hedge funds directly, prop traders, market makers, banks, fintech partners, etc.

"We are not a news organisation but events break on Twitter very commonly now - not only around significant financial events but act of god events. So this is a big use case in financial markets and sentiment analysis is a very common use case and we are seeing that at the 'cashtag' level.

"A number of CEOs start to communicate on Twitter before they do anything else, like Elon Musk. If you want to know what he's doing you have to go to Twitter - it's the first place he will go and, often it's the only place he communicates."

Morse said sentiment derived from consumers about certain brands, which can also impact equity prices, is a new twist on the subject currently being explored and which we can expect to see a lot of in the future. Twitter can also gauge macro and geopolitical factors he said, citing a study last year which showed Twitter data predictive of unemployment levels in the US.

StockTwits, which provides real time commentary on individual companies, was the inventor of the cashtag, adopted by Twitter over time.

Pierce Crosby, business director and data evangelist, StockTwits said: "Basically all of our conversations are structured around individual companies. But also we allow users to add binaries, so they add a bullish or a bearish tag to their messages.

"From a database standpoint, it becomes a classifier for a large database of data because you have these binaries that eliminate a lot of false positives, or things like people trying to be funny with their words.

Crosby said that while sentiment is the obvious low hanging fruit, the data can also be used to look into volatility of stocks. "I think on the macro level it's really interesting but on a company level and sector level it's also very interesting, where more or less we watch volumes spike in real time on either different asset classes or companies or ETFs, and as that actually translates into realised volatility.

"We have run a study that actually looks at the predictive element of crowds around, not just events, but just on daily trading activity. So basically trying to correlate volatility as it applies to companies from social data is becoming an area that people are really interested in."

Peter Hafez, chief data scientist, RavenPack said an important concept right now is "democratising data". Big hedge funds and asset managers want to know they can get the data they need, whether in-house or external, at the time they need it.

He said: "There's a lot of new data being produced out there that we can take advantage of, and a lot of asset managers and hedge funds have become a little bit of a data hoarders."

The data can be anything from emails to instant messages or legal documents. These can be fed to people using his company's data engine, which is like access to a private cloud. "In the end what people are trying to build is almost like an internal Amazon, where you can go on a platform and say, I want to get back what we know as a company about IBM.

"Then you'll get, this from the legal department, we know that from the Dow Jones news wires, we know this from Twitter, we know that from the analyst reports we get in our inbox. So you can take all of these different sources and combine it."

DePalma added: "In the last five or six months, I have had a number of client trials with large fund managers - large being more than several hundred billion AUM.

"One of them spoke transparently with me that they believe these behaviour finance tools will be maturing in the next five to 10 years and they, like their portfolio managers, are already incorporating these types of signals into discretionary process.

"So that as these tools become more reliable and mature, their portfolio managers will have already incorporated them, to use them for their large fund management."