Alternative data. Know of it or use it (effectively) in your financial organization? If not, do so at your own peril, because there is an explosion of data occurring all around you. Press releases, SEC filings, investor presentations, public records, ratings, social media, product reviews, job postings, and other information are just a part of what is considered alternative data, and it can be incredibly difficult to get to through traditional data collection methods BUT can also include highly valuable information needed for analysis of all types, and particularly useful in financial analyses and transactions. In the guest post below, Bitvore Chief Data Officer Greg Bolcer educates us on the many uses for alternative data and more. Take some time to read this piece and then think about the myriad of ways financial institutions can use this type of data for a competitive edge or deeper insights into their clients or investments. Fascinating!           
Cindy Taylor/Publisher


What Is Alternative Data And Why Is It Important?

By Gregory Bolcer, Chief Data Officer, Bitvore

Alternative investment data, or alt-data, can be as simple as measuring and tracking positive or negative sentiment on news around a company. Or as complex as looking at non-traditional data that's not commonly collected and correlating the data to the performance of a company.

Traditionally alt-data came from paper receipts of various things that weren't originally available in electronic format or private company information that either isn't shared beyond the individual line-of-business or not captured at all. The latter typically happens when the storage costs can't be justified by any value that the data might provide—even though the data might be valuable when combined with other things.

Alternative data can also be something that is derived from individual or aggregate data by algorithms or machine learning to traditional sources so that results can be used as inputs to other analyses. Some of these sources are available through news, government agencies, the companies themselves, or from licensing or purchasing data from other service industries.

One especially critical thing about alt-data is that it has a network effect. A network effect is when the value of the whole network is exponentially greater than the sum of any individual piece. A telephone for instance is worthless if you are the only one in the world that possess it. The more people that have telephones, the number of potential calls between them goes up exponentially, and so does the total value. If there are telephones in the world, the value is thus the value is proportional to 2^n.

Storing alt-data just for the sake of storing data has a cost associated with it because most alt-data is either costly to collect/extract or costly to store and analyze because of the size. There are nearly infinite ways you can assemble all of the data together into collections. Each collection can then also interact with the other in valuable ways, thus the value is proportional to thus the value is proportional to 2^n.

While theoretical value of combining data can be calculated, using the data takes a little longer due to having to overcome justifications for the adoption costs.

Take Web servers for instance. If you went to your company executives in the early 90s and told them that you were going to use the company's expensive network connection, clog it up with traffic, run a piece of software on an expensive company machine so that people outside of the company that you don't know can grab proprietary company information that your competitors can use against you ... the executives would probably fire you.

But that's exactly what happened with Web servers in the current day and age (minus the firing). After the benefits of having a Web server far outweighed any initial costs and concerns, companies took advantage of the collective value. This is a perfect example of a network effect. Alt-data is just starting to provide enough value to overcome the initial costs and concerns, and its adoption will only accelerate from here.

What Is Unstructured Alternative Data?

Raw text is considered unstructured data, but the truth is, even raw text comes with some points of structure. What source did it come from? When was it published? Who is the author? At Bitvore, we focus mostly on semi-structured data like textual news item, though we do look at press releases, SEC filings, investor presentations, public records, ratings, social media, product reviews, job postings, and other information.

There are companies that do use more visual alt-data like satellite images of how many cars are sitting in a storage lot or how much foot traffic goes through various airports, buildings, malls, or public spaces. That sort of information, while useful, falls outside of our interest and customer areas.

Because we can augur some of the structure, we can reason and derive structure out of the data. Did it come from a reputable source or is the source blacklisted? Was this written by a human or is it robonews/junk? What is the subject of the story and where did it take place? A lot of these early answers that help us separate invaluable, valuable, and un-valuable info can be derived structurally even before we apply more powerful machine learning algorithms.

Another source of semi-structured data comes from Web sites. The reason Web sites are semi-structured is because you aren't just looking up values on the site to answer questions. Who is the CEO? Who is on the board? What is the last big deal the company did? For how much? With which customer? When did they last launch a product?

There are Web scraping technologies out there in the world, but without doing a bit of analysis, it's hard to figure out the information or answers you need. The key question is, how do you get a machine to understand and answer these questions to the same level of quality as a human sitting down and digging through the Web site to find the answers? The answer is: humans and machines aren't perfect, but a little machine learning goes a long way to being able to do far more, far faster, and for far more sites than is feasible with any amount of humans.

How Do Data Scientists Use Alternative Data To Build Predictive Models For Analysts?

There's an urban legend that gets passed along among alt-data data scientists. It starts out like an old joke. Two guys walk into a bar. A stock analyst following Tesla is drinking away his sorrows as his clients keep asking him what is happening with Tesla. They keep promising tens of thousands of cars, but every time he visits the company, they are stockpiling thousands of cars that aren't moving anywhere.

His friend who works in satellites tells him, he can look at the past months satellite feed as Moffett Field is right across the bay and his satellite flies right over there. It expanded from there to the point people started live-streaming all the distribution centers as a way to try to predict whether there will be sufficient demand for the new model, and ultimately whether the share price will fall or rise.

This excellent example of unstructured data is simply a picture of how many cars are sitting on any given lot at any given time. Some users were even able to write automated counters and live-stream the locations so that traders could have the information on-demand and any time they wanted. The problem with the whole thing is that the alt-data lacked context. As Tesla ramped up production, so did their temporary storage. Without knowing the other factors, having access to the fastest, most accurate, alt-data in real-time can be open to any number of wide interpretations.

Bitvore's Use Of Alt-Data Takes A Different Approach.

Alt-data isn't valuable without correlating it to more traditional data sources. The single most valuable source is timestamp-based news. While there are a lot of things that can be discovered that never show up in the news, having access to those things lack context without validation in the news. That's not to say all news sources are equivalent. There is a production cycle and an escalation process for certain items. Bitvore has gotten really good at identifying early news items that will be significant before they are well covered by more traditional, slow-moving media.

This expertise helps for predictive models. In the short term, we can find valuable news items by correlating the information with our alt-data and leveraging our machine learning models that have been tuned using tens or hundreds of millions of records across various companies and industries. For longer term predictions, we look for patterns in our analysis. We identify individual items with something called a signal. A signal is simply an indicator that something financially impactful happened with a very high degree of reliability. We also correlate that signal to the company that is mentioned. When we combine both the company and the signal, we come up with precision news—a highly reliable indicator that something important happened.

Our latest predictive efforts use that highly reliable information to predict other signals. For instance, in our municipal product, if a city eliminates a fire, police, or ambulance service, forgoes teacher raises in a school district, or starts discussing pension costs--all signals in our system, we can predict with almost certainty they will be announcing a budget shortfall at the end of the fiscal year. Likewise, if a city further announces a budget shortfall, raises new money through issuing new bonds, pushes through public employee raises, or raises property taxes, also all signals in our system, we can predict a city or a county bankruptcy.

Companies follow similar patterns. Fundraising, an abundance of new product launches, executive churn, and various other patterns of signals can result in looking for new money/fundraising, trying to sell the company/merger & acquisition, financial distress, or even bankruptcy. While these types of predictions are not absolute, just knowing there is a higher percent chance over the course of the next two or four quarters is extremely useful information.

Why Do Data Scientists Spend 60-80% Of Their Time Dealing With Unstructured Alternative Data?

In short:

For data science, there is always a tradeoff between using a small, but very clean data set versus using a large and dirty one. There are many ways data can be dirty. The first is concordance. If you have several different names of companies, i.e. Family Dollar Stores, Dollar Tree, Dollar General, Dollar Express, Dollar Holdings, you have a concordance problem. Which company names are the same and which are different? Which are still around and which have gone away? Sometimes it's even hard for humans to know the difference. Geographically, we have to differentiate between the City of West, Texas and West Texas, Central Pennsylvania and the City of Center, Pennsylvania, and hundreds more of really ambiguous items.

Likewise, if our computers get a news item from say a local newspaper, there are no indicators anywhere in the story or sometimes even on the site about in which US state the paper resides. We would have to have a human go to the site, find the weather section, and try to figure out where exactly the story was from by hand. Luckily we've been able to work out automated solutions to a lot of these issues.

Other issues like stemming—a technique where the root word is the same independent tense or usage—causes issues. An example of this is hospital versus hospitality. The hospitality suite at the hospital was very inhospitable. If we're trying to differentiate between Marriott Suites, a Marriott-run senior center & hospital, and a public-funded hospital that gives family grants to stay at the Marriott around the corner, it takes a lot of disambiguation.

Even before we get to these issues above, there are general cleansing issues. Take for instance the headline:

Salesforce Signs Definitive Agreement To Buy Tableau

The body of the article is: "Registration required", "no content found", "A valid subscription is required to read the article", "To continue reading register or subscribe below", or a very frequent "Shop now in our online pharmacy". Typically we blacklist almost a billion individual news records a year. Blacklisting by itself is almost a full time job.

Malware, pharmacy scams, paid promotions, celebrities, obituaries, pornography, lifestyles, weddings, coupons, etc. That's just the first pass. In addition to pattern-based blacklisting, we have machine learning models trained to remove old content, error pages, paywalls, dark Webs, robonews, and a variety of other things. Interestingly enough, the single biggest factor in identifying a fake or promotional news story is simply that there is no current date and/or no data comparable to when the story first appeared.

Once that those records are cleaned up, we can start to look at duplicates and similarity clusters. We keep duplicates and similar stories in our system for analysis, but other than identifying them, they hardly get analyzed in our system. Once the data is distilled, it can be analyzed for entities/entity-extraction, sentiment, signals, geography, and any other features or data items that you want to look at. As mentioned before, mapping those things into like-entities requires careful processing and lot of machine learning and human expertise.

What Is AI Ready Data?

Definition: Aggregated from multiple sources, normalized to appropriate domains, and cleansed of garbage.

At the point you have only clean data and reliable entity-extractions, you have the minimum needed for AI-ready data. Adding in other clean data values like geographies, signals, sentiment, scoring, or other differentiators and disambiguators allows data scientists to carve off just the data they want.

A large part of that is just simply being able to sort items by a value or only find items that frequently appear together. When you are looking at tens of thousands, hundreds of thousands, or millions of things, being able to perform large data operations to get exactly what you need to do a data science experiment becomes important.

Microsoft Excel—one of the favorite tools for data scientists in their toolkit—has a hard limit of 1 million rows. Imagine trying to read a 5 million row data file into Excel just so you can sort, rank, score, and excerpt the top 500,000 things you need for your experiment. It gets old really fast. Most data scientists will resort to using tools that are designed to handle larger datasets like PowerBI or Tableau which can take a long time for each operation. But for simple filtering, they end up either putting the data into a database, dump it into files, or writing scripts to find patterns. Consider the difference between:

Sometimes just having a comma in your dataset is problematic when dealing with hundreds of thousands of things. Likewise, when you are dealing with unstructured text where you actually need the title of a news article, binary characters, double quotes inside of double quotes, single quotes, punctuation, and a variety of other things like character encodings can really mess up the best laid tools.

Likewise, even having the data in the right size and format is no guarantee. Joining data with other reference-able datasets is a black art in itself. Imagine you have a record that is a news article about Salesforce. You want to join the information with Salesforce's number of employees or revenue numbers. Instead of having a column of data that says Org1's employee count, Org2's employee count, OrgCombined's employee count—or worse you have a Salesforce employee column in every single row of your database regardless whether Salesforce is in that row—you want to be able to do some analytics on the combined employee count by joining the values from some third metadata source.

Other issues include unrolling or grouping. Say you have Salesforce and Tableau in one article with signals MergersAcquisitions and FinancialHealth. Unrolling lets you figure out when you have two lists of things in two different columns so that you can do better analytics.

•  Salesforce,1.05,MergersBankruptcy

•  Salesforce,1.05,FinancialHealth

•  Tableau,1.05,MergersBankrtupcy

•  Tableau,1.05,FinancialHealth

 
For our example, both signals belong to both companies. But if you are unrolling a CEO's name and a VP of Marketing's name for a sales agreement, how do you know which company the CEO works at and which the VP of Marketing works at if they are two different companies? Sometimes you need to keep the extracted data together because there is a dependency that shouldn't be unrolled. City names and States are another example of this type of dependency that has to be preserved.

Finally, since time is a very important dimension for doing prediction, data scientists have to roll up time into hours, days, weeks, months, quarters, years. If you want counts for how many signals for this company happened last month, you will get a number. You can then compare that number to the previous time frame from last year or last month depending on what type of analysis you are performing.

Having a strategy for all of these issues and the tools to help solve them easily is what AI-ready data is. Eliminating the 60-80% of time Data Scientists spend on making data ready for predictive analytics is exactly what Bitvore does. Bitvore creates AI-Ready Data.


 

 

About Greg Bolcer, CDO Bitvore

Greg is a serial entrepreneur who has founded three angel and VC-funded companies. He's been involved at an early stage or as an advisor to at least half a dozen more. Greg has a PhD and BS in Information and Computer Sciences from UC Irvine and an MS from USC. He started his career at Irvine as a researcher in Web protocols, standards, and applications under a series of DARPA-funded grants. He formerly was the Intel Architecture chair for the Peer to Peer working group and was awarded the Distinguished Alumni of the Year in 2004 from UCI.

About Bitvore

Bitvore provides precision intelligence derived from world business news and information. Our products are deployed in over sixty of the world's largest financial institutions, allowing them to rapidly create augmented intelligence solutions to address their unique business requirements. Augmented intelligence solutions assist employees in making faster and more effective decisions, so they outperform the competition. To learn more, visit www.bitvore.com.

Download Interview Here