Unstructured Alternative Data In Predictive Modeling

A large part of that is just simply being able to sort items by a value or only find items that frequently appear together. When you are looking at tens of thousands, hundreds of thousands, or millions of things, being able to perform large data operations to get exactly what you need to do a data science experiment becomes important.

Microsoft Excel—one of the favorite tools for data scientists in their toolkit—has a hard limit of 1 million rows. Imagine trying to read a 5 million row data file into Excel just so you can sort, rank, score, and excerpt the top 500,000 things you need for your experiment. It gets old really fast. Most data scientists will resort to using tools that are designed to handle larger datasets like PowerBI or Tableau which can take a long time for each operation. But for simple filtering, they end up either putting the data into a database, dump it into files, or writing scripts to find patterns. Consider the difference between:

"Paris, Texas", 1.05, FinancialHealth
Paris/TX, 1.05, FinancialHealth

Sometimes just having a comma in your dataset is problematic when dealing with hundreds of thousands of things. Likewise, when you are dealing with unstructured text where you actually need the title of a news article, binary characters, double quotes inside of double quotes, single quotes, punctuation, and a variety of other things like character encodings can really mess up the best laid tools.

Likewise, even having the data in the right size and format is no guarantee. Joining data with other reference-able datasets is a black art in itself. Imagine you have a record that is a news article about Salesforce. You want to join the information with Salesforce's number of employees or revenue numbers. Instead of having a column of data that says Org1's employee count, Org2's employee count, OrgCombined's employee count—or worse you have a Salesforce employee column in every single row of your database regardless whether Salesforce is in that row—you want to be able to do some analytics on the combined employee count by joining the values from some third metadata source.

Other issues include unrolling or grouping. Say you have Salesforce and Tableau in one article with signals MergersAcquisitions and FinancialHealth. Unrolling lets you figure out when you have two lists of things in two different columns so that you can do better analytics.

[Salesforce,Tableau],1.05,[MergersBankruptcy,FinancialHealth]
Unrolled:

• Salesforce,1.05,MergersBankruptcy

• Salesforce,1.05,FinancialHealth

• Tableau,1.05,MergersBankrtupcy

• Tableau,1.05,FinancialHealth

For our example, both signals belong to both companies. But if you are unrolling a CEO's name and a VP of Marketing's name for a sales agreement, how do you know which company the CEO works at and which the VP of Marketing works at if they are two different companies? Sometimes you need to keep the extracted data together because there is a dependency that shouldn't be unrolled. City names and States are another example of this type of dependency that has to be preserved.

Finally, since time is a very important dimension for doing prediction, data scientists have to roll up time into hours, days, weeks, months, quarters, years. If you want counts for how many signals for this company happened last month, you will get a number. You can then compare that number to the previous time frame from last year or last month depending on what type of analysis you are performing.

Unstructured Alternative Data In Predictive Modeling

You are not logged in