Likewise, if our computers get a news item from say a local newspaper, there are no indicators anywhere in the story or sometimes even on the site about in which US state the paper resides. We would have to have a human go to the site, find the weather section, and try to figure out where exactly the story was from by hand. Luckily we've been able to work out automated solutions to a lot of these issues.

Other issues like stemming—a technique where the root word is the same independent tense or usage—causes issues. An example of this is hospital versus hospitality. The hospitality suite at the hospital was very inhospitable. If we're trying to differentiate between Marriott Suites, a Marriott-run senior center & hospital, and a public-funded hospital that gives family grants to stay at the Marriott around the corner, it takes a lot of disambiguation.

Even before we get to these issues above, there are general cleansing issues. Take for instance the headline:

Salesforce Signs Definitive Agreement To Buy Tableau

The body of the article is: "Registration required", "no content found", "A valid subscription is required to read the article", "To continue reading register or subscribe below", or a very frequent "Shop now in our online pharmacy". Typically we blacklist almost a billion individual news records a year. Blacklisting by itself is almost a full time job.

Malware, pharmacy scams, paid promotions, celebrities, obituaries, pornography, lifestyles, weddings, coupons, etc. That's just the first pass. In addition to pattern-based blacklisting, we have machine learning models trained to remove old content, error pages, paywalls, dark Webs, robonews, and a variety of other things. Interestingly enough, the single biggest factor in identifying a fake or promotional news story is simply that there is no current date and/or no data comparable to when the story first appeared.

Once that those records are cleaned up, we can start to look at duplicates and similarity clusters. We keep duplicates and similar stories in our system for analysis, but other than identifying them, they hardly get analyzed in our system. Once the data is distilled, it can be analyzed for entities/entity-extraction, sentiment, signals, geography, and any other features or data items that you want to look at. As mentioned before, mapping those things into like-entities requires careful processing and lot of machine learning and human expertise.

What Is AI Ready Data?

Definition: Aggregated from multiple sources, normalized to appropriate domains, and cleansed of garbage.

At the point you have only clean data and reliable entity-extractions, you have the minimum needed for AI-ready data. Adding in other clean data values like geographies, signals, sentiment, scoring, or other differentiators and disambiguators allows data scientists to carve off just the data they want.

First « 1 2 3 4 5 6 » Next