Smart Filtering: How to Determine What to Capture and Keep When it Comes to Big DataJanuary 15, 2014 No Comments
Featured article by David Pope and Dan Zaratsian, SAS
The Big Data Dilemma: What to Capture, and What to Keep
Technology enables you to capture every bit and byte, but should you? No. Not all of the data out there is relevant or useful. Organizations need to separate the meaningful information from the chatter and focus on what counts. Smart filtering allows you to do just that.
With smart filtering, the organization captures and stores only what is suspected of being relevant for further analysis, and can discard unnecessary documents during the initial retrieval. The goal is to reduce data noise and store only what is needed to answer business questions. Smart filters help identify the relevant data, so you don’t spend time searching large data stores simply because you don’t know what subsection of data could contain value. If you are a manufacturer searching for clues to defects in warranty reports, instead of bringing in all the text from every warranty call, a smart filter’s embedded extraction rules would isolate and extract warranty calls to uncover the root cause of a defect. Smart filters are truly smart when they use more than if/then rules and Boolean searches.
The basics of smart filtering
Smart filters do this with embedded natural language processing and advanced linguistic techniques to identify and extract only the text that is initially believed to be relevant to the business question at hand. It’s the mind of a 21st century librarian embedded on a microchip.
In addition to identifying the most relevant nuggets of information from the available universe of information, smart filters can help determine where to store this data, and direct it to the right location. Is the data highly relevant? Then you’d want to have it readily accessible in an operational database type of storage. Or is it lower relevance? If so, it can be stored in lower-cost storage, such as a Hadoop cluster.
Now organizations have a way to analyze data up front, determine its relative importance, and use analytics to support automated processes that move data to the most appropriate storage location as it evolves from low to high relevance, or vice versa. It’s like being able to “try on” data in cyberspace before having to commit to store it.
Smart filtering in practice
A government agency is employing smart filtering to monitor various scientific information sources and media outlets to identify potential risks to food production. The organization is assessing more than 5 million unique sources of text (e.g. reports, documents, social media) looking for relationships between chemicals in the food production chain and possible side effects.
Given the volume of data involved, the organization had only been able to run the analysis once a month in the past. As the analysis is for safety reasons, month-old data isn’t nearly as effective as more recent data. Now the organization can customize information retrieval calls on those millions of texts across the entire food chain, honing in on the most relevant information before download. As search functions crawl the Web, smart filters with embedded extraction rules filter out the irrelevant content. The organization discovered that only about 10 percent of the data they previously stored was what they were interested in. By narrowing down the data store and analysis to that critical 10 percent, the organization can now report much more frequently and deliver better and more timely alerts on emerging contaminants or other safety risks.
While smart filters can work with voluminous external data, they can also be employed on internal data that isn’t being categorized effectively. One example is a financial institution call center that examined details from lost accounts. It discovered that in many cases, customers were quite explicit that if they didn’t get a call back from an employee or a manager didn’t address their concern they were going to end their relationship with the financial institution. The conversation had been correctly typed up, but the call center employee manually categorized the call as “general call”. In a retrospective view of the data made to determine whether to employ smart filtering, the organization discovered 17,000 examples exhibiting similar behavior, where the total asset level for all of these “at risk” customers was worth more than a billion dollars. Today, the financial institution uses smart filtering to detect potential relationship ending comments, identify up-sell and cross-sell opportunities, and quickly alert the staff to address the customer in a meaningful way.
Tracking sentiment correctly — without blowing up your storage inbox
There has been a lot of buzz around sentiment analysis. What does discussion around Sports Utility Vehicles and accidents mean to your car company? Or insurance provider? How will a Federal Reserve announcement about interest rates impact investments your firm holds. You don’t want to go through the bother of storing this information, you just want it delivered at the appropriate time. If you are a power company dealing with a widespread power outage, a quick read of the location of tweets about power loss could be used as an added layer to find downed lines. A publicly-traded company can keep an eye on sentiment in the days following a quarterly earnings release. Non-government agencies can look for signs of unrest in nations where they keep aid workers. The smart filter, though, can’t work unless the underlying ontology is “smart enough” to filter out the words and phrases unrelated to what is being sought. Apple Computer does not want to know how people are using apples in their autumn desserts. A well-designed filter has rules attached that make decisions about what to capture and what to discard, what to surface and what is just useless “chatter”.
Smart filtering applies more than if/than rules and Boolean searches. It combines advanced analytics, natural language processing, and facetted if/then rules to target and extract relevant information. Without smart filtering you may not be able to answer important questions in a timely manner, you will be stuck with an ever-growing volume of data and the attendant cost of storing it, , and you may miss out on opportunities to take action on critical business decisions.
David Pope holds U.S. patents and has worked on data integration, forecasting and modeling solutions at SAS for 22 years. Dan Zaratsian, is a Text Analytics Consultant for the SAS Business Analytics Practice.
CLOUD COMPUTING, DATA and ANALYTICS , Featured Articles