In the connected computing world where illicit activities can take the form of data breaches or credit card fraud, data set anomaly detection provides a significant line of defense for detecting and preventing these activities. Numerous data scientists have written doctoral theses and applied for software patents that combine K-Means clustering of data, or similar techniques, together with specific mathematical algorithms to detect these types of anomalies within a given data set. But what about the large number of software developers who are working with non-critical data? This article presents a simple concept to help you find and correct anomalies within non-critical data.
Detecting anomalies is an activity that our minds do every single day. While driving, we suddenly feel an alert because we have encountered something unusual. Our mind encountered something outside of the norm; Perhaps a Point Anomaly, something too far off from other data surrounding it, or a Contextual Anomaly, something out of place for the particular situation. Encounters with anomalies like these cause us to make decisions such as slowing down our car. Our responses to anomaly detection can be so ingrained in our behavior that we become unaware of our reacting to this set of circumstances. Anomaly detection helps formulate our trust or mistrust of the data we encounter. As software developers, if you have enough domain expertise to understand the data you are processing, you can use that same process to help validate and improve your data.
We recently developed a Retail Intelligence Application that gathered customer feedback then used a Cognitive Intelligence Engine (AI) to determine customer sentiment. Data was then presented in various ways to help uncover new business insights based on customer preference and feedback. In most cases the AI was fairly accurate, but we also found cases where a particular sentence, an outlier within an overall customer review, caused an incorrect sentiment calculation for the review. Below is an example of an outlier sentence contained within an overall positive review:
“Nike has a much firmer sole and the New Balance upper has better water resistance than this shoe. But, overall we love the quality and performance of this shoe and we will be buying it again.”
The first sentence generates a significantly negative sentiment score, which when summed with the other sentences in the review, lowers the overall sentiment of an otherwise positive review. A traditional method for handling outliers like this is to discard them from the data set. While that solution corrects the sentiment score, it comes at the cost of losing valuable information concerning the product features the customer evaluates when choosing to purchase a new shoe. To resolve this, we stepped back from the incoming text to look at what other data was provided with a customer product review.
For online reviews, customers also submit a 1-to-5 star rating indicating their overall happiness. Using this additional information, we chose between various decisions for handling a sentence found to be an outlier. Here are several decisions, based on this additional information, that improved the overall validity of our sentiment detection results.
- If a high star rating was provided, and all but one or two sentences contained a sentiment rating that matched this star rating, sentiment for the outlier values was adjusted upward until it came into a pre-determined range for the star rating.
- If a low star rating was provided, and all but one or two sentences contained a sentiment that reflected this rating, the outlier sentiments were adjusted downward.
- If there was a significant difference between the star rating provided and majority of textural sentiments detected, we chose to not use the star rating information. In these cases, no adjustment was done for the detected sentiment outliers.
We could look for additional information to help inform our data validation decisions because we understood the domain of online customer reviews. As you, the software developer, work with data it is sometimes helpful to step back and ask, “What else do I know about this information that could help me validate the incoming data I am receiving?”
An advanced mathematics degree can help you create statistical models that validate and improve incoming data sets. But don’t discount the power of understanding the context from which your data is derived. That too can provide powerful tools and insights to help improve your overall data processing results.