Deriving Advanced Analytics from Flawed Data Sets
Self-service analytics and business intelligence (BI) investments have been an important step in advanced analytics maturity, granting entire organizations freedom to explore operating data rather than creating a bottleneck of information within the analytics team. However, self-service reports don’t guarantee access to the insights that really drive progress; as data volumes grow, throwing more data at business users will only make it more difficult to uncover the important, and often intricate, patterns within the data.
By leveraging advanced analytics and machine learning techniques, data scientists can unveil those patterns to provide recommendations and next best actions more efficiently than business users. Many companies have attempted to reap the benefits of analytics in the past with their existing data and have fallen short, and the assumption is usually that they don’t have enough data or the right data to tackle advanced analytics.
As with any challenging business problem, the first step is to develop a focused problem statement. Once your problem is clear, identify what data is available and what may be missing. Many teams get to this point and decide that there is not enough data for advanced analytics, so they abandon their analytics projects before they ever started.
Companies can use their existing data, even if flawed, and avoid getting stuck using these tips.
Generate New Attributes
If there’s a disconnect between the data available and the desired analytical results, consider developing attributes that describe the relationships between various data characteristics. These newly developed attributes often provide a more meaningful and influential variable to your analytics or modeling. For example, transform a transactional invoice history data set into a customer profile by creating attributes like number of days between invoices and average sales per invoice.
Pro Tip: Make sure the data feeding the model is pertinent to the problem statement. Business users should identify the use case to ensure that the provided data is as relevant as possible. This may require a new column, generated from the combination of other components of the data set.
A blending of data sets can also provide relevant data. For example, master data about a product can be joined with transactional invoice data to provide a more detailed view of customer buying patterns to help with customer segmentation. You can take this a step further and join your data with public information, such as weather data, to forecast the effect of the day’s weather on category sales by customer segment. This public data can give new insights to analyze, and to take this example further, it could demonstrate which products are weather-driven and build that into promotional spend analysis.
Look for Correlation and Proxies
Correlation analysis is also a way to gain a deeper understanding of the data. Working toward optimization typically involves many variables and a clear picture of how they affect one another. Are there highly correlated factors that influence the analytical model? Work with the business users to understand the use case of the model. Keep in mind that using highly correlated values in machine learning models can dominate the model with those values, so you really want to choose just one. For example, if you are trying to optimize your trade promotion program, you may have multiple data options for both units sold and sale price which are highly correlated. In your analysis the trade promotion manager and the chief operating officer (COO) may have different goals for the analysis. The trade manager will want to understand promotional lift based off units sold, while the COO wants to understand revenue and margin. For a data scientist to interpret results effectively, you need to consider the business case and users and what they’re interested in understanding.
Companies that feel they don’t have the data they need to solve their largest business challenges through analytics don’t need to overhaul their processes or invest in significantly more resources to get the data they need. Establishing a proxy can be a powerful alternative in the absence of the preferred data. Proxies should be statistically relevant to the data they’re representing, and an understanding of how tightly they’re related is crucial for understanding accuracy. Statistical analysis and correlation can identify these proxy values and the mathematical relationships within your data. Proxies aren’t perfect solutions, but they can still be very useful.
Use Advanced Analytics like Machine Learning
A popular approach to predictive analytics is using machine learning models that run on data sets with a column of ‘labels’, or the values to be predicted. These labels are a single predicted class or number value like total sales, or ‘high risk’ versus ‘low risk’. If this predicted value does not yet exist in the data set, there are ways to label each data entry without spending countless hours in manual efforts. A solid understanding of the problem statement and the data itself can be enough for a business to establish rules that automatically assign labels to the data set.
For example, consider a business that is expanding into a new territory. To identify the new consumers in that area that are likely to be highly engaged with their brand, the company can label a historical data set with customers and their purchase history. A predictive model based on these labels will identify the most important demographic factors that indicate engagement level and identify those high potential targets.
Technical Tip: Be sure to exclude the data that drives the newly created label from the training of the machine learning model to avoid dominating the results with that highly correlated value.
There are also automated approaches to illuminate patterns in data by grouping similar data rows together into clusters. These clusters may be used to validate long-standing beliefs of the business about the data, but they can also bring to light new patterns in the data that the team was not aware of. Sometimes these new patterns may even call into question the anecdotal or gut-feel biases held by the team and encourage additional analysis to find clarity.
Regardless of a company’s data maturity level, creating new data, proxies, and the use of machine learning can get stalled analytics projects in motion. It is still extremely rare to create perfect predictive models. While it is important in some applications to strive for perfection, most functions can benefit tremendously with 75-80% statistically significant results. Don’t let the pursuit of perfection kill advanced analytics projects before they get off the ground. Take the time to identify your business problem and understand your true data options with both analysts and business users to illuminate answers from even the least organized data sets.
Subscribe to Clarkston's Insights
Contributions from Elise Watson.