How Can I Trust My Data? 4 Principles to Ensure Reliable Insights
Data is ubiquitous in business to the extent that it can be overwhelming; yet everything from basic analysis reporting to the outputs of complex Machine Learning (ML) and Artificial Intelligence (AI) models have become the lifeblood of strategic decision-making. With data originating from so many different sources—transactional data, online tracking, purchased data sets, and APIs, to name just a few—that may be updating in real time, how can stakeholders be confident that the data and the insights derived from it are reliable? When questioning “how can I trust my data?”, there are a few basic principles that, when diligently applied, can create reliable data that improves strategic business decisions.
Principles for Reliable Data
1. Engage everyone in answering, “what is the business case?”
For any data strategy to be successful, every person who plays a role in the strategy—from executives making strategic decisions, to the data architects and engineers in charge of acquiring, structuring and storing the data, and everyone in-between—must be aligned on a vision for how the data will be used.
Executives and other decision-makers must articulate what insights they are hoping to derive from business intelligence, forecasts, predictive models, and AI implementations. These insights must be realistic based on existing data sources and new data sources that may be built or acquired.
Data architects and engineers need to understand these goals so that they can properly store and structure the data. Clear goals, along with data that is validated from quality sources, is well-structured and made as simple as possible for those who will be using it is the foundation for being able to trust data.
This sounds simple enough, but often not all those involved “speak the same language.” Executives often do not understand data and analytics terminology, and those in the weeds working with data sometimes don’t understand bigger picture business strategies. Therefore, it is a good idea for someone who understands both to serve as a Business Translator, gathering requirements from those in charge of strategy, and translating those requirements into actionable steps for the data team to implement.
2. Choose the right architecture, and create a “single source of truth”
Data architecture can be complex, and there is no one-size-fits-all solution. No matter your data architecture—utilizing databases, data warehouses, data lakes, or some combination thereof—it is important to focus on a “single source of truth,” which is a table or set of tables of highly structured data centered around the most important business practices and objectives. These tables are meticulously refined by data engineers from raw sources to be accurate, updated consistently, and contain most or all of the key information needed for analysis and modeling. They contain the key data centered around our central question, “What is the business case?”
This single source of truth can be a dedicated set of tables in the data warehouse. Alternatively, James Mayfield with dbt Labs suggests carving out a section right in the middle of the data lake for what he calls the “island of truth,” so that these key data sets are housed among other unstructured data that can be easily merged for advanced analysis and modeling.
Only dedicated data stewards should have permissions to make changes to these tables to ensure the data quality is not accidentally compromised by someone using it for analysis.
3. Align on key metrics
In addition to vertically aligning toward a common business case, also important is a form of horizontal alignment, or agreement on the meanings of key metrics across departments.
Do all departments—say, accounting, operations, and procurement—agree on the difference between sales and net sales? Does net sales in accounting match net sales in operations? Is available inventory calculated the same way by procurement and operations?
Wherever possible, all departments should use the same data sets and definitions—ideally within the single source of truth—when calculating similar metrics. Additionally, cross-departmental communication to ensure alignment on KPIs and other metrics is another ideal role for the Business Translator. Finally, the agreed-upon meanings of variables within data sets and shared metrics should be well-documented and accessible.
4. Bring documentation to the forefront
Data and analytics projects are exciting, and stakeholders are often eager to get to the key insights, leaving documentation of all the processes that led to those insights as an afterthought. Adding more pressure to neglect documentation is that taking time to document these processes does not drive any immediate business impact in terms of revenue, cost-cutting, or efficiency.
Neglecting these crucial steps, however, can lead to disaster. People integral to creating these processes leave for other jobs, data sources emerge or are deprecated over time, and data servers fail or can be hacked, all of which can require rebuilding or changing existing processes. If there is not thorough, reliable documentation for every single step of each process, it can result in supply chain disruptions, unhappy customers, lost revenue, disgruntled employees, or numerous other nightmare scenarios for a data-driven business.
In addition to process documentation, creating data dictionaries is extremely valuable, especially for data contained in the single source of truth. Data dictionaries help all parties using the data to understand the meaning of every single variable in each data set.
It is also helpful to create metadata for key datasets. Metadata summarizes existing data sets with information, such as file or table names, storage locations, file sizes, number of records, and other summary information. Tracking metadata can help detect anomalies in key data sources if a sudden, unexpected change occurs.
What About Trusting Models and AI?
Models and AI go beyond basic data analysis, using algorithms to make predictions, classify entities such as customers or products, and find hidden patterns in data sets. Some models are simple with easily interpretable results. More sophisticated models, such as deep neural networks used in AI, are “black box” models whose processes are effectively uninterpretable, even to sophisticated users.
The answer to how to trust models is twofold—thorough testing and regular updates.
Before being used for any real decision-making or business process, models should be thoroughly tested by the proper data scientist, AI engineer, or other expert practitioner for accuracy and consistency using methods like cross-validation, back-testing, or bootstrapping. Once the model is ready for implementation, it is often a good idea to conduct an A/B test, in which the model is implemented for only a subset of the business so that its results can be compared to another subset during the same timeframe.
All models need to be re-trained, tested, and updated regularly. The behavior of customers, employees, markets, supply chains, and other entities change over time, meaning that the data that drives a model will evolve and the model may drift. Models should be maintained according to strict policies that are outlined before they are ever implemented.
Closing Thoughts
It is worth restating that every decision surrounding data and analytics should come back to the answer to the question, “What is the business case?” Do not underestimate the role of the Business Translator for ensuring all parties involved in the end-to-end process are on the same page regarding the project goals, how KPIs will be measured across all departments, and ensuring processes and procedures are in place to ensure data integrity over time.
Whether your company is in its data infancy or you are well-established and need a data strategy overhaul, Clarkston’s expert team of data architects, data engineers, data scientists, data analysts, business translators, and business strategists is ready to help. Reach out to us today.