Data debt is everyone’s problem. Gaps in data management processes, accountability, and general data malaise and ambivalence combined with technical shortcomings can lead to data debt, which is felt by the business either through time spent to update data or by reduced trust in the validity of that data. While technical debt has a well-documented impact on IT operations, retail, consumer goods, and life sciences along with other industries are just starting to understand the impacts of data debt on their business. Data debt is thought of as just data quality issues but can often have a much larger impact on all areas of the business, both increasing manual work and disrupting analytics efforts.
Most organizations see these issues but don’t know how to address them or understand the root cause. Companies that fail to review and pay down these core debts will experience these high costs and impediments to their growth.
What is Data Debt?
Data debt is the accumulation of problems with processes, systems, or lack of attention to the health of the data ecosystem over time. While it may start hidden and unnoticed, it will become apparent in day-to-day operations. It is often a symptom of technical debt but can have deeper ramifications and financial impacts of its own. With the increasing volume and velocity of data, these problems will continue to grow and fester without intervention, eventually causing debilitating issues that can disrupt the business.
Part of the challenge with data debt is it is often viewed as an IT problem, rather than an enterprise issue that involves everyone. Organizations that have experienced multiple failed system implementations or sudden extensions of timelines for large projects likely are seeing impacts of data debt. Additionally, if your teams struggle to agree on information and numbers, there is data debt. Data debt causes challenges with identifying what data assets are available or where those assets can be accessed. IT cannot solve these problems on its own and must have business engagement to pay down this debt.
Looking at a few key areas in depth, we can see the impacts that data debt can have as well as the possible methods to reduce that impact going forward.
Mismanaged core data is often one of the easiest places to see the burden of data debt in an organization. Core data is the data that is repeated throughout the business. Examples include customer or vendor lists, your product information, suppliers, and other key data sets.
If teams within the organization must enter the same information into multiple systems manually, like a customer’s address into a CRM, ERP, and a warehouse system, there is data debt. Similarly, data teams experience the impacts of debt when they must combine tables of data from multiple systems to get an enterprise-wide picture of all the items, customers, or any other core data. Rapid acquisitions without standardization of systems and data can make this worse. Organizations could have multiple different systems containing the same customer or supplier information, and this disconnectedness could result in customers receiving multiple sent invoices within a given month or other operational challenges.
These data silos can cause tremendous impacts. Something as simple as an ampersand in one system and an ‘and’ in another can lead an organization to overcount vendors. Those discrepancies can also lead to an incomplete picture of a particular element, duplicates within the same system, and challenges with reconciling these lists. Additionally, as aggregates to one team may be different from another, this makes any executive-level reporting tricky if each business unit is managing its core data. All the above limit the organization’s ability to quickly measure performance and increase time spent on each task.
The resolution of these core data debts is often time-consuming but can be tremendously rewarding. A straightforward method is to identify systems with duplicative data, then select one point of truth for a given data set, then consolidate and feed that data to downstream systems via integrations. Depending on the architecture and age of systems, this can be a massive project so identifying a key pilot program to support a specific business goal can assist with executive buy-in, like a consolidated supplier report or invoice efficiency program. From the analytics side, data fabrics or data hubs can provide that consolidation point. Advanced capabilities in these systems can provide automated recognition of like records and combine them into a golden record for reporting, enabling analysts to pull data from one single point rather than multiple systems.
External Data Sources
With the growth in exploratory analytics, firms are beginning to purchase and integrate more third-party data sources into their existing ecosystem. Third-party data is information provided by another group external to your organization. Common examples are weather or census data, which can be combined with enterprise information to provide a more full picture of an event.
While bringing in this data, shortcuts are often taken to reduce the onboarding process of this data. Analysts may manually transform the data in a non-repeatable way using Excel or another tool to prepare for analysis. Sometimes this data is stored in personal OneDrive folders or shared drop sites but is not accessible from the organization’s visualization or analytical tool. Beyond this, any provided metadata information or specification provided by the vendor is sometimes not stored in the same location or stored anywhere at all.
While the approaches above can be sufficient for one-time proof of concepts or analysis, the impacts really can be felt when organizations seek to repeat that work or transform this analysis into a regular accessible tool for the business. The manual transformation steps can often be lost if there is no documentation or repeatable code, leading to unnecessary rework when the analysis is repeated. Beyond that, improper storage of any metadata explaining the contents of the external file can lead to challenges when a new team wants to leverage this data for a new project. Lastly, failing to evaluate the quality of the acquired can impact any future work, making any actions taken based on results from these data risky.
These impacts can be mitigated by establishing defined processes for consuming these third-party data sets. Leveraging flexible storage, such as data lakes, and data catalogs can provide ways for analysts to increase the accessibility of acquired data and provide insights into the content of the data to other teams. Other enterprise-wide data management tools can also assist with creating repeatable steps to transform data for analysis. Lastly, data governance can help provide some of the guardrails around quality management.
Where to Go from Here
Like technical debt, data debt must be reviewed and taken on strategically and intentionally. It is everyone’s problem and can have tremendous impacts on decision-making throughout the organization. Executives that can identify and resolve these issues will improve their business growth in the coming years. To help consolidate these pushes, many organizations are creating a Chief Data Officer (CDO) to champion these efforts and manage the teams responsible for data enterprise-wide. Project teams should also be conscious of the influences that their decisions can have on data and dedicate time during or after projects to address these impacts.
Inaction isn’t an option. With the increasing velocity and volume of data, organizations must find a way to effectively manage this data or face the impacts of data debt.