While not as headline-grabbing as a new machine learning model, modern data architectures are the foundation of any successful analytical effort. Outdated architectures can restrict your organization’s ability to maximize the return on machine learning investments and can prevent the productionization of analytical pilot programs. Beyond just those future areas of growth, your data architecture can limit capabilities to create self-service reporting, reduce access to information, and increase challenges in creating a sustainable governance program. To achieve a sustained digital and analytical transformation, companies must make investments into building a modern data architecture.
Building a Modern Data Architecture
While IT architecture prompts thoughts of systems and software, but to create a truly modern data platform you must build out organizational components that help support the physical. Those elements and building a data-driven culture are crucial to getting the most out of any architecture investments.
Data strategy is the series of decisions that often leads to the implementation and upgrades of data systems. Once these are live, the strategy should be refocused on building to support the key use cases that drive business and customer value. We often see IT and data teams are attracted to the trendy, new methodology like real-time streaming when that the business cannot make decisions at that rate. This costly investment then doesn’t turn into business value. As a rule, the use cases and decisions should stem from business priorities and then design cost-effective solutions to meet those priorities. As the organization matures and grows, so will the strategy to support it.
The importance of this strategy is most clearly shown in the rise of the Chief Data Officer (CDO) in the last few years. Responsible for championing data, these executives traditionally create the data strategy and execute against it. They manage the various teams involved in the data lifecycle, including the governance, engineering, and advanced analytics groups both embedded in other areas of the business and as a centralized resource.
With all this increased access to data, governance and security structures must be put in place to ensure users have the appropriate access and any sensitive information is protected. Data governance can also put guardrails and standards within the organization. Data lakes can become data swamps without a strong metadata management program defining the rules that teams must follow when bringing in datasets. Metadata sharing can increase the effectiveness of analysts in the system by providing more clear paths to each data set. Identifying data stewards and other organizational elements helps ensure teams are viewing data the same way and are following these established governance procedures. Numerous systems in the market can help maintain governance rules across the functional areas of the organization.
While more fully outlined in our What is Data Engineering piece, data engineering plays a major role in architecture, as this team provides the pipes for data to get into the storage systems as well as owning these repositories. This group also must deal with the challenges of integrating disparate on-premise and cloud systems within the architecture and ensuring all users have the appropriate access.
The systems of a modern data architecture are the embodiment of the data strategy. These systems may exist in your IT landscape already, but can be improved by enabling cloud integrations, streaming data, and expanding them to support business use cases. These methods for storing data shouldn’t be viewed as an either/or choice, but rather as supporting different needs within the overall company data and analytical strategy.
Traditional data warehouses have been around since the mid-80s and are designed to store enterprise data for business visualization and reporting. These systems rely on heavily structured data models that consume information from operational systems and transform it into a defined structure. This is very helpful to ensure data is valid and standardized from various source systems, but these models can often take a great amount of time to set up, leading to longer implementations. Additionally, edits to these models can be time-consuming efforts depending on the scale of the update.
Often used for business analytics, data warehouses can also have areas designated for exploratory analytics where users can develop reports and dashboards to support business use cases. These sandboxes can also support initial hypothesis testing that can identify possible machine learning use cases.
Data lakes are a more recent concept, starting in the 2000s, as the costs of storing large amounts of data started rapidly declining. Data lakes are a repository for all data within and external to an organization that could be valuable. This includes transaction-level data for internal systems, third-party information, event logs from machinery, and anything else that the organization views as valuable today or could be valuable in the future.
Unlike data warehouses which require modeling before loading, data lakes consume all data in its native format while adding metadata tags to indicate origin, load time, and other elements. This leads to a mix of structured, unstructured, and semi-structured data all contained within one system. Generally, data lakes have shorter implementation times, as intensive tasks like modeling are avoided during initial setup and when adding new sources. Once installed, data lakes help jumpstart analytical efforts by breaking down barriers between systems and providing a playground for analysts to interact with and share data.
Data lakes also enable the use of specialized data stores, such as graph and time-series databases, for individual use cases. Graph is gaining new life in the past few years as it enables analysts to more closely understand relationships and patterns within data by using a novel approach of nodes and edges to store information rather than the traditional table format. This allows for more robust network analysis and understanding of connections between elements. Time-series databases are optimized storing historical trend data for forecasting. Facebook has built time series databases to monitor their system performance. With systems and services writing thousands of lines to a single storage engine, this time series database allows for real-time monitoring to ensure no service goes offline. Organizations that can quickly plug and play these new structures will enable the further realization of the value of their data.
There has also been an advent in structures known as data lakehouses. These allow data to be stored in unstructured format but then have structured queries through Spark or similar technologies to allow for reporting on this data. This takes advantage of the separation of store and compute – providing the benefits of the scalability of cloud storage and the performance of structured data.
Referenced in a recent Gartner Trends Report, data fabrics are a new term for a technology that has been emerging in this space for the past few years. They create a novel way to look at your data throughout the organization as an interconnected system, linking disparate sources and providing reusable standard integrations that can pull information together in real-time. These integrations can also enforce data quality standards, enabling you to standardize data while querying across systems. Similar to the systems above, the goal is to create a unified data environment, but data fabrics accomplish this more quickly by removing the need to move data to a new system.
Data fabrics (or data hubs) achieve data virtualization by removing the barriers to each separate data store without requiring a central system to drop it in. From the user perspective, this allows querying across multiple tables and systems, sometimes without even knowing it. With these combined data sources, hierarchies or ontologies can be built to look at each data element as a singular entity across systems.
For example, if you are a packaged goods company that sells to retailers via distributors, you could be possibly selling to the same retailer through different distributors. You may be able to acquire data from each distributor on that retailer, 3rd party data, and sales/promotional data from the retailer itself. With this unified data fabric and hierarchy, you can then build a full picture of this retailer across these different sources to more fully understand their needs and performance.
Enterprise Data Platforms
While sometimes referred to as separate systems, enterprise data platforms are the above methods for storing and accessing data with a unified security and governance platform. This provides a one-stop-shop for every analyst, user, data scientist, and anyone who needs information within a single system, where data stewards and owners can manage the associated metadata and data definitions as well. This is further advantaged by the support model these platforms usually include where additional resources can be rapidly spun up or down on demand, providing further flexibility to the organization.
Investments in data are critical for companies to stay relevant in 2021, allowing businesses to grow, automate, and better understand their outcomes. While these tasks outlined may seem daunting, leveraging the data strategy elements to prioritize systems and teams can turn these into manageable projects that can be accomplished quickly and generate returns on the investment rapidly. Performing analytical pilots to show wins alongside these larger data efforts can assist with demonstrating this value but realizing the full value of those analytical investments at scale requires a modern data architecture.