An introduction to a more flexible, contemporary approach to data movement that combines Agile's rapid iterations with DevOps' process automation. 

 

“The best time to plant a tree was 20 years ago. The second best time is now." – Anonymous

 

Modernization is a core focus in the data and analytics space over the next year, with time to value and speed to market as the primary gauges of success. This transformation has arguably always been happening gradually, within different scopes and across various technological eras. Despite changing technological landscapes, the core principles have persisted since the beginning: more data leads to more questions, and accuracy and integrity are key. The process of managing, moving, integrating, and analyzing data relies heavily on contracts between those providing the data and those consuming it. As modernization accelerates, many technologists and experts are thinking about ways to handle data more flexibly.

What has changed, of course, is the technology. With the introduction of big data and cloud computing, data analytics looks more like software engineering than ever before. This advancement is driving a new set of expectations, along with the need for data platforms to meet fresh demands. Renewed organizational pressure to accelerate value capture, customer demand for more streamlined experiences, and the user expectation that ‘it just needs to work’ are driving change. The need is overwhelming the traditional IT department and driving users embedded in the business to seek out one-off analytics tools to satisfy their demand.

Introduction to DataOps

The modern data platform is greater than the sum of its parts, which include streaming data capture, big data engineering, search functionality, catalogues, machine learning technology, and more. It can be difficult to accurately capture the extensive and evolving features of a modern data platform in a word, leading to the introduction of the term ‘DataOps.’

DataOps takes the notion of operating in a linear, contract-based way and instead adopts a modular, loosely-coupled way to accelerate the delivery of data products to users. At Apex, DataOps has been known for a few years now as ‘continuous analytics,’ or the continual monitoring of data and processes throughout an environment. This new concept is the practical confluence of Agile, DevOps, and data analytics. Agile’s collaborative, rapid iterations of value demonstration combine with DevOps’ automation of processes to move data from source to target in a way that accommodates more flexibility compared with the traditional approaches to data movement.

The practice of DataOps strives to address a changing technical landscape through technology and process. The practice injects transparency and reliability into both data-curation and value-creation activities by breaking down the traditional, linear nature of a data pipeline into components that are transparent and able to be validated. The true value of DataOps is realized when each stakeholder along the data supply chain thinks about best serving their ‘customer.’ To practice DataOps is to manage the development lifecycle in a way that prioritizes getting complete and accurate data into users’ hands as quickly as possible. Doing this doesn’t mean you relax standards for quality assurance, change control and governance; rather, it means you programmatically instill these things into the data delivery process itself.

Getting Started

There are three major considerations when envisioning how DataOps will transform your organization: mindset, skillset, and toolset. ‘Mindset’ refers to the strategy around the way IT teams work with their business counterparts to realize DataOps, and how siloes in teams and processes may hinder innovation. Essentially, consider the question: is the mindset of our organization conducive to a new way to deliver data? ‘Skillset’ alludes to the readiness of people and teams to adopt new technology and ways of working. Shifts in technology and process require intentional change management. As modern data technologies can be a paradigm shift from the drag-and-drop ETL solutions of the past, upskilling and continual education should be the priority of every data and IT leader going forward. We will explore the first two considerations of this process in more detail in subsequent articles later in this series. 

DataOps takes the notion of operating in a linear, contract-based way and instead adopts a modular, loosely-coupled way to accelerate the delivery of data products to users.

In this piece, we will focus on the ‘toolset’ aspect. Selecting tools that facilitate DataOps is key to successful modernization. We recommend modular, code-based, and version-controlled applications used for orchestrating data throughout the environment. Tools like Databricks, Kafka, Airflow, AWS Glue, Kubernetes, and Data Build Tool (DBT) are all good options whose features naturally facilitate DataOps and enable development with enhanced productivity and processes. While some organizations still have valid reasons for not adopting Cloud-native platforms, Apex believes that the benefits from DataOps can still be realized in a hybrid or on-prem architecture, but the effort will be certainly more complex.

To implement DataOps tooling we recommend looking at three key areas: version control, continuous integration and continuous deployment (CI/CD), and orchestration. First, move your data integration (i.e. ETL/ELT) code to a version control platform like Azure DevOps or GitHub. By shifting development to a centralized repository, your development teams will be increasingly ready to implement Agile, DataOps-centric delivery. Not only can ETL/ELT code be version controlled, but so can data definition language (DDL). Open-source tooling like FlywayDB (or Snowflake’s own adoption of it: schemachange) can be incorporated into version control so that changes to the database structures themselves can be better governed.

When your code is version controlled CI/CD becomes a natural next step, allowing you to realize the strategy of testable and automated data code. Version control systems like GitHub Actions have CI/CD features natively integrated into them, which facilitate embedding validation directly into the deployment process. This exponentially increases the productivity of development and QA teams who are tasked with ensuring these data products reach production reliably.

Finally consider orchestration, which is essentially the automation of the infrastructure and services to support your data products. Tactically, services like Kubernetes, Docker, and Terraform templates are key tools to consider adopting. These tools provide a framework to deploy validated data products at scale to meet the demands of your customers. Data and analytic workloads are typically ‘bursty,’ meaning they experience transient periods of high utilization. Because of this, orchestration tools are invaluable in meeting that increased demand in a way that avoids degradation of the user experience. When activity subsides, orchestration tools seamlessly handle scaling down, which saves organizations a massive amount of money over the long term.

Summary

Regardless where your organization may be on its modernization journey, it is never too late to consider adopting DataOps. While a new toolset can seem daunting, even the smallest steps can enable you to realize immediate benefits while setting the stage for a data platform team that rivals the most advanced Silicon Valley start-ups. Envision a future state where your data and platform teams are delivering more quickly, delighting users, and are empowered as true value-drivers within the organization. Then, adopt a toolset that helps you realize that vision.