From Data Pipeline Purgatory to Production Heaven

Get rid of manual workflow deployments using Infrastructure as Code

Anna Geller
Dev Genius

--

You know the story. You write your data pipeline locally. You configure dependencies, and you validate that everything works as expected. If it does, it’s time to submit a pull request and ship it. Right?

Well, that’s easier said than done. Moving to production without affecting existing applications and data can be challenging. Managing environments, access to database schemas, permissions to cloud resources— the list goes on.

This post shows how combining open-source tools such as Terraform and Kestra makes moving from development to production easier.

Clear Separation Between Development and Production Environments Is Non-Negotiable

Maintaining a clear separation between development and production workloads is essential. Most teams leverage dedicated servers, databases, workspaces, or clusters to separate development and production environments.

Data pipelines are no exception. Your development instance can be open to contributions from engineers and domain experts within your organization, allowing rapid iteration without the overhead of CI/CD.

However, changes to production get deployed only after a peer review using infrastructure management and build tools like Terraform or GitHub Actions. In this post, we’ll focus on:

  • Terraform — an infrastructure-as-code (IaC) tool that declaratively manages environments and the underlying cloud infrastructure
  • Kestra — a simple, event-driven orchestrator that helps to implement workflows as code. Rather than making it an engineering-only effort, Kestra’s UI opens the process of building data workflows to more people within your organization, including domain experts not proficient in programming.

Both Terraform and Kestra leverage declarative configuration and modular components to automate provisioning and managing changes to your resources. You can maintain all Kestra resources via Terraform, and you can also reference Terraform variables in your Kestra workflows. When used together, these tools provide a reliable foundation for managing data operations.

Dedicated Development Environment for Business Stakeholders Improves Collaboration

Who should write ETL? Some argue that engineers shouldn’t write ETL or that business users should develop their own data pipelines. As a result, many companies opt for no-code solutions.

On the other side of the spectrum, engineers can manage the entire process in the form of custom scripts and Cron jobs. However, this approach can leave both engineers and domain experts frustrated due to their dependency on each other and a lot of back-and-forth communication.

You can see that data workflow tooling covers a wide spectrum ranging from no code to only code.

Kestra aims to strike a balance between these two extremes. While you can build everything as code, it’s also possible to build everything exclusively from the UI, too. The YAML definition gets adjusted any time you make changes to a workflow from the UI or via an API call. This way, the orchestration logic is stored declaratively in code, even if some workflow components are modified in other ways.

Engineers can build custom plugins and blueprints that domain experts can use to create their own workflows from a simple UI form or code editor within a dedicated development environment. The UI has an embedded code editor, an integrated Version Control System, comprehensive plugin documentation, a DAG dependency view, and workflow blueprints — all that to make building workflows easy and enjoyable. Business users can reap the benefits such as code versioning, auto-completion, and syntax validation without having to know how to set up an IDE or work with Git.

Production Environments Should Be Exclusively Managed With Infrastructure as Code

In a production environment, you’ll typically want read-only access to prevent not reviewed or unauthorized changes. Modifications should only be made after a peer review, such as through a pull request or by approving a Terraform Cloud run. Once approved, changes are deployed to the production instance using Terraform.

Deploying production resources only from Terraform ensures stable, maintainable, and reproducible environments.

Terraform CI/CD for Data Workflows

The code snippet below will deploy all flows stored in the flows directory using just 20 lines of Terraform configuration.

To use that code in a deployment pipeline, you can either leverage Terraform Cloud (the easiest and recommended) or use a remote backend (such as S3 or GCS) along with the setup-terraform GitHub Action. The video below gives a hands-on demo of how this works in practice.

Managing environments and CI/CD with Terraform and Kestra — hands-on demo

Next Steps

Combining Terraform and Kestra provides a reliable foundation for managing workflows and infrastructure as code. This setup helps to balance fast feedback loops during development and the benefits of infrastructure-as-code for production environments. If you encounter any challenges while reproducing this demo, you can open a GitHub issue or ask via Slack.

--

--