Why Kubernetes Has Become So Popular in Data Engineering
Written by Anna Geller |
Containers have become the de facto standard for moving data projects to production. No more dependency management nightmares— projects developed on a local machine can be “shipped” to a staging and production cluster (typically) with no surprises. Data pipelines and ML models are finally reproducible and can run anywhere in the same fashion. However, with an ever-growing number of containerized data workloads, orchestration platforms are becoming increasingly important.
1. Orchestrating containers
If you want to run reproducible data pipelines and ML models that can run anywhere, you probably know that a Docker image is the way to go. Everybody loves container images, from software engineers to DevOps and SREs. In the past, the challenge was to run containers at scale. With Kubernetes, especially when deployed to infinitely scalable cloud services, managing the execution of containers at scale has become considerably less painful.
2. Communicating between teams
As much as many of us love building end-to-end data products, in reality, many enterprises split the responsibilities between who builds a data product and who runs it. Data scientists and engineers may build fantastic data pipelines, but in the end, they often need to hand it over to DevOps and SREs to deploy and monitor them over time. Container images make this handover process much easier.
At the same time, containers facilitate sharing code and collaboration between data engineers, data scientists, and analysts who can all work in the same reproducible environment.
3. Declarative definition
Data engineering workloads these days are dynamic in nature. Imagine that you deployed your data pipeline to a Kubernetes cluster and it’s now failing due to an out-of-memory error. Fixing the issue may be a matter of increasing the memory size in your declarative deployment file and applying the changes to the cluster.
A nice “side effect” of workflow and configuration as code is that everything is documented. Not only the business logic but also where things are running and how many resources they consume. When other engineers need to work on your code, there is no guessing and back-and-forth communication to get them up to speed with your data workloads.
Another benefit of declarative workflow and environment definition is that everything can be version-controlled. If something goes wrong you can revert to the previous version. You can track changes made to your environment over time and provide an audit log for compliance. GitOps and MLOps made this approach popular but effectively containers and orchestration platforms made it possible.
6. Maintaining the health of your execution layer
One of the most prominent benefits of Kubernetes is that it will always attempt to maintain the desired state, and restart or recreate resources once they die. Of course, the self-healing “powers” of Kubernetes won’t fix all your problems (such as broken business logic), but at least you don’t need to intervene when an error occurs due to a temporary network issue that can get resolved with a restart or redeployment to a new pod.
7. Seamlessly scale your execution platform as your data grows
Running your data pipelines on a single server probably won’t cut it for you these days. With growing amounts of data, it becomes difficult to manage data processing, regardless of whether we package our code as container images or run them in a single local process.
If you need to scale your workloads across multiple nodes, Kubernetes (especially in combination with Helm charts) makes it easier to install Dask or Spark on a compute cluster and thus distribute data processing across multiple nodes. Most cloud providers offer autoscaling services or even provide a completely serverless data plane (AWS EKS on Fargate and GCP GKE Autopilot). Those cloud vendors take care of scaling out worker nodes when needed, thereby entirely eliminating the need for guessing required capacity.
Despite all the goodness, Kubernetes won’t give you immediate visibility into what happens in your data workloads, but it will make it easier to install a distributed compute engine and seamlessly scale it out to multiple nodes.
8. Development and production environment parity
A common challenge when building data products is that a development environment is often vastly different from how things are supposed to run in production. Containerized workloads make this transition much easier. You may have different clusters or namespaces for staging and prod environments — switching between them should be seamless.
9. Iterating faster
Iterations are crucial for building data workloads. Most data products are bad at first. You may need several cycles of testing various classifiers, tuning, enriching models with new data, testing various transformations and hyperparameters. With Kubernetes deployments, you can implement A/B testing or run multiple instances of the same ML training job but with different hyperparameters.
The real proof that leveraging Kubernetes allows you to iterate faster is when you combine Kubernetes with tools that abstract away low-level details. For instance:
- to build and scale out your data flows, you can leverage Prefect with Dask on Kubernetes,
- to serve ML models or to perform A/B testing, you could leverage Seldon,
- to build visualizations, insights, metrics, and KPIs, you can use GoodData.CN.
10. Installing almost any software with Helm & further features
Kubernetes has become so universally popular that plenty of tools have been built or redesigned to work on K8s clusters. In the list above we just scratched the surface. With Kubernetes, you can leverage the same container orchestration platform to execute ETL jobs, train and serve ML models, and even build visualizations with tools that work on Kubernetes.
The common theme
So what’s the common theme of all the above points? It’s the fact that containers (and container orchestration platforms) allow us to:
- build reproducible code that can run the same way anywhere at any scale,
- reduce friction in the handoff between different teams,
- apply modern software engineering practices to data workloads, including GitOps, DevOps, and MLOps.
If you want to follow this paradigm for your data visualizations, GoodData has recently launched its cloud-native platform GoodData.CN that you can install to your Kubernetes cluster. You can start by running a single-image Docker container for local development of dashboards, metrics, and KPIs (in a DRY-way!):
docker run --name gooddata -p 3000:3000 -p 5432:5432 \\
\-e LICENSE_AND_PRIVACY_POLICY_ACCEPTED=YES gooddata/gooddata-cn-ce:latest
Once you developed your dashboards, metrics, and insights locally, you can export your declarative definitions and deploy them to a Kubernetes cluster. GoodData team provided detailed documentation showing how you can install GoodData for various scenarios:
- AWS EKS with RDS Postgres database and ElastiCache for Redis,
- GCP GKE with Cloud SQL for Postgres and MemoryStore for Redis,
- Azure AKS with Azure Database for Postgres and Azure Cache for Redis,
- On-premise deployment,
- …and many helpful tips on how to manage organizations, workspaces, and data sources.
In this article, we looked at the features of Kubernetes (and container orchestration platforms in general) that made containers so universally popular among data teams. The sheer number of tools integrating with K8s and its presence in every cloud platform makes it a comprehensive execution layer for reproducible and scalable data workloads.
Thank you for reading!
Cover image by Lukas Hartmann from Pexels
Written by Anna Geller |
Subscribe to our newsletter
Get your dose of interesting facts on analytics in your inbox every month.Subscribe