Dagster

Dagster is an open source tool for running scripts that take data from one place, do some stuff to it, and then put the results somewhere. People usually call this process ETL.

Dagster has four main components

  • Web UI for interacting with app to inspect runs or kick off runs and such
  • Database for storing logs and metadata about each run
  • A Code Location to hold the code for all the custom ETL scripts
  • Some kind of scheduler to actually run those ETL scripts

in Kubernetes

Kubernetes runs applications that are split into several containers. For example, Kubernetes might run a web app that has a frontend in one container, a backend in a second container, and a database in a third container. Kubernetes helps organize the compute, storage, and networking for those containers.

Dagster can run in a few different places, including locally, which is one of its selling points. I can develop pipelines locally and then deploy them to a Kubernetes cluster when it’s time to run them in a staging or production environment.

When Dagster is deployed on Kubernetes, each of the four components I mentioned earlier is one container (technically one “pod”). Whenever Dagster runs an ETL script, it usually creates a new pod just for that script. I’m simplifying, but that’s the basic idea. That’s really handy, because then I can use Kubernetes to manage those runs instead of trying to juggle them all myself.

Here’s what it looks like when I develop ETL pipelines for Dagster

  1. Write a pipeline on my local machine
  2. Debug the pipeline locally to make sure it’s doing what I want
  3. Push my changes to my version control system
  4. CI/CD deploys my new pipeline to my Kubernetes cluster

in Docker

That’s all well and good, but sometimes I have to make changes to Dagster’s Kubernetes deployment itself. For example, I might want to change the password of Dagster’s database. That means editing Dagster’s values.yaml and checking whether that change actually changed the password.

Now, I could push that change to version control, wait for CI/CD to deploy it to my cluster, and then check things out there.

Or.

I could run a Kubernetes cluster locally in Docker using kind. Then I can look at things locally without having to wait for CI/CD.

Docker runs single containers. It also does some orchestration, but it mostly keeps to itself. Kubernetes, on the other hand, is designed to control multiple physical machines. But with kind, I can run a Kubernetes cluster inside a single Docker container for debugging.

As an added benefit, it’s a little hilarious when I step back and look at it. I can’t help but imagine someone shaking their head saying, “Where did it all go wrong?”

dinkind

On a lark, I wanted to see whether I could actually get Dagster running inside a local Kubernetes cluster in Docker.

To deploy Dagster pipelines to Kubernetes, I have to build a Docker image that has my ETL pipelines. This is the “Code Location” I mentioned in the first section. Then Dagster pulls that image so it has the code for the pipelines. That ended up being the hardest part to work out locally, but it really wasn’t too bad. Docker has an image I can use to host my own local registry. There’s even some documentation on how to use that with kind.

So I made a little script to

  1. Spin up a local Kubernetes cluster
  2. Spin up a local Docker image registry
  3. Build my Code Location image
  4. Put that Code Location image in my local registry
  5. Deploy Dagster to my local Kubernetes cluster, including my custom Code Location
  6. Set up a port forward so I can see the Dagster UI in my browser

Check out dinkind on my GitHub for that script plus a demo Dagster project to get this whole thing working.