Skip to main content
Version: 0.9

Kubernetes Operator

Manage ingestion pipelines declaratively using the Marmot Operator.

Instead of running marmot ingest from the CLI scripts or UI, the operator lets you define pipelines as Kubernetes resources. The cluster handles scheduling, retries and lifecycle for you. Pipeline config lives alongside your other manifests, so changes go through the same review and GitOps workflow as everything else.

The operator watches Run resources and reconciles them into Kubernetes Jobs or CronJobs. This allows you to run each job as a seperate pod on Kubernetes allowing for a more granular permisions model so you don't have to give Marmot acess to all your assets. It can also help with performance if you're ingesting a lot of assets regularly.

Prerequisites

The operator is deployed alongside Marmot via the Helm chart. See the Helm / Kubernetes guide to install Marmot first.

Enabling the Operator

Enable the operator in your Helm values:

operator:
enabled: true
helm upgrade marmot marmotdata/marmot -f values.yaml

Creating a Run

A Run resource defines an ingestion pipeline. The spec.runs array uses the same format as the CLI configuration file.

apiVersion: runs.marmotdata.io/v1alpha1
kind: Run
metadata:
name: my-pipeline
spec:
schedule: "0 */6 * * *"
runs:
- postgresql:
host: "db.example.com"
port: 5432
database: "production"
user: "readonly"

The resource's metadata.name is used as the pipeline name for tracking ingestion state.

kubectl apply -f my-pipeline.yaml

Pod Labels and Annotations

Use podLabels and podAnnotations to integrate ingestion pods with service meshes, observability tools or policy engines.

On AWS, this is particularly useful for providing credentials to plugins via IAM Roles for Service Accounts (IRSA). Instead of storing AWS credentials in your pipeline config, annotate the pod so it automatically receives IAM permissions:

spec:
podAnnotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::123456789012:role/marmot-s3-reader"
podLabels:
team: data-engineering
runs:
- s3:
bucket: "my-data-lake"
region: "eu-west-1"

Manual Triggers

Trigger a scheduled pipeline outside its cron window by annotating the Run:

kubectl annotate run my-pipeline runs.marmotdata.io/trigger=true

This creates a temporary Job that runs immediately and cleans up after 60 seconds.

Teardown on Delete

By default, deleting a Run resource runs marmot ingest --destroy to remove all assets that pipeline previously discovered from Marmot. Set teardownOnDelete: false if you want to keep existing assets after removing the Run.

Reference

Run Spec

FieldTypeDefaultDescription
runsarrayrequiredSource configurations, same format as CLI YAML
schedulestringCron expression. When set, creates a CronJob instead of a Job
suspendbooleanfalsePause scheduled executions. Only applies when schedule is set
concurrencyPolicyAllow / Forbid / ReplaceForbidHow to handle concurrent Job executions
backoffLimitint3Retries before marking a Job as failed
activeDeadlineSecondsintMaximum duration (seconds) a Job may run
successfulJobsHistoryLimitint3Successful CronJob runs to retain
failedJobsHistoryLimitint1Failed CronJob runs to retain
resourcesResourceRequirementsCPU/memory requests and limits for the ingestion container
podLabelsmapAdditional labels applied to the pod template
podAnnotationsmapAdditional annotations applied to the pod template
teardownOnDeletebooleantrueRun marmot ingest --destroy when the Run is deleted

Operator Helm Values

KeyDefaultDescription
operator.enabledfalseEnable the operator Deployment and CRD
operator.replicas1Number of operator replicas
operator.leaderElecttrueEnable leader election for HA
operator.watchNamespace"" (all)Restrict to a single namespace
operator.marmot.urlauto-detectedMarmot API URL passed to Job pods
operator.resources100m/128Mi, 500m/256MiOperator pod resource requests and limits

Next Steps