GKE, Airflow, Cloud Composer, and Persistent Volumes

Google Cloud Composer is a managed1 version of Airflow that allows you to schedule Docker images using KubernetesPodOperators. This is nice, except that there's curiously clear documentation or Stackoverflow posts on how to schedule a pod, mount a volume, and use the volume, making it annoying if you want to share information across pods (i.e. pod A does some stuff, writes to a volume, pod B gets scheduled and mounts the same volume used by pod A). I'll go through that process here.

1. Compute Engine

Create a disk in Google Compute Engine with the desired size.Note that you should not used a shared disk, as Kubernetes won't let you mount a volume in use. You also cannot use the boot disk of a node pool.

2. Persistent Volume Claim

Create a persistent volume claim from the volume (https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/preexisting-pd), and run kubectl apply -f existing-pd.yaml. Note that you will need both the volume (to have a fixed volume in Kube) and the volume claim (to allow the pods to use the volume).

You'll see the disk in the cloud console (Kubernetes Engine->Storage).

For example, if I had a GCE disk with the name persist-pods-disk, the config would look like:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: persist-pods-disk-volume
spec:
  storageClassName: ""
  capacity:
    storage: 500G
  accessModes:
    - ReadWriteMany
  gcePersistentDisk:
    pdName: persist-pods-disk
    fsType: ext4
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: persist-pods-disk-claim
spec:
  # It's necessary to specify "" as the storageClassName
  # so that the default storage class won't be used, see
  # https://kubernetes.io/docs/concepts/storage/persistent-volumes/#class-1
  storageClassName: ""
  volumeName: persist-pods-disk-volume
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 500G

3. The DAG

Update your dag with a volume and a volume mount (https://airflow.apache.org/kubernetes.html).

For example, if I wanted to mount the disk into the pods at location /files, using the configuration give above:

volume_mount = VolumeMount('persist-disk',
                           mount_path='/files',
                           sub_path=None,
                           read_only=False)

volume_config= {
    'persistentVolumeClaim':
    {
        'claimName': 'persist-pods-disk-claim' # uses the persistentVolumeClaim given in the Kube yaml
    }
}
volume = Volume(name='persist-disk', configs=volume_config) # the name here is the literal name given to volume for the pods yaml.

# ... other stuff

operator = kubernetes_pod_operator.KubernetesPodOperator(
    task_id=name + "_task",
    name=name,
    namespace='default',
    image=image,
    image_pull_policy='Always',
    retries = retries,
    arguments=arguments,
    affinity={
        'nodeAffinity': {
            'requiredDuringSchedulingIgnoredDuringExecution': {
                'nodeSelectorTerms': [{
                    'matchExpressions': [{
                        'key': 'cloud.google.com/gke-nodepool',
                        'operator': 'In',
                        'values': [
                            affinity,
                        ]
                    }]
                }]
            }
        }
    },
    volumes = [volume],
    volume_mounts = [volume_mount],
)

Footnotes:

1

I say it's managed, but it actually just deploys a Kube cluster and expects you to still work around with Airflow stuff.

Posted: 2018-09-30
Filed Under: GCP, computer