GKE, Airflow, Cloud Composer, and Persistent Volumes
Table of Contents
Google Cloud Composer is a managed1 version of Airflow that allows you to schedule Docker images using KubernetesPodOperators. This is nice, except that there's curiously clear documentation or Stackoverflow posts on how to schedule a pod, mount a volume, and use the volume, making it annoying if you want to share information across pods (i.e. pod A does some stuff, writes to a volume, pod B gets scheduled and mounts the same volume used by pod A). I'll go through that process here.
1. Compute Engine
Create a disk in Google Compute Engine with the desired size.Note that you should not used a shared disk, as Kubernetes won't let you mount a volume in use. You also cannot use the boot disk of a node pool.
2. Persistent Volume Claim
Create a persistent volume claim from the volume (https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/preexisting-pd), and run kubectl apply -f existing-pd.yaml
. Note that you will need both the volume (to have a fixed volume in Kube) and the volume claim (to allow the pods to use the volume).
You'll see the disk in the cloud console (Kubernetes Engine->Storage).
For example, if I had a GCE disk with the name persist-pods-disk
, the config would look like:
apiVersion: v1 kind: PersistentVolume metadata: name: persist-pods-disk-volume spec: storageClassName: "" capacity: storage: 500G accessModes: - ReadWriteMany gcePersistentDisk: pdName: persist-pods-disk fsType: ext4 --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: persist-pods-disk-claim spec: # It's necessary to specify "" as the storageClassName # so that the default storage class won't be used, see # https://kubernetes.io/docs/concepts/storage/persistent-volumes/#class-1 storageClassName: "" volumeName: persist-pods-disk-volume accessModes: - ReadWriteMany resources: requests: storage: 500G
3. The DAG
Update your dag with a volume and a volume mount (https://airflow.apache.org/kubernetes.html).
For example, if I wanted to mount the disk into the pods at location /files
, using the configuration give above:
volume_mount = VolumeMount('persist-disk', mount_path='/files', sub_path=None, read_only=False) volume_config= { 'persistentVolumeClaim': { 'claimName': 'persist-pods-disk-claim' # uses the persistentVolumeClaim given in the Kube yaml } } volume = Volume(name='persist-disk', configs=volume_config) # the name here is the literal name given to volume for the pods yaml. # ... other stuff operator = kubernetes_pod_operator.KubernetesPodOperator( task_id=name + "_task", name=name, namespace='default', image=image, image_pull_policy='Always', retries = retries, arguments=arguments, affinity={ 'nodeAffinity': { 'requiredDuringSchedulingIgnoredDuringExecution': { 'nodeSelectorTerms': [{ 'matchExpressions': [{ 'key': 'cloud.google.com/gke-nodepool', 'operator': 'In', 'values': [ affinity, ] }] }] } } }, volumes = [volume], volume_mounts = [volume_mount], )
Footnotes:
I say it's managed, but it actually just deploys a Kube cluster and expects you to still work around with Airflow stuff.