GKE, Airflow, Cloud Composer, and Persistent Volumes
Table of Contents
Google Cloud Composer is a managed1 version of Airflow that allows you to schedule Docker images using KubernetesPodOperators. This is nice, except that there's curiously clear documentation or Stackoverflow posts on how to schedule a pod, mount a volume, and use the volume, making it annoying if you want to share information across pods (i.e. pod A does some stuff, writes to a volume, pod B gets scheduled and mounts the same volume used by pod A). I'll go through that process here.
1. Compute Engine
Create a disk in Google Compute Engine with the desired size.Note that you should not used a shared disk, as Kubernetes won't let you mount a volume in use. You also cannot use the boot disk of a node pool.
2. Persistent Volume Claim
Create a persistent volume claim from the volume (https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/preexisting-pd), and run kubectl apply -f existing-pd.yaml. Note that you will need both the volume (to have a fixed volume in Kube) and the volume claim (to allow the pods to use the volume).
You'll see the disk in the cloud console (Kubernetes Engine->Storage).
For example, if I had a GCE disk with the name persist-pods-disk, the config would look like:
apiVersion: v1
kind: PersistentVolume
metadata:
name: persist-pods-disk-volume
spec:
storageClassName: ""
capacity:
storage: 500G
accessModes:
- ReadWriteMany
gcePersistentDisk:
pdName: persist-pods-disk
fsType: ext4
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: persist-pods-disk-claim
spec:
# It's necessary to specify "" as the storageClassName
# so that the default storage class won't be used, see
# https://kubernetes.io/docs/concepts/storage/persistent-volumes/#class-1
storageClassName: ""
volumeName: persist-pods-disk-volume
accessModes:
- ReadWriteMany
resources:
requests:
storage: 500G
3. The DAG
Update your dag with a volume and a volume mount (https://airflow.apache.org/kubernetes.html).
For example, if I wanted to mount the disk into the pods at location /files, using the configuration give above:
volume_mount = VolumeMount('persist-disk',
mount_path='/files',
sub_path=None,
read_only=False)
volume_config= {
'persistentVolumeClaim':
{
'claimName': 'persist-pods-disk-claim' # uses the persistentVolumeClaim given in the Kube yaml
}
}
volume = Volume(name='persist-disk', configs=volume_config) # the name here is the literal name given to volume for the pods yaml.
# ... other stuff
operator = kubernetes_pod_operator.KubernetesPodOperator(
task_id=name + "_task",
name=name,
namespace='default',
image=image,
image_pull_policy='Always',
retries = retries,
arguments=arguments,
affinity={
'nodeAffinity': {
'requiredDuringSchedulingIgnoredDuringExecution': {
'nodeSelectorTerms': [{
'matchExpressions': [{
'key': 'cloud.google.com/gke-nodepool',
'operator': 'In',
'values': [
affinity,
]
}]
}]
}
}
},
volumes = [volume],
volume_mounts = [volume_mount],
)
Footnotes:
I say it's managed, but it actually just deploys a Kube cluster and expects you to still work around with Airflow stuff.