Admission controller and PodSecurityPolicy

In this post I tell you the story of how I shot myself in the foot with PodSecurityPolicies and how I recovered my cluster.

Docs

I will not explain in detail what PodSecurityPolicies (PSPs) are, because there are a lot of resources covering the concept (ex. official docs). The baseline is, these policies determine what pods created in the cluster can and cannot do, from privileges, seccomp and linux (kernel) capabilities to filesystem mounts and networking.

What I want to point out about the docs and other resources is that they explain what policies are, what you can do with them and throw in a few examples that you can apply, but no guide on how to really implement them. Because it depends on the use case. So here is my cluster, a concrete use case and I tell you how I naively implemented PSPs, what went wrong and how I recovered. Hopefully at the end we can synthetize what is a good approach in this.

Doing it naively

One thing I understood at the beginning is that PSPs are not enabled by default in most k8s clusters. That means the admission controller is not enabled. This is the controller that admits a pod into the cluster if it can be validated against an available policy.

So the first thing one might fail is not creating any policies before enabling the admission controller. Considering the above, if there are no available policies no pod could be validated, thus created. I did not fall for this, I created a restrictive baseline policy as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37


apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  annotations:
    apparmor.security.beta.kubernetes.io/allowedProfileNames: runtime/default
    apparmor.security.beta.kubernetes.io/defaultProfileName: runtime/default
    seccomp.security.alpha.kubernetes.io/allowedProfileNames: docker/default,runtime/default
    seccomp.security.alpha.kubernetes.io/defaultProfileName: runtime/default
  name: baseline
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
  - ALL
  hostNetwork: false
  hostIPC: false
  hostPID: false
  hostPorts:
  - min: 0
    max: 0
  fsGroup:
    rule: RunAsAny
  runAsUser:
    rule: RunAsAny
  seLinux:
    rule: RunAsAny
  supplementalGroups:
    rule: RunAsAny
  volumes:
  - configMap
  - emptyDir
  - projected
  - secret
  - downwardAPI
  - persistentVolumeClaim
  - csi
  - gcePersistentDisk

With this in place you’d still not be able to create any pods. That is because PSPs need to be allowed to be used via RBAC (look up here if you don’t know what that is), specifically with Roles and RoleBindings.

Since you yourself rarely create pods —you create deployments and other resources and based on them pods are created by service accounts— best is to allow service accounts to use the policies.

If we want a general solution (like in this case we want this restrictive PSP to be used clusterwide), we can use a ClusterRole and a ClusterRoleBinding that are applied to not just for one namespace. Oh, I forgot to mention that PSPs are also cluster level, they can be used in multiple or all namespaces.

So to implement what I drafted here, we need the following ClusterRole:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: psp-baseline
rules:
- apiGroups:
  - policy
  resourceNames:
  - baseline
  resources:
  - podsecuritypolicies
  verbs:
  - use

Then we can bind this role —and this took a bit for me to figure out— to all service accounts in the cluster with the following ClusterRoleBinding

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: psp-baseline
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: psp-baseline
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: Group
  name: system:serviceaccounts

LGTM. I tried to restart some deployments and everything seemed to be fine. Until sometime later, when issues started to pop up.

Buckle up, this gonna be bumpy

Notice: It is important to recognize, that PSPs are applied to new pods only. Pods already running may not meet PSP restrictions, but they’re already admitted to the cluster, so they can run. Problems get visible when the corresponding resources are restarted or scaled, or pods are re-scheduled for some reason.

First it was the NFS Server Provisioner. It reported some error. Digging in, I found that it cannot start a new pod, because it cannot be validated against any PSP. It appeared that it uses some uncommon linux capabilities (namely DAC_READ_SEARCH and SYS_RESOURCE) that are not allowed by my baseline PSP. OK, that’s obvious. Let’s see how to resolve.

The NFS Provisioner was installed by Helm and it uses it’s own service account to spawn pods, so I have to create a PSP that allows those capabilities, a Role that allows to use that PSP and bind it to the nfs-provisioner service account via a RoleBinding.

The PSP:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  annotations:
    apparmor.security.beta.kubernetes.io/allowedProfileNames: runtime/default
    apparmor.security.beta.kubernetes.io/defaultProfileName: runtime/default
    seccomp.security.alpha.kubernetes.io/allowedProfileNames: docker/default,runtime/default
    seccomp.security.alpha.kubernetes.io/defaultProfileName: runtime/default
  name: nfs-provisioner
spec:
  privileged: false
  allowPrivilegeEscalation: false
  allowedCapabilities:
  - DAC_READ_SEARCH
  - SYS_RESOURCE
  ...

The role:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: psp-nfs-provisioner
  namespace: nfs-provisioner
rules:
- apiGroups:
  - policy
  resourceNames:
  - nfs-provisioner
  resources:
  - podsecuritypolicies
  verbs:
  - use

The rolebinding:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: psp-nfs-provisioner
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: psp-nfs-provisioner
  namespace: nfs-provisioner
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: ServiceAccount
  name: nfs-provisioner-sa
  namespace: nfs-provisioner

That definitely solved the problem, NFS provisioner got back working, everything was bright.

Maybe a week later, I updated a deployment that used a PersistentVolume. It could not start the new pod, because it could not mount the persistent volume. OK, I said, I know (that), this type of persistent volume can be mounted by only one pod at a time (ReadWriteOnce) and scheduler starts the new pod before terminating the old. Let’s switch the update strategy to recreate

1
2
3
4
5


...
spec:
  strategy:
    type: Recreate
...

After that it still couldn’t mount the volume, but a different error came up:

MountVolume.SetUp failed for volume "pvc-cdf6f3bb-f284-42fd-8220-c4e7f4f86886" : 
kubernetes.io/csi: mounter.SetUpAt failed to check for STAGE_UNSTAGE_VOLUME capability: rpc error: code = Unavailable 
desc = connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/csi.scaleway.com/csi.sock: 
connect: no such file or directory"

I didn’t quite get this, but it was clear that something is wrong with the Scaleway CSI plugin that should provide the persistent volume. So I went to the always helpful Scaleway Slack Community and asked. Within a few minutes we were able to figure out that the root of the problem is that the csi-node DaemonSet, that should run pods on all cluster nodes —and is responsible for managing the persistent volumes— is not running on the node that my new pod is scheduled to. Guess why?!

I think you guessed: because of the PSP. The csi-node needs to mount host filesystem to provide the persistent volumes (I assume it mounts block storage volumes on the host and presents them as persistent volumes for the pods running on that node), but the hostPath volume mount isn’t permitted.

You probably know the solution as well: new PSP that allows hostPath, new role to use it and bind it to the service account that handles the csi-node daemonset.

I hope you see the pattern here. The great hint in this case came from the guy, @sh4d1 on the Scaleway Slack channel, who adviced that I should describe the pod to be sure. And exactly, if you examine the pod description properly, you will see something like:

Annotations: ...
             kubernetes.io/psp: baseline

So you can always tell what policy is used by a given pod.

kube-system and other exotic places

Yes, csi-node is not the only workload in the kube-system namespace that needs special rights. For some managed clusters (like GKE) system components running there has their own PSPs and RBAC, but in others or in self-managed clusters don’t take it granted. Just a few examples: ingress controllers probably need to use hostPorts, metric collectors often need access to hostPID, hostNetwork and hostPath, log collectors definitely need to read logs from hostPath and so on.

To find out all these the hard way, well, can be painful or even damage your cluster. Theoretically you could go along all pods and check what is defined in their SecurityContext, but in my experience it’s not reliable. For example I can see that the fluentd-logzio image mounts the /var/log directory of the host, but hostPath is not defined in its SecurityContext block.

So my solution is —maybe it’s not a best practice— to simply create a privileged PSP and allow it to be used by any service account in the kube-system namespace. You shouldn’t put random stuff there anyway.

An all-access PSP could look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  annotations:
    apparmor.security.beta.kubernetes.io/allowedProfileNames: runtime/default
    apparmor.security.beta.kubernetes.io/defaultProfileName: runtime/default
    seccomp.security.alpha.kubernetes.io/allowedProfileNames: docker/default,runtime/default
    seccomp.security.alpha.kubernetes.io/defaultProfileName: runtime/default
  name: privileged
spec:
  allowPrivilegeEscalation: true
  privileged: true
  allowedCapabilities:
  - '*'
  fsGroup:
    rule: RunAsAny
  hostNetwork: true
  hostPorts:
  - max: 65535
    min: 0
  privileged: true
  runAsUser:
    rule: RunAsAny
  seLinux:
    rule: RunAsAny
  supplementalGroups:
    rule: RunAsAny
  volumes:
  - '*'

and the RBAC

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: psp-privileged
rules:
- apiGroups:
  - policy
  resourceNames:
  - psp-privileged
  resources:
  - podsecuritypolicies
  verbs:
  - use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: psp-privileged
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: psp-privileged
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: Group
  name: system:serviceaccounts
  namespace: kube-system

Why I used ClusterRole here? This way we can reuse this all-access PSP in other namespeces if needed. Just be careful with that.

Another catch

Another point where things can go south is how the policy is chosen for a pod. For understanding this, I will use an example.

Let’s imagine we want to run a container that uses chown (changes file ownerships) during it’s initialization. Not a best practice, but there are many publicly available images that do that. Also, our baseline PSP (rightfully) does not allow this capability to be used.

Our deployment looks like this

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


apiVersion: apps/v1
kind: Deployment
metadata:
  name:  myapp
  namespace: app-namespace
  labels:
    app:  myapp
spec:
  selector:
    matchLabels:
      app: myapp
  replicas: 1
  template:
    metadata:
      labels:
        app:  myapp
    spec:
      containers:
      - name:  myapp
        image:  image-that-uses-chown:1.0
        resources:
          ...
      restartPolicy: Always

When we start the deployment we will see that it goes into the dreaded CrashLoopBackoff state quite fast. If we check the logs, we will see something like:

chown -R 100:101 /some/path: operation not permitted
exit(1)

Do we know what’s going on here? Yes, kubernetes uses the PSP not just for admission control, but also to enforce the pods SecurityContext. So if it validates the pod against our baseline PSP and that has not allowed the CHOWN capability, then it will drop that in the pods context. That means kernel will refuse that operation regardless the user is root or not: no chown at all under my watch here.

How can we resolve this? Of course, we have to create another PSP that allows that capability and use RBAC to allow the service account (default, unless specified) that runs our deployment to use it.

Here is the PSP, you can figure out the needed RBAC now:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  annotations:
    apparmor.security.beta.kubernetes.io/allowedProfileNames: runtime/default
    apparmor.security.beta.kubernetes.io/defaultProfileName: runtime/default
    seccomp.security.alpha.kubernetes.io/allowedProfileNames: docker/default,runtime/default
    seccomp.security.alpha.kubernetes.io/defaultProfileName: runtime/default
  name: allow-chown
spec:
  privileged: false
  allowPrivilegeEscalation: false
  allowedCapabilities:
  - CHOWN
  ...

And here is the catch: if we named our new PSP ex. allow-chown everything works fine, but if we named it chown it still crashes and we get the same log. How is that happening?

If we describe our pod in the latter case, we will find out that the pod still uses the baseline PSP, while in the former naming it will use the allow-chown one. That is because the admission controller will look at all the available PSPs for the service account in alphabetical order and use the first one that the pod can be validated against. It’s obvious that we cannot use just naming to get the admission-controller use the proper PSP, rather we should help the validator so that only the right one is valid.

We can do this by setting the pod’s SecurityContext like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


apiVersion: apps/v1
kind: Deployment
metadata:
  name:  myapp
  namespace: app-namespace
  labels:
    app:  myapp
spec:
  ...
  template:
    ...
    spec:
      ...
      SecurityContext:
        allowedCapabilities:
        - CHOWN
      restartPolicy: Always

Applying this, when the admission-controller begins validation, the baseline PSP will not be valid for this pod, because it does not allow the CHOWN capability. So it will go to the next and find the only PSP that is valid for the pod’s SecurityContext is the allow-chown/chown PSP, whatever the name is, it doesn’t matter any more.

Conclusion

PodSecurityPolicy and admission control are very powerful tools to enforce secure workloads in our cluster, but they should be implemented with care and a lot of caution. They do not provide a way to dry-run them to see if something will break when they’re applied. And it’s not just your business workloads, but system components and other tools that are essential for your cluster operation. It is best to use some staging environment to test how each and every application/component behaves under certain restrictions before applying to production.

In my experience many applications you can install via Helm contain the needed PSP, but often the RBAC isn’t set up correctly (I’ve seen this with Prometheus and NFS Server Provisioner as well), so it’s useful to check if it’s really effective.

Also, it is very important to not rely on PSPs used for pods just because of alphabetical order, but explicitly set the pods SecurityContext to help finding the matching PSP.

Notice:
PodSecurityPolicy is deprecated in Kubernetes 1.21. That does not mean it can’t be used any more, it will probably stay with us for a while, I assume. However, Kubernetes and community will provide new ways for admission control. Like Open Policy Agent (OPA) is the new standard that leverages the REGO query language for validation and Gatekeeper is a Kubernetes admission controller implementation that uses OPA. This is a far more powerful and extensible solution than PSPs, for example it supports dry-run mode to check if everything is OK before enforcing a policy, but that just one example. So if you're just planning to make your cluster more secure, it's worth considering going with a more future-proof solution.

Discussion: https://twitter.com/iben12/status/1394666173824897033?s=20