In this post I tell you the story of how I shot myself in the foot with PodSecurityPolicies and how I recovered my cluster.
Docs
I will not explain in detail what PodSecurityPolicies
(PSPs) are, because there are a lot of resources covering the concept (ex. official docs). The baseline is, these policies determine what pods created in the cluster can and cannot do, from privileges, seccomp and linux (kernel) capabilities to filesystem mounts and networking.
What I want to point out about the docs and other resources is that they explain what policies are, what you can do with them and throw in a few examples that you can apply, but no guide on how to really implement them. Because it depends on the use case. So here is my cluster, a concrete use case and I tell you how I naively implemented PSPs, what went wrong and how I recovered. Hopefully at the end we can synthetize what is a good approach in this.
Doing it naively
One thing I understood at the beginning is that PSPs are not enabled by default in most k8s clusters. That means the admission controller is not enabled. This is the controller that admits a pod into the cluster if it can be validated against an available policy.
So the first thing one might fail is not creating any policies before enabling the admission controller. Considering the above, if there are no available policies no pod could be validated, thus created. I did not fall for this, I created a restrictive baseline policy as follows:
|
|
With this in place you’d still not be able to create any pods. That is because PSPs need to be allowed to be used via RBAC (look up here if you don’t know what that is), specifically with Roles
and RoleBindings
.
Since you yourself rarely create pods —you create deployments and other resources and based on them pods are created by service accounts— best is to allow service accounts to use the policies.
If we want a general solution (like in this case we want this restrictive PSP to be used clusterwide), we can use a ClusterRole
and a ClusterRoleBinding
that are applied to not just for one namespace. Oh, I forgot to mention that PSPs are also cluster level, they can be used in multiple or all namespaces.
So to implement what I drafted here, we need the following ClusterRole:
|
|
Then we can bind this role —and this took a bit for me to figure out— to all service accounts in the cluster with the following ClusterRoleBinding
|
|
LGTM. I tried to restart some deployments and everything seemed to be fine. Until sometime later, when issues started to pop up.
Buckle up, this gonna be bumpy
Notice: It is important to recognize, that PSPs are applied to new pods only. Pods already running may not meet PSP restrictions, but they’re already admitted to the cluster, so they can run. Problems get visible when the corresponding resources are restarted or scaled, or pods are re-scheduled for some reason.
First it was the NFS Server Provisioner. It reported some error. Digging in, I found that it cannot start a new pod, because it cannot be validated against any PSP. It appeared that it uses some uncommon linux capabilities (namely DAC_READ_SEARCH
and SYS_RESOURCE
) that are not allowed by my baseline PSP. OK, that’s obvious. Let’s see how to resolve.
The NFS Provisioner was installed by Helm and it uses it’s own service account to spawn pods, so I have to create a PSP that allows those capabilities, a Role
that allows to use that PSP and bind it to the nfs-provisioner
service account via a RoleBinding
.
The PSP:
|
|
The role:
|
|
The rolebinding:
|
|
That definitely solved the problem, NFS provisioner got back working, everything was bright.
Maybe a week later, I updated a deployment that used a PersistentVolume
. It could not start the new pod, because it could not mount the persistent volume. OK, I said, I know (that), this type of persistent volume can be mounted by only one pod at a time (ReadWriteOnce
) and scheduler starts the new pod before terminating the old. Let’s switch the update strategy to recreate
|
|
After that it still couldn’t mount the volume, but a different error came up:
MountVolume.SetUp failed for volume "pvc-cdf6f3bb-f284-42fd-8220-c4e7f4f86886" :
kubernetes.io/csi: mounter.SetUpAt failed to check for STAGE_UNSTAGE_VOLUME capability: rpc error: code = Unavailable
desc = connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/csi.scaleway.com/csi.sock:
connect: no such file or directory"
I didn’t quite get this, but it was clear that something is wrong with the Scaleway CSI plugin that should provide the persistent volume. So I went to the always helpful Scaleway Slack Community and asked. Within a few minutes we were able to figure out that the root of the problem is that the csi-node DaemonSet
, that should run pods on all cluster nodes —and is responsible for managing the persistent volumes— is not running on the node that my new pod is scheduled to. Guess why?!
I think you guessed: because of the PSP. The csi-node needs to mount host filesystem to provide the persistent volumes (I assume it mounts block storage volumes on the host and presents them as persistent volumes for the pods running on that node), but the hostPath
volume mount isn’t permitted.
You probably know the solution as well: new PSP that allows hostPath
, new role to use it and bind it to the service account that handles the csi-node daemonset.
I hope you see the pattern here. The great hint in this case came from the guy, @sh4d1 on the Scaleway Slack channel, who adviced that I should describe the pod to be sure. And exactly, if you examine the pod description properly, you will see something like:
Annotations: ...
kubernetes.io/psp: baseline
So you can always tell what policy is used by a given pod.
kube-system and other exotic places
Yes, csi-node is not the only workload in the kube-system
namespace that needs special rights. For some managed clusters (like GKE) system components running there has their own PSPs and RBAC, but in others or in self-managed clusters don’t take it granted. Just a few examples: ingress controllers probably need to use hostPort
s, metric collectors often need access to hostPID
, hostNetwork
and hostPath
, log collectors definitely need to read logs from hostPath
and so on.
To find out all these the hard way, well, can be painful or even damage your cluster. Theoretically you could go along all pods and check what is defined in their SecurityContext
, but in my experience it’s not reliable. For example I can see that the fluentd-logzio
image mounts the /var/log
directory of the host, but hostPath
is not defined in its SecurityContext
block.
So my solution is —maybe it’s not a best practice— to simply create a privileged PSP and allow it to be used by any service account in the kube-system
namespace. You shouldn’t put random stuff there anyway.
An all-access PSP could look like this:
|
|
and the RBAC
|
|
Why I used ClusterRole
here? This way we can reuse this all-access PSP in other namespeces if needed. Just be careful with that.
Another catch
Another point where things can go south is how the policy is chosen for a pod. For understanding this, I will use an example.
Let’s imagine we want to run a container that uses chown
(changes file ownerships) during it’s initialization. Not a best practice, but there are many publicly available images that do that. Also, our baseline PSP (rightfully) does not allow this capability to be used.
Our deployment looks like this
|
|
When we start the deployment we will see that it goes into the dreaded CrashLoopBackoff
state quite fast. If we check the logs, we will see something like:
chown -R 100:101 /some/path: operation not permitted
exit(1)
Do we know what’s going on here? Yes, kubernetes uses the PSP not just for admission control, but also to enforce the pods SecurityContext
. So if it validates the pod against our baseline PSP and that has not allowed the CHOWN
capability, then it will drop that in the pods context. That means kernel will refuse that operation regardless the user is root or not: no chown at all under my watch here.
How can we resolve this? Of course, we have to create another PSP that allows that capability and use RBAC to allow the service account (default, unless specified) that runs our deployment to use it.
Here is the PSP, you can figure out the needed RBAC now:
|
|
And here is the catch: if we named our new PSP ex. allow-chown everything works fine, but if we named it chown it still crashes and we get the same log. How is that happening?
If we describe our pod in the latter case, we will find out that the pod still uses the baseline PSP, while in the former naming it will use the allow-chown one. That is because the admission controller will look at all the available PSPs for the service account in alphabetical order and use the first one that the pod can be validated against. It’s obvious that we cannot use just naming to get the admission-controller use the proper PSP, rather we should help the validator so that only the right one is valid.
We can do this by setting the pod’s SecurityContext like this:
|
|
Applying this, when the admission-controller begins validation, the baseline PSP will not be valid for this pod, because it does not allow the CHOWN
capability. So it will go to the next and find the only PSP that is valid for the pod’s SecurityContext
is the allow-chown/chown PSP, whatever the name is, it doesn’t matter any more.
Conclusion
PodSecurityPolicy
and admission control are very powerful tools to enforce secure workloads in our cluster, but they should be implemented with care and a lot of caution. They do not provide a way to dry-run them to see if something will break when they’re applied. And it’s not just your business workloads, but system components and other tools that are essential for your cluster operation. It is best to use some staging environment to test how each and every application/component behaves under certain restrictions before applying to production.
In my experience many applications you can install via Helm contain the needed PSP, but often the RBAC isn’t set up correctly (I’ve seen this with Prometheus and NFS Server Provisioner as well), so it’s useful to check if it’s really effective.
Also, it is very important to not rely on PSPs used for pods just because of alphabetical order, but explicitly set the pods SecurityContext
to help finding the matching PSP.
Notice:
PodSecurityPolicy
is deprecated in Kubernetes 1.21. That does not mean it can’t be used any more, it will probably stay with us for a while, I assume. However, Kubernetes and community will provide new ways for admission control. Like Open Policy Agent (OPA) is the new standard that leverages the REGO query language for validation and Gatekeeper is a Kubernetes admission controller implementation that uses OPA. This is a far more powerful and extensible solution than PSPs, for example it supports dry-run mode to check if everything is OK before enforcing a policy, but that just one example. So if you're just planning to make your cluster more secure, it's worth considering going with a more future-proof solution.
Discussion: https://twitter.com/iben12/status/1394666173824897033?s=20