In this post I will go through the choices I had to make while deploying my cluster and the theory behind the things I consider essential additions for a production grade cluster. Do not expect a step-by-step guide here (it will come later), but you can learn some not-so-obvious concepts behind k8s.
In my first post I explained what’s my goal with this cluster, now it’s time to talk about what I actually have as a setup.
Deploying and operating Kubernetes is not an easy task, unless you leave it for the experts, as I recommended here. I’m using Scaleway’s k8s service — Kapsule — to create my cluster. Using a managed service make things magnitudes simpler, but even a provider’s setup wizard will ask intimidating questions for a first time user.
While it is possible to use infrastructure-as-code tools like Terraform to create the cluster if your provider has a usable API, this usually is worth the additional complexity if you deploy clusters on a regular basis where automation and consistent outcome is a remarkable gain. That’s not my case, so I just use the Scaleway UI to deploy the cluster with a few clicks.
Create
Let’s go through what decisions we have to make in this phase. While this is specific to Kapsule, you will have very similar options on other providers (I know GKE is almost the same), so bear with me.
Kubernetes version
Unless you know you depend on a specific version, the best is to use the latest version.
Node pool
The node pool specifies the attributes of the VMs that are spawned to act as your cluster nodes. One cluster may have multiple pools with different node types, so you can have diverse compute instances to support your workloads and scale each node pool as required (ex. you can spawn 5 regular nodes and 2 high memory instances, or even some with GPU attached to them for your ML purposes). We can debate on what’s the ideal node size, but it depends much on your use case. Here is a good article on it. In short, if you go with a bunch of smaller nodes, your cluster will be more resilient to loose some of them and you have a more fine-grained option for scaling. However, the system overhead and leftover unusable resources may take a bigger percentage of your overall compute capacities. On the other hand having a few superstrong nodes may be painful if one goes down and also there may be limits of the number of pods on one node regardless how much resource it has, so you may not be able to fully utilize them. Since I’m on a budget and don’t have workloads with extreme resource requirements I set up a pool of DEV1-M (3 vCPU, 4GB RAM, 40GB disk) instances. At Scaleway (and other providers as well) node pools have two additional features: autoscaling and autohealing.
Autoscaling allows the cluster to bring up new nodes when needed and shut them down when there’s too much excess capacity. What is important here — and that I learned along the way — this does not happen by Kubernetes monitoring your resource usage, but rather by you setting sane resource requirements on your workloads. I’m going to share more details on this in a later post on deployments, for now, let we say that one of your pods requests 1G memory, your only node has 4G and the pods already running on your node request an overall 3.2G. In this case the pod cannot be created since there’s no node in the cluster with as much unallocated memory as your pod requests. If you enabled autoscaling the appropriate controller will spawn a new node through the provider’s API and add it to your cluster as long as the maximum number of nodes is reached. In opposite, if you remove a workload and the system determines that the overall resource request can be satisfied with one less node, it will drain and shut down one.
Autohealing helps you — as the name suggests — keeping your nodes healthy: if a node is failing (unavailable or reports some other kind of errors) it will be restarted and if that doesn’t help, it gets thrown away and replaced by a new one based on the pool specs.
I’ve set up the pool to autoscaling between 1 and 3 nodes and enabled autohealing.
Container Network Interface (CNI)
Now that’s a though one. I asked a friend of mine, who actively contributes to one of the CNI plugins available for k8s, what he’d say about this in short and he said exactly this:
I would say it’s fucking complicated and before making the choice you should consider what are your requirements for the network. And I’d link this article.
Well, I’m no expert in networking and currently my requirement is that pods should be able to connect to each other within my cluster. That’s not too much and actually covered by every one of them. Later on however, I might want to go deeper into this bottomless rabbit hole. Ease of installation is usually another aspect of comparison between CNI plugins, but in a managed environment it usually boils down to selecting an option from a dropdown. Keep in mind that switching CNI in a running cluster however, is not possible — experts say * —, so what you choose now, will stay with you. Long story short, I chose Calico, because of their cute cat logo in my understanding it’s performant and has compatibility with service mesh (Istio), if I ever want to experiment with that (probably never in this scale). If you want to make a more educated decision read the above article, or if you’re not a network expert this one from the Rancher blog.
UPDATE: Since this post was published I had some experience with Calico and have to say I wouldn’t choose it again. I have two reasons for that:
- I ran into this issue and it wasn’t funny. I noticed that sites hosted in my cluster go offline one after the other. I checked the relevant pods and found that they’re in constant restart cycle since the readiness check fails all the time. Checking the events showed that it’s not that the pod wouldn’t respond to the
http
check, rather it fails withinvalid argument
error. On the Scaleway Slack channel, where I asked about this, I got an instant response if I use Calico. After confirming I do, I was told that I should restart the Calico pods to resolve this issue. It did, but wasn’t a great experience and there is no workaround I know of other than re-install Calico (which basically equals to spawning a new cluster). - I read this article by Sedat Gökcen where he explains why they moved on from Calico to Cilium. The main point is the difference how these two plugins work: Calico uses IPtables while Cilium uses eBPF. With a lot of services in your cluster the IPtables config will contain a huge amount of rules, which can harm network performance since every packet has to be validated against every rule. Also rule updates may take ages. So at some point IPtables becomes a bottleneck.
Addons
Scaleway offers two addons on Kapsule clusters: deployment of an ingress controller (Traefik or NGINX) and the Kubernetes dashboard.
Regarding the ingress controller, you’ll definitely need one to easily expose your applications to the internet, however this auto-deployed version leaves very little room for configuration. Admittedly (by Scaleway staff) this option is only for the purpose of experimenting, making you able to quickly deploy something that’s available publicly, but not meant to be used in production clusters. You can deploy the ingress controller of your choice later on with not much more effort and I will show you how.
The Kubernetes dashboard is another part of the beginners experience: it provides a visual overview of your cluster, counter to the kubectl
CLI you might have difficulties to grasp at first. But there are some more useful tools for the purpose: Lens (if you’re into GUI) or k9s (if you’re a terminal fan).
I left both options unchecked.
The die is cast…
all I have to do is click the button and wait for my cluster to be provisioned.
Essentials
My cluster is up and running in a few minutes, but is this enough? Sure, I can start experimenting, deploying applications, creating services, exposing them through node ports or ingresses, but to operate stuff reliably on the long run I will need a bit more than that. Namely:
- Helm
- monitoring
- log collection and aggregation
- security measures
- ingress controller
You might argue that this list is a bit off, since these things fall into quite different categories. Some of them is essential for any production application in any infrastructure, while others are rather specific for k8s. Anyway, all of these are tasks you should take care of in your cluster before anything else. Let’s go through them one-by-one and I’ll explain why I think they’re essential and I will show you how to set them up later.
Helm
Helm is the de facto package manager for k8s. It makes it easy to share and deploy complex, multi-service applications. Instead of copying and editing a bunch of manifests (deployments, configMaps, secrets, services, etc), you can install a chart from a public (or your own private) repository, or even from a local folder. Configuration happens during installation via command line arguments or editing the values.yaml
file that contains all possible configurable options. In the background Helm uses manifest templates and renders them using the values given.
Before Helm 3 it had a component (Tiller) that had to be installed into the cluster, but that’s gone, now it’s just a command line tool that you install on your computer and you’re ready to use it against your cluster.
Monitoring
By that I mean resource monitoring, which is one part of application visibility in the cloud environment. Quite soon you’ll have a lot of workloads spread across your cluster, so unless you have a centralized space where you can check how much resource each and every pod/deployment/etc is using, you have no chance to find out where all your memory has gone, or why one of your nodes peaking at 100% CPU all the time while the others hardly do anything.
Luckily, you’re not the first one walking that road, so kind of an industry standard is already settled. And that is the Prometheus - Grafana stack, optionally extended with alertmanager to handle alerting. Prometheus is the brain here, that is able to collect metrics from several sources and expose them in a form that for example Grafana can display them in a nice graphical way. With Grafana you can use a bunch of pre-built visualizations like bar/line/pie charts, single metrics, heatmaps, gauges etc to display your metrics in the most appropriate way and organize them into dashboards based on their relations. While Grafana itself supports setting up alerts on your metrics on GUI, most of the time people use alertmanger for that purpose, where you can simply write the rules as code.
You can deploy this stack into the cluster itself quite easily with the help of Helm and see everything in one place instantly.
Logging
The other part of visibility are logs: you cannot just look at them the old way when you have dozens of pods distributed on several nodes (not talking about hundreds of pods on dozens of nodes), the only reliable method is to send them to a centralized place and provide a toolkit to display them. Again, there is a battle tested way to do this and it’s called the ELK stack. Elasticsearch is responsible for storing and indexing the logs, Logstash provides a convenient way to ingest the incoming data, parse and store it in Elasticsearch, while Kibana is the tool to query and display it.
While it is totally possible to operate your own ELK stack, I’d advise against that for two reasons. One is, however, at first it’s quite easy to deploy ELK into your cluster, later on — when you produce more and more logs — you’ll need more and more resource and effort to keep it scaling with your needs. The other is that when you have some problems in your cluster it can happen that you won’t be able access your logs at all. In this case, good luck to find out what went wrong.
There are several LaaS (Logging as a Service) solutions out there, some of them offering hosted ELK directly (logz.io, logit.io), others have their own stack (Sematext, loggly by Solarwinds, or DataDog that acquired Logmatic). Just to mention a few.
I’m using logz.io for their generous free tier and stock Kibana UI, that I’m familiar with.
Security
Some argue that containerized infra is by default more secure than traditional, because of isolation. I lean towards the opinion that this is just partly true. A misconfigured or — even worse — uncofigured cluster can be pretty vulnerable. If someone hacks your application and you do not have additional safety measures, it is possible to break out of the container and in worst case get total control of your cluster. Fortunately with k8s we have the tools to mitigate this. Role Based Access Control (or RBAC) and PodSecurityPolicy
are your friends in this. RBAC allows you to configure who can do what — let it be actual users or service accounts — while PodSecurityPolicy
enforces what containers are allowed to do. This topic is worth another post, but I tell you what I did in short. I have created a baseline PodSecurityPolicy
that prevents a few things:
- privileged containers (that can access any device on the host)
- privilege escalation
- use of host PID
- mounting host volumes (if you can mount host filesystem, you can mount devices, configurations, etc and that’s useful if you want to break out of the container or temper with the host)
- use of host network
It’s not very restrictive, but is a good start. I enabled this policy to the group of service accounts. Service accounts are responsible to create the pods for any resource (ex. Deployment
, StatefulSet
, Job
, etc) and they can create them applying this policy. If you (or any other user) try to create a pod manually with your own user account, you won’t be able to (unless you grant access to the policy for the specific user as well).
Notice: That is a great way to shoot yourself in the foot, so don’t do this. I will write about how to do it correctly soon.
Ingress controller
You can live without an ingress controller, but not very long. You probably want to expose multiple services via standard HTTP/S and not on some arbitrary ports. For this you either create a new load balancer for each service (your provider will do this for you if you create a service with the type of LoadBalancer
), but this is cumbersome to handle and sometimes very slow. You have to find a tool to manage that your domain names point to the right load balancer IPs, nevertheless you pay for every load balancer instance and public IP address. Also this setup is not really flexible.
The ingress controller acts as a reverse proxy and router between your load balancer and your cluster. It allows you to direct all HTTP traffic to your single load balancer, it is forwarded to the ingress controller which routes the requests to your services based on the domain/path/etc rules you define with ingresses. Same like NGINX works on a single server instance with multiple vhosts. Also they support TLS termination, even with automatic Let’s Encrypt certs out of the box or with some helper service.
Actually one of the most popular ingress controllers is the Kubernetes/NGINX Ingress Controller maintained by the Kubernetes community. It is the most simple, offering minimum fetures. Still on the simple and staightforward side are NGINX Ingress Controller for Kubernetes (offered by NGINX, don’t get confused) and Traefik. Then there’s a whole bunch of them offering myriad of features for advanced usage, including service mesh support, API gateways, authentication, WAF, JWT validation, you name it. Most of these don’t come for free however.
I used Traefik (v1.7) on my Docker Swarm cluster, but they totally changed their concept with v2.x and it wasn’t quite clear for me at the time of deployment. But I had some experience with the NGINX Ingress Controller by NGINX on k8s, so I went with that.
Ready Cluster One?
Yes, in theory. I may be missing something here, but these things seem essential for me. In my next post I will give you a hands-on walkthrough how to actually set these stuff up.
Discussion: https://twitter.com/iben12/status/1298209813743177728?s=20
* Yes, of course, everything is possible, but even my friend was unsure about that. At least you may expect serious downtime in the process. ← Back to where I was