Background and inspiration#
I’ve been in the IT industry long enough to have heard countless discussions about immutable operating systems. While the concept is intriguing, the reality for many of us is that SSH remains a cornerstone of system management. We log in, execute commands, resolve issues, and log out—no persistence, no GitOps, nothing fancy.
With the increasing adoption of container technologies and public cloud services, it’s clear that moving away from traditional setups can reduce toil and enhance system maintainability. However, this shift often comes with its own set of challenges, such as increased complexity and costs. Additionally, some organizations can’t afford managed Kubernetes services like AKS, EKS, or GKE, and others are restricted from moving parts of their infrastructure to the cloud due to compliance reasons.
One particularly painful aspect of managing an on-premise Kubernetes (K8s) cluster is maintaining the control planes and worker nodes. Tasks such as patching vulnerabilities or ensuring compliance with CIS benchmarks can be tedious and error-prone. In this context, Talos Linux presents a compelling solution.
Talos Linux is a minimal, immutable Linux distribution designed specifically for Kubernetes. It eliminates SSH access to the VM, which prevents “ninja-ing” on the OS, leading to a more stable and secure foundation for your cluster. With only a small subset of binaries present, Talos Linux reduces the attack surface, making your Security Officer happy by not having to patch vulnerabilities in libraries you never knew existed, let alone were installed by default in your OS of choice.
In this post, I will guide you through setting up a K8s cluster from scratch using Vagrant and Talos Linux. This cluster will serve as the base for future blog posts, allowing me to quickly spin up a fresh cluster for new topics or recreate an empty environment whenever needed.
The technical stuff#
Target architecture#
To visualize the target architecture for our Talos cluster setup, I’ve created a simple diagram to clarify the components discussed in this post. We will be using Ubuntu 22.04 as the host OS, and Vagrant version 2.4.1 along with the vagrant-libvirt plugin version 0.12.2. To enhance the performance of the QEMU VMs, the kvm kernel module is enabled on my system.

Creating Talos VMs using Vagrant + Libvirt#
If you want to follow along, ensure to have Vagrant installed on your computer. I’ll refer you to Hashicorp’s docs and assume you’ll figure this part out by yourself. In the next steps, we will create our lab environment. I’m following along with this post and will do some exploration of the talosctl command after the cluster is up and running.
- Download Talos ISO
1
| $ wget --timestamping https://github.com/siderolabs/talos/releases/download/v1.7.5/metal-amd64.iso -O /tmp/metal-amd64.iso
|
- Create the Vagrant file to start 3 CP’s and 1 worker, save the file as ‘Vagrantfile’. If you don’t have that many resources available, you could use just one control plane and allow scheduling on that node by setting
cluster.allowSchedulingOnControlPlanes: true in the machine-config we create later on.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
| Vagrant.configure("2") do |config|
config.vm.define "control-plane-node-1" do |vm|
vm.vm.provider :libvirt do |domain|
domain.cpus = 2
domain.memory = 2048
domain.storage :file, :device => :cdrom, :path => "/tmp/metal-amd64.iso"
domain.storage :file, :size => '4G', :type => 'raw'
domain.boot 'hd'
domain.boot 'cdrom'
end
end
config.vm.define "control-plane-node-2" do |vm|
vm.vm.provider :libvirt do |domain|
domain.cpus = 2
domain.memory = 2048
domain.storage :file, :device => :cdrom, :path => "/tmp/metal-amd64.iso"
domain.storage :file, :size => '4G', :type => 'raw'
domain.boot 'hd'
domain.boot 'cdrom'
end
end
config.vm.define "control-plane-node-3" do |vm|
vm.vm.provider :libvirt do |domain|
domain.cpus = 2
domain.memory = 2048
domain.storage :file, :device => :cdrom, :path => "/tmp/metal-amd64.iso"
domain.storage :file, :size => '4G', :type => 'raw'
domain.boot 'hd'
domain.boot 'cdrom'
end
end
config.vm.define "worker-node-1" do |vm|
vm.vm.provider :libvirt do |domain|
domain.cpus = 1
domain.memory = 1024
domain.storage :file, :device => :cdrom, :path => "/tmp/metal-amd64.iso"
domain.storage :file, :size => '4G', :type => 'raw'
domain.boot 'hd'
domain.boot 'cdrom'
end
end
end
|
- Run
vagrant up and vagrant status to check the status of the VMs
1
2
3
4
5
6
7
| $ vagrant status
Current machine states:
control-plane-node-1 running (libvirt)
control-plane-node-2 running (libvirt)
control-plane-node-3 running (libvirt)
worker-node-1 running (libvirt)
|
- Get the IP addresses of the VMs by running the following command
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| $ virsh list | awk '/node/ {print $2}' | xargs -t -L1 virsh domifaddr
virsh domifaddr talos_control-plane-node-3
Name MAC address Protocol Address
-------------------------------------------------------------------------------
vnet16 52:54:00:bf:ea:9d ipv4 192.168.121.75/24
virsh domifaddr talos_control-plane-node-2
Name MAC address Protocol Address
-------------------------------------------------------------------------------
vnet17 52:54:00:70:78:c3 ipv4 192.168.121.42/24
virsh domifaddr talos_worker-node-1
Name MAC address Protocol Address
-------------------------------------------------------------------------------
vnet18 52:54:00:1b:f2:4a ipv4 192.168.121.254/24
virsh domifaddr talos_control-plane-node-1
Name MAC address Protocol Address
-------------------------------------------------------------------------------
vnet19 52:54:00:d3:5c:44 ipv4 192.168.121.176/24
|
- Get the talosctl binary installed on your system. I like to live dangerously and just pipe it straight to
sh.
1
| $ curl -sL https://talos.dev/install | sh
|
- We should be able to interact with one of our VMs, let’s get the disk status of
control-plane-node-1
1
2
3
| $ talosctl -n 192.168.121.176 disks --insecure
DEV MODEL SERIAL TYPE UUID WWID MODALIAS NAME SIZE BUS_PATH SUBSYSTEM READ_ONLY SYSTEM_DISK
/dev/vda - - HDD - - virtio:d00000002v00001AF4 - 4.3 GB /pci0000:00/0000:00:03.0/virtio0/ /sys/class/block
|
Cool! We have successfully set up the VMs and can interact with them using the Talos API. The next step would be to install the actual OS itself since we’ve only created some empty VMs that are booted from the ISO. We need to persist the OS installation to disk, right? :)
Persisting OS to disk and bootstrapping our Kubernetes cluster#
In any Kubernetes cluster, you would need at least three control plane nodes, so that etcd can do its leader election magic when one of your nodes fails. One of the pain points when building a high-availability control plane is giving clients a single IP or URL at which they can reach any of the control plane nodes.
Fortunately, Talos Linux supports a “Virtual” IP (VIP) address to access the Kubernetes API server, providing high availability with no other resources required.
Let’s pick a random IP that is not in use yet by any of our VMs and shares the same network subnet. I’m going with 192.168.121.100/24.
- Use
talosctl to generate a machine-config that we later push to our CP’s
1
| $ talosctl gen config my-cluster https://192.168.121.100:6443 --install-disk /dev/vda
|
This command will generate three files.
controlplane.yaml: Defines how the control planes should be configured.worker.yaml: Defines how the worker node should be configured.talosconfig: The client configuration file. Contains the Root CA, client certificate, and private key to access the Talos API.
I’m not sure yet and hopefully will figure out this later, but I think those .yaml files are the ones you want to store in git, obviously removing the private key and storing them in your secret management solution, like Hashicorp Vault or AWS SecretsManager.
- Add the virtual IP to our machine-config and push the configuration to our initial control plane node. I’m going with
control-plane-node-1/192.168.121.176.
1
2
3
4
5
6
7
8
| machine:
network:
interfaces:
- deviceSelector:
busPath: "0*" # should select any hardware network device, if you have just one, it will be selected
dhcp: true
vip:
ip: 192.168.121.100
|
After saving, apply the machine-config to the node.
1
| $ talosctl -n 192.168.121.176 apply-config --insecure --file controlplane.yaml
|
- Now, let’s update the talosconfig to set the endpoints IP’s of our control planes.
1
2
| $ export TALOSCONFIG=$(realpath ./talosconfig)
$ talosctl config endpoint 192.168.121.176 192.168.121.42 192.168.121.75
|
- Finally, bootstrap the cluster from any control-plane node.
1
| $ talosctl -n 192.168.121.176 bootstrap
|
In case you get the following error, please wait until the machine-config is successfully applied:
error executing bootstrap: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority"
- If the bootstrap was successful, we can apply the machine-configs to the other nodes as well.
1
2
3
| $ talosctl -n 192.168.121.42 apply-config --insecure --file controlplane.yaml
$ talosctl -n 192.168.121.75 apply-config --insecure --file controlplane.yaml
$ talosctl -n 192.168.121.254 apply-config --insecure --file worker.yaml
|
- After a while, you should see that all the members have joined:
1
2
3
4
5
6
| $ talosctl -n 192.168.121.176 get members
NODE NAMESPACE TYPE ID VERSION HOSTNAME MACHINE TYPE OS ADDRESSES
192.168.121.176 cluster Member talos-6ut-95h 1 talos-6ut-95h controlplane Talos (v1.7.5) ["192.168.121.176"]
192.168.121.176 cluster Member talos-efm-hy5 1 talos-efm-hy5 controlplane Talos (v1.7.5) ["192.168.121.42"]
192.168.121.176 cluster Member talos-1a3-57h 1 talos-1a3-57h worker Talos (v1.7.5) ["192.168.121.75"]
192.168.121.176 cluster Member talos-m7j-04a 1 talos-m7j-04a controlplane Talos (v1.7.5) ["192.168.121.254"]
|
Getting the kubeconfig#
Now that our Talos cluster is alive, it’s time to get our kubeconfig so we can interact with Kubernetes.
- Execute the talosctl kubeconfig command against one of your control-planes nodes
1
| $ talosctl -n 192.168.121.176 kubeconfig ./kubeconfig
|
- We can see that Talos correctly set our control-plane vip to https://192.168.121.100:6443
1
2
| $ yq .clusters[].cluster.server ./kubeconfig
https://192.168.121.100:6443
|
- Let’s verify if we can kubectl get node :)
1
2
3
4
5
6
| $ kubectl get node
NAME STATUS ROLES AGE VERSION
talos-1a3-57h Ready control-plane 3m18s v1.30.1
talos-6ut-95h Ready control-plane 7m19s v1.30.1
talos-efm-hy5 Ready control-plane 2m51s v1.30.1
talos-m7j-04a Ready <none> 3m18s v1.30.1
|
Amazing, our Talos cluster is alive! Let’s play with some of the talosctl commands to get some information from the nodes.
Useful Talos commands#
- Snapshotting etcd can be done fairly simply using the following command. The file itself will be saved to the machine from where talosctl is run.
1
| $ talosctl -n 192.168.121.176 etcd snapshot <filename>
|
- We can request directory listings using the ’list’ command. Let’s see how many binaries are in /usr/bin and /bin
1
2
3
4
5
6
7
8
9
10
11
12
| $ talosctl -n 192.168.121.176 list /usr/bin
NODE NAME
192.168.121.176 .
192.168.121.176 udevadm
$ talosctl -n 192.168.121.176 list /bin
NODE NAME
192.168.121.176 .
192.168.121.176 containerd
192.168.121.176 containerd-shim
192.168.121.176 containerd-shim-runc-v2
192.168.121.176 runc
|
As you can see, the OS is very minimal! I like it :)
- We can get the running processes using the “processes” command
1
2
| $ talosctl -n 192.168.121.176 processes | grep kube-apiserver
192.168.121.176 2317 S 9 91.59 1.5 GB 277 MB /usr/local/bin/kube-apiserver --admission-control-config-file=/system/config/kubernetes/kube-apiserver/admission-control-config.yaml --advertise-address=192.168.121.176 --allow-privileged=true --anonymous-auth=false --api-audiences=https://192.168.121.100:6443 --audit-log-maxage=30 --audit-log-maxbackup=10 --audit-log-maxsize=100 --audit-log-path=/var/log/audit/kube/kube-apiserver.log --audit-policy-file=/system/config/kubernetes/kube-apiserver/auditpolicy.yaml --authorization-mode=Node,RBAC --bind-address=0.0.0.0 --client-ca-file=/system/secrets/kubernetes/kube-apiserver/ca.crt --enable-admission-plugins=NodeRestriction --enable-bootstrap-token-auth=true --encryption-provider-config=/system/secrets/kubernetes/kube-apiserver/encryptionconfig.yaml --etcd-cafile=/system/secrets/kubernetes/kube-apiserver/etcd-client-ca.crt --etcd-certfile=/system/secrets/kubernetes/kube-apiserver/etcd-client.crt --etcd-keyfile=/system/secrets/kubernetes/kube-apiserver/etcd-client.key --etcd-servers=https://localhost:2379 --kubelet-client-certificate=/system/secrets/kubernetes/kube-apiserver/apiserver-kubelet-client.crt --kubelet-client-key=/system/secrets/kubernetes/kube-apiserver/apiserver-kubelet-client.key --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --profiling=false --proxy-client-cert-file=/system/secrets/kubernetes/kube-apiserver/front-proxy-client.crt --proxy-client-key-file=/system/secrets/kubernetes/kube-apiserver/front-proxy-client.key --requestheader-allowed-names=front-proxy-client --requestheader-client-ca-file=/system/secrets/kubernetes/kube-apiserver/aggregator-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6443 --service-account-issuer=https://192.168.121.100:6443 --service-account-key-file=/system/secrets/kubernetes/kube-apiserver/service-account.pub --service-account-signing-key-file=/system/secrets/kubernetes/kube-apiserver/service-account.key --service-cluster-ip-range=10.96.0.0/12 --tls-cert-file=/system/secrets/kubernetes/kube-apiserver/apiserver.crt --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256 --tls-min-version=VersionTLS12 --tls-private-key-file=/system/secrets/kubernetes/kube-apiserver/apiserver.key
|
- Fetching some memory information about the host
1
2
3
| $ talosctl -n 192.168.121.176 memory
NODE TOTAL USED FREE SHARED BUFFERS CACHE AVAILABLE
192.168.121.176 1953 671 116 64 3 1160 1208
|
- Getting dmesg logs, we could even use -f to follow the log stream
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| $ talosctl -n 192.168.121.176 dmesg -f
<---SNIP--->
192.168.121.176: kern: info: [2024-07-24T19:24:52.119964041Z]: cni0: port 1(veth73a2de7b) entered blocking state
192.168.121.176: kern: info: [2024-07-24T19:24:52.120919041Z]: cni0: port 1(veth73a2de7b) entered disabled state
192.168.121.176: kern: info: [2024-07-24T19:24:52.121837041Z]: veth73a2de7b: entered allmulticast mode
192.168.121.176: kern: info: [2024-07-24T19:24:52.122678041Z]: veth73a2de7b: entered promiscuous mode
192.168.121.176: kern: info: [2024-07-24T19:24:52.162855041Z]: cni0: port 1(veth73a2de7b) entered blocking state
192.168.121.176: kern: info: [2024-07-24T19:24:52.162860041Z]: cni0: port 1(veth73a2de7b) entered forwarding state
192.168.121.176: kern: info: [2024-07-24T19:24:55.130502041Z]: cni0: port 2(vethfa968507) entered blocking state
192.168.121.176: kern: info: [2024-07-24T19:24:55.132040041Z]: cni0: port 2(vethfa968507) entered disabled state
192.168.121.176: kern: info: [2024-07-24T19:24:55.133490041Z]: vethfa968507: entered allmulticast mode
192.168.121.176: kern: info: [2024-07-24T19:24:55.134858041Z]: vethfa968507: entered promiscuous mode
192.168.121.176: kern: info: [2024-07-24T19:24:55.183104041Z]: cni0: port 2(vethfa968507) entered blocking state
192.168.121.176: kern: info: [2024-07-24T19:24:55.184078041Z]: cni0: port 2(vethfa968507) entered forwarding state
192.168.121.176: user: warning: [2024-07-24T19:39:52.571932041Z]: [talos] pcap: packets captured 1095, polls 50, socket stats: drops 0, packets 1127, queue freezes 0
192.168.121.176: user: warning: [2024-07-24T19:40:07.867826041Z]: [talos] pcap: packets captured 195, polls 12, socket stats: drops 0, packets 205, queue freezes 0
|
- Getting Talos services status
1
2
3
4
5
6
7
8
9
10
11
12
| $ talosctl -n 192.168.121.176 services
NODE SERVICE STATE HEALTH LAST CHANGE LAST EVENT
192.168.121.176 apid Running OK 28m15s ago Health check successful
192.168.121.176 containerd Running OK 28m22s ago Health check successful
192.168.121.176 cri Running OK 28m19s ago Health check successful
192.168.121.176 dashboard Running ? 28m21s ago Process Process(["/sbin/dashboard"]) started with PID 1788
192.168.121.176 etcd Running OK 27m43s ago Health check successful
192.168.121.176 kubelet Running OK 24m21s ago Health check successful
192.168.121.176 machined Running OK 28m27s ago Health check successful
192.168.121.176 syslogd Running OK 28m26s ago Health check successful
192.168.121.176 trustd Running OK 28m14s ago Health check successful
192.168.121.176 udevd Running OK 28m26s ago Health check successful
|
Conclusion#
In this blog post, I’ve demonstrated that managing an on-premise Kubernetes cluster doesn’t have to be overly complex. Many companies may not be aware of solutions like Talos Kubernetes, or they might be hesitant to adopt them. Fortunately, Sidero Labs, the creator of Talos, offers Enterprise support, which can alleviate some of the concerns organizations may have. Embracing Talos is a significant opportunity to simplify infrastructure management. By managing infrastructure as code with Talos Kubernetes, organizations can create more secure, immutable systems, providing a more resilient foundation for their infrastructure.