Skip to content

docs: audit cluster management section#485

Open
Iheanacho-ai wants to merge 1 commit intosiderolabs:mainfrom
Iheanacho-ai:cluster-management
Open

docs: audit cluster management section#485
Iheanacho-ai wants to merge 1 commit intosiderolabs:mainfrom
Iheanacho-ai:cluster-management

Conversation

@Iheanacho-ai
Copy link
Copy Markdown
Member

No description provided.

@github-project-automation github-project-automation bot moved this to To Do in Planning Apr 9, 2026
@talos-bot talos-bot moved this from To Do to In Review in Planning Apr 9, 2026
@Iheanacho-ai Iheanacho-ai marked this pull request as draft April 9, 2026 07:29
@Iheanacho-ai Iheanacho-ai force-pushed the cluster-management branch 9 times, most recently from 7685b14 to eaab1ce Compare April 10, 2026 14:13
Signed-off-by: Amarachi Iheanacho <amarachi.iheanacho@siderolabs.com>
@Iheanacho-ai Iheanacho-ai marked this pull request as ready for review April 10, 2026 16:45
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

machines:
- <existing-worker-uuid>
- <new-worker-uuid> # add the new machine UUID here
```
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add a section for when someone is using a machine set. In that case it's just a number change (no UUID)

To remove a control plane node:

```yaml
kind: ControlPlane
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: your example above puts ControlPlane example first and workers second.

import { version } from '/snippets/custom-variables.mdx';

Refer to the general guide on creating a cluster to get started. To create a hybrid cluster, navigate to the cluster, then apply the following cluster patch by clicking on "Config Patches", and create a new patch with the target of "Cluster":
A hybrid cluster is a Kubernetes cluster whose nodes span multiple networks or infrastructure types, for example, a mix of bare metal machines, cloud virtual machines, on-premises virtual machines, or single-board computers (SBCs).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SBC isn't a different type than bare metal.

A hybrid cluster is a Kubernetes cluster whose nodes span multiple networks or infrastructure types, for example, a mix of bare metal machines, cloud virtual machines, on-premises virtual machines, or single-board computers (SBCs).

<img src="./images/create-a-hybrid-cluster-create-patch-kubescan-enabled.png" alt="Create Patch"/>
By default, Kubernetes assumes all nodes can reach each other directly on the same network. When nodes are spread across different networks, this assumption breaks down. <a href={`../../talos/${version}/networking/kubespan`}>Kubespan</a> addresses this by establishing an encrypted WireGuard tunnel between every node in the cluster, so that all nodes can communicate securely regardless of where they are hosted.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
By default, Kubernetes assumes all nodes can reach each other directly on the same network. When nodes are spread across different networks, this assumption breaks down. <a href={`../../talos/${version}/networking/kubespan`}>Kubespan</a> addresses this by establishing an encrypted WireGuard tunnel between every node in the cluster, so that all nodes can communicate securely regardless of where they are hosted.
Kubernetes requires all nodes can reach each other directly without NAT. When nodes are spread across different networks, this assumption breaks down. <a href={`../../talos/${version}/networking/kubespan`}>Kubespan</a> addresses this by establishing an encrypted WireGuard tunnel between every node in the cluster. The tunnel flattens the network so all nodes can communicate securely regardless of where they are hosted.

3. Select **Config Patches** from the dropdown.
4. Click **Create Patch** to open the **Create Patch** page.
5. Apply the following patch:
```yaml
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talos 1.13 is going to have a multi-doc config so we might want to add tabs for that now. This config will still work but new config types are recommended.

</Tab>
</Tabs>

Once this patch is applied, all node-to-node traffic in the cluster will be encrypted using WireGuard, allowing nodes to communicate with each other securely regardless of which network they are on.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a warning about network throughput with wireguard is significantly less than without wireguard. If people want native network throughput from nodes on the same network they need to set up the filters.excludeAdvertisedNetworks configuration

@@ -1,45 +1,77 @@
---
title: Expose an HTTP Service from a Cluster
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title: Expose an HTTP Service from a Cluster
title: Expose a Workload via Service Proxy

I'm not sure if this would be easier to search for or more clear about what the document is for. The original title could apply to load balancers or any way to expose a service.

omni-kube-service-exposer.sidero.dev/label: Sample Nginx
#omni-kube-service-exposer.sidero.dev/prefix: myservice
omni-kube-service-exposer.sidero.dev/prefix: myservice
omni-kube-service-exposer.sidero.dev/icon: H4sICB0B1mQAA25naW54LXN2Z3JlcG8tY29tLnN2ZwBdU8ly2zAMvfcrWPZKwiTANWM5015yyiHdDr1kNLZsa0axvKix8/cFJbvNdCRCEvEAPDxQ8/vLSydem+Op7XeVtGCkaHbLftXuNpX8Pax1kveL+UetxY9919erZiWG/k58+/kgvjb7Xonz+Qyn182RP2DZvyjx0OyaYz30x38o8dhemqP43vfdSWi9+DDnCHFuV8O2ksmY/UWKbdNutsPfz9e2OX/pL5U0wghCvqVgqrtTJbfDsL+bzUrhM0F/3MzQGDPjlHIxH9qhaxbrtmueh7d987zbtLvLfDZtz/f1sBWrSj5aD9klhVswwdfWgLNJXR+GL6sgRwSP6QmRd53yELzCCMmRShCjqyFmLOsWwCiIKS01GJOUA0qZHQUby5ZXlsAGjkv8wmuK00A+gDfxoD1DSREQOm0teBdVgOA4wqdY1i0i+AiG4lOGbFEhg7icZWJIgCMz+It1DA/hYDQXScxVjyyohpCprBt7SswylJze49htVNxQjk6xDuSXTAs12OQgUGLWMRenLj4pTsNb11SSde/uPhmbA2U5e6c3qxBiEdhTOhhO77CIwxvJ55p7NVlN1owX+xkOJhUb3M1OTuShAZpQIoK72mtcSF5bwExLoxECjsqzssgIzdMLB2IdiPViApHbsTwhH1KNkIgFHO2tTOB54pjfXu3k4QLechmK9lCGzfm9s0XbQtmWfqa4NB0Oo1lzVtUsx6wjKxtYBcKSMkJOyGzJBbYxBM0aBypZfdBRJyDCz0zNRjXZKw0D/J75KFApFvPVTt73kv/6b0Lr9bqMp/wziz8W9M/pAwQAAA==
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This renders without colors and look weird. Does it need to be quoted?

Image

spec:
containers:
- name: workload-proxy-example-nginx
image: nginx:stable-alpine-slim
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should switch this image to our example workload https://docs.siderolabs.com/talos/v1.12/getting-started/deploy-first-workload

```

### Troubleshooting
## Troubleshoot exposed services
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also a kubernetes workload that runs in the cluster they can inspect for troubleshooting. We should mention that somewhere and show them how to look at logs

---

This guide will walk you through the steps to import an existing Talos cluster into Omni.
If you have an existing Talos cluster running outside of Omni, you can import it so that Omni can manage it going forward. The import process connects your Talos nodes to Omni, preserves the existing cluster configuration as config patches, and registers the cluster as a managed resource, without resetting or disrupting the running workloads.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you have an existing Talos cluster running outside of Omni, you can import it so that Omni can manage it going forward. The import process connects your Talos nodes to Omni, preserves the existing cluster configuration as config patches, and registers the cluster as a managed resource, without resetting or disrupting the running workloads.
If you have existing Talos clusters running without Omni management, you can import them to be managed by Omni. The import process connects your Talos nodes to Omni, preserves the existing cluster configuration as config patches, and registers the cluster as a managed resource, without resetting or disrupting the running workloads.

<Info>
This is an experimental feature. It does not support Talos installations with custom built Linux kernel or custom built extensions.
</Info>
> **Note:** This is an experimental feature. Clusters with a custom-built Linux kernel or custom-built extensions are not supported. If your cluster uses either of these, do not proceed with this guide.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is considered experimental anymore


### Step 3: Unlock the cluster

When you are ready for Omni to begin managing the cluster, unlock it by running the following command, replacing `<cluster-name>` with the name of your cluster:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mention what happens when a cluster is unlocked. The endpoint is changed to Omni and other patches will be applied. They can see pending changes in the Omni UI. I think there's a CLI to see the diff too.


Understanding how Omni handles schematics and config patches during import helps you anticipate the outcome and troubleshoot any issues that arise.

### Image schematic
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be higher in the guide and maybe set as a prereq


The import command uses the `--initial-talos-version` and `--initial-kubernetes-version` values to generate the default machine config that Omni would produce for each node. It then compares that default config against the actual config running on each node and generates a config patch representing the difference, effectively capturing all customisations made to the cluster since it was first created.

Certain machine config fields that are not permitted on Omni-managed clusters are excluded from the generated patches.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should provide examples

- **Use Option 2** if the cluster was unlocked and further modified after import, making the backed-up configs potentially out of date.

Cluster has to be in `locked` state to be able to abort an import operation.
### Option 1: Restore from backup
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make these tabs

@@ -1,40 +1,46 @@
---
title: Restore Etcd of a Cluster Managed by Cluster Templates
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it different when the machine isn't managed by cluster templates? Maybe we can shorten this title

The output will look like this:
The output will look similar to this:

```
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have formatting?


```bash
omnictl get clusteruuid my-cluster
omnictl get clusteruuid <cluster-name>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You use <cluster-name> in 3 commands. Maybe we should export a variable so the other commands are copy/pastable

The output will look like this:
The output will look similar to this:

```
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting?

Omni upgrades control plane nodes first, verifying that the etcd cluster is healthy and will remain healthy after each node leaves the etcd cluster before proceeding.

> Note: you cannot lock control plane nodes, as it is not supported to have the Kubernetes version of a worker higher than that of the control plane nodes in a cluster - this may result in API version incompatibility.
For each node, Omni drains and cordons it, updates the OS, then uncordons it. All upgrades use the `--preserve=true` flag, which retains ephemeral data on the node.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think preserve is the default behavior since ~1.10 (maybe 1.11). No flag is needed even with talosctl.

You ca nstill point out that ephemeral data (including container images) and user volumes on the node is not erased.


### What happens during a Kubernetes upgrade

Kubernetes upgrades are non-disruptive to workloads and proceed in the following order:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do workloads restart during the upgrade? For some reason I thought they did which would be disruptive.


### Apply updated Kubernetes manifests

Omni does not automatically apply updates to Kubernetes bootstrap manifests during an upgrade.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to say what this includes. CoreDNS, kube-proxy, CNI


#### Format of the audit log
<Tabs>
<Tab title="UI">
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omni 1.7 has an integrated audit log viewer.

Image Image Image

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

title: Audit logs
description: View and manage activity logs in Omni.
title: Audit Logs
description: View, configure, and interpret activity logs in Omni.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get people who often ask how they can export their audit logs to a different log platform. We don't have a solution for them right now but we should at least have a section that answers that question and we can update when we do have a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

3 participants