Skip to content

feat: Add Hyper-V sypport#1775

Open
rmvangun wants to merge 4 commits into
mainfrom
feat/platform-hyperv
Open

feat: Add Hyper-V sypport#1775
rmvangun wants to merge 4 commits into
mainfrom
feat/platform-hyperv

Conversation

@rmvangun
Copy link
Copy Markdown
Contributor

@rmvangun rmvangun commented May 10, 2026

Note

Medium Risk

Introduces a new platform target and modifies a shared kustomize patch; the externalTrafficPolicy: Cluster change affects all existing NodePort deployments and has been accepted by the author as a known trade-off.

Overview

This PR introduces Hyper-V as a first-class Windsor platform, delivering a platform-hyperv facet, a compute/hyperv Terraform module, and a new cluster/talos/config module that generates per-node Talos machineconfigs and wraps them into CIDATA seed ISOs for VM boot-time delivery. It extends cluster/talos to accept pre-generated secrets so the CIDATA path and the cluster bootstrap step share the same CA, eliminating a redundant machineconfig apply on the Hyper-V path.

The most recent commit resolves the earlier NIC-naming fragility by changing var.network.interface in cluster/talos/config to default to the e* glob, which covers both eth0 and enX0 naming conventions across Talos versions. A moved block in cluster/talos/main.tf ensures state continuity when the new var.machine_secrets count gate is introduced to existing non-Hyper-V deployments. Two low-severity findings remain open: the talos provider in cluster/talos/config lacks a version constraint (the lock file covers normal use, but init -upgrade without it would pull latest), and bench_ip extraction uses colon-splitting that silently breaks for IPv6 cluster endpoints.

Test coverage is solid: both Terraform modules ship tftest.hcl suites and the facet has a .test.yaml covering the main topology combinations including NodePort multi-VM, custom CIDR, and gitops pull-mode variants.

Reviewed by Claude for commit 887a8a4.

@github-actions
Copy link
Copy Markdown

Caution

🩺 Integration diagnosis — mode: upgrade, phase: upgrade

Commit: 93298d0 · Run: #25640902044

Likely cause: The PR refactored talos_machine_secrets.this into module.secrets[0].talos_machine_secrets.this without a Terraform moved {} block, causing the cluster CA to be regenerated on upgrade — breaking TLS trust with the already-running Talos node.

Evidence: The baseline succeeded and the cluster came up healthy. The upgrade-mode windsor up (step 24) failed after ~12 seconds. Post-failure talosctl version produced transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: Ed25519 verification failure" while trying to verify candidate authority certificate "talos") — a clear CA mismatch. The terraform state (windsor-state.tar.gz) confirms the resource was relocated to module.main.module.secrets[0].talos_machine_secrets.this, meaning Terraform destroyed and recreated talos_machine_secrets.this, issuing new CA key material. The running node retained the old CA, so every subsequent Talos API call (machine config apply, bootstrap, health check) was rejected at the TLS layer. The controlplane container log also shows a cascade of TLS failures: remote error: tls: internal error on kubelet health probes.

Suggested next step: Add a moved { from = talos_machine_secrets.this to = module.secrets[0].talos_machine_secrets.this } block inside terraform/cluster/talos/main.tf so Terraform preserves the existing secrets resource address on upgrade instead of destroying and recreating it.

Live read-only inspection by Claude. Support bundle attached to the run artifacts.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review by Claude. See inline comments and the summary comment below.

Comment thread kustomize/gateway/base/envoy/nodeport/patches/helm-release.yaml
Comment thread terraform/cluster/talos/config/main.tf
Comment thread contexts/_template/facets/platform-hyperv.yaml Outdated
rmvangun added 2 commits May 12, 2026 21:38
…_block and adjust test cases for new IP addresses
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review by Claude. See inline comments and the summary in the PR description.

source = "windsorcli/hyperv"
version = "~> 0.2"
}
talos = {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talos provider has no version constraint

Low Severity · terraform/cluster/talos/config/main.tf:36

The hyperv provider in the same block pins ~> 0.2, but talos carries no version field. The lock file currently captures 0.11.0, which is fine for everyday use. Running terraform init -upgrade or initializing a fresh workspace without the lock file will pull the latest available version, which could silently introduce breaking argument-shape changes if the upstream provider makes a major bump. Adding version = "~> 0.11" would keep the declared constraint aligned with the lock file entry.

nat_internal_prefix: "${network.cidr_block ?? '10.5.0.0/16'}"
nat_prefix_length: "${split(network.cidr_block ?? '10.5.0.0/16', '/')[1]}"
nat_host_address: "${cidrhost(network.cidr_block ?? '10.5.0.0/16', 1)}"
bench_ip: "${(cluster.endpoint ?? '') != '' ? split(split(cluster.endpoint, '://')[1], ':')[0] : ''}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bench_ip extraction fails for IPv6 cluster endpoints

Low Severity · contexts/_template/facets/platform-hyperv.yaml:109

The expression splits cluster.endpoint on :// then on : to extract the host portion. For an IPv4 endpoint like https://192.168.3.77:6443 this is correct, but for https://[::1]:6443 it returns [ (the first colon hit is inside the bracket). Port-forward rules written with [ as the external target would silently fail at the Windows netsh / Add-NetNatStaticMapping level. If IPv6 bench addresses are out of scope for this target, noting it in the cluster.endpoint schema description would prevent silent misconfiguration when someone tries it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant