@@ -5,18 +5,25 @@ description: "Operational instructions for etcd database."
55
66` etcd ` database backs Kubernetes control plane state, so ` etcd ` health is critical for Kubernetes availability.
77
8+ > Note: Commands from ` talosctl etcd ` namespace are functional only on the Talos control plane nodes.
9+ > Each time you see ` <IPx> ` in this page, it is referencing IP address of control plane node.
10+
811## Space Quota
912
1013` etcd ` default database space quota is set to 2 GiB by default.
1114If the database size exceeds the quota, ` etcd ` will stop operations until the issue is resolved.
1215
1316This condition can be checked with ` talosctl etcd alarm list ` command:
1417
15- ``` bash
16- $ talosctl -n < IP> etcd alarm list
18+ {{< tabpane >}}
19+ {{< tab header="Command" lang="Bash" >}}
20+ talosctl -n <IP > etcd alarm list
21+ {{< /tab >}}
22+ {{< tab header="Output" lang="Console" >}}
1723NODE MEMBER ALARM
1824172.20.0.2 a49c021e76e707db NOSPACE
19- ```
25+ {{< /tab >}}
26+ {{< /tabpane >}}
2027
2128If the Kubernetes database contains lots of resources, space quota can be increased to match the actual usage.
2229The recommended maximum size is 8 GiB.
@@ -43,34 +50,177 @@ If the database runs over the space quota (see above), but the actual in use dat
4350
4451Current database size can be checked with `talosctl etcd status` command :
4552
46- ` ` ` bash
47- $ talosctl -n <CP1>,<CP2>,<CP3> etcd status
53+ {{< tabpane >}}
54+ {{< tab header="Command" lang="Bash" >}}
55+ talosctl -n <IP1>,<IP2>,<IP3> etcd status
56+ {{< /tab >}}
57+ {{< tab header="Output" lang="Console" >}}
4858NODE MEMBER DB SIZE IN USE LEADER RAFT INDEX RAFT TERM RAFT APPLIED INDEX LEARNER ERRORS
4959172.20.0.3 ecebb05b59a776f1 21 MB 6.0 MB (29.08%) ecebb05b59a776f1 53391 4 53391 false
5060172.20.0.2 a49c021e76e707db 17 MB 4.5 MB (26.10%) ecebb05b59a776f1 53391 4 53391 false
5161172.20.0.4 eb47fb33e59bf0e2 20 MB 5.9 MB (28.96%) ecebb05b59a776f1 53391 4 53391 false
52- ` ` `
62+ {{< /tab >}}
63+ {{< /tabpane >}}
5364
5465If any of the nodes are over database size quota, alarms will be printed in the `ERRORS` column.
5566
5667To defragment the database, run `talosctl etcd defrag` command :
5768
5869` ` ` bash
59- talosctl -n <CP1 > etcd defrag
70+ talosctl -n <IP1 > etcd defrag
6071` ` `
6172
62- > Note: defragmentation is a resource-intensive operation, so it is recommended to run it on a single node at a time.
73+ > Note: Defragmentation is a resource-intensive operation, so it is recommended to run it on a single node at a time.
6374> Defragmentation to a live member blocks the system from reading and writing data while rebuilding its state.
6475
6576Once the defragmentation is complete, the database size will match closely to the in use size :
6677
67- ` ` ` bash
68- $ talosctl -n <CP1> etcd status
78+ {{< tabpane >}}
79+ {{< tab header="Command" lang="Bash" >}}
80+ talosctl -n <IP1> etcd status
81+ {{< /tab >}}
82+ {{< tab header="Output" lang="Console" >}}
6983NODE MEMBER DB SIZE IN USE LEADER RAFT INDEX RAFT TERM RAFT APPLIED INDEX LEARNER ERRORS
7084172.20.0.2 a49c021e76e707db 4.5 MB 4.5 MB (100.00%) ecebb05b59a776f1 56065 4 56065 false
71- ` ` `
85+ {{< /tab >}}
86+ {{< /tabpane >}}
7287
7388# # Snapshotting
7489
7590Regular backups of `etcd` database should be performed to ensure that the cluster can be restored in case of a failure.
7691This procedure is described in the [disaster recovery]({{< relref "disaster-recovery" >}}) guide.
92+
93+ # # Downgrade v3.6 to v3.5
94+
95+ Before beginning, check the `etcd` health and download snapshot, as described in [disaster recovery]({{< relref "disaster-recovery" >}}).
96+ Should something go wrong with the downgrade, it is possible to use this backup to rollback to existing `etcd` version.
97+
98+ This example shows how to downgrade an `etcd` in Talos cluster.
99+
100+ # ## Step 1: Check Downgrade Requirements
101+
102+ Is the cluster healthy and running v3.6.x?
103+
104+ {{< tabpane >}}
105+ {{< tab header="Command" lang="Bash" >}}
106+ talosctl -n <IP1>,<IP2>,<IP3> etcd status
107+ {{< /tab >}}
108+ {{< tab header="Output" lang="Console" >}}
109+ NODE MEMBER DB SIZE IN USE LEADER RAFT INDEX RAFT TERM RAFT APPLIED INDEX LEARNER PROTOCOL STORAGE ERRORS
110+ 172.20.0.4 a2b8a1f794bdd561 3.6 MB 2.2 MB (61.59%) a49c021e76e707db 4703 2 4703 false 3.6.4 3.6.0
111+ 172.20.0.3 912415ee6ed360c4 3.5 MB 2.2 MB (61.88%) a49c021e76e707db 4703 2 4703 false 3.6.4 3.6.0
112+ 172.20.0.2 a49c021e76e707db 3.5 MB 2.2 MB (62.06%) a49c021e76e707db 4703 2 4703 false 3.6.4 3.6.0
113+ {{< /tab >}}
114+ {{< /tabpane >}}
115+
116+ # ## Step 2: Download Snapshot
117+
118+ [Download the snapshot backup]({{< relref "disaster-recovery" >}}) to provide a downgrade path should any problems occur.
119+
120+ # ## Step 3: Validate Downgrade
121+
122+ Validate the downgrade target version before enabling the downgrade :
123+
124+ - We only support downgrading one minor version at a time, e.g. downgrading from v3.6 to v3.4 isn't allowed.
125+ - Please do not move on to next step until the validation is successful.
126+
127+ {{< tabpane >}}
128+ {{< tab header="Command" lang="Bash" >}}
129+ talosctl -n <IP1> etcd downgrade validate 3.5
130+ {{< /tab >}}
131+ {{< tab header="Output" lang="Console" >}}
132+ NODE MESSAGE
133+ 172.20.0.2 downgrade validate success, cluster version 3.6
134+ {{< /tab >}}
135+ {{< /tabpane >}}
136+
137+ # ## Step 4: Enable Downgrade
138+
139+ {{< tabpane >}}
140+ {{< tab header="Command" lang="Bash" >}}
141+ talosctl -n <IP1> etcd downgrade enable 3.5
142+ {{< /tab >}}
143+ {{< tab header="Output" lang="Console" >}}
144+ NODE MESSAGE
145+ 172.20.0.2 downgrade enable success, cluster version 3.6
146+ {{< /tab >}}
147+ {{< /tabpane >}}
148+
149+ After enabling downgrade, the cluster will start to operate with v3.5 protocol, which is the downgrade target version.
150+ In addition, `etcd` will automatically migrate the schema to the downgrade target version, which usually happens very fast.
151+ Confirm the storage version of all servers has been migrated to v3.5 by checking the endpoint status before moving on to the next step.
152+
153+ {{< tabpane >}}
154+ {{< tab header="Command" lang="Bash" >}}
155+ talosctl -n <IP1>,<IP2>,<IP3> etcd status
156+ {{< /tab >}}
157+ {{< tab header="Output" lang="Console" >}}
158+ NODE MEMBER DB SIZE IN USE LEADER RAFT INDEX RAFT TERM RAFT APPLIED INDEX LEARNER PROTOCOL STORAGE ERRORS
159+ 172.20.0.3 912415ee6ed360c4 3.5 MB 1.9 MB (54.92%) a49c021e76e707db 5152 2 5152 false 3.6.4 3.5.0
160+ 172.20.0.2 a49c021e76e707db 3.5 MB 1.9 MB (54.64%) a49c021e76e707db 5152 2 5152 false 3.6.4 3.5.0
161+ 172.20.0.4 a2b8a1f794bdd561 3.6 MB 1.9 MB (54.44%) a49c021e76e707db 5152 2 5152 false 3.6.4 3.5.0
162+ {{< /tab >}}
163+ {{< /tabpane >}}
164+
165+ > Note: Once downgrade is enabled, the cluster will remain operating with v3.5 protocol even if all the servers are still running the v3.6 binary, unless the downgrade is canceled with `talosctl -n <IP1> downgrade cancel`.
166+
167+ # ## Step 5: Patch Machine Config
168+
169+ Before patching the node, check if the etcd is leader.
170+ We recommend downgrading the leader last.
171+ If the server to be downgraded is the leader, you can avoid some downtime by `forfeit-leadership` to another server before stopping this server.
172+
173+ ` ` ` bash
174+ talosctl -n <IP1> etcd forfeit-leadership
175+ ` ` `
176+
177+ Create a file with the patch pointing to desired `etcd` image :
178+
179+ ` ` ` yaml
180+ # etcd-patch.yaml
181+ cluster:
182+ etcd:
183+ image: gcr.io/etcd-development/etcd:v3.5.22
184+ ` ` `
185+
186+ Apply patch to the machine with same configuration but with the new `etcd` version.
187+
188+ {{< tabpane >}}
189+ {{< tab header="Command" lang="Bash" >}}
190+ talosctl -n <IP1> patch machineconfig --patch @etcd-patch.yaml --mode reboot
191+ {{< /tab >}}
192+ {{< tab header="Output" lang="Console" >}}
193+ patched MachineConfigs.config.talos.dev/v1alpha1 at the node 172.20.0.2
194+ Applied configuration with a reboot
195+ {{< /tab >}}
196+ {{< /tabpane >}}
197+
198+ Verify that each member, and then the entire cluster, becomes healthy with the new v3.5 `etcd` :
199+
200+ {{< tabpane >}}
201+ {{< tab header="Command" lang="Bash" >}}
202+ talosctl -n <IP1>,<IP2>,<IP3> etcd status
203+ {{< /tab >}}
204+ {{< tab header="Output" lang="Console" >}}
205+ NODE MEMBER DB SIZE IN USE LEADER RAFT INDEX RAFT TERM RAFT APPLIED INDEX LEARNER PROTOCOL STORAGE ERRORS
206+ 172.20.0.2 a49c021e76e707db 3.5 MB 3.1 MB (88.05%) a2b8a1f794bdd561 13116 4 13116 false 3.5.22 3.5.0
207+ 172.20.0.4 a2b8a1f794bdd561 3.6 MB 3.1 MB (88.12%) a2b8a1f794bdd561 13116 4 13116 false 3.6.4 3.5.0
208+ 172.20.0.3 912415ee6ed360c4 3.5 MB 3.1 MB (88.30%) a2b8a1f794bdd561 13116 4 13116 false 3.6.4 3.5.0
209+ {{< /tab >}}
210+ {{< /tabpane >}}
211+
212+ # ## Step 6: Continue on the Remaining Control Plane Nodes
213+
214+ When all members are downgraded, check the health and status of the cluster, and confirm the minor version of all members is v3.5, and storage version is empty :
215+
216+ {{< tabpane >}}
217+ {{< tab header="Command" lang="Bash" >}}
218+ talosctl -n <IP1>,<IP2>,<IP3> etcd status
219+ {{< /tab >}}
220+ {{< tab header="Output" lang="Console" >}}
221+ NODE MEMBER DB SIZE IN USE LEADER RAFT INDEX RAFT TERM RAFT APPLIED INDEX LEARNER PROTOCOL STORAGE ERRORS
222+ 172.20.0.2 a49c021e76e707db 4.5 MB 4.5 MB (100.00%) 912415ee6ed360c4 13865 5 13865 false 3.5.22 3.5.0
223+ 172.20.0.4 a2b8a1f794bdd561 4.6 MB 4.6 MB (100.00%) 912415ee6ed360c4 13865 5 13865 false 3.5.22 3.5.0
224+ 172.20.0.3 912415ee6ed360c4 4.6 MB 4.6 MB (99.64%) 912415ee6ed360c4 13865 5 13865 false 3.5.22 3.5.0
225+ {{< /tab >}}
226+ {{< /tabpane >}}
0 commit comments