| This issue typically occurs when: - A node is removed from the ETCD cluster.
- The node still retains stale ETCD data.
- The ETCD service fails to start or continuously restarts.
For the purposes of this article, the HA cluster nodes are defined as follows: - HA1 – 10.65.48.44 -> ETCD is down (unhealthy).
- HA2 – 10.65.49.161 -> Node is healthy.
- HA3 – 10.65.49.162 -> Node is healthy.
Running the following command shows ETCD as down or unstable: phstatus Check service status: system status etcd In some cases, the status may show 'success' or 'exit code = 1', which may be misleading, as the service is continuously restarting. Log into a Healthy node and run the following to see the exception: etcdctl \ --endpoints=http://10.65.49.161:2379,http://10.65.49.162:2379,http://10.65.48.44:2379 \ endpoint health -w table Output: {"level":"warn","ts":"2026-03-24T15:01:39.396771-0700","caller":"flags/flag.go:94","msg":"unrecognized environment variable","environment-variable":"ETCDCTL_API=3"} {"level":"warn","ts":"2026-03-24T15:01:44.399821-0700","logger":"client","caller":"v3@v3.6.4/retry_interceptor.go:65","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002e0000/10.65.48.44:2379","method":"/etcdserverpb.KV/Range","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing: dial tcp 10.65.48.44:2379: connect: connection refused\""} +--------------------------+--------+--------------+---------------------------+ | ENDPOINT | HEALTH | TOOK | ERROR | +--------------------------+--------+--------------+---------------------------+ | http://10.65.49.162:2379 | true | 4.342763ms | | | http://10.65.49.161:2379 | true | 4.545808ms | | | http://10.65.48.44:2379 | false | 5.000977199s | context deadline exceeded | +--------------------------+--------+--------------+---------------------------+ Error: unhealthy cluster This indicates one node is unhealthy.
On the unhealthy node, run the following command: journalctl -u etcd -n 50 --no-pager | grep error
Example output: the member has been permanently removed from the cluster Root cause: This issue occurs when: - The node was removed from the ETCD cluster.
- The node still contains old member ID data in /var/lib/etcd.
- ETCD detects a mismatch and refuses to start.
Remediation steps: Step 1 - Identify the stale member. etcdctl member list Example output: {"level":"warn","ts":"2026-03-31T17:30:04.513625-0700","caller":"flags/flag.go:94","msg":"unrecognized environment variable","environment-variable":"ETCDCTL_API=3"} 6219912d81186549, started, HA-Super2_10_65_49_161, http://10.65.49.161:2380, http://10.65.49.161:2379, false debaf717ada6ae1e, started, HA-Super3_10_65_49_162, http://10.65.49.162:2380, http://10.65.49.162:2379, false fa288190cc028248, started, AH-SUPER_10_65_48_44, http://10.65.48.44:2380, http://10.65.48.44:2379, false Broken node ID: fa288190cc028248. Step 2 - Log into a Healthy node and remove the stale member. etcdctl --endpoints=http://10.65.49.161:2379,http://10.65.49.162:2379 member remove fa288190cc028248 Output: Member fa288190cc028248 removed from cluster 92445e086e4dd2e2 Step 3 - Re-add the node: etcdctl --endpoints=http://10.65.49.161:2379,http://10.65.49.162:2379 member add AH-SUPER_10_65_48_44 --peer-urls=http://10.65.48.44:2380 Output: Member fa288190cc028248 added to cluster ETCD_NAME="AH-SUPER_10_65_48_44" ETCD_INITIAL_CLUSTER="AH-SUPER_10_65_48_44=http://10.65.48.44:2380,HA-Super2_10_65_49_161=http://10.65.49.161:2380,HA-Super3_10_65_49_162=http://10.65.49.162:2380" ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.65.48.44:2380" ETCD_INITIAL_CLUSTER_STATE="existing" Log back into the broken node and stop and remove the etcd status and folders and recreate them. Step 4 - Clean the stale data on the broken node CLI. systemctl stop etcd rm -rf /var/lib/etcd/* Step 5 - Recreate directory. mkdir -p /var/lib/etcd chown etcd:etcd /var/lib/etcd chmod 700 /var/lib/etcd Step 6 - Update the configuration. vi /etc/etcd/etcd.conf Paste: ETCD_NAME="AH-SUPER_10_65_48_44" ETCD_DATA_DIR="/var/lib/etcd" ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.65.48.44:2380" ETCD_LISTEN_PEER_URLS="http://10.65.48.44:2380" ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:2379" ETCD_ADVERTISE_CLIENT_URLS="http://10.65.48.44:2379" ETCD_INITIAL_CLUSTER="AH-SUPER_10_65_48_44=http://10.65.48.44:2380,HA-Super2_10_65_49_161=http://10.65.49.161:2380,HA-Super3_10_65_49_162=http://10.65.49.162:2380" ETCD_INITIAL_CLUSTER_STATE="existing"
Step 7 - Start etcd: systemctl daemon-reload systemctl start etcd systemctl status etcd Expected output: Active: active (running) published local member to cluster through raft ready to serve client requests Step 8 - Verify cluster health: etcdctl --endpoints=http://10.65.49.161:2379,http://10.65.49.162:2379,http://10.65.48.44:2379 endpoint health -w table It is expected that all nodes show HEALTH = true. Step 9 - Verify the member list: etcdctl member list
Expected: 4c00291e5f9a435c, started, AH-SUPER_10_65_48_44 ... |