FortiSIEM
FortiSIEM provides Security Information and Event Management (SIEM) and User and Entity Behavior Analytics (UEBA)
aebadi
Staff
Staff
Article Id 384217
Description This article is to troubleshoot ETCD being down with regards to the error 'panic: tocommit(962234) is out of range'.
Scope FortiSIEM v7.3.x.
Solution

ETCD uses the Raft consensus algorithm, which requires all members to agree on the order of operations. If a member's state is out of sync, the Raft prevents it from joining the cluster.

 

Initial Diagnosis.

  1. Check the process and try to bring it up:

 

[root@super-1 ~]# systemctl status etcd.service
● etcd.service - Etcd Server
Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Sun 2025-03-23 12:33:54 PDT; 2h 24min ago
Process: 3190 ExecStart=/bin/bash -c GOMAXPROCS=$(nproc) /usr/bin/etcd (code=exited, status=2)
Main PID: 3190 (code=exited, status=2)

Mar 23 12:33:54 super-1 systemd[1]: Failed to start Etcd Server.
Mar 23 12:33:54 super-1 systemd[1]: etcd.service: Service RestartSec=100ms expired, scheduling restart.
Mar 23 12:33:54 super-1 systemd[1]: etcd.service: Scheduled restart job, restart counter is at 5.
Mar 23 12:33:54 super-1 systemd[1]: Stopped Etcd Server.
Mar 23 12:33:54 super-1 systemd[1]: etcd.service: Start request repeated too quickly.
Mar 23 12:33:54 super-1 systemd[1]: etcd.service: Failed with result 'exit-code'.
Mar 23 12:33:54 super-1 systemd[1]: Failed to start Etcd Server.
systemctl start etcd.service

--

[root@super-1 ~]# systemctl start etcd.service
Job for etcd.service failed because the control process exited with error code.
See "systemctl status etcd.service" and "journalctl -xe" for details.
[root@super-1 ~]#

 

  1. Check the journal entry on why it failed.

 

[root@super-1 ~]# journalctl -u etcd 

 
d2bb399d-6949-4c6f-b091-6f54ebf49e62.jpg

 

The above log is truncated but the actual error will loop over and over: 'panic: tocommit(14244845) is out of range [lastIndex(14265)]. Was the raft log corrupted, truncated, or lost?'
This means the Etcd member has a state.commit not in range with the cluster’s actual commit. Because of that raft does not allow the member to join the cluster to avoid data corruption, and ETCD fails to start with a panic.

 

  1. Check ETCD's Members list and get the endpoint health.

 

root@super-1 ~]# etcdctl member list
{"level":"warn","ts":"2025-03-23T15:37:43.696021-0700","logger":"etcd-client","caller":"v3@v3.5.13/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000162000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
Error: context deadline exceeded

--

[root@super-1 ~]# etcdctl endpoint health
{"level":"warn","ts":"2025-03-23T15:38:04.882974-0700","logger":"client","caller":"v3@v3.5.13/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00010c000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
127.0.0.1:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Error: unhealthy cluster

[root@super-1 ~]#

 

Remediation.

In such circumstances performing a rolling restart can help the end node re-join the ETCD cluster.

 

  1. Ensure quorum is still intact: Before proceeding with a rolling restart, check that enough Etcd nodes are still running to maintain quorum.
    When a node is restarted, it will rejoin the cluster once it has recovered and re-synced with the remaining nodes. During this process, it might lag a bit in syncing the latest data, but this will not cause issues as long as the quorum is intact.

  2. Backup before making changes: Always take a backup of the Etcd cluster before performing a rolling restart. While ETCD is designed to handle restarts, a backup is essential to recover from any unforeseen issues.

 

etcdctl snapshot save /path/to/backup.db

 

  1. If a single node in the cluster is already down, needing to restart that node immediately may not be necessary, especially if the quorum is intact and the cluster is still functioning. However, restarting the other nodes in the cluster can later work on recovering the down node. Perform a Rolling Restart of the nodes one by one.

     

  2. After restarting the nodes, monitor the cluster to ensure it has fully recovered and is health:

 

 

etcdctl member list

 

etcd memers list.jpg

 

etcdctl endpoint health

 

endpoint health healthy.jpg

 

systemctl status etcd.service

 

healthy etcd status .jpg

 

Solution.

The Solution in the article is to perform a Rolling Restart on your Etcd cluster nodes after a backup was performed.

If noticing that the cluster is still warning, open a TAC ticket for further review and troubleshooting.