Technical Tip: How to setup Disaster Recover mode with verification steps

mbenvenuti · ‎01-17-2025

Description

This article describes how to set Disaster recovery mode with the first verification steps.

Scope

FortiSIEM.

Solution

When disaster recovery mode needs to be configured, a few verification steps can be performed to ensure that the replication process is occurring properly. Follow the next steps before setting up disaster recovery mode:

Requirements:

2 licenses: 1 configured on each node.
Good network connection between the 2 supers.
DNS or virtual IP redirecting to the active node.

Warning: In case of High Availability cluster requirements, this HA cluster can only be set up on the primary site. No HA on the secondary site.

Check the state of the 2 supers:

The 2 nodes need to have:

The same exact version.
Normal health in Admin -> Health.

Storage configured:
- When using eventdb, both sides are independent.
- When using ClickHouse, configure the node at Admin -> Setup -> Storage Online but the ClickHouse cluster will be configured after DR setup.
On the secondary site, add the Workers before setting up DR mode because once it is up and secondary in standby, no other node can be added to the secondary site cluster.

Check the ports opening. A dedicated network between the supers without filtering is recommended but in case a filter is there, make sure the following ports are opened:

22/tcp - ssh service.
161-162/udp - SNMP.
443/tcp - https - GUI and APIs.
2046/tcp - NFS service.
123/udp - NTP.
5432/tcp - postgresl service.
2323/tcp - for connection testing.
514/tcp.
514/udp.
111/tcp.
2049/tcp.
6514/tcp.
6100/tcp.
5555/tcp.
6343/tcp.
6343/udp.
6379/tcp.
3000/tcp.
1470/tcp.
7900-7950/tcp.
6666-6671/tcp.
16666-16671/tcp.
19999/tcp.
20000-30000/tcp.
60002-60003/tcp.
2055/udp.
2181/tcp.
8081/tcp.
2888/tcp.
3888/tcp.
8123/tcp.
9000/tcp.
9009/tcp.
9440/tcp.
9010/tcp.
8443/tcp.

Check the link. It is important to have a stable link between the supers to ensure a proper replication and have the data as up-to-date as possible. A connection of 100MBits/s and low latency is recommended.

Here are some commands to assess the quality of the connection. Run the following on the primary site CLI as root:

yum -y install iperf

Complete!

firewall-cmd --zone=fortisiem --add-port=2323/tcp
success

iperf -s -p 2323
------------------------------------------------------------
Server listening on TCP port 2323
TCP window size: 8.00 MByte (default)
------------------------------------------------------------
[ 1] local super1_IP port 2323 connected with super2_IP port 53032
[ ID] Interval Transfer Bandwidth
[ 1] 0.00-10.00 sec 16.0 GBytes 13.8 Gbits/sec
[ 2] local super1_IP port 2323 connected with super2_IP port 48112
[ ID] Interval Transfer Bandwidth
[ 2] 0.00-10.00 sec 16.2 GBytes 13.9 Gbits/sec
[ 3] local super1_IP port 2323 connected with super2_IP port 48238
[ ID] Interval Transfer Bandwidth
[ 3] 0.00-10.00 sec 16.2 GBytes 13.9 Gbits/sec

ping -i 5 -c 5 super2_IP

On the secondary site CLI as root:

yum -y install iperf

Complete!

firewall-cmd --zone=fortisiem --add-port=2323/tcp
success
i=0; while [ $i -lt 3 ]; do ((i++)); iperf -c super1_IP -p 2323; sleep 5;done
------------------------------------------------------------
Client connecting to super1_IP, TCP port 2323
TCP window size: 8.00 MByte (default)
------------------------------------------------------------
[ 1] local super2_IP port 53032 connected with super1_IP port 2323
[ ID] Interval Transfer Bandwidth
[ 1] 0.00-10.01 sec 16.0 GBytes 13.8 Gbits/sec
------------------------------------------------------------
Client connecting to super1_IP, TCP port 2323
TCP window size: 8.00 MByte (default)
------------------------------------------------------------
[ 1] local super2_IP port 48112 connected with super1_IP port 2323
[ ID] Interval Transfer Bandwidth
[ 1] 0.00-10.00 sec 16.2 GBytes 13.9 Gbits/sec
------------------------------------------------------------
Client connecting to super1_IP, TCP port 2323
TCP window size: 8.00 MByte (default)
------------------------------------------------------------
[ 1] local super2_IP port 48238 connected with super1_IP port 2323
[ ID] Interval Transfer Bandwidth
[ 1] 0.00-10.02 sec 16.2 GBytes 13.9 Gbits/sec

ping -i 5 -c 5 super1_IP
PING super1_IP (super1_IP) 56(84) bytes of data.
64 bytes from super1_IP: icmp_seq=1 ttl=64 time=0.271 ms
64 bytes from super1_IP: icmp_seq=2 ttl=64 time=0.414 ms
64 bytes from super1_IP: icmp_seq=3 ttl=64 time=0.358 ms
64 bytes from super1_IP: icmp_seq=4 ttl=64 time=0.502 ms
64 bytes from super1_IP: icmp_seq=5 ttl=64 time=0.286 ms

--- super1_IP ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 20586ms
rtt min/avg/max/mdev = 0.271/0.366/0.502/0.086 ms

Check the results of the previous command, if any errors occur or values are high, the network between the 2 supers is required to be reviewed before going further.

Clear previous ssh-key traces. When DR mode has already been set in the past but then removed and/or super nodes have changed their ssh-keys, it is necessary to remove any previous expired/unwanted SSH keys configured for the supers same IPs and hostnames.

From the CLI on both super nodes as the root user:

su admin

sed -i -e '/super1_ip/d' -e '/super1_hostname/d' /opt/phoenix/bin/.ssh/authorized_keys

sed -i -e '/super2_ip/d' -e '/super2_hostname/d' /opt/phoenix/bin/.ssh/authorized_keys

sed -i -e '/super1_ip/d' -e '/super1_hostname/d' /opt/phoenix/bin/.ssh/known_hosts

sed -i -e '/super2_ip/d' -e '/super2_hostname/d' /opt/phoenix/bin/.ssh/known_hosts

egrep 'super1_ip|super1_hostname|super2_ip|super2_hostname' /opt/phoenix/bin/.ssh/

Traces of old ssh-keys have been removed from authorized_keys and known_hosts. The system is ready to have a new ssh-key configured, see the next step.

Configure SSH keys. On the 2 supers CLI as root, leave the answers blank:

su admin

ssh-keygen -t rsa -b 4096

Generating public/private rsa key pair.
Enter file in which to save the key (/opt/phoenix/bin/.ssh/id_rsa):
Created directory '/opt/phoenix/bin/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /opt/phoenix/bin/.ssh/id_rsa.
Your public key has been saved in /opt/phoenix/bin/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:sG9fNfWmeOCaWEwstc7qdV6VwL3OiEXVkIpbXeRoRrk admin@fsmSup
The key's randomart image is:
+---[RSA 4096]----+
| o*o|
| ..=oo|
| . .. *++o|
| o o..+oE.+|
| . S +o..oo+|
| . *..o++= |
| o *.=.+o |
| . * * o |
| .+ + . |
+----[SHA256]-----+

Renew the same commands on the other super.

Setup Disaster Recovery mode. From each super node CLI as root, run the next commands to get info and live logs:

hostname

ifconfig

phgetUUID | sed 's/.*UUID: //g'

cat /opt/phoenix/bin/.ssh/id_rsa.pub

tail -f /opt/phoenix/log/phoenix.log | grep '-DR'

From the primary super GUI, go to Admin -> License -> Nodes -> Add -> Secondary Supervisor.

Then fill in the form with the result from previous commands.

Check replication health at Admin -> Health -> Replication Health

Check the SSH connection. To do replication, an SSH connection is required to work without any errors or interactions (fingerprint validation or password...) between the 2 nodes:

Run the following commands from the primary site CLI as root:

su admin

ssh admin@super2_IP echo test_OK

test_OK

From secondary site CLI as root:

su admin

ssh admin@super1_IP echo test_OK

test_OK

If any interaction is prompted, make sure the next execution of those commands is not asking for any questions anymore.

For any errors, review the steps from 'Clear previous ssh-key traces'.

Configure ClickHouse storage replication. Only when using ClickHouse, run the next commands on the secondary site super CLI as root:

/opt/phoenix/phscripts/clickhouse/cleanup_clickhouse_keeper.sh

/opt/phoenix/phscripts/clickhouse/cleanup_clickhouse.sh

Then go to Admin -> Settings -> Database -> ClickHouse Config and configure secondary sites for replicas and keepers.

Check the replication state from the CLI. From primary site super CLI as root:

ls -al /opt/phoenix/cache/replication/
total 12
drwxrwxr-x 2 admin admin 46 Jan 13 14:13 .
drwxr-xr-x 30 admin admin 4096 Jan 14 00:10 ..
-rw------- 1 admin admin 1302 Jan 13 14:13 complete_status.xml
-rw-rw-r-- 1 admin admin 8 May 15 2024 .role

From the secondary site super CLI as root:

ls -al /opt/phoenix/cache/replication/
total 16
drwxrwxr-x 2 admin admin 64 Jan 13 09:17 .
drwxr-xr-x 51 admin admin 4096 Jan 16 17:55 ..
-rw-rw-r-- 1 admin admin 114 Jan 14 00:19 cmdbstatus
-rw-rw-r-- 1 admin admin 11 Jan 14 00:16 last_finish_svnlite
-rw-rw-r-- 1 admin admin 10 May 15 2024 .role

Troubleshooting. After checking that the second super is in good health and the connection link is OK, here are additional troubleshooting steps:

Eventdb, ProfileDB, and Configuration are not in sync: Make sure the SSH key is correct by following the section 'Check SSH connection'.

Profile database is not in sync:

Check the sizes of the profileDB files on each node: Primary node CLI as root:

ls -l /opt/phoenix/cache/profile.db*
-rw-r--r-- 1 admin admin 2285568 Dec 19 00:10 /opt/phoenix/cache/profile.db
-rw-r--r-- 1 admin admin 32768 Dec 19 00:10 /opt/phoenix/cache/profile.db-shm
-rw-r--r-- 1 admin admin 1104192 Dec 19 00:10 /opt/phoenix/cache/profile.db-wal

Secondary node CLI as root:

ls -l /opt/phoenix/cache/profile.db*
-rw-r--r-- 1 admin admin 2285568 Dec 5 00:11 /opt/phoenix/cache/profile.db

If the sizes do not match, replication is not in sync, try to sync it manually to eventually notice any issues:

On primary node:

su admin

rsync -v /opt/phoenix/cache/profile.db admin@SECONDARY_IP:/opt/phoenix/cache/

profile.db

sent 1,134,396 bytes received 9,155 bytes 762,367.33 bytes/sec
total size is 2,285,568 speedup is 2.00

If an interaction is faced, it means that the SSH keys configuration needs to be reviewed. If a connection error is displayed or the transfer is extremely slow, review the network and link between the supers.

Configuration is not in sync:

Check the sizes of the /svn disk on each node: Primary node CLI as root:

df -h /svn
Filesystem Size Used Avail Use% Mounted on
/dev/sdc1 60G 461M 60G 1% /svn

Secondary node CLI as root:

df -h /svn
Filesystem Size Used Avail Use% Mounted on
/dev/sdc1 60G 461M 60G 1% /svn

If the sizes do not match, replication is not in sync, try to sync it manually to eventually notice any issues:

On primary node:

su admin

rsync -v /svn admin@SECONDARY_IP:/svn

If an interaction is faced, it means that the SSH keys configuration needs to be reviewed. If a connection error is displayed or the transfer is extremely slow, review the network and link between the supers.

CMDB is not in sync:

Check details from the article: How to do advanced checks on the CMDB replication.

Technical Tip: How to setup Disaster Recover mode with verification steps

You are leaving our website