Troubleshooting Tip: How to troubleshoot performance issues

mbenvenuti · ‎03-13-2024

Description

This article describes how to troubleshoot performance issues while using FortiSIEM.

Scope

FortiSIEM.

Solution

The next steps can be followed when:

There is some normal slowness on the GUI.
The page is frozen at 'Initializing...' after the login page.
There are latencies with the super-worker connections.
There are some incidents in which it does not display event details.
There are queries that take time to show a result.

Check from a web browser on the same local network. When connecting to the FortiSIEM, the web browser can go through long-distance links including possible internal or external proxies.

Connect from a web browser in the same local network as the FortiSIEM super.
If the FortiSIEM behaves better from this local web browser, a linking issue in the path (VPN, proxy, internet connection, etc.) must be checked further.
If no change appears, check for other points in that document.

Check for the Health details. From the GUI at Admin -> Health -> Cloud Health, select the icon next to the node state to open the pop-up window with the summary content:

This display will point out any issues. Using the arrows will enable checking the stats in detail.

Check for major system issues.
From the SSH FortiSIEM super as root, execute the following commands:

journalctl -k --no-pager

Check for major kernel errors
Check long interrupts like below:

Dec 23 13:07:39 FortiSIEM kernel: perf: interrupt took too long (3941 > 3923), lowering kernel.perf_event_max_sample_rate to 50000

Check for a high number of interrupts, especially the 'Non-maskable interrupts' section with the following command:

cat /proc/interrupts

Those could be caused by the hardware being used or the VM platform FortiSIEM is on.

Check for Health Assessment results.

The following tool reports for FortiSIEM health information with the below command:

get-fsm-health.py --local -o /tmp/fsm-health.log

cat /tmp/fsm-health.log

This command displays an overview of the FortiSIEM health node.
- The 'Health Assessment' section will show a hint on where to investigate first.
- 'App Server Exceptions' and 'Backend Errors' are listing the most repeated errors. Fixing those top errors will help.

Check CPU and Memory stats.
From the SSH FortiSIEM super as root, execute the following commands:

phstatus.py -a

Check for the user and system CPU usage (us & sy).
- If those values are high and %id low, have a look at the CPU% column in the lower table and check if the process is constantly using CPU.
- If one or several ph processes are constantly high, check the activity of the process deeper with the following command (replace process1, process2 with the process high in CPU) and watch for explicit errors:

tail -f /opt/phoenix/log/phoenix.log | egrep -i 'process1|process2'

If none of the listed processes are high, run the following command to identify which process on the top list is using CPU to investigate further:

top

Check for Memory and Swap usage: Mem should have some free memory left and no swap used.
- Check the processes' memory usage. It is normal to have big values for phFortiInsightAI and AppSvr (if the UEBA feature is not and will not be used, the service can be deactivated, as explained in this article: Technical Tip: How to deactivate UEBA/phFortiInsightAI service).
- Check processes for abnormal activity with previous tail commands provided before.
- If a lot of swaps are used, extend the memory.
- The use of a lot of Swap can be due to low storage performances (see next point).

Be aware that recommended resources are 32cores and 32GB of memory, follow the below guide:

FortiSIEM Sizing Guide

Check disks and online storage stats. For local disks:

From the GUI at Analytics, run a random query, and from the FortiSIEM console run:

iostat -dhxm 5 5

Check for r_await, w_await, aqu-sz, %util Device values for the disks and it needs to be the closest to 0. If not, it means that the system needs to wait before reading/writing. This behavior needs to be avoided.

iotop

This allows checking which processes could use a lot of IO.

When online storage is on NFS:

From the GUI at Analytics, run a random query, and from the FortiSIEM console run:

nfsiostat 2 3 > /tmp/nfsio_query.txt

cat /tmp/nfsio_query.txt

Check the latencies from column avg RTT, avg exe, and avg queue. If one of those values is higher than 10ms, the NFS server setup needs to be reviewed.
- The NFS server needs to be closed to the FortiSIEM.
- Review the server's hardware and resources: interface rates, CPU, RAM, and disk type.
- It is strongly recommended an NFS server be dedicated to the FortiSIEM usage.
Another load test can be applied:

echo 3 > /proc/sys/vm/drop_caches
dd if=/dev/zero of=/data/test_file.bin bs=1M count=2000 conv=fsync & nfsiostat 2 3 > /tmp/nfsio_write.txt
echo 3 > /proc/sys/vm/drop_caches
dd if=/data/test_file.bin of=/dev/zero & nfsiostat 2 3 > /tmp/nfsio_read.txt
rm -rf /data/test_file.bin

cat /tmp/nfsio_write.txt

cat /tmp/nfsio_read.txt

It is strongly recommended the use of high-performance nVme SSD for online data.

VM platform usage (Only FortiSIEM-VM). When using a FortiSIEM-VM, it is hosted on a machine where hardware resources CPU/RAM/DISK are shared among other VMs.

If those VMs are loaded and using a lot of resources, this can affect the FortiSIEM performance and stability.

Check the resource usage of the host and other VMs and arbitrate on the VM management.
Isolate the FortiSIEM VM on one dedicated host without any other VMs and check the behavior.
If there are several hosts in cluster mode, migrate the VM to another host and check again.

Errors log related to performances. In /opt/glassfish/domains/domain1/logs/server.log:

[2024-05-28T02:00:01.184+0200] [glassfish 5.1] [ERROR] [] [org.hibernate.engine.jdbc.spi.SqlExceptionHelper] [tid: _ThreadID=331 _ThreadName=PHScheduler_Worker-20] [timeMillis: 1716854401184] [levelValue: 1000] [[

ERROR: deadlock detected

Detail: Process 85521 waits for ShareLock on transaction 152825958; blocked by process 85522.

Process 85522 waits for ShareLock on transaction 152825964; blocked by process 85521.

Hint: See server log for query details.

Where: while deleting tuple (0,60) in relation "ph_change_set"]]

To fix this, move the VM to a faster disk host (especially /cmdb disk) and increase the number of CPUs.

In /opt/phoenix/log/phoenix.log on worker or collector:

2024-05-28T01:03:25.248477+02:00 machine phEventHandler: [PH_EVT_HANDLER_ERR]:[eventSeverity]=PHL_ERROR,[procName]=phEventHandler,[fileName]=phHttpRequestHandler.cpp,[lineNumber]=313,[phLogDetail]=Server is not running or congested, reject upload request

2024-05-28T01:01:09.649952+02:00 collector phEventPackager[90872]: [PH_HTTP_RESPONSE_FAILURE]:[eventSeverity]=PHL_WARNING,[procName]=phEventPackager,[fileName]=phHttpClient.cpp,[lineNumber]=614,[errorNo]=500,[phLogDetail]=HTTP response code failure

To fix this, check the FortiSIEM Super node where the Application Server is in a bad state.

Troubleshooting Tip: How to troubleshoot performance issues

You are leaving our website