Troubleshooting Tip: initial commands to collect for memory debugging

lol · ‎11-13-2024

Description

This article provides CLI commands for a methodical initial debugging of a memory-related issue on FortiGate.

Scope

FortiGate.

Solution

FortiGate memory troubleshooting can be difficult.

This article provides a simplified and structured method to collect relevant debug outputs for the initial troubleshooting.

This will help focus on the most important commands to collect to assist Technical Support to resolve the issue.

Memory areas:

A Linux kernel will differentiate areas where memory is allocated.
Refer to the Linux documentation in proc.txt -> search for '/proc/meminfo'.

On FortiGate, most memory related issues are observed in the following areas:

Cached - memory allocated for disk I/O
Active - memory allocated for recently active processes
Shmem - shared memory for different processes accessing the same memory
Slab - kernel allocated memory

General FortiGate commands to check in which area memory is allocated.

In FortiGate, the command 'get hardware memory' will show where memory is allocated by the kernel.

This is the same as the commands 'fnsysctl cat /proc/meminfo' or 'diagnose hardware sysinfo memory', which will show the same data.

To troubleshoot any memory issue, collect data when the memory is already allocated.

As a general guideline, this is at 75% memory usage or above, i.e. while already in conserve mode.

A debug report (same as 'execute tac report'):

diag debug report

Or the following commands (which are already included in a debug report, meaning a debug report would be preferred):

get sys stat

get sys perf stat

get hardware memory

Example output with interesting memory areas highlighted:

get system status

Version: FortiGate-100F v7.6.0,build3401,240724 (GA.F)

Serial-Number: FG100FTK12345678

Hostname: firewall02

Current HA mode: a-p, secondary

Cluster uptime: 533 days, 0 hours, 50 minutes, 55 seconds

Cluster state change time: 2024-08-30 09:47:00

System time: Fri Sep 6 00:01:05 2024

get system performance status

Memory: 3701384k total, 2951756k used (79.7%), 548876k free (14.8%), 200752k freeable (5.5%) <--- 79.7% memory usage

Uptime: 6 days, 14 hours, 16 minutes

get hardware memory

MemTotal: 3701664 kB <-----

MemFree: 2518100 kB <-----

Buffers: 14444 kB

Cached: 341552 kB <-----

SwapCached: 0 kB

Active: 724644 kB <-----

Inactive: 54092 kB

Active(anon): 449044 kB

Inactive(anon): 8616 kB

Active(file): 275600 kB

Inactive(file): 45476 kB

Unevictable: 0 kB

Mlocked: 0 kB

SwapTotal: 0 kB

SwapFree: 0 kB

Dirty: 0 kB

Writeback: 0 kB

AnonPages: 422820 kB

Mapped: 76144 kB

Shmem: 34880 kB <-----

Slab: 152464 kB <-----

SReclaimable: 9348 kB

SUnreclaim: 143116 kB

KernelStack: 3488 kB

PageTables: 20992 kB

NFS_Unstable: 0 kB

Bounce: 0 kB

WritebackTmp: 0 kB

CommitLimit: 1850832 kB

Committed_AS: 10386140 kB

VmallocTotal: 260046784 kB

VmallocUsed: 75272 kB

VmallocChunk: 259873640 kB

If a conserve mode occurred in the past, the output of 'get hardware mem' will be written into the crashlog at the time of the event.

This helps to understand where memory was allocated at the time of the issue.

This data can be seen with the command 'diag debug crashlog read' or in a 'diag debug report' where this command is already included.

The output below is already filtered for the most interesting areas.

diag debug report

...

diagnose debug crashlog read
10: 2023-10-12 11:09:05 service=kernel conserve=on total="24140 MB" used="21247 MB" red="21243 MB"
11: 2023-10-12 11:09:05 green="19795 MB" msg="Kernel enters memory conserve mode"
12: 2023-10-12 11:09:07 MemTotal: 24720008 kB
13: 2023-10-12 11:09:07 MemFree: 1725984 kB
16: 2023-10-12 11:09:07 Cached: 1270152 kB
18: 2023-10-12 11:09:07 Active: 15681444 kB <----- Most memory was allocated in active mem.
32: 2023-10-12 11:09:08 Shmem: 593924 kB
33: 2023-10-12 11:09:08 Slab: 1459632 kB

Depending on the area with high memory usage, collect the output of more commands for Technical support to troubleshoot.

After the area(s) with the most memory usage have been isolated, further commands should be used to help find the cause.

Make sure to also share all commands from step 1, i.e. the output of 'diag debug report'.

See below as simplified steps:

If memory is high in cached memory, collect data about files on the disk.

fnsysctl df -h
fnsysctl du -d 1 -a
fnsysctl du -alLH /

If memory is high in Active memory, collect data about active user space processes.

diagnose sys top-mem 99
diagnose sys top 1 99 5

If memory is high in Shmem shared memory, collect data about shared memory and files on the disk.

diagnose hardware sysinfo shm
fnsysctl ls -al /dev/shm
fnsysctl du -d 1 -a /dev/shm/
fnsysctl ls -al /tmp
fnsysctl du -d 1 -a /tmp
fnsysctl cat /proc/sysvipc/shm

fnsysctl df -h
fnsysctl du -d 1 -a /
fnsysctl du -alLH /

If memory is high in kernel slabs, collect slabs info.

diag hardware sysinfo slab

This is the same as the following command:

fnsysctl cat /proc/slabinfo

The command is also included in a debug report, so a debug report is preferred as it contains additional details.

diag debug report

With these initial details collected while the memory is allocated (above 75%), the root cause can be quickly isolated. Please note that the base memory consumption for smaller devices with 2GB of memory or less can be quite high, at times close to the 75% threshold mentioned. It is best to implement memory optimizations for small devices as described in the KB article Technical Tip: Free up memory to avoid conserve mode. This way FortiOS will need to misallocate more memory to get to the 75% memory threshold making it easier to identify the problem.

Note:

To execute, the 'fnsysctl' command requires Super_admin (administrator account with super_admin permission profile) access. FortiGate will produce an error otherwise. For further information, see this KB article: Technical Tip: fnsysctl command returns Unknown action 0

Troubleshooting Tip: initial commands to collect for memory debugging

You are leaving our website