Solution |
This article will cover the following outputs commands below.
get system performance
diagnose test application sqlrptcached 2
diagnose debug application sqlrptcached 8
diagnose test application oftpd 2
diagnose test application oftpd 5
These commands together can be useful to check the FortiAnalyzer performance.
Command: get system performance.
Follow the FortiAnalyzer get sys performance output explained. This command provides a snapshot of the system's resource utilization, including CPU, memory, hard disk, and flash disk performance. Below is an explanation of each section:
get system performance
CPU:
Used: 54.22%
Used(Excluded NICE): 54.22%
%used %user %nice %sys %idle %iowait %irq %softirq
CPU0 26.60 0.19 0.00 0.39 73.40 25.83 0.00 0.19
CPU1 64.15 3.29 0.00 4.26 35.85 56.59 0.00 0.00
CPU2 57.93 0.59 0.00 1.17 42.07 56.16 0.00 0.00
CPU3 34.17 0.97 0.00 1.17 65.83 31.65 0.00 0.39
CPU4 57.59 0.39 0.00 0.78 42.41 56.42 0.00 0.00
CPU5 69.79 0.39 0.00 1.17 30.21 68.23 0.00 0.00
CPU6 90.33 1.55 0.00 1.35 9.67 87.04 0.00 0.39
CPU7 45.74 0.39 0.00 1.36 54.26 43.60 0.00 0.39
CPU8 59.81 0.19 0.00 0.58 40.19 59.03 0.00 0.00
CPU9 76.41 0.58 0.00 0.58 23.59 75.24 0.00 0.00
CPU10 87.94 0.39 0.00 0.97 12.06 86.38 0.00 0.19
CPU11 53.86 0.58 0.00 1.93 46.14 51.16 0.00 0.19
CPU12 26.26 0.19 0.00 0.39 73.74 25.68 0.00 0.00
CPU13 63.99 0.39 0.00 0.78 36.01 62.62 0.00 0.20
CPU14 66.47 0.19 0.00 0.97 33.53 64.73 0.00 0.58
CPU15 59.07 0.77 0.00 0.97 40.93 57.14 0.00 0.19
CPU16 38.72 0.39 0.00 0.58 61.28 37.74 0.00 0.00
CPU17 46.99 0.19 0.00 0.97 53.01 45.83 0.00 0.00
CPU18 78.53 1.55 0.00 0.97 21.47 75.82 0.00 0.19
CPU19 66.41 0.97 0.00 1.55 33.59 63.88 0.00 0.00
CPU20 18.91 0.00 0.00 0.39 81.09 18.52 0.00 0.00
CPU21 58.22 0.58 0.00 1.16 41.78 56.29 0.00 0.19
CPU22 62.67 0.19 0.00 1.35 37.33 61.12 0.00 0.00
CPU23 46.51 1.36 0.00 1.94 53.49 43.22 0.00 0.00
CPU24 12.45 4.86 0.00 1.36 87.55 6.23 0.00 0.00
CPU25 69.46 1.36 0.00 1.36 30.54 66.73 0.00 0.00
CPU26 15.98 2.53 0.00 0.78 84.02 12.48 0.00 0.19
CPU27 44.47 0.58 0.00 0.97 55.53 42.91 0.00 0.00
CPU28 60.08 0.58 0.00 1.36 39.92 58.14 0.00 0.00
CPU29 51.36 0.39 0.00 0.78 48.64 50.19 0.00 0.00
CPU30 84.17 1.35 0.00 1.16 15.83 81.47 0.00 0.19
CPU31 38.88 0.77 0.00 1.16 61.12 36.94 0.00 0.00
CPU32 9.13 0.19 0.00 0.58 90.87 8.16 0.00 0.19
CPU33 56.03 0.78 0.00 0.78 43.97 54.47 0.00 0.00
CPU34 67.57 0.78 0.00 0.78 32.43 66.02 0.00 0.00
CPU35 54.74 0.58 0.00 1.55 45.26 52.42 0.00 0.19
CPU36 16.12 0.19 0.00 0.78 83.88 14.95 0.00 0.19
CPU37 60.27 0.39 0.00 1.16 39.73 58.33 0.00 0.39
CPU38 68.41 0.39 0.00 1.16 31.59 66.67 0.00 0.19
CPU39 61.39 0.19 0.00 1.74 38.61 59.27 0.00 0.19
CPU40 54.17 0.58 0.00 1.55 45.83 51.84 0.00 0.19
CPU41 60.04 0.00 0.00 1.17 39.96 58.87 0.00 0.00
CPU42 45.42 0.39 0.00 0.39 54.58 44.64 0.00 0.00
CPU43 76.12 5.83 0.00 2.33 23.88 67.77 0.00 0.19
CPU44 19.15 1.55 0.00 1.16 80.85 16.25 0.00 0.19
CPU45 60.43 0.39 0.00 0.78 39.57 59.06 0.00 0.19
CPU46 59.19 0.39 0.00 0.77 40.81 57.64 0.00 0.39
CPU47 99.42 0.00 0.00 0.97 0.58 98.25 0.00 0.19
Memory:
Total: 198,695,780 KB
Used: 20,092,740 KB 10.1%
Total (Excluding Swap): 131,586,920 KB
Used (Excluding Swap): 20,092,740 KB 15.3%
Hard Disk:
Total: 80,957,203,120 KB
Used: 73,599,911,924 KB 90.9%
Inode-Total: 4,286,053,632
Inode-Used: 4,185,325,226 97.6%
IOStat: tps r_tps w_tps r_kB/s w_kB/s queue wait_ms svc_ms %util sampling_sec
3274.1 2483.8 790.3 25660.9 15899.4 74.8 22.8 0.3 90.8 706.53
Flash Disk:
Total: 1,007,512 KB
Used: 374,500 KB 37.2%
Inode-Total: 65,536
Inode-Used: 35 0.1%
IOStat: tps r_tps w_tps r_kB/s w_kB/s queue wait_ms svc_ms %util sampling_sec
3.0 2.8 0.3 309.4 0.3 0.0 0.6 0.4 0.1 706.53
- CPU.
This section details the CPU usage across the system's 48 cores (CPU0 to CPU47). The overall CPU usage is 54.22%, which includes all types of CPU activity (user processes, system processes, I/O wait, etc.).
Key Metrics:
- Overall CPU Usage:
- Used: 54.22% – The total CPU utilization across all cores.
- Used (Excluding NICE): 54.22% – The same as total usage here, as no processes are running with the nice priority (which adjusts process priority).
Per-Core Breakdown: For each CPU core, the following metrics are provided:
- %used: Total CPU usage for the core (e.g., CPU47 is at 99.42%, indicating it’s nearly fully utilized, while CPU32 is at 9.13%, indicating low usage).
- %user: Percentage of CPU time spent on user-level processes (e.g., running applications). Most cores have low user activity (0.19% to 5.83%).
- %nice: Percentage of CPU time spent on user processes with adjusted priority. All cores show 0.00%, meaning no nice processes are running.
- %sys: Percentage of CPU time spent on system-level processes (e.g., kernel tasks). Typically low, ranging from 0.39% to 4.26%.
- %idle: Percentage of time the CPU is idle (not processing tasks). Higher idle percentages (e.g., CPU32 at 90.87%) indicate underutilization, while low idle percentages (e.g., CPU47 at 0.58%) indicate heavy usage.
- %iowait: Percentage of time the CPU is waiting for I/O operations (e.g., disk or network). This is significant for many cores, with CPU47 at 98.25% and CPU6 at 87.04%, suggesting I/O bottlenecks.
- %irq and %softirq: Percentage of CPU time spent handling hardware and software interrupts. All cores show 0.00% or very low values, indicating minimal interrupt activity.
Analysis:
- The system is under moderate CPU load overall (54.22%), but some cores (e.g., CPU6, CPU10, CPU18, CPU30, CPU47) are heavily utilized, with usage above 80%.
- High %iowait values on many cores (e.g., CPU47 at 98.25%, CPU6 at 87.04%) indicate that the CPUs are spending significant time waiting for I/O operations, likely disk-related. This suggests a potential bottleneck in disk performance, which could be slowing down the system.
- Low %user and %sys values indicate that the CPU load is not driven by user applications or system processes but rather by I/O wait times.
Recommendations:
- Investigate the cause of high I/O wait times. This could be due to heavy disk activity (e.g., logging, report generation, or database operations in FortiAnalyzer).
- Check if the system is handling a large volume of logs or running resource-intensive tasks (e.g., analytics, reports).
- Consider optimizing disk I/O by reducing logging verbosity, archiving old logs, or upgrading to faster storage (e.g., SSDs).
- Memory.
This section shows memory usage on the FortiAnalyzer system.
Key Metrics:
- Total: 198,695,780 KB (~198 GB) – Total physical memory available, including swap space.
- Used: 20,092,740 KB (~20 GB, 10.1%) – Amount of memory currently in use, including swap.
- Total (Excluding Swap): 131,586,920 KB (~131 GB) – Total physical RAM (excluding swap).
- Used (Excluding Swap): 20,092,740 KB (~20 GB, 15.3%) – Physical RAM in use.
Analysis:
- The system is using only 15.3% of its physical RAM, indicating that memory is not a bottleneck.
- The low memory usage suggests that the FortiAnalyzer has ample memory capacity for its current workload.
- Hard Disk.
This section describes the usage and performance of the hard disk used for storing logs, reports, and other data.
Key Metrics:
- Total: 80,957,203,120 KB (~80.9 TB) – Total disk capacity.
- Used: 73,599,911,924 KB (~73.6 TB, 90.9%) – Amount of disk space used.
- Inode-Total: 4,286,053,632 – Total number of inodes (file metadata structures).
- Inode-Used: 4,185,325,226 (97.6%) – Number of inodes in use.
- IOStat Metrics:
- tps: 3274.1 – Transactions per second (I/O operations).
- r_tps: 2483.8 – Read transactions per second.
- w_tps: 790.3 – Write transactions per second.
- r_kB/s: 25,660.9 – Read throughput (KB per second).
- w_kB/s: 15,899.4 – Write throughput (KB per second).
- queue: 74.8 – Average number of I/O requests in the queue.
- wait_ms: 22.8 – Average time I/O requests wait in the queue (milliseconds).
- svc_ms: 0.3 – Average service time for I/O requests (milliseconds).
- %util: 90.8% – Disk utilization percentage.
- sampling_sec: 706.53 – Duration of the sampling period (seconds).
Analysis:
- The hard disk is 90.9% full, which is concerning as it is approaching capacity. Low disk space can lead to performance issues or system failures if logs or reports cannot be written.
- The inode usage is 97.6%, indicating that the filesystem is nearly out of inodes. This could prevent new files from being created, even if disk space is available.
- The disk is heavily utilized (90.8%), with a high transaction rate (3274.1 tps) and significant read/write activity. The high queue value (74.8) and wait_ms (22.8 ms) suggest that the disk is struggling to keep up with I/O demands, contributing to the high %iowait seen in the CPU section.
- The disk’s performance metrics indicate a bottleneck, likely exacerbating the CPU’s I/O wait times.
Recommendations:
- Urgently free up disk space by archiving or deleting old logs, reports, or unused data. FortiAnalyzer’s log management features can help automate log pruning.
- Address the inode exhaustion by removing unnecessary small files or reconfiguring the filesystem to allocate more inodes (if possible).
- Consider adding more storage capacity or migrating to faster disks (e.g., SSDs) to handle the high I/O load.
- Optimize FortiAnalyzer settings to reduce disk I/O, such as lowering log retention periods or disabling unnecessary analytics.
Command: diag test application sqlrptcached 2.
The command diag test application sqlrptcached 2 provides detailed diagnostic information about the SQL report cache daemon (sqlrptcached), which is responsible for managing the caching of FortiView and report data for efficient log processing and analytics. Below, each section of the output will be explained, and insights will be provided into what it indicates about the system's performance, especially in the context of the previous get sys performance output, which highlighted disk I/O bottlenecks and high disk usage.
diag test application sqlrptcached 2
work-state: Running
Fortiview/NOC-SOC: Enabled, Report auto-cache: Disabled, FortiView auto-cache: Enabled
Number of hcache workers: 10, restart=0
Number of log table on-hold: all=978(rpt=0)
Number of log table pending: all=997(fv=997, increment-fv=0)
Number of log table read: all=1(rpt=0, fv=1, increment-fv=0)
Number of log table done: all=0(rpt=0, fv=0, increment-fv=0)
Last log table for hcache: FGTADOM65986-FGT-tlog-1703088690
Number of hcache requests sent: 1, resource waits=74
Number of log table analyze: analyze=0 check=0 pending=0 ondemand-analyze=0 runerr=0
FortiView hcache load: rounds=0, tbl=0, cleanup=0
Charts cache: batch-load=0 Adom-init=1/20 Adom-rpt-load=1, shm ncmdb ver: 1349
Total loaded auto-cache charts: FortiView=50 Report=0
ncmdb:
cache hit : sch=0, layout=0, chart=0, macro=0, dataset=0
calls : sch(foreach)=1, layout(find)=0, chart(find)=0, macro(find)=0, dataset(find)=0
rpt sched cache: ctor/dtor=0/0, count/cap=0/10000, est-elem-sz(B)=2016, est-total-sz(kB)=0
rpt layout cache: ctor/dtor=0/0, count/cap=0/500, est-elem-sz(B)=1584888, est-total-sz(kB)=0
rpt chart cache: ctor/dtor=0/0, count/cap=0/50000, est-elem-sz(B)=13616, est-total-sz(kB)=0
rpt macro cache: ctor/dtor=0/0, count/cap=0/2000, est-elem-sz(B)=3200, est-total-sz(kB)=0
rpt dataset cache: ctor/dtor=0/0, count/cap=0/50000, est-elem-sz(B)=68256, est-total-sz(kB)=0
Estimated total cache size (kB): 0
Overview.
The sqlrptcached process handles the caching of log data for FortiView (real-time network analytics) and reports (scheduled or on-demand analytics). It uses a hybrid cache (hcache) to store processed log data, reducing the need for repeated SQL queries to the disk-based log database. The output shows the current state of the caching process, including worker threads, log table processing, and cache usage.
Detailed breakdown.
- General Status.
- work-state: Running
- The sqlrptcached process is active and operational.
- Fortiview/NOC-SOC: Enabled, Report auto-cache: Disabled, FortiView auto-cache: Enabled
- FortiView/NOC-SOC: Enabled, meaning FortiView (real-time analytics dashboards) and NOC-SOC features are active.
- Report auto-cache: Disabled: Automatic caching for scheduled reports is turned off, so reports are generated on-demand or not cached.
- FortiView auto-cache: Enabled: FortiView data is automatically cached to improve performance for real-time analytics.
- Number of hcache workers: 10, restart=0
- There are 10 worker threads processing hcache tasks (e.g., loading log tables into cache).
- restart=0: No worker threads have restarted, indicating stable operation.
Analysis: The process is running normally with FortiView caching enabled, which is likely contributing to the disk I/O load seen in the previous output (high %iowait and disk utilization). Report auto-caching being disabled means reports are not pre-cached, which could reduce some disk I/O but may slow down report generation.
- Log Table Processing.
This section shows the status of log tables being processed for caching. Log tables store raw log data, and sqlrptcached processes them for FortiView and reports.
- Number of log table on-hold: all=978 (rpt=0)
- 978 tables are on hold, meaning they are not currently being processed but are queued for future caching.
- rpt=0: None of these are for reports (consistent with report auto-cache being disabled).
- Number of log table pending: all=997 (fv=997, increment-fv=0)
- 997 tables are pending processing, all for FortiView (fv=997).
- increment-fv=0: No incremental FortiView updates are pending (incremental updates process only new log data).
- Number of log table read: all=1 (rpt=0, fv=1, increment-fv=0)
- Only 1 table is currently being read, for FortiView (fv=1).
- Number of log table done: all=0 (rpt=0, fv=0, increment-fv=0)
- No tables have completed processing for caching.
- Last log table for hcache: FGTADOM65986-FGT-tlog-1703088690
- The most recent log table being processed is identified by this name, likely corresponding to a specific FortiGate device or ADOM (Administrative Domain) with a timestamp.
Analysis:
- A large number of log tables (978 on-hold, 997 pending) indicates a significant backlog of FortiView data waiting to be cached. This backlog is likely contributing to the high disk I/O load observed in the get sys performance output (e.g., 3274.1 tps, 90.8% disk utilization).
- Only 1 table is actively being read, suggesting that the system is processing tables slowly, possibly due to disk I/O bottlenecks or limited worker thread capacity.
- The focus on FortiView (fv=997) aligns with FortiView auto-cache being enabled, but the lack of completed tables (all=0) suggests processing is stalled or very slow.
Recommendations:
- Investigate why so many tables are pending/on-hold. This could be due to insufficient processing resources or disk I/O limitations.
- Consider increasing the number of hcache workers (if configurable) or optimizing disk performance (e.g., upgrading to SSDs) to handle the backlog.
- Review FortiView settings to reduce the volume of data being cached (e.g., limit the scope of FortiView dashboards).
- Hcache Requests.
- Number of hcache requests sent: 1, resource waits=74
- Only 1 hcache request has been sent, indicating minimal active caching activity.
- resource waits=74: The process has encountered 74 instances where it had to wait for resources (likely disk I/O or memory), further evidence of a bottleneck.
Analysis:
- The low number of hcache requests (1) and high resource waits (74) strongly suggest that the caching process is being throttled by resource constraints, most likely disk I/O, given the high %iowait and disk utilization from the previous output.
- This aligns with the large backlog of pending/on-hold tables, as the system struggles to read log data from the disk.
Recommendations:
- Address the disk I/O bottleneck by reducing disk load (e.g., archiving old logs, reducing log retention) or upgrading storage.
- Monitor resource waits over time to see if they decrease after optimizing disk performance.
- Log Table Analysis.
- Number of log table analyze: analyze=0, check=0, pending=0, ondemand-analyze=0, runerr=0.
- No log tables are being analyzed, checked, or pending analysis, and there are no errors (runerr=0).
- This suggests that the analysis phase (e.g., generating analytics from cached data) is not currently active, likely because the system is stuck in the data loading phase.
Analysis:
- The lack of analysis activity is expected given the backlog in loading log tables. The system is prioritizing reading data into the cache before analyzing it.
Recommendations:
- No immediate action is needed here, but resolving the backlog in log table processing will allow analysis to proceed.
- FortiView Hcache Load.
- FortiView hcache load: rounds=0, tbl=0, cleanup=0
- No rounds of hcache loading have completed (rounds=0), no tables have been loaded (tbl=0), and no cleanup operations have occurred (cleanup=0).
- This indicates that FortiView caching is not progressing, likely due to the disk I/O bottleneck.
Analysis:
- The lack of progress in FortiView hcache loading reinforces the idea that the system is struggling to read log data from the disk.
Recommendations:
- Same as above: address disk I/O issues and review FortiView caching settings.
- Charts Cache.
- Charts cache: batch-load=0, Adom-init=1/20, Adom-rpt-load=1, shm ncmdb ver: 1349
- batch-load=0: No batch loading of charts has occurred.
- Adom-init=1/20: Only 1 out of 20 ADOMs (Administrative Domains) has initialized its cache, indicating that cache initialization is incomplete.
- Adom-rpt-load=1: One ADOM has loaded report-related data.
- shm ncmdb ver: 1349: The shared memory (shm) version of the NCMDB (Network Configuration Management Database) is 1349, which is an internal versioning detail.
- Total loaded auto-cache charts: FortiView=50, Report=0
- 50 FortiView charts have been loaded into the cache, likely for real-time dashboards.
- 0 report charts are cached, consistent with Report auto-cache: Disabled.
Analysis:
- The partial initialization of ADOMs (1/20) suggests that the system is still setting up caches for different ADOMs, which could be slowed by resource constraints.
- The 50 FortiView charts indicate active use of FortiView dashboards, contributing to the caching workload.
- The lack of report charts aligns with the disabled report auto-cache setting.
Recommendations:
- Monitor ADOM initialization progress to ensure all ADOMs complete their cache setup.
- If FortiView usage is heavy, consider limiting the number of active charts or dashboards to reduce caching demands.
- NCMDB (Network Configuration Management Database) Cache.
This section details the cache usage for various components of the NCMDB, which stores configuration data for schedules, layouts, charts, macros, and datasets.
- cache hit: sch=0, layout=0, chart=0, macro=0, dataset=0
- No cache hits for any component, meaning the cache is not being used to retrieve data (likely because caching is still in the loading phase).
- calls: sch(foreach)=1, layout(find)=0, chart(find)=0, macro(find)=0, dataset(find)=0
- One call was made to iterate through schedules (sch(foreach)=1), but no other components were queried.
- Cache Details for Each Component:
- rpt sched cache: 0 schedules cached, capacity for 10,000.
- rpt layout cache: 0 layouts cached, capacity for 500.
- rpt chart cache: 0 charts cached, capacity for 50,000.
- rpt macro cache: 0 macros cached, capacity for 2,000.
- rpt dataset cache: 0 datasets cached, capacity for 50,000.
- Estimated total cache size (kB): 0: No memory is currently used by the NCMDB cache.
- ctor/dtor=0/0: No cache objects have been created or destroyed.
- est-elem-sz(B): Estimated size of each element (e.g., 1,584,888 bytes for layouts, 68,256 bytes for datasets).
- est-total-sz(kB): Estimated total size of each cache (currently 0 due to no cached items).
Analysis:
- The NCMDB cache is effectively unused (cache hit=0, count=0 for all components), indicating that the system has not yet populated these caches.
- This is likely due to the backlog in log table processing, as the system is focused on loading raw log data before caching processed configurations.
- The large capacities (e.g., 50,000 for charts, 10,000 for schedules) suggest that the system is designed to handle significant caching, but it’s not currently utilizing this capacity.
Recommendations:
- No immediate action needed for the NCMDB cache, as its lack of usage is a symptom of the log table processing backlog.
- Resolving the disk I/O bottleneck will allow the system to progress to caching configuration data.
Overall Analysis.
The diag test application sqlrptcached 2 output confirms that the FortiAnalyzer is struggling to process FortiView data due to a large backlog of log tables (978 on-hold, 997 pending) and resource constraints (74 resource waits). This aligns with the get sys performance output, which showed:
- High disk utilization (90.8%) and I/O wait times (up to 98.25% on some CPU cores).
- A nearly full hard disk (90.9%) and inode exhaustion (97.6%).
The primary issues are:
- Disk I/O Bottleneck: The large number of pending log tables and high resource waits indicate that the system cannot read log data from the disk quickly enough, slowing down FortiView caching.
- Disk Space and Inode Exhaustion: The nearly full disk and inode usage limit the system’s ability to write new data or create new files, exacerbating the backlog.
- FortiView Load: Enabled FortiView auto-caching and 50 active charts are driving significant disk I/O, contributing to the bottleneck.
Action Plan.
- Address Disk Space and Inode Issues (Critical):
- Free up disk space: Archive or delete old logs/reports using FortiAnalyzer’s log management tools (e.g., set shorter retention periods).
- Resolve inode exhaustion: Identify and remove unnecessary small files or reconfigure the filesystem to allocate more inodes (if possible).
- Add storage: If feasible, expand disk capacity or migrate to faster storage (e.g., SSDs) to handle the high I/O load.
- Optimize FortiView Caching:
- Review FortiView settings to reduce the number of active charts or dashboards (currently 50 charts).
- Limit the scope of FortiView data (e.g., fewer devices, shorter time ranges) to reduce caching demands.
- If performance remains poor, consider temporarily disabling FortiView auto-cache to alleviate disk I/O load, though this may impact real-time analytics.
- Monitor and Tune sqlrptcached:
- Check if the number of hcache workers (10) can be increased to process the backlog faster (consult FortiAnalyzer documentation or Fortinet support).
- Monitor the backlog of pending/on-hold tables over time to confirm that optimizations are effective.
- Investigate Resource Waits:
- The 74 resource waits suggest disk I/O is the primary limiter. After addressing disk issues, re-run diag test application sqlrptcached 2 to verify if resource waits decrease.
- Long-Term Improvements:
- Upgrade to faster storage (e.g., SSDs or high-performance RAID) to handle FortiAnalyzer’s I/O demands.
- Scale the FortiAnalyzer deployment (e.g., add more collectors or analyzers in a distributed setup) if the current hardware is insufficient for the log volume.
Command: diagnose debug application sqlrptcached 8.
The command 'diag debug application sqlrptcached 8' enables debug logging for the sqlrptcached process at level 8, which likely focuses on performance-related metrics, such as I/O wait usage. The output shows repeated messages indicating that the I/O wait percentage is exceeding a predefined threshold:
FAZ-3500G # diag debug application sqlrptcached 8
FAZ-3500G # diag debug enable
FAZ-3500G # [1785] iowait usage (45.4%) is over limit (13%).
[1785] iowait usage (45.4%) is over limit (13%).
[1785] iowait usage (56.3%) is over limit (13%).
[1785] iowait usage (56.3%) is over limit (13%).
[1785] iowait usage (60.4%) is over limit (13%).
[1785] iowait usage (60.4%) is over limit (13%).
[1785] iowait usage (56.0%) is over limit (13%).
[1785] iowait usage (56.0%) is over limit (13%).
[1785] iowait usage (56.0%) is over limit (13%).
[1785] iowait usage (37.9%) is over limit (13%).
[1785] iowait usage (37.9%) is over limit (13%).
[1785] iowait usage (37.9%) is over limit (13%).
[1785] iowait usage (51.9%) is over limit (13%).
[1785] iowait usage (51.9%) is over limit (13%).
[1785] iowait usage (43.8%) is over limit (13%).
[1785] iowait usage (43.8%) is over limit (13%).
[1785] iowait usage (43.8%) is over limit (13%).
[1785] iowait usage (47.4%) is over limit (13%).
diag debug disable
This output provides debug information related to the sqlrptcached process, specifically focusing on I/O wait usage, which is critical given the disk I/O bottlenecks observed in the previous outputs (get sys performance and diag test application sqlrptcached 2). Below, this output will be explained, its implications will be analyzed, and some recommendations will be provided, tying it back to the earlier data.
The command 'diag debug application sqlrptcached 8' enables debug logging for the sqlrptcached process at level 8, which likely focuses on performance-related metrics, such as I/O wait usage. The output shows repeated messages indicating that the I/O wait percentage is exceeding a predefined threshold:
- Format of Messages:
- [1785] iowait usage (X%) is over limit (13%)
- [1785]: Likely the process ID (PID) of the sqlrptcached process.
- iowait usage (X%): The percentage of time the CPU is waiting for I/O operations (e.g., disk reads/writes) associated with the sqlrptcached process.
- is over limit (13%): The I/O wait usage exceeds the FortiAnalyzer’s configured threshold of 13%, triggering these warnings.
- Observed I/O Wait Values:
- The I/O wait percentages reported are: 45.4%, 56.3%, 60.4%, 56.0%, 37.9%, 51.9%, 43.8%, and 47.4%.
- These values are significantly higher than the threshold of 13%, with the highest being 60.4% and the lowest 37.9%.
- Repetition:
- Some values (e.g., 45.4%, 56.3%, 60.4%, 56.0%, 37.9%, 43.8%) appear multiple times, suggesting that these high I/O wait conditions persisted over multiple checks during the debug period.
Analysis.
This output confirms and reinforces the disk I/O bottleneck identified in the earlier outputs:
- High I/O Wait:
- The I/O wait percentages (ranging from 37.9% to 60.4%) are alarmingly high, far exceeding the 13% threshold. This indicates that the sqlrptcached process is spending a significant amount of time waiting for disk I/O operations, which slows down log table processing and caching for FortiView.
- These values align with the get sys performance output, where many CPU cores showed high %iowait (e.g., 98.25% on CPU47, 87.04% on CPU6), and the disk was at 90.8% utilization with a high transaction rate (3274.1 tps).
- Impact on sqlrptcached:
- The diag test application sqlrptcached 2 output showed a large backlog of log tables (978 on-hold, 997 pending) and 74 resource waits, indicating that the sqlrptcached process is struggling to read log data from the disk.
- The high I/O wait values in this debug output explain why only 1 log table was being read and no tables were completed (done=0), as the process is bottlenecked by slow disk access.
- Disk Constraints:
- The get sys performance output highlighted critical disk issues:
- 90.9% disk usage (73.6 TB used out of 80.9 TB).
- 97.6% inode usage (4.18 billion inodes used out of 4.29 billion), limiting the ability to create new files.
- These conditions likely exacerbate the I/O wait issue, as the nearly full disk and inode exhaustion slow down read/write operations.
- FortiView Load:
- The diag test application sqlrptcached 2 output indicated that FortiView auto-cache is enabled with 50 active charts and 997 pending FortiView log tables. This heavy FortiView workload is a primary driver of the disk I/O load, as sqlrptcached attempts to cache large amounts of log data for real-time analytics.
- Threshold Violation:
- The 13% I/O wait threshold is a FortiAnalyzer internal limit to ensure smooth operation. Consistently exceeding this threshold (by up to 60.4%) suggests that the system is under significant strain, likely impacting performance for FortiView, report generation, and other analytics tasks.
Implications.
- Performance Degradation: The high I/O wait times are slowing down the sqlrptcached process, leading to delays in FortiView data availability and potential lags in real-time analytics dashboards.
- Risk of System Instability: The nearly full disk and inode exhaustion increase the risk of system failures if new logs or files cannot be written.
- Resource Contention: The disk bottleneck is affecting CPU performance (high %iowait across cores), reducing the system’s ability to handle other tasks efficiently.
Recommendations.
Building on the action plan from the previous outputs, the following is a tailored set of recommendations to address the high I/O wait issues highlighted in this debug output:
- Urgently Address Disk Space and Inode Issues:
- Free up disk space:
- Use FortiAnalyzer’s log management tools (GUI or CLI, e.g., execute log delete) to archive or delete old logs and reports. Focus on reducing the 90.9% disk usage.
- Set shorter log retention periods (e.g., reduce from months to weeks) to prevent rapid disk fill-up.
- Resolve inode exhaustion:
- Identify directories with many small files (e.g., log fragments) using commands like find / -type f | wc -l or FortiAnalyzer’s filesystem tools.
- Delete unnecessary files or consolidate logs to free up inodes (currently at 97.6%).
- If possible, reconfigure the filesystem to allocate more inodes (consult Fortinet support or documentation, as this may require reformatting).
- Add storage capacity:
- If budget allows, expand disk capacity or migrate to high-performance storage (e.g., SSDs or RAID arrays optimized for I/O).
- Consider offloading logs to a secondary FortiAnalyzer or FortiCloud for long-term storage.
- Reduce FortiView Load:
- Limit FortiView auto-caching:
- Review the 50 active FortiView charts (diag test application sqlrptcached 2) and disable unnecessary dashboards or reduce their scope (e.g., fewer devices, shorter time ranges).
- Temporarily disable FortiView auto-cache (via GUI: FortiView > Settings) to reduce disk I/O load, though this may impact real-time analytics.
- Optimize FortiView workload:
- Check which devices or ADOMs are generating the most logs (use get log device stats or GUI log statistics).
- Reduce log ingestion for high-volume devices (e.g., adjust logging levels on FortiGates to exclude low-priority logs).
- Clear backlog:
- The 997 pending FortiView log tables indicate a significant backlog. After addressing disk I/O, monitor the backlog using diag test application sqlrptcached 2 to ensure it decreases.
- Optimize sqlrptcached Performance:
- Increase hcache workers:
- The current 10 hcache workers may be insufficient for the workload. Check FortiAnalyzer documentation or contact Fortinet support to see if this can be increased (e.g., via hidden CLI commands).
- Monitor resource waits:
- The 74 resource waits from the previous output are likely tied to the high I/O wait. Re-run diag test application sqlrptcached 2 after disk optimizations to verify if waits decrease.
- Tune I/O wait threshold:
- The 13% I/O wait threshold may be too conservative for your workload. Consult Fortinet support to determine if it can be adjusted (though this is unlikely to resolve the root issue).
- Improve Disk I/O Performance:
- Upgrade storage:
- The high disk utilization (90.8%) and transaction rate (3274.1 tps) suggest the current disk subsystem is overwhelmed. SSDs or high-performance RAID configurations can significantly reduce I/O wait times.
- Optimize I/O scheduling:
- Check if the FortiAnalyzer’s disk I/O scheduler can be tuned (e.g., switching to deadline or noop for better performance). This requires advanced access and Fortinet support guidance.
- Reduce disk I/O load:
- Disable unnecessary analytics features (e.g., unused FortiView widgets, non-critical reports).
- Schedule resource-intensive tasks (e.g., report generation) during off-peak hours.
- Monitor and Validate Fixes:
- Re-run diagnostics:
- After implementing changes, re-run diag debug application sqlrptcached 8 to check if I/O wait percentages drop below 13%.
- Monitor get sys performance to confirm reduced %iowait across CPU cores and lower disk utilization.
- Use diag test application sqlrptcached 2 to verify that the backlog of pending log tables (997) decreases.
- Check system logs:
- Review FortiAnalyzer logs (via GUI or get system log) for additional errors related to disk I/O or sqlrptcached failures.
- Engage Fortinet support:
- Share these outputs (get sys performance, diag test application sqlrptcached 2, and diag debug application sqlrptcached 8) with Fortinet support for deeper analysis. They may provide firmware updates, configuration tweaks, or hardware recommendations.
- Long-Term Strategy:
- Scale FortiAnalyzer deployment:
- If the FAZ-3500G is consistently overwhelmed, consider a distributed setup with additional FortiAnalyzer units (e.g., a collector for log ingestion and an analyzer for processing).
- Implement log management policies:
- Establish strict log retention and archiving policies to prevent disk and inode exhaustion.
- Upgrade hardware:
- The FAZ-3500G is a high-end model, but its disk subsystem may not be optimized for your workload. Evaluate newer models or cloud-based solutions like FortiAnalyzer Cloud.
Tying to Previous Outputs.
- get sys performance:
- The high %iowait (e.g., 98.25% on CPU47) and disk utilization (90.8%) directly correlate with the 37.9%–60.4% I/O wait in this output, confirming that disk I/O is the primary bottleneck.
- The nearly full disk (90.9%) and inode exhaustion (97.6%) are critical issues that must be addressed to reduce I/O wait.
- diag test application sqlrptcached 2:
- The large backlog of log tables (978 on-hold, 997 pending) and 74 resource waits are symptoms of the high I/O wait, as sqlrptcached cannot process data quickly enough.
- The focus on FortiView (50 charts, FortiView auto-cache enabled) is driving the disk I/O load, which this debug output quantifies as exceeding the 13% threshold.
Command: diagnose test application oftpd 2.
This command provides detailed diagnostic information about the oftpd process, which is the FortiAnalyzer's daemon responsible for handling log and data transfers from FortiGate devices and other managed devices, as well as processing REST API requests and other communications. This output is particularly relevant given the disk I/O bottlenecks and high I/O wait times observed in the previous outputs (get sys performance, diag test application sqlrptcached 2, and diag debug application sqlrptcached 8). Below, the output will be explained, analyzed, and recommendations will be provided to address the issues, tying it back to the earlier findings.
The oftpd process (Object File Transfer Protocol Daemon) manages incoming log data, keepalive messages, REST API requests, and other communications between the FortiAnalyzer and managed devices (e.g., FortiGates). The diagnose test application oftpd 2 output provides a snapshot of the process’s activity, including session statistics, worker thread performance, request statistics, and thread pool usage. This output helps us understand the load on the FortiAnalyzer from log ingestion and API interactions, which likely contribute to the disk I/O bottleneck.
Detailed Breakdown.
- Session Statistics.
- session: max=160021, used=14815, idle-closed=632318, now = 1707931908 (2024/02/14 14:31:48)
- max=160021: The maximum number of concurrent sessions the oftpd process can handle.
- used=14815: Currently, 14,815 sessions are active, which is a significant but not overwhelming portion of the maximum capacity (about 9.3%).
- idle-closed=632318: A total of 632,318 sessions have been closed due to inactivity, indicating high session turnover.
- now = 1707931908 (2024/02/14 14:31:48): Timestamp of the output.
Analysis:
- The 14,815 active sessions suggest a high volume of devices (e.g., FortiGates) sending logs or communicating with the FortiAnalyzer. This is consistent with a large-scale deployment.
- The session count is well below the maximum (160,021), so session capacity is not a bottleneck.
- The large number of idle-closed sessions indicates frequent connections and disconnections, which is typical in environments with many devices sending logs periodically.
- Worker Threads (wkr).
- wkr.0–wkr.3: Four worker threads are handling oftpd tasks.
- num-runs: Number of times each worker has executed its main loop:
- wkr.0: 96,226,884
- wkr.1: 94,111,752
- wkr.2: 94,010,605
- wkr.3: 93,399,325
- last-run=0s ago: All workers ran recently, indicating active processing.
- mainloop max-run: Maximum time a single loop execution took:
- wkr.0: 1.380s
- wkr.1: 1.570s
- wkr.2: 1.420s
- wkr.3: 1.570s
- avg-run=0.000s: Average loop execution time is negligible, suggesting most tasks are quick.
Analysis:
- The high num-runs (over 93 million per worker) indicates that the workers are constantly processing tasks, likely due to the high volume of incoming logs and requests.
- The max-run times (1.38–1.57 seconds) are relatively long for individual tasks, suggesting occasional delays, possibly due to disk I/O bottlenecks (as seen in previous outputs).
- The low avg-run time (0.000s) indicates that most tasks complete quickly, but the occasional long-running tasks could contribute to resource contention.
- Socket File Descriptors.
- sock-fd: new=1,472,026, del=1,456,619
- new=1,472,026: Number of new socket file descriptors created.
- del=1,456,619: Number of socket file descriptors closed.
- The difference (15,407) roughly aligns with the 14,815 active sessions, as each session typically uses a socket.
Analysis:
- The high number of socket creations and deletions reflects the dynamic nature of device connections, consistent with the 632,318 idle-closed sessions.
- No issues are apparent here, as socket management appears stable.
- Device Management (DVM).
- dvm: upd-ip=0, upd-devname=1925, ign-thrtl=0, ign-hamb=0, find-devid=0, find-devid-ev=0, find-ha-member=0
- upd-devname=1925: The FortiAnalyzer has updated the device name for 1,925 devices, likely reflecting the number of managed devices (e.g., FortiGates).
- Other metrics (e.g., upd-ip, ign-thrtl, find-devid) are zero, indicating no significant activity in those areas (e.g., no IP updates, no throttling, no device ID lookups).
Analysis:
- The 1,925 devices suggest a large-scale deployment, which explains the high volume of log data and sessions.
- No throttling (ign-thrtl=0) or HA-related issues (ign-hamb=0) are present, so device management is functioning normally.
- CMDB Cache Events.
- cmdb cache event: add=2, edit=1924, del=1
- The Configuration Management Database (CMDB) cache has recorded:
- 2 additions of new entries.
- 1,924 edits to existing entries.
- 1 deletion.
- These events likely correspond to device configuration changes or updates.
Analysis:
- The 1,924 edits align with the 1,925 device name updates, suggesting frequent device status or configuration changes.
- The CMDB cache is functioning as expected, with no apparent issues.
- Unregistered Devices and REST API.
- unreg-dev: num=0: No unregistered devices are attempting to connect.
- restapi-batch-req-max: 1: The maximum number of batched REST API requests is 1, indicating no batching is occurring.
Analysis:
- No unregistered devices is a good sign, as it means all connections are from authorized devices.
- The low REST API batching suggests that API requests are processed individually, which may increase processing overhead but is not a major issue.
- Request Statistics.
This section details the types and volumes of requests handled by oftpd, along with their processing times and status.
- Request Types and Metrics:
- LOGIN_REQUEST_LEGACY (1,421,905):
- Legacy login requests from devices, averaging 0.000s, max 0.050s.
- No active or rejected requests.
- STORE_REQUEST (144,484,783):
- Log storage requests, the most frequent type, averaging 0.000s, max 0.000s.
- 150 active (max 150), no rejections.
- These are logs being sent from devices to be stored on the FortiAnalyzer’s disk.
- KEEPALIVE (189,478,566):
- Keepalive messages to maintain device connections, averaging 0.000s, max 0.000s.
- No active or rejected requests.
- Highest volume, indicating frequent device check-ins.
- DISK_USAGE_REQUEST (14,530):
- Requests for disk usage information, averaging 0.000s, max 0.020s.
- No active or rejected requests.
- FGT RELIABLE LOG REQUEST (1,045,984):
- Reliable log transfer requests, averaging 0.000s, max 0.000s.
- No active or rejected requests.
- CHANGE NOTICE (44,770,957):
- Notifications of configuration or status changes, averaging 0.000s, max 0.000s.
- No active or rejected requests.
- RESTAPI REQUEST (281,854):
- REST API requests, averaging 0.620s, max 28.520s (significantly higher than others).
- No active or rejected requests.
- Long max time suggests occasional complex or slow API calls.
- RESTAPI RESPONSE (8,310,932):
- REST API responses, averaging 0.000s, max 0.000s.
- 141 active, no rejections.
- GET STATUS (133,349):
- Status queries, averaging 0.000s, max 0.000s.
- No active or rejected requests.
- HAINFO NOTIFICATION (14,473):
- High Availability (HA) notifications, averaging 0.000s, max 0.000s.
- No active or rejected requests.
- FGT INTERFACE STATE (1,003,639):
- Interface state updates from FortiGates, averaging 0.000s, max 0.000s.
- 9 active, no rejections.
Analysis:
- High Volume Requests:
- KEEPALIVE (189M) and STORE_REQUEST (144M) dominate, reflecting the constant communication and log ingestion from 1,925 devices.
- The 144M STORE_REQUESTs indicate a massive volume of logs being written to disk, directly contributing to the high disk I/O load (90.8% utilization, 3274.1 tps) and I/O wait times (37.9%–60.4%) seen in previous outputs.
- RESTAPI REQUEST:
- The 0.620s average and 28.520s max processing times are concerning, as they suggest that some API calls are significantly slower, possibly due to disk I/O bottlenecks or complex queries.
- The 8.3M RESTAPI RESPONSEs with 141 active indicate ongoing API activity, likely from GUI users or external integrations.
- No Rejections:
- The absence of rejected requests (rejected=0) is positive, indicating that the oftpd process is handling the load without dropping requests, despite the disk bottleneck.
- Login Statistics.
- login: total=1,421,899, rejected=69
- 1,421,899 login attempts, with only 69 rejections (0.005%), indicating stable authentication.
- feature-set: total=46,738, send=46,727, error=0
- Feature set updates (e.g., device capabilities) are processed successfully with no errors.
- fos keep-alive interruptions: 0
- No interruptions in FortiOS keepalive messages, indicating stable device connections.
Analysis:
- Login and feature-set processing is stable, with minimal rejections and no errors.
- The low rejection rate suggests that device authentication is not a bottleneck.
- Error Statistics.
- ssl-negotiation: 16,591.
- 16,591 SSL negotiation errors occurred during session establishment, likely due to network issues, certificate mismatches, or timeouts.
Analysis:
- The SSL errors are a small fraction of the 1.4M login requests (1.2%), but they indicate potential network or configuration issues with some devices.
- These errors may contribute to session turnover (632,318 idle-closed) but are unlikely to be a major driver of the disk I/O bottleneck.
- Helper Threads Statistics.
The oftpd process uses two thread pools to handle tasks:
- Thread Pool 0:
- Threads running=3 (max 300, capacity 400).
- #new=28, #quit=25: Moderate thread turnover.
- queued-task=0: No tasks are queued, indicating low load.
- Threads are mostly idle, with task processing times ranging from 1m18s to 25m46s and low load (0.01).
- Thread Pool 1:
- Threads running=300 (max 300, capacity 400).
- #new=108,570, #quit=108,270: High thread turnover, with recent quits (19.755s ago).
- queued-task=4,392, oldest-queued-task=0.577s: A significant backlog of 4,392 tasks is waiting to be processed, with the oldest task queued for 0.577s.
- Threads are heavily utilized, handling STORE_REQUEST and RESTAPI RESPONSE tasks:
- Task processing times: Range from 14m30s to 39m30s.
- Load: High, ranging from 0.67 to 0.78, indicating near-capacity operation.
- Active tasks: Include STORE_REQUEST (log storage) and RESTAPI RESPONSE, with processing times up to 5.302s for STORE_REQUEST and 2.225s for RESTAPI RESPONSE.
- Extra-info: IP addresses and ports of devices (e.g., 10.127.2.86:2118) indicate specific FortiGates sending logs or API requests.
Analysis:
- Thread Pool 0: Lightly loaded with only 3 threads and no queued tasks, suggesting it handles low-priority or infrequent tasks.
- Thread Pool 1: Fully utilized with 300 threads and a backlog of 4,392 tasks, indicating significant pressure on the oftpd process.
- The high load (0.67–0.78) and long task processing times (up to 39m30s) for STORE_REQUEST and RESTAPI RESPONSE suggest that disk I/O bottlenecks are slowing down log storage and API responses.
- The 4,392 queued tasks correlate with the high disk I/O load (90.8% utilization, 3274.1 tps) and I/O wait times (37.9%–60.4%), as log storage tasks are likely waiting for disk access.
- The RESTAPI RESPONSE tasks with long processing times (e.g., 2.225s) align with the 28.520s max for REST API requests, indicating that API performance is also affected by the disk bottleneck.
Overall Analysis.
The diagnose test application oftpd 2 output highlights the significant load on the FortiAnalyzer from log ingestion and API activity, which is exacerbating the disk I/O bottleneck observed in previous outputs:
- High Log Ingestion Load:
- The 144M STORE_REQUESTs from 1,925 devices indicate a massive volume of logs being written to disk, directly contributing to the high disk utilization (90.8%) and I/O wait times (37.9%–60.4%).
- The 4,392 queued tasks in Thread Pool 1, many of which are STORE_REQUEST tasks, suggest that the oftpd process is struggling to keep up with log storage due to disk constraints.
- REST API Load:
- The 281,854 RESTAPI REQUESTs and 8.3M RESTAPI RESPONSEs, with some taking up to 28.520s, indicate significant API activity, likely from GUI users or external integrations.
- The long processing times for API tasks in Thread Pool 1 (e.g., 2.225s) suggest that disk I/O bottlenecks are slowing down API responses.
- Disk Constraints:
- The nearly full disk (90.9%) and inode exhaustion (97.6%) are critical issues, limiting the FortiAnalyzer’s ability to write logs and process tasks efficiently.
- The high disk I/O load is affecting both oftpd (log storage) and sqlrptcached (FortiView caching), as seen in the 997 pending log tables and 74 resource waits from the sqlrptcached output.
- Thread Pool Pressure:
- Thread Pool 1’s full utilization (300 threads, 4,392 queued tasks) indicates that the oftpd process is at capacity, unable to process tasks faster due to disk I/O limitations.
- FortiView and API Interaction:
- The high log ingestion load (STORE_REQUEST) feeds into the FortiView caching process (sqlrptcached), which is struggling with a backlog of 997 pending tables due to the same disk bottleneck.
- REST API activity may be driven by FortiView dashboards or external queries, further taxing the system.
Recommendations.
Building on the action plans from previous outputs, the following is a tailored set of recommendations to address the issues highlighted in the oftpd output, focusing on reducing log ingestion load, optimizing API performance, and alleviating disk I/O bottlenecks:
- Urgently Address Disk Space and Inode Issues (Critical):
- Free up disk space:
- Use FortiAnalyzer’s GUI or CLI (execute log delete) to archive or delete old logs and reports, reducing the 90.9% disk usage.
- Set shorter log retention periods (e.g., 7–30 days instead of months) to prevent rapid disk fill-up.
- Resolve inode exhaustion:
- Identify directories with many small files using find / -type f | wc -l or FortiAnalyzer’s filesystem tools.
- Delete unnecessary files or consolidate logs to free up inodes (97.6% used).
- If feasible, reconfigure the filesystem to allocate more inodes (consult Fortinet support, as this may require reformatting).
- Add storage capacity:
- Expand disk capacity or migrate to high-performance storage (e.g., SSDs or RAID arrays) to handle the high I/O load (3274.1 tps).
- Consider offloading logs to FortiAnalyzer Cloud or a secondary FortiAnalyzer for long-term storage.
- Reduce Log Ingestion Load:
- Optimize device logging:
- Review the 1,925 managed devices and identify high-volume log sources using get log device stats or the FortiAnalyzer GUI (Log View > Device Statistics).
- Reduce logging levels on FortiGates for non-critical events (e.g., disable debug or informational logs, focus on security and traffic logs).
- Limit STORE_REQUEST volume:
- The 144M STORE_REQUESTs are a primary driver of disk I/O. Implement log filters on FortiGates to send only essential logs to the FortiAnalyzer.
- Schedule log uploads from devices during off-peak hours (if possible) to distribute the load.
- Monitor queued tasks:
- The 4,392 queued tasks in Thread Pool 1 indicate a backlog. After reducing log volume, re-run diagnose test application oftpd 2 to verify if the backlog decreases.
- Optimize REST API Performance:
- Investigate slow API requests:
- The 28.520s max for RESTAPI REQUESTs and long processing times (e.g., 2.225s in Thread Pool 1) suggest complex or disk-intensive queries.
- Check the FortiAnalyzer GUI or external integrations (e.g., FortiManager, third-party tools) for heavy API usage. Limit API calls for non-essential queries.
- Reduce API load:
- Restrict GUI access to essential users or schedule API-heavy tasks (e.g., custom reports) during low-traffic periods.
- If FortiView dashboards are driving API requests, reduce the number of active charts (50 currently, per sqlrptcached output).
- Tune API batching:
- The restapi-batch-req-max: 1 indicates no batching. Consult Fortinet support to see if batching can be enabled to reduce API overhead.
- Optimize FortiView Integration:
- Reduce FortiView load:
- The **sqlrptcached` output showed 997 pending FortiView log tables and 50 active charts. Disable unnecessary FortiView dashboards or limit their scope (e.g., fewer devices, shorter time ranges).
- Temporarily disable FortiView auto-cache (GUI: FortiView > Settings) to reduce disk I/O, though this may impact real-time analytics.
- Coordinate with log ingestion:
- The high STORE_REQUEST volume feeds into FortiView caching. Reducing log ingestion (as above) will alleviate pressure on both oftpd and sqlrptcached.
- Improve Disk I/O Performance:
- Upgrade storage:
- The high disk utilization (90.8%) and I/O wait times (37.9%–60.4%) indicate that the current disk subsystem is overwhelmed. SSDs or high-performance RAID can reduce I/O wait.
- Optimize I/O scheduling:
- Check if the disk I/O scheduler can be tuned (e.g., to deadline or noop). This requires Fortinet support guidance.
- Reduce disk I/O load:
- Disable non-critical analytics features (e.g., unused FortiView widgets, scheduled reports).
- Offload report generation to a secondary FortiAnalyzer or schedule it during off-peak hours.
- Tune oftpd Performance:
- Increase thread capacity:
- Thread Pool 1 is at capacity (300 threads, 4,392 queued tasks). Check with Fortinet support if the maximum thread count (currently 300, capacity 400) can be increased or if additional worker threads can be allocated.
- Monitor SSL errors:
- The 16,591 SSL negotiation errors are minor but should be investigated. Check device certificates, network stability, or FortiAnalyzer SSL settings for misconfigurations.
- Balance worker load:
- The four workers (wkr.0–wkr.3) are heavily utilized. If configurable, consider adding more workers to distribute the load (consult Fortinet support).
- Monitor and Validate Fixes:
- Re-run diagnostics:
- After implementing changes, re-run diagnose test application oftpd 2 to check if queued tasks (4,392) and long processing times (e.g., RESTAPI RESPONSE) decrease.
- Monitor get sys performance for reduced %iowait and disk utilization.
- Re-run diag debug application sqlrptcached 8 to verify that I/O wait drops below 13%.
- Check system logs:
- Review FortiAnalyzer logs (GUI or get system log) for errors related to oftpd, disk I/O, or SSL issues.
- Engage Fortinet support:
- Share all outputs (get sys performance, diag test application sqlrptcached 2, diag debug application sqlrptcached 8, and diagnose test application oftpd 2) with Fortinet support for tailored recommendations, especially regarding thread tuning, SSL errors, or firmware updates.
- Long-Term Strategy:
- Scale FortiAnalyzer deployment:
- The FAZ-3500G is handling 1,925 devices and 144M STORE_REQUESTs, which may be pushing its limits. Consider a distributed setup with additional FortiAnalyzer units (e.g., collectors for log ingestion, analyzers for processing).
- Implement log management policies:
- Establish strict log retention and archiving policies to prevent disk and inode exhaustion.
- Upgrade hardware:
- Evaluate newer FortiAnalyzer models or cloud-based solutions (e.g., FortiAnalyzer Cloud) for better performance in large-scale deployments.
Tying to Previous Outputs.
- get sys performance:
- The high disk utilization (90.8%) and I/O wait (up to 98.25% on some CPU cores) are driven by the 144M STORE_REQUESTs in this output, as log storage tasks overwhelm the disk.
- The nearly full disk (90.9%) and inode exhaustion (97.6%) exacerbate the bottleneck, slowing down oftpd tasks like STORE_REQUEST.
- diag test application sqlrptcached 2:
- The 997 pending FortiView log tables and 74 resource waits are downstream effects of the high log ingestion load (STORE_REQUEST) from oftpd, as logs must be written to disk before sqlrptcached can process them.
- The 50 FortiView charts and enabled auto-caching increase disk I/O, compounding the pressure from oftpd.
- diag debug application sqlrptcached 8:
- The high I/O wait (37.9%–60.4%, exceeding 13% threshold) directly correlates with the 4,392 queued tasks in oftpd’s Thread Pool 1, as log storage tasks wait for disk access.
- The sqlrptcached backlog is a symptom of the same disk bottleneck affecting oftpd.
Command: diagnose test application oftpd 5.
diagnose test application oftpd 5
Request-stats (now = 1707931909(2024/02/14 14:31:49)):
type num average(s) max(s) dev running rejected (server busy)
------------------------------------------------------------------------------------------------------------------
LOGIN_REQUEST_LEGACY 1421908 0.000 0.050 FGT60FTK2xxxxx6R 0 0
STORE_REQUEST 144489476 0.000 0.000 150 (150 max) 0
KEEPALIVE 189479541 0.000 0.000 0 0
DISK_USAGE_REQUEST 14530 0.000 0.020 FGT60FTK2xxxxxGX 0 0
FGT RELIABLE LOG REQUEST 1045987 0.000 0.000 0 0
CHANGE NOTICE 44771112 0.000 0.000 0 0
RESTAPI REQUEST 281854 0.620 28.520 FGT60FTK2xxxxx56 0 0
RESTAPI RESPONSE 8311005 0.000 0.000 141 0
GET STATUS 133349 0.000 0.000 0 0
HAINFO NOTIFICATION 14473 0.000 0.000 0 0
FGT INTERFACE STATE 1003644 0.000 0.000 9 0
This command provides a focused view of the request statistics for the oftpd process, which handles log and data transfers from managed devices (e.g., FortiGates) and processes REST API requests. This output is very similar to the request statistics section of the previous diagnose test application oftpd 2 output, but it was captured one second later (2024/02/14 14:31:49 vs. 14:31:48). Below, this command will be analyzed, compared with the previous outputs, and provide recommendations, tying it back to the ongoing disk I/O bottleneck and high I/O wait issues identified earlier.
Detailed Breakdown.
Below is a comparison of the request statistics between the current diagnose test application oftpd 5 output and the previous diagnose test application oftpd 2 output, highlighting changes and key observations:
Type Num (oftpd 5) Num (oftpd 2) Delta Average(s) Max(s) Dev Running Rejected
LOGIN_REQUEST_LEGACY 1,421,908 1,421,905 +3 0.000 0.050 FGT60FTK2xxxxx6R 0 0
STORE_REQUEST 144,489,476 144,484,783 +4,693 0.000 0.000 - 150 (150 max) 0
KEEPALIVE 189,479,541 189,478,566 +975 0.000 0.000 - 0 0
DISK_USAGE_REQUEST 14,530 14,530 0 0.000 0.020 FGT60FTKxxxxx5GX 0 0
FGT RELIABLE LOG REQUEST 1,045,987 1,045,984 +3 0.000 0.000 - 0 0
CHANGE NOTICE 44,771,112 44,770,957 +155 0.000 0.000 - 0 0
RESTAPI REQUEST 281,854 281,854 0 0.620 28.520 FGT60FTK2xxxxx56 0 0
RESTAPI RESPONSE 8,311,005 8,310,932 +73 0.000 0.000 - 141 0
GET STATUS 133,349 133,349 0 0.000 0.000 - 0 0
HAINFO NOTIFICATION 14,473 14,473 0 0.000 0.000 - 0 0
FGT INTERFACE STATE 1,003,644 1,003,639 +5 0.000 0.000 - 9 0
Key Observations.
- Request Volume Changes:
- STORE_REQUEST: Increased by 4,693 (from 144,484,783 to 144,489,476), indicating a high rate of log storage requests (~4,693 per second). This is a primary driver of the disk I/O load.
- KEEPALIVE: Increased by 975, reflecting frequent device check-ins (~975 per second).
- RESTAPI RESPONSE: Increased by 73, showing ongoing API activity.
- CHANGE NOTICE: Increased by 155, indicating device configuration or status updates.
- Smaller increases in LOGIN_REQUEST_LEGACY (+3), FGT RELIABLE LOG REQUEST (+3), and FGT INTERFACE STATE (+5) suggest steady but less significant activity.
- RESTAPI REQUEST, DISK_USAGE_REQUEST, GET STATUS, and HAINFO NOTIFICATION show no change, indicating stable activity in these areas.
- Performance Metrics:
- Average(s) and Max(s) remain unchanged from the previous output:
- Most requests (e.g., STORE_REQUEST, KEEPALIVE, RESTAPI RESPONSE) have an average processing time of 0.000s and max of 0.000s, indicating fast processing for most tasks.
- RESTAPI REQUEST stands out with a high average of 0.620s and max of 28.520s, suggesting occasional slow API calls, likely due to disk I/O bottlenecks.
- LOGIN_REQUEST_LEGACY and DISK_USAGE_REQUEST have max times of 0.050s and 0.020s, respectively, which are minor.
- Running tasks:
- STORE_REQUEST: Still at 150 (max 150), indicating the process is at capacity for log storage tasks.
- RESTAPI RESPONSE: 141 active, consistent with ongoing API activity.
- FGT INTERFACE STATE: 9 active, minor activity.
- Rejected: All remain 0, indicating no requests are being dropped due to server overload.
- Device Associations:
- The devices associated with the longest processing times (FGT60FTK2xxxxx6R, FGT60FTK2xxxxxGX, FGT60FTK2xxxxx56) are unchanged, suggesting these FortiGates may be generating complex or high-volume requests (e.g., for RESTAPI REQUEST).
Analysis.
This output reinforces the findings from the previous diagnose test application oftpd 2 output and ties directly to the disk I/O bottleneck observed across all outputs:
- High Log Ingestion Load:
- The 4,693 new STORE_REQUESTs in one second (~4,693/second) highlight the immense volume of logs being sent from 1,925 devices (per oftpd 2). This is a major contributor to the high disk utilization (90.8%, 3274.1 tps) and I/O wait times (37.9%–60.4%, per diag debug application sqlrptcached 8).
- The 150 active STORE_REQUESTs (at maximum capacity) indicate that the oftpd process is fully occupied with log storage, likely waiting on disk I/O, as seen with the 4,392 queued tasks in the oftpd 2 Thread Pool 1.
- REST API Load:
- The RESTAPI REQUEST metrics (average 0.620s, max 28.520s) and 141 active RESTAPI RESPONSEs confirm significant API activity, likely from FortiView dashboards, GUI users, or external integrations.
- The long max time (28.520s) suggests that some API calls are disk-intensive, exacerbating the I/O bottleneck. This aligns with the 2.225s processing times for RESTAPI RESPONSE in the oftpd 2 Thread Pool 1.
- Disk Constraints:
- The nearly full disk (90.9%) and inode exhaustion (97.6%, per get sys performance) continue to limit the FortiAnalyzer’s ability to write logs and process tasks efficiently.
- The high volume of STORE_REQUESTs directly contributes to the disk I/O load, which in turn slows down the sqlrptcached process (997 pending FortiView log tables, 74 resource waits, per diag test application sqlrptcached 2).
- No Rejections:
- The absence of rejected requests (rejected=0) is positive, indicating that the oftpd process is handling the load without dropping tasks, despite the disk bottleneck.
- Comparison to Previous Output:
- The small time difference (1 second) means the changes are incremental but significant:
- The 4,693 new STORE_REQUESTs and 975 new KEEPALIVEs reflect the high rate of log ingestion and device communication.
- The 73 new RESTAPI RESPONSEs and 141 active tasks show sustained API activity.
- The consistency in average(s), max(s), and running tasks suggests that the system is in a steady state of high load, constrained by disk I/O.
Implications:
- Disk I/O Bottleneck: The high rate of STORE_REQUESTs (~4,693/second) is a primary driver of the disk I/O load, contributing to the high utilization (90.8%) and I/O wait times (37.9%–60.4%). This slows down both oftpd (log storage) and sqlrptcached (FortiView caching).
- API Performance: The long processing times for RESTAPI REQUEST (up to 28.520s) indicate that API-driven tasks (e.g., FortiView dashboards, external queries) are also affected by the disk bottleneck, leading to potential GUI lag or slow analytics.
- System Strain: The FortiAnalyzer is handling a large-scale deployment (1,925 devices, 14,815 active sessions) but is struggling due to disk constraints, risking performance degradation or system instability if the disk fills completely.
Recommendations.
The recommendations build on those provided for the previous outputs, with a focus on reducing the log ingestion load (STORE_REQUEST), optimizing API performance, and addressing the critical disk constraints. Since this output is a subset of diagnose test application oftpd 2, the recommendations are consistent but emphasize the high rate of STORE_REQUESTs observed here.
- Urgently Address Disk Space and Inode Issues (Critical):
- Free up disk space:
- Use FortiAnalyzer’s GUI (Log View > Log Management) or CLI (execute log delete) to archive or delete old logs and reports, reducing the 90.9% disk usage.
- Set shorter log retention periods (e.g., 7–30 days) to prevent rapid disk fill-up.
- Resolve inode exhaustion:
- Identify directories with many small files using find / -type f | wc -l or FortiAnalyzer’s filesystem tools.
- Delete unnecessary files or consolidate logs to free up inodes (97.6% used).
- If possible, reconfigure the filesystem to allocate more inodes (consult Fortinet support, as this may require reformatting).
- Add storage capacity:
- Expand disk capacity or migrate to high-performance storage (e.g., SSDs or RAID arrays) to handle the high I/O load (3274.1 tps, 4,693 STORE_REQUESTs/second).
- Consider offloading logs to FortiAnalyzer Cloud or a secondary FortiAnalyzer for long-term storage.
- Reduce Log Ingestion Load:
- Optimize device logging:
- The 144M STORE_REQUESTs (~4,693/second) from 1,925 devices are overwhelming the disk. Use get log device stats or the FortiAnalyzer GUI (Log View > Device Statistics) to identify high-volume log sources.
- On FortiGates, reduce logging levels for non-critical events (e.g., disable debug or informational logs, focus on security and traffic logs).
- Example CLI on FortiGate: config log fortianalyzer setting; set filter-type exclude; set filter "logid(1,2,3)"; end (adjust log IDs as needed).
- Limit STORE_REQUEST volume:
- Implement log filters on FortiGates to send only essential logs to the FortiAnalyzer.
- Schedule log uploads during off-peak hours (if configurable) to distribute the load.
- Monitor queued tasks:
- The 4,392 queued tasks in oftpd 2 Thread Pool 1 are linked to STORE_REQUEST delays. After reducing log volume, re-run diagnose test application oftpd 2 to check if the backlog decreases.
- Optimize REST API Performance:
- Investigate slow API requests:
- The 28.520s max for RESTAPI REQUEST and 0.620s average indicate disk-intensive or complex queries. Check the FortiAnalyzer GUI, FortiView dashboards, or external integrations (e.g., FortiManager, third-party tools) for heavy API usage.
- Limit API calls for non-essential queries (e.g., disable auto-refreshing dashboards).
- Reduce API load:
- Restrict GUI access to essential users or schedule API-heavy tasks (e.g., custom reports) during low-traffic periods.
- Reduce the number of active FortiView charts (50 currently, per sqlrptcached 2) to lower API and disk I/O demands.
- Tune API batching:
- The restapi-batch-req-max: 1 (from oftpd 2) suggests no batching. Consult Fortinet support to enable batching, which could reduce API overhead.
- Optimize FortiView Integration:
- Reduce FortiView load:
- The 997 pending FortiView log tables and 50 active charts (per sqlrptcached 2) are downstream effects of the high STORE_REQUEST volume. Disable unnecessary FortiView dashboards or limit their scope (e.g., fewer devices, shorter time ranges).
- Temporarily disable FortiView auto-cache (GUI: FortiView > Settings) to reduce disk I/O, though this may impact real-time analytics.
- Coordinate with log ingestion:
- Reducing STORE_REQUEST volume will alleviate pressure on sqlrptcached, helping clear the 997 pending tables and 74 resource waits.
- Improve Disk I/O Performance:
- Upgrade storage:
- The high disk utilization (90.8%) and I/O wait times (37.9%–60.4%) are driven by STORE_REQUESTs. SSDs or high-performance RAID can significantly reduce I/O wait.
- Optimize I/O scheduling:
- Check if the disk I/O scheduler can be tuned (e.g., to deadline or noop). This requires Fortinet support guidance.
- Reduce disk I/O load:
- Disable non-critical analytics features (e.g., unused FortiView widgets, scheduled reports).
- Offload report generation to a secondary FortiAnalyzer or schedule it during off-peak hours.
- Tune oftpd Performance:
- Increase thread capacity:
- The 150 active STORE_REQUESTs (max 150) and 4,392 queued tasks (from oftpd 2) indicate that oftpd is at capacity. Check with Fortinet support if the thread count in Thread Pool 1 (300, capacity 400) or STORE_REQUEST limit can be increased.
- Monitor SSL errors:
- The 16,591 SSL negotiation errors (from oftpd 2) are minor but should be investigated. Check device certificates or network stability.
- Balance worker load:
- The four workers in oftpd 2 are heavily utilized. Explore adding more workers if configurable (consult Fortinet support).
- Monitor and Validate Fixes:
- Re-run diagnostics:
- After implementing changes, re-run diagnose test application oftpd 5 and diagnose test application oftpd 2 to monitor STORE_REQUEST rates and queued tasks.
- Check diag debug application sqlrptcached 8 for reduced I/O wait (below 13%) and get sys performance for lower disk utilization and %iowait.
- Check system logs:
- Review FortiAnalyzer logs (GUI or get system log) for errors related to oftpd, disk I/O, or SSL issues.
- Engage Fortinet support:
- Share all outputs (get sys performance, diag test application sqlrptcached 2, diag debug application sqlrptcached 8, diagnose test application oftpd 2, and diagnose test application oftpd 5) with Fortinet support for expert assistance, especially regarding oftpd tuning, disk optimization, or firmware updates.
- Long-Term Strategy:
- Scale FortiAnalyzer deployment:
- The FAZ-3500G is managing 1,925 devices and handling ~4,693 STORE_REQUESTs/second, which may be pushing its limits. Consider a distributed setup with additional FortiAnalyzer units (e.g., collectors for log ingestion, analyzers for processing).
- Implement log management policies:
- Establish strict log retention and archiving policies to prevent disk and inode exhaustion.
- Upgrade hardware:
- Evaluate newer FortiAnalyzer models or cloud-based solutions (e.g., FortiAnalyzer Cloud) for better performance in large-scale deployments.
Tying to Previous Outputs.
- get sys performance:
- The high disk utilization (90.8%) and I/O wait (up to 98.25% on some CPU cores) are directly driven by the 144M STORE_REQUESTs (~4,693/second), as logs are written to disk.
- The nearly full disk (90.9%) and inode exhaustion (97.6%) exacerbate the bottleneck, slowing down oftpd’s log storage tasks.
- diag test application sqlrptcached 2:
- The 997 pending FortiView log tables and 74 resource waits result from the high STORE_REQUEST volume, as logs must be written before sqlrptcached can process them for FortiView.
- The 50 FortiView charts and enabled auto-caching increase disk I/O, compounding the pressure from oftpd.
- diag debug application sqlrptcached 8:
- The high I/O wait (37.9%–60.4%, exceeding 13% threshold) is linked to the disk load from STORE_REQUESTs, which also causes the 4,392 queued tasks in oftpd 2 Thread Pool 1.
- diagnose test application oftpd 2:
- The 4,693 new STORE_REQUESTs in one second confirm the high log ingestion rate, aligning with the 144M total and 150 active tasks in both outputs.
- The RESTAPI REQUEST (max 28.520s) and 141 active RESTAPI RESPONSEs are consistent, highlighting API performance issues due to disk bottlenecks.
|