If general troubleshooting indicates a problem with hosts, you can use the Hosts Overview to dig deeper. The Hosts Overview table highlights the idleness (or busyness) of cluster hosts by showing cluster resource consumption broken down by host.
From the “left-nav” menu, select Platform Spotlight > Hosts.
Click each sub-column heading in turn to sort its data to show which users have the greatest resource usage.
Color highlights for outlier values indicate the distance (in number of standard deviations) from the column’s average value. The darker (closer to red) the color, the greater the difference in the value and the column’s average value.
The displayed memory metrics are key to learning which cluster hosts are having memory issues. When you find the problem hosts, you can allocate more memory to them or reconfigure the workload to avoid memory starvation or pressure.
To see whether a host is experiencing memory pressure, compare the Total Memory to the Inactive Plus Free Memory Percent data.
To determine if the systems are thrashing (spending more time swapping memory than running apps), compare the Free Swap Memory, Used Swap Memory, and Total Swap Memory data. A large used value (relative to the total value) indicates that the system is thrashing.
Load Issues in Hosts
The Fifteen Minute Load Average, Number of Running Threads, and Number of Total Threads metrics are the key for learning which cluster hosts are experiencing load issues. Typically, a load average that’s higher than 5x the number of system cores indicates that jobs are experiencing CPU contention (starvation). When you find the problem hosts, you can reconfigure the workload to avoid overloading the system.