Plan Your Cluster’s Alarms

Before you set up alarms, you should have an understanding of your clusters. By considering the problems that have previously occurred, you can predict which issues are most likely to happen again. Additionally, by knowing your system’s performance requirements and customer SLAs (service level agreements), you can configure alarms at thresholds that allow you to address small problems before they become major issues.

Predefined Alarms (YARN)

A set of predefined alarms in the Pepperdata dashboard monitors basic system resource usage, such as CPU load and storage. You cannot delete these alarms, but you can disable and enable them, customize them for your system (see Edit Alarms), and use them as the basis for planning and configuring additional alarms.

The predefined alarms appear in the Alarms Settings tab of the Alarms & Alerts page. (To show the Alarms & Alerts page, click the Alarms icon () in the “top-nav” menu, and select View All Alarms. Or use the “left-nav” menu, and select Alarms.)

If you want to create a custom alarm that’s similar to a predefined alarm, it’s easier to use the predefined alarm’s query as the starting point for creating a new alarm directly instead of creating an alarm from a chart view or figuring out the required query string. For more information, see Create Alarms From the Alarms Page.

The table describes the predefined alarms for YARN clusters and their default queries, thresholds, and firing sensitivity—by how long the threshold must be exceeded, and by what percentage, before the alarm is actually fired (triggered).

Predefined Alarms: YARN Clusters
Title Description Query Threshold Firing Sensitivity
CPU Load node Fifteen minute load average per core, by Host h=*&m=n_15mlavg_per_core > 5 > 1% of time in any five minute period
Memory User RAM percentage, by Host h=*&m=c_rsspct > 90 > 1% of time in any five minute period
Disk I/O node Percent of time doing I/O, by Host h=*&m=n_dnmsdips_max_pct > 90 > 1% of time in any five minute period
Storage node Percent of disk space used, by Host h=*&m=c_capacity_disk_pct > 90 > 1% of time in any five minute period
HBase Garbage Collection Task JVM old garbage collection time, by Host, by App h=*&j=hbase&m=trfjgot > 45000 > 1% of time in any five minute period
Swap Status node Swap state, by Host h=*&m=n_sdss > 0 > 1% of time in any five minute period
Resource Manager Metrics ResourceManager active nodes m=rmc_act < 1 > 30% of time in any 30 minute period
Too Few Kafka Broker Active Controllers * Kafka active controller count m=kafka_activecontroller < 1 > 100% of time in any one minute period
Too Many Kafka Broker Active Controllers * Kafka active controller count m=kafka_activecontroller > 1 > 100% of time in any one minute period
Kafka Offline Partitions * Kafka offline partitions count m=kafka_offlinepartitionscount 0 > 100% of time in any one minute period

* Available only when Streaming Spotlight is enabled.

Alarms for Common Problems

Common cluster problems for which you might want to create alarms include:

  • Slow (Long Running) Jobs
  • Excessive Container Asks
  • Specific (Named) Job Ran Too Long
  • Job Failed or Killed

For information bout the associated metrics and the steps for creating corresponding alarms, see Create Alarms for Common Problems.