Create Alarms for Common Problems

There are several ways to create alarms: (1) From the App Details page, in the Alarms tab or, for YARN clusters, the Bottlenecks tab; (2) from chart views; (3) from the Alarms page; (4) by using the Pepperdata REST API. The result is identical regardless of how you create the alarm, so you can choose a method based on its convenience. You might create alarms during your regular workflow of looking at charts while you troubleshoot your clusters, or you might have a big list of alarms for which you know the applicable metrics, and so can quickly create them from the Alarms page.

Straggler Tasks

A common cause of slow (long running) apps are straggler tasks—the last remaining task for an app whose progress is greater than 90%, and has more than 50 containers in the past. If you have an important or critical app, you might want to configure an alarm to fire if the app spends more than a given amount of time in the straggler task state or more than a given percentage of overall runtime in the straggler task state. Typical reasons for straggler task problems are running the app on particularly busy hosts or issues with the data partitioning of the app.

The procedure describes how to create the alarm from the applicable metric’s chart. You could also manually create the alarm (see Create Alarms From the Alarms Page or, for YARN clusters, create the alarm from the Bottlenecks tab in the App Details page (see Bottlenecks Tab).

Procedure

  1. In the left-nav menu, select Charts.

  2. In the filter bar, click Metrics.

  3. In the search box, clear any previously selected metrics, and enter the search term, “straggler”.

  4. Select the metric, and click Apply.

  5. In the top-right of the metric chart, click the Create or View Alarms icon (add_alert), and enter a name for the alarm (such as “App in straggler task state”) and a threshold value, and click the Save icon (save).

When this alarm fires, you’ll want to investigate why the app got stuck in the straggler task state. If you set up alert notifications, the notification includes a link to the chart with the applicable metrics and data for the alarm.

Slow (Long Running) Jobs

One of the most useful conditions to monitor with Pepperdata is slow jobs; for example, an alarm that is triggered any time a job runs longer than a given amount of time. This example shows how to configure an alarm for any job that runs longer than five (5) minutes.

The procedure describes how to create the alarm from the applicable metric’s chart. You could also manually create the alarm; see Create Alarms From the Alarms Page.

Procedure

  1. In the left-nav menu, select Charts.

  2. In the filter bar, click Metrics.

  3. In the search box, clear any previously selected metrics, and enter the search term, “job execution duration seconds”.

  4. Select the metric from the General sub-category and the ResourceManager group group, and click Apply.

  5. In the top-right of the metric chart, click the Create or View Alarms icon (add_alert), and enter a name for the alarm (such as “job ran >5 minutes”) and a threshold value (for five minutes, use 300 seconds), and click the Save icon (save).

    To change the threshold from an upper limit (above which the alarm will fire) to a lower limit (below which the alarm will fire), or the advanced threshold percentages and times, or to add email override addresses for receiving alert notifications when the alarm fires and clears, edit the alarm after you create it; see Edit Alarms. By default, alarms fire when the metric value exceeds the threshold for 1% of the time in any five minute period, and alarms are enabled when you create them. To disable a newly-created alarm, see Disable and Enable Alarms.

When this alarm fires, you’ll want to investigate why the job ran so slowly (long). If you set up alert notifications, the notification includes a link to the chart with the applicable metrics and data for the alarm.

Excessive Container Asks

Configuring an alarm for excessive container asks can help you avoid queue contention issues. When someone asks for too many containers, they take too much queue capacity. How many containers is too many is environment-specific.

The procedure provides the query and threshold values to use when manually creating the alarm. You could also create it from a chart with the container asks metric selected.

Procedure

  1. Navigate to the Alarms & Alerts page by either of the following methods:

    • From the “top-nav” menu, click the Alarms icon (), and select View All Alarms.
    • From the “left-nav” menu, select Alarms.
  2. Below the Look Up Alarm search, click the New Cluster Alarm icon (add_alert).

  3. Enter the title, query, and threshold value; optionally change the advanced threshold options; and optionally, enter an email override for alerts.

    The table shows the typical values for this alarm.

    Title Too many container asks.
    Query j=*&m=rm_ask_containers
    Threshold above 960
  4. Click Save.

Specific (Named) Job Ran Too Long

A common scenario for addressing SLA (service level agreement) concerns is to configure an alarm that fires when a given, named job exceeds a given overall job duration.

The procedure describes how to create the alarm from the applicable metric’s chart. You could also manually create the alarm; see Create Alarms From the Alarms Page.

Procedure

  1. In the left-nav menu, select Charts.

  2. In the filter bar, click Metrics.

  3. In the search box, clear any previously selected metrics, and enter the search term, “overall job duration seconds”.

  4. Select the metric from the General or the Resource Manager metrics sub-category, and click Apply. It is the same metric, and you can select either one.

  5. Click the Breakdown By filter clear (de-select) the Summary series, select the Application series, and enter the application name in the Name filter.

  6. Click Apply.

  7. In the top-right of the resulting metric chart, click the Create or View Alarms icon (add_alert), and enter a name for the alarm (such as “your-job-name exceeded your-threshold seconds”) and a threshold value, and click the Save icon (save).

    To change the threshold from an upper limit (above which the alarm will fire) to a lower limit (below which the alarm will fire), or the advanced threshold percentages and times, or to add email override addresses for receiving alert notifications when the alarm fires and clears, edit the alarm after you create it; see Edit Alarms. By default, alarms fire when the metric value exceeds the threshold for 1% of the time in any five minute period, and alarms are enabled when you create them. To disable a newly-created alarm, see Disable and Enable Alarms.
Metrics that are related to overall job duration seconds are job queue duration seconds and job execution duration seconds. You can configure alarms for these metrics in the same manner as for overall job duration seconds.

Job Failed or Killed

A common scenario for addressing SLA (service level agreement) concerns is to configure an alarm that fires whenever a job fails or is killed.

The procedure describes how to create the alarm from the applicable metric’s chart. You could also manually create the alarm; see Create Alarms From the Alarms Page.

Procedure

  1. In the left-nav menu, select Charts.

  2. In the filter bar, click Metrics.

  3. Select either or both of the following metrics by entering them into the search box and selecting them from the Metrics column.

    • “Application state” (rm_state)
    • “Application final Status” (rm_final_status)

    When a metric appears in more than one metrics sub-category—which is true for both rm_state and rm_final_status—you can select the metric from any sub-category. They are the same metric.

  4. Click Apply.

  5. In the top-right of the resulting metric chart, click the Create or View Alarms icon (add_alert), and enter a name for the alarm (such as “Job your-job-name failed”) and a threshold value—the enumeration value for the ResourceManager state/status.

    • For the Application state, failed jobs are enumeration value 6, and killed jobs are enumeration value 7.
    • For the Application final state, failed jobs are enumeration value 2, and killed jobs are enumeration value 3.
  6. Click the Save icon (save).

    To change the threshold from an upper limit (above which the alarm will fire) to a lower limit (below which the alarm will fire), or the advanced threshold percentages and times, or to add email override addresses for receiving alert notifications when the alarm fires and clears, edit the alarm after you create it; see Edit Alarms. By default, alarms fire when the metric value exceeds the threshold for 1% of the time in any five minute period, and alarms are enabled when you create them. To disable a newly-created alarm, see Disable and Enable Alarms.