Spark Recommendations

Pepperdata recommendations for Spark applications are generated by the Application Profiler (which must be configured, enabled, and running before an app begins in order for recommendations to be generated). The Spark tile in the Recommendations section of the Pepperdata dashboard shows how many recommendations were made during the last 24 hours, along with their severity levels. (If there are no recommendations for Spark apps, the Spark tile does not appear on the dashboard.)

Recommendations information is shown in several places in the Pepperdata dashboard:

  • To see a table of all the Spark apps that received recommendations at a given severity level, click the linked severity text in the Spark tile.

  • To see the recommendations’ severity levels for all recently run Spark applications, show the Applications Overview page by using the left-nav menu to select App Spotlight > Applications (and filter by Spark apps).

  • To view the Application Profiler report, click the title of the Spark tile, or use the left-nav menu to select App Spotlight > Application Profiler (and filter by Spark apps).

Although heuristics and recommendations are closely related, the terms are not interchangeable.

Heuristics are the rules and triggering/firing thresholds against which Pepperdata compares the actual metrics values for your applications.

• When thresholds are crossed, Pepperdata analyzes the data, and provides relevant recommendations.

For example, a single heuristic might have a low and a high threshold, from which Pepperdata can provide distinct recommendations such as "Too long average task runtime" and "Too short average task runtime". That is, there is not a 1:1 correspondence of heuristics to recommendations.

The table describes the Pepperdata recommendations for Spark applications: each recommendation’s name, whether Application Profiler must be configured and enabled in order for Pepperdata to generate the recommendation, the recommendation type (general guidance or specific tuning values to change), for which environments it’s applicable (YARN and/or Kubernetes), what triggered the recommendation (the cause), and the text of the actual recommendation.

  • Except where noted, the recommendations are only for non-streaming Spark applications.

    Support for recommendations for streaming Spark applications requires Supervisor v6.5.23 or later.
  • Due to the nature and availability of the metrics that Pepperdata analyzes in order to generate recommendations, the recommendations for a given Spark app are not available until approximately thirty (30) minutes after the app finishes.

  • Because Pepperdata is continually improving the recommendations as more and more applications and queries are profiled, the name, cause, and/or recommendation text might be slightly different from what’s shown in this documentation.

    For details about how the recommendations appear in an application’s detail page, see Recommendations Tab.

Spark Recommendations
Name Type Environment Cause Recommendation Notes
Guidance Tuning YARN K8s

Batch Processing is taking too long

Ratio of processing delay to Batch interval specified is too high

Consider reducing the batch processing time.

  • This recommendation is only for streaming applications.

  • This recommendation’s heuristic helps you answer the question, “Is the system able to process batches as quickly as they’re generated?”

Batch Queue Delay is too large

The Queueing delay for batches keeps increasing

Consider reducing the batch processing time.

  • This recommendation is only for streaming applications.

  • This recommendation’s heuristic helps you answer the question, “Is the system able to process batches as quickly as they’re generated?”

Not enough Parallelism

Number of cores used in the system is less than optimal. To use more cores reduce the block interval times.

Consider reducing the block interval time.

  • This recommendation is only for streaming applications.

  • This recommendation’s heuristic helps you answer the question, “Is the streaming job utilizing the hardware resources enough?”

Unused GPUs

  • One GPU allocated, but it was unused.

  • Of <X> GPUs allocated, at least 1 GPU is not being used at any given time.

  • Of <X> GPUs allocated, at least <Y> of them are not being used at any given time.

The software may be misconfigured to not use all the GPUs or written such that the GPUs are being used in bursts. Check the individual GPU utilization graphs in the GPU tab. Verify that the application is configured to use all the GPUs (not only CPUs), and be sure that the application code is written to utilize GPUs as much as possible.

Because the number of unused GPUs can vary over the duration of the job, we report the lowest number obtained from our data sampling.

Low usage of GPU resources

  • One GPU allocated, but it has low resource utilization (<20% core usage and <20% memory usage).

  • Of <X> GPUs allocated, at least 1 GPU has core and memory utilization <= 20% at any given time.

  • Of <X> GPUs allocated, at least <Y> of them have core and memory utilization <= 20% at any given time.

Use GPUs with fewer cores and/or less memory.

This is a low-severity recommendation.

Excessive {driver/executor} memory wasted

App requested <N> GB of memory for {driver/executor}, but maximum usage was <N> GB

  • Change {spark.driver.memory/spark.executor.memory} from CURRENT VALUE to PROPOSED VALUE

  • Change {spark.yarn.driver.memoryOverhead/spark.yarn.executor.memoryOverhead} from CURRENT VALUE to PROPOSED VALUE

GC (garbage collection) duration too high

Ratio of total GC time :total executors duration is too high : <N>

Change spark.executor.memory from CURRENT VALUE to PROPOSED VALUE

GC (garbage collection) duration too high

Ratio of total GC time :total executors duration is too high : <N>

Consider increasing the value of spark.executor.memory

Executors killed

This is usually caused by an out of memory error. Affected stages: stage: <N> attempt: <N>[, stage: <N> attempt: <N>]

Consider increasing heap size with spark.executor.memory or overhead with spark.yarn.executor.memoryOverhead. For more information, review Application Errors in the <Spark Tab>.

Containers killed by YARN

Container(s) exceeded configured memory limit. Affected stages: <N> attempt: <N>[, stage: <N> attempt: <N>]

Consider increasing spark.yarn.executor.memoryOverhead. For more information, review Application Errors in the <Spark Tab>.

Total number of unused storage memory executors

<N> of the <N> executors in the application used < <N>% of the Spark storage memory that was available for RDD caching.

Look for opportunities to cache RDDs, which will improve application response time and resource usage efficiency.

Stages with OOM errors

Some tasks have failed due to overhead memory error

Change spark.executor.memory from CURRENT VALUE to PROPOSED VALUE

Stages with OOM errors

Some tasks have failed due to OutOfMemory Error

  • Consider increasing the value of spark.executor.memory

  • Consider decreasing the value of spark.memory.fraction

  • Consider decreasing the value of spark.executor.cores

Stages with MemoryOverHead errors

Some tasks have failed due to overhead memory error

Consider increasing the value of spark.yarn.executor.memoryOverhead

Executor average task time distribution for stage <N>

Spark executor average task time skew. The ratio of the <N>th percentile average task time used by any executor to the median average task time used by all executors is <, which is >= the firing threshold of <N>.

To speed up your app: Increase the number of data partitions of RDD of stage <N> (name = <module and line info>) which enables more even distribution of work across executors. Consider using the RDD repartition transformation, which triggers a shuffle.

Executor average task time distribution for stage <N> and attempt <N>

Spark executor average task time skew. The ratio of the <N>th percentile average task time used by any executor to the median average task time used by all executors is <, which is >= the firing threshold of <N>.

To speed up your app and improve parallelism: Increase the number of data partitions of RDD of stage <N> (name = <name info>) which enables more even distribution of work across executors. Consider using the RDD repartition transformation, which triggers a shuffle.

Executor average shuffle read bytes distribution for stage <N>

Spark executor average shuffle read bytes skew. The ratio of the <N>th percentile average shuffle read bytes used by any executor to the median average shuffle read bytes used by all executors is <N>, which is >= the firing threshold of <N>.

To speed up your app: Increase the number of data partitions of RDD of stage <N> (name = <module and line info>) which enables more even distribution of work across executors. Consider using the RDD repartition transformation, which triggers a shuffle.

Executor average shuffle read bytes distribution for stage <N> and attempt <N>

Spark executor average shuffle read bytes skew. The ratio of the <N>th percentile average shuffle read bytes used by any executor to the average shuffle read bytes used by all executors is <N>, which is >= the firing threshold of <N>.

To speed up your app and improve parallelism: Increase the number of data partitions of RDD of stage <N> (name = <name info>) which enables more even distribution of work across executors. Consider using the RDD repartition transformation, which triggers a shuffle.

Executor average shuffle write bytes distribution for stage <N> and attempt <N>

Spark executor average shuffle write bytes skew. The ratio of the <N>th percentile average shuffle write bytes used by any executor to the median average shuffle write bytes used by all executors is <N>, which is >= the firing threshold of <N>.

To speed up your app and improve parallelism: Increase the number of data partitions of RDD of stage <N> (name = <module and line info>) which enables more even distribution of work across executors. Consider using the RDD repartition transformation, which triggers a shuffle.

Executor average input bytes distribution for stage <N> and attempt <N>

Spark executor average input bytes skew. The ratio of the <N>th percentile input bytes used by any executor to the median average input bytes used by all executors is <N>, which is >= the firing threshold of <N>.

To speed up your app and improve parallelism: Increase the number of data partitions of RDD of stage <N> (name = <module and line info>) which enables more even distribution of work across executors. Consider using the RDD repartition transformation, which triggers a shuffle.

Spark stages with long average executor runtimes

Executor runtimes too long for Spark stages. Affected stages: stage <N>, attempt <N> (runtime: <duration>)[ stage <N>, attempt <N> (runtime: <duration>)].

To speed up executors: Review the application’s driver log to determine the root cause of the long runtimes for executors, and refactor the jobs as needed before running them again.

Spark job failure rate

Too high a rate of Spark job failures. <N>% of jobs failed, which is >= the firing threshold of <N>%. Failed jobs: job <N>[, <N>].

To decrease job failures review the errors in the <Spark Tab>.

Spark jobs with high task failure rates

Too high a rate of task failures for Spark jobs. Affected jobs: job <N>[, <N>] (task failure rate: <N>).

To decrease task failures review the errors in the <Spark Tab>.

Spark jobs with high task failure rates

Too high a rate of task failures for Spark jobs. Affected jobs: job <Job-Name>[, <Job-Name>].

To decrease task failures: Review the application’s driver log to determine the root cause of the task failures, and refactor the jobs as needed before running them again.

Spark stage failure rate

Too high a rate of stage failures. <N>% of stages failed, which is >= the firing threshold of <N>%. Failed stages: stage <N>, attempt <N>[, stage <N>, attempt <N>].

To decrease stage failures review the errors in the <Spark Tab>.

Spark stage failure rate

Too high a rate of stage failures. <N>% of stages failed, which is >= the firing threshold of <N>%. Failed stages: stage <N>[, stage <N>].

To decrease stage failures: Review the application’s driver log to determine the root cause of the stage failures, and refactor the jobs as needed before running them again.

Spark stages with high task failure rates

Too high a rate of task failures for Spark stages. Affected stages: stage <N>, attempt <N> (task failure rate: <N>)[, stage <N>, attempt <N> (task failure rate: <N>)].

To decrease task failures review the errors in the <Spark Tab>.

Spark stages with high task failure rates

Too high a rate of task failures for Spark stages. Affected stages: stage <N>[, stage <N>].

To decrease task failures: Review the application’s driver log to determine the root cause of the task failures, and refactor the jobs as needed before running them again.

Too many recomputations for RDD <RDD-Name> (id <RDD-Id>).

RDD <RDD-Name> (id <N>) is recomputed <N> times.

To speed up your app, cache the RDD at <data structure> at <module and line info>.

Unnecessary caching of RDD <RDD-Name> (id <RDD-Id>).

RDD <RDD-Name> (id <N>) is cached but is never recomputed.

To reduce storage memory waste, do not cache the RDD that is created at <data structure> at <module and line info>.

spark.shuffle.service.enabled

Spark shuffle service is not enabled.

Change spark.shuffle.service.enabled from CURRENT VALUE to PROPOSED VALUE

spark.shuffle.service.enabled

Spark shuffle service is not enabled.

To speed up your app: Enable the Spark shuffle service by setting the app’s configured spark.spark.shuffle.service.enabled=true

spark.serializer

KyroSerializer is not enabled.

Change spark.serializer from CURRENT VALUE to PROPOSED VALUE

spark.serializer

KyroSerializer is not enabled.

To speed up your app: Enable the Kryo serializer by setting the app’s configured spark.serializer=org.apache.spark.serializer.KryoSerializer