Hive on Tez Recommendations

Pepperdata recommendations for Hive on Tez queries are generated by the Query Profiler, which is automatically enabled when you configure Query Spotlight. The Hive tile in the Recommendations section of the Pepperdata dashboard Home page shows how many recommendations for Hive on Tez queries were made during the last 24 hours, along with their severity levels. (If there are no recommendations for Hive on Tez queries, the Hive tile does not appear on the dashboard.)

Supervisor v6.5.x or later is required to receive recommendations for Hive on Tez queries.

Recommendations information is shown in several places in the Pepperdata dashboard:

  • To see a table of all the Hive on Tez queries that received recommendations, sorted by severity level, click any severity’s linked text in the Hive tile.

  • To see the recommendations’ severity levels for all recently run queries, show the Queries Overview page by using the left-nav menu to select Query Spotlight > Queries.

  • To view the Query Profiler report (for all profiled queries), click the title of the Hive tile, or use the left-nav menu to select Query Spotlight > Query Profiler.

The table describes the Pepperdata recommendations for Hive on Tez queries: each recommendation’s name, its type (general guidance or specific tuning values to change), what triggered the recommendation (the cause), the text of the actual recommendation, and notes that provide additional information.

Because Pepperdata is continually improving the recommendations as more and more applications and queries are profiled, the name, cause, and/or recommendation text might be slightly different from what’s shown in this documentation.

For details about how the recommendations appear in an application’s detail page, see Recommendations Tab.
Hive on Tez Recommendations
Name Type Cause Recommendation Notes
Guidance Tuning

Excessive GC duration

Your tasks spent an average of <N>% of the time on Garbage Collection. Long GC duration contributes to task duration and slows down the application.

Start by allocating more memory to tasks or by tuning the GC configuration before re-running the query. Increase the Tez container size (hive.tez.container.size) to a smaller multiple of the minimum YARN container size (yarn.schedule.minimum-allocation-mb). Ensure that 80% of allocated memory is provided to the Java heap (tez.container.max.java.heap.fraction = 0.8). Ensure that large hash tables are used for map joins (we recommend that hive.auto.convert.join.noconditionaltask.size = 0.33*hive.tez.container.size).

  • Excessive GC time increases total task duration and application run time. Increasing the available memory or tuning GC configuration can increase GC efficiency.

  • This recommendation’s severity ranges from Low to Critical depending on the ratio of the average task GC time to the average task CPU time.

Imbalanced work across Mappers

  • Input data skew across mappers. One group (<N> mapper vertices that worked on an average of <N SIZE> of data each) worked on >= the firing threshold of <N> times more data than the other group (<N> mapper vertices that worked on an average of <N SIZE> of data each).

  • Imbalanced time spent across mapper tasks. One group (<N> mapper vertices that spent an average of <N TIME> each) spent >= the firing threshold of <N> times more than the other group (<N> mapper vertices that spent an average of <N TIME> each).

To speed up your query: Decrease the number of tasks per mapper vertex by increasing the value of tez.grouping.split-count. Also, increase the value of tez.grouping.min-size (default = 50 MB) and tez.grouping.max-size (default = 1 GB).

This recommendation analyzes the HDFS input data and runtime across all the mapper vertices to check for data or time skew. If both data and time skew exist, the one with the worse severity is shown as the cause.

Imbalanced work across Reducers

  • Input data skew across reducers. One group (<N> reducer vertices that worked on an average of <N SIZE> of data each) worked on >= the firing threshold of <N> times more data than the other group (<N> reducer vertices that worked on an average of <N SIZE> of data each).

  • Imbalanced time spent across reducer tasks. One group (<N> reducer vertices that spent an average of <N TIME> each) spent >= the firing threshold of <N> times more than the other group (<N> reducer vertices that spent an average of <N TIME> each).

To speed up your query: Decrease the number of tasks per reducer vertex by increasing the value of hive.exec.reducers.bytes.per.reducer. Also, limit the number of reducers by setting hive.exec.reducers.max. And ensure that reducer parallelism is enabled (set hive.tez.auto.reducer.parallelism = true).

This recommendation analyzes the local input data and runtime across all the reducer vertices to check for data or time skew. If both data and time skew exist, the one with the worse severity is shown as the cause.

Too short average mapper runtime

Runtimes are too short for mappers. <N> mappers took an average of <N TIME>. Average mapper input size was <N SIZE>.

To speed up your query, reduce the number of mappers (tasks per mapper vertex), which lets each mapper process more data, by increasing the mapper split size properties’ values: tez.grouping.min-size and tez.grouping.max-size.

This heuristic analyzes the average runtimes of mapper tasks.

Too long average mapper runtime

Runtimes are too long for mappers. <N> mappers took an average of <N TIME>. Average mapper input size was <N SIZE>.

To speed up your query, increase the number of mappers (tasks per mapper vertex), which lets each mapper process less data, by decreasing the mapper split size properties’ values: tez.grouping.min-size and tez.grouping.max-size.

This heuristic analyzes the average runtimes of mappers.

Table Statistics out of date

Statistics for the following tables are out of date: <table1>, <table2>[, … <tableN>]

Run compute stats on the following table(s): <table1>, <table2>[, … <tableN>]. For example: ANALYZE TABLE &lt;table_name> COMPUTE STATISTICS;

Hive’s Cost-Based Optimizer (CBO) chooses the best query plan based on the table statistics. Missing or out-of-date statistics can lead to inaccurate plans and poor query performance.

Missing Column Statistics

The following tables are missing column statistics: <table1>, <table2>[, … <tableN>]

Run column statistics for the frequently accessed columns in the following table(s): <table1>, <table2>[, … <tableN>]. For example: ANALYZE TABLE &lt;table_name> COMPUTE STATISTICS for COLUMNS &lt;comma-separated column list>;

Although Hive generates table statistics by default, it does not generate column statistics by default because that’s a resource-intensive operation. To make the Hive Cost-Based Optimizer (CBO) fully-functional, you must manually generate the column statistics.