Hive on Tez Recommendations
Pepperdata recommendations for Hive on Tez queries are generated by the Query Profiler, which is automatically enabled when you configure Query Spotlight. The Hive tile in the Recommendations section of the Pepperdata dashboard Home page shows how many recommendations for Hive on Tez queries were made during the last 24 hours, along with their severity levels. (If there are no recommendations for Hive on Tez queries, the Hive tile does not appear on the dashboard.)
Recommendations information is shown in several places in the Pepperdata dashboard:
-
To see a table of all the Hive on Tez queries that received recommendations, sorted by severity level, click any severity’s linked text in the Hive tile.
-
To see the recommendations’ severity levels for all recently run queries, show the Queries Overview page by using the left-nav menu to select Query Spotlight > Queries.
-
To view the Query Profiler report (for all profiled queries), click the title of the Hive tile, or use the left-nav menu to select Query Spotlight > Query Profiler.
The table describes the Pepperdata recommendations for Hive on Tez queries: each recommendation’s name, its type (general guidance or specific tuning values to change), what triggered the recommendation (the cause), the text of the actual recommendation, and notes that provide additional information.
For details about how the recommendations appear in an application’s detail page, see Recommendations Tab.
Name | Type | Cause | Recommendation | Notes | |
Guidance | Tuning | ||||
Excessive GC duration |
Your tasks spent an average of <N>% of the time on Garbage Collection. Long GC duration contributes to task duration and slows down the application. |
Start by allocating more memory to tasks or by tuning the GC configuration before re-running the query. Increase the Tez container size ( |
|
||
Imbalanced work across Mappers |
|
To speed up your query: Decrease the number of tasks per mapper vertex by increasing the value of |
This recommendation analyzes the HDFS input data and runtime across all the mapper vertices to check for data or time skew. If both data and time skew exist, the one with the worse severity is shown as the cause. |
||
Imbalanced work across Reducers |
|
To speed up your query: Decrease the number of tasks per reducer vertex by increasing the value of |
This recommendation analyzes the local input data and runtime across all the reducer vertices to check for data or time skew. If both data and time skew exist, the one with the worse severity is shown as the cause. |
||
Too short average mapper runtime |
Runtimes are too short for mappers. <N> mappers took an average of <N TIME>. Average mapper input size was <N SIZE>. |
To speed up your query, reduce the number of mappers (tasks per mapper vertex), which lets each mapper process more data, by increasing the mapper split size properties’ values: |
This heuristic analyzes the average runtimes of mapper tasks. |
||
Too long average mapper runtime |
Runtimes are too long for mappers. <N> mappers took an average of <N TIME>. Average mapper input size was <N SIZE>. |
To speed up your query, increase the number of mappers (tasks per mapper vertex), which lets each mapper process less data, by decreasing the mapper split size properties’ values: |
This heuristic analyzes the average runtimes of mappers. |
||
Table Statistics out of date |
Statistics for the following tables are out of date: <table1>, <table2>[, … <tableN>] |
Run compute stats on the following table(s): <table1>, <table2>[, … <tableN>]. For example: |
Hive’s Cost-Based Optimizer (CBO) chooses the best query plan based on the table statistics. Missing or out-of-date statistics can lead to inaccurate plans and poor query performance. |
||
Missing Column Statistics |
The following tables are missing column statistics: <table1>, <table2>[, … <tableN>] |
Run column statistics for the frequently accessed columns in the following table(s): <table1>, <table2>[, … <tableN>]. For example: |
Although Hive generates table statistics by default, it does not generate column statistics by default because that’s a resource-intensive operation. To make the Hive Cost-Based Optimizer (CBO) fully-functional, you must manually generate the column statistics. |