Impala Recommendations

Pepperdata recommendations for Impala queries are generated by the Query Profiler, which is automatically enabled when you configure Query Spotlight. The Impala tile in the Recommendations section of the Pepperdata dashboard shows how many recommendations were made during the last 24 hours, along with their severity levels. (If there are no recommendations for Impala queries, the Impala tile does not appear on the dashboard.)

Supervisor v6.4.x or later is required to receive recommendations for Impala queries.

Recommendations information is shown in several places in the Pepperdata dashboard:

  • To see a table of all the queries that received recommendations at a given severity level, click the linked severity text in the Impala tile.

  • To see the recommendations’ severity levels for all recently run queries, show the Queries Overview page by using the left-nav menu to select Query Spotlight > Queries.

  • To view the Query Profiler report (for all profiled queries), click the title of the Impala tile, or use the left-nav menu to select Query Spotlight > Query Profiler.

The table describes the Pepperdata recommendations for Impala queries: each recommendation’s name, its type (general guidance or specific tuning values to change), what triggered the recommendation (the cause), the text of the actual recommendation, and notes that provide additional information.

Because Pepperdata is continually improving the recommendations as more and more applications and queries are profiled, the name, cause, and/or recommendation text might be slightly different from what’s shown in this documentation.

For details about how the recommendations appear in an application’s detail page, see Recommendations Tab.
Impala Recommendations
Name Type Cause Recommendation Notes
Guidance Tuning

Impala tables missing statistics

The following tables are missing relevant table and/or column statistics: <table1>, <table2>

Run compute stats on the following tables: <table1>, <table2>

The Impala query planner uses table and column statistics to generate effective plans. Missing statistics can cause inaccurate plans and poor query performance.

Impala statistics out of date

The table and/or column statistics for the following tables are out-of-date: <table1>, <table2>

Run compute stats on the following tables: <table1>, <table2>

The Impala query planner uses table and column statistics to generate effective plans. Out-of-date statistics can cause inaccurate plans—wrong join types or wrong join order—which in turn leads to longer query runtimes.

Impala disable codegen

The query took <N> milliseconds. Codegen phase is enabled for this query.

Set query option DISABLE_CODEGEN to 1

For simple queries, the codegen phase can represent a significant amount of the total runtime. Disabling codegen could speed up the query.

Impala Cross join

Cross join found in the query. The result set of the Cross join is greater than or equal to <X>.

Rewrite the query to add join conditions and eliminate Cross joins.

Cross join of large tables can be resource intensive.

Impala join count and rows processed by join operations exceeds threshold

The query has more than <X> join operations. The total number of rows processed is exceeding <N>.

Denormalize tables to eliminate the need of joins.

A lot of joins that process a large number of rows can be resource intensive.

Impala query planning exceeded threshold

The query planning phase consumed <X>% of the overall query execution time.

Simplify the query by reducing the number of filters, columns returned or joins.

A query planning phase that consumes a large percentage of the overall runtime indicates that the query is complex or that a metadata refresh occurred during that planing phase.

Impala sort operations spilled to disk

The sort operation spilled <SIZE> to disk for sorting <SIZE> of data.

Increase the memory for the query by using the MEM_LIMIT query option.

Reduce the number of rows processed by adding WHERE clause in SELECT statement.

Reduce the number of rows processed by adding join conditions.

Runtimes can increase when a sort operation must process large data that does not fit into the memory allocated for the query.

Impala HDFS read skew

The following hosts read <N> or more times higher HDFS data than the average HDFS data read by other hosts in the cluster.

The following hosts read between <N> and <> times higher HDFS data than the average HDFS data read by other hosts in the cluster.

If the query is artificially small, perhaps for benchmarking purposes, scale it up to process a larger data set. Aim for a “sweet spot” where each node reads 2 GB or more from HDFS per query. Queries that process lower volumes than that could experience inconsistent performance that smooths out as queries become more data-intensive.

Reduce the amount of compression applied to the data. For the text data files, the highest degree of compression (gzip) produces unsplittable files that are more difficult for Impala to process in parallel, and require extra memory during processing to hold the compressed and uncompressed data simultaneously. For binary formats such as Parquet and Avro, compression can result in fewer data blocks overall, but remember that when queries process relatively few blocks, there is less opportunity for parallel execution and many nodes in the cluster might sit idle.

Runtimes can increase when one host takes substantially longer than the other hosts to finish the work. The extra time needed for the slow host can become the dominant factor in overall query performance.

Impala HDFS partition pruning

<N> files were read out of <N> on host: <host-URL>.

Consider using the partition key column in the WHERE clause of the query. To see the existing partitions of a table, use the “SHOW PARTITIONS” statement.

Consider using a partition key that is frequently used to filter query results. To see the existing partitions of a table, use the “SHOW PARTITIONS” statement.

Partition pruning speeds up the query by pre-selecting only the partitions and files that have data that is relevant to the query, which reduces the amount of data to process.