Configure Spark History Servers (Cloud)
If you’re using Application Profiler to fetch history data for Spark apps, you can customize the connection timeout value and/or add a second Spark History Server for monitoring.
On This Page
Configure Connection Timeout for Spark History Server
In environments with extreme network latency or frequent connectivity issues, it can be helpful to increase the connection timeout setting for REST requests that fetch Spark app data from the Spark History Server. By default, Pepperdata waits five (5) seconds before timing out, but you can change this value to suit your environment.
Related Topics
- Configure History Fetcher Retries. To ensure that application history is successfully fetched from the applicable component (MapReduce Job History Server for MapReduce apps, Spark History Server for Spark apps, or YARN Timeline Server for Tez apps), the Pepperdata Supervisor uses a two-phase approach. Phase 1 makes the initial attempt to fetch the history, and if it fails, makes up to three retries. Phase 2 adds an additional try and by default up to five retries, with the interval between retries increased by a factor of five every time. You can customize the number of retries for each phase, which might be required for environments with extreme network latency or frequent connectivity issues.
Procedure
-
In your cloud environment (such as GDP or AWS), configure the connection timeout,
PD_JOBHISTORY_SPARK_CONNECTION_TIMEOUT_SEC
.-
From the environment’s cluster configuration folder (in the cloud), download the Pepperdata configuration file,
/etc/pepperdata/pepperdata-config.sh
, to a location where you can edit it. -
Open the file for editing, and add the environment variable for the connection timeout, in the following format. Be sure to replace the default connection timeout (5 seconds) with your custom value.
export PD_JOBHISTORY_SPARK_CONNECTION_TIMEOUT_SEC=5
-
Save your changes and close the file.
-
Upload the revised file to overwrite the original
pepperdata-config.sh
file.
If there are no already-running hosts with Pepperdata, you are done with this procedure. Do not perform the remaining steps. -
-
Open a command shell (terminal session) and log in to any already-running host as a user with
sudo
privileges.Important: You can begin with any host on which Pepperdata is running, but be sure to repeat the login (this step), copying the bootstrap file (next step), and loading the revised Pepperdata configuration (the following step) on every already-running host. -
From the command line, copy the Pepperdata bootstrap script that you extracted from the Pepperdata package from its local location to any location; in this procedure’s steps, we’ve copied it to
/tmp
.-
For Amazon EMR clusters:
aws s3 cp s3://<pd-bootstrap-script-from-install-packages> /tmp/bootstrap
-
For Google Dataproc clusters:
sudo gsutil cp gs://<pd-bootstrap-script-from-install-packages> /tmp/bootstrap
-
-
Load the revised configuration by running the Pepperdata bootstrap script.
-
For EMR clusters:
-
You can use the --long-options form of the
--bucket
,--upload-realm
, and--is-running
arguments as shown or their -short-option equivalents,-b
,-u
, and-r
. -
The
--is-running
(-r
) option is required for bootstrapping an already-running host prior to Supervisor version 7.0.13. -
Optionally, you can specify a proxy server for the AWS Command Line Interface (CLI) and Pepperdata-enabled cluster hosts.
Include the
--proxy-address
(or--emr-proxy-address
for Supervisor version 8.0.24 or later) argument when running the Pepperdata bootstrap script, specifying its value as a fully-qualified host address that useshttps
protocol. -
If you’re using a non-default EMR API endpoint (by using the
--endpoint-url
argument), include the--emr-api-endpoint
argument when running the Pepperdata bootstrap script. Its value must be a fully-qualified host address. (It can usehttp
orhttps
protocol.) -
If you are using a script from an earlier Supervisor version that has the
--cluster
or-c
arguments instead of the--upload-realm
or-u
arguments (which were introduced in Supervisor v6.5), respectively, you can continue using the script and its old arguments. They are backward compatible. -
Optionally, you can override the default exponential backoff and jitter retry logic for the
describe-cluster
command that the Pepperdata bootstrapping uses to retrieve the cluster’s metadata.Specify either or both of the following options in the bootstrap’s Optional arguments. Be sure to substitute your values for the
<my-retries>
and<my-timeout>
placeholders that are shown in the command.-
max-retry-attempts
—(default=10) Maximum number of retry attempts to make after the initialdescribe-cluster
call. -
max-timeout
—(default=60) Maximum number of seconds to wait before the next retry call todescribe-cluster
. The actual wait time for a given retry is assigned as a random number, 1–calculated timeout (inclusive), which introduces the desired jitter.
-
-
# For Supervisor versions before 7.0.13: sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> --is-running [--proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>] # For Supervisor versions 7.0.13 to 8.0.23: sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> [--proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>] # For Supervisor versions 8.0.24 and later: sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> [--emr-proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>]
-
For Dataproc clusters:
sudo bash /tmp/bootstrap <bucket-name> <realm-name>
The script finishes with a
Pepperdata installation succeeded
message. -
-
Repeat steps 2–4 on every already-running host in your cluster.
Read from Two Spark History Servers
If you have two Spark History Servers (typically because you’re in the midst of upgrading to a newer version, and you’re temporarily running both versions during the migration), you can configure Pepperdata to read from both of them. Perform the configuration for the second Spark History Server on any persistent, static host other than the MapReduce Job History Server host (which is the primary history fetcher host for Pepperdata). On the chosen host, add environment variables for the second Spark History Server to the Pepperdata configuration.
Prerequisites
-
Complete the regular Pepperdata installation steps; see Installing Pepperdata.
-
Complete the regular Pepperdata configuration steps, including the configuration of the first Spark History Server; see Configuring Pepperdata.
Procedure
-
In your cloud environment (such as GDP or AWS), add the configuration for the second Spark History Server.
-
From the environment’s cluster configuration folder (in the cloud), download the Pepperdata configuration file,
/etc/pepperdata/pepperdata-config.sh
, to a location where you can edit it. -
Open the file for editing, and add the required environment variables as follows.
-
If the
spark-defaults.conf
file contains the correct assignment forspark.yarn.historyServer.address
for the second Spark History Server, configure theSPARK_CONF_DIR
environment variable to match:export SPARK_CONF_DIR=your-path-to-second-spark-conf-directory
Where
your-path-to-second-spark-conf-directory
is the directory that contains thespark-defaults.conf
file. -
If the
spark-defaults.conf
file does not includespark.yarn.historyServer.address
or its value is incorrect, and you can edit thespark-defaults.conf
file:-
Edit the
spark-defaults.conf
file so that it includes the correct assignment forspark.yarn.historyServer.address
for the second Spark History Server. -
In the
pepperdata-config.sh
file, configure theSPARK_CONF_DIR
environment variable to match:export SPARK_CONF_DIR=your-path-to-second-spark-conf-directory
, whereyour-path-to-second-spark-conf-directory
is the directory that contains thespark-defaults.conf
file.
-
-
For all other cases, edit the
pepperdata-config.sh
file to include thePD_SPARK_HISTORY_SERVER_ADDRESS
environment variable, and set its value to the second Spark History Server’s fully-qualified URL.
-
-
Disambiguate the two Spark History Servers by adding the following configuration options.
export PD_BYPASS_JOBHISTORY_IS_LOCAL_CHECK=1 export PD_DO_JOBHISTORY_STARTUP_CHECK=0 export PD_JOBHISTORY_FETCHERS="spark"
-
Save your changes and close the file.
-
Upload the revised file to overwrite the original
pepperdata-config.sh
file.
If there are no already-running hosts with Pepperdata, you are done with this procedure. Do not perform the remaining steps. -
-
Open a command shell (terminal session) and log in to any persistent, already-running host other than the MapReduce Job History Server host (which is the primary history fetcher host for Pepperdata).
-
From the command line, copy the Pepperdata bootstrap script that you extracted from the Pepperdata package from its local location to any location; in this procedure’s steps, we’ve copied it to
/tmp
.-
For Amazon EMR clusters:
aws s3 cp s3://<pd-bootstrap-script-from-install-packages> /tmp/bootstrap
-
For Google Dataproc clusters:
sudo gsutil cp gs://<pd-bootstrap-script-from-install-packages> /tmp/bootstrap
-
-
Load the revised configuration by running the Pepperdata bootstrap script.
-
For EMR clusters:
-
You can use the --long-options form of the
--bucket
,--upload-realm
, and--is-running
arguments as shown or their -short-option equivalents,-b
,-u
, and-r
. -
The
--is-running
(-r
) option is required for bootstrapping an already-running host prior to Supervisor version 7.0.13. -
Optionally, you can specify a proxy server for the AWS Command Line Interface (CLI) and Pepperdata-enabled cluster hosts.
Include the
--proxy-address
(or--emr-proxy-address
for Supervisor version 8.0.24 or later) argument when running the Pepperdata bootstrap script, specifying its value as a fully-qualified host address that useshttps
protocol. -
If you’re using a non-default EMR API endpoint (by using the
--endpoint-url
argument), include the--emr-api-endpoint
argument when running the Pepperdata bootstrap script. Its value must be a fully-qualified host address. (It can usehttp
orhttps
protocol.) -
If you are using a script from an earlier Supervisor version that has the
--cluster
or-c
arguments instead of the--upload-realm
or-u
arguments (which were introduced in Supervisor v6.5), respectively, you can continue using the script and its old arguments. They are backward compatible. -
Optionally, you can override the default exponential backoff and jitter retry logic for the
describe-cluster
command that the Pepperdata bootstrapping uses to retrieve the cluster’s metadata.Specify either or both of the following options in the bootstrap’s Optional arguments. Be sure to substitute your values for the
<my-retries>
and<my-timeout>
placeholders that are shown in the command.-
max-retry-attempts
—(default=10) Maximum number of retry attempts to make after the initialdescribe-cluster
call. -
max-timeout
—(default=60) Maximum number of seconds to wait before the next retry call todescribe-cluster
. The actual wait time for a given retry is assigned as a random number, 1–calculated timeout (inclusive), which introduces the desired jitter.
-
-
# For Supervisor versions before 7.0.13: sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> --is-running [--proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>] # For Supervisor versions 7.0.13 to 8.0.23: sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> [--proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>] # For Supervisor versions 8.0.24 and later: sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> [--emr-proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>]
-
For Dataproc clusters:
sudo bash /tmp/bootstrap <bucket-name> <realm-name>
The script finishes with a
Pepperdata installation succeeded
message. -
-
Repeat steps 2–4 on any other already-running hosts that you want to read from the second Spark History Server.