Set Up a Pepperdata Proxy (Cloud)
If your Pepperdata-enabled cluster hosts must be isolated from other hosts—perhaps in accordance with custom firewall rules—you can use a proxy server on your network to enable Pepperdata functionality.
Pepperdata is fully integrated with the standard
https_proxy environment variable, which you can configure in the Pepperdata configuration file,
If applicable for your environment and version of Pepperdata Supervisor, add the proxy settings to the Pepperdata configuration.• EMR clusters with already-running hosts. Skip this step. Instead, perform steps 2–4 for every already-running host, and be sure to specify the proxy server by using the
--proxy-addressargument when you run the Pepperdata bootstrap script.
• EMR clusters without already-running hosts. Skip this procedure. Instead, specify the proxy server by including the
--proxy-addressargument in the Optional arguments field when you run create the cluster with the Create cluster wizard.
• Dataproc clusters. Perform this step and, if there are already-running hosts, steps 2–4 for each already-running host.
In your cloud environment (such as GDP or AWS), configure the proxy settings.
If there are no already-running hosts with Pepperdata, you are done with this procedure. Do not perform the remaining steps.
From the environment’s cluster configuration folder (in the cloud), download the Pepperdata configuration file,
/etc/pepperdata/pepperdata-config.sh, to a location where you can edit it.
Open the file for editing, and uncomment and update the environment variable for the proxy host and port,
Be sure to replace the
my_proxy_portplaceholders with your actual proxy server name and port number.
Save your changes and close the file.
Upload the revised file to overwrite the original
Open a command shell (terminal session) and log in to any already-running host as a user with
sudoprivileges.Important: You can begin with any host on which Pepperdata is running, but be sure to repeat the login (this step), copying the bootstrap file (next step), and loading the revised Pepperdata configuration (the following step) on every already-running host.
From the command line, copy the Pepperdata bootstrap script that you extracted from the Pepperdata package from its local location to any location; in this procedure’s steps, we’ve copied it to
For Amazon EMR clusters:
aws s3 cp s3://<pd-bootstrap-script-from-install-packages> /tmp/bootstrap
For Google Dataproc clusters:
sudo gsutil cp gs://<pd-bootstrap-script-from-install-packages> /tmp/bootstrap
Load the revised configuration by running the Pepperdata bootstrap script.
For EMR clusters:
You can use the --long-options form of the
--is-runningarguments as shown or their -short-option equivalents,
-r) option is required for bootstrapping an already-running host prior to Supervisor version 7.0.13.
Optionally, you can specify a proxy server for the AWS Command Line Interface (CLI) and Pepperdata-enabled cluster hosts.
--proxy-addressargument when running the Pepperdata bootstrap script, specifying its value as a fully-qualified host address that uses
If you’re using a non-default EMR API endpoint (by using the
--endpoint-urlargument), include the
--emr-api-endpointargument when running the Pepperdata bootstrap script. Its value must be a fully-qualified host address. (It can use
If you are using a script from an earlier Supervisor version that has the
-carguments instead of the
-uarguments (which were introduced in Supervisor v6.5), respectively, you can continue using the script and its old arguments. They are backward compatible.
Optionally, you can override the default exponential backoff and jitter retry logic for the
describe-clustercommand that the Pepperdata bootstrapping uses to retrieve the cluster’s metadata.
Specify either or both of the following options in the bootstrap’s Optional arguments. Be sure to substitute your values for the
<my-timeout>placeholders that are shown in the command.
max-retry-attempts—(default=10) Maximum number of retry attempts to make after the initial
max-timeout—(default=60) Maximum number of seconds to wait before the next retry call to
describe-cluster. The actual wait time for a given retry is assigned as a random number, 1–calculated timeout (inclusive), which introduces the desired jitter.
# For Supervisor versions before 7.0.13: sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> --is-running [--proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>] # For Supervisor versions 7.0.13 and later: sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> [--proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>]
For Dataproc clusters:
sudo bash /tmp/bootstrap <bucket-name> <realm-name>
The script finishes with a
Pepperdata installation succeededmessage.
Repeat steps 2–4 on every already-running host in your cluster.
(Optional) Verify that hosts can connect to the Pepperdata dashboard through the proxy server as it’s configured.
In a cluster without Pepperdata—so a different cluster from the one you are using for Pepperdata—log in to any running host.
Try to connect to the Pepperdata dashboard through the proxy server as it’s configured.
Be sure to replace the
<my_proxy_port>placeholders with the same proxy server name and port number that you configured in step 1.
curl --proxy <my_proxy_url>:<my_proxy_port> --tlsv1.2 -v https://upload-main.pepperdata.com