Disable/Enable Pepperdata Data Collection for a Host (Cloud)

Occasionally you might want Pepperdata to not collect data from a cluster host on which Pepperdata is installed. Or, you might want to re-enable data collection for a host where you previously disabled data collection. In such cases, you can disable or enable the host from Pepperdata data collection by configuring the host’s PD_COLLECT_AND_UPLOAD environment variable.

Disable Pepperdata Data Collection for a Running Host

A typical reason for disabling Pepperdata data collection for a host is when you want to install Pepperdata in a test environment that mimics your Pepperdata-installed production environment as closely as possible, but you do not want Pepperdata to collect data from the test hosts. Or, you might want to install Pepperdata to manage edge hosts that are not managed by YARN, and you want to omit those hosts from Pepperdata capacity planning calculations and consideration.

When you disable data collection for a host, the Collector (the pepcollectd daemon) stops collecting the data from the other Pepperdata agents, and stops sending data to the Pepperdata dashboard. Charts, tables, and reports include all the data that was collected and sent to the dashboard before data collection was disabled, but contain no data for times during which data collection is disabled. The remaining Pepperdata agents, such as PepAgent, continue to run, collecting metrics and performing calculations for dynamic allocation.

Procedure

Reconfigure the host that you want to disable.
1. Download the Pepperdata configuration file, pepperdata-config.sh, from the host that you want to disable to a location where you can edit it.
2. Open the file for editing, find the PD_COLLECT_AND_UPLOAD environment variable, and change its value from 1 to 0.
3. Save your changes and close the file.
4. Upload the revised file to overwrite the original pepperdata-config.sh file.
Open a command shell (terminal session) and log in to the host that you want to disable, as a user with sudo privileges.
From the command line, copy the Pepperdata bootstrap script that you extracted from the Pepperdata package from its local location to any location; in this procedure’s steps, we’ve copied it to /tmp.
- For Amazon EMR clusters:
  
  aws s3 cp s3://<pd-bootstrap-script-from-install-packages> /tmp/bootstrap
- For Google Dataproc clusters:
  
  sudo gsutil cp gs://<pd-bootstrap-script-from-install-packages> /tmp/bootstrap
Load the revised configuration by running the Pepperdata bootstrap script.
- For EMR clusters:
  - You can use the --long-options form of the --bucket, --upload-realm, and --is-running arguments as shown or their -short-option equivalents, -b, -u, and -r.
  - The --is-running (-r) option is required for bootstrapping an already-running host prior to Supervisor version 7.0.13.
  - Optionally, you can specify a proxy server for the AWS Command Line Interface (CLI) and Pepperdata-enabled cluster hosts.
    
    Include the --proxy-address argument when running the Pepperdata bootstrap script, specifying its value as a fully-qualified host address that uses https protocol.
  - If you’re using a non-default EMR API endpoint (by using the --endpoint-url argument), include the --emr-api-endpoint argument when running the Pepperdata bootstrap script. Its value must be a fully-qualified host address. (It can use http or https protocol.)
  - If you are using a script from an earlier Supervisor version that has the --cluster or -c arguments instead of the --upload-realm or -u arguments (which were introduced in Supervisor v6.5), respectively, you can continue using the script and its old arguments. They are backward compatible.
  - Optionally, you can override the default exponential backoff and jitter retry logic for the describe-cluster command that the Pepperdata bootstrapping uses to retrieve the cluster’s metadata.
    
    Specify either or both of the following options in the bootstrap’s Optional arguments. Be sure to substitute your values for the <my-retries> and <my-timeout> placeholders that are shown in the command.
    - max-retry-attempts—(default=10) Maximum number of retry attempts to make after the initial describe-cluster call.
    - max-timeout—(default=60) Maximum number of seconds to wait before the next retry call to describe-cluster. The actual wait time for a given retry is assigned as a random number, 1–calculated timeout (inclusive), which introduces the desired jitter.
```
# For Supervisor versions before 7.0.13:
sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> --is-running [--proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>]
   
# For Supervisor versions 7.0.13 and later:
sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> [--proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>]
```
- For Dataproc clusters:
  
  sudo bash /tmp/bootstrap <bucket-name> <realm-name>
The script finishes with a Pepperdata installation succeeded message.

Data collection immediately stops, and no more data (including data that was already collected but that hasn’t yet been uploaded due to the two- to five-minute upload interval) appears in the Pepperdata dashboard. Data previously uploaded to the dashboard remains available for charts, tables, and reports.

Enable Pepperdata Data Collection for a Running Host

If data collection is disabled for a host, during installation or sometime later, you can manually re-enable it whenever you want. When you re-enable data collection for a host, data uploading resumes immediately. Depending on the configured disk space and data retention settings (see Configure Pepperdata Logs Retention and Disk Usage), up to seven days of previous data is also uploaded.

Procedure

Reconfigure the host that you want to enable.
1. Download the Pepperdata configuration file, pepperdata-config.sh, from the host that you want to enable to a location where you can edit it.
2. Open the file for editing, find the PD_COLLECT_AND_UPLOAD environment variable, and change its value from 0 to 1.
3. Save your changes and close the file.
4. Upload the revised file to overwrite the original pepperdata-config.sh file.
Open a command shell (terminal session) and log in to the host that you want to disable, as a user with sudo privileges.
From the command line, copy the Pepperdata bootstrap script that you extracted from the Pepperdata package from its local location to any location; in this procedure’s steps, we’ve copied it to /tmp.
- For Amazon EMR clusters:
  
  aws s3 cp s3://<pd-bootstrap-script-from-install-packages> /tmp/bootstrap
- For Google Dataproc clusters:
  
  sudo gsutil cp gs://<pd-bootstrap-script-from-install-packages> /tmp/bootstrap
Load the revised configuration by running the Pepperdata bootstrap script.
- For EMR clusters:
  - You can use the --long-options form of the --bucket, --upload-realm, and --is-running arguments as shown or their -short-option equivalents, -b, -u, and -r.
  - The --is-running (-r) option is required for bootstrapping an already-running host prior to Supervisor version 7.0.13.
  - Optionally, you can specify a proxy server for the AWS Command Line Interface (CLI) and Pepperdata-enabled cluster hosts.
    
    Include the --proxy-address argument when running the Pepperdata bootstrap script, specifying its value as a fully-qualified host address that uses https protocol.
  - If you’re using a non-default EMR API endpoint (by using the --endpoint-url argument), include the --emr-api-endpoint argument when running the Pepperdata bootstrap script. Its value must be a fully-qualified host address. (It can use http or https protocol.)
  - If you are using a script from an earlier Supervisor version that has the --cluster or -c arguments instead of the --upload-realm or -u arguments (which were introduced in Supervisor v6.5), respectively, you can continue using the script and its old arguments. They are backward compatible.
  - Optionally, you can override the default exponential backoff and jitter retry logic for the describe-cluster command that the Pepperdata bootstrapping uses to retrieve the cluster’s metadata.
    
    Specify either or both of the following options in the bootstrap’s Optional arguments. Be sure to substitute your values for the <my-retries> and <my-timeout> placeholders that are shown in the command.
    - max-retry-attempts—(default=10) Maximum number of retry attempts to make after the initial describe-cluster call.
    - max-timeout—(default=60) Maximum number of seconds to wait before the next retry call to describe-cluster. The actual wait time for a given retry is assigned as a random number, 1–calculated timeout (inclusive), which introduces the desired jitter.
```
# For Supervisor versions before 7.0.13:
sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> --is-running [--proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>]
   
# For Supervisor versions 7.0.13 and later:
sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> [--proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>]
```
- For Dataproc clusters:
  
  sudo bash /tmp/bootstrap <bucket-name> <realm-name>
The script finishes with a Pepperdata installation succeeded message.