Configure Pepperdata Logs Retention and Disk Usage (Cloud)

By default, Pepperdata sets a disk usage cap (PD_MAX_LOG_DIR_SIZE) of 5 GB as the maximum size for its accumulated metrics and message log files. So long as this cap is not reached, Pepperdata retains log files for seven (7) days (PD_MAX_LOG_AGE_DAYS) before deleting them, and the Pepperdata Collector (the pepcollectd daemon) uploads data that is up to seven (7) days old (PD_LOG_PROC_MAX_AGE_DAYS). When the disk usage cap is reached, Pepperdata deletes enough log files, starting with the oldest ones, to reduce the disk usage to less than the cap.

Although the age caps—limits on how long log files are eligible for uploading and when they’re ready for deletion—can be important for business requirements such as retaining sensitive files for a given amount of time or for custom processing, the PD_MAX_LOG_DIR_SIZE size cap is the appropriate focus for controlling disk usage.

To override the default disk usage cap and/or log retention policies, you can add any of the following environment variables to the Pepperdata configuration. For RPM/DEB-based installations, add the environment variables to the Pepperdata configuration file, /etc/pepperdata/pepperdata-config.sh. For Parcel for Cloudera/Cloudera Manager-based installations/management, add the environment variables to the appropriate Cloudera Manager template. See the procedure for details.

PD_LOG_DIR: (default=/var/log/pepperdata) Directory to which Pepperdata writes its log files.
PD_MAX_LOG_DIR_SIZE: (default=5368709120, which is 5 GB) Size cap (maximum total size), in bytes, of all the log files in the directory specified by the PD_LOG_DIR environment variable (default=/var/log/pepperdata).

When the PepAgent (the pepagentd daemon) starts, it verifies that there is sufficient capacity on the partition where PD_LOG_DIR is located. If the capacity is less than PD_MAX_LOG_DIR_SIZE, the PepAgent will not start.
PD_MAX_LOG_AGE_DAYS: (default=the value of PD_LOG_PROC_MAX_AGE_DAYS) Number of days a log file is retained before Pepperdata deletes it.
PD_LOG_PROC_MAX_AGE_DAYS: (default=7) Maximum age of a log file that the Pepperdata Collector (the pepcollectd daemon) will upload to the Pepperdata dashboard.

If you lose connectivity to Pepperdata for longer than the PD_LOG_PROC_MAX_AGE_DAYS value, pepcollectd will be unable to upload the log file before it exceeds PD_LOG_PROC_MAX_AGE_DAYS, and the log file’s data will be lost.
PD_ARCHIVE_DIR: (no default) Directory in which to archive old log files instead of deleting them when they exceed the maximum age (the PD_LOG_PROC_MAX_AGE_DAYS environment variable value). Not applicable unless the PD_CLEAN_LOG_DIR environment variable is enabled (its value set to 1).
PD_CLEAN_LOG_DIR: (default=1/enabled) Enable/disable Pepperdata from cleaning (deleting or archiving) its log files.

Procedure

In your cloud environment (such as GDP or AWS), reconfigure the log retention and disk usage.
1. From the environment’s cluster configuration folder (in the cloud), download the Pepperdata configuration file, /etc/pepperdata/pepperdata-config.sh, to a location where you can edit it.
2. Open the file for editing, and add any of the disk usage environment variables, in the following format. Be sure to replace THE-VARIABLE-NAME and the-variable-value with the actual environment variable’s name and value.
```
export THE-VARIABLE-NAME=the-variable-value
```
3. Save your changes and close the file.
4. Upload the revised file to overwrite the original pepperdata-config.sh file.
If there are no already-running hosts with Pepperdata, skip steps 2–5 of this procedure.
Open a command shell (terminal session) and log in to any already-running host as a user with sudo privileges.

Important: You can begin with any host on which Pepperdata is running, but be sure to repeat the login (this step), copying the bootstrap file (next step), and loading the revised Pepperdata configuration (the following step) on every already-running host.
From the command line, copy the Pepperdata bootstrap script that you extracted from the Pepperdata package from its local location to any location; in this procedure’s steps, we’ve copied it to /tmp.
- For Amazon EMR clusters:
  
  aws s3 cp s3://<pd-bootstrap-script-from-install-packages> /tmp/bootstrap
- For Google Dataproc clusters:
  
  sudo gsutil cp gs://<pd-bootstrap-script-from-install-packages> /tmp/bootstrap
Load the revised configuration by running the Pepperdata bootstrap script.
- For EMR clusters:
  - You can use the --long-options form of the --bucket, --upload-realm, and --is-running arguments as shown or their -short-option equivalents, -b, -u, and -r.
  - The --is-running (-r) option is required for bootstrapping an already-running host prior to Supervisor version 7.0.13.
  - Optionally, you can specify a proxy server for the AWS Command Line Interface (CLI) and Pepperdata-enabled cluster hosts.
    
    Include the --proxy-address (or --emr-proxy-address for Supervisor version 8.0.24 or later) argument when running the Pepperdata bootstrap script, specifying its value as a fully-qualified host address that uses https protocol.
  - If you’re using a non-default EMR API endpoint (by using the --endpoint-url argument), include the --emr-api-endpoint argument when running the Pepperdata bootstrap script. Its value must be a fully-qualified host address. (It can use http or https protocol.)
  - If you are using a script from an earlier Supervisor version that has the --cluster or -c arguments instead of the --upload-realm or -u arguments (which were introduced in Supervisor v6.5), respectively, you can continue using the script and its old arguments. They are backward compatible.
  - Optionally, you can override the default exponential backoff and jitter retry logic for the describe-cluster command that the Pepperdata bootstrapping uses to retrieve the cluster’s metadata.
    
    Specify either or both of the following options in the bootstrap’s Optional arguments. Be sure to substitute your values for the <my-retries> and <my-timeout> placeholders that are shown in the command.
    - max-retry-attempts—(default=10) Maximum number of retry attempts to make after the initial describe-cluster call.
    - max-timeout—(default=60) Maximum number of seconds to wait before the next retry call to describe-cluster. The actual wait time for a given retry is assigned as a random number, 1–calculated timeout (inclusive), which introduces the desired jitter.
```
# For Supervisor versions before 7.0.13:
sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> --is-running [--proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>]
   
# For Supervisor versions 7.0.13 to 8.0.23:
sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> [--proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>]
   
# For Supervisor versions 8.0.24 and later:
sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> [--emr-proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>]
```
- For Dataproc clusters:
  
  sudo bash /tmp/bootstrap <bucket-name> <realm-name>
The script finishes with a Pepperdata installation succeeded message.
Repeat steps 2–4 on every already-running host in your cluster.
(Only for the PD_LOG_DIR environment variable) In your cloud environment (such as GDP or AWS), add the associated property to the Pepperdata configuration.
1. From the environment’s cluster configuration folder (in the cloud), download the Pepperdata site file, /etc/pepperdata/pepperdata-site.xml, to a location where you can edit it.
2. Open the file for editing, and add the pepperdata.log.baseDir property.
  
  Be sure that you set the logging directory environment variable and the pepperdata.log.baseDir property to the same location. If the locations do not match, not all metrics are sent to Pepperdata, and not all metric log files will be deleted or archived.
```
<property>
  <name>pepperdata.log.baseDir</name>
  <value>your/pepperdata/log/dir</value>
</property>
```
  Malformed XML files can cause operational errors that can be difficult to debug. To prevent such errors, we recommend that you use a linter, such as xmllint, after you edit any .xml configuration file.
3. Save your changes and close the file.
4. Upload the revised file to overwrite the original pepperdata-site.xml file.
If there are no already-running hosts with Pepperdata, you are done with this procedure. Do not perform the remaining steps.
Open a command shell (terminal session) and log in to any already-running host as a user with sudo privileges.

Important: You can begin with any host on which Pepperdata is running, but be sure to repeat the login (this step) and loading the revised Pepperdata configuration (next step) on every already-running host.
Load the revised configuration by running the Pepperdata bootstrap script.
- For EMR clusters:
  - You can use the --long-options form of the --bucket, --upload-realm, and --is-running arguments as shown or their -short-option equivalents, -b, -u, and -r.
  - The --is-running (-r) option is required for bootstrapping an already-running host prior to Supervisor version 7.0.13.
  - Optionally, you can specify a proxy server for the AWS Command Line Interface (CLI) and Pepperdata-enabled cluster hosts.
    
    Include the --proxy-address (or --emr-proxy-address for Supervisor version 8.0.24 or later) argument when running the Pepperdata bootstrap script, specifying its value as a fully-qualified host address that uses https protocol.
  - If you’re using a non-default EMR API endpoint (by using the --endpoint-url argument), include the --emr-api-endpoint argument when running the Pepperdata bootstrap script. Its value must be a fully-qualified host address. (It can use http or https protocol.)
  - If you are using a script from an earlier Supervisor version that has the --cluster or -c arguments instead of the --upload-realm or -u arguments (which were introduced in Supervisor v6.5), respectively, you can continue using the script and its old arguments. They are backward compatible.
  - Optionally, you can override the default exponential backoff and jitter retry logic for the describe-cluster command that the Pepperdata bootstrapping uses to retrieve the cluster’s metadata.
    
    Specify either or both of the following options in the bootstrap’s Optional arguments. Be sure to substitute your values for the <my-retries> and <my-timeout> placeholders that are shown in the command.
    - max-retry-attempts—(default=10) Maximum number of retry attempts to make after the initial describe-cluster call.
    - max-timeout—(default=60) Maximum number of seconds to wait before the next retry call to describe-cluster. The actual wait time for a given retry is assigned as a random number, 1–calculated timeout (inclusive), which introduces the desired jitter.
```
# For Supervisor versions before 7.0.13:
sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> --is-running [--proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>]
   
# For Supervisor versions 7.0.13 to 8.0.23:
sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> [--proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>]
   
# For Supervisor versions 8.0.24 and later:
sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> [--emr-proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>]
```
- For Dataproc clusters:
  
  sudo bash /tmp/bootstrap <bucket-name> <realm-name>
The script finishes with a Pepperdata installation succeeded message.
Repeat steps 7–8 on every already-running host in your cluster.