Dataproc: Install Pepperdata

Supported versions: See the Google Dataproc entries for Pepperdata 7.1.x in the table of Supported Platforms by Pepperdata Version

To activate Pepperdata—inject the necessary instrumentation—on an already existing and/or running cluster, activate Pepperdata on the already-running hosts, and add the activation commands to the cluster’s existing bootstrap configuration. To activate Pepperdata on new Dataproc clusters, add the Pepperdata bootstrap script to the cluster’s configuration.

On This Page

Create New Cluster with Pepperdata
Add Pepperdata to Existing/Running Cluster

Create New Cluster with Pepperdata

This procedure is for configuring Pepperdata activation for hosts that will be created in the future, for a cluster that will be created in the future. That is, there is not already an existing/running cluster. You can use the Cloud Console or the command line procedure.

Assumptions

This procedure assumes that you will not need to leverage any custom cluster management functions, such as certificate management. If such additional (non-Pepperdata) functions are needed, you should create a “helper bootstrap” script to invoke those functions and call the Pepperdata bootstrap script. In this case, upload the helper bootstrap script to the cluster configuration folder, and use its location and filename for the Advanced options > Add initialization action in the procedure.

Procedure: Cloud Console

In your GDP environment, click Create Cluster to create the cluster.
Under Advanced options, click Add initialization action.
Browse to locate the bootstrap script, which should be located at gs://<my-bucket>/install-packages/<base-dir>/dataproc/bootstrap.

The <base-dir> corresponds to supervisor-X.Y.Z-<distribution>, where <distribution> is the final part of the package name, without the file type; for example, supervisor-X.Y.Z-H26_YARN2_A.

Click Add metadata, and enter the required information.

SwapSpace is an integer representing the amount (in gigabytes) of swap space to create. Provide a value of zero if you do not wish to create swap space.

To remove swap space after the installation, ssh to every node in the cluster and do the following:

swapoff -v /swapfile to disable the swap file.
sudo sed -i.bak '/^\/swapfile/d' /etc/fstab to remove its entry from the /etc/fstab file.
rm /swapfile to delete the swapfile.

Substitute your bucket name, and cluster name for <my-bucket> and <my-cluster>.

Key	Value
SwapSpace	Size (G) of swap space
PepperdataBucket	`<my-bucket>`
Realm	`<my-cluster>`

Click Create to finish.

The cluster is created, the Pepperdata software is installed, and the Pepperdata services are automatically started.

Procedure: Command Line

If you prefer to use the Cloud Shell or a terminal window to create the cluster on the command line, or you have scripts or other automation framework, be sure to specify the initialization action for the Pepperdata bootstrap script and the required metadata in your clusters create command.

For example:

gcloud dataproc clusters create <my-cluster> \
    --region=${REGION} \
    --initialization-actions=gs://<my-bucket>/install-packages/<base-dir>/dataproc/bootstrap \
    --metadata=PepperdataBucket=<my-bucket>,Realm=<my-cluster>

Add Pepperdata to Existing/Running Cluster

This procedure is for configuring Pepperdata activation on hosts in an existing/running cluster: already-running hosts and hosts that will be created in the future.

Prerequisites

Every currently-running host in the cluster must already have an initialization (bootstrap) script.

If there is no initialization script, you must destroy the cluster and re-create it so that every host has an initialization script. The script can be empty or you can follow the procedure for activating Pepperdata on a new cluster.
Be sure that you’ve installed Pepperdata on every already-running host.

Procedure

Although you can manually perform the procedure steps on every already-running host, to save time or if you have a lot of hosts, we recommend that you use any existing automation framework that you have or create a shell script with the required commands.

Beginning with any already-running host, log in to the host.
Activate Pepperdata on the already-running host.
1. From the command line, copy the Pepperdata bootstrap script that you extracted from the Pepperdata package from its local location to any location in your GDP environment.
  
  For example:
  
  sudo gsutil cp gs://<pd-bootstrap-script-from-install-packages> /tmp/bootstrap
2. Run the Pepperdata bootstrap script.
  
  sudo bash /tmp/bootstrap <bucket-name> <realm-name>
  
  The script finishes with a Pepperdata installation succeeded message.
3. Repeat steps a–b on every already-running host.
Add bootstrap actions to activate Pepperdata on new hosts as they’re created.

Important: For existing/running clusters, do not try to change the configured pointer/name of the bootstrap script to the Pepperdata bootstrap script. Doing so will not result in activating Pepperdata on new hosts as they’re created.
1. Download a copy of your existing cluster bootstrap script (not the Pepperdata bootstrap script) to a location where you can edit it.
2. Open the script for editing.
3. Add the same activation commands to it that you previously ran for the already-running hosts.
```
sudo gsutil cp gs://<pd-bootstrap-script-from-install-packages> /tmp/bootstrap
sudo bash /tmp/bootstrap <bucket-name> <realm-name>
```
4. Save your changes and close the file.
5. Upload the revised file to overwrite the original cluster bootstrap script.