EMR: Install Pepperdata

Supported versions: See the Amazon Elastic MapReduce (EMR) entries for Pepperdata 8.0.x in the table of Supported Platforms by Pepperdata Version

To activate Pepperdata—inject the necessary instrumentation—on an already existing and/or running cluster, activate Pepperdata on the already-running hosts, and add the activation commands to the cluster’s existing bootstrap configuration. To activate Pepperdata on new EMR clusters, add the Pepperdata bootstrap script to the cluster’s configuration.

Create an Identity and Access Management (IAM) Role for Pepperdata Access

An IAM role is required to enable Pepperdata access for operation in the EMR environment. You can create the policies for access to all resources or to only specific resources, such as clusters where given functionality is required. For details about IAM service roles and permissions, refer to the Customize IAM Roles  page from Amazon.

If you want to use an existing role for the cluster in which you’re installing Pepperdata, you can, so long as it provides the necessary permissions for the Pepperdata bootstrap script’s action; see Verify that IAM Role Permissions are Sufficient.

Procedure

  1. In your Amazon AWS environment, create a policy for S3 read-access. You can use any name for the policy; in this procedure, we’ll call it policy-pepperdata-s3-read.

    S3 read access enables the cluster’s bootstrap script to access and download the Pepperdata packages and configuration from the S3 bucket.

  2. (Ephemeral Cluster Name Support) Create a policy for read-access to the ListBootstrapActions, which enables Pepperdata to filter overview pages, charts, and tables by the ephemeral cluster name.

    You can use any name for the policy; in this procedure, we’ll call it policy-pepperdata-ephemeral-cluster-name-read.

  3. Create a custom IAM role and assign the policies you just created: in this example, policy-pepperdata-s3-read and policy-pepperdata-ephemeral-cluster-name-read.

    You can use any name for the IAM role; in this procedure, we’ll call it PepperdataRole.

Verify that IAM Role Permissions are Sufficient

Before you create a new cluster with Pepperdata or add Pepperdata to an existing cluster, you should use the Pepperdata check_IAM_role_permissions utility script to verify that the cluster’s IAM role (that you plan to use for a new cluster, or have already used for the existing cluster) provides the necessary permissions for the Pepperdata bootstrap script’s actions. If the utility script fails, see the output log file for errors, resolve the error(s), and re-run the script. Repeat until the utility script succeeds, and outputs a Success! message.

If you are creating a new cluster and you will be using an IAM role that you created specifically for Pepperdata access (Create an Identity and Access Management (IAM) Role for Pepperdata Access), you can skip this procedure.

When you create the role as described in the procedure, the role has all the necessary permissions.

Prerequisites

Before you run the Pepperdata check_IAM_role_permissions utility script, identify the target cluster for the verification:

  • If you’ll be creating a new cluster in which to run Pepperdata, choose a target cluster that was created with the same role that you’ll be using for the Pepperdata-enabled cluster.

  • If you’ll be adding Pepperdata to an existing cluster, that cluster is the target cluster.

Procedure

  1. Download the check_IAM_role_permissions_aws.tgz tarball from the helpsite.

    Even if you previously downloaded the tarball, be sure to download a fresh copy because it, and correspondingly this procedure, may have changed.
  2. Extract the check_IAM_role_permissions script from the tarball.

    tar xvf check_IAM_role_permissions_aws.tgz

  3. Upload the check_IAM_role_permissions script to your Amazon AWS environment.

  4. Open a command shell (terminal session) and log in to the target cluster.

  5. Run the script.

    ./check_IAM_role_permissions

    For each policy that is tested, the script outputs its name and whether the test PASSED, was NOT TESTED (some policies cannot be tested in isolation, but are internally tested by other policy tests), or FAILED.

    The script then outputs the final result:

    • If all tests passed, the message is:

      STATUS: Success! This cluster has the necessary role permissions for installing Pepperdata.

      You are done with this procedure; do not perform the remaining steps.
    • If any test failed, the message is:

      STATUS: Failure. This cluster lacks the necessary role permissions for installing Pepperdata.

  6. (Only if the status is Failure) Correct the errors.

    1. For each failed policy test, add the missing permissions to an existing IAM role, or create a new IAM role for the necessary permissions.

      • If you’ll be creating a new cluster in which to run Pepperdata, you can use the IAM role of the target cluster. If you do not want to change the permissions of that IAM role, you can create a new IAM role, but be sure to re-start this procedure from the beginning so that you can run the script on a target cluster that is created with the new IAM role.

      • If you’ll be adding Pepperdata to an existing cluster, you must add the permissions to the IAM role that was used when the cluster was created.

    2. Repeat step 5 (rerun the script).

Create New Cluster with Pepperdata

This procedure is for configuring Pepperdata activation for hosts that will be created in the future, for a cluster that will be created in the future. That is, there is not already an existing/running cluster.

Assumptions

This procedure assumes that you will not need to leverage any custom cluster management functions, such as certificate management. If such additional (non-Pepperdata) functions are needed, you should create a “helper bootstrap” script to invoke those functions and call the Pepperdata bootstrap script. In this case, upload the helper bootstrap script to the cluster configuration folder, and use its location and filename for the Script location field in the procedure.

Prerequisites

  • Ensure that Security-Enhanced Linux (SELinux) is disabled. By default, EMR disables SELinux. But if it’s been enabled, you must disable it before activating Pepperdata.

  • Ensure that the IAM role that you want to use for the cluster provides the necessary permissions for the Pepperdata bootstrop script’s actions.

    Either:

Procedure

  1. In your Amazon AWS environment, launch the Create cluster wizard, and enter the cluster name.

  2. (Autoscaling Optimization, EMR-managed scaling) If you’re configuring autoscaling optimization for the cluster, and the cluster uses EMR-managed scaling (not the custom automatic scaling policy in EMR), create the autoscaling policy now.

  3. Under Additional Options, add a Bootstrap Action, and select Custom Action.

  4. Enter the required information.

    • Be sure to substitute the name of your base directory, bucket name, and realm name for the <base-dir>, <my-bucket>, and <my-realm> placeholders, respectively, that are shown in the table.

    • You can use the --long-options form of the --bucket and --upload-realm arguments as shown in the table or their -short-option equivalents, -b and -u, respectively.

    • Optionally, you can specify a proxy server for the AWS Command Line Interface (CLI) and Pepperdata-enabled cluster hosts.

      Include the --proxy-address (or --emr-proxy-address for Supervisor version 8.0.24 or later) argument when running the Pepperdata bootstrap script, specifying its value as a fully-qualified host address that uses https protocol.

    • If you’re using a non-default EMR API endpoint (by using the --endpoint-url argument), include the --emr-api-endpoint argument when running the Pepperdata bootstrap script. Its value must be a fully-qualified host address. (It can use http or https protocol.)

    • If you are using a script from an earlier Supervisor version that has the --cluster or -c arguments instead of the --upload-realm or -u arguments (which were introduced in Supervisor v6.5), respectively, you can continue using the script and its old arguments. They are backward compatible.

    • Optionally, you can override the default exponential backoff and jitter retry logic for the describe-cluster command that the Pepperdata bootstrapping uses to retrieve the cluster’s metadata.

      Specify either or both of the following options in the bootstrap’s Optional arguments. Be sure to substitute your values for the <my-retries> and <my-timeout> placeholders that are shown in the table.

      • max-retry-attempts—(default=10) Maximum number of retry attempts to make after the initial describe-cluster call.

      • max-timeout—(default=60) Maximum number of seconds to wait before the next retry call to describe-cluster. The actual wait time for a given retry is assigned as a random number, 1–calculated timeout (inclusive), which introduces the desired jitter.

    • Do not include the --is-running (-r) option in the bootstrap’s Optional arguments. This option is only for bootstrapping an already-running host prior to Supervisor version 7.0.13, not for bootstrapping new hosts as they’re created.

    Field Information
    Name Pepperdata APM
    Script location s3://<my-bucket>/install-packages/<base-dir>/emr/bootstrap

    The <base-dir> corresponds to supervisor-X.Y.Z-<distribution>, where <distribution> is the final part of the package name, without the file type; for example, supervisor-X.Y.Z-H26_YARN2_A.
    Optional arguments --bucket <my-bucket> --upload-realm <my-realm> --emr-proxy-address <proxy-url:proxy-port> --emr-api-endpoint <endpoint-url:endpoint-port> [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>]
  5. Click Add to add the bootstrap script.

  6. Configure the cluster as you want.

    Be sure to assign an IAM role that has all the necessary permissions for Pepperdata to the EC2 instance profile.

    Steps

    1. Navigate to the Step 4: Security > Permissions section, and select Custom.

    2. In the EC2 instance profile list, select the IAM role that you want.

  7. Click Create Cluster.

    The cluster is created, the Pepperdata software is installed, and the Pepperdata services are automatically started.

Add Pepperdata to Existing/Running Cluster

This procedure is for configuring Pepperdata activation on hosts in an existing/running cluster: already-running hosts and hosts that will be created in the future.

Prerequisites

  • Every currently-running host in the cluster must already have an initialization (bootstrap) script.

    If there is no initialization script, you must destroy the cluster and re-create it so that every host has an initialization script. The script can be empty or you can follow the procedure for activating Pepperdata on a new cluster.

  • Be sure that you’ve installed Pepperdata on every already-running host.

  • Verify that the cluster’s IAM roles provide the necessary permissions for the Pepperdata bootstrap script’s actions; see Verify that IAM Role Permissions are Sufficient.

Procedure

Although you can manually perform the procedure steps on every already-running host, to save time or if you have a lot of hosts, we recommend that you use any existing automation framework that you have or create a shell script with the required commands.

  1. Beginning with any already-running host, log in to the host.

  2. Activate Pepperdata on the already-running host.

    1. From the command line, copy the Pepperdata bootstrap script that you extracted from the Pepperdata package from its local location to any location in your Amazon S3 execution environment.

      For example:

      aws s3 cp s3://<pd-bootstrap-script-from-install-packages> /tmp/bootstrap

    2. Run the Pepperdata bootstrap script.

      • You can use the long form of the --bucket and --upload-realm arguments as shown or their short equivalents, -b and -u.

      • Optionally, you can create swap space by using the --swap-space argument follwed by an integer representing the amount of swap space in gigabytes (G). If you specify zero, the script does not create swap space.

        To remove swap space after the installation, ssh to every node in the cluster and do the following:

        1. swapoff -v /swapfile to disable the swap file.

        2. sudo sed -i.bak '/^\/swapfile/d' /etc/fstab to remove its entry from the /etc/fstab file.

        3. rm /swapfile to delete the swapfile.

      • Optionally, you can specify a proxy server for the AWS Command Line Interface (CLI) and Pepperdata-enabled cluster hosts.

        Include the --proxy-address (or --emr-proxy-address for Supervisor version 8.0.24 or later) argument when running the Pepperdata bootstrap script, specifying its value as a fully-qualified host address that uses https protocol.

      • If you’re using a non-default EMR API endpoint (by using the --endpoint-url argument), include the --emr-api-endpoint argument when running the Pepperdata bootstrap script. Its value must be a fully-qualified host address. (It can use http or https protocol.)

      • If you are using a script from an earlier Supervisor version that has the --cluster or -c arguments instead of the --upload-realm or -u arguments (which were introduced in Supervisor v6.5), respectively, you can continue using the script and its old arguments. They are backward compatible.

      • Optionally, you can override the default exponential backoff and jitter retry logic for the describe-cluster command that the Pepperdata bootstrapping uses to retrieve the cluster’s metadata.

        Specify either or both of the following options in the bootstrap’s Optional arguments. Be sure to substitute your values for the <my-retries> and <my-timeout> placeholders that are shown in the command.

        • max-retry-attempts—(default=10) Maximum number of retry attempts to make after the initial describe-cluster call.

        • max-timeout—(default=60) Maximum number of seconds to wait before the next retry call to describe-cluster. The actual wait time for a given retry is assigned as a random number, 1–calculated timeout (inclusive), which introduces the desired jitter.

      For Supervisor version 8.0.24 or later:

      sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> [--emr-proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>]
      

      For Supervisor version 7.0.13 to 8.0.23:

      sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> [--proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>]
      

      For Supervisor versions prior to 7.0.13:

      sudo bash /tmp/bootstrap --bucket <bucket-name> --upload-realm <realm-name> --is-running [--proxy-address <proxy-url:proxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>]
      

      The script finishes with a Pepperdata installation succeeded message.

    3. Repeat steps a–b on every already-running host.

  3. Add bootstrap actions to activate Pepperdata on new hosts as they’re created.

    1. Download a copy of your existing cluster bootstrap script (not the Pepperdata bootstrap script) to a location where you can edit it.

    2. Open the script for editing.

    3. Add the same activation commands to it that you previously ran for the already-running hosts.

      • You can use the long form of the --bucket and --upload-realm arguments as shown or their short equivalents, -b and -u.

      • Optionally, you can create swap space by using the --swap-space argument follwed by an integer representing the amount of swap space in gigabytes (G). If you specify zero, the script does not create swap space.

        To remove swap space after the installation, ssh to every node in the cluster and do the following:

        1. swapoff -v /swapfile to disable the swap file.

        2. sudo sed -i.bak '/^\/swapfile/d' /etc/fstab to remove its entry from the /etc/fstab file.

        3. rm /swapfile to delete the swapfile.

      • Optionally, you can specify a proxy server for the AWS Command Line Interface (CLI) and Pepperdata-enabled cluster hosts.

        Include the --proxy-address (or --emr-proxy-address for Supervisor version 8.0.24 or later) argument when running the Pepperdata bootstrap script, specifying its value as a fully-qualified host address that uses https protocol.

      • If you’re using a non-default EMR API endpoint (by using the --endpoint-url argument), include the --emr-api-endpoint argument when running the Pepperdata bootstrap script. Its value must be a fully-qualified host address. (It can use http or https protocol.)

      • If you are using a script from an earlier Supervisor version that has the --cluster or -c arguments instead of the --upload-realm or -u arguments (which were introduced in Supervisor v6.5), respectively, you can continue using the script and its old arguments. They are backward compatible.

      • Optionally, you can override the default exponential backoff and jitter retry logic for the describe-cluster command that the Pepperdata bootstrapping uses to retrieve the cluster’s metadata.

        Specify either or both of the following options in the bootstrap’s Optional arguments. Substitute your values for the <my-retries> and <my-timeout> placeholders that are shown in the command.

        • max-retry-attempts—(default=10) Maximum number of retry attempts to make after the initial describe-cluster call.

        • max-timeout—(default=60) Maximum number of seconds to wait before the next retry call to describe-cluster. The actual wait time for a given retry is assigned as a random number, 1–calculated timeout (inclusive), which introduces the desired jitter.

      • Do not include the --is-running (-r) option in the bootstrap command. This option is only for bootstrapping an already-running host prior to Supervisor version 7.0.13, not for bootstrapping new hosts as they’re created.

      ARCHITECTURE=$(uname -m)
      PD_BUCKET=<bucket-name> # Name of the S3 bucket for Pepperdata that you previously created
      PD_PKG=<pd-ver-pkg>     # Supervisor version and package; for example: 7.0.9-H30_YARN3
      REALM_NAME=<realm-name>
      
      PD_ARCHITECTURE=
      if [ $ARCHITECTURE = "arm64" ] || [ $ARCHITECTURE = "aarch64" ]; then
          PD_ARCHITECTURE=-aarch64
      fi
      
      aws s3 cp s3://${PD_BUCKET}/install-packages/supervisor-${PD_PKG}${PD_ARCHITECTURE}/emr/bootstrap /tmp/bootstrap
      sudo bash /tmp/bootstrap --bucket ${PD_BUCKET} --upload-realm ${REALM_NAME} [--emr-proxy-address <proxy-url:aproxy-port>] [--emr-api-endpoint <endpoint-url:endpoint-port>] [--max-retry-attempts <my-retries>] [--max-timeout <my-timeout>]
      
    4. Save your changes and close the file.

    5. Upload the revised file to overwrite the original cluster bootstrap script.