(Application Spotlight) Configure Application Profiler (Cloud)

On This Page

Prerequisites
Supported Authentication Protocols for Application Profiler
Task 1: Configure Pepperdata to Monitor Application History
Task 2: (Kerberized Clusters) Configure HTTP/HTTPS Endpoint Authentication
Task 3: (Basic Access Authentication) Add BA Authentication Credentials
Task 4: Access the Application Profiler on the Pepperdata Dashboard
Post Requisite: (Hadoop 2) Confirm Near Real-Time Data Collection

Prerequisites

Before you begin configuring Application Profiler, ensure that your system meets the required prerequisites.

Pepperdata must be installed on the host running the MapReduce Job History Server
MapReduce Job History Server must be running
(Spark Monitoring) Spark History Server must be running
Your cluster uses a supported authentication protocol; see Supported Authentication Protocols for Application Profiler, below

Supported Authentication Protocols for Application Profiler

To enable Application Profiler to fetch application data from the MapReduce Job History Server/Spark History Server, your cluster must use a Pepperdata-supported authentication protocol:

No authentication.
Pseudo auth (also known as Hadoop’s simple authentication)—the server authenticates requests based on the user.name query string parameter contained in the request.
Kerberos.
Basic access (BA) authentication—uses standard fields in the HTTP header to specify the user name and password; for details, see https://en.wikipedia.org/wiki/Basic_access_authentication .

Task 1: Configure Pepperdata to Monitor Application History

Procedure

Download a copy of your existing cluster-level Pepperdata configuration file, pepperdata-config.sh, from the environment’s cluster configuration folder (in the cloud) to a location where you can edit it.
Open the file for editing, and revise it as follows.
- Modify the value of PD_JOBHISTORY_MONITOR_ENABLED to 1.
- To enable Spark application monitoring, add the configuration according to your environment.
  
  Note: If you’re using Application Profiler to fetch history data for Spark apps, you can customize the connection timeout value and/or add a second Spark History Server for monitoring. See Configure Spark History Servers.
  - If the spark-defaults.conf file contains the correct assignment for spark.yarn.historyServer.address for the first (or only) Spark History Server, configure the SPARK_CONF_DIR environment variable to match:
    
    export SPARK_CONF_DIR=your-path-to-first-spark-conf-directory
    
    Where your-path-to-first-spark-conf-directory is the directory that contains the spark-defaults.conf file.
  - If the spark-defaults.conf file does not include spark.yarn.historyServer.address or its value is incorrect, and you can edit the spark-defaults.conf file:
    1. Edit the spark-defaults.conf file so that it includes the correct assignment for spark.yarn.historyServer.address for the first Spark History Server.
    2. In the pepperdata-config.sh file, configure the SPARK_CONF_DIR environment variable to match: export SPARK_CONF_DIR=your-path-to-first-spark-conf-directory, where your-path-to-first-spark-conf-directory is the directory that contains the spark-defaults.conf file.
  - For all other cases, edit the pepperdata-config.sh file to include the PD_SPARK_HISTORY_SERVER_ADDRESS environment variable, and set its value to the first (or only) Spark History Server’s fully-qualified URL.
Example of modifications to the pepperdata-config.sh file
```
export PD_JOBHISTORY_MONITOR_ENABLED=1
# For Spark Application Monitoring
export SPARK_CONF_DIR=your-path-to-spark-conf-directory
# Or, if the spark-defaults.conf file does not contain the correct assignment
# for the spark.yarn.historyServer.address, and you cannot edit it:
# export PD_SPARK_HISTORY_SERVER_ADDRESS=http(s)://url-to-your-spark-history:port
```
Save your changes and close the file.
Upload the revised file to the cluster configuration folder to overwrite the original pepperdata-config.sh file.

Task 2: (Kerberized Clusters) Configure HTTP/HTTPS Endpoint Authentication

If the core services of the ResourceManagers and the MapReduce Job History Server are Kerberized (secured with Kerberos), add the authentication type for the auxiliary HTTP/HTTPS endpoint service to the Pepperdata configuration file, pepperdata-config.sh.

Note: Even when a daemon’s core services use kerberos authentication, the auxiliary services, such as HTTP/HTTPS, can use either simple or kerberos authentication.

Prerequisites

Be sure that the Kerberos principal has access to the ResourceManager and MapReduce Job History Server endpoints (HTTP or HTTPS).
Be sure that you added the required environment variables—PD_AGENT_PRINCIPAL and PD_AGENT_KEYTAB_LOCATION—to the Pepperdata configuration during the installation process (Task 4. (Kerberized clusters) Enable Kerberos Authentication).

Procedure

Note: The authentication type can be different for each type of component (ResourceManagers, the MapReduce Job History Server, and the YARN Timeline Server). Be sure to separately determine each authentication type.

For your Kerberized ResourceManager host, determine its authentication type by logging in to the host and running the following cURL command, where {your-protocol} is http or https:

curl --tlsv1.2 -kI {your-protocol}://RM_HOST:PORT/ws/v1/cluster/info | grep WWW-Authenticate
- If the returned response is WWW-Authenticate: Negotiate, the authentication type (your-rm-auth-type) is kerberos.
- Otherwise nothing is returned, and the authentication type (your-rm-auth-type) is simple.
For your Kerberized MapReduce Job History Server host, determine its authentication type by logging in to the host and running the following cURL command, where {your-protocol} is http or https:

curl --tlsv1.2 -kI {your-protocol}://JHS_HOST:PORT/ws/v1/history | grep WWW-Authenticate
- If the returned response is WWW-Authenticate: Negotiate, the authentication type (your-jhs-auth-type) is kerberos.
- Otherwise nothing is returned, and the authentication type (your-jhs-auth-type) is simple.
For your Kerberized YARN Timeline Server host, determine its authentication type by logging in to the host and running the following cURL command, where {your-protocol} is http or https:

curl --tlsv1.2 -kI {your-protocol}://TIMELINE_SERVER_HOST:PORT/ws/v1/timeline | grep WWW-Authenticate
- If the returned response is WWW-Authenticate: Negotiate, the authentication type (your-timeline-server-auth-type) is kerberos.
- Otherwise nothing is returned, and the authentication type (your-timeline-server-auth-type) is simple.
On the MapReduce Job History Server, add the environment variables for the HTTP/HTTPS endpoint’s authentication type for the ResourceManager, and the MapReduce Job History Server.
1. Log in to the MapReduce Job History Server host, and download a copy of its existing Pepperdata configuration file, pepperdata-config.sh, to a location where you can edit it.
2. Open the file for editing, and add the required environment variables.
  
  Be sure to substitute the authentication type (simple or kerberos, as you determined in the previous step) for the your-rm-auth-type, your-jhs-auth-type, and your-timeline-server-auth-type placeholders.
```
# For ResourceManager:
export PD_JOBHISTORY_RESOURCE_MANAGER_HTTP_AUTH_TYPE=your-rm-auth-type
# For MapReduce Job History Server:
export PD_JOBHISTORY_MR_HISTORY_SERVER_HTTP_AUTH_TYPE=your-jhs-auth-type
```
3. Save your changes and close the file.
4. Upload the revised file to overwrite the original pepperdata-config.sh file.
Revise the Pepperdata configuration that is used for future hosts.
1. Download a copy of your existing cluster-level Pepperdata configuration file, pepperdata-config.sh, from the environment’s cluster configuration folder (in the cloud) to a location where you can edit it.
2. Open the file for editing, and add the required environment variables.
  
  Be sure to substitute the authentication type (simple or kerberos, as you determined in the previous step) for the your-rm-auth-type, your-jhs-auth-type, and your-timeline-server-auth-type placeholders.
```
# For ResourceManager:
export PD_JOBHISTORY_RESOURCE_MANAGER_HTTP_AUTH_TYPE=your-rm-auth-type
# For MapReduce Job History Server:
export PD_JOBHISTORY_MR_HISTORY_SERVER_HTTP_AUTH_TYPE=your-jhs-auth-type
```
3. Save your changes and close the file.
4. Upload the revised file to overwrite the original cluster-level Pepperdata site file, pepperdata-config.sh.

(YARN 3.x) For YARN 3.x environments (which typically align with Hadoop 3.x-based distros such as EMR 6.x), add authentication properties to the Pepperdata configuration to enable REST access.

Note: If you already configured the authentication properties during the installation process, you do not need to do so again, and you should skip this procedure now.
1. Log in to the ResourceManager host, and download a copy of the host’s existing Pepperdata site file, pepperdata-site.xml, from the environment’s cluster configuration folder (in the cloud) to a location where you can edit it.
2. Open the file for editing, and add the required properties.
  
  Be sure to substitute your HTTP service policy—HTTP_ONLY or HTTPS_ONLY—for the your-http-service-policy placeholder in the following code snippet.
  
  For Kerberized clusters, the HTTP service policy is usually HTTPS_ONLY. But you should check with your cluster administrator or look for the value of the yarn.http.policy property in the cluster’s yarn-site.xml file or the Hadoop configuration.
```
<property>
  <name>pepperdata.agent.yarn.http.authentication.type</name>
  <value>kerberos</value>
</property>
<property>
  <name>pepperdata.agent.yarn.http.policy</name>
  <value>your-http-service-policy</value>
</property>
```
  Malformed XML files can cause operational errors that can be difficult to debug. To prevent such errors, we recommend that you use a linter, such as xmllint, after you edit any .xml configuration file.
3. Save your changes and close the file.
4. Upload the revised file to overwrite the original pepperdata-site.xml file.

Task 3: (Basic Access Authentication) Add BA Authentication Credentials

For Basic access (BA) authentication, add the BA authentication credentials for the monitored applications’ servers to the Pepperdata configuration.

Procedure

Log in to the MapReduce Job History server host.
Download a copy of its existing Pepperdata configuration file, pepperdata-config.sh, to a location where you can edit it.

Open the file for editing, and add the required environment variables. Be sure to substitute your user name and password for the your-username and your-password placeholders. (The same environment variables are used to configure the BA authentication credentials for the ResourceManager, MapReduce Job History Server, and YARN Timeline Server.)

# For ResourceManager, MapReduce Job History Server, YARN Timeline Server
export PD_AGENT_SIMPLE_OR_BASIC_AUTH_USERNAME=your-username
export PD_AGENT_BASIC_AUTH_PASSWORD=your-password

# For Spark History Server
export PD_JOBHISTORY_SPARK_HISTORY_BASIC_AUTH_USERNAME=your-username
export PD_JOBHISTORY_SPARK_HISTORY_BASIC_AUTH_PASSWORD=your-password

Save your changes and close the file.
Upload the revised file to overwrite the original pepperdata-config.sh file.

Task 4: Access the Application Profiler on the Pepperdata Dashboard

The Application Profiler interface is integrated into the Pepperdata Dashboard.

The Applications and Recommendations sections of the dashboard Cluster View show pertinent data for every application that is monitored by Application Profiler.

To view detailed metrics of a highlighted application, click its link in a tile.
To view the table of applications that have Pepperdata recommendations of a given severity, click that severity in the applicable Recommendations tile (for the app type).
To view data for all monitored applications, use the left-nav menu, and select App Spotlight > Application Profiler.

Post Requisite: (Hadoop 2) Confirm Near Real-Time Data Collection

After you finish configuring and customizing the cluster and bootstrapping it, you can confirm that Application Profiler is correctly configured for near real-time monitoring in Hadoop 2 (for example, in Amazon EMR 5.x or Google Dataproc 1.3–1.5) by viewing the data collection process stats (MapReduce Job History Server retrieval).

Be sure to replace the your-jobhistory-server-host placeholder with the URL of your actual MapReduce Job History Server.

http://your-jobhistory-server-host:50505/JobHistoryMonitor.