Assigning Workflow Ids for Grouping and Chargeback Reporting

A workflow Id identifies all the applications/jobs that function together for a single purpose. Grouping or filtering metrics by workflow Id enables chargeback reporting, filtering charts, and viewing resource consumption by workflow.

On This Page

Uses for Workflow Ids
Pepperdata Workflow Id: YARN Clusters
Pepperdata Workflow Id: Kubernetes Clusters

Uses for Workflow Ids

Although workflow Ids are primarily used to enable chargeback reporting, they serve a variety of purposes:

(YARN-only) Creating the Chargeback report—cost to use the cluster’s resources, apportioned for each cluster user.
(YARN-only) Series breakdowns in charts; see Filter the Charts & Tables by Dimensions: Hosts, Users, Etc..
Grouping data in the Workflows Overview; see Application Spotlight Overviews & Reports.

Pepperdata Workflow Id: YARN Clusters

In YARN clusters, the Oozie and Hive workflow schedulers automatically assign their own workflow Ids, but to enable Pepperdata workflow-related functionality, you must manually configure a Pepperdata workflow Id—for Oozie jobs, oozie.job.id; for other jobs, the pepperdata.workflow.id key.

Procedure

Assign the appropriate Pepperdata workflow Id key to the job—pepperdata.workflow.id or oozie.job.id, depending on the job’s type—and the value you want to assign to the application configuration.

How to assign the Pepperdata workflow Id depends on your particular environment.

The examples show the most broadly applicable method—specifying a parameter and its value in a command-line invocation for running an app. But you should check with your system administrator to determine whether there’s a custom app/job manager or other framework or method for overriding default configuration settings.

Examples

For Oozie jobs

Submitted through YARN:

yarn jar -Doozie.job.id=group01-522 my_application.jar <myoptions>

Submitted through Spark:

spark-submit --conf spark.hadoop.oozie.job.id=group01-522 --class com.company.application.MainClass my_application.jar

For all other types of jobs

Submitted through YARN:

yarn jar -Dpepperdata.workflow.id=group01-522 my_application.jar <myoptions>

Submitted through Spark:

spark-submit --conf spark.hadoop.pepperdata.workflow.id=group01-522 --class com.company.application.MainClass my_application.jar

Pepperdata Workflow Id: Kubernetes Clusters

In Kubernetes clusters, Pepperdata associates a workflow with a Spark application by using the DAG (Directed Acyclic Graph) name (dag_name) and task name (task_name) labels (which you configure in the Pepperdata dashboard) for the driver Pod and executor Pods.

Procedure

Add the applicable Spark properties for labels—dag_name and task_name—to all Pepperdata-monitored Spark applications.
- Add the properties to the same <spark-job>.yaml files that you configured for Pepperdata instrumentation (see Activate Pepperdata for Spark Applications).
- For a given app, the dag_name values must be the same for the driver and executor properties. Likewise, the task_name values must be the same for a given app’s driver and executor properties.
- Be sure to replace the your-* placeholder names with the actual names.
```
# For Spark applications:
"spark.kubernetes.driver.label.dag_name": "your-dag-name"
"spark.kubernetes.executor.label.dag_name": "your-dag-name"
"spark.kubernetes.driver.label.task_name": "your-task-name"
"spark.kubernetes.executor.label.task_name": "your-task-name"
```

What to do next

Associate the dag_name and task_name label attributes with specific applications; see Configure Labels.