Assigning Workflow Ids for Grouping and Chargeback Reporting

A workflow Id identifies all the applications/jobs that function together for a single purpose. Grouping or filtering metrics by workflow Id enables chargeback reporting, filtering charts, and viewing resource consumption by workflow.

Uses for Workflow Ids

Although workflow Ids are primarily used to enable chargeback reporting, they serve a variety of purposes:

Pepperdata Workflow Id: YARN Clusters

In YARN clusters, the Oozie and Hive workflow schedulers automatically assign their own workflow Ids, but to enable Pepperdata workflow-related functionality, you must manually configure a Pepperdata workflow Id—for Oozie jobs, oozie.job.id; for other jobs, the pepperdata.workflow.id key.

Procedure

  • Assign the appropriate Pepperdata workflow Id key to the job—pepperdata.workflow.id or oozie.job.id, depending on the job’s type—and the value you want to assign to the application configuration.

    How to assign the Pepperdata workflow Id depends on your particular environment.

    The examples show the most broadly applicable method—specifying a parameter and its value in a command-line invocation for running an app. But you should check with your system administrator to determine whether there’s a custom app/job manager or other framework or method for overriding default configuration settings.

Examples

  • For Oozie jobs

    • Submitted through YARN:

      yarn jar -Doozie.job.id=group01-522 my_application.jar <myoptions>
      
    • Submitted through Spark:

      spark-submit --conf spark.hadoop.oozie.job.id=group01-522 --class com.company.application.MainClass my_application.jar
      
  • For all other types of jobs

    • Submitted through YARN:

      yarn jar -Dpepperdata.workflow.id=group01-522 my_application.jar <myoptions>
      
    • Submitted through Spark:

      spark-submit --conf spark.hadoop.pepperdata.workflow.id=group01-522 --class com.company.application.MainClass my_application.jar
      

Pepperdata Workflow Id: Kubernetes Clusters

In Kubernetes clusters, Pepperdata associates a workflow with a Spark application by using the DAG (Directed Acyclic Graph) name (dag_name) and task name (task_name) labels (which you configure in the Pepperdata dashboard) for the driver Pod and executor Pods.

Procedure

  • Add the applicable Spark properties for labels—dag_name and task_name—to all Pepperdata-monitored Spark applications.

    • Add the properties to the same <spark-job>.yaml files that you configured for Pepperdata instrumentation (see Activate Pepperdata for Spark Applications).

    • For a given app, the dag_name values must be the same for the driver and executor properties. Likewise, the task_name values must be the same for a given app’s driver and executor properties.

    • Be sure to replace the your-* placeholder names with the actual names.

    # For Spark applications:
    "spark.kubernetes.driver.label.dag_name": "your-dag-name"
    "spark.kubernetes.executor.label.dag_name": "your-dag-name"
    "spark.kubernetes.driver.label.task_name": "your-task-name"
    "spark.kubernetes.executor.label.task_name": "your-task-name"
    

What to do next

  • Associate the dag_name and task_name label attributes with specific applications; see Configure Labels.