DataprocSubmitPigJobOperator
GoogleStart a Pig query Job on a Cloud DataProc cluster. The parameters of the operation will be passed to the cluster.
Access Instructions
Install the Google provider package into your Airflow environment.
Import the module into your DAG file and instantiate it with your desired params.
Parameters
Documentation
Start a Pig query Job on a Cloud DataProc cluster. The parameters of the operation will be passed to the cluster.
It’s a good practice to define dataproc_* parameters in the default_args of the dag like the cluster name and UDFs.
default_args = {"cluster_name": "cluster-1","dataproc_pig_jars": ["gs://example/udf/jar/datafu/1.2.0/datafu.jar","gs://example/udf/jar/gpig/1.2/gpig.jar",],}
You can pass a pig script as string or file reference. Use variables to pass on variables for the pig script to be resolved on the cluster or use the parameters to be resolved in the script as template parameters.
Example:
t1 = DataProcPigOperator(task_id='dataproc_pig',query='a_pig_script.pig',variables={'out': 'gs://example/output/{{ds}}'},dag=dag)
See also
For more detail on about job submission have a look at the reference: https://cloud.google.com/dataproc/reference/rest/v1/projects.regions.jobs