DataprocSubmitPigJobOperator

Google

Start a Pig query Job on a Cloud DataProc cluster. The parameters of the operation will be passed to the cluster.

View on GitHub

Last Updated: Feb. 25, 2023

Access Instructions

Install the Google provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

Parameters

queryThe query or reference to the query file (pg or pig extension). (templated)
query_uriThe HCFS URI of the script that contains the Pig queries.
variablesMap of named parameters for the query. (templated)

Documentation

Start a Pig query Job on a Cloud DataProc cluster. The parameters of the operation will be passed to the cluster.

It’s a good practice to define dataproc_* parameters in the default_args of the dag like the cluster name and UDFs.

default_args = {
"cluster_name": "cluster-1",
"dataproc_pig_jars": [
"gs://example/udf/jar/datafu/1.2.0/datafu.jar",
"gs://example/udf/jar/gpig/1.2/gpig.jar",
],
}

You can pass a pig script as string or file reference. Use variables to pass on variables for the pig script to be resolved on the cluster or use the parameters to be resolved in the script as template parameters.

Example:

t1 = DataProcPigOperator(
task_id='dataproc_pig',
query='a_pig_script.pig',
variables={'out': 'gs://example/output/{{ds}}'},
dag=dag)

See also

For more detail on about job submission have a look at the reference: https://cloud.google.com/dataproc/reference/rest/v1/projects.regions.jobs

Was this page helpful?