DataflowCreateJavaJobOperator
GoogleStart a Java Cloud Dataflow batch job. The parameters of the operation will be passed to the job.
Access Instructions
Install the Google provider package into your Airflow environment.
Import the module into your DAG file and instantiate it with your desired params.
Parameters
Documentation
Start a Java Cloud Dataflow batch job. The parameters of the operation will be passed to the job.
This class is deprecated. Please use providers.apache.beam.operators.beam.BeamRunJavaPipelineOperator.
Example:
default_args = {"owner": "airflow","depends_on_past": False,"start_date": (2016, 8, 1),"email": ["alex@vanboxel.be"],"email_on_failure": False,"email_on_retry": False,"retries": 1,"retry_delay": timedelta(minutes=30),"dataflow_default_options": {"project": "my-gcp-project","zone": "us-central1-f","stagingLocation": "gs://bucket/tmp/dataflow/staging/",},}dag = DAG("test-dag", default_args=default_args)task = DataflowCreateJavaJobOperator(gcp_conn_id="gcp_default",task_id="normalize-cal",jar="{{var.value.gcp_dataflow_base}}pipeline-ingress-cal-normalize-1.0.jar",options={"autoscalingAlgorithm": "BASIC","maxNumWorkers": "50","start": "{{ds}}","partitionType": "DAY",},dag=dag,)
See also
For more detail on job submission have a look at the reference: https://cloud.google.com/dataflow/pipelines/specifying-exec-params
See also
For more information on how to use this operator, take a look at the guide: Java SDK pipelines
Note that both dataflow_default_options
and options
will be merged to specify pipeline execution parameter, and dataflow_default_options
is expected to save high-level options, for instances, project and zone information, which apply to all dataflow operators in the DAG.
It’s a good practice to define dataflow_* parameters in the default_args of the dag like the project, zone and staging location.
default_args = {"dataflow_default_options": {"zone": "europe-west1-d","stagingLocation": "gs://my-staging-bucket/staging/",}}
You need to pass the path to your dataflow as a file reference with the jar
parameter, the jar needs to be a self executing jar (see documentation here: https://beam.apache.org/documentation/runners/dataflow/#self-executing-jar). Use options
to pass on options to your job.
t1 = DataflowCreateJavaJobOperator(task_id="dataflow_example",jar="{{var.value.gcp_dataflow_base}}pipeline/build/libs/pipeline-example-1.0.jar",options={"autoscalingAlgorithm": "BASIC","maxNumWorkers": "50","start": "{{ds}}","partitionType": "DAY","labels": {"foo": "bar"},},gcp_conn_id="airflow-conn-id",dag=my_dag,)