DataprocJobBaseOperator

Google

The base class for operators that launch job on DataProc.

View on GitHub

Last Updated: Feb. 25, 2023

Access Instructions

Install the Google provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

Parameters

regionRequiredThe specified region where the dataproc cluster is created.
job_nameThe job name used in the DataProc cluster. This name by default is the task_id appended with the execution data, but can be templated. The name will always be appended with a random number to avoid name clashes.
cluster_nameThe name of the DataProc cluster.
project_idThe ID of the Google Cloud project the cluster belongs to, if not specified the project will be inferred from the provided GCP connection.
dataproc_propertiesMap for the Hive properties. Ideal to put in default arguments (templated)
dataproc_jarsHCFS URIs of jar files to add to the CLASSPATH of the Hive server and Hadoop MapReduce (MR) tasks. Can contain Hive SerDes and UDFs. (templated)
gcp_conn_idThe connection ID to use connecting to Google Cloud.
delegate_toThe account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.
labelsThe labels to associate with this job. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a job.
job_error_statesJob states that should be considered error states. Any states in this set will result in an error being raised and failure of the task. Eg, if the CANCELLED state should also be considered a task failure, pass in {'ERROR', 'CANCELLED'}. Possible values are currently only 'ERROR' and 'CANCELLED', but could change in the future. Defaults to {'ERROR'}.
impersonation_chainOptional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).
asynchronousFlag to return after submitting the job to the Dataproc API. This is useful for submitting long running jobs and waiting on them asynchronously using the DataprocJobSensor
deferrableRun operator in the deferrable mode
polling_interval_secondstime in seconds between polling for job completion. The value is considered only when running in deferrable mode. Must be greater than 0.
dataproc_job_id

Documentation

The base class for operators that launch job on DataProc.

Was this page helpful?