DataprocCreateClusterOperator

Google

Create a new cluster on Google Cloud Dataproc. The operator will wait until the creation is successful or an error occurs in the creation process. If the cluster already exists and use_if_exists is True then the operator will:

View on GitHub

Last Updated: Feb. 25, 2023

Access Instructions

Install the Google provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

Parameters

project_idThe ID of the google cloud project in which to create the cluster. (templated)
cluster_nameRequiredName of the cluster to create
labelsLabels that will be assigned to created cluster
cluster_configRequired. The cluster config to create. If a dict is provided, it must be of the same form as the protobuf message ClusterConfig
virtual_cluster_configOptional. The virtual cluster config, used when creating a Dataproc cluster that does not directly control the underlying compute resources, for example, when creating a Dataproc-on-GKE cluster
regionRequiredThe specified region where the dataproc cluster is created.
delete_on_errorIf true the cluster will be deleted if created with ERROR state. Default value is true.
use_if_existsIf true use existing cluster
request_idOptional. A unique id used to identify the request. If the server receives two DeleteClusterRequest requests with the same id, then the second request will be ignored and the first google.longrunning.Operation created and stored in the backend is returned.
retryA retry object used to retry requests. If None is specified, requests will not be retried.
timeoutThe amount of time, in seconds, to wait for the request to complete. Note that if retry is specified, the timeout applies to each individual attempt.
metadataAdditional metadata that is provided to the method.
gcp_conn_idThe connection ID to use connecting to Google Cloud.
impersonation_chainOptional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).
deferrableRun operator in the deferrable mode.
polling_interval_secondsTime (seconds) to wait between calls to check the run status.

Documentation

Create a new cluster on Google Cloud Dataproc. The operator will wait until the creation is successful or an error occurs in the creation process. If the cluster already exists and use_if_exists is True then the operator will:

  • if cluster state is ERROR then delete it if specified and raise error

  • if cluster state is CREATING wait for it and then check for ERROR state

  • if cluster state is DELETING wait for it and then create new cluster

Please refer to

https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters

for a detailed explanation on the different parameters. Most of the configuration parameters detailed in the link are available as a parameter to this operator.

See also

For more information on how to use this operator, take a look at the guide: Create a Cluster

Was this page helpful?