DataprocCreateSparkJobOperator

Yandex

Runs Spark job in Data Proc cluster.

View on GitHub

Last Updated: Oct. 23, 2022

Access Instructions

Install the Yandex provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

Parameters

main_jar_file_uriURI of jar file with job. Can be placed in HDFS or S3.
main_className of the main class of the job.
file_urisURIs of files used in the job. Can be placed in HDFS or S3.
archive_urisURIs of archive files used in the job. Can be placed in HDFS or S3.
jar_file_urisURIs of JAR files used in the job. Can be placed in HDFS or S3.
propertiesProperties for the job.
argsArguments to be passed to the job.
nameName of the job. Used for labeling.
cluster_idID of the cluster to run job in. Will try to take the ID from Dataproc Hook object if it’s specified. (templated)
connection_idID of the Yandex.Cloud Airflow connection.
packagesList of maven coordinates of jars to include on the driver and executor classpaths.
repositoriesList of additional remote repositories to search for the maven coordinates given with –packages.
exclude_packagesList of groupId:artifactId, to exclude while resolving the dependencies provided in –packages to avoid dependency conflicts.

Documentation

Runs Spark job in Data Proc cluster.

Was this page helpful?