DataprocCreatePysparkJobOperator

Yandex

Runs Pyspark job in Data Proc cluster.

Last Updated: Oct. 23, 2022

Install the Yandex provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

main_python_file_uriURI of python file with job. Can be placed in HDFS or S3.

python_file_urisURIs of python files used in the job. Can be placed in HDFS or S3.

file_urisURIs of files used in the job. Can be placed in HDFS or S3.

archive_urisURIs of archive files used in the job. Can be placed in HDFS or S3.

jar_file_urisURIs of JAR files used in the job. Can be placed in HDFS or S3.

propertiesProperties for the job.

argsArguments to be passed to the job.

nameName of the job. Used for labeling.

cluster_idID of the cluster to run job in. Will try to take the ID from Dataproc Hook object if it’s specified. (templated)

connection_idID of the Yandex.Cloud Airflow connection.

packagesList of maven coordinates of jars to include on the driver and executor classpaths.

repositoriesList of additional remote repositories to search for the maven coordinates given with –packages.

exclude_packagesList of groupId:artifactId, to exclude while resolving the dependencies provided in –packages to avoid dependency conflicts.

Runs Pyspark job in Data Proc cluster.