DataprocSubmitPySparkJobOperator

Google

Start a PySpark Job on a Cloud DataProc cluster.

Last Updated: Feb. 25, 2023

Install the Google provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

mainRequired[Required] The Hadoop Compatible Filesystem (HCFS) URI of the main Python file to use as the driver. Must be a .py file. (templated)

argumentsArguments for the job. (templated)

archivesList of archived files that will be unpacked in the work directory. Should be stored in Cloud Storage.

filesList of files to be copied to the working directory

pyfilesList of Python files to pass to the PySpark framework. Supported file types: .py, .egg, and .zip

Start a PySpark Job on a Cloud DataProc cluster.