SparkSubmitOperator

Apache Spark

This hook is a wrapper around the spark-submit binary to kick off a spark-submit job. It requires that the “spark-submit” binary is in the PATH.

View on GitHub

Last Updated: Mar. 15, 2023

Access Instructions

Install the Apache Spark provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

Parameters

applicationThe application that submitted as a job, either jar or py file. (templated)
confArbitrary Spark configuration properties (templated)
spark_conn_idThe spark connection id as configured in Airflow administration. When an invalid connection_id is supplied, it will default to yarn.
filesUpload additional files to the executor running the job, separated by a comma. Files will be placed in the working directory of each executor. For example, serialized objects. (templated)
py_filesAdditional python files used by the job, can be .zip, .egg or .py. (templated)
jarsSubmit additional jars to upload and place them in executor classpath. (templated)
driver_class_pathAdditional, driver-specific, classpath settings. (templated)
java_classthe main class of the Java application
packagesComma-separated list of maven coordinates of jars to include on the driver and executor classpaths. (templated)
exclude_packagesComma-separated list of maven coordinates of jars to exclude while resolving the dependencies provided in ‘packages’ (templated)
repositoriesComma-separated list of additional remote repositories to search for the maven coordinates given with ‘packages’
total_executor_cores(Standalone & Mesos only) Total cores for all executors (Default: all the available cores on the worker)
executor_cores(Standalone & YARN only) Number of cores per executor (Default: 2)
executor_memoryMemory per executor (e.g. 1000M, 2G) (Default: 1G)
driver_memoryMemory allocated to the driver (e.g. 1000M, 2G) (Default: 1G)
keytabFull path to the file that contains the keytab (templated)
principalThe name of the kerberos principal used for keytab (templated)
proxy_userUser to impersonate when submitting the application (templated)
nameName of the job (default airflow-spark). (templated)
num_executorsNumber of executors to launch
status_poll_intervalSeconds to wait between polls of driver status in cluster mode (Default: 1)
application_argsArguments for the application being submitted (templated)
env_varsEnvironment variables for spark-submit. It supports yarn and k8s mode too. (templated)
verboseWhether to pass the verbose flag to spark-submit process for debugging
spark_binaryThe command to use for spark submit. Some distros may use spark2-submit.

Documentation

This hook is a wrapper around the spark-submit binary to kick off a spark-submit job. It requires that the “spark-submit” binary is in the PATH.

See also

For more information on how to use this operator, take a look at the guide: SparkSubmitOperator

Was this page helpful?