SparkSubmitHook

Apache Spark

This hook is a wrapper around the spark-submit binary to kick off a spark-submit job. It requires that the “spark-submit” binary is in the PATH.

View on GitHub

Last Updated: Mar. 22, 2023

Access Instructions

Install the Apache Spark provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

Parameters

confArbitrary Spark configuration properties
spark_conn_idThe spark connection id as configured in Airflow administration. When an invalid connection_id is supplied, it will default to yarn.
filesUpload additional files to the executor running the job, separated by a comma. Files will be placed in the working directory of each executor. For example, serialized objects.
py_filesAdditional python files used by the job, can be .zip, .egg or .py.
archivesArchives that spark should unzip (and possibly tag with #ALIAS) into the application working directory.
driver_class_pathAdditional, driver-specific, classpath settings.
jarsSubmit additional jars to upload and place them in executor classpath.
java_classthe main class of the Java application
packagesComma-separated list of maven coordinates of jars to include on the driver and executor classpaths
exclude_packagesComma-separated list of maven coordinates of jars to exclude while resolving the dependencies provided in ‘packages’
repositoriesComma-separated list of additional remote repositories to search for the maven coordinates given with ‘packages’
total_executor_cores(Standalone & Mesos only) Total cores for all executors (Default: all the available cores on the worker)
executor_cores(Standalone, YARN and Kubernetes only) Number of cores per executor (Default: 2)
executor_memoryMemory per executor (e.g. 1000M, 2G) (Default: 1G)
driver_memoryMemory allocated to the driver (e.g. 1000M, 2G) (Default: 1G)
keytabFull path to the file that contains the keytab
principalThe name of the kerberos principal used for keytab
proxy_userUser to impersonate when submitting the application
nameName of the job (default airflow-spark)
num_executorsNumber of executors to launch
status_poll_intervalSeconds to wait between polls of driver status in cluster mode (Default: 1)
application_argsArguments for the application being submitted
env_varsEnvironment variables for spark-submit. It supports yarn and k8s mode too.
verboseWhether to pass the verbose flag to spark-submit process for debugging
spark_binaryThe command to use for spark submit. Some distros may use spark2-submit.

Documentation

This hook is a wrapper around the spark-submit binary to kick off a spark-submit job. It requires that the “spark-submit” binary is in the PATH.

Was this page helpful?