SparkJDBCOperator

Apache Spark

This operator extends the SparkSubmitOperator specifically for performing data transfers to/from JDBC-based databases with Apache Spark. As with the SparkSubmitOperator, it assumes that the “spark-submit” binary is available on the PATH.

View on GitHub

Last Updated: Oct. 23, 2022

Access Instructions

Install the Apache Spark provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

Parameters

Documentation

Connections

Parameters

spark_app_nameName of the job (default airflow-spark-jdbc)

spark_conn_idThe spark connection id as configured in Airflow administration

spark_confAny additional Spark configuration properties

spark_py_filesAdditional python files used (.zip, .egg, or .py)

spark_filesAdditional files to upload to the container running the job

spark_jarsAdditional jars to upload and add to the driver and executor classpath

num_executorsnumber of executor to run. This should be set so as to manage the number of connections made with the JDBC database

executor_coresNumber of cores per executor

executor_memoryMemory per executor (e.g. 1000M, 2G)

driver_memoryMemory allocated to the driver (e.g. 1000M, 2G)

verboseWhether to pass the verbose flag to spark-submit for debugging

keytabFull path to the file that contains the keytab

principalThe name of the kerberos principal used for keytab

cmd_typeWhich way the data should flow. 2 possible values: spark_to_jdbc: data written by spark from metastore to jdbc jdbc_to_spark: data written by spark from jdbc to metastore

jdbc_tableThe name of the JDBC table

jdbc_conn_idConnection id used for connection to JDBC database

jdbc_driverName of the JDBC driver to use for the JDBC connection. This driver (usually a jar) should be passed in the ‘jars’ parameter

metastore_tableThe name of the metastore table,

jdbc_truncate(spark_to_jdbc only) Whether or not Spark should truncate or drop and recreate the JDBC table. This only takes effect if ‘save_mode’ is set to Overwrite. Also, if the schema is different, Spark cannot truncate, and will drop and recreate

save_modeThe Spark save-mode to use (e.g. overwrite, append, etc.)

save_format(jdbc_to_spark-only) The Spark save-format to use (e.g. parquet)

batch_size(spark_to_jdbc only) The size of the batch to insert per round trip to the JDBC database. Defaults to 1000

fetch_size(jdbc_to_spark only) The size of the batch to fetch per round trip from the JDBC database. Default depends on the JDBC driver

num_partitionsThe maximum number of partitions that can be used by Spark simultaneously, both for spark_to_jdbc and jdbc_to_spark operations. This will also cap the number of JDBC connections that can be opened

partition_column(jdbc_to_spark-only) A numeric column to be used to partition the metastore table by. If specified, you must also specify: num_partitions, lower_bound, upper_bound

lower_bound(jdbc_to_spark-only) Lower bound of the range of the numeric partition column to fetch. If specified, you must also specify: num_partitions, partition_column, upper_bound

upper_bound(jdbc_to_spark-only) Upper bound of the range of the numeric partition column to fetch. If specified, you must also specify: num_partitions, partition_column, lower_bound

create_table_column_types(spark_to_jdbc-only) The database column data types to use instead of the defaults, when creating the table. Data type information should be specified in the same format as CREATE TABLE columns syntax (e.g: “name CHAR(64), comments VARCHAR(1024)”). The specified types should be valid spark sql data types.