QuboleOperator

Qubole

Execute tasks (commands) on QDS (https://qubole.com).

Last Updated: Oct. 24, 2022

Install the Qubole provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

qubole_conn_idConnection id which consists of qds auth_token

Execute tasks (commands) on QDS (https://qubole.com).

command_type: type of command to be executed, e.g. hivecmd, shellcmd, hadoopcmd
tags: array of tags to be assigned with the command
cluster_label: cluster label on which the command will be executed
name: name to be given to command
notify: whether to send email on command completion or not (default is False)

Arguments specific to command types

hivecmd:

prestocmd:

hadoopcmd:

sub_commnad: must be one these [“jar”, “s3distcp”, “streaming”] followed by 1 or more args

shellcmd:

script: inline command with args
script_location: s3 location containing query statement
files: list of files in s3 bucket as file1,file2 format. These files will be copied into the working directory where the qubole command is being executed.
archives: list of archives in s3 bucket as archive1,archive2 format. These will be unarchived into the working directory where the qubole command is being executed
parameters: any extra args which need to be passed to script (only when script_location is supplied)

pigcmd:

script: inline query statement (latin_statements)
script_location: s3 location containing pig query
parameters: any extra args which need to be passed to script (only when script_location is supplied

sparkcmd:

program: the complete Spark Program in Scala, R, or Python
cmdline: spark-submit command line, all required arguments must be specify in cmdline itself.
sql: inline sql query
script_location: s3 location containing query statement
language: language of the program, Scala, R, or Python
app_id: ID of an Spark job server app
arguments: spark-submit command line arguments. If cmdline is selected, this should not be used because all required arguments and configurations are to be passed in the cmdline itself.
user_program_arguments: arguments that the user program takes in
macros: macro values which were used in query
note_id: Id of the Notebook to run

dbtapquerycmd:

dbexportcmd:

mode: Can be 1 for Hive export or 2 for HDFS/S3 export
schema: Db schema name assumed accordingly by database if not specified
hive_table: Name of the hive table
partition_spec: partition specification for Hive table.
dbtap_id: data store ID of the target database, in Qubole.
db_table: name of the db table
db_update_mode: allowinsert or updateonly
db_update_keys: columns used to determine the uniqueness of rows
export_dir: HDFS/S3 location from which data will be exported.
fields_terminated_by: hex of the char used as column separator in the dataset
use_customer_cluster: To use cluster to run command
customer_cluster_label: the label of the cluster to run the command on
additional_options: Additional Sqoop options which are needed enclose options in double or single quotes e.g. ‘–map-column-hive id=int,data=string’

dbimportcmd:

mode: 1 (simple), 2 (advance)
hive_table: Name of the hive table
schema: Db schema name assumed accordingly by database if not specified
hive_serde: Output format of the Hive Table
dbtap_id: data store ID of the target database, in Qubole.
db_table: name of the db table
where_clause: where clause, if any
parallelism: number of parallel db connections to use for extracting data
extract_query: SQL query to extract data from db. $CONDITIONS must be part of the where clause.
boundary_query: Query to be used get range of row IDs to be extracted
split_column: Column used as row ID to split data into ranges (mode 2)
use_customer_cluster: To use cluster to run command
customer_cluster_label: the label of the cluster to run the command on
additional_options: Additional Sqoop options which are needed enclose options in double or single quotes

jupytercmd:

path: Path including name of the Jupyter notebook to be run with extension.
arguments: Valid JSON to be sent to the notebook. Specify the parameters in notebooks and pass the parameter value using the JSON format. key is the parameter’s name and value is the parameter’s value. Supported types in parameters are string, integer, float and boolean.