S3ToHiveOperator

Apache Hive

Moves data from S3 to Hive. The operator downloads a file from S3, stores the file locally before loading it into a Hive table. If the create or recreate arguments are set to True, a CREATE TABLE and DROP TABLE statements are generated. Hive data types are inferred from the cursor’s metadata from.

View on GitHub

Last Updated: Mar. 21, 2023

Access Instructions

Install the Apache Hive provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

Parameters

Documentation

Connections

Parameters

s3_keyRequiredThe key to be retrieved from S3. (templated)

field_dictRequiredA dictionary of the fields name in the file as keys and their Hive types as values

hive_tableRequiredtarget Hive table, use dot notation to target a specific database. (templated)

delimiterfield delimiter in the file

createwhether to create the table if it doesn’t exist

recreatewhether to drop and recreate the table at every execution

partitiontarget partition as a dict of partition columns and values. (templated)

headerswhether the file contains column names on the first line

check_headerswhether the column names on the first line should be checked against the keys of field_dict

wildcard_matchwhether the s3_key should be interpreted as a Unix wildcard pattern

aws_conn_idsource s3 connection

verifyWhether or not to verify SSL certificates for S3 connection. By default SSL certificates are verified. You can provide the following values: False: do not validate SSL certificates. SSL will still be used(unless use_ssl is False), but SSL certificates will not be verified. path/to/cert/bundle.pem: A filename of the CA cert bundle to uses.You can specify this argument if you want to use a different CA cert bundle than the one used by botocore.

hive_cli_conn_idReference to the Hive CLI connection id.

input_compressedBoolean to determine if file decompression is required to process headers

tblpropertiesTBLPROPERTIES of the hive table being created

select_expressionS3 Select expression

Documentation

Note that the table generated in Hive uses STORED AS textfile which isn’t the most efficient serialization format. If a large amount of data is loaded and/or if the tables gets queried considerably, you may want to use this operator only to stage the data into a temporary table before loading it into its final destination using a HiveOperator.