HiveStatsCollectionOperator

Apache Hive

Gathers partition statistics using a dynamically generated Presto query, inserts the stats into a MySql table with this format. Stats overwrite themselves if you rerun the same date/partition.

View on GitHub

Last Updated: Oct. 26, 2022

Access Instructions

Install the Apache Hive provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

Parameters

Documentation

Connections

Parameters

metastore_conn_idReference to the Hive Metastore connection id.

tableRequiredthe source table, in the format database.table_name. (templated)

partitionRequiredthe source partition. (templated)

extra_exprsdict of expression to run against the table where keys are metric names and values are Presto compatible expressions

excluded_columnslist of columns to exclude, consider excluding blobs, large json columns, …

assignment_funca function that receives a column name and a type, and returns a dict of metric names and an Presto expressions. If None is returned, the global defaults are applied. If an empty dictionary is returned, no stats are computed for that column.

Documentation

Gathers partition statistics using a dynamically generated Presto query, inserts the stats into a MySql table with this format. Stats overwrite themselves if you rerun the same date/partition.

CREATE TABLE hive_stats (
    ds VARCHAR(16),
    table_name VARCHAR(500),
    metric VARCHAR(200),
    value BIGINT
);