Apache Airflow Provider - XLSX

XLSX

CommunityData Processing

An Apache Airflow provider for interacting with Microsoft Excel.

Version
1.0.1
Downloads
1,324/month
Last Published
Mar. 25, 2022
Quick Install

Airflow Provider XLSX

Apache Airflow operators for converting XLSX files from/to Parquet, CSV and JSON.

Build Status PyPI version PyPI Downloads Code style: black

System Requirements

  • Airflow Versions
    • 2.0 or newer

Installation

$ pip install airflow-provider-xlsx

Operators

FromXLSXOperator

Read an XLSX or XLS file and convert it into Parquet, CSV, JSON, JSON Lines(one line per record) file.

API Documentation

Example

XLSX Source

image

Airflow Task

from xlsx_provider.operators.from_xlsx_operator import FromXLSXOperator
xlsx_to_jsonl = FromXLSXOperator(
task_id='xlsx_to_jsonl',
source='{{ var.value.tmp_path }}/test.xlsx',
target='{{ var.value.tmp_path }}/test.jsonl',
file_format='jsonl',
dag=dag
)

JSON Lines Output

{"month": "Jan", "high": -12.2, "mean": -16.2, "low": -20.1, "precipitation": 19}
{"month": "Feb", "high": -10.3, "mean": -14.7, "low": -19.1, "precipitation": 14}
{"month": "Mar", "high": -2.6, "mean": -7.2, "low": -11.8, "precipitation": 15}
{"month": "Apr", "high": 8.1, "mean": 3.2, "low": -1.7, "precipitation": 24}
{"month": "May", "high": 17.5, "mean": 11.6, "low": 5.6, "precipitation": 36}
{"month": "Jun", "high": 24, "mean": 18.2, "low": 12.3, "precipitation": 58}
{"month": "Jul", "high": 25.7, "mean": 20.2, "low": 14.7, "precipitation": 72}
{"month": "Aug", "high": 22.2, "mean": 17, "low": 11.7, "precipitation": 66}
{"month": "Sep", "high": 16.6, "mean": 11.5, "low": 6.4, "precipitation": 44}
{"month": "Oct", "high": 6.8, "mean": 3.4, "low": 0, "precipitation": 38}

FromXLSXQueryOperator

Execute an SQL query an XLSX/XLS file and export the result into a Parquet or CSV file

This operators loads an XLSX or XLS file into an in-memory SQLite database, executes a query on the db and stores the result into a Parquet, CSV, JSON, JSON Lines(one line per record) file. The output columns names and types are determinated by the SQL query output.

API Documentation

Example

XLSX Source

image

SQL Query

select
g as high_tech_sector,
h as eur_bilion,
i as share
from
high_tech
where
_index > 1
and high_tech_sector <> ''
and lower(high_tech_sector) <> 'total'

Airflow Task

from xlsx_provider.operators.from_xlsx_query_operator import FromXLSXQueryOperator
xlsx_to_csv = FromXLSXQueryOperator(
task_id='xlsx_to_csv',
source='{{ var.value.tmp_path }}/high_tech.xlsx',
target='{{ var.value.tmp_path }}/high_tech.parquet',
file_format='csv',
csv_delimiter=',',
table_name='high_tech',
worksheet='Figure 3',
query='''
select
g as high_tech_sector,
h as eur_bilion,
i as share
from
high_tech
where
_index > 1
and high_tech_sector <> ''
and lower(high_tech_sector) <> 'total'
''',
dag = dag
)

Output

high_tech_sector,value,share
Pharmacy,78280,0.231952169555313
Electronics-telecommunications,75243,0.222954583130376
Scientific instruments,64010,0.189670433253542
Aerospace,44472,0.131776952366115
Computers office machines,21772,0.0645136852766778
Non-electrical machinery,20813,0.0616714981835167
Chemistry,19776,0.058598734453222
Electrical machinery,9730,0.028831912195612
Armament,3384,0.0100300315856265

ToXLSXOperator

Read a Parquest, CSV, JSON, JSON Lines(one line per record) file and convert it into XLSX.

API Documentation

Example
from xlsx_provider.operators.to_xlsx_operator import ToXLSXOperator
parquet_to_xlsx = ToXLSXOperator(
task_id='parquet_to_xlsx',
source='{{ var.value.tmp_path }}/test.parquet',
target='{{ var.value.tmp_path }}/test.xlsx',
dag=dag
)

Links