Data Science Modeling Using Google Cloud Storage XCom Backend

A sample data science pipeline demonstrating extraction from BigQuery to modeling that uses an XCom backend in Google Cloud Storage to pass intermediary data between tasks.

Data ScienceETL/ELT


Providers:

Modules:

Last Updated: Oct. 1, 2021

Run this DAG

1. Install the Astronomer CLI:Skip if you already have our CLI

2. Download the repository:

3. Navigate to where the repository was cloned and start the DAG:

Example DAGs for Data Science and Machine Learning Use Cases

These examples are meant to be a guide/skaffold for Data Science and Machine Learning pipelines that can be implemented in Airflow.

In an effort to keep the examples easy to follow, much of the data processing and modeling code has intentially been kept simple.

Examples

  1. xcom_gcs_ds.py - A simple DS pipeline from data extraction to modeling.

    • Pulls data from BigQuery using the Google Provider (BigQueryHook) into a dataframe that preps, trains, and builds the model
    • Data is passed between the tasks using XComs
    • Uses GCS as an Xcom backend to easily track intermediary data in a scalable, external system
  2. xcom_gcs_ds_k8sExecutor.py - A simple DS pipeline from data extraction to modeling that leverages the flexibility of the Kubernetes Executor.

    • This DAG can only be used with the Kubernetes Executor.
    • All components from example #1 except that each task is now executed in its own pod with custom configs.
    • Uses pod_override to provide more resources to tasks to enable proper or faster execution.