Tutorial Redshift DAG

Run an ETL pipeline that resumes a Redshift cluster, extracts data from S3, transforms data, loads data back to S3, then pauses the cluster.

ETL/ELTDatabases


Providers:

Run this DAG

1. Install the Astronomer CLI:Skip if you already have our CLI

2. Download the repository:

3. Navigate to where the repository was cloned and start the DAG:

Customer Success Redshift Tutorial

This DAG demonstrates how to use the following:

  • RedshiftSQLOperator
  • RedshiftPauseClusterOperator
  • RedshiftResumeClusterOperator
  • RedshiftClusterSensor
  • RedshiftToS3Operator
  • S3ToRedshiftOperator

Prerequisites

  • Astro CLI or Astrocloud CLI
  • Accessible Redshift Cluster
  • Account with read/write access to an S3 Bucket
  • Airflow Instance (If you plan on deploying)

Steps to Use

If you are using the astro CLI instead of the astrocloud CLI, you can simply replace astrocloud in the below commands with astro

Run the following in your terminal:

  1. git clone git@github.com:astronomer/cs-tutorial-databricks.git
  2. cd cs-tutorial-redshift
  3. astrocloud dev start

Add redshift_default & aws_default connections to your sandbox

  1. Go to your sandbox http://locahost:8080/home

  2. Navigate to connictions (i.e. Admin >> Connections)

  3. Add a new connection with the following parameters:

    • Connection Id: redshift_default
    • Connection Type: Amazon Redshift
    • Host: <Your-Redshift-Endpoint>
    • Schema: <Your-Redshift-Database>
    • Login: <Your-Redshift-Login>
    • Password: <Your-Redshift-Password>
    • Port: <Your-Redshift-Port>
  4. Add another connection with the following parameters:

    • Connection Id: aws_default
    • Connection Type: Amazon Web Services
    • Extra: {"aws_access_key_id": "", "aws_secret_access_key": "", "region_name": ""}

In order to use all of the components from this POC, the account associated with your aws_default connection will need the following permissions:

  • Access to perform read/write actions for a pre-configured S3 Bucket
  • Access to interact with the Redshift Cluster, specifically:
    • redshift:DescribeClusters
    • redshift:PauseCluster
    • redshift:ResumeCluster

Replace variables with values from S3 and Redshift Cluster

In the redshift_example_dag.py you'll need to replace variables like cluster_identifier, from_redshift_table, s3_bucket, schema, and table with the corresponding values that actually exist in your Redshift Cluster/S3 Storage


After following these steps, you should be able to run the tasks in the redshift_example_dag. Enjoy!