Tutorial Redshift DAG
Run an ETL pipeline that resumes a Redshift cluster, extracts data from S3, transforms data, loads data back to S3, then pauses the cluster.
ETL/ELTDatabases
Providers:
Run this DAG
1. Install the Astronomer CLI:Skip if you already have our CLI
2. Download the repository:
3. Navigate to where the repository was cloned and start the DAG:
Customer Success Redshift Tutorial
This DAG demonstrates how to use the following:
RedshiftSQLOperator
RedshiftPauseClusterOperator
RedshiftResumeClusterOperator
RedshiftClusterSensor
RedshiftToS3Operator
S3ToRedshiftOperator
Prerequisites
- Astro CLI or Astrocloud CLI
- Accessible Redshift Cluster
- Account with read/write access to an S3 Bucket
- Airflow Instance (If you plan on deploying)
Steps to Use
If you are using the
astro
CLI instead of theastrocloud
CLI, you can simply replaceastrocloud
in the below commands withastro
Run the following in your terminal:
git clone git@github.com:astronomer/cs-tutorial-databricks.git
cd cs-tutorial-redshift
astrocloud dev start
Add redshift_default
& aws_default
connections to your sandbox
Go to your sandbox http://locahost:8080/home
Navigate to connictions (i.e. Admin >> Connections)
Add a new connection with the following parameters:
- Connection Id: redshift_default
- Connection Type: Amazon Redshift
- Host:
<Your-Redshift-Endpoint>
- Schema:
<Your-Redshift-Database>
- Login:
<Your-Redshift-Login>
- Password:
<Your-Redshift-Password>
- Port:
<Your-Redshift-Port>
Add another connection with the following parameters:
- Connection Id: aws_default
- Connection Type: Amazon Web Services
- Extra: {"aws_access_key_id": "", "aws_secret_access_key": "", "region_name": ""}
In order to use all of the components from this POC, the account associated with your
aws_default
connection will need the following permissions:
- Access to perform read/write actions for a pre-configured S3 Bucket
- Access to interact with the Redshift Cluster, specifically:
redshift:DescribeClusters
redshift:PauseCluster
redshift:ResumeCluster
Replace variables with values from S3 and Redshift Cluster
In the redshift_example_dag.py
you'll need to replace variables like cluster_identifier
, from_redshift_table
,
s3_bucket
, schema
, and table
with the corresponding values that actually exist in your Redshift Cluster/S3 Storage
After following these steps, you should be able to run the tasks in the redshift_example_dag
. Enjoy!