Orchestration of Redshift Stored Procedures using AWS Managed Airflow - amazon-web-services

I have many redshift stored procedures created(15-20) some can run asynchronously while many have to run in a synchronous manner.
I tried scheduling them in Async and Sync manner using Aws Eventbridge but found many limitations (failure handling and orchestration).
I went ahead to use AWS Managed Airflow.
How can we do the redshift cluster connection in the airflow?
So that we can call our stored procedure in airflow dags and stored proc. will run in the redshift cluster?
Is there any RedshiftOperator present for connection or we can create a direct connection to the Redshift cluster using the connection option in the airflow menu?
If possible can we achieve all these using AWS console only, without Aws cli?

How can we do the redshift cluster connection in the airflow?
So that we can call our stored procedure in airflow dags and stored proc. will run in the redshift cluster?
You can use Airflow Connections to connect to Redshift. This is the native approach for managing connections to external services such as databases.
Managing Connections (Airflow)
Amazon Redshift Connection (Airflow)
Is there any RedshiftOperator present for connection or we can create a direct connection to the Redshift cluster using the connection option in the airflow menu?
You can use the PostgresOperator to execute SQL commands in the Redshift cluster. When initializing the PostgresOperator, set the postgres_conn_id parameter to the Redshift connection ID (e.g. redshift_default). Example:
PostgresOperator(
task_id="call_stored_proc",
postgres_conn_id="redshift_default",
sql="sql/stored_proc.sql",
)
PostgresOperator (Airflow)
How-to Guide for PostgresOperator (Airflow)
If possible can we achieve all these using AWS console only, without Aws cli?
No, it's not possible to achieve this only using the AWS console.

Related

AWS Batch from on prem

I have an on premises schedule that's too dense and there are a lot of errors. I'd like to explore options for migrating some of the workload to the cloud. Can I connect to on-premises resources using AWS Batch? I'd to connect to on prem database/warehouse and run some of these jobs in the cloud using Spot instances and drop the output into S3 but wasn't sure if I used AWS Batch or AWS Glue or a combination of the two. Is there a different option?

How do I periodically refresh the data in an RDS table automatically?

I have a web application that uses two databases. I want to schedule an SQL query to be executed in set intervals to select data from the first database and put it in a table in the second one and the web application takes it's data from here.
You can do that using a scheduled Lambda function.
From Schedule jobs for Amazon RDS and Aurora PostgreSQL using Lambda and Secrets Manager
For on-premises databases and databases that are hosted on Amazon Elastic Compute Cloud (Amazon EC2) instances, database administrators often use the cron utility to schedule jobs. For example, a job for data extraction or a job for data purging can easily be scheduled using cron. For these jobs, database credentials are typically either hard-coded or stored in a properties file. However, when you migrate to Amazon Relational Database Service (Amazon RDS) or Amazon Aurora PostgreSQL, you lose the ability to log in to the host instance to schedule cron jobs. This pattern describes how to use AWS Lambda and AWS Secrets Manager to schedule jobs for Amazon RDS and Aurora PostgreSQL databases after migration.
https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/schedule-jobs-for-amazon-rds-and-aurora-postgresql-using-lambda-and-secrets-manager.html

Can't reach flask in Spark master node using Amazon EMR

I want to understand if it's possible to use flask application connected to Spark master node implemented in Amazon EMR. The goal is to call Flask from a web app to retrieve spark outputs. Ports are open in amazon EMR cluster's security group but I can't reach it from outside on his port.
What do you think about it? Are there any other solutions?
While it is totally possible to call Flask (or anything) running on EMR, depending on what you are doing you might find Apache Livy handy. The good thing is Livy is fully supported by EMR. You can use Livy to submit jobs and to retrieve results synchronously or asynchronously. It gives you a rest API to interact with Spark.

Can GCP Dataproc sqoop data (or run other jobs on) from local DB?

Can GCP Dataproc sqoop import data from local DB to put into GCP Storage (without GCP VPC)?
We have a remote Oracle DB connected to our local network via VPN tunnel that we use a Hadoop cluster to extract data out of each day via Apache Sqoop. Would like to replace this process with GCP Dataproc cluster to run the sqoop jobs and GCP Storage.
Found this article that appears to be doing something similar Moving Data with Apache Sqoop in Google Cloud Dataproc, but it assumes that users have GCP VPC (which I did not intend on purchasing).
So my question is:
Without this VPC connection, would the cloud dataproc cluster know how to get the data from the DB on our local network using the job submission API?
How would this work if so (perhaps I am do not understand enough about how Hadoop jobs work / get data)?
Some other way if not?
Without using VPC/VPN you will not be able to grant Dataproc access to your local DB.
Instead of using VPC, you can use VPN if it meets your needs better: https://cloud.google.com/vpn/docs/
Only other option that you have is to open up your local DB to Internet so Dataproc will be able to access it without VPC/VPN, but this is inherently insecure.
Installing the GCS connector on-prem might work in this case. It will not require VPC/VPN.

Simplest way to get data from AWS mysql RDS to AWS Elasticsearch?

I have data in an AWS RDS, and I would like to pipe it over to an AWS ES instance, preferably updating once an hour, or similar.
On my local machine, with a local mysql database and Elasticsearch database, it was easy to set this up using Logstash.
Is there a "native" AWS way to do the same thing? Or do I need to set up an EC2 server and install Logstash on it myself?
You can achieve the same thing with your local Logstash, simply point your jdbc input to your RDS database and the elasticsearch output to your AWS ES instance. If you need to run this regularly, then yes, you'd need to setup a small instance to run Logstash on it.
A more "native" AWS solution to achieve the same thing would include the use of Amazon Kinesis and AWS Lambda.
Here's a good article explaining how to connect it all together, namely:
how to stream RDS data into a Kinesis Stream
configuring a Lambda function to handle the stream
push the data to your AWS ES instance
Take a look at Amazon DMS. Its usually used for DB migrations, however, it also supports continuous data replication. This might simplify the process and be cost-effective.
You can use AWS Database Migration Service to perform continuous data replication. Continuous data replication has a multitude of use cases including Disaster Recovery instance synchronization, geographic database distribution and Dev/Test environment synchronization. You can use DMS for both homogeneous and heterogeneous data replications for all supported database engines. The source or destination databases can be located in your own premises outside of AWS, running on an Amazon EC2 instance, or it can be an Amazon RDS database. You can replicate data from a single database to one or more target databases or data from multiple source databases can be consolidated and replicated to one or more target databases.
https://aws.amazon.com/dms/