AWS Batch from on prem - amazon-web-services

I have an on premises schedule that's too dense and there are a lot of errors. I'd like to explore options for migrating some of the workload to the cloud. Can I connect to on-premises resources using AWS Batch? I'd to connect to on prem database/warehouse and run some of these jobs in the cloud using Spot instances and drop the output into S3 but wasn't sure if I used AWS Batch or AWS Glue or a combination of the two. Is there a different option?

Related

What AWS service to use for running batch jobs

I need to run java code that talk to MySql db in AWS and does some ETL on a nightly frequency. Which AWS service can I used for this?
I would recommend looking at the following:
AWS Glue ETL
AWS Batch
AWS ECS / Fargate scheduled tasks

Orchestration of Redshift Stored Procedures using AWS Managed Airflow

I have many redshift stored procedures created(15-20) some can run asynchronously while many have to run in a synchronous manner.
I tried scheduling them in Async and Sync manner using Aws Eventbridge but found many limitations (failure handling and orchestration).
I went ahead to use AWS Managed Airflow.
How can we do the redshift cluster connection in the airflow?
So that we can call our stored procedure in airflow dags and stored proc. will run in the redshift cluster?
Is there any RedshiftOperator present for connection or we can create a direct connection to the Redshift cluster using the connection option in the airflow menu?
If possible can we achieve all these using AWS console only, without Aws cli?
How can we do the redshift cluster connection in the airflow?
So that we can call our stored procedure in airflow dags and stored proc. will run in the redshift cluster?
You can use Airflow Connections to connect to Redshift. This is the native approach for managing connections to external services such as databases.
Managing Connections (Airflow)
Amazon Redshift Connection (Airflow)
Is there any RedshiftOperator present for connection or we can create a direct connection to the Redshift cluster using the connection option in the airflow menu?
You can use the PostgresOperator to execute SQL commands in the Redshift cluster. When initializing the PostgresOperator, set the postgres_conn_id parameter to the Redshift connection ID (e.g. redshift_default). Example:
PostgresOperator(
task_id="call_stored_proc",
postgres_conn_id="redshift_default",
sql="sql/stored_proc.sql",
)
PostgresOperator (Airflow)
How-to Guide for PostgresOperator (Airflow)
If possible can we achieve all these using AWS console only, without Aws cli?
No, it's not possible to achieve this only using the AWS console.

AWS Batch vs AWS Step functions for Control M migration

The goal is to migrate our jobs from Control M to AWS, but before I do that I want to better understand the differences between AWS batch and AWS step functions. From what I've understood, AWS step functions seems more encompassing in that I can have one of my steps run AWS batch.
Can you explain the difference between AWS Batch and AWS Step functions? Which is better suited to migrate to from Control M? (Maybe this is preference)
AWS Batch is service to run an offline workload. With Batch, you can easily set up your offline workload using Docker and defining the set of instances types and how many instances will run this workload.
AWS Step Functions is a serverless workflow management service. It only serves you a way to connect to other AWS Services; you cannot run a script in Step Functions itself and you only define the workflow with input/output from other AWS services.
That said, you can use both services to migrate Control M to AWS and possibly other AWS Services like Lambda (for minor workload), SNS (for e-mail) and S3 (for storage).

AWS and GCP centrally managed airflows and Dataflow equivalent for AWS

I have two questions to ask:
So my company has 2 instances of airflow running, one on a GCP
provisioned cluster and another on a AWS provisioned cluster. Since
GCP has Composer, which helps you to manage airflow, is there a way
to sort of integrate the airflow DAGs on the AWS cluster to be
managed by GCP as well?
For Batch ETL/Streaming jobs(in python), GCP has Dataflow (Apache
Beam) for that. What's the AWS equivalent of that?
Thanks!
No, you can't do it, till now you have to use AWS, provision it and manage by yourself. There are some options you can choose: EC2, ECS + Fargate, EKS
Dataflow is equivalent to Amazon Elastic MapReduce (EMR) or AWS Batch Dataflow. Moreover if you want to run current Apache Beam jobs, you can provision Apache Beam in EMR and everything should be the same

What query to run to determine Amazon Athena version?

I'd like to determine what version of Amazon Athena I'm connected to by running a query. Is this possible? If so, what is the query?
Searching Google, SO, and AWS docs have not found an answer.
Amazon Redshift launches as a cluster, with virtual machines being used for that specific cluster. The cluster must be specifically updated between versions because it is continuously running and is accessible by only one AWS account. Think of it as software running on your own virtual machines.
From Amazon Redshift Clusters:
Amazon Redshift provides a setting, Allow Version Upgrade, to specify whether to automatically upgrade the Amazon Redshift engine in your cluster if a new version of the engine becomes available.
Amazon Athena, however, is a fully-managed service. There is no cluster to be created -- you simply provide your query and it uses the metastore to know where to find data. Think of it just like Amazon S3 -- many servers provide access to multiple AWS customers simultaneously.
From Amazon Athena – Interactive SQL Queries for Data in Amazon S3:
Behind the scenes, Athena parallelizes your query, spreads it out across hundreds or thousands of cores, and delivers results in seconds.
As a fully-managed service, there is only ever one version of Amazon Athena, which is the version that is currently available.