AWS and GCP centrally managed airflows and Dataflow equivalent for AWS - amazon-web-services

I have two questions to ask:
So my company has 2 instances of airflow running, one on a GCP
provisioned cluster and another on a AWS provisioned cluster. Since
GCP has Composer, which helps you to manage airflow, is there a way
to sort of integrate the airflow DAGs on the AWS cluster to be
managed by GCP as well?
For Batch ETL/Streaming jobs(in python), GCP has Dataflow (Apache
Beam) for that. What's the AWS equivalent of that?
Thanks!

No, you can't do it, till now you have to use AWS, provision it and manage by yourself. There are some options you can choose: EC2, ECS + Fargate, EKS
Dataflow is equivalent to Amazon Elastic MapReduce (EMR) or AWS Batch Dataflow. Moreover if you want to run current Apache Beam jobs, you can provision Apache Beam in EMR and everything should be the same

Related

What AWS service to use for running batch jobs

I need to run java code that talk to MySql db in AWS and does some ETL on a nightly frequency. Which AWS service can I used for this?
I would recommend looking at the following:
AWS Glue ETL
AWS Batch
AWS ECS / Fargate scheduled tasks

AWS EMR or Create Own Cluster

Could someone help me in deciding which would be better AWS EMR or creating own cluster in AWS? I am using airflow to create AWS EMR via terraform , run the job and destroy cluster. However did anyone created a spark cluster in AWS without EMR e.g. using ECS Fargate and docker image from bitnami/spark e.g. link or something along the same line in AWS. Thank you

Is it possible to run kubeflow pipelines or notebooks using AWS EMR as Spark Master/Driver

I am trying to implement as solution on an EKS cluster where jobs are expected to be submitted using kubeflow central dashboard by users/developers. To include spark as a service for users on platform I tried to have standalone spark installation on EKS cluster where everything other config will have to managed by admin. So managed service EMR could be possibly used here as an independent service and will be triggered only when job is submitted.
I an trying to make EMR on EC2 or EMR on EKS available as an endpoint to be used in kubeflow notebooks or pipelines. Tried various things but could not have any robust solution for it.
So if anybody has any sort of experience in the same please feel free to drop in your suggestions.

AWS EMR free usage

I am very new to cloud based services. I want to try impala queries on AWS EMR and EC2. Is it possible, Can I create a free account for EC2/EMR. If yes then how?
Impala is not available as a standard option in Amazon EMR.
You would probably need to launch your own Hadoop cluster on Amazon EC2 instances.
However, the AWS Free Usage Tier only provides micro-sized EC2 instances, which are not appropriate for a Hadoop cluster.

What should be suitable configuration to set up 2-3 node hadoop cluster on AWS?

What should be suitable configuration to set up 2-3 node hadoop cluster on AWS ?
I want to set-up Hive, HBase, Solr, Tomcat on hadoop cluster with purpose of doing small POC's.
Also please suggest option to go with EMR or with EC2 and manually set up cluster on that.
Amazon EMR can deploy a multi-node cluster with Hadoop and various applications (eg Hive, HBase) within a few minutes. It is much easier to deploy and manage than trying to deploy your own Hadoop cluster under Amazon EC2.
See: Getting Started: Analyzing Big Data with Amazon EMR