AWS EMR or Create Own Cluster - amazon-web-services

Could someone help me in deciding which would be better AWS EMR or creating own cluster in AWS? I am using airflow to create AWS EMR via terraform , run the job and destroy cluster. However did anyone created a spark cluster in AWS without EMR e.g. using ECS Fargate and docker image from bitnami/spark e.g. link or something along the same line in AWS. Thank you

Related

Is it possible to run kubeflow pipelines or notebooks using AWS EMR as Spark Master/Driver

I am trying to implement as solution on an EKS cluster where jobs are expected to be submitted using kubeflow central dashboard by users/developers. To include spark as a service for users on platform I tried to have standalone spark installation on EKS cluster where everything other config will have to managed by admin. So managed service EMR could be possibly used here as an independent service and will be triggered only when job is submitted.
I an trying to make EMR on EC2 or EMR on EKS available as an endpoint to be used in kubeflow notebooks or pipelines. Tried various things but could not have any robust solution for it.
So if anybody has any sort of experience in the same please feel free to drop in your suggestions.

Does terraform provide a resource to create emr notebooks?

I have an transient Emr cluster up and ready, I want to run a simple pyspark script on the emr notebook.
Is there any way to create and modify the emr notebook through terraform?
Thanks in advance.
As far as i know, AWS says "You create an EMR notebook using the Amazon EMR console. Creating notebooks using the AWS CLI or the Amazon EMR API is not supported." [AWS Documentation on creating EMR Notebook][1]
You can create a notebook via console, the notebook will be stored in S3 as .ipynb, by giving the relative path, you can execute notebook on the cluster. Refer boto3 for more info [Boto3 Documentation][2]
[1]: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-create.html
[2]: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.start_notebook_execution
Yes, you can create and modify an EMR cluster from Terraform and choose which tools will be installed, but this seems like the "hard way". Easier would be a Sagemaker Notebook or using the new Glue Databrew tool.

EMR notebooks template for cloudformation

I am aware of AWS cloudformation EMR resource to create Clusters. But, I could not find any instructions about EMR notebooks. Is there a cloudformation resource for EMR notebooks or similar alternative?
EMR Notebooks can only be created manually using the AWS EMR console. From the documentation (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-create.html):
You create an EMR notebook using the Amazon EMR console. Creating notebooks using the AWS CLI or the Amazon EMR API is not supported.
Since there is no API for this I don't think there will be a way to create notebooks using CloudFormation or similar tools.

AWS and GCP centrally managed airflows and Dataflow equivalent for AWS

I have two questions to ask:
So my company has 2 instances of airflow running, one on a GCP
provisioned cluster and another on a AWS provisioned cluster. Since
GCP has Composer, which helps you to manage airflow, is there a way
to sort of integrate the airflow DAGs on the AWS cluster to be
managed by GCP as well?
For Batch ETL/Streaming jobs(in python), GCP has Dataflow (Apache
Beam) for that. What's the AWS equivalent of that?
Thanks!
No, you can't do it, till now you have to use AWS, provision it and manage by yourself. There are some options you can choose: EC2, ECS + Fargate, EKS
Dataflow is equivalent to Amazon Elastic MapReduce (EMR) or AWS Batch Dataflow. Moreover if you want to run current Apache Beam jobs, you can provision Apache Beam in EMR and everything should be the same

What should be suitable configuration to set up 2-3 node hadoop cluster on AWS?

What should be suitable configuration to set up 2-3 node hadoop cluster on AWS ?
I want to set-up Hive, HBase, Solr, Tomcat on hadoop cluster with purpose of doing small POC's.
Also please suggest option to go with EMR or with EC2 and manually set up cluster on that.
Amazon EMR can deploy a multi-node cluster with Hadoop and various applications (eg Hive, HBase) within a few minutes. It is much easier to deploy and manage than trying to deploy your own Hadoop cluster under Amazon EC2.
See: Getting Started: Analyzing Big Data with Amazon EMR