Spark on EMR | EKS |YARN - amazon-web-services

I am getting migrated from on-premise to the AWS stack. I have a doubt and often confused about how Apache spark works in AWS/similar.
I will just share my current understanding about the onpremise spark that run on yarn. When the application is submitted in the spark cluster,an application master will be created in any of the data node (as containers) and this will take care of the application by spawning executor tasks in the data nodes. This means the spark code will be deployed to the node where the data resides. This means less network transfer. More over this is logical and easy to visualise (at least to me.)
But, suppose there is a same spark application that runs on AWS. This fetches the data from S3 and run on top of eks. Here as I understand the spark drvier and the executor tasks will be spawn on a k8s pod.
-Then does this mean, data has to be transferred through network from S3 to EKS cluster to the node where the executor pod gets spawned ?
I have seen some of the videos that uses EMR on top of EKS. But I am a little confused here.
-Since EMR provides spark runtime, why do we use EKS here? Can't we run EMR alone for spark applications in actual production environment? (I know that EKS, can be a replacement to YARN in spark world)
-Can't we run spark on top of EKS without using EMR? (I am thinking emr as a cluster where in spark drivers and executors can run )
Edit - This is a query more on k8s with spark integration. Not specific to AWS.

Related

Why do we need HDFS on EMR when we have S3

In our place, we use AWS services for all our data infrastructure and services needs. Our hive tables are external tables and the actual data files are stored in S3. We use Apache Spark for Data ingestion and transformation. We have EMR ever-running cluster with 1 master node (always running) and 1 core node(always running), whenever data processing happens additional core nodes and task nodes are added and removed once processing is done. Our EC2 instances are having EBS volumes for temporary storage/scratch space for executors.
Given this context, I am wondering why do we need HDFS in our EMR cluster at all?. I also see that the HDFS Namenode services are always running on the master node and on the core node Datanode services are running. They do have some blocks they are managing but not able to find which files they belong to. And also the size of all the blocks are very small(~2 GB).
Software versions used
Python version: 3.7.0
PySpark version: 2.4.7
Emr version: 5.32.0
If you know the answer to this question, can you please help me understand this need for HDFS?. Please let me know if you have any questions for me.
HDFS in EMR is a built-it component that is provided to store secondary information such as credentials if your spark executors need to authenticate themselves to read a resource, another use is to store log files, in my personal experience I used it as a staging area to store a partial result in a long computation, so that if something went wrong in the middle I would have a checkpoint from which to resume execution instead of starting the computation from scratch, it is strongly discouraged to store the final result on HDFS.
Spark on EMR runs on YARN, which itself uses HDFS. The Spark executors run inside of YARN containers, and Spark distributes the Spark code and config by placing it in HDFS and distributing it to all of the nodes running the Spark executors in the YARN containers. Additionally, the Spark Event Logs from each running and completed application are stored in HDFS by default.

Is it possible to run kubeflow pipelines or notebooks using AWS EMR as Spark Master/Driver

I am trying to implement as solution on an EKS cluster where jobs are expected to be submitted using kubeflow central dashboard by users/developers. To include spark as a service for users on platform I tried to have standalone spark installation on EKS cluster where everything other config will have to managed by admin. So managed service EMR could be possibly used here as an independent service and will be triggered only when job is submitted.
I an trying to make EMR on EC2 or EMR on EKS available as an endpoint to be used in kubeflow notebooks or pipelines. Tried various things but could not have any robust solution for it.
So if anybody has any sort of experience in the same please feel free to drop in your suggestions.

Is it a good practice to have an AWS EMR standing cluster always running structured streaming?

I have a Spark Structured Streaming job which takes data as input from AWS MSK (Kafka) and write to AWS S3. Is it a good idea to have a standing AWS EMR cluster always running the same? Or are there better ways to manage this infrastructure?
Please let me know if you need further details.
You need some worker pool that is consuming and writing.
Your other options include using EKS instead or YARN on EMR to run Spark, or you could not use Spark and use Kafka Connect S3 Sink instead on an EC2/EKS cluster.

What should be suitable configuration to set up 2-3 node hadoop cluster on AWS?

What should be suitable configuration to set up 2-3 node hadoop cluster on AWS ?
I want to set-up Hive, HBase, Solr, Tomcat on hadoop cluster with purpose of doing small POC's.
Also please suggest option to go with EMR or with EC2 and manually set up cluster on that.
Amazon EMR can deploy a multi-node cluster with Hadoop and various applications (eg Hive, HBase) within a few minutes. It is much easier to deploy and manage than trying to deploy your own Hadoop cluster under Amazon EC2.
See: Getting Started: Analyzing Big Data with Amazon EMR

How do I get Spark installed on Amazon's EMR core/worker nodes/instances while creating the cluster?

I am trying to launch a EMR cluster with Spark (1.6.0) and Hadoop (Distribution: Amazon 2.7.1) applications. The release label is emr-4.4.0.
The cluster gets setup as needed but it doesn't run Spark master (in the master instances) as a daemon process and also I cannot find Spark being installed in the worker (core) instances (the Spark dir under /usr/lib/ has just lib and yarn directories).
I'd like to run the Spark master and worker nodes as soon as the cluster has been setup. (i.e., workers connect to the master automatically and become a part of the Spark cluster).
How do I achieve this? Or am I missing something?
Thanks in advance!
Spark on EMR is installed in YARN mode. This is the reason why you are not able to see standalone masters and slave daemons. http://spark.apache.org/docs/latest/running-on-yarn.html
Standalone Spark master and worker daemons are spawned only in spark-standalone mode. http://spark.apache.org/docs/latest/spark-standalone.html
Now, if you do want to run spark masters and workers on EMR, you can do so using
/usr/lib/spark/sbin/start-master.sh
/usr/lib/spark/sbin/start-slave.sh
and configuring accordingly.