How do I get Spark installed on Amazon's EMR core/worker nodes/instances while creating the cluster? - amazon-web-services

I am trying to launch a EMR cluster with Spark (1.6.0) and Hadoop (Distribution: Amazon 2.7.1) applications. The release label is emr-4.4.0.
The cluster gets setup as needed but it doesn't run Spark master (in the master instances) as a daemon process and also I cannot find Spark being installed in the worker (core) instances (the Spark dir under /usr/lib/ has just lib and yarn directories).
I'd like to run the Spark master and worker nodes as soon as the cluster has been setup. (i.e., workers connect to the master automatically and become a part of the Spark cluster).
How do I achieve this? Or am I missing something?
Thanks in advance!

Spark on EMR is installed in YARN mode. This is the reason why you are not able to see standalone masters and slave daemons. http://spark.apache.org/docs/latest/running-on-yarn.html
Standalone Spark master and worker daemons are spawned only in spark-standalone mode. http://spark.apache.org/docs/latest/spark-standalone.html
Now, if you do want to run spark masters and workers on EMR, you can do so using
/usr/lib/spark/sbin/start-master.sh
/usr/lib/spark/sbin/start-slave.sh
and configuring accordingly.

Related

Spark on EMR | EKS |YARN

I am getting migrated from on-premise to the AWS stack. I have a doubt and often confused about how Apache spark works in AWS/similar.
I will just share my current understanding about the onpremise spark that run on yarn. When the application is submitted in the spark cluster,an application master will be created in any of the data node (as containers) and this will take care of the application by spawning executor tasks in the data nodes. This means the spark code will be deployed to the node where the data resides. This means less network transfer. More over this is logical and easy to visualise (at least to me.)
But, suppose there is a same spark application that runs on AWS. This fetches the data from S3 and run on top of eks. Here as I understand the spark drvier and the executor tasks will be spawn on a k8s pod.
-Then does this mean, data has to be transferred through network from S3 to EKS cluster to the node where the executor pod gets spawned ?
I have seen some of the videos that uses EMR on top of EKS. But I am a little confused here.
-Since EMR provides spark runtime, why do we use EKS here? Can't we run EMR alone for spark applications in actual production environment? (I know that EKS, can be a replacement to YARN in spark world)
-Can't we run spark on top of EKS without using EMR? (I am thinking emr as a cluster where in spark drivers and executors can run )
Edit - This is a query more on k8s with spark integration. Not specific to AWS.

Flink JobManager HA on EMR

Stack
EMR: emr-6.1.0 (1 master, 4 core nodes)
EMR installed apps: FLINK 1.11.0
AWS documentation says (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/flink-configure.html):
Beginning with Amazon EMR version 5.28.0, JobManager high availability is also enabled automatically. No manual configuration is needed.
But when i send kill signal to Flink jobmanager yarn container -signal container_1601027657994_0003_01_000001 GRACEFUL_SHUTDOWN (same with FORCEFUL_SHUTDOWN) yarn container nothing happens. Yarn won't restart the app.
Do i need to enable EMR Zookeeper as well ? (most probably yes, otherwise, I don’t understand how flink will understand from which savepoint to restart the application).
Should i use a EMR cluster with 3 master nodes to have HA for Flink?
Yes, to have an JobManager HA you need to have an EMR with 3 master nodes, and then emr automatically adds failover configuration into flink-conf.yaml (tested with EMR 6.1.0)

what is the best Airflow architecture for AWS EMR clusters?

I have an AWS EMR cluster with 1 master node, 30 core nodes and some auto-scaled task nodes.
now, hundreds of Hive and mysql jobs are running by Oozie on the cluster.
I'm going to change some jobs from Oozie to Airflow.
I googled to apply Airflow to my cluster.
I found out that all dag should be located on every node and Airflow Worker must be installed on all nodes.
But, My dag will be updated frequently and new dags will be added frequently, but the number of nodes is about 100 and even auto-scaled nodes are used.
And, As you know, only master node has hive/mysql application on the cluster.
So I am very confused.
Who can tell me Airflow architecture to apply to my EMR cluster?
Airflow worker nodes are not the same as EMR nodes.
In a typical setup, a celery worker ("Airflow worker node") reads from a queue of jobs and executes them using the appropriate operator (In this case probably a SparkSubmitOperator or possibly an SSHOperator).
Celery workers would not run on your EMR nodes as those are dedicated to running Hadoop jobs.
Celery workers would likely run on EC2s outside of your EMR cluster.
One common solution to having the same DAGs on every celery worker, is to put the dags on network storage (like EFS) and mount the network drive to the celery worker EC2s.

Setting UP Spark on existing EC2 cluster

I have to access some big files in buckets in Amazon S3 and do processing on them. For this I was planning to use Apache Spark. I have 2 EC2 instances for this learning project. These are not used but for small crons, so could I use them to install and run Spark? If so, how to install Spark on existing EC2 boxes, so that I can make one master and one slave?
If it helps, I installed Spark in standalone mode on one branch, and the other as well, setting one as Master, and the other as slave. The detailed instructions for the same as I followed are
https://spark.apache.org/docs/1.2.0/spark-standalone.html#installing-spark-standalone-to-a-cluster
See the tutorial on Apache Spark Cluster on EC2 here http://www.supergloo.com/fieldnotes/apache-spark-cluster-amazon-ec2-tutorial/
yes you can create easily a master slave with 2 aws instances just set SPARK_MASTER_IP = instance_privateIP_1 in spark-env.sh on both instances and put instance2 private ip in slaves file in conf folder and these configurations are same on both the machine and other configurations also set like memory core etc. and then you can start it from master, and make sure the spark is install on same location in both the machines.

How to make the slave nodes work for Spark cluster using EMR?

I tried to run a job on my Spark cluster using EMR. The cluster has one master and two slaves, and each node (master or slave node) has 32 cores. The job was using "Add Step" through the console, the configuration is set below:
sparkConf.setMaster("local[24]").set("spark.executor.memory", "40g")
.set("spark.driver.memory", "40g");
Then I noticed that the two slaves didn't work (CPU usage close to 0), only master was working hard. How do I fix this problem and make the slaves work?
Thanks!
When you specify a 'local' master that means the master is local - it is not distributed over the nodes.
You should follow the doc:
http://spark.apache.org/docs/1.2.0/spark-standalone.html
I've only recently started working with Spark on EMR, but I've found these examples extremely helpful for launching / configuring the cluster and submitting Spark jobs.