Spark standalone mode on AWS EMR - amazon-web-services

I'm able to run Spark on AWS EMR without much trouble following the documentation but from what I see it always uses YARN instead of the standalone manager. Is there any way to use the standalone mode instead of YARN easily? I don't really feel like hacking the bootstrap scripts to turn off yarn and deploy spark master/workers myself.
I'm running into a weird YARN related bug and I was hoping it won't happen with standalone manager.

As far as I know there are no way to run in standalone mode on EMR unless you go back to the old ami-versions instead of using the emr-release-label. The old ami-version will however cause other problems with newer versions of Spark, so I wouldn't go that way.
What you can do is to launch ordinary EC2-instances with Spark instead of using EMR. If you have a local Spark installation, go to the ec2 folder and use spark-ec2 to launch the cluster, like this:
./spark-ec2 --copy-aws-credentials --key-pair=MY_KEY --identity-file=MY_PEM_FILE.pem --region=MY_PREFERED_REGION --instance-type=INSTANCE_TYPE --slaves=NUMBER_OF_SLAVES --hadoop-major-version=2 --ganglia launch NAME_OF_JOB
I suspect that you have jar-files that are needed, so they have to be copied onto the cluster (copy to master first, ssh to master and copy them onto the slaves from there. ./spark-ec2/copy-dir on master will copy a directory onto all slaves). Then restart Spark:
./spark/sbin/stop-master.sh
./spark/sbin/stop-slaves.sh
./spark/sbin/start-master.sh
./spark/sbin/start-slaves.sh
and you are ready to launch Spark in standalone mode:
./spark/bin/spark-submit --deploy-mode client ...

Related

Run EC2 --user-data file only after instance has fully launched with Status OK

I have a --user-data file that downloads a few python scripts that runs a docker as soon as the server is launched.
It seems that the EC2 launch processes interferes with the python script / docker I'm running and is destroying the connection to the docker.
How I've dealt with this issue so far is to wait 5 minutes before running the python script with a simple timer.sleep(300) method. But this feels messy. Any way I can check for what launch processes that are interfering? or a clean solution might be to look for the completion of the launch processes, but I don't know what those processes are, or how I would check against it.
In your question, I see Build and run docker, I think its a lot expensive for your EC2 start up,
I suggest you to externalize the build/push the image docker in other part like CodeBuild/CodePipeline/ECR.... and why not using EC2/ECS/Fargate solution to pull the image docker and run it with ECS agent.
But if you want to use EC2, you can generate a custom AMI with your predefined lib and packages, and in your userdata, you can just pull the image and start it

AWS Batch Failing to launch Dockerfile - standard_init_linux.go:219: exec user process caused: exec format error

I am attempting to use AWS Batch to launch a linux server, which will in essence perform the fetch and go example included within AWS (to download a SH from S3 and run it).
Does AWS Batch work at all for anyone?
The aws fetch_and_go example always fails, even followed someone elses guide online which mimicked the aws example.
I have tried creating Dockerfile for amazonlinux:latest and ubuntu:20.04 with numerous RUN and CMD.
The scripts always seem to fail with the error:
standard_init_linux.go:219: exec user process caused: exec format error
I thought at first this was relevant to my deployment access rights maybe within the amazonlinux so have played with chmod 777, chmod -x etc on the she file.
The final nail in the coffin, my current script is litterely:
FROM ubuntu:20.04
Launch this using AWS Batch, no command or parameters passed through and it still fails with the same error code. This is almost hinting to me that there is either a setup issue with my AWS Batch (which im using default wizard settings, except changing to an a1.medium server) or that AWS Batch has some major issues.
Has anyone had any success with AWS Batch launching their own Dockerfiles ? Could they share their examples and/or setup parameters?
Thank you in advance.
A1 instances are ARM based first-generation Graviton CPU. It is highly likely the image you are trying to run something that is expecting x86 CPU (Intel or AMD). Any instance class with a "g" in it ("c6g" or "m5g") are Graviton2 which is also ARM based and will not work for the default examples.
You can test whether a specific container will run by launching an A1 instance yourself and running the container (after installing docker). My guess is that you will get the same error. Running on Intel or AMD instances should work.
To leverage Batch with ARM your containerized application will need to work on ARM. If you point me to the exact example, I can give more details on how to adjust to run on A1 or Graviton2 instances.
I had the same issue, and it was because I build the image locally on my M1 Mac.
Try adding --platform linux/amd64 to your docker build command before pushing if this is your case.
In addition to the other comment. You can create multi-arch images yourself which will provide the correct architecture.
https://www.docker.com/blog/multi-arch-build-and-images-the-simple-way/

Can you load standard zeppelin interpreter settings from S3?

Our company is building up a suite of common internal Spark functions and jobs, and I'd like to make sure that our data scientists have access to all of these when they prototype in Zeppelin.
Ideally, I'd like a way for them to start up a Zeppelin notebook on AWS EMR, and have the dependency jar we build automatically loaded onto it without them having to manually type in the maven information manually every time (private repo location/credentials, package info, etc).
Right now we have the dependency jar loaded on S3, and with some work we could get a private maven repository to host it on.
I see that ZEPPELIN_INTERPRETER_DIR saves off interpreter settings, but I don't think it can load from a common default location (like S3, or something)
Is there a way to tell Zeppelin on an EMR cluster to load it's interpreter settings from a common location? I can't be the first person to want this.
Other thoughts I've had but have not tried yet:
Have a script that uses aws cmd line options to start a EMR cluster with all the necessary settings pre-made for you. (Could also upload the .jar dependency if we can't get maven to work)
Use a infrastructure-as-code framework to start up the clusters with the required settings.
I don't believe it's possible to tell EMR to load settings from a common location. The first thought you included is the way to go imo - you would aws emr create ... and that create would include a shell script step to replace /etc/zeppelin/conf.dist/interpreter.json by downloading the interpreter.json of interest from S3, and then hard restart zeppelin (sudo stop zeppelin; sudo start zeppelin).

WebHCat on Amazon's EMR?

Is it possible or advisable to run WebHCat on an Amazon Elastic MapReduce cluster?
I'm new to this technology and I was wonder if it was possible to use WebHCat as a REST interface to run Hive queries. The cluster in question is running Hive.
I wasn't able to get it working but WebHCat is actually installed by default on Amazon's EMR instance.
To get it running you have to do the following,
chmod u+x /home/hadoop/hive/hcatalog/bin/hcat
chmod u+x /home/hadoop/hive/hcatalog/sbin/webhcat_server.sh
export TEMPLETON_HOME=/home/hadoop/.versions/hive-0.11.0/hcatalog/
export HCAT_PREFIX=/home/hadoop/.versions/hive-0.11.0/hcatalog/
/home/hadoop/hive/hcatalog/webhcat_server.sh start
You can then confirm that it's running on port 50111 using curl,
curl -i http://localhost:50111/templeton/v1/status
To hit 50111 on other machines you have to open the port up in the EC2 EMR security group.
You then have to configure the users you going to "proxy" when you run queries in hcatalog. I didn't actually save this configuration, but it is outlined in the WebHCat documentation. I wish they had some concrete examples there but basically I ended up configuring the local 'hadoop' user as the one that run the queries, not the most secure thing to do I am sure, but I was just trying to get it up and running.
Attempting a query then gave me this error,
{"error":"Server IPC version 9 cannot communicate with client version
4"}
The workaround was to switch off of the latest EMR image (3.0.4 with Hadoop 2.2.0) and switch to a Hadoop 1.0 image (2.4.2 with Hadoop 1.0.3).
I then hit another issues where it couldn't find the Hive jar properly, after struggling with the configuration more, I decided I had dumped enough time into trying to get this to work and decided to communicate with Hive directly (using RBHive for Ruby and JDBC for the JVM).
To answer my own question, it is possible to run WebHCat on EMR, but it's not documented at all (Googling lead me nowhere which is why I created this question in the first place, it's currently the first hit when you search "WebHCat EMR") and the WebHCat documentation leaves a lot to be desired. Getting it to work seems like a pain, though my hope is that by writing up the initial steps someone will come along and take it the rest of the way and post a complete answer.
I did not test it but, it should be doable.
EMR allows to customise the bootstrap actions, i.e. the scripts run where the nodes are started. You can use bootstrap actions to install additional software and to change the configuration of applications on the cluster
See more details at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html.
I would create a shell script to install WebHCat and test your script on a regular EC2 instance first (outside the context of EMR - just as a test to ensure your script is OK)
You can use EC2's user-data properties to test your script, typically :
#!/bin/bash
curl http://path_to_your_install_script.sh | sh
Then - once you know the script is working - make it available to the cluster on a S3 bucket and follow these instructions to include your script as custom bootstrap action of your cluster.
--Seb

Configuring AmazonLinux AMI instances

I am trying to setup an AMI such that, when booted it will auto configure itself with a defined "configuration" somewhere on a server. I came across Chef and Puppet. Considering Puppet, I was able to run though their examples but couldn't see one for auto configuration from master. I found out that Puppet Enterprise is not supported on "Amazon Linux". Team chose Amazon Linux and would like keep that instead of going to other OS just because one tool doesn't support it. Can someone please give me some idea about how I could achieve this? (I am trying to stay away from home grown shell scripts over a good industry adopted tool for maintainability)
What I have done in the past is to copy /etc/rc.local to /etc/rc.local.orig, and then configure /etc/rc.local to kick off a puppet run and then pave over itself.
/etc/rc.local:
#!/bin/bash
##
#add pre-puppeting stuff here, I add the hostname in "User-data" when creating the VM so I can set the hostname before checking in
##
/usr/bin/puppet agent --test
/bin/cp -f /etc/rc.local.orig /etc/rc.local
/sbin/init 6
AWS CloudFormation is one of Amazon's recommended ways to provision servers (and other cloud resources, too). You declare all the resources you need in a JSON file, and specify how to provision each server by declaring packages to install, services to run, files to create, and commands to run when the server is created. See the user guide for more information. I also wrote a couple of blog posts about getting started with it.