I am downloading, using bootstrap, a mysql jar file to the spark/jars folder. I use the following:
sudo aws s3 cp s3://buck/emrtest/mysql-connector-java-5.1.39-bin.jar /usr/lib/spark/jars
Everything downloads correctly but I eventually get a provisioning error and the cluster terminates. I get this error :
On 5 slave instances (including i-0505b9beda64e9,i-0f85f4664e1359 and i-00d346a73f717b), application provisioning failed
It doesn't fail on my master node but fails on my slave nodes. I have checked my logs and it doesn't give me any information. Why does this fail and how would I go about downloading this jar file to every node in a bootstrap fasion?
Thanks!
I figured out the answer. First off, the logging for this is not there. The master node launches on a failure.
I was retrieving a file in a private s3 bucket. Note: aws configs do not get inherited in your EMR cluster.
Related
I need all the core nodes on an EMR to contain a keystore file in the /usr/local/spark/conf/ directory. This particularly becomes challenging when core node resizing goes on because any newly brought up core node will not have the keystore file and won't even have the /usr/local/spark/conf/ directory. I need to automate the process of populating this directory with the keystore file on any newly brought up core node.
I've created a shell script to create the /usr/local/spark/conf directory and then populate that with the keystore file by fetching it from Amazon S3. The problem is getting this shell script to automatically run on any newly brought up EMR core node.
mkdir -p /usr/local/spark/conf/
cd /usr/local/spark/conf/
aws s3 cp s3://my_bucket/certs/cacerts .
aws s3 cp s3://my_bucket/certs/keystore.jks .
Yes, you can use the bootstrap action feature, to run a predefined script from S3:
You can use a bootstrap action to install additional software or customize the configuration of cluster instances. Bootstrap actions are scripts that run on cluster after Amazon EMR launches the instance using the Amazon Linux Amazon Machine Image (AMI). Bootstrap actions run before Amazon EMR installs the applications that you specify when you create the cluster and before cluster nodes begin processing data. If you add nodes to a running cluster, bootstrap actions also run on those nodes in the same way. You can create custom bootstrap actions and specify them when you create your cluster.
See https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html
I am submitting one spark application job jar to EMR, and it is using some property file. So I can put it into S3 and while creating the EMR I can download it and copy it at some location in EMR box if this is the best way how I can do this while creating the EMR cluster itself at bootstrapping time.
Check following snapshot
In edit software setting you can add your own configuration or JSON file ( which stored on S3 location ) and using this setting you can passed configure parameter to EMR cluster on creating time. For more details please check following links
Amazon EMR Cluster Configurations
Configuring Applications
AWS ClI
hope this will help you.
I am running spark jobs on the AWS EMR cluster submitting them from the client host machine.
Client machine is just an EC2 instance that submits jobs to the EMR with yarn in cluster mode.
The problem is - spark saves temp files each of 200Mb like:
/tmp/spark-456184c9-d59f-48f4-9b0560b7d310655/__spark_conf__6943938018805427428.zip
Tmp folder is getting filled with such files very fast and I start getting failed jobs with the error:
No space left on device
I tried to configure spark.local.dir in spark-defaults.conf to point to my s3 bucket, but it adds user directory prefix to the path like this: /home/username/s3a://my-bucket/spark-tmp-folder
Could you please suggest how I can fix this problem?
I uploaded the zip archive __spark_conf__6943938018805427428.zip
with spark libs to the s3 bucket.
Then I specified it in the spark-defaults.conf in the property
spark.yarn.archive s3a://mybucket/libs/spark_libs.zip on my
client host machine that submits jobs.
Now spark loads only configs to the local tmp folder that takes
only 170Kb instead of 200Mb.
I have written a python code in spark and I want to run it on Amazon's Elastic Map reduce.
My code works great on my local machine, but I am slightly confused over how to run it on Amazon's AWS?
More specifically, how should I transfer my python code over to the Master node? Do I need to copy my Python code to my s3 bucket and execute it from there? Or, should I ssh into Master and scp my python code to the spark folder in Master?
For now, I tried running the code locally on my terminal and connecting to the cluster address ( I did this by reading the output of --help flag of spark, so I might be missing a few steps here)
./bin/spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.1 \
--master spark://hadoop#ec2-public-dns-of-my-cluster.compute-1.amazonaws.com \
mypythoncode.py
I tried it with and without my permissions file i.e.
-i permissionsfile.pem
However, it fails and the stack trace shows something on the lines of
Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
......
......
Is my approach correct and I need to resolve the Access issues to get going or am I heading in a wrong direction?
What is the right way of doing it?
I searched a lot on youtube but couldn't find any tutorials on running Spark on Amazon's EMR.
If it helps, the dataset I am working on it is part of Amazon's public dataset.
go to EMR, create new cluster... [recommendation: start with 1 node only, just for testing purposes].
Click the checkbox to install Spark, you can uncheck the other boxes if you don't need those additional programs.
configure the cluster further by choosing a VPC and a security key (ssh key, a.k.a pem key)
wait for it to boot up. Once your cluster says "waiting", you're free to proceed.
[spark submission via the GUI] in the GUI, you can add a Step and select Spark job, and upload your spark file to S3, and then choose the path to that newly uploaded S3 file. Once it runs it will either succeed or fail. If it fails, wait a moment, and then click "view logs" over on the of that Step line in the list of steps. Keep tweaking your script until you've got it working.
[submission via the command line] SSH into the driver node following the ssh instructions at the top of the page. Once inside, use a command-line text editor to create a new file, and paste the contents of your script in. Then spark-submit yourNewFile.py. If it fails, you'll see the error output straight to the console. Tweak your script, and re-run. Do that until you've got it working as expected.
Note: running jobs from your local machine to a remote machine is troublesome because you may actually be causing your local instance of spark to be responsible for some expensive computations and data transfer over the network. So thats why you want to submit AWS EMR jobs from within EMR.
There are typical two ways to run a job on an Amazon EMR cluster (whether for Spark or other job types):
Login to the master node an run Spark jobs interactively. See: Access the Spark Shell
Submit jobs to the EMR cluster. See: Adding a Spark Step
If you have Apache Zeppelin installed on your EMR cluster, you can use a web browser to interact with Spark.
The error you are experiencing is saying that files where accessed via the s3n: protocol, which requires AWS credentials to be provided. If, instead, the files were accessed via s3:, I suspect that the credentials would be sourced from the IAM Role that is automatically assigned to nodes in the cluster and this error would be resolved.
I'm running a python script, using Boto3 (first time using boto/3), on my local server which monitors S3 bucket for new files. When it detects new files in the bucket, it starts a stopped EC2 instance, which has software loaded onto it to process these said files, and then needs to somehow instruct S3/EC2 to copy the new files from S3 to EC2. How can I achieve that using Boto3 script which is running on my local server ?
Essentially, the script running locally is the orchestrator of the process and needs to start the instance when there are new files to process and have them processed on the EC2 instance and copy the processed files back to S3. I'm currently stuck at trying to figure how to get the files copied over to EC2 from S3 by the script running locally. I'd like to avoid having to download from S3 to local server and then upload to EC2.
Suggestions/ideas ?
You should consider using Lambda for any S3 event-based processing. Why launch and run servers when you don't have to?
If the name of the bucket and other params don't change, you can achieve it simply by having a script on your EC2 instance that would pull the latest content from the bucket and set this script to be triggered every time your EC2 starts up.
If the s3 command parameters do change and you must run it from your local machine with boto, you'll need to find a way to ssh into the EC2 instance using boto. Check this module: boto.manage.cmdshell and a similar question: Boto Execute shell command on ec2 instance