How to make spark save it's temp files on S3? - amazon-web-services

I am running spark jobs on the AWS EMR cluster submitting them from the client host machine.
Client machine is just an EC2 instance that submits jobs to the EMR with yarn in cluster mode.
The problem is - spark saves temp files each of 200Mb like:
/tmp/spark-456184c9-d59f-48f4-9b0560b7d310655/__spark_conf__6943938018805427428.zip
Tmp folder is getting filled with such files very fast and I start getting failed jobs with the error:
No space left on device
I tried to configure spark.local.dir in spark-defaults.conf to point to my s3 bucket, but it adds user directory prefix to the path like this: /home/username/s3a://my-bucket/spark-tmp-folder
Could you please suggest how I can fix this problem?

I uploaded the zip archive __spark_conf__6943938018805427428.zip
with spark libs to the s3 bucket.
Then I specified it in the spark-defaults.conf in the property
spark.yarn.archive s3a://mybucket/libs/spark_libs.zip on my
client host machine that submits jobs.
Now spark loads only configs to the local tmp folder that takes
only 170Kb instead of 200Mb.

Related

Nextflow script with both 'local' and 'awsbatch' executor

I have a Nextflow pipeline executed in AWS Batch. Recently, I tried to add a process that uploads files from local machine to S3 bucket so I don't have to upload files manually before each run. I wrote a python script that handles the upload and I wrapped it into a Nextflow process. Since I am uploading from a local machine, I want the upload process with
executor 'local'
This requires a Fusion filesystem enabled in order to have a Work Dir in S3. But when I enable the Fusion filesystem I don't have access to my local filesystem. In my understanding, when Fusion filesystem is enabled, the task runs in Wave container without access to host filesystem. Does anyone have experience with running Nextflow with FusionFS enabled and how to access host filesystem? Thanks!
I don't think you need to manage a hybrid workload here. Pipeline inputs can be stored either locally or in an S3 bucket. If your files are stored locally and you specify a working directory in S3, Nextflow will already try to upload them into the staging area for you. For example, if you specify your working directory in S3 using -work-dir 's3://mybucket/work', Nextflow will try to stage the input files under s3://mybucket/work/stage-<session-uuid>. Once the files are in the staging area, Nextflow can then begin to submit jobs that require them.
Note that a Fusion file system is not strictly required to have your working directory in S3. Nextflow includes support for S3. Either include your AWS access and secret keys in your pipeline configuration or use an IAM role to allow your EC2 instances full access to S3 storage.

Spark on EMR - Downloading Different Jar Files

I am downloading, using bootstrap, a mysql jar file to the spark/jars folder. I use the following:
sudo aws s3 cp s3://buck/emrtest/mysql-connector-java-5.1.39-bin.jar /usr/lib/spark/jars
Everything downloads correctly but I eventually get a provisioning error and the cluster terminates. I get this error :
On 5 slave instances (including i-0505b9beda64e9,i-0f85f4664e1359 and i-00d346a73f717b), application provisioning failed
It doesn't fail on my master node but fails on my slave nodes. I have checked my logs and it doesn't give me any information. Why does this fail and how would I go about downloading this jar file to every node in a bootstrap fasion?
Thanks!
I figured out the answer. First off, the logging for this is not there. The master node launches on a failure.
I was retrieving a file in a private s3 bucket. Note: aws configs do not get inherited in your EMR cluster.

Where to put application property file for spark application running on AWS EMR

I am submitting one spark application job jar to EMR, and it is using some property file. So I can put it into S3 and while creating the EMR I can download it and copy it at some location in EMR box if this is the best way how I can do this while creating the EMR cluster itself at bootstrapping time.
Check following snapshot
In edit software setting you can add your own configuration or JSON file ( which stored on S3 location ) and using this setting you can passed configure parameter to EMR cluster on creating time. For more details please check following links
Amazon EMR Cluster Configurations
Configuring Applications
AWS ClI
hope this will help you.

How to run Python Spark code on Amazon Aws?

I have written a python code in spark and I want to run it on Amazon's Elastic Map reduce.
My code works great on my local machine, but I am slightly confused over how to run it on Amazon's AWS?
More specifically, how should I transfer my python code over to the Master node? Do I need to copy my Python code to my s3 bucket and execute it from there? Or, should I ssh into Master and scp my python code to the spark folder in Master?
For now, I tried running the code locally on my terminal and connecting to the cluster address ( I did this by reading the output of --help flag of spark, so I might be missing a few steps here)
./bin/spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.1 \
--master spark://hadoop#ec2-public-dns-of-my-cluster.compute-1.amazonaws.com \
mypythoncode.py
I tried it with and without my permissions file i.e.
-i permissionsfile.pem
However, it fails and the stack trace shows something on the lines of
Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
......
......
Is my approach correct and I need to resolve the Access issues to get going or am I heading in a wrong direction?
What is the right way of doing it?
I searched a lot on youtube but couldn't find any tutorials on running Spark on Amazon's EMR.
If it helps, the dataset I am working on it is part of Amazon's public dataset.
go to EMR, create new cluster... [recommendation: start with 1 node only, just for testing purposes].
Click the checkbox to install Spark, you can uncheck the other boxes if you don't need those additional programs.
configure the cluster further by choosing a VPC and a security key (ssh key, a.k.a pem key)
wait for it to boot up. Once your cluster says "waiting", you're free to proceed.
[spark submission via the GUI] in the GUI, you can add a Step and select Spark job, and upload your spark file to S3, and then choose the path to that newly uploaded S3 file. Once it runs it will either succeed or fail. If it fails, wait a moment, and then click "view logs" over on the of that Step line in the list of steps. Keep tweaking your script until you've got it working.
[submission via the command line] SSH into the driver node following the ssh instructions at the top of the page. Once inside, use a command-line text editor to create a new file, and paste the contents of your script in. Then spark-submit yourNewFile.py. If it fails, you'll see the error output straight to the console. Tweak your script, and re-run. Do that until you've got it working as expected.
Note: running jobs from your local machine to a remote machine is troublesome because you may actually be causing your local instance of spark to be responsible for some expensive computations and data transfer over the network. So thats why you want to submit AWS EMR jobs from within EMR.
There are typical two ways to run a job on an Amazon EMR cluster (whether for Spark or other job types):
Login to the master node an run Spark jobs interactively. See: Access the Spark Shell
Submit jobs to the EMR cluster. See: Adding a Spark Step
If you have Apache Zeppelin installed on your EMR cluster, you can use a web browser to interact with Spark.
The error you are experiencing is saying that files where accessed via the s3n: protocol, which requires AWS credentials to be provided. If, instead, the files were accessed via s3:, I suspect that the credentials would be sourced from the IAM Role that is automatically assigned to nodes in the cluster and this error would be resolved.

Copy files from S3 onto EC2 instance using Boto3 (script running on local server)?

I'm running a python script, using Boto3 (first time using boto/3), on my local server which monitors S3 bucket for new files. When it detects new files in the bucket, it starts a stopped EC2 instance, which has software loaded onto it to process these said files, and then needs to somehow instruct S3/EC2 to copy the new files from S3 to EC2. How can I achieve that using Boto3 script which is running on my local server ?
Essentially, the script running locally is the orchestrator of the process and needs to start the instance when there are new files to process and have them processed on the EC2 instance and copy the processed files back to S3. I'm currently stuck at trying to figure how to get the files copied over to EC2 from S3 by the script running locally. I'd like to avoid having to download from S3 to local server and then upload to EC2.
Suggestions/ideas ?
You should consider using Lambda for any S3 event-based processing. Why launch and run servers when you don't have to?
If the name of the bucket and other params don't change, you can achieve it simply by having a script on your EC2 instance that would pull the latest content from the bucket and set this script to be triggered every time your EC2 starts up.
If the s3 command parameters do change and you must run it from your local machine with boto, you'll need to find a way to ssh into the EC2 instance using boto. Check this module: boto.manage.cmdshell and a similar question: Boto Execute shell command on ec2 instance