how could I launch Trino in AWS EMR? - amazon-web-services

When I create the EMR cluster, in the 'Application' step I have chosen 'Trino'. I can confirm this.
When I connect to the Master Node using SSH, and type 'presto --version' they give me 'presto:command not found'.
Also tried 'presto-cli' as EMR docs said, still got 'presto-cli' not found.
Also,as Trino Docs, I should go to the 'bin/launcher' directory and launch trino. However, I do not know where is this in my Cluster.
I am a noob. Could you please tell me how to run/launch trino in AWS EMR?

If you're looking to startup the Trino CLI, you should use 'trino-cli'. (Recall that PrestoSQL was renamed to Trino a couple of years ago which triggered a whole host of renaming both inside and out.)

Related

Disable multipart upload on EMR

The goal is to disable the multipart upload on Amazon EMR.
The guide says enter classification=core-site,properties=[fs.s3.multipart.uploads.enabled=false] in Edit Software Settings when creating the EMR cluster.
My questions are:
Can we modify the configurations for existing EMR cluster? If so, how to do it?
Can we achieve the same goal by putting sparkSession.sparkContext.hadoopConfiguration.set("fs.s3.multipart.uploads.enabled","false") in the jar to be executed on EMR?
Unfortunately, you cannot currently modify configurations on a running EMR cluster, but if it's possible for you to start a new one, you could use the AWS EMR Console to clone your current cluster's configuration then modify the configuration before launching it. (Note: Only the configuration is cloned, not any of the data that may be stored in HDFS or on the cluster instances' local disks.)
However, I believe that what you asked about in your second question will work as intended. Have you tried this and found it not to work?

run spark on AWS EMR by passing credentials

I am new to EMR and tried to launch Spark job as a step using something like command-runner.jar spark-submit --deploy-mode cluster --class com.xx.xx.className s3n://mybuckets/spark-jobs.jar
However, the spark job needs credentials as environment variables, my question is what is the best way to pass the credentials as environment variables to the spark jobs.
Thanks!
Have a look here: AWS EMR 4.0 - How can I add a custom JAR step to run shell commands and here: http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-script.html
try running step like this(arguments): /usr/bin/spark-submit --deploy-mode cluster --class
I came to your question googling the solution for myself. Right now as a temp solution, I am passing the credential as cmd line parameters. In the future I am thinking to add a custom bootstrap script which will fetch data from service and create the ~/.aws/credentials and config files.
I hope this helps or if you have discovered any other option do post here.

AWS EMR Spark 1.0

Is there a way to force Amazon EMR to use Spark 1.0.1? The current selectable versions stop at 1.4.1.
I am using the Alternating Least Squares implementation within MLlib, and since v1.1 they have implemented weighted regularization and for specific reasons (research study) I do not want this implementation, rather I am trying to access the non-weighted regularization version they had implemented in v1.0.
I am using Zepplin notebooks with Scala if that helps.
Is working with Zeppelin a requirement? Because if so, it could be very difficult. Zeppelin is compiled against a specific version of Spark so downgrading the jar will most likely fail.
Otherwise, if you are ok with not using Zeppelin and instead using the EMR Step API, then you might be able to spin up an EMR cluster with a bootstrap action that installs spark-assembly 1.0.1. I said it might work, because there's no guarantee that the current EMR version is compatible with a 2 year old version of Spark.
To create the cluster:
Create a cluster from the UI, make sure to uncheck Spark from the additional software menu
Add a custom bootstrap action and use the script at s3://support.elasticmapreduce/spark/install-spark with arguments -v 1.0.1
(See https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark for configuration options)
To run spark using the EMR Step API:
Upload your compiled jar to s3, then submit a step against that cluster
Cluster ID: the id of your cluster (ex j-XXXXXXXX)
Region of cluster. Where you created your EMR cluster. Ex us-west-2
Your spark main class: This is where you put your ml pipeline code.
Your jar: you have to upload the jar with your code to S3 so your cluster can download it
arg1, arg2: arguments to your main (optional)
aws emr add-steps --cluster-id --steps \
Name=SparkPi,Jar=s3://.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn,--class,com.your.spark.class.MainApp,s3://>/your.jar,arg1,arg2],ActionOnFailure=CONTINUE
(Taken from the official github repo at https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/spark-submit-via-step.md)
Also if that fails, install Hadoop and check out https://spark.apache.org/docs/1.0.1/running-on-yarn.html
Or you could also run 1.0.1 locally on your laptop if your data is small.
Good luck.
Amazon EMR provide a list of supported versions of software packages you can install by selecting a drop menu. Nothing stop you from installing additional custom software with a bootstrap action. I had some experience installing java 8 when EMR was supporting only Java 7. It is a bit painful but totally possible.
EMR supports Spark 1.6.0. Take a look at their latest release of emr-4.4.0: http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-whatsnew.html

Installing Impala 2.3 on Amazon EMR

I see that Impala 2.3 is only supported on Cloudera CDH 5.5 & above. Impala 2.2 can be installed on Amazon EMR as there is Bootstrap script available on GitHub & you don't require Cloudera installation.
However, I don't see any way to install Cloudera CDH 5.5 or 5.6 on Amazon EMR. I want to install Impala 2.3 so is there any way through which Impala 2.3 can be installed on Amazon EMR?
Well, my previous answer has been deleted as long as "does not provide an answer to the question". I'm not going to argue if it's better to have a partially incorrect answer to this question or if making categorical claims without foundation is a good answer :/.
In any case, I'm not giving up :)
Yes, it's possible to install "anything" on the paper.
Once you launch the EMR cluster, all instances will appear on your EC2 console. The only thing is that you have to be careful assigning the right permissions to access thru SSH to your instances. My suggestion is to create a specific security group with the access and assign this extra security group to the instances using the Advanced configuration of the cluster.
By having the proper configuration, you could ssh into any instance and install anything (you should be able to scp any file or download from internet if you have the proper configuration of your VPC). Note that the user will be "hadoop" instead "ec2-root" but this is documented on the EMR user guide.
Keep in mind that the cluster is "Terminated" so, the EMR instances are volatile and the installation is not going to survive the cluster termination.
On the other hand, using the latest versions of EMR AMIs and the latest capabilities of AWS (I think that it was all the time the case, but, it doesn't matter now) you should be able to create some actions on the bootstrap and install anything you want.
Using the "Advanced configuration" of your cluster, you can access to the "Bootstrap" actions to be executed on your cluster. You could even have different actions depending on the node type (master, core, tasks). You should store your scripts (and/or jar files) on an S3 bucket and made this bucket available to your cluster. On the paper, you could install Impala on these EC2 instances comprising the EMR cluster but I'm not sure if this will work.
For more information, you can read http://docs.aws.amazon.com//emr/latest/ManagementGuide/emr-plan-bootstrap.html
And for a previous version of EMR AMI and not so recent version of Impala you can read https://github.com/awslabs/emr-bootstrap-actions/tree/master/impala
Thanks Mark, you forced me to elaborate better my comment.
No, it is not possible to "install" anything on EMR because it's a PaaS provided by AWS. But if your goal is to run a newer version of Impala on AWS, there is an AWS Quick Start path for installing CDH 5.x (including Impala) that makes the process relatively easy.
http://aws.amazon.com/quickstart/

how to run/install oozie in EMR cluster

I want to orchestrate my EMR jobs. so I thought oozie will be good fit. I have done some POCs on oozie workflow but in local mode, its fairly simple and great.
But I dont understand how to use oozie on EMR cluster.
Based on some search I got to know that aws doesnt come with oozie so we have install it explicitly as a bootstrap action.
Most people point to this link
https://github.com/lila/emr-oozie-sample
But since I am new to aws(EMR) I am still confused how to use it.
It will be great, If anyone can simplify it for me providing some steps or something.
Thanks
I have had some question, which i posted to AWS technical support and i got below reply. I tried it and Oozie is all installed and running with no extra efforts required.
In order to have Oozie installed on an EMR cluster you need to install Hue. The reason is that currently Oozie on EMR is installed as a dependency for Hue. Hue is supported on AMIs 3.3.0 and 3.3.1 as per http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html. After launching an EMR cluster with Hue -> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hue.html installed you should be able to use Oozie immediately as it will be already configured and started.
EMR 4.x and 5.x series releases now come with Oozie as an optional application. There's also been a recent blog post on the AWS Big Data Blog outlining how to get started with it:
https://blogs.aws.amazon.com/bigdata/post/TxZ4KDBGBMZYJL/Use-Apache-Oozie-Workflows-to-Automate-Apache-Spark-Jobs-and-more-on-Amazon-EMR
That github project installs Oozie as well, so you don't need to take care of it. The configuration for the Oozie installation is in the next link:
https://github.com/lila/emr-oozie-sample/blob/master/config/config-oozie.sh
After that, there are some tasks you can execute from the command shell:
create:
ssh:
sshproxy:
socksproxy:
So, if you follow his instructions you only need to run some of this tasks in order to create and execute an EMR task using Oozie.
For those who are interested, I have cloned the repo and updated the Oozie installer script to support Hadoop 2.4.0 and Oozie 4.0.1
https://github.com/davideanastasia/emr-oozie-sample
Firstly, this is not a direct answer to this question.
EMR integrates with Data Pipeline - Amazon's own scheduler and data workflow orchestrator. Amazon expects you to use Data Pipeline with EMR. It can create, start and terminate EMR clusters, managing cluster lifecycle etc. Evaluate that to see if that fits your needs better..