I am new with AWS services and trying some use-cases. I want to create EMR clusters on demand with some predefined configurations and applications/scripts installed. I was planning to create a snapshot of existing EMR cluster or at-least namenode initially and then use it every-time whenever I want to create other clusters. But after some Google search, I couldn't find any way to capture snapshot of EMR cluster. Is it possible to create snapshot ? or any other alternate way that can help me out with my use-case.
Appreciate any kind of help.
Thanks
It is not possible to create a snapshot of an EMR cluster node and you cannot use a custom AMI when running a cluster. However you can install software on the cluster nodes at the cluster creation time using custom bootstrap actions. You can create your custom bootstrap scripts and use them every time you launch a new cluster. This way you can achieve a similar functionality with the one you are seeking.
For more information using bootstrap actions on EMR please visit: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html#bootstrapCustom
Let us know if you need any further assistance.
Related
I have a recurring task where I need to clone an existing EMR cluster (except with a different name). I have been doing this in the AWS Console (basically, finding the EMR cluster in the console, click "Clone", change the name, then "Create cluster"). Is there a way to do this in command line so that I can automate it? I have checked aws emr create-cluster help but nothing seems relevant. Thanks!
I think this is what you are looking for:
Assuming that you want the cluster to be a clone of the starting state of the original cluster, just create the first EMR cluster from a CloudFormation template and then create new clusters from the same template as needed. Here's an example template.
Cloning a Cluster Using the Console
You can use the Amazon EMR console to clone a cluster, which makes a copy of the configuration of the original cluster to use as the basis for a new cluster.
To clone a cluster using the console
From the Cluster List page, click a cluster to clone.
At the top of the Cluster Details page, click Clone.
In the dialog box, choose Yes to include the steps from the original cluster in the cloned cluster. Choose No to clone the original cluster's configuration without including any of the steps.
Note
For clusters created using AMI 3.1.1 and later (Hadoop 2.x) or AMI 2.4.8 and later (Hadoop 1.x), if you clone a cluster and include steps, all system steps (such as configuring Hive) are cloned along with user-submitted steps, up to 1,000 total. Any older steps that no longer appear in the console's step history cannot be cloned. For earlier AMIs, only 256 steps can be cloned (including system steps). For more information, see Submit Work to a Cluster.
The Create Cluster page appears with a copy of the original cluster's
configuration. Review the configuration, make any necessary changes,
and then click Create Cluster.
We are moving from On-Premises environment to google cloud dataproc for spark jobs. I am able to build the cluster though and ssh to master node for job execution. I am not clear how to build the edge node where we can allow users to login and submit job. Is it going to be another gce vm? Any thoughts or best practices?
A new VM instance is a good option to map the EdgeNode role from other architectures:
You can execute your job from the Master node which you can make accessible through SSH.
You will need to find a balance between simplicity (SHH) or security (EdgeNode).
Please note that IAM can help to allow individual users to submit jobs by assigning Dataproc Editor role.
Don't forget the ability that Dataproc offers of creating ephemeral nodes. This means that you create a cluster, execute your job and delete your cluster.
Using ephemeral clusters will avoid unnecessary costs. Even, the script you create for that it can be executed from any machine that has the Google Cloud SDK installed, e.g. OnPrem servers or your PC.
I have a DataProc cluster with 10 nodes and Presto installed. The Autoscaling function of cluster is on. I wonder when Presto is running and the cluster scales up, will Presto be able to pick up and use the additional nodes automatically? I didn't find an answer from Google's doc.
My concern is that is I need to manually restart Presto, it defeats the purpose of autoscaling. My hope is that the cluster and autoscale when presto gets a larger job.
Presto will automatically pick up new nodes as the cluster scales.
However, be aware that Dataproc autoscaling currently only supports scaling based on YARN metrics (see the docs). Your cluster won't autoscale based on Presto query load, but rather the load on YARN.
The goal is to disable the multipart upload on Amazon EMR.
The guide says enter classification=core-site,properties=[fs.s3.multipart.uploads.enabled=false] in Edit Software Settings when creating the EMR cluster.
My questions are:
Can we modify the configurations for existing EMR cluster? If so, how to do it?
Can we achieve the same goal by putting sparkSession.sparkContext.hadoopConfiguration.set("fs.s3.multipart.uploads.enabled","false") in the jar to be executed on EMR?
Unfortunately, you cannot currently modify configurations on a running EMR cluster, but if it's possible for you to start a new one, you could use the AWS EMR Console to clone your current cluster's configuration then modify the configuration before launching it. (Note: Only the configuration is cloned, not any of the data that may be stored in HDFS or on the cluster instances' local disks.)
However, I believe that what you asked about in your second question will work as intended. Have you tried this and found it not to work?
I just deployed a Presto Sandbox cluster on AWS using EMR. Is there any way to add connectors to my Presto cluster apart from manually (ssh) creating the properties and then restarting the cluster?
If you're looking for a UI to add a connector, Presto itself doesn't offer that and as far as I know Amazon EMR doesn't either. I'm afraid you'll have to add connectors manually by SSH-ing to the master node, creating the appropriate file, distributing it to all the nodes and then restarting everything.
Adding connectors to Presto with EMR does require manual restarting as you mention. You might be able to use a CFT to automate some of this, or you can try something like Ahana Cloud https://ahana.io/ahana-cloud/ which is a managed serviced for Presto in AWS.