Will DataProc Presto pick up new nodes automatically? - google-cloud-platform

I have a DataProc cluster with 10 nodes and Presto installed. The Autoscaling function of cluster is on. I wonder when Presto is running and the cluster scales up, will Presto be able to pick up and use the additional nodes automatically? I didn't find an answer from Google's doc.
My concern is that is I need to manually restart Presto, it defeats the purpose of autoscaling. My hope is that the cluster and autoscale when presto gets a larger job.

Presto will automatically pick up new nodes as the cluster scales.
However, be aware that Dataproc autoscaling currently only supports scaling based on YARN metrics (see the docs). Your cluster won't autoscale based on Presto query load, but rather the load on YARN.


Turn on/off AWS EMR clusters

How can I turn on/off EMR clusters? There is only one possibility to terminate permanently. What if I do not need the cluster at nights and I do not want to create a new cluster every morning?
You can't do this. Stopping an EMR cluster is not supported. You simply terminate it when you don't need it.
To protect your data, you should be using EMRFS which allows EMR cluster to read data from S3. This way, there is no need to copy any data from S3 to HDFS.
You can enable scale up\scale down policies available in EMR UI and resize your cluster based on multiple metrics, i.e. ram\cpu utilization. You can also create external job that will send to EMR scale up\scale down command via awscli and you can schedule such jobs to run in the morning and in the evening.
From my experience resizing works well on task nodes while resizing core nodes demands HDFS sync that works only if you don't run any tasks on your EMR.

Autoscaling a running Hadoop cluster setup on AWS EC2

My goal is to understand how can I auto-scale a Hadoop cluster on AWS EC2.
I am exploring AWS offerings from elastic scaling perspective for a Hadoop as service (EMR) and Hadoop on EC2.
For EMR, I gathered that using CloudWatch, performance metrics can be monitored and the user can be alerted once they reach the set threshold, thereafter the cluster can be scaled up or down depending on its utilization state.
This approach would require some custom implementation to automate the steps.(correct me if I am missing anything here)
For Hadoop on EC2, I came across with the auto scaling option which can add or remove instances as per configured scaling policies.
But I am not clear how a newly added node would get bootstrapped to the cluster automatically? How would YARN know that it can spawn a new container on this newly added node?
Does auto-scaling work for master-slave kind of setup as well or is limited to the web application?
There is 'Qubole' offering services to manage Hadoop on AWS as well....should that be used for automatically managing scaling the cluster?

What should be suitable configuration to set up 2-3 node hadoop cluster on AWS?

What should be suitable configuration to set up 2-3 node hadoop cluster on AWS ?
I want to set-up Hive, HBase, Solr, Tomcat on hadoop cluster with purpose of doing small POC's.
Also please suggest option to go with EMR or with EC2 and manually set up cluster on that.
Amazon EMR can deploy a multi-node cluster with Hadoop and various applications (eg Hive, HBase) within a few minutes. It is much easier to deploy and manage than trying to deploy your own Hadoop cluster under Amazon EC2.
See: Getting Started: Analyzing Big Data with Amazon EMR

Presto Sandbox cluster on AWS EMR - add connector (catalog/.properties)

I just deployed a Presto Sandbox cluster on AWS using EMR. Is there any way to add connectors to my Presto cluster apart from manually (ssh) creating the properties and then restarting the cluster?
If you're looking for a UI to add a connector, Presto itself doesn't offer that and as far as I know Amazon EMR doesn't either. I'm afraid you'll have to add connectors manually by SSH-ing to the master node, creating the appropriate file, distributing it to all the nodes and then restarting everything.
Adding connectors to Presto with EMR does require manual restarting as you mention. You might be able to use a CFT to automate some of this, or you can try something like Ahana Cloud https://ahana.io/ahana-cloud/ which is a managed serviced for Presto in AWS.

Possibility of taking snapshot of AWS EMR cluster or namenode

I am new with AWS services and trying some use-cases. I want to create EMR clusters on demand with some predefined configurations and applications/scripts installed. I was planning to create a snapshot of existing EMR cluster or at-least namenode initially and then use it every-time whenever I want to create other clusters. But after some Google search, I couldn't find any way to capture snapshot of EMR cluster. Is it possible to create snapshot ? or any other alternate way that can help me out with my use-case.
Appreciate any kind of help.
It is not possible to create a snapshot of an EMR cluster node and you cannot use a custom AMI when running a cluster. However you can install software on the cluster nodes at the cluster creation time using custom bootstrap actions. You can create your custom bootstrap scripts and use them every time you launch a new cluster. This way you can achieve a similar functionality with the one you are seeking.
For more information using bootstrap actions on EMR please visit: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html#bootstrapCustom
Let us know if you need any further assistance.