It is possible use AutoScaling with Elastic Mapreduce? - amazon-web-services

I would like to know if I can use AutoScaling to automatically scaling up or down Amazon Ec2 capacity according to cpu utilization with elastic map reduce.
For example, I start a mapreduce job with only 1 instance, but if this instance arrive to 50% utilization for example I want to use the created AutoScaling group to start a new instance. This is possible?
Do you know if it is possible? Or elastic mapreduce because is "elastic", if it needs starts automatically more instances without any configuration?

You need Qubole: http://www.qubole.com/blog/product/industrys-first-auto-scaling-hadoop-clusters/
We have never seen any of our users/customers use vanilla auto-scaling successfully with Hadoop. Hadoop is stateful. Nodes hold HDFS data and intermediate outputs. Deleting nodes based on cpu/memory just doesn't work. Adding nodes needs sophistication - this isn't a web site. One needs to look at the sizes of jobs submitted and the speed at which they are completing.
We run the largest Hadoop clusters, easily, on AWS (for our customers). And they auto-scale all the time. And they use spot instances. And it costs the same as EMR.

No, Auto Scaling cannot be used with Amazon Elastic MapReduce (EMR).
It is possible to scale EMR via API or Command-Line calls, adding and removing Task Nodes (which do not host HDFS storage). Note that it is not possible to remove Core Nodes (because they host HDFS storage, and removing nodes could lead to lost data). In fact, this is the only difference between Core and Task nodes.
It is also possible to change the number of nodes from within an EMR "Step". Steps are executed sequentially, so the cluster could be made larger prior to a step requiring heavy processing, and could be reduced in size in a subsequent step.
From the EMR Developer Guide:
You can have a different number of slave nodes for each cluster step. You can also add a step to a running cluster to modify the number of slave nodes. Because all steps are guaranteed to run sequentially by default, you can specify the number of running slave nodes for any step.
CPU would not be a good metric on which to base scaling of an EMR cluster, since Hadoop will keep all nodes as busy as possible when a job is running. A better metric would be the number of jobs waiting, so that they can finish quicker.
See also:
Stackoverflow: Can we add more Amazon Elastic Mapreduce instances into an existing Amazon Elastic Mapreduce instances?
Stackoverflow: Can Amazon Auto Scaling Service work with Elastic Map Reduce Service?

Related

Increase and decrease AWS instances CPUs automatically

is there's a way in AWS to increase and decrease instances CPUs depending on pressure. I have been paying a lot of money for AWS statically increasing and decreasing instance cores when no clients are using it.
to be more specific, clients can upload an excel file and the software will do some calculations that will take time depending on the AWS instance cores. Having 2 cores will take 30 minutes to completion and having 96 cores will take only a couple of minutes.
Is there's a way to automatically increase the cores to 96 when the clients are using and uploading files to the website and automatically decrease the cores to 2 when no action is happening and clients are either not using the website or just using the website with current data and aren't taking a new action.
If not then can I possibly add a schedule in AWS to change the instance type. As an example run the instance on a 2 core type (ex: t2.large) and then change the instance type only from 1pm-6pm to 96 cores (ex: c5a.24xlarge) after that get it back to 2 cores?
I'm very new to AWS and devops in general, and I have been reading about AWS Autoscaling groups, but I'm not sure if this is the answer for my problem.
No, it is not possible to "scale CPU cores". (Commonly known as Vertical scaling.)
Instead, the recommended method is to add/remove parallel capacity based upon demand.
If you are using Amazon EC2, then you can launch more instances or terminate existing instances. This can be automated through Amazon EC2 Auto Scaling, which can monitor metrics (eg CPU Utilization) and then launch/terminate instances automatically. You would typically put a Load Balancer in front of these instances if they are web servers, or the instances might be 'worker nodes' that pull work from a queue.
If you are using containers (Docker, Kubernetes) then Amazon ECS/Amazon EKS can automatically add/remove tasks to meet demand for your application.
If you are using AWS Lambda functions, then they 'scale' by allowing multiple functions to run in parallel. Lambda functions typically exit when they have finished processing, so there is not charge when there is nothing to process.
These are all examples of Horizontal scaling, where capacity is added/removed in parallel.

Autoscaling Kubernetes based on number of Jobs on AWS EKS

My cluster sometimes gets a "burst" of information and generates a large number of Kubernetes Jobs at once. And in other times I have ~0 active jobs.
I'm wondering how can I make it autoscale the number of nodes to continuously be able to process all these jobs in a reasonable time-frame.
I specifically use AWS EKS and each job takes a few minutes to complete.
EKS allows you to deploy cluster autoscaler so when new job can not be scheduled due to lack of available cpu/memory, extra node will be added to the cluster.

How to Automate Redshift Cluster Start/Stop for night time?

I have a AWS Redshift Cluster dc2.8xlarge and currently I am paying huge bill each month for running the cluster 24/7.
Is there a way I can automate the cluster uptime so that the cluster will be running in day time and I can stop the cluster at 8PM in evening and again start it in 8AM in morning.
Update: Stop/Start is now available. See: Amazon Redshift launches pause and resume
Amazon Redshift does not have a Start/Stop concept. However, there are a few options...
You could resize the cluster so that it is a lower-cost. A Redshift Cluster is sized for Compute and for Storage. You could reduce the number of nodes as long as you retain enough nodes for your Storage needs.
Also, Amazon Redshift has introduced RA3 nodes with managed storage enabling independent compute and storage scaling, which means you might be able scale-down to a single node. (This is a new node type, I'm not sure of how it works.)
Another option is to take a Snapshot and Shutdown the cluster. This will result in no costs for the cluster (but the Snapshot will be charged). Then, create a new cluster from the Snapshot when you want the cluster again.
Scheduling the above can be done in Amazon CloudWatch Events, which can trigger an AWS Lambda function. Within the function, you can make the necessary API calls to the Amazon Redshift service.
If you are concerned with the general cost of your cluster, you might want to downside from the dc2.8xlarge. You could either use multiple dc2.large nodes, or even consider a move to ds2.xlarge, which is a lower cost per TB of data stored.
good news :)
Now we can able to pause and resume the Redshift cluster (both Console and CLI)
check out the link:
https://aws.amazon.com/blogs/big-data/lower-your-costs-with-the-new-pause-and-resume-actions-on-amazon-redshift/
Now we can pause and resume an AWS Redshift cluster.
We can also schedule the pause and the resume, which is a very important feature to check on the costs.
Link: https://aws.amazon.com/blogs/big-data/lower-your-costs-with-the-new-pause-and-resume-actions-on-amazon-redshift/
This will help you in automating the cluster uptime & downtime so that the cluster will be running in day time and is paused automatically at a specific time in the evening and again start in the morning automatically.
its pretty easy to use opensource https://cloudcustodian.io to automate nightime/weekend off hours on redshift and other aws resources.

Autoscaling a Cassandra cluster on AWS

I have been trying to auto-scale a 3node Cassandra cluster with Replication Factor 3 and Consistency Level 1 on Amazon EC2 instances. Despite the load balancer one of the autoscaled nodes has zero CPU utilization and the other autoscaled node has considerable traffic on it.
I have experimented more than 4 times to auto-scale a 3 node with RF3CL1 and the CPU utilization on one of the autoscaling nodes is still zero. The overall CPU utilization has a drop but one of the autoscaled nodes is consistently idle from the point of auto scaling.
Note that the two nodes which are launched at the point of autoscaling are started by the same launch configuration. The two nodes have the same configuration in every aspect. There is an alarm for the triggering of the nodes and the scaling policy is set as per that alarm.
Can there be a bash script that can be run on the user data?
For example, altering the keyspaces?
Can someone let me know what could be the reason behind this behavior?
AWS auto scaling and load balancing is not a good fit for Cassandra. Cassandra has its own built in clustering with seed nodes to discover the other members of the cluster, so there is no need for an ELB. And auto scaling can screw you up because the data has to be re-balanced between the nodes.
https://d0.awsstatic.com/whitepapers/Cassandra_on_AWS.pdf
yes, you don't need ELB for Cassandra.
So you created a single node Cassandra, and created some keyspace. Then you scaled Cassandra to three nodes. You found one new node was idle when accessing the existing keyspace. Is this understanding correct? Did you alter the existing keyspace's replication factor to 3? If not, the existing keyspace's data will still have 1 replica.
When adding the new nodes, Cassandra will automatically balance some tokens to the new nodes. This is probably why you are seeing load on one of the new nodes, which happens to get some tokens that has keyspace data.

Kubernetes - adding more nodes

I have a basic cluster, which has a master and 2 nodes. The 2 nodes are part of an aws autoscaling group - asg1. These 2 nodes are running application1.
I need to be able to have further nodes, that are running application2 be added to the cluster.
Ideally, I'm looking to maybe have a multi-region setup, whereby aplication2 can be run in multiple regions, but be part of the same cluster (not sure if that is possible).
So my question is, how do I add nodes to a cluster, more specifically in AWS?
I've seen a couple of articles whereby people have spun up the instances and then manually logged in to install the kubeltet and various other things, but I was wondering if it could be done in more of an automatic way?
Thanks
If you followed this instructions, you should have an autoscaling group for your minions.
Go to AWS panel, and scale up the autoscaling group. That should do it.
If you did it somehow manually, you can clone a machine selecting an existing minion/slave, and choosing "launch more like this".
As Pablo said, you should be able to add new nodes (in the same availability zone) by scaling up your existing ASG. This will provision new nodes that will be available for you to run application2. Unless your applications can't share the same nodes, you may also be able to run application2 on your existing nodes without provisioning new nodes if your nodes are big enough. In some cases this can be more cost effective than adding additional small nodes to your cluster.
To your other question, Kubernetes isn't designed to be run across regions. You can run a multi-zone configuration (in the same region) for higher availability applications (which is called Ubernetes Lite). Support for cross-region application deployments (Ubernetes) is currently being designed.