Is there an `initial_workers` (cluster.yaml) replacement mechanism in ray tune? - ray

I shortly describe my use case: Assuming I wanted to spin up a cluster with 10 workers on AWS:
In the past I always used initial_workers: 10, min_workers: 0, max_workers: 10 options (cluster.yaml) to initially spin up the cluster to full capacity and then exploit the automated downscaling of the cluster based on idle time. So at the end of job, where almost all trials have been terminated and the full capacity of the cluster is not needed anymore, nodes are automatically removed.
Now with the initial_workers option gone #12444, it is not really clear to me how to accomplish the same downscaling behavior.
I experimented with the programatic way to request resources (ray.autoscaler.sdk.request_resources) before and after tune.run but this seems to be the same as settig the min_workers field and I can only downscale the cluster after all jobs have been terminated.
I also tried to set the upscaling_speed but for some reason upscaling is very slowly and seems to add only one node at a time (I am requesting GPUs). There is also always only one pending task which I also do not really understand yet (Unfortunately I also do not really have the time to investigate this fully :()
Currently I am using the programatic way described above which works fine but then I have a lot of idle resources at the end of the job that run for hours before I can downscale.
Would be great if someone could point me to the right direction to solve this.
Thx

With ray version 1.30 the autoscaler issues I observed seem to be resolved and now the cluster scales with the pending trials as expected (using AWS ec2 g4dn instances). So no need for intial_workers option anymore.

Related

GCP Autoscale Down Protection

So I have a set of long running tasks that have to be run on Compute Engine and have to scale. Each task takes approximately 3 hours. So in order to handle this I thought about using:
https://cloud.google.com/solutions/using-cloud-pub-sub-long-running-tasks
Architecture. And while it works fine there is one huge problem. On scale down, I'd really like to avoid it scaling down a task that is currently running! I'd potentially lose 3 hours worth of processing.
Is there a way to ensure that autoscale down doesn't scale down a VM with a long running / uptime?
EDIT: A few people have asked to elaborate my task. So it's similar to what's described in the link above which is many long running tasks that need to be run on a GPU. There is a chunk of data that needs to be processed. It takes 4 hours (video encoding) then once completed it outputs to a bucket. Well it can take anywhere from 1 to 6 hours depending on the length of the video. Just like the architecture above it would be nice to have the cluster scale up based on queue size. But when scaling down I'd like to ensure that it's not scaling down currently running tasks which is what is currently happening. It being GPU bound doesn't allow me to use the CPU metric.
I think you should probably add more details about what kind of task you are running. However, as #Jhon Hanley suggestion, it worth to take a look of Cloud Tasks and see as well the following documentation that talks about the scaling risks.

DynamoDB on-demand mode suddenly stops working

I have a table that is incrementally populated with a lambda function every hour. The write capacity metric is full of predictable spikes and throttling was normally avoided by relying on the burst capacity.
The first three loads after turning on-demand mode on kept working. Thereafter it stopped loading new entries into the table and began to time-out (from ~10 seconds to the current limit of 4 minutes). The lambda function was not modified at all.
Does anyone know why might this be happening?
EDIT: I just see timeouts in the logs.
Logs before failure
Logs after failure
Errors and availability (%)
Since you are using Lambda to perform incremental writes, this issue is more than likely on Lambda side. That is where I would start looking for this. Do you have CW logs to look through? If you cannot find it, open a case with AWS support.
Unless this was recently fixed, there is a known bug in Lambda where you can get a series of timeouts. We encountered it on a project I worked on: a lambda would just start up and sit there doing nothing, quite like yours.
So like Kirk, I'd guess the problem is with the Lambda, not DynamoDB.
At the time there was no fix. As a workaround, we had another Lambda checking the one that suffered from failures and rerunning those. Not sure if there are other solutions. Maybe deleting everything and setting it back up again (with your fingers crossed :))? Should be easy enough if everything is in Cloudformation.

Elastic Beanstalk Auto Scaling - Which metric should I use?

I've got an application that is built in node.js, and is primarily used to post photos to (up to 25mb). The app resizes to thumbnail size, and moves both the thumbnail and full size image to S3. When the uploads begin happening, they usually come in bursts of 10-15 pictures, rinse, wash, repeat in 5 minute durations. I'm seeing a lot of scaling, and the trigger is the default 6MB NetworkOut trigger. My question is, is the moving the photos to S3 considered NetworkOut? Or should I consider a different scaling trigger, so far the app hasn't stuttered so I'm hesitant to not fix what ain't broken, but I am seeing quite a big of scaling so I thought I would investigate. Thanks for any help!
The short answer - scale when ever a resource is constrained. eg, If your instances can keep up with network IO or cpu is above 80% then scale. Yes, sending any data from your ec2 instance is network out traffic. You got to get that data from point A to B somehow :)
As you go up in size on ec2 instances you get more memory and cpu along with more network IO. If you don't see issue with transfers you may want to switch the auto scale over to watch cpu or memory. In an app I'm working on users can start jobs which require a bit of cpu. So I have my auto-scale to scale if my cpu is over 80%. But you might have a process that consumes a lot of memory and not much cpu...
On a side note - you may want to think about having your uploads go directly to your s3 bucket and use a lambda to trigger the resize routine. This has several advantages over your current design. http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
I suggest getting familiar with the instance metrics. You can then recognize your app-specific bottlenecks on the current instance type and count.
https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/health-enhanced-metrics.html

Amazon Elasticsearch 1.5 hangs while re-indexing

I m using a micro ES instance from Amazon with 2 nodes.
However, while I m re-indexing my data (around 300.000 docs, 300MB), the instance becomes unresponsive several times. It usually hangs when trying to read from the instance at the same time.
I m using this instance for the production of my website, and this issue causes me big headaches.
Anyone experiencing same issues? Would it help If I move to:
1) larger instance?
2) upgrade to 2.X version?
Thank you
The only time I've had issues with ES queries being unresponsive during re-indexing is when the resources are exhausted, so I would advocate a larger instance.
You should use CloudWatch metrics to determine the resource usage on your current instance during a period where it's running well and also during your re-index. Use this information to decide the best instance type accordingly, the following table will give you an idea of what resource you get https://aws.amazon.com/elasticsearch-service/pricing/

Using any of the Amazon Web Services, how could I schedule something to happen 1 year from now?

I'd like to be able to create a "job" that will execute in an arbitrary time from now... Let's say 1 year from now. I'm trying to come up with a stable, distributed system that doesn't rely on me maintaining a server and scheduling code. (Obviously, I'll have to maintain the servers to execute the job).
I realize I can poll simpleDB every few seconds and check to see if there's anything that needs to be executed, but this seems very inefficient. Ideally I could create an Amazon SNS topic that would fire off at the appropriate time, but I don't think it's possible.
Alternatively, I could create a message in the Amazon SQS that would not be visible for 1 year. After 1 year, it becomes visible and my polling code picks up on it and executes it.
It would seem this is a topic like Singletons or Inversion Control that Phd's have discussed and come up with best practices for. I can't find the articles if there any.
Any ideas?
Cheers!
The easiest way for most people to do this would be to run at least an EC2 server with a cron job on the EC2 server to trigger an action. However, the cost of running an EC2 server 24 hours a day for a year just to trigger an action would be around $170 at the cheapest (8G t1.micro with Heavy Utilization Reserved Instance). Plus, you have to monitor that server and recover from failures.
I have sketched out a different approach to running jobs on a schedule that uses AWS resources completely. It's a bit more work, but does not have the expense or maintenance issues with running an EC2 instance.
You can set up an Auto Scaling schedule (cron format) to start an instance at some point in the future, or on a recurring schedule (e.g., nightly). When you set this up, you specify the job to be run in a user-data script for the launch configuration.
I've written out sample commands in the following article, along with special settings you need to take care of for this to work with Auto Scaling:
Running EC2 Instances on a Recurring Schedule with Auto Scaling
http://alestic.com/2011/11/ec2-schedule-instance
With this approach, you only pay for the EC2 instance hours when the job is actually running and the server can shut itself down afterwards.
This wouldn't be a reasonable way to schedule tens of thousands of emails with an individual timer for each, but it can make a lot of sense for large, infrequent jobs (a few times a day to once per year).
I think it really depends on what kind of job you want to execute in 1 year and if that value (1 year) is actually hypothetical. There are many ways to schedule a task, windows and linux both offer a service to schedule tasks. Windows being Task Scheduler, linux being crontab. In addition to those operating system specific solutions you can use Maintenance tasks on MSSQL server and I'm sure many of the larger db's have similar features.
Without knowing more about what you plan on doing its kind of hard to suggest any more alternatives since I think many of the other solutions would be specific to the technologies and platforms you plan on using. If you want to provide some more insight on what you're going to be doing with these tasks then I'd be more than happy to expand my answer to be more helpful.