Auto scaling batch jobs on Elastic Beanstalk - amazon-web-services

I am trying to set up a scalable background image processing using beanstalk.
My setup is the following:
Application server (running on Elastic Beanstalk) receives a file, puts it on S3 and sends a request to process it over SQS.
Worker server (also running on Elastic Beanstalk) polls the SQS queue, takes the request, load original image from S3, processes it resulting in 10 different variants and stores them back on S3.
These upload events are happening at a rate of about 1-2 batches per day, 20-40 pics each batch, at unpredictable times.
Problem:
I am currently using one micro-instance for the worker. Generating one variant of the picture can take anywhere from 3 seconds to 25-30 (it seems first ones are done in 3, but then micro instance slows down, I think this is by its 2 second bursty workload design). Anyway, when I upload 30 pictures that means the job takes: 30 pics * 10 variants each * 30 seconds = 2.5 hours to process??!?!
Obviously this is unacceptable, I tried using "small" instance for that, the performance is consistent there, but its about 5 seconds per variant, so still 30*10*5 = 26 minutes per batch. Still not really acceptable.
What is the best way to attack this problem which will get fastest results and will be price efficient at the same time?
Solutions I can think of:
Rely on beanstalk auto-scaling. I've tried that, setting up auto scaling based on CPU utilization. That seems very slow to react and unreliable. I've tried setting measurement time to 1 minute, and breach duration at 1 minute with thresholds of 70% to go up and 30% to go down with 1 increments. It takes the system a while to scale up and then a while to scale down, I can probably fine tune it, but it still feels weird. Ideally I would like to get a faster machine than micro (small, medium?) to use for these spikes of work, but with beanstalk that means I need to run at least one all the time, since most of the time the system is idle that doesn't make any sense price-wise.
Abandon beanstalk for the worker, implement my own monitor of of the SQS queue running on a micro, and let it fire up larger machine(or group of larger machines) when there are enough pending messages in the queue, terminate them the moment we detect queue is idle. That seems like a lot of work, unless there is a solution for this ready out there. In any case, I lose the benefits of beanstalk of deploying the code through git, managing environments etc.
I don't like any of these two solutions
Is there any other nice approach I am missing?
Thanks

CPU utilization on a micro instance is probably not the best metric to use for autoscaling in this case.
Length of the SQS queue would probably be the better metric to use, and the one that makes the most natural sense.
Needless to say, if you can budget for a bigger base-line machine everything would run that much faster.

Related

Running background processes in Google Cloud Run

I have a lightweight server that runs cron jobs at a given time. As I understand Google Cloud Run only processes incoming requests and then becomes idle after a short time if there is no other request to process. Hence, it is not advisable to deploy that cron service to Cloud Run.
Out of curiosity, I deployed the following server that starts up and then prints a log every hour.
const express = require('express');
const app = express();
setInterval(() => console.log('ping!'), 1000 * 60 * 60);
app.listen(process.env.PORT, () => {
console.log('server listening');
})
I deployed it with a minimum and maximum instance count of 1. It has not received any request and when I checked back the next day, it was precisely printing the log every hour. Was this coincidence or can I use this setup for production?
If you set the min instance to 1 and the CPU always on to true, yes, you can perform background compute intensive processing without CPU Throttling (in your hello world case, you can use the few CPU % allowed to the idle instance without the CPU always on option).
BUT, and the but is very important, you will pay for 1 Cloud Run instance always up. In addition, is you receive request, you can scale up and have more than 1 instance up and running. Does it make sense to have several instances with the same CRON scheduling? (except if you set the max instance to 1).
At the end, the best pattern is to host the scheduling outside, on Cloud Scheduler, and then to query your instance to perform the task. It's serverless, you can handle several task in parallel, it's scalable.
From my understanding no.
From the documentation here, Google indicates that the CPU of idle instances is throttled to nearly zero. I suppose this means that very simple operation can still be performed (e.g. logging a string every hour). I guess you could test it more extensively by doing some more complex operations and evaluate the processing time of these operations.
Either way, I would not count on it in a production environment. There is no guarantee that the CPU "throttled to nearly zero" will be able to complete the operations you need in a reasonable time delay.

How does Cloud Run scaling down to zero affect long-computation jobs or external API requests?

I'm new to using Cloud Run and the idea of scaling down to zero is very appealing to me, but I have question about a few scenarios about its usage:
If I have a Cloud Run instance querying an external API endpoint, would the instance winds down while waiting for the response if no additional requests come in (i.e. I set the query time out to 60min, and no requests are received in that 60 min)?
If the Cloud Run instance is running computation that lasts for longer than 24 hour, or perhaps even days, without receiving requests, could it be trusted to carry out the computation until it's done without being randomly shutdown or restarted for servicing or other purposes (I ask this because Cloud Run is primarily intended as for stateless applications, but I have infrequent computation jobs that may take a long time that may be considered "stateful" in short-term context).
Does CPU utilization impact auto-scaling (e.g. if I have a computationally intensive job not configured for distributed computing running on one instance, would this trigger Cloud Run to spawn additional instances?)
If you deep dive in the documentation, I'm quite sure that you can find your answers. So, here a summary
(Interesting read).The Cloud Run instances are shut down only when they aren't in used (usually 15 minutes (can change at any time, no commitment, only observations) without request handling). In your case, if you are in a request handling context, no worries, your instance won't be killed, it is in use! Note: don't send an HTTP response before the end of the processing. Background process/jobs aren't considered in a request context. The context is considered from the receipt of the request to the response (OK or KO) back. Partial response/streaming is accepted.
Cloud run instance can, potentially, live more than 24h, but nothing is guaranteed. And, because the request handling is limited to 1h, you can't run process longer that that. I recommend you to have a look to GKE autopilot or to run a container on a Compute Engine and stop the VM at the end of the processing to save resources and money (or a hack to run your container on AI PLatform custom training; even if you train nothing, you run a custom container on a serverless platform!). If you can, I recommend you to design your workload to be split in several small and parallelizable jobs
Yes, it's described here. But keep in mind that only 1 request is processed on one instance. If you send a request that trigger an intensive compute job, the request will be only processed on the same instance (that can have several CPUs if your workload is compliant with that). And if another request comes in during the intensive processing, another Cloud Run instance will be spawn to handle it; only the new request.

Estimate SQS processing time and load

I am going to use AWS SQS(regular queue, not FIFO) to process different client side metrics.
I’m expect to have ~400 messages per second (worst case).My SQS message will contain S3 location of the file.
I created an application, which will listen to my SQS Queue, and process messages from it.
By process I mean:
read SQS message ->
take S3 location from that SQS message ->
call S3 client ->
Read that file ->
Add a few additional fields —>
Publish data from this file to AWS Kinesis Firehose.
Similar process will be for each SQS message in the Queue. The size of S3 file is small, less than 0,5 KB.
How can calculate if I will be able to process those 400 messages per second? How can I estimate that my solution would handle x5 increase in data?
How can calculate if I will be able to process those 400 messages per second? How can I estimate that my solution would handle x5 increase in data?
Test it! Start with a small scale, and do the math to extrapolate from there. Make your test environment as close to what it will be in production as feasible.
On a single host and single thread, the math is simple:
1000 / AvgTotalTimeMillis = AvgMessagesPerSecond, or
1000 / AvgMessagesPerSecond = AvgTotalTimeMillis
How to approach testing this:
Start with a single thread and host, and generate some timing metrics for each step that you outlined, along with a total time.
Figure out your average/max/min time, and how many messages per second that translates to
400 messages per second on a single thread & host would be under 3ms per message. Hopefully this makes it obvious you need multiple threads/hosts.
Scale up!
Now that you know how much a single thread can handle, figure out how many threads a single host can effectively handle (you'll need to experiment). Consider batching messages where possible - SQS provides batch operations.
Use math to calculate how many hosts you need
If you need 5X that number, go up from there
While you're doing this math, consider any limits of the systems you're using:
Review the throttling limits of SQS / S3 / Firehose / etc. If you plan to use Lambda to do the work instead of EC2, it has limits too. Make sure you're within those limits, and consider contacting AWS support if you are close to exceeding them.
A few other suggestions based on my experience:
Based on your workflow outline & details, using EC2 you can probably handle a decent number of threads per host
M5.large should be more than enough - you can probably go smaller, as the performance bottleneck will likely be networking I/O to fetch and send messages.
Consider using autoscaling to handle message spikes for when you need to increase throughput, though keep in mind autoscaling can take several minutes to kick in.
The only way to determine this is to create a test environment that mirrors your scenario.
If your solution is designed to handle messages in parallel, it should be possible to scale-up your system to handle virtually any workload.
A good architecture would be to use AWS Lambda functions to process the messages. Lambda defaults to 1000 concurrent functions. So, if a function takes 3 seconds to run, it would support 333 messages per second consistently. You can request for the Lambda concurrency to be increased to handle higher workloads.
If you are using Amazon EC2 instead of Lambda functions, then it would just be a matter of scaling-out and adding more EC2 instances with more workers to handle whatever workload you desired.

Elasticsearch percolation dead slow on AWS EC2

Recently we switched our cluster to EC2 and everything is working great... except percolation :(
We use Elasticsearch 2.2.0.
To reindex (and percolate) our data we use a separate EC2 c3.8xlarge instance (32 cores, 60GB, 2 x 160 GB SSD) and tell our index to include only this node in allocation.
Because we'll distribute it amongst the rest of the nodes later, we use 10 shards, no replicas (just for indexing and percolation).
There are about 22 million documents in the index and 15.000 percolators. The index is a tad smaller than 11GB (and so easily fits into memory).
About 16 php processes talk to the REST API doing multi percolate requests with 200 requests in each (we made it smaller because of the performance, it was 1000 per request before).
One percolation request (a real one, tapped off of the php processes running) is taking around 2m20s under load (of the 16 php processes). That would've been ok if one of the resources on the EC2 was maxed out but that's the strange thing (see stats output here but also seen on htop, iotop and iostat): load, cpu, memory, heap, io; everything is well (very well) within limits. There doesn't seem to be a shortage of resources but still, percolation performance is bad.
When we back off the php processes and try the percolate request again, it comes out at around 15s. Just to be clear: I don't have a problem with a 2min+ multi percolate request. As long as I know that one of the resources is fully utilized (and I can act upon it by giving it more of what it wants).
So, ok, it's not the usual suspects, let's try different stuff:
To rule out network, coordination, etc issues we also did the same request from the node itself (enabling the client) with the same pressure from the php processes: no change
We upped the processors configuration in elasticsearch.yml and restarted the node to fake our way to a higher usage of resources: no change.
We tried tweaking the percolate and get pool size and queue size: no change.
When we looked at the hot threads, we DiscovereUsageTrackingQueryCachingPolicy was coming up a lot so we did as suggested in this issue: no change.
Maybe it's the amount of replicas, seeing Elasticsearch uses those to do searches as well? We upped it to 3 and used more EC2 to spread them out: no change.
To determine if we could actually use all resources on EC2, we did stress tests and everything seemed fine, getting it to loads of over 40. Also IO, memory, etc showed no issues under high strain.
It could still be the batch size. Under load we tried a batch of just one percolator in a multi percolate request, directly on the data & client node (dedicated to this index) and found that it used 1m50s. When we tried a batch of 200 percolators (still in one multi percolate request) it used 2m02s (which fits roughly with the 15s result of earlier, without pressure).
This last point might be interesting! It seems that it's stuck somewhere for a loooong time and then goes through the percolate phase quite smoothly.
Can anyone make anything out of this? Anything we have missed? We can provide more data if needed.
Have a look at the thread on the Elastic Discuss forum to see the solution.
TLDR;
Use multiple nodes on one big server to get better resource utilization.

AWS EC2 reaches high CPU uses during nighttime

I just implemented a few alarms with CloudWatch last week and I noticed a strange behavior with EC2 small instances everyday between 6h30 and 6h45 (UTC time).
I implemented one alarm to warn me when a AutoScallingGroup have its CPU over 50% during 3 minutes (average sample) and another alarm to warn me when the same AutoScallingGroup goes back to normal, which I considered to be CPU under 30% during 3 minutes (also average sample). Did that 2 times: one for zone A, and another time for zone B.
Looks OK, but something is happening during 6h30 to 6h45 that takes certain amount of processing for 2 to 5 min. The CPU rises, sometimes trigger the "High use alarm", but always triggers the "returned to normal alarm". Our system is currently under early stages of development, so no user have access to it, and we don't have any process/backups/etc scheduled. We barely have Apache+PHP installed and configured, so it can only be something related to the host machines I guess.
Anybody can explain what is going on and how can we solve it besides increasing the sample time or % in the "return to normal" alarm? Folks at Amazon forum said the Service Team would have a look once they get a chance, but it's been almost a week with no return.