Using AWS SQS for Job Queuing but minimizing "workers" uptime - amazon-web-services

I am designing my first Amazon AWS project and I could use some help with the queue processing.
This service accepts processing jobs, either via an ASP.net Web API service or a GUI web site (which just calls the API). Each job has one or more files associated with it and some rules about the type of job. I want to queue each job as it comes in, presumably using AWS SQS. The jobs will then be processed by a "worker" which is a python script with a .Net wrapper. The python script is an existing batch processor that cannot be altered/customized for AWS, hence the wrapper in .Net that manages the AWS portions and passing in the correct params to python.
The issue is that we will not have a huge number of jobs, but each job is somewhat compute intensive. One of the reasons to go to AWS was to minimize infrastructure costs. I plan on having the frontend web site (Web API + ASP.net MVC4 site) run on elastic beanstalk. But I would prefer not to have a dedicated worker machine always online polling for jobs, since these workers need to be a bit "beefier" instance (for processing) and it would cost us a lot to mostly sit doing nothing.
Is there a way to only run the web portion on beanstalk and then have the worker process only spin up if there are items in the queue? I realize I could have a micro "controller" instance always online polling and then have it control the compute spinup, but even that seems like it shouldn't be needed. Can EC2 instances be started based on a non-zero SQS queue size? So basically web api adds job to queue, something watches the queue and sees it's non-zero, this triggers the EC2 worker to start, it spins up and polls the queue on startup. It processes until the queue until empty, then something triggers it to shutdown.

You can use Autoscaling in conjunction with SQS to dynamically start and stop EC2 instances. There is a AWS blog post that describes the architecture you are thinking of.

Related

AWS ECS: Is there a way for code inside a task container to set some custom status for outside services to read?

I am building a file processing service in AWS and here is what I have now as a manager-worker architecture:
A nodejs application running in an EC2 instance, serving as a manager node; In this EC2 instance, there is also a RabbitMQ service hosting a job queue
An ECS service running multiple task containers and the containers are also running nodejs code. The code in every task container runs some custom business logic for processing a job. The task containers get the jobs from the above RabbitMQ job queue. When there are jobs enqueued in the RabbitMQ queue, the jobs are assigned to the ECS task containers and the ECS task container would start processing the job.
Now, this ECS service should scale up or down. When there are no jobs in the queue (which happens very frequently), I just want to keep one worker container alive so that I can save budgets.
When there is a large number of jobs arriving at the manager and enqueue into the job queue, the manager has to figure out how to scale up.
It needs to figure out how many new worker container to add into the ECS service. And to do this, it needs to know:
the number of task containers in the ECS service now;
the status of each container: is it currently processing a job?
This second point leads to my question: is there a way to set a custom status to the task, such that this status can be read by the application in EC2 instance through some AWS ECS API?
As others have noted in the comments, there isn't any built in AWS method to do this. I have two suggestions that I hope can accomplish what you want to do:
Create a lambda function that runs on a regular interval that calls into your RabbitMQ api to check the queue length. Then it can use the ECS API to set the desired task count for your service. You can have as much control as you want over the thresholds and strategy for scaling in your code.
Consider using AWS Batch. The compute backend for Batch is also ECS based, so it might not be such a big change. Long running jobs where you want to scale up and down the processing is its sweet spot. If you want, you can queue the work directly in Batch and skip Rabbit. Or, if you still need to use Rabbit you could create a smaller job in Lambda or anywhere else, that pulls the messages out and creates AWS Batch jobs for each. Batch supports running on EC2 ECS clusters, but it can also use Fargate, so it could simplify your management even further.

AWS System Design on Preconfigured EC2s

I have an AWS workflow which is as follows:
API call -> Lambda Function (Paramiko Remote Connect) -> EC2 -> output
Basically, I have an API call, which triggers a lambda function. Within the lambda function, I remote connect to a preconfigured EC2 instance using Python Paramiko, run some commands on the ec2 instance, and then return the output. I have two main concerns with this design: 1.) latency and 2.) scalability.
For Latency:
When I call the API, it takes 8-9 seconds to run, but if I were to run the job directly on the EC2 instance, it would take 1-2 seconds. Do the ssh_client.connect() and ssh_client.exec_command() cause significantly increased runtime? Also, I am implementing this on a t2-micro ubuntu 18.04 free-tier EC2 instance. Would using the paid versions cause a difference in runtime?
For Scalability:
I am sure AWS has a solution for this, but suppose that there are several simultaneous API calls. I am sure that I can't have only 1 available EC2 instance to run the job. Should I have multiple EC2 instances preconfigured and use a load-balancer? What AWS features can I use to scale this system?
If anything is unclear, please ask and I will elaborate.
Rather than using Paramiko, the more "cloud-friendly" method of running commands on an EC2 instance would be to use AWS Systems Manager Run Command, which uses an agent to run commands on instance. It can even run commands on multiple instances and also on-premises computers that have the agent installed.
Another design choice is to push a "job" message to an Amazon SQS queue. The worker instances can poll the SQS queue asking for work. When they receive a message, they can perform the work. This is more of an asynchronous model because the main system does not 'wait' for job to finish, so it needs a return path to provide the results (eg another SQS queue). However, it is highly scalable and more resilient, with no load balancer required. This is a common design pattern.

Best place in AWS for a long running subscription background service

I currently am trying to set up a system in AWS that utilises EventSourcing & CQRS. I've got everything working on the Command side, and this is storing the events into Aurora. I've got SqlEventStore as my EventSourcing store and that has a Subscription mechanism that will listen for new events and then fire a function appropriately.
So far it's all set up in Lambda, but I can't have the subscription in Lambda as they aren't always running, so my first thought was running this side in Fargate and a docker container. Using my reading though, this seems to need to be fired by a task, rather than sit in the container on a subscription.
So my question is really, where is best to have a long running process in AWS that just sits listening for things to happen, rather than responding to a prod from something like a Lambda.
So my question is really, where is best to have a long running process
in AWS that just sits listening for things to happen, rather than
responding to a prod from something like a Lambda.
I will suggest to go with Fargate or EC2 type ECS container, with fargate you do not need manage server, something similar to lambda but more suitable for such long-running process.
This seems to need to be fired by a task, rather than sit in the
container on a subscription.
no, you can run fargate in two ways.
Running as a long-running services
fire service based on cloud watch event or schedule time ( perform task and terminate)
AWS Fargate now supports the ability to run tasks on a regular,
scheduled basis and in response to CloudWatch Events. This makes it
easier to launch and stop container services that you need to run only
at certain times.
AWS fargate
Where is best to have a long-running process in AWS that just sits
listening for things to happen, rather than responding to an event from something like a Lambda
If your task is supposed for the run for a long time then lambda is not for you, there is always timeout in case of lambda.
If you do not want to manage the server, and the process is supposed to run for a long time, then fargate is for you, so then it's fine to sit for the event and listen.
you can explore aws glue python shell for long running server services.

Parallel processing with load balancing on AWS

I have the below use case. Need some help in figuring out the best options on AWS.
I have a python script which needs to be executed for 200 different datasets.
I need to run each dataset in an AWS instance. Maximum instance I can have is 10 (so 20 times I need to ran on 10 instances parallelly to complete my 200 jobs)
All the instances will use a common Mongo DB instance to store/read data for the python scripts.
This is not an web application. Just a simple python script invocation.
The python script won't provide any exit codes once its completed (3rd party script and don't have control over it). So I need to figure out the AWS instance completes the job so I can send the next dataset for process (kind of load balancing).
Sounds like a typical use case for SQS, a distributed queue.
Auto Scaling Group managing EC2 Instances
SQS queue managing calculation jobs
Small script polling new jobs from SQS and executing Python script
CloudWatch alarms scaling up and down Auto Scaling Group based on number of jobs in SQS queue
General approach: http://docs.aws.amazon.com/autoscaling/latest/userguide/as-using-sqs-queue.html
Using PaaS Elastic Beanstalk for this kind of setup: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html
Example implementation: https://cloudonaut.io/antivirus-for-s3-buckets/

PHP AWS Elastic Beanstalk background workers

I have deployed my application using Elastic Beanstalk, since this gives me a very easy deployment flow, to multiple instances at once using the "git aws.push".
I like to add background processing support to my application. The background worker will use the same codebase, and simply start up a long lived php script that continuously looks for tasks to execute. What AWS should i use to create such a worker instance?
Should i use the EB for this aswell or should i try to setup a standard EC2 instance (since i dont need it to be public available) ? I guess thats the right way of doing it and then create a deployment flow that make it easy to deploy to both my EC2 worker instances and to Elastic beanstalk app? or is there a better way of doing this?
AWS EB now adds support for Worker Instances. They're just a different kind of environment which those two differences:
They don't have a cnamePrefix (whatever.elasticbeanstalk.com)
Instead, they've got a SQS queue bound
On each instance, they run a daemon called sqsd which basically poll their environments' sqs queue and forward it to the local http server.
I believe it is worth a try.
If the worker is just polling a queue for jobs and does not require an ELB, then all you need to do is work with EC2, SQS, and probably S3. You can start EC2 instances as part of an Auto-scaling group that, for example, is configured to scale as a function of depth of the SQS queue. When there is no work to do you can have minimum # of EC2, but if the queue gets deep, auto-scaling will spin up more.