Using AWS ECS service tasks as disposable/consumable workers? - amazon-web-services

Right now I have a web app running on ECS and have a pretty convoluted method of running background jobs:
I have a single task service that polls an SQS queue. When it reads a message, it attempts to place the requested task on the cluster. If this fails due to lack of available resources, the service backs off/sleeps for a period before trying again.
What I'd like to move to instead is as follows:
Run a multi task worker service. Each task periodically polls the queue. When a message is received it runs the job itself (as opposed to trying to schedule a new task) and then exits. The AWS service scheduler would then replenish the service with a new task. This is analogous to gunicorn's prefork model.
My only concern is that I may be abusing the concept of services - are planned and frequent service task exits well supported or should service tasks only exit when something bad happens like an error
Thanks

Related

AWS ECS: Is there a way for code inside a task container to set some custom status for outside services to read?

I am building a file processing service in AWS and here is what I have now as a manager-worker architecture:
A nodejs application running in an EC2 instance, serving as a manager node; In this EC2 instance, there is also a RabbitMQ service hosting a job queue
An ECS service running multiple task containers and the containers are also running nodejs code. The code in every task container runs some custom business logic for processing a job. The task containers get the jobs from the above RabbitMQ job queue. When there are jobs enqueued in the RabbitMQ queue, the jobs are assigned to the ECS task containers and the ECS task container would start processing the job.
Now, this ECS service should scale up or down. When there are no jobs in the queue (which happens very frequently), I just want to keep one worker container alive so that I can save budgets.
When there is a large number of jobs arriving at the manager and enqueue into the job queue, the manager has to figure out how to scale up.
It needs to figure out how many new worker container to add into the ECS service. And to do this, it needs to know:
the number of task containers in the ECS service now;
the status of each container: is it currently processing a job?
This second point leads to my question: is there a way to set a custom status to the task, such that this status can be read by the application in EC2 instance through some AWS ECS API?
As others have noted in the comments, there isn't any built in AWS method to do this. I have two suggestions that I hope can accomplish what you want to do:
Create a lambda function that runs on a regular interval that calls into your RabbitMQ api to check the queue length. Then it can use the ECS API to set the desired task count for your service. You can have as much control as you want over the thresholds and strategy for scaling in your code.
Consider using AWS Batch. The compute backend for Batch is also ECS based, so it might not be such a big change. Long running jobs where you want to scale up and down the processing is its sweet spot. If you want, you can queue the work directly in Batch and skip Rabbit. Or, if you still need to use Rabbit you could create a smaller job in Lambda or anywhere else, that pulls the messages out and creates AWS Batch jobs for each. Batch supports running on EC2 ECS clusters, but it can also use Fargate, so it could simplify your management even further.

Self destruct AWS ECS tasks after reaching RUNNING state

I have ECS Task set as a target on a CloudWatch Event rule that invokes on the below S3 Event Pattern.
The rule invokes OK on a PUT operation in a S3 bucket, and starts the ECS Task that is set as its target.
The Task reaches RUNNING state... and remains in RUNNING state until it is stopped. I use the CLI to stop the task. Also, this task is not part of a ECS Service, but a stand-alone task intended to do a specific task.
Is there a way to self-destruct the Task after it reaches the RUNNING state and does the intended work? I could wait for 30mins or even a few hours... but ultimately the tasks needs to STOP by itself.
This becomes particularly difficult to manage when there are 1000s of S3 PUT operations that invoke the CloudWatch rule that in-turn starts 1000s of tasks. I am looking for somehow stopping these tasks after they reach the RUNNING state and finish the intended work.
Any suggestions?
If you have to really have to stick at what you are doing, then you should invoke another lambda function to stop the task once a certain stage is reach in your application which is running as the docker container. Beware of integration hell though!
What you are trying to do should be better handled by the AWS Lambda and Batch service. You can specify a docker image, to run and once the operation is done, exit the docker process.
Refer this: https://medium.com/swlh/aws-batch-to-process-s3-events-388a77d0d9c2

How to set up a long running Django command in Google Cloud Platform

I have recently moved my site to Google Cloud Run.
The problem is I also need to move a couple of cron jobs that run a Django command every day inside a container. What is the preferred way of doing this if I don't want to pay for a full Kubernetes cluster with always running node instances?
I would like the task to run and then spin the server down, just as Cloud Run does when I get an incoming request. I have searched through all the documentation, but I am having trouble in finding the correct solution for long running tasks inside containers that do not require an underlying server in Google Cloud.
Can someone point me in the right direction?
Cloud Run request timeout limit is 15 minutes.
Cloud Functions function timeout limit is 540 seconds.
For long-running tasks spinning up and down Compute Instance when needed would be more preferred option.
An example of how to schedule, run and stop Compute Instances automatically is nicely explained here:
Scheduling compute instances with Cloud Scheduler
In brief: Actual instance start / stop is performed by Cloud Functions. Cloud Scheduler on timetable publishes required tasks to Cloud Pub/Sub queue which triggers these functions. Your code at the end of main logic can also publish message to Cloud Pub/Sub to run Stop this instance task.
How to process task in Django?
it can be same django app started with wsgi server to process incoming requests (like regular django site) but wth increased request / response / other timeouts, long wsgi worker life ... - in this case task is regular http request to django view
it can be just one script (or django management command) run at cloud instance startup to just automatically execute one task
you may also want to pass additional arguments for the task, in this case you can publish to Cloud Pub/Sub one Start instance task, and one main logic task with custom arguments and make your code pull from Pub/Sub first
more django-native - use Celery and start celery worker as separate Compute Instance
One possisble option of how to use just one Celery worker without all other parts (i.e. broker (there is no official built-in Cloud Pub/Sub support)) and pull/push tasks to/from Cloud Pub/Sub:
run celery worker with dummy filesystem broker
add target method as #periodic_task to run i.e. every 30 seconds
at the start of the task - subscribe to Cloud Pub/Sub queue, check for new task, receive one and start processing
at the and of the task - publish to Cloud Pub/Sub results and a call to Stop this instance
There is also Cloud Tasks (timeout limit: with auto-startup - 10 minutes, manual startup - 24 hours) as a Cloud Run addition for asynchronous tasks, but in this case Cloud Pub/Sub is more suitable.

Best place in AWS for a long running subscription background service

I currently am trying to set up a system in AWS that utilises EventSourcing & CQRS. I've got everything working on the Command side, and this is storing the events into Aurora. I've got SqlEventStore as my EventSourcing store and that has a Subscription mechanism that will listen for new events and then fire a function appropriately.
So far it's all set up in Lambda, but I can't have the subscription in Lambda as they aren't always running, so my first thought was running this side in Fargate and a docker container. Using my reading though, this seems to need to be fired by a task, rather than sit in the container on a subscription.
So my question is really, where is best to have a long running process in AWS that just sits listening for things to happen, rather than responding to a prod from something like a Lambda.
So my question is really, where is best to have a long running process
in AWS that just sits listening for things to happen, rather than
responding to a prod from something like a Lambda.
I will suggest to go with Fargate or EC2 type ECS container, with fargate you do not need manage server, something similar to lambda but more suitable for such long-running process.
This seems to need to be fired by a task, rather than sit in the
container on a subscription.
no, you can run fargate in two ways.
Running as a long-running services
fire service based on cloud watch event or schedule time ( perform task and terminate)
AWS Fargate now supports the ability to run tasks on a regular,
scheduled basis and in response to CloudWatch Events. This makes it
easier to launch and stop container services that you need to run only
at certain times.
AWS fargate
Where is best to have a long-running process in AWS that just sits
listening for things to happen, rather than responding to an event from something like a Lambda
If your task is supposed for the run for a long time then lambda is not for you, there is always timeout in case of lambda.
If you do not want to manage the server, and the process is supposed to run for a long time, then fargate is for you, so then it's fine to sit for the event and listen.
you can explore aws glue python shell for long running server services.

Using AWS SQS for Job Queuing but minimizing "workers" uptime

I am designing my first Amazon AWS project and I could use some help with the queue processing.
This service accepts processing jobs, either via an ASP.net Web API service or a GUI web site (which just calls the API). Each job has one or more files associated with it and some rules about the type of job. I want to queue each job as it comes in, presumably using AWS SQS. The jobs will then be processed by a "worker" which is a python script with a .Net wrapper. The python script is an existing batch processor that cannot be altered/customized for AWS, hence the wrapper in .Net that manages the AWS portions and passing in the correct params to python.
The issue is that we will not have a huge number of jobs, but each job is somewhat compute intensive. One of the reasons to go to AWS was to minimize infrastructure costs. I plan on having the frontend web site (Web API + ASP.net MVC4 site) run on elastic beanstalk. But I would prefer not to have a dedicated worker machine always online polling for jobs, since these workers need to be a bit "beefier" instance (for processing) and it would cost us a lot to mostly sit doing nothing.
Is there a way to only run the web portion on beanstalk and then have the worker process only spin up if there are items in the queue? I realize I could have a micro "controller" instance always online polling and then have it control the compute spinup, but even that seems like it shouldn't be needed. Can EC2 instances be started based on a non-zero SQS queue size? So basically web api adds job to queue, something watches the queue and sees it's non-zero, this triggers the EC2 worker to start, it spins up and polls the queue on startup. It processes until the queue until empty, then something triggers it to shutdown.
You can use Autoscaling in conjunction with SQS to dynamically start and stop EC2 instances. There is a AWS blog post that describes the architecture you are thinking of.