Parallel processing with load balancing on AWS - amazon-web-services

I have the below use case. Need some help in figuring out the best options on AWS.
I have a python script which needs to be executed for 200 different datasets.
I need to run each dataset in an AWS instance. Maximum instance I can have is 10 (so 20 times I need to ran on 10 instances parallelly to complete my 200 jobs)
All the instances will use a common Mongo DB instance to store/read data for the python scripts.
This is not an web application. Just a simple python script invocation.
The python script won't provide any exit codes once its completed (3rd party script and don't have control over it). So I need to figure out the AWS instance completes the job so I can send the next dataset for process (kind of load balancing).

Sounds like a typical use case for SQS, a distributed queue.
Auto Scaling Group managing EC2 Instances
SQS queue managing calculation jobs
Small script polling new jobs from SQS and executing Python script
CloudWatch alarms scaling up and down Auto Scaling Group based on number of jobs in SQS queue
General approach: http://docs.aws.amazon.com/autoscaling/latest/userguide/as-using-sqs-queue.html
Using PaaS Elastic Beanstalk for this kind of setup: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html
Example implementation: https://cloudonaut.io/antivirus-for-s3-buckets/

Related

AWS ECS: Is there a way for code inside a task container to set some custom status for outside services to read?

I am building a file processing service in AWS and here is what I have now as a manager-worker architecture:
A nodejs application running in an EC2 instance, serving as a manager node; In this EC2 instance, there is also a RabbitMQ service hosting a job queue
An ECS service running multiple task containers and the containers are also running nodejs code. The code in every task container runs some custom business logic for processing a job. The task containers get the jobs from the above RabbitMQ job queue. When there are jobs enqueued in the RabbitMQ queue, the jobs are assigned to the ECS task containers and the ECS task container would start processing the job.
Now, this ECS service should scale up or down. When there are no jobs in the queue (which happens very frequently), I just want to keep one worker container alive so that I can save budgets.
When there is a large number of jobs arriving at the manager and enqueue into the job queue, the manager has to figure out how to scale up.
It needs to figure out how many new worker container to add into the ECS service. And to do this, it needs to know:
the number of task containers in the ECS service now;
the status of each container: is it currently processing a job?
This second point leads to my question: is there a way to set a custom status to the task, such that this status can be read by the application in EC2 instance through some AWS ECS API?
As others have noted in the comments, there isn't any built in AWS method to do this. I have two suggestions that I hope can accomplish what you want to do:
Create a lambda function that runs on a regular interval that calls into your RabbitMQ api to check the queue length. Then it can use the ECS API to set the desired task count for your service. You can have as much control as you want over the thresholds and strategy for scaling in your code.
Consider using AWS Batch. The compute backend for Batch is also ECS based, so it might not be such a big change. Long running jobs where you want to scale up and down the processing is its sweet spot. If you want, you can queue the work directly in Batch and skip Rabbit. Or, if you still need to use Rabbit you could create a smaller job in Lambda or anywhere else, that pulls the messages out and creates AWS Batch jobs for each. Batch supports running on EC2 ECS clusters, but it can also use Fargate, so it could simplify your management even further.

Migrating on-premises Python ETL scripts that feed a Splunk Forwarder from a syslog box to AWS?

I've been asked to migrate on-premises Python ETL scripts that live on a syslog box over to AWS. These scripts run as cron-jobs and output logs that a Splunk Forwarder parses and sends to our Splunk instance for indexing.
My initial idea was to deploy a Cloudwatch-triggered Lambda function that spins up an EC2 instance, runs the ETL scripts cloned to that instance (30 minutes), and then brings down the instance. Another idea was to containerize the scripts and run them as task definitions. They take approximately 30 minutes to run.
Any help moving forward would be nice; I would like to deploy this in IaaC, preferably in troposphere/boto3.
Another idea was to containerize the scripts and run them as task definitions
This is probably the best approach. You can include the splunk universal forwarder container in your task definition (ensuring both containers are configured to mount the same storage where the logs are held) to get the logs into splunk. You can schedule task execution just like lambda functions or similar. Alternatively to the forwarder container, if you can configure the logs to output to stdout/stderr instead of log files, you can just setup your docker log driver to output directly to splunk.
Assuming you don't already have a cluster with capacity to run the task, you can use a capacity provider for the ASG attached to the ECS cluster to automatically provision instances into the cluster whenever the task needs to run (and scale down after the task completes).
Or use Fargate tasks with EFS storage and you don't have to worry about cluster provisioning at all.

How to run cron job only on single instance in AWS AutoScaling?

I have scheduled 2 cronjobs for my application.
My Application server is in an autoscaling group and I kept a minimum of 2 instances because of High availability. Everything working is fine but cron job is running multiple times because of 2 instances in autoscaling.
I could not limit the instance size to 1 because already my application in the production environment I prefer to have HA.
How should I have to limit execute cron job on a single instance? or should i have to use other services like AWS Lamda or AWS ELasticBeanstalk
Firstly you should consider whether running the crons on these instances is suitable. If you're trying to keep this highly available and it is directly interacted via customers what will the impact of the crons performance be?
Perhaps consider using a separate autoscaling group or instance with a total of 1 instances to run these crons? You could launch the instance or update the autoscaling group just before the cron needs to run and then automate the shutdown after it has completed.
Otherwise you would need to consider using a locking mechanism for your script. By using this your script write a lock to confirm that it is in process, at the beginning of the script run it would check whether there was any script lock in progress. To further prevent the chance of a collision between multiple servers consider adding jitter (random seconds of sleep) to the start of your script.
Suitable technologies for writing a lock are below:
DynamoDB using strongly consistent reads.
EFS for a Linux application, or FSX for a Windows application.
S3 using strong consistency.
Solutions suggested by Chris Williams sound reasonable if using lambda function is not an option.
One way to simulate cron job is by using CloudWatch Events (now known as EventBridge) in conjunction with AWS Lambda.
First you need to write a Lambda function with the code that needs to be executed on a schedule. Lambda supports cron expressions.
You can then use Schedule Expressions with EventBridge/CloudWatch Event in the same way as a cron tab and mention the Lambda function as target.
you can enable termination protection on of the instance. Attach necessary role & permission for system manager. once the instance is available under managed instance under system manager you can create a schedule event in cloudwatch to run ssm documents. if you are running a bash script convert that to ssm document and set this doc as targate. or you can use shellscript document for running commands

using CloudWatch Events to trigger cron jobs on an EC2 instance overkill?

We have an EC2 server that runs cronjobs. Currently there is a crontab on that server that holds the cronjob settings. Everything runs perfectly fine on this server.
Would it be overkill to use AWS Cloudwatch Events to trigger the crons instead? ie create a cloudwatch event that calls a lambda to run a shell command on the EC2 instance.
My thinking is that these would be possible benefits:
no need to manage a crontab file on the EC2 server
easier to activate/deactivate specific cronjobs
looks like there are indeed benefits according to the AWS Docs:
https://aws.amazon.com/blogs/compute/scheduling-ssh-jobs-using-aws-lambda/
Decouple job schedule and AMI: If your cron jobs are part of an AMI, each schedule change requires you to create a new AMI version, and update existing instances running with that AMI. This is both cumbersome and time-consuming. Using scheduled Lambda functions, you can keep the job schedule outside of your AMI and change the schedule on the fly.
Flexible targeting of EC2 instances: By abstracting the job schedule from AMI and EC2 instances, you can flexibly target a subset of your EC2 instance fleet based on tags or other conditions. In this example, we are targeting EC2 instances with the “Environment=Dev” tag.
Intelligent scheduling: With scheduled Lambda functions, you can add custom logic to you abstracted job scheduler.
In my experience it's not an over kill at all. I have used same setup with great success running job(s) (around 50 different jobs) with heavy workload.
My setup was slightly different
The cloudwatch scheduled event was calling a lambda which in turn was putting a messages on a sqs and application in running on ec2 instance(s) was grabbing messages from the sqs and processing them.
The sqs was simply added for robustness.
But this may or may not make sense in your use case.

Using AWS SQS for Job Queuing but minimizing "workers" uptime

I am designing my first Amazon AWS project and I could use some help with the queue processing.
This service accepts processing jobs, either via an ASP.net Web API service or a GUI web site (which just calls the API). Each job has one or more files associated with it and some rules about the type of job. I want to queue each job as it comes in, presumably using AWS SQS. The jobs will then be processed by a "worker" which is a python script with a .Net wrapper. The python script is an existing batch processor that cannot be altered/customized for AWS, hence the wrapper in .Net that manages the AWS portions and passing in the correct params to python.
The issue is that we will not have a huge number of jobs, but each job is somewhat compute intensive. One of the reasons to go to AWS was to minimize infrastructure costs. I plan on having the frontend web site (Web API + ASP.net MVC4 site) run on elastic beanstalk. But I would prefer not to have a dedicated worker machine always online polling for jobs, since these workers need to be a bit "beefier" instance (for processing) and it would cost us a lot to mostly sit doing nothing.
Is there a way to only run the web portion on beanstalk and then have the worker process only spin up if there are items in the queue? I realize I could have a micro "controller" instance always online polling and then have it control the compute spinup, but even that seems like it shouldn't be needed. Can EC2 instances be started based on a non-zero SQS queue size? So basically web api adds job to queue, something watches the queue and sees it's non-zero, this triggers the EC2 worker to start, it spins up and polls the queue on startup. It processes until the queue until empty, then something triggers it to shutdown.
You can use Autoscaling in conjunction with SQS to dynamically start and stop EC2 instances. There is a AWS blog post that describes the architecture you are thinking of.