Running Batch Jobs on Amazon ECS - amazon-web-services

Running Batch Jobs on Amazon ECS - amazon-web-services

I'm very new to using AWS, and even more so for ECS. Currently, I have developed an application that can take an S3 link, download the data from that link, processes the data, and then output some information about that data. I've already packaged this application up in a docker container and now resides on the amazon container registry. What I want to do now is start up a cluster, send an S3 link to each EC2 instance running Docker, have all the container instances crunch the numbers, and return all the results back to a single node. I don't quite understand how I am supposed to change my application at this point. Do I need to make my application running in the docker container a service? Or should I just send commands to containers via ssh? Then assuming I get that far, how do I then communicate with the cluster to farm out the work for potentially hundreds of S3 links? Ideally, since my application is very compute intensive, I'd like to only run one container per EC2 instance.
Thanks!

Your story is hard to answer since it's a lot of questions without a lot of research done.
My initial thought is to make it completely stateless.
You're on the right track by making them start up and process via S3. You should expand this to use something like an SQS queue. Those SQS messages would contain an S3 link. Your application will start up, grab a message from SQS, process the link it got, and delete the message.
The next thing is to not output to a console of any kind. Output somewhere else. Like a different SQS queue, or somewhere.
This removes the requirement for the boxes to talk to each other. This will speed things up, make it infinitely scalable and remove the strange hackery around making them communicate.
Also why one container per instance? 2 threads at 50% is the same as 1 at 100% usually. Remove this requirement and you can use ECS + Lambda + Cloudwatch to scale based on the number of messages. >10000, scale up, that kind of thing. <100 scale down. This means you can throw millions of messages into SQS and just let ECS scale up to process them and output somewhere else to consume.

I agree with Marc Young, you need to make this stateless and decouple the communication layer from the app.
For an application like this I would put the S3 links into a queue (rabbitMQ is a good one, I personally don't care for SQS but it's also an option). Then have your worker nodes in ECS pull messages off the queue and process.
It sounds like you have another app that does processing. Depending on the output you could then put the result into another processing queue and use the same model or just stuff it directly in a database of some sort (or as files in S3).
In addition to what Marc said about autoscaling, consider using cloudwatch + spot instances to manage the cost of your ECS container instances. Particularly for heavy compute tasks, you can get big discounts that way.

Related

Calling lambda functions programatically every minute at scale

While I have worked with AWS for a bit, I'm stuck on how to correctly approach the following use case.
We want to design an uptime monitor for up to 10K websites.
The monitor should run from multiple AWS regions and ping websites if they are available and measure the response time. With a lambda function, I can ping the site, pass the result to a sqs queue and process it. So far, so good.
However, I want to run this function every minute. I also want to have the ability to add and delete monitors. So if I don't want to monitor website "A" from region "us-west-1" I would like to do that. Or the other way round, add a website to a region.
Ideally, all this would run serverless and deployable to custom regions with cloud formation.
What services should I go with?
I have been thinking about Eventbridge, where I wanted to make custom events for every website in every region and then send the result over SNS to a central processing Lambda. But I'm not sure this is the way to go.
Alternatively, I wanted to build a scheduler lambda that fetches the websites it has to schedule from a DB and then invokes the fetcher lambda. But I was not sure about the delay since I want to have the functions triggered every minute. The architecture should monitor 10K websites and even more if possible.
Feel free to give me any advise you have :)
Kind regards.

In my opinion Lambda is not the correct solution for this problem. Your costs will be very high and it may not scale to what you want to ultimately do.
A c5.9xlarge EC2 costs about USD $1.53/hour and has a 10gbit network. With 36 CPU's a threaded program could take care of a large percentage - maybe all 10k - of your load. It could still be run in multiple regions on demand and push to an SQS queue. That's around $1100/month/region without pre-purchasing EC2 time.
A Lambda, running 10000 times / minute and running 5 seconds every time and taking only 128MB would be around USD $4600/month/region.
Coupled with the management interface you're alluding to the EC2 could handle pretty much everything you're wanting to do. Of course, you'd want to scale and likely have at least two EC2's for failover but with 2 of them you're still less than half the cost of the Lambda. As you scale now to 100,000 web sites it's a matter of adding machines.
There are a ton of other choices but understand that serverless does not mean cost efficient in all use cases.

Best AWS architecture solution for migrating data to cloud

Say I have 4 or 5 data sources that I access through API calls. The data aggregation and mining is all scripted in a python file. Lets say the output is all structured data. I know there are plenty of considerations, but from a high level, what would some possible solutions look like if I ultimately wanted to run analysis in BI software?
Can I host the python script in Lambda and set a daily trigger to run the python file. And then have the output stored in RDS/Aurora? Or since the applications I'm running API calls to aren't in AWS, would I need the data to be in an AWS instance before running a Lambda function?
Or host the python script in an EC2 instance, use lambda to trigger a daily refresh that just stores the data in EC2-ESB or Redshift?
Just starting to learn AWS cloud architecture so my knowledge is fairly limited. Just seems like there can be multiple solutions to any problem so not sure if the 2 ideas above are viable.

You've mentioned two approaches which are working. Ultimately it very depends on your use case, budget etc.. and you are right, usually in AWS you will have different solutions that can solve the same problem. For example, another possible solution could be to Dockerize your Python script and run it on containers services (ECS/EKS). But considering you just started with AWS I will focus on the approaches you mentioned as it's probably the most 2 common ones.
In short, based on your description, I would not suggest to go with EC2 because it adds complexity to your use case and moreover extra costs. If you can imagine the final setup, you will need to configure and manage the instance itself, its class type, AMI, your script deployment, access to internet, subnets, etc. Also a minor thing to clarify: you would probably set a cron expression on it to trigger your script (not a lambda reaching the EC2 !). As you can see, quite a big setup for poor benefits (except maybe gaining some experience with AWS ;)) and the instance would be idle most of the time which is far from optimum.
If you just have to run a daily Python script and need to store the output somewhere I would suggest to use lambda for the processing, you can simply have a scheduled event (prefered way is now Amazon EventBridge instead) that triggers your lambda function once a day. Then depending on your output and how you need to process it, you can use RDS obviously from lambda using the Python SDK but you can also use S3 as blob storage if you don't need to run specific queries - for example if you can store your output in json format.
Note that one limitation to lambda is that it can only run for 15 minutes straight per execution. The good thing is that by default lambda has internet access so you don't need to care about any gateway setup and can reach your external endpoints.
Also from a cost perspective running one lambda/day combined with S3 should be free or almost free. The pricing in lambda is very cheap. Running 24/7 an EC2 instance or RDS (which is also an instance) will cost you some money.

Lambda with storage in S3 is the way to go. EC2 / EBS costs add up over time and EC2 will limit the parallelism you can achieve.
Look into Step Functions as a way to organize and orchestrate your Lambdas. I have python code that copies 500K+ files to S3 and takes a week to run. If I copy the files in parallel (500-ish at a time) this process takes about 10 hours. The parallelism is limited by the sourcing system as I can overload it by going wider. The main Lambda launches the file copy Lambdas at a controlled rate but also terminates after a few minutes of run time but returns the last file updated to the controlling Step Function. The Step Function restarts the main Lambda where the last one left off.
Since you have multiple sources you can have multiple top level Lambdas running in parallel all from the same Step Function and each launching a controlled number of worker Lambdas. You won't overwhelm S3 but you will want to make sure you don't overload your sources.
The best part of this is that it costs pennies (at the scale I'm using it).
Once the data is in S3 I'm copying it up to Redshift and transforming it. These processes are also part of the Step Function through additional Lambda Functions.

AWS service for video post processing

I have a local program that inputs a video, uses a tensorflow model to do object classification, and then does a bunch of processing on the objects. I want to get this running in AWS, but there is a dizzying array of AWS services. My desired flow is:
video gets uploaded to s3 --> do classification and processing on each frame of said video --> store results in s3.
I've used Lambda for similar work, but this program relies on 2 different models and its overall size is ~800 MB.
My original thought is to run an ec2 instance that can be triggered when 3 receives a video. Is this the right approach?

You can consider creating a docker image containing your code, dependencies, and the model. Then, you can push it to ECR and create a task definition and fargate cluster. When the task definition is ready, you can set up a cloudwatch event, which will be triggered upon s3 upload, and as a target, you can select fargate resources that were created at the beginning.
There's a tutorial with a similar case available here: https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/CloudWatch-Events-tutorial-ECS.html

I think you're on the right track. I would configure S3 to send new object notifications to an SQS queue. Then you can have your EC2 instance poll the queue for pending tasks. I would probably go with ECS + Fargate for this, but EC2 also works.

You can use Amazon Elemental to split the video file, and distribute the parts to different lambdas, so you can scale it, and process it in parallel.

What to use AWS Fargate or AWS Beanstalk

I have a java application that reads from a SQS queue and does some business processing and finally writes it to a datastore. As the SQS queue grows I want to be able to scale to read more messages and process them. Each SQS message will take about 15 to 20 minutes to process. I was looking at a service like AWS Fargate or AWS Beanstalk to deploy my application. Money is not a concern but usability is. What would be the best platform?

Fargate would be an ideal solution, as it has following advantages over Beanstalk:
It's serverless
More fine-grained control for custom application architectures.
No need to write EB extensions.
Build and Test image locally and Promote same to Fargate.
With application autoscaling, you can scale on the go.
Pricing is per second with a 1-minute minimum
FAQ:
https://aws.amazon.com/fargate/faqs/
Pricing:
https://aws.amazon.com/fargate/pricing/

I've had a very similar use case to this and I used Batch. (which was not available in 2014 when the question was asked)
https://aws.amazon.com/batch/
In my case I was processing audio and video files from the queue.
You can set a lambda to fire on the SQS queue and have that drop the job onto batch for processing.
If you have the minimum cluster size set to zero then you will have no servers running when there is no work to do, but you can have them autoscale up to process as much work as you require when the jobs come in.
The advantage compared to lambda is that the code that executes can be any container with as much resource as you want to throw at it.
For your use case it will be perfect, but for anything that can complete processing in a a few seconds or a minute it's worth making each job process more than one task per execution or all of the time will be spent firing up and shutting down containers.

AWS SQS and other services

my company has a messaging system which sends real-time messages in JSON format, and it's not built on AWS
our team is trying to use AWS SQS to receive these messages, which will then have DynamoDB to storage this messages
im thinking to use EC2 to read this messages then save them
any better solution ?? or how to do it i don't have a good experience

First of All EC2 is infrastructure on Cloud, It is similar to physical machine with OS on local setup. If you want to create any application that will fetch the data from Amazon SQS(Messages in Json Format) and Push it in dynamodb(No Sql database), Your design is correct as both SQS and DynamoDb have thorough Json Support. Once your application is ready then you deploy that application on EC2 machine.
For achieving this, your application must have the asyc Buffered SQS consumer that will consume the messages(limit of sqs messages is 256KB), Hence whichever application is publishing messages size of messages needs to be less thab 256Kb.
Please refer below link for sqs consumer
is putting sqs-consumer to detect receiveMessage event in sqs scalable
Once you had consumed the message from sqs queue you need to save it in dynamodb, that you can easily do it using crud repository. With Repository you can directly save the json in Dynamodb table but please sure to configure the provisioning write capacity based on requests, because more will be the provisioning capacity more will be the cost. Please refer below link for configuring the write capacity of table.
Dynamodb reading and writing units

In general, you'll have a setup something like this:
The EC2 instances (one or more) will read your queue every few seconds to see if there is anything there. If so, they will write this data to DynamoDB.
Based on what you're saying you'll have less than 1,000,000 reads from SQS in a month so you can start out on the free tier for that. You can have a single EC2 instance initially and that can be a very small instance - a T2.micro should be more than sufficient. And you don't need more than a few writes per second on DynamoDB.
The advantage of SQS is that if for some reason your EC2 instance is temporarily unavailable the messages continue to queue up and you won't lose any of them.
From a coding perspective, you don't mention your development environment but there are AWS libraries available for a pretty wide variety of environments. I develop in Java and the code to do this would be maybe 100 lines. I would guess that other languages would be similar. Make sure you look at long polling in the language you're using - it can help to speed up the processing and save you money.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js