I have a local program that inputs a video, uses a tensorflow model to do object classification, and then does a bunch of processing on the objects. I want to get this running in AWS, but there is a dizzying array of AWS services. My desired flow is:
video gets uploaded to s3 --> do classification and processing on each frame of said video --> store results in s3.
I've used Lambda for similar work, but this program relies on 2 different models and its overall size is ~800 MB.
My original thought is to run an ec2 instance that can be triggered when 3 receives a video. Is this the right approach?
You can consider creating a docker image containing your code, dependencies, and the model. Then, you can push it to ECR and create a task definition and fargate cluster. When the task definition is ready, you can set up a cloudwatch event, which will be triggered upon s3 upload, and as a target, you can select fargate resources that were created at the beginning.
There's a tutorial with a similar case available here: https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/CloudWatch-Events-tutorial-ECS.html
I think you're on the right track. I would configure S3 to send new object notifications to an SQS queue. Then you can have your EC2 instance poll the queue for pending tasks. I would probably go with ECS + Fargate for this, but EC2 also works.
You can use Amazon Elemental to split the video file, and distribute the parts to different lambdas, so you can scale it, and process it in parallel.
Related
I am new to AWS world and I am trying to implement a process where data written into S3 by AWS EMR can be loaded into AWS Redshift. I am using terraform to create S3 and Redshift and other supported functionality. For loading data I am using lambda function which gets triggered when the redshift cluster is up . The lambda function has the code to copy the data from S3 to redshift. Currently the process seams to work fine .The amount of data is currently low
My question is
This approach seems to work right now but I don't know how it will work once the volume of data increases and what if lambda functions times out
can someone please suggest me any alternate way of handling this scenario even if it can be handled without lambda .One alternate I came across searching for this topic is AWS data pipeline.
Thank you
A server-less approach I've recommended clients move to in this case is Redshift Data API (and Step Functions if needed). With the Redshift Data API you can launch a SQL command (COPY) and close your Lambda function. The COPY command will run to completion and if this is all you need to do then your done.
If you need to take additional actions after the COPY then you need a polling Lambda that checks to see when the COPY completes. This is enabled by Redshift Data API. Once COPY completes you can start another Lambda to run the additional actions. All these Lambdas and their interactions are orchestrated by a Step Function that:
launches the first Lambda (initiates the COPY)
has a wait loop that calls the "status checker" Lambda every 30 sec (or whatever interval you want) and keeps looping until the checker says that the COPY completed successfully
Once the status checker lambda says the COPY is complete the step function launches the additional actions Lambda
The Step function is an action sequencer and the Lambdas are the actions. There are a number of frameworks that can set up the Lambdas and Step Function as one unit.
With bigger datasets, as you already know, Lambda may time out. But 15 minutes is still a lot of time, so you can implement alternative solution meanwhile.
I wouldn't recommend data pipeline as it might be an overhead (It will start an EC2 instance to run your commands). Your problem is simply time out, so you may use either ECS Fargate, or Glue Python Shell Job. Either of them can be triggered by Cloudwatch Event triggered on an S3 event.
a. Using ECS Fargate, you'll have to take care of docker image and setup ECS infrastructure i.e. Task Definition, Cluster (simple for Fargate).
b. Using Glue Python Shell job you'll simply have to deploy your python script in S3 (along with the required packages as wheel files), and link those files in the job configuration.
Both of these options are serverless and you may chose one based on ease of deployment and your comfort level with docker.
ECS doesn't have any timeout limits, while timeout limit for Glue is 2 days.
Note: To trigger AWS Glue job from Cloudwatch Event, you'll have to use a Lambda function, as Cloudwatch Event doesn't support Glue start job yet.
Reference: https://docs.aws.amazon.com/eventbridge/latest/APIReference/API_PutTargets.html
In my architecture when I receive a new file on S3 bucket, a lambda function triggers an ECS task.
The problem occurs when I receive multiple files at the same time: the lambda will trigger multiple instance of the same ECS task that acts on the same shared resources.
I want to ensure only 1 instance is running for specific ECS Task, how can I do?
Is there a specific setting that can ensure it?
I tried to query ECS Cluster before run a new instance of the ECS task, but (using AWS Python SDK) I didn't receive any information when the task is in PROVISIONING status, the sdk only return data when the task is in PENDING or RUNNING.
Thank you
I don't think you can control that because your S3 event will trigger new tasks. It will be more difficult to check if the task is already running and you might miss execution if you receive a lot of files.
You should think different to achieve what you want. If you want only one task processing that forget about triggering the ECS task from the S3 event. It might work better if you implement queues. Your S3 event should add the information (via Lambda, maybe?) to an SQS queue.
From there you can have an ECS service doing a SQS long polling and processing one message at a time.
I'm new to Cloudwatch events and to Fargate. I want to trigger a Fargate task (Python) to run whenever a file is uploaded to a specific S3 bucket. I can get the task to run whenever I upload a file, and can see the name in the event log; however I can't figure out a simple way to read the event data in Fargate. I've been researching this the past couple of days and haven't found solution other than reading the event log or using a lambda to invoke the task and to put the event data in a message queue.
Is there a simple way to obtain the event data in Fargate with boto3? It's likely that I'm not looking in the right places or asking the right question.
Thanks
One of the easiest options that you can configure is two targets for same s3 image upload event.
Push the Same Event to SQS
launch Fargate task at the same time
Read Message Event from SQS when Fargate is up (No Lambda in between), also same task definition that will work a normal use case, make sure you exit the process after reading the message from sqs.
So in this case whenever Fargate Task up, it will read messages from the SQS.
To do this you would need to use a input transformer.
Each time a event rule is triggered a JSON object accessible to use for in the transformation.
As the event itself is not accessible within the container (like with Lambda functions), the idea is that you would actually forward key information as environment variables and manipulate in your container.
At this time it does not look like every service supports this in the console so you have the following options:
CloudFormation
Terraform
CLI
You can view a tutorial for this exact scenario from this link.
I'm very new to using AWS, and even more so for ECS. Currently, I have developed an application that can take an S3 link, download the data from that link, processes the data, and then output some information about that data. I've already packaged this application up in a docker container and now resides on the amazon container registry. What I want to do now is start up a cluster, send an S3 link to each EC2 instance running Docker, have all the container instances crunch the numbers, and return all the results back to a single node. I don't quite understand how I am supposed to change my application at this point. Do I need to make my application running in the docker container a service? Or should I just send commands to containers via ssh? Then assuming I get that far, how do I then communicate with the cluster to farm out the work for potentially hundreds of S3 links? Ideally, since my application is very compute intensive, I'd like to only run one container per EC2 instance.
Thanks!
Your story is hard to answer since it's a lot of questions without a lot of research done.
My initial thought is to make it completely stateless.
You're on the right track by making them start up and process via S3. You should expand this to use something like an SQS queue. Those SQS messages would contain an S3 link. Your application will start up, grab a message from SQS, process the link it got, and delete the message.
The next thing is to not output to a console of any kind. Output somewhere else. Like a different SQS queue, or somewhere.
This removes the requirement for the boxes to talk to each other. This will speed things up, make it infinitely scalable and remove the strange hackery around making them communicate.
Also why one container per instance? 2 threads at 50% is the same as 1 at 100% usually. Remove this requirement and you can use ECS + Lambda + Cloudwatch to scale based on the number of messages. >10000, scale up, that kind of thing. <100 scale down. This means you can throw millions of messages into SQS and just let ECS scale up to process them and output somewhere else to consume.
I agree with Marc Young, you need to make this stateless and decouple the communication layer from the app.
For an application like this I would put the S3 links into a queue (rabbitMQ is a good one, I personally don't care for SQS but it's also an option). Then have your worker nodes in ECS pull messages off the queue and process.
It sounds like you have another app that does processing. Depending on the output you could then put the result into another processing queue and use the same model or just stuff it directly in a database of some sort (or as files in S3).
In addition to what Marc said about autoscaling, consider using cloudwatch + spot instances to manage the cost of your ECS container instances. Particularly for heavy compute tasks, you can get big discounts that way.
I have an AWS-hosted website, which takes images for processing, and adds them to the SQS. Is it possible to automatically start the processing instance whenever there is something in the queue using AWS services, or should I do it manually in my backend's code?
Yes you can use EC2 and SQS in conjunction. Please go through this blog https://aws.amazon.com/articles/1464