Aws lambda vs aws batch - amazon-web-services

I am currently working on a project where I need to merge two significantly large csv files into one(both are a few hundred MBs). I am fairly new to aws. I am aware of memory allocation and execution time limitations of lambda. Other than that are there any advantages of using batch jobs over lambda for this project? Is there any other aws component which more suitable for this task? Either lambda or batch job will be triggered inside a step function using a sns notification.

Lambda function has some limitations:
Execute time: 15 mins
RAM: 3G
disk space /tmp only 500mb <= difficult to store any file large than this number on lambda
The good point is cheap and fast boot up
I suggest you use the ECS (Both Fargate and Container are good)

Try using a Python function in Lambda that writes to S3 with boto3.

Related

synchronizing, scheduling and executing Node.js scripts on AWS

Usually, we store our code on github, and then deploy it on AWS lambda.
We are now challenged with a specific Node.js script.
it takes roughly an hour to run, we can't deploy it on a lambda because of that.
it needs to run just once a month.
once in a while we'll update the script in our github repository, and we want the script in AWS to stay in sync if we make changes (e.g. using a pipeline)
this scripts copies files from S3 and processes them locally. It does some heavy lifting with data.
What would be the recommended way to set this up on AWS ?
The serverless approach fits nicely since you will run the work only once per month. Data transfer between Lambda and S3 (in the same region) is free. If Lambda is comfortable for your use case except for execution time constraints and you can "track the progress" of the processing, you can create a state machine that will invoke your lambda as a step function in the loop while you will not process all S3 data chunks. Each lambda execution can take up to 15 minutes and state machine execution time is way beyond 1 hour. Regarding ops, you can have a trigger on your GitHub that will publish a new version of the lambda. You can use AWS CloudFormation, CDK or any other suitable tool for that.

AWS lambda function for copying data into Redshift

I am new to AWS world and I am trying to implement a process where data written into S3 by AWS EMR can be loaded into AWS Redshift. I am using terraform to create S3 and Redshift and other supported functionality. For loading data I am using lambda function which gets triggered when the redshift cluster is up . The lambda function has the code to copy the data from S3 to redshift. Currently the process seams to work fine .The amount of data is currently low
My question is
This approach seems to work right now but I don't know how it will work once the volume of data increases and what if lambda functions times out
can someone please suggest me any alternate way of handling this scenario even if it can be handled without lambda .One alternate I came across searching for this topic is AWS data pipeline.
Thank you
A server-less approach I've recommended clients move to in this case is Redshift Data API (and Step Functions if needed). With the Redshift Data API you can launch a SQL command (COPY) and close your Lambda function. The COPY command will run to completion and if this is all you need to do then your done.
If you need to take additional actions after the COPY then you need a polling Lambda that checks to see when the COPY completes. This is enabled by Redshift Data API. Once COPY completes you can start another Lambda to run the additional actions. All these Lambdas and their interactions are orchestrated by a Step Function that:
launches the first Lambda (initiates the COPY)
has a wait loop that calls the "status checker" Lambda every 30 sec (or whatever interval you want) and keeps looping until the checker says that the COPY completed successfully
Once the status checker lambda says the COPY is complete the step function launches the additional actions Lambda
The Step function is an action sequencer and the Lambdas are the actions. There are a number of frameworks that can set up the Lambdas and Step Function as one unit.
With bigger datasets, as you already know, Lambda may time out. But 15 minutes is still a lot of time, so you can implement alternative solution meanwhile.
I wouldn't recommend data pipeline as it might be an overhead (It will start an EC2 instance to run your commands). Your problem is simply time out, so you may use either ECS Fargate, or Glue Python Shell Job. Either of them can be triggered by Cloudwatch Event triggered on an S3 event.
a. Using ECS Fargate, you'll have to take care of docker image and setup ECS infrastructure i.e. Task Definition, Cluster (simple for Fargate).
b. Using Glue Python Shell job you'll simply have to deploy your python script in S3 (along with the required packages as wheel files), and link those files in the job configuration.
Both of these options are serverless and you may chose one based on ease of deployment and your comfort level with docker.
ECS doesn't have any timeout limits, while timeout limit for Glue is 2 days.
Note: To trigger AWS Glue job from Cloudwatch Event, you'll have to use a Lambda function, as Cloudwatch Event doesn't support Glue start job yet.
Reference: https://docs.aws.amazon.com/eventbridge/latest/APIReference/API_PutTargets.html

Timeout problems with bulk processing of files using AWS Lambda

I have a lambda function which I'm expecting to exceed 15 minutes of execution time. What should I do so it will continuously run until I processed all of my files?
If you can, figure out how to scale your workload horizontally. This means splitting your workload so it runs on many lambdas instead of one "super" lambda. You don't provide a lot of details so I'll list a couple common ways of doing this:
Create an SQS queue and each lambda takes one item off of the queue and processes it.
Use an S3 trigger so that when a new file is added to a bucket a lambda processes that file.
If you absolutely need to process for longer than 15 minutes you can look into other serverless technologies like AWS Fargate. Non-serverless options might include AWS Batch or running EC2.
15 minutes is the maximum execution time available for AWS Lambda functions.
If your processing is taking more than that, then you should break it into more than one lambda. You can trigger them in sequence or in parallel depending on your execution logic.

Accessing Large files stored in AWS s3 using AWS Lambda functions

I have more than 30GB file stored in s3,and I want to write an Lambda function which will access that file, parse it and then run some algorithm on the same.
I am not sure if my lambda function can take that big file and work on it as Max execution time for Lambda function is 300 sec(5 min).
I found AWS S3 feature regarding faster acceleration, but will it help?
Considering the scenario other than lambda function can any one suggest any other service to host my code as micro service and parse the file?
Thanks in Advance
It is totally based on the processing requirements and frequency of processing.
You can use Amazon EMR for parsing the file and run the algorithm, and based on the requirement you can terminate the cluster or keep it alive for frequent processing. https://aws.amazon.com/emr/getting-started/
You can try using Amazon Athena (Recently launched) service, that will help you for parsing and processing files stored in S3. The infrastructure need will be taken care by Amazon. http://docs.aws.amazon.com/athena/latest/ug/getting-started.html
For Complex Processing flow requirements, you can use combinations of AWS services like AWS DataPipeline - for managing the flow and AWS EMR or EC2 - to run the processing task.https://aws.amazon.com/datapipeline/
Hope this helps, thanks

How to scale Lambda when /tmp is reused?

I have a lambda function that reads from DynamoDB and creates a large file (~500M) in /tmp that finally uploaded to s3. Once uploaded the lambda clears the file from /tmp (since there is a high probability that the instance may be reused)
This function takes about 1 minute to execute, even if you ignore the latencies.
In this scenario, when i try to invoke the function again, in < 1m, i have no control if i will have enough space to write to /tmp. My function fails.
Questions:
1. What are the known work arounds in these kind of scenario?
(Potentially give more space in /tmp or ensure a clean /tmp is given for each new execution)
2. What are the best practices regarding file creation and management in Lambda?
3. Can i attach another EBS or other storage to Lambda for execution ?
4. Is there a way to have file system like access to s3 so that my function instead of using /tmp can write directly to s3?
I doubt that two concurrently running instances of AWS Lambda will share /tmp or any other local resource, since they must execute in complete isolation. Your error should have a different explanation. If you mean, that a subsequent invocation of AWS Lambda reuses the same instance, then you should simply clear /tmp on your own.
In general, if your Lambda is a resource hog, you better do that work in an ECS container worker and use the Lambda for launching ECS tasks, as described here.
You are likely running into the 512 MB /tmp limit of AWS Lambda.
You can improve your performance and address your problem by storing the file in-memory, since the memory limit for Lambda functions can go as high as 1.5 GB.
Starting March 2022, Lambda now supports increasing /tmp directory's maximum size limit up to 10,240MB.
More information available here.
Now it is even easy, AWS storage can be increased to 10GB named Ephemeral Storage. It is available in general configuration of the AWS lambda functions.