How to scale Lambda when /tmp is reused? - amazon-web-services

I have a lambda function that reads from DynamoDB and creates a large file (~500M) in /tmp that finally uploaded to s3. Once uploaded the lambda clears the file from /tmp (since there is a high probability that the instance may be reused)
This function takes about 1 minute to execute, even if you ignore the latencies.
In this scenario, when i try to invoke the function again, in < 1m, i have no control if i will have enough space to write to /tmp. My function fails.
Questions:
1. What are the known work arounds in these kind of scenario?
(Potentially give more space in /tmp or ensure a clean /tmp is given for each new execution)
2. What are the best practices regarding file creation and management in Lambda?
3. Can i attach another EBS or other storage to Lambda for execution ?
4. Is there a way to have file system like access to s3 so that my function instead of using /tmp can write directly to s3?

I doubt that two concurrently running instances of AWS Lambda will share /tmp or any other local resource, since they must execute in complete isolation. Your error should have a different explanation. If you mean, that a subsequent invocation of AWS Lambda reuses the same instance, then you should simply clear /tmp on your own.
In general, if your Lambda is a resource hog, you better do that work in an ECS container worker and use the Lambda for launching ECS tasks, as described here.

You are likely running into the 512 MB /tmp limit of AWS Lambda.
You can improve your performance and address your problem by storing the file in-memory, since the memory limit for Lambda functions can go as high as 1.5 GB.

Starting March 2022, Lambda now supports increasing /tmp directory's maximum size limit up to 10,240MB.
More information available here.

Now it is even easy, AWS storage can be increased to 10GB named Ephemeral Storage. It is available in general configuration of the AWS lambda functions.

Related

Best AWS architecture solution for migrating data to cloud

Say I have 4 or 5 data sources that I access through API calls. The data aggregation and mining is all scripted in a python file. Lets say the output is all structured data. I know there are plenty of considerations, but from a high level, what would some possible solutions look like if I ultimately wanted to run analysis in BI software?
Can I host the python script in Lambda and set a daily trigger to run the python file. And then have the output stored in RDS/Aurora? Or since the applications I'm running API calls to aren't in AWS, would I need the data to be in an AWS instance before running a Lambda function?
Or host the python script in an EC2 instance, use lambda to trigger a daily refresh that just stores the data in EC2-ESB or Redshift?
Just starting to learn AWS cloud architecture so my knowledge is fairly limited. Just seems like there can be multiple solutions to any problem so not sure if the 2 ideas above are viable.
You've mentioned two approaches which are working. Ultimately it very depends on your use case, budget etc.. and you are right, usually in AWS you will have different solutions that can solve the same problem. For example, another possible solution could be to Dockerize your Python script and run it on containers services (ECS/EKS). But considering you just started with AWS I will focus on the approaches you mentioned as it's probably the most 2 common ones.
In short, based on your description, I would not suggest to go with EC2 because it adds complexity to your use case and moreover extra costs. If you can imagine the final setup, you will need to configure and manage the instance itself, its class type, AMI, your script deployment, access to internet, subnets, etc. Also a minor thing to clarify: you would probably set a cron expression on it to trigger your script (not a lambda reaching the EC2 !). As you can see, quite a big setup for poor benefits (except maybe gaining some experience with AWS ;)) and the instance would be idle most of the time which is far from optimum.
If you just have to run a daily Python script and need to store the output somewhere I would suggest to use lambda for the processing, you can simply have a scheduled event (prefered way is now Amazon EventBridge instead) that triggers your lambda function once a day. Then depending on your output and how you need to process it, you can use RDS obviously from lambda using the Python SDK but you can also use S3 as blob storage if you don't need to run specific queries - for example if you can store your output in json format.
Note that one limitation to lambda is that it can only run for 15 minutes straight per execution. The good thing is that by default lambda has internet access so you don't need to care about any gateway setup and can reach your external endpoints.
Also from a cost perspective running one lambda/day combined with S3 should be free or almost free. The pricing in lambda is very cheap. Running 24/7 an EC2 instance or RDS (which is also an instance) will cost you some money.
Lambda with storage in S3 is the way to go. EC2 / EBS costs add up over time and EC2 will limit the parallelism you can achieve.
Look into Step Functions as a way to organize and orchestrate your Lambdas. I have python code that copies 500K+ files to S3 and takes a week to run. If I copy the files in parallel (500-ish at a time) this process takes about 10 hours. The parallelism is limited by the sourcing system as I can overload it by going wider. The main Lambda launches the file copy Lambdas at a controlled rate but also terminates after a few minutes of run time but returns the last file updated to the controlling Step Function. The Step Function restarts the main Lambda where the last one left off.
Since you have multiple sources you can have multiple top level Lambdas running in parallel all from the same Step Function and each launching a controlled number of worker Lambdas. You won't overwhelm S3 but you will want to make sure you don't overload your sources.
The best part of this is that it costs pennies (at the scale I'm using it).
Once the data is in S3 I'm copying it up to Redshift and transforming it. These processes are also part of the Step Function through additional Lambda Functions.

AWS Serverless: Force parallel lambda execution based on request or HTTP API parameters

Is there a way to force AWS to execute a Lambda request coming from an API Gateway resource in a certain execution environment? We're in a use-case where we use one codebase with various models that are 100-300mb, so on their own small enough to fit in the ephemeral storage, but too big to play well together.
Currently, a second invocation with a different model will use the existing (warmed up) lambda function, and run out of storage.
I'm hoping to attach something like a parameter to the request that forces lambda to create parallel versions of the same function for each of the models, so that we don't run over the 512 MB limit and optimize the cold-boot times, ideally without duplicating the function and having to maintain the function in multiple places.
I've tried to investigate Step Machines but I'm not sure if there's an option for parameter-based conditionality there. AWS are suggesting to use EFS to circumvent the ephemeral storage limits, but from what I can find, using EFS will be a lot slower than reading from the ephemeral /tmp/ directory.
To my knowledge: no. You cannot control the execution environments. Only thing you can do is limit the concurrent executions.
So you never know, if it is a single Lambda serving all your events triggered from API Gateway or several running in parallel. You also have no control over which one of the execution environments is serving the next request.
If your issues is the /temp directory limit for AWS Lambda, why not try EFS?

Python Script as a Cron on AWS S3 buckets

I have a python script which copy files from one S3 bucket to another S3 bucket. This script needs to be run every Sunday at some specific time. I was reading some of articles and answers, So I tried to use AWS lambda + Cloudwatch events. This files run for minimum 30 minutes. would it be still good with Lambda as Lambda can run max 15 minutes only. Or is there any other way? I can create an EC2 box and run it as a Cron but that would be expensive. Or any other standard way?
The more appropriate way would be to use aws glue python shell job as it is under the serverless umbrella and you'll be charged as you go.
So this way you will only be charged for the time your code runs.
Also you don't need to manage the EC2 for this. This is like an extended lambda.
If the two buckets are supposed to stay in sync, i.e. all files from bucket #1 should eventually be synced to bucket #2, then there are various replication options in S3.
Otherwise look at S3 Batch Operations. You can derive the list of files that you need to copy from S3 Inventory which will give you additional context on the files, such as date/time uploaded, size, storage class etc.
Unfortunately, the lambda 15min execution time is a hard stop so it's not suitable for this use case as a big bang.
You could use multiple lambda calls to go through the objects one at a time and move them. However, you would need a DynamoDB table (or something similar) to keep track of what has been moved and what has not.
Another couple of options would be:
S3 Replication which will keep one bucket in sync with the other.
An S3 Batch operation
Or if its data files? you can always use AWS glue.
You can certainly use Amazon EC2 for a long-running batch job.
A t3.micro Linux instance costs $0.0104 per hour, and a t3.nano is half that price, charged per-second.
Just add a command at the end of the User Data script that will shut down the instance:
sudo shutdown now -h
If you launch the instance with Shutdown Behavior = Terminate, then the instance will self-terminate.

Aws lambda vs aws batch

I am currently working on a project where I need to merge two significantly large csv files into one(both are a few hundred MBs). I am fairly new to aws. I am aware of memory allocation and execution time limitations of lambda. Other than that are there any advantages of using batch jobs over lambda for this project? Is there any other aws component which more suitable for this task? Either lambda or batch job will be triggered inside a step function using a sns notification.
Lambda function has some limitations:
Execute time: 15 mins
RAM: 3G
disk space /tmp only 500mb <= difficult to store any file large than this number on lambda
The good point is cheap and fast boot up
I suggest you use the ECS (Both Fargate and Container are good)
Try using a Python function in Lambda that writes to S3 with boto3.

How to improve lambda performance?

Hi I am trying to understand the lambda architecture in depth. Below is my understanding about lambda.
Whenever we create lambda function, container will spin up. If we select python as run time the python container will spin up. Now there is cold start. For example, If we dint call lambda for long time, container will become inactive. It will call new container and it will take some time to spin up new container. This is cold start. Now I am bit confused here. If I want to avoid this delay what is the right approach? We can trigger lambda every 5 min using cloud watch. Any other good approaches to handle this?
Also there is /tmp folder where we can store static files. So /tmp is not part of container? Whenever new container spins up, /tmp data will be lost or remain? Can someone help me to understand this concepts and tell me to use best approaches to handle this? Any help would be appreciated. Thank you.
You are correct there is a cold start issue but it's been observed that it depends on a lot of factors(runtime, memory, zip size....for e.g. a java lambda will have more cold start compared to python) and basically it was a big problem for lambdas inside a user-defined VPC. wherein there is an overhead of creating an elastic network interface and then invoking the lambda. But the recent rollout has changed this and now you should not see this problem. improved-vpc-networking for lambda.
Also just in the reinvent 2019 aws have announced the Provisioned Concurrency So for lambda Functions using Provisioned Concurrency will execute with consistent start-up latency.
With Provisioned Concurrency, functions can instantaneously serve a
burst of traffic with consistent start-up latency for every invoke up
to the specified scale. Customers only pay for the amount of concurrency that they configure and for the period of time that it is configured.
Regarding the /tmp please note that Each Lambda function receives 512MB of non-persistent disk space in its own /tmp directory. So you cannot rely on it. Lambda limits If you are looking for persistent storage you should be using S3.