Hi I am trying to understand the lambda architecture in depth. Below is my understanding about lambda.
Whenever we create lambda function, container will spin up. If we select python as run time the python container will spin up. Now there is cold start. For example, If we dint call lambda for long time, container will become inactive. It will call new container and it will take some time to spin up new container. This is cold start. Now I am bit confused here. If I want to avoid this delay what is the right approach? We can trigger lambda every 5 min using cloud watch. Any other good approaches to handle this?
Also there is /tmp folder where we can store static files. So /tmp is not part of container? Whenever new container spins up, /tmp data will be lost or remain? Can someone help me to understand this concepts and tell me to use best approaches to handle this? Any help would be appreciated. Thank you.
You are correct there is a cold start issue but it's been observed that it depends on a lot of factors(runtime, memory, zip size....for e.g. a java lambda will have more cold start compared to python) and basically it was a big problem for lambdas inside a user-defined VPC. wherein there is an overhead of creating an elastic network interface and then invoking the lambda. But the recent rollout has changed this and now you should not see this problem. improved-vpc-networking for lambda.
Also just in the reinvent 2019 aws have announced the Provisioned Concurrency So for lambda Functions using Provisioned Concurrency will execute with consistent start-up latency.
With Provisioned Concurrency, functions can instantaneously serve a
burst of traffic with consistent start-up latency for every invoke up
to the specified scale. Customers only pay for the amount of concurrency that they configure and for the period of time that it is configured.
Regarding the /tmp please note that Each Lambda function receives 512MB of non-persistent disk space in its own /tmp directory. So you cannot rely on it. Lambda limits If you are looking for persistent storage you should be using S3.
Related
I have one lambda function to test the URLs using puppeteer and chrome.
When I invoke 50 lambdas at the same time chrome is not able to load all the passed URLs.
What could be the reason for it?
I suspect it shares the CPU with time slicing.
One of the best features of AWS Lambda functions is scalability. It means it will increase the needed resources to perform the task. It is impossible to share the CPU because it will destroy the whole concept of Serverless in Lambda Functions. BUT, these scenarios could be your problem:
Multiple invocations at the same will share /tmp directory. Your code might store more than allowed ephemeral storage in your invocation which might be the reason of your problem. I suggest checking to invocation logs to see if you can find any errors for regarding the ephemeral storage.
As you said, you are sending 50 requests at same time. If the target server is just a single server, it might be flooded and the memory might get full. In that case, the server can't respond to you anymore.
While I have worked with AWS for a bit, I'm stuck on how to correctly approach the following use case.
We want to design an uptime monitor for up to 10K websites.
The monitor should run from multiple AWS regions and ping websites if they are available and measure the response time. With a lambda function, I can ping the site, pass the result to a sqs queue and process it. So far, so good.
However, I want to run this function every minute. I also want to have the ability to add and delete monitors. So if I don't want to monitor website "A" from region "us-west-1" I would like to do that. Or the other way round, add a website to a region.
Ideally, all this would run serverless and deployable to custom regions with cloud formation.
What services should I go with?
I have been thinking about Eventbridge, where I wanted to make custom events for every website in every region and then send the result over SNS to a central processing Lambda. But I'm not sure this is the way to go.
Alternatively, I wanted to build a scheduler lambda that fetches the websites it has to schedule from a DB and then invokes the fetcher lambda. But I was not sure about the delay since I want to have the functions triggered every minute. The architecture should monitor 10K websites and even more if possible.
Feel free to give me any advise you have :)
Kind regards.
In my opinion Lambda is not the correct solution for this problem. Your costs will be very high and it may not scale to what you want to ultimately do.
A c5.9xlarge EC2 costs about USD $1.53/hour and has a 10gbit network. With 36 CPU's a threaded program could take care of a large percentage - maybe all 10k - of your load. It could still be run in multiple regions on demand and push to an SQS queue. That's around $1100/month/region without pre-purchasing EC2 time.
A Lambda, running 10000 times / minute and running 5 seconds every time and taking only 128MB would be around USD $4600/month/region.
Coupled with the management interface you're alluding to the EC2 could handle pretty much everything you're wanting to do. Of course, you'd want to scale and likely have at least two EC2's for failover but with 2 of them you're still less than half the cost of the Lambda. As you scale now to 100,000 web sites it's a matter of adding machines.
There are a ton of other choices but understand that serverless does not mean cost efficient in all use cases.
Say I have 4 or 5 data sources that I access through API calls. The data aggregation and mining is all scripted in a python file. Lets say the output is all structured data. I know there are plenty of considerations, but from a high level, what would some possible solutions look like if I ultimately wanted to run analysis in BI software?
Can I host the python script in Lambda and set a daily trigger to run the python file. And then have the output stored in RDS/Aurora? Or since the applications I'm running API calls to aren't in AWS, would I need the data to be in an AWS instance before running a Lambda function?
Or host the python script in an EC2 instance, use lambda to trigger a daily refresh that just stores the data in EC2-ESB or Redshift?
Just starting to learn AWS cloud architecture so my knowledge is fairly limited. Just seems like there can be multiple solutions to any problem so not sure if the 2 ideas above are viable.
You've mentioned two approaches which are working. Ultimately it very depends on your use case, budget etc.. and you are right, usually in AWS you will have different solutions that can solve the same problem. For example, another possible solution could be to Dockerize your Python script and run it on containers services (ECS/EKS). But considering you just started with AWS I will focus on the approaches you mentioned as it's probably the most 2 common ones.
In short, based on your description, I would not suggest to go with EC2 because it adds complexity to your use case and moreover extra costs. If you can imagine the final setup, you will need to configure and manage the instance itself, its class type, AMI, your script deployment, access to internet, subnets, etc. Also a minor thing to clarify: you would probably set a cron expression on it to trigger your script (not a lambda reaching the EC2 !). As you can see, quite a big setup for poor benefits (except maybe gaining some experience with AWS ;)) and the instance would be idle most of the time which is far from optimum.
If you just have to run a daily Python script and need to store the output somewhere I would suggest to use lambda for the processing, you can simply have a scheduled event (prefered way is now Amazon EventBridge instead) that triggers your lambda function once a day. Then depending on your output and how you need to process it, you can use RDS obviously from lambda using the Python SDK but you can also use S3 as blob storage if you don't need to run specific queries - for example if you can store your output in json format.
Note that one limitation to lambda is that it can only run for 15 minutes straight per execution. The good thing is that by default lambda has internet access so you don't need to care about any gateway setup and can reach your external endpoints.
Also from a cost perspective running one lambda/day combined with S3 should be free or almost free. The pricing in lambda is very cheap. Running 24/7 an EC2 instance or RDS (which is also an instance) will cost you some money.
Lambda with storage in S3 is the way to go. EC2 / EBS costs add up over time and EC2 will limit the parallelism you can achieve.
Look into Step Functions as a way to organize and orchestrate your Lambdas. I have python code that copies 500K+ files to S3 and takes a week to run. If I copy the files in parallel (500-ish at a time) this process takes about 10 hours. The parallelism is limited by the sourcing system as I can overload it by going wider. The main Lambda launches the file copy Lambdas at a controlled rate but also terminates after a few minutes of run time but returns the last file updated to the controlling Step Function. The Step Function restarts the main Lambda where the last one left off.
Since you have multiple sources you can have multiple top level Lambdas running in parallel all from the same Step Function and each launching a controlled number of worker Lambdas. You won't overwhelm S3 but you will want to make sure you don't overload your sources.
The best part of this is that it costs pennies (at the scale I'm using it).
Once the data is in S3 I'm copying it up to Redshift and transforming it. These processes are also part of the Step Function through additional Lambda Functions.
I have a batch job that I need to run on AWS. I'm wondering what's the best service to use. The job needs to run once a day, so I think that naturally AWS Lambda with a CloudWatch Rule triggering it would do it. However, I'm starting to think that AWS Lambda is thought to be used as a service to handle requests. This AWS official library to integrate Spring-Boot is very oriented to handle HTTP requests, and when creating a lambda via AWS Console, only test cases that send an input to the lambda can be written.
Then, is this a use case for AWS Lambda? Also, these functions can run up to 15 minutes. What should I use if my job needs to run longer?
The purpose of Lambda, as compared to AWS EC2, is to simplify building smaller, on-demand applications that are responsive to events and new information.
If your batch is running within a limit of 15 minutes then you can go with a lambda function.
But if you want batch processing to be done, you should check AWS batch.
Here is nice article which demonstrates the usage of AWS batch.
If you are already using some batch framework like spring-batch, you can also take a look at ECS scheduled task with Fargate.
With ECS Fargate you can launch and stop container services that you need to run only at certain times.
Here are some related articles on Fargate event and scheduled task and Scheduled Tasks.
If you're confident that your function will only run at maximum of 15mins, AWS Lambda could be the solution. Here are the AWS Lambda limits that could help you decide on that.
Also note that lambda has cold start, it's when it will run slower at first but will eventually pick up the pace. Here are some good reads about it that could help you decide on the lambda direction, but feel free to check on any articles that could better explain at your disposal.
This one shows a brief lists that you would like to consider and the factors affecting it.
This one might have a deeper explanation of the cold start with regards to how it works internally.
What should I use if my job needs to run longer?
Depending on your infrastructure, you could maybe explore Scheduled Tasks
How can I check the running Lambda functions running using the aws cli?
It seems that there is no a command to check it:
aws lambda XXXX
I have several scripts running, and I'd like to monitor the situation.
It is enough to show how many functions are running.
Thank you
Watching or monitoring the cloud watch logs would be the best way to monitor if a lambda is running or not. These logs are not real time, but may be near real time enough for your needs. You could ask CloudWatch for the last X minutes of log for a particular lambda and monitor the timing of the log statements. As Aniket Chopade stated though, knowing why you're trying to do this could help someone provide a better solution.
You will not see any state such as "running" in CLI ouput.
From users perspective, Lambda functions are always up running (actually right word is invokable) because they respond to triggers.
There is concept of keeping them "warm" , that means keeping one physical instance alive having containers running in it. But again these level of details are hidden from lambda's users.
I am curious why you want to know such state for lambda functions.