Why does my Python app always cold start twice on AWS lambda? - amazon-web-services

I have a lambda, in Python where I am loading a large machine learning model during the cold start. The code is something like this:
uuid = uuid4()
app_logger.info("Loading model... %s" % uuid)
endpoints.embedder.load()
def create_app() -> FastAPI:
app = FastAPI()
app.include_router(endpoints.router)
return app
app_logger.info("Creating app... %s" % uuid)
app = create_app()
app_logger.info("Loaded app. %s" % uuid)
handler = Mangum(app)
The first time after deployment, AWS Lambda seems to start the Lambda twice as seen by the two different UUIDs. Here are the logs:
2023-01-05 21:44:40.083 | INFO | myapp.app:<module>:47 - Loading model... 76a5ac6f-a4fc-490e-b21c-83bb5ef458eb
2023-01-05 21:44:42.406 | INFO | myapp.embedder:load:31 - Loading embedding model
2023-01-05 21:44:50.626 | INFO | myapp.app:<module>:47 - Loading model... c633a9c6-bcfc-44d5-bacf-9834b39ee300
2023-01-05 21:44:51.878 | INFO | myapp.embedder:load:31 - Loading embedding model
2023-01-05 21:45:00.418 | INFO | myapp.app:<module>:59 - Creating app... c633a9c6-bcfc-44d5-bacf-9834b39ee300
2023-01-05 21:45:00.420 | INFO | myapp.app:<module>:61 - Loaded app. c633a9c6-bcfc-44d5-bacf-9834b39ee300
This happens consistently. It executes it for 10 seconds the first time, then seems to restart and do it again. There are no errors in the logs that indicate why this would be. I have my Lambda configured to run with 4G of memory and it always loads with < 3GB used.
Any ideas why this happens and how to avoid it?

To summarize all the learnings in the comments so far:
AWS limits the init phase to 10 seconds. This is explained here: https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtime-environment.html
If the app exceeds 10 seconds, it gets inited again without this limit
If you hit the 10 second limit, there are two ways to deal with this:
Init the model after the function is loaded during the invocation. The downsides being that you don't get the CPU boost and lower cost initialization.
Use provisioned concurrency. Init is not limited to 10 seconds, but this is more expensive and can still run into the same problems as not using it, e.g. if you get a burst in usage.
Moving my model to EFS does improve startup time compared to S3 and Docker layer caching, but it is not sufficient to make it init in < 10 seconds. It might work for other use cases with slightly smaller models though.
Perhaps someday SnapStart will address this problem for Python. Until then, I am going back to EC2.

Related

Best wsgi service for handling webhooks with few resources

Currently working on a Virtual server with 2 CPU's 4GB of ram. I am running a Flask + uwsgi + nginx to host the webserver. I need the server to be capable of accepting about 10 out of 2500-ish the requests a day. The requests that don't pass average about 2ms yet the queue is consistently backed up. The issue I have been encountering lately is both speed and duplication when it does work. As the accepted webhooks are sent to another server and I will get duplicates or completely miss a bunch.
[uwsgi]
module = wsgi
master = true
processes = 4
enable-threads = true
threads = 2
socket = API.sock
chmod-socket = 660
vacuum = true
harakiri = 10
die-on-term = true
This is my current .ini file I have messed around with harakiri and have read countless hours through the uwsgi documentation trying different things it is unbelievably frustrating.
Picture of Systemctl status API
The check for it looks similar to this redacted some info.
#app.route('/api', methods=['POST'])
def handle_callee():
authorization = request.headers.get('authorization')
if authorization == SECRET and check_callee(request.json):
data = request.json
name = data["payload"]["object"]["caller"]["name"]
create_json(name, data)
return 'success', 200
else:
return 'failure', 204
The json is then parsed through a number of functions. This is my first time deploying a wsgi service and I don't know if my configuration is incorrect. I've poured hours of research into trying to fix this. Should I try switching to gunicorn. I have asked this question differently a couple of days ago but to no avail. Trying to put more context in hopes someone could point me in the right direction. I don't even know in the systemctl status whether the | req: 12/31 is how many it's done thus far and what's queued for that PID. Any insight into this situation would make my week. I've been unable to fix this for about 2 weeks of trying different configs increasing working, processes, messing with the harakiri, disabling logging. But none of this has proved to get the requests to process at a speed that I desire.
Thank you to anyone who took the time to read this, I am still learning and have tried to add as much context as possible. If you need more I will gladly respond. I just can't wrap my head around this issue.
You would need to take a systematic approach in figuring out:
How many requests per second can you handle
What are your apps bottlenecks and scaling factors
Cloudbees have written a great article on performance tuning for uwsgi + flask + nginx.
To give an overview of the steps to tune your service here is what it might look like:
First, you need to make sure you have the required tooling, particularly a benchmarking tool like Apache Bench, k6, etc.
Establish a base. This means that you configure your application with the minimum setup to run i.e. single thread and single process, no multi-threading. Run the benchmark and record the results.
Start tweaking the setup. Add threads, processes, etc.
Benchmark after the tweaks.
Repeat steps 2 & 3 until you see the upper limits, and understand the service characteristics - are you CPU/IO bound?
Try changing the hardware/vm, as some offerings come with penalties in performance due to shared CPU with other tenants, bandwidth, etc.
Tip: Try to run the benchmark tool from a different system than the one you are benchmarking, since it also consumes resources and loads the system further.
In your code sample you have two methods create_json(name, data) and check_callee(request.json), do you know their performance?
Note: Can't comment so had to write this as an answer.

MWAA Airflow Scaling: what do I do when I have to run frequent & time consuming scripts? (Negsignal.SIGKILL)

I have an MWAA Airflow env in my AWS account. The DAG I am setting up is supposed to read massive data from S3 bucket A, filter what I want and dump the filtered results to S3 bucket B. It needs to read every minute since the data is coming in every minute. Every run processes about 200MB of json data.
My initial setting was using env class mw1.small with 10 worker machines, if I only run the task once in this setting, it takes about 8 minutes to finish each run, but when I start the schedule to run every minute, most of them could not finish, starts to take much longer to run (around 18 mins) and displays the error message:
[2021-09-25 20:33:16,472] {{local_task_job.py:102}} INFO - Task exited with return code Negsignal.SIGKILL
I tried to expand env class to mw1.large with 15 workers, more jobs were able to complete before the error shows up, but still could not catch up with the speed of ingesting every minute. The Negsignal.SIGKILL error would still show before even reaching worker machine max.
At this point, what should I do to scale this? I can imagine opening another Airflow env but that does not really make sense. There must be a way to do it within one env.
I've found the solution to this, for MWAA, edit the environment and under Airflow configuration options, setup these configs
celery.sync_parallelism = 1
celery.worker_autoscale = 1,1
This will make sure your worker machine runs 1 job at a time, preventing multiple jobs to share the worker, hence saving memory and reduces runtime.

Verify VM is running or stopped from how long using kusto query language

I am new for kusto query language. I need some help, how to check vm is shutdown from how long and if running how long it was running. can you please help me on that because i am just starting learning on kusto query language.
Update 1:
Original answer:
When the vm is stopped, there're events named Deallocate Virtual Machine are sent to AzureActivity table. When the vm is started to running, there're events named Start Virtual Machine are sent to AzureActivity table.
So it's easy to find the vm is running or stopped by the query below(in azure monitor -> Logs):
AzureActivity
| where OperationName in ("Deallocate Virtual Machine","Start Virtual Machine")
| project TimeGenerated,OperationName
| top 1 by TimeGenerated desc
If the query result contains Deallocate Virtual Machine, it means the vm is in stopped status. Otherwise, it's in running status. The screenshot is as below:
Next, since we know the vm status, for example, the vm is in stopped status, then we can write query to calculate how long since it's stopped. To do that, we can use the current time to minus the time when the vm is stopped. The query like below:
let stop_time = AzureActivity
| where OperationName == "Deallocate Virtual Machine"
| project TimeGenerated
| top 1 by TimeGenerated desc;
AzureActivity
| extend the_time = now() - toscalar(stop_time)
| project the_time
| top 1 by the_time
Here is the test result:
You can also modify the query above to calculate the running time if the vm is now in running status.

How to improve web-service api throughput?

I'm new to creating web service. So I'd like to know what i'm missing out on performance (assuming i'm missing something).
I've build a simple flask app. Nothing fancy, it just reads from the DB and responds with the result.
uWSGI is used for WSGI layer. I've run multiple tests and set process=2 and threads=5 based on performance monitoring.
processes = 2
threads = 5
enable-threads = True
AWS ALB is used for load balancer. uWSGI and Flask app is dockerized and launched in ECS (3 container [1vCPU]).
The for each DB hit, the flask app takes 1 - 1.5 sec to get the data. There is no other lag on the app side. I know it can be optimised. But assuming that the request processing time takes 1 - 1.5 sec, can the throughput be increased?
The throughput I'm seeing is ~60 request per second. I feel it's too low. Is there any way to increase the throughput with the same infra ?
Am i missing something here or is the throughput reasonable given that the DB hit takes 1.5 sec?
Note : It's synchronous.

Simple HelloWorld app on cloudrun (or knative) seems too slow

I deployed a sample HelloWorld app on Google Cloud Run, which is basically k-native, and every call to the API takes 1.4 seconds at best, in an end-to-end manner. Is it supposed to be so?
The sample app is at https://cloud.google.com/run/docs/quickstarts/build-and-deploy
I deployed the very same app on my localhost as a docker container and it takes about 22ms, end-to-end.
The same app on my GKE cluster takes about 150 ms, end-to-end.
import os
from flask import Flask
app = Flask(__name__)
#app.route('/')
def hello_world():
target = os.environ.get('TARGET', 'World')
return 'Hello {}!\n'.format(target)
if __name__ == "__main__":
app.run(debug=True,host='0.0.0.0',port=int(os.environ.get('PORT', 8080)))
I am a little experience in FaaS and I expect API calls would get faster as I invoked them in a row. (as in cold start vs. warm start)
But no matter how many times I execute the command it doesn't go below 1.4 seconds.
I think the network distance isn't the dominant factor here. The round-trip time via ping to the API endpoint is only 50ms away, more or less
So my questions are as follows:
Is it potentially an unintended bug? Is it a technical difficulty which will be resolved eventually? Or maybe nothing's wrong, it's just the SLA of k-native?
If nothing's wrong with Google Cloud Run and/or k-native, what is the dominant time-consuming factor here for my API call? I'd love to learn the mechanism.
Additional Details:
Where I am located at: Seoul/Asia
The region for my Cloud Run app: us-central1
type of Internet connection I am testing under: Business, Wired
app's container image size: 343.3MB
the bucket location that Container Registry is using: gcr.io
WebPageTest from Seoul/Asia (warmup time):
Content Type: text/html
Request Start: 0.44 s
DNS Lookup: 249 ms
Initial Connection: 59 ms
SSL Negotiation: 106 ms
Time to First Byte: 961 ms
Content Download: 2 ms
WebPageTest from Chicago/US (warmup time):
Content Type: text/html
Request Start: 0.171 s
DNS Lookup: 41 ms
Initial Connection: 29 ms
SSL Negotiation: 57 ms
Time to First Byte: 61 ms
Content Download: 3 ms
ANSWER by Steren, the Cloud Run product manager
We have detected high latency when calling Cloud Run services from
some particular regions in the world. Sadly, Seoul seems to be one of
them.
[Update: This person has a networking problem in his area. I tested his endpoint from Seattle with no problems. Details in the comments below.]
I have worked with Cloud Run constantly for the past several months. I have deployed several production applications and dozens of test services. I am in Seattle, Cloud Run is in us-central1. I have never noticed a delay. Actually, I am impressed with how fast a container starts up.
For one of my services, I am seeing cold start time to first byte of 485ms. Next invocation 266ms, 360ms. My container is checking SSL certificates (2) on the Internet. The response time is very good.
For another service which is a PHP website, time to first byte on cold start is 312ms, then 94ms, 112ms.
What could be factors that are different for you?
How large is your container image? Check Container Registry for the size. My containers are under 100 MB. The larger the container the longer the cold start time.
Where is the bucket located that Container Registry is using? You want the bucket to be in us-central1 or at least US. This will change soon with when new Cloud Run regions are announced.
What type of Internet connection are you testing under? Home based or Business. Wireless or Ethernet connection? Where in the world are you testing from? Launch a temporary Compute Engine instance, repeat your tests to Cloud Run and compare. This will remove your ISP from the equation.
Increase the memory allocated to the container. Does this affect performance? Python/Flask does not require much memory, my containers are typically 128 MB and 256 MB. Container images are loaded into memory, so if you have a bloated container, you might now have enough memory left reducing performance.
What does Stackdriver logs show you? You can see container starts, requests, and container terminations.
(Cloud Run product manager here)
We have detected high latency when calling Cloud Run services from some particular regions in the world. Sadly, Seoul seems to be one of them.
We will explicitly capture this as a Known issue and we are working on fixing this before General Availability. Feel free to open a new issue in our public issue tracker