Scalable Inference Server for Object Detection - django

I have created a Django service (nginx + Gunicorn) for object detection models.
For my case i have 50+ models with resnet 50 based back bone.
Server Machine Specification:
16 CPU
64 GB Ram
I have pre-loaded all the models in my service. I am running 20 inference requests in parallel.
But issue i am facing that Gunicorn intermittently restarts any worker out of 8 workers when inference is running and it is not due to time out. it starts to reloading the model in service again. Due to this my inference requests are failing.
I have notice that might be due to server's all cpus are utilised to 100%.
Can you please suggest the solution or other way to run inference as service ?

Related

How to solve high request latency on Cloud Run (nearly 4minutes)

Hi everyone at this point I am at a loss. I am in the process of brining our application up in Cloud Run. I did have a little bit of worry running "serverless" for this application stack. I have 1 frontend NextJS application and Backend GraphQL (node) application. I use schema stitching against the backend to connect to our managed CMS service.
I bumped the number of max instances the Serverless VPC access connector and this really helped the requests from 5/6 slowdown to 1/6.
Right now, 1/6 requests are being very slow (nearly 4minutes to resolve). The slow request is when the browser asks for the API to return information from the CMS + some information on the CMS data from our internal DB. I have this same application running in heroku with less resources and it is running smooth so to me that rules out a code issue.
Areas I have checked:
Cloud SQL CPU and connection pool (very low)
Container Image size (small as a node app can be)
Keeping min and max number of containers at the same number
Allocating CPU
I am at the point of thinking the backend application is just not a proper fit for Cloud Run but I do like the simplicity of getting applications running in cloud run and their approach to serverless.
Any help appreciated

Django application with ASGI Uvicorn increasing the latencies by 4x

We are using a Django application (https://github.com/saleor/saleor) to handle our e-commerce use-cases. We are using ASGI with Uvicorn in production with 4 workers. Infra setup -
4 instances of 4 core 16 GB machines for hosting the Django application (Saleor).
The app is deployed using docker on all the instances.
2 instances of 4 core 16 GB for Celery.
Hosted PostgresSQL solution with one primary and one replica.
Saleor uses Django and Graphene to implement GraphQL APIs. One of the API is createCheckout which takes around 150ms to 250ms depending on the payload entities. While running load test with 1 user, the API consistently gives similar latencies. When number of concurrent users increase to 10, the latencies increase to 4 times (1sec - 1.3 secs). With 20 users, it reaches to more than 10 seconds.
Average CPU usage doesn't exceed more than 60%. While tracing the latencies, we found out that the core APIs are not taking more than 150-250ms even with 20 users making concurrent requests. This means that all the latencies are being added at ASGI + Uvicorn layer.
Not sure what are we missing out here. From the deployment perspective, we have followed standard Django + ASGI + Uvicorn setup for production. Any help or suggestions with this regard would be appreciated.
We had similar issues when the saleor setup did not include the celery runner. Can you make sure, that celery is connected via redis and that it processes requests for every checkout as expected? If not, saleor could not run any async tasks and tries to run them synchronous, adding a lot of latency ..

504 gateway timeout for any requests to Nginx with lot of free resources

We have been maintaining a project internally which has both web and mobile application platform. The backend of the project is developed in Django 1.9 (Python 3.4) and deployed in AWS.
The server stack consists of Nginx, Gunicorn, Django and PostgreSQL. We use Redis based cache server to serve resource intensive heavy queries. Our AWS resources include:
t1.medium EC2 (2 core, 4 GB RAM)
PostgreSQL RDS with one additional read-replica.
Right now Gunicorn is set to create 5 workers (by following the 2*n+1 rule). Load wise, there are like 20-30 mobile users making requests in every minute and there are 5-10 users checking the web panel every hour. So I would say, not very much load.
Now this setup works alright for 80% days. But when something goes wrong (for example, we detect a bug in the live system and we had to switch off the server for maintenance for few hours. In the mean time, the mobile apps have a queue of requests ready in their app. So when we make the backend live, a lot of users hit the system at the same time.), the server stops behaving normally and started responding with 504 gateway timeout error.
Surprisingly every time this happened, we found the server resources (CPU, Memory) to be free by 70-80% and the connection pool in the databases are mostly free.
Any idea where the problem is? How to debug? If you have already faced a similar issue, please share the fix.
Thank you,

How does Heroku determines the number of web processes to run per dyno?

I'm using Heroku to host a django application, and I'm using Waitress as my web server.
I run 2 (x2) dynos, And I see in New Relic instance tab that I have 10 instances running.
I was wondering How does Heroku determines the number of web server processes to run on one Dyno when using Waitress?
I know that when using Gunicorn there is a way to set the number of proccess per dyno, but didn't see any way to define it in Waitress.
Thanks!
In Waitress, there is a master process and (by default) 4 worker threads or processes. You can change this if you wish. Here is the docs for these options for waitress-serve
http://waitress.readthedocs.org/en/latest/runner.html#runner
--threads=INT
Number of threads used to process application logic, default is 4.
So if you have 2 dynos, and 5 (4+1) threads on each, then the total would come to 10 instances for this app in the RPM dashboard.
One can add more processes to the dynos as the maximum supported on Heroku 2x dynos is much higher:
2X dynos support no more than 512
https://devcenter.heroku.com/articles/dynos#process-thread-limits
But, you may want to check out some discussion on tuning this vs Gunicorn:
Waitress differs, in that it has an async master process that buffers the
entire client body before passing it onto a sync worker. Thus, the server
is resilient to slow clients, but is also guaranteed to process a maximum
of (default) 4 requests at once. This saves the database from overload, and
makes scaling services much more predictable.
Because waitress has no external dependencies, it also keeps the heroku
slug size smaller.
https://discussion.heroku.com/t/waitress-vs-gunicorn-for-docs/33
So after talking to the New relic support they clarified the issue.
Apparently only processes are counted in the instances tab (threads do not count).
in my Procfile I am also monitoring RabbitMQ workers which add instances to the instance tab, and hence the mismatch.
To quote their answer :
I clarified with our developers how exactly we measure instances for the Instances tab. The Python agent views each monitored process as one instance. Threads do not count as additional instances. I noticed that you're monitoring not only your django/waitress app, but also some background tasks. It looks like the background tasks plus the django processes are adding up to that total of 10 processes being monitored.

Django app freezing with a few concurrent requests

I have a django app without views, I only use it to provide a REST API using django-piston package.
Since I have deployed it to amazon-ec2 with mod-wsgi, after some requests it freezes, and the CPU goes to 100% of usage divided by python and httpd processes.
I'm using Postgres 8.4, Python 2.5 and Django 'ENGINE': 'django.contrib.gis.db.backends.postgis'.
Logs don't show me any problem. How can I debug the problem?
Sounds like you're in a micro instance. Micro instances are able to burst large amounts of cpu for a VERY short amount of time, after that they must drop to very low background levels for an extended duration or else amazon with harshly throttle it. If you're getting concurrent requests most likely even a lightly cpu intensive app would cause the throttling to kick in.
Micro instances are only usable for very very light traffic on something like a very basic blog and that's like it.
Their user guide goes into this in detail: Micro Instance guide.