Fastcgi app blocking on FCGX_Finish_r in combination with apache

Fastcgi app blocking on FCGX_Finish_r in combination with apache - c++

I have a fastcgi app running several hundred requests per second. Most of the requests finish in a millisecond or less, but some are taking up more than half a second. When outputting some more timing information, I was able to pin down the problem to calling FCGX_Finish_r.
As far as I can see, this call will flush all data written to apache (mod_fastcgi). But why is it blocking?
I tried narrowing it down to the response size, but although the longest response durations (i.e. 2seconds) are with the largest responses (i.e. 120kb), those same large responses can be as short as 16ms.
-- EDIT --
I ran some more numbers, and the bad timings seem to occur once the responses are bigger than 16k. Although from that point on, there is no correlation between the size and the duration of the slow responses.

Related

MQTTnet poor performance when multi-threaded despite same number of messages being sent

I am sending around 20,000 messages per second, these can be across a number of arbitary threads (the messages are processed before sending to MQTTnet).
I have found, that the fewer the threads the better the performance, going over 16 simultanous senders causes MQTTnet to grind to a halt even with 10k messages per second.
It is not the threads that are slow, I poll the MQTTnet Managed Client buffer size every 10 seconds and see it increasing to the point where it becomes full (at the limit that I have set).
This is with the most recent code version and something I noticed a number of months ago (from today August 2020) - it was highlighted with my recent ThreadRipper system upgrade and my code creating number of threads equal to the number of Environment Processors - same code base but 8 with previous hardware with 48 on new hardware caused the "failure".
48 decode/send threads caused MQTTnet to grind to a halt, whereas 4 to 8 threads was OK and performant. I can see the speed on the NIC to the MQTT server drop from 8Mbps (with 4 to 8 sending threads) to less than 100kbps when higher thread counts are used.
Local or remote MQTT server makes no difference - as mentioned I can see the send buffer within MQTT increase and will do so until memory exhaustion unless a limit is set (either way, it will drop messages once under duress from threads and in this state).
Note, in both cases, the total number of messages being sent per second remained the same - the only variable was the number of worker threads the messages were being sent from.
Is this a bug, or something I am doing wrong? Should i create my own queue to front-end the managed client and dispatch one at a time (i don't want to reinvent the wheel, but want to ensure i am using the library correctly).

I have found this seems to be related to debug and start without debugs - starting without debug is magnitudes faster and can scale all the way to 48 threads (as per the environment processor count) without any issue when started without debug without any queuing whatsoever.
Strange, as the message volume is the same in both cases, only difference being the thread count as mentioned (and even with debug and 8 threads, debug can keep up without issue).
Seems there is an overhead when debugging with multiple sending threads - which may be obvious but couldn't find where this was warned.

How to do a very large number of HTTP requests in shortest time

So we have a very huge database which has around 300,000 urls. These urls have to be pinged and get data from.(these urls are radio stations which are playing song. The data is metadata)
Some of them are sometimes inactive and sometimes active.
On any given time, around 80,000 are active. Some respond slow, some respond quickly. I have a server and I am thinking to do this using c++
My goal is to ping and parse(or crawl) them within 1 minute and keep repeating the process because information(the song playing on them) can change over time. ranging from 2-7 minutes mostly. But I am not sure if it is possible.
What should be my approach to do it?
I have thought of creating two programs, one to test if the url is active or not and run it twice a day. And how much time it generally takes to respond. Does it usually respond slow or whether it is responding slower now.
And the other to do the actual crawling where fastest will be crawled first and some dedicated threads for urls which respond faster.
Please i would love more better ideas or better solutions for it. Can any one tell me how to do the maths to find out the number of dedicated threads i should allot to each for getting the results in least number of time

You don't need performance of your CPU (not your bottleneck at the moment), but you need to avoid network layer stall... if the request timeout is 60 seconds, and you have 16 threads, and hit 16 very slow servers (which will time-out eventually), you are generally stalled for 60 seconds and not processing anything more.
So I would start with let's say 500 threads (and like 15-30s timeout, if you know the very slow radios are capable to fit even this), and keep some statistic about their turnaround, and keep adding more working threads dynamically for every original which didn't get response within 2-3 secs. 80000/500 = 160, so each "normally quick" worker thread has then to ping around 160 urls, if each does take 2 seconds, that's still 320 = 5min! So 500 sounds like minimum.
That said, having 500+ threads will somewhat burden CPU and memory (not sure how much, with decent thread/memory model implementation 500 doesn't sounds like much for modern x86 CPU with GB of RAM, even 5000 sounds still reasonable), but I would worry lot more about the network layer and about possible firewalls around, you need server-grade like network for such amount of requests (if I would try something like that from my home, my own router would filter me out with default settings, detecting it as some kind of DoS attack).
So get some statistic how long the request on average take, then take your target time (2-7min), and divide the number of urls by those, like average ping 5s, round time 3min = 300,000/(3*60/5) = 8333.33 threads at least needed. Then you will have to profile your app to verify, that with 8000 threads it will not choke on something else, but it will really handle the task as expected.
(other option is to fire asynchronous http request from single thread, but that sort of creates its own threads for each task any way, so I would rather manage the threads myself, and use synchronous http calls)
And thinking about dynamic grow mechanics... you can keep some counters about how many new requests were added in last second, and how many finished (either responded or failed), and after few seconds of running these should start to form some kind of "throughput" statistic, then if throughput is under desired threshold, you can add more threads.
About active/inactive... keep the response time/last-seen/last-check together with url, and add some further logic to check url only when it makes sense (like not within next 60s, if it did just respond, or check inactive just after 6h from last test). You need also avoid checking the same url in two different threads at the same time, so some central manager code should feed the threads with target (maybe some FIFO thread-safe queue ... actually you can use its size to estimate how well the worker threads are processing it, so you can add more threads when you see the queue is not emptying fast enough = that avoids adding the statistic code to thread themselves).

Response time of an application unintutively correlated with number of input triggers in a give time period

I have an Multi threaded C++ network application which listens for UDP packets as input - this data then hops through the application/processed via various queues and threads and finally pushed out on a TCP socket.
What I am seeing is that if inputs come in slow lets say 5/sec, the total response time (in-to-out) is slow (lets say 100ms) and if the inputs come in fast e.g. 20/sec, the response time is also fast (~50ms). This observation is really weird. And also messes up the response time in the fast case because the 1st response is always slow. Just to make sure - the application is doing exactly the same amount of work in both slow and fast cases.
Things that have been tried to investigate this -
Its a dual Xeon box running Linux 2.6 kernel - disabled Turbo boost, made sure processors are in C0 states.
eliminated network causes. the root cause is within the box. ( in software or hardware )
I have a fake input going through the system from input to output on a timer to keep the application "warm" - no effect. (the application
's worker threads are busy waiting and pinned to cores).
perf points indicate that EVERY thing gets slower - which basically mean that the processors are slowing down when not under continuous load - but nothing else suggests that ( 17z/turbostat) or I am reading them incorrectly.
Does someone has color on what might be happening?

In my experience, this would be the nasty doings of Nagle's devilish device. Sounds extremeliy plausible to me that more data in the buffer fills up TCP packet which than get's sent. Without much data, tcp packet is waiting for the ack from the other side, which is, as we all know, delayed.
Solution - learn to make sure Nagle's program killer is disabled the first thing after you create any send-capable TCP socket.

NSURLConnection works in Simulator but not from iPad

I have an iOS app that inserts a record into an MySQL server via a Python-Django server, and then immediately queries that information.
I am using [NSURLConnection sendAsynchronousRequest: queue: completionHandler:] with a processing block to parse data from a server JSON object.
The insert statement and query statement are run from separate sendAsynchronousRequests, but the latter is called from the completionHandler block in the former, preventing any race conditions.
On the simulator, this works perfectly and returns the inserted data 100% of the time.
On a physical iPad, running iOS7, the response does not ever contain the immediately inserted data.
If I insert a long pause between the two sendAsynchronousRequests using [performSelector: withObject: afterDelay:] it does eventually work -- but I need to add a delay of 2 - 3 seconds.
On the simulator, the timing between the insert and the correct query response is less than 500 ms, so the very long delay should not be necessary and does not seem to be caused by a lower executing process on the simulator.
I have tried Cache-Control: max-age=0, and Expires= headers in my server code, and neither makes a difference, so I don't currently believe that it is a NSURLConnection caching issue.
What else am I overlooking here? Why does this work so perfectly on the Simulator but not on a physical iPad?
Thanks!

I finally figured this out.
My server code was filtering results based on a calculated time period, which ended "now". The time clock on my iPad was a few seconds into the future compared to the server, while my Mac's clock was in sync with the server. As a result, the records posted from the iPad were getting filtered out.

What is optimal value for Phusion passenger PassengerMaxRequestQueueSize

I know this depends on the box hardware, but for example if there are set 100 processes, the default queue is also 100. Does it makes sense to increase PassengerMaxRequestQueueSize to 200 or 300? Probably this depends on free memory. Thoughts?
The best answer will be explaining the setting and probably one or two examples, assuming the server process requests for 2-3 seconds.
Thanks in advance!

Why you should limit queuing
Any requests that aren't immediately handled by an application process, are queued. Queuing is usually is bad: it often means that your server cannot handle the requests quickly enough.
A larger queue means that requests are less likely to be dropped. But this comes with a drawback: during busy times, the larger the queue, the longer your visitors have to wait before they see a response. This causes them to click reload, making the queue even longer (their previous request will stay in the queue; the OS does not know that they've disconnected until it tries to send data back to the visitor), or causes them to leave in frustration.
So having a limit on the queue is a good thing. It limits the impact of the above situation.
You should ensure that requests are queued as little as possible. That could mean:
Making your app faster (if your workload is CPU bound).
Upgrading to faster hardware (if your workload is CPU bound).
Increasing your app's concurrency settings (if your workload is I/O bound), e.g. by increasing the number of processes or threads.
If you cannot prevent requests from being queued, then the next best thing to do is to keep the queue short, and to display a friendly error message upon reaching the queue limit. Something like, "We're sorry, a lot of people are visiting us right now. Please try again later." The documentation for PassengerMaxRequestQueueSize tells you how to do that.
Optimal value for the queue size
It's hard to say what the optimal queue size should be. A good rule of thumb is: set the request queue size to the maximum number of requests you can handle in one second. Depending on your situation you may have to tweak things a little bit.
This rule of thumb comes from the notion of expected burst traffic. How many simultaneous requests do you expect on your server?
Suppose that your queue size is 100, and that for whatever reason you receive 150 requests at the same time. Suppose that your server is fast enough to handle 150 requests in half a second, so you know it's not a performance problem. But if you have a request queue size of 100, then 50 of those requests will be dropped with a "Request queue full" error.
In such a situation, you should set the queue size to the maximum number of concurrent requests that you think you can safely handle without performance issues.

This SO question and the Passenger docs here talk more about working with this. If you want more information about why this is happening on your server you can try running passenger-status (usually you need to run this as root).
If you would like to set a custom error page when visitors see this issue you can use the following (in Apache) to set a custom error page:
PassengerErrorOverride on
ErrorDocument 503 /error503.html
As mentioned by Hongli you can also change the setting PassengerMaxRequestQueueSize to a higher number to queue more requests. You can also set this to 0 and disable it (for most situations this is not an optimal solution however).
For reference, the default error message a visitor to your site will see when bumping against this limit is:
This website is under heavy load
We're sorry, too many people are accessing this website at the same time. We're working on this problem. Please try again later.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js