Multi threaded curl handling multiple connections at the same time

Multi threaded curl handling multiple connections at the same time - c++

Is curl_multi interface spawning new threads internally to handle multiple requests concurrently? Is it equal to spawning threads manually and just using curl_easy handles? What is more performant. I need to make up to 1000 concurrent requests.
https://curl.haxx.se/libcurl/c/multithread.html
Is using curl_multi equal to the example in the link above?
From: https://curl.haxx.se/libcurl/c/libcurl-multi.html
Enable multiple simultaneous transfers in the same thread without making it complicated for the application.
What does it mean? How does it handle multiple transfers in the same thread? I could as well create 100 threads with 100 curl_easy handles and make requests there.
Maybe the question should be: When to use multiple threads and when to use curl_multi.

There's no easy or simple answer. libcurl allows you and your application to make the decision and supports working in either mode.
The libcurl multi interface is a single-core single-thread way to do a large amount of parallel transfers in the same thread. It allows for easy reuse of caches, connections and more. That has its clear advantages but will make it CPU-bound in that single CPU.
Doing multi-threaded transfers will make each thread/handle has its own cache and connection pool etc which will change when they will be useful, but will make the transfers less likely to be CPU-bound when you can spread them out over a larger set of cores/CPUs.
Which the right design decision is for you, is not easy for us to tell.

No all the connections on a single curl_multi handle operate on the same thread. It uses a single select/poll/epoll event loop and non-blocking sockets to process all the connection concurrently on the same thread.

Related

Performance of multithreaded TCP networking

I'm working on a project using the TCP protocol that may have to work with many 100s or more connections at once.
As such, I am uncertain as to what method I should collect and send this data.
I was wondering whether the principal of more threads = more performance applied here.
My reason for doubt is because all data still has to be fed through the network connection, of which most devices only have 1 active at a time. In addition, I know that repeated context switching can reduce performance as well.
However, I've seen from other sources suggesting that multithreading does indeed scale network performance to a point, and if that's true, why?
Currently, I'm using the Non-Boost variant of ASIO to handle networking.
Thanks in advance for any assistance.

ASIO is a wrapper around epoll/IOCP, and as such is optimized for high-performance non-blocking I/O. It's possible to achieve hundreds of thousands of simultaneous connections with this setup on a single thread. Indeed, the old-fashioned "a thread per client" setup could never reach this level is performance due to the context switching overhead.
With that said, depending on the protocol used, handling network requests and replies takes some CPU time, and on a high-rate network it might saturate the single CPU core on which the io_service is running. In that case it is possible to parallelize the io_service so that completion routines can run on more than one core. Still no context switching would take place if the number of threads doesn't exceed the number of available CPU cores/hardware threads. Context switching occurs when the same core needs to handle multiple threads and also when switching between user and kernel mode (i.e. twice for each system call).
Benchmark your server to see how many clients it can handle on a single thread. Chances are it will be enough. Parallelizing io_service comes at a cost of having to deal with completion routines running in parallel, which almost always requires additional synchronization, which means additional overhead.

You want about the same number of threads as you have CPU cores, including hypertreaded ones. Not more.
Each thread deals with a subset of the sockets. That way, you maximize CPU parallelism, but minimize overhead.
If you truly need in the 100s of connections and require low latency, you should consider UDP, where a single socket can receive from many remote addresses. But you then have to implement reliability yourself. Still, that's how multi-player AAA games servers are typically run. And there's good reasons for it.

Multi-Threading vs Single-Threading is a hard topic, and I think it all depends on the point of view of your implementation.
If you have a good event-driven system on one thread probably using single thread for low level network IO will be better.
Spawning threads have on itself a performance penalty as the system will need to attend them, of course it will be helpful to use the extra processors, but as you said when finally getting into the low level all threads will need some kind of synchronization, penalty again, unless you are using one socket per thread.
One mayor drawback of multi-threading (one socket per thread) on networks is that most of the time you system will be subject to 'slow loris' attacks.
Wikipedia for slow loris
Computerphile video on slow loris
So, I think you are better using multi-thread for other long waiting or time consuming tasks. Of course you should use non-blocking IO.

How to limit Boost.Asio memory

I'm having trouble managing the work .post()'ed to Boost.Asio's io_context, having multiple questions about it (newbie warning).
Background: I'm writing a library that connects to a large number of different hosts for shorts periods at a time each (connect, send data, receive answer, close), and I figured using Boost.Asio. The documentation is scarce (too DRY?)
My current approach is this: (assuming a quad core machine): two physical cores run CPU bound sync operations, and post() additional work items to io_context. Two other threads are .run()ing and performing completion handlers.
1- The work scheduler
As per this amazing answer,
Boost.Asio may start some of the work as soon as it has been told about it, and other times it may wait to do the work at a later point in time.
When does boost.asio do what? On what basis is the queued work later processed?
2- Multiple Producers/ Multiple Consumers
As per This article,
At its core, Boost Asio provides a task execution framework that you can use to perform operations of any kind. You create your tasks as function objects and post them to a task queue maintained by Boost Asio. You enlist one or more threads to pick these tasks (function objects) and invoke them. The threads keep picking up tasks, one after the other till the task queues are empty at which point the threads do not block but exit.
I am failing to find a way to put a cap on the length of this task queue. This answer gives a couple of solutions, but they both involve locking, something I'd like to avoid as much as possible.
3- Are strands really necessary? How do I "disable them"
As detailed in this answer, boost uses an implicit strand per connection. Making potentially millions of connections, the memory savings by "bypassing" strands make sense to me. As the requests I make are independent (different host to each request), operations I make within a single connection is already serialized (callback chain) so I have no overlapping reads & writes, and no synchronization is expected from Boost.Asio. Does it make sense for me to try and bypass strands? If so, how?
4- Scaling design approach (A bit vague because I have no clue)
As stated in my background section, I'm running two io_contexts on two physical cores, each with two threads one for writing and one for reading. My goal here is to spew packets as fast as I can, and I have already
Compiled asio with BoringSSL (OpenSSL is a serious bottleneck)
Wrote my own c-ares resolver service to avoid async-ish DNS queries running in a thread loop.
But it still happens that my network driver starts timing out when multiple connections are opened. So how do I dynamically adjust boost.asio's throughput, the network adapter can cope with it?
My question(s) is most likely ill-informed as I'm no expert in network programming, and I know this a complex problem, I'd appreciate it if someone left pointers for me to look before closing the question or making it "dead".
Thank you.

IOCP Critical Section Design

I'm running an fully operational IOCP TCP socket application. Today I was thinking about the Critical Section design and now I have one endless question in my head: global or per client Critical Section? I came to this because as I see there is no point to use multiple working threads if every threads depends on a single lock, right? I mean... now I don't see any performance issue with 100 simultaneous clients, but what if was 10000?
My shared resource is per client pre allocated struct, so, each client have your own IO context, socket and stuff. There is no inter-client resource share, so I think that is another point for use the per client CS. I use one accept thread and 8 (processors * 2) working threads. This applications is basicaly designed for small (< 1KB) packets but sometimes for file streaming.

The "correct" answer probably depends on your design, the number of concurrent clients and the performance that you require from the hardware that you have available.
In general, I find it best to go with the simplest thing that works and then profile to locate hot spots.
However... You say that you have no inter-client shared resources so I assume the only synchronisation that you need to do is around 'per-connection' state.
Since it's per connection the obvious (to me) design would be for the per-connection state to contain its own critical section. What do you perceive to be the downside of this approach?
The problem with a single shared lock is that you introduce contention between connections (and threads) that have no reason to block each other. This will adversely affect performance and will likely become a hot-spot as connection numbers rise.
Once you have a per connection lock you might want to look at avoiding using it as often as possible by having the IOCP threads simply lock to place completions in a per connection queue for processing. This has the advantage of allowing a single IOCP thread to work on each connection and preventing a single connection from having additional IOCP threads blocking on it. It also works well with 'skip completion port on success' processing.

Boost Asio single threaded performance

I am implementing custom server that needs to maintain very large number (100K or more) of long lived connections. Server simply passes messages between sockets and it doesn't do any serious data processing. Messages are small, but many of them are received/send every second. Reducing latency is one of the goals. I realize that using multiple cores won't improve performance and therefore I decided to run the server in a single thread by calling run_one or poll methods of io_service object. Anyway multi-threaded server would be much harder to implement.
What are the possible bottlenecks? Syscalls, bandwidth, completion queue / event demultiplexing? I suspect that dispatching handlers may require locking (that is done internally by asio library). Is it possible to disable even queue locking (or any other locking) in boost.asio?
EDIT: related question. Does syscall performance improve with multiple threads? My feeling is that because syscalls are atomic/synchronized by the kernel adding more threads won't improve speed.

You might want to read my question from a few years ago, I asked it when first investigating the scalability of Boost.Asio while developing the system software for the Blue Gene/Q supercomputer.
Scaling to 100k or more connections should not be a problem, though you will need to be aware of the obvious resource limitations such as the maximum number of open file descriptors. If you haven't read the seminal C10K paper, I suggest reading it.
After you have implemented your application using a single thread and a single io_service, I suggest investigating a pool of threads invoking io_service::run(), and only then investigate pinning an io_service to a specific thread and/or cpu. There are multiple examples included in the Asio documentation for all three of these designs, and several questions on SO with more information. Be aware that as you introduce multiple threads invoking io_service::run() you may need to implement strands to ensure the handlers have exclusive access to shared data structures.

Using boost::asio you can write single-thread or multi-thread server approximately at same development cost. You can write single-threaded version as first version, then convert it to multithreaded, if needed.
Typically, only bottleneck for boost::asio is that epoll/kqueue reactor is working in a mutex. So, only one thread is doing epoll at same time. This can decrease performance in case when you have multithreaded server, which serves lots and lots very small packets. But, imo it anyway should be faster than just plain-singlethread server.
Now about your task. If you want to just pass messages between connections - i think it must be multithreaded server. The problem is syscalls(recv/send etc). An instruction is very easy think to do for CPU, but any syscall is not very "light" operation (everything is relative, but relative to other jobs in your task). So, with single thread you will get big syscalls overhead, its why i recommend to use multithreaded scheme.
Also, you can separate io_service and make it work as "io_service per thread" idiom. I think this must give best performance, but it has drawback: if one of io_service will get too big queue - other threads will not help it, so some connections may slowdown. On other side, with single io_service - queue overrun can lead to big locking overhead. All you can do - do the both variants and measure bandwidth/latency. It should be not too difficult to implement both variants.

Is select() Ok to implement single socket read/write timeout?

I have an application processing network communication with blocking calls. Each thread manages a single connection. I've added a timeout on the read and write operation by using select prior to read or write on the socket.
Select is known to be inefficient when dealing with large number of sockets. But is it ok, in term of performance to use it with a single socket or are there more efficient methods to add timeout support on single sockets calls ? The benefit of select is to be portable.

Yes that's no problem, and you do want some timeout mechanisms to not leak resources from bad behaving clients etc.
Note that having a large number of threads is even more inefficient than having select dealing with a large number of sockets.

If yo uthink select is inefficient with a large number of sockets, try handling a large number of sockets with one thread per sockt. You are in for a world of pain. Like you will have problems scaling to 1000 threads.
What I have done in the past is that:
Group sockets in groups of X (512, 1024).
Have one thread or two run along those groups and select () - then hand off the sockets with new data into a queue.
have a number of worker threads work off those sockets with new data. How many depends how much I need to max out the CPU ;)
This way I dont ahve a super über select () with TONS of items, and I also dont waste ridiculous amountf of memory on threads (hint: every thread needs it's own stack. With ONLY 2mb, that is 2gb for 1000 sockets - talk about inefficient) and waste hugh amounts of CPU on doing useless context switches.

The question with threads/select is whether you want to avoid clients blocking each other. If this is not an issue, then work single-threaded. If it is, choose an appropriate threading scheme (1 thread per connection, worker threads per connection, worker thread per request,...).
When working with 1 thread per connection, a select per read/write is a decent solution, but generally speaking, it is better to work with non-blocking sockets in combination with select to avoid blocking in situations where only a part of the expected message arrives and then do a select after writing.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js