Efficiency between select() and recv with MSG_PEEK. Asynchronous

Efficiency between select() and recv with MSG_PEEK. Asynchronous - c++

I would like to know what would be most efficient when checking for incoming data (asynchronously). Let's say I have 500 connections. I have 3 scenarios (that I can think of):
Using select() to check FD_SETSIZE sockets at a time, then iterating over all of them to receive the data. (Wouldn't this require two calls to recv for each socket returned? MSG_PEEK to allocate a buffer then recv() it again which would be the same as #3)
Using select() to check one socket at a time. (Wouldn't this also be like #3? It requires the two calls to recv.)
Use recv() with MSG_PEEK one socket at a time, allocate a buffer then call recv() again. Wouldn't this be better because we can skip all the calls to select()? Or is the overhead of one recv() call too much?
I've already coded the situations to 1 and 2, but I'm not sure which one to use. Sorry if I'm a bit unclear.
Thanks

FD_SETSIZE is typically 1024, so you can check all of 500 connections at once. Then, you will perform the two recv calls only on those which are ready -- say, for a very busy system, half a dozen of them each time around, for example. With the other approaches you need about 500 more syscalls (the huge amount of "failing" recv or select calls you perform on the many hundreds of sockets which will not be ready at any given time!-).
In addition, with approach 1 you can block until at least one connection is ready (no overhead in that case, which won't be rare in systems that aren't all that busy) -- with the other approaches, you'll need to be "polling", i.e., churning, continuously, burning huge amounds of CPU to no good purpose (or, if you sleep a while after each loop of checks, then you'll have a delay in responding despite the system not being at all busy -- eep!-).
That's why I consider polling to be an anti-pattern: frequently used, but nevertheless destructive. Sometimes you have absolutely no alternative (which basically tells you that you're having to interact with very badly designed systems -- alas, sometimes in this imperfect life you do have to!-), but when any decent alternative does exist, doing polling nevertheless is really a very bad design practice and should be avoided.

you can simply do some efficiency simulation on 3 scenario where:
Scenario A (0/500 incoming data)
for solution #1, you only invoke single select()
for solution #2, you need 500 select()
for solution #3, you need 500 recv()
Scenario B (250/500 incoming data)
for solution #1, single select() + (500 recv())
for solution #2, 500 select() + (500 recv())
for solution #3, 750 recv()
**assume skipping socket with no buffer size # no incoming data
answer is obvious :)

...most efficient when checking for
incoming data (asynchronously). Let's
say I have 500 connections. I have 3
scenarios (that I can think of):
Using select() to check FD_SETSIZE
sockets at a time, then iterating over
all of them to receive the data.
(Wouldn't this require two calls to
recv for each socket returned? MSG_PEEK to allocate a buffer then recv() it again which would be the same as #3)
I trust you're carefully constructing your fd set with only the descriptors that are currently connected...? You then iterate over the set and only issue recv() for those that have read or exception/error conditions (the latter difference being between BSD and Windows implementations). While it's ok functionally (and arguably elegant conceptually), in most real-world applications you don't need to peek before recv-ing: even if you're unsure of the message size and know you could peek it from a buffer, you should consider whether you can:
process the message in chunks (e.g. read whatever's a good unit of work - maybe 8k, process it, then read the next <=8k into the same buffer...)
read into a buffer that's big enough for most/all messages, and only dynamically allocate more if you find the message is incomplete
Using select() to check one socket at a time. (Wouldn't this also be like #3? It requires the two calls to recv.)
Not good at all. If you stay single-threaded, you'd need to put a 0 timeout value on select and spin like crazy through the listenig and client descriptors. Very wasteful of CPU time, and will vastly degrade latency.
Use recv() with MSG_PEEK one socket at a time, allocate a buffer then call recv() again. Wouldn't this be better because we can skip all the calls to select()? Or is the overhead of one recv() call too much?
(Ignoring that it's better to try to avoid MSG_PEEK) - how would you know which socket to MSG_PEEK or recv() on? Again, if you're single threaded, then either you'd block on the first peek/recv attempt, or you use non-blocking mode and then spin like crazy through all the descriptors hoping a peek/recv will return something. Wasteful.
So, stick to 1 or move to a multithreaded model. For the latter, the simplest approach to begin with is to have the listening thread loop calling accept, and each time accept yields a new client descriptor it should spawn a new thread to handle the connection. These client-connection handling threads can simply block in recv(). That way, the operating system itself does the monitoring and wake-up of threads in response to events, and you can trust that it will be reasonably efficient. While this model sounds easy, you should be aware that multi-threaded programming has lots of other complications - if you're not already familiar with it you may not want to try to learn that at the same time as socket I/O.

Related

Any way to know how many bytes will be sent on TCP before sending?

I'm aware that the ::send within a Linux TCP server can limit the sending of the payload such that ::send needs to be called multiple times until the entire payload is sent.
i.e. Payload is 1024 bytes
sent_bytes = ::send(fd, ...) where sent_bytes is only 256 bytes so this needs to be called again.
Is there any way to know exactly how many bytes can be sent before sending? If the socket will allow for the entire message, or that the message will be fragmented and by how much?
Example Case
2 messages are sent to the same socket by different threads at the same time on the same tcp client via ::send(). In some cases where messages are large multiple calls to ::send() are required as not all the bytes are sent at the initial call. Thus, go with the loop solution until all the bytes are sent. The loop is mutexed so can be seen as thread safe, so each thread has to perform the sending after the other. But, my worry is that beacuse Tcp is a stream the client will receive fragments of each message and I was thinking that adding framing to each message I could rebuild the message on the client side, if I knew how many bytes are sent at a time.
Although the call to ::send() is done sequentially, is the any chance that the byte stream is still mixed?
Effectively, could this happen:
Server Side
Message 1: "CiaoCiao"
Message 2: "HelloThere"
Client Side
Received Message: "CiaoHelloCiaoThere"

Although the call to ::send() is done sequentially, is the any chance that
the byte stream is still mixed?
Of course. Not only there's a chance of that, it is pretty much going to be a certainty, at one point or another. It's going to happen at one point. Guaranteed.
sent to the same socket by different threads
It will be necessary to handle the synchronization at this level, by employing a mutex that each thread locks before sending its message and unlocking it only after the entire message is sent.
It goes without sending that this leaves open a possibility that a blocked/hung socket will result in a single thread locking this mutex for an excessive amount of time, until the socket times out and your execution thread ends up dealing with a failed send() or write(), in whatever fashion it is already doing now (you are, of course, checking the return value from send/write, and handling the exception conditions appropriately).
There is no single, cookie-cutter, paint-by-numbers, solution to this that works in every situation, in every program, that needs to do something like this. Each eventual solution needs to be tailored based on each program's unique requirements and purpose. Just one possibility would be a dedicated execution thread that handles all socket input/output, and all your other execution threads sending their messages to the socket thread, instead of writing to the socket directly. This would avoid having all execution thread wedged by a hung socket, at expense of grown memory, that's holding all unsent data.
But that's just one possible approach. The number of possible, alternative solutions has no limit. You will need to figure out which logic/algorithm based solution will work best for your specific program. There is no operating system/kernel level indication that will give you any kind of a guarantee as to the amount of a send() or write() call on a socket will accept.

UDP send() to localhost under Winsock throwing away packets?

Scenario is rather simple... not allowed to use "sendto()" so using "send()" instead...
Under winsock2.2, normal operation on an brand new i7 machine running Windows 7 Professional...
Using SOCK_DGRAM socket, Client and Server console applications connect over localhost (127.0.0.1) to test things ...
Have to use packets of constant size...
Client socket uses connect(), Server socket uses bind()...
Client sends N packets using series of BLOCKING send() calls. Server only uses ioctlsocket call with FIONREAD, running in a while loop to constantly printf() number of bytes awaiting to be received...
PACKETS GET LOST UNLESS I PUT SLEEP() WITH CONSIDERABLE AMMOUNT OF TIME... What I mean is the number of bytes on the receiving socket differs between runs if I do not use SLEEP()...
Have played with changing buffer sizes, situation did not change much, except now there is no overflow, but the problem with the delay remains the same ...
I have seen many discussions about the issue between send() and recv(), but in this scenario, recv() is not even involved...
Thoughts anyone?
(P.S. The constraints under which I am programming are required for reasons beyond my control, so no WSA, .NET, MFC, STL, BOOST, QT or other stuff)
It is NOT an issue of buffer overflow for three reasons:
Both incoming and outgoing buffers are set and checked to be
significantly larger than ALL of the information being sent.
There is no recv(), only checking of the incoming buffer via ioctl() call, recv() is called long after, upon user input.
When Sleep() of >40ms is added between send()-s, the whole thing works, i.e. if there was an overflow no ammount of
Sleep() would have helped (again, see point (2) )

PACKETS GET LOST UNLESS I PUT SLEEP() WITH CONSIDERABLE AMMOUNT OF
TIME... What I mean is the number of bytes on the receiving socket
differs between runs if I do not use SLEEP()...
This is expected behavior; as others have said in the comments, UDP packets can and do get dropped for any reason. In the context of localhost-only communication, however, the reason is usually that a fixed-size packet buffer somewhere is full and can't hold the incoming UDP packet. Note that UDP has no concept of flow control, so if your receiving program can't keep up with your sending program, packet loss is definitely going to occur as soon as the buffers get full.
As for what to do about it, the insert-a-call-to-sleep() solution isn't particularly good because you have no good way of knowing what the "right" sleep-duration ought to be. (To short a sleep() and you'll still drop packets; too long a sleep() and you're transferring data more slowly than you might otherwise do; and of course the "best" value will likely vary from one computer to the next, or one moment to the next, in non-obvious ways).
One thing you could do is switch to a different transport protocol such as TCP, or (since you're only communicating within localhost), a simple pipe or socketpair. These protocols have the lossless FIFO semantics that you are looking for, so they might be the right tool for the job.
Assuming you are required to use UDP, however, UDP packet loss will be a fact of life for you, but there are some things you can do to reduce packet loss:
send() in blocking mode, or if using non-blocking send(), be sure to wait until the UDP socket select()'s as ready-for-write before calling send(). (I know you said you send() in blocking mode; I'm just including this for completeness)
Make your SO_RCVBUF setting as large as possible on the receiving UDP socket(s). The larger the buffer, the lower the chance of it filling up to capacity.
In the receiving program, be sure that the thread that calls recv() does nothing else that would ever hold it off from getting back to the next recv() call. In particular, no blocking operations (even printf() is a blocking operation that can slow your thread down, especially under Windows where the DOS prompt is infamous for slow scrolling under load)
Run your receiver's network recv() loop in a separate thread that does nothing else but call recv() and place the received data into a FIFO queue (or other shared data structure) somewhere. Then another thread can do the less time-critical work of examining and parsing the data in the FIFO, without fear of causing a dropped packet.
Run the UDP-receive thread at the highest priority you can convince the OS to let you run at. The fewer other tasks that can hold of the UDP-receive thread, the fewer opportunities for packets to get dropped during those hold-off periods.
Just keep in mind that no matter how clever you are at reducing the chances for UDP packet loss, UDP packet loss will still happen. So regardless you need to come up with a design that allows your programs to still function in a reasonably useful manner even when packets are lost. This could be done by implementing some kind of automatic-resend mechanism, or (depending on what you are trying to accomplish) by designing the protocol such that packet loss can simply be ignored.

Can single-buffer blocking WSASend deliver partial data?

I've pretty much always used send() with sockets and now I'm moving onto the WSA functions. With send(), I have a sendall() helper that ensured all data is delivered even if it didn't happen in one try and a partial send occurred on first call.
So, instead of learning the hard way or over-complicating code when I don't have to, decided to ask you:
Can a blocking WSASend() send partial data or does it send everything before it returns or fails? Or should I check the bytes sent vs. expected to send and keep at it until everything is delivered?
ANSWER: Overlapped WSASend() does not send partial data but if it does, it means the connection has terminated. I've never encountered the case yet.

From the WSASend docs:
If the socket is non-blocking and stream-oriented, and there is not sufficient space in the transport's buffer, WSASend will return with only part of the application's buffers having been consumed. Given the same buffer situation and a blocking socket, WSASend will block until all of the application buffer contents have been consumed.
I haven't tried this behavior though. BTW, why do you rewrite your code to use WSA functions? Switching from standard bsd socket api just to use the socket basically with the same blocking behavior doesn't really seem to be a good idea for me. Just keep the old blocking code with send with the "retry code", this way its portable and bulletproof. It is not saving 1-2 comparisons is that makes your IO code performant.
Switch to specialized WSA functions only if you are trying to exploit some windows specific strengths, or if you want to use for non-blocking sockets with WSAWaitForMultipleObjects that is a bit better than the standard select but even in that case you can simply go with send and recv as I did it.
In my opinion using epoll/kqueue/iocp (or a library that abstracts these away) with sockets are the way to go. There are some very basic tasks that can be done with blocking sockets but if you cross the line and you need nonblocking socks then switching straight to epoll/kqueue/iocp is the way to go instead of programming painful select or WSAWaitForMultipleObjects based apis. epoll/kqueue/iocp are not only better but also easier to program than the select based alternatives. Really. They are more modern apis that were invented based on more experience. (Although they are not crossplatform, but even select has portability issues...).
The previously mentioned apis for linux/bsd/windows are based on the same concept but in my opinion the simplest and easiest to learn is the epoll api of linux. It is ways better than a select call but its 100x easier to program once you get the idea. If you start using IOCP on windows than it my seem a bit more complicated.
If you haven't yet used these apis then definitely give epoll a go if you are familiar with linux and then on windows implement the same with IOCP that is based on a similar concept with a bit more complicated overlapped IO programming. With IOCP you will have a reason for using WSASend because you can not start overlapped IO on a socket with send but you can do that with WSASend (or WriteFile).
EDIT: If you are going for max performance with IOCP then here are some additional hints:
Drop blocking operations. This is very important. A serious networking engine can not afford blocking IO. It simply doesn't scale on any of the platforms. Do overlapped operations for both send and receive, overlapped IO is the big gun of windows.
Setup a thread pool that processes the completed IO operations. Setup test clients that bomb your server with real-world-usage-like messages and parallel connection counts and under stress tweak the buffer sizes and thread counts for your actual target hardware.
Set the SO_RCVBUF and SO_SNDBUF sizes of your sockets to zero and play around with the size of the buffers that you are using to send and receive data. Setting the rcv/send buf of the socket handle to zero allows the tcp stack to receive/send data directly to/from your buffers avoiding an additional copy between your userspace buffers and the socket buffers. The optimal size for these buffers is also subject to tweaking. I usually use at least a few ten K buffers sizes but sometimes in case of large volume transfers 1-2M buffer sizes are better depending on the number of parallel busy connections. Again, tweak the values while stressing the server with some test clients that do activity similar to real world clients. When you are ready with the first working version of your network engine on top of it lets build a test client that can simulate many (maybe thousands of) parallel clients depending on the real world usage of your server.
You will need "per connection software send buffers" inside your network engine and you may (or may not) want to control the max size of the send buffers. In case of reaching the max send buffer size you may want to block or discard messages/data depending on what you want to do, encapsulate this special buffer and provide two nice interfaces to it: one for the threads that are putting data into this buffer and another interface that is used by the IOCP sender code. This buffer is usually a very critical part of the whole thing and I usually had a lot of bugs around this part of the code so make sure to design its interface nicely to minimize the number of bugs. Depending on how your application constructs and puts messages into the queue you can play around a lot with the internal implementation (size of storage chunks, nagle-like optimizations, ...).

recv() correct use C++

I'm working on my own FTP client in C++, but I'm stuck at function recv(). When I get data with recv(), they can be incomplete, because I'm using TCP protocol, so I have to use recv in loop. Problem is that when I call recv after everything that should be received was received server blocks, and my program is stuck.
I don't know how many bytes im going to recieve so I can't control it and stop it when its done. I found two not very elegant solutions right now:
is to use string.substr() (or TR1 regex) to find needed
expression and then stop calling recv before it blocks
second is to
set up timeval structure and then control socket through
setsockopt() function. Problem is long response time when i can get
incomplete corrupted data.
Question is, is there any clean and elegant solution for this?

The obvious thing to do is to transmit the length of the to-be-received message ahead (many protocols, including for example HTTP do that, to address the exact same issue). That way, you know that when you have received amount X, no more will come.
This will work fine 99.9% of the time and will catastrophically fail in the 0.1% of cases where the server is lying to you or where the server crashes unexpectedly or someone stumbles over the network cable (or something similar happens). Sadly, the "connection" established by TCP is an illusion, and you don't have much of a means to detect when the connection dies. The other end can go down, and you will not notice anything, unless you try to send and get an error (or until several hours later).
Therefore, you also need a backup strategy for when things don't go quite as good as expected. You might either use select or poll to know when data is available, so you don't block forever for a message that will never come.
Using threads to solve the block-at-end problem (as proposed in other answers) is not a very good option since blocking isn't the actual problem. The actual problem is that you don't know when you have reached the end of the transmission. Having a worker thread block at the end of the transmission will "work", but will leave the worker thread blocked indefinitely, consuming resources and with an uncertain, system-dependent fate.
You cannot join the thread before exiting, since it is blocked (so trying to join it would deadlock your main thread). When your process exits and the socket is closed, the thread will unblock, but will (at least on some operating systems, e.g. Windows) be terminated immediately after. This likely won't do much evil, but terminating a thread in an uncontrolled way is always less desirable than having it exit properly. On other operating systems, you may have a lingering thread remaining.

Since you are using C++, there are alternative libraries that greatly simplify network programming compared to stock C. My personal favourite is Boost::Asio, however others are available. These libraries not only save you the pain of coding in C, but also provide asynchronous capabilities to work around your blocking problem.

The typical approach is to use select()/pselect() or poll()/ppoll(). Both allow to specify a timeout in order to exit if there are no incoming data.
However I don't see how you should "call recv after everything that should be received". It would be extremely inefficient to rely on the timeout also when there are not network problems...
Or you send the size of data being sent, before the data, and that's what you read, or the data connection is terminated with an EOF. In this case read() will return -1 and you exit.

I can think of two options that will not require a major rewrite of your existing code and a third one which is more radical:
use non-blocking I/O and poll for data periodically. You can do other work while a message remains incomplete or no further data can be read from the socket.
use a separate worker thread to do the I/O. Even if it blocks on synchronous recv() calls, your main thread can continue to do work. The worker thread can transfer the data it receives to the main thread for processing once a complete message is received via TCP.
use an OS specific feature (I/O completion ports on Windows or aio on Linux), but these are far more complex and you should definitely consider Boost.Asio before going this route.

You can put the recv function in it's own thread and do the processing in another thread.

Should I use multiple threads for a multi socket client?

I understand that for most cases using threads in Qt networking is overkill and unnecessary, especially if you do it the proper way and use the readyRead() signal. However, my "client" application will have multiple sockets open (about 5) at one time. It is possible for there to be data coming in on all sockets at the same time. I am really not going to be doing any intense processing with the incoming data. Simply reading it in and then sending out a signal to update the GUI with the newly received data. Do you think a single thread application should be able to handle all of the data coming in?
I understand that I haven't shown you any code and that my description is pretty vague and it could very well depend on how it performs once implemented, but from a general design perspective and your guys' expertise, what is your opinion?

Unless you are receiving really high-bandwidth streams (e.g. megabytes per second rather than kilobytes per second), a single-threaded design should be sufficient. Keep in mind that the OS's networking stack is running "in the background" at all times, receiving TCP packets and storing the received data inside fixed-size in-kernel memory buffers. This happens in parallel with your program's execution, so in most cases the fact that your program is single-threaded and busy dealing with a GUI update (or another socket) won't hamper your computer's reception of TCP packets.
The case where a single-threaded design would cause a slowdown of TCP traffic is if your program (via Qt) didn't call recv() quickly enough, such that the kernel's TCP-receive buffer for a socket became entirely filled with data. At that point the kernel would have no choice but to start dropping incoming TCP packets for that socket, which would cause the server to have to re-send those TCP packets, and that would cause the socket's TCP receive rate to slow down, at least temporarily. However, that problem can be avoided by making sure the buffers never (or at least rarely) get full.
The obvious way to do that is to ensure that your program reads all of the incoming data as quickly as possible -- something that QTCPSocket does by default. The only thing you need to do is make sure that your GUI updates don't take an inordinate amount of time -- and Qt's widget-update routines are fairly efficient, so they shouldn't, unless you have a really elaborate GUI or an inefficient custom paintEvent() routine or etc.
If that's not sufficient, the next thing you could do (if necessary) is tell the OS's TCP stack to increase the size of its in-kernel TCP receive buffer, e.g. by doing:
int fd = myQTCPSocketObject.descriptor();
int newBufSizeBytes = 128*1024; // request 128kB kernel recv-buffer for this socket
if (setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &newBufSizeBytes, sizeof(newBufSizeBytes)) != 0) perror("setsockopt");
Doing that would give your (single) thread more time to react before incoming packets start getting dropped for lack of in-kernel buffer space.
If, after trying all that, you still aren't getting the network performance you need, then you can try going multithreaded. I doubt it will come to that, but if it does, it needn't affect your program's design too much; you'd just write a wrapper class (called SocketThread or something) that holds your QTCPSocket object and runs an internal thread that handles the reading from the socket, and emits a bytesReceived(QByteArray) signal whenever the thread reads data from the socket. The rest of your code would remain approximately the same; just modify it to hold the SocketThread object instead of a QTCPSocket, and connect the SocketThread's bytesReceived(QByteArray) signal to a corresponding slot (via a QueuedConnection, of course, for thread-safety) and use that instead of responding directly to readReady().

Implement it without threads, using a thread-considerate design(*), measure the delay your data experiences, decide if it is within acceptable bounds. Then decide if you need to use threads to capture it more rapidly.
From your description, the key bottleneck is going to be GUI reception of the "data ready" signal, render it. If you use the approach of sending lots of these signals, your GUI is goign to be doing more re-renders.
If you use a single-thread approach, you can marshal the network reads and get all the updates and then refresh the GUI directly. As you've described it, this sounds like it will have the least degree of contention.
(* try to avoid constructs which will require an entire rewrite if you go threaded, but don't put so much effort into making it thread-proof that it will actually need threads to make it efficient, e.g. don't wrap everything with mutex calls)

I do not know much about Qt, but this could be a typical scenario where you use select() to multiplex multiple socket accesses with a single thread.
If the thread for selecting is used mainly for handling the data from/to the sockets you will be very fast(as you will have less context switches). So if you are not transfer really huge amounts of data it is likely possible that you will be faster will a single threaded solution.
That being said, i would go with the solution that fits the most for your needs, something that you can implement in a fair amount of time. Implementing select (async) can be quite a hassle, an overkill that might not be needed.
It's a C-like approach, but i hope i could help anyway.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js