effect of SELECT on read() in linux

effect of SELECT on read() in linux - c++

I have a legacy code which is doing this:
select(nFD + 1, &tReadFds, NULL, NULL, &timer);
.............
if (FD_ISSET(nFD, &tReadFds))
n = read(nFD,len,x);
is the read gonna read the whole receive buffer(nFD), assuming 'len' and 'x' are big enough.
I think SELECT here is acting as just a way of blocking till data becomes available in recv buffer.

In a nutshell, select is a function that you can call without blocking (i.e. it returns immediately), and upon return it will tell you a list of file descriptors on which you can call read (or write) without blocking.
Such a function is crucial if you want to provide a persistent service while processing I/O with only a single thread: You cannot afford to do nothing while you are waiting for I/O, and so you need a deterministic method to ensure that you can do non-blocking I/O.
Edit. Here's an example of a typical single-threaded select-server, in pseudo-code:
while (true)
{
select(...);
read_available_data();
process_data_and_do_work(); // expensive
}
Such a server never has to be idle, and the expensive processing function can take up almost all the available computing time (it just has to make sure to return when it needs more data). I think select even allows for a context switch, so this will play nice in a multi-process environment.

The code snippet is calling select() with a non-NULL timeout parameter. The code is waiting up to some maximum amount of time for the socket to become readable. If the timeout elapses, the socket is not readable and FD_ISSET() will return false, skipping the read() call. However, if the socket becomes readable before the timeout elapses, FD_ISSET() will return true, and a call to read() is quaranteed not to block the calling thread. It will return immediately, either returning whatever data is currently in the socket's receive buffer (up to len bytes max), or returning 0 if the remote party has disconnected gracefully.

Related

Epoll zero recv() and negative(EAGAIN) send()

I was struggling with epoll last days and I'm in the middle of nowhere right now ;)
There's a lot of information on the Internet and obviously in the system man but I probably took an overdose and a bit confused.
In my server app(backend to nginx) I'm waiting for data from clients in the ET mode:
event_template.events = EPOLLIN | EPOLLRDHUP | EPOLLET
Everything has become curious when I have noticed that nginx is responding with 502 despite I could see successful send() on my side. I run wireshark
to sniff and have realised that my server sends(trying and getting RST) data to another machine on the net. So, I decided that socket descriptor is invalid and this is sort of "undefined behaviour". Finally, I found out that on a second recv() I'm getting zero bytes which means that connection has to be closed and I'm not allowed to send data back anymore. Nevertheless, I was getting from epoll not just EPOLLIN but EPOLLRDHUP in a row.
Question: Do I have to close socket just for reading when recv() returns zero and shutdown(SHUT_WR) later on during EPOLLRDHUP processing?
Reading from socket in a nutshell:
std::array<char, BatchSize> batch;
ssize_t total_count = 0, count = 0;
do {
count = recv(_handle, batch.begin(), batch.size(), MSG_DONTWAIT);
if (0 == count && 0 == total_count) {
/// #??? Do I need to wait zero just on first iteration?
close();
return total_count;
} else if (count < 0) {
if (errno == EAGAIN || errno == EWOULDBLOCK) {
/// #??? Will be back with next EPOLLIN?!
break ;
}
_last_error = errno;
/// #brief just log the error
return 0;
}
if (count > 0) {
total_count += count;
/// DATA!
if (count < batch.size()) {
/// #??? Received less than requested - no sense to repeat recv, otherwise I need one more turn?!
return total_count;
}
}
} while (count > 0);
Probably, my the general mistake was attempt to send data on invalid socket descriptor and everything what happens later is just a consequence. But, I continued to dig ;) My second part of a question is about writing to a socket in MSG_DONTWAIT mode as well.
As far as I now know, send() may also return -1 and EAGAIN which means that I'm supposed to subscribe on EPOLLOUT and wait when kernel buffer will be free enough to receive some data from my me. Is this right? But what if client won't wait so long? Or, may I call blocking send(anyway, I'm sending on a different thread) and guarantee the everything what I send to kernel will be really sent to peer because of setsockopt(SO_LINGER)? And a final guess which I ask to confirm: I'm allowed to read and write simultaneously, but N>1 concurrent writes is a data race and everything that I have to deal with it is a mutex.
Thanks to everyone who at least read to the end :)

Questions: Do I have to close socket just for reading when recv()
returns zero and shutdown(SHUT_WR) later on during EPOLLRDHUP
processing?
No, there is no particular reason to perform that somewhat convoluted sequence of actions.
Having received a 0 return value from recv(), you know that the connection is at least half-closed at the network layer. You will not receive anything further from it, and I would not expect EPoll operating in edge-triggered mode to further advertise its readiness for reading, but that does not in itself require any particular action. If the write side remains open (from a local perspective) then you may continue to write() or send() on it, though you will be without a mechanism for confirming receipt of what you send.
What you actually should do depends on the application-level protocol or message exchange pattern you are assuming. If you expect the remote peer to shutdown the write side of its endpoint (connected to the read side of the local endpoint) while awaiting data from you then by all means do send the data it anticipates. Otherwise, you should probably just close the whole connection and stop using it when recv() signals end-of-file by returning 0. Note well that close()ing the descriptor will remove it automatically from any Epoll interest sets in which it is enrolled, but only if there are no other open file descriptors referring to the same open file description.
Any way around, until you do close() the socket, it remains valid, even if you cannot successfully communicate over it. Until then, there is no reason to expect that messages you attempt to send over it will go anywhere other than possibly to the original remote endpoint. Attempts to send may succeed, or they may appear to do even though the data never arrive at the far end, or the may fail with one of several different errors.
/// #??? Do I need to wait zero just on first iteration?
You should take action on a return value of 0 whether any data have already been received or not. Not necessarily identical action, but either way you should arrange one way or another to get it out of the EPoll interest set, quite possibly by closing it.
/// #??? Will be back with next EPOLLIN?!
If recv() fails with EAGAIN or EWOULDBLOCK then EPoll might very well signal read-readiness for it on a future call. Not necessarilly the very next one, though.
/// #??? Received less than requested - no sense to repeat recv, otherwise I need one more turn?!
Receiving less than you requested is a possibility you should always be prepared for. It does not necessarily mean that another recv() won't return any data, and if you are using edge-triggered mode in EPoll then assuming the contrary is dangerous. In that case, you should continue to recv(), in non-blocking mode or with MSG_DONTWAIT, until the call fails with EAGAIN or EWOULDBLOCK.
As far as I now know, send() may also return -1 and EAGAIN which means that I'm supposed to subscribe on EPOLLOUT and wait when kernel buffer will be free enough to receive some data from my me. Is this right?
send() certainly can fail with EAGAIN or EWOULDBLOCK. It can also succeed, but send fewer bytes than you requested, which you should be prepared for. Either way, it would be reasonable to respond by subscribing to EPOLLOUT events on the file descriptor, so as to resume sending later.
But what if client won't wait so long?
That depends on what the client does in such a situation. If it closes the connection then a future attempt to send() to it would fail with a different error. If you were registered only for EPOLLOUT events on the descriptor then I suspect it would be possible, albeit unlikely, to get stuck in a condition where that attempt never happens because no further event is signaled. That likelihood could be reduced even further by registering for and correctly handling EPOLLRDHUP events, too, even though your main interest is in writing.
If the client gives up without ever closing the connection then EPOLLRDHUP probably would not be useful, and you're more likely to get the stale connection stuck indefinitely in your EPoll. It might be worthwhile to address this possibility with a per-FD timeout.
Or, may I call blocking send(anyway, I'm sending on a different
thread) and guarantee the everything what I send to kernel will be
really sent to peer because of setsockopt(SO_LINGER)?
If you have a separate thread dedicated entirely to sending on that specific file descriptor then you can certainly consider blocking send()s. The only drawback is that you cannot implement a timeout on top of that, but other than that, what would such a thread do if it blocking either on sending data or on receiving more data to send?
I don't see quite what SO_LINGER has to do with it, though, at least on the local side. The kernel will make every attempt to send data that you have already dispatched via a send() call to the remote peer, even if you close() the socket while data are still buffered, regardless of the value of SO_LINGER. The purpose of that option is to receive (and drop) straggling data associated with the connection after it is closed, so that they are not accidentally delivered to another socket.
None of this can guarantee that the data are successfully delivered to the remote peer, however. Nothing can guarantee that.
And a final guess which I ask to confirm: I'm allowed to read and
write simultaneously, but N>1 concurrent writes is a data race and
everything that I have to deal with it is a mutex.
Sockets are full-duplex, yes. Moreover, POSIX requires most functions, including send() and recv(), to be thread safe. Nevertheless, multiple threads writing to the same socket is asking for trouble, for the thread safety of individual calls does not guarantee coherency across multiple calls.

What is an overlapped I/O alternative to WaitNamedPipe?

The WaitNamedPipe function allows a pipe client application to synchronously wait for an available connection on a named pipe server. You then call CreateFile to open the pipe as a client. Pseudocode:
// loop works around race condition with WaitNamedPipe and CreateFile
HANDLE hPipe;
while (true) {
if (WaitNamedPipe says connection is ready) {
hPipe = CreateFile(...);
if (hPipe ok or last error is NOT pipe busy) {
break; // hPipe is valid or last error is set
}
} else {
break; // WaitNamedPipe failed
}
}
The problem is that these are all blocking, synchronous calls. What is a good way to do this asynchronously? I can't seem to find an API that uses overlapped I/O to do this, for example. For example, for pipe servers the ConnectNamedPipe function provides an lpOverlapped parameters allowing for a server to asynchronously wait for a client. The pipe server can then call WaitForMultipleObjects and wait for the I/O operation to complete, or any other event to be signaled (for example, an event signaling the thread to cancel pending I/O and terminate).
The only way I can think of is to call WaitNamedPipe in a loop with a short, finite timeout and check other signals if it times out. Alternatively, in a loop call CreateFile, check other signals, and then call Sleep with a short delay (or WaitNamedPipe). For example:
HANDLE hPipe;
while (true) {
hPipe = CreateFile(...);
if (hPipe not valid and pipe is busy) {
// sleep 100 milliseconds; alternatively, call WaitNamedPipe with timeout
Sleep(100);
// TODO: check other signals here to see if we should abort I/O
} else
break;
}
But this method stinks to high heaven in my opinion. If a pipe isn't available for awhile, the thread continues to run - sucking up CPU, using power, requiring memory pages to remain in RAM, etc. In my mind, a thread that relies on Sleep or short timeouts does not perform well and is a sign of sloppy multi-threaded programming.
But what's the alternative in this case?

WaitNamedPipe is completely useless, and will just use all the cpu if you specify a timeout and there's no server waiting for it.
Just call CreateFile over and over with a Sleep like you're doing, and move it to other threads as you see appropriate. There is no API alternative.
The only "benefit" WaitNamedPipe provides is if you want to know if you can connect to a named pipe but you explicitly don't want to actually open a connection. It's junk.
If you really want to be thorough, your only options are
Ensure that whatever program is opening the named pipe is always calling CreateNamedPipe again immediately after it's named pipe is connected to.
Have your program actually check if that program is running.
If your intent is really not to have additional connections, still call CreateNamedPipe, and when someone connects, tell them to go away until they're waited a given amount of time, the close the pipe.

Why can't the server just create more pipes? The performance hit in the scenario you describe isn't a problem if it is rare.
I.e. if there are usually enough pipes to go round what does it matter if you use CreateFile/Sleep instead of WaitForMultipleObjects? The performance hit will not matter.
I also have to question the need for overlapped IO in a client. How many servers is it communicating with at a time? If the answer is less than, say, 10 you could reasonably create a thread per connection.
Basically I am saying I think the reason there is no overlapped WaitforNamedPipe is because there is no reasonable use-case which requires it.

You can open the pipe file system at \\.\pipe\ and then use DeviceIoControl to send FSCTL_PIPE_WAIT.

Socket select() blocks the second time around

I have a simple server/client program that I'm working on. I'm using select() to wait for data to come in on a TCP socket before reading it in. When the data comes in, I use several recv() and select() calls to read in the chunks until I have it all. Then I loop back to the initial select() call and see if the client has anything else to send.
struct timeval timeoutCounter;
fd_set readFileDescriptor;
do {
timeoutCounter.tv_sec = 30;
timeoutCounter.tv_usec = 0;
FD_ZERO(&readFileDescriptor);
FD_SET(socket, &readFileDescriptor);
cout << "This line always prints, every iteration through the loop.\n";
dataReady = select(socket+1,&readFileDescriptor,NULL,NULL,&timeoutCounter);
cout << "This line only prints the first time I call select()."
<< "The second time it hangs before reaching this line.\n";
// ... recv(), select(), recv(), select(), etc in a loop until I have all the data
// send() a response to the client
} while(dataReady > 0);
I started off with all of this in a big, hard-to-read function, and it worked. Then I broke it out into a separate class from the one that accept()s the connection, and now its behavior is different. The first data set that the user sends comes in fine. But the client waits for a response from the server and then sends a second set of data to the socket. However, select() doesn't return after the client sends the second set of data; it blocks until it times out.
I've already ruled the client out as being the problem; the packets are sent fine and at the appropriate time. I've also tried printing the socket file descriptor to prove that it does not change somewhere. Does anyone have any idea why this code might not work? What are the factors that might cause select() to block?
EDIT: It looks like my code runs fine on 32-bit machines, but fails on 64-bit machines. I still haven't solved the problem, but that narrows it down a good bit.

Without seeing your complete code it's hard to tell what might be wrong. However, the select() function modifies the fd_set values passed to it. You will need to make sure that you reinitialise each fd_set value before calling select(), so that you include the socket(s) you want.
Remember also that the recv() function will block until it gets some data (or the socket is closed) so unless you really need the timeout functionality you may not even need to call select(). Finally, the recv() function will return if any data is available, not necessarily all of what you asked for. You will have to repeatedly call recv() in a loop to get all the data. This is true even if you're reading a small number of bytes.

Callbacks and Delays in a select/poll loop

One can use poll/select when writing a server that can service multiple clients all in the same thread. select and poll, however need a file descriptor to work. For this reason, I am uncertain how to perform simple asynchronous operations, like implementing a simple callback to break up a long running operation or a delayed callback without exiting the select/poll loop. How does one go about doing this? Ideally, I would like to do this without resorting to spawning new threads.
In a nutshell, I am looking for a mechanism with which I can perform ALL asynchronous operations. The windows WaitForMultipleObjects or Symbian TRequestStatus seems a much more suited to generalized asynchronous operations.

For arbitrary callbacks, maintain a POSIX pipe (see pipe(2)). When you want to do a deferred call, write a struct consisting of a function pointer and optional context pointer to the write end. The read end is just another input for select. If it selects readable, read the same struct, and call the function with the context as argument.
For timed callbacks, maintain a list in order of due time. Entries in the list are structs of e.g. { due time (as interval since previous callback); function pointer; optional context pointer }. If this list is empty, block forever in select(). Otherwise, timeout when the first event is due. Before each call to select, recalculate the first event's due time.
Hide the details behind a reasonable interface.

select() and poll() are syscalls - it means that your program is calling OS kernel to do something and your program can do nothing while waiting for return from kernel, unless you use other thread.
Although select() and poll() are used for async I/O, these functions (syscalls) are not async - they will block (unless you specify some timeout) until there is something happened with the descriptor you are watching.
Best strategy would be to check descriptors time to time (specifying small timeout value), and if there is nothing, do what you want to do in idle time, otherwise process I/O.

You could take advantage of the timeout of select() or poll() to do your background stuff periodically:
for ( ;; ) {
...
int fds = select(<fds and timeout>);
if (fds < 0) {
<error occured>
} else if if (fds == 0) {
<handle timeout, do some background work.>
} else {
<handle the active file descriptors>
}
}

For an immediate callback using the select loop, one can use one of the special files like /dev/zero that are always active. The will allow select the exit soon but will allow other files to become active as well.
For timed delays, I can only thing of using the timeout on select.
Both of the above don't feel great, so please send better answers.

Is this program running Asynchronous or synchrounous?

When I run this program
OVERLAPPED o;
int main()
{
..
CreateIoCompletionPort(....);
for (int i = 0; i<10; i++)
{
WriteFile(..,&o);
OVERLAPPED* po;
GetQueuedCompletionStatus(..,&po);
}
}
it seems that the WriteFile didn't return until the writing job is done. At the same time , GetQueuedCompletionStatus() gets called. The behavior is like a synchronous IO operation rather than an asynch-IO operation.
Why is that?

If the file handle and volume have write caching enabled, the file operation may complete with just a memory copy to cache, to be flushed lazily later. Since there is no actual IO taking place, there's no reason to do async IO in that case.
Internally, each IO operation is represented by an IRP (IO request packet). It is created by the kernel and given to the filesystem to handle the request, where it passes down through layered drivers until the request becomes an actual disk controller command. That driver will make the request, mark the IRP as pending and return control of the thread. If the handle was opened for overlapped IO, the kernel gives control back to your program immediately. Otherwise, the kernel will wait for the IRP to complete before returning.
Not all IO operations make it all the way to the disk, however. The filesystem may determine that the write should be cached, and not written until later. There is even a special path for operations that can be satisfied entirely using the cache, called fast IO. Even if you make an asynchronous request, fast IO is always synchronous because it's just copying data into and out of cache.
Process monitor, in advanced output mode, displays the different modes and will show blank in the status field while an IRP is pending.
There is a limit to how much data is allowed to be outstanding in the write cache. Once it fills up, the write operations will not complete immediately. Try writing a lot of data at once, with may operations.

I wrote a blog posting a while back entitled "When are asynchronous file writes not asynchronous" and the answer was, unfortunately, "most of the time". See the posting here: http://www.lenholgate.com/blog/2008/02/when-are-asynchronous-file-writes-not-asynchronous.html
The gist of it is:
For security reasons Windows extends files in a synchronous manner
You can attempt to work around this by setting the end of the file to a large value before you start and then trimming the file to the correct size when you finish.
You can tell the cache manager to use your buffers and not its, by using FILE_FLAG_NO_BUFFERING
At least it's not as bad as if you're forced to use FILE_FLAG_WRITE_THROUGH

If GetQueuedCompletionStatus is being called, then the call to WriteFile is synchronous (and it has returned), but it can still modify &o even after it's returned if it is asynchronous.

from this page in MSDN:
For asynchronous write operations,
hFile can be any handle opened with
the CreateFile function using the
FILE_FLAG_OVERLAPPED flag or a socket
handle returned by the socket or
accept function.
also, from this page:
If a handle is provided, it has to
have been opened for overlapped I/O
completion. For example, you must
specify the FILE_FLAG_OVERLAPPED flag
when using the CreateFile function to
obtain the handle.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js