Epoll zero recv() and negative(EAGAIN) send()

Epoll zero recv() and negative(EAGAIN) send() - c++

I was struggling with epoll last days and I'm in the middle of nowhere right now ;)
There's a lot of information on the Internet and obviously in the system man but I probably took an overdose and a bit confused.
In my server app(backend to nginx) I'm waiting for data from clients in the ET mode:
event_template.events = EPOLLIN | EPOLLRDHUP | EPOLLET
Everything has become curious when I have noticed that nginx is responding with 502 despite I could see successful send() on my side. I run wireshark
to sniff and have realised that my server sends(trying and getting RST) data to another machine on the net. So, I decided that socket descriptor is invalid and this is sort of "undefined behaviour". Finally, I found out that on a second recv() I'm getting zero bytes which means that connection has to be closed and I'm not allowed to send data back anymore. Nevertheless, I was getting from epoll not just EPOLLIN but EPOLLRDHUP in a row.
Question: Do I have to close socket just for reading when recv() returns zero and shutdown(SHUT_WR) later on during EPOLLRDHUP processing?
Reading from socket in a nutshell:
std::array<char, BatchSize> batch;
ssize_t total_count = 0, count = 0;
do {
count = recv(_handle, batch.begin(), batch.size(), MSG_DONTWAIT);
if (0 == count && 0 == total_count) {
/// #??? Do I need to wait zero just on first iteration?
close();
return total_count;
} else if (count < 0) {
if (errno == EAGAIN || errno == EWOULDBLOCK) {
/// #??? Will be back with next EPOLLIN?!
break ;
}
_last_error = errno;
/// #brief just log the error
return 0;
}
if (count > 0) {
total_count += count;
/// DATA!
if (count < batch.size()) {
/// #??? Received less than requested - no sense to repeat recv, otherwise I need one more turn?!
return total_count;
}
}
} while (count > 0);
Probably, my the general mistake was attempt to send data on invalid socket descriptor and everything what happens later is just a consequence. But, I continued to dig ;) My second part of a question is about writing to a socket in MSG_DONTWAIT mode as well.
As far as I now know, send() may also return -1 and EAGAIN which means that I'm supposed to subscribe on EPOLLOUT and wait when kernel buffer will be free enough to receive some data from my me. Is this right? But what if client won't wait so long? Or, may I call blocking send(anyway, I'm sending on a different thread) and guarantee the everything what I send to kernel will be really sent to peer because of setsockopt(SO_LINGER)? And a final guess which I ask to confirm: I'm allowed to read and write simultaneously, but N>1 concurrent writes is a data race and everything that I have to deal with it is a mutex.
Thanks to everyone who at least read to the end :)

Questions: Do I have to close socket just for reading when recv()
returns zero and shutdown(SHUT_WR) later on during EPOLLRDHUP
processing?
No, there is no particular reason to perform that somewhat convoluted sequence of actions.
Having received a 0 return value from recv(), you know that the connection is at least half-closed at the network layer. You will not receive anything further from it, and I would not expect EPoll operating in edge-triggered mode to further advertise its readiness for reading, but that does not in itself require any particular action. If the write side remains open (from a local perspective) then you may continue to write() or send() on it, though you will be without a mechanism for confirming receipt of what you send.
What you actually should do depends on the application-level protocol or message exchange pattern you are assuming. If you expect the remote peer to shutdown the write side of its endpoint (connected to the read side of the local endpoint) while awaiting data from you then by all means do send the data it anticipates. Otherwise, you should probably just close the whole connection and stop using it when recv() signals end-of-file by returning 0. Note well that close()ing the descriptor will remove it automatically from any Epoll interest sets in which it is enrolled, but only if there are no other open file descriptors referring to the same open file description.
Any way around, until you do close() the socket, it remains valid, even if you cannot successfully communicate over it. Until then, there is no reason to expect that messages you attempt to send over it will go anywhere other than possibly to the original remote endpoint. Attempts to send may succeed, or they may appear to do even though the data never arrive at the far end, or the may fail with one of several different errors.
/// #??? Do I need to wait zero just on first iteration?
You should take action on a return value of 0 whether any data have already been received or not. Not necessarily identical action, but either way you should arrange one way or another to get it out of the EPoll interest set, quite possibly by closing it.
/// #??? Will be back with next EPOLLIN?!
If recv() fails with EAGAIN or EWOULDBLOCK then EPoll might very well signal read-readiness for it on a future call. Not necessarilly the very next one, though.
/// #??? Received less than requested - no sense to repeat recv, otherwise I need one more turn?!
Receiving less than you requested is a possibility you should always be prepared for. It does not necessarily mean that another recv() won't return any data, and if you are using edge-triggered mode in EPoll then assuming the contrary is dangerous. In that case, you should continue to recv(), in non-blocking mode or with MSG_DONTWAIT, until the call fails with EAGAIN or EWOULDBLOCK.
As far as I now know, send() may also return -1 and EAGAIN which means that I'm supposed to subscribe on EPOLLOUT and wait when kernel buffer will be free enough to receive some data from my me. Is this right?
send() certainly can fail with EAGAIN or EWOULDBLOCK. It can also succeed, but send fewer bytes than you requested, which you should be prepared for. Either way, it would be reasonable to respond by subscribing to EPOLLOUT events on the file descriptor, so as to resume sending later.
But what if client won't wait so long?
That depends on what the client does in such a situation. If it closes the connection then a future attempt to send() to it would fail with a different error. If you were registered only for EPOLLOUT events on the descriptor then I suspect it would be possible, albeit unlikely, to get stuck in a condition where that attempt never happens because no further event is signaled. That likelihood could be reduced even further by registering for and correctly handling EPOLLRDHUP events, too, even though your main interest is in writing.
If the client gives up without ever closing the connection then EPOLLRDHUP probably would not be useful, and you're more likely to get the stale connection stuck indefinitely in your EPoll. It might be worthwhile to address this possibility with a per-FD timeout.
Or, may I call blocking send(anyway, I'm sending on a different
thread) and guarantee the everything what I send to kernel will be
really sent to peer because of setsockopt(SO_LINGER)?
If you have a separate thread dedicated entirely to sending on that specific file descriptor then you can certainly consider blocking send()s. The only drawback is that you cannot implement a timeout on top of that, but other than that, what would such a thread do if it blocking either on sending data or on receiving more data to send?
I don't see quite what SO_LINGER has to do with it, though, at least on the local side. The kernel will make every attempt to send data that you have already dispatched via a send() call to the remote peer, even if you close() the socket while data are still buffered, regardless of the value of SO_LINGER. The purpose of that option is to receive (and drop) straggling data associated with the connection after it is closed, so that they are not accidentally delivered to another socket.
None of this can guarantee that the data are successfully delivered to the remote peer, however. Nothing can guarantee that.
And a final guess which I ask to confirm: I'm allowed to read and
write simultaneously, but N>1 concurrent writes is a data race and
everything that I have to deal with it is a mutex.
Sockets are full-duplex, yes. Moreover, POSIX requires most functions, including send() and recv(), to be thread safe. Nevertheless, multiple threads writing to the same socket is asking for trouble, for the thread safety of individual calls does not guarantee coherency across multiple calls.

Related

Socket data race [duplicate]

This question already has answers here:
Are parallel calls to send/recv on the same socket valid?
(3 answers)
Closed 7 years ago.
Sockets can generally two way communicate, therefore the same socket can be used to send and recv.
If I wanted to send some data (on another thread) while the socket is getting read, what would the kernel do? This is applied for both parts.
Consider this example: the server is sending you a file and say it will take a lot (low uplink or a very big file). The user gets bored and decides to SIGINT you. You catch it and tell the server to stop sending the file (with some kind of message).
Will you be able to send to tell the server to stop sending even though you're reading from it? And of course, that's applied to the server-side as well.
Hopefully I've been enough clear.

If I wanted to send some data (on another thread) while the socket is getting read, what would the kernel do?
Nothing special... sockets aren't like garden hoses... there's just some meta-data added to a packet that's sent between the machines, so the reading and writing happen independently (apart perhaps from if one side calls recv() on a socket that has unsent data in the local buffers due to the Nagle algorithm, which bunches up data into sensible size packets, it might time-out immediately and send whatever it can, but any tuning of that would be an implementation latency-tuning detail and doesn't change the big picture or way the client and server call the TCP API).
Consider this example: the server is sending you a file and say it will take a lot (low uplink or a very big file). The user gets bored and decides to SIGINT you. You catch it and tell the server to stop sending the file (with some kind of message). Will you be able to send to tell the server to stop sending even though you're reading from it? And of course, that's applied to the server-side as well.
The kernel accepts a limited amount of data to be sent, and a limited amount of data received, after which it forces the sending side to wait until some has been consumed before sending more. So, if you've sent data to a server, then get a local SIGINT and send a "oh, cancel that" in the same way, the server must read all the already-sent data before it can see the "oh, cancel that". If instead of sending it "in the same way" you turn on the Out Of Band (OOB) flag while sending the cancel message, then the server can (if it's written to do so) detect that there's OOB data and read it before it's completed reading/processing the other data. It will still need to read and discard whatever in-band data you've already sent, but the flow control / buffering mentioned above means that should be a manageable amount - far less than your file size might be. Throughout all this, whatever you want to recv or the server sends is independent and unaffected by the large client->server send, any OOB data etc..
There's a discussion and example code from GNU at http://www.gnu.org/software/libc/manual/html_node/Out_002dof_002dBand-Data.html

Thread 1 can safely write to the socket (with send) whilst thread 2 reads from the socket (with recv). What you need to be careful of is that at the point where you close() the socket the threads are synchronised, else the file descriptor may be used elsewhere, so the other thread (if not synchronized) could read from a file descriptor now used for something else. One way to achieve this would be for your reading thread to shutdown the file descriptor, which should cause the other end to drop the connection and thus error an in-progress send.

what does blocking means in setsockopt parameter SO_RCVTIMEO

When i was taking a look at setsockopt from msdn link. i came across a parameter SO_RCVTIMEO, it description is "Sets the timeout, in milliseconds, for blocking receive calls." I thought the socket listen operation is event driven which means when kernel drained frame from NIC card it notify my program socket, so what is the blocking all about?

The recv and WSARecv functions are blocking. They are not event driven (at least not at the calling level). Even when blocking has a timeout (as set with the SO_RECTIMEO option), they are not event driven as far as your code is concerned. In that case, they are just pseudo-blocking (arguably non-blocking depending on how short the timeout is).
When you call WSARecv, it will wait until data is ready to be read. While data is not ready to be read, it just waits. This is why it's considered blocking.
You are correct that at it's core networking is event driven. Under the hood, computers are, by nature, event driven. It's the way hardware works. Hardware interrupts are essentially events. You're right that at a low level what is happening is that your NIC card is telling the OS that it's ready to be read. At that level, it is indeed event based.
The problem is that WSARecv waits for that event.
Here's a hopefully clear analogy. Imagine that you for some reason cannot leave your house. Now imagine that your friend F lives next door. Additionally, assume that your other friend G is at your house.
Now imagine that you give G a piece of paper with a question on it and ask him to take it to F.
Once the question has been sent, imagine that you send G to go get F's response. This is like the recv call. G will wait until F has written down his response, then he will bring it to you. G does not immediately turn around and come back if F hasn't written it yet.
This is where the gap comes from. G is indeed aware of the "F wrote!" events, but you're not. You're not directly watching the piece of paper.
Setting a timeout means that you're telling G to wait at most some amount of time before giving up and coming back. In this situation, G is still waiting on F to write, but if F doesn't write within x milliseconds, G turns around and comes back empty handed.
Basically the pseudo code of recv is vaguely like:
1) is data available?
1a) Yes: read it and return
1b) No: GOTO 2
2) Wait until an event is received
2a) GOTO 1
I know this has been a horribly convoluted explanation, but my main point is this: recv is interacting with the events, not your code. recv blocks until one of those events is received. If a timeout is set, it blocks until either one of those events is received, or the timeout is reached.

Sockets are NOT event-driven by default. You have to write extra code to enable that. A socket is initially created in a blocking mode instead. This means that a call to send(), recv(), or accept() will block the calling thread indefinately by default until the requested operation is finished.
For recv(), that means the calling thread is blocked until there is at least 1 byte available to read from the socket's receive buffer, or until a socket error occurs, whichever occurs first. SO_RCVTIMEO allows you to set a timeout on the blocking read so recv() exits with a WSAETIMEDOUT error if no incoing data becomes available before the timeout elapses.
Another way to implement a timeout is to set the socket to a non-blocking mode instead via ioctlsocket(FIONBIO) and then call select() with a timeout, then call recv() or accept() only if select() reports that the socket is in a readible state, and send() only if select() reports the socket is in a writable state. But this requires more code to manage cases where the socket would enter a blocking state, causing operations to fail with WSAEWOULDBLOCK errors.

Non-blocking TCP socket and flushing right after send?

I am using Windows socket for my application(winsock2.h). Since the blocking socket doesn't let me control connection timeout, I am using non-blocking one. Right after send command I am using shutdown command to flush(I have to). My timeout is 50ms and the thing I want to know is if the data to be sent is so big, is there a risk of sending only a portion of data or sending nothing at all? Thanks in advance...
hSocket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
u_long iMode=1;
ioctlsocket(hSocket,FIONBIO,&iMode);
connect(hSocket, (sockaddr*)(&sockAddr),sockAddrSize);
send(hSocket, sendbuf, sendlen, 0);
shutdown(hSocket, SD_BOTH);
Sleep(50);
closesocket(hSocket);

Non-blocking TCP socket and flushing right after send?
There is no such thing as flushing a TCP socket.
Since the blocking socket doesn't let me control connection timeout
False. You can use select() on a blocking socket.
I am using non-blocking one.
Non sequitur.
Right after send command I am using shutdown command to flush(I have to).
You don't have to, and shutdown() doesn't flush anything.
My timeout is 50ms
Why? The time to send data depends on the size of the data. Obviously. It does not make any sense whatsoever to use a fixed timeout for a send.
and the thing I want to know is if the data to be sent is so big, is there a risk of sending only a portion of data or sending nothing at all?
In blocking mode, all the data you provided to send() will be sent if possible. In non-blocking mode, the amount of data represented by the return value of send() will be sent, if possible. In either case the connection will be reset if the send fails. Whatever timeout mechanism you superimpose can't possibly change any of that: specifically, closing the socket asynchronously after a timeout will only cause the close to be appended to the data being sent. It will not cause the send to be aborted.
Your code wouldn't pass any code review known to man. There is zero error checking; the sleep is completely pointless; and shutdown before close is redundant. If the sleep is intended to implement a timeout, it doesn't.
I want to be sending data as fast as possible.
You can't. TCP implements flow control. There is exactly nothing you can do about that. You are rate-limited by the receiver.
Also the 2 possible cases are: server waits too long to accept connection
There is no such case. The client can complete a connection before the server ever calls accept(). If you're trying to implement a connect timeout shorter than the default of about a minute, use select().
or receive.
Nothing you can do about that: see above.
So both connecting and writing should be done in max of 50ms since the time is very important in my situation.
See above. It doesn't make sense to implement a fixed timeout for operations that take variable time. And 50ms is far too short for a connect timeout. If that's a real issue you should keep the connection open so that the connect delay only happens once: in fact you should keep TCP connections open as long as possible anyway.
I have to flush both write and read streams
You can't. There is no operation in TCP that will flush either a read stream or a write stream.
because the server keeps sending me unnecessarly big data and I have limited internet connection.
Another non sequitur. If the server sends you data, you have to read it, otherwise you will stall the server, and that doesn't have anything to do with flushing your own write stream.
Actually I don't even want a single byte from the server
Bad luck. You have to read it. [If you were on BSD Unix you could shutdown the socket for input, which would cause data from the server to be thrown away, but that doesn't work on Windows: it causes the server to get a connection reset.]

Thanks to EJP and Martin, now I have created a second thread to check. Also in the code I posted in my question, I added "counter=0;" line after the "send" line and removed shutdown. It works just as I wanted now. It never waits more than 50ms :) Really big thanks
unsigned __stdcall SecondThreadFunc( void* pArguments )
{
while(1)
{
counter++;
if (counter > 49)
{
closesocket(hSocket);
counter = 0;
printf("\rtimeout");
}
Sleep(1);
}
return 0;
}

Unix Sockets: select() with more than one set associated does more than it should do

I have the following select call for tcp sockets:
ret = select(nfds + 1, &rfds, &rfds2, NULL, &tv);
rfds2 is used when I send to large data (non-blocking mode). And rfds is there to detect if we received something on the socket.
Now, when the send buffer is empty, I detect it with rfds2. But at the same time I get the socket back in rfds, although there is nothing that I received on that socket.
Is that the intended behaviour of the select-call? How can I distinguish orderly between the send and the recieve case?

Now, when the send buffer is empty, I
detect it with rfds2
That's not correct. select() will detect when the send buffer has room. It is hardly ever correct to register a socket for OP_READ and OP_WRITE simultaneously. OP_WRITE is almost always ready, except in the brief intervals when the send buffer is full.

Thanks for your answers. I have found the problem for myself:
The faulty code was after the select call (how I used FD_ISSET() to determine which action I can do).
I think my assumption is true, that there is only a socket in rfds, when there is really some data that can be received.

If the socket is non-blocking that seems to be the expected behaviour. The manual page for select has this to say about the readfds argument:
Those listed in readfds will be
watched to see if characters become
available for reading (more
precisely, to see if a read will not
block; in particular, a file
descriptor is also ready on
end-of-file)
Because the socket is non-blocking it is true that a read would not block and hence it is reasonable for that bit to be set.
It shouldn't cause a problem because if you try and read from the socket you will simply get nothing returned and the read won't block.
As a rule of thumb, whenever select returns you should process each socket that it indicates is ready, either reading and processing whatever data is available if it returns as ready-to-read, or writing more data if it returns as ready-to-write. You shouldn't assume that only one event will be signalled each time it returns.

recv() with errno=107:(transport endpoint connected)

well..I use a typical model of epoll+multithread to handle massive sockets, that is, I have a thread called epollWorkThread that use epoll_wait to handle i/o sockets. While there's an event of EPOLLIN, recv() will do the work and I do use the noblocking mode to allow immediate return. And recv() is indeed in a while(true) loop.
Everything is fine in the intial time(maybe a couple of hours or maybe minutes or if I'm lucky days), I can receive the information. But some time later, recv() insists to return -1 with the errno = 107(ENOTCONN). The other peer of the transport is written in AS3 which makes sure that the socket is connected. So I'm confused by the recv() behaviour. Thank you in advance and any comment is appreciated!

Errno 107 means that the socket is NOT connected (any more).
There are several reasons why this could happen. Assuming you're right and both sides of the connection claim that the socket is still open, an intermediate router/switch may have dropped the connection due to a timeout. The safest way to avoid such things from happen is to periodically send a 'health' or 'keep-alive' message. (Thus the intermediate router/switch accepts the connection as living...)=

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js