I have a C++ application that includes this function:
int
mySelect(const int fdMaxPlus1,
fd_set *readFDset,
fd_set *writeFDset,
struct timeval *timeout)
{
retry:
const int selectReturn
= ::select(fdMaxPlus1, readFDset, writeFDset, NULL, timeout);
if (selectReturn < 0 && EINTR == errno) {
// Interrupted system call, such as for profiling signal, try again.
goto retry;
}
return selectReturn;
}
Normally, this code work just fine, however, in one instance, I saw it get into an infinite loop where select() keeps failing with the EINTR errno code. In this case, the caller had set the timeout to zero seconds and zero microseconds, meaning don't wait and return the select() result immediately. I thought that EINTR only occurs when a signal handler occurred, why would I keep getting a signal handler over and over again (for over 12 hours)? This is Centos 5. Once I put this into the debugger to see what was happening, the code returned without EINTR after a couple iterations. Note that the fd being checked is a socket.
I could add a retry limit to the above code, but I'd like to understand what is going on first.
On Linux, select(2) may modify the timeout argument (passed by address). So you should copy it after the call.
retry:
struct timeout timeoutcopy = timeout;
const int selectReturn
= ::select(fdMaxPlus1, readFDset, writeFDset, NULL, &timeoutcopy);
(in your code, your timeout is probably zero or very small after a few or even the first iterations)
BTW, I suggest rather using poll(2) instead of select (since poll is is more C10K problem friendly)
BTW, EINTR happens on any signal (see signal(7)), even without a registered signal handler.
You might use strace to understand the overall behavior of your program.
Related
I came to maintain a piece of software that does:
/*... init, setting timeout, etc ... */
FD_ZERO(&set);
FD_SET(socket_, &set);
int selectRes = select(socket_ + 1, &set, NULL, NULL, &timeout);
if (selectRes < 0) {
throw IoException("Select: ", errno);
}
if (selectRes == 0) {
throw TimeoutException();
}
/* ... then handle recvfrom, throw IoException if return < 0 ... */
IoException should cause the program to terminate. Timeout exception resumes operation. Passing without exception loops back. socket_ is a UNIX datagram socket (reading messages from another local process).
This program runs at very high priority (required to react to messages quickly) but it's expected to be idle most of the time, hanging on select()'s timeout waiting for incoming messages. Meanwhile, it seems like it sometimes hogs 100% of CPU time (without receiving enough messages to grant such behavior). The occurrence is rather erratic, never mind the program's high priority makes debugging it very hard (a small single-core Linux embedded system, everything else grinds to a halt).
I'm worried about the NULL in the errorfds position - is testing the return value of select() enough in this case, or may select() return immediately (with 0) if there's an error condition on the socket but errorfds is NULL, and keep repeating doing so every time it loops back to select()?
Or alternatively, what other circumstances, other than an avalanche of messages, could make select() exit immediately (or maybe wait in a spinlock instead of freeing up the CPU time)?
In a server code I want to use pselect to wait for clients to connect as well monitor the standard output of the prozesses that I create and send it to the client (like a simplified remote shell).
I tried to find examples on how to use pselect but I haven't found any. The socket where the client can connect is already set up and works, as I verified that with accept(). SIGTERM is blocked.
Here is the code where I try to use pselect:
waitClient()
{
fd_set readers;
fd_set writers;
fd_set exceptions;
struct timespec ts;
// Loop until we get a sigterm to shutdown
while(getSigTERM() == false)
{
FD_ZERO(&readers);
FD_ZERO(&writers);
FD_ZERO(&exceptions);
FD_SET(fileno(stdin), &readers);
FD_SET(fileno(stdout), &writers);
FD_SET(fileno(stderr), &writers);
FD_SET(getServerSocket()->getSocketId(), &readers);
//FD_SET(getServerSocket()->getSocketId(), &writers);
memset(&ts, 0, sizeof(struct timespec));
pret = pselect(FD_SETSIZE, &readers, &writers, &exceptions, &ts, &mSignalMask);
// Here pselect always returns with 2. What does this mean?
cout << "pselect returned..." << pret << endl;
cout.flush();
}
}
So what I want to know is how to wait with pselect until an event is received, because currently pselect always returns immediately with a value 2. I tried to set the timeout to NULL but that doesn't change anything.
The returnvalue of pselect (if positive) is the filedescriptor that caused the event?
I'm using fork() to create new prozesses (not implemented yet) I know that I have to wait() on them. Can I wait on them as well? I suppose I need to chatch the signal SIGCHILD, so how would I use that? wait() on the child would also block, or can I just do a peek and then continue with pselect, otherwise I have to concurrent blocking waits.
It returns immediately because the file descriptors in the writers set are ready. The standard output streams will almost always be ready for writing.
And if you check a select manual page you will see that the return value is either -1 on error, 0 on timeout, and a positive number telling you the number of file descriptors that are ready.
A lot of system calls like close( fd ) Can be interrupted by a signal. In this case usually -1 is returned and errno is set EINTR.
The question is what is the right thing to do? Say, I still want this fd to be closed.
What I can come up with is:
while( close( fd ) == -1 )
if( errno != EINTR ) {
ReportError();
break;
}
Can anybody suggest a better/more elegant/standard way to handle this situation?
UPDATE:
As noticed by mux, SA_RESTART flag can be used when installing the signal handler.
Can somebody tell me which functions are guaranteed to be restartable on all POSIX systems(not only Linux)?
Some system calls are restartable, which means the kernel will restart the call if interrupted, if the SA_RESTART flag is used when installing the signal handler, the signal(7) man page says:
If a blocked call to one of the following interfaces is interrupted
by a signal handler, then the call will be automatically restarted
after the signal
handler returns if the SA_RESTART flag was used; otherwise the call will fail with the error EINTR:
It doesn't mention if close() is restartable, but these are:
read(2), readv(2), write(2), writev(2), ioctl(2), open(2),wait(2),
wait3(2), wait4(2), waitid(2), and waitpid,accept(2), connect(2),
recv(2), recvfrom(2), recvmsg(2), send(2), sendto(2), and sendmsg(2)
flock(2) and fcntl(2) mq_receive(3), mq_timedreceive(3), mq_send(3),
and mq_timedsend(3) sem_wait(3) and sem_timedwait(3) futex(2)
Note that those details, specifically the list of non-restartable calls, are Linux-specific
I posted a relevant question about which system calls are restartable and if it's specified by POSIX somewhere, it is specified by POSIX but it's optional, so you should check the list of non-restartable calls for your OS, if it's not there it should be restartable. This is my question:
How to know if a Linux system call is restartable or not?
Update: Close is a special case it's not restartable and should not be retried in Linux, see this answer for more details:
https://stackoverflow.com/a/14431867/1157444
Assuming you're after shorter code, you can try something like:
while (((rc = close (fd)) == -1) && (errno == EINTR));
if (rc == -1)
complainBitterly (errno);
Assuming you're after more readable code in addition to shorter, just create a function:
int closeWithRetry (int fd);
and place your readable code in there. Then it doesn't really matter how long it is, it's still a one-liner where you call it, but you can make the function body itself very readable:
int closeWithRetry (int fd) {
// Initial close attempt.
int rc = close (fd);
// As long as you failed with EINTR, keep trying.
// Possibly with a limit (count or time-based).
while ((rc == -1) && (errno == EINTR))
rc = close (fd);
// Once either success or non-retry failure, return error code.
return rc;
}
For the record: On essentially every UNIX, close() must not be retried if it returns EINTR. DO NOT put an EINTR retry-loop in place for close like you would for waitpid() or read(). See this page for more details: http://austingroupbugs.net/view.php?id=529 On linux, Solaris, BSD and others, retrying close() is incorrect. HP-UX is the only common(!) system I could find that requires this.
EINTR means something very different for read() and select() and waitpid() and so on than it does for close(). For most calls, you retry on EINTR because you asked for something to be done which blocks, and if you were interrupted that means it didn't happen, so you try again. For close(), the action you requested was for an entry to be removed from the fd table, which is instantaneous, without error, and will always happen no matter what close() returns.[*] The only reason close() blocks is that sometimes, for special semantics (like TCP linger), it can wait until I/O is done before returning. If close returns EINTR, that means that you asked it to wait but it couldn't. However, the fd was still closed; you just lost your chance to wait on it.
Conclusion: unless you know you can't receive signals, using close() for waiting is a very stupid thing to do. Use an application-level ACK (TCP) or an fsync (file I/O) to make sure any writes were completed before closing the fd.
[*] There is a caveat: if another thread of the process is inside a blocking syscall on the same fd, well, ... it depends.
Since it seems that I can't find a solution to my original problem, I tried to do a little workaround. I'm simply trying to set a timeout to the connect() call of my TCP Socket.
I want the connect() to be blocking but not until the usual 75 seconds timeout, I want to define my own.
I have already tried select() which worked for the timeout but I couldn't get a connection (that was my initial problem as described here ).
So now I found another way to deal with it: just do a blocking connect() call but interrupt it with an alarm like this :
signal(SIGALRM, connect_alarm);
int secs = 5;
alarm(secs);
if (connect(m_Socket, (struct sockaddr *)&addr, sizeof(addr)) < 0 )
{
if ( errno == EINTR )
{
debug_printf("Timeout");
m_connectionStatus = STATUS_CLOSED;
return ERR_TIMEOUT;
}
else
{
debug_printf("Other Err");
m_connectionStatus = STATUS_CLOSED;
return ERR_NET_SOCKET;
}
}
with
static void connect_alarm(int signo)
{
debug_printf("SignalHandler");
return;
}
This is the solution I found on the Internet in a thread here on stackoverflow. If I use this code the program starts the timer and then goes into the connect() call. After the 5 seconds the signal handler is fired (as seen on the console with the printf()), but after that the program still remains within the connect() function for 75 seconds. Actually every description says that the connect_alarm() should interrupt the connect() function but it seems it doesn't in my case. Is there any way to get the desired result for my problem?
signal is a massively under-specified interface and should be avoided in new code. On some versions of Linux, I believe it provides "BSD semantics", which means (among other things) that providing SA_RESTART by default.
Use sigaction instead, do not specify SA_RESTART, and you should be good to go.
...
Well, except for the general fragility and unavoidable race conditions, that is. connect will return EINTR for any signal, not just SIGALARM. More troublesome, if the system happens to be under heavy load, it could take more than 5 seconds between the call to alarm and the call to connect, in which case you will miss the signal and block in connect forever.
Your earlier attempt, using non-blocking sockets with connect and select, was a much better idea. I would suggest debugging that.
While it's relatively easy to setup the alarm(2) (less the pain of signal handling and system call interruptions), the more efficient way of timing out TCP connection attempts is the non-blocking connect, which also allows you to initiate multiple connections and wait on all of them, handling successes and failures one at a time.
The man pages for select() do not list EAGAIN as possible error code for the select() function.
Can anyone explain in which cases the select() can produce EAGAIN error?
If I understand select_tut man page, EAGAIN can be produced by sending a signal to the process which is blocked waiting on blocked select(). Is this correct?
Since I am using select() in blocking mode with timeout, like this:
bool selectRepeat = true;
int res = 0;
timeval selectTimeout( timeout );
while ( true == selectRepeat )
{
res = ::select( fd.Get() + 1,
NULL,
&writeFdSet,
NULL,
&selectTimeout );
selectRepeat = ( ( -1 == res ) && ( EINTR == errno ) );
}
should I just repeat the loop when the error number is EAGAIN?
select() will not return EAGAIN under any circumstance.
It may, however, return EINTR if interrupted by a signal (This applies to most system calls).
EAGAIN (or EWOULDBLOCK) may be returned from read, write, recv, send, etc.
EAGAIN is technically not an error, but an indication that the operation terminated without completing, and you should...er...try it again. You will probably want to write logic to retry, but not infinitely. If that was safe, they would have done it themselves in the API.
If you are thinking that returing a silly non-error error code like that is kinda bad client interface design, you aren't the first. It turns out EAGAIN as an error code has a long interesting history in Unix. Among other things, it spawned the widely circulated essay on software design The Rise of Worse-is-Better. There's a couple of paragraphs in the middle that explain why Unix needs to return this sometimes. Yes, it does indeed have something to do with receiving interrupts during an I/O. They call it PC loser-ing.
Many credit this essay as one of the inspirations for Agile programming.