I have a problem with a piece of legacy c++/winsock code that is part of a multi-threaded socket server. The application creates a thread that handles connections from clients, of which there are typically a couple of hundred connected at any one time. It typically runs without a problem for several days (continuously), and then suddenly stops accepting connections. This only happens in production, never test.
It uses WSAEventSelect() to detect FD_ACCEPT network events. The (simplified) code for the connection handler is:
SOCKET listener;
HANDLE hStopEvent;
// ... initialise listener and hStopEvent, and other stuff ...
HANDLE hAcceptEvent = WSACreateEvent();
WSAEventSelect(listener, hAcceptEvent, FD_ACCEPT);
HANDLE rghEvents[] = { hStopEvent, hAcceptEvent };
bool bExit = false;
while(!bExit)
{
DWORD nEvent = WaitForMultipleObjects(2, rghEvents, FALSE, INFINITE);
switch(nEvent)
{
case WAIT_OBJECT_0:
bExit = true;
break;
case WAIT_OBJECT_1:
HandleConnect();
WSAResetEvent(hAcceptEvent);
break;
case WAIT_ABANDONED_0:
case WAIT_ABANDONED_0 + 1:
case WAIT_FAILED:
LogError();
break;
}
}
From detailed logging I know that, when the problem occurs, the thread enters WaitForMultipleObjects() and never emerges, even though there are clients attempting to connect and waiting for an accept. The WAIT_FAILED and WAIT_ABANDONED_x conditions never occur.
While I haven't ruled-out a config problem on the server, or even some kind of resource leak (can't find anything), I am also wondering if the event created by WSACreateEvent() is somehow being 'dissassociated' from the FD_ACCEPT network event - causing it to never fire.
So, am I doing something wrong here? Is there something I should be doing that I'm not? Or a better way? I'd appreciate any suggestions! Thanks.
EDIT
The socket is a non-blocking socket.
EDIT
Problem solved by using the approach suggested by kipkennedy (below). Changed hAcceptEvent to be an auto-reset event, and removed the call to WSAResetEvent() which was no-longer needed.
Maybe an FD_ACCEPT is signaling during HandleConnect() after the accept() and before the return and subsequent ResetEvent(). Then, ResetEvent() ends up resetting all signals and no re-enabling accept() is ever called. For example, the following sequence is possible:
Event signaled, WaitForMultipleObjects() returns
During HandleConnect(), sometime after accept() is called, the Event is signaled again
HandleConnect() returns
ResetEvent() resets the event, masking the second signal
WaitForMultipleObjects() never returns since as far as Windows is concerned, it has already signaled the subsequent event and no subsequent accepts() have re-enabled it
A couple possible solutions: 1) loop on accept() in HandleConnect() until WSAEWOULDBLOCK is returned 2) use an auto-reset event or immediately reset the event before calling HandleConnect()
Code looks fine. The only thing I can suggest is calling WSAWaitForMultipleObjects() instead of the global version.
From reading the docs, it appears that WSAEventSelect() is as parsimonious about notifications as WSAAsyncSelect(). The stack doesn't signal FD_ACCEPT every time a connection comes in. To Winsock, the notification is its way of saying:
You called accept() earlier, and it failed with WSAEWOULDBLOCK. Go ahead and call it again, it should succeed this time.
The solution is to call accept() before calling WSAEventSelect(), and only call WSAEventSelect() after you get WSAEWOULDBLOCK. For this to work as you expect, you need to have set the listening socket to non-blocking. (That might seem obvious, but it isn't actually required.)
After an accept event occured you must not do a WSAResetEvent(hAcceptEven). You must issue a WSAEnumNetworkEvents ( listener, hAcceptEvent, &some_struct). This functions clears the internal state of the socket (ar copies this state into some_struct) and after that you can receive new connections.
Related
We have a thread which is reading off of a socket. We ran into an issue on a network with a little more latency that we are used to, where our read loop would seemingly stop getting notified of read events on the socket. Original code (some error checking removed):
HANDLE hEventSocket = WSACreateEvent();
WSAEventSelect(pIOParams->sock, hEventSocket, FD_READ | FD_CLOSE);
std::array<HANDLE, 2> ahEvents;
// This is an event handle that can be called from another thread to
// get this read thread to exit
ahEvents[0] = pIOParams->hEventStop;
ahEvents[1] = hEventSocket;
while(pIOParams->bIsReading)
{
// wait for stop or I/O events
DWORD dwTimeout = 30000; // in ms
dwWaitResult = WSAWaitForMultipleEvents(ahEvents.size(), ahEvents.data(), FALSE, dwTimeout, FALSE);
if(dwWaitResult == WSA_WAIT_TIMEOUT)
{
CLogger::LogPrintf(LogLevel::LOG_DEBUG, "CSessionClient", "WSAWaitForMultipleEvents time out");
continue;
}
if(dwWaitResult == WAIT_OBJECT_0) // check to see if we were signaled to stop from another thread
{
break;
}
if(dwWaitResult == WAIT_OBJECT_0 +1)
{
// determine which I/O operation triggered event
if (WSAEnumNetworkEvents(pIOParams->sock, hEventSocket, &NetworkEvents) != 0)
{
int err = WSAGetLastError();
CLogger::LogPrintf(LogLevel::LOG_WARN, "CSessionClient", "WSAEnumNetworkEvents failed (%d)", err);
break;
}
// HERE IS THE LINE WE REMOVED THAT SEEMED TO FIX THE PROBLEM
WSAResetEvent(hEventSocket);
// Handle events on socket
if (NetworkEvents.lNetworkEvents & FD_READ)
{
// Do stuff to read from socket
}
if (NetworkEvents.lNetworkEvents & FD_CLOSE)
{
// Handle that the socket was closed
break;
}
}
}
Here is the issue: With WSAResetEvent(hEventSocket); in the code, sometimes the program works and reads all of the data from the server, but sometimes, it seems to get stuck in a loop receiving WSA_WAIT_TIMEOUT, even though the server appears to have data queued up for it.
While the program is looping receiving WSA_WAIT_TIMEOUT, Process Hacker shows the socket connected in a normal state.
Now we know that WSAEnumNetworkEvents will reset hEventSocket, but it doesn't seem like the additional call to WSAResetEvent should hurt. It also doesn't make sense that it permanently messes up the signaling. I would expect that perhaps we wouldn't get notified of the last chunk of data to be read, as data could have been read in between the call to WSAEnumNetworkEvents and WSAResetEvent, but I would assume that once additional data came in on the socket, the hEventSocket would get raised.
The stranger part of this is that we have been running this code for years, and we're only now seeing this issue.
Any ideas why this would cause an issue?
Calling WSAResetEvent() manually introduces a race condition that can put your socket into a bad state.
After WSAEnumNetworkEvents() is called, when new data arrives afterwards, or there is unread data left over from an earlier read, then the event is signaled, but ONLY if the socket is in the proper state to signal that event.
If the event does get signaled before you call WSAResetEvent(), you lose that signal.
Per the WSAEventSelect() documentation:
Having successfully recorded the occurrence of the network event (by setting the corresponding bit in the internal network event record) and signaled the associated event object, no further actions are taken for that network event until the application makes the function call that implicitly reenables the setting of that network event and signaling of the associated event object.
FD_READ
The recv, recvfrom, WSARecv, WSARecvEx, or WSARecvFrom function.
...
Any call to the reenabling routine, even one that fails, results in reenabling of recording and signaling for the relevant network event and event object.
...
For FD_READ, FD_OOB, and FD_ACCEPT network events, network event recording and event object signaling are level-triggered. This means that if the reenabling routine is called and the relevant network condition is still valid after the call, the network event is recorded and the associated event object is set. This allows an application to be event-driven and not be concerned with the amount of data that arrives at any one time.
What that means is that if you manually reset the event after calling WSAEnumNetworkEvents(), the event will NOT be signaled again until AFTER you perform a read on the socket (which re-enables the signing of the event for read operations) AND new data arrives afterwards, or you didn't read all of the data that was available.
By resetting the event manually, you lose the signal that allows WSAWaitForMultipleEvents() to tell you to call
WSAEnumNetworkEvents() so it can then tell you to read from the socket. Without that read, the event will never be signaled again when data is waiting to be read. The only other condition you registered that can signal the event is a socket closure.
Since WSAEnumNetworkEvents() already resets the event for you, DON'T reset the event manually!
You already pass the event handle to WSAEnumNetworkEvents, which resets the handle in an atomic manner. That is the handle is only reset if the pending event data is copied.
With a direct call to WSAResetEvent it would be possible for data notification to be lost (that is you call WSAEnumNetworkEvents to get the current status and reset the event after which more data arrives causing the event to be set but before you call WSAResetEvent, you then call WSAResetEvent before the next loop iteration and unless more data comes in you won't get told about the data that already came in).
Far better to just let WSAEnumNetworkEvents deal with the event state.
I have a loop which basically calls this every few seconds (after the timeout):
while(true){
if(finished)
return;
switch(select(FD_SETSIZE, &readfds, 0, 0, &tv)){
case SOCKET_ERROR : report bad stuff etc; return;
default : break;
}
// do stuff with the incoming connection
}
So basically for every few seconds (which is specified by tv), it reactivates the listening.
This is run on thread B (not a main thread). There are times when I want to end this acceptor loop immediately from thread A (main thread), but seems like I have to wait until the time interval finishes..
Is there a way to disrupt the select function from another thread so thread B can quit instantly?
The easiest way is probably to use pipe(2) to create a pipe and add the read end to readfds. When the other thread wants to interrupt the select() just write a byte to it, then consume it afterward.
Yes, you create a connected pair of sockets. Then thread B writes to one side of socket and thread A adds the other side socket to select. So once B writes to socket A exits select, do not forget to read this byte from socket.
This is the most standard and common way to interrupt selects.
Notes:
Under Unix, use socketpair to create a pair of sockets, under windows it is little bit tricky but googling for Windows socketpair would give you samples of code.
Can't you just make the timeout sufficiently short (like 10ms or so?).
These "just create a dummy connection"-type solution seem sort of hacked. I personally think that if an application is well designed, concurrent tasks never have to be interrupted forcefully, the just has worker check often enough (this is also a reason why boost.threads do not have a terminate function).
Edit Made this answer CV. It is bad, but it might help other to understand why it is bad, which is explained in the comments.
You can use shutdown(Sock, SHUT_RDWR) call from main thread to come out of waiting select call which will also exit your another thread before the timeout so you don't need to wait till timeout expires.
cheers. :)
I have a Windows service written in C++ that functions as a TCP server listening for incoming connections.
I initialized the server socket and put the accept code in a separate thread. This will accept and process the incoming connections.
However, I also need to stop this thread in case the service receives the STOP signal. So I thought of creating an event object using CreateEvent and waiting for it to be signaled. This waiting would happen in the thread that creates the accept thread. So I could use the TerminateThread function to stop the accept thread when the STOP signal is received.
However, MSDN says that
TerminateThread is a dangerous function that should only be used in the most extreme cases.
How strictly should this be followed and is my approach correct? What could be another way of doing this?
In Windows, you can wake up a blocking accept call from another thread simply by calling closesocket. The blocking accept call will return -1 and your code has a chance to break out of whatever loop it is in by checking some other exit condition that you have already set (e.g. global variable)
This also works with Mac (and likely BSD derivatives) with the close function, but not Linux. The more universal UNIX solution to this problem is here.
Some pseduo code for the Windows solution below.
SOCKET _listenSocket;
bool _needToExit = false;
HANDLE _hThread;
void MakeListenThreadExit()
{
_needToExit = true;
closesocket(_listenSocket);
_listenSocket = INVALID_SOCKET;
// wait for the thread to exit
WaitForSingleObject(_hThread, INFINITE);
}
DWORD __stdcall ListenThread(void* context)
{
while (_needToExit == false)
{
SOCKET client = accept(_listenSocket, (sockaddr*)&addr, &addrSize);
if ((client == -1) || _needToExit)
{
break;
}
ProcessClient(client);
}
return 0;
}
In this situation, don't use accept() on a blocking socket. Use a non-blocking socket instead. Then you can use select() with a timeout so your thread can check for a termination condition periodically. Or better, use WSACreateEvent() with WSASelectEvent(). Create two event objects, one to detect client connections, and one to detect thread termination. You can then use WSAWaitForMultipleEvents() to wait on both events at the same time. Use WSASetEvent() to signal the termination event when needed, and call accept() or WSAAccept() whenever the other event is signalled. WSAWaitForMultipleEvents() will tell you which event to act on.
using socket with the overlapped operation selected the event-based completion notification;
Have 2 events, one for data, the other to cancel long send/recv:
HANDLE events[] = { m_hDataEvent, m_hInterruptEvent };
then calling WSASend,
WSASend(m_Socket, &DataBuf, 1, NULL, 0, &SendOverlapped, NULL);
followed by
WSAWaitForMultipleEvents(2, events, FALSE, INFINITE, FALSE);
which is setup to return on any one event signaled.
Now assume send is in progress, and m_hInterruptEvent is signaled.
WSAWaitForMultipleEvents returns, technically the function calling send can return as well and delete internally allocated buffers.
What is not clear to me, the WSASend may still be working in background, and deleting buffers will cause data corruption in best case.
What would be the proper way to stop the background Send/Receive, if the socket needs to be used for something else immediately?
I looked at the CancelIO(), but the MSDN never mentions it in relation to Sockets, does it work with file based IO only?
It makes no sense to try to cancel it once sent. Even if you succeeded you would have a problem because the receiving application would not have any idea that the transmission was interrupted. Your new message will be mistaken for the end of the old message.
If you feel the need to cancel long sends, you should probably look at your application design.
Send in chunks and check for cancellation in between chunks. Ensure you have a way of communicating to the receiver that the transmission was cancelled.
Close the socket to cancel. Again, ensure the client has a way to know that this is an interrupted transmission (for example if the client knows the total length in advance they will recognise an interrupted transmission).
Just wait for it to succeed in the background and don't worry. If you have urgent messages use a separate connection for them.
For your particular question "What would be the proper way to stop the background Send/Receive, if the socket needs to be used for something else immediately", the answer is: Sockets are cheap - Just use two - one for the slow transmission the other for the urgent messages.
I have a loop which basically calls this every few seconds (after the timeout):
while(true){
if(finished)
return;
switch(select(FD_SETSIZE, &readfds, 0, 0, &tv)){
case SOCKET_ERROR : report bad stuff etc; return;
default : break;
}
// do stuff with the incoming connection
}
So basically for every few seconds (which is specified by tv), it reactivates the listening.
This is run on thread B (not a main thread). There are times when I want to end this acceptor loop immediately from thread A (main thread), but seems like I have to wait until the time interval finishes..
Is there a way to disrupt the select function from another thread so thread B can quit instantly?
The easiest way is probably to use pipe(2) to create a pipe and add the read end to readfds. When the other thread wants to interrupt the select() just write a byte to it, then consume it afterward.
Yes, you create a connected pair of sockets. Then thread B writes to one side of socket and thread A adds the other side socket to select. So once B writes to socket A exits select, do not forget to read this byte from socket.
This is the most standard and common way to interrupt selects.
Notes:
Under Unix, use socketpair to create a pair of sockets, under windows it is little bit tricky but googling for Windows socketpair would give you samples of code.
Can't you just make the timeout sufficiently short (like 10ms or so?).
These "just create a dummy connection"-type solution seem sort of hacked. I personally think that if an application is well designed, concurrent tasks never have to be interrupted forcefully, the just has worker check often enough (this is also a reason why boost.threads do not have a terminate function).
Edit Made this answer CV. It is bad, but it might help other to understand why it is bad, which is explained in the comments.
You can use shutdown(Sock, SHUT_RDWR) call from main thread to come out of waiting select call which will also exit your another thread before the timeout so you don't need to wait till timeout expires.
cheers. :)