I'm writing a UDP server application for windows desktop/server.
My code uses the WSA API suggested by windows the following way (This is my simplified receivePacket method):
struct Packet
{
unsigned int size;
char buffer[MAX_SIZE(1024)];
}
bool receivePacket(Packet packet)
{
WSABUFFER wsa_buffer[2];
wsa_buffer[0].buf = &packet.size;
wsa_buffer[0].len = sizeof(packet.size);
wsa_buffer[1].buf = packet.buffer;
wsa_buffer[1].len = MAX_SIZE;
bool retval = false;
int flags = 0;
int recv_bytes = 0;
inet_addr client_addr;
int client_addr_len = sizeof(client_addr);
if(WSARecvFrom(_socket, wsa_buffer, sizeof(wsa_buffer)/sizeof(wsa_buffer[0]), &bytes_recv, &flags, (sockaddr *)&client_addr, &client_addr_len, NULL, NULL) == 0)
{
//Packet received successfully
}
else
{
//Report
}
}
Now, when I'm trying to close my application gracefully, not network-wise, but rather application-wise (going through all the d'tors and stuff), i'm trying to unblock this call.
To do this, I call the shutdown(_socket, SD_BOTH) method. Unfortunately, the call to shutdown itself BLOCKS!
After reading every possible page in the MSDN, I didn't find any reference to why this happens, other ways of attacking the problem or any way out.
Another thing I checked was using the SO_RCVTIMEO. Surprisingly, this sockopt didn't work as expected as well.
Is there any problem with my code/approach?
Did you run shutdown on duplicated handle? Shutdown on the same handle will wait any active operation on this handle to complete.
Related
I'm writing IOCP server for video streaming from desktop client to browser.
Both sides uses WebSocket protocol to unify server's achitecture (and because there is no other way for browsers to perform a full-duplex exchange).
The working thread starts like this:
unsigned int __stdcall WorkerThread(void * param){
int ThreadId = (int)param;
OVERLAPPED *overlapped = nullptr;
IO_Context *ctx = nullptr;
Client *client = nullptr;
DWORD transfered = 0;
BOOL QCS = 0;
while(WAIT_OBJECT_0 != WaitForSingleObject(EventShutdown, 0)){
QCS = GetQueuedCompletionStatus(hIOCP, &transfered, (PULONG_PTR)&client, &overlapped, INFINITE);
if(!client){
if( Debug ) printf("No client\n");
break;
}
ctx = (IO_Context *)overlapped;
if(!QCS || (QCS && !transfered)){
printf("Error %d\n", WSAGetLastError());
DeleteClient(client);
continue;
}
switch(auto opcode = client->ProcessCurrentEvent(ctx, transfered)){
// Client owed to receive some data
case OPCODE_RECV_DEBT:{
if((SOCKET_ERROR == client->Recv()) && (WSA_IO_PENDING != WSAGetLastError())) DeleteClient(client);
break;
}
// Client received all data or the beginning of new message
case OPCODE_RECV_DONE:{
std::string message;
client->GetInput(message);
// Analizing the first byte of WebSocket frame
switch( opcode = message[0] & 0xFF ){
// HTTP_HANDSHAKE is 'G' - from GET HTTP...
case HTTP_HANDSHAKE:{
message = websocket::handshake(message);
while(!client->SetSend(message)) Sleep(1); // Set outgoing data
if((SOCKET_ERROR == client->Send()) && (WSA_IO_PENDING != WSAGetLastError())) DeleteClient(client);
break;
}
// Browser sent a closing frame (0x88) - performing clean WebSocket closure
case FIN_CLOSE:{
websocket::frame frame;
frame.parse(message);
frame.masked = false;
if( frame.pl_len == 0 ){
unsigned short reason = 1000;
frame.payload.resize(sizeof(reason));
frame.payload[0] = (reason >> 8) & 0xFF;
frame.payload[1] = reason & 0xFF;
}
frame.pack(message);
while(!client->SetSend(message)) Sleep(1);
if((SOCKET_ERROR == client->Send()) && (WSA_IO_PENDING != WSAGetLastError())) DeleteClient(client);
shutdown(client->Socket(), SD_SEND);
break;
}
IO context struct:
struct IO_Context{
OVERLAPPED overlapped;
WSABUF data;
char buffer[IO_BUFFER_LENGTH];
unsigned char opcode;
unsigned long long debt;
std::string message;
IO_Context(){
debt = 0;
opcode = 0;
data.buf = buffer;
data.len = IO_BUFFER_LENGTH;
overlapped.Offset = overlapped.OffsetHigh = 0;
overlapped.Internal = overlapped.InternalHigh = 0;
overlapped.Pointer = nullptr;
overlapped.hEvent = nullptr;
}
~IO_Context(){ while(!HasOverlappedIoCompleted(&overlapped)) Sleep(1); }
};
Client Send function:
int Client::Send(){
int var_buf = O.message.size();
// "O" is IO_Context for Output
O.data.len = (var_buf>IO_BUFFER_LENGTH)?IO_BUFFER_LENGTH:var_buf;
var_buf = O.data.len;
while(var_buf > 0) O.data.buf[var_buf] = O.message[--var_buf];
O.message.erase(0, O.data.len);
return WSASend(connection, &O.data, 1, nullptr, 0, &O.overlapped, nullptr);
}
When the desktop client disconnects (it uses just closesocket() to do it, no shutdown()) the GetQueuedCompletionStatus returns TRUE and sets transfered to 0 - in this case WSAGetLastError() returns 64 (The specified network name is no longer available), and it has sense - client disconnected (line with if(!QCS || (QCS && !transfered))). But when the browser disconnects, the error codes confuse me... It can be 0, 997 (pending operation), 87 (invalid parameter)... and no codes related to end of connection.
Why do IOCP select this events? How can it select a pending operation? Why the error is 0 when 0 bytes transferred? Also it leads to endless trying to delete an object associated with the overlapped structure, because the destructor calls ~IO_Context(){ while(!HasOverlappedIoCompleted(&overlapped)) Sleep(1); } for secure deleting. In DeleteClient call the socket is closing with closesocket(), but, as you can see, I'm posting a shutdown(client->Socket(), SD_SEND); call before it (in FIN_CLOSE section).
I understand that there are two sides of a connection and closing it on a server side does not mean that an other side will close it too. But I need to create a stabile server, immune to bad and half opened connections. For example, the user of web application can rapidly press F5 to reload page few times (yeah, some dudes do so :) ) - the connection will reopen few times, and the server must not lag or crash due to this actions.
How to handle this "bad" events in IOCP?
you have many wrong code here.
while(WAIT_OBJECT_0 != WaitForSingleObject(EventShutdown, 0)){
QCS = GetQueuedCompletionStatus(hIOCP, &transfered, (PULONG_PTR)&client, &overlapped, INFINITE);
this is not efficient and wrong code for stop WorkerThread. at first you do excess call WaitForSingleObject, use excess EventShutdown and main this anyway fail todo shutdown. if your code wait for packet inside GetQueuedCompletionStatus that you say EventShutdown - not break GetQueuedCompletionStatus call - you continue infinite wait here. correct way for shutdown - PostQueuedCompletionStatus(hIOCP, 0, 0, 0) instead call SetEvent(EventShutdown) and if worked thread view client == 0 - he break loop. and usually you need have multiple WorkerThread (not single). and multiple calls PostQueuedCompletionStatus(hIOCP, 0, 0, 0) - exactly count of working threads. also you need synchronize this calls with io - do this only after all io already complete and no new io packets will be queued to iocp. so "null packets" must be the last queued to port
if(!QCS || (QCS && !transfered)){
printf("Error %d\n", WSAGetLastError());
DeleteClient(client);
continue;
}
if !QCS - the value in client not initialized, you simply can not use it and call DeleteClient(client); is wrong under this condition
when object (client) used from several thread - who must delete it ? what be if one thread delete object, when another still use it ? correct solution will be if you use reference counting on such object (client). and based on your code - you have single client per hIOCP ? because you retriever pointer for client as completion key for hIOCP which is single for all I/O operation on sockets bind to the hIOCP. all this is wrong design.
you need store pointer to client in IO_Context. and add reference to client in IO_Context and release client in IO_Context destructor.
class IO_Context : public OVERLAPPED {
Client *client;
ULONG opcode;
// ...
public:
IO_Context(Client *client, ULONG opcode) : client(client), opcode(opcode) {
client->AddRef();
}
~IO_Context() {
client->Release();
}
void OnIoComplete(ULONG transfered) {
OnIoComplete(RtlNtStatusToDosError(Internal), transfered);
}
void OnIoComplete(ULONG error, ULONG transfered) {
client->OnIoComplete(opcode, error, transfered);
delete this;
}
void CheckIoError(ULONG error) {
switch(error) {
case NOERROR:
case ERROR_IO_PENDING:
break;
default:
OnIoComplete(error, 0);
}
}
};
then are you have single IO_Context ? if yes, this is fatal error. the IO_Context must be unique for every I/O operation.
if (IO_Context* ctx = new IO_Context(client, op))
{
ctx->CheckIoError(WSAxxx(ctx) == 0 ? NOERROR : WSAGetLastError());
}
and from worked threads
ULONG WINAPI WorkerThread(void * param)
{
ULONG_PTR key;
OVERLAPPED *overlapped;
ULONG transfered;
while(GetQueuedCompletionStatus(hIOCP, &transfered, &key, &overlapped, INFINITE)) {
switch (key){
case '_io_':
static_cast<IO_Context*>(overlapped)->OnIoComplete(transfered);
continue;
case 'stop':
// ...
return 0;
default: __debugbreak();
}
}
__debugbreak();
return GetLastError();
}
the code like while(!HasOverlappedIoCompleted(&overlapped)) Sleep(1); is always wrong. absolute and always. never write such code.
ctx = (IO_Context *)overlapped; despite in your concrete case this give correct result, not nice and can be break if you change definition of IO_Context. you can use CONTAINING_RECORD(overlapped, IO_Context, overlapped) if you use struct IO_Context{
OVERLAPPED overlapped; } but better use class IO_Context : public OVERLAPPED and static_cast<IO_Context*>(overlapped)
now about Why do IOCP select this events? How to handle this "bad" events in IOCP?
the IOCP nothing select. he simply signaling when I/O complete. all. which specific wsa errors you got on different network operation absolute independent from use IOCP or any other completion mechanism.
on graceful disconnect is normal when error code is 0 and 0 bytes transferred in recv operation. you need permanent have recv request active after connection done, and if recv complete with 0 bytes transferred this mean that disconnect happens
I'm using czmq for interprocess communication.
There are 2 processes :
The server, receiving requests and sending replies but also sending events.
The client, sending requests and receiving replies but also listening to the events.
I have already successfuly implemented the "request/reply" pattern with REQ/REP (details below)
Now I want to implement the notification mechanism.
I want my server to send its events without caring whether anyone receives them or not and without being blocked in anyway.
The client listens to those events but should it crash, it mustn't have any impact on the server.
I believe PUB/SUB is the most appropriate pattern, but if not do not hesitate to enlighten me.
Here's my implementation (cleaned from checks and logs) :
The server publishes the events
Server::eventIpcPublisher = zsock_new_pub("#ipc:///tmp/events.ipc");
void Server::OnEvent(uint8_t8* eventData, size_t dataSize) {
if (Server::eventIpcPublisher != nullptr) {
int retCode = zsock_send(Server::eventIpcPublisher, "b", eventData, dataSize);
}
The client listens to them in a dedicated thread
void Client::RegisterToEvents(const std::function<void(uint8_t*, size_t)>& callback) {
zsock_t* eventIpcSubscriber = zsock_new_sub(">ipc:///tmp/events.ipc", "");
listening = true;
while (listening) {
byte* receptionBuffer;
size_t receptionBufferSize;
int retCode = zsock_recv(eventIpcSubscriber, "b", &receptionBuffer, &receptionBufferSize);
--> NEVER REACHED <--
if (retCode == 0) {
callback(static_cast<uint8_t*>(receptionBuffer), receptionBufferSize);
}
}
zsock_destroy(&eventIpcSubscriber);
}
It doesn't work:
The server sends with return code 0, as if everything is ok,
The client doesn't receive anything (blocked on receive).
Help would be much appreciated, thanks in advance!
Chris.
PS: here is the REQ/REP that I have already implemented with success (no help needed here, just for comprehension)
The client sends a request and then waits for the answer.
uint8_t* MulticamApi::GetDatabase(size_t& sizeOfData) {
zsock_t* requestSocket = zsock_new_req(">ipc:///tmp/requests.ipc");
if (requestSocket == nullptr)
return nullptr;
byte* receptionBuffer;
size_t receptionBufferSize;
int retCode = zsock_send(requestSocket, "i", static_cast<int>(IpcComm_GetClipDbRequest));
if (retCode != 0) {
sizeOfData = 0;
return nullptr;
}
retCode = zsock_recv(requestSocket, "b", &receptionBuffer, &receptionBufferSize);
databaseData.reset(new MallocGuard(static_cast<void*>(receptionBuffer)));
sizeOfData = receptionBufferSize;
return static_cast<uint8_t*>(databaseData->Data());
}
A dedicated thread in the server listens to requests, processes them and replies. (don't worry, delete is handled somewhere else)
U32 Server::V_OnProcessing(U32 waitCode) {
protocolIpcWriter = zsock_new_rep("#ipc:///tmp/requests.ipc");
while (running) {
int receptionInt = 0;
int retCode = zsock_recv(protocolIpcWriter, "i", &receptionInt);
if ((retCode == 0) && (receptionInt == static_cast<int>(IpcComm_GetClipDbRequest))) {
GetDatabase();
}
sleep(1);
}
zsock_destroy(&protocolIpcWriter);
return 0;
}
void Server::GetDatabase() {
uint32_t dataSize = 10820 * 340;
uint8_t* data = new uint8_t[dataSize];
uint32_t nbBytesWritten = DbHelper::SaveDbToBuffer(data, dataSize);
int retCode = zsock_send(protocolIpcWriter, "b", data, nbBytesWritten);
}
I know my question's old but for the record, I switched from czmq to base zmq api and everything went smooth. A colleague of mine also had issues with the czmq layer and switched to zmq to fix them so that's definitely what I recommend.
I did a few tests with an I/O-Completion port and winsock sockets.
I encountered, that sometimes after I received data from a connection and then adjacently call WSARecv again on that socket it returns immediately with the error 259 (ERROR_NO_MORE_ITEMS).
I am wondering why the system flags the overlapped transaction with this error instead of keeping the recv call blocking/waiting for incoming data.
Do You know what´s the sense of this ?
I would be glad to hear about your thoughts.
Edit: Code
do
{
OVERLAPPED* pOverlapped = nullptr;
DWORD dwBytes = 0; ULONG_PTR ulKey = 0;
//Dequeue a completion packet
if(!m_pIOCP->GetCompletionStatus(&dwBytes, &ulKey, &pOverlapped, INFINITE))
DebugBreak();
//Evaluate
switch(((MYOVERLAPPED*)pOverlapped)->WorkType)
{
case ACCEPT_OVERLAPPED_TYPE:
{
//cast
ACCEPT_OVERLAPPED* pAccept = (ACCEPT_OVERLAPPED*)pOverlapped;
//Associate the newly accepted connection with the IOCP
if(!m_pIOCP->AssociateHandle((HANDLE)(pAccept->pSockClient)->operator SOCKET(), 1))
{
//Association failed: close the socket and and delte the overlapped strucuture
}
//Call recv
RECV_OVERLAPPED* pRecvAction = new RECV_OVERLAPPED;
pRecvAction->pSockClient = pAccept->pSockClient;
short s = (pRecvAction->pSockClient)->Recv(pRecvAction->strBuf, pRecvAction->pWSABuf, 10, pRecvAction);
if(s == Inc::REMOTECONNECTION_CLOSED)
{
//Error stuff
}
//Call accept again (create a new ACCEPT_OVERLAPPED to ensure overlapped being zeroed out)
ACCEPT_OVERLAPPED *pNewAccept = new ACCEPT_OVERLAPPED;
pNewAccept->pSockListen = pAccept->pSockListen;
pNewAccept->pSockClient = new Inc::CSocket((pNewAccept->pSockListen)->Accept(nullptr, nullptr, pNewAccept));
//delete the old overlapped struct
delete pAccept;
}
break;
case RECV_OVERLAPPED_TYPE:
{
RECV_OVERLAPPED* pOldRecvAction = (RECV_OVERLAPPED*)pOverlapped;
if(!pOldRecvAction->InternalHigh)
{
//Connection has been closed: delete the socket(implicitly closes the socket)
Inc::CSocket::freewsabuf(pOldRecvAction->pWSABuf); //free the wsabuf
delete pOldRecvAction->pSockClient;
}
else
{
//Call recv again (create a new RECV_OVERLAPPED)
RECV_OVERLAPPED* pNewRecvAction = new RECV_OVERLAPPED;
pNewRecvAction->pSockClient = pOldRecvAction->pSockClient;
short sRet2 = (pNewRecvAction->pSockClient)->Recv(pNewRecvAction->strBuf, pNewRecvAction->pWSABuf, 10, pNewRecvAction);
//Free the old wsabuf
Inc::CSocket::freewsabuf(pOldRecvAction->pWSABuf);
delete pOldRecvAction;
}
Cutted error checkings...
The Recv-member-function is a simple wrapper around the WSARecv-call which creates the WSABUF and the receiving buffer itself (which needs to be cleaned up by the user via freewsabuf - just to mention)...
It looks like I was sending less data than was requested by the receiving side.
But since it´s an overlapped operation receiving a small junk of the requested bunch via the TCP-connection would trigger the completion indication with the error ERROR_NO_MORE_ITEMS, meaning there was nothing more to recv than what it already had.
I'm trying to make a select-server in order to receive connection from several clients (all clients will connect to the same port).
The server accepts the first 2 clients, but unless one of them disconnects, it will not accept a new one.
I'm starting to listen the the server port like this:
listen(m_socketId, SOMAXCONN);
and using the select command like this:
int selected = select(m_maxSocketId + 1, &m_socketReadSet, NULL, NULL, 0);
I've added some code.
bool TcpServer::Start(char* ipAddress, int port)
{
m_active = true;
FD_ZERO(&m_socketMasterSet);
bool listening = m_socket->Listen(ipAddress, port);
// Start listening.
m_maxSocketId = m_socket->GetId();
FD_SET(m_maxSocketId, &m_socketMasterSet);
if (listening == true)
{
StartThread(&InvokeListening);
StartReceiving();
return true;
}
else
{
return false;
}
}
void TcpServer::Listen()
{
while (m_active == true)
{
m_socketReadSet = m_socketMasterSet;
int selected = select(m_maxSocketId + 1, &m_socketReadSet, NULL, NULL, 0);
if (selected <= 0)
continue;
bool accepted = Accept();
if (accepted == false)
{
ReceiveFromSockets();
}
}
}
bool TcpServer::Accept()
{
int listenerId = m_socket->GetId();
if (FD_ISSET(listenerId, &m_socketReadSet) == true)
{
struct sockaddr_in remoteAddr;
int addrSize = sizeof(remoteAddr);
unsigned int newSockId = accept(listenerId, (struct sockaddr *)&remoteAddr, &addrSize);
if (newSockId == -1) // Invalid socket...
{
return false;
}
if (newSockId > m_maxSocketId)
{
m_maxSocketId = newSockId;
}
m_clientUniqueId++;
// Remembering the new socket, so we'll be able to check its state
// the next time.
FD_SET(newSockId, &m_socketMasterSet);
CommEndPoint remote(remoteAddr);
CommEndPoint local = m_socket->GetLocalPoint();
ClientId* client = new ClientId(m_clientUniqueId, newSockId, local, remote);
m_clients.Add(client);
StoreNewlyAcceptedClient(client);
char acceptedMsg = CommInternalServerMsg::ConnectionAccepted;
Server::Send(CommMessageType::Internal, client, &acceptedMsg, sizeof(acceptedMsg));
return true;
}
return false;
}
I hope it's enough :)
what's wrong with it?
The by far most common error with select() is not re-initializing the fd sets on every iteration. The second, third, and forth arguments are updated by the call, so you have to populate them again every time.
Post more code, so people can actually help you.
Edit 0:
fd_set on Windows is a mess :)
It's not allowed to copy construct fd_set objects:
m_socketReadSet = m_socketMasterSet;
This combined with Nikolai's correct statement that select changes the set passed in probably accounts for your error.
poll (On Windows, WSAPoll) is a much friendlier API.
Windows also provides WSAEventSelect and (Msg)WaitForMultipleObjects(Ex), which doesn't have a direct equivalent on Unix, but allows you to wait on sockets, files, thread synchronization events, timers, and UI messages at the same time.
I'm not sure if this is a known issue that I am running into, but I couldn't find a good search string that would give me any useful results.
Anyway, here's the basic rundown:
we've got a relatively simple application that takes data from a source (DB or file) and streams that data over TCP to connected clients as new data comes in. its a relatively low number of clients; i would say at max 10 clients per server, so we have the following rough design:
client: connect to server, set to read (with timeout set to higher than the server heartbeat message frequency). It blocks on read.
server: one listening thread that accepts connections and then spawns a writer thread to read from the data source and write to the client. The writer thread is also detached(using boost::thread so just call the .detach() function). It blocks on writes indefinetly, but does check errno for errors before writing. We start the servers using a single perl script and calling "fork" for each server process.
The problem(s):
at seemingly random times, the client will shutdown with a "connection terminated (SUCCESFUL)" indicating that the remote server shutdown the socket on purpose. However, when this happens the SERVER application ALSO closes, without any errors or anything. it just crashes.
Now, to further the problem, we have multiple instances of the server app being started by a startup script running different files and different ports. When ONE of the servers crashes like this, ALL the servers crash out.
Both the server and client using the same "Connection" library created in-house. It's mostly a C++ wrapper for the C socket calls.
here's some rough code for the write and read function in the Connection libary:
int connectionTimeout_read = 60 * 60 * 1000;
int Socket::readUntil(char* buf, int amount) const
{
int readyFds = epoll_wait(epfd,epEvents,1,connectionTimeout_read);
if(readyFds < 0)
{
status = convertFlagToStatus(errno);
return 0;
}
if(readyFds == 0)
{
status = CONNECTION_TIMEOUT;
return 0;
}
int fd = epEvents[0].data.fd;
if( fd != socket)
{
status = CONNECTION_INCORRECT_SOCKET;
return 0;
}
int rec = recv(fd,buf,amount,MSG_WAITALL);
if(rec == 0)
status = CONNECTION_CLOSED;
else if(rec < 0)
status = convertFlagToStatus(errno);
else
status = CONNECTION_NORMAL;
lastReadBytes = rec;
return rec;
}
int Socket::write(const void* buf, int size) const
{
int readyFds = epoll_wait(epfd,epEvents,1,-1);
if(readyFds < 0)
{
status = convertFlagToStatus(errno);
return 0;
}
if(readyFds == 0)
{
status = CONNECTION_TERMINATED;
return 0;
}
int fd = epEvents[0].data.fd;
if(fd != socket)
{
status = CONNECTION_INCORRECT_SOCKET;
return 0;
}
if(epEvents[0].events != EPOLLOUT)
{
status = CONNECTION_CLOSED;
return 0;
}
int bytesWrote = ::send(socket, buf, size,0);
if(bytesWrote < 0)
status = convertFlagToStatus(errno);
lastWriteBytes = bytesWrote;
return bytesWrote;
}
Any help solving this mystery bug would be great! at the VERY least, I would like it to NOT crash out the server even if the client crashes (which is really strange for me, since there is no two-way communication).
Also, for reference, here is the server listening code:
while(server.getStatus() == connection::CONNECTION_NORMAL)
{
connection::Socket s = server.listen();
if(s.getStatus() != connection::CONNECTION_NORMAL)
{
fprintf(stdout,"failed to accept a socket. error: %s\n",connection::getStatusString(s.getStatus()));
}
DATASOURCE* dataSource;
dataSource = open_datasource(XXXX); /* edited */ if(dataSource == NULL)
{
fprintf(stdout,"FATAL ERROR. DATASOURCE NOT FOUND\n");
return;
}
boost::thread fileSender(Sender(s,dataSource));
fileSender.detach();
}
...And also here is the spawned child sending thread:
::signal(SIGPIPE,SIG_IGN);
//const int headerNeeds = 29;
const int BUFFERSIZE = 2000;
char buf[BUFFERSIZE];
bool running = true;
while(running)
{
memset(buf,'\0',BUFFERSIZE*sizeof(char));
unsigned int readBytes = 0;
while((readBytes = read_datasource(buf,sizeof(unsigned char),BUFFERSIZE,dataSource)) == 0)
{
boost::this_thread::sleep(boost::posix_time::milliseconds(1000));
}
socket.write(buf,readBytes);
if(socket.getStatus() != connection::CONNECTION_NORMAL)
running = false;
}
fprintf(stdout,"socket error: %s\n",connection::getStatusString(socket.getStatus()));
socket.close();
fprintf(stdout,"sender exiting...\n");
Any insights would be welcome! Thanks in advance.
You've probably got everything backwards... when the server crashes, the OS will close all sockets. So the server crash happens first and causes the client to get the disconnect message (FIN flag in a TCP segment, actually), the crash is not a result of the socket closing.
Since you have multiple server processes crashing at the same time, I'd look at resources they share, and also any scheduled tasks that all servers would try to execute at the same time.
EDIT: You don't have a single client connecting to multiple servers, do you? Note that TCP connections are always bidirectional, so the server process does get feedback if a client disconnects. Some internet providers have even been caught generating RST packets on connections that fail some test for suspicious traffic.
Write a signal handler. Make sure it uses only raw I/O functions to log problems (open, write, close, not fwrite, not printf).
Check return values. Check for negative return value from write on a socket, but check all return values.
Thanks for all the comments and suggestions.
After looking through the code and adding the signal handling as Ben suggested, the applications themselves are far more stable. Thank you for all your input.
The original problem, however, was due to a rogue script that one of the admins was running as root that would randomly kill certain processes on the server-side machine (i won't get into what it was trying to do in reality; safe to say it was buggy).
Lesson learned: check the environment.
Thank you all for the advice.