libircclient : Selective connection absolutely impossible to debug - c++

I'm not usually the type to post a question, and more to search why something doesn't work first, but this time I did everything I could, and I just can't figure out what is wrong.
So here's the thing:
I'm currently programming an IRC Bot, and I'm using libircclient, a small C library to handle IRC connections. It's working pretty great, it does the job and is kinda easy to use, but ...
I'm connecting to two different servers, and so I'm using the custom networking loop, which uses the select function. On my personal computer, there's no problem with this loop, and everything works great.
But (Here's the problem), on my remote server, where the bot will be hosted, I can connect to one server but not the other.
I tried to debug everything I could. I even went to examine the sources of libircclient, to see how it worked, and put some printfs where I could, and I could see where does it comes from, but I don't understand why it does this.
So here's the code for the server (The irc_session_t objects are encapsulated, but it's normally kinda easy to understand. Feel free to ask for more informations if you want to):
// Connect the first session
first.connect();
// Connect the osu! session
second.connect();
// Initialize sockets sets
fd_set sockets, out_sockets;
// Initialize sockets count
int sockets_count;
// Initialize timeout struct
struct timeval timeout;
// Set running as true
running = true;
// While the server is running (Which means always)
while (running)
{
// First session has disconnected
if (!first.connected())
// Reconnect it
first.connect();
// Second session has disconnected
if (!second.connected())
// Reconnect it
second.connect();
// Reset timeout values
timeout.tv_sec = 1;
timeout.tv_usec = 0;
// Reset sockets count
sockets_count = 0;
// Reset sockets and out sockets
FD_ZERO(&sockets);
FD_ZERO(&out_sockets);
// Add sessions descriptors
irc_add_select_descriptors(first.session(), &sockets, &out_sockets, &sockets_count);
irc_add_select_descriptors(second.session(), &sockets, &out_sockets, &sockets_count);
// Select something. If it went wrong
int available = select(sockets_count + 1, &sockets, &out_sockets, NULL, &timeout);
// Error
if (available < 0)
// Error
Utils::throw_error("Server", "run", "Something went wrong when selecting a socket");
// We have a socket
if (available > 0)
{
// If there was something wrong when processing the first session
if (irc_process_select_descriptors(first.session(), &sockets, &out_sockets))
// Error
Utils::throw_error("Server", "run", Utils::string_format("Error with the first session: %s", first.get_error()));
// If there was something wrong when processing the second session
if (irc_process_select_descriptors(second.session(), &sockets, &out_sockets))
// Error
Utils::throw_error("Server", "run", Utils::string_format("Error with the second session: %s", second.get_error()));
}
The problem in this code is that this line:
irc_process_select_descriptors(second.session(), &sockets, &out_sockets)
Always return an error the first time it's called, and only for one server. The weird thing is that on my Windows computer, it works perfectly, while on the Ubuntu server, it just doesn't want to, and I just can't understand why.
I did some in-depth debug, and I saw that libircclient does this:
if (session->state == LIBIRC_STATE_CONNECTING && FD_ISSET(session->sock, out_set))
And this is where everything goes wrong. The session state is correctly set to LIBIRC_STATE_CONNECTING, but the second thing, FD_ISSET(session->sock, out_set) always return false. It returns true for the first session, but for the second session, never.
The two servers are irc.twitch.tv:6667 and irc.ppy.sh:6667. The servers are correctly set, and the server passwords are correct too, since everything works fine on my personal computer.
Sorry for the very long post.
Thanks in advance !

Alright, after some hours of debug, I finally got the problem.
So when a session is connected, it will enter in the LIBIRC_STATE_CONNECTING state, and then when calling irc_process_select_descriptors, it will check this:
if (session->state == LIBIRC_STATE_CONNECTING && FD_ISSET(session->sock, out_set))
The problem is that select() will alter the sockets sets, and will remove all the sets that are not relevant.
So if the server didn't send any messages before calling the irc_process_select_descriptors, FD_ISSET will return 0, because select() thought that this socket is not relevant.
I fixed it by just writing
if (session->state == LIBIRC_STATE_CONNECTING)
{
if(!FD_ISSET(session->sock, out_set))
return 0;
...
}
So it will make the program wait until the server has sent us anything.
Sorry for not having checked everything !

Related

I need help figuring out tcp sockets (clsocket)

I am having trouble figuring out sockets i am just asking the server for data at a position (glm::i64vec4) and expecting a response but the position gets way off when i get the response and the data for that position reflects that (aka my voxel game make a kinda cool looking but useless mess)
It's probably just me not understanding sockets whatsoever or maybe something weird with this library
one thought i had is it was maybe something to do with mismatching blocking and non blocking on the server and client
but when i switched the server to blocking (and put each client in a seperate thread from each other and the accepting process) it did nothing
if i'm doing something really stupid please tell me i know next to nothing about sockets
here is some code that probably looks horrible
Server Code
std::deque <CActiveSocket*> clients;
CPassiveSocket socket;
socket.Initialize();
socket.SetNonblocking();//I'm doing this so i don't need multiple threads for clients
socket.Listen("0.0.0.0",port);
while (1){
{
CActiveSocket* c;
if ((c = socket.Accept()) != NULL){
clients.emplace_back(c);
}
}
for (CActiveSocket*& c : clients){
c->Receive(sizeof(glm::i64vec4));
if (c->GetBytesReceived() == sizeof(glm::i64vec4)){
chkpkt chk;
chk.pos = *(glm::i64vec4*)c->GetData();
LOOP3D(chksize+2){
chk.data(i,j,k).val = chk.pos.y*chksize+j;
chk.data(i,j,k).id=0;
}
while (c->Send((uint8*)&chk,sizeof(chkpkt)) != sizeof(chkpkt)){}
}
}
}
Client Code
//v is a glm::i64vec4
//fsock is set to Blocking
if(fsock.Send((uint8*)&v,sizeof(glm::i64vec4)))
if (fsock.Receive(sizeof(chkpkt))){
tthread::lock_guard<tthread::fast_mutex> lock(wld->filemut);
wld->ichks[v]=(*(chkpkt*)fsock.GetData()).data;//i tried using the position i get back from the server to set this (instead of v) but that made it to where nothing loaded
//i checked it and the chunks position never lines up with what i sent
}
Without your complete application codes it's unrealistic to offer any suggestions of any particular lines of code correction.
But it seems like you are using this library. It doesn't matter if not, because most of time when doing network programming, socket's weird behavior make some problems somewhat universal. Thus there are a few suggestions for the portion of socket application in your project:
It suffices to have BLOCKING sockets.
Most of time socket's read have somewhat weird behavior, that is, it might not receive the requested size of bytes at a time. Due to this, you need to repeatedly call read until the receiving buffer is read thoroughly. For a complete and robust solution you can refer to Stevens's readn routine ([Ref.1], page122).
If you are using exactly the library mentioned above, you can see that your fsock.Receive eventually calls recv. And recv is just an variant of read[Ref.2], thus the solutions for both of them are just identical. And this pattern might help:
while(fsock.Receive(sizeof(chkpkt))>0)
{
// ...
}
Ref.1: https://mathcs.clarku.edu/~jbreecher/cs280/UNIX%20Network%20Programming(Volume1,3rd).pdf
Ref.2: https://man7.org/linux/man-pages/man2/recv.2.html#DESCRIPTION

Winsock - Client disconnected, closesocket loop / maximum connections

I am learning Winsock and trying to create some easy programs to get to know it. I managed to create server which can handle multiple connections also manage them and client according to all tutorials, it is working how it was supposed to but :
I tried to make loop where I check if any of clients has disconnected and if it has, I wanted to close it.
I managed to write something which would check if socket is disconnected but it does not connect 2 or more sockets at one time
Anyone can give me reply how to make working loop checking through every client if it has disconnected and close socket ? It is all to make something like max clients connected to server at one time. Thanks in advance.
while (true)
{
ConnectingSocket = accept (ListeningSocket, (SOCKADDR*)&addr, &addrlen);
if (ConnectingSocket!=INVALID_SOCKET)
{
Connections[ConnectionsCounter] = ConnectingSocket;
char *Name = new char[64];
ZeroMemory (Name,64);
sprintf (Name, "%i",ConnectionsCounter);
send (Connections[ConnectionsCounter],Name,64,0);
cout<<"New connection !\n";
ConnectionsCounter++;
char data;
if (ConnectionsCounter>0)
{
for (int i=0;i<ConnectionsCounter;i++)
{
if (recv(Connections[i],&data,1, MSG_PEEK))
{
closesocket(Connections[i]);
cout<<"Connection closed.\n";
ConnectionsCounter=ConnectionsCounter-1;
}
}
}
}
Sleep(50);
}
it seems that you want to manage multiple connections using a single thread. right?
Briefly socket communication has two mode, block and non-block. The default one is block mode. let's focus your code:
for (int i=0;i<ConnectionsCounter;i++)
{
if (recv(Connections[i],&data,1, MSG_PEEK))
{
closesocket(Connections[i]);
cout<<"Connection closed.\n";
ConnectionsCounter=ConnectionsCounter-1;
}
}
In the above code, you called the recv function. and it will block until peer has sent msg to you, or peer closed the link. So, if you have two connection now namely Connections[0] and Connections[1]. If you were recv Connections[0], at the same time, the Connections[1] has disconnected, you were not know it. because you were blocking at recv(Connections[0]). when the Connections[0] sent msg to you or it closed the socket, then loop continue, finally you checked it disconnect, even through it disconnected 10 minutes ago.
To solve it, I think you need a book Network Programming for Microsoft Windows . There are some method, such as one thread one socket pattern, asynchronous communication mode, non-block mode, and so on.
Forgot to point out the bug, pay attention here:
closesocket(Connectons[i]);
cout<<"Connection closed.\n";
ConnectionsCounter=ConnectionsCounter-1;
Let me give an example to illustrate it. now we have two Connections with index 0 and 1, and then ConnectionsCount should be 2, right? When the Connections[0] is disconnected, the ConnectionsCounter is changed from 2 to 1. and loop exit, a new client connected, you save the new client socket as Connections[ConnectionsCounter(=1)] = ConnectingSocket; oops, gotting an bug. because the disconnected socket's index is 0, and index 1 was used by another link. you are reusing the index 1.
why not try to use vector to save the socket.
hope it helps~

zeromq: reset REQ/REP socket state

When you use the simple ZeroMQ REQ/REP pattern you depend on a fixed send()->recv() / recv()->send() sequence.
As this article describes you get into trouble when a participant disconnects in the middle of a request because then you can't just start over with receiving the next request from another connection but the state machine would force you to send a request to the disconnected one.
Has there emerged a more elegant way to solve this since the mentioned article has been written?
Is reconnecting the only way to solve this (apart from not using REQ/REP but use another pattern)
As the accepted answer seem so terribly sad to me, I did some research and have found that everything we need was actually in the documentation.
The .setsockopt() with the correct parameter can help you resetting your socket state-machine without brutally destroy it and rebuild another on top of the previous one dead body.
(yeah I like the image).
ZMQ_REQ_CORRELATE: match replies with requests
The default behaviour of REQ sockets is to rely on the ordering of messages to match requests and responses and that is usually sufficient. When this option is set to 1, the REQ socket will prefix outgoing messages with an extra frame containing a request id. That means the full message is (request id, 0, user frames…). The REQ socket will discard all incoming messages that don't begin with these two frames.
Option value type int
Option value unit 0, 1
Default value 0
Applicable socket types ZMQ_REQ
ZMQ_REQ_RELAXED: relax strict alternation between request and reply
By default, a REQ socket does not allow initiating a new request with zmq_send(3) until the reply to the previous one has been received. When set to 1, sending another message is allowed and has the effect of disconnecting the underlying connection to the peer from which the reply was expected, triggering a reconnection attempt on transports that support it. The request-reply state machine is reset and a new request is sent to the next available peer.
If set to 1, also enable ZMQ_REQ_CORRELATE to ensure correct matching of requests and replies. Otherwise a late reply to an aborted request can be reported as the reply to the superseding request.
Option value type int
Option value unit 0, 1
Default value 0
Applicable socket types ZMQ_REQ
A complete documentation is here
The good news is that, as of ZMQ 3.0 and later (the modern era), you can set a timeout on a socket. As others have noted elsewhere, you must do this after you have created the socket, but before you connect it:
zmq_req_socket.setsockopt( zmq.RCVTIMEO, 500 ) # milliseconds
Then, when you actually try to receive the reply (after you have sent a message to the REP socket), you can catch the error that will be asserted if the timeout is exceeded:
try:
send( message, 0 )
send_failed = False
except zmq.Again:
logging.warning( "Image send failed." )
send_failed = True
However! When this happens, as observed elsewhere, your socket will be in a funny state, because it will still be expecting the response. At this point, I cannot find anything that works reliably other than just restarting the socket. Note that if you disconnect() the socket and then re connect() it, it will still be in this bad state. Thus you need to
def reset_my_socket:
zmq_req_socket.close()
zmq_req_socket = zmq_context.socket( zmq.REQ )
zmq_req_socket.setsockopt( zmq.RCVTIMEO, 500 ) # milliseconds
zmq_req_socket.connect( zmq_endpoint )
You will also notice that because I close()d the socket, the receive timeout option was "lost", so it is important set that on the new socket.
I hope this helps. And I hope that this does not turn out to be the best answer to this question. :)
There is one solution to this and that is adding timeouts to all calls. Since ZeroMQ by itself does not really provide simple timeout functionality I recommend using a subclass of the ZeroMQ socket that adds a timeout parameter to all important calls.
So, instead of calling s.recv() you would call s.recv(timeout=5.0) and if a response does not come back within that 5 second window it will return None and stop blocking. I had made a futile attempt at this when I run into this problem.
I'm actually looking into this at the moment, because I am retro fitting a legacy system.
I am coming across code constantly that "needs" to know about the state of the connection. However the thing is I want to move to the message passing paradigm that the library promotes.
I found the following function : zmq_socket_monitor
What it does is monitor the socket passed to it and generate events that are then passed to an "inproc" endpoint - at that point you can add handling code to actually do something.
There is also an example (actually test code) here : github
I have not got any specific code to give at the moment (maybe at the end of the week) but my intention is to respond to the connect and disconnects such that I can actually perform any resetting of logic required.
Hope this helps, and despite quoting 4.2 docs, I am using 4.0.4 which seems to have the functionality
as well.
Note I notice you talk about python above, but the question is tagged C++ so that's where my answer is coming from...
Update: I'm updating this answer with this excellent resource here: https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/ Socket programming is complicated so do checkout the references in this post.
None of the answers here seem accurate or useful. The OP is not looking for information on BSD socket programming. He is trying to figure out how to robustly handle accept()ed client-socket failures in ZMQ on the REP socket to prevent the server from hanging or crashing.
As already noted -- this problem is complicated by the fact that ZMQ tries to pretend that the servers listen()ing socket is the same as an accept()ed socket (and there is no where in the documentation that describes how to set basic timeouts on such sockets.)
My answer:
After doing a lot of digging through the code, the only relevant socket options passed along to accept()ed socks seem to be keep alive options from the parent listen()er. So the solution is to set the following options on the listen socket before calling send or recv:
void zmq_setup(zmq::context_t** context, zmq::socket_t** socket, const char* endpoint)
{
// Free old references.
if(*socket != NULL)
{
(**socket).close();
(**socket).~socket_t();
}
if(*context != NULL)
{
// Shutdown all previous server client-sockets.
zmq_ctx_destroy((*context));
(**context).~context_t();
}
*context = new zmq::context_t(1);
*socket = new zmq::socket_t(**context, ZMQ_REP);
// Enable TCP keep alive.
int is_tcp_keep_alive = 1;
(**socket).setsockopt(ZMQ_TCP_KEEPALIVE, &is_tcp_keep_alive, sizeof(is_tcp_keep_alive));
// Only send 2 probes to check if client is still alive.
int tcp_probe_no = 2;
(**socket).setsockopt(ZMQ_TCP_KEEPALIVE_CNT, &tcp_probe_no, sizeof(tcp_probe_no));
// How long does a con need to be "idle" for in seconds.
int tcp_idle_timeout = 1;
(**socket).setsockopt(ZMQ_TCP_KEEPALIVE_IDLE, &tcp_idle_timeout, sizeof(tcp_idle_timeout));
// Time in seconds between individual keep alive probes.
int tcp_probe_interval = 1;
(**socket).setsockopt(ZMQ_TCP_KEEPALIVE_INTVL, &tcp_probe_interval, sizeof(tcp_probe_interval));
// Discard pending messages in buf on close.
int is_linger = 0;
(**socket).setsockopt(ZMQ_LINGER, &is_linger, sizeof(is_linger));
// TCP user timeout on unacknowledged send buffer
int is_user_timeout = 2;
(**socket).setsockopt(ZMQ_TCP_MAXRT, &is_user_timeout, sizeof(is_user_timeout));
// Start internal enclave event server.
printf("Host: Starting enclave event server\n");
(**socket).bind(endpoint);
}
What this does is tell the operating system to aggressively check the client socket for timeouts and reap them for cleanup when a client doesn't return a heart beat in time. The result is that the OS will send a SIGPIPE back to your program and socket errors will bubble up to send / recv - fixing a hung server. You then need to do two more things:
1. Handle SIGPIPE errors so the program doesn't crash
#include <signal.h>
#include <zmq.hpp>
// zmq_setup def here [...]
int main(int argc, char** argv)
{
// Ignore SIGPIPE signals.
signal(SIGPIPE, SIG_IGN);
// ... rest of your code after
// (Could potentially also restart the server
// sock on N SIGPIPEs if you're paranoid.)
// Start server socket.
const char* endpoint = "tcp://127.0.0.1:47357";
zmq::context_t* context;
zmq::socket_t* socket;
zmq_setup(&context, &socket, endpoint);
// Message buffers.
zmq::message_t request;
zmq::message_t reply;
// ... rest of your socket code here
}
2. Check for -1 returned by send or recv and catch ZMQ errors.
// E.g. skip broken accepted sockets (pseudo-code.)
while (1):
{
try
{
if ((*socket).recv(&request)) == -1)
throw -1;
}
catch (...)
{
// Prevent any endless error loops killing CPU.
sleep(1)
// Reset ZMQ state machine.
try
{
zmq::message_t blank_reply = zmq::message_t();
(*socket).send (blank_reply);
}
catch (...)
{
1;
}
continue;
}
Notice the weird code that tries to send a reply on a socket failure? In ZMQ, a REP server "socket" is an endpoint to another program making a REQ socket to that server. The result is if you go do a recv on a REP socket with a hung client, the server sock becomes stuck in a broken receive loop where it will wait forever to receive a valid reply.
To force an update on the state machine, you try send a reply. ZMQ detects that the socket is broken, and removes it from its queue. The server socket becomes "unstuck", and the next recv call returns a new client from the queue.
To enable timeouts on an async client (in Python 3), the code would look something like this:
import asyncio
import zmq
import zmq.asyncio
#asyncio.coroutine
def req(endpoint):
ms = 2000 # In milliseconds.
sock = ctx.socket(zmq.REQ)
sock.setsockopt(zmq.SNDTIMEO, ms)
sock.setsockopt(zmq.RCVTIMEO, ms)
sock.setsockopt(zmq.LINGER, ms) # Discard pending buffered socket messages on close().
sock.setsockopt(zmq.CONNECT_TIMEOUT, ms)
# Connect the socket.
# Connections don't strictly happen here.
# ZMQ waits until the socket is used (which is confusing, I know.)
sock.connect(endpoint)
# Send some bytes.
yield from sock.send(b"some bytes")
# Recv bytes and convert to unicode.
msg = yield from sock.recv()
msg = msg.decode(u"utf-8")
Now you have some failure scenarios when something goes wrong.
By the way -- if anyone's curious -- the default value for TCP idle timeout in Linux seems to be 7200 seconds or 2 hours. So you would be waiting a long time for a hung server to do anything!
Sources:
https://github.com/zeromq/libzmq/blob/84dc40dd90fdc59b91cb011a14c1abb79b01b726/src/tcp_listener.cpp#L82 TCP keep alive options preserved for client sock
http://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/ How does keep alive work
https://github.com/zeromq/libzmq/blob/master/builds/zos/README.md Handling sig pipe errors
https://github.com/zeromq/libzmq/issues/2586 for information on closing sockets
https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/
https://github.com/zeromq/libzmq/issues/976
Disclaimer:
I've tested this code and it seems to be working, but ZMQ does complicate testing this a fair bit because the client re-connects on failure? If anyone wants to use this solution in production, I recommend writing some basic unit tests, first.
The server code could also be improved a lot with threading or polling to be able to handle multiple clients at once. As it stands, a malicious client can temporarily take up resources from the server (3 second timeout) which isn't ideal.

mysql reconnect c++

Right now I have a C++ client application that uses mysql.h to connect to a MYSQL database and have to preform some logic in case there is a disconnect. I'm wondering if this is the best way to reconnect to a MYSQL database in a situation where my client gets disconnected.
bool MYSQL::Reconnect(const char *host, const char *user, const char *passwd, const char *db)
{
bool out = false;
pid_t command_pid = fork();
if (command_pid == 0)
{
while(1)
{
sleep(1);
if (mysql_real_connect(&m_mysql, host, user, passwd, db, 0, NULL, 0) == NULL )
{
fprintf(stderr, "Failed to connect to database: Error: %s\n",
mysql_error(&m_mysql));
}
else
{
m_connected = true;
out = true;
break;
}
}
exit(0);
}
if (command_pid < 0)
fprintf(stderr, "Could not fork process[reconnect]: %s\n", mysql_error(&m_mysql));
return out;
}
Right now i take in all my parameters and preform a fork. the child process attempts to reconnect every second with a sleep() statement. Is this a good way to do this? Thanks
Sorry, but your code doesn't do what you think it does, Kaiser Wilhelm.
In essence, you're trying to treat a fork like a thread, which it is not.
When you fork a child, the parent process is completely cloned, including file and socket descriptors, which is how your program is connected to the MySQL database server. That is, both the parent and the child end up with their own copy of the same connection to the database server when you fork. I assume the parent only calls this Reconnect() method when it sees the connection drop, and stops using its copy of the now-defunct MySQL connection object, m_mysql. If so, the parent's copy of the connection is just as useless as the client's when you start the reconnect operation.
The thing is, the reverse is not also true: once the child manages to reconnect to the database server, the parent's connection object remains defunct. Nothing the child does propagates back up to the parent. After the fork, the two processes are completely independent, except insofar as they might try to access some I/O resource they initially shared. For example, if you called this Reconnect() while the connection was up and continued using the connection in the parent, the child's attempts to talk to the DB server on the same connection would confuse either mysqld or libmysqlclient, likely causing data corruption or a crash.
As hinted above, one solution to this is to use threads instead of forking. Beware, however, of the many problems with using threads with the MySQL C API.
Given a choice, I'd rather use asynchronous I/O to do the background connection attempt within the application's main thread, but the MySQL C API doesn't allow that.
It seems you're trying to avoid blocking your main application thread while attempting the DB server reconnection. It may be that you can get away with doing it synchronously anyway by setting the connect timeout to 1 second, which is fine when the MySQL server is on the same machine or same LAN as the client. If you could tolerate your main thread blocking for up to a second for connection attempts to fail — worst case happening when the server is on a separate machine and it's physically disconnected or firewalled — this would probably be a cleaner solution than threads. The connection attempt can fail much quicker if the server machine is still running and the port isn't firewalled, such as when it is rebooting and the TCP/IP stack is [still] up.
As far as I can tell, this doesn't do what you intended.
Logical issues
Reconnect doesn't "perform some logic in case there is a disconnect" at all.
It attempts to connect over and over again until it succeeds, then stops. That's it. The state of the connection is never checked again. If the connection drops, this code knows nothing about it.
Technical issues
Also pay close attention to the technical issues that Warren raises.
Sure, it's perfectly OK. You might want to think about replacing the while ( 1 ) loop with something like
while ( NULL == mysql_real_connect( ... )) {
sleep( 1 );
...
}
which is the kind of idiom that one learns by practice, but your code works just fine as far as I can see. Don't forget to put a counter inside the while loop.

Socket in use error when reusing sockets

I am writing an XMLRPC client in c++ that is intended to talk to a python XMLRPC server.
Unfortunately, at this time, the python XMLRPC server is only capable of fielding one request on a connection, then it shuts down, I discovered this thanks to mhawke's response to my previous query about a related subject
Because of this, I have to create a new socket connection to my python server every time I want to make an XMLRPC request. This means the creation and deletion of a lot of sockets. Everything works fine, until I approach ~4000 requests. At this point I get socket error 10048, Socket in use.
I've tried sleeping the thread to let winsock fix its file descriptors, a trick that worked when a python client of mine had an identical issue, to no avail.
I've tried the following
int err = setsockopt(s_,SOL_SOCKET,SO_REUSEADDR,(char*)TRUE,sizeof(BOOL));
with no success.
I'm using winsock 2.0, so WSADATA::iMaxSockets shouldn't come into play, and either way, I checked and its set to 0 (I assume that means infinity)
4000 requests doesn't seem like an outlandish number of requests to make during the run of an application. Is there some way to use SO_KEEPALIVE on the client side while the server continually closes and reopens?
Am I totally missing something?
The problem is being caused by sockets hanging around in the TIME_WAIT state which is entered once you close the client's socket. By default the socket will remain in this state for 4 minutes before it is available for reuse. Your client (possibly helped by other processes) is consuming them all within a 4 minute period. See this answer for a good explanation and a possible non-code solution.
Windows dynamically allocates port numbers in the range 1024-5000 (3977 ports) when you do not explicitly bind the socket address. This Python code demonstrates the problem:
import socket
sockets = []
while True:
s = socket.socket()
s.connect(('some_host', 80))
sockets.append(s.getsockname())
s.close()
print len(sockets)
sockets.sort()
print "Lowest port: ", sockets[0][1], " Highest port: ", sockets[-1][1]
# on Windows you should see something like this...
3960
Lowest port: 1025 Highest port: 5000
If you try to run this immeditaely again, it should fail very quickly since all dynamic ports are in the TIME_WAIT state.
There are a few ways around this:
Manage your own port assignments and
use bind() to explicitly bind your
client socket to a specific port
that you increment each time your
create a socket. You'll still have
to handle the case where a port is
already in use, but you will not be
limited to dynamic ports. e.g.
port = 5000
while True:
s = socket.socket()
s.bind(('your_host', port))
s.connect(('some_host', 80))
s.close()
port += 1
Fiddle with the SO_LINGER socket
option. I have found that this
sometimes works in Windows (although
not exactly sure why):
s.setsockopt(socket.SOL_SOCKET,
socket.SO_LINGER, 1)
I don't know if this will help in
your particular application,
however, it is possible to send
multiple XMLRPC requests over the
same connection using the
multicall method. Basically
this allows you to accumulate
several requests and then send them
all at once. You will not get any
responses until you actually send
the accumulated requests, so you can
essentially think of this as batch
processing - does this fit in with
your application design?
Update:
I tossed this into the code and it seems to be working now.
if(::connect(s_, (sockaddr *) &addr, sizeof(sockaddr)))
{
int err = WSAGetLastError();
if(err == 10048) //if socket in user error, force kill and reopen socket
{
closesocket(s_);
WSACleanup();
WSADATA info;
WSAStartup(MAKEWORD(2,0), &info);
s_ = socket(AF_INET,SOCK_STREAM,0);
setsockopt(s_,SOL_SOCKET,SO_REUSEADDR,(char*)&x,sizeof(BOOL));
}
}
Basically, if you encounter the 10048 error (socket in use), you can simply close the socket, call cleanup, and restart WSA, the reset the socket and its sockopt
(the last sockopt may not be necessary)
i must have been missing the WSACleanup/WSAStartup calls before, because closesocket() and socket() were definitely being called
this error only occurs once every 4000ish calls.
I am curious as to why this may be, even though this seems to fix it.
If anyone has any input on the subject i would be very curious to hear it
Do you close the sockets after using it?