Winsock sendto returns error 10049 (WSAEADDRNOTAVAIL) for broadcast address after network adapter is disabled or physically disconnected - c++

I am working on a p2p application and to make testing simple, I am currently using udp broadcast for the peer discovery in my local network. Each peer binds one udp socket to port 29292 of the ip address of each local network interface (discovered via GetAdaptersInfo) and each socket periodically sends a packet to the broadcast address of its network interface/local address. The sockets are set to allow port reuse (via setsockopt SO_REUSEADDR), which enables me to run multiple peers on the same local machine without any conflicts. In this case there is only a single peer on the entire network though.
This all works perfectly fine (tested with 2 peers on 1 machine and 2 peers on 2 machines) UNTIL a network interface is disconnected. When deactivacting the network adapter of either my wifi or an USB-to-LAN adapter in the windows dialog, or just plugging the usb cable of the adapter, the next call to sendto will fail with return code 10049. It doesn't matter if the other adapter is still connected, or was at the beginning, it will fail. The only thing that doesn't make it fail is deactivating wifi through the fancy win10 dialog through the taskbar, but that isn't really a surprise because that doesn't deactivate or remove the adapter itself.
I initially thought that this makes sense because when the nic is gone, how should the system route the packet. But: The fact that the packet can't reach its target has absolutely nothing to do with the address itsself being invalid (which is what the error means), so I suspect I am missing something here. I was looking for any information I could use to detect this case and distinguish it from simply trying to sendto INADDR_ANY, but I couldn't find anything. I started to log every bit of information which I suspected could have changed, but its all the same on a successfull sendto and the one that crashes (retrieved via getsockopt):
250 16.24746[886] [debug|debug] local address: 192.168.178.35
251 16.24812[886] [debug|debug] no remote address
252 16.25333[886] [debug|debug] type: SOCK_DGRAM
253 16.25457[886] [debug|debug] protocol: IPPROTO_UDP
254 16.25673[886] [debug|debug] broadcast: 1, dontroute: 0, max_msg_size: 65507, rcv_buffer: 65536, rcv_timeout: 0, reuse_addr: 1, snd_buffer: 65536, sdn_timeout: 0
255 16.25806[886] [debug|debug] Last WSA error on socket was WSA Error Code 0: The operation completed successfully.
256 16.25916[886] [debug|debug] target address windows formatted: 192.168.178.255
257 16.25976[886] [debug|debug] target address 192.168.178.255:29292
258 16.26138[886] [debug|assert] ASSERT FAILED at D:\Workspaces\spaced\source\platform\win32_platform.cpp:4141: sendto failed with (unhandled) WSA Error Code 10049: The requested address is not valid in its context.
The nic that got removed is this one:
1.07254[0] [platform|info] Discovered Network Interface "Realtek USB GbE Family Controller" with IP 192.168.178.35 and Subnet 255.255.255.0
And this is the code that does the sending (dlog_socket_information_and_last_wsaerror generates all the output that is gathered using getsockopt):
void send_slice_over_udp_socket(Socket_Handle handle, Slice<d_byte> buffer, u32 remote_ip, u16 remote_port){
PROFILE_FUNCTION();
auto socket = (UDP_Socket*) sockets[handle.handle];
ASSERT_VALID_UDP_SOCKET(socket);
dlog_socket_information_and_last_wsaerror(socket);
if(socket->is_dummy)
return;
if(buffer.size == 0)
return;
DASSERT(socket->state == Socket_State::created);
u64 bytes_left = buffer.size;
sockaddr_in target_socket_address = create_socket_address(remote_ip, remote_port);
#pragma warning(push)
#pragma warning(disable: 4996)
dlog("target address windows formatted: %s", inet_ntoa(target_socket_address.sin_addr));
#pragma warning(pop)
unsigned char* parts = (unsigned char*)&remote_ip;
dlog("target address %hhu.%hhu.%hhu.%hhu:%hu", parts[3], parts[2], parts[1], parts[0], remote_port);
int sent_bytes = sendto(socket->handle, (char*) buffer.data, bytes_left > (u64) INT32_MAX ? INT32_MAX : (int) bytes_left, 0, (sockaddr*)&target_socket_address, sizeof(target_socket_address));
if(sent_bytes == SOCKET_ERROR){
#define LOG_WARNING(message) log_nonreproducible(message, Category::platform_network, Severity::warning, socket->handle); return;
switch(WSAGetLastError()){
//#TODO handle all (more? I guess many should just be asserted since they should never happen) cases
case WSAEHOSTUNREACH: LOG_WARNING("socket %lld, send failed: The remote host can't be reached at this time.");
case WSAECONNRESET: LOG_WARNING("socket %lld, send failed: Multiple UDP packet deliveries failed. According to documentation we should close the socket. Not sure if this makes sense, this is a UDP port after all. Closing the socket wont change anything, right?");
case WSAENETUNREACH: LOG_WARNING("socket %lld, send failed: the network cannot be reached from this host at this time.");
case WSAETIMEDOUT: LOG_WARNING("socket %lld, send failed: The connection has been dropped, because of a network failure or because the system on the other end went down without notice.");
case WSAEADDRNOTAVAIL:
case WSAENETRESET:
case WSAEACCES:
case WSAEWOULDBLOCK: //can this even happen on a udp port? I expect this to be fire-and-forget-style.
case WSAEMSGSIZE:
case WSANOTINITIALISED:
case WSAENETDOWN:
case WSAEINVAL:
case WSAEINTR:
case WSAEINPROGRESS:
case WSAEFAULT:
case WSAENOBUFS:
case WSAENOTCONN:
case WSAENOTSOCK:
case WSAEOPNOTSUPP:
case WSAESHUTDOWN:
case WSAECONNABORTED:
case WSAEAFNOSUPPORT:
case WSAEDESTADDRREQ:
ASSERT(false, tprint_last_wsa_error_as_formatted_message("sendto failed with (unhandled) ")); break;
default: ASSERT(false, tprint_last_wsa_error_as_formatted_message("sendto failed with (undocumented) ")); //The switch case above should have been exhaustive. This is a bug. We either forgot a case, or maybe the docs were lying? (That happened to me on android. Fun times. Well. Not really.)
}
#undef LOG_WARNING
}
DASSERT(sent_bytes >= 0);
total_bytes_sent += (u64) sent_bytes;
bytes_left -= (u64) sent_bytes;
DASSERT(bytes_left == 0);
}
The code that generates the address from ip and port looks like this:
sockaddr_in create_socket_address(u32 ip, u16 port){
sockaddr_in address_info;
address_info.sin_family = AF_INET;
address_info.sin_port = htons(port);
address_info.sin_addr.s_addr = htonl(ip);
memset(address_info.sin_zero, 0, 8);
return address_info;
}
The error seems to be a little flaky. It reproduces 100% of the time until it decides not to anymore. After a restart its usually back.
I am looking for a solution to handle this case correctly. I could of course just re-do the network interface discovery when the error occurs, because I "know" that I don't give any broken IPs to sendto, but that would just be a heuristic. I want to solve the actual problem.
I also don't quite understand when error 10049 is supposed to fire exactly anyway. Is it just if I pass an ipv6 address to a ipv4 socket, or send to 0.0.0.0? There is no flat out "illegal" ipv4 address after all, just ones that don't make sense from context.
If you know what I am missing here, please let me know!

This is a issue people have been facing up for a while , and people suggested to read the documentation provided by Microsoft on the following issue .
"Btw , I don't know whether they are the same issues or not but the error thrown back the code are same, that's why I have attached a link for the same!!"
https://learn.microsoft.com/en-us/answers/questions/537493/binding-winsock-shortly-after-boot-results-in-erro.html

I found a solution (workaround?)
I used NotifyAddrChange to receive changes to the NICs and thought it for some reason didn't trigger when I disabled the NIC. Turns out it does, I'm just stupid and stopped debugging too early: There was a bug in the code that diffs the results from GetAdaptersInfo to the last known state to figure out the differences, so the application missed the NIC disconnecting. Now that it observes the disconnect, it can kill the sockets before they try to send on the disabled NIC, thus preventing the error from happening. This is not really a solution though, since there is a race condition here (NIC gets disabled before send and after check for changes), so I'll still have to handle error 10049.
The bug was this:
My expectation was that, when I disable a NIC, iterating over all existing NICs would show the disabled NIC as disabled. That is not what happens. What happens is that the NIC is just not in the list of existing NICs anymore, even though the windows dialog will still show it (as disabled). That is somewhat suprising to me but not all that unreasonable I guess.
Before I had these checks to detect changes in the NICs:
Did the NIC exist before, was enabled and is now disabled -> disable notification
Did the NIC exist before, was disabled and is now enabled -> enable notification
Did the NIC not exist before, is not enabled -> enable notification
And the fix was adding a fourth one:
Is there an existing NIC that was not in the list of NICs anymore -> disable notification
I'm still not 100% happy that there is the possibility of getting a somewhat ambiguous error on a race condition, but I might call it a day here.

Related

Linux socket C/C++ - What is the best way to check if ip/port is already in use?

I have a system that can start multiple instances.
Every instance has a client and a server.
They are connected over socket/TCP
Every instance is started by starting a client.
The client starts (checks if IP is available, if not increase the IP by 1, checks again ...) -
The client starts the server with the free IP and connects to it. (for legacy reasons has to be like this)
Instance numbers 2, 3, 4, 5 work without issues.
...
Instance number 6. -> Fails on checking if the first IP in the range is available.
To check if IP is already in use, I do not close the socket on the server side so that it can accept the additional connection.
On the client-side, I check if I can connect to the server-side with the following code:
bool CheckIPInUse(char *ip)
{
bool ret = false;
int port = 12345;
int sock;
struct sockaddr_in serv_addr;
serv_addr.sin_family = AF_INET;
serv_addr.sin_port = htons(port);
// **non blocking** because I want the check to be fast.
sock = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, 0);
inet_pton(AF_INET, ip, &serv_addr.sin_addr);
int ret_conn = connect(sock, (struct sockaddr *)&serv_addr, sizeof(serv_addr));
if (ret_conn == 0){
fprintf(stdout, "connected");
ret = true;
}
else if (ret_conn < 0 && (errno != EINPROGRESS)){
fprintf(stdout, "failed to connect");
}
else
{
int check_if_connected = 10;
while (check_if_connected--)
{
socklen_t len = sizeof(serv_addr);
int ret_getpeer = getpeername(sock, (struct sockaddr *)&serv_addr, &len);
if (ret_getpeer == 0)
{
fprintf(stdout, "connected");
ret = true;
break;
}
usleep(100000);
}
}
close(sock);
return ret;
}
This works for the first 5 instances.
6th instance fails to connect to the first IP in range and tries to start the server with IP which is already in use. (always the 6th).
Is there any better way to check programmatically if IP/Port is already busy?
Any ideas on what to check. for failure in the instance number 6?
The only way to check if an ip/port on a server is available is to bind() to it. If it worked, it was available (but not any more).
Any approach that involves a test connect()ion first, to see if it fails, or anything along the lines of poking somewhere in /proc to see which IPs and ports are in use -- nothing along these lines will ever be 100% foolproof. That's because even if you reach the conclusion that the port is available, it may no longer be by the time you get around to try to bind() to it.
Now, you can take, as a starting position, that a particular IP and/or port range is reserved for your application's use, and you only wish to arbitrate IP/port allocation between different instances of your application. In that case you can do that pretty much whatever you want, you're not limited to attempting to actually start instances of your application, and hope for the best. One simplistic approach is to use lock files in /var/tmp to represent all possible IP/port combination, and have your application try, in turn, to acquire a lock on the corresponding lock file, first, and once it's official, and the lock file is acquired, then the corresponding IP/port then can be established at your leisure, but the lock file must remain locked until the IP/port is no longer in use.
But in terms of attempting to check if a socket port is available, or not, the only way to do it is to bind() it, because that, by definition, is what it does. You could attempt to implement a multi-layered approach, like trying to connect() first, and then attempt to bind() it, and if the bind() fails, then keep looking for a free port. But that's creating extra complexity, without much of a benefit.
Did you check that the server did not meet its maximum backlog length ?
You may be getting "connection refused" if the server you are trying to connect to
has more pending connections then the defined backlog.
So if multiple clients are testing at the same time, one of them may encounter this.
The most probable cause of your problem is that your client is getting a connect from the server due to the listen queue. The best way to avoid this problem is to close the socket on which you call accept(2) once all the instances are in use, and reopen it again when any of the server instances are finished.
The listen queue makes the kernel to accept (send the SYN/ACK segment) connections on the otherwise not yet open socket waiting, and this will make the connection establishment quicker for the next server instances if many such connections are entering in the system. All those connections are handled in the accept(2) socket, so the best way to accept five such connections is to close the accept socket as soon as the last connection has been established (this will not avoid the problem if a connection happens to enter the server in the time between one accept(2) and the next, but the connection so established will be closed as soon as the accept socket is still open)
In my opinion, you should have a master server process that forks new processes to handle the different connection and closes the accept socket as soon as it reaches the full capacity. Once one of the servers attending the connections closes one of them, it should reopen the accept socket and accept a new connection.
IMHO, also the most robust way of implementing such a system is to allow the extra connections to get in, but not attend them, so the connection remains open in case a new client happens to enter, and it can close it if the server doesn't attend it in a timeout interval. Having a sixth client already connected, but waiting for the server to say hello, will leave you in a state in which you can start talking to the server as soon as the last service ends.

Socket is open after process, that opened it finished

After closing client socket on sever side and exit application, socket still open for some time.
I can see it via netstat
Every 0.1s: netstat -tuplna | grep 6676
tcp 0 0 127.0.0.1:6676 127.0.0.1:36065 TIME_WAIT -
I use log4cxx logging and telnet appender. log4cxx use apr sockets.
Socket::close() method looks like that:
void Socket::close() {
if (socket != 0) {
apr_status_t status = apr_socket_close(socket);
if (status != APR_SUCCESS) {
throw SocketException(status);
}
socket = 0;
}
}
And it's successfully processed. But after program is finished I can see opened socket via netstat, and if it starts again log4cxx unable to open 6676 port, because it is busy.
I tries to modify log4cxx.
Shutdown socket before close:
void Socket::close() {
if (socket != 0) {
apr_status_t shutdown_status = apr_socket_shutdown(socket, APR_SHUTDOWN_READWRITE);
printf("Socket::close shutdown_status %d\n", shutdown_status);
if (shutdown_status != APR_SUCCESS) {
printf("Socket::close WTF %d\n", shutdown_status != APR_SUCCESS);
throw SocketException(shutdown_status);
}
apr_status_t close_status = apr_socket_close(socket);
printf("Socket::close close_status %d\n", close_status);
if (close_status != APR_SUCCESS) {
printf("Socket::close WTF %d\n", close_status != APR_SUCCESS);
throw SocketException(close_status);
}
socket = 0;
}
}
But it didn't helped, bug still reproduced.
This is not a bug. Time Wait (and Close Wait) is by design for safety purpose. You may however adjust the wait time. In any case, on server's perspective the socket is closed and you are relax by the ulimit counter, it has not much visible impact unless you are doing stress test.
As noted by Calvin this isn't a bug, it's a feature. Time Wait is a socket state that says, this socket isn't in use any more but nevertheless can't be reused quite yet.
Imagine you have a socket open and some client is sending data. The data may be backed up in the network or be in-flight when the server closes its socket.
Now imagine you start the service again or start some new service. The packets on the wire aren't aware that its a new service and the service can't know the packets were destined for a service that's gone. The new service may try to parse the packets and fail because they're in some odd format or the client may get an unrelated error back and keep trying to send, maybe because the sequence numbers don't match and the receiving host will get some odd error. With timed wait the client will get notified that the socket is closed and the server won't potentially get odd data. A win-win. The time it waits should be sofficient for all in-transit data to be flused from the system.
Take a look at this post for some additional info: Socket options SO_REUSEADDR and SO_REUSEPORT, how do they differ? Do they mean the same across all major operating systems?
TIME_WAIT is a socket state to allow all in travel packets that could remain from the connection to arrive or dead before the connection parameters (source address, source port, desintation address, destination port) can be reused again. The kernel simply sets a timer to wait for this time to elapse, before allowing you to reuse that socket again. But you cannot shorten it (even if you can, you had better not to do it), because you have no possibility to know if there are still packets travelling or to accelerate or kill them. The only possibility you have is to wait for a socket bound to that port to timeout and pass from the state TIME_WAIT to the CLOSED state.
If you were allowed to reuse the connection (I think there's an option or something can be done in the linux kernel) and you receive an old connection packet, you can get a connection reset due to the received packet. This can lead to more problems in the new connection. These are solved making you wait for all traffic belonging to the old connection to die or reach destination, before you use that socket again.

Why would connect() give EADDRNOTAVAIL?

I have in my application a failure that arose which does not seem to be reproducible. I have a TCP socket connection which failed and the application tried to reconnect it. In the second call to connect() attempting to reconnect, I got an error result with errno == EADDRNOTAVAIL which the man page for connect() says means: "The specified address is not available from the local machine."
Looking at the call to connect(), the second argument appears to be the address to which the error is referring to, but as I understand it, this argument is the TCP socket address of the remote host, so I am confused about the man page referring to the local machine. Is it that this address to the remote TCP socket host is not available from my local machine? If so, why would this be? It had to have succeeded calling connect() the first time before the connection failed and it attempted to reconnect and got this error. The arguments to connect() were the same both times.
Would this error be a transient one which, if I had tried calling connect again might have gone away if I waited long enough? If not, how should I try to recover from this failure?
Check this link
http://www.toptip.ca/2010/02/linux-eaddrnotavail-address-not.html
EDIT: Yes I meant to add more but had to cut it there because of an emergency
Did you close the socket before attempting to reconnect? Closing will tell the system that the socketpair (ip/port) is now free.
Here are additional items too look at:
If the local port is already connected to the given remote IP and port (i.e., there's already an identical socketpair), you'll receive this error (see bug link below).
Binding a socket address which isn't the local one will produce this error. if the IP addresses of a machine are 127.0.0.1 and 1.2.3.4, and you're trying to bind to 1.2.3.5 you are going to get this error.
EADDRNOTAVAIL: The specified address is unavailable on the remote machine or the address field of the name structure is all zeroes.
Link with a bug similar to yours (answer is close to the bottom)
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4294599
It seems that your socket is basically stuck in one of the TCP internal states and that adding a delay for reconnection might solve your problem as they seem to have done in that bug report.
This can also happen if an invalid port is given, like 0.
If you are unwilling to change the number of temporary ports available (as suggested by David), or you need more connections than the theoretical maximum, there are two other methods to reduce the number of ports in use. However, they are to various degrees violations of the TCP standard, so they should be used with care.
The first is to turn on SO_LINGER with a zero-second timeout, forcing the TCP stack to send a RST packet and flush the connection state. There is one subtlety, however: you should call shutdown on the socket file descriptor before you close, so that you have a chance to send a FIN packet before the RST packet. So the code will look something like:
shutdown(fd, SHUT_RDWR);
struct linger linger;
linger.l_onoff = 1;
linger.l_linger = 0;
// todo: test for error
setsockopt(fd, SOL_SOCKET, SO_LINGER,
(char *) &linger, sizeof(linger));
close(fd);
The server should only see a premature connection reset if the FIN packet gets reordered with the RST packet.
See TCP option SO_LINGER (zero) - when it's required for more details. (Experimentally, it doesn't seem to matter where you set setsockopt.)
The second is to use SO_REUSEADDR and an explicit bind (even if you're the client), which will allow Linux to reuse temporary ports when you run, before they are done waiting. Note that you must use bind with INADDR_ANY and port 0, otherwise SO_REUSEADDR is not respected. Your code will look something like:
int opts = 1;
// todo: test for error
setsockopt(fd, SOL_SOCKET, SO_REUSEADDR,
(char *) &opts, sizeof(int));
struct sockaddr_in listen_addr;
listen_addr.sin_family = AF_INET;
listen_addr.sin_port = 0;
listen_addr.sin_addr.s_addr = INADDR_ANY;
// todo: test for error
bind(fd, (struct sockaddr *) &listen_addr, sizeof(listen_addr));
// todo: test for addr
// saddr is the struct sockaddr_in you're connecting to
connect(fd, (struct sockaddr *) &saddr, sizeof(saddr));
This option is less good because you'll still saturate the internal kernel data structures for TCP connections as per netstat -an | grep -e tcp -e udp | wc -l. However, you won't start reusing ports until this happens.
I got this issue. I got it resolve by enabling tcp timestamp.
Root cause:
After connection close, Connections will go in TIME_WAIT state for some
time.
During this state if any new connections comes with same IP and PORT,
if SO_REUSEADDR is not provided during socket creation then socket bind()
will fail with error EADDRINUSE.
But even though after providing SO_REUSEADDR also sockect connect() may
fail with error EADDRNOTAVAIL if tcp timestamp is not enable on both side.
Solution:
Please enable tcp timestamp on both side client and server.
echo 1 > /proc/sys/net/ipv4/tcp_timestamps
Reason to enable tcp_timestamp:
When we enable tcp_tw_reuse, sockets in TIME_WAIT state can be used before they expire, and the kernel will try to make sure that there is no collision regarding TCP sequence numbers. If we enable tcp_timestamps, it will make sure that those collisions cannot happen. However, we need TCP timestamps to be enabled on both ends. See the definition of tcp_twsk_unique for the gory details.
reference:
https://serverfault.com/questions/342741/what-are-the-ramifications-of-setting-tcp-tw-recycle-reuse-to-1
Another thing to check is that the interface is up. I got confused by this one recently while using network namespaces, since it seems creating a new network namespace produces an entirely independent loopback interface but doesn't bring it up (at least, with Debian wheezy's versions of things). This escaped me for a while since one doesn't typically think of loopback as ever being down.

How do I receive raw, layer 2 packets in C/C++?

How do I receive layer 2 packets in POSIXy C++? The packets only have src and dst MAC address, type/length, and custom formatted data. They're not TCP or UDP or IP or IGMP or ARP or whatever - they're a home-brewed format given unto me by the Hardware guys.
My socket(AF_PACKET, SOCK_RAW, IPPROTO_RAW) never returns from its recvfrom().
I can send fine, I just can't receive no matter what options I fling at the network stack.
(Platform is VxWorks, but I can translate POSIX or Linux or whatever...)
receive code (current incarnation):
int s;
if ((s = socket(AF_PACKET, SOCK_RAW, IPPROTO_RAW)) < 0) {
printf("socket create error.");
return -1;
}
struct ifreq _ifr;
strncpy(_ifr.ifr_name, "lltemac0", strlen("lltemac0"));
ioctl(s, IP_SIOCGIFINDEX, &_ifr);
struct sockaddr_ll _sockAttrib;
memset(&_sockAttrib, 0, sizeof(_sockAttrib));
_sockAttrib.sll_len = sizeof(_sockAttrib);
_sockAttrib.sll_family = AF_PACKET;
_sockAttrib.sll_protocol = IFT_ETHER;
_sockAttrib.sll_ifindex = _ifr.ifr_ifindex;
_sockAttrib.sll_hatype = 0xFFFF;
_sockAttrib.sll_pkttype = PACKET_HOST;
_sockAttrib.sll_halen = 6;
_sockAttrib.sll_addr[0] = 0x00;
_sockAttrib.sll_addr[1] = 0x02;
_sockAttrib.sll_addr[2] = 0x03;
_sockAttrib.sll_addr[3] = 0x12;
_sockAttrib.sll_addr[4] = 0x34;
_sockAttrib.sll_addr[5] = 0x56;
int _sockAttribLen = sizeof(_sockAttrib);
char packet[64];
memset(packet, 0, sizeof(packet));
if (recvfrom(s, (char *)packet, sizeof(packet), 0,
(struct sockaddr *)&_sockAttrib, &_sockAttribLen) < 0)
{
printf("packet receive error.");
}
// code never reaches here
I think the way to do this is to write your own Network Service that binds to the MUX layer in the VxWorks network stack. This is reasonably well documented in the VxWorks Network Programmer's Guide and something I have done a number of times.
A custom Network Service can be configured to see all layer 2 packets received on a network interface using the MUX_PROTO_SNARF service type, which is how Wind River's own WDB protocol works, or packets with a specific protocol type.
It is also possible to add a socket interface to your custom Network Service by writing a custom socket back-end that sits between the Network Service and the socket API. This is not required if you are happy to do the application processing in the Network Service.
You haven't said which version of VxWorks you are using but I think the above holds for VxWorks 5.5.x and 6.x
Have you tried setting the socket protocol to htons(ETH_P_ALL) as prescribed in packet(7)? What you're doing doesn't have much to do with IP (although IPPROTO_RAW may be some wildcard value, dunno)
I think this is going to be a bit tougher problem to solve than you expect. Given that it's not IP at all (or apparently any other protocol anything will recognize), I don't think you'll be able to solve your problem(s) entirely with user-level code. On Linux, I think you'd need to write your own device agnostic interface driver (probably using NAPI). Getting it to work under VxWorks will almost certainly be non-trivial (more like a complete rewrite from the ground-up than what most people would think of as a port).
Have you tried confirming via Wireshark that a packet has actually been sent from the other end?
Also, for debugging, ask your hardware guys if they have a debug pin (you can attach to a logic analyzer) that they can assert when it receives a packet. Just to make sure that the hardware is getting the packets fine.
First you need to specify the protocol as ETH_P_ALL so that your interface gets all the packet. Set your socket to be on promiscuous mode. Then bind your RAW socket to an interface before you perform a receive.

Socket in use error when reusing sockets

I am writing an XMLRPC client in c++ that is intended to talk to a python XMLRPC server.
Unfortunately, at this time, the python XMLRPC server is only capable of fielding one request on a connection, then it shuts down, I discovered this thanks to mhawke's response to my previous query about a related subject
Because of this, I have to create a new socket connection to my python server every time I want to make an XMLRPC request. This means the creation and deletion of a lot of sockets. Everything works fine, until I approach ~4000 requests. At this point I get socket error 10048, Socket in use.
I've tried sleeping the thread to let winsock fix its file descriptors, a trick that worked when a python client of mine had an identical issue, to no avail.
I've tried the following
int err = setsockopt(s_,SOL_SOCKET,SO_REUSEADDR,(char*)TRUE,sizeof(BOOL));
with no success.
I'm using winsock 2.0, so WSADATA::iMaxSockets shouldn't come into play, and either way, I checked and its set to 0 (I assume that means infinity)
4000 requests doesn't seem like an outlandish number of requests to make during the run of an application. Is there some way to use SO_KEEPALIVE on the client side while the server continually closes and reopens?
Am I totally missing something?
The problem is being caused by sockets hanging around in the TIME_WAIT state which is entered once you close the client's socket. By default the socket will remain in this state for 4 minutes before it is available for reuse. Your client (possibly helped by other processes) is consuming them all within a 4 minute period. See this answer for a good explanation and a possible non-code solution.
Windows dynamically allocates port numbers in the range 1024-5000 (3977 ports) when you do not explicitly bind the socket address. This Python code demonstrates the problem:
import socket
sockets = []
while True:
s = socket.socket()
s.connect(('some_host', 80))
sockets.append(s.getsockname())
s.close()
print len(sockets)
sockets.sort()
print "Lowest port: ", sockets[0][1], " Highest port: ", sockets[-1][1]
# on Windows you should see something like this...
3960
Lowest port: 1025 Highest port: 5000
If you try to run this immeditaely again, it should fail very quickly since all dynamic ports are in the TIME_WAIT state.
There are a few ways around this:
Manage your own port assignments and
use bind() to explicitly bind your
client socket to a specific port
that you increment each time your
create a socket. You'll still have
to handle the case where a port is
already in use, but you will not be
limited to dynamic ports. e.g.
port = 5000
while True:
s = socket.socket()
s.bind(('your_host', port))
s.connect(('some_host', 80))
s.close()
port += 1
Fiddle with the SO_LINGER socket
option. I have found that this
sometimes works in Windows (although
not exactly sure why):
s.setsockopt(socket.SOL_SOCKET,
socket.SO_LINGER, 1)
I don't know if this will help in
your particular application,
however, it is possible to send
multiple XMLRPC requests over the
same connection using the
multicall method. Basically
this allows you to accumulate
several requests and then send them
all at once. You will not get any
responses until you actually send
the accumulated requests, so you can
essentially think of this as batch
processing - does this fit in with
your application design?
Update:
I tossed this into the code and it seems to be working now.
if(::connect(s_, (sockaddr *) &addr, sizeof(sockaddr)))
{
int err = WSAGetLastError();
if(err == 10048) //if socket in user error, force kill and reopen socket
{
closesocket(s_);
WSACleanup();
WSADATA info;
WSAStartup(MAKEWORD(2,0), &info);
s_ = socket(AF_INET,SOCK_STREAM,0);
setsockopt(s_,SOL_SOCKET,SO_REUSEADDR,(char*)&x,sizeof(BOOL));
}
}
Basically, if you encounter the 10048 error (socket in use), you can simply close the socket, call cleanup, and restart WSA, the reset the socket and its sockopt
(the last sockopt may not be necessary)
i must have been missing the WSACleanup/WSAStartup calls before, because closesocket() and socket() were definitely being called
this error only occurs once every 4000ish calls.
I am curious as to why this may be, even though this seems to fix it.
If anyone has any input on the subject i would be very curious to hear it
Do you close the sockets after using it?