boost1.62 socket broken after reconnect in docker container - c++

we have a c++ Application with boost1.62 and libssl1.0, that opens a TLS connection to a lighttpd Webserver (remote).
This works fine on any device we already rolled out. Now we are trying to use this application inside a container. The application starts up and everything is fine but.
When the connection gets reset for any reason, the application attempts to reconnect by making a new TCP-Connection with the socket.
Creating a HTTPS-Connection with TLS over that socket fails with EOF. Then the application tries to reconnect and gets the same fault -> endless reconnection loop.
I recorded the traffic and have seen the following:
Everything is alright
A TLS-Alert is recorded, sometimes also a TCP RESET.
Client sends SYN
Server sends SYN ACK
Client sends ACK
Client sends FIN, ACK
Server sends ACK
Server sends FIN, ACK
Client sends ACK
Steps 3 to 7 occur in less than 3 ms.
as soon as Step 7 has passed, a new connection is made starting with step 3.
I'm using an ubuntu 18.04 on host and as base image. (Both x64)
Both host and container use the same libraries. Therefore i think its not an issue with used libraries.
The application runs in production for over a year on several arm32v7 and x64 devices. This error never occurred then.
Oddly, if the application is configured to use plain HTTP instead HTTPS, the error does not occur.
Any suggestions what this might be? Based on my knowledge i can rule out the following:
wrong dependencies
misconfigured Kernel (container and host use the same)
Thanks for your help

After a disconnect, we reused the old socket. This works flawless outside docker. Inside docker it produces the described behaviour.
Workaround: delete the broken socket and create a new one.
This is bad for performance though.

Related

gRPC C++ client blocking when attempting to connect channel on unreachable IP

I'm trying to enhance some client C++ code using gRPC to support failover between 2 LAN connections.
I'm unsure if I found a bug in gRPC, or more likely that I'm doing something wrong.
Both the server and client machines are on the same network with dual LAN connections, I'll call LAN-A and LAN-B.
The server is listening on 0.0.0.0:5214, so accepts connections on both LANs.
I tried creating the channel on the client with both IPs, and using various load balancing options, ex:
string all_endpoints = "ipv4:172.24.1.12:5214,10.23.50.123:5214";
grpc::ChannelArguments args;
args.SetLoadBalancingPolicyName("pick_first");
_chan = grpc::CreateCustomChannel(all_endpoints,
grpc::InsecureChannelCredentials(),
args);
_stub = std::move(Service::NewStub(_chan));
When I startup the client and server with all LAN connections functioning, everything works perfectly. However, if I kill one of the connections or startup the client with one of the connections down, gRPC seems to be blocking forever on that subchannel. I would expect it to use the subchannel that is still functioning.
As an experiment, I implemented some code to only try to connect on 1 channel (the non-functioning one in this case), and then wait 5 seconds for a connection. If the deadline is exceeded, then we create a new channel and stub.
if(!_chan->WaitForConnected(std::chrono::system_clock::now() +
std::chrono::milliseconds(5000)))
{
lan_failover();
}
The stub is a unique_ptr so should be destroyed, the channel is a shared_ptr. What I see is that I can successfully connect on my new channel but when my code returns, gRPC ends up taking over and indefinitely blocking on what appears to be trying to connect on the old channel. I would expect gRPC would be closing/deleting this no longer used channel. I don't see any functions available in the cpp version that I can call on the channel or globally that would for the shutdown/closure of the channel.
I'm at a loss on how to get gRPC to stop trying to connect on failed channels, any help would be greatly appreciated.
Thanks!
Here is some grpc debug output I see when I startup with the first load balancing implementation I mention, and 1 of the 2 LANs is not functioning (blocking forever):
https://pastebin.com/S5s9E4fA
You can enable keepaliaves. Example usage: https://github.com/grpc/grpc/blob/master/test/cpp/end2end/flaky_network_test.cc#L354-L358
Just wanted to let anyone know the problem wasn't with gRCP, but the way our systems were configured with a SAN that was being written too. A SAN was mounted through the LAN connection I was using the test failover through and the process was actually blocking because it was trying to access that SAN. The stack trace was misleading because it showed the gRPC thread.

Client doesn't detect Server disconnection

In my application (c++) I have a service exposed as:
grpc foo(stream Request) returns (Reply) { }
The issue is that when the server goes down (CTRL-C) the stream on the client side keeps going indeed the
grpc::ClientWriter::Write
doesn't return false. I can confirm that using netstat I don't see any connection between the client and the server (apart a TIME_WAIT one that after a while goes away) and the client keeps calling that Write without errors.
Is there a way to see if the underlying connection is still up instead to rely on the Write return value ? I use grpc version 1.12
update
I discovered that the underlying channel goes in status IDLE but still the ClientWriter::Write doesn't report the error, I don't know if this is intended. During the streaming I'm trying now to reestablish a connection with the server every time the channel status is not GRPC_CHANNEL_READY
This could happen in a few scenarios but the most common element is a connection issue. We have KEEPALIVE support in gRPC to tackle exactly this issue. For C++, please refer to https://github.com/grpc/grpc/blob/master/doc/keepalive.md on how to set this up. Essentially, endpoints would send pings at certain intervals and expect a reply within a certain timeframe.

RakNet tutorial dropping clients

sorry for the noobish question but I can't find any resources online clearly stating whether this should work or not, and all tutorials / sample code always use localhost ^^ Soooo...
I'm trying to setup a simple server / client using RakNet. I'm literally just following the first tutorial (http://www.jenkinssoftware.com/raknet/manual/tutorial.html), just trying to get the client to connect to the server and keep the connection alive for a bit.
It all works great as long as I use 127.0.0.1 or 192.168.0.XXX, I can start the server, then the client, the server detects the connection request and sends the reply to the client, the client receives the reply and prints out "connection accepted" and such, and I can exchange messages between the client and the server.
However if I try using my actual IP, the server does not seem to detect the connection request (if you look at the tutorial code, it doesn't print "incoming connection"), but the client still receives a reply from somewhere ("Our connection request has been accepted").
After this initial semi-successful connection, no more packets will be received by either server or client, and the client will inevitably get disconnected after a few seconds (I assume time out?).
Port is open on the router, and the app runs fine as long as I keep it on localhost.
So my question is: is it even possible to run a server and client on the same machine / IP which is sitting behind a router?
The RakNet documentation part about NAT punchthrough and UDP forwarding does mention no more than one client and server being able to run on the same machine, but I was under the impression that one server / one client would not be an issue?
Thanks in advance to anybody who can shed some light on this!!
Forgot to mention my firewall is disabled !

How to handle SSL connection premature closure

I am writing a proxy server that proxies SSL connections, and it is all working perfectly fine for normal traffic. However when there is a large file transfer (Anything over 20KB) like an email attachment, then the connection is reset on the TCP level before the file is finished being written. I am using non-blocking IO, and am spawning a thread for each specific connection.
When a connection comes in I do the following:
Spawn a thread
Connect to the client (unencrypted) and read the connect request (all other requests are ignored)
Create a secure connection (SSL using openssl api) to the server
Tell the client that we contacted the server (unencrypted)
Create secure connection to client, and start proxying data between the two using a select loop to determine when reading and writing can occur
Once the underlying sockets are closed, or there is an error, the connection is closed, and thread is terminated.
Like I said, this works great for normal sized data (regular webpages, and other things) but fails as soon as a file is too large with either an error code (depending on the webapp being used) or a Error: Connection Interrupted.
I have no idea what is causing the connection to close, whether it's something TCP, HTTP, or SSL specific, and I can't find any information on it at all. In some browsers it will start to work if I put a sleep statement immediately after the SSL_write, but this seems to cause other issues in other browsers. The sleep doesn't have to be long, really just a delay. I currently have it set to 4ms per write, and 2ms per read, and this fixes it completely in older firefox, chrome with HTTP uploads, and opera.
Any leads would be appreciated, and let me know if you need any more information. Thanks in advanced!
-Sam
If the web-app thinks an uploaded file is too large what does it do? If it's entitled to just close the connection, that will cause an ECONN at the sender: 'connection reset'. Whatever it does, as you're writing a proxy, and assuming there are no bugs in your code that are causing this, your mission is to mirror whatever happens to your upstream connection back down the downstream connection. In this case the answer is to just do what you're doing: close the upstream and downstream sockets. If you got an incoming close_notify from the server, do an orderly SSL close to the client; if you got ECONN, just close the client socket directly, bypassing SSL.

telnet client connection stops receiveing data, server is still sending

I'm Working in an embedded linux environment.
it launches a telnet daemon on startup which watches on a particular port and launches a program when a connection is received.
i.e.
telnetd -l /usr/local/bin/PROGA -p 1234
PROGA - will output some data at irregular intervals. When it is not outputting data, every X period of time it sends out a 'heartbeat' type string to let the client know that we are still active i.e. "heartbeat\r\n"
After a random amount of time, the client (use a linux version of telnet, launched by: telnet xxx.xxx.xxx.xxx 1234) will fail to receive the 'heartbeat\r\n'
The data the client sees:
heartbeat
heartbeat
heartbeat
...
heartbeat
[nothing, should have received heartbeat]
[nothing forever]
heartbeat is sent:
result = printf("%s", heartbeat);
checking result, it is always the length of heartbeat. Logging to syslog shows us that the printf() is executing with success at the proper intervals
I've since added in a tcdrain and fflush which both return success, but do not seem to help the situation.
Any help would be appreciated.
**UDPATE: got a wireshark capture from the server side. Very Clearly the heartbeat is being sent continuously. No Hicups, no delays. Found something interesting on the client though. The client in this test case (telnet on Ubuntu 9.04) seems to suddenly stop receiving heartbeat (as describes above). Wireshark confirms this, big pause in packets. Well, once the client had stopped receiving the heartbeat, pressing any keystroke (on the client) seems to trigger a spew of data from the client's buffer (all heartbeats). Wireshark on the client also shows this massive amount of data all in one packet.
Unfortunately I don't really know what this means. It this a line mode on/off thing? Line endings (\r\n) are very clearly coming through.
**Update 2: running netcat instead of telnetd, the problem is not reproducible.
The first thing I would do is get out Wireshark and try to find out if the server is truly sending the message. It would be instructive to run Wireshark at the server as well as third party PC. Is there anything different about the last heartbeat?
Edit. Well, that was an interesting find on your client.
It seems like there's some sort of terminal thing in the way. You may want to use the netcat program rather than telnetd. netcat is designed for sending arbitrary data over a TCP session in raw mode, without any special formatting, and it has the ability to hook up an arbitrary process to a socket. On a Windows machine you can use PuTTY in raw mode to accomplish the same thing.
It may still be worth examining traffic with a third party between your client and server. The kernel may be optimizing away writes to the network and internally buffering data. That's the only way to ensure that what see is what's really happening on the wire.