Sometimes cannot connect to HTTP server on Windows (the target machine actively refused the connection) - c++

I'm attempting to run some mock tests for a C++ library on Windows. This involves spinning up an HTTP server (https://gitlab.com/eidheim/Simple-Web-Server) so that the library can make some requests to it on localhost:8080.
Most of the time this works OK, but I'm running into some confusing issues at the OS level. If I run the test suite application too soon after a previous run, I get errors stating that the HTTP connections could not be made to the server, because "the target machine actively refused it". This causes a few issues on our CI server, as we run 32-bit tests followed by 64-bit tests, and some of the time the second batch of tests will fail because of these connection rejections. If I wait a bit between runs then the connections usually work again (though not always).
It should be noted that the HTTP server is started up at the beginning of each individual test, and torn down after. It is not persistent between tests or between process runs, so there should be no dangling connections or anything like that. This is what leads me to believe this is an issue at the OS level.
Is there any good reason why this might be happening? It's quite difficult for me to debug because I'm not even really sure what the problem is - maybe Windows is not allowing port 8080 to be re-used until after a certain timeout?

Related

What is required to get a BSD-sockets-based program to do LAN networking under Emscripten?

Background: I've got an C++/Qt-based application that communicates with servers on the user's LAN. It uses non-blocking TCP and UDP sockets, and the networking is implemented via calls to the BSD sockets API (i.e. socket()/send()/recv()/select()/etc). It all works well.
The other day, just for fun, I decided to recompile the application using emscripten, so that it could run as a WebAssembly app inside a web browser.
This worked surprisingly well -- within an hour or two, I had my app up and running inside Google Chrome. However, the app's usefulness in this configuration is severely limited by the fact that it isn't able to connect to any servers -- presumably this is because it is running in a restricted/sandboxed environment.
If I wanted to pursue this line of development beyond the clever-hack-demo stage and try to make it useful, I would need to find a way for my program to discover and connect to servers on the user's LAN.
My question is: is that functionality at all possible for a Emscripten/WebAssembly-based app to perform? If so, what steps would I need to take? (i.e. would it require upgrading the LAN's servers to handle WebSocket-based connections? Would it require adding some sort of proxy server to run on the web server that the web page was served from? Is UDP even a thing in a web-app context? Are there other hoops that would also have to be jumped through?)

Service blocks windows startup

We have automatically started service which in some cases spends a lot of the time loading necessary data, let's say 10 minutes. During this time it works as expected (processing some huge data files required to start). I report the progess by C++ SetServiceStatus function, it is working fine.
This service is not dependent on anything and has only one dependency which is again our own service. It is started after those 10 minutes, it needs the first "server" service to be fully running to accept the requests.
I thought that windows would start all other automatic services (in less then 10 minutes as usually) and then start working normally but system is completely blocked during startup (i can't login to computer or ping the computer) until this one specific service is started (reports SERVICE_RUNNING by SetServiceStatus). When out service completely starts, the other missing system services (required for network, remote desktop, whatever, it's quite random) are also started. Is this normal behaviour? Why are non-depending processes (as remote desktop, network connections, etc.) waiting for this process? Am I missing something?
I tried to add some dependencies to postpone the startup of my service but I ended up with many dependencies and behaviour still somehow random (as order of services is random). Sometimes I was able to login but for example Start button started working only after those 10 minutes when my service was started. I am not sure what is "the last service" to depend on and what services to include to my depend-list and on some computers this services can be disabled and it can bring new problems... so I don't like this solution very much.
Another option was Delayed start option for our service. This should start service when all other automatic services are running. Well, this works fine, windows boots, computer running and responding, our service is started, but the performance is very bad, many times slower than usually, it seems that delayed started services have much lower priority or something like that.
My only current solution is to report to system that my service is running (by SetServiceStatus function), but to continue loading (this works, I tested it). But then we have problem with our dependent service as it needs to be started when the first one is really ready. It can be solved but I still wonder how is this possible and if there is something I could use to keep the current state of automatic started service which reports "started" when it is really fully started and prepared to work. Thanks for any ideas.
Set SERVICE_RUNNING as soon as possible, and then continue processing in background. Make your other service resilient to the first service being in a running state, but not yet ready to service.
The longer the service is in the starting state the more problems we get from different windows versions.

SQL-Server Connection Fails after Network Reconnect

I am working on an update to an application that uses DAO to access an SQL Server. I know, but let's consider DAO a requirement for now.
The application runs all the time in the system tray and periodically performs SQL server operations. Since it is running all the time, and users of the application will be on laptops and transitioning between buildings, I've designed it to quietly transition between active and inactive states. When the database connection is successful operations resume.
I have one last issue before I release this update: When a connection is dropped, then reestablished, the SQL operations fail. This occurs only if I have specified the hostname in my connection string. If I use the IP, everything is fine (but I need to be able to use hostname).
Here is the behavior:
1) Everything working. Good network connection, database operations are fine.
2) Lost connection. Little 'x' appears on task bar icon, and nothing else. All ok.
3) Reconnect.
At step 3, I get an 'ODBC--call failed' error when I run the first query. Interestingly, the database is first opened without error.
If I skip step 1, and start the application when the connection is down, everything works fine in step 3, hostname or not.
I expect this is an issue with the DAO engine caching the DNS entry after the first connection, although the destination IP does not change so I'm not sure about that. I have tried flushing the windows DNS cache (from cmd prompt) to no effect. The same behavior occurs even when I'm using my local hostname with a local SQL server I set up for development. 127.0.0.1 has no problems.
I also tried to CoUninitialize() the DAO interface between active times, but I had trouble getting this to work. If someone thinks that would help I will work harder at it.
This behavior is the same in Windows XP or 7.
Thanks for anything you've got!
Edit: I should have mentioned - I am closing the database connection between the attempts, then reopening it with
m_pDb = m_pDaoEngine->OpenDatabase()
I ended up biting the bullet and converting the application to ADO. Everything works nicely now, and database operations are much faster to boot.

Is there a way for the cache to stay up without timeout after crash in AppFabric Cache?

First my setup that is used for testing purpose:
3 Virtual Machines running with the following configuration:
MS Windows 2008 Server Standard Edition
Latest version of AppFabric Cache
Each one has a local network share where the config file is stored (I have added all the machines in each config)
The cache is distributed but not high availibility (we don't have Enterprise version of Windows)
Each host is configured as lead, so according to the documentation at least one host should be allowed to crash.
Each machine has the website I testing installed, and local cache configured
One linux machine that is used as a proxy (varnish is used) to distribute the traffic for testing purpose.
That's the setup and now on to the problem. The scenario I am testing is simulating one of the servers crashing and then bring it back in the cluster. I have problem both with the server crashing and bringing it back up. Steps I am using to test it:
Direct the traffic with Varnish on the linux machine to one server only.
Log in to make sure there is something in the cache.
Unplug the network cable for one of the other servers (simulates that server crashing)
Now I get a cache timeout and I get a service error. I want the application to still be up on the servers that didn't crash, and it take some time for the cache to come back up on the remaining servers. Is that how it should be? Plugging the network cable back in and starting the host cause a similar problem.
So my question is if I have missed something? What I would like to see happen is that if one server crashes the cache should still remaing upp since a majority of the leads are still up, and starting the crashed server again should bring it back gracefully into the cluster without any causing any problems on the other hosts. But that might no be how it works?
I ran through a similar test scenario a few months ago where I had a test client generating load on a 3 lead-server cluster with a variety of Puts, Gets, and Removes. I rebooted one of the servers multiple times while the load test was running and the cache stayed online. If I remember correctly, there were a limited number errors as that server rebooted, but overall the cache appeared to remain healthy.
I'm not sure why you're not seeing similar results, but I would try removing the Varnish proxy from your test and see if that helps.

Socket re-connection failure

System Background:
Its basically a client/server application. Server is an embedded device and Client is a windows app developed in C++.
Issue: After a runtime of about a week, communication breaks between client/server,
because of this the server is not able to connect back to the client and needs a restart to recover. Looks like System is experiencing Socket re-connection problem. Also The network sometimes experiences intermittent failures.
Abrupt Termination at remote end
Port locking
Want some suggestions on how to cleanup the socket or shutdown cleanly so that re-connection happens properly. Other alternate solutions?
Thanks,
Hussain
It does not sound like you are in a position to easily write a stress test app to reproduce this more quickly out of band, which is what I would normally suggest. A pragmatic solution might be to periodically restart the server and client at a time when you think the system is least busy, or when problems arise. This sounds like cheating but many production systems I have been involved with take this approach to maximize system uptime.
My preferred solution here would be to abstract the server and client socket code (hopefully your design allows this to be done without too much work) and use it to implement client and server test apps that can be used to stress test only the socket code by simulating a lot of normal socket traffic in a short space of time - this helps identify timing windows and edge cases that could cause problems over time, and might speed up the process of obtaining a debuggable repro - you can simulate network error in your test code by dropping the socket on the client or server periodically.
A further step to take on the strategic front would be to ensure that you have good diagnostics in your socket handlers on client and server side. Track socket open and close, with special focus on your socket error and reconnect paths given you know the network is unreliable. Make sure the logs are output sequential with a timestamp. Something as simple as this might quickly show you what error or conditions trigger your problems. You can quickly make sure the logs are correct and complete using the test apps I mentioned above.
One thing you might want to check is that you are not being hit by lack of ability to reuse addresses. Sometimes when a socket gets closed, it cannot be immediately reused for a reconnect attempt as there is still residual activity on one or other end. You may be able to get around this (based on my Windows/Winsock experience) by experimenting with SO_REUSEADDR and SO_LINGER on your sockets. however, my first focus in your case would be on ensuring the socket code on client and server handles all errors and mainline cases correctly, before worrying about this.
A common issue is that when a connection is dropped, it is kept opened by the OS in TIME_WAIT state. If you want to restart the server socket, it will not be able to reopen the same port directly because it is still present for the OS.
To avoid that, you need to set the parameter SO_REUSEADDR so that the OS allows you to reuse the port if it is in TIME_WAIT state for a server socket.
Example:
int optval=1;
// set SO_REUSEADDR on a socket to true (1):
setsockopt(s1, SOL_SOCKET, SO_REUSEADDR, &optval, sizeof optval);
I'm experiencing something similar with encrypted connections. I believe in my case it is because the client dropped the connection and reconnected in less than the 4 minute FIN_WAIT period. The initial connection is recycled (by the os) and the server doesn't see the drop out. The SSL authentication is lost when the client loses connection so the client tries to re-authenticate. This is during what the servers considers the middle of a conversation. The server then hangs up on the client. I think the server ssl code considers this a man in the middle attack or just gets confused and closes the connection.