Restarting Application: Delay - c++

I'm coding a restart feature into my newest Crysis Wars server modification that remotely reboots the server. This is useful if the server has a problem and a simple system reload does not fix it, also is useful to tell the server to restart at a specified time in order to free up memory.
I have coded the required functions in order to achieve this, and the application itself has no problem restarting. The issue is that the port is not closed quickly enough, resulting in a new instance of the application that cannot function properly.
I am looking for an ideal solution to this, that the program is shut down and launches two seconds later, instead of immediately. Doing this will give Windows enough time to free the port that the server was using, and clean up any existing memory.
Please Note: I have removed my other (related) question since apparently closing program ports is impossible without telling it to do so when it it assigned the port, which is something I cannot do since I don't have access to the sourcecode of the code that binds to the port.
The code, if it's required
int CScriptBind_GameRules::Restart(IFunctionHandler *pH)
{
bool arg1 = false;
const char *arg2 = "";
gEnv->pScriptSystem->BeginCall("Dynamic","GetValue");
gEnv->pScriptSystem->PushFuncParam("r.enable");
gEnv->pScriptSystem->EndCall(arg1);
gEnv->pScriptSystem->BeginCall("Dynamic","GetValue");
gEnv->pScriptSystem->PushFuncParam("r.line");
gEnv->pScriptSystem->EndCall(arg2);
if (arg1)
{
LogMsg(2, "System restart initiated.");
if (arg2)
{
LogMsg(2, "System Reboot.");
gEnv->pScriptSystem->BeginCall("os","execute");
gEnv->pScriptSystem->PushFuncParam(arg2);
gEnv->pScriptSystem->EndCall(), close((int)gEnv->pConsole->GetCVar("sv_port")->GetString());
return pH->EndFunction();
}
else
{
LogMsg(2, "Internal Faliure.");
return pH->EndFunction();
}
return pH->EndFunction();
}
LogMsg(2, "System restart cancelled: Feature is Disabled.");
return pH->EndFunction();
}

What I usually do is add a command-line parameter, 'StartupDelay'. When the server/whatever starts up, before attempting to run the listener etc, it checks for that parameter. If no param, it runs up normally, if it finds 'StartupDelay=2000', it sleeps for 2 seconds before attempting to start the listener etc.
Result - if started from desktop icon, it starts immediately. If it needs to 'restart itself' it sets the parameter to instruct the new instance of itself to wait as directed.

Related

Crashing when calling QTcpSocket::setSocketDescriptor()

my project using QTcpSocket and the function setSocketDescriptor(). The code is very normal
QTcpSocket *socket = new QTcpSocket();
socket->setSocketDescriptor(this->m_socketDescriptor);
This coding worked fine most of the time until I ran a performance testing on Windows Server 2016, the crash occurred. I debugging with the crash dump, here is the log
0000004f`ad1ff4e0 : ucrtbase!abort+0x4e
00000000`6ed19790 : Qt5Core!qt_logging_to_console+0x15a
000001b7`79015508 : Qt5Core!QMessageLogger::fatal+0x6d
0000004f`ad1ff0f0 : Qt5Core!QEventDispatcherWin32::installMessageHook+0xc0
00000000`00000000 : Qt5Core!QEventDispatcherWin32::createInternalHwnd+0xf3
000001b7`785b0000 : Qt5Core!QEventDispatcherWin32::registerSocketNotifier+0x13e
000001b7`7ad57580 : Qt5Core!QSocketNotifier::QSocketNotifier+0xf9
00000000`00000001 : Qt5Network!QLocalSocket::socketDescriptor+0x4cf7
00000000`00000000 : Qt5Network!QAbstractSocket::setSocketDescriptor+0x256
In the stderr log, I see those logs
CreateWindow() for QEventDispatcherWin32 internal window failed (Not enough storage is available to process this command.)
Qt: INTERNAL ERROR: failed to install GetMessage hook: 8, Not enough storage is available to process this command.
Here is the function, where the code was stopped on the Qt codebase
void QEventDispatcherWin32::installMessageHook()
{
Q_D(QEventDispatcherWin32);
if (d->getMessageHook)
return;
// setup GetMessage hook needed to drive our posted events
d->getMessageHook = SetWindowsHookEx(WH_GETMESSAGE, (HOOKPROC) qt_GetMessageHook, NULL, GetCurrentThreadId());
if (Q_UNLIKELY(!d->getMessageHook)) {
int errorCode = GetLastError();
qFatal("Qt: INTERNAL ERROR: failed to install GetMessage hook: %d, %s",
errorCode, qPrintable(qt_error_string(errorCode)));
}
}
I did research and the error Not enough storage is available to process this command. maybe the OS (Windows) does not have enough resources to process this function (SetWindowsHookEx) and failed to create a hook, and then Qt fire a fatal signal, finally my app is killed.
I tested this on Windows Server 2019, the app is working fine, no crashes appear.
I just want to know more about the meaning of the error message (stderr) cause I don't really know what is "Not enough storage"? I think it is maybe the limit or bug of the Windows Server 2016? If yes, is there any way to overcome this issue on Windows Server 2016?
The error ‘Not enough storage is available to process this command’ usually occurs in Windows servers when the registry value is set incorrectly or after a recent reset or reinstallations, the configurations are not set correctly.
Below is verified procedure for this issue:
Click on Start > Run > regedit & press Enter
Find this key name HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\LanmanServer\Parameters
Locate IRPStackSize
If this value does not exist Right Click on Parameters key and Click on New > Dword Value and type in IRPStackSize under the name.
The name of the value must be exactly (combination of uppercase and lowercase letters) the same as what I have above.
Right Click on the IRPStackSize and click on Modify
Select Decimal enter a value higher than 15(Maximum Value is 50 decimal) and Click Ok
You can close the registry editor and restart your computer.
Reference
After researching for a few days I finally can configure the Windows Server 2016 setting (registry) to prevent the crash.
So basically it is a limitation of the OS itself, it is called desktop heap limitation.
https://learn.microsoft.com/en-us/troubleshoot/windows-server/performance/desktop-heap-limitation-out-of-memory
(The funny thing is the error message is Not enough storage is available to process this command but the real problem came to desktop heap limitation. )
So for the solution, flowing the steps in this link: https://learn.microsoft.com/en-us/troubleshoot/system-center/orchestrator/increase-maximum-number-concurrent-policy-instances
I increased the 3rd parameter of SharedSection to 2048 and it fix the issue.
Summary steps:
Desktop Heap for the non-interactive desktops is identified by the third parameter of the SharedSection= segment of the following registry value:
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager\SubSystems\Windows
The default data for this registry value will look something like the following:
%SystemRoot%\system32\csrss.exe ObjectDirectory=\Windows SharedSection=1024,3072,512 Windows=On SubSystemType=Windows ServerDll=basesrv,1 ServerDll=winsrv:UserServerDllInitialization,3 ServerDll=winsrv:ConServerDllInitialization,2 ProfileControl=Off MaxRequestThreads=16
The value to be entered into the Third Parameter of the SharedSection= segment should be based on the calculation of:
(number of desired concurrent policies) * 10 = (third parameter value)
Example: If it's desired to have 200 concurrent policy instances, then 200 * 10 = 2000, rounding up to a nice memory number gives you 2048as the third parameter resulting in the following update to be made to the registry value:
SharedSection=1024,3072,2048

How to detect cause of Dart VM crash

I have two Dart apps running on Amazon (AWS Ubuntu), which are:
Self-hosted http API
Worker that handles background tasks on a timer
Both apps use PostgreSQL. They were occasionally crashing so, in addition to trying to find the root causes, I also implemented a supervisor script that just detects whether those 2 main apps are running and restarts them as needed.
Now the problem I need to solve is that the supervisor script is crashing, or the VM is crashing. It happens every few days.
I don't think it is a memory leak because if I increase the polling rate from 10s to much more often (1 ns), it correctly shows in the Dart Observatory that it exhausts 30MB and then garbage-collects and starts over at low memory usage, and keeps cycling.
I don't think it's an uncaught exception because the infinite loop is completely enclosed in try/catch.
I'm at a loss for what else to try. Is there a VM dump file that can be examined if the VM really crashed? Is there any other technique to debug the root cause? Is Dart just not stable enough to run apps for days at a time?
This is the main part of the code in the supervisor script:
///never ending function checks the state of the other processes
Future pulse() async {
while (true) {
sleep(new Duration(milliseconds: 100)); //DEBUG - was seconds:10
try {
//detect restart (as signaled from existence of restart.txt)
File f_restart = new File('restart.txt');
if (await f_restart.exists()) {
log("supervisor: restart detected");
await f_restart.delete();
await endBoth();
sleep(new Duration(seconds: 10));
}
//if restarting or either proc crashed, restart it
bool apiAlive = await isRunning('api_alive.txt', 3);
if (!apiAlive) await startApi();
bool workerAlive = await isRunning('worker_alive.txt', 8);
if (!workerAlive) await startWorker();
//if it's time to send mail, run that process
if (utcNow().isAfter(_nextMailUtc)) {
log("supervisor: starting sendmail");
Process.start('dart', [rootPath() + '/sendmail.dart'], workingDirectory: rootPath());
_nextMailUtc = utcNow().add(_mailInterval);
}
} catch (ex) {}
}
}
If you have the observatory up you can get a crash dump with:
curl localhost:<your obseratory port>/_getCrashDump
I'm not totally sure if this is related but Process.start returns a future which I don't believe will be caught by your try/catch if it completes with an error...

libircclient : Selective connection absolutely impossible to debug

I'm not usually the type to post a question, and more to search why something doesn't work first, but this time I did everything I could, and I just can't figure out what is wrong.
So here's the thing:
I'm currently programming an IRC Bot, and I'm using libircclient, a small C library to handle IRC connections. It's working pretty great, it does the job and is kinda easy to use, but ...
I'm connecting to two different servers, and so I'm using the custom networking loop, which uses the select function. On my personal computer, there's no problem with this loop, and everything works great.
But (Here's the problem), on my remote server, where the bot will be hosted, I can connect to one server but not the other.
I tried to debug everything I could. I even went to examine the sources of libircclient, to see how it worked, and put some printfs where I could, and I could see where does it comes from, but I don't understand why it does this.
So here's the code for the server (The irc_session_t objects are encapsulated, but it's normally kinda easy to understand. Feel free to ask for more informations if you want to):
// Connect the first session
first.connect();
// Connect the osu! session
second.connect();
// Initialize sockets sets
fd_set sockets, out_sockets;
// Initialize sockets count
int sockets_count;
// Initialize timeout struct
struct timeval timeout;
// Set running as true
running = true;
// While the server is running (Which means always)
while (running)
{
// First session has disconnected
if (!first.connected())
// Reconnect it
first.connect();
// Second session has disconnected
if (!second.connected())
// Reconnect it
second.connect();
// Reset timeout values
timeout.tv_sec = 1;
timeout.tv_usec = 0;
// Reset sockets count
sockets_count = 0;
// Reset sockets and out sockets
FD_ZERO(&sockets);
FD_ZERO(&out_sockets);
// Add sessions descriptors
irc_add_select_descriptors(first.session(), &sockets, &out_sockets, &sockets_count);
irc_add_select_descriptors(second.session(), &sockets, &out_sockets, &sockets_count);
// Select something. If it went wrong
int available = select(sockets_count + 1, &sockets, &out_sockets, NULL, &timeout);
// Error
if (available < 0)
// Error
Utils::throw_error("Server", "run", "Something went wrong when selecting a socket");
// We have a socket
if (available > 0)
{
// If there was something wrong when processing the first session
if (irc_process_select_descriptors(first.session(), &sockets, &out_sockets))
// Error
Utils::throw_error("Server", "run", Utils::string_format("Error with the first session: %s", first.get_error()));
// If there was something wrong when processing the second session
if (irc_process_select_descriptors(second.session(), &sockets, &out_sockets))
// Error
Utils::throw_error("Server", "run", Utils::string_format("Error with the second session: %s", second.get_error()));
}
The problem in this code is that this line:
irc_process_select_descriptors(second.session(), &sockets, &out_sockets)
Always return an error the first time it's called, and only for one server. The weird thing is that on my Windows computer, it works perfectly, while on the Ubuntu server, it just doesn't want to, and I just can't understand why.
I did some in-depth debug, and I saw that libircclient does this:
if (session->state == LIBIRC_STATE_CONNECTING && FD_ISSET(session->sock, out_set))
And this is where everything goes wrong. The session state is correctly set to LIBIRC_STATE_CONNECTING, but the second thing, FD_ISSET(session->sock, out_set) always return false. It returns true for the first session, but for the second session, never.
The two servers are irc.twitch.tv:6667 and irc.ppy.sh:6667. The servers are correctly set, and the server passwords are correct too, since everything works fine on my personal computer.
Sorry for the very long post.
Thanks in advance !
Alright, after some hours of debug, I finally got the problem.
So when a session is connected, it will enter in the LIBIRC_STATE_CONNECTING state, and then when calling irc_process_select_descriptors, it will check this:
if (session->state == LIBIRC_STATE_CONNECTING && FD_ISSET(session->sock, out_set))
The problem is that select() will alter the sockets sets, and will remove all the sets that are not relevant.
So if the server didn't send any messages before calling the irc_process_select_descriptors, FD_ISSET will return 0, because select() thought that this socket is not relevant.
I fixed it by just writing
if (session->state == LIBIRC_STATE_CONNECTING)
{
if(!FD_ISSET(session->sock, out_set))
return 0;
...
}
So it will make the program wait until the server has sent us anything.
Sorry for not having checked everything !

Boost UDP socket issue on unix - bind: address already in use

First of all, I know there are several other threads on the same theme, but I was unable to find anything in those that could help me so I'll try to be very specific with my situation.
I have set up a simple UDP Client / UDP Server pair that is responsible to send data between several parallel simulations. That is, every instance of the simulator is running in a separate thread and send data on a UDP socket. In the master thread the server is running and routes the messages between the simulations.
The (for this problem) important parts of the server code looks like this:
UDPServer::UDPServer(boost::asio::io_service &m_io_service) :
m_socket(m_io_service, udp::endpoint(udp::v4(), PORT_NUMBER)),
m_endpoint(boost::asio::ip::address::from_string("127.0.0.1"), PORT_NUMBER)
{
this->start_receive();
};
void UDPServer::start_receive() {
// Set SO_REUSABLE to true
boost::asio::socket_base::reuse_address option(true);
this->m_socket.set_option(option);
// Specify what happens when a message is received (it should call the handle_receive function)
this->m_socket.async_receive_from( boost::asio::buffer(this->recv_buffer),
this->m_endpoint,
boost::bind(&UDPServer::handle_receive, this, boost::asio::placeholders::error, boost::asio::placeholders::bytes_transferred));
};
This works fine on my windows workstation.
The thing is; I want to be able to run this on a linux cluster, which is why I compiled it and tried to run it on a cluster node. The code compiled without a hitch, but when I try to run it I get the error
bind: address already in use
I use a port number above 1024, and have verified that it is not in use by another program. And as is seen above, I also set the reuse_address option, so I really don't know what else could be wrong.
To portably use SO_REUSEADDR you need to set the option before binding the socket to the wildcard address:
UDPServer::UDPServer(boost::asio::io_service &m_io_service) :
m_socket(m_io_service, udp::v4()),
m_endpoint()
{
boost::asio::socket_base::reuse_address option(true);
this->m_socket.set_option(option);
this->m_socket.bind(udp::endpoint(udp::v4(), PORT_NUMBER));
this->start_receive();
}
In your original code, the constructor that takes an endpoint constructs, opens and binds the socket in a single line - it's concise but not very flexible. Here we're constructing and opening the socket in the constructor call, and then binding it later after we set the option.
As an aside, there's not much point initialising m_endpoint if you're just going to use it as the out argument of async_receive_from anyway.
Try running the following command on Linux to see if the port is already being used by another program.
netstat -antup | grep 1024
If you are getting "address already in use" then it is definitely being used by some other program. If the above command yields some result, then kill the process id that is reported in the command. If this does not work, try changing the port number to some other arbitrary port and check if the problem persists.

mysql reconnect c++

Right now I have a C++ client application that uses mysql.h to connect to a MYSQL database and have to preform some logic in case there is a disconnect. I'm wondering if this is the best way to reconnect to a MYSQL database in a situation where my client gets disconnected.
bool MYSQL::Reconnect(const char *host, const char *user, const char *passwd, const char *db)
{
bool out = false;
pid_t command_pid = fork();
if (command_pid == 0)
{
while(1)
{
sleep(1);
if (mysql_real_connect(&m_mysql, host, user, passwd, db, 0, NULL, 0) == NULL )
{
fprintf(stderr, "Failed to connect to database: Error: %s\n",
mysql_error(&m_mysql));
}
else
{
m_connected = true;
out = true;
break;
}
}
exit(0);
}
if (command_pid < 0)
fprintf(stderr, "Could not fork process[reconnect]: %s\n", mysql_error(&m_mysql));
return out;
}
Right now i take in all my parameters and preform a fork. the child process attempts to reconnect every second with a sleep() statement. Is this a good way to do this? Thanks
Sorry, but your code doesn't do what you think it does, Kaiser Wilhelm.
In essence, you're trying to treat a fork like a thread, which it is not.
When you fork a child, the parent process is completely cloned, including file and socket descriptors, which is how your program is connected to the MySQL database server. That is, both the parent and the child end up with their own copy of the same connection to the database server when you fork. I assume the parent only calls this Reconnect() method when it sees the connection drop, and stops using its copy of the now-defunct MySQL connection object, m_mysql. If so, the parent's copy of the connection is just as useless as the client's when you start the reconnect operation.
The thing is, the reverse is not also true: once the child manages to reconnect to the database server, the parent's connection object remains defunct. Nothing the child does propagates back up to the parent. After the fork, the two processes are completely independent, except insofar as they might try to access some I/O resource they initially shared. For example, if you called this Reconnect() while the connection was up and continued using the connection in the parent, the child's attempts to talk to the DB server on the same connection would confuse either mysqld or libmysqlclient, likely causing data corruption or a crash.
As hinted above, one solution to this is to use threads instead of forking. Beware, however, of the many problems with using threads with the MySQL C API.
Given a choice, I'd rather use asynchronous I/O to do the background connection attempt within the application's main thread, but the MySQL C API doesn't allow that.
It seems you're trying to avoid blocking your main application thread while attempting the DB server reconnection. It may be that you can get away with doing it synchronously anyway by setting the connect timeout to 1 second, which is fine when the MySQL server is on the same machine or same LAN as the client. If you could tolerate your main thread blocking for up to a second for connection attempts to fail — worst case happening when the server is on a separate machine and it's physically disconnected or firewalled — this would probably be a cleaner solution than threads. The connection attempt can fail much quicker if the server machine is still running and the port isn't firewalled, such as when it is rebooting and the TCP/IP stack is [still] up.
As far as I can tell, this doesn't do what you intended.
Logical issues
Reconnect doesn't "perform some logic in case there is a disconnect" at all.
It attempts to connect over and over again until it succeeds, then stops. That's it. The state of the connection is never checked again. If the connection drops, this code knows nothing about it.
Technical issues
Also pay close attention to the technical issues that Warren raises.
Sure, it's perfectly OK. You might want to think about replacing the while ( 1 ) loop with something like
while ( NULL == mysql_real_connect( ... )) {
sleep( 1 );
...
}
which is the kind of idiom that one learns by practice, but your code works just fine as far as I can see. Don't forget to put a counter inside the while loop.