How to detect cause of Dart VM crash - amazon-web-services

I have two Dart apps running on Amazon (AWS Ubuntu), which are:
Self-hosted http API
Worker that handles background tasks on a timer
Both apps use PostgreSQL. They were occasionally crashing so, in addition to trying to find the root causes, I also implemented a supervisor script that just detects whether those 2 main apps are running and restarts them as needed.
Now the problem I need to solve is that the supervisor script is crashing, or the VM is crashing. It happens every few days.
I don't think it is a memory leak because if I increase the polling rate from 10s to much more often (1 ns), it correctly shows in the Dart Observatory that it exhausts 30MB and then garbage-collects and starts over at low memory usage, and keeps cycling.
I don't think it's an uncaught exception because the infinite loop is completely enclosed in try/catch.
I'm at a loss for what else to try. Is there a VM dump file that can be examined if the VM really crashed? Is there any other technique to debug the root cause? Is Dart just not stable enough to run apps for days at a time?
This is the main part of the code in the supervisor script:
///never ending function checks the state of the other processes
Future pulse() async {
while (true) {
sleep(new Duration(milliseconds: 100)); //DEBUG - was seconds:10
try {
//detect restart (as signaled from existence of restart.txt)
File f_restart = new File('restart.txt');
if (await f_restart.exists()) {
log("supervisor: restart detected");
await f_restart.delete();
await endBoth();
sleep(new Duration(seconds: 10));
}
//if restarting or either proc crashed, restart it
bool apiAlive = await isRunning('api_alive.txt', 3);
if (!apiAlive) await startApi();
bool workerAlive = await isRunning('worker_alive.txt', 8);
if (!workerAlive) await startWorker();
//if it's time to send mail, run that process
if (utcNow().isAfter(_nextMailUtc)) {
log("supervisor: starting sendmail");
Process.start('dart', [rootPath() + '/sendmail.dart'], workingDirectory: rootPath());
_nextMailUtc = utcNow().add(_mailInterval);
}
} catch (ex) {}
}
}

If you have the observatory up you can get a crash dump with:
curl localhost:<your obseratory port>/_getCrashDump
I'm not totally sure if this is related but Process.start returns a future which I don't believe will be caught by your try/catch if it completes with an error...

Related

Google Cloud Functions Java 11 (Beta) Runtime - Performance Issue

I have created a new Cloud Function using Java 11 (Beta) Runtime to handle HTML form submission for my static site. It's a simple 3-field form (name, email, message). No file upload is involved. The function does 2 things primarily:
Creates a pull request with BitBucket
Sends email to me using SendGrid
NOTE: It also verifies recaptcha but I've disabled it for testing.
The function when ran on my local machine (base model 2019 Macbook Pro 13") takes about 3 secs. I'm based in SE Asia. The same function when deployed to Google Cloud us-central1 takes about 25 secs (8 times slower). I have almost the same code running in production as part of a Servlet on GAE Java 8 runtime also in US Central region for a few years. It takes about 2-3 secs including recaptcha verification and sending the email. I'm trying to port it over to Cloud Function, but the performance is about 10 times slower with Cloud Function even without recaptcha verification.
For comparison, the Cloud Function is running on 256MB / 400GHz instance, whereas my GAE Java 8 runtime runs on F1 (128MB / 600GHz) instance. The function is using only about 75MB of memory. The function is configured to accept unauthenticated requests.
I noticed that even basic String concatenation like: String c = a + b; takes a good 100ms on the Cloud Function. I have timed the calls and a simple string concatenation of about 15 strings into one takes about 1.5-2.0 seconds.
Moreover, writing a small message (~ 1KB) to the HTTPUrlConnection output stream and reading the response back takes about 10 seconds (yes seconds)!
/* Writing < 1KB to output stream takes about 4-5 secs */
wr = new OutputStreamWriter(con.getOutputStream());
wr.write(encodedParams);
wr.flush();
wr.close();
/* Reading response also take about 4-5 secs */
String responseMessage = con.getResponseMessage();
Similarly, the SendGrid code below takes another 10 secs to send the email. It takes about 1 sec on my local machine.
Email from = new Email(fromEmail, fromName);
Email to = new Email(toEmail, toName);
Email replyTo = new Email(replyToEmail, replyToName);
Content content = new Content("text/html", body);
Mail mail = new Mail(from, subject, to, content);
mail.setReplyTo(replyTo);
SendGrid sg = new SendGrid(SENDGRID_API_KEY);
Request sgRequest = new Request();
Response sgResponse = null;
try {
sgRequest.setMethod(Method.POST);
sgRequest.setEndpoint("mail/send");
sgRequest.setBody(mail.build());
sgResponse = sg.api(sgRequest);
} catch (IOException ex) {
throw ex;
}
Something is obviously wrong with the Cloud Function. Since my original code is running on GAE Java 8 runtime, it was very easy for me to port it over to the Cloud Function with minor changes. Otherwise I would have gone with NodeJS runtime. I'm also not seeing any of the performance issues when running this function on my local machine.
Can someone help me make sense of the slow performance issue?
What you're seeing is almost certainly due to the "cold start" cost associated with the creation of a new server instance to handle the request. This is an issue with all types of Cloud Functions, as described in the documentation:
Several of the recommendations in this document center around what is known as a cold start. Functions are stateless, and the execution environment is often initialized from scratch, which is called a cold start. Cold starts can take significant amounts of time to complete. It is best practice to avoid unnecessary cold starts, and to streamline the cold start process to whatever extent possible (for example, by avoiding unnecessary dependencies).
I would expect JVM languages to have an even longer cold start time due to the amount of time that it takes to initalize a JVM, in addition to the server instance itself.
Other than the advice above, there is very little one can due to effectively mitigate cold starts. Efforts to keep a function warm are not as effective as you might imagine. There is a lot of discussion about this on the internet if you wish to search.
Keep in mind that the Java runtime is also in beta, so you can expect improvements in the future. The same thing happened with the other runtimes.

crossbar Out-of-memory when pub runs faster than sub

I was doing some pub&sub test with autobahn-cpp. However, I found that when you pub some data at a frequency that faster than the sub endpoint can consume, this will cause the router(crossbar) cache some data and the memory usage increases. Eventually, the router will use up all the memory and be killed by the os.
For example
publisher:
while(1)
{
session->publish("com.pub.test",std::make_tuple(std::string("hello, world")) );
std::this_thread::sleep_for(std::chrono::seconds(1)); // sleep 1s
} // pub a string every seconds
subscriber:
void topic1(const autobahn::wamp_event& event)
{
try
{
auto s = event.argument<std::string>(0);
std::cerr << s << std::endl;
std::this_thread::sleep_for(std::chrono::seconds(2)); //need 2s to finish the job
}
catch (std::exception& e)
{
std::cerr << e.what() << std::endl;
}
}
main()
{
...
session>subscribe("com.pub.test", &topic1);
...
} // pub runs faster than the sub can consume
After several housr:
2016-01-7 10:11:32+0000 [Controller 16142] Worker 16145: Process connection gone (A process has ended with a probable error condition: process ended by signal 9.)
dmsg:
Out of memory: Kill process 16145(Crossbar.io Wor) score 4 or sacrifice child
My questions:
Is this normal (use up all the memory and be killed by the os) ?
Or is there any config options can be set to limit the memory usage?
I found a similar issue, see link https://github.com/crossbario/crossbar/issues/48
system info: ubuntu 14.04(32bit), CPython 2.7.6, Crossbar.io 0.11.1, Autobahn 0.10.9
The client is filling up with messages it hasn't delivered yet.
This is a "feature" of message based protocols.
Instead of request -> response
It's request => response + response + etc
You're running into "backpressure", where the queue of responses to send is filling up faster than the client can receive them.
You should stop producing or drop responses. Do you need all the responses, or just the latest?
Here is some "backpressure" documentation from uWebsockets
There is an "Observable" pattern (similar to Promises), that can help, Rx.Js is for JavaScript, but I'm sure there is something similar for C++. It's like a streaming promise library.

Android usb host input bulktransfer fails to read randomly when data available

The following code is inside a thread and reads input data coming over usb. Approximately every 80 readings it misses one of the packets coming from an stm32 board. The board is programmed to send data packets to the android tablet every one second.
// Non Working Code
while(running){
int resp = bulktransfer(mInEp,mBuf,mBuf.lenght,1000);
if(resp>0){
dispatchMessage(mBuf);
}else if(resp<0)
showsBufferEmptyMessage();
}
I was looking the Missile Launcher example in android an other libraries on the internet and they put a delay of 50ms between each poll. Doing this it solves the missing package problem.
//Working code
while(running){
int resp = bulktransfer(mInEp,mBuf,mBuf.lenght,1000);
if(resp>0){
dispatchMessage(mBuf);
}else if(resp<0)
showsBufferEmptyMessage();
try{
Thread.sleep(50);
}catch(Exception e){}
}
Does anyone knows the reason why the delay works. Most of the libraries on github has this delay an as I mention before the google example too.
I am putting down my results regarding this problem. After all seems that the UsbConnection.bulkTransfer(...) method has some bug when called continuously. The solution was to use the asynchronous API, UsbRequest class. Using this method I could read from the input endpoint without delay and no data was lost during the whole stress test. So the direction to take is asynchronous UsbRequest instead of synchronously bulktransfer.

Restarting Application: Delay

I'm coding a restart feature into my newest Crysis Wars server modification that remotely reboots the server. This is useful if the server has a problem and a simple system reload does not fix it, also is useful to tell the server to restart at a specified time in order to free up memory.
I have coded the required functions in order to achieve this, and the application itself has no problem restarting. The issue is that the port is not closed quickly enough, resulting in a new instance of the application that cannot function properly.
I am looking for an ideal solution to this, that the program is shut down and launches two seconds later, instead of immediately. Doing this will give Windows enough time to free the port that the server was using, and clean up any existing memory.
Please Note: I have removed my other (related) question since apparently closing program ports is impossible without telling it to do so when it it assigned the port, which is something I cannot do since I don't have access to the sourcecode of the code that binds to the port.
The code, if it's required
int CScriptBind_GameRules::Restart(IFunctionHandler *pH)
{
bool arg1 = false;
const char *arg2 = "";
gEnv->pScriptSystem->BeginCall("Dynamic","GetValue");
gEnv->pScriptSystem->PushFuncParam("r.enable");
gEnv->pScriptSystem->EndCall(arg1);
gEnv->pScriptSystem->BeginCall("Dynamic","GetValue");
gEnv->pScriptSystem->PushFuncParam("r.line");
gEnv->pScriptSystem->EndCall(arg2);
if (arg1)
{
LogMsg(2, "System restart initiated.");
if (arg2)
{
LogMsg(2, "System Reboot.");
gEnv->pScriptSystem->BeginCall("os","execute");
gEnv->pScriptSystem->PushFuncParam(arg2);
gEnv->pScriptSystem->EndCall(), close((int)gEnv->pConsole->GetCVar("sv_port")->GetString());
return pH->EndFunction();
}
else
{
LogMsg(2, "Internal Faliure.");
return pH->EndFunction();
}
return pH->EndFunction();
}
LogMsg(2, "System restart cancelled: Feature is Disabled.");
return pH->EndFunction();
}
What I usually do is add a command-line parameter, 'StartupDelay'. When the server/whatever starts up, before attempting to run the listener etc, it checks for that parameter. If no param, it runs up normally, if it finds 'StartupDelay=2000', it sleeps for 2 seconds before attempting to start the listener etc.
Result - if started from desktop icon, it starts immediately. If it needs to 'restart itself' it sets the parameter to instruct the new instance of itself to wait as directed.

Jetty 8.1 flooding the log file with "Dispatched Failed" messages

We are using Jetty 8.1 as an embedded HTTP server. Under overload conditions the server sometimes starts flooding the log file with these messages:
warn: java.util.concurrent.RejectedExecutionException
warn: Dispatched Failed! SCEP#76107610{l(...)<->r(...),d=false,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=1r}...
The same message is repeated thousands of times, and the amount of logging appears to slow down the whole system. The messages itself are fine, our request handler ist just to slow to process the requests in time. But the huge number of repeated messages makes things actually worse and makes it more difficult for the system to recover from the overload.
So, my question is: is this a normal behaviour, or are we doing something wrong?
Here is how we set up the server:
Server server = new Server();
SelectChannelConnector connector = new SelectChannelConnector();
connector.setAcceptQueueSize( 10 );
server.setConnectors( new Connector[]{ connector } );
server.setThreadPool( new ExecutorThreadPool( 32, 32, 60, TimeUnit.SECONDS,
new ArrayBlockingQueue<Runnable>( 10 )));
The SelectChannelEndPoint is the origin of this log message.
To not see it, just set your named logger of org.eclipse.jetty.io.nio.SelectChannelEndPoint to LEVEL=OFF.
Now as for why you see it, that is more interesting to the developers of Jetty. Can you detail what specific version of Jetty you are using and also what specific JVM you are using?