What causes libdispatch error EVFILT_MACHPORT in MacOSX Sierra? - c++

Good morning,
I am facing a crash in my application. When the user tries to start it, he waits like a minute and then a std::exception is raised. Really I could not reproduce the bug by myself, but it seems quite a common problem.
The only thing I could track is the following line in syslog:
BUG in libdispatch client: kevent[EVFILT_MACHPORT] monitored resource vanished before the source cancel handler was invoked
Then, I start to google it and I can not find much more...I can only "suppose" that is some problem with GCD (that I do not use afaik, or at least not directly...). What I saw in Internet is that it is related with MacOSX Sierra. But the majority of forum have no answer, just a lot of tries without a unique result. Maybe the only web page that seems a bit clear about a workaround (that I have not tested, and anyway I do not want to use) is this.
So...:
someone has clear what can cause the exception in libdispatch?
someone can give me some good link, official documentation or something?
Is true that can be a bug in Sierra without updates?
Could it be related with the installer of an application?
Someone knows a way to reproduce this exception with a test program?

This libdispatch log message is not fatal, and is almost certainly not related to your crash, which sounds like an abort due to an uncaught C++ exception (without a crashreport/backtrace it is difficult to say anything more). libdispatch does not itself generate any C++ exceptions FWIW.
As to the meaning of that specific log message, it relates to the following section in the dispatch_source_create(3) manpage:
CANCELLATION:
Important: a cancellation handler is required for file descriptor and mach port based sources in order
to safely close the descriptor or destroy the port. Closing the descriptor or port before the cancellation handler has run may result in a race condition: if a new descriptor is allocated with the same
value as the recently closed descriptor while the source's event handler is still running, the event
handler may read/write data to the wrong descriptor.
If you see the EVFILT_MACHPORT "vanished" log message, somebody in your process has violated that API contract and deallocated a machport while a dispatch source was still monitoring it (causing the kernel to generate an EV_VANISHED kevent), see the source code.

Related

My server exited with code 137

I wrote a C++ server/client pair using C++11, boost::asio and HDF5. The server was running fine for a some time (2 days), and then it stopped with code 137. Since I executed the server with an infinite loop, it was restarted.
Unfortunately, my error logs don't provide sufficient information to understand the problem. So I've been trying to understand what this code means. It seems there's consensus that this means it's an error of 128+9, with 9 meaning that the program was killed with kill -9. Now I'm not sure at all why this happened. I need help to find out.
By reading further, I found out that it could have been killed by the system because it exceeded a certain allowed execution time, and thus the system killed it. Now this is not so unlikely, since my linux server is provided by my university, so they could be applying some kind of security to do this. I read about something called timeout in linux. My first question is: How can I know if this is the cause of the problem?
My second question is: what should I check also to understand this problem? What would you do? Please advise.
If you require any additional information, please ask.
Thanks.
Sounds like you have blown through memory limits and your linux memory manager sent SIGKILL to your process. In that case you should check /var/log/messages file to see if there is anything about it. That's the first thing I would do. Check with your sysadmin if you don't have permissions.

OpenSSL decryption failed or bad record mac boost::asio

I'm writing a transparent intercepting HTTPS capable proxy using boost::asio + openSSL. I have a default server context where I specify that the server is a TLSv1.2 server, when a client connects, I extract the host from the hello and use SSL_set_SSL_CTX to set the context (which either already exists or I've just created it after spoofing the upstream cert) and initiate the server (downstream) read/write volley as well as the upstream.
This was working before I started storing and sharing contexts. On each new incoming connection, I was creating a new client socket and context, loading ca-bundle as verify file, then creating a new server context, getting the spoofed certificate. It was functioning, but I started developing issues where EC_KEY objects were being double freed and such. I learned from another question of mine that I was going about this the wrong way and began refactoring to recycle and share CTX objects. To be specific, I'm using a single client CTX shared across the board that loads, at program startup, the CA-Bundle for verification.
However, since this refactor, I'm getting this on both the client and the server:
decryption failed or bad record mac
..mixed with a bajillion "short read"s. If I try to force everything TLSv1.2, I get
block cipher pad is wrong
Those errors are given to me after a read/write has failed and I call async_shutdown on either upstream or downstream sockets, which in the callback, error is set (so the shutdown failed).
I've scoured the interwebs finding jira posts from places like apache httpd and nginx where this error was fixed in different ways (resizing read buffers to be larger, openSSL patches, forcing SSLv3, so on and so forth).
I thought there might be an issue with multithreading (my io-service uses a thread pool) but I can see in the code that boost do_init sets locking mechanics for openSSL and all of my IO are wrapped into a single strand.
I'm at a total loss and am wondering if anyone can shed light on what might be happening. I realize I've posted no code, that's because I've got hundreds and hundreds of lines of it and don't want to turn people off with a huge code dump. I realize however this is a rather complicated program and thus a complicated issue so please ask and I'll provide whatever I can.
Edit
I guess I should mention for completeness that I'm getting these errors on both openssl 1.0.2 and 1.0.2a, Win 8.1 x64 and I'm intercepting and routing the http/https traffic through my proxy with with WinDivert.
Edit 2
Reduced entire program to 1 thread, same effect. Created new client CTX for each client connection, same issue. Tried disabling AES-NI, issue persists. Tried different computer, same effect. Recompiled openssl from source (was using precompiled binaries), issue persists. Tried setting additional OP_ workaround flags described in current docs related to downgrade detection, padding bugs, so on and so forth, issue persist. I think I'll just start randomly mashing the keyboard and compile button soon.
I was going to just delete this question, but I decided to answer it in light of the fact that nowhere on the net (that I could find) actually pointed to a correct solution to this problem. I've read every single report about this error that one could find and every single one of those reports, the people "solved" or "reduced" this error in a different way. Every single one of them, a different solution. This is what helped make this issue so difficult to reason out, because everyone everywhere has a different underlying causal explanation.
It's complicated, ready? This error will present itself if you cancel/abort a pending async SSL operation. Mind->boom(). It'll be even more confusing if you do what the docs say and use async_shutdown to do so, because even the call back to async_shutdown will fail (error code is set) and your error message will randomly be something stupid like "decryption failed or bad record mac" or "block cipher pad is wrong" or "SSLv3 alert!" so on and so forth. When seeing errors like this, ignore the errors and analyze the control flow of your IO ops, somewhere you're either prematurely ending them or getting them out of order.
In my case, the premature end was (sort of) intentional, since during this stupid heavy refactor I decided to change things outside the scope of the problem, like my HTTPHeader parser, which I bugged out and ended up cause it to fail nearly 100% and thus aborting the connections. :) The error strings were masking the real cause by telling me encryption failed for some reason or another. Dumb mistake I know, but I take comfort in being the first one (apparently) to recognize it. :)
Open a powershell and type this
(Invoke-WebRequest -Uri status.dev.azure.com).StatusDescription
https://devblogs.microsoft.com/devops/deprecating-weak-cryptographic-standards-tls-1-0-and-1-1-in-azure-devops-services/

Failed to resume in time Crashlog

I am trying to figure out a "Failed to resume in time" problem. In one of our testers devices (which is an iPhone 4S with the latest OS) it happens very frequently, whereas in my own device it doesn't seem to happen at all.
Anyway, I got a few crashlogs. I am unable to trace the root of the cause though. I understand that the issue might be
1.When a process is holding up the main thread for too long.
2.When there is a memory issue.
I don't think the memory is much of an issue since it seems to happen when the user leaves the main menu and comes back. Nothing much is happening in the main menu so it probably is a task that runs too long.
Here is an excerpt from the crash log:
Can somebody help me or guide me on who I can trace the cause of the issue? Is there anyway to turn off the watchdog timer(probably not huh?) Also, what does highlighted thread refer to?
I have already checked my applicationDidBecomeActive & applicationWillEnterForeground to make sure there is nothing going on there.
To my knowledge there are no synchronous calls being made at this point. Does Reachability use synchronous calls to check for internet? How can I check for that?
I am not making any large data transfers upon resume.
I notice that GameCenter automatically logs in or check for log in upon resuming your app. Is there anyway to prevent this? Could this possibly cause a time out issue?
I tried doing a time profile, but I am not able to understand how to use it to analyze. If you can provide a good resource for that, that would be amazing.
Thanks!!!
You're currently in "trying to find the issue mode". You should switch to "try to find out how much of an issue this really is" mode.
So go find another 4S (actually as many as you can) to rule out that it's a device-specific issue. If it happens on all 4S it should be easier to pinpoint. If not, have someone else look over it, discuss possible causes. The peer programming approach often helps when you're stuck in a dead-end situation.
If the issue is only on that one device, you might want to check if it's broken (or "jailbroken") or might simply need a hard reboot (hold power and home for 10+ seconds).
If it only happens on some devices but not all, try to find what they have in common. This could be language/locale, or dictation, practically any kind of setting the user might have changed. If necessary, write a logger that logs as many settings as possible to your (web) server so you can compare settings one-by-one and quickly discard those that aren't in synch.
If only very few devices are affected, you could also ignore the issue and hope that additional crash logs from users will reveal the key to the issue.
Finally, there's always the option to disable suspend on terminate and instead terminate the app when the home button is pressed (as it was pre iOS 4). Unless of course the app has to run in background.

Displaying exception debug information to users

I'm currently working on adding exceptions and exception handling to my OSS application. Exceptions have been the general idea from the start, but I wanted to find a good exception framework and in all honesty, understand C++ exception handling conventions and idioms a bit better before starting to use them. I have a lot of experience with C#/.Net, Python and other languages that use exceptions. I'm no stranger to the idea (but far from a master).
In C# and Python, when an unhandled exception occurs, the user gets a nice stack trace and in general a lot of very useful priceless debugging information. If you're working on an OSS application, having users paste that info into issue reports is... well let's just say I'm finding it difficult to live without that. For this C++ project, I get "The application crashed", or from more informed users, "I did X, Y and Z, and then it crashed". But I want that debugging information too!
I've already (and with great difficulty) made my peace with the fact that I'll never see a cross-platform and cross-compiler way of getting a C++ exception stack trace, but I know I can get the function name and other relevant information.
And now I want that for my unhandled exceptions. I'm using boost::exception, and they have this very nice diagnostic_information thingamajig that can print out the (unmangled) function name, file, line and most importantly, other exception specific information the programmer added to that exception.
Naturally, I'll be handling exceptions inside the code whenever I can, but I'm not that naive to think I won't let a couple slip through (unintentionally, of course).
So what I want to do is wrap my main entry point inside a try block with a catch that creates a special dialog that informs the user that an error has occurred in the application, with more detailed information presented when the user clicks "More" or "Debug info" or whatever. This would contain the string from diagnostic_information. I could then instruct the users to paste this information into issue reports.
But a nagging gut feeling is telling me that wrapping everything in a try block is a really bad idea. Is what I'm about to do stupid? If it is (and even if it's not), what's a better way to achieve what I want?
Putting a try/catch block in main() is okay, it doesn't cause any problems. The program is dead on an unhandled exception anyway. It isn't going to be helpful at all in your quest to get the all-important stack trace though. That info is gonzo when the catch block traps the exception.
Catching a C++ exception won't be very helpful either. The odds that the program dies on a an exception derived from std::exception are pretty slim. Although it could happen. Much more likely in a C/C++ app is death due to hardware exceptions, AccessViolation being numero uno. Trapping those requires the __try and __except keywords in your main() method. Again, very little context is available, you've basically only got an exception code. An AV also tells you which exact memory location caused the exception.
This is not just a cross-platform issue btw, you can't get a good stack trace on any platform. There is no reliable way to walk the stack, there are too many optimizations (like framepointer omission) that make this a perilous journey. It is the C/C++ way: make it as fast as possible, leave no clue what happened when it blows up.
What you need to do is debug these kind of problems the C/C++ way. You need to create a minidump. It is roughly analogous to the "core dump" of old, a snapshot of the process image at the time the exception happens. Back then, you actually got a complete dump of the core. There's been progress, nowadays it is "mini", somewhat necessary because a complete core dump would take close to 2 gigabytes. It actually works pretty well to diagnose the program state.
On Windows, that starts by calling SetUnhandledExceptionFilter(), you provide a callback function pointer to a function that will run when your program dies on an unhandled exception. Any exception, C++ as well as SEH. Your next resource is dbghelp.dll, available in the Debugging Tools for Windows download. It has an entrypoint called MiniDumpWriteDump(), it creates a minidump.
Once you get the file created by MiniDumpWriteDump(), you're pretty golden. You can load the .dmp file in Visual Studio, almost like it's a project. Press F5 and VS grinds away for a while trying to load .pdb files for the DLLs loaded in the process. You'll want to setup the symbol server, that's very important to get good stack traces. If everything works, you'll get a "debug break" at the exact location where the exception was thrown". With a stack trace.
Things you need to do to make this work smoothly:
Use a build server to create the binaries. It needs to push the debugging symbols (.pdb files) to a symbol server so they are readily available when you debug the minidump.
Configure the debugger so it can find the debugging symbols for all modules. You can get the debugging symbols for Windows from Microsoft, the symbols for your code needs to come from the symbol server mentioned above.
Write the code to trap the unhandled exception and create the minidump. I mentioned SetUnhandledExceptionFilter() but the code that creates the minidump should not be in the program that crashed. The odds that it can write the minidump successfully are fairly slim, the state of the program is undetermined. Best thing to do is to run a "guard" process that keeps an eye on a named Mutex. Your exception filter can set the mutex, the guard can create the minidump.
Create a way for the minidump to get transferred from the client's machine to yours. We use Amazon's S3 service for that, terabytes at a reasonable rate.
Wire the minidump handler into your debug database. We use Jira, it has a web-service that allows us to verify the crash bucket against a database of earlier crashes with the same "signature". When it is unique, or doesn't have enough hits, we ask the crash manager code to upload the minidump to Amazon and create the bug database entry.
Well, that's what I did for the company I work for. Worked out very well, it reduced crash bucket frequency from thousands to dozens. Personal message to the creators of the open source ffdshow component: I hate you with a passion. But you're no longer crashing our app anymore! Buggers.
Wrapping all your code in one try/catch block is a-ok. It won't slow down the execution of anything inside it, for example. In fact, all my programs have (code similar to) this framework:
int execute(int pArgc, char *pArgv[])
{
// do stuff
}
int main(int pArgc, char *pArgv[])
{
// maybe setup some debug stuff,
// like splitting cerr to log.txt
try
{
return execute(pArgc, pArgv);
}
catch (const std::exception& e)
{
std::cerr << "Unhandled exception:\n" << e.what() << std::endl;
// or other methods of displaying an error
return EXIT_FAILURE;
}
catch (...)
{
std::cerr << "Unknown exception!" << std::endl;
return EXIT_FAILURE;
}
}
No it's not stupid. It's a very good idea, and it costs virtually nothing at runtime until you hit an unhandled exception, of course.
Be aware that there is already an exception handler wrapping your thread, provided by the OS (and another one by the C-runtime I think). You may need to pass certain exceptions on to these handlers to get correct behavior. In some architectures, accessing mis-aligned data is handled by an exception handler. so you may want to special case EXCEPTION_DATATYPE_MISALIGNMENT and let it pass on to the higher level exception handler.
I include the registers, the app version and build number, the exception type and a stack dump in hex annotated with module names and offsets for hex values that could be addresses to code. Be sure to include the version number and build number/date of your exe.
You can also use VirtualQuery to turn stack values into "ModuleName+Offset" pretty easily. And that, combined with a .MAP file will often tell you exactly where you crashed.
I found that I could train beta testers to send my the text pretty easily, but in the early days what I got was a picture of the error dialog rather than the text. I think that's because a lot of users don't know you can right click on any Edit control to get a menu with "Select All" and "Copy". If I was going to do it again, I would add a button
that copied that text to the clipboard so that it can easily be pasted into an email.
Even better if you want to go to the trouble of haveing a 'send error report' button, but just giving users a way to get the text into their own emails gets you most of the way there, and doesn't raise any red flags about "what information am I sharing with them?"
In fact, boost::diagnostic_information has been designed specifically to be used in a "global" catch(...) block, to display information about exceptions which should not have reached it. However, note that the string returned by boost::diagnostic_information is NOT user-friendly.

Getting rid of the evil delay caused by ShellExecute

This is something that's been bothering me a while and there just has to be a solution to this. Every time I call ShellExecute to open an external file (be it a document, executable or a URL) this causes a very long lockup in my program before ShellExecute spawns the new process and returns. Does anyone know how to solve or work around this?
EDIT: And as the tags might indicate, this is on Win32 using C++.
I don't know what is causing it, but Mark Russinovich (of sysinternal's fame) has a really great blog where he explains how to debug these kinds of things. A good one to look at for you would be The Case of the Delayed Windows Vista File Open Dialogs, where he debugged a similar issue using only process explorer (it turned out to be a problem accessing the domain). You can of course do similar things using a regular windows debugger.
You problem is probably not the same as his, but using these techniques may help you get closer to the source of the problem. I suggest invoking the CreateProcess call and then capturing a few stack traces and seeing where it appears to be hung.
The Case of the Process Startup Delays might be even more relevant for you.
Are you multithreaded?
I've seen issues with opening files with ShellExecute. Not executables, but files associated an application - usually MS Office. Applications that used DDE to open their files did some of broadcast of a message to all threads in all (well, I don't know if it was all...) programs. Since I wasn't pumping messages in worker threads in my application I'd hang the shell (and the opening of the file) for some time. It eventually timed out waiting for me to process the message and the application would launch and open the file.
I recall using PeekMessage in a loop to just remove messages in the queue for that worker thread. I always assumed there was a way to avoid this in another way, maybe create the thread differently as to never be the target of messages?
Update
It must have not just been any thread that was doing this but one servicing a window. Raymond (link 1) knows all (link 2). I bet either CoInitialize (single threaded apartment) or something in MFC created a hidden window for the thread.