How to debug stuck Akka stream? - akka

I am trying to debug an akka stream which in one situation, is pulling new data OK, but in another situation, is getting "stuck" and not pulling new data. This is all fully reproducible, but the second situation is quite complicated and requires several minutes of setup to get going. The thing is, this Akka stream is actually created mostly by akka-http and akka-actor code, so I don't actually know what it looks like under the hood, and it's not obvious from reading the code that I've read so far.
So what I would like to do is:
See what the overall materialised stream graph looks like, or at least see all the construction in one place, rather than scattered over various source files. Is there a way to log out the materialised graph?
See what is causing the relevant pull to happen in the first situation. The thing is, there are other pulls happening which may or may not be relevant, so just putting a breakpoint on a pull-related method might lead to confusing, irrelevant hits.
Figure out why that same pull is not happening in the second situation.
I already have a rough idea of what the problem probably is - my code is not asynchronous enough - but I need to understand what is going on precisely, so that I can write a regression test that reproduces the problem quickly, without the need for all this complicated setup that I have in the second situation.
I have enabled the following debug options and enabled DEBUG level logging for everything, but it's still not logging enough information, as far as I can see.
akka {
http {
host-connection-pool {
idle-timeout = infinite
}
}
io {
tcp {
trace-logging = on
windows-connection-abort-workaround-enabled = auto
}
}
loglevel = "DEBUG"
}

Related

CPU Usage gradually increases in dotnet core webservice

I have a .net Core web service which seems to slowly increase its cpu usage.
meaning at the first day it won't go past 10%, the second day it can go up to 20% and so on.
Using the TOP command in linux, all my webservices seems to sometime be shown there (probably when a request is made) and afterward disappear.
This specific process after running for a while just stays there constantly consuming cpu even when no request has been made.
the API still working fine, it seems like there are some threads that just keeps hanging and consuming cpu. last time I checked I had 5 threads that consumed 3-4% cpu and didn't die for some reason.
My guess is that in some specific scenario a thread just stays alive consuming cpu.
The app runs on ubuntu machine, my first step was trying to create a dump file with ProcDump so I can analyze those threads and maybe find where they are hanging.
ProcDump generates a huge 21gb file, which trying to analyze with lldb throws out of memory exception. even tried transferring it to a windows machine to debug with windbg , no help there as it couldn't open the file.
As there is no specific exception or anything I can't really share any piece of code as I have no idea where the issue is... just kind a hoping for some suggestion that might help me get to a solution or at least understand where the problem is.
Thanks a lot for reading, cheers
You could try using something like jetBrains’ DotMemory, they also have a fairly high level but helpful guide https://www.jetbrains.com/help/dotmemory/How_to_Find_a_Memory_Leak.html it also worth checking your startup file and double checking the services you’ve registered are used in the correct way ie not added as scoped when they should be transient or even a singleton etc
so iv'e been at it for a while.
Eventually found out that my problem was with HttpClient
Probably some bad mix of static class and creating new instances of HttpClient that causes the issue Iv'e explained above.
Solved it by utilizing HttpClientFactory as explained here -
https://learn.microsoft.com/en-us/aspnet/core/fundamentals/http-requests?view=aspnetcore-2.1
Lesson learned :)
A little late but Procdump for Linux just added .NET Core 3 support that generates much more managable sized core dumps. It automatically detects if the target process is .NET Core and does the right thing (i.e., no need to specify switches).

Synchronous single file download - is it the right approach in a GUI Qt application?

I'm developing an updater for my application in Qt, primarily to get to know the framework (I realize there are multiple ready-made solutions available, that's not relevant here). It is a basic GUI application using a QMainWindow subclass for its main window and an MyAppUpdater class to perform the actual program logic.
The update information (version, changelog, files to be downloaded) is stored on my server as an XML file. The first thing the updater should do after it sets up the UI is query that server, get the XML file, parse it and display info to the user. Here's where I have a problem though; coming from a procedural/C background, I'd initiate a synchronous download, set a timeout of maybe 3 seconds, then see what happens - if I manage to download the file correctly, I'll parse it and carry on, otherwise display an error.
However, seeing how inconvenient something like that is to implement in Qt, I've come to believe that its network classes are designed in a different way, with a different approach in mind.
I was thinking about initiating an asynchronous download in, say, InitVersionInfoDownload, and then connecting QNetworkReply's finished signal to a slot called VersionInfoDownloadComplete, or something along these lines. I'd also need a timer somewhere to implement timeout checks - if the slot is not invoked after say 3 seconds, the update should be aborted. However, this approach seems overly complicated and in general inadequate to the situation; I cannot proceed without retrieving this file from the server, or indeed do anything while waiting for it to be downloaded, so an asynchronous approach seems inappropriate in general.
Am I mistaken about that, or is there a better way?
TL;DR: It's the wrong approach in any GUI application.
how inconvenient something like that is to implement in Qt
It's not meant to be convenient, since whenever I see a shipping product that behaves that way, I have an urge to have a stern talk with the developers. Blocking the GUI is a usability nightmare. You never want to code that way.
coming from a procedural/C background, I'd initiate a synchronous download, set a timeout of maybe 3 seconds, then see what happens
If you write any sort of machine or interface control code in C, you probably don't want it to be synchronous either. You'd set up a state machine and process everything asynchronously. When coding embedded C applications, state machines make hard things downright trivial. There are several solutions out there, QP/C would be a first class example.
was thinking about initiating an asynchronous download in, say, InitVersionInfoDownload, and then connecting QNetworkReply's finished signal to a slot called VersionInfoDownloadComplete, or something along these lines. I'd also need a timer somewhere to implement timeout checks - if the slot is not invoked after say 3 seconds, the update should be aborted. However, this approach seems overly complicated
It is trivial. You can't discuss such things without showing your code: perhaps you've implemented it in some horribly verbose manner. When done correctly, it's supposed to look lean and sweet. For some inspiration, see this answer.
I cannot proceed without retrieving this file from the server, or indeed do anything while waiting for it to be downloaded
That's patently false. Your user might wish to cancel the update and exit your application, or resize its window, or minimize/maximize it, or check the existing version, or the OS might require a window repaint, or ...
Remember: Your user and the environment are in control. An application unresponsive by design is not only horrible user experience, but also makes your code harder to comprehend and test. Pseudo-synchronous spaghetti gets out of hand real quick. With async design, it's trivial to use signal spy or other products to introspect what the application is doing, where it's stuck, etc.

OpenSSL decryption failed or bad record mac boost::asio

I'm writing a transparent intercepting HTTPS capable proxy using boost::asio + openSSL. I have a default server context where I specify that the server is a TLSv1.2 server, when a client connects, I extract the host from the hello and use SSL_set_SSL_CTX to set the context (which either already exists or I've just created it after spoofing the upstream cert) and initiate the server (downstream) read/write volley as well as the upstream.
This was working before I started storing and sharing contexts. On each new incoming connection, I was creating a new client socket and context, loading ca-bundle as verify file, then creating a new server context, getting the spoofed certificate. It was functioning, but I started developing issues where EC_KEY objects were being double freed and such. I learned from another question of mine that I was going about this the wrong way and began refactoring to recycle and share CTX objects. To be specific, I'm using a single client CTX shared across the board that loads, at program startup, the CA-Bundle for verification.
However, since this refactor, I'm getting this on both the client and the server:
decryption failed or bad record mac
..mixed with a bajillion "short read"s. If I try to force everything TLSv1.2, I get
block cipher pad is wrong
Those errors are given to me after a read/write has failed and I call async_shutdown on either upstream or downstream sockets, which in the callback, error is set (so the shutdown failed).
I've scoured the interwebs finding jira posts from places like apache httpd and nginx where this error was fixed in different ways (resizing read buffers to be larger, openSSL patches, forcing SSLv3, so on and so forth).
I thought there might be an issue with multithreading (my io-service uses a thread pool) but I can see in the code that boost do_init sets locking mechanics for openSSL and all of my IO are wrapped into a single strand.
I'm at a total loss and am wondering if anyone can shed light on what might be happening. I realize I've posted no code, that's because I've got hundreds and hundreds of lines of it and don't want to turn people off with a huge code dump. I realize however this is a rather complicated program and thus a complicated issue so please ask and I'll provide whatever I can.
Edit
I guess I should mention for completeness that I'm getting these errors on both openssl 1.0.2 and 1.0.2a, Win 8.1 x64 and I'm intercepting and routing the http/https traffic through my proxy with with WinDivert.
Edit 2
Reduced entire program to 1 thread, same effect. Created new client CTX for each client connection, same issue. Tried disabling AES-NI, issue persists. Tried different computer, same effect. Recompiled openssl from source (was using precompiled binaries), issue persists. Tried setting additional OP_ workaround flags described in current docs related to downgrade detection, padding bugs, so on and so forth, issue persist. I think I'll just start randomly mashing the keyboard and compile button soon.
I was going to just delete this question, but I decided to answer it in light of the fact that nowhere on the net (that I could find) actually pointed to a correct solution to this problem. I've read every single report about this error that one could find and every single one of those reports, the people "solved" or "reduced" this error in a different way. Every single one of them, a different solution. This is what helped make this issue so difficult to reason out, because everyone everywhere has a different underlying causal explanation.
It's complicated, ready? This error will present itself if you cancel/abort a pending async SSL operation. Mind->boom(). It'll be even more confusing if you do what the docs say and use async_shutdown to do so, because even the call back to async_shutdown will fail (error code is set) and your error message will randomly be something stupid like "decryption failed or bad record mac" or "block cipher pad is wrong" or "SSLv3 alert!" so on and so forth. When seeing errors like this, ignore the errors and analyze the control flow of your IO ops, somewhere you're either prematurely ending them or getting them out of order.
In my case, the premature end was (sort of) intentional, since during this stupid heavy refactor I decided to change things outside the scope of the problem, like my HTTPHeader parser, which I bugged out and ended up cause it to fail nearly 100% and thus aborting the connections. :) The error strings were masking the real cause by telling me encryption failed for some reason or another. Dumb mistake I know, but I take comfort in being the first one (apparently) to recognize it. :)
Open a powershell and type this
(Invoke-WebRequest -Uri status.dev.azure.com).StatusDescription
https://devblogs.microsoft.com/devops/deprecating-weak-cryptographic-standards-tls-1-0-and-1-1-in-azure-devops-services/

Failed to resume in time Crashlog

I am trying to figure out a "Failed to resume in time" problem. In one of our testers devices (which is an iPhone 4S with the latest OS) it happens very frequently, whereas in my own device it doesn't seem to happen at all.
Anyway, I got a few crashlogs. I am unable to trace the root of the cause though. I understand that the issue might be
1.When a process is holding up the main thread for too long.
2.When there is a memory issue.
I don't think the memory is much of an issue since it seems to happen when the user leaves the main menu and comes back. Nothing much is happening in the main menu so it probably is a task that runs too long.
Here is an excerpt from the crash log:
Can somebody help me or guide me on who I can trace the cause of the issue? Is there anyway to turn off the watchdog timer(probably not huh?) Also, what does highlighted thread refer to?
I have already checked my applicationDidBecomeActive & applicationWillEnterForeground to make sure there is nothing going on there.
To my knowledge there are no synchronous calls being made at this point. Does Reachability use synchronous calls to check for internet? How can I check for that?
I am not making any large data transfers upon resume.
I notice that GameCenter automatically logs in or check for log in upon resuming your app. Is there anyway to prevent this? Could this possibly cause a time out issue?
I tried doing a time profile, but I am not able to understand how to use it to analyze. If you can provide a good resource for that, that would be amazing.
Thanks!!!
You're currently in "trying to find the issue mode". You should switch to "try to find out how much of an issue this really is" mode.
So go find another 4S (actually as many as you can) to rule out that it's a device-specific issue. If it happens on all 4S it should be easier to pinpoint. If not, have someone else look over it, discuss possible causes. The peer programming approach often helps when you're stuck in a dead-end situation.
If the issue is only on that one device, you might want to check if it's broken (or "jailbroken") or might simply need a hard reboot (hold power and home for 10+ seconds).
If it only happens on some devices but not all, try to find what they have in common. This could be language/locale, or dictation, practically any kind of setting the user might have changed. If necessary, write a logger that logs as many settings as possible to your (web) server so you can compare settings one-by-one and quickly discard those that aren't in synch.
If only very few devices are affected, you could also ignore the issue and hope that additional crash logs from users will reveal the key to the issue.
Finally, there's always the option to disable suspend on terminate and instead terminate the app when the home button is pressed (as it was pre iOS 4). Unless of course the app has to run in background.

File corruption detection and error handling

I'm a newbie C++ developer and I'm working on an application which needs to write out a log file every so often, and we've noticed that the log file has been corrupted a few times when running the app. The main scenarios seems to be when the program is shutting down, or crashes, but I'm concerned that this isn't the only time that something may go wrong, as the application was born out of a fairly "quick and dirty" project.
It's not critical to have to the most absolute up-to-date data saved, so one idea that someone mentioned was to alternatively write to two log files, and then if the program crashes at least one will still have proper integrity. But this doesn't smell right to me as I haven't really seen any other application use this method.
Are there any "best practises" or standard "patterns" or frameworks to deal with this problem?
At the moment I'm thinking of doing something like this -
Write data to a temp file
Check the data was written correctly with a hash
Rename the original file, and put the temp file in place.
Delete the original
Then if anything fails I can just roll back by just deleting the temp, and the original be untouched.
You must find the reason why the file gets corrupted. If the app crashes unexpectedly, it can't corrupt the file. The only thing that can happen is that the file is truncated (i.e. the last log messages are missing). But the app can't really jump around in the file and modify something elsewhere (unless you call seek in the logging code which would surprise me).
My guess is that the app is multi threaded and the logging code is being called from several threads which can easily lead to data corrupted before the data is written to the log.
You probably forgot to call fsync() every so often, or the data comes in from different threads without proper synchronization among them. Hard to tell without more information (platform, form of corruption you see).
A workaround would be to use logfile rollover, ie. starting a new file every so often.
I really think that you (and others) are wasting your time when you start adding complexity to log files. The whole point of a log is that it should be simple to use and implement, and should work most of the time. To that end, just write the log to an unbuffered stream (l;ike cerr in a C++ program) and live with any, very occasional in my experience, snafus.
OTOH, if you really need an audit trail of everything your app does, for legal reasons, then you should be using some form of transactional storage such as a SQL database.
Not sure if your app is multi-threaded -- if so, consider using Active Object Pattern (PDF) to put a queue in front of the log and make all writes within a single thread. That thread can commit the log in the background. All logs writes will be asynchronous, and in order, but not necessarily written immediately.
The active object can also batch writes.