Profiling a legacy application - profiling

I am using an old version of a metastorm workflow designer.
We support this while we rewriting it in Microsoft Technologies.
After a few changes the "MAP" (*.epc) has become exceedingly slow to work with and "PUBLISH".
The publish writes the map and its binaries to the DB which then a service will pick up and execute.
However the publish "hangs" never completing and taking from a completion time of 15 min to in excess of 3 hours but still not completing.
I can see the CPU is being hammered but memory seems fine.
I ran process monitor but it does not show me much which leads me to believe the process is doing something either than the norm or the map has grown to a point which is leading it to destruction.
My question: How else can I profile this black box exe?

Related

Why does my program run faster on first launch than on next launches?

I have been working for 2.5 years on a personal flight sim project on my leisure time, written in C++ and using Opengl on a windows 7 PC.
I recently had to move to windows 10. Hardware is exactly the same. I reinstalled Code::blocks.
It turns out that on first launch of my project after the system start, performance is OK, similar to what I used to see with windows 7. But, the second, third, and all next launches give me lower performance, with significant less fluidity in frame rate compared to the first run, detectable by eye. This never happened with windows 7.
Any time I start my system, first run is fast, next ones are slower.
I had a look at the task manager while doing some runs. The first run is handled by one of the 4 cores of my CPU (iCore5-6500) at approximately 85%. For the next runs, the load is spread accross the 4 cores. During those slower runs on 4 cores, I tried to modify the affinity and direct my program to only one core without significant improvement in performance. The selected core was working at full load, though.
My C++ code doesn't explicitly use any thread function at this stage. From my modest programmer's point of view, there is only one main thread run in the main(). On the task manager, I can see that some 10 to 14 threads are alive when my program runs. I guess (wrongly?) that they are implicitly created by the use of joysticks, track ir or other communication task with GPU...
Could it come from memory not being correctly freed when my program stops? I thought windows would free it properly, even if I forgot some 'delete' after using 'new'.
Has anyone encountered a similar situation? Any explanation coming to your minds?
Any suggestion to better understand these facts? Obviously, my ultimate goal is to have a consistent performance level whatever the number of launches.
trying to upload screenshots of second run as viewed by task manager:
trying to upload screenshots of first run as viewed by task manager:
Well I got a problems when switching to win10 for clients at my work too here few I encountered all because Windows10 has changed up scheduling of processes creating a lot of issues like:
older windowses blockless thread synchronizations techniques not working anymore
well placed Sleep() helps sometimes. Btw. similar problems was encountered when switching from w2k to wxp.
huge slowdowns and frequent freezes for few seconds on older single threaded apps
usually setting affinity to single core solves this. You can do this also in task manager just to check and if helps then you can do this in code too. Here an example on how to do it with winapi:
Cache size estimation on your system?
messed up drivers timings causing zombies processes even total freeze and or BSOD
I deal with USB in my work and its a nightmare sometimes on win10. On top of all this Win10 tends to enforce wrong drivers on devices (like gfx cards, custom USB systems etc ...)
auto freeze close app if it does not respond the wndproc in time
In Windows10 the timeout is much much smaller than in older versions. If the case You can try running in compatibility mode (set in icon properties on desktop) for older windows (however does not work for #1,#2), or change the apps code to speed up response. For example in VCL you can call Process Messages from inside of blocking code to remedy this... Or you can use threads for the heavy lifting ... just be careful with rendering and using winapi as accessing some winapi (any window/visual related stuff) functions from outside main thread causes havoc ...
On top of all this old IDEs (especially for MCUs) don't work properly anymore and new ones are usually much worse (or unusable because of lack of functionality that was present in older versions) to work with so I stayed faith full to Windows7 for developer purposes.
If none of above helps then try to log the times some of your task did need ... it might show you which part of code is the problem. I usually do this using timing graph like this:
both x,y axises are time and each task has its own color and row in graph. the graph is scrolling in time (to the left side in my case) and has changeable time scale. The numbers are showing actual and max (or sliding avg) value ...
This way I can see if some task is not taking too much time or even overlaps its next execution, peaks are also nicely visible and all runs during runtime without any debug tools which might change the behavior of execution.

CPU Usage gradually increases in dotnet core webservice

I have a .net Core web service which seems to slowly increase its cpu usage.
meaning at the first day it won't go past 10%, the second day it can go up to 20% and so on.
Using the TOP command in linux, all my webservices seems to sometime be shown there (probably when a request is made) and afterward disappear.
This specific process after running for a while just stays there constantly consuming cpu even when no request has been made.
the API still working fine, it seems like there are some threads that just keeps hanging and consuming cpu. last time I checked I had 5 threads that consumed 3-4% cpu and didn't die for some reason.
My guess is that in some specific scenario a thread just stays alive consuming cpu.
The app runs on ubuntu machine, my first step was trying to create a dump file with ProcDump so I can analyze those threads and maybe find where they are hanging.
ProcDump generates a huge 21gb file, which trying to analyze with lldb throws out of memory exception. even tried transferring it to a windows machine to debug with windbg , no help there as it couldn't open the file.
As there is no specific exception or anything I can't really share any piece of code as I have no idea where the issue is... just kind a hoping for some suggestion that might help me get to a solution or at least understand where the problem is.
Thanks a lot for reading, cheers
You could try using something like jetBrains’ DotMemory, they also have a fairly high level but helpful guide https://www.jetbrains.com/help/dotmemory/How_to_Find_a_Memory_Leak.html it also worth checking your startup file and double checking the services you’ve registered are used in the correct way ie not added as scoped when they should be transient or even a singleton etc
so iv'e been at it for a while.
Eventually found out that my problem was with HttpClient
Probably some bad mix of static class and creating new instances of HttpClient that causes the issue Iv'e explained above.
Solved it by utilizing HttpClientFactory as explained here -
https://learn.microsoft.com/en-us/aspnet/core/fundamentals/http-requests?view=aspnetcore-2.1
Lesson learned :)
A little late but Procdump for Linux just added .NET Core 3 support that generates much more managable sized core dumps. It automatically detects if the target process is .NET Core and does the right thing (i.e., no need to specify switches).

Process spawned by Windows service runs 3 to 4 times more slowly than spawned by GUI

I have written a service application in Borland C++. It works fine. In the ServiceStart(TService *Sender,bool &Started) routine, I call mjwinrun to launch a process which picks up and processes macros. This process has no UI and any errors are logged to a file. It continues to run, until the server is restarted, shut down, or the process is terminated using Task Manager. Here is mjwinrun :-
int mjwinrun(AnsiString cmd)
{
STARTUPINFO mjstupinf; PROCESS_INFORMATION mjprcinf;
memset(&mjstupinf,0,sizeof(STARTUPINFO)); mjstupinf.cb=sizeof(STARTUPINFO);
if (!CreateProcess(NULL,cmd.c_str(),NULL,NULL,TRUE,0,NULL,GetCurrentDir().c_str(),&mjstupinf,&mjprcinf))
{
LogMessage("Could not launch "+cmd); return -1;
}
CloseHandle(mjprcinf.hThread); CloseHandle(mjprcinf.hProcess);
return mjprcinf.dwProcessId;
}
cmd is the command line for launching the macro queue processor. I used a macro that is CPU/Memory intensive and got it to write its timings to a file. Here is what I found :-
1) If the macro processor is launched from the command line within a logged on session, no matter what Windows core it is running under, the macro is completed in 6 seconds.
2) If the macro processor is launched from a service starting up on Vista core or earlier (using mjwinrun above), the macro is completed in 6 seconds.
3) If the macro processor is launched from a service starting up on Windows 7 core or later (using mjwinrun above), the macro is completed in more than 18 seconds.
I have tried all the different flags for CreateProcess and none of them make a difference. I have tried all different accounts for the service and that makes no difference. I tried setting all of the various priorities for tasks, I/O and Page, but they all make no difference. It's as if the service's spawned processes are somehow throttled, not in I/O terms, but in CPU/memory usage terms. Any ideas what changed in Windows 7 onwards?
I isolated code to reproduce this, and it eventually boiled down to calls to the database engine to lookup a field definition (TTable methods FindField and FieldByName). These took much longer on a table with a lot of fields when run on a service app instead of a GUI app. I devised my own method to store mappings from field names to field definitions, since I always opened my databases with a central routine. I used an array of strings indexed by the Tag property on each table (common to all BCB objects), where each string was composed of ;fieldname;fieldnumber; pairs, and then did a .Pos of the field name to get the field number. fieldnumber is zero-padded to a width of 4. This only uses a few hundred KB of RAM for the entire app and all of its databases. Once in place, the service app runs at the same speed as the GUI app. The only thing I can think of that may explain this, is that service apps have a fixed heap (I think I read 48MBytes somewhere by default) for themselves and any process they spawn. With lots of fields, the memory overflowed and had to thrash to VM on the disk. The GUI app had no such limit and was able to do the lookup entirely in real memory. However, I maybe completely wrong. One thing I have learnt is that FieldByName and FindField are expensive TTable functions to call, and I have now supplanted them all with my own mechanism which seems to work much better and much faster. Here is my lookup routine :-
AnsiString fldsbytag[MXSPRTBLS+100];
TField *fldfromtag(TAdsTable *tbl,AnsiString fld)
{
int fi=fldsbytag[tbl->Tag].Pos(";"+fld.UpperCase()+";"),gi;
if (fi==0) return tbl->FindField(fld);
gi=StrToIntDef(fldsbytag[tbl->Tag].SubString(fi+fld.Length()+2,4),-1);
if (gi<0 || gi>=tbl->Fields->Count) return tbl->FindField(fld);
return tbl->Fields->Fields[gi];
}
It will be very difficult to give an authoritative answer to this question without a lot more details.
However a factor to consider is the Windows foreground priority boost described here.
You may want to read Russinovich's book chapter on processes/threads, in particular the stuff on scheduling. You can find PDFs of the book online (there are two that together make up the whole book). I believe the latest (or next to latest) edition covers changes in Win 7.

ColdFusion scheduler threads eating CPU

I've got CF10 running on a dev box, Windows 7, 64 bit.
Periodically, every minute or so, the CPU usage for CF10 will spike up to 100% for about 20 seconds and come back down. It's pretty regular.
I've found it difficult to diagnose this issue. I've seen talk of client variables purges, logging, monitoring and all manner of things - but I've turned these all off to no avail.
With VisualVM, I've managed to track the issue down to the 'scheduler' threads. I have 5 of these in a waiting state. Periodically each will run, bumping up the CPU dramatically.
Taking a thread dump, it seems that all these threads are calling java.io.WinNTFileSystem.getBooleanAttributes - something I've seen mentioned a few times as potentially problematic.
UPDATE: Recently I've been playing with onSessionEnd on another app, and discovered that the scheduler-x threads appear to be internal to ColdFusion - my onSessionEnd tasks always seem to run in one of these threads.
Looking in the temp folder, I can see that a lot of EH Cache folders have been made which I think are to do with query caching. The apps I have running make use of this fairly extensively. I thought clearing the temp folder out might improve performance but it has had no effect.
It's worth noting that if I start the CF service without actually calling any of my apps, the problem does not occur. That might suggest the issue is with the apps themselves, however they do not cause any issue in production - only on this box.
There are no scheduled tasks set up either.
Below is an example of one of the threads causing high CPU. I'd appreciate any help in diagnosing what this thread is doing and why, as well as how to potentially stop it from using so much resources.
"scheduler-2" - Thread t#84
java.lang.Thread.State: RUNNABLE
at java.io.WinNTFileSystem.getBooleanAttributes(Native Method)
at java.io.File.isDirectory(File.java:849)
at coldfusion.watch.Watcher.accept(Watcher.java:352)
at java.io.File.listFiles(File.java:1252)
at coldfusion.watch.Watcher.getFiles(Watcher.java:386)
at coldfusion.watch.Watcher.getFiles(Watcher.java:397)
at coldfusion.watch.Watcher.getFiles(Watcher.java:397)
at coldfusion.watch.Watcher.getFiles(Watcher.java:397)
at coldfusion.watch.Watcher.getFiles(Watcher.java:397)
at coldfusion.watch.Watcher.getFiles(Watcher.java:397)
at coldfusion.watch.Watcher.getFiles(Watcher.java:397)
at coldfusion.watch.Watcher.getFiles(Watcher.java:397)
at coldfusion.watch.Watcher.getFiles(Watcher.java:397)
at coldfusion.watch.Watcher.getFiles(Watcher.java:397)
at coldfusion.watch.Watcher.checkWatchedDirectories(Watcher.java:166)
at coldfusion.watch.Watcher.run(Watcher.java:216)
at coldfusion.scheduling.ThreadPool.run(ThreadPool.java:211)
at coldfusion.scheduling.WorkerThread.run(WorkerThread.java:71)
My environment:
Win 7 64-bit
CF10 Update 12
JDK 1.8.0_11
The issue occurs on multiple versions of JVM - this version is currently used to make monitoring available.
My java settings:
Min heap size: 512mb
Max heap size: 1024mb
-server -XX:MaxPermSize=512m -XX:+UseParallelGC -Xbatch -Dcoldfusion.home={application.home} -Dcoldfusion.rootDir={application.home} -Dcoldfusion.libPath={application.home}/lib -Dorg.apache.coyote.USE_CUSTOM_STATUS_MSG_IN_HEADER=true -Dcoldfusion.jsafe.defaultalgo=FIPS186Random -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8701 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false
I'd be lying if I said I understood what all of these settings do!
Sorry if you're one of those people that believes all CF developers should be Java app stack experts. I am not.
Any help, much appreciated. ;)
Using FusionReactor 6, I was able to solve this for us today. We were using this.javaSettings to hot load java class files. The WatchInterval from this.javaSettings uses the DirectoryWatcher at the specified watch number. In our case, I had lowered it to one second.
How I solved it: I set a breakpoint in FusionReactor and could see that it was constantly stuck scanning the directory above the one I specified in this.javasettings. This directory has enough files and subfolders, that it looks like one DirectoryWatcher was unable to finish before the next one was created. Had ColdFusion just stuck to the subfolder, I specified in this.javaSettings, it would not have been a problem.
Example:
This.javaSettings = {
loadPaths = ["\externals\lib\"]
, loadColdFusionClassPath = true
, reloadOnChange = true
, watchInterval = 1
};
In the above case, lib has just 5 files. However, "externals" is loaded with stuff. In the breakpoint, it was typically looking at stuff in "externals."
Do you have scheduled tasks running that use the CFFILE tag? They tend to be resource hogs. Spinning these into their own threads may help with the CPU spike.
another thought:
looking at the JVM,
•Min heap size: 512mb
•Max heap size: 1024mb
These establish the minimum and maximum memory available to the java virtual machine
-server -XX:MaxPermSize=512m
This is the amount of memory dedicated to the java permanent memory generation.
you've got half of your JVM allocated memory dedicated to the permanent generation, try bumping up the maximum heap size to 2048mb. and restarting the ColdFusion service. It could go higher based on whether or not you're running a 64Bit operating system or not.

ADsOpenObject() returns -2147024882 (0x8007000E) -> OUT_OF_MEMORY

I have a C++ DLL that is used for authentication that gets loaded by a Windows service for every login.
In that DLL I use the Windows ADSI function ADsOpenObject() to get a user object from Active Directory.
HRESULT hr = ADsOpenObject(L"LDAP://rootDSE",
L"username",
L"password",
m_dwADSFlags,
IID_IDirectorySearch,
(void**)&m_DSSearch);
Generally this works since years. But currently I get the error code
-2147024882 (0x8007000E)
which is OUT_OF_MEMORY. When I restart the service that is using my DLL then it runs fine for weeks but then the errors start occuring.
Now I can't find what is out of memory. The task scheduler looks fine and free memory is plenty.
What can I do to fix that?
which is OUT_OF_MEMORY.
It is E_OUTOFMEMORY, a COM error code. The description isn't very helpful, this error code tends to be returned for any "out of resources" errors by Microsoft code, not just memory. Could by an internal limit being reached, could be a winapi call that fails.
And it is not necessarily limited to the direct software involved. A mishaving device driver that leaks kernel pool memory can be the indirect source of the mishap for example.
You'll be lucky if you can find something back in the Application event log, look both in the machine that reports the error as well as the domain server. Task Manager might give a clue, add the Handles, GDI Objects, USER Object, Commit size, Page pool and NP Pool columns. Pretty hard to give specific advice beyond this. It is no doubt a leak, quacks loudly like one when you have to restart a machine periodically to recover. Good luck hunting it down.
Are you calling Release on m_DSSearch? Also if you are searching you need to call CloseSearchHandle or AbandonSearch. If you are not doing either of those, you could be slowly leaking memory.
Your process could fragment the heap to a point so that ADsOpenObject can't find a consecutive piece of memory that is large enough.
You can use VMMap to profile your memory usage:
http://technet.microsoft.com/en-us/sysinternals/dd535533
You could try enabling low fragmentation heap:
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366750(v=vs.85).aspx
If it is now starting to appear and it wasn't earlier, I would suppose one of two possibilities: it has to do with the time (more specifically the year, a milleniumbug of some kind). Another option would be a problem with 32/64 bit architecture.
Now mind that I don't program C++. But I do know a bit about MS errors (I have worked on a MS helpdesk...)