Process spawned by Windows service runs 3 to 4 times more slowly than spawned by GUI

Process spawned by Windows service runs 3 to 4 times more slowly than spawned by GUI - c++

I have written a service application in Borland C++. It works fine. In the ServiceStart(TService *Sender,bool &Started) routine, I call mjwinrun to launch a process which picks up and processes macros. This process has no UI and any errors are logged to a file. It continues to run, until the server is restarted, shut down, or the process is terminated using Task Manager. Here is mjwinrun :-
int mjwinrun(AnsiString cmd)
{
STARTUPINFO mjstupinf; PROCESS_INFORMATION mjprcinf;
memset(&mjstupinf,0,sizeof(STARTUPINFO)); mjstupinf.cb=sizeof(STARTUPINFO);
if (!CreateProcess(NULL,cmd.c_str(),NULL,NULL,TRUE,0,NULL,GetCurrentDir().c_str(),&mjstupinf,&mjprcinf))
{
LogMessage("Could not launch "+cmd); return -1;
}
CloseHandle(mjprcinf.hThread); CloseHandle(mjprcinf.hProcess);
return mjprcinf.dwProcessId;
}
cmd is the command line for launching the macro queue processor. I used a macro that is CPU/Memory intensive and got it to write its timings to a file. Here is what I found :-
1) If the macro processor is launched from the command line within a logged on session, no matter what Windows core it is running under, the macro is completed in 6 seconds.
2) If the macro processor is launched from a service starting up on Vista core or earlier (using mjwinrun above), the macro is completed in 6 seconds.
3) If the macro processor is launched from a service starting up on Windows 7 core or later (using mjwinrun above), the macro is completed in more than 18 seconds.
I have tried all the different flags for CreateProcess and none of them make a difference. I have tried all different accounts for the service and that makes no difference. I tried setting all of the various priorities for tasks, I/O and Page, but they all make no difference. It's as if the service's spawned processes are somehow throttled, not in I/O terms, but in CPU/memory usage terms. Any ideas what changed in Windows 7 onwards?

I isolated code to reproduce this, and it eventually boiled down to calls to the database engine to lookup a field definition (TTable methods FindField and FieldByName). These took much longer on a table with a lot of fields when run on a service app instead of a GUI app. I devised my own method to store mappings from field names to field definitions, since I always opened my databases with a central routine. I used an array of strings indexed by the Tag property on each table (common to all BCB objects), where each string was composed of ;fieldname;fieldnumber; pairs, and then did a .Pos of the field name to get the field number. fieldnumber is zero-padded to a width of 4. This only uses a few hundred KB of RAM for the entire app and all of its databases. Once in place, the service app runs at the same speed as the GUI app. The only thing I can think of that may explain this, is that service apps have a fixed heap (I think I read 48MBytes somewhere by default) for themselves and any process they spawn. With lots of fields, the memory overflowed and had to thrash to VM on the disk. The GUI app had no such limit and was able to do the lookup entirely in real memory. However, I maybe completely wrong. One thing I have learnt is that FieldByName and FindField are expensive TTable functions to call, and I have now supplanted them all with my own mechanism which seems to work much better and much faster. Here is my lookup routine :-
AnsiString fldsbytag[MXSPRTBLS+100];
TField *fldfromtag(TAdsTable *tbl,AnsiString fld)
{
int fi=fldsbytag[tbl->Tag].Pos(";"+fld.UpperCase()+";"),gi;
if (fi==0) return tbl->FindField(fld);
gi=StrToIntDef(fldsbytag[tbl->Tag].SubString(fi+fld.Length()+2,4),-1);
if (gi<0 || gi>=tbl->Fields->Count) return tbl->FindField(fld);
return tbl->Fields->Fields[gi];
}

It will be very difficult to give an authoritative answer to this question without a lot more details.
However a factor to consider is the Windows foreground priority boost described here.
You may want to read Russinovich's book chapter on processes/threads, in particular the stuff on scheduling. You can find PDFs of the book online (there are two that together make up the whole book). I believe the latest (or next to latest) edition covers changes in Win 7.

Related

Why does my program run faster on first launch than on next launches?

I have been working for 2.5 years on a personal flight sim project on my leisure time, written in C++ and using Opengl on a windows 7 PC.
I recently had to move to windows 10. Hardware is exactly the same. I reinstalled Code::blocks.
It turns out that on first launch of my project after the system start, performance is OK, similar to what I used to see with windows 7. But, the second, third, and all next launches give me lower performance, with significant less fluidity in frame rate compared to the first run, detectable by eye. This never happened with windows 7.
Any time I start my system, first run is fast, next ones are slower.
I had a look at the task manager while doing some runs. The first run is handled by one of the 4 cores of my CPU (iCore5-6500) at approximately 85%. For the next runs, the load is spread accross the 4 cores. During those slower runs on 4 cores, I tried to modify the affinity and direct my program to only one core without significant improvement in performance. The selected core was working at full load, though.
My C++ code doesn't explicitly use any thread function at this stage. From my modest programmer's point of view, there is only one main thread run in the main(). On the task manager, I can see that some 10 to 14 threads are alive when my program runs. I guess (wrongly?) that they are implicitly created by the use of joysticks, track ir or other communication task with GPU...
Could it come from memory not being correctly freed when my program stops? I thought windows would free it properly, even if I forgot some 'delete' after using 'new'.
Has anyone encountered a similar situation? Any explanation coming to your minds?
Any suggestion to better understand these facts? Obviously, my ultimate goal is to have a consistent performance level whatever the number of launches.
trying to upload screenshots of second run as viewed by task manager:
trying to upload screenshots of first run as viewed by task manager:

Well I got a problems when switching to win10 for clients at my work too here few I encountered all because Windows10 has changed up scheduling of processes creating a lot of issues like:
older windowses blockless thread synchronizations techniques not working anymore
well placed Sleep() helps sometimes. Btw. similar problems was encountered when switching from w2k to wxp.
huge slowdowns and frequent freezes for few seconds on older single threaded apps
usually setting affinity to single core solves this. You can do this also in task manager just to check and if helps then you can do this in code too. Here an example on how to do it with winapi:
Cache size estimation on your system?
messed up drivers timings causing zombies processes even total freeze and or BSOD
I deal with USB in my work and its a nightmare sometimes on win10. On top of all this Win10 tends to enforce wrong drivers on devices (like gfx cards, custom USB systems etc ...)
auto freeze close app if it does not respond the wndproc in time
In Windows10 the timeout is much much smaller than in older versions. If the case You can try running in compatibility mode (set in icon properties on desktop) for older windows (however does not work for #1,#2), or change the apps code to speed up response. For example in VCL you can call Process Messages from inside of blocking code to remedy this... Or you can use threads for the heavy lifting ... just be careful with rendering and using winapi as accessing some winapi (any window/visual related stuff) functions from outside main thread causes havoc ...
On top of all this old IDEs (especially for MCUs) don't work properly anymore and new ones are usually much worse (or unusable because of lack of functionality that was present in older versions) to work with so I stayed faith full to Windows7 for developer purposes.
If none of above helps then try to log the times some of your task did need ... it might show you which part of code is the problem. I usually do this using timing graph like this:
both x,y axises are time and each task has its own color and row in graph. the graph is scrolling in time (to the left side in my case) and has changeable time scale. The numbers are showing actual and max (or sliding avg) value ...
This way I can see if some task is not taking too much time or even overlaps its next execution, peaks are also nicely visible and all runs during runtime without any debug tools which might change the behavior of execution.

COM Surrogate Server Timeout

I have a Win32/MFC application that depends on two separate STA COM DLL servers that I created many years ago using C++/ATL. These are large DLL servers with multiple interfaces and are also successfully used in other contexts and client programs. Several years ago, I had to create 64-bit versions of these 32-bit servers, and my 32-bit MFC app needed to be able to use either the 32-bit or 64-bit version of the DLL COM server (chosen with a checkbox).
Because a 32-bit process can't load a 64-bit COM server DLL in-process, I worked around this by having the MFC app create the 64-bit servers in the system surrogate (DLLHOST.EXE) by replacing
CoCreateInstance(..., CLSCTX_INPROC_SERVER, ...)
with
CoCreateInstance(..., CLSCTX_LOCAL_SERVER | CLSCTX_ACTIVATE_64_BIT_SERVER, ...)
Some updates were required, like adding an interface to copy environment variables into the server process and set the server/surrogate's working directory (the surrogate starts in SYSTEM32), but the other interfaces were all remoteable. This all seems to work perfectly and I can now use the 32-bit and 64-bit servers interchangeably from the 32-bit app by flipping a switch.
There is, however, one problem that I haven't been able to solve: making the surrogate quickly terminate when the client releases the last interface. The surrogate hangs around for 3-5 seconds after all remote interfaces are released by the MFC client -- presumably an optimization, hoping the client will come back. If the MFC app re-launches the server with CoCreateInstance() during that 3-5 seconds, it reconnects to the same "dirty" surrogate. The server code is not serially re-usable (it packages up many thousands of lines of legacy ANSI "C" code with lots of static variables) so reconnecting to the same instance is just not possible.
I worked around this several years ago by having the startup interface return a COM error code indicating the server is waiting to be recycled (better than a crash). However, the servers are launched when the end user presses a toolbar button in the MFC app, so this means the user gets a message like "wait a few seconds and try again". That works, but the bad part is that every fresh launch attempt resets the 3-5 second counter that keeps the surrogate from exiting. And impatient users are complaining. I'll add this all works perfectly in-process, with CoFreeUnusedLibraries() working as expected.
I tried a number of things already -- everything short of coding an ExitProcess() in the server, which seems inappropriate. There seems to be no way to tell the surrogate that the application is complete and should not wait for more connections. The MS documentation claims omitting the RunAs attribute in the AppID might help (I had it set to "Interactive User") but it didn't. It also mentions REGCLS_SINGLUSE but then says "Do not set REGCLS_SINGLUSE or REGCLS_MULTIPLEUSE when you register a surrogate for DLL servers" and "REGCLS_SINGLUSE and REGCLS_MULTIPLEUSE should not be used for DLL servers loaded into surrogates." and I don't have control over what the surrogate's class factory as far as I know.
It looks like COM+ might provide some control over recycling, as it seems to have a RecycleActivationLimit option that I might be able to set to 0, but I have no idea what it would take to convert this into a COM+ server.
The other possibility is to write a custom surrogate.
If there's no easy answer, I might just resort to greying out the button until the server vanishes -- but since I can't probe the server without extending its lifetime, I guess I could add a shared mutex and wait for it to vanish. Ugh.
Is RecycleActivationLimit somehow available to regular COM applications? Any other suggestions are most welcome.

Listen For Process Start and End

I'm new to Windows API programming. I am aware that there are ways to check if a process is already running (via enumeration). However, I was wondering if there was a way to listen for when a process starts and ends (for example, notepad.exe) and then perform some action when the starting or ending of that process has been detected. I assume that one could run a continuous enumeration and check loop for every marginal unit of time, but I was wondering if there was a cleaner solution.

Use WMI, Win32_ProcessStartTrace and Win32_ProcessStopTrace classes. Sample C# code is here.
You'll need to write the equivalent C++ code. Which, erm, isn't quite that compact. It's mostly boilerplate, the survival guide is available here.

If you can run code in kernel, check Detecting Windows NT/2K process execution.

Hans Passant has probably given you the best answer, but... It is slow and fairly heavy-weight to write in C or C++.
On versions of Windows less than or equal to Vista, you can get 95ish% coverage with a Windows WH_CBT hook, which can be set with SetWindowsHookEx.
There are a few problems:
This misses some service starts/stops which you can mitigate by keeping a list of running procs and occasionally scanning the list for changes. You do not have to keep procs in this list that have explorer.exe as a parent/grandparent process. Christian Steiber's proc handle idea is good for managing the removal of procs from the table.
It misses things executed directly by the kernel. This can be mitigated the same way as #1.
There are misbehaved apps that do not follow the hook system rules which can cause your app to miss notifications. Again, this can be mitigated by keeping a process table.
The positives are it is pretty lightweight and easy to write.
For Windows 7 and up, look at SetWinEventHook. I have not written the code to cover Win7 so I have no comments.

Process handles are actually objects that you can "Wait" for, with something like "WaitForMultipleObjects".
While it doesn't send a notification of some sort, you can do this as part of your event loop by using the MsgWaitForMultipleObjects() version of the call to combine it with your message processing.

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion
\Image File Execution Options
You can place a registry key here with your process name then add a REG_SZ named 'Debugger' and your listner application name to relay the process start notification.
Unfortunately there is no such zero-overhead aproach to recieving process exit that i know of.

Profiling a legacy application

I am using an old version of a metastorm workflow designer.
We support this while we rewriting it in Microsoft Technologies.
After a few changes the "MAP" (*.epc) has become exceedingly slow to work with and "PUBLISH".
The publish writes the map and its binaries to the DB which then a service will pick up and execute.
However the publish "hangs" never completing and taking from a completion time of 15 min to in excess of 3 hours but still not completing.
I can see the CPU is being hammered but memory seems fine.
I ran process monitor but it does not show me much which leads me to believe the process is doing something either than the norm or the map has grown to a point which is leading it to destruction.
My question: How else can I profile this black box exe?

Running out of file descriptors for mmaped files despite high limits in multithreaded web-app

I have an application that mmaps a large number of files. 3000+ or so. It also uses about 75 worker threads. The application is written in a mix of Java and C++, with the Java server code calling out to C++ via JNI.
It frequently, though not predictably, runs out of file descriptors. I have upped the limits in /etc/security/limits.conf to:
* hard nofile 131072
/proc/sys/fs/file-max is 101752. The system is a Linode VPS running Ubuntu 8.04 LTS with kernel 2.6.35.4.
Opens fail from both the Java and C++ bits of the code after a certain point. Netstat doesn't show a large number of open sockets ("netstat -n | wc -l" is under 500). The number of open files in either lsof or /proc/{pid}/fd are the about expected 2000-5000.
This has had me grasping at straws for a few weeks (not constantly, but in flashes of fear and loathing every time I start getting notifications of things going boom).
There are a couple other loose threads that have me wondering if they offer any insight:
Since the process has about 75 threads, if the mmaped files were somehow taking up one file descriptor per thread, then the numbers add up. That said, doing a recursive count on the things in /proc/{pid}/tasks/*/fd currently lists 215575 fds, so it would seem that it should be already hitting the limits and it's not, so that seems unlikely.
Apache + Passenger are also running on the same box, and come in second for the largest number of file descriptors, but even with children none of those processes weigh in at over 10k descriptors.
I'm unsure where to go from there. Obviously something's making the app hit its limits, but I'm completely blank for what to check next. Any thoughts?

So, from all I can tell, this appears to have been an issue specific to Ubuntu 8.04. After upgrading to 10.04, after one month, there hasn't been a single instance of this problem. The configuration didn't change, so I'm lead to believe that this must have been a kernel bug.

your setup uses a huge chunk of code that may be guilty of leaking too; the JVM. Maybe you can switch between the sun and the opensource jvms as a way to check if that code is not by chance guilty. Also there are different garbage collector strategies available for the jvm. Using a different one or different sizes will cause more or less garbage collects (which in java includes the closing of a descriptor).
I know its kinda far fetched, but it seems like all the other options you already followed ;)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js