_beginthreadex leaking memory - c++

The code below is my entire test program. Each time I press ENTER, the RAM that the process is using is increasing with 4k (it will keep increasing, without stopping; I am seeing it with task manager). What is wrong? The same things happens with _beginthread.
I am trying to write a server, and I want to process each connection with a thread. (Note that this means that I can't join the thread, because that will pause the main thread from accepting new connections.)
unsigned __stdcall thread_test(void *)
{
for(int i = 0; i < 10000; i++)
{
i+=1;
i-=1;
} //simulating processing
_endthreadex( 0 );
}
int main()
{
HANDLE hThread;
while(1)
{
getchar();
hThread = (HANDLE)_beginthreadex( NULL, 0, thread_test, 0, 0, NULL );
CloseHandle( hThread );
}
}
Compiled with code blocks and visual studio.
EDIT: I've made some tests, and the memory stops filling up once it reach around 133.000K (when the program starts, the memory is around 800k); but at this stage, the program runs like 4-5 times slower than it did in the beginning (higher the memory - slower the program runs), so it would not be good for my server to run like that.
EDIT 2: I've got Visual Studio 2013 and the problem gone.
EDIT 3: If I test the code above in Visual Studio 2013, it gives no leaks. But if I use beginthreadex with a small server code, it gives me leaks like before, each request giving 4k. Here is the server testcode(it does nothing, only to see that it leaks memory) that I use http://pastebin.com/EDmJXkZU . You can compile it and test it by typing your IP into the adress bar of the browser.

Task Manager is not showing RAM used by your program. For a better view use Task Manager's Resource Monitor and observe the private bytes indication. But all memory monitors show only "virtual memory," which is commonly retained by the runtime library instead of being freed back to Windows. You don't have a real problem.

Related

C++ _popen() windows leaks paged pool memory

Main application runs in Windows service and that process starts other c++ console processes but all console modes are hidden, i.e. parent process is Windows Service and child processes are non-console applications.
Observed paged pool memory of the system is increasing during call _popen() on the customer system windows server 2016. The application runs clean on our lab system same OS.
From the Windows Performance tool xperf, captured the logs and check the call stack.
attached the pic for reference.
void CMachine::GetJavaVersion()
{
m_stJavaVersion.m_strName = " Java version";
CPUChar strVersion[64] = { 0 };
BOOL bFound = CheckJREVersion(strVersion, 64);
BYTE bytColorSt = RED;
string strRemark;
FILE *fp = NULL;
char version[130] = { 0 };
BOOL bFoundVersion = FALSE;
fp = _popen("java -version 2>&1", "r");
while (fp && fgets(version, sizeof version, fp))
{
string strTmp = version;
if (strTmp.find("version") != string::npos)
{
bFoundVersion = TRUE;
break;
}
}
if(fp) _pclose(fp);
....
PoolMon trace
Memory:33401164K Avail:30057324K PageFlts: 92362 InRam Krnl:20212K P:776328K
Commit:3228052K Limit:37595468K Peak:4747992K Pool N:182820K P:782568K
System pool information
Tag Type Allocs Frees Diff Bytes Per Alloc
Toke Paged 10546816 ( 390) 10319712 ( 382) 227104 324868080 ( 11392) 1430
CM31 Paged 42886 ( 0) 20849 ( 0) 22037 101154816 ( 0) 4590
SeAt Paged 44678436 (1662) 43769798 (1630) 908638 87253680 ( 3072) 96
QINi Paged 234 ( 0) 1 ( 0) 233 60293216 ( 0) 258769
MmSt Paged 2683066 ( 79) 2670922 ( 83) 12144 27223856 ( 3312) 2241
PoolMon
Eric Lippert writes about benchmark mistakes. I think mistake #1 applies to your case:
Mistake #1: Choosing a bad metric.
Why do you measure "paged pool" to determine a memory leak?
Paged memory is the memory that is swapped out to disk. This happens because the physical RAM is needed for something else. What is the physical RAM needed for? Probably for running the process that you start.
Once the memory is swapped to disk, it may take a while until it is swapped back to RAM. That will happen just when some other application tries to access the memory - and that may be minutes, if ever.
I also tend to say that memory isn't leaked during a method call but after a method call. After the method call, all variables should be destroyed and the related resources should be released.
If you are told that the paged pool is the cause, then ask for proof.
On my Windows 10 system, the paged pool limit is 17 GB. This can be shown by Process Explorer in View/System Information with Symbols configured.
If you're running java -version so often that it leaks 17 GB of kernel memory, then something is seriously wrong. Of course there will be a pipe or something to redirect the output from Java to your application so you can read the stream. There will also be other kernel objects like a process, a thread etc.
Even with 1 kB of kernel memory leak for each call, you would need to call that 17 million times to exhaust the paged pool. If that's the case, maybe you should consider caching the result anyway. It should be unlikely that server admins install and uninstall Java 17 million times in a few days.
For monitoring the paged pool, you can try Poolmon with /p /P command line parameters. Poolmon is part of the WDK.
Problems in your code:
Your code has at least 2 problems:
if "version" never appears in the output, your code might run in an endless loop. How could that happen? It's unlikely, but if I rename my HelloWorld.exe to java.exe, it could.
if "version" appears in the output but accidentally "ver" is in the first buffer and "sion" is in the second buffer, you'll never find out it actually was there. Your code could run into an endless loop.

How could just loading a dll lead to 100 CPU load in my main application?

I have a perfectly working program which connects to a video camera (an IDS uEye camera) and continuously grabs frames from it and displays them.
However, when loading a specific dll before connecting to the camera, the program runs with 100% CPU load. If I load the dll after connecting to the camera, the program runs fine.
int main()
{
INT nRet = IS_NO_SUCCESS;
// init camera (open next available camera)
m_hCam = (HIDS)0;
// (A) Uncomment this for 100% CPU load:
// HMODULE handle = LoadLibrary(L"myInnocentDll.dll");
// This is the call to the 3rdparty camera vendor's library:
nRet = is_InitCamera(&m_hCam, 0);
// (B) Uncomment this instead of (A) and the CPU load won't change
// HMODULE handle = LoadLibrary(L"myInnocentDll.dll");
if (nRet == IS_SUCCESS)
{
/*
* Please note: I have removed all lines which are not necessary for the exploit.
* Therefore this is NOT a full example of how to properly initialize an IDS camera!
*/
is_GetSensorInfo(m_hCam, &m_sInfo);
GetMaxImageSize(m_hCam, &m_s32ImageWidth, &m_s32ImageHeight);
m_nColorMode = IS_CM_BGR8_PACKED;// IS_CM_BGRA8_PACKED;
m_nBitsPerPixel = 24; // 32;
nRet |= is_SetColorMode(m_hCam, m_nColorMode);
// allocate image memory.
if (is_AllocImageMem(m_hCam, m_s32ImageWidth, m_s32ImageHeight, m_nBitsPerPixel, &m_pcImageMemory, &m_lMemoryId) != IS_SUCCESS)
{
return 1;
}
else
{
is_SetImageMem(m_hCam, m_pcImageMemory, m_lMemoryId);
}
}
else
{
return 1;
}
std::thread([&]() {
while (true) {
is_FreezeVideo(m_hCam, IS_WAIT);
/*
* Usually, the image memory would now be grabbed via is_GetImageMem().
* but as it is not needed for the exploit, I removed it as well
*/
}
}).detach();
cv::waitKey(0);
}
Independently of the actually used camera driver, in what way could loading a dll change the performance of it, occupying 100% of all available CPU cores? When using the Visual Studio Diagnostic Tools, the excess CPU time is attributed to "[External Call] SwitchToThread" and not to the myInnocentDll.
Loading just the dll without the camera initialization does not result in 100% CPU load.
I was first thinking of some static initializers in the myInnocentDll.dll configuring some threading behavior, but I did not find anything pointing in this direction. For which aspects should I look for in the code of myInnocentDll.dll?
After a lot of digging I found the answer and it is both frustratingly simple and frustrating by itself:
It is Microsoft's poor support of OpenMP. When I disabled OpenMP in my project, the camera driver runs just fine.
The reason seems to be that the Microsoft compiler uses OpenMP with busy waiting and there is also the possibility to manually configure OMP_WAIT_POLICY, but as I was not depending on OpenMP anyways, disabling was the easiest solution for me.
https://developercommunity.visualstudio.com/content/problem/589564/how-to-control-omp-wait-policy-for-openmp.html
https://support.microsoft.com/en-us/help/2689322/redistributable-package-fix-high-cpu-usage-when-you-run-a-visual-c-201
I still don't understand why the CPU only went up high when using the camera and not when running the rest of my solution, even though the camera library is pre-built and my disabling/enabling of OpenMP compilation cannot have any effect on it. And I also don't understand why they bothered to make a hotfix for VS2010 but have no real fix as of VS2019, which I am using. But the problem is averted.
You can disable CPU idle state in the IDS camera manager and then the minimum CPU load in the windows energy plans is set to 100%
I think this is worth mentioning here, even you solved your problem already.

multi-threading limit?

I am writing a program using threads in c++ in linux.
Currently, I am just keeping an array of threads, and every time one second has elapsed, I check to see which have finished, and restart them. Is this bad? I need to keep this program running for a long time. As it is now, I am getting a code 11 after so many loops of restarting threads (the 100th loop in the last trial). I figured that reusing threads and making sure I only have a small number of them running at any one time, that I would not hit the limit. The array I am using only has a size of 8 (of course, I am not starting 8 each time, just those that have stopped).
Any ideas?
My code is below:
if ( loop_times == 0 || pthread_kill(threads[t],0) != 0 )
{
rc = pthread_create(&threads[t], NULL, thread_stall, (void *)NULL);
if (rc){
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
thread_count++;
}
The loop_times variable is just so that I can get into the loop and start the threads the first time. Otherwise, I get a SEGFAULT because the threads haven't been started before.
Also, I have been wanting to see the value of PTHREAD_THREADS_MAX, but I can't print it (even when including limits.h)
If you want to use multiple threads...It better to go for thread pool.
Start a set of threads as detached ones and then through a queue you can send info to every thread so that it can work on that and wait for next input from you.
As it turns out, my problem was that I needed to pthread_join my thread before I restarted it each time. After this, I stopped getting a code 11 and stopped having "still reachable" memory when running it through Valgrind.

CreateThread failure on a longterm run

I'm writing a program in C++ using WINAPI to monitor certain directory for new files arriving, and send them in certain order. The files are derived from a live video stream, so there are 2 files in a unit - audio file and video file, and units should be sent in sequence. a. k. a. (1.mp3, 1.avi); (2.mp3, 2.avi)... Architecture is:
1) detect a new file added to the folder, insert file name to the input queue
2) organize files into units, insert units into unit queue
3) send unit by unit
But since I have to use monitoring file directory for files added there, I need to make sure that file is complete, a. k. a. it is ready to send, since the signal appears when the file is created, but it has yet to be filled with info and closed. So I pop file name from a input queue either when queue has more than 1 file (a. k. a. signal came for next file created, that means that previous file is ready to send) or on timeout(10 sec) so for 10 seconds any file should be done.
So in general this program runs and works properly. But, if I assume that the send procedure will take too long time, so the unit queue will grow. And after some number of units buffered in a unit queue the bug appears.
time[END] = 0;
time[START] = clock();
HANDLE hIOMutex2= CreateMutex (NULL, FALSE, NULL);
WaitForSingleObject( hIOMutex2, INFINITE );
hTimer = CreateThread(NULL, 0, Timer, time, 0, &ThreadId1);
if(hTimer == NULL)
printf("Timer Error\n");
ReleaseMutex(hIOMutex2);
ReadDirectoryChangesW(hDir, szBuffer, sizeof(szBuffer) / sizeof(TCHAR), FALSE, FILE_NOTIFY_CHANGE_FILE_NAME, &dwBytes, NULL, NULL);
HANDLE hIOMutex= CreateMutex (NULL, FALSE, NULL);
WaitForSingleObject( hIOMutex, INFINITE );
time[END] = clock();
TerminateThread(hTimer, 0);
ReleaseMutex( hIOMutex);
After around 800 units buffered in a queue, my program gives me "Time Error" message, if I'm right that means that program can't allocate thread. But in this code program terminates timer thread exactly after the file was created in a directory. So I'm kind of confused with this bug. Also interesting is that even with this time error, my program continue to send units as usual, so that doesn't look like a OS mistake or something different, it is wrong thread declaration/termination, at least it seems like that to me.
Also providing Timer code below if it is helpful.
DWORD WINAPI Timer(LPVOID in){
clock_t* time = (clock_t*) in;
while(TRUE){
if(((clock() - time[START])/CLOCKS_PER_SEC >= 10) && (!time[END]) && (!output.empty())){
Send();
if(output.empty()){
ExitThread(0);
}
}
else if((output.empty()) || (time[END])){
break;
}
else{
Sleep(10);
}
}
ExitThread(0);
return 0;
}
Please could anyone here give me some advise how to solve this bug? Thanks in advance.
Using TerminateThread is a bad idea in many ways. In your case, it makes your program fail because it doesn't release the memory for the thread stack. Failure comes when your program has consumed all available virtual memory and CreateThread() cannot reserve enough memory for another thread. Only ever use TerminateThread while exiting a program.
You'll have to do this a smarter way. Either by asking a thread to exit nicely by signaling an event or by just not consuming such an expensive system resource only for handling a file. A simple timer and one thread can do this too.

EnterCriticalSection Deadlock

Having what appears to be a dead-lock situation with a multi-threaded logging application.
Little background:
My main application has 4-6 threads running. The main thread responsible for monitoring health of various things I'm doing, updating GUIs, etc... Then I have a transmit thread and a receive thread. The transmit and receive threads talk to physical hardware. I sometimes need to debug the data that the transmit and receive threads are seeing; i.e. print to a console without interrupting them due to their time critical nature of the data. The data, by the way, is on a USB bus.
Due to the threading nature of the application, I want to create a debug console that I can send messages to from my other threads. The debug consule runs as a low priority thread and implements a ring buffer such that when you print to the debug console, the message is quickly stored to a ring buffer and sets and event. The debug console's thread sits WaitingOnSingleObject events from the in bound messages that come in. When an event is detected, console thread updates a GUI display with the message. Simple eh? The printing calls and the console thread use a critical section to control access.
NOTE: I can adjust the ring buffer size if I see that I am dropping messages (at least that's the idea).
In a test application, the console works very well if I call its Print method slowly via mouse clicks. I have a button that I can press to send messages to the console and it works. However, if I put any sort of load (many calls to Print method), everything dead-locks. When I trace the dead-lock, my IDE's debugger traces to EnterCriticalSection and sits there.
NOTE: If I remove the Lock/UnLock calls and just use Enter/LeaveCriticalSection (see the code) I sometimes work but still find myself in a dead-lock situation. To rule out deadlocks to stack push/pops, I call Enter/LeaveCriticalSection directly now but this did not solve my issue.... What's going on here?
Here is one Print statement, that allows me to pass in a simple int to the display console.
void TGDB::Print(int I)
{
//Lock();
EnterCriticalSection(&CS);
if( !SuppressOutput )
{
//swprintf( MsgRec->Msg, L"%d", I);
sprintf( MsgRec->Msg, "%d", I);
MBuffer->PutMsg(MsgRec, 1);
}
SetEvent( m_hEvent );
LeaveCriticalSection(&CS);
//UnLock();
}
// My Lock/UnLock methods
void TGDB::Lock(void)
{
EnterCriticalSection(&CS);
}
bool TGDB::TryLock(void)
{
return( TryEnterCriticalSection(&CS) );
}
void TGDB::UnLock(void)
{
LeaveCriticalSection(&CS);
}
// This is how I implemented Console's thread routines
DWORD WINAPI TGDB::ConsoleThread(PVOID pA)
{
DWORD rVal;
TGDB *g = (TGDB *)pA;
return( g->ProcessMessages() );
}
DWORD TGDB::ProcessMessages()
{
DWORD rVal;
bool brVal;
int MsgCnt;
do
{
rVal = WaitForMultipleObjects(1, &m_hEvent, true, iWaitTime);
switch(rVal)
{
case WAIT_OBJECT_0:
EnterCriticalSection(&CS);
//Lock();
if( KeepRunning )
{
Info->Caption = "Rx";
Info->Refresh();
MsgCnt = MBuffer->GetMsgCount();
for(int i=0; i<MsgCnt; i++)
{
MBuffer->GetMsg( MsgRec, 1);
Log->Lines->Add(MsgRec->Msg);
}
}
brVal = KeepRunning;
ResetEvent( m_hEvent );
LeaveCriticalSection(&CS);
//UnLock();
break;
case WAIT_TIMEOUT:
EnterCriticalSection(&CS);
//Lock();
Info->Caption = "Idle";
Info->Refresh();
brVal = KeepRunning;
ResetEvent( m_hEvent );
LeaveCriticalSection(&CS);
//UnLock();
break;
case WAIT_FAILED:
EnterCriticalSection(&CS);
//Lock();
brVal = false;
Info->Caption = "ERROR";
Info->Refresh();
aLine.sprintf("Console error: [%d]", GetLastError() );
Log->Lines->Add(aLine);
aLine = "";
LeaveCriticalSection(&CS);
//UnLock();
break;
}
}while( brVal );
return( rVal );
}
MyTest1 and MyTest2 are just two test functions that I call in response to a button press. MyTest1 never causes a problem no matter how fast I click the button. MyTest2 dead locks nearly everytime.
// No Dead Lock
void TTest::MyTest1()
{
if(gdb)
{
// else where: gdb = new TGDB;
gdb->Print(++I);
}
}
// Causes a Dead Lock
void TTest::MyTest2()
{
if(gdb)
{
// else where: gdb = new TGDB;
gdb->Print(++I);
gdb->Print(++I);
gdb->Print(++I);
gdb->Print(++I);
gdb->Print(++I);
gdb->Print(++I);
gdb->Print(++I);
gdb->Print(++I);
}
}
UPDATE:
Found a bug in my ring buffer implementation. Under heavy load, when buffer wrapped, I didn't detect a full buffer properly so buffer was not returning. I'm pretty sure that issue is now resolved. Once I fixed the ring buffer issue, performance got much better. However, if I decrease the iWaitTime, my dead lock (or freeze up issue) returns.
So after further tests with a much heavier load it appears my deadlock is not gone. Under super heavy load I continue to deadlock or at least my app freezes up but no where near it use to since I fixed ring buffer problem. If I double the number of Print calls in MyTest2 I easily can lock up every time....
Also, my updated code is reflected above. I know make sure my Set & Reset event calls are inside critical section calls.
With those options closed up, I would ask questions about this "Info" object. Is it a window, which window is it parented to, and which thread was it created on?
If Info, or its parent window, was created on the other thread, then the following situation might occur:
The Console Thread is inside a critical section, processing a message.
The Main thread calls Print() and blocks on a critical section waiting for the Console Thread to release the lock.
The Console thread calls a function on Info (Caption), which results in the system sending a message (WM_SETTEXT) to the window. SendMessage blocks because the target thread is not in a message alertable state (isn't blocked on a call to GetMessage/WaitMessage/MsgWaitForMultipleObjects).
Now you have a deadlock.
This kind of #$(%^ can happen whenever you mix blocking routines with anything that interacts with windows. The only appropriate blocking function to use on a GUI thread is MSGWaitForMultipleObjects otherwise SendMessage calls to windows hosted on the thread can easily deadlock.
Avoiding this involves two possible approaches:
Never doing any GUI interaction in worker threads. Only use PostMessage to dispatch non blocking UI update commands to the UI thread, OR
Use kernel Event objects + MSGWaitForMultipleObjects (on the GUI thread) to ensure that even when you are blocking on a resource, you are still dispatching messages.
Without knowing where it is deadlocking this code is hard to figure out. Two comments tho:
Given that this is c++, you should be using an Auto object to perform the lock and unlock. Just in case it ever becomes non catastrophic for Log to throw an exception.
You are resetting the event in response to WAIT_TIMEOUT. This leaves a small window of opportunity for a 2nd Print() call to set the event while the worker thread has returned from WaitForMultiple, but before it has entered the critical section. Which will result in the event being reset when there is actually data pending.
But you do need to debug it and reveal where it "Deadlocks". If one thread IS stuck on EnterCriticalSection, then we can find out why. If neither thread is, then the incomplete printing is just the result of an event getting lost.
I would strongly recommend a lockfree implementation.
Not only will this avoid potential deadlock, but debug instrumentation is one place where you absolutely do not want to take a lock. The impact of formatting debug messages on timing of a multi-threaded application is bad enough... having locks synchronize your parallel code just because you instrumented it makes debugging futile.
What I suggest is an SList-based design (The Win32 API provides an SList implementation, but you can build a thread-safe template easily enough using InterlockedCompareExchange and InterlockedExchange). Each thread will have a pool of buffers. Each buffer will track the thread it came from, after processing the buffer, the log manager will post the buffer back to the source thread's SList for reuse. Threads wishing to write a message will post a buffer to the logger thread. This also prevents any thread from starving other threads of buffers. An event to wake the logger thread when a buffer is placed into the queue completes the design.