can run in gdb, segmentation fault when run directly - gdb

My program get segmentation fault when I run it normally. However it works just fine if I use gdb run. Moreover, the ratio of segmentation fault increases when I increase the sleep time in the philo function. I am using ubuntu 12.04. Any help or pointing is appreciated. Here is my code
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>
#include <signal.h>
#include <sys/wait.h>
#include <time.h>
#include <semaphore.h>
#include <errno.h>
#define STACKSIZE 10000
#define NUMPROCS 5
#define ROUNDS 10
int ph[NUMPROCS];
//cs[i] is the chopstick between philosopher i and i+1
sem_t cs[NUMPROCS], dead;
int philo() {
int i = 0;
int cpid = getpid();
int phno;
for (i=0; i<NUMPROCS; i++)
if(ph[i] == cpid) phno = i;
for (i=0; i < ROUNDS ; i++){
// Add your entry protocol here
if (sem_wait(&dead) != 0) {
perror(NULL);
return 1;
}
if (sem_wait(&cs[phno]) != 0) {
perror(NULL);
return 1;
}
if (sem_wait(&cs[(phno-1+NUMPROCS) % NUMPROCS]) != 0){
perror(NULL);
return 1;
}
// Start of critical section -- simulation of slow n++
int sleeptime = 20000 + rand()%50000;
printf("philosopher %d is eating by chopsticks %d and %d\n",phno,phno,(phno-1+NUMPROCS)%NUMPROCS);
usleep(sleeptime) ;
// End of critical section
// Add your exit protocol here
if (sem_post(&dead) != 0) {
perror(NULL);
return 1;
}
if (sem_post(&cs[phno]) != 0) {
perror(NULL);
return 1;
}
if (sem_post(&cs[(phno-1+NUMPROCS) % NUMPROCS]) != 0){
perror(NULL);
return 1;
}
}
return 0;
}
int main( int argc, char ** argv){
int i;
void* stack[NUMPROCS];
srand(time(NULL));
//initialize semaphores
for (i=0; i<NUMPROCS; i++) {
if (sem_init(&cs[i],1,1) != 0){
perror(NULL);
return 1;
}
}
if (sem_init(&dead,1,4) != 0){
perror(NULL);
return 1;
}
for (i = 0; i < NUMPROCS; i++){
stack[i] = malloc(STACKSIZE) ;
if ( stack[i] == NULL ) {
printf("Error allocating memory\n") ;
exit(1) ;
}
// create a child that shares the data segment
ph[i] = clone(philo, stack[i]+STACKSIZE-1, CLONE_VM|SIGCHLD, NULL) ;
if (ph[i] < 0) {
perror(NULL) ;
return 1;
}
}
for (i=0; i < NUMPROCS; i++) wait(NULL);
for (i=0; i < NUMPROCS; i++) free(stack[i]);
return 0 ;
}

A typical Heisenbug: if you look at it, it disappears. In my experience getting a segv only outside gdb or vice versa is sign of using uninitialized memory or dependence on actual pointer addresses. Normally running valgrind is ruthlessly accurate in detecting those. Unfortunately (my) valgrind can not handle your clone outside the pthread context.
Visual inspection suggests it is not a memory problem. Only the stacks are allocated on the heap and their use looks ok. Except you treat them with a void * pointer and then add something to it, which is not allowed in standard-C (a GNU extension). Proper would be to use a char *, but the GNU extensions does what you want.
Subtracting one from the top address of the stack is probably not necessary and might cause alignment errors on simple implementations of clone, but again I don't think that is the problem, as clone most likely will align the stack top again. And admittedly the manual page of clone is not very clear about the exact location of the address: "topmost address of the memory space".
Just waiting for a state change of a child and assuming it died is a bit sloppy and then taking away its stack might lead to segmentation faults, but again I don't think that is the problem, because you are probably not frantically sending signals to your philosophers.
If I run your application the philosophers can finish their diner undisturbed both inside and outside gdb, so the following is a guess. Let's call the parent process that clones philosophers "the table". Once a philosopher is cloned the table stores the returned pid in ph, say assign that number to a chair. The first thing a philosopher does is looking for his chair. If he doesn't find his chair he will have an uninitialized phno which is used to access his semaphores. Now this may very well lead to segmentation faults.
The implementation is assuming that control is returned to the table before the philosophers start. I can't find such guarantee in the manual page and I would actually expect this not to be true. Also the clone interface has a possibility to place process ids in memory shared between the child and the parent, suggesting this is a recognized problem (see parameters pid and ctid). If those are used the pid will be written before either the table or the just cloned philosopher gets control.
It is highly possible that this error explains the difference between inside and outside gdb, because gdb is well aware of the processes that are spawned under its supervision and may treat them differently than the operating system.
Alternatively you could assign a semaphore to the table. So nobody sits at the table until the table says so, obviously after it assigned all chairs. This would make a much better use for the semaphore dead.
BTW. You are of course fully aware that the setup of your solution does allow for the situation where all philosophers end up each having one fork (eh chopstick) and starve to death waiting for the other. Luckily chances of that happening are very slim.

ph[i] = clone(philo, stack[i]+STACKSIZE-1, CLONE_VM|SIGCHLD, NULL) ;
This creates a thread of execution, which glibc knows nothing about. As such, glibc does not create any thread-specific internal structures that it needs for e.g. dynamic symbol resolution.
With such setup, calling into any glibc function from your philo function invokes undefined behavior, and you sometimes crash (because the dynamic loader will use main thread's private data to perform symbol resolution, and because the loader assumes that each thread has its own private area, but you've violated this assumption by creating clones which share the single private area "behind glibc's back").
If you look at a core dump, there is a high chance that the actual crash happens in ld.so, which would confirm my guess.
Don't ever use clone directly (unless you know what you are doing). Use pthread_create instead.
Here is what I see in the core that I just got (which is exactly the problem I described):
Program terminated with signal 4, Illegal instruction.
#0 _dl_x86_64_restore_sse () at ../sysdeps/x86_64/dl-trampoline.S:239
239 vmovdqa %fs:RTLD_SAVESPACE_SSE+0*YMM_SIZE, %ymm0
(gdb) bt
#0 _dl_x86_64_restore_sse () at ../sysdeps/x86_64/dl-trampoline.S:239
#1 0x00007fb694e1dc45 in _dl_fixup (l=<optimized out>, reloc_arg=<optimized out>) at ../elf/dl-runtime.c:127
#2 0x00007fb694e0dee5 in _dl_runtime_resolve () at ../sysdeps/x86_64/dl-trampoline.S:42
#3 0x00000000004009ec in philo ()
#4 0x00007fb69486669d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112

Related

32-bit malloc() return NULL when opening many threads?

I have a sample C++ program as below:
#include <windows.h>
#include <stdio.h>
int main(int argc, char* argv[])
{
void * pointerArr[20000];
int i = 0, j;
for (i = 0; i < 20000; i++) {
void * pointer = malloc(131125);
if (pointer == NULL) {
printf("i = %d, out of memory!\n", i);
getchar();
break;
}
pointerArr[i] = pointer;
}
for (j = 0; j < i; j++) {
free(pointerArr[j]);
}
getchar();
return 0;
}
When I run it with Visual Studio 32-bit Debug, it will run with following result:
The program can use nearly 2Gb of memory before out of memory.
This is normal behavior.
However, when I adding the code to start Thread inside the for loop as below:
#include <windows.h>
#include <stdio.h>
DWORD WINAPI thread_func(VOID* pInArgs)
{
Sleep(100000);
return 0;
}
int main(int argc, char* argv[])
{
void * pointerArr[20000];
int i = 0, j;
for (i = 0; i < 20000; i++) {
CreateThread(NULL, 0, thread_func, NULL, 0, NULL);
void * pointer = malloc(131125);
if (pointer == NULL) {
printf("i = %d, out of memory!\n", i);
getchar();
break;
}
pointerArr[i] = pointer;
}
for (j = 0; j < i; j++) {
free(pointerArr[j]);
}
getchar();
return 0;
}
The result is as below:
The memory is still just around 200Mb but function malloc will return NULL.
Could anyone help explain why the program cannot use the memory up to 2Gb before out of memory?
Is it mean creating many threads like above will cause memory leak?
In my real application, this error occur when I create about 800 threads, the RAM memory at the time "out of memory" is around 300Mb.
As noted in a comment by #macroland, the main thing happening here is that each thread is consuming 1 MiB for its stack (see MSDN CreateThread and Thread Stack Size). You say malloc returns NULL once the total you have directly allocated reaches 200 MB. Since you are allocating 131125 bytes at a time, that is 200 MB / 131125 B = 1525 threads. Their cumulative stack space will be around 1.5 GB. Adding the 200 MB of malloc memory is 1.7 GB, and miscellaneous overhead likely accounts for the rest.
So, why does Task Manager not show this? Because the full 1 MiB of thread stack space is not actually allocated (also called committed), rather it is reserved. See VirtualAlloc and the MEM_RESERVE flag. The address space has been reserved for expansion up to 1 MiB, but initially only 64 KiB are allocated, and Task Manager only counts the latter. But reserved memory will not be unilaterally repurposed by malloc until the reservation is lifted, so once it runs out of available address space, it has to return NULL.
What tool can show this? I don't know of anything off the shelf (even Process Explorer does not seem show a count of reserved memory). What I have done in the past is write my own little routine that uses VirtualQuery to enumerate the entire address space, including reserved ranges. I recommend you do the same; it's not much code to write, and very handy when coding for 32-bit Windows because the 2 GiB address space gets cramped very easily (DLLs are an obvious reason, but the default malloc also will leave unexpected reservations behind in response to certain allocation patterns even if you free everything).
In any case, if you want to create thousands of threads in a 32-bit Windows process, be sure to pass a non-zero value as the dwStackSize parameter to CreateThread, and also pass STACK_SIZE_PARAM_IS_A_RESERVATION as dwCreationFlags. The minimum is 64 KiB, which will be plenty if you avoid recursive algorithms in the threads.
Addendum: In a comment, #iinspectable cautions against using thousands of threads, citing Raymond Chen's 2005 blog post Does Windows have a limit of 2000 threads per process?. I agree that doing so is questionable for a variety of reasons; it is not my intent to endorse the practice, rather I'm just explaining one necessary element.

recursive threading with C++ gives a Resource temporarily unavailable

So I'm trying to create a program that implements a function that generates a random number (n) and based on n, creates n threads. The main thread is responsible to print the minimum and maximum of the leafs. The depth of hierarchy with the Main thread is 3.
I have written the code below:
#include <iostream>
#include <thread>
#include <time.h>
#include <string>
#include <sstream>
using namespace std;
// a structure to keep the needed information of each thread
struct ThreadInfo
{
long randomN;
int level;
bool run;
int maxOfVals;
double minOfVals;
};
// The start address (function) of the threads
void ChildWork(void* a) {
ThreadInfo* info = (ThreadInfo*)a;
// Generate random value n
srand(time(NULL));
double n=rand()%6+1;
// initialize the thread info with n value
info->randomN=n;
info->maxOfVals=n;
info->minOfVals=n;
// the depth of recursion should not be more than 3
if(info->level > 3)
{
info->run = false;
}
// Create n threads and run them
ThreadInfo* childInfo = new ThreadInfo[(int)n];
for(int i = 0; i < n; i++)
{
childInfo[i].level = info->level + 1;
childInfo[i].run = true;
std::thread tt(ChildWork, &childInfo[i]) ;
tt.detach();
}
// checks if any child threads are working
bool anyRun = true;
while(anyRun)
{
anyRun = false;
for(int i = 0; i < n; i++)
{
anyRun = anyRun || childInfo[i].run;
}
}
// once all child threads are done, we find their max and min value
double maximum=1, minimum=6;
for( int i=0;i<n;i++)
{
// cout<<childInfo[i].maxOfVals<<endl;
if(childInfo[i].maxOfVals>=maximum)
maximum=childInfo[i].maxOfVals;
if(childInfo[i].minOfVals< minimum)
minimum=childInfo[i].minOfVals;
}
info->maxOfVals=maximum;
info->minOfVals=minimum;
// we set the info->run value to false, so that the parrent thread of this thread will know that it is done
info->run = false;
}
int main()
{
ThreadInfo info;
srand(time(NULL));
double n=rand()%6+1;
cout<<"n is: "<<n<<endl;
// initializing thread info
info.randomN=n;
info.maxOfVals=n;
info.minOfVals=n;
info.level = 1;
info.run = true;
std::thread t(ChildWork, &info) ;
t.join();
while(info.run);
info.maxOfVals= max<unsigned long>(info.randomN,info.maxOfVals);
info.minOfVals= min<unsigned long>(info.randomN,info.minOfVals);
cout << "Max is: " << info.maxOfVals <<" and Min is: "<<info.minOfVals;
}
The code compiles with no error, but when I execute it, it gives me this :
libc++abi.dylib: terminating with uncaught exception of type
std::__1::system_error: thread constructor failed: Resource
temporarily unavailable Abort trap: 6
You spawn too many threads. It looks a bit like a fork() bomb. Threads are a very heavy-weight system resource. Use them sparingly.
Within the function void Childwork I see two mistakes:
As someone already pointed out in the comments, you check the info level of a thread and then you go and create some more threads regardless of the previous check.
Within the for loop that spawns your new threads, you increment the info level right before you spawn the actual thread. However you increment a freshly created instance of ThreadInfo here ThreadInfo* childInfo = new ThreadInfo[(int)n]. All instances within childInfo hold a level of 0. Basically the level of each thread you spawn is 1.
In general avoid using threads to achieve concurrency for I/O bound operations (*). Just use threads to achieve concurrency for independent CPU bound operations. As a rule of thumb you never need more threads than you have CPU cores in your system (**). Having more does not improve concurrency and does not improve performance.
(*) You should always use direct function calls and an event based system to run pseudo concurrent I/O operations. You do not need any threading to do so. For example a TCP server does not need any threads to serve thousands of clients.
(**) This is the ideal case. In practice your software is composed of multiple parts, developed by independent developers and maintained in different modes, so it is ok to have some threads which could be theoretically avoided.
Multithreading is still rocket science in 2019. Especially in C++. Do not do it unless you know exactly what you are doing. Here is a good series of blog posts that handle threads.

C++ threads add two array to one result array [duplicate]

I'm fairly new to threads in C. For this program I need to declare a thread which I pass in a for loop thats meant to print out the printfs from the thread.
I can't seem to get it to print in correct order. Here's my code:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define NUM_THREADS 16
void *thread(void *thread_id) {
int id = *((int *) thread_id);
printf("Hello from thread %d\n", id);
return NULL;
}
int main() {
pthread_t threads[NUM_THREADS];
for (int i = 0; i < NUM_THREADS; i++) {
int code = pthread_create(&threads[i], NULL, thread, &i);
if (code != 0) {
fprintf(stderr, "pthread_create failed!\n");
return EXIT_FAILURE;
}
}
return EXIT_SUCCESS;
}
//gcc -o main main.c -lpthread
That's the classic example of understanding multi-threading.
The threads are running concurrently, scheduled by OS scheduler.
There is no such thing as "correct order" when we are talking about running in parallel.
Also, there is such thing as buffers flushing for stdout output. Means, when you "printf" something, it is not promised it will happen immediately, but after reaching some buffer limit/timeout.
Also, if you want to do the work in the "correct order", means wait until the first thread finishes it's work before staring next one, consider using "join":
http://man7.org/linux/man-pages/man3/pthread_join.3.html
UPD:
passing pointer to thread_id is also incorrect in this case, as a thread may print id that doesn't belong to him (thanks Kevin)

What could cause a mutex to misbehave?

I've been busy the last couple of months debugging a rare crash caused somewhere within a very large proprietary C++ image processing library, compiled with GCC 4.7.2 for an ARM Cortex-A9 Linux target. Since a common symptom was glibc complaining about heap corruption, the first step was to employ a heap corruption checker to catch oob memory writes. I used the technique described in https://stackoverflow.com/a/17850402/3779334 to divert all calls to free/malloc to my own function, padding every allocated chunk of memory with some amount of known data to catch out-of-bounds writes - but found nothing, even when padding with as much as 1 KB before and after every single allocated block (there are hundreds of thousands of allocated blocks due to intensive use of STL containers, so I can't enlarge the padding further, plus I assume any write more than 1KB out of bounds would eventually trigger a segfault anyway). This bounds checker has found other problems in the past so I don't doubt its functionality.
(Before anyone says 'Valgrind', yes, I have tried that too with no results either.)
Now, my memory bounds checker also has a feature where it prepends every allocated block with a data struct. These structs are all linked in one long linked list, to allow me to occasionally go over all allocations and test memory integrity. For some reason, even though all manipulations of this list are mutex protected, the list was getting corrupted. When investigating the issue, it began to seem like the mutex itself was occasionally failing to do its job. Here is the pseudocode:
pthread_mutex_t alloc_mutex;
static bool boolmutex; // set to false during init. volatile has no effect.
void malloc_wrapper() {
// ...
pthread_mutex_lock(&alloc_mutex);
if (boolmutex) {
printf("mutex misbehaving\n");
__THROW_ERROR__; // this happens!
}
boolmutex = true;
// manipulate linked list here
boolmutex = false;
pthread_mutex_unlock(&alloc_mutex);
// ...
}
The code commented with "this happens!" is occasionally reached, even though this seems impossible. My first theory was that the mutex data structure was being overwritten. I placed the mutex within a struct, with large arrays before and after it, but when this problem occurred the arrays were untouched so nothing seems to be overwritten.
So.. What kind of corruption could possibly cause this to happen, and how would I find and fix the cause?
A few more notes. The test program uses 3-4 threads for processing. Running with less threads seems to make the corruptions less common, but not disappear. The test runs for about 20 seconds each time and completes successfully in the vast majority of cases (I can have 10 units repeating the test, with the first failure occurring after 5 minutes to several hours). When the problem occurs it is quite late in the test (say, 15 seconds in), so this isn't a bad initialization issue. The memory bounds checker never catches actual out of bounds writes but glibc still occasionally fails with a corrupted heap error (Can such an error be caused by something other than an oob write?). Each failure generates a core dump with plenty of trace information; there is no pattern I can see in these dumps, no particular section of code that shows up more than others. This problem seems very specific to a particular family of algorithms and does not happen in other algorithms, so I'm quite certain this isn't a sporadic hardware or memory error. I have done many more tests to check for oob heap accesses which I don't want to list to keep this post from getting any longer.
Thanks in advance for any help!
Thanks to all commenters. I've tried nearly all suggestions with no results, when I finally decided to write a simple memory allocation stress test - one that would run a thread on each of the CPU cores (my unit is a Freescale i.MX6 quad core SoC), each allocating and freeing memory in random order at high speed. The test crashed with a glibc memory corruption error within minutes or a few hours at most.
Updating the kernel from 3.0.35 to 3.0.101 solved the problem; both the stress test and the image processing algorithm now run overnight without failing. The problem does not reproduce on Intel machines with the same kernel version, so the problem is specific either to ARM in general or perhaps to some patch Freescale included with the specific BSP version that included kernel 3.0.35.
For those curious, attached is the stress test source code. Set NUM_THREADS to the number of CPU cores and build with:
<cross-compiler-prefix>g++ -O3 test_heap.cpp -lpthread -o test_heap
I hope this information helps someone. Cheers :)
// Multithreaded heap stress test. By Itay Chamiel 20151012.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <pthread.h>
#include <sys/time.h>
#define NUM_THREADS 4 // set to number of CPU cores
#define ALIVE_INDICATOR NUM_THREADS
// Each thread constantly allocates and frees memory. In each iteration of the infinite loop, decide at random whether to
// allocate or free a block of memory. A list of 500-1000 allocated blocks is maintained by each thread. When memory is allocated
// it is added to this list; when freeing, a random block is selected from this list, freed and removed from the list.
void* thr(void* arg) {
int* alive_flag = (int*)arg;
int thread_id = *alive_flag; // this is a number between 0 and (NUM_THREADS-1) given by main()
int cnt = 0;
timeval t_pre, t_post;
gettimeofday(&t_pre, NULL);
const int ALLOCATE=1, FREE=0;
const unsigned int MINSIZE=500, MAXSIZE=1000;
const int MAX_ALLOC=10000;
char* membufs[MAXSIZE];
unsigned int membufs_size = 0;
int num_allocs = 0, num_frees = 0;
while(1)
{
int action;
// Decide whether to allocate or free a memory block.
// if we have less than MINSIZE buffers, allocate.
if (membufs_size < MINSIZE) action = ALLOCATE;
// if we have MAXSIZE, free.
else if (membufs_size >= MAXSIZE) action = FREE;
// else, decide randomly.
else {
action = ((rand() & 0x1)? ALLOCATE : FREE);
}
if (action == ALLOCATE) {
// choose size to allocate, from 1 to MAX_ALLOC bytes
size_t size = (rand() % MAX_ALLOC) + 1;
// allocate and fill memory
char* buf = (char*)malloc(size);
memset(buf, 0x77, size);
// add buffer to list
membufs[membufs_size] = buf;
membufs_size++;
assert(membufs_size <= MAXSIZE);
num_allocs++;
}
else { // action == FREE
// choose a random buffer to free
size_t pos = rand() % membufs_size;
assert (pos < membufs_size);
// free and remove from list by replacing entry with last member
free(membufs[pos]);
membufs[pos] = membufs[membufs_size-1];
membufs_size--;
assert(membufs_size >= 0);
num_frees++;
}
// once in 10 seconds print a status update
gettimeofday(&t_post, NULL);
if (t_post.tv_sec - t_pre.tv_sec >= 10) {
printf("Thread %d [%d] - %d allocs %d frees. Alloced blocks %u.\n", thread_id, cnt++, num_allocs, num_frees, membufs_size);
gettimeofday(&t_pre, NULL);
}
// indicate alive to main thread
*alive_flag = ALIVE_INDICATOR;
}
return NULL;
}
int main()
{
int alive_flag[NUM_THREADS];
printf("Memory allocation stress test running on %d threads.\n", NUM_THREADS);
// start a thread for each core
for (int i=0; i<NUM_THREADS; i++) {
alive_flag[i] = i; // tell each thread its ID.
pthread_t th;
int ret = pthread_create(&th, NULL, thr, &alive_flag[i]);
assert(ret == 0);
}
while(1) {
sleep(10);
// check that all threads are alive
bool ok = true;
for (int i=0; i<NUM_THREADS; i++) {
if (alive_flag[i] != ALIVE_INDICATOR)
{
printf("Thread %d is not responding\n", i);
ok = false;
}
}
assert(ok);
for (int i=0; i<NUM_THREADS; i++)
alive_flag[i] = 0;
}
return 0;
}

Windows7 memory management - how to prevent concurrent threads from blocking

I'm working on a program consisting of two concurrent threads. One (here "Clock") is performing some computation on a regular basis (10 Hz) and is quite memory-intensive. The other one (here "hugeList") uses even more RAM but is not as time critical as the first one. So I decided to reduce its priority to THREAD_PRIORITY_LOWEST. Yet, when the thread frees most of the memory it has used the critical one doesn't manage to keep its timing.
I was able to condense down the problem to this bit of code (make sure optimizations are turned off!):
while Clock tries to keep a 10Hz-timing the hugeList-thread allocates and frees more and more memory not organized in any sort of chunks.
#include "stdafx.h"
#include <stdio.h>
#include <forward_list>
#include <time.h>
#include <windows.h>
#include <vector>
void wait_ms(double _ms)
{
clock_t endwait;
endwait = clock () + _ms * CLOCKS_PER_SEC/1000;
while (clock () < endwait) {} // active wait
}
void hugeList(void)
{
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_LOWEST);
unsigned int loglimit = 3;
unsigned int limit = 1000;
while(true)
{
for(signed int cnt=loglimit; cnt>0; cnt--)
{
printf(" Countdown %d...\n", cnt);
wait_ms(1000.0);
}
printf(" Filling list...\n");
std::forward_list<double> list;
for(unsigned int cnt=0; cnt<limit; cnt++)
list.push_front(42.0);
loglimit++;
limit *= 10;
printf(" Clearing list...\n");
while(!list.empty())
list.pop_front();
}
}
void Clock()
{
clock_t start = clock()-CLOCKS_PER_SEC*100/1000;
while(true)
{
std::vector<double> dummyData(100000, 42.0); // just get some memory
printf("delta: %d ms\n", (clock()-start)*1000/CLOCKS_PER_SEC);
start = clock();
wait_ms(100.0);
}
}
int main()
{
DWORD dwThreadId;
if (CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)&Clock, (LPVOID) NULL, 0, &dwThreadId) == NULL)
printf("Thread could not be created");
if (CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)&hugeList, (LPVOID) NULL, 0, &dwThreadId) == NULL)
printf("Thread could not be created");
while(true) {;}
return 0;
}
First of all I noticed that allocating memory for the linked list is way faster than freeing it.
On my machine (Windows7) at around the 4th iteration of the "hugeList"-method the Clock-Thread gets significantly disturbed (up to 200ms). The effect disappears without the dummyData-vector "asking" for some memory in the Clock-Thread.
So,
Is there any way of increasing the priority of memory allocation for the Clock-Thread in Win7?
Or do I have to split both operations onto two contexts (processes)?
Note that my original code uses some communication via shared variables which would require for some kind of IPC if I chose the second option.
Note that my original code gets stuck for about 1sec when the equivalent to the "hugeList"-method clears a boost::unordered_map and enters ntdll.dll!RtIInitializeCriticalSection many many times.
(observed by systinernals process explorer)
Note that the effects observed are not due to swapping, I'm using 1.4GB of my 16GB (64bit win7).
edit:
just wanted to let you know that up to now I haven't been able to solve my issue. Splitting both parts of the code onto two processes does not seem to be an option since my time is rather limited and I've never worked with processes so far. I'm afraid I won't be able to get to a running version in time.
However, I managed to reduce the effects by reducing the number of memory deallocations made by the non-critical thread. This was achieved by using a fast pooling memory allocator (like the one provided in the boost library).
There does not seem to be the possibility of explicitly creating certain objects (like e.g. the huge forward list in my example) on some sort of threadprivate heap that would not require synchronisation.
For further reading:
http://bmagic.sourceforge.net/memalloc.html
Do threads have a distinct heap?
Memory Allocation/Deallocation Bottleneck?
http://software.intel.com/en-us/articles/avoiding-heap-contention-among-threads
http://www.boost.org/doc/libs/1_55_0/libs/pool/doc/html/boost_pool/pool/introduction.html
Replacing std::forward_list with a std::list, I ran your code on a corei7 4GB machine until 2GB is consumed. No disturbances at all. (In debug build)
P.S
Yes. The release build recreates the issue. I replaced the forward list with an array
double* p = new double[limit];
for(unsigned int cnt=0; cnt<limit; cnt++)
p[cnt] = 42.0;
and
for(unsigned int cnt=0; cnt<limit; cnt++)
p[cnt] = -1;
delete [] p;
It does not recreates then.
It seems thread scheduler is punishing for asking for lot of small memory chunks.