Exception in thread and stack trace shows call to _endthreadex - c++

I am currently debugging a crash in one of my applications and stumbled up on something that looks a bit odd to me.
In my application I start several threads using _beginthreadex function. At some random point in time (sometimes after several hours) one of the threads crashes for a yet unknown reason. The stack trace at the time of the crash looks like this:
my.dll!thread(void * p=0x241e0140) Line 792 + 0xf bytes C
msvcr90.dll!__endthreadex() + 0x44 bytes
msvcr90.dll!__endthreadex() + 0xd8 bytes
kernel32.dll!_BaseThreadStart#8() + 0x37 bytes
where thread is the main loop function of my thread. I omitted the stack frames above my thread function here. The crash is an access violation and happens each time at the same location with the exact same bad pointer value.
Now what made me think is that __endthreadex occurs in the stack trace. I tried to reproduce this in a small sample program and caused an access violation there too. However the stack trace looks different:
test.exe!thread(void * vpb=0x00000000) Line 9 + 0x3 bytes C++
test.exe!_callthreadstartex() Line 348 + 0x6 bytes C
test.exe!_threadstartex(void * ptd=0x000328e8) Line 326 + 0x5 bytes C
kernel32.dll!_BaseThreadStart#8() + 0x37 bytes
The corresponding code is this:
static unsigned __stdcall thread( void * vpb )
{
int *a = (int*)0xdeadbeef;
*a = 0;
return 1;
}
int main()
{
HANDLE th = (HANDLE)_beginthreadex( NULL, 0, thread, 0, CREATE_SUSPENDED, NULL );
for(int i=0; i<INT_MAX;)
{
++i;
}
ResumeThread(th);
for(int i=0; i<INT_MAX;)
{
++i;
}
return 0;
}
The original code looks similar to this but is obviously more complex. So I'm no wondering if the differences in the stack trace might be caused by my modifications to the code when implementing the small example or if all this shows that there is a problem with a corrupted stack.
So my question is basically: Is it possible that _BaseThreadStart calls __endthreadex which again calls the thread function itself? And if so, can someone explain why this is the case because to me it looks odd and I'm thinking if this points out some problem which might be related to my initial problem of the crashing thread.
Thanks in advance for your suggestions.

Related

32-bit malloc() return NULL when opening many threads?

I have a sample C++ program as below:
#include <windows.h>
#include <stdio.h>
int main(int argc, char* argv[])
{
void * pointerArr[20000];
int i = 0, j;
for (i = 0; i < 20000; i++) {
void * pointer = malloc(131125);
if (pointer == NULL) {
printf("i = %d, out of memory!\n", i);
getchar();
break;
}
pointerArr[i] = pointer;
}
for (j = 0; j < i; j++) {
free(pointerArr[j]);
}
getchar();
return 0;
}
When I run it with Visual Studio 32-bit Debug, it will run with following result:
The program can use nearly 2Gb of memory before out of memory.
This is normal behavior.
However, when I adding the code to start Thread inside the for loop as below:
#include <windows.h>
#include <stdio.h>
DWORD WINAPI thread_func(VOID* pInArgs)
{
Sleep(100000);
return 0;
}
int main(int argc, char* argv[])
{
void * pointerArr[20000];
int i = 0, j;
for (i = 0; i < 20000; i++) {
CreateThread(NULL, 0, thread_func, NULL, 0, NULL);
void * pointer = malloc(131125);
if (pointer == NULL) {
printf("i = %d, out of memory!\n", i);
getchar();
break;
}
pointerArr[i] = pointer;
}
for (j = 0; j < i; j++) {
free(pointerArr[j]);
}
getchar();
return 0;
}
The result is as below:
The memory is still just around 200Mb but function malloc will return NULL.
Could anyone help explain why the program cannot use the memory up to 2Gb before out of memory?
Is it mean creating many threads like above will cause memory leak?
In my real application, this error occur when I create about 800 threads, the RAM memory at the time "out of memory" is around 300Mb.
As noted in a comment by #macroland, the main thing happening here is that each thread is consuming 1 MiB for its stack (see MSDN CreateThread and Thread Stack Size). You say malloc returns NULL once the total you have directly allocated reaches 200 MB. Since you are allocating 131125 bytes at a time, that is 200 MB / 131125 B = 1525 threads. Their cumulative stack space will be around 1.5 GB. Adding the 200 MB of malloc memory is 1.7 GB, and miscellaneous overhead likely accounts for the rest.
So, why does Task Manager not show this? Because the full 1 MiB of thread stack space is not actually allocated (also called committed), rather it is reserved. See VirtualAlloc and the MEM_RESERVE flag. The address space has been reserved for expansion up to 1 MiB, but initially only 64 KiB are allocated, and Task Manager only counts the latter. But reserved memory will not be unilaterally repurposed by malloc until the reservation is lifted, so once it runs out of available address space, it has to return NULL.
What tool can show this? I don't know of anything off the shelf (even Process Explorer does not seem show a count of reserved memory). What I have done in the past is write my own little routine that uses VirtualQuery to enumerate the entire address space, including reserved ranges. I recommend you do the same; it's not much code to write, and very handy when coding for 32-bit Windows because the 2 GiB address space gets cramped very easily (DLLs are an obvious reason, but the default malloc also will leave unexpected reservations behind in response to certain allocation patterns even if you free everything).
In any case, if you want to create thousands of threads in a 32-bit Windows process, be sure to pass a non-zero value as the dwStackSize parameter to CreateThread, and also pass STACK_SIZE_PARAM_IS_A_RESERVATION as dwCreationFlags. The minimum is 64 KiB, which will be plenty if you avoid recursive algorithms in the threads.
Addendum: In a comment, #iinspectable cautions against using thousands of threads, citing Raymond Chen's 2005 blog post Does Windows have a limit of 2000 threads per process?. I agree that doing so is questionable for a variety of reasons; it is not my intent to endorse the practice, rather I'm just explaining one necessary element.

Access old stack frames

If I'm understanding this right, each time you call a C++ function, SP (and possibly BP) get moved to allocate some temporary space on the stack — a stack frame. And when the function returns, the pointers get moved back, deallocating the stack frame again.
But it appears to me that the data on the old stack frame is still there, it's just not referenced any more. Is there some way to make GDB show me these deleted stack frames? (Obviously once you enter a new stack frame it will at least partially overwrite any previous ones... but until then it seems like it should be possible.)
But it appears to me that the data on the old stack frame is still there, it's just not referenced any more.
Correct.
Is there some way to make GDB show me these deleted stack frames?
You can trivially look at the unused stack with GDB examine command. For example:
void fn()
{
int x[100];
for (int j = 0; j < 100; j++) x[j] = (0x1234 << 12) + j;
}
int main()
{
fn();
return 0;
}
Build and debug with:
gcc -g t.c
gdb -q ./a.out
(gdb) start
Temporary breakpoint 1 at 0x115f: file t.c, line 10.
Starting program: /tmp/a.out
Temporary breakpoint 1, main () at t.c:10
10 fn();
(gdb) n
11 return 0;
(gdb) x/40x $rsp-0x40
0x7fffffffdc60: 0x0123405c 0x0123405d 0x0123405e 0x0123405f
0x7fffffffdc70: 0x01234060 0x01234061 0x01234062 0x01234063
0x7fffffffdc80: 0x55555170 0x00005555 0x55555040 0x00000064
0x7fffffffdc90: 0xffffdca0 0x00007fff 0x55555169 0x00005555
0x7fffffffdca0: 0x55555170 0x00005555 0xf7a3a52b 0x00007fff
0x7fffffffdcb0: 0x00000000 0x00000000 0xffffdd88 0x00007fff
0x7fffffffdcc0: 0x00080000 0x00000001 0x5555515b 0x00005555
0x7fffffffdcd0: 0x00000000 0x00000000 0xa91c6994 0xc8f4292d
0x7fffffffdce0: 0x55555040 0x00005555 0xffffdd80 0x00007fff
0x7fffffffdcf0: 0x00000000 0x00000000 0x00000000 0x00000000
Here you can clearly see x still on stack: 0x7fffffffdc60 is where x[92] used to be, 0x7fffffffdc70 is where x[96] used to be, etc.
There is no easy way to make GDB interpret that data as locals of fn though.
Stackframes do not contain any information about its size or boundaries, rather this knowledge is hardcoded into functions' code. There is a (stack) frame pointer register, using which makes it possible to walk the stack up, but not down. In the current function you know the boundaries of the current frame, but there is no information of what could possibly be below it.

What could cause a mutex to misbehave?

I've been busy the last couple of months debugging a rare crash caused somewhere within a very large proprietary C++ image processing library, compiled with GCC 4.7.2 for an ARM Cortex-A9 Linux target. Since a common symptom was glibc complaining about heap corruption, the first step was to employ a heap corruption checker to catch oob memory writes. I used the technique described in https://stackoverflow.com/a/17850402/3779334 to divert all calls to free/malloc to my own function, padding every allocated chunk of memory with some amount of known data to catch out-of-bounds writes - but found nothing, even when padding with as much as 1 KB before and after every single allocated block (there are hundreds of thousands of allocated blocks due to intensive use of STL containers, so I can't enlarge the padding further, plus I assume any write more than 1KB out of bounds would eventually trigger a segfault anyway). This bounds checker has found other problems in the past so I don't doubt its functionality.
(Before anyone says 'Valgrind', yes, I have tried that too with no results either.)
Now, my memory bounds checker also has a feature where it prepends every allocated block with a data struct. These structs are all linked in one long linked list, to allow me to occasionally go over all allocations and test memory integrity. For some reason, even though all manipulations of this list are mutex protected, the list was getting corrupted. When investigating the issue, it began to seem like the mutex itself was occasionally failing to do its job. Here is the pseudocode:
pthread_mutex_t alloc_mutex;
static bool boolmutex; // set to false during init. volatile has no effect.
void malloc_wrapper() {
// ...
pthread_mutex_lock(&alloc_mutex);
if (boolmutex) {
printf("mutex misbehaving\n");
__THROW_ERROR__; // this happens!
}
boolmutex = true;
// manipulate linked list here
boolmutex = false;
pthread_mutex_unlock(&alloc_mutex);
// ...
}
The code commented with "this happens!" is occasionally reached, even though this seems impossible. My first theory was that the mutex data structure was being overwritten. I placed the mutex within a struct, with large arrays before and after it, but when this problem occurred the arrays were untouched so nothing seems to be overwritten.
So.. What kind of corruption could possibly cause this to happen, and how would I find and fix the cause?
A few more notes. The test program uses 3-4 threads for processing. Running with less threads seems to make the corruptions less common, but not disappear. The test runs for about 20 seconds each time and completes successfully in the vast majority of cases (I can have 10 units repeating the test, with the first failure occurring after 5 minutes to several hours). When the problem occurs it is quite late in the test (say, 15 seconds in), so this isn't a bad initialization issue. The memory bounds checker never catches actual out of bounds writes but glibc still occasionally fails with a corrupted heap error (Can such an error be caused by something other than an oob write?). Each failure generates a core dump with plenty of trace information; there is no pattern I can see in these dumps, no particular section of code that shows up more than others. This problem seems very specific to a particular family of algorithms and does not happen in other algorithms, so I'm quite certain this isn't a sporadic hardware or memory error. I have done many more tests to check for oob heap accesses which I don't want to list to keep this post from getting any longer.
Thanks in advance for any help!
Thanks to all commenters. I've tried nearly all suggestions with no results, when I finally decided to write a simple memory allocation stress test - one that would run a thread on each of the CPU cores (my unit is a Freescale i.MX6 quad core SoC), each allocating and freeing memory in random order at high speed. The test crashed with a glibc memory corruption error within minutes or a few hours at most.
Updating the kernel from 3.0.35 to 3.0.101 solved the problem; both the stress test and the image processing algorithm now run overnight without failing. The problem does not reproduce on Intel machines with the same kernel version, so the problem is specific either to ARM in general or perhaps to some patch Freescale included with the specific BSP version that included kernel 3.0.35.
For those curious, attached is the stress test source code. Set NUM_THREADS to the number of CPU cores and build with:
<cross-compiler-prefix>g++ -O3 test_heap.cpp -lpthread -o test_heap
I hope this information helps someone. Cheers :)
// Multithreaded heap stress test. By Itay Chamiel 20151012.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <pthread.h>
#include <sys/time.h>
#define NUM_THREADS 4 // set to number of CPU cores
#define ALIVE_INDICATOR NUM_THREADS
// Each thread constantly allocates and frees memory. In each iteration of the infinite loop, decide at random whether to
// allocate or free a block of memory. A list of 500-1000 allocated blocks is maintained by each thread. When memory is allocated
// it is added to this list; when freeing, a random block is selected from this list, freed and removed from the list.
void* thr(void* arg) {
int* alive_flag = (int*)arg;
int thread_id = *alive_flag; // this is a number between 0 and (NUM_THREADS-1) given by main()
int cnt = 0;
timeval t_pre, t_post;
gettimeofday(&t_pre, NULL);
const int ALLOCATE=1, FREE=0;
const unsigned int MINSIZE=500, MAXSIZE=1000;
const int MAX_ALLOC=10000;
char* membufs[MAXSIZE];
unsigned int membufs_size = 0;
int num_allocs = 0, num_frees = 0;
while(1)
{
int action;
// Decide whether to allocate or free a memory block.
// if we have less than MINSIZE buffers, allocate.
if (membufs_size < MINSIZE) action = ALLOCATE;
// if we have MAXSIZE, free.
else if (membufs_size >= MAXSIZE) action = FREE;
// else, decide randomly.
else {
action = ((rand() & 0x1)? ALLOCATE : FREE);
}
if (action == ALLOCATE) {
// choose size to allocate, from 1 to MAX_ALLOC bytes
size_t size = (rand() % MAX_ALLOC) + 1;
// allocate and fill memory
char* buf = (char*)malloc(size);
memset(buf, 0x77, size);
// add buffer to list
membufs[membufs_size] = buf;
membufs_size++;
assert(membufs_size <= MAXSIZE);
num_allocs++;
}
else { // action == FREE
// choose a random buffer to free
size_t pos = rand() % membufs_size;
assert (pos < membufs_size);
// free and remove from list by replacing entry with last member
free(membufs[pos]);
membufs[pos] = membufs[membufs_size-1];
membufs_size--;
assert(membufs_size >= 0);
num_frees++;
}
// once in 10 seconds print a status update
gettimeofday(&t_post, NULL);
if (t_post.tv_sec - t_pre.tv_sec >= 10) {
printf("Thread %d [%d] - %d allocs %d frees. Alloced blocks %u.\n", thread_id, cnt++, num_allocs, num_frees, membufs_size);
gettimeofday(&t_pre, NULL);
}
// indicate alive to main thread
*alive_flag = ALIVE_INDICATOR;
}
return NULL;
}
int main()
{
int alive_flag[NUM_THREADS];
printf("Memory allocation stress test running on %d threads.\n", NUM_THREADS);
// start a thread for each core
for (int i=0; i<NUM_THREADS; i++) {
alive_flag[i] = i; // tell each thread its ID.
pthread_t th;
int ret = pthread_create(&th, NULL, thr, &alive_flag[i]);
assert(ret == 0);
}
while(1) {
sleep(10);
// check that all threads are alive
bool ok = true;
for (int i=0; i<NUM_THREADS; i++) {
if (alive_flag[i] != ALIVE_INDICATOR)
{
printf("Thread %d is not responding\n", i);
ok = false;
}
}
assert(ok);
for (int i=0; i<NUM_THREADS; i++)
alive_flag[i] = 0;
}
return 0;
}

C++ VirtualQueryEx infinite loop

I'm currently re-creating a memory modifier application using C++, the original was in C#.
All credit goes to "gimmeamilk" who's tutorials Ive been following on YouTube(video 1 of 8). I would highly recommend these tutorials for anyone attempting to create a similar application.
The problem I have is that my VirtualQueryEx seems to run forever. The process I'm scanning is "notepad.exe" and I am passing to the application via command line parameter.
std::cout<<"Create scan started\n";
#define WRITABLE (PAGE_READWRITE | PAGE_WRITECOPY | PAGE_EXECUTE_READWRITE | PAGE_EXECUTE_WRITECOPY) //These are all the flags that will be used to determin if a memory block is writable.
MEMBLOCK * mb_list = NULL; //pointer to the head of the link list to be returned
MEMORY_BASIC_INFORMATION meminfo; //holder for the VirtualQueryEx return struct
unsigned char *addr = 0; //holds the value to pass to VirtualQueryEx
HANDLE hProc = OpenProcess(PROCESS_ALL_ACCESS,false, pid);
if(hProc)
{
while(1)
{
if(VirtualQueryEx(hProc,addr, &meminfo, sizeof(meminfo)) == 0)
{
break;
}
if((meminfo.State & MEM_COMMIT) && (meminfo.Protect & WRITABLE)) //((binary comparison of meminfos state and MEM_COMMIT, this is basically filtering out memory that the process has reserved but not used)())
{
MEMBLOCK * mb = create_memblock(hProc, &meminfo);
if(mb)
{
mb->next = mb_list;
mb_list = mb;
}
}
addr = (unsigned char *)meminfo.BaseAddress + meminfo.RegionSize;//move the adress along by adding on the length of the current block
}
}
else
{
std::cout<<"Failed to open process\n";
}
std::cout<<"Create scan finished\n";
return mb_list;
The output from this code results in
Create scan started on process:7228
Then it does not return anything else to the console. Unfortunately the example source code linked to via the Youtube video is no longer available.
(7228 will change based on the current pid of notepad.exe)
edit-reply to question #Hans Passant
I still don't understand, what I think Im doing is
Starting a infinite loop
{
Testing using vqx if the address is valid and populating my MEM_BASIC_etc..
{
(has the process commited to using that addr of memory)(is the memory writeable)
{
create memblock etc
}
}
move the address along by the size of the current block
}
My program is x32 and so is notepad (as far as I'm aware).
Is my problem that because I'm using a x64 bit OS that I'm actually inspecting half of a block (a block here meaning the unit assigned by the OS in memory) and its causing it to loop?
Big thanks for your help! I want to understand my problem as well as fix it.
Your problem is you're compiling a 32 bit program and using it to parse the memory of a 64 bit program. You define 'addr' as a unsigned char pointer, which in this case is 32 bits in size. It cannot contain a 64 bit address, which is the cause of your problem.
If your target process is 64 bit, compile your program as 64 bit as well. For 32 bit target processes, compile for 32 bit. This is typically the best technique for dealing with the memory of external processes and is the fastest solution.
Depending on what you're doing, you can also use #ifdef and other conditionals to use 64 bit variables depending on the target, but the original solution is usually easier.

My trampoline won't bounce (detouring, C++, GCC)

It feels like I'm abusing Stackoverflow with all my questions, but it's a Q&A forum after all :) Anyhow, I have been using detours for a while now, but I have yet to implement one of my own (I've used wrappers earlier). Since I want to have complete control over my code (who doesn't?) I have decided to implement a fully functional detour'er on my own, so I can understand every single byte of my code.
The code (below) is as simple as possible, the problem though, is not. I have successfully implemented the detour (i.e a hook to my own function) but I haven't been able to implement the trampoline.
Whenever I call the trampoline, depending on the offset I use, I get either a "segmentation fault" or an "illegal instruction". Both cases ends the same though; 'core dumped'. I think it is because I've mixed up the 'relative address' (note: I'm pretty new to Linux so I have far from mastered GDB).
As commented in the code, depending on sizeof(jmpOp)(at line 66) I either get an illegal instruction or a segmentation fault. I'm sorry if it's something obvious, I'm staying up way too late...
// Header files
#include <stdio.h>
#include <sys/mman.h>
#include <unistd.h>
#include "global.h" // Contains typedefines for byte, ulong, ushort etc...
#include <cstring>
bool ProtectMemory(void * addr, int flags)
{
// Constant holding the page size value
const size_t pageSize = sysconf(_SC_PAGE_SIZE);
// Calculate relative page offset
size_t temp = (size_t) addr;
temp -= temp % pageSize;
// Update address
addr = (void*) temp;
// Update memory area protection
return !mprotect(addr, pageSize, flags);
}
const byte jmpOp[] = { 0xE9, 0x00, 0x00, 0x00, 0x00 };
int Test(void)
{
printf("This is testing\n");
return 5;
}
int MyTest(void)
{
printf("This is ******\n");
return 9;
}
typedef int (*TestType)(void);
int main(int argc, char * argv[])
{
// Fetch addresses
byte * test = (byte*) &Test;
byte * myTest = (byte*) &MyTest;
// Call original
Test();
// Update memory access for 'test' function
ProtectMemory((void*) test, PROT_EXEC | PROT_WRITE | PROT_READ);
// Allocate memory for the trampoline
byte * trampoline = new byte[sizeof(jmpOp) * 2];
// Do copy operations
memcpy(trampoline, test, sizeof(jmpOp));
memcpy(test, jmpOp, sizeof(jmpOp));
// Setup trampoline
trampoline += sizeof(jmpOp);
*trampoline = 0xE9;
// I think this address is incorrect, how should I calculate it? With the current
// status (commented 'sizeof(jmpOp)') the compiler complains about "Illegal Instruction".
// If I uncomment it, and use either + or -, a segmentation fault will occur...
*(uint*)(trampoline + 1) = ((uint) test - (uint) trampoline)/* + sizeof(jmpOp)*/;
trampoline -= sizeof(jmpOp);
// Make the trampoline executable (and read/write)
ProtectMemory((void*) trampoline, PROT_EXEC | PROT_WRITE | PROT_READ);
// Setup detour
*(uint*)(test + 1) = ((uint) myTest - (uint) test) - sizeof(jmpOp);
// Call 'detoured' func
Test();
// Call trampoline (crashes)
((TestType) trampoline)();
return 0;
}
In case of interest, this is the output during a normal run (with the exact code above):
This is testing
This is **
Illegal instruction (core dumped)
And this is the result if I use +/- sizeof(jmpOp) at line 66:
This is testing
This is ******
Segmentation fault (core dumped)
NOTE: I'm running Ubuntu 32 bit and compile with g++ global.cpp main.cpp -o main -Iinclude
You're not going to be able to indiscriminately copy the first 5 bytes of Test() into your trampoline, followed by a jump to the 6th instruction byte of Test(), because you don't know if the first 5 bytes comprise an integral number of x86 variable-length instructions. To do this, you're going to have to do at least a minimal amount of automated disassembling of the Test() function in order to find an instruction boundary that's 5 or more bytes past the beginning of the function, then copy an appropriate number of bytes to your trampoline, and THEN append your jump (which won't be at a fixed offset within your trampoline). Note that on a typical RISC processor (like PPC), you wouldn't have this problem, as all instructions are the same width.