Static Variables and Thread-Local Storage - c++

Background:
I have discovered something of an interesting edge case relating to static memory initialization across multiple threads. Specifically, I am using Howard Hinnant's TZ library which has been working fine for the rest of my code across many different threads.
Now, I am developing a logging class which relies on yet another thread and condition variable. Unfortunately, when I attempt to format a chrono time_point using date::make_zoned(data::locate_zone("UTC"), tp) the library crashes. Upon digging through tz.cpp, I find that the time zone database returned internally is evaluating to NULL. This all comes from the following snippet:
tzdb_list&
get_tzdb_list()
{
static tzdb_list tz_db = create_tzdb();
return tz_db;
}
As can be seen, the database list is stored statically. With a few printf()s and some time with GDB I can see that the same db is returned for multiple calls from the main thread but returns NULL when called from my logger thread.
If, however, I change the declaration of tzdb_list to:
static thread_local tzdb_list tz_db = create_tzdb();
Everything works as expected. This is not surprising as thread_local will cause each thread to do the heavy-lifting of creating a standalone instance of tzdb_list. Obviously this is wasteful of memory and can easily cause problems later. As such, I really don't see this as a viable solution.
Questions:
What about the invocation of one thread versus another would cause static memory to behave differently? If anything, I would expect the opposite of what is happening (eg. for the threads to 'fight' over initialized memory; not have one receive a NULL pointer).
How is it possible for a returned static reference to have multiple different values in the first place (in my case, valid memory versus NULL)?
With thread_local built into the library I get wildly different memory locations on opposite ends of the addressable region; why? I suspect that this has to do with where thread memory is allocated versus the main process memory but do not know the exact details of thread allocation regions.
Reference:
My logging thread is created with:
outputThread = std::thread(Logger::outputHandler, &outputQueue);
And the actual output handler / invocation of the library (LogMessage is just a typedef for std::tuple):
void Logger::outputHandler(LogQueue *queue)
{
LogMessage entry;
std::stringstream ss;
while (1)
{
queue->pop(entry); // Blocks on a condition variable
ss << date::make_zoned(date::locate_zone("UTC"), std::get<0>(entry))
<< ":" << levelId[std::get<1>(entry)
<< ":" << std::get<3>(entry) << std::endl;
// Printing stuff
ss.str("");
ss.clear();
}
}
Additional code and output samples available on request.
EDIT 1
This is definitely a problem in my code. When I strip everything out my logger works as expected. What is strange to me is that my test case in the full application is just two prints in main and a call to the logger before manually exiting. None of the rest of the app initialization is run but I am linking in all support libraries at that point (Microsoft CPP REST SDK, MySQL Connector for C++ and Howard's date library (static)).
It is easy for me to see how something could be stomping this memory but, even in the "full" case in my application, I don't know why the prints on the main thread would work but the next line calling into the logger would fail. If something were going sideways at init I would expect all calls to break.
I also noticed that if I make my logger static the problem goes away. Of course, this changes the memory layout so it doesn't rule out heap / stack smashing. What I do find interesting is that I can declare the logger globally or on the stack at the start of main() and both will segfault in the same way. If I declare the logger as static, however, both global and stack-based declaration work.
Still trying to create a minimal test case which reproduces this.
I am already linking with -lpthread; have been pretty much since the inception of this application.
OS is Fedora 27 x86_64 running on an Intel Xeon. Compiler:
$ g++ --version
g++ (GCC) 7.3.1 20180130 (Red Hat 7.3.1-2)
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

It appears that this problem was caused by a bug in tz.cpp which has since been fixed.
The bug was that there was a namespace scope variable whose initialization was not guaranteed in the proper order. This was fixed by turning that variable into a function-local-static to ensure the proper initialization order.
My apologies to all who might have been impacted by this bug. And my thanks to all those who have reported it.

Related

Implementing Thread Local Storage in Software

We are porting an embedded application from Windows CE to a different system. The current processor is an STM32F4. Our current codebase heavily uses TLS. The new prototype is running KEIL CMSIS RTOS which has very reduced functionality.
On http://www.keil.com/support/man/docs/armcc/armcc_chr1359124216560.htm it says that thread local storage is supported since 5.04. Right now we are using 5.04. The problem is that when linking our program with a variable definition of __thread int a; the linker cannot find __aeabi_read_tp which makes sense to me.
My question is: Is it possible to implement __aeabi_read_tp and it will work or is there more to it?
If it simply is not possible for us: Is there a way to implement TLS only in software? Let's not talk about performance there for now.
EDIT
I tried implementing __aeabi_read_tp by looking at old source of freeBSD and other sources. While the function is mostly implemented in assembly I found a version in C which boils down to this:
extern "C"
{
extern osThreadId svcThreadGetId(void);
void *__aeabi_read_tp()
{
return (void*)svcThreadGetId();
}
}
What this basically does is give me the ID (void*) of my currently executing thread. If I understand correctly that is what we want. Can this possibly work?
Not considering the performance and not going into CMIS RTOS specifics (which are unknown to me), you can allocate space needed for your variables - either on heap or as static or global variable - I would suggest to have an array of structures. Then, when you create thread, pass the pointer to the next not used structure to your thread function.
In case of static or global variable, it would be good if you know how many threads are working in parallel for limiting the size of preallocated memory.
EDIT: Added sample of TLS implementation based on pthreads:
#include <pthread.h>
#define MAX_PARALLEL_THREADS 10
static pthread_t threads[MAX_PARALLEL_THREADS];
static struct tls_data tls_data[MAX_PARALLEL_THREADS];
static int tls_data_free_index = 0;
static void *worker_thread(void *arg) {
static struct tls_data *data = (struct tls_data *) arg;
/* Code omitted. */
}
static int spawn_thread() {
if (tls_data_free_index >= MAX_PARALLEL_THREADS) {
// Consider increasing MAX_PARALLEL_THREADS
return -1;
}
/* Prepare thread data - code omitted. */
pthread_create(& threads[tls_data_free_index], NULL, worker_thread, & tls_data[tls_data_free_index]);
}
The not-so-impressive solution is a std::map<threadID, T>. Needs to be wrapped with a mutex to allow new threads.
For something more convoluted, see this idea
I believe this is possible, but probably tricky.
Here's a paper describing how __thread or thread_local behaves in ELF images (though it doesn't talk about ARM architecture for AEABI):
https://www.akkadia.org/drepper/tls.pdf
The executive summary is:
The linker creates .tbss and/or .tdata sections in the resulting executable to provide a prototype image of the thread local data needed for each thread.
At runtime, each thread control block (TCB) has a pointer to a dynamic thread-local vector table (dtv in the paper) that contains the thread-local storage for that thread. It is lazily allocated and initialized the first time a thread attempts to access a thread-local variable. (presumably by __aeabi_read_tp())
Initialization copies the prototype .tdata image and memsets the .tbss image into the allocated storage.
When source code access thread-local variables, the compiler generates code to read the thread pointer from __aeabi_read_tp(), and do all the appropriate indirection to get at the storage for that thread-local variable.
The compiler and linker is doing all the work you'd expect it to, but you need to initialize and return a "thread pointer" that is properly structured and filled out the way the compiler expects it to be, because it's generating instructions directly to follow the hops.
There are a few ways that TLS variables are accessed, as mentioned in this paper, which, again, may or may not totally apply to your compiler and architecture:
http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt
But, the problems are roughly the same. When you have runtime-loaded libraries that may bring their own .tbss and .tdata sections, it gets more complicated. You have to expand the thread-local storage for any thread that suddenly tries to access a variable introduced by a library loaded after the storage for that thread was initialized. The compiler has to generate different access code depending on where the TLS variable is declared. You'd need to handle and test all the cases you would want to support.
It's years later, so you probably already solved or didn't solve your problem. In this case, it is (was) probably easiest to use your OS's TLS API directly.

Debugging lambda memory corruption || Automatically watch object pointer in GDB

TL;DR: How do I automatically add a watch in gdb when a function is called so I can debug some memory corruption?
I am currently dealing with some memory corruption in C++
I am mostly seeing 4-5 types of reaccuring crashes - all of which make little to no sense, so I'm guessing it has to be related to memory corruption.
These crashes only happen on the production server, round about every 2-5hours.
Most of them consist of accessing or passing a null pointer where it cant possibly have existed in the first place.
One of these places is a lambda capturing this. (see below)
Obviously looked at core dumps and even had gdb attached while it crashed
valgrind: I've spent hours staring at multiple instances of valgrind with no success.
Enabled gccs stack protection (-fstack-protector-all)
I have tried looking over the code & the changes, but it has been impossible for me to find anything (100k lines of code total, "On master, 10,437 files have changed and there have been 3,352,600 additions and 85,495 deletions." since the last release on the production server). I might have just plain missed something, or not looked in the right spots - I cant tell.
Used cppcheck to see if there was something plain obvious wrong with the code
If there is an easier/more straight forward method to finding where the corruption occurs feel free to suggest that too.
Lets look at some simplified code.
I have a class, Socket, which manages a client connection.
It is constructed something like this
Listener::OnAccept(fd){
Socket* s = new Socket();
if (s->Setup(fd)){
// push into a vector and do some other things
}
}
Socket::Setup calls (virtual) OnConnect of the Socket class, which then creates a ping event, using a lambda:
Socket::OnConnect(){
m_pingEvent = new Event([this](Event* e){
if (!this->GotPong()){
// close connection
}else{
this->Ping();
}
}, 30 /*seconds*/, true /* loop */);
}
Event accepts an std::function as the callback
m_pingEvent is deleted in the destructor (if set) which will cancel the event if it is running.
What happens (rarely) is that the lambda calls Ping on a nullptr, which calls m_pingPacket->Send() on this=0x1f8, which leads to a segfault.
My question - or rather my proposed solution - would be watching the captured this pointer for writing, which definitely shouldnt happen.
There is only one small issue with that..
How would I even watch such a high ammount of pointers without manually adding each one? (about 400 concurrent connections with a lot (dis)connects)
As for the captured data I found this is in the __closure object:
(gdb) frame 2
#2 0x081b9d63 in operator() (e=0x9b2a748, __closure=0xb5a8318)
at net/socket/Client.cpp:151
151 net/socket/Client.cpp: No such file or directory.
(gdb) ptype __closure
type = const struct {
net::socket::Client * const __this;
} * const
Which I can get when creating the lambda easily by just moving the lambda to "auto callback = " which will be of type:
(gdb) info locals
callback = {__this = 0xb4dd0948}
(gdb) ptype callback
type = struct {
net::socket::Client * const __this;
}
(gdb) print callback
$1 = {__this = 0xb4dd0948}
(This is gcc version 4.7.2 (Debian 4.7.2-5) for reference, might be different with other compilers/versions)
Shortly before posting I realized the struct would probably change address once moved into the std::function (is this correct?)
I've been digging through the gnu "functional" header, but I havent really been able to find anything yet, I'll keep looking (and updating this)
Another note: I am posting this full describtion with all of the details included in case anyone has an easier solution for me. (XY Problem)
Edit:
(gdb) print *(void**)m_pingEvent->m_callback._M_functor._M_unused._M_object
$8 = (void *) 0xb4dd56d8
(gdb) print this
$4 = (net::socket::Client * const) 0xb4dd56d8
Found it :)
Edit2:
break net/socket/Client.cpp:158
commands
silent
watch -l m_pingEvent->m_callback._M_functor._M_unused._M_object
continue
end
This has two flaws: you can only watch 4 addresses at a time & there is no way to delete the watch once the object will be freed.
Soo it's unusable.
Edit 3:
I've figured out how to do the watching using this python script I wrote (linking this one externally since it's quite long): https://gist.github.com/imermcmaps/4a6d8a1577118645acf3
Next issue is making sense of the output..
Added watch 7 -> 0x10eb2200
Hardware watchpoint 7: -location m_pingEvent->m_callback._M_functor._M_unused._M_obj
Old value = (void *) 0x10eba4b0
New value = (void *) 0x10eba400
net::Packet::Packet (this=0x10eb1088) at ../shared/net/Packet.cpp:13
Like it's saying it changed from an old value, which shouldn't even be the original value, since I'm checking if the this pointer and the pointer value match, which they do.
Edit 4 (yay):
Turns out watch -l doesnt work like i want it to.
Manually grabbing the address and then watching that address seems to work
How do I automatically add a watch in gdb when a function is called so
I can debug some memory corruption?
Memory corruption is often detected after the real corruption has already occurred by some modules loaded within your process. So manual debugging may not be very useful for real complex projects.Because any third party modules/library which is loaded within your process may also lead to this problem. From your post it looks like this problem is not reproducible always which indicates that this might be related to threading/synchronization problem which leads to some sort of memory corruption. So based on my experience i strongly suggest you to concentrate on reproducing the problem under dynamic tools(Valgrind/Helgrind).
However as you have mentioned in your question that you are able to attach your program using Valgrind. So you may want to attach your program(a.out) in case you have not done in this way.
$ valgrind --tool=memcheck --db-attach=yes ./a.out
This way Valgrind would automatically attach your program in the debugger when your first memory error is detected so that you can do live debugging(GDB). This seems to be the best possible way to find out the root cause of your problem.
However I think that there may be some data racing scenario which is leading to memory corruption.So you may want to use Helgrind to check/find data racing/threading problem which might be leading to this problem.
For more information on these, you may refer the following post:
https://stackoverflow.com/a/22658693/2724703
https://stackoverflow.com/a/22617989/2724703

i get thread sleep error in C++11 threading

I created a thread using C++11 thread class and I want the thread to sleep in a loop.
When the this_thread::sleep_for() function is called, I get exception saying:
Run-Time Check Failure #2 - Stack around the variable '_Now' was
corrupted.
My code is below:
std::chrono::milliseconds duration( 5000 );
while (m_connected)
{
this->CheckConnection();
std::this_thread::sleep_for(duration);
}
I presume _Now is a local variable somewhere deep in implementation of sleep_for. If it gets corrupt, either there is bug in that function (unlikely) or some other part of your application is writing to dangling pointers (much more likely).
The most likely cause is that you, some time before calling the sleep_for, give out pointer to local variable that stays around and is written to by other thread while this thread sleeps.
If you were on Linux, I'd recommend you to try valgrind (though I am not certain it can catch invalid access to stack), but on Windows I don't know about any tool for debugging this kind of problems. You can do careful review and you can try disabling various parts of functionality to see when the problem goes away to narrow down where it might be.
I also used to use duma library with some success, but it can only catch invalid access to heap, not stack.
Note: Both clang and gcc are further in implementing C++11 than MSVC++, so if you don't use much Windows-specific stuff, it might be easy to port and try valgrind on it. Gcc and especially clang are also known for giving much better static diagnostics than MSVC++, so if you compile it with gcc or clagn, you may get some warning that will point you to the problem.

Howto debug double deletes in C++?

I'm maintaining a legacy application written in C++. It crashes every now and then and Valgrind tells me its a double delete of some object.
What are the best ways to find the bug that is causing a double delete in an application you don't fully understand and which is too large to be rewritten ?
Please share your best tips and tricks!
Here's some general suggestion's that have helped me in that situation:
Turn your logging level up to full debug, if you are using a logger. Look for suspicious stuff in the output. If your app doesn't log pointer allocations and deletes of the object/class under suspicion, it's time to insert some cout << "class Foo constructed, ptr= " << this << endl; statements in your code (and corresponding delete/destructor prints).
Run valgrind with --db-attach=yes. I've found this very handy, if a bit tedious. Valgrind will show you a stack trace every time it detects a significant memory error or event and then ask you if you want to debug it. You may find yourself repeatedly pressing 'n' many many times if your app is large, but keep looking for the line of code where the object in question is first (and secondly) deleted.
Just scour the code. Look for construction/deletion of the object in question. Sadly, sometimes it winds up being in a 3rd party library :-(.
Update: Just found this out recently: Apparently gcc 4.8 and later (if you can use GCC on your system) has some new built-in features for detecting memory errors, the "address sanitizer". Also available in the LLVM compiler system.
Yep. What #OliCharlesworth said. There's no surefire way of testing a pointer to see if it points to allocated memory, since it really is just the memory location itself.
The biggest problem your question implies is the lack of reproducability. Continuing with that in mind, you're stuck with changing simple 'delete' constructs to delete foo;foo = NULL;.
Even then the best case scenario is "it seems to occur less" until you've really stamped it down.
I'd also ask by what evidence Valgrind suggests it's a double-delete problem. Might be a better clue lingering around in there.
It's one of the simpler truly nasty problems.
This may or may not work for you.
Long time ago I was working on 1M+ lines program that was 15 years old at the time. Faced with the exact same problem - double delete with huge data set. With such data any out of the box "memory profiler" would be a no go.
Things that were on my side:
It was very reproducible - we had macro language and running same script exactly the same way reproduced it every time
Sometime during the history of the project someone decided that "#define malloc my_malloc" and "#define free my_free" had some use. These didn't do much more than call built-in malloc() and free() but project already compiled and worked this way.
Now the trick/idea:
my_malloc(int size)
{
static int allocation_num = 0; // it was single threaded
void* p = builtin_malloc(size+16);
*(int*)p = ++allocation_num;
*((char*)p+sizeof(int)) = 0; // not freed
return (char*)p+16; // check for NULL in order here
}
my_free(void* p)
{
if (*((char*)p+sizeof(int)))
{
// this is double free, check allocation_number
// then rerun app with this in my_alloc
// if (alloc_num == XXX) debug_break();
}
*((char*)p+sizeof(int)) = 1; // freed
//built_in_free((char*)p-16); // do not do this until problem is figured out
}
With new/delete it might be trickier, but still with LD_PRELOAD you might be able to replace malloc/free without even recompiling your app.
you are probably upgrading from a version that treated delete differently then the new version.
probably what the previous version did was when delete was called it did a static check for if (X != NULL){ delete X; X = NULL;} and then in the new version it just does the delete action.
you might need to go through and check for pointer assignments, and tracking references of object names from construction to deletion.
I've found this useful: backtrace() on linux. (You have to compile with -rdynamic.) This lets you find out where that double free is coming from by putting a try/catch block around all memory operations (new/delete) then in the catch block, print out your stack trace.
This way you can narrow down the suspects much faster than running valgrind.
I wrapped backtrace in a handy little class so that I can just say:
try {
...
} catch (...) {
StackTrace trace;
std::cerr << "Double free!!!\n" << trace << std::endl;
throw;
}
On Windows, assuming the app is built with MSVC++, you can take advantage of the extensive heap debugging tools built into the debug version of the standard library.
Also on Windows, you can use Application Verifier. If I recall correctly, it has a mode the forces each allocation onto a separate page with protected guard pages in between. It's very effective at finding buffer overruns, but I suspect it would also be useful for a double-free situation.
Another thing you could do (on any platform) would be to make a copy of the sources that are transformed (perhaps with macros) so that every instance of:
delete foo;
is replaced with:
{ delete foo; foo = nullptr; }
(The braces help in many cases, though it's not perfect.) That will turn many instances of double-free into a null pointer reference, making it much easier to detect. It doesn't catch everything; you might have a copy of a stale pointer, but it can help squash a lot of the common use-after-delete scenarios.

My code crashes on delete this

I get a segmentation fault when attempting to delete this.
I know what you think about delete this, but it has been left over by my predecessor. I am aware of some precautions I should take, which have been validated and taken care of.
I don't get what kind of conditions might lead to this crash, only once in a while. About 95% of the time the code runs perfectly fine but sometimes this seems to be corrupted somehow and crash.
The destructor of the class doesn't do anything btw.
Should I assume that something is corrupting my heap somewhere else and that the this pointer is messed up somehow?
Edit : As requested, the crashing code:
long CImageBuffer::Release()
{
long nRefCount = InterlockedDecrement(&m_nRefCount);
if(nRefCount == 0)
{
delete this;
}
return nRefCount;
}
The object has been created with a new, it is not in any kind of array.
The most obvious answer is : don't delete this.
If you insists on doing that, then use common ways of finding bugs :
1. use valgrind (or similar tool) to find memory access problems
2. write unit tests
3. use debugger (prepare for loooong staring at the screen - depends on how big your project is)
It seems like you've mismatched new and delete. Note that delete this; can only be used on an object which was allocated using new (and in case of overridden operator new, or multiple copies of the C++ runtime, the particular new that matches delete found in the current scope)
Crashes upon deallocation can be a pain: It is not supposed to happen, and when it happens, the code is too complicated to easily find a solution.
Note: The use of InterlockedDecrement have me assume you are working on Windows.
Log everything
My own solution was to massively log the construction/destruction, as the crash could well never happen while debugging:
Log the construction, including the this pointer value, and other relevant data
Log the destruction, including the this pointer value, and other relevant data
This way, you'll be able to see if the this was deallocated twice, or even allocated at all.
... everything, including the stack
My problem happened in Managed C++/.NET code, meaning that I had easy access to the stack, which was a blessing. You seem to work on plain C++, so retrieving the stack could be a chore, but still, it remains very very useful.
You should try to load code from internet to print out the current stack for each log. I remember playing with http://www.codeproject.com/KB/threads/StackWalker.aspx for that.
Note that you'll need to either be in debug build, or have the PDB file along the executable file, to make sure the stack will be fully printed.
... everything, including multiple crashes
I believe you are on Windows: You could try to catch the SEH exception. This way, if multiple crashes are happening, you'll see them all, instead of seeing only the first, and each time you'll be able to mark "OK" or "CRASHED" in your logs. I went even as far as using maps to remember addresses of allocations/deallocations, thus organizing the logs to show them together (instead of sequentially).
I'm at home, so I can't provide you with the exact code, but here, Google is your friend, but the thing to remember is that you can't have a __try/__except handdler everywhere (C++ unwinding and C++ exception handlers are not compatible with SEH), so you'll have to write an intermediary function to catch the SEH exception.
Is your crash thread-related?
Last, but not least, the "I happens only 5% of the time" symptom could be caused by different code path executions, or the fact you have multiple threads playing together with the same data.
The InterlockedDecrement part bothers me: Is your object living in multiple threads? And is m_nRefCount correctly aligned and volatile LONG?
The correctly aligned and LONG part are important, here.
If your variable is not a LONG (for example, it could be a size_t, which is not a LONG on a 64-bit Windows), then the function could well work the wrong way.
The same can be said for a variable not aligned on 32-byte boundaries. Is there #pragma pack() instructions in your code? Does your projet file change the default alignment (I assume you're working on Visual Studio)?
For the volatile part, InterlockedDecrement seem to generate a Read/Write memory barrier, so the volatile part should not be mandatory (see http://msdn.microsoft.com/en-us/library/f20w0x5e.aspx).