Infrequent segmentation fault in accessing boost::unordered_multimap or struct - c++

I'm having trouble debugging a segmentation fault. I'd appreciate tips on how to go about narrowing in on the problem.
The error appears when an iterator tries to access an element of a struct Infection, defined as:
struct Infection {
public:
explicit Infection( double it, double rt ) : infT( it ), recT( rt ) {}
double infT; // infection start time
double recT; // scheduled recovery time
};
These structs are kept in a special structure, InfectionMap:
typedef boost::unordered_multimap< int, Infection > InfectionMap;
Every member of class Host has an InfectionMap carriage. Recovery times and associated host identifiers are kept in a priority queue. When a scheduled recovery event arises in the simulation for a particular strain s in a particular host, the program searches through carriage of that host to find the Infection whose recT matches the recovery time (double recoverTime). (For reasons that aren't worth going into, it's not as expedient for me to use recT as the key to InfectionMap; the strain s is more useful, and coinfections with the same strain are possible.)
assert( carriage.size() > 0 );
pair<InfectionMap::iterator,InfectionMap::iterator> ret = carriage.equal_range( s );
InfectionMap::iterator it;
for ( it = ret.first; it != ret.second; it++ ) {
if ( ((*it).second).recT == recoverTime ) { // produces seg fault
carriage.erase( it );
}
}
I get a "Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: KERN_INVALID_ADDRESS at address..." on the line specified above. The recoverTime is fine, and the assert(...) in the code is not tripped.
As I said, this seg fault appears 'randomly' after thousands of successful recovery events.
How would you go about figuring out what's going on? I'd love ideas about what could be wrong and how I can further investigate the problem.
Update
I added a new assert and a check just inside the for loop:
assert( carriage.size() > 0 );
assert( carriage.count( s ) > 0 );
pair<InfectionMap::iterator,InfectionMap::iterator> ret = carriage.equal_range( s );
InfectionMap::iterator it;
cout << "carriage.count(" << s << ")=" << carriage.count(s) << endl;
for ( it = ret.first; it != ret.second; it++ ) {
cout << "(*it).first=" << (*it).first << endl; // error here
if ( ((*it).second).recT == recoverTime ) {
carriage.erase( it );
}
}
The EXC_BAD_ACCESS error now appears at the (*it).first call, again after many thousands of successful recoveries. Can anyone give me tips on how to figure out how this problem arises? I'm trying to use gdb. Frame 0 from the backtrace reads
"#0 0x0000000100001d50 in Host::recover (this=0x100530d80, s=0, recoverTime=635.91148029170529) at Host.cpp:317"
I'm not sure what useful information I can extract here.
Update 2
I added a break; after the carriage.erase(it). This works.

Correct me if I'm wrong but I would bet that erasing an item in an unordered multimap invalidates all iterators pointing into it. Try "it = carriage.erase(it)". You'll have to do something about ret as well.
Update in reply to your latest update:
The reason breaking out of the loop after calling "carriage.erase(it)" fixed the bug is because you stopped trying to access an erased iterator.

Compile the program with gcc -g and run it under gdb. When you get an EXC_BAD_ACCESS crash you'll drop into the gdb command line. At that point you can type bt to get a backtrace, which will show you how you got to the point where the crash occurred.

Related

Shared Lua state between pthread seg-fault if not executing coroutine

first of all I know my question look familiar but I am actually not asking why a seg-fault occurs when sharing a lua state between different pthread. I am actually asking why they don't seg-fault in a specific case described below.
I tried to organize it as well as I could but I realize it is very long. Sorry about that.
A bit of background:
I am writing a program which is using the Lua interpreter as a base for the user to execute instructions and using the ROOT libraries (https://root.cern.ch/) to display graphs, histograms, etc...
All of this is working just fine but then I tried to implement a way for the user to start a background task while keeping the ability to input commands in the Lua prompt, to be able to do something else entirely while the task finishes, or to request to stop it for instance.
My first attempt was the following:
First on the Lua side I load some helper functions and initialize global variables
-- Lua script
RootTasks = {}
NextTaskToStart = nil
function SetupNewTask(taskname, fn, ...)
local task = function(...)
local rets = table.pack(fn(...))
RootTasks[taskname].status = "done"
return table.unpack(rets)
end
RootTasks[taskname] = {
task = SetupNewTask_C(task, ...),
status = "waiting",
}
NextTaskToStart = taskname
end
Then on the C side
// inside the C++ script
int SetupNewTask_C ( lua_State* L )
{
// just a function to check if the argument is valid
if ( !CheckLuaArgs ( L, 1, true, "SetupNewTask_C", LUA_TFUNCTION ) ) return 0;
int nvals = lua_gettop ( L );
lua_newtable ( L );
for ( int i = 0; i < nvals; i++ )
{
lua_pushvalue ( L, 1 );
lua_remove ( L, 1 );
lua_seti ( L, -2, i+1 );
}
return 1;
}
Basically the user provide the function to execute followed by the parameters to pass and it just pushes a table with the function to execute as the first field and the arguments as subsequent fields. This table is pushed on top of the stack, I retrieve it and store it a global variable.
The next step is on the Lua side
-- Lua script
function StartNewTask(taskname, fn, ...)
SetupNewTask(taskname, fn, ...)
StartNewTask_C()
RootTasks[taskname].status = "running"
end
and on the C side
// In the C++ script
// lua, called below, is a pointer to the lua_State
// created when starting the Lua interpreter
void* NewTaskFn ( void* arg )
{
// helper function to get global fields from
// strings like "something.field.subfield"
// Retrieve the name of the task to be started (has been pushed as
// a global variable by previous call to SetupNewTask_C)
TryGetGlobalField ( lua, "NextTaskToStart" );
if ( lua_type ( lua, -1 ) != LUA_TSTRING )
{
cerr << "Next task to schedule is undetermined..." << endl;
return nullptr;
}
string nextTask = lua_tostring ( lua, -1 );
lua_pop ( lua, 1 );
// Now we get the actual table with the function to execute
// and the arguments
TryGetGlobalField ( lua, ( string ) ( "RootTasks."+nextTask ) );
if ( lua_type ( lua, -1 ) != LUA_TTABLE )
{
cerr << "This task does not exists or has an invalid format..." << endl;
return nullptr;
}
// The field "task" from the previous table contains the
// function and arguments
lua_getfield ( lua, -1, "task" );
if ( lua_type ( lua, -1 ) != LUA_TTABLE )
{
cerr << "This task has an invalid format..." << endl;
return nullptr;
}
lua_remove ( lua, -2 );
int taskStackPos = lua_gettop ( lua );
// The first element of the table we retrieved is the function so the
// number of arguments for that function is the table length - 1
int nargs = lua_rawlen ( lua, -1 ) - 1;
// That will be the function
lua_geti ( lua, taskStackPos, 1 );
// And the arguments...
for ( int i = 0; i < nargs; i++ )
{
lua_geti ( lua, taskStackPos, i+2 );
}
lua_remove ( lua, taskStackPos );
// I just reset the global variable NextTaskToStart as we are
// about to start the scheduled one.
lua_pushnil ( lua );
TrySetGlobalField ( lua, "NextTaskToStart" );
// Let's go!
lua_pcall ( lua, nargs, LUA_MULTRET, 0 );
}
int StartNewTask_C ( lua_State* L )
{
pthread_t newTask;
pthread_create ( &newTask, nullptr, NewTaskFn, nullptr );
return 0;
}
So for instance a call in the Lua interpreter to
> StartNewTask("PeriodicPrint", function(str) for i=1,10 print(str);
>> sleep(1); end end, "Hello")
Will produce for the next 10 seconds a print of "Hello" every second. It will then return from execution and everything is wonderful.
Now if I ever hit ENTER key while that task is running, the program dies in horrible seg-fault sufferings (which I don't copy here as each time it seg-fault the error log is different, sometimes there is no error at all).
So I read a bit online what could be the matter and I found several mention that the lua_State are not thread safe. I don't really understand why just hitting ENTER will make it flip out but that's not really the point here.
I discovered by accident that this approach could work without any seg-faulting with a tiny modification. Instead of running the function directly, if a coroutine is executed, everything I wrote above works just fine.
replace the previous Lua side function SetupNewTask with
function SetupNewTask(taskname, fn, ...)
local task = coroutine.create( function(...)
local rets = table.pack(fn(...))
RootTasks[taskname].status = "done"
return table.unpack(rets)
end)
local taskfn = function(...)
coroutine.resume(task, ...)
end
RootTasks[taskname] = {
task = SetupNewTask_C(taskfn, ...),
routine = task,
status = "waiting",
}
NextTaskToStart = taskname
end
I can execute several tasks at once for extended period of time without getting any seg-faults. So we finally come to my question:
Why using coroutine works? What is the fundamental difference in this case? I just call coroutine.resume and I do not do any yield (or anything else for what matters). Then just wait for the coroutine to be done and that's it.
Are coroutine doing something I do not suspect?
That it seems as if nothing broke doesn't mean that it actually works, so…
What's in a lua_State?
(This is what a coroutine is.)
A lua_State stores this coroutine's state – most importantly its stack, CallInfo list, a pointer to the global_State, and a bunch of other stuff.
If you hit return in the REPL of the standard Lua interpreter, the interpreter tries to run the code you typed. (An empty line is also a program.) This involves putting it on the Lua stack, calling some functions, etc. etc. If you have code running in a different OS thread that is also using the same Lua stack/state… well, I think it's clear why this breaks, right? (One part of the problem is caching of stuff that "doesn't"/shouldn't change (but changes because another thread is also messing with it). Both threads are pushing/popping stuff on the same stack and step on each other's feet. If you want to dig through the code, luaV_execute may be a good starting point.)
So now you're using two different coroutines, and all the obvious sources of problems are gone. Now it works, right…? Nope, because coroutines share state,
The global_State!
This is where the "registry", string cache, and all the things related to garbage collection live. And while you got rid of the main "high-frequency" source of errors (stack handling), many many other "low-frequency" sources remain. A brief (non-exhaustive!) list of some of them:
You can potentially trigger a garbage collection step by any allocation, which will then run the GC for a bit, which uses its shared structures. And while allocations usually don't trigger the GC, the GCdebt counter that controls this is part of the global state, so once it crosses the threshold, allocations on multiple threads at the same time have a good chance to start the GC on several threads at once. (If that happens, it'll almost certainly explode violently.) Any allocation means, among others
creating tables, coroutines, userdata, …
concatenating strings, reading from files, tostring(), …
calling functions(!) (if that requires growing the stack or allocating a new CallInfo slot)
etc.
(Re-)Setting a thing's metatable may modify GC structures. (If the metatable has __gc or __mode, it gets added to a list.)
Adding new fields to tables, which may trigger a resize. If you're also accessing it from another thread during the resize (even just reading existing fields), well… *boom*. (Or not boom, because while the data may have moved to a different area, the memory where it was before is probably still accessible. So it might "work" or only lead to silent corruption.)
Even if you stopped the GC, creating new strings is unsafe because it may modify the string cache.
And then probably lots of other things…
Making it Fail
For fun, you can re-build Lua and #define both HARDSTACKTESTS and HARDMEMTESTS (e.g. at the very top of luaconf.h). This will enable some code that will reallocate the stack and run a full GC cycle in many places. (For me, it does 260 stack reallocations and 235 collections just until it brings up the prompt. Just hitting return (running an empty program) does 13 stack reallocations and 6 collections.) Running your program that seems to work with that enabled will probably make it crash… or maybe not?
Why it might still "work"
So for instance a call in the Lua interpreter to
StartNewTask("PeriodicPrint", function(str)
for i=1,10 print(str); sleep(1); end
end, "Hello")
Will produce for the next 10 seconds a print of "Hello" every second.
In this particular example, there's not much happening. All the functions and strings are allocated before you start the thread. Without HARDSTACKTESTS, you might be lucky and the stack is already big enough. Even if the stack needs to grow, the allocation (& collection cycle because HARDMEMTESTS) may have the right timing so that it doesn't break horribly. But the more "real work" that your test program does, the more likely it will be that it will crash. (One good way to do that is to create lots of tables and stuff so the GC needs more time for the full cycle and the time window for interesting race conditions gets bigger. Or maybe just repeatedly run a dummy function really fast like for i = 1, 1e9 do (function() return i end)() end on 2+ threads and hope for the best… err, worst.)

Successful settimeofday() randomly function locks up application

I have a C++ application running on a Raspberry Pi (DietPi Distro - Jessie) and am using GPS data to update the system time at boot. The code is simple, however, it crashes or locks up the application about 50% of the time. No exceptions are thrown and I've tried to capture any stderr in a log file with no success. Occasionally I see a segmentation fault, but I think this may be unrelated.
The portion of the code that clearly causes the crash is "settimeofday(&tv, NULL)". I can comment out only this and it will run fine, but here's the segment of code that assigns timeval 'tv' and changes the system time:
//Convert gps_data_t* member 'time' to timeval
timeval tv;
double wholeseconds, decimalseconds, offsettime;
offsettime = gpsdata->fix.time - (5.0 * 3600.0);
decimalseconds = modf(offsettime, &wholeseconds);
tv.tv_sec = static_cast<int32_t>(wholeseconds);
tv.tv_usec = static_cast<int32_t>(decimalseconds * 1000000.0);
//Set system time - THIS IS CAUSING CRASHES, WHY?
if ( settimeofday(&tv, NULL) >= 0) {
std::cout << "Time set successful!" << '\n';
} else {
std::cout << "Time set failure!" << '\n';
}
A point I would like to make is the setting of the time is successful when the system crashes. I have seen it unsuccessful in the case where gpsdata->fix.time is 'NaN', and it seems to handle this well and just report a failure. My own theories of possible causes:
This is a multi-threading program where several other threads are in a
sleep state (std::this_thread::sleep_for() used extensively). Does
changing the system time while these threads are in a sleep state
interfere with the time it comes out of sleep?
I know there is a time service (NTP?) in the Debian distro that
manages system time synchronization. Could this be interfering?
Anyways, I've got some more experimenting to do but it seems like something somebody may recognize immediately. All advice is appreciated.
A few other points, I've followed this link to remove the ntpd service and the issue still stands, ruling that cause out. Furthermore, I found this link that says changing the system time during a sleeping thread doesn't impact when it wakes up. So now my two theories are shot. Any other ideas are appreciated!
Because of the occasional segmentation fault that occurs, which is not clear if it's related or not to the freezing/crashing, I went ahead and updated the code to prevent the only source of undefined behavior I could identify. So I added uniform initialization for all the variables used in the modf function and made my timeval const. Also changed the type casts per advice below. Behavior is still the same.
//Loop until first GPS lock to set system time
while ( (gpsdata == NULL) ||
(gpsdata->fix.mode <= 1) ||
(gpsdata->fix.time < 1) ||
std::isnan(gpsdata->fix.time) ) {
gpsdata = gps_rec.read();
}
//Convert gps_data_t* member 'time' to timeval
double offsettime{ gpsdata->fix.time - (5.0 * 3600.0) }; //5.0 hr offset for EST
double seconds{ 0.0 };
double microseconds{ 1000000.0 * modf(offsettime, &seconds) };
const timeval tv{ static_cast<time_t>(seconds),
static_cast<suseconds_t>(microseconds) };
//Set system time - THIS IS CAUSING CRASHES, WHY?
if ( settimeofday(&tv, NULL) >= 0) {
std::cout << "Time set successful!" << '\n';
} else {
std::cout << "Time set failure!" << '\n';
}

Memory leak using a class method

I have a problem with my program : I have a memory leak. I know it because when I look at the memory usage it never stop increasing, and then the program crash.
I have remarked that it happens because of this line (when I comment it there is no problem) :
replica[n].SetTempId(i+1); // FUITE MEMOIRE SUR CETTE LIGNE !!
I have checked that n is never bigger than the size of my array (which is a vector type, see below).
My class :
class Replica
{
/* All the functions in this class are in the 'public:' section */
public:
Replica();
~Replica();
void SetEnergy(double E);
void SetConfName(string config_name);
void SetTempId(int id_temp);
int GetTempId();
string GetConfName();
double GetEnergy();
void SetUp();
void SetDown();
void SetZero();
int GetUpDown();
/* All the attributes in this class are in the 'private:' section */
private:
double m_E; // The energy of the replica
string m_config_name; // The name of the config file associated
int m_id_temp; // The id of the temperature where the spin configuration is.
int m_updown;
};
The method
void Replica::SetTempId(int id_temp)
{
m_id_temp=id_temp;
}
I initialised my object like this :
vector<Replica> replica(n_temp); // we create a table that will contain information on the replicas.
The constructor :
Replica::Replica() : m_E(0), m_config_name(""), m_updown(0), m_id_temp(0)
{
}
How I initialize my vector :
for(int i=0; i<=n_temp-1 ; i++) // We write the object replica that will contain information on a given spin configuration.
{
string temp_folder="";
temp_folder= spin_folder + "/T=" + to_string(Tf[i]) + ".dat";
replica[i].SetEnergy(Ef[i]+i); // we save the energy of the config (to avoid to calculate it)
replica[i].SetConfName(temp_folder); // we save the name of the file where the config is saved (to avoid to have huge variables that will slow down the program)
replica[i].SetTempId(i);
replica[i].SetZero();
if(i==0)
replica[i].SetDown();
if(i==(n_temp-1))
replica[i].SetUp();
}
I am a beginner in C++ so it is probably a basic mistake.
Thank you for your help !
I have read your answers.
But it is hard to write a minimal code : I tried to delete some stuff but as soon as I delete lines it works.
In fact the problem is very "random", for example when I delete my line :
replica[n].SetTempId(i+1);
it works, but I can have this line not deleted but when I delete an other line of my code it will also works (I dont know if you see what I mean).
The bug is very hard to find because of this "randomness"...
I also can say that when it crash the program says me :
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
So, could you give me guess on what could cause this error (because I don't arrive to write the minimal code).
I don't do dynamic allocation on my code.
The structure of my code is like this :
while(Nsw_c<Nsw)
{
cout << "test1";
// a
// lot
// of
// code
// with some Nsw_c++
// and this cout at the end of the loop
cout << " Nsw_c : " << Nsw_c << endl << " i " << i << " compteur_bug " << compteur_bug;
}
cout << "test2";
It ALWAYS freeze on this cout above which is at the end of the loop.
I know this because neither test2 or test1 are displayed when it freezes and it is the next cout.
Nsw, Nsw_c, i are integers that are lower than 100 (they are not too big).
To be more precise, if I replace the cout just at the end of the loop by another cout like this :
cout << " test ";
It will also freeze at the same place.
In fact the program always freeze at the end of my while (juste before analysing the condition).
But Nsw and Nsw_c are not big at all so that's why it is strange.
I tried to replace the condition Nsw_c < Nsw just by "1" and it didn't freeze anymore. So it is probably a problem with the condition but both are just "normal" integers so...
Thanks !
I have launched gdb (i just learnt to use it) and i wrote :
catch throw std::bad_alloc
The debugger then do this (I don't know if it can help) :
not stopped at a C++ exception catchpoint
Catchpoint 1 (exception thrown), 0xb7f25290 in __cxa_throw () from /usr/lib/i386-linux-gnu/libstdc++.so.6

semop() failing at failing

I'm trying to write a program in C++, compiled in GCC 4.6.1 on Ubuntu 11.10, and the IPC is giving me a hard time. To demonstrate, here's my code for signaling a semaphore, with semid and semnum already supplied:
struct sembuf x;
x.sem_num = semnum;
x.sem_op = 1;
x.sem_flg = SEM_UNDO;
int old_value = semctl(semid, 0, GETVAL);
if(semop(semid, &x, 1) < 0)
{
std::cerr << "semaphore failed to signal" << std::endl;
}
else if(semctl(semid, 0, GETVAL) == old_value)
{
std::cerr << "signal returned OK, but didn't work" << std::endl;
}
The code for "wait" is similar; the main difference, of course, is that sem_op is set to -1. Sometimes I get the first error message here, but as often as not I get the second, which makes no sense at all to me. The first, I imagine I could hunt for an error code (though I'm not sure if that depends on C++11 features I'm not supposed to use), but I've got no idea how to even begin addressing the second. Rebooting didn't work. GDB isn't being much help, especially when "next" and "step" seem to jump around back and forth instead of going forward in sequence.

Seg faults with pthreads_mutex

I am implementing a particle interaction simulator in pthreads,and I keep getting segmentation faults in my pthreads code. The fault occurs in the following loop, which each thread does in the end of each timestep in my thread_routine:
for (int i = first; i < last; i++)
{
get_id(particles[i], box_id);
pthread_mutex_lock(&locks[box_id.x + box_no * box_id.y]);
//cout << box_id.x << "," << box_id.y << "," << thread_id << "l" << endl;
box[box_id.x][box_id.y].push_back(&particles[i]);
//cout << box_id.x << box_id.y << endl;
pthread_mutex_unlock(&locks[box_id.x + box_no * box_id.y]);
}
The strange thing is that if I uncomment one (it doesn't matter which one) or both of the couts, the program runs as expected, with no errors occurring (but this obviously kills performance, and isn't an elegant solution), giving correct output.
box is a globally declared
vector < vector < vector < particle_t*> > > box
which represents a decomposition of my (square) domain into boxes.
When the loop starts, box[i][j].size() has been set to zero for all i, j, and the loop is supposed to put particles back into the box-structure (the get_id function gives correct results, I've checked)
The array pthread_mutex_t locks is declared as a global
pthread_mutex_t * locks,
and the size is set by thread 0 and the locks initialized by thread 0 before the other threads are created:
locks = (pthread_mutex_t *) malloc( box_no*box_no * sizeof( pthread_mutex_t ) );
for (int i = 0; i < box_no*box_no; i++)
{
pthread_mutex_init(&locks[i],NULL);
}
Do you have any idea of what could cause this? The code also runs if the number of processors is set to 1, and it seems like the more processors I run on, the earlier the seg fault occurs (it has run through the entire simulation once on two processors, but this seems to be an exception)
Thanks
This is only an educated guess, but based on the problem going away if you use one lock for all the boxes: push_back has to allocate memory, which it does via the std::allocator template. I don't think allocator is guaranteed to be thread-safe and I don't think it's guaranteed to be partitioned, one for each vector, either. (The underlying operator new is thread-safe, but allocator usually does block-slicing tricks to amortize operator new's cost.)
Is it practical for you to use reserve to preallocate space for all your vectors ahead of time, using some conservative estimate of how many particles are going to wind up in each box? That's the first thing I'd try.
The other thing I'd try is using one lock for all the boxes, which we know works, but moving the lock/unlock operations outside the for loop so that each thread gets to stash all its items at once. That might actually be faster than what you're trying to do -- less lock thrashing.
Are the box and box[i] vectors initialized properly? You only say the innermost set of vectors are set. Otherwise it looks like box_id's x or y component is wrong and running off the end of one of your arrays.
What part of the look is it crashing on?