I am doing simple string operations in the code where i am getting a segmention fault. I could not get what the exact problem is.
Please take a look if someone can help.
The backtrace of the core is
(gdb) bt
#0 0x00007f595dee41da in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#1 0x00007f595deea105 in _dl_runtime_resolve () from /lib64/ld-linux-x86-64.so.2
#2 0x0000000000401d04 in getNodeInfo (node=0x7fffbfb4ba83 "TCU-0")
at hwdetails.cpp:294
#3 0x0000000000402178 in main (argc=3, argv=0x7fffbfb4aef8)
at hwdetails.cpp:369
At line 294 the crash is coming where the cout statement is there.
LdapDN is char * and is not NULL.
if ( Epath && (Epath->Entry[0].EntityType == SAHPI_ENT_UNSPECIFIED ||
Epath->Entry[0].EntityType == SAHPI_ENT_ROOT )) {
// nothing is mapped. Degrade the ldap dn path to slot.
if(LdapDN){
std::cout << "LdapDN " << LdapDN << std::endl;
}
std::string ldapDN;
ldapDN = LdapDN;
std::string slot = LDAP_PIU_ID;
if ( ldapDN.compare(0, slot.length(), slot) != 0 ) {
size_t pos = ldapDN.find(slot);
if ( pos != std::string::npos ) {
ldapDN = ldapDN.substr(pos);
LdapDN = (char *)ldapDN.c_str();
//getEntityPathFromLdapDn(ldapDN.c_str(), epath, domid);
}
}
}
A crash in _dl_fixup generally means that you have corrupted the state of runtime loader.
The two most common causes are:
Heap corruption (overflow) or
Mismatched parts of glibc itself.
If you are not setting e.g. LD_LIBRARY_PATH to point to a non-standard glibc, then we can forget about reason #2.
For #1, run your program under Valgrind, and make sure it detects no errors.
If in fact it doesn't, use disas and info registers GDB commands, update your question with their output, and you may receive additional help.
This is problem with GOT table. _dl_runtime_resolve - procedure which changes GOT (global offset table), when some function from dynamic library call's first time. In the next time using changed GOT entry.
When a function (for example printf() from libc.so) from dynamic library call's in your code first time:
goto PLT(program lookup table). The PLT is a trampoline which gets the correct address of the function being called from GOT.
from PLT goto GOT
return to PLT
call _dl_runtime_resolve
store actual function jump address to GOT
call function from dynamic library
The second time function call is:
goto PLT
goto GOT
GOT have direct jump to function address from dynamic library.
GOT is a reference to a function called once again without going through the _dl_runtime_resolve fast.
i see a memory leak here :
You essentially are losing your previous string LdapDN when you do
if ( pos != std::string::npos ) {
ldapDN = ldapDN.substr(pos);
LdapDN = (char *)ldapDN.c_str();
//getEntityPathFromLdapDn(ldapDN.c_str(), epath, domid);
}
Related
Program in C/C++ runs on embedded PowerPC under debugger with HW break points capabilities.
There is global variable 'char Name[256]' known in 2 files and 2 tasks correspondingly. One task reads Name, another fills it with a text, '1234567...', for example.
At some moment, global variable Name gets corrupted. When asked for the variable address gdb shows (and application prints by debug printouts) address equal to 0x31323334.
How to catch this bug with HW breakpoints? I mean at what address to put HWBP.
When I look into assembler, I see:
lis 9,Name#ha
lwz 9,Namel#l(9)
So, how memory corruption can change the code without influencing the application flow - it should crash immediately, no?
Thanks a lot ahead
0x31323334 is "1234" sans null terminator. Further, "Global variable address corruption" does not make much sense "global variables" (whose addresses do not change), nor really for an array of size 256 (unless you're using a pointer somewhere and it's the pointer which is being corrupted). So I suspect you might be unfamiliar with GDB.
When using GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1 on x86 (admittedly, not ppc, but basically the same software) with the following test file:
// g++ test.cpp -g
#include <iostream>
char Name[256] = "123456789";
int main() {
Name[0] = 'a';
std::cout << Name << std::endl;
}
I can get the following output from GDB:
(gdb) break main
Breakpoint 2 at 0x40086a: file test.cpp, line 6.
(gdb) r
Starting program: /home/keithb/dev/mytest/a.out
Breakpoint 2, main () at test.cpp:6
6 Name[0] = 'a';
(gdb) whatis Name
type = char [256]
(gdb) print Name
$1 = "123456789", '\000' <repeats 246 times>
(gdb) print &Name
$2 = (char (*)[256]) 0x6010c0 <Name>
In any case, if you really do want to set a "hardware breakpoint" (GDB calls those "watchpoints"), then you can do get the address of Name prior to corruption. Then just set the watchpoint and wait for your program to write to the address.
(gdb) c
Continuing.
a23456789
[Inferior 1 (process 21878) exited normally]
(gdb) delete 2
(gdb) watch *0x6010c0
Hardware watchpoint 3: *0x6010c0
(gdb) r
Starting program: /home/keithb/dev/mytest/a.out
Hardware watchpoint 3: *0x6010c0
Old value = 875770417
New value = 875770465
main () at test.cpp:7
7 std::cout << Name << std::endl;
(gdb)
My program get segmentation fault when I run it normally. However it works just fine if I use gdb run. Moreover, the ratio of segmentation fault increases when I increase the sleep time in the philo function. I am using ubuntu 12.04. Any help or pointing is appreciated. Here is my code
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>
#include <signal.h>
#include <sys/wait.h>
#include <time.h>
#include <semaphore.h>
#include <errno.h>
#define STACKSIZE 10000
#define NUMPROCS 5
#define ROUNDS 10
int ph[NUMPROCS];
//cs[i] is the chopstick between philosopher i and i+1
sem_t cs[NUMPROCS], dead;
int philo() {
int i = 0;
int cpid = getpid();
int phno;
for (i=0; i<NUMPROCS; i++)
if(ph[i] == cpid) phno = i;
for (i=0; i < ROUNDS ; i++){
// Add your entry protocol here
if (sem_wait(&dead) != 0) {
perror(NULL);
return 1;
}
if (sem_wait(&cs[phno]) != 0) {
perror(NULL);
return 1;
}
if (sem_wait(&cs[(phno-1+NUMPROCS) % NUMPROCS]) != 0){
perror(NULL);
return 1;
}
// Start of critical section -- simulation of slow n++
int sleeptime = 20000 + rand()%50000;
printf("philosopher %d is eating by chopsticks %d and %d\n",phno,phno,(phno-1+NUMPROCS)%NUMPROCS);
usleep(sleeptime) ;
// End of critical section
// Add your exit protocol here
if (sem_post(&dead) != 0) {
perror(NULL);
return 1;
}
if (sem_post(&cs[phno]) != 0) {
perror(NULL);
return 1;
}
if (sem_post(&cs[(phno-1+NUMPROCS) % NUMPROCS]) != 0){
perror(NULL);
return 1;
}
}
return 0;
}
int main( int argc, char ** argv){
int i;
void* stack[NUMPROCS];
srand(time(NULL));
//initialize semaphores
for (i=0; i<NUMPROCS; i++) {
if (sem_init(&cs[i],1,1) != 0){
perror(NULL);
return 1;
}
}
if (sem_init(&dead,1,4) != 0){
perror(NULL);
return 1;
}
for (i = 0; i < NUMPROCS; i++){
stack[i] = malloc(STACKSIZE) ;
if ( stack[i] == NULL ) {
printf("Error allocating memory\n") ;
exit(1) ;
}
// create a child that shares the data segment
ph[i] = clone(philo, stack[i]+STACKSIZE-1, CLONE_VM|SIGCHLD, NULL) ;
if (ph[i] < 0) {
perror(NULL) ;
return 1;
}
}
for (i=0; i < NUMPROCS; i++) wait(NULL);
for (i=0; i < NUMPROCS; i++) free(stack[i]);
return 0 ;
}
A typical Heisenbug: if you look at it, it disappears. In my experience getting a segv only outside gdb or vice versa is sign of using uninitialized memory or dependence on actual pointer addresses. Normally running valgrind is ruthlessly accurate in detecting those. Unfortunately (my) valgrind can not handle your clone outside the pthread context.
Visual inspection suggests it is not a memory problem. Only the stacks are allocated on the heap and their use looks ok. Except you treat them with a void * pointer and then add something to it, which is not allowed in standard-C (a GNU extension). Proper would be to use a char *, but the GNU extensions does what you want.
Subtracting one from the top address of the stack is probably not necessary and might cause alignment errors on simple implementations of clone, but again I don't think that is the problem, as clone most likely will align the stack top again. And admittedly the manual page of clone is not very clear about the exact location of the address: "topmost address of the memory space".
Just waiting for a state change of a child and assuming it died is a bit sloppy and then taking away its stack might lead to segmentation faults, but again I don't think that is the problem, because you are probably not frantically sending signals to your philosophers.
If I run your application the philosophers can finish their diner undisturbed both inside and outside gdb, so the following is a guess. Let's call the parent process that clones philosophers "the table". Once a philosopher is cloned the table stores the returned pid in ph, say assign that number to a chair. The first thing a philosopher does is looking for his chair. If he doesn't find his chair he will have an uninitialized phno which is used to access his semaphores. Now this may very well lead to segmentation faults.
The implementation is assuming that control is returned to the table before the philosophers start. I can't find such guarantee in the manual page and I would actually expect this not to be true. Also the clone interface has a possibility to place process ids in memory shared between the child and the parent, suggesting this is a recognized problem (see parameters pid and ctid). If those are used the pid will be written before either the table or the just cloned philosopher gets control.
It is highly possible that this error explains the difference between inside and outside gdb, because gdb is well aware of the processes that are spawned under its supervision and may treat them differently than the operating system.
Alternatively you could assign a semaphore to the table. So nobody sits at the table until the table says so, obviously after it assigned all chairs. This would make a much better use for the semaphore dead.
BTW. You are of course fully aware that the setup of your solution does allow for the situation where all philosophers end up each having one fork (eh chopstick) and starve to death waiting for the other. Luckily chances of that happening are very slim.
ph[i] = clone(philo, stack[i]+STACKSIZE-1, CLONE_VM|SIGCHLD, NULL) ;
This creates a thread of execution, which glibc knows nothing about. As such, glibc does not create any thread-specific internal structures that it needs for e.g. dynamic symbol resolution.
With such setup, calling into any glibc function from your philo function invokes undefined behavior, and you sometimes crash (because the dynamic loader will use main thread's private data to perform symbol resolution, and because the loader assumes that each thread has its own private area, but you've violated this assumption by creating clones which share the single private area "behind glibc's back").
If you look at a core dump, there is a high chance that the actual crash happens in ld.so, which would confirm my guess.
Don't ever use clone directly (unless you know what you are doing). Use pthread_create instead.
Here is what I see in the core that I just got (which is exactly the problem I described):
Program terminated with signal 4, Illegal instruction.
#0 _dl_x86_64_restore_sse () at ../sysdeps/x86_64/dl-trampoline.S:239
239 vmovdqa %fs:RTLD_SAVESPACE_SSE+0*YMM_SIZE, %ymm0
(gdb) bt
#0 _dl_x86_64_restore_sse () at ../sysdeps/x86_64/dl-trampoline.S:239
#1 0x00007fb694e1dc45 in _dl_fixup (l=<optimized out>, reloc_arg=<optimized out>) at ../elf/dl-runtime.c:127
#2 0x00007fb694e0dee5 in _dl_runtime_resolve () at ../sysdeps/x86_64/dl-trampoline.S:42
#3 0x00000000004009ec in philo ()
#4 0x00007fb69486669d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
I have encountered a strange problem when compiling my program using 64-bit g++ 4.7.0 on a Fedora 17 x86_64 machine (the same program works well on a 32-bit Fedora).
The program is too complicated and I cannot figure out an easy way to produce a small code sample. But from the following gdb record, you can see the problem.
Program received signal SIGSEGV, Segmentation fault.
0x000000000042a4b0 in boost::shared_ptr<cppPNML::details::ddObj>::operator!(this=0x100000007)
at /usr/include/boost/smart_ptr/detail/operator_bool.hpp:55
55 return px == 0;
Missing separate debuginfos, use: debuginfo-install gnome-keyring-3.4.1-3.fc17.x86_64
(gdb) bt
#0 0x000000000042a4b0 in boost::shared_ptr<cppPNML::details::ddObj>::operator! (this=0x100000007)
at /usr/include/boost/smart_ptr/detail/operator_bool.hpp:55
#1 0x00000000004202a5 in cppPNML::pnNode::getBBox (this=0xffffffff) at cpp_pnml.cpp:131
#2 0x000000000040eca4 in draw_page (g=..., painter=...) at pnml2pdf.cpp:178
#3 0x000000000040e3b9 in main (argc=2, argv=0x7fffffffe188) at pnml2pdf.cpp:106
(gdb) up
#1 0x00000000004202a5 in cppPNML::pnNode::getBBox (this=0xffffffff) at cpp_pnml.cpp:131
131 if(!p_) return pair<double, double>(0,0);
(gdb) up
#2 0x000000000040eca4 in draw_page (g=..., painter=...) at pnml2pdf.cpp:178
178 boost::tie(w, h) = node.getBBox();
(gdb) p node
$1 = {<cppPNML::pnObj> = {_vptr.pnObj = 0x79a490, p_ = {px = 0x7c40a0, pn = {pi_ = 0x7c4170}}}, <No data fields>}
(gdb) l
173 QRectF bound(0,0,0,0);
174
175 // nodes
176 for(pnNode node = g.front<pnNode>(); node.valid(); node = node.next()) {
177 double h, w, x, y, wa, ha, xa, ya, angle;
178 boost::tie(w, h) = node.getBBox();
179 angle = atan2(h, w);
180 boost::tie(x, y) = node.getPosition();
181 wa = 0; ha = 0; xa = 0; ya = 0;
182
(gdb)
The program under debugging is a graphic printing program (pnml2pdf) that draw a graph to pdf using QT4.
The object node belongs to class pnNode, which is defined by my own graphic data struct library (quite complex, https://github.com/wsong83/cppPNML).
It is shown a SEG error where the smart pointer is uninitialized.
Through the back trace you can see that the this pointer of node.getBBox() is invalid.
However, printing the node from one level upper show the node is actually OK.
I am totally confused here.
Anyone has any clue or need any more code segment? Thanks in advance!
Update:
Thanks to the advice from #atzz, I am now certain the calculation of this pointer in member method getBBox() produced a wrong address. The problem is not caused by any source code error (directly linking object files will eliminate the segment fault), but caused by the 64-bit static library generation command "ar" (as the definition of pnNode is defined in a static lib rather than object file). It is seems now the static library is wrong and causes the wrong this calculation.
Still digging... Will update the result if anyone is still interested to know.
Is this an optimised build or a debug build? Looks to me like it should be failing on line 176 not line 178.
Are you sure the loop is right? Looks like you are going over the end. I suspect your implementation of node.valid() either doesn't do the right thing, or is the wrong thing for the loop test.
The value 0xffffffff looks like a std::iterator end() value so I think you either need to test your loop against that, or make sure the pnObj::valid() const { return p_ != NULL && p_ != 0xffffffff; }
Also the way you are implementing next() just looks wrong. Creating an iterator, searching for the string ID and then calling next() on the iterator?
I'm using gdb to debug some c++ code. At the moment the code I'm looking at iterates through an array of pointers, which are either a pointer to some object or a NULL pointer.
If I just display list[index]->member it'll complain when list[index] is null. Is there anyway to display the member only if list[index] is not null? I know you can set conditional breakpoints (condition <bp-num> <exp>) but I'm not sure how that'd help.
The code in question is:
for (int i=0;i<BSIZE*BSIZE;i++){
if (vms[i]==target) {valid=true; break;}
}
where vms is the array of pointers.
Since display accepts arbitrary expressions, you can try something like the following display command:
display (list[index]) ? list[index]->member : "null"
I'm not sure if that cleans things up well enough for what you want - you'll still get a display, but it won't be a complaint.
Basically the condition works like this:
#include <iostream>
int main() {
for (int i=0; i<10; ++i) {
std::cerr << i << std::endl;
}
}
You can debug it like this:
(gdb) break 5
Breakpoint 1 at 0x100000d0e: file foobar.cpp, line 5.
(gdb) condition 1 i==3
(gdb) r
Starting program: /private/tmp/foobar
Reading symbols for shared libraries ++. done
0
1
2
Breakpoint 1, main () at foobar.cpp:5
5 std::cerr << i << std::endl;
I'm having trouble debugging a segmentation fault. I'd appreciate tips on how to go about narrowing in on the problem.
The error appears when an iterator tries to access an element of a struct Infection, defined as:
struct Infection {
public:
explicit Infection( double it, double rt ) : infT( it ), recT( rt ) {}
double infT; // infection start time
double recT; // scheduled recovery time
};
These structs are kept in a special structure, InfectionMap:
typedef boost::unordered_multimap< int, Infection > InfectionMap;
Every member of class Host has an InfectionMap carriage. Recovery times and associated host identifiers are kept in a priority queue. When a scheduled recovery event arises in the simulation for a particular strain s in a particular host, the program searches through carriage of that host to find the Infection whose recT matches the recovery time (double recoverTime). (For reasons that aren't worth going into, it's not as expedient for me to use recT as the key to InfectionMap; the strain s is more useful, and coinfections with the same strain are possible.)
assert( carriage.size() > 0 );
pair<InfectionMap::iterator,InfectionMap::iterator> ret = carriage.equal_range( s );
InfectionMap::iterator it;
for ( it = ret.first; it != ret.second; it++ ) {
if ( ((*it).second).recT == recoverTime ) { // produces seg fault
carriage.erase( it );
}
}
I get a "Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: KERN_INVALID_ADDRESS at address..." on the line specified above. The recoverTime is fine, and the assert(...) in the code is not tripped.
As I said, this seg fault appears 'randomly' after thousands of successful recovery events.
How would you go about figuring out what's going on? I'd love ideas about what could be wrong and how I can further investigate the problem.
Update
I added a new assert and a check just inside the for loop:
assert( carriage.size() > 0 );
assert( carriage.count( s ) > 0 );
pair<InfectionMap::iterator,InfectionMap::iterator> ret = carriage.equal_range( s );
InfectionMap::iterator it;
cout << "carriage.count(" << s << ")=" << carriage.count(s) << endl;
for ( it = ret.first; it != ret.second; it++ ) {
cout << "(*it).first=" << (*it).first << endl; // error here
if ( ((*it).second).recT == recoverTime ) {
carriage.erase( it );
}
}
The EXC_BAD_ACCESS error now appears at the (*it).first call, again after many thousands of successful recoveries. Can anyone give me tips on how to figure out how this problem arises? I'm trying to use gdb. Frame 0 from the backtrace reads
"#0 0x0000000100001d50 in Host::recover (this=0x100530d80, s=0, recoverTime=635.91148029170529) at Host.cpp:317"
I'm not sure what useful information I can extract here.
Update 2
I added a break; after the carriage.erase(it). This works.
Correct me if I'm wrong but I would bet that erasing an item in an unordered multimap invalidates all iterators pointing into it. Try "it = carriage.erase(it)". You'll have to do something about ret as well.
Update in reply to your latest update:
The reason breaking out of the loop after calling "carriage.erase(it)" fixed the bug is because you stopped trying to access an erased iterator.
Compile the program with gcc -g and run it under gdb. When you get an EXC_BAD_ACCESS crash you'll drop into the gdb command line. At that point you can type bt to get a backtrace, which will show you how you got to the point where the crash occurred.