gdb strange behaviour ( [next] jumps few lines back on a block code) - c++

I have notice pretty bizare bahaviour of gdb when debugging a straight block of code.
I ran gdb normally with following commands.
gdb ./exe
break main
run
next
then [enter] a few times.
What I got a as result was
35 world.generations(generations);
(gdb)
36 world.popSize(100);
(gdb)
37 world.eliteSize(5);
(gdb)
41 world.setEvaluationFnc( eval );
(gdb)
37 world.eliteSize(5);
(gdb)
39 world.pXOver(0.9);
(gdb)
38 world.pMut(0.9);
(gdb)
41 world.setEvaluationFnc( eval );
(gdb)
There is absolutely no reason to run over those lines twice. I do not understand this behaviour. The code looks as follows:
(gdb) list 39
34 SimpleGA<MySpecimen> world;
35 world.generations(generations);
36 world.popSize(100);
37 world.eliteSize(5);
38 world.pMut(0.9);
39 world.pXOver(0.9);
40
41 world.setEvaluationFnc( eval );
42
43 world.setErrorSink(stderrSink);
I am not sure if i should disregard it or there something wicked going on in my code. The app uses OpenMP and is compiled to use it. However, the info thread says there is only one thread running. Also, everything seems to give proper results, but even executed twice there should be no problems as those are mostly some plain setters.
Did anyone seen something like this or have any hints where to investigate ? I failed on my own =).
Thanks for hints,
luk32.

Most likely the compiler rearranging the code. I suppose the "new" order still works correctly?
If possible, try to debug with optimizations turned off; that increases the likelyhood of the executable staying closer to the source code.

Related

Qt slot callback breaks straight into my destructor

Take a look at this amazing call stack:
1 FNFolderTreeDir::refreshSubDir fnfoldertreedir.cpp 42 0x13faa3287
2 FNFolderTreeDriveRootHive::processDirLoaded fnfoldertreedriveroothive.cpp 25 0x13faa2175
3 FNFolderTreeDriveRootHive::qt_static_metacall moc_fnfoldertreedriveroothive.cpp 74 0x13faa5df5
4 QMetaObject::activate qobject.cpp 3742 0x51744922
... lots of Qt internals here ...
33 _purecall purevirt.c 58 0x7feedc2d0cc
34 FNFolderTreeNode::purge fnfoldertreenode.cpp 28 0x13fa9cf53
35 FNFolderTreeNode::~FNFolderTreeNode fnfoldertreenode.cpp 23 0x13fa9ceeb
FNFolderTreeDriveRootHive::processDirLoaded is a slot connected to the QFileSystemModel::directoryLoaded signal. So, what happened here is that my destructor was happily doing some internal cleanup when Qt interrupted it at a seemingly random point to call a slot to refresh the very object that has been reduced to the base class already. Of course, it crashed.
A couple of questions, if I may:
How is this even possible? I suspect that Qt used either QAbstractItemModel::beginRemoveColumns or endRemoveRows to dispatch the callback - I call them both in FNFolderTreeNode::purge. Should it not be doing this in main loop only?
How can I prevent this behaviour? Could try disconnecting the slot first thing in the destructor, but where is the guarantee that it will not be interrupted too?

gdb - identify the reason of the segmentation fault

I have a server/client system that runs well on my machines. But it core dumps at one of the users machine (OS: Centos 5). Since I don't have access to the user's machine so I built a debug mode binary and asked the user to try it. The crash did happened again after around 2 days of running. And he sent me the core dump file. Loading the core dump file with gdb, it did shows the crash location but I don't understand the reason (sorry, my previous experience is mostly with Windows. I don't have much experience with Linux/gdb). I would like have your input. Thanks!
1. the /var/log/messages at the user's machine shows the segfault:
Jan 16 09:20:39 LPZ08945 kernel: LSystem[4688]: segfault at 0000000000000000 rip 00000000080e6433 rsp 00000000f2afd4e0 error 4
This message indicates that there is a segfault at instruction pointer 80e6433 and stack pointer f2afd4e0. Looks that the program tries to read/write at address 0.
2. load the core dump file into gdb and it shows the crash location:
$gdb LSystem core.19009
GNU gdb (GDB) CentOS (7.0.1-45.el5.centos)
... (many lines of outputs from gdb omitted)
Core was generated by `./LSystem'.
Program terminated with signal 11,
Segmentation fault.
'#0' 0x080e6433 in CLClient::connectToServer (this=0xf2afd898, conn=11) at liccomm/LClient.cpp:214
214 memcpy((char *) & (a4.sin_addr), pHost->h_addr, pHost->h_length);
gdb says the crash occurs at Line 214?
3. Frame information. (at Frame #0)
(gdb) info frame
Stack level 0, frame at 0xf2afd7e0:
eip = 0x80e6433 in CLClient::connectToServer (liccomm/LClient.cpp:214); saved eip 0x80e6701
called by frame at 0xf2afd820
source language c++.
Arglist at 0xf2afd7d8, args: this=0xf2afd898, conn=11
Locals at 0xf2afd7d8, Previous frame's sp is 0xf2afd7e0
Saved registers:
ebx at 0xf2afd7cc, ebp at 0xf2afd7d8, esi at 0xf2afd7d0, edi at 0xf2afd7d4, eip at 0xf2afd7dc
The frame is at f2afd7e0, why it's different than the rsp from Part 1, which is f2afd4e0? I guess the user may have provided me with mismatched core dump file (whose pid is 19009) and /var/log/messages file (which indicates a pid 4688).
4. The source
(gdb) list +
209
210 //pHost is declared as struct hostent* and 'pHost = gethostbyname(serverAddress);'
211 memset( &a4, 0, sizeof(a4) );
212 a4.sin_family = AF_INET;
213 a4.sin_port = htons( nPort );
214 memcpy((char *) & (a4.sin_addr), pHost->h_addr, pHost->h_length);
215
216 aalen = sizeof(a4);
217 aa = (struct sockaddr *)&a4;
I could not see anything wrong with Line 214. And this part of the code must ran many times during the runtime of 2 days.
5. The variables
Since gdb indicated that Line 214 was the culprit. I printed everything.
memcpy((char *) & (a4.sin_addr), pHost->h_addr, pHost->h_length);
(gdb) print a4.sin_addr
$1 = {s_addr = 0}
(gdb) print &(a4.sin_addr)
$2 = (in_addr *) 0xf2afd794
(gdb) print pHost->h_addr_list[0]
$3 = 0xa24af30 "\202}\204\250"
(gdb) print pHost->h_length
$4 = 4
(gdb) print memcpy
$5 = {} 0x2fcf90
So I basically printed everything that's at Line 214. ('pHost->h_addr_list[0]' is 'pHost->h_addr' due to '#define h_addr h_addr_list[0]')
I was not able to catch anything wrong. Did you catch anything fishy? Is it possible the memory has been corrupted somewhere else? I appreciate your help!
[edited] 6. back trace
(gdb) bt
'#0' 0x080e6433 in CLClient::connectToServer (this=0xf2afd898, conn=11) at liccomm/LClient.cpp:214
'#1' 0x080e6701 in CLClient::connectToLMServer (this=0xf2afd898) at liccomm/LClient.cpp:121
... (Frames 2~7 omitted, not relevant)
'#8' 0x080937f2 in handleConnectionStarter (par=0xf3563f98) at LManager.cpp:166
'#9' 0xf7f5fb41 in ?? ()
'#10' 0xf3563f98 in ?? ()
'#11' 0xf2aff31c in ?? ()
'#12' 0x00000000 in ?? ()
I followed the nested calls. They are correct.
The problem with the memcpy is that the source location is not of the same type than the destination.
You should use inet_addr to convert addresses from string to binary
a4.sin_addr = inet_addr(pHost->h_addr);
The previous code may not work depending on the implementation (some my return struct in_addr, others will return unsigned long, but the principle is the same.

FORTRAN IV/66 program stalls in DO loops

I copied a FORTRAN IV program from a thesis, so it presumably worked at the time it was written. I compiled it with gfortran. When running, it stalls in an integration subroutine. I have tried easing off the residuals but to no avail. I am asking for help because (presuming no mistakes in code) gfortran might not like the archaic 66/IV code, and updating it is outside my abilities.
The program gets stuck by line 9, so I wonder if the DO loops are responsible. Note, lines 1 and 6 are unusual to me because ',1' has been added to the ends: e.g. =1,N,1.
I don't think it's necessary to show the FUNC subroutine called on line 5 but am happy to provide it if necessary.
If you need more detailed information I am happy to provide it.
00000001 13 DO 22 TDP=QDP,7,1
00000002 TD=TDP-1
00000003 X=X0+H0
00000004 IF(TD.EQ.QD) GOTO 15
00000005 CALL FUNC(N,DY,X,Y,J)
00000006 15 DO 21 RD=1,N,1
00000007 GOTO (120,121,122,123,124,125,126),TDP
00000008 120 RK(5*N*RD)=Y(RD)
00000009 GOTO 21
00000010 121 RK(RD)=HD*DY(RD)
00000011 H0=0.5*HD
00000012 F0=0.5*RK(RD)
00000013 GOTO 20
00000014 122 RK(N+RD)=HD*DY(RD)
00000015 F0=0.25*(RK(RD)+RK(N+RD))
00000016 GOTO 20
00000017 123 RK(2*N+RD)=HD*DY(RD)
00000018 H0=HD
00000019 F0=-RK(N+RD)+2.*RK(2*N+RD)
00000020 GOTO 20
00000021 124 RK(3*N+RD)=HD*DY(RD)
00000022 H0=0.66666666667*HD
00000023 F0=(7.*RK(RD)+10.*RK(N+RD)+RK(3*N+RD))/27.
00000024 GOTO 20
00000025 125 RK(4*N+RD)=HD*DY(RD)
00000026 H0=0.2*HD
00000027 F0=(28.*RK(RD)-125.*RK(N+RD)+546.*RK(2*N+RD)+54.*RK(3*N+RD)-
00000028 1378.*RK(4*N+RD))/625.
00000029 GOTO 20
00000030 126 RK(6*N+RD)=HD*DY(RD)
00000031 F0=0.1666666667*(RK(RD)+4.*RK(2*N+RD)+RK(3*N+RD))
00000032 X=X0+HD
00000033 ER=(-42.*RK(RD)-224.*RK(2*N+RD)-21.*RK(3*N+RD)+162.*RK(4*N+RD)
00000034 1+125.*RK(6*N+RD))/67.2
00000035 YN=RK(5*N+RD)+F0
00000036 IF(ABS(YN).LT.1E-8) YN=1
00000037 ER=ABS(ER/YN)
00000038 IF(ER.GT.G0) GOTO 115
00000039 IF(ED.GT.ER) GOTO 20
00000040 QD=-1
00000041 20 Y(RD)=RK(5*N+RD)+F0
00000042 21 CONTINUE
00000043 22 CONTINUE
It's difficult to be certain (not entirely sure your snippet exactly matches your source file) but your problem might arise from an old FORTRAN gotcha -- a 0 in column 6 is (or rather was) treated as a blank. Any other (non-blank) character in column 6 is/was treated as a continuation indicator, but not the 0.
Not all f66 compilers adhered to the convention of executing a loop at least once, but it was a common (non-portable) assumption.
Similarly, the assumption of all static variables was not a portable one, but can be implemented by adding a SAVE statement, beginning with f77. A further assumption that SAVE variables will be zero-initialized is even more non-portable, but most compilers have an option to implement that.
If an attempt is being made to resurrect old code, it is probably worth while to get it working before modernizing it incrementally so as to make it more self-documenting. The computed goto looks like a relatively sane one which could be replaced by select case, at a possible expense of optimization. Here the recent uses of the term "modernization" become contradictory.
The ,1 bits are to get the compiler to spot the errors. It is quite common to do the following
DO 10 I = 1.7
That is perfectly legal since spaces are allowed in variable names. If you wish to avoid that, then put in the extra number. The following will generate errors
DO 10 I = 1.7,1
DO 10 I = 1,7.1
DO 10 I = 1.7.1
Re the program getting struck, try puting a continuation line between labels 21 and 22. The if-goto is the same as if-not-then in the later versions of Fortran and the computed goto is the same as a select statement. You don't need to recode it: there is nothing wrong with it other than youngsters getting confused whenever they see goto. All you need to do is indent it and it becomes obvious. So what you will have is
DO 22 TDP = QDP, 7, 1
...
DO 23 RD = 1, N, 1
GOTO (...) TDP
...
GOTO 21
...
GOTO 20
...
GOTO 20
...
20 CONTINUE
Y(RD) = ...
21 CONTINUE
23 CONTINUE
22 CONTINUE
You will probably end up with far more code if you try recoding it. It will look exactly the same except that gotos have been replaced by other words. It is possible that the compiler is generating the wrong code so just help it by putting a few dummy (CONTINUE) statements.

Strange runtime error when using vectors

I have this very strange problem with my code at runtime when I'm running it in parallel with mpi:
*** glibc detected *** ./QuadTreeConstruction: munmap_chunk(): invalid pointer: 0x0000000001fbf180 ***
======= Backtrace: =========
/lib/libc.so.6(+0x776d6)[0x7f38763156d6]
./QuadTreeConstruction(_ZN9__gnu_cxx13new_allocatorImE10deallocateEPmm+0x20)[0x423f04]
./QuadTreeConstruction(_ZNSt13_Bvector_baseISaIbEE13_M_deallocateEv+0x50)[0x423e72]
./QuadTreeConstruction(_ZNSt13_Bvector_baseISaIbEED2Ev+0x1b)[0x423c79]
./QuadTreeConstruction(_ZNSt6vectorIbSaIbEED1Ev+0x18)[0x4237d2]
./QuadTreeConstruction(_Z22findLocalandGhostCellsRK8QuadTreeRK6ArrayVIiES5_iRS3_S6_+0x849)[0x41dbbd]
./QuadTreeConstruction(main+0xa32)[0x41ca49]
/lib/libc.so.6(__libc_start_main+0xfe)[0x7f38762bcd8e]
./QuadTreeConstruction[0x41b029]
My code is valgrind clean up to the some errors that are internal to mpi library (I use OpenMpi and I have seen them many times before but they have never been an issue; see http://www.open-mpi.org/faq/?category=debugging#valgrind_clean). I do not have any problem when I am running in serial.
I have been able to track down the problem to a SIGABORT system call using gdb and here is the stack when the code breaks:
0 raise raise.c 64 0x7f4bd8655ba5
1 abort abort.c 92 0x7f4bd86596b0
2 __libc_message libc_fatal.c 189 0x7f4bd868f65b
3 malloc_printerr malloc.c 6283 0x7f4bd86996d6
4 __gnu_cxx::new_allocator<unsigned long>::deallocate new_allocator.h 95 0x423f04
5 std::_Bvector_base<std::allocator<bool> >::_M_deallocate stl_bvector.h 444 0x423e72
6 std::_Bvector_base<std::allocator<bool> >::~_Bvector_base stl_bvector.h 430 0x423c79
7 std::vector<bool, std::allocator<bool> >::~vector stl_bvector.h 547 0x4237d2
8 findLocalandGhostCells mpi_partition.cpp 249 0x41dbbd
9 main mpi_partition.cpp 111 0x41ca49
This sounds like a memory corruption but I have absolutely no clue what is causing it. Basically the code breaks inside a function that looks something like this:
void findLocalandGhostCells(){
std::vector<bool> foo(fooSize,false);
// do stuff with foo; nothing crazy -- I promise
return;
}
Anyone has any idea what I should be doing now? :(
If you are quite sure that vector operation itself is correct and non-crazy, try tracing the members of your vector step by step. It's possible that some of your other operations corrupted your memory blocks of vector. For example, a memcpy that invaded memory of vector.

What exactly does C++ profiling (google cpu perf tools) measure?

I trying to get started with Google Perf Tools to profile some CPU intensive applications. It's a statistical calculation that dumps each step to a file using `ofstream'. I'm not a C++ expert so I'm having troubling finding the bottleneck. My first pass gives results:
Total: 857 samples
357 41.7% 41.7% 357 41.7% _write$UNIX2003
134 15.6% 57.3% 134 15.6% _exp$fenv_access_off
109 12.7% 70.0% 276 32.2% scythe::dnorm
103 12.0% 82.0% 103 12.0% _log$fenv_access_off
58 6.8% 88.8% 58 6.8% scythe::const_matrix_forward_iterator::operator*
37 4.3% 93.1% 37 4.3% scythe::matrix_forward_iterator::operator*
15 1.8% 94.9% 47 5.5% std::transform
13 1.5% 96.4% 486 56.7% SliceStep::DoStep
10 1.2% 97.5% 10 1.2% 0x0002726c
5 0.6% 98.1% 5 0.6% 0x000271c7
5 0.6% 98.7% 5 0.6% _write$NOCANCEL$UNIX2003
This is surprising, since all the real calculation occurs in SliceStep::DoStep. The "_write$UNIX2003" (where can I find out what this is?) appears to be coming from writing the output file. Now, what confuses me is that if I comment out all the outfile << "text" statements and run pprof, 95% is in SliceStep::DoStep and `_write$UNIX2003' goes away. However my application does not speed up, as measured by total time. The whole thing speeds up less than 1 percent.
What am I missing?
Added:
The pprof output without the outfile << statements is:
Total: 790 samples
205 25.9% 25.9% 205 25.9% _exp$fenv_access_off
170 21.5% 47.5% 170 21.5% _log$fenv_access_off
162 20.5% 68.0% 437 55.3% scythe::dnorm
83 10.5% 78.5% 83 10.5% scythe::const_matrix_forward_iterator::operator*
70 8.9% 87.3% 70 8.9% scythe::matrix_forward_iterator::operator*
28 3.5% 90.9% 78 9.9% std::transform
26 3.3% 94.2% 26 3.3% 0x00027262
12 1.5% 95.7% 12 1.5% _write$NOCANCEL$UNIX2003
11 1.4% 97.1% 764 96.7% SliceStep::DoStep
9 1.1% 98.2% 9 1.1% 0x00027253
6 0.8% 99.0% 6 0.8% 0x000274a6
This looks like what I'd expect, except I see no visible increase in performance (.1 second on a 10 second calculation). The code is essentially:
ofstream outfile("out.txt");
for loop:
SliceStep::DoStep()
outfile << 'result'
outfile.close()
Update: I timing using boost::timer, starting where the profiler starts and ending where it ends. I do not use threads or anything fancy.
From my comments:
The numbers you get from your profiler say, that the program should be around 40% faster without the print statements.
The runtime, however, stays nearly the same.
Obviously one of the measurements must be wrong. That means you have to do more and better measurements.
First I suggest starting with another easy tool: the time command. This should get you a rough idea where your time is spend.
If the results are still not conclusive you need a better testcase:
Use a larger problem
Do a warmup before measuring. Do some loops and start any measurement afterwards (in the same process).
Tiristan: It's all in user. What I'm doing is pretty simple, I think... Does the fact that the file is open the whole time mean anything?
That means the profiler is wrong.
Printing 100000 lines to the console using python results in something like:
for i in xrange(100000):
print i
To console:
time python print.py
[...]
real 0m2.370s
user 0m0.156s
sys 0m0.232s
Versus:
time python test.py > /dev/null
real 0m0.133s
user 0m0.116s
sys 0m0.008s
My point is:
Your internal measurements and time show you do not gain anything from disabling output. Google Perf Tools says you should. Who's wrong?
_write$UNIX2003 is probably referring to the write POSIX system call, which outputs to the terminal. I/O is very slow compared to almost anything else, so it makes sense that your program is spending a lot of time there if you are writing a fair bit of output.
I'm not sure why your program wouldn't speed up when you remove the output, but I can't really make a guess on only the information you've given. It would be nice to see some of the code, or even the perftools output when the cout statement is removed.
Google perftools collects samples of the call stack, so what you need is to get some visibility into those.
According to the doc, you can display the call graph at statement or address granularity. That should tell you what you need to know.