How to debug DPDK libraries to diagnose segmentation fault? - dpdk

I am working with DPDK version 18.11.8 stable on Linux, using a gcc x64 build.
At runtime I get a segmentation fault. Running gdb on the core dump gives this backtrace:
#0 0x0000000000f65680 in rte_eth_devices ()
#1 0x000000000048a03a in rte_eth_rx_burst (nb_pkts=7,
rx_pkts=0x7fab40620480, queue_id=0, port_id=<optimized out>)
at
/opt/dpdk/dpdk-18.08/x86_64-native-linuxapp-gcc/include/rte_ethdev.h:3825
#2 Socket_poll (ucRxPortId=<optimized out>, ucRxQueId=ucRxQueId at entry=0
'\000', uiMaxNumOfRxFrm=uiMaxNumOfRxFrm at entry=7,
pISocketListener=pISocketListener at entry=0xf635d0 <FH_gtFrontHaulObj+16>)
at /data/<snip>/SocketClass.c:2188
#3 0x000000000048b941 in FH_perform (args_ptr=<optimized out>) at
/data/<snip>/FrontHaul.c:281
#4 0x00000000005788e4 in eal_thread_loop ()
#5 0x00007fab419fae65 in start_thread () from /lib64/libpthread.so.0
#6 0x00007fab4172388d in clone () from /lib64/libc.so.6
So it seems that rte_eth_rx_burst() calls rte_eth_devices () and that function crashes, presumably because of an illegal memory access. Possibly a hugepages problem?
I want to enable more debug info in DPDK. I am building DPDK using:
usertools/dpdk-setup.sh
Am I correct in thinking that the build commands in that script use make and I should modify the appropriate:
config/defconfig_*
file (defconfig_x86_64-native-linuxapp-gcc in my case) ?
If so, would these values be appropriate?
CONFIG_RTE_LIBRTE_ETHDEV_DEBUG=y
RTE_LOG_LEVEL=RTE_LOG_DEBUG
RTE_LIBRTE_ETHDEV_DEBUG=y
(not sure whether all values should be prefixed by 'CONFIG_'?)
I tried building DPDK using:
$ export EXTRA_CFLAGS='-O0 -g'
$ make install T=x86_64-native-linuxapp-gcc
but that gave no extra info in the backtrace.

EDIT: error is identified update is Fixed and running without crashing now
using chat room dpdk-debug, we were able to rebuild the libraries and application with proper CFLAGS. Using gdb have identified the probable cause is in rte_eth_rx_burst not being passed with pointer array for mbuf.
Based on the GDB details for frame 1, it looks the application is not build with the EXTRA_CFLAGS (assuming you are using DPDK example Makefile). The right way to build an DPDK application for debugging is to follow the steps as
cd [dpdk target folder]
make clean
make EXTRA_CFLAGS='-O0 -ggdb'
cd [application folder]
make EXTRA_CFLAGS='-O0 -ggdb'
then use GDB in TUI or non-TUI mode to analyze the error.
note:
one of the most common mistakes I commit in rx_burst, is passing *mbuf_array instead of **mbuf_array as the argument.
if custom Makefile is used for the application, pass the EXTRA_CFLAGS as CFLAGS+="-O0 -ggdb"

Related

gdb only finds some debug symbols

So I am experiencing this really weird behavior of gdb on Linux (KDE Neon 5.20.2):
I start gdb and load my executable using the file command:
(gdb) file mumble
Reading symbols from mumble...
As you can see it did find debug symbols. Then I start my program (using start) which causes gdb to pause at the entry to the main function. At this point I can also print out the back trace using bt and it works as expected.
If I now continue my program and interrupt it at any point during startup, I can still display the backtrace without issues. However if I do something in my application that happens in another thread than the startup (which all happens in thread 1) and interrupt my program there, gdb will no longer be able to display the stacktrace properly. Instead it gives
(gdb) bt
#0 0x00007ffff5bedaff in ?? ()
#1 0x0000555556a863f0 in ?? ()
#2 0x0000555556a863f0 in ?? ()
#3 0x0000000000000004 in ?? ()
#4 0x0000000100000001 in ?? ()
#5 0x00007fffec005000 in ?? ()
#6 0x00007ffff58a81ae in ?? ()
#7 0x0000000000000000 in ?? ()
which shows that it can't find the respective debug symbols.
I compiled my application with cmake (gcc) using -DCMAKE_BUILD_TYPE=Debug. I also ensured that a bunch of debug symbols are present in the binary using objdump --debug mumble (Which also printed a few lines of objdump: Error: LEB value too large, but I'm not sure if this is related to the problem I am seeing).
While playing around with gdb, I also encountered the error
Cannot find user-level thread for LWP <SomeNumber>: generic error
a few times, which lets me suspect that maybe there is indeed some issue invloving threads here...
Finally I tried starting gdb and before loading my binary using set verbose on which yields
(gdb) set verbose on
(gdb) file mumble
Reading symbols from mumble...
Reading in symbols for /home/user/Documents/Git/mumble/src/mumble/main.cpp...done.
This does also look suspicious to me as only main.cpp is explicitly listed here (even though the project has much, much more source files). I should also note that all successful backtraces that I am able to produce (as described above) all originate from main.cpp.
I am honestly a bit clueless as to what might be the root issue here. Does someone have an idea what could be going on? Or how I could investigate further?
Note: I also tried using clang as a compiler but the result was the same.
Used program versions:
cmake: 3.18.4
gcc: 9.3.0
clang: 10.0.0
make: 4.2.1

Debugging a haxe based ndk app on android with gdb

I am trying to debug a openfl/haxe app on android with gdb. Since haxe/openfl compiles to c++ which is then compiled using the ndk, I am basically trying to debug an ndk app.
I got gdbserver attached to the apps process and can remote debug it using arm-linux-androideabi-gdb. But as soon as I try to get a backtrace (the app is running fine at this moment), I get this:
#0 0xb6e70b10 in ?? ()
#1 0xb6e4833c in ?? ()
#2 0xb6e4833c in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
I understand that this can happen when I corrupted the stack by using bad pointers, but since the app is running fine (am actually looking for a runtime error, not a crash), that is not the case.

Caffe Compilation Error: gflags.cc' is being linked both statically and dynamically into this executable

I am trying to install caffe following this tutorial
Basically I have the following error when I type the last make command:
me#dl-01:/home/me/caffe-master$ make runtest
.build_release/tools/caffe
caffe: command line brew
usage: caffe command args
commands:
train train or finetune a model
test score a model
device_query show GPU diagnostic information
time benchmark model execution time
Flags from tools/caffe.cpp:
-gpu (Run in GPU mode on given device ID.) type: int32 default: -1
-iterations (The number of iterations to run.) type: int32 default: 50
-model (The model definition protocol buffer text file..) type: string
default: ""
-snapshot (Optional; the snapshot solver state to resume training.)
type: string default: ""
-solver (The solver definition protocol buffer text file.) type: string
default: ""
-weights (Optional; the pretrained weights to initialize finetuning. Cannot
be set simultaneously with snapshot.) type: string default: ""
.build_release/test/test_all.testbin 0 --gtest_shuffle
ERROR: something wrong with flag 'flagfile' in file '/root/glog-0.3.3/gflags-master/src/gflags.cc'. One possibility: file '/root/glog-0.3.3/gflags-master/src/gflags.cc' is being linked both statically and dynamically into this executable.
make: *** [runtest] Error 1
I don't understand how to solve this error. Did anybody find this error before? how can I solve it?
Whether or not you've already solved this somewhere else, I'm posting the answer here in-case others run into the same problem.
Primarily, this problem seems to have come about because we don't always read things properly and blindly follow all instructions thinking they all apply to our case. hint: they don't.
In the installation instructions for Caffe (presuming Ubuntu instructions), there is a section which states:
Everything is packaged in 14.04.
sudo apt-get install libgflags-dev libgoogle-glog-dev liblmdb-dev protobuf-compiler
Blindly ignoring the next title, which states clearly:
Remaining dependencies, 12.04
we go on to install these dependencies, building and installing as required, resulting in the unfortunate side-effect of having 2 versions of libgflags, one dynamic (in /usr/lib[/x86_x64] and one static in /usr/local/lib
Resolution
Promise ourselves failthfully we'll read instructions properly next time around.
Uninstall libgflags
sudo apt-get remove -y libgflags
Delete make install versions
sudo rm -f /usr/local/lib/libgflags.a /usr/local/lib/libgflags_nothreads.a
sudo rm -rf /usr/local/include/gflags
Clean Caffe build
cd <path>/<to>/caffe
make clean
Re-install libgflags package
sudo apt-get install -y libgflags-dev
Rebuild Caffe
make all
make test
make runtest
Et Voila. All tests should now run and you're ready to rock the deep-learning boat.
I've worked out a way to debug this issue analytically. In my case, I was cross-compiling for an older ABI, so apt-get wasn't an option and I was compiling all dependencies manually.
First let's take a look at what this issue actually is. In the Google GFlags library, flags are declared through global objects. When the global object's constructor is run, it calls into the GFlags library to register that command line flag. If the global constructor gets run multiple times (due to multiple versions of the library containing it being loaded into memory), then the GFlags register method dies with an error.
What does GLog have to do with this? Well, GLog uses GFlags, and it has globally declared flag objects. Even if GFlags is linked correctly, if the GLog library gets loaded multiple times, you get an error pointing to logging.cc in GLog.
Sounds like quite a mess, huh. Even if GLog and GFlags are linked as shared in most cases, if another library links to a static version or some other version, kaboom.
Luckily, we can debug this issue using GDB and other tools, if you're willing to delve through some tricky symbol analysis.
First, you'll want to run GDB on the Python interpreter when it tries to import caffe:
gdb --args python -c 'import caffe'
Now, run the program once through so that GDB can pick up all the libraries it imports:
(gdb) r
Now, we can set a breakpoint on the place in the function (FlagRegistry::RegisterFlag()) that prints the error message, and run it again. Note that this line number is from my version of GFlags (2.2.2), you may have to look at the source code of your GFlags version and get the line number.
(gdb) break gflags.c:728
(gdb) r
Hopefully, GDB should then break on the first instance of the error (if not, check that gflags has been built with debugging symbols).
Look at the backtrace:
(gdb) bt
#0 google::(anonymous namespace)::FlagRegistry::RegisterFlag (this=0xa33b30, flag=0x1249d20) at dev/gflags-2.2.2/src/gflags.cc:728
#1 0x00007ffff0f3247a in _GLOBAL__sub_I_logging.cc () from prefix/lib/libcaffe2.so
#2 0x00007ffff7de76ca in call_init (l=<optimized out>, argc=argc#entry=3, argv=argv#entry=0x7fffffffdb08, env=env#entry=0x7fffffffdb28) at dl-init.c:72
#3 0x00007ffff7de77db in call_init (env=0x7fffffffdb28, argv=0x7fffffffdb08, argc=3, l=<optimized out>) at dl-init.c:30
#4 _dl_init (main_map=main_map#entry=0xd9c2a0, argc=3, argv=0x7fffffffdb08, env=0x7fffffffdb28) at dl-init.c:120
#5 0x00007ffff7dec8f2 in dl_open_worker (a=a#entry=0x7fffffffcf70) at dl-open.c:575
#6 0x00007ffff7de7574 in _dl_catch_error (objname=objname#entry=0x7fffffffcf60, errstring=errstring#entry=0x7fffffffcf68, mallocedp=mallocedp#entry=0x7fffffffcf5f,
operate=operate#entry=0x7ffff7dec4e0 <dl_open_worker>, args=args#entry=0x7fffffffcf70) at dl-error.c:187
#7 0x00007ffff7debdb9 in _dl_open (file=0x9aee70 "prefix/lib/python2.7/site-packages/caffe2/python/caffe2_pybind11_state.so", mode=-2147483646,
caller_dlopen=0x51bb39 <_PyImport_GetDynLoadFunc+233>, nsid=-2, argc=<optimized out>, argv=<optimized out>, env=0x7fffffffdb28) at dl-open.c:660
#8 0x00007ffff75ecf09 in dlopen_doit (a=a#entry=0x7fffffffd1a0) at dlopen.c:66
#9 0x00007ffff7de7574 in _dl_catch_error (objname=0xabf9f0, errstring=0xabf9f8, mallocedp=0xabf9e8, operate=0x7ffff75eceb0 <dlopen_doit>, args=0x7fffffffd1a0) at dl-error.c:187
#10 0x00007ffff75ed571 in _dlerror_run (operate=operate#entry=0x7ffff75eceb0 <dlopen_doit>, args=args#entry=0x7fffffffd1a0) at dlerror.c:163
#11 0x00007ffff75ecfa1 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#12 0x000000000051bb39 in _PyImport_GetDynLoadFunc ()
<snip>
Well that's a lot to deal with, but let's focus on the line that's actually important:
#1 0x00007ffff0f3247a in _GLOBAL__sub_I_logging.cc () from prefix/lib/libcaffe2.so
This is the call to the constructor for the global variables in logging.cc (which is part of GLog). As you can see, this call is in libcaffe2.so, meaning that GLog has been statically linked to libcaffe2.so [I was using caffe2, but this procedure should be the same for both].
You can then set a breakpoint on google::(anonymous namespace)::FlagRegistry::RegisterFlag and rerun the program from the start. Look at each call to RegisterFlag(), and figure out where this particular flag was registered the first time. If the library providing the flag is a shared library, then it should only ever get registered from that .so file, and nowhere else.
To confirm the diagnosis, you can use
nm <library> | grep _GLOBAL__sub_I_logging.cc
to check for that init function in a library file. Once you've found your culprit, you'll need to rebuild it so that it doesn't link to GFlags/GLog statically.
I also had two libraries installed, a shared .so library and a static .a library. I removed them all as well as the /usr/local/include/glog folder.
The .so file I had brought over when I (cross) compiled the system, while the .a was from a native and up-to-date build.
Ultimately it came down to building glog (natively) in such a way that it provided the .so files.
I started with a clean download:
git clone git://github.com/google/glog
Then I edited CMakeLists.txt.
Where it says:
add_library (glog
${GLOG_SRCS}
)
I changed it to:
add_library (glog SHARED
${GLOG_SRCS}
)
Next you should be able to follow the other instructions. For my particular case I had to use slightly different instructions, not saying you have to do this. For me it was:
mkdir build
cd build
export CXXFLAGS="-fPIC"
cmake ..
make
sudo make install
This gave me the .so files and put them in the right place. Then I started over with caffe and it fixed the error for me.

freeglut segfaults, no useful info from gdb

In this program I'm writing, I've been using freeglut and, generally, it has been working. However, sometimes when there is some issue in the program that often has nothing to do with rendering at all, I get a segfault at glutInit() and no explanation from GDB.
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7942409 in glutInit () from /usr/lib/x86_64-linux-gnu/libglut.so.3
Backtrace:
#0 0x00007ffff7942409 in glutInit () from /usr/lib/x86_64-linux-gnu/libglut.so.3
#1 0x0000000000415d4c in initGL () at ../gfx/render.cpp:62
#2 0x00000000004035f3 in main () at battle.cpp:49
Running with rendering disabled produces no errors.
So, I am wondering what I need to do to get more information on these failures. Can I get the backtrace to look inside liblut.so.3?
(As an aside, any recommendations for a more reliable toolkit than freeglut are appreciated.)
Can I get the backtrace to look inside liblut.so.3?
You already have a backtrace that is looking inside libglut.so.3.
You need to either
install debug symbols for it (sudo apt-get install freeglut3-dbg or some such), or
compile libglut.so from source, or
debug at assembly level: x/i $pc, disas, info registers, etc.

gdb no symbol table loaded for core file

There was a core dump produced at the customer end for my application and while looking at the backtrace I don't have the symbols loaded...
(gdb) where
#0 0x000000364c032885 in ?? ()
#1 0x000000364c034065 in ?? ()
#2 0x0000000000000000 in ?? ()
(gdb) bt full
#0 0x000000364c032885 in ?? ()
No symbol table info available.
#1 0x000000364c034065 in ?? ()
No symbol table info available.
#2 0x0000000000000000 in ?? ()
No symbol table info available.
One think I want to mention in here is that the application being used is build with -g option.
To me it seems that the required libraries are not being loaded. I tried to load the libraries manually using the "symbol-file", but this doesn't help.
What could be the possible issue?
No symbol table info available.
Chances are you invoked GDB incorrectly. Don't do this:
gdb core
gdb -c core
Do this instead:
gdb exename core
Also see this answer for what you'll likely have to do to get meaningful crash stack trace for a core from customer's machine.
I was facing a similar issue and later found out that I am missing -g option, Make sure you have compiled the binary with -g.
This happens when you run gdb with path to executable that does not correspond to the one that produced the core dump.
Make sure that you provide gdb with the correct path.
<put an example of correct code or commands here>