I used gdb before and it was possible to obtain the line numbers where the program crashed with bt command, standing for backtrace. It is then very easy to locate where the bug (e.g. segmentation fault) occurs.
I am looking for a similar feature in bazel. I compiled with
bazel build -c dbg //...
It already gives me a more informative output than -c opt but does not show the line numbers.
Is it possible to get the line numbers as well in bazel? I included my stack trace below.
*** Aborted at 1619445043 (unix time) try "date -d #1619445043" if you are using GNU date ***
PC: # 0x0 (unknown)
*** SIGSEGV (#0x0) received by PID 3177568 (TID 0x7f325db34880) from PID 0; stack trace: ***
# 0x7f325dda53c0 (unknown)
# 0x55d0e0d6e6b2 std::vector<>::operator[]()
# 0x55d0e0d693f8 _ZZ11MineDCRulesvENKUlRKN9codelearn18GitHubFileRevisionERKN10eventgraph19OriginalTreesChangeERKNS3_18SourceGraphStorageES9_E_clES2_S6_S9_S9_
# 0x55d0e0d6a096 _ZNSt17_Function_handlerIFvRKN9codelearn18GitHubFileRevisionERKN10eventgraph19OriginalTreesChangeERKNS4_18SourceGraphStorageESA_EZ11MineDCRulesvEUlS3_S7_SA_SA_E_E9_M_invokeERKSt9_Any_dataS3_S7_SA_SA_
# 0x55d0e0d935cc std::function<>::operator()()
# 0x55d0e0d81466 _ZZ36TraverseDatalogAnalyzedFilesInternalRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEiibP9StringSetPN4absl5MutexEPKSt8functionIFvRKN9codelearn18GitHubFileRevisionERKN10eventgraph19OriginalTreesChangeERKNSH_18SourceGraphStorageESN_EEbENKUlS6_E_clES6_
# 0x55d0e0d888a2 _ZNSt17_Function_handlerIFvRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEZ36TraverseDatalogAnalyzedFilesInternalS7_iibP9StringSetPN4absl5MutexEPKSt8functionIFvRKN9codelearn18GitHubFileRevisionERKN10eventgraph19OriginalTreesChangeERKNSJ_18SourceGraphStorageESP_EEbEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
# 0x55d0e0dacb93 std::function<>::operator()()
# 0x55d0e0da90e5 _Z19ForEachFileInternalP16SimpleThreadPoolRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEibRKSt8functionIFvS8_EE.localalias
# 0x55d0e0da98cc _Z19ForEachFileInternalP16SimpleThreadPoolRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEibRKSt8functionIFvS8_EE.localalias
# 0x55d0e0da98cc _Z19ForEachFileInternalP16SimpleThreadPoolRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEibRKSt8functionIFvS8_EE.localalias
# 0x55d0e0da9b57 ForEachFile()
# 0x55d0e0d81c79 TraverseDatalogAnalyzedFilesInternal()
# 0x55d0e0d83271 TraverseDatalogAnalyzedFilesWithMetadata()
# 0x55d0e0d699f8 MineDCRules()
# 0x55d0e0d69e3c main
# 0x7f325db680b3 __libc_start_main
# 0x55d0e0d6836e _start
# 0x0 (unknown)
tools/scripts/train.sh: line 94: 3177568 Segmentation fault (core dumped) $DC_MINER $LOGGING --repos_dir $RAWDATA --transform $LANGUAGE --histories_dir $HISTORIES --eval_histories $TEST_HISTORIES --num_parallel_threads $NUM_CPU --out_dir $EMB_DIR --database $DATABASE/$LANGUAGE --ontology_dir $DATABASE/rules
Related
I'm trying to use Ray from a Flask web application.
The whole thing runs in Docker container.
Ray Version is 0.8.6, Flask 1.1.2
When I start the web application, Ray tries to init twice, at it seems, and then the processes crashes. I added the memory limitations later on because there where some warning regarding not enough shared memory size (docker compose setting is "shm_size: '4gb'").
If I start Ray in the same container without using Flask it runs well.
import os
import flask
import ray
from flask import Flask
def create_app(test_config=None):
app = Flask(__name__, instance_relative_config=True)
app.config.from_mapping(
SECRET_KEY='dev',
DEBUG = True
)
# ensure the instance folder exists
try:
os.makedirs(app.instance_path)
except OSError:
pass
if ray.is_initialized() == False:
ray.init(ignore_reinit_error=True,
include_webui=False,
object_store_memory=1*1024*1014*1024,
redis_max_memory=2*1024*1014*1024)
ray.worker.global_worker.run_function_on_all_workers(setup_ray_logger)
#app.route('/api/GetAccountRatings', methods=['GET'])
def GetAccountRatings():
return ...
return app
When I start the flask web app with:
export FLASK_APP="mifad.api:create_app()"
export FLASK_ENV=development
flask run --host=0.0.0.0 --port=8084
I get the following error messages:
* Serving Flask app "mifad.api:create_app()" (lazy loading)
* Environment: development
* Debug mode: on
* Running on http://0.0.0.0:8084/ (Press CTRL+C to quit)
* Restarting with stat
Failed to set SIGTERM handler, processes mightnot be cleaned up properly on exit.
* Debugger is active!
* Debugger PIN: 331-620-174
Failed to set SIGTERM handler, processes mightnot be cleaned up properly on exit.
2020-07-06 07:38:10,382 INFO resource_spec.py:212 -- Starting Ray with 59.18 GiB memory available for workers and up to 0.99 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-07-06 07:38:10,610 WARNING services.py:923 -- Redis failed to start, retrying now.
2020-07-06 07:38:10,675 INFO resource_spec.py:212 -- Starting Ray with 59.13 GiB memory available for workers and up to 0.99 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-07-06 07:38:10,781 WARNING services.py:923 -- Redis failed to start, retrying now.
2020-07-06 07:38:11,043 WARNING services.py:923 -- Redis failed to start, retrying now.
2020-07-06 07:38:11,479 ERROR import_thread.py:93 -- ImportThread: Error 111 connecting to 172.29.0.2:44946. Connection refused.
2020-07-06 07:38:11,481 ERROR worker.py:949 -- print_logs: Connection closed by server.
2020-07-06 07:38:11,488 ERROR worker.py:1049 -- listen_error_messages_raylet: Connection closed by server.
2020-07-06 07:38:11,899 ERROR import_thread.py:93 -- ImportThread: Error while reading from socket: (104, 'Connection reset by peer')
2020-07-06 07:38:11,901 ERROR worker.py:1049 -- listen_error_messages_raylet: Connection closed by server.
2020-07-06 07:38:11,908 ERROR worker.py:949 -- print_logs: Connection closed by server.
F0706 07:38:17.390182 4555 4659 service_based_gcs_client.cc:104] Check failed: num_attempts < RayConfig::instance().gcs_service_connect_retries() No entry found for GcsServerAddress
*** Check failure stack trace: ***
# 0x7ff84ae8061d google::LogMessage::Fail()
# 0x7ff84ae81a8c google::LogMessage::SendToLog()
# 0x7ff84ae802f9 google::LogMessage::Flush()
# 0x7ff84ae80511 google::LogMessage::~LogMessage()
# 0x7ff84ae5dde9 ray::RayLog::~RayLog()
# 0x7ff84ac39cea ray::gcs::ServiceBasedGcsClient::GetGcsServerAddressFromRedis()
# 0x7ff84ac39f37 _ZNSt17_Function_handlerIFSt4pairISsiEvEZN3ray3gcs21ServiceBasedGcsClient7ConnectERN5boost4asio10io_contextEEUlvE_E9_M_invokeERKSt9_Any_data
# 0x7ff84ac6ffb7 ray::rpc::GcsRpcClient::Reconnect()
# 0x7ff84ac71da8 _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc19AddProfileDataReplyEEZNS4_12GcsRpcClient14AddProfileDataERKNS4_21AddProfileDataRequestERKSt8functionIS8_EEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
# 0x7ff84ac4251d ray::rpc::ClientCallImpl<>::OnReplyReceived()
# 0x7ff84ab96870 _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
# 0x7ff84b0b80df boost::asio::detail::scheduler::do_run_one()
# 0x7ff84b0b8cf1 boost::asio::detail::scheduler::run()
# 0x7ff84b0b9c42 boost::asio::io_context::run()
# 0x7ff84ab7db10 ray::CoreWorker::RunIOService()
# 0x7ff84a7763e7 execute_native_thread_routine_compat
# 0x7ff84deed6db start_thread
# 0x7ff84dc1688f clone
F0706 07:38:17.804720 4553 4703 service_based_gcs_client.cc:104] Check failed: num_attempts < RayConfig::instance().gcs_service_connect_retries() No entry found for GcsServerAddress
*** Check failure stack trace: ***
# 0x7fedd65e261d google::LogMessage::Fail()
# 0x7fedd65e3a8c google::LogMessage::SendToLog()
# 0x7fedd65e22f9 google::LogMessage::Flush()
# 0x7fedd65e2511 google::LogMessage::~LogMessage()
# 0x7fedd65bfde9 ray::RayLog::~RayLog()
# 0x7fedd639bcea ray::gcs::ServiceBasedGcsClient::GetGcsServerAddressFromRedis()
# 0x7fedd639bf37 _ZNSt17_Function_handlerIFSt4pairISsiEvEZN3ray3gcs21ServiceBasedGcsClient7ConnectERN5boost4asio10io_contextEEUlvE_E9_M_invokeERKSt9_Any_data
# 0x7fedd63d1fb7 ray::rpc::GcsRpcClient::Reconnect()
# 0x7fedd63d3da8 _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc19AddProfileDataReplyEEZNS4_12GcsRpcClient14AddProfileDataERKNS4_21AddProfileDataRequestERKSt8functionIS8_EEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
# 0x7fedd63a451d ray::rpc::ClientCallImpl<>::OnReplyReceived()
# 0x7fedd62f8870 _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
# 0x7fedd681a0df boost::asio::detail::scheduler::do_run_one()
# 0x7fedd681acf1 boost::asio::detail::scheduler::run()
# 0x7fedd681bc42 boost::asio::io_context::run()
# 0x7fedd62dfb10 ray::CoreWorker::RunIOService()
# 0x7fedd5ed83e7 execute_native_thread_routine_compat
# 0x7fedd968f6db start_thread
# 0x7fedd93b888f clone
Aborted (core dumped)
What am I doing wrong?
Best regards,
Bernd
I'm running GNU grep under gdb on linux and single stepping it. After about 12 steps, control is transferred to setlocale.c, for which no source code is available.
Example session, after step 12 no source code information is available and the list command just shows the file.
Is there a way of getting gdb to keep stepping until a file with source code is available again. Alternatively, is there a way of telling gdb to keep stepping until control is transferred to a different file?
example session, showing source code initially available and then unavaiable for setlocale.c?
(gdb) start
Temporary breakpoint 1 at 0x402e50: file grep.c, line 2415.
Starting program: ~/ws/opt/grep/out/bin/grep --context=20 -r --line-number --byte-offset --include=\*.c int .
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Temporary breakpoint 1, main (argc=0x8, argv=0x7fffffffdaa8) at grep.c:2415
2415 {
(gdb) l
2410 return result;
2411 }
2412
2413 int
2414 main (int argc, char **argv)
2415 {
2416 char *keys = NULL;
2417 size_t keycc = 0, oldcc, keyalloc = 0;
2418 int matcher = -1;
2419 bool with_filenames = false;
(gdb) s 12
__GI_setlocale (category=category#entry=0x6, locale=locale#entry=0x420b7b "") at setlocale.c:220
220 setlocale.c: No such file or directory.
(gdb) l
215 in setlocale.c
You need gdb finish command. With this command you can go out of current stack frame which has no source code available. You can use it as many times as you want until you are again in stack frame with source code. See documentation.
I ended up writing a simple gdb script using the Python API to do this. It will keep stepping until control is transfered to the next file, regardless of whether that involves adding a new stack frame or leaving the current one.
The script can be loaded with source leave_this_file.py. It defines a command called leave_this_file that can be invoked with no arguments, or given a number of times to repeat.
The script is a little bit makeshift and ends up parsing the result of the gdb command frame 0 rather than using one of gdb's proper APIs for inspecting frames.
MAX_STEPS = 10000
def get_file_name():
"""extract the file name for the bottommost frame"""
# example string
#0 main (argc=0x7, argv=0x7fffffffdaa8) at grep.c:2415
# <source fragment>
where_str = gdb.execute("frame 0", from_tty=False, to_string=True)
# last word of first line is file:line
file_line = where_str.splitlines()[0].split()[-1]
filename, _, line = file_line.rpartition(":")
# confirm that line number is an int, raise otherwise
int(line)
return filename
def step_out_of_file_once():
orig_file_name = get_file_name()
current_file_name = orig_file_name
counter = 0
for x in range(MAX_STEPS):
gdb.execute("step", from_tty=False, to_string=True)
counter += 1
current_file_name = get_file_name()
if orig_file_name != current_file_name:
break
print("%s: %30s, %s: %s" % ("new", current_file_name, "steps", counter))
class LeaveThisFile(gdb.Command):
"""step out of the current file"""
def __init__(self):
gdb.Command.__init__(
self, "leave_this_file", gdb.COMMAND_DATA, gdb.COMPLETE_SYMBOL, True
)
def invoke(self, arg, from_tty):
# interpret the arg as a number of times to execute the command
# 1 by default
if arg:
arg = int(arg)
else:
arg = 1
for x in range(arg):
step_out_of_file_once()
LeaveThisFile()
Here's some example output when running GNU grep under gdb
2415 {
(gdb) startQuit
(gdb) source leave_this_file.py
(gdb) leave_this_file 15
new: setlocale.c, steps: 18
new: pthread_rwlock_wrlock.c, steps: 8
new: ../sysdeps/unix/sysv/linux/x86/hle.h, steps: 3
new: pthread_rwlock_wrlock.c, steps: 1
new: setlocale.c, steps: 7
new: ../sysdeps/x86_64/multiarch/../strcmp.S, steps: 1
new: setlocale.c, steps: 48
new: getenv.c, steps: 4
new: ../sysdeps/x86_64/strlen.S, steps: 2
new: getenv.c, steps: 16
new: ../sysdeps/x86_64/multiarch/../strcmp.S, steps: 64
new: getenv.c, steps: 53
new: setlocale.c, steps: 16
new: ../sysdeps/x86_64/multiarch/../strchr.S, steps: 5
new: setlocale.c, steps: 23
There is a problem to identify the full command from the core dump file using gdb
The crashed command itself can be long
i.e.
myCommand -f log/SlaRunTimeReport.rep -I input/myFile.txt -t output/myFile.txt
But When using gdb to identify the command in the location “Core was generated by”
i.e. by executing
gdb -c core.56536
The Output:
GNU gdb (GDB) Red Hat Enterprise Linux 7.10-20.el7
….
Core was generated by `myCommand -f log/SlaRunTimeReport.rep -I
input/myFile.t'.
It is possible to see that the full command(executable + parameters) was cut in the middle
‘myCommand -f log/SlaRunTimeReport.rep -I input/myFile.t'
In additional using strings command , also did not help to identify the full command
strings core.56536 | grep PMRunTimeReport
The Output:
myCommand
myCommand -f log/SlaRunTimeReport.rep -I input/myFile.t
Is there any way to get from coredump file the full command that caused the failure
Thanks in Advance
Is there any way to get from coredump file the full command that caused the failure
There are multiple ways, but running strings is the wrong way.
IF you built your program with debug info, you should be able to simply execute up command until you reach main, then examine argv[0] through argv[argc-1].
If your main was not built with debug info, or if it doesn't use argc and argv, you should be able to recover that info from __libc_argc and __libc_argv variables. Example:
$ ./a.out foo bar baz $(python -c 'print "a" * 500')
Aborted (core dumped)
$ gdb -q ./a.out core
Core was generated by `./a.out foo bar baz aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'.
Note that the "generated by" is truncated -- it comes from a fixed length array inside of struct prpsinfo, saved in NT_PRPSINFO ELF note in the core.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007fab38cfcf2b in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.27-15.fc28.x86_64
(gdb) p (int)__libc_argc
$1 = 5
(gdb) p ((char**)__libc_argv)[0]#5
$2 = {0x7ffede43289f "./a.out", 0x7ffede4328a7 "foo", 0x7ffede4328ab "bar",
0x7ffede4328af "baz",
0x7ffede4328b3 'a' <repeats 200 times>...}
This last line is actually a lie -- we know that 'a' repeats 500 times.
We can fix it like so:
(gdb) set print elem 0
(gdb) p ((char**)__libc_argv)[0]#5
$3 = {0x7ffede43289f "./a.out", 0x7ffede4328a7 "foo", 0x7ffede4328ab "bar",
0x7ffede4328af "baz",
0x7ffede4328b3 'a' <repeats 500 times>}
Voila: we now have the complete command.
Lastly, if you install debug info for GLIBC, you can simply look in the __libc_start_main (which called your main):
(gdb) set backtrace past-main
(gdb) bt
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007fab38ce7561 in __GI_abort () at abort.c:79
#2 0x00000000004004ef in main () at foo.c:3
#3 0x00007fab38ce918b in __libc_start_main (main=0x4004e6 <main>, argc=5, argv=0x7ffede431118,
init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffede431108)
at ../csu/libc-start.c:308
#4 0x000000000040042a in _start ()
Here you can clearly see argc and argv in frame 3, and can examine that argv like so:
(gdb) fr 3
#3 0x00007fab38ce918b in __libc_start_main (main=0x4004e6 <main>, argc=5, argv=0x7ffede431118,
init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffede431108)
at ../csu/libc-start.c:308
308 result = main (argc, argv, __environ MAIN_AUXVEC_PARAM);
(gdb) p argv[0]#5
$1 = {0x7ffede43289f "./a.out", 0x7ffede4328a7 "foo", 0x7ffede4328ab "bar",
0x7ffede4328af "baz",
0x7ffede4328b3 'a' <repeats 500 times>}
I am using OCaml 4.05.0+spacetime in order to profile my app's memory usage. However, I fail to install jbuilder 1.0 with this compiler. The error message
# opam-version 1.2.2 (aa258ecc06d3aea5a67f442a4ffd23f2a457180b)
# os linux
# command ocaml bootstrap.ml
# path /home/opam/.opam/4.04.0+spacetime/build/jbuilder.1.0+beta14
# compiler 4.04.0+spacetime
# exit-code 137
# env-file /home/opam/.opam/4.04.0+spacetime/build/jbuilder.1.0+beta14/jbuilder-23292-ffb3fd.env
# stdout-file /home/opam/.opam/4.04.0+spacetime/build/jbuilder.1.0+beta14/jbuilder-23292-ffb3fd.out
# stderr-file /home/opam/.opam/4.04.0+spacetime/build/jbuilder.1.0+beta14/jbuilder-23292-ffb3fd.err
### stdout ###
# /home/opam/.opam/4.04.0+spacetime/bin/ocamllex.opt -q src/sexp_lexer.mll
# /home/opam/.opam/4.04.0+spacetime/bin/ocamllex.opt -q src/meta_lexer.mll
# /home/opam/.opam/4.04.0+spacetime/bin/ocamldep.opt -modules src/action.ml src/action_intf.ml src/alias.ml src/ansi_color.ml src/arg_spec.ml src/artifacts.ml src/bin.ml src/build.ml src/build_interpret.ml src/build_system.ml src/clflags.ml src/cm_kind.ml src/config.ml src/context.ml src/file_tree.ml src/findlib.ml src/future.ml src/gen_meta.ml src/gen_rules.ml src/glob_lexer.boot.ml src/import.ml src/install.ml src/io.ml src/jbuild.ml src/jbuild_load.ml vendor/boot/jbuilder_opam_file_format.ml vendor/boot/jbuilder_re.ml src/js_of_ocaml_rules.ml src/lib.ml src/lib_db.ml src/loc.ml src/log.ml src/main.ml src/merlin.ml src/meta.ml src/meta_lexer.ml src/ml_kind.ml src/mode.ml src/module.ml src/module_compilation.ml src/ocaml_flags.ml src/ocamldep.ml src/odoc.ml src/opam_file.ml src/ordered_set_lang.ml src/package.ml src/path.ml src/sexp.ml src/sexp_lexer.ml src/string_with_vars.ml src/super_context.ml src/top_closure.ml src/utils.ml src/utop.ml src/vfile_kind.ml src/watermarks.ml src/workspace.ml > boot-depends.txt
# /home/opam/.opam/4.04.0+spacetime/bin/ocamlopt.opt -w -40 -o boot.exe unix.cmxa boot.ml
### stderr ###
# Killed
This problem also occurs with OCaml 4.05.0+spacetime. My machine runs Ubuntu 16.04.
I have tried on macOS X and this problem does not come up. So this could be a Linux-specific problem.
Anyone knows what's going on?
I have trained my own network. The training was fine. I have also used 'caffe time' and it estimates time for forward and backward pass normally. However when I run this: (Using this ref)
./build/examples/cpp_classification/classification.bin models/own_net/deploy.prototxt examples/RSR_50k_all_1k_db/snapshot_iter_10000.caffemodel examples/RSR_50k_all_1k_db/mean.binaryproto examples/RSR_50k_all_1k_db/labels.txt /home/ubuntu/datasets/RSR_50k_1ll_1k/Testing/[0]/outfile243.jpg
This generates an error:
F0426 10:10:50.063822 2714 classification.cpp:63] Check failed: net_->num_outputs() == 1 (2 vs. 1) Network should have exactly one output.
*** Check failure stack trace: ***
# 0xf6c5d060 (unknown)
# 0xf6c5cf5c (unknown)
# 0xf6c5cb78 (unknown)
# 0xf6c5ef98 (unknown)
# 0xd10c Classifier::Classifier()
# 0xb0a2 main
# 0xf672c632 (unknown)
Aborted
When I use the same command to classify the stock cat image using caffenet, it works just fine. I suspect there is a problem with label file. My label file only lists all the labels, one in each line. Any idea what I am doing wrong?
unfortunately, the net is not mine so I don't think I am allowed to share the full structure. However, it has some conv, relu, and fc layers and ends with this layer:
layer {
name: "prob"
type: "Softmax"
bottom: "ip3"
top: "prob"
}
which I suspect might be the culprit.