Where does ray[tune] record `python-core-worker-*.log` files? - ray

Maybe a dumb question since I'm new to ray, but my (single machine) hyperparameter tuning test run is getting some ERRORed trials, and I'm seeing the following messages suggesting I need to find worker logs in order to see a stack trace from my experiment code:
2021-03-14 22:16:56,198 WARNING worker.py:1107 -- A worker died or was killed while executing task ffffffffffffffff06c82ce3c4361bef34d813e601000000.
2021-03-14 22:16:56,201 ERROR trial_runner.py:616 -- Trial experiment_fn_82158_00001: Error processing event.
Traceback (most recent call last):
...
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-*.log files for more information.
Anyone know where to find those log files?
I checked the experiment directory under local_dir, but the only relevant file there is an error.txt with the same message.

If you are running the optimization in linux, it is under
tmp/ray/<your session>/logs

Related

How to debug a hanging job resulting from reading from lustre?

I have a job in interruptible sleep state (S), hanging for a few hours.
can't use gdb (gdb will hang when attaching to the PID).
can't use strace, strace will resume the hanging job =(
WCHAN field shows the PID is waiting for ptlrpc. After some search online, it looks like this is a lustre operation. The print files also revealed the program is stuck in reading data from lustre. Any idea or suggestion on how to proceed the diagnose? Or possible reason why the hanging happens?
You can check /proc/$PID/stack on the client to see the whole stack of the process, which would give you some more information about what the process is doing (ptlrpc_set_wait() is just the generic "wait for RPC completion" function).
That said, what is more likely to be useful is to check the kernel console error messages (dmesg and/or /var/log/messages) to see what is going on. Lustre is definitely not shy about logging errors when there is a problem.
Very likely this will show that the client is waiting on a server to complete the RPC, so you'll also have to check the dmesg and/or /var/log/messages To see what the problem is on the server. There are several existing docs that go into detail about how to debug Lustre issues:
https://wiki.lustre.org/Diagnostic_and_Debugging_Tools
https://cug.org/5-publications/proceedings_attendee_lists/CUG11CD/pages/1-program/final_program/Wednesday/12A-Spitz-Paper.pdf
At that point, you are probably best off to check for existing Lustre bugs at https://jira.whamcloud,com/ to search for the first error messages that are reported, or maybe a stack trace. It is very likely (depending on what error is being hit), that there is already a fix available, and upgrading to the latest maintenance release (2.12.7 currently), or applying a patch (if the bug is recently fixed) will sole your problem.

Asio Bad File Descriptor only on some systems

Recently I wrote a Discord-Bot in C++ with the sleepy-discord bot library.
Now, the problem here is that when I run the bot it shows me the following errors:
[2021-05-29 18:30:29] [info] Error getting remote endpoint: asio.system:9 (Bad file descriptor)
[2021-05-29 18:30:29] [error] handle_connect error: Timer Expired
[2021-05-29 18:30:29] [info] asio async_shutdown error: asio.ssl:336462100 (uninitialized)
Now, I searched far and wide what this could be triggered by but the answers always say like a socket wasn't opened and so on.
The thing is, it works on a lot of systems, but yesterday I was renting a VM (same system as my computer), and this seems to be the only one giving me that issue.
What could be the reason for this?
Edit: I was instructed to show a reproducible example, but I am not sure how I would write a minimal example that's why I link the bot in question:
https://github.com/ElandaOfficial/jucedoc
Update:
I tinkered a bit around in the library I am using and was able to increase the Websocketpp log level, thankfully I got one more line of information out of it:
[2021-05-29 23:49:08] [fail] WebSocket Connection Unknown - "" /?v=8 0 websocketpp.transport:9 Timer Expired
The error triggers when you so s.remote_endpoint on a socket that is not connected/no longer connected.
It would happen e.g. when you try to print the endpoint with the socket after an IO error. The usual way to work around that is to store a copy of the remote endpoint as soon as a connection is established, so you don't have to retrieve it when it's too late.
On the question why it's happening on the particular VM, you have to shift focus to the root cause. It might be that accept is failing (possibly due to limits like number of filedescriptors, available memory, etc.)

arm-none-eabi-gdb on stm32: warning: unrecognized item "timeout" in "qSupported" response

I'm using the command-line to do my stm32 development. CubeIDE and Atom are too heavyweight for the specs of my machine.
I compile an elf and bin file with debug support and upload the bin to the stm32. It is a simple LED blink program, and it works.
I start stlink-server, and it reports port 7184. In another terminal I type:
$ arm-none-eabi-gdb
file app.elf
target remote localhost:7184
I do not get a response for about 30 seconds, then arm-non-eabi-gdb reports:
Ignoring packet error, continuing...
warning: unrecognized item "timeout" in "qSupported" response
Ignoring packet error, continuing...
Remote replied unexpectedly to 'vMustReplyEmpty': timeout
stlink-server reports:
Error: recv returned 0. Remote side has closed gracefully. Good.
But not good!
So, what do I do? I can't seem to halt the stm32, set breakpoints, run, etc..
I'm running a mish-mash of stlink-server, arm-none-eabi-gcc, and arm-none-eabi-gdb from various sources, which might not be helping.
I'm using a Chinese ST-LINK v2, which I hear might not have all the pins wired up for debugging, and that I have to short some pins. It uploads the bin OK, though.
Update 1 OK, perhaps a little progress (??)
I start st-util, which reports:
2020-07-06T14:50:03 INFO common.c: F1xx Medium-density: 20 KiB SRAM, 64 KiB flash in at least 1 KiB pages.
2020-07-06T14:50:03 INFO gdb-server.c: Listening at *:4242...
So then in a separate console I type:
$ arm-none-eabi-gdb
(gdb) target remote localhost:4242
(gdb) file app.elf
(gdb) load app.elf
You can't do that when your target is `exec'
Oh. Also:
(gdb) r
Don't know how to run. Try "help target".
So I think I'm getting closer, It appears that I can set breakpoints. And maybe I've run the commands in the wrong order.
I think maybe I have to do:
exec app.elf
but that doesn't seem to respect the breakpoints.
Hmmm.
Update 2 The saga continues.
This seems better:
$ $arm-none-eabi-gdb
(gdb) target remote localhost:4242
(gdb) file app.elf
(gdb) b 26
continue
That seems to respect breakpoints; but the debugger reports:
Continuing.
Note: automatically using hardware breakpoints for read-only addresses.
Program received signal SIGTRAP, Trace/breakpoint trap.
0x0800000c in _reset ()
(gdb) print i
No symbol "i" in current context
Hmmm. It seems that the program is now no longer in main(), but in a signal trap, and hence i is not in current context (even though I defined it it main).
So reaching a breakpoint basically causes the machine to reset?? Which kinda defeats the point of debugging. So I think I must be doing something wrong (?) and there's a better way of doing it?
Update 3
I switched to the Arduino IDE and uploaded a sketch. Using the procedure above, I didn't get the signal trap problem. I was able to debug my program, set breakpoints, and inspect variables. Nice. The Arduino is obviously incorporating some "secret sauce" that I had not added to my own non-Arduino code.
So, it mostly works. Mostly.
Try reset mcu before load:
target remote localhost:4242
file app.elf
monitor reset halt
load app.elf

Why is celery attempting to execute a task that no longer exists?

I am a new hire at a company that is using Django with Celery for asynchronous communication. I am new to Celery as well.
The previous programmer left a Celery task defined which is obsolete. Every 30 seconds it performs some database retrievals, does a bunch of logic, and then throws the results away. This was apparently a dead end, a different system was completed later on which does what this was actually meant to do.
I have removed the task definition from the python source file it was found in, and also removed the following section from our settings.py file:
CELERYBEAT_SCHEDULE = {
"runs-every-30-seconds": {
"task": "foo.tasks.execute_useless_task",
"schedule": timedelta(seconds=30),
},
}
The problem is that Celery is still trying to execute it whenever I start up django and Celery through supervisor. From /var/log/supervisor/celeryd-stdout---supervisor-FZ2Xqo.log (lines truncated for brevity):
ERROR 2014-11-20 15:26:35,637 base+498 Failed to submit message: u'Unknown task ignored: Task of kind \'foo.tasks.execute_useless_task\' is not registered, please make sure it\'s imported...
ERROR 2014-11-20 15:26:35,751 base+504 Unknown task ignored: Task of kind 'foo.tasks.execute_useless_task' is not registered, please make sure it's imported...
ERROR 2014-11-20 15:26:35,764 base+504 Unknown task ignored: Task of kind 'foo.tasks.execute_useless_task' is not registered, please make sure it's imported...
...and so on...
I have been trying to read through the celery docs all day to find out exactly how celery could know about the task, now that I've removed all trace of it from our codebase (while being pulled in 50 different directions at once, being a developer, admin, and technical support desk all at the same time, naturally). Most of the documentation or discussion I have found is around a FAILURE to execute tasks, and reversing the directions I have found around that has lead nowhere.
If anyone could direct me to some documentation that specifies exactly all the ways that one can register tasks with celery through Django, so that I can figure out how the heck celery could possibly know that task existed, I would appreciate it.
Answered my own question.
I had expected the message queue to be empty, but it was not. Clearing out redis fixed the problem for me:
# redis-cli
> FLUSHDB
> FLUSHALL
Answered by question: How do I delete everything in Redis?

Help with a cryptic error message with KGDB - Bogus trace status reply from target: E22

I'm using gdb to connect to a 2.6.31.13 linux kernel patched with KGDB over Ethernet, and when I try to detach the debugger I get this:
(gdb) quit
A debugging session is active.
Inferior 1 [Remote target] will be killed.
Quit anyway? (y or n) y
Bogus trace status reply from target: E22
after that the session is still open, I can keep going on and on with ctrl+d, and the debugger doesn't exit.
I've searched for that message in google and there are just 5 results (and none of them are useful :-/ ).
Any idea of what could it be and how to fix it?
If you cleared all breakpoints on the target and "C"ontinued from the latest breakpoint (assuming that the target code didn't crash, etc.), I think you'll be safe: when running, kgdb won't be talking to your gdb anyway, since if I recall, it only handles the link when stopped (in a breakpoint or exception) awaiting for commands.
A few Ctrl-C in a fast sequence if needed to get control back in gdb, then "q", and that's it.
That's at least my experience when debugging ko's...
I suspect gdb is saying this because it doesn't realize that it is talking to a kgdb rather than to a remote gdb server. I don't imagine kgdb accepting to kill a kernel thread because the debugger was exited, anyway!
Hmmm, afterthought:
You're talking about kgdb 'lite', the one now part of the kernel tree, are you? Because that's the only one I have experience with...
PS on June, 3:
I had never seen the exact message you mentioned until I moved to the 2.6.32 kernel (as part of the migration of my dev and target machines to Lucid). And then, surprise, I ran into it too. Here, it seems to happen in hopeless situations (like a segfault or kgbd seemingly running away after missing a breakpoint or single step). The only cure I have found so far was to pkill ddd (gdb) on the dev machine and reboot the target. But the good news is that the kgdb in 2.6.32 seems to be quite more stable than the one in Karmic (2.6.31).
ctrl + z should help you quit.