I have compiled my cpp file via a make file. I have run my file via this make file too.
This multi-thread application uses 99% of CPU as well. I am using Ubuntu 16.04.1 LTS as my OS.
After three days of running, I realized that the application has suddenly stopped and I see this unexpected error message on the terminal.
Makefile:: recipe for target 'myMain' failed
make: *** myMain Killed
There is not other error message. This application failed with no exception error message. And I am highly confidant about the programs I write (about failing) despite no one is writing a full proof application.
I have never seen message of make: *** something Killed before too.
Unfortunately, this is a case which I cannot easily repeat again and again to see what is wrong.
I am wondering if make application or Ubuntu have any mechanism to kill any application if running for a long time and taking huge amount of resource?
Update
Thanks to user Basile Starynkevitch, this is the result I received from dmesg:
[351059.556308] Out of memory: Kill process 2794 (main) score 882 or sacrifice child
[351059.556318] Killed process 2794 (main) total-vm:30432908kB, anon-rss:13530324kB, file-rss:0kB
It's most likely that your program was the victim of the Linux kernel's OOM Killer. See also this question and answers.
Out of memory: Kill process
Most likely you were compiling sources as a user and your environment were restricted by the resource limits listed with ulimit -a command (either memory or number of processes). Once the hard limit is hit, the process is killed by the Linux kernel.
If you've got enough memory, it's possible to increase these limits (ulimit -Sv), otherwise you need to increase memory of your machine or add some extra swap space.
For more details about this behaviour, see: Kernel - Out Of Memory Management.
When the machine is low on memory, old page frames will be reclaimed, but despite reclaiming pages is may find that it was unable to free enough pages to satisfy a request even when scanning at highest priority. If it does fail to free page frames, out_of_memory() is called to see if the system is out of memory and needs to kill a process.
Related
I have a job in interruptible sleep state (S), hanging for a few hours.
can't use gdb (gdb will hang when attaching to the PID).
can't use strace, strace will resume the hanging job =(
WCHAN field shows the PID is waiting for ptlrpc. After some search online, it looks like this is a lustre operation. The print files also revealed the program is stuck in reading data from lustre. Any idea or suggestion on how to proceed the diagnose? Or possible reason why the hanging happens?
You can check /proc/$PID/stack on the client to see the whole stack of the process, which would give you some more information about what the process is doing (ptlrpc_set_wait() is just the generic "wait for RPC completion" function).
That said, what is more likely to be useful is to check the kernel console error messages (dmesg and/or /var/log/messages) to see what is going on. Lustre is definitely not shy about logging errors when there is a problem.
Very likely this will show that the client is waiting on a server to complete the RPC, so you'll also have to check the dmesg and/or /var/log/messages To see what the problem is on the server. There are several existing docs that go into detail about how to debug Lustre issues:
https://wiki.lustre.org/Diagnostic_and_Debugging_Tools
https://cug.org/5-publications/proceedings_attendee_lists/CUG11CD/pages/1-program/final_program/Wednesday/12A-Spitz-Paper.pdf
At that point, you are probably best off to check for existing Lustre bugs at https://jira.whamcloud,com/ to search for the first error messages that are reported, or maybe a stack trace. It is very likely (depending on what error is being hit), that there is already a fix available, and upgrading to the latest maintenance release (2.12.7 currently), or applying a patch (if the bug is recently fixed) will sole your problem.
I've been trying to write a kernel module (using SystemTap) that would intercept system calls, capture its information and add it to a system call buffer region that is kmalloc'd. I have implemented a mmap file operation so that a user space process can access this kmalloc'd region and read from it.
Note: For now, the module only intercepts the memfd_create system call. To test this out I have compiled a test application that calls memfd_create twice.
SystemTap script/kernel module code
In addition to the kernel module, I also wrote a user space application that would periodically read the system calls of this buffer, determine whether the system call is legit or malicious and then adds a response to a response buffer region (also included in the kmalloc'd region and can be accessed using mmap) indicating whether to let the system call proceed or to terminated the calling process.
User space application code
The kernel module also has a timer that kicks every few milliseconds to check the response buffer for responses added by the user space. Depending on the response the kernel module would either terminate the calling process or let it proceed.
The issue I am facing is that after intercepting a few system calls (I keep executing test application) and processing them properly, I start facing some issues executing normal commands in the userspace. For example: A simple command like ls:
[virsec#redhat7 stap-test]$ ls
Segmentation fault
[virsec#redhat7 stap-test]$ strace ls
execve("/usr/bin/ls", ["ls"], [/* 30 vars */]) = -1 EFAULT (Bad address)
--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0} ---
+++ killed by SIGSEGV +++
Segmentation fault
This happens with every terminal command that I run. The dmesg output shows nothing but the printk debug outputs of the kernel module. There is no kernel panic, the kernel module and the userspace application are still running and waiting for the next system call to intercept. What do you think could be the issue? Let me know if you need any more information. I was not able to post any code snippets because I think the code would make much more sense as a whole.
I'm working on a large-scale application that spawns numerous processes for dealing with various tasks. In some situations, the OS will kill one of my processes because of memory pressure. That's ok, it's entirely expected, the parent process handles this gracefully.
What I'd like to know is find out why a process was killed. If it was killed because of memory pressure, I want to respawn the treatment a little later. If it was killed for any other reason – because, say, of an assertion failure or an out of bounds memory access, I want to log and investigate.
So, here's my question: how do you find out that a child process was killed because the OS needed the memory?
Question applies to:
Windows;
MacOS;
Linux;
(for bonus points, I'm also interested in Android, but that's not my priority).
Processes are not running as root/admin.
On Linux, you can read the syslog to find out whether a process was killed by the OS. you can investigate it by reading the syslog (/var/log/messages or /var/log/syslog on some distributions) or via the dmesg command.
If you spawned the process you can also detect that it was killed with the SIGKILL(9) signal, as opposed to the SIGSEGV(11) signal that corresponds to the app crashing all by itself, and SIGINT(2)/SIGTERM(15) that means that the applications was aked to terminate gracefully.
Regarding Windows, I only know that this type of monitoring can be enabled via the Application Event Log. There's a GUI Application that can help you set it up.
When the OS intervenes in the execution of a process in order to kill, it does so via signals.
What you can do (on IX based/like platforms) is -- dmesg.
It outputs the kernel activity logs.
From there, you can identify the signal that was sent to your process.
For example this code below --
#include <stdio.h>
int main (void)
{
char *p = NULL;
printf ("\n%c", *p);
return 0;
}
Causes this obtained from dmesg --
[8478285.606105] crash.out[16830]: segfault at 0 ip 0000000000400531 sp 00007fffc373b090 error 4 in crash.out[400000+1000]
So, the title says it all.
Is it possible that one process has two tracers?
I am playing around with ptrace, and I can see that whenever someone attaches to process, then in /proc//status under TracerPID will be PID of the tracer. However, is it possible to have two tracers?
I have two programs (tracer, and tracee). And I ran tracee in debug mode, and then I ran tracer, and got error Operation not permited (even with root permissions).
Regards,
golobich
They can't. It is indirectly confirmed in ptrace man page:
EPERM The specified process cannot be traced. This could be because
the tracer has insufficient privileges (the required capability
is CAP_SYS_PTRACE); unprivileged processes cannot trace pro‐
cesses that they cannot send signals to or those running set-
user-ID/set-group-ID programs, for obvious reasons. Alterna‐
tively, the process may already be being traced, or (on kernels
before 2.6.26) be init(1) (PID 1).
I have an application that I need to debug on a target system.
All the relevant TRACE macros are in place to send messages to the debug window, however, I'm having difficulties in finding a way to prevent the spam there.
You see, this application is regularly creating & terminating threads, so I am getting a large amount of "The thread 0x23CF2B8A has exited with code 0 (0x0)" messages.
I've looked through the various menu options but I can't seem to find a way to disable this automated output.
Is there any way I can do this to clean up my debug window?
Sounds like you could do with a worker thread pool or a fixed number of threads.
Should you go with a fixed number of threads, you will also gain performance, i.e. when using as many threads as CPUs.
Another argument for not creating large amounts of threads on the fly is backward-compatibility. Windows used to leak resources (on XP SP1, if I remember correctly) when creating/destroying threads, so that the process eventually could not ::CreateThread(). (Hopefully this is fixed by now, but don't count on it.)