can not build with server terminated message.
The most common cause I've seen for this kind of thing (can't say for sure without the jvm.out) is running out of memory which results in the OOM killer terminating the bazel server process. Between the bazel server process and a bunch of C++ compilations in parallel, it's easy to run out. Using a lower --jobs is the most direct way to limit that. --local_ram_resources will help in theory, but bazel doesn't have a very good idea how much RAM each compilation command uses so it's very approximate.
Related
I am writing a program (in C++11) that can optionally be run in parallel using MPI. The project uses CMake for its configuration, and CMake automatically disables MPI if it cannot be found and displays a warning message about it.
However, I am worrying about a perfectly plausible use case whereby a user configures and compiles the program on an HPC cluster, forgets to load the MPI module, and does not notice the warning. That same user might then try to run the program, notice that mpirun is not found, include the MPI module, but forget to recompile. If the user then runs the program with mpirun, this will work, but the program will just run a number of times without any parallelization, as MPI was disabled at compile time. To prevent the user from thinking the program is running in parallel, I would like to make the program display an error message in this case.
My questions is: how can I detect that my program is being run in parallel without using MPI library functions (as MPI was disabled at compile time)? mpirun just launches the program a number of times, but does not tell the processes it launches about them being run in parallel, as far as I know.
I thought about letting the program write some test file, and then check if that file already exists, but apart from the fact that this might be tricky to do due to concurrency problems, there is no guarantee that mpirun will even launch the various processes on nodes that share a file system.
I also considered using a system variable to communicate between the two processes, but as far as I know, there is no system independent way of doing this (and again, this might cause concurrency issues, as there is no way to coordinate system calls between the various processes).
So at the moment, I have run out of ideas, and I would very much appreciate any suggestions that might help me achieve this. Preferred solutions should by operating system independent, although a UNIX-only solution would already be of great help.
Basically, you want to run a a detection of whether you are being run by mpirun etc. in your non-MPI code-path. There is a very similar question: How can my program detect, whether it was launch via mpirun that already presents one non-portable solution.
Check for environment variables that are set by mpirun. See e.g.:
http://www.open-mpi.org/faq/?category=running#mpi-environmental-variables
As another option, you could get the process id of the parent process and it's process name and compare it with a list of known MPI launcher binaries such as orted,slurmstepd,hydra??1. Everything about that is unfortunately again non-portable.
Since launching itself is not clearly defined by the MPI standard, there cannot be a standard way to detect it.
1: Only from my memory, please don't take the list literally.
From a user experience point of view, I would argue that always showing a clear message how the program is being run, such as:
Running FancySimulator serially. If you see this as part of mpirun, rebuild FancySimuilator with FANCYSIM_MPI=True.
or
Running FancySimulator in parallel with 120 MPI processes.
would "solve" the problem. A user getting 120 garbled messages will hopefully notice.
I have a program which runs for a long time, about 3 weeks. It's actually a simulation application.
After that time usually the memory gets full, the system becomes unresposive and I have to restart the whole computer. I really don't want to do that and since we are talking about Ubuntu Linux 14.04 LTS I think there is a way to avoid that. Swap is turned off, because getting stuff of the program to swap would slow it down too much.
The programm is partly written in C++ (about 10%) and FORTRAN (about 90%), and is compiled and linked using the GNU Compiler Suite (g++ and gfortran).
Getting to my question:
Is there a good way to protect the system against those programs which mess it up other than a virtual machine?
P.S.: I know the program has bugs but I cannot fix them right now, so I want to protect the system against hang ups. Also I cannot use a debugger, because it would run for too long.
Edit:
After some comments, I want to clarify some things. The code is way too complex. I don't have the time to fix the bugs and there are versions in which I don't even get the source code. I have to run it, because we are forced to do so. You do not have always the choice.
Not running a program like this is not an option because it still produces some results. So restarting the system is a workaround but I would like to do better. I consider ulimit an option, Didn't think about that one. It might help.
Limiting this crappy application memory is the easiest part. You can for example use Docker (https://goldmann.pl/blog/2014/09/11/resource-management-in-docker/#_memory), or cgroup, which are kind of virtual machine but with much less overhead. ulimit may also be an option, as mentioned in the comments.
The real problem here is to realize that if your simulation program gets killed when it runs out of memory, can you actually use the generated results? Is this program doing some checkpointing to recover from a crash?
Also badly written programs with memory leaks also frequently have more serious problems like overflows, which can turn the results totally useless if you do real science.
You may try to use valgrind to debug memory issues. Fortran also has nice compilation directives for array bounds checking, you should activate those settings if you can.
I was required to use Microsoft Windows to build Windows binaries of an open project code, while the output i got works, i'm wondering if aggressive Anti-Malware may affect the final binary in a bad way (unexpected errors during compilation, during use of the compiled binary or not-running-at-all binary).
Question raised in my head after seeing the Anti-Malware service increase UC time consumption specifically when compiling, consuming up to 600% more UC it would usually when working on the computer.
If your binary happen to cause a false positive, the AV software will
indeed try to modify it, but not silently. So yes, its safe, if there's a
problem, you will know about it.
In my experience the only thing antivirus software does is deleting files and not modifying them. You should be notified when a file is deleted. You should still check if you get notified by the antivirus software during compile time. If you are not notified by the anti virus software during compile time, you can expect your program to be correctly compiled. If you run your program after compilation, the situation might look different. Your program can be wrongly detected by his behavior and be deleted when it is executed. However your program has already been compiled when it is executed.
No doubt the anti-virus software was reacting to the intermediate files created during compilation and linking. In a large project, they can be quite big and would take a fair amount of processing to filter through looking for trouble.
If you have any doubt, you don't have to use Windows. You can use wine on Linux to run MSVC safely and with no AV software. At the minimum you could compare (as in diff) the products of a Linux-based build with those on Windows to gain confidence that all is as it should be.
Guys, could you please recommend a tool for spotting a memory corruption on a production multithreaded server built with c++ and working under linux x86_64? I'm currently facing the following problem : every several hours my server crashes with a segfault and the core dump shows that error happens in malloc/calloc which is definitely a sign of memory being corrupted somewhere.
Actually I have already tried some tools without much luck. Here is my experience so far:
Valgrind is a great(I'd even say best) tool but it slows down the server too much making it unusable in production. I tried it on a stage server and it really helped me find some memory related issues but even after fixing them I still get crashes on the production server. I ran my stage server under Valgrind for several hours but still couldn't spot any serious errors.
ElectricFence is said to be a real memory hog but I couldn't even get it working properly. It segfaults almost immediately on the stage server in random weird places where Valgrind didn't show any issues at all. Maybe ElectricFence doesn't support threading well?.. I have no idea.
DUMA - same story as ElectricFence but even worse. While EF produced core dumps with readable backtraces DUMA shows me only "?????"(and yes server is built with -g flag for sure)
dmalloc - I configured the server to use it instead of standard malloc routines however it hangs after several minutes. Attaching a gdb to the process reveals it's hung somewhere in dmalloc :(
I'm gradually getting crazy and simply don't know what to do next. I have the following tools to be tried: mtrace, mpatrol but maybe someone has a better idea?
I'd greatly appreciate any help on this issue.
Update: I managed to find the source of the bug. However I found it on the stage server not production one using helgrind/DRD/tsan - there was a datarace between several threads which resulted in memory corruption. The key was to use proper valgrind suppressions since these tools showed too many false positives. Still I don't really know how this can be discovered on the production server without any significant slowdowns...
Yes, C/C++ memory corruption problems are tough.
I also used several times valgrind, sometimes it revealed the problem and sometimes not.
While examining valgrind output don't tend to ignore its result too fast. Sometimes after a considerable time spent, you'll see that valgrind gave you the clue on the first place, but you ignored it.
Another advise is to compare the code changes from previously known stable release. It's not a problem if you use some sort of source versioning system (e.g. svn). Examine all memory related functions (e.g. memcpy, memset, sprintf, new, delete/delete[]).
Compile your program with gcc 4.1 and the -fstack-protector-all switch. If the memory corruption is caused by stack smashing this should be able to detect it. You might need to play with some of the additional parameters of SSP.
Folks, I managed to find the source of the bug. However I found it on the stage server using helgrind/DRD/tsan - there was a datarace between several threads which resulted in memory corruption. The key was to use proper valgrind suppressions since these tools showed too many false positives. Still I don't really know how this can be discovered on the production server without any significant slowdowns...
Have you tried -fmudflap? (scroll up a few lines to see the options available).
you can try IBM purify, but i am afraid that is not opensource..
The Google Perftools --- which is Open Source --- may be of help, see the heap checker documentation.
Try this one:
http://www.hexco.de/rmdebug/
I used it extensively, its has a low impact in performance(it mostly impacts amount of ram) but the allocation algorithm is the same. Its always proven enough to find any allocation bugs. Your program will crash as soon as the bug occurs, and it will have a detailed log.
I'm not sure if it would have caught your particular bug, but the MALLOC_CHECK_ environment variable (malloc man page) turns on additional checking in the default Linux malloc implementation, and typically doesn't have a significant runtime cost.
I am using a c++ library that is meant to be multi-threaded and the number of working threads can be set using a variable. The library uses pthreads. The problem appears when I run the application ,that is provided as a test of library, on a quad-core machine using 3 threads or more. The application exits with a segmentation fault runtime error. When I try to insert some tracing "cout"s in some parts of library, the problem is solved and application finishes normally.
When running on single-core machine, no matter what number of threads are used, the application finishes normally.
How can I figure out where the problem seam from?
Is it a kind of synchronization error? how can I find it? is there any tool I can use too check the code ?
Sounds like you're using Linux (you mention pthreads). Have you considered running valgrind?
Valgrind has tools for checking for data race conditions (helgrind) and memory problems (memcheck). Valgrind may be to find such an error in debug mode without needing to produce the crash that release mode produces.
Some general debugging recommendations.
Make sure your build has symbols (compile with -g). This option is orthogonal to other build options (i.e. the decision to build with symbols is independent of the optimization level).
Once you have symbols, take a close look at the call stack of where the seg fault occurs. To do this, first make sure your environment is configured to generate core files (ulimit -c unlimited) and then after the crash, load the program/core in the debugger (gdb /path/to/prog /path/to/core). Once you know what part of your code is causing the crash, that should give you a better idea of what is going wrong.
You are running into a race condition.
Where multiple threads are interacting on the same resource.
There are a whole host of possible culprits, but without the source anything we say is a guess.
You want to create a core file; then debug the application with the core file. This will set up the debugger to the state of the application at the point it crashed. This will allow you to examin the variables/registers etc.
How to do this will very depending on your system.
A quick Google revealed this:
http://www.codeguru.com/forum/archive/index.php/t-299035.html
Hope this helps.