I have a pretty involving program that uses an in house FFT algorithm. I recently decided to try using FFTW for a performance increase. Just as a simple test to ensure that FFTW would link and run, I added the following code to the beginning of the application, however, when I run, I get a segmentation fault when I create the fftwf_plan:
const size_t size = 1024;
vector<complex<float> > data(size);
for(size_t i = 0; i < size; ++i) data[i] = complex<float>(i, -i);
fftwf_plan plan =
fftwf_plan_dft_1d(size,
(fftwf_complex*)&data[0],
(fftwf_complex*)&data[0],
FFTW_FORWARD,
FFTW_ESTIMATE);
// ^ seg faults here ^
fftwf_execute(plan);
fftwf_destroy_plan(plan);
Any ideas what would be causing this?
Using FFTW 3.3. Tried 2 different compilers, g++ 4.1.1 and icc 11.1. Also, the core file file shows nothing of significance:
Thread 1.1: Error at 0x00000000
Stack Trace: PC: 000000, FP=Hex Address
EDIT
I reconfigured FFTW to add debug, using the following commands:
setenv CFLAGS "-fPIC -g -O0"
configure --enabled-shared --enable-float --enable-debug
make
make install
When the program has a segmentation fault, it is in a random location in the fftwf_plan_dft_1d() method, however, the stack trace allways shows that is in or below the function search which is called by mkplan.
Aparently the issue stems from multi-threading. Even though the main functions are thread safe in FFTW (e.g. fftwf_execute), the functions to create a plan are not. This doesn't fully explain why just running a test on startup failed, however, when I excapsulated the plan creation in mutex locks, the segmentation faults ceased.
The creation and destruction of plans must be single threaded
fftw_init_threads();
#pragma omp parallel for
for(i=0;i<n;i++) {
#pragma omp critical {
plan = fftw_create_plan....
}
fftw_execute(plan); // or the fftw_execute_dft for multiple in/out fft operations
#pragma omp critical {
fftw_destroy_plan(plan);
}
}
fftw_cleanup_threads();
I'm 3 years late, but I've just stumbled upon a very similar problem, also when using multi-threading (--enable-openmp and fftw_plan_with_nthreads(omp_get_max_threads())). Mine seg faulted on fftw_destroy_plan(p).
It turned out that I didn't pay attention when restructuring my code, and I was calling fftw_cleanup_threads() before calling fftw_destroy_plan(p) ... silly, I know, but it got me chasing my tail for about 1h.
When using multi-threading, fftw_cleanup_threads() needs to be called after all fftw* functions, just as fftw_init_threads() needs to be called before any fftw* function.
Related
I am working on a considerably large C++ project with a high emphasis on performance. It therefore relies on the Intel MKL library and on OpenMP. I recently observed a considerable memory leak that I could narrow down to the following minimal example:
#include <atomic>
#include <iostream>
#include <thread>
class Foo {
public:
Foo() : calculate(false) {}
// Start the thread
void start() {
if (calculate) return;
calculate = true;
thread = std::thread(&Foo::loop, this);
}
// Stop the thread
void stop() {
if (!calculate) return;
calculate = false;
if (thread.joinable())
thread.join();
}
private:
// function containing the loop that is continually executed
void loop() {
while (calculate) {
#pragma omp parallel
{
}
}
}
std::atomic<bool> calculate;
std::thread thread;
};
int main() {
Foo foo;
foo.start();
foo.stop();
foo.start();
// Let the program run until the user inputs something
int a;
std::cin >> a;
foo.stop();
return 0;
}
When compiled with Visual Studio 2013 and executed, this code leaks up to 200 MB memory per second (!).
By modifying the above code only a little, the leak totally disappears. For instance:
If the program is not linked against the MKL library (which is obviously not needed here), there is no leak.
If I tell OpenMP to use only one thread, (i.e. I set the environment variable OMP_NUM_THREADS to 1), there is no leak.
If I comment out the line #pragma omp parallel, there is no leak.
If I don't stop the thread and start it again with foo.stop() and foo.start(), there is no leak.
Am I doing something wrong here or am I missing something ?
MKL's parallel (default) driver is built against Intel's OpenMP runtime. MSVC compiles OpenMP applications against its own runtime that is built around the Win32 ThreadPool API. Both most likely don't play nice. It is only safe to use the parallel MKL driver with OpenMP code built using Intel C/C++/Fortran compilers.
It should be fine if you link your OpenMP code with the serial driver of MKL. That way, you may call MKL from multiple threads at the same time and get concurrent serial instances of MKL. Whether n concurrent serial MKL calls are slower than, comparable to or faster than a single threaded MKL call on n threads is likely dependent on the kind of computation and the hardware.
Note that Microsoft no longer support their own OpenMP runtime. MSVC's OpenMP support is stuck at version 2.0, which is more than a decade older than the current specification. There are probably bugs in the runtime (and there are bugs in the compiler's OpenMP support itself) and those are not likely to get fixed. They don't want you to use OpenMP and would like you to favour their own Parallel Patterns Library instead. But PPL is not portable to other platforms (e.g. Linux), therefore you should really be using Intel Treading Building Blocks (TBB). If you want quality OpenMP support under Windows, use the Intel compiler or some of the GCC ports. (I don't work for Intel)
Our project uses a few boost 1.48 libraries on several platforms, including Windows, Mac, Android, and IOS.
We are able to consistently get the IOS version of the project to crash (nontrivially but reliably) when using IOS, and
from our investigation we see that ~thread_data_base is being called on the thread's thread_info while its thread is still running.
This seems to happen as a result of the smart pointer reaching a zero count, even though it is obviously still
in scope in the thread_proxy function which creates it and runs the requested function in the thread.
This seems to happen in various cases - the call stack is not identical between crashes, though there are a few
variations which are common.
Just to be clear - this often requires running code which is creating hundreds of threads, though there are
never more than about 30 running simultaneously. I have "been lucky" and got it very very early in the
run also, but that's rare.
I created a version of the destructor which actually catches the code red-handed:
in libs/thread/src/pthread/thread.cpp:
thread_data_base::~thread_data_base()
{
boost::detail::thread_data_base* const thread_info=detail::get_current_thread_data();
void *void_thread_info = (void *) thread_info;
void *void_this = (void *) this;
// is somebody destructing the thread_data other than its own thread?
// (remember that its own which should no longer point to it anyway,
// because of the call to detail::set_current_thread_data(0) in thread_proxy)
if (void_thread_info) { // == void_this) {
__builtin_trap();
}
}
I should note that (as seen from the commented-out code) I had previously checked to see that void_thread_info == void_this because I
was only checking for the case where the thread's current thread_info was killing itself.
I have also seen cases where the value returned by get_current_thread_data is non-zero and
different from "this", which is really weird.
Also when I first wrote that version of the code, I wrote:
if (((void*)thread_info) == ((void*)this))
and at run-time I got some very weird exception that said I something about a virtual function table
or something like that - I don't remember. I decided that it was trying to call "==" for this object type
and was unhappy with that, so I rewrote as above, putting the conversions to void * as separate
lines of code. That in itself is quite suspicious to me. I am not one to run to rush to blame compilers, but...
I should also note that when we did catch this happening the trap, we saw the destructor for
~shared_count appear twice consecutively on the stack in Xcode source. Very doubleweird.
We tried to look at the disassembly, but couldn't make much out of it.
Again - it looks like this is always a result of the shared_count which seems to be owned by
the shared_ptr which owns the thread_info reaching zero too early.
Update: it seems that it is possible to get into situations which reach the above trap without the situation doing any harm. Since fixing the issue (see answer) I have seen it happen, but always after thread_info->run() has finished executing. Don't yet understand how...but it's working.
Some additional info:
I should note that the boost.sh from Pete Goodliffe (and modified by others) that is commonly used to compile boost for IOS
has the following note in the header:
: ${EXTRA_CPPFLAGS:="-DBOOST_AC_USE_PTHREADS -DBOOST_SP_USE_PTHREADS"}
# The EXTRA_CPPFLAGS definition works around a thread race issue in
# shared_ptr. I encountered this historically and have not verified that
# the fix is no longer required. Without using the posix thread primitives
# an invalid compare-and-swap ARM instruction (non-thread-safe) was used for the
# shared_ptr use count causing nasty and subtle bugs.
#
# Should perhaps also consider/use instead: -BOOST_SP_USE_PTHREADS
I use those flags, but to no avail.
I found the following which is very tantalizing - it looks like they had the same issue in std::thread:
http://llvm.org/bugs/show_bug.cgi?format=multiple&id=12730
That was suggestive of using an alternate implementation inside boost for arm processors which seems also to directly address this issue:
spinlock_gcc_arm.hpp
The version included with boost 1.48 uses outdated arm assembly.
I took the updated version from boost 1.52, but I'm having trouble compiling it.
I get the following error:
predicated instructions must be in IT block
I found a reference to what looks to be a similar use of this instruction here:
https://zeromq.jira.com/browse/LIBZMQ-414
I was able to use the same idea to get the 1.52 code to compile by modifying
the code as follows (I inserted an appropriate IT instruction)
__asm__ __volatile__(
"ldrex %0, [%2]; \n"
"cmp %0, %1; \n"
"it ne; \n"
"strexne %0, %1, [%2]; \n"
BOOST_SP_ARM_BARRIER :
"=&r"( r ): // outputs
"r"( 1 ), "r"( &v_ ): // inputs
"memory", "cc" );
But in any case, there are ifdefs in this file which look for the arm architecture, which is not defined that way in my environment. After I simply edited the file so that only ARM 7 code
was left, the compiler complains about the definition of BOOST_SP_ARM_BARRIER:
In file included from ./boost/smart_ptr/detail/spinlock.hpp:35:
./boost/smart_ptr/detail/spinlock_gcc_arm.hpp:39:13: error: instruction requires a CPU feature not currently enabled
BOOST_SP_ARM_BARRIER :
^
./boost/smart_ptr/detail/spinlock_gcc_arm.hpp:13:32: note: expanded from macro 'BOOST_SP_ARM_BARRIER'
# define BOOST_SP_ARM_BARRIER "dmb"
Any ideas??
Figured this out. It turns out that the boost.sh script that I mention in the question chose the incorrect boost flag to address this problem - instead of BOOST_SP_USE_PTHREADS (and the other flag there with it, BOOST_AC_USE_PTHREADS) it turns out that what is needed on IOS is BOOST_SP_USE_SPINLOCK. This ends up giving pretty much the identical solution used in the std::thread issue referred to in the question.
If you are compiling for any modern IOS device which uses ARM 7, but using an older boost (we are using 1.48), you need to copy the file spinlock_gcc_arm.hpp from a more recent boost (like 1.52). That file is #ifdef'd for the different arm architectures, but it is not clear to me that the defines it is looking for are defined in the IOS compile environment using the script. So you can either edit the file (violent but effective) or invest some time to figure out how to make this tidy and correct.
In any case, you may need to insert the extra assembly instruction that I did above in the question:
"it ne; \n"
I have not yet gone back to see if I can delete that now that I have my compile environment working problem.
However, we're not done yet. The code used in boost for this option includes, as discussed, ARM assembly language instructions. The ARM chips support two instruction sets which can't be mixed in a given module (not sure of the scope, but evidently file by file is an acceptable granularity when compiling). The instructions used in boost for this locking include non-Thumb instructions, but IOS by default uses the Thumb instruction set. The boost code, aware of the instruction set issue, checks to see that you have arm enabled but not thumb, but by default in IOS, thumb is on.
Getting the compiler to generate non-thumb ARM code depends on which compiler you are using in IOS - Apple's LLVM or LLVM GCC. GCC is deprecated, and Apple's LLVM is the default when you use XCode.
For the default Clang + Apple LLVM 4.1, you need to compile using the -mno-thumb flag. Also any files in your IOS app which use any part of boost which uses smart pointers will also have to be compiled using -mno-thumb.
To compile boost like this, I think you can just add -mno-thumb to the EXTRA_CPP_FLAGS in the script. (I modified the user-config.jam directly while experimenting and haven't yet gone back to clean up.)
For your app, in Xcode you need to select your target, then go into the Build Phases tab, and there select Compile sources. There you have the option of adding compile flags, so for each relevant file (which includes boost), add the -mno-thumb flag. You can do this directly in project.pbxproj also where each file has
settings = { COMPILER_FLAGS = ""; };
you just change this to
settings = { COMPILER_FLAGS = "-mno-thumb"; };
But there's a little more. You also have to modify the darwin.jam file in the tools/build/v2/tools directory. In boost 1.48, there is a code that says:
case arm :
{
options = -arch armv6;
}
This has to be modified to
case arm :
{
options = -arch armv7 ;
}
Finally, in the boost.sh script, in the function writeBjamUserConfig(), you should remove the references to -arch armv6.
If somebody knows how to do this a little more generally and cleanly, I'm sure we'd all benefit. For now, this is where I've gotten to, and I hope that this will help other IOS boost threads users. I hope that the various variants on the boost.sh IOS script out there will be updated. I plan to add some more links to this answer later.
Update: For a great article which describes the issue on the processor level,
see here:
http://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu
Enjoy!
I use boost.asio, boost.thread, boost.smart_ptr etc. on iOS platform, the app always crash when run in release mode, which throws signal sigabrt. The crash call stack is :
__stack_chk_fail
boost::asio::detail::completion_handle
boost::asio::detail::task_ios_service_operation::complete
boost::asio::detail::task_io_service::do_run_one
boost::asio::detail::task_ios_service::run
boost::asio::io_service::run
![when create a asio work with creating new thread and io_service][1]
When trying to solve the problem, I found the following articles:
[boost-thread-threads-not-starting-on-the-iphone-ipad-in-release-build][2]
[The issue of spin_lock and thumb on iOS][3]
Then I try to add -mno-thumb to my project compile flag, and the problem occured in release mode is gone.
However, a new bug bring out : EXC_ARM_DA_ALIGN, which crashed at where I try to convert network data to host-endian.
As[this article][4] says, the ARM instructions strict that the memory data must be aligned.
And follow the article [Exc_arm_da_align][5], I fix it by using memcpy for the data convert, instead of directly converting from the pointer.
[1]: http://i.stack.imgur.com/3ijF4.png
[2]: http://stackoverflow.com/questions/4201262/boost-thread-threads-not-starting-on-the-iphone-ipad-in-release-builds/4245821#4245821
[3]: http://groups.google.com/group/boost-list/browse_thread/thread/7dc1e80659182ab3
[4]: https://brewx.qualcomm.com/bws/content/gi/common/appseng/en/knowledgebase/docs/kb95.html
[5]: http://www.cnblogs.com/unionfind/archive/2013/02/25/2932262.html
Assume that /QPar is set, and for the following code:
#pragma loop(hint_parallel(8))
for(int i = 0; i < u; i++)
{
SomeExpensiveCall();
}
My u is small (~50), and SomeExpensiveCall takes ~1 second. The code doesn't appear to be getting parallelized (I commented out the hint and there was no change). Is there any way I can force the compiler to parallelize this?
Something I just thought of - would this have anything to do with the fact that the project containing the above code is in a static library that is linked into a CLI/C++ DLL that does not (and cannot) have /QPar?
Thanks
/Qpar-report:2 ought to tell you what's happening. Likely it doesn't want to parallel the function call due to potential side-effects.
I'm writing a scientific program to solve Maxwell's equation with C++. The task in data parallel and I want to use OpenMP to make the program parallel. But when I use OpenMP to parallelise a for loop in side a function it. When I run my code the program gets SIGABRT. I couldn't find out went wrong. Please help.
The for loop is as follows:
#pragma omp parallel for
for (int i = 0; i < totalNoOfElementsInSecondMesh; i++) {
FEMSecondMeshElement2D *secondMeshElement = (FEMSecondMeshElement2D *)mesh->secondMeshFEMElement(i);
if (secondMeshElement->elementType == FEMDelectricElement) {
if (solutionType == TE)
calculateEzFieldForDielectricElement(secondMeshElement, i, currentSecondMeshIndex, nextFirstMeshIndex);
else
calculateHzFieldForDielectricElement(secondMeshElement, i, currentSecondMeshIndex, nextFirstMeshIndex);
} else if (secondMeshElement->elementType == FEMXPMLDielectricElement) {
if (solutionType == TE)
calculateEzFieldForDielectricPMLElement((FEMPMLSecondMeshElement2D *)secondMeshElement, i, currentSecondMeshIndex, nextFirstMeshIndex);
else
calculateHzFieldForDielectricPMLElement((FEMPMLSecondMeshElement2D *)secondMeshElement, i, currentSecondMeshIndex, nextFirstMeshIndex);
}
}
The compiler is llvm-gcc which came with Xcode 4.2 by default.
Please help.
It is possible you've run into a compiler problem on Lion. See this link:
https://plus.google.com/101546077160053841119/posts/9h35WKKqffL
You can download gcc 4.7 pre-compiled for Lion from a link on that page, and that seems to work fine.
The most likely reason that your program crashes is memory corruption when accessing FEMSecondMeshElement2D* secondMeshElement, currentSecondMeshIndex or nextFirstMeshIndex
depending what the other functions in the if clause do to them.
I recommend to check carefully the access of variables and declare them thread private / shared properly, beforehand.
e.g.
FEMSecondMeshElement2D *secondMeshElement = NULL;
#pragma omp parallel for private(secondMeshElement)
...
Did you try to compile your program with debugging and all warnings, i.e. with -g -Wall flags?
Then you can use a debugger (that is, gdb) to debug it.
You can enable core(5) dumps (by setting appropriately, with setrlimit(2) or the ulimit shell builtin which calls it, the RLIMIT_CORE). Once you have a core file, gdb can be used for post-mortem analysis. And there is also gcore(1) to force a core dump.
How does one determine where the mistake is in the code that causes a segmentation fault?
Can my compiler (gcc) show the location of the fault in the program?
GCC can't do that but GDB (a debugger) sure can. Compile you program using the -g switch, like this:
gcc program.c -g
Then use gdb:
$ gdb ./a.out
(gdb) run
<segfault happens here>
(gdb) backtrace
<offending code is shown here>
Here is a nice tutorial to get you started with GDB.
Where the segfault occurs is generally only a clue as to where "the mistake which causes" it is in the code. The given location is not necessarily where the problem resides.
Also, you can give valgrind a try: if you install valgrind and run
valgrind --leak-check=full <program>
then it will run your program and display stack traces for any segfaults, as well as any invalid memory reads or writes and memory leaks. It's really quite useful.
You could also use a core dump and then examine it with gdb. To get useful information you also need to compile with the -g flag.
Whenever you get the message:
Segmentation fault (core dumped)
a core file is written into your current directory. And you can examine it with the command
gdb your_program core_file
The file contains the state of the memory when the program crashed. A core dump can be useful during the deployment of your software.
Make sure your system doesn't set the core dump file size to zero. You can set it to unlimited with:
ulimit -c unlimited
Careful though! that core dumps can become huge.
There are a number of tools available which help debugging segmentation faults and I would like to add my favorite tool to the list: Address Sanitizers (often abbreviated ASAN).
Modern¹ compilers come with the handy -fsanitize=address flag, adding some compile time and run time overhead which does more error checking.
According to the documentation these checks include catching segmentation faults by default. The advantage here is that you get a stack trace similar to gdb's output, but without running the program inside a debugger. An example:
int main() {
volatile int *ptr = (int*)0;
*ptr = 0;
}
$ gcc -g -fsanitize=address main.c
$ ./a.out
AddressSanitizer:DEADLYSIGNAL
=================================================================
==4848==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x5654348db1a0 bp 0x7ffc05e39240 sp 0x7ffc05e39230 T0)
==4848==The signal is caused by a WRITE memory access.
==4848==Hint: address points to the zero page.
#0 0x5654348db19f in main /tmp/tmp.s3gwjqb8zT/main.c:3
#1 0x7f0e5a052b6a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x26b6a)
#2 0x5654348db099 in _start (/tmp/tmp.s3gwjqb8zT/a.out+0x1099)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /tmp/tmp.s3gwjqb8zT/main.c:3 in main
==4848==ABORTING
The output is slightly more complicated than what gdb would output but there are upsides:
There is no need to reproduce the problem to receive a stack trace. Simply enabling the flag during development is enough.
ASANs catch a lot more than just segmentation faults. Many out of bounds accesses will be caught even if that memory area was accessible to the process.
¹ That is Clang 3.1+ and GCC 4.8+.
All of the above answers are correct and recommended; this answer is intended only as a last-resort if none of the aforementioned approaches can be used.
If all else fails, you can always recompile your program with various temporary debug-print statements (e.g. fprintf(stderr, "CHECKPOINT REACHED # %s:%i\n", __FILE__, __LINE__);) sprinkled throughout what you believe to be the relevant parts of your code. Then run the program, and observe what the was last debug-print printed just before the crash occurred -- you know your program got that far, so the crash must have happened after that point. Add or remove debug-prints, recompile, and run the test again, until you have narrowed it down to a single line of code. At that point you can fix the bug and remove all of the temporary debug-prints.
It's quite tedious, but it has the advantage of working just about anywhere -- the only times it might not is if you don't have access to stdout or stderr for some reason, or if the bug you are trying to fix is a race-condition whose behavior changes when the timing of the program changes (since the debug-prints will slow down the program and change its timing)
Lucas's answer about core dumps is good. In my .cshrc I have:
alias core 'ls -lt core; echo where | gdb -core=core -silent; echo "\n"'
to display the backtrace by entering 'core'. And the date stamp, to ensure I am looking at the right file :(.
Added: If there is a stack corruption bug, then the backtrace applied to the core dump is often garbage. In this case, running the program within gdb can give better results, as per the accepted answer (assuming the fault is easily reproducible). And also beware of multiple processes dumping core simultaneously; some OS's add the PID to the name of the core file.
This is a crude way to find the exact line after which there was the segmentation fault.
Define line logging function
#include \<iostream>
void log(int line) {
std::cout << line << std::endl;
}
find and replace all the semicolon after the log function with "; log(_LINE_);"
Make sure that the semicolons replaced with functions in the for (;;) loops are removed
If you have a reproducible exception like segmentation fault, you can use a tool like a debugger to reproduce the error.
I used to find source code location for even non-reproducible error. It's based on the Microsoft compiler tool chain. But it's based on a idea.
Save the MAP file for each binary (DLL,EXE) before you give it to the customer.
If an exception occurs, lookup the address in the MAP file and determine the function whose start address is just below the exception address. As a result you know the function, where the exception occurred.
Subtract the function start address from the exception address. The result is the offset in the function.
Recompile the source file containing the function with assembly listing enabled. Extract the function's assembly listing.
The assembly includes the offset of each instruction in the function. Lookup the source code line, that matches the offset in the function.
Evaluate the assembler code for the specific source code line. The offset points exactly the assembler instruction that caused the thrown exception. Evaluate the code of this single source code line. With a bit of experience with the compiler output you can say what caused the exception.
Be aware the reason for the exception might be at a totally different location. e.g. the code dereferenced a NULL pointer, but the actual reason, why the pointer is NULL can be somewhere else.
The steps 6. and 7. are beneficial since you asked only for the line of code. But I recommend that you should be aware of it.
I hope you get a similar environment with the GCC compiler for your platform. If you don't have a usable MAP file, use the tool chain tools to get the addresses of the the function. I am sure the ELF file format supports this.
In case any of you (like me!) were looking for this same question but with gfortran, not gcc, the compiler is much more powerful these days and before resorting to the use of the debugger, you can also try out these compile options. For me, this identified exactly the line of code where the error occurred and which variable I was accessing out of bounds to cause the segmentation fault error.
-O0 -g -Wall -fcheck=all -fbacktrace