How can I locate a zgemm error thrown by MKL? - fortran

I have a big Fortran code and for some calculations I get this to stdout:
Intel MKL ERROR: Parameter 13 was incorrect on entry to ZGEMM
I tried to check the ldc-parameter for my most common zgemms, but I can't possibly check all of them by hand. Is there a way to trigger an error rather than just a warning, so I can find the location and possibly even get a core-dump?

Related

How can I debug Eigen alignment errors when they seem unrelated to the exact code which triggers them

I am writing code which uses the Eigen matrix library for coordinate transforms, and also PCL for point cloud processing (which also uses Eigen a lot). I keep getting assertion errors from Eigen, about unaligned accesses, despite the fact that I have observed everything in the documentation about alignment of Eigen types (https://eigen.tuxfamily.org/dox/group__DenseMatrixManipulation__Alignement.html).
I can only trigger this assertion when some Eigen code has run before, but was unsuccessful in pinpointing what the exact conditions are. For instance, this is the code that crashes:
Affine3f Transform::getAffine() const {
// ... Vector3f translation(...)
// ... Quaternionf rotation(...)
Affine3f affine = Affine3:f:Identity(); /// <---
affine.translate(translation);
affine.rotate(rotation);
return affine;
}
but only if some eigen code has been executed before. Maybe that is because the problem only arises after some allocations made by the Eigen::aligned_allocator.
However, the help pages tell me I should use a debugger to check exactly which object has unaligned:
For example, if you're using GCC, you can use the GDB debugger as follows:
$ gdb ./my_program # Start GDB on your program
> run # Start running your program
... # Now reproduce the crash!
> bt # Obtain the backtrace
Now that you know precisely where in your own code the problem is happening, read on to
understand what you need to change.
I am of course doing that, but the code crashing here seems to satisfy all the requirements.
Question
How can I effectively debug what code causes the misalignment when the error is only triggered during later allocations?

C++ - How to debug SIGILL ILL_ILLOPN

Recently I ran into a crash while the following statement is getting executed
static const float kDefaultTolerance = DoubleToFloat(0.25);
where DoubleToFloat is defined as below
static inline float DoubleToFloat(double x){
return static_cast<float>(x);
}
And the log statements shows below
09-04 01:08:50.727 882 882 F DEBUG : signal 4 (SIGILL), code 2 (ILL_ILLOPN), fault addr 0x7f9e3ca96818
when I read about SIGILL, I understand that it happens when process encounters to run an invalid operation. So I think compiler (clang in my case) is generating some junk code while translating the above snippet. How to check what is compiler generating and see what is going wrong in this particular case? Also suggest me if there are any tools to debug these kind of issues.
I have a similar problem today.Finally, I found the reason for the problem is that AVX instruction set is used in floating-point operation, but the computer does not support AVX instruction set. You can try to use SSE Instruction set.

Reed solomon error correction and false positives

I have a Reed-Solomon encoder/decoder. After manipulating data and evaluating the results, I have experienced the following 3 cases:
The decoder decodes the message correctly and does not throw an error
The decoder decodes the message to a wrong result, without complaining - effectively producing a false positive. The chance should be very low, but can happen, even if the number of manipulated data is far below the error correction ability (even after changing a single bit...)
The decoder fails (throws an error), if more data is manipulated, than what is allowed by its error correction ability.
Are all 3 cases valid for a proper Reed-Solomon decoder? I am especially unsure about case 2, where the decoder would produce a wrong result (without throwing an error), even if there are much fewer errors than what is allowed by its correction abilities...?
mis-correction below error correction ability
This would indicate a bug in the code. A RS decoder should never fail if there are less than ⌊(n-k)/2⌋ errors.
correction detects when there more errors then error correction ability
Even if there are more than ⌊(n-k)/2⌋ errors, there is a good chance that a RS decoder will still detect an uncorrectable error, as most error patterns would not result in a received codeword that is within ⌊(n-k)/2⌋ or fewer error symbols of a valid codeword, since a working RS decoder should only produce a valid codeword or indicate an uncorrectable error. Miscorrection of more than ⌊(n-k)/2⌋ errors involves the decoder creating an additional ⌊(n-k)/2⌋ or fewer error symbols, resulting in a valid codeword, but one that differs from the original by n-k+1 or more symbols.
Detecting an uncorrectable error can be done by regenerating syndromes for the corrected codeword, but it's usually caught sooner when solving the error locator polynomial (normally done by looping through all possible locator values), when it produces fewer locators than it should due to duplicate or missing roots.
I wrote some interactive RS demo programs in C, for both 4 bit and 8 bit fields, that include the 3 most common decoders (PGZ (matrix), BM (discrepancy), SY (extended Euclid)). Note the SY - extended Euclid decoders in my examples emulate a hardware register oriented solution, two registers, always shift left, each register holds two polynomials where the split shifts left along with the register. The right half of each register is reversed (least significant coefficient first). The wiki article example may be easier to follow.
http://rcgldr.net/misc/eccdemo4.zip
http://rcgldr.net/misc/eccdemo8.zip

forrtl: severe (157): Program Exception - access violation

I am using "thrgibbs1f90b" one of BLUPF90 Family of Programs which is based on "fortran" and used for gibbs sampling to estimate the variance component for binary data. In each time I try to run thrgibbs1f90b I get the following error message:
forrtl: severe (157): Program Exception - access violation
Image PC Routine Line Source
thrgibbs1f90b.exe 0000000140021961 Unknown Unknown Unknown
thrgibbs1f90b.exe 000000014000BB5B Unknown Unknown Unknown
thrgibbs1f90b.exe 000000014026B41C Unknown Unknown Unknown
thrgibbs1f90b.exe 000000014024F4E3 Unknown Unknown Unknown
kernel32.dll 0000000076E2652D Unknown Unknown Unknown
ntdll.dll 0000000076F5C521 Unknown Unknown Unknown
Any idea why I have this message?
Thanks!
Two educated guesses
The program has tried to read from or write to an array element which doesn't exist, such as the 26th element of a 25-element array.
There is a mismatch between the dummy arguments specified for a procedure and the actual arguments in a call to the procedure; for example passing a 4-byte real value when an 8-byte value is expected (or vice-versa)
Either of these might lead to an attempt to access a memory location to which the program's process has no rights of access. There are many other possible causes, but in my experience these are the most common errors in Fortran programs which give rise to such error messages.
Both of these are easy to spot, you need to (re-)compile your program with compiler options set to check for these conditions

Memory error: Dereference null pointer/ SSE misalignment

I'm compiling a program on remote linux server. The program compiled. However when I run it the program ends abruptly. So I debugged the program using DDT. It spits out the following error:
Process 0:
Memory error detected in ClassName::function (filename.cpp:6462).
Thread 1 attempted to dereference a null pointer or execute an SSE instruction with an
incorrectly aligned memory address (the latter may sometimes occur spuriously if guard
pages are enabled)
Tip: Use the stack list and the local variables to explore your program's current
state and identify the source of the error.
Can anyone please tell me what exactly this error means?
The line where the program stops looks like this:
SumUtility = ParaEst[0] + hhincome * ParaEst[71] + IsBlack * ParaEst[61] + IsBachAss * (ParaEst[55]);
This is within a switch case.
These are the variable types
vector<double> ParaEst;
double hhincome;
int IsBlack, Is BachAss;
Thanks for the help!
It means that:
ParaEst is NULL or a bad Pointer
ParaEst's individual array values are not aligned to 16-byte boundaries, required for SSE.
hhincome, IsBlack, or IsBachAss are not aligned to 16-byte boundaries and are SSE type values.
SumUtility is not aligned to 16-bytes and is a SSE type field.
If you could post the assembly code of the exact line that failed along with the register values of that assembler line, we could tell you exactly which of the above conditions have failed. It would also help to see the types of each variable shown to help narrow root the cause.
Ok... The problem finally got fixed.
The issue was that the expression where the code was breaking down was in a newly defined function. However for some weird reason running the make-file did not incorporate these changes and was still compiling using the previously compiled .o file. This resulted in garbage values being assigned to the variables within this new function. To top things off the program calls this function as a first step. Hence there was this systematic breakdown. The technical aspect of this was what Michael alluded to.
After this I would always recommend to use a make clean option in the make file. The issue of why running the make file is failing to compile the modified source file is an issue that definitely warrants further discussion.
Thanks for the responses!!