forrtl: severe (157): Program Exception - access violation - fortran

I am using "thrgibbs1f90b" one of BLUPF90 Family of Programs which is based on "fortran" and used for gibbs sampling to estimate the variance component for binary data. In each time I try to run thrgibbs1f90b I get the following error message:
forrtl: severe (157): Program Exception - access violation
Image PC Routine Line Source
thrgibbs1f90b.exe 0000000140021961 Unknown Unknown Unknown
thrgibbs1f90b.exe 000000014000BB5B Unknown Unknown Unknown
thrgibbs1f90b.exe 000000014026B41C Unknown Unknown Unknown
thrgibbs1f90b.exe 000000014024F4E3 Unknown Unknown Unknown
kernel32.dll 0000000076E2652D Unknown Unknown Unknown
ntdll.dll 0000000076F5C521 Unknown Unknown Unknown
Any idea why I have this message?
Thanks!

Two educated guesses
The program has tried to read from or write to an array element which doesn't exist, such as the 26th element of a 25-element array.
There is a mismatch between the dummy arguments specified for a procedure and the actual arguments in a call to the procedure; for example passing a 4-byte real value when an 8-byte value is expected (or vice-versa)
Either of these might lead to an attempt to access a memory location to which the program's process has no rights of access. There are many other possible causes, but in my experience these are the most common errors in Fortran programs which give rise to such error messages.
Both of these are easy to spot, you need to (re-)compile your program with compiler options set to check for these conditions

Related

"Program received signal SIGSEGV: Segmentation fault - invalid memory reference." when using large-size array and MPI_BARRIER

I used Fortran with MPI (CRAY's compiler) for my code. I used 512 cores, and I found that as my variable exceeds certain size, the code crashed at MPI_BARRIER, and the error message is
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
.
.
.
One possibly useful information is that I print out a tag (i.e., write(,) "tag") before conducting MPI_BARRIER, and I found that the number of outputted tags (426) plus the number of the repeated error messages (86) is equal to the cores I used (512).
I think this is memory issue. I use slurm to submit my job, and I remember I've tried something like "ulimit -s unlimited" (couldn't find the web now...), but I haven't been able to solve this problem.

Better runtime error in C++ for vectors and address boundary error

In Python, when we access an index out of the array range we get an error output that gives the exact location in the code that had this error:
array = []
index = 0
array[index]
IndexError Traceback (most recent call last)
Untitled-1 in <cell line: 3>()
1 array = []
2 index = 0
----> 3 array[index]
IndexError: list index out of range
But a code like this in C++ only gives us a generic address boundary error in both GCC and Clang compiler:
#include <vector>
int main(int argc, const char **argv) {
std::vector<int> array{};
int index = 0;
int value = array[index];
return 0;
}
Is there a way to have better runtime errors with more detail in C++??
only give us a generic Address boundary error
No, it isn't even guaranteed to do that. Accessing out-of-bounds with [] causes undefined behavior in C++. If it happens you loose any guarantee on the program's behavior. It may fail with some kind of error, but it might also just continue running producing wrong output or do anything else. This is a very important difference from Python that must be understood. In C++ if you violate language rules or preconditions of library functions there is no guarantee that the compiler or the program will tell you about it. It will just not behave as expected in many cases.
To figure out where such an error comes from you usually run your program under a debugger which will tell you the line that e.g. a segmentation fault (if one happened) occurred and allows you to step through the code.
You can guarantee that indexing a std::vector out-of-bounds will generate an error message by using its .at member function instead of indexing with brackets. If the index is then out-of-bounds an exception will be thrown which you can catch or let propagate out of main to terminate the program with some error message. However, the exception doesn't typically carry information about the point at which it was thrown. Again you'll need to run under a debugger to get that information.
Depending on your compiler and platform, if you keep using [], you may also be able to compile your program with a sanitizer enabled which will print a diagnostic including source lines when such an out-of-bounds access occurs. For example GCC and Clang have the address sanitizer which can be enabled with -fsanitize=address at compile-time. The option -g should be added as well to generate debug symbols that will be used in the sanitizer output to reference the source locations.

How can I locate a zgemm error thrown by MKL?

I have a big Fortran code and for some calculations I get this to stdout:
Intel MKL ERROR: Parameter 13 was incorrect on entry to ZGEMM
I tried to check the ldc-parameter for my most common zgemms, but I can't possibly check all of them by hand. Is there a way to trigger an error rather than just a warning, so I can find the location and possibly even get a core-dump?

Reed solomon error correction and false positives

I have a Reed-Solomon encoder/decoder. After manipulating data and evaluating the results, I have experienced the following 3 cases:
The decoder decodes the message correctly and does not throw an error
The decoder decodes the message to a wrong result, without complaining - effectively producing a false positive. The chance should be very low, but can happen, even if the number of manipulated data is far below the error correction ability (even after changing a single bit...)
The decoder fails (throws an error), if more data is manipulated, than what is allowed by its error correction ability.
Are all 3 cases valid for a proper Reed-Solomon decoder? I am especially unsure about case 2, where the decoder would produce a wrong result (without throwing an error), even if there are much fewer errors than what is allowed by its correction abilities...?
mis-correction below error correction ability
This would indicate a bug in the code. A RS decoder should never fail if there are less than ⌊(n-k)/2⌋ errors.
correction detects when there more errors then error correction ability
Even if there are more than ⌊(n-k)/2⌋ errors, there is a good chance that a RS decoder will still detect an uncorrectable error, as most error patterns would not result in a received codeword that is within ⌊(n-k)/2⌋ or fewer error symbols of a valid codeword, since a working RS decoder should only produce a valid codeword or indicate an uncorrectable error. Miscorrection of more than ⌊(n-k)/2⌋ errors involves the decoder creating an additional ⌊(n-k)/2⌋ or fewer error symbols, resulting in a valid codeword, but one that differs from the original by n-k+1 or more symbols.
Detecting an uncorrectable error can be done by regenerating syndromes for the corrected codeword, but it's usually caught sooner when solving the error locator polynomial (normally done by looping through all possible locator values), when it produces fewer locators than it should due to duplicate or missing roots.
I wrote some interactive RS demo programs in C, for both 4 bit and 8 bit fields, that include the 3 most common decoders (PGZ (matrix), BM (discrepancy), SY (extended Euclid)). Note the SY - extended Euclid decoders in my examples emulate a hardware register oriented solution, two registers, always shift left, each register holds two polynomials where the split shifts left along with the register. The right half of each register is reversed (least significant coefficient first). The wiki article example may be easier to follow.
http://rcgldr.net/misc/eccdemo4.zip
http://rcgldr.net/misc/eccdemo8.zip

error forrtl: severe (174): SIGSEGV, segmentation fault occurred matrix multiplication

I tried to implement an easy Matrix multiplication, but I Keep getting the error
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
test_performance. 00000000004708F1 Unknown Unknown Unknown
test_performance. 000000000046F047 Unknown Unknown Unknown
test_performance. 000000000043F544 Unknown Unknown Unknown
test_performance. 000000000043F356 Unknown Unknown Unknown
test_performance. 0000000000423DFF Unknown Unknown Unknown
test_performance. 000000000040384D Unknown Unknown Unknown
libpthread.so.0 00002AD8B44769F0 Unknown Unknown Unknown
test_performance. 00000000004034A8 Unknown Unknown Unknown
test_performance. 0000000000402ECE Unknown Unknown Unknown
libc.so.6 00002AD8B46A6BE5 Unknown Unknown Unknown
test_performance. 0000000000402DD9 Unknown Unknown Unknown
This is my Code:
PROGRAM test_performance
IMPLICIT NONE
INTEGER :: DIM_M, DIM_L, DIM_N, index1, index2,index3,index4
INTEGER, DIMENSION(4,4) :: A,B,C
DIM_L=4
DIM_M=4
DIM_N=4
DO index1=1,DIM_M
DO index2=1,DIM_L
print *, 'here I am!'
A(index1,index2)=index1+index2
END DO
END DO
DO index3=1,DIM_L
DO index4=1,DIM_N
B(index3,index4)=index3+index4
END DO
END DO
print *,'A= ',A
print *,'B= ',B
CALL MATRIXMULTIPLICATION
PRINT *, 'C=', C
END PROGRAM test_performance
SUBROUTINE MATRIXMULTIPLICATION(A,B,C, DIM_M, DIM_L, DIM_N)
INTEGER, INTENT(IN) :: DIM_M, DIM_L, DIM_N
INTEGER, INTENT(IN) :: A(4,4), B(4,4)
INTEGER, INTENT(OUT) :: C(4,4)
INTEGER :: ii=1,jj=1, kk=1
DO ii=1, DIM_M
DO jj=1, DIM_N
DO kk=1, DIM_L
C(ii,jj)=C(ii,jj)+A(ii,ll)*B(ll,jj)
END DO
END DO
END DO
END SUBROUTINE MATRIXMULTIPLICATION
I don't know why I get this error, since the Dimension and all the Indices should just be fine. I tried to find the error by using all possible stuff, but I don't havy any clue anymore what the error could be.
The statement
CALL MATRIXMULTIPLICATION
doesn't include the arguments needed when the routine is called. A poor solution would be to simply replace that statement by
CALL MATRIXMULTIPLICATION(A,B,C, DIM_M, DIM_L, DIM_N)
A better solution would be, however, to make the subroutine's interface explicit. There are a number of ways of doing this, one by putting it into a module and useing the module. For a single subroutine that might be overkill but is definitely the way to go as your programs become larger and more complex.
A simple straightforward and satisfactory for your current purposes solution would be to move the line
END PROGRAM test_performance
to follow the line
END SUBROUTINE MATRIXMULTIPLICATION
and, where the end program line originally was insert the line
contains
If you had written your program along these lines in the first place the compiler would have seen your egregious error and pointed it out to you. As it stands the subroutine is external to the program and the compiler can't match its dummy and actual arguments at compile time; as written that argument matching is the programmer's responsibility, one you've rather fluffed.
Further improvements would be to make your subroutine handle arrays of any size and to not bother passing the array dimensions through the argument list. Fortran arrays carry their size and shape information with them, on the rare occasion a routine needs to know them explicitly it can make enquiries.
Even easier would be to use the matmul intrinsic and to spend your time programming other, perhaps more challenging and more interesting, parts of your code.