FORTRAN Error: BUS ERROR/SEG while printing a variable - fortran

All,
I am working on estimation of a physical variable for my research and I ran into an odd issue. I already know the roots and differences between bus error and segmentation fault. But the problem is that, when I am printing a four dimension variable like this somewhere:
PRINT *, RH(1,1,1,1)
I am able to get reasonable value.
But, again, the same command in a loop leads to an error. Sometimes it shows me bus error and sometimes SEG one.
The variable has never been modified from the first PRINT command until the
second one.
I am running this on a cluster machine with 32GB RAM. So, RAM capacity should suffice.
RH is defined as REAL*8 :: RH(IIPAR,JJPAR,LLPAR,25) !RH
I would be thankful if someone could help on this.
Best,
Amir

Related

Writing out arrays fixes a problem for a reason unknown to me

When I run certain programs using FORTRAN 77 with the GNU gfortran compiler I have come across the same problem several times and I'm hoping someone has insight. The value I want should be ~ 1, but it writes out at the end of a program as something well over 10^100. This is generally a problem restricted to arrays for me. The improper value often goes away when I write out the value of this at some stage in the program before (which inevitably happens in trying to troubleshoot the issue).
I have tried initializing arrays explicitly and have tried some array bound checking as well as some internal logical checks in my program. The issue, to me and my research supervisor, seems to be pathological.
WRITE(*,*)"OK9999", INV,INVV
vs
WRITE(*,*)"OK9999", INV,INVV,NP,I1STOR(NP),I2STOR(NP)
The former gives incorrect results for what I would expect for the variables INV and INVV, the latter correct. This is the 'newest' example of this problem, which has on and off affected me for about a year.
The greater context of these lines is:
WRITE(*,*)"AFTER ENERGY",I1STOR(1),I2STOR(1)
DO 167 NP=1!,NV
IF(I1STOR(NP).NE.0) THEN
INV = I1STOR(NP)
INVV = I2STOR(NP)
WRITE(*,*)"OK9999", INV,INVV,NP,I1STOR(NP),I2STOR(NP)
PAUSE
ENDIF
I1STOR(1) and I2STOR(1) are written correctly in the first case "AFTER ENERGY" above. If I write out the value of NP after the DO 167 line, this will also remedy the situation.
My expectation would be that writing out a variable wouldn't affect its value. Often times I am doing large, time-intensive calculations where the ultimate value is way off and in many cases it has traced back to the situation where writing the value out (to screen or file) magically alleviates the problem. Any help would be sincerely appreciated.

Segmentation fault for array, but only if a component of a derived type

Pretty simple setup, using gfortran 4.8.5 on linux (red hat):
I get a segfault if my array of reals (inside a derived type) has size > 2,000,000. This seems to be a standard stack/heap issue as my stack size is 8mb if I check with ulimit.
There is no problem if the array is NOT inside a derived type
Note that as #francescalus guesses, removing the initial value = 0.0 eliminates the problem
Edit to add: Note that I have posted a followup question Segmentation fault related to component of derived type that represents a more realistic use case and further narrows down the conditions under which this seems to occur.
program main
call sub1 ! seg fault if col size > 2,100,000
call sub2 ! works fine at col size = 100,000,000
end program main
subroutine sub1
type table
real :: col(2100000) = 0.0 ! works if "= 0.0" removed
end type table
type(table) :: table1
table1%col = 1.0
end subroutine sub1
subroutine sub2
real :: col(100000000) = 0.0
col = 1.0
end subroutine sub2
Some obvious questions here:
Is this expected behavior, or some bug that was fixed in newer versions of gfortran?
Am I following standard fortran operating procedures here, or doing something wrong?
What is the recommended way to avoid this (please assume that I am unable to update to a newer version of gfortran in the near term)? I will almost certainly solve with an allocatable array component for reasons not specific to this question, but that might not be an ideal general solution and I would like to know of all good options I have here.
In particular, is initializing the components of a derived type bad practice?
This is likely to be a runtime issue due to insufficient stack, rather than a bug with gfortran.
Gfortran uses the stack to store automatic arrays and other initialization data. When code does not create problems when one such array is small, but segfaults when the size of the array increases, a possible reason is running out of stack.
The issue seems to be the same in more recent versions of gfortran. I compiled and ran your program with gfortran 4.8.4, 4.9.3, 5.5.0, 6.4.0, 7.3.0 and 8.2.0. In all cases I obtained a segmentation fault with the default stack size, but no error when the stack size was slightly increased.
$ ./sfa
Segmentation fault
$ ulimit -s
8192
$ ulimit -s 8256
$ ./sfa && echo "DONE"
DONE
Your problem may be solved by running
$ ulimit -s unlimited
before executing your binary. I am not aware of any particular penalty for doing this, but programmers more aware of the fine details of memory management, such as compiler developers, may think otherwise.
Initializing the components of a derived type is not bad practice, but as you can see, it can create problems with the stack if the component is a big array - be it due to the storage of the component itself, or to the storage of memory to work on the RHS of the assignment. If the component is made allocatable and allocated in a subroutine, the array is stored in the heap rather than in the stack, and this issue is usually avoided. In this case, it may be about actually setting the values of the array dynamically in a subroutine rather than at compile time. It may be less elegant, but I think it's worth it, since it's the typical example of code development work that prevents avoidable, environment-related errors when executing the binary.
Your code above is standards compliant. As explained in the comments, lack of explicit interfaces for subroutines is not good practice, but for these simple subroutines it's not against the rules.
Some compilers have flags that allow you to change where some objects are allocated in memory. While it may fix a particular issue, flags are compiler dependent, and usually not equivalent when comparing different compilers. Using dynamic memory via allocatables is a more robust solution, according to my experience.
Finally, note that, if you are using OpenMP, the ulimit command above only affects the master thread - you need to set the stack size of each of the other threads via the environment variable OMP_STACKSIZE, which cannot be unlimited. And bear in mind that non-master threads running out of stack are a problem much more difficult to diagnose, since the binary may stop without a proper Segmentation fault error.
These are not necessarily useful solutions, but below are some conditions under which the seg fault disappears. A couple of people mentioned the lack of an explicit interface (as bad practice though not technically incorrect), and it seems that this might be one key here as either of these two changes to the code gets rid of the seg fault, although it's not quite that simple, as I'll explain:
Put everything in main, with no subroutine calls
Put the type definition table in a module
Let me expand on #2 briefly. Simply taking the example in the OP and then giving it an explicit interface by putting the subroutine in a module does NOT work. However, if I put the type definition in a module and then use it (as shown below) the segfault does not occur:
program main
use table_mod
type(table) :: table1
table1%col = 1.0
end program main

printing variables in GDB seems to change code behavior?

I'm debugging a large, ancient, piece of code that we just upgraded the OS/driver for, the entire thing is running 32 bit. The original developers of the code are long gone and much of it is still a black box to me.
I'm running it on the debugger. I narrowed down on a particular if statement within a larger loop, I need the 'else' part of the for loop to run to update some variables, but it was never running; implying that the variable that is being checked in the 'if' statement is always true.
Eventually I stepped into the method call (a simple getter on a private boolean) and printed the content of the variable. When I print the variable it is false, and the 'else' method will be entered when I return.
To experiment I've tried allowing the loop to run for 10 minutes, the 'else' method is never entered (as indicated by a breakpoint not being hit). Then when I print the variable being checked it's false and the variable is entered. It doesn't matter how long I let it run, or how many times I break and continue before printing the variable, the same pattern holds, I enter the 'else' method IFF I print the content of the variable that is being checked first.
To rule out some sort of datarace I've tried sitting at the breakpoint in question for the length of time it takes to do a print statement, a delay without a print doesn't result in entering the 'else' method.
What could cause such an odd behavior? Since we had issues with different architectures, running a 32 bit program on a 64 bit OS and, more importantly, the driver that it uses was not tested for 32 bit for years until they recompiled the driver for me under a 32 bit architecture, which would make me suspect the driver except that particular line of code that is misbehaving isn't touching the driver in any way. Still I suspect some sort of overflow or underflow may be happening due to a confusion caused by trying to force an old 32 bit program to run.
However, even assuming this could cause such an odd behavior, I don't know how to confirm if that is happening or otherwise debug a program where the act of looking at it changes it's behavior. I'd love any tip on what could cause such a problem or how I could move forward with debugging it.
Dammit Jim I'm a programmer, not a Quantum Mechanic!

std::cout printing forbidden system info to the console

I have a c++ program which runs continuously and in order to debug it I print some values to the console using cout. This has been working fine for the past week.
Now however, after about 30 seconds it will print information that it should not have access to, and stops printing what I tell it to completely.
Here is an example:
Object 104 dist: 0.111448
Object 104 dist: 0.113334
Object 104 dist: 0.0950752
JRE_HOME=/jreLS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;3^C
In the above example, I tell the program to print the "Object dist" on every iteration, then suddenly it spits out this JRE_HOME info that it should not have access to, then stops printing all together. Apart from the printing stopping, the program continues to run fine.
I am worried that I might have some kind of memory leak or buffer overflow, but I do not know how to diagnose this as I am fairly new to C++ and have no idea what might be causing this (I have looked through my recent code changes and nothing stands out as an obvious cause).
I have restarted my computer but each time it just prints out more random system info which it should not have access to which again leads me to believe it is reading from some part of system memory.
What could be causing this?
that it should not have access to
You are mistaken, each process has a copy of the environment block in its memory space.
You are seeing snippets of environment variables because you are using an invalid pointer, that happens to land within the environment block instead of the intended data.

Memory error: Dereference null pointer/ SSE misalignment

I'm compiling a program on remote linux server. The program compiled. However when I run it the program ends abruptly. So I debugged the program using DDT. It spits out the following error:
Process 0:
Memory error detected in ClassName::function (filename.cpp:6462).
Thread 1 attempted to dereference a null pointer or execute an SSE instruction with an
incorrectly aligned memory address (the latter may sometimes occur spuriously if guard
pages are enabled)
Tip: Use the stack list and the local variables to explore your program's current
state and identify the source of the error.
Can anyone please tell me what exactly this error means?
The line where the program stops looks like this:
SumUtility = ParaEst[0] + hhincome * ParaEst[71] + IsBlack * ParaEst[61] + IsBachAss * (ParaEst[55]);
This is within a switch case.
These are the variable types
vector<double> ParaEst;
double hhincome;
int IsBlack, Is BachAss;
Thanks for the help!
It means that:
ParaEst is NULL or a bad Pointer
ParaEst's individual array values are not aligned to 16-byte boundaries, required for SSE.
hhincome, IsBlack, or IsBachAss are not aligned to 16-byte boundaries and are SSE type values.
SumUtility is not aligned to 16-bytes and is a SSE type field.
If you could post the assembly code of the exact line that failed along with the register values of that assembler line, we could tell you exactly which of the above conditions have failed. It would also help to see the types of each variable shown to help narrow root the cause.
Ok... The problem finally got fixed.
The issue was that the expression where the code was breaking down was in a newly defined function. However for some weird reason running the make-file did not incorporate these changes and was still compiling using the previously compiled .o file. This resulted in garbage values being assigned to the variables within this new function. To top things off the program calls this function as a first step. Hence there was this systematic breakdown. The technical aspect of this was what Michael alluded to.
After this I would always recommend to use a make clean option in the make file. The issue of why running the make file is failing to compile the modified source file is an issue that definitely warrants further discussion.
Thanks for the responses!!