I need some pointers to solve a problem that I can describe only in a limited way. I got a code written in f77 from a senior scientist. I can't give the code on a public forum for ownership issues. It's not big (750 lines) but given implicit declarations and gotos statements, it is very unreadable. Hence I am having trouble finding out the source of error. Here is the problem:
When I compile the code with ifort, it runs fine and gives me sensible numbers but when I compile it with gfortran, it compiles fine but does not give me the right answer. The code is a numerical root finder for a complex plasma physics problem. The ifort compiled version finds the root but the gfortran compiled version fails to find the root.
Any ideas on how to proceed looking for a solution? I will update the question to reflect the actual problem when I find one.
Some things to investigate, not necessarily in the order I would try them:
Use your compiler(s) to check everything that your compiler(s) are capable of checking including and especially array-bounds (for run-time confidence) and subroutine argument matching.
Use of uninitialised variables.
The kinds of real, complex and integer variables; the compilers (or your compilation options) may default to different kinds.
Common blocks, equivalences, entry, ... other now deprecated or obsolete features.
Finally, perhaps not a matter for immediate investigation but something you ought to do sooner (right choice) or later (wrong choice), make the effort to declare IMPLICIT NONE in all scopes and to write explicit declarations for all entities.
Related
Say I have C++ project which has been working for years well.
Say also this project might (need to verify) contain undefined behaviour.
So maybe compiler was kind to us and doesn't make program misbehave even though there is UB.
Now imagine I want to add some features to the project. e.g. add Crypto ++ library to it.
But the actual code I add to it say from Crypto++ is legitimate.
Here I read:
Your code, if part of a larger project, could conditionally call some
3rd party code (say, a shell extension that previews an image type in
a file open dialog) that changes the state of some flags (floating
point precision, locale, integer overflow flags, division by zero
behavior, etc). Your code, which worked fine before, now exhibits
completely different behavior.
But I can't gauge exactly what author means. Does he say even by adding say Crypto ++ library to my project, despite the code from Crypto++ I add is legitimate, my project can suddenly start working incorrectly?
Is this realistic?
Any links which can confirm this?
It is hard for me to explain to people involved that just adding library might increase risks. Maybe someone can help me formulate how to explain this?
When source code invokes undefined behaviour, it means that the standard gives no guarantee on what could happen. It can work perfectly in one compilation run, but simply compiling it again with a newer version of the compiler or of a library could make it break. Or changing the optimisation level on the compiler can have same effect.
A common example for that is reading one element past end of an array. Suppose you expect it to be null and by chance next memory location contains a 0 on normal conditions (say it is an error flag). It will work without problem. But suppose now that on another compilation run after changing something totally unrelated, the memory organization is slightly changed and next memory location after the array is no longer that flag (that kept a constant value) but a variable taking other values. You program will break and will be hard to debug, because if that variable is used as a pointer, you could overwrite memory on random places.
TL/DR: If one version works but you suspect UB in it, the only correct way is to consistently remove all possible UB from the code before any change. Alternatively, you can keep the working version untouched, but beware, you could have to change it later...
Over the years, C has mutated into a weird hybrid of a low-level language and a high-level language, where code provides a low-level description of a way of performing a task, and modern compilers then try to convert that into a high-level description of what the task is and then implement efficient code to perform that task (possibly in a way very different from what was specified). In order to facilitate the translation from the low-level sequence of steps into the higher-level description of the operations being performed, the compiler needs to make certain assumptions about the conditions under which those low-level steps will be performed. If those assumptions do not hold, the compiler may generate code which malfunctions in very weird and bizarre ways.
Complicating the situation is the fact that there are many common programming constructs which might be legal if certain parts of the rules were a little better thought-out, but which as the rules are written would authorize compilers to do anything they want. Identifying all the places where code does things which arguably should be legal, and which have historically worked correctly 99.999% of the time, but might break for arbitrary reasons can be very difficult.
Thus, one may wish for the addition of a new library not to break anything, and most of the time one's wish might come true, but unfortunately it's very difficult to know whether any code may have lurking time bombs within it.
Unfortunately I am not working with open code right now, so please consider this a question of pure theoretical nature.
The C++ project I am working with seems to be definitely crippled by the following options and at least GCC 4.3 - 4.8 are causing the same problems, didn't notice any trouble with 3.x series (these options might have not been existed or worked differently there), affected are the platforms Linux x86 and Linux ARM. The options itself are automatically set with O1 or O2 level, so I had to find out first what options are causing it:
tree-dominator-opts
tree-dse
tree-fre
tree-pre
gcse
cse-follow-jumps
Its not my own code, but I have to maintain it, so how could I possibly find the sources of the trouble these options are making. Once I disabled the optimizations above with "-fno" the code works.
On a side note, the project does work flawlessly with Visual Studio 2008,2010 and 2013 without any noticeable problems or specific compiler options. Granted, the code is not 100% cross platform, so some parts are Windows/Linux specific but even then I'd like to know what's happening here.
It's no vital question, since I can make the code run flawlessly, but I am still interested how to track down such problems.
So to make it short: How to identify and find the affected code?
I doubt it's a giant GCC bug and maybe there is not even a real fix for the code I am working with, but it's of real interest for me.
I take it that most of these options are eliminations of some kind and I also read the explanations for these, still I have no idea how I would start here.
First of all: try using debugger. If the program crashes, check the backtrace for places to look for the faulty function. If the program misbehaves (wrong outputs), you should be able to tell where it occurs by carefully placing breakpoints.
If it didn't help and the project is small, you could try compiling a subset of your project with the "-fno" options that stop your program from misbehaving. You could brute-force your way to finding the smallest subset of faulty .cpp files and work your way from there. Note: finding a search algorithm with good complexity could save you a lot of time.
If, by any chance, there is a single faulty .cpp file, then you could further factor its contents into several .cpp files to see which functions are the cause of misbehavior.
TL;DR
Protection against binary incompatibility resulting from compiler argument typos in shared, possibly templated headers' preprocessor directives, which control conditional compilation, in different compilation units?
Ex.
g++ ... -DYOUR_NORMAl_FLAG ... -o libA.so
/**Another compilation unit, or even project. **/
g++ ... -DYOUR_NORMA1_FLAG ... -o libB.so
/**Another compilation unit, or even project. **/
g++ ... -DYOUR_NORMAI_FLAG ... main.cpp libA.so //The possibilities!
The Basic Story
Recently, I ran into a strange bug: the symptom was a single SIGSEGV, which always seemed to occur at the same location after recompling. This led me to believe there was some kind of memory corruption going on, and the actual underlying pointer is not a pointer at all, but some data section.
I save you from the long and strenuous journey taking almost two otherwise perfectly good work days to track down the problem. Sufficient to say, Valgrind, GDB, nm, readelf, electric fence, GCC's stack smashing protection, and then some more measures/methods/approaches failed.
In utter devastation, my attention turned to the finest details in the build process, which was analogous to:
Build one small library.
Build one large library, which uses the small one.
Build the test suite of the large library.
Only in case when the large library was used as a static, or a dynamic library dependency (ie. the dynamic linker loaded it automatically, no dlopen)
was there a problem. The test case where all the code of the library was simply included in the tests, everything worked: this was the most important clue.
The "Solution"
In the end, it turned out to be the simplest thing: a single (!) typo.
Turns out, the compilation flags differed by a single char in the test suite, and the large library: a define, which was controlling the behavior of the small library, was misspelled.
Critical information morsel: the small library had some templates. These were used directly in every case, without explicit instantiation in advance. The contents of one of the templated classes changed when the flag was toggled: some data fields were simply not present in case the flag was defined!
The linker noticed nothing of this. (Since the class was templated, the resultant symbols were weak.)
The code used dynamic casts, and the class affected by this problem was inheriting from the mangled class -> things went south.
My question is as follows: how would you protect against this kind of problem? Are there any tools or solutions which address this specific issue?
Future Proofing
I've thought of two things, and believe no protection can be built on the object file level:
1: Save options implemented as preprocessor symbols in some well defined place, preferably extracted by a separate build step. Provide check script which uses this to check all compiler defines, and defines in user code. Integrate this check into the build process. Possibly use Levenshtein distance or similar to check for misspellings. Expensive, and the script / solution can get complicated. Possible problem with similar flags (but why have them?), additional files must accompany compiled library code. (Well, maybe with DWARF 2, this is untrue, but let's assume we don't want that.)
2: Centralize build options: cheap, customization option left open (think makefile.local), but makes monolithic monstrosities, strong project couplings.
I'd like to go ahead and quench a few likely flame inducing embers possibly flaring up in some readers: "do not use preprocessor symbols" is not an option here.
Conditional compilation does have it's place in high performance code, and doing everything with templates and enable_if-s would needlessly overcomplicate things. While the above solution is usually not desirable it can arise form the development process.
Please assume you have no control over the situation, assume you have legacy code, assume everything you can to force yourself to avoid side-stepping.
If those won't do, generalize into ABI incompatibility detection, though this might escalate the scope of the question too much for SO.
I'm aware of:
http://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html
DT_SONAME is not applicable.
Other version schemes therein are not applicable either - they were designed to protect a package which is in itself not faulty.
Mixing C++ ABIs to build against legacy libraries
Static analysis tool to detect ABI breaks in C++
If it matters, don't have a default case.
#ifdef YOUR_NORMAL_FLAG
// some code
#elsif YOUR_SPECIAL_FLAG
// some other code
#else
// in case of a typo, this is a compilation error
# error "No flag specified"
#endif
This may lead to a large list of compiler options if conditional compilation is overused, but there are ways around this like defining config-files
flag=normal
flag2=special
which get parsed by build scripts and generate the options and can possibly check for typos or could be parsed directly from the Makefile.
I have to work on a fortran program, which used to be compiled using Microsoft Compaq Visual Fortran 6.6. I would prefer to work with gfortran but I have met lots of problems.
The main problem is that the generated binaries have different behaviours. My program takes an input file and then has to generate an output file. But sometimes, when using the binary compiled by gfortran, it crashes before its end, or gives different numerical results.
This a program written by researchers which uses a lot of float numbers.
So my question is: what are the differences between these two compilers which could lead to this kind of problem?
edit:
My program computes the values of some parameters and there are numerous iterations. At the beginning, everything goes well. After several iterations, some NaN values appear (only when compiled by gfortran).
edit:
Think you everybody for your answers.
So I used the intel compiler which helped me by giving some useful error messages.
The origin of my problems is that some variables are not initialized properly. It looks like when compiling with compaq visual fortran these variables take automatically 0 as a value, whereas with gfortran (and intel) it takes random values, which explain some numerical differences which add up at the following iterations.
So now the solution is a better understanding of the program to correct these missing initializations.
There can be several reasons for such behaviour.
What I would do is:
Switch off any optimization
Switch on all debug options. If you have access to e.g. intel compiler, use ifort -CB -CU -debug -traceback. If you have to stick to gfortran, use valgrind, its output is somewhat less human-readable, but it's often better than nothing.
Make sure there are no implicit typed variables, use implicit none in all the modules and all the code blocks.
Use consistent float types. I personally always use real*8 as the only float type in my codes. If you are using external libraries, you might need to change call signatures for some routines (e.g., BLAS has different routine names for single and double precision variables).
If you are lucky, it's just some variable doesn't get initialized properly, and you'll catch it by one of these techniques. Otherwise, as M.S.B. was suggesting, a deeper understanding of what the program really does is necessary. And, yes, it might be needed to just check the algorithm manually starting from the point where you say 'some NaNs values appear'.
Different compilers can emit different instructions for the same source code. If a numerical calculation is on the boundary of working, one set of instructions might work, and another not. Most compilers have options to use more conservative floating point arithmetic, versus optimizations for speed -- I suggest checking the compiler options that you are using for the available options. More fundamentally this problem -- particularly that the compilers agree for several iterations but then diverge -- may be a sign that the numerical approach of the program is borderline. A simplistic solution is to increase the precision of the calculations, e.g., from single to double. Perhaps also tweak parameters, such as a step size or similar parameter. Better would be to gain a deeper understanding of the algorithm and possibly make a more fundamental change.
I don't know about the crash but some differences in the results of numerical code in an Intel machine can be due to one compiler using 80-doubles and the other 64-bit doubles, even if not for variables but perhaps for temporary values. Moreover, floating-point computation is sensitive to the order elementary operations are performed. Different compilers may generate different sequence of operations.
Differences in different type implementations, differences in various non-Standard vendor extensions, could be a lot of things.
Here are just some of the language features that differ (look at gfortran and intel). Programs written to fortran standard work on every compiler the same, but a lot of people don't know what are the standard language features, and what are the language extensions, and so use them ... when compiled with a different compiler troubles arise.
If you post the code somewhere I could take a quick look at it; otherwise, like this, 'tis hard to say for certain.
Consider a situation. We have some specific C++ compiler, a specific set of compiler settings and a specific C++ program.
We compile that specific programs with that compiler and those settings two times, doing a "clean compile" each time.
Should the machine code emitted be the same (I don't mean timestamps and other bells and whistles, I mean only real code that will be executed) or is it allowed to vary from one compilation to another?
The C++ standard certainly doesn't say anything to prevent this from happening. In reality, however, a compiler is normally deterministic, so given identical inputs it will produce identical output.
The real question is mostly what parts of the environment it considers as its inputs -- there are a few that seem to assume characteristics of the build machine reflect characteristics of the target, and vary their output based on "inputs" that are implicit in the build environment instead of explicitly stated, such as via compiler flags. That said, even that is relatively unusual. The norm is for the output to depend on explicit inputs (input files, command line flags, etc.)
Offhand, I can only think of one fairly obvious thing that changes "spontaneously": some compilers and/or linkers embed a timestamp into their output file, so a few bytes of the output file will change from one build to the next--but this will only be in the metadata embedded in the file, not a change to the actual code that's generated.
According to the as-if rule in the standard, as long as a conforming program (e.g., no undefined behavior) cannot tell the difference, the compiler is allowed to do whatever it wants. In other words, as long as the program produces the same output, there is no restriction in the standard prohibiting this.
From a practical point of view, I wouldn't use a compiler that does this to build production software. I want to be able to recompile a release made two years ago (with the same compiler, etc) and produce the same machine code. I don't want to worry that the reason I can't reproduce a bug is that the compiler decided to do something slightly different today.
There is no guarantee that they will be the same. Also according to http://www.mingw.org/wiki/My_executable_is_sometimes_different
My executable is sometimes different, when I compile and recompile the same source. Is this normal?
Yes, by default, and by design, ~MinGW's GCC does not produce ConsistentOutput, unless you patch it.
EDIT: Found this post that seems to explain how to make them the same.
I'd bet it would vary every time due to some metadata compiler writes (for instance, c# compiled dlls always vary in some bytes even if I do "build" twice in a row without changing anything). But anyways, I would never rely on that it would not vary.