A simple test case between clang++/g++/gfortran

A simple test case between clang++/g++/gfortran - c++

I ran across this question on scicomp which involves computing a sum. There, you can see a c++ and a similar fortran implementation. Interestingly I saw the fortran version was faster by about 32%.
I thought, I was not sure about their result and tried to regenerate the situation. Here is the (very slightly) different codes I ran:
c++
#include <iostream>
#include <complex>
#include <cmath>
#include <iomanip>
int main ()
{
const double alpha = 1;
std::cout.precision(16);
std::complex<double> sum = 0;
const std::complex<double> a = std::complex<double>(1,1)/std::sqrt(2.);
for (unsigned int k=1; k<10000000; ++k)
{
sum += std::pow(a, k)*std::pow(k, -alpha);
if (k % 1000000 == 0)
std::cout << k << ' ' << sum << std::endl;
}
return 0;
}
fortran
implicit none
integer, parameter :: dp = kind(0.d0)
complex(dp), parameter :: i_ = (0, 1)
real(dp) :: alpha = 1
complex(dp) :: s = 0
integer :: k
do k = 1, 10000000
s = s + ((i_+1)/sqrt(2._dp))**k * k**(-alpha)
if (modulo(k, 1000000) == 0) print *, k, s
end do
end
I compile the above codes using gcc 4.6.3 and clang 3.0 on a Ubuntu 12.04 LTS machine all with -O3 flag. Here's my timings:
time ./a.out
gfortran
real 0m1.538s
user 0m1.536s
sys 0m0.000s
g++
real 0m2.225s
user 0m2.228s
sys 0m0.000s
clang
real 0m1.250s
user 0m1.244s
sys 0m0.004s
Interestingly I can also see that the fortran code is faster than the c++ by about the same 32% when gcc is used. Using clang, however, I can see that the c++ code actually runs faster by about 19%. Here are my questions:
Why is g++ generated code slower than the gfortran? Since they are from the same compiler family does this mean (this) fortran code can simply be translated into a faster code? Is this generally the case with fortran vs c++?
Why is clang doing so well here? Is there a fortran front-end for llvm compiler? If there, will the code generated by that one be even faster?
UPDATE:
Using -ffast-math -O3 options generates the following results:
gfortran
real 0m1.515s
user 0m1.512s
sys 0m0.000s
g++
real 0m1.478s
user 0m1.476s
sys 0m0.000s
clang
real 0m1.253s
user 0m1.252s
sys 0m0.000s
Npw g++ version is running as fast gfortran and still clang is faster than both. Adding -fcx-fortran-rules to the above options does not significantly change the results

The time differences will be related to the time it takes to execute pow, as the other code is relatively simple. You can check this by profiling. The question then is what the compiler does to compute the power function?
My timings: ~1.20 s for the Fortran version with gfortran -O3, and 1.07 s for the C++ version compiled with g++ -O3 -ffast-math. Note that -ffast-math doesn't matter for gfortran, as pow will be called from a library, but it makes a huge difference for g++.
In my case, for gfortran, it's the function _gfortran_pow_c8_i4 that gets called (source code). Their implementation is the usual way to compute integer powers. With g++ on the other hand, it's a function template from the libstdc++ library, but I don't know how that's implemented. Apparently, it's slightly better written/optimizable. I don't know to what extent the function is compiled on the fly, considering it's a template. For what it's worth, the Fortran version compiled with ifort and C++ version compiled with icc (using -fast optimization flag) both give the same timings, so I guess these use the same library functions.
If I just write a power function in Fortran with complex arithmetic (explicitely writing out real and imaginary parts), it's as fast as the C++ version compiled with g++ (but then -ffast-math slows it down, so I stuck to only -O3 with gfortran):
complex(8) function pow_c8_i4(a, k)
implicit none
integer, intent(in) :: k
complex(8), intent(in) :: a
real(8) :: Re_a, Im_a, Re_pow, Im_pow, tmp
integer :: i
Re_pow = 1.0_8
Im_pow = 0.0_8
Re_a = real(a)
Im_a = aimag(a)
i = k
do while (i.ne.0)
if (iand(i,1).eq.1) then
tmp = Re_pow
Re_pow = Re_pow*Re_a-Im_pow*Im_a
Im_pow = tmp *Im_a+Im_pow*Re_a
end if
i = ishft(i,-1)
tmp = Re_a
Re_a = Re_a**2-Im_a**2
Im_a = 2*tmp*Im_a
end do
pow_c8_i4 = cmplx(Re_pow,Im_pow,8)
end function
In my experience, using explicit real and imaginary parts in Fortran implementations is faster, allthough it's very convenient of course to use the complex types.
Final note: even though it's just an example, the way to call the power function each iteration is extremely inefficient. Instead, you should of course just multiply a by itself each iteration.

I believe your problem is in the output part. It is well-known that C++ streams (std::cout) are often very inefficient. While different compilers may optimize this, it is always a good idea to rewrite critical performance parts using C printf function instead of std::cout.

Related

Fortran vectorize log inside a for loop

Here is the minimum working code.
program test
implicit none
double precision:: c1,c2,rate
integer::ci,cj,cr,cm,i
integer,parameter::max_iter=10000000 !10^7
c1=0.0d+0
CALL system_clock(count_rate=cr)
CALL system_clock(count_max=cm)
rate = REAL(cr)
CALL SYSTEM_CLOCK(ci)
do i=1,max_iter
c1=c1+log(DBLE(i))
end do
CALL SYSTEM_CLOCK(cj)
WRITE(*,*) "system_clock : ",(cj - ci)/rate
print*, c1
end program test
When I compile with gfortran -Ofast -march=core-avx2 -fopt-info-vec-optimized the for loop with the log function does not get vectorized. I have also tried with -O3 but the result does not change.
But if I write the equivalent C++ code,
#include <iostream>
#include <chrono>
#include <cmath>
using namespace std;
using namespace std::chrono;
int main()
{
double c1=0;
const int max_iter=10000000; // 10^7
auto start = high_resolution_clock::now();
for(int i=1;i<=max_iter;i++)
{
c1 += log(i);
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(stop - start);
cout << duration.count() << " ms"<<'\n';
printf("%0.15f\n",c1);
return 0;
}
and compile it with g++ -Ofast -march=core-avx2 -fopt-info-vec-optimized, the for loop gets vectorized and runs almost 10 times faster.
What should I do to make the fortran loop vectorized?

The problem with vectorizing loops that include the math functions (like log) is that the compiler has to be taught the semantics of the vectorized math functions (and you see if you look at the assembler output that the Fortran version calls the "normal" scalar function (a line like call log) whereas your C++ version calls the vectorized version (call _ZGVdN4v___log_finite)). There has been some work wrt making GFortran understand the glibc vector math library (libmvec), but I'm not sure what the current status is. See the thread starting at https://gcc.gnu.org/legacy-ml/gcc/2018-04/msg00062.html and continuing in June 2018 starting at https://gcc.gnu.org/legacy-ml/gcc/2018-06/msg00167.html for more details.

Puzzling performance difference between mac and a relatively powerful desktop

My original intention for writing this piece of code is to measure performance difference when an entire array is operated on by a function vs operating individual elements of an array.
i.e. comparing the following two statements:
function_vector(x, y, z, n);
vs
for(int i=0; i<n; i++){
function_scalar(x[i], y[i], z[i]);
}
where function_* does some substantial but identical calculations.
With -ffast-math turned on, the scalar version is roughly 2x faster on multiple machines I have tested on.
However, whats puzzling is the comparison of timings on two different machines, both using gcc 6.3.0:
# on desktop with Intel-Core-i7-4930K-Processor-12M-Cache-up-to-3_90-GHz
g++ loop_test.cpp -o loop_test -std=c++11 -O3
./loop_test
vector time = 12.3742 s
scalar time = 10.7406 s
g++ loop_test.cpp -o loop_test -std=c++11 -O3 -ffast-math
./loop_test
vector time = 11.2543 s
scalar time = 5.70873 s
# on mac with Intel-Core-i5-4258U-Processor-3M-Cache-up-to-2_90-GHz
g++ loop_test.cpp -o loop_test -std=c++11 -O3
./loop_test
vector time = 2.89193 s
scalar time = 1.87269 s
g++ loop_test.cpp -o loop_test -std=c++11 -O3 -ffast-math
./loop_test
vector time = 2.38422 s
scalar time = 0.995433 s
By all means the first machine is superior in terms of cache size, clock speed etc. Still the code runs 5x faster on the second machine.
Question:
Can this be explained? Or am I doing something wrong here?
Link to the code: https://gist.github.com/anandpratap/262a72bd017fdc6803e23ed326847643
Edit
After comments from ShadowRanger, I added the __restrict__ keyword to function_vector and -march=native compilation flag. This gives:
# on desktop with Intel-Core-i7-4930K-Processor-12M-Cache-up-to-3_90-GHz
vector time = 1.3767 s
scalar time = 1.28002 s
# on mac with Intel-Core-i5-4258U-Processor-3M-Cache-up-to-2_90-GHz
vector time = 1.05206 s
scalar time = 1.07556 s

Odds are possible pointer aliasing is limiting optimizations in the vectorized case.
Try changing the declaration of function_vector to:
void function_vector(double *__restrict__ x, double *__restrict__ y, double *__restrict__ z, const int n){
to use g++'s non-standard support for a feature matching C99's restrict keyword.
Without it, function_vector likely has to assume that the writes to x[i] could be modifying values in y or z, so it can't do read-ahead to get the values.

C++ Centralizing SIMD usage

i have a library and a lot of projects depending on that library. I want to optimize certain procedures inside the library using SIMD extensions. However it is important for me to stay portable, so to the user it should be quite abstract.
I say at the beginning that i dont want to use some other great library that does the trick. I actually want to understand if that what i want is possible and to what extent.
My very first idea was to have a "vector" wrapper class, that the usage of SIMD is transparent to the user and a "scalar" vector class could be used in case no SIMD extension is available on the target machine.
The naive thought came to my mind to use the preprocessor to select one vector class out of many depending on which target the library is compiled. So one scalar vector class, one with SSE (something like this basically: http://fastcpp.blogspot.de/2011/12/simple-vector3-class-with-sse-support.html) and so on... all with the same interface.
This gives me good performance but this would mean that i would have to compile the library for any kind of SIMD ISA that i use. I rather would like to evaluate the processor capabilities dynamically at runtime and select the "best" implementation available.
So my second guess was to have a general "vector" class with abstract methods. The "processor evaluator" function would than return instances of the optimal implementation. Obviously this would lead to ugly code, but the pointer to the vector object could be stored in a smart pointer-like container that just delegates the calls to the vector object. Actually I would prefer this method because of its abstraction but I'm not sure if calling the virtual methods actually will kill the performance that i gain using SIMD extensions.
The last option that i figured out would be to do optimizations whole routines and select at runtime the optimal one. I dont like this idea so much because this forces me to implement whole functions multiple times. I would prefer to do this once, using my idea of the vector class i would like to do something like this for example:
void Memcopy(void *dst, void *src, size_t size)
{
vector v;
for(int i = 0; i < size; i += v.size())
{
v.load(src);
v.store(dst);
dst += v.size();
src += v.size();
}
}
I assume here that "size" is a correct value so that no overlapping happens. This example should just show what i would prefer to have. The size-method of the vector object would for example just return 4 in case SSE is used and 1 in case the scalar version is used.
Is there a proper way to implement this using only runtime information without loosing too much performance? Abstraction is to me more important than performance but as this is a performance optimization i wouldn't include it if would not speedup my application.
I also found this on the web: http://compeng.uni-frankfurt.de/?vc
Its open source but i dont understand how the correct vector class is chosen.

Your idea will only compile to efficient code if everything inlines at compile time, which is incompatible with runtime CPU dispatching. For v.load(), v.store(), and v.size() to actually be different at runtime depending on the CPU, they'd have to be actual function calls, not single instructions. The overhead would be killer.
If your library has functions that are big enough to work without being inlined, then function pointers are great for dispatching based on runtime CPU detection. (e.g. make multiple versions of memcpy, and pay the overhead of runtime detection once per call, not twice per loop iteration.)
This shouldn't be visible in your library's external API/ABI, unless your functions are mostly so short that the overhead of an extra (direct) call/ret matters. In the implementation of your library functions, put each sub-task that you want to make a CPU-specific version of into a helper function. Call those helper functions through function pointers.
Start with your function pointers initialized to versions that will work on your baseline target. e.g. SSE2 for x86-64, scalar or SSE2 for legacy 32bit x86 (depending on whether you care about Athlon XP and Pentium III), and probably scalar for non-x86 architectures. In a constructor or library init function, do a CPUID and update the function pointers to the best version for the host CPU. Even if your absolute baseline is scalar, you could make your "good performance" baseline something like SSSE3, and not spend much/any time on SSE2-only routines. Even if you're mostly targetting SSSE3, some of your routines will probably end up only requiring SSE2, so you might as well mark them as such and let the dispatcher use them on CPUs that only do SSE2.
Updating the function pointers shouldn't even require any locking. Any calls that happen from other threads before your constructor is done setting function pointers may get the baseline version, but that's fine. Storing a pointer to an aligned address is atomic on x86. If it's not atomic on any platform where you have a version of a routine that needs runtime CPU detection, use C++ std:atomic (with memory-order relaxed stores and loads, not the default sequential consistency which would trigger a full memory barrier on every load). It matters a lot that there's minimal overhead when calling through the function pointers, and it doesn't matter what order different threads see the changes to the function pointers. They're write-once.
x264 (the heavily-optimized open source h.264 video encoder) uses this technique extensively, with arrays of function pointers. See x264_mc_init_mmx(), for example. (That function handles all CPU dispatching for Motion Compensation functions, from MMX to AVX2). I assume libx264 does the CPU dispatching in the "encoder init" function. If you don't have a function that users of your library are required to call, then you should look into some kind of mechanism for running global constructor / init functions when programs using your library start up.
If you want this to work with very C++ey code (C++ish? Is that a word?) i.e. templated classes & functions, the program using the library will probably have do the CPU dispatching, and arrange to get baseline and multiple CPU-requirement versions of functions compiled.

I do exactly this with a fractal project. It works with vector sizes of 1, 2, 4, 8, and 16 for float and 1, 2, 4, 8 for double. I use a CPU dispatcher at run-time to select the following instructions sets: SSE2, SSE4.1, AVX, AVX+FMA, and AVX512.
The reason I use a vector size of 1 is to test performance. There is already a SIMD library that does all this: Agner Fog's Vector Class Library. He even includes example code for a CPU dispatcher.
The VCL emulates hardware such as AVX on systems that only have SSE (or even AVX512 for SSE). It just implements AVX twice (for four times for AVX512) so in most cases you can just use the largest vector size you want to target.
//#include "vectorclass.h"
void Memcopy(void *dst, void *src, size_t size)
{
Vec8f v; //eight floats using AVX hardware or AVX emulated with SSE twice.
for(int i = 0; i < size; i +=v.size())
{
v.load(src);
v.store(dst);
dst += v.size();
src += v.size();
}
}
(however, writing an efficient memcpy is complicating. For large sizes you should consider non temroal stores and on IVB and above use rep movsb instead). Notice that that code is identical to what you asked for except I changed the word vector to Vec8f.
Using the VLC, as CPU dispatcher, templating, and macros you can write your code/kernel so that it looks nearly identical to scalar code without source code duplication for every different instruction set and vector size. It's your binaries which will be bigger not your source code.
I have described CPU dispatchers several times. You can also see some example using templateing and macros for a dispatcher here: alias of a function template
Edit: Here is an example of part of my kernel to calculate the Mandelbrot set for a set of pixels equal to the vector size. At compile time I set TYPE to float, double, or doubledouble and N to 1, 2, 4, 8, or 16. The type doubledouble is described here which I created and added to the VCL. This produces Vector types of Vec1f, Vec4f, Vec8f, Vec16f, Vec1d, Vec2d, Vec4d, Vec8d, doubledouble1, doubledouble2, doubledouble4, doubledouble8.
template<typename TYPE, unsigned N>
static inline intn calc(floatn const &cx, floatn const &cy, floatn const &cut, int32_t maxiter) {
floatn x = cx, y = cy;
intn n = 0;
for(int32_t i=0; i<maxiter; i++) {
floatn x2 = square(x), y2 = square(y);
floatn r2 = x2 + y2;
booln mask = r2<cut;
if(!horizontal_or(mask)) break;
add_mask(n,mask);
floatn t = x*y; mul2(t);
x = x2 - y2 + cx;
y = t + cy;
}
return n;
}
So my SIMD code for several several different data types and vector sizes is nearly identical to the scalar code I would use. I have not included the part of my kernel which loops over each super-pixel.
My build file looks something like this
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -msse2 -Ivectorclass kernel.cpp -okernel_sse2.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -msse4.1 -Ivectorclass kernel.cpp -okernel_sse41.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx -Ivectorclass kernel.cpp -okernel_avx.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx2 -mfma -Ivectorclass kernel.cpp -okernel_avx2.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx2 -mfma -Ivectorclass kernel_fma.cpp -okernel_fma.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx512f -mfma -Ivectorclass kernel.cpp -okernel_avx512.o
g++ -m64 -Wall -Wextra -std=gnu++11 -O3 -fopenmp -mfpmath=sse -msse2 -Ivectorclass frac.cpp vectorclass/instrset_detect.cpp kernel_sse2.o kernel_sse41.o kernel_avx.o kernel_avx2.o kernel_avx512.o kernel_fma.o -o frac
Then the dispatcher looks something like this
int iset = instrset_detect();
fp_float1 = NULL;
fp_floatn = NULL;
fp_double1 = NULL;
fp_doublen = NULL;
fp_doublefloat1 = NULL;
fp_doublefloatn = NULL;
fp_doubledouble1 = NULL;
fp_doubledoublen = NULL;
fp_float128 = NULL;
fp_floatn_fma = NULL;
fp_doublen_fma = NULL;
if (iset >= 9) {
fp_float1 = &manddd_AVX512<float,1>;
fp_floatn = &manddd_AVX512<float,16>;
fp_double1 = &manddd_AVX512<double,1>;
fp_doublen = &manddd_AVX512<double,8>;
fp_doublefloat1 = &manddd_AVX512<doublefloat,1>;
fp_doublefloatn = &manddd_AVX512<doublefloat,16>;
fp_doubledouble1 = &manddd_AVX512<doubledouble,1>;
fp_doubledoublen = &manddd_AVX512<doubledouble,8>;
}
else if (iset >= 8) {
fp_float1 = &manddd_AVX<float,1>;
fp_floatn = &manddd_AVX2<float,8>;
fp_double1 = &manddd_AVX2<double,1>;
fp_doublen = &manddd_AVX2<double,4>;
fp_doublefloat1 = &manddd_AVX2<doublefloat,1>;
fp_doublefloatn = &manddd_AVX2<doublefloat,8>;
fp_doubledouble1 = &manddd_AVX2<doubledouble,1>;
fp_doubledoublen = &manddd_AVX2<doubledouble,4>;
}
....
This sets function pointers to each of the different possible datatype vector combination for the instruction set found at runtime. Then I can call whatever function I'm interested.

Thanks Peter Cordes and Z boson. With your both replies I I came to a solution that satisfies me.
I chose the Memcopy just as an example just because of everyone knowing it and its beautiful simplicity (but also slowness) when implemented naively in contrast to SIMD optimizations that are often not well readable anymore but of course much faster.
I have now two classes (more possible of course) a scalar vector and an SSE vector both with inline methods. To the user i show something like:
typedef void(*MEM_COPY_FUNC)(void *, const void *, size_t);
extern MEM_COPY_FUNC memCopyPointer;
I declare my function something like this, as Z boson pointed out:
template
void MemCopyTemplate(void *pDest, const void *prc, size_t size)
{
VectorType v;
byte *pDst, *pSrc;
uint32 mask;
pDst = (byte *)pDest;
pSrc = (byte *)prc;
mask = (2 << v.GetSize()) - 1;
while(size & mask)
{
*pDst++ = *pSrc++;
}
while(size)
{
v.Load(pSrc);
v.Store(pDst);
pDst += v.GetSize();
pSrc += v.GetSize();
size -= v.GetSize();
}
}
And at runtime, when the library is loaded, i use CPUID to do either
memCopyPointer = MemCopyTemplate<ScalarVector>;
or
memCopyPointer = MemCopyTemplate<SSEVector>;
as you both suggested. Thanks a lot.

Fortran COMPLEX calculates different from C++

I have completed a port from Fortran to C++ but have discovered some differences in the COMPLEX type. Consider the following codes:
PROGRAM CMPLX
COMPLEX*16 c
REAL*8 a
c = (1.23456789, 3.45678901)
a = AIMAG(1.0 / c)
WRITE (*, *) a
END
And the C++:
#include <complex>
#include <iostream>
#include <iomanip>
int main()
{
std::complex<double> c(1.23456789, 3.45678901);
double a = (1.0 / c).imag();
std::cout << std::setprecision(15) << " " << a << std::endl;
}
Compiling the C++ version with clang++ or g++, I get the output: -0.256561150444368
Compiling the Fortran version however gives me: -0.25656115049876993
I mean, doesn't both languages follow the IEEE 754? If I run the following in Octave (Matlab):
octave:1> c=1.23456789+ 3.45678901i
c = 1.2346 + 3.4568i
octave:2> c
c = 1.2346 + 3.4568i
octave:3> output_precision(15)
octave:4> c
c = 1.23456789000000e+00 + 3.45678901000000e+00i
octave:5> 1 / c
ans = 9.16290109820952e-02 - 2.56561150444368e-01i
I get the same as the C++ version. What is up with the Fortran COMPLEX type? Am I missing some compiler flags? -ffast-math doesn't change anything. I want to produce the exact same 15 decimals in C++ and Fortran, so I easier can spot porting differences.
Any Fortran gurus around? Thanks!

In the Fortran code replace
c = (1.23456789, 3.45678901)
with
c = (1.23456789d0, 3.45678901d0)
Without a kind the real literals you use on the rhs are, most likely, 32-bit reals and you probably want 64-bit reals. The suffix d0 causes the compiler to create 64-bit reals closest to the values you provide. I've glossed over some details in this, and there are other (possibly better) ways of specifying the kind of a real number literal but this approach should work OK on any current Fortran compiler.
I don't know C++ very well, I'm not sure if the C++ code has the same problem.
If I read your question correctly the two codes produce the same answer to 8sf, the limit of single precision.
As for IEEE-754 compliance, that standard does not cover, so far as I am aware, the issues of complex arithmetic. I expect the f-p arithmetic used behind the scenes produces results on complex numbers within expected error bounds in most cases, but I'm not aware that they are guaranteed as error bounds on f-p arithmetic are.

I would propose to change all Fortran contants to DP
1.23456789_8 (or 1.23456789D00) etc
and use DIMAG instead of AIMAG

Using Fortran 77 subprogram as stand-alone, calling from C++

So I've been avoiding Fortran like the plague, but finally my time has come... I need to take part of someone else's Fortran code (let's call it program A) and do two things with it:
(1) Merge it with a third person's Fortran code (let's call it program B) so that B can call A
(2) Merge it with my C++ code (program C) so that C can call A
B and C are optimization algorithms, and A is a collection of benchmark functions... But before all that awesomeness can happen, I must first compile the portion of A that I need. All the subroutines of A that I need are contained in one file. I've been getting it into shape based on information I got online (e.g. adding "IMPLICIT NONE" to the code and making it suitable for gfortran). But I've got two stubborn bugs and a warning (I'll leave the warning for another post).
Here's how I am currently compiling it (via a Makefile):
all:
gfortran progA.FOR
g++ -c progC.cpp
g++ -o Program.out progA.o progC.o
rm *.o
But the first line fails to complete with the following errors,
FIRST ERROR:
SUBROUTINE TP1(MODE)
1
Error: Unclassifiable statement at (1)
RELEVANT CODE (starting from the top of the file):
IMPLICIT NONE
INTEGER NMAX,MMAX,LMAX,MNNMAX,LWA,LIWA,LACTIV,N,NILI,NINL,
/ NELI,NENL,NEX, MODE
PARAMETER (NMAX = 101,
/ MMAX = 50,
/ LMAX = 50,
/ MNNMAX = NMAX + NMAX + MMAX + 2,
/ LWA = 2*NMAX*NMAX + 33*NMAX + 10*MMAX + 200,
/ LIWA = MMAX + NMAX + 150,
/ LACTIV = 2*MMAX + 15)
LOGICAL INDEX1,INDEX2
SUBROUTINE TP1(MODE)
COMMON/L1/N,NILI,NINL,NELI,NENL
COMMON/L2/X(2)
COMMON/L4/GF(2)
COMMON/L6/FX
COMMON/L9/INDEX1
COMMON/L10/INDEX2
COMMON/L11/LXL
COMMON/L12/LXU
COMMON/L13/XL(2)
COMMON/L20/LEX,NEX,FEX,XEX(2)
REAL*8 X,G,GF,GG,FX,XL,XU,FEX,XEX
LOGICAL LXL(2),LXU(2),LEX
GOTO (1,2,3,4,4),MODE
1 N=2
NILI=0
NINL=0
NELI=0
NENL=0
X(1)=-2.D0
X(2)=1.D0
LXL(1)=.FALSE.
LXL(2)=.TRUE.
LXU(1)=.FALSE.
LXU(2)=.FALSE.
XL(2)=-1.5D0
LEX=.TRUE.
NEX=1
XEX(1)=1.D0
XEX(2)=1.D0
FEX=0.D0
RETURN
2 FX=100.D0*(X(2)-X(1)**2)**2+(1.D0-X(1))**2
RETURN
3 GF(2)=200.D0*(X(2)-X(1)**2)
GF(1)=-2.D0*(X(1)*(GF(2)-1.D0)+1.D0)
4 RETURN
END
I do not understand why this error appears since there are over 300 other subroutines declared exactly the same way (e.g. SUBROUTINE TP2(MODE), ..., SUBROUTINE TP300(MODE) ).
SECOND ERROR:
HX=TP273A(X)
1
Error: Return type mismatch of function 'tp273a' at (1) (REAL(4)/REAL(8))
RELEVANT CODE:
SUBROUTINE TP273(MODE)
COMMON/L1/N,NILI,NIML,NELI,NENL
COMMON/L2/X
COMMON/L4/GF
COMMON/L6/FX
COMMON/L11/LXL
COMMON/L12/LXU
COMMON/L20/LEX,NEX,FEX,XEX
LOGICAL LEX,LXL(6),LXU(6)
REAL*8 X(6),FX,GF(6),FEX,XEX(6),HX,DFLOAT
GOTO (1,2,3,4,4)MODE
1 N=6
NILI=0
NINL=0
NELI=0
NENL=0
DO 6 I=1,6
X(I)=0.D+0
XEX(I)=0.1D+1
LXL(I)=.FALSE.
6 LXU(I)=.FALSE.
LEX=.TRUE.
NEX=1
FEX=0.D+0
RETURN
2 HX=TP273A(X)
FX=0.1D+2*HX*(0.1D+1+HX)
RETURN
3 HX=TP273A(X)
DO 7 I=1,6
7 GF(I)=0.2D+2*(0.16D+2-DFLOAT(I))*(X(I)-0.1D+1)
1 *(0.1D+1+0.2D+1*HX)
4 RETURN
END
REAL*8 FUNCTION TP273A (X)
REAL*8 X(6),DFLOAT
TP273A=0
DO 10 I=1,6
10 TP273A=TP273A+(0.16D+2-DFLOAT(I))*(X(I)-0.1D+1)**2
RETURN
END
After reading Physics Forums I tried renaming the variable "TP273A" to "TP273Avar" so that it would not have the same name as the function. This did not resolve the error. Also, I replaced the "1" with "F" just below "7 GF(I) = ..." and recompiled. Nothing changed. I'm pretty sure the changes I just mentioned are necessary anyway, but there must be something else going on.
I have also read Data type mismatch in fortran and Function return type mismatch, so I naively tried adding "module mycode" to the top and "end module mycode" to the bottom of the file to no avail.
After this is all said and done, my goal is to call these subroutines from C++ using a code similar to:
#include <kitchensink>
extern"C"
{
void TP1_(int *mode);
}
int main()
{
TP1_(2);
return 0;
}
Once the Fortran Code compiles, I want to modify the subroutines so that C++ can pass std::vector X to TP#_(2,*X,*Y) and get back the computed value for Y. My std::vector X will replace COMMON/L2 X in each of the subroutines, and Y will be the value of FX computed in the subroutines. I used Mixing Fortran and C as guidance for the above C++ code.
As for the B calls A part, I hope that it will be as simple as compiling A along with B, and adding "CALL TP1(MODE)" lines wherever I need them.
Any and all guidance will be greatly appreciated!!!

You cannot have statements just in a file outside of a compilation unit. These can be subroutines, functions, modules or programs. In your case you have some statements( first of them being implicit none) and only after them there is the beginning of the subroutine TP1.
Either organize the procedures in a module and leave the common part before the contains section (more work with the C++ interoperability will follow if you are a Fortran newbie) or you must include the implicit none and others in every subroutine separately. Are you sure you even need this if the code worked before?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

A simple test case between clang++/g++/gfortran - c++

I believe your problem is in the output part. It is well-known that C++ streams (std::cout) are often very inefficient. While different compilers may optimize this, it is always a good idea to rewrite critical performance parts using C printf function instead of std::cout.

Related

Fortran vectorize log inside a for loop

Puzzling performance difference between mac and a relatively powerful desktop

C++ Centralizing SIMD usage

Fortran COMPLEX calculates different from C++

Using Fortran 77 subprogram as stand-alone, calling from C++

Categories

Resources