Optimizing if-then-else statement in Fortran 77 - if-statement

For my C++ code, I asked this question about two days ago. But I realize now that I have to do the coding in Fortran since the kernels I write is going to be part of an existing application written in Fortran 77. Therefore I am posting this question again, this time the context is Fortran. Thank you.
I have different functions for square matrix multiplication depending on matrix size which varies from 8x8 through 20x20. The functions differ from each other because each employ different strategies for optimization, namely, different loop permutations and different loop unroll factors. Matrix size is invariant during the life of a program, and is known at compile time. My goal is to reduce the time to decide which function must be used. For example, a naive implementation is:
if (matrixSize == 8) C = mxm8(A, B);
else if (matrixSize == 9) C = mxm9(A,B);
...
else if (matrixSize == 20) C = mxm20(A,B);
The time taken to decide which function to use for every matrix multiplication is non-trivial in this case, specially since matrix multiplication happens frequently in the code. Thanks in advance for any suggestion on how to handle this in Fortran 77.

If matrixSize is a compile time constant in a language sense (i.e. it is a Fortran PARAMETER), then I would expect most optimising compilers to take advantage of that, and completely eliminate a runtime branch.
If matrixSize is not a compile time constant, then you should make it one. Facilities provided in later Fortran language revisions (modules) make it very easy to propagate such a runtime constant from a single point of definition to a point of use.
Note that conforming Fortran 77 is also conforming Fortran 90, and with very few exceptions, will also be conforming Fortran 2015.

If it is known at compile time, then you only need 1 version of this function. It seems like you just put each version of the function in its obj object file or library, and then link to the appropriate one.
If you meant to say it is known at runtime, but does not change over the course or an execution, then you could have 13 versions of the code, one for each size, and use on set of ifs to decide which to use.

Related

How much do C/C++ compilers optimize conditional statements?

I recently ran into a situation where I wrote the following code:
for(int i = 0; i < (size - 1); i++)
{
// do whatever
}
// Assume 'size' will be constant during the duration of the for loop
When looking at this code, it made me wonder how exactly the for loop condition is evaluated for each loop. Specifically, I'm curious as to whether or not the compiler would 'optimize away' any additional arithmetic that has to be done for each loop. In my case, would this code get compiled such that (size - 1) would have to be evaluated for every loop iteration? Or is the compiler smart enough to realize that the 'size' variable won't change, thus it could precalculate it for each loop iteration.
This then got me thinking about the general case where you have a conditional statement that may specify more operations than necessary.
As an example, how would the following two pieces of code compile:
if(6)
if(1+1+1+1+1+1)
int foo = 1;
if(foo + foo + foo + foo + foo + foo)
How smart is the compiler? Will the 3 cases listed above be converted into the same machine code?
And while I'm at, why not list another example. What does the compiler do if you are doing an operation within a conditional that won't have any effect on the end result? Example:
if(2*(val))
// Assume val is an int that can take on any value
In this example, the multiplication is completely unnecessary. While this case seems a lot stupider than my original case, the question still stands: will the compiler be able to remove this unnecessary multiplication?
Question:
How much optimization is involved with conditional statements?
Does it vary based on compiler?
Short answer: the compiler is exceptionally clever, and will generally optimise those cases that you have presented (including utterly ignoring irrelevant conditions).
One of the biggest hurdles language newcomers face in terms of truly understanding C++, is that there is not a one-to-one relationship between their code and what the computer executes. The entire purpose of the language is to create an abstraction. You are defining the program's semantics, but the computer has no responsibility to actually follow your C++ code line by line; indeed, if it did so, it would be abhorrently slow as compared to the speed we can expect from modern computers.
Generally speaking, unless you have a reason to micro-optimise (game developers come to mind), it is best to almost completely ignore this facet of programming, and trust your compiler. Write a program that takes the inputs you want, and gives the outputs you want, after performing the calculations you want… and let your compiler do the hard work of figuring out how the physical machine is going to make all that happen.
Are there exceptions? Certainly. Sometimes your requirements are so specific that you do know better than the compiler, and you end up optimising. You generally do this after profiling and determining what your bottlenecks are. And there's also no excuse to write deliberately silly code. After all, if you go out of your way to ask your program to copy a 50MB vector, then it's going to copy a 50MB vector.
But, assuming sensible code that means what it looks like, you really shouldn't spend too much time worrying about this. Because modern compilers are so good at optimising, that you'd be a fool to try to keep up.
The C++ language specification permits the compiler to make any optimization that results in no observable changes to the expected results.
If the compiler can determine that size is constant and will not change during execution, it can certainly make that particular optimization.
Alternatively, if the compiler can also determine that i is not used in the loop (and its value is not used afterwards), that it is used only as a counter, it might very well rewrite the loop to:
for(int i = 1; i < size; i++)
because that might produce smaller code. Even if this i is used in some fashion, the compiler can still make this change and then adjust all other usage of i so that the observable results are still the same.
To summarize: anything goes. The compiler may or may not make any optimization change as long as the observable results are the same.
Yes, there is a lot of optimization, and it is very complex.
It varies based on the compiler, and it also varies based on the compiler options
Check
https://meta.stackexchange.com/questions/25840/can-we-stop-recommending-the-dragon-book-please
for some book recomendations if you really want to understand what a compiler may do. It is a very complex subject.
You can also compile to assembly with the -S option (gcc / g++) to see what the compiler is really doing. Use -O3 / ... / -O0 / -O to experiment with different optimization levels.

When should I use CUDA's built-in warpSize, as opposed to my own proper constant?

nvcc device code has access to a built-in value, warpSize, which is set to the warp size of the device executing the kernel (i.e. 32 for the foreseeable future). Usually you can't tell it apart from a constant - but if you try to declare an array of length warpSize you get a complaint about it being non-const... (with CUDA 7.5)
So, at least for that purpose you are motivated to have something like (edit):
enum : unsigned int { warp_size = 32 };
somewhere in your headers. But now - which should I prefer, and when? : warpSize, or warp_size?
Edit: warpSize is apparently a compile-time constant in PTX. Still, the question stands.
Let's get a couple of points straight. The warp size isn't a compile time constant and shouldn't be treated as one. It is an architecture specific runtime immediate constant (and its value just happens to be 32 for all architectures to date). Once upon a time, the old Open64 compiler did emit a constant into PTX, however that changed at least 6 years ago if my memory doesn't fail me.
The value is available:
In CUDA C via warpSize, where is is not a compile time constant (the PTX WARP_SZ variable is emitted by the compiler in such cases).
In PTX assembler via WARP_SZ, where it is a runtime immediate constant
From the runtime API as a device property
Don't declare you own constant for the warp size, that is just asking for trouble. The normal use case for an in-kernel array dimensioned to be some multiple of the warp size would be to use dynamically allocated shared memory. You can read the warp size from the host API at runtime to get it. If you have a statically declared in-kernel you need to dimension from the warp size, use templates and select the correct instance at runtime. The latter might seem like unnecessary theatre, but it is the right thing to do for a use case that almost never arises in practice. The choice is yours.
Contrary to talonmies's answer I find warp_size constant perfectly acceptable. The only reason to use warpSize is to make the code forward-compatibly with a possible future hardware that may have warps of different size. However, when such hardware arrives, the kernel code will most likely require other alterations as well in order to remain efficient. CUDA is not a hardware-agnostic language - on the contrary, it is still quite a low-level programming language. Production code uses various intrinsic functions that come and go over time (e.g. __umul24).
The day we get a different warp size (e.g. 64) many things will change:
The warpSize will have to be adjusted obviously
Many warp-level intrinsic will need their signature adjusted, or a new version produced, e.g. int __ballot, and while int does not need to be 32-bit, it is most commonly so!
Iterative operations, such as warp-level reductions, will need their number of iterations adjusted. I have never seen anyone writing:
for (int i = 0; i < log2(warpSize); ++i) ...
that would be overly complex in something that is usually a time-critical piece of code.
warpIdx and laneIdx computation out of threadIdx would need to be adjusted. Currently, the most typical code I see for it is:
warpIdx = threadIdx.x/32;
laneIdx = threadIdx.x%32;
which reduces to simple right-shift and mask operations. However, if you replace 32 with warpSize this suddenly becomes a quite expensive operation!
At the same time, using warpSize in the code prevents optimization, since formally it is not a compile-time known constant.
Also, if the amount of shared memory depends on the warpSize this forces you to use the dynamically allocated shmem (as per talonmies's answer). However, the syntax for that is inconvenient to use, especially when you have several arrays -- this forces you to do pointer arithmetic yourself and manually compute the sum of all memory usage.
Using templates for that warp_size is a partial solution, but adds a layer of syntactic complexity needed at every function call:
deviceFunction<warp_size>(params)
This obfuscates the code. The more boilerplate, the harder the code is to read and maintain.
My suggestion would be to have a single header that control all the model-specific constants, e.g.
#if __CUDA_ARCH__ <= 600
//all devices of compute capability <= 6.0
static const int warp_size = 32;
#endif
Now the rest of your CUDA code can use it without any syntactic overhead. The day you decide to add support for newer architecture, you just need to alter this one piece of code.

C++ compiler optimization for complex equations

I have some equations that involve multiple operations that I would like to run as fast as possible. Since the c++ compiler breaks it down in to machine code anyway does it matter if I break it up to multiple lines like
A=4*B+4*C;
D=3*E/F;
G=A*D;
vs
G=12*E*(B+C)/F;
My need is more complex than this but the i think it conveys the idea. Also if this is in a function that gets called is in a loop, does defining double A, D cost CPU time vs putting it in as a class variable?
Using a modern compiler, Clang/Gcc/VC++/Intel, it won't really matter, the best thing you should do is worry about how readable your code will be and turn on optimizations, compiler designers are well aware of issues like these and design their compilers to (for the most part) optimize according.
If I were to say which would be slower I would assume the first way since there would be 3 mov instructions, I could be wrong. but this isn't something you should worry about too much.
If these variables are integers, that second code fragment is not a valid optimization of the first. For B=1, C=1, E=1, F=6, you have:
A=4*B+4*C; // 8
D=3*E/F; // 0
G=A*D; // 0
and
G=12*E*(B+C)/F; // 4
If floating point, then it really depends on what compiler, what compiler options, and what cpu you have.

Is it possible to convert all regular programming tasks to compile time using meta-programming?

I read about meta-programming, and found it was really interesting. For example, check to see if the number is prime, calculate fibonacci number...I'm curious about its practical usage, if we can convert all runtime solution to meta-programming, the the application would perform much better. Let's say to find max value of an array. We would take O( n ) at run time if it was not sorted. Is it possible to get O( 1 ) with meta-programing?
Thanks,
Chan
You can't because metaprogramming only works for inputs that are known at compile time. So you can have a metafunction that calculates a Fibonacci number given a constant known at compile time:
int value = Fibonacci<5>::Value;
But it won't work for values that are inputted by a user at runtime:
int input = GetUserInput();
int value = Fibonacci<input>::Value; // Does not compile
Sure, you can recompile the program every time you get new values, but that becomes impractical for non-trivial programs.
Keep in mind that metaprogramming in C++ is basically a "useful accidental abuse" of the way C++ handles templates. Template metaprogramming was definitely not what the C++ standards committee had in mind when creating the C++ standards prior to C++0x. You can only push the compiler so much until you get internal compiler errors (that has changed nowadays with newer compilers, but you still shouldn't go overboard).
There's an (advanced-level) book dedicated to C++ template metaprogramming if you want to see what they are really useful for.
If it ain't known when you hit the compile button, then it won't be solvable by meta-programming.
If you're talking about processing data known at compile-time (as opposed to known at run-time), then theoretically, yes.
In practice, no. Any non-trivial task quickly becomes a tangled nightmare of impenetrable template code, giving even more impenetrable error messages when they fail to compile. Furthermore, most C++ compilers can only tolerate a certain depth of template nesting before they explode.
Sure. Of course, you can't make any system calls, all users will need a compiler to run your program, user input will have to take the form of defining constant expressions, but yeah...if you really, really wanted to you could write just about any program in C++ template code so that it 'runs' during compilation rather than runtime.

How should I compare a c++ metaprogram with an C code ? (runtime )

I have ported a C program to a C++ Template Meta program .Now i want to compare the runtime .
Since there is almost no runtime in the C++ program , how should i compare these 2 programs.
Can i compare C runtime with C++ compile time ? or is it just not comparable ?
You can compare anything you want to compare. There is no one true rule of what should be compared.
You can compare the time each version takes to execute, or you can compare the time taken to compile each.
Or you can compare the length of the program, or the number of 'r' characters in the source file.
You could compare the timestamp of each file.
How you should compare the two programs depend on what you want to show!
If you want to show that one executes faster than the other, then run both, time how long they take to execute, and compare those numbers.
If you want to show that one compiles faster than the other, then time the time it takes to compile them.
If you think the relation between the compile time of the C++ program and the run time of the C program is relevant, then compare those.
Decide what it is you want to show. Then you'll know what to compare.
If I understand correctly, you've rewritten a C program with one that is entirely template-based? As a result, you're comparing the time it takes to run the C program with a C++ program that takes almost no time but simply writes the result out.
In this case, I don't think its quite comparable - the end user will see the C program take x seconds to run, and the C++ one complete immediately. However, the developer will see the C program compile in x seconds, and the C++ compile in many more seconds.
You could compare the C++ compile time to the C run time, and if the app is designed to produce a result and never run twice, then yes, you can compare the times in this way. If the program is designed to be run multiple times, then the run time is what you need to compare.
I just hope you put a LOT of comments in your C++ template code though :)
PS. I'm curious - how long does the C take to run, compared to the compile time for both?
since the C++ program will always produce the same result, why bother with any of it? compute the result once using either program, and then replace both with:
int main()
{
printf("<insert correct output here>\n");
return 0;
}
I think what would make sense is to compare compile times of the two programs, then runtimes, then you can calculate after how many runs you have amortized the additional compile time.
This is what i think you're trying to do:
You haven't said what your c program does so lets say it computes a cosine number to some specified degree of accuracy. You've converted this program into a c++ template-based equivalent which does the same thing but at compile time to yield a compile-time contant value. This is a reasonable thing to do as you may have an algorithm that uses "hard-coded" cosine values and you prefer not to have a table of random looking numbers. See this article for an example of real-world use for this (or do a search for Blitz and/or Todd Veldhuizen for more examples).
In which case, you therefore want to compare the compile-time performance of the C++ sine calculator against the run-time performance of the original C version.
A direct comparison of the time to compile the C++ source file against the time to run the C version will almost certainly show the compile time to be significantly slower. But this is hardly a fair comparison since the compiler is doing a lot more than just "executing" the template code.
EDIT: You could compensate for the compiler overhead by creating a copy of your c++ program which has some simple code equivalent to what the templated code would generate - i.e. you have to hand compile your templated code if that makes sense. If you then time the compilation of that source, the difference between that time and time to compile your original C++ templated program is presumably just the time required to execute the template.
Today's C and C++ compilers share the same backends, hence generate most likely the same assembly code.
C++ is just a more annotated C, and you can still do good C while Cplusplusing ;)
C is just C++ old brother.