Temporary array creation and routine GEMM - fortran

When I run a Fortran code, I get this warning:
Fortran runtime warning: An array temporary was created for argument '_formal_11' of procedure 'zgemm'
related to this part of the code
do iw=w0,w1
!
do a=1,nmodes
Vw(a,:)=V(a,:)*w(iw,:)
end do
call zgemm('N', 'C',&
nmodes, nmodes, nbnd*nbnd, &
(1.d0,0.0d0),&
Vw, nmodes, &
V, nmodes, &
(0.d0,0.0d0), VwV(iw,:,:), nmodes)
end do
!
If I have understood well, the warning is related to passing non-continguous arrays which could affect the preformances. I would like to take care of this. However it is not clear to me what exactly is the problem here, and what I could do to solve it.

What is going on, is that you activated compiling flags that will warn you of temporary array creation at runtime.
Before getting to more explanation, we have to take a better look at what an array is. An array is an area in memory, together with the information needed to interpret it correctly. Those information include but
are not limited to the data type of the elements, number of dimensions, The start-index and end-index of each dimension, and most importantly, the gap between two successive element.
In very simplistic terms, Fortran 77 and below do not have a built-in mechanism to pass in the gap between successive elements. So when there is no explicit interface of the called subroutine, the compiler ensures that there is no gap between successive element by copying data to a temporary contiguous array. This is a safe mechanism to ensure the predictability of the behavior of the subroutine.
When using modules, Fortran 90 and above use a descriptor to pass those information to the called subroutine; that works hand-in-hand with assumed-shape declaration of arrays. This is also a simplistic description.
In summary, that is a warning that will be of importance only if the performance is affected as Vladimir said.

Related

Can I efficiently use an array of integers rather than 30 integer variables?

I'm trying to port C++ code from a developer who uses global variables called
p0, p1, ..., p30
of integer type.
I wondered if I could not just use an array int p[31]; and access them as p[0], p[1],...
It seems plausible that there would be no performance hit if the indices were always passed as constants. Then I could just pass his data as extern int p[];.
Obviously I could use descriptive macros for the various indices to make the code clearer.
I know that this sounds like weird code, but the developer seems to have a "neurodiverse" personality, and we can't just tell him to mend his ways. Performance is very important in the module he is working on.
I don't see any danger in the replacement of variables with an array.
Modern compilers are very good at optimizing code.
You normally can asume that there will be no difference between using individual variables p0, … p30 and an std::array<int, 31> (or an int[31]), if they are used in the same way and if you use only constants for accessing the array.
A compiler is not required to keep an std::array or an int[] as such, but can completely or partially optimize it a way as long as it complains with the as-if rule.
Variables (also arrays) only need to exists in memory if the compiler can't determin their contents at runtime and/or if the registers are not sufficient to do all manipulations related to those variables using only these registers.
If they exits in memory they need to be referenced by their address in memory, for both a pN and a p[N] (if N is constant) the address where the value is in memory can be determined in the same way at compile time.
If you are unsure if the generated code is the same you can always compair the output generated by the compiler (this can e.g. be done on godbolt), or using the corresponding compiler flags if you don't want to submit code to a foreign service.

CUDA, Using 2D and 3D Arrays

There are a lot of questions online about allocating, copying, indexing, etc 2d and 3d arrays on CUDA. I'm getting a lot of conflicting answers so I'm attempting to compile past questions to see if I can ask the right ones.
First link: https://devtalk.nvidia.com/default/topic/392370/how-to-cudamalloc-two-dimensional-array-/
Problem: Allocating a 2d array of pointers
User solution: use mallocPitch
"Correct" inefficient solution: Use malloc and memcpy in a for loop for each row (Absurd overhead)
"More correct" solution: Squash it into a 1d array "professional opinion," one comment saying no one with an eye on performance uses 2d pointer structures on the gpu
Second link: https://devtalk.nvidia.com/default/topic/413905/passing-a-multidimensional-array-to-kernel-how-to-allocate-space-in-host-and-pass-to-device-/
Problem: Allocating space on host and passing it to device
Sub link: https://devtalk.nvidia.com/default/topic/398305/cuda-programming-and-performance/dynamically-allocate-array-of-structs/
Sub link solution: Coding pointer based structures on the GPU is a bad experience and highly inefficient, squash it into a 1d array.
Third link: Allocate 2D Array on Device Memory in CUDA
Problem: Allocating and transferring 2d arrays
User solution: use mallocPitch
Other solution: flatten it
Fourth link: How to use 2D Arrays in CUDA?
Problem: Allocate and traverse 2d arrays
Submitted solution: Does not show allocation
Other solution: squash it
There are a lot of other sources mostly saying the same thing but in multiple instances I see warnings about pointer structures on the GPU.
Many people claim the proper way to allocate an array of pointers is with a call to malloc and memcpy for each row yet the functions mallocPitch and memcpy2D exist. Are these functions somehow less efficient? Why wouldn't this be the default answer?
The other 'correct' answer for 2d arrays is to squash them into one array. Should I just get used to this as a fact of life? I'm very persnickety about my code and it feels inelegant to me.
Another solution I was considering was to max a matrix class that uses a 1d pointer array but I can't find a way to implement the double bracket operator.
Also according to this link: Copy an object to device?
and the sub link answer: cudaMemcpy segmentation fault
This gets a little iffy.
The classes I want to use CUDA with all have 2/3d arrays and wouldn't there be a lot of overhead in converting those to 1d arrays for CUDA?
I know I've asked a lot but in summary should I get used to squashed arrays as a fact of life or can I use the 2d allocate and copy functions without getting bad overhead like in the solution where alloc and cpy are called in a for loop?
Since your question compiles a list of other questions, I'll answer by compiling a list of other answers.
cudaMallocPitch/cudaMemcpy2D:
First, the cuda runtime API functions like cudaMallocPitch and cudaMemcpy2D do not actually involve either double-pointer allocations or 2D (doubly-subscripted) arrays. This is easy to confirm simply by looking at the documentation, and noting the types of parameters in the function prototypes. The src and dst parameters are single-pointer parameters. They could not be doubly-subscripted, or doubly dereferenced. For additional example usage, here is one of many questions on this. here is a fully worked example usage. Another example covering various concepts associated with cudaMallocPitch/cudaMemcpy2d usage is here. Instead the correct way to think about these is that they work with pitched allocations. Also, you cannot use cudaMemcpy2D to transfer data when the underlying allocation has been created using a set of malloc (or new, or similar) operations in a loop. That sort of host data allocation construction is particularly ill-suited to working with the data on the device.
general, dynamically allocated 2D case:
If you wish to learn how to use a dynamically allocated 2D array in a CUDA kernel (meaning you can use doubly-subscripted access, e.g. data[x][y]), then the cuda tag info page contains the "canonical" question for this, it is here. The answer given by talonmies there includes the proper mechanics, as well as appropriate caveats:
there is additional, non-trivial complexity
the access will generally be less efficient than 1D access, because data access requires dereferencing 2 pointers, instead of 1.
(note that allocating an array of objects, where the object(s) has an embedded pointer to a dynamic allocation, is essentially the same as the 2D array concept, and the example you linked in your question is a reasonable demonstration for that)
Also, here is a thrust method for building a general dynamically allocated 2D array.
flattening:
If you think you must use the general 2D method, then go ahead, it's not impossible (although sometimes people struggle with the process!) However, due to the added complexity and reduced efficiency, the canonical "advice" here is to "flatten" your storage method, and use "simulated" 2D access. Here is one of many examples of questions/answers discussing "flattening".
general, dynamically allocated 3D case:
As we extend this to 3 (or higher!) dimensions, the general case becomes overly complex to handle, IMO. The additional complexity should strongly motivate us to seek alternatives. The triply-subscripted general case involves 3 pointer accesses before the data is actually retrieved, so even less efficient. Here is a fully worked example (2nd code example).
special case: array width known at compile time:
Note that it should be considered a special case when the array dimension(s) (the width, in the case of a 2D array, or 2 of the 3 dimensions for a 3D array) is known at compile-time. In this case, with an appropriate auxiliary type definition, we can "instruct" the compiler how the indexing should be computed, and in this case we can use doubly-subscripted access with considerably less complexity than the general case, and there is no loss of efficiency due to pointer-chasing. Only one pointer need be dereferenced to retrieve the data (regardless of array dimensionality, if n-1 dimensions are known at compile time for a n-dimensional array). The first code example in the already-mentioned answer here (first code example) gives a fully worked example of that in the 3D case, and the answer here gives a 2D example of this special case.
doubly-subscripted host code, singly-subscripted device code:
Finally another methodology option allows us to easily mix 2D (doubly-subscripted) access in host code while using only 1D (singly-subscripted, perhaps with "simulated 2D" access) in device code. A worked example of that is here. By organizing the underlying allocation as a contiguous allocation, then building the pointer "tree", we can enable doubly-subscripted access on the host, and still easily pass the flat allocation to the device. Although the example does not show it, it would be possible to extend this method to create a doubly-subscripted access system on the device based off a flat allocation and a manually-created pointer "tree", however this would have approximately the same issues as the 2D general dynamically allocated method given above: it would involve double-pointer (double-dereference) access, so less efficient, and there is some complexity associated with building the pointer "tree", for use in device code (e.g. it would necessitate an additional cudaMemcpy operation, probably).
From the above methods, you'll need to choose one that fits your appetite and needs. There is not one single recommendation that fits every possible case.

Fortran runtime warning: temporary array

I get the fortran runtime warning "An array temporary was created" when running my code (compiled with gfortran) and I would like to know if there is a better way to solve this warning.
My original code is something like this:
allocate(flx_est(lsign,3))
allocate(flx_err(lsign,3))
do i=1,lsign
call combflx_calc(flx_est(i,:),flx_err(i,:))
enddo
Inside the subroutine I define the variables like this:
subroutine combflx_calc(flx_est,flx_err)
use,intrinsic :: ISO_Fortran_env, only: real64
implicit none
real(real64),intent(inout) :: flx_est(3),flx_err(3)
flux_est and flx_err vectors may change inside the subroutine depending on several conditions and I need to update their values accordingly.
Fortran does not seem to like this structure. I can solve it defining temporary variables:
tmp_flx_est=flx_est(i,:)
tmp_flx_err=flx_err(i,:)
call combflx_calc(tmp_flx_est,tmp_flx_err)
flx_est(i,:)=tmp_flx_est
flx_err(i,:)=tmp_flx_err
But it seems to me quite a silly way to fix it.
As you may see I'm not an expert with Fortran, so any help is more than welcome.
One way is to pass an assumed shape array
real(real64),intent(inout) :: flx_est(:),flx_err(:)
the other is to exchange the dimensions of your array, so that you can pass a contiguous section of the 2D array.
call combflx_calc(flx_est(:,i),flx_err(:,i))
The problem is that the explicit size dummy arguments of your procedure (var(n)) require contiguous arrays. The assumed shape arrays can have some stride.
Your array temporary is being created because you are passing a strided array to your subroutine. Fortran arrays are column major so the leftmost index varies fastest in an array, or better said, the leftmost index is contiguous in memory and each variable to the right is strided over those to the left.
When you call
call combflx_calc(flx_est(i,:),flx_err(i,:))
These slices are arrays of your 3-vector strided by the length of lsign. The subroutine expects variables of a single dimension contiguous in memory, which the variable you pass into it is not. Thus, a temporary must be made for the subroutine to operate on and then copied back into your array slice.
Your "fix" does not change this, it just not longer warns about a temporary because you are using an explicitly created variable rather than the runtime doing it for you.
Vladimir's answer gives you options to avoid the temporary, so I will not duplicate them here.

About the order of input parameters

For a function/method contains many input parameters, does it make a difference if passing-in in different orders? If does, in what aspects (readability, efficiency, ...)? I am more curious about how should I do for my own functions/methods?
It seems to me that:
Parameters passing by references/pointers often come before parameters passing by values. For example:
void* memset( void* dest, int ch, std::size_t count );
Destination parameters often come before source parameters. For example:
void* memcpy( void* dest, const void* src, std::size_t count );
Except for some hard constraints, i.e., parameters with default values must come last. For example:
size_type find( const basic_string& str, size_type pos = 0 ) const;
They are functional equivalent (achieve the same goal) no matter what order they pass in.
There are a few reasons it can matter - listed below. The C++ Standard itself doesn't mandate any particular behaviours in this space, so there's no portable way to reason about performance impact, and even if something's demonstrably (slightly) faster in one executable, a change anywhere in the program, or to the compiler options or version, might remove or even reverse the earlier benefit. In practice it's extremely rare to hear people talk about parameter ordering being of any significance in their performance tuning. If you really care you'd best examine your own compiler's output and/or benchmark resultant code.
Exceptions
The order of evaluation of expressions passed to function parameters is unspecified, and it's quite possible that it could be affected by changes to the order they appear in the source code, with some combinations working better in the CPU execution pipeline, or raising an exception earlier that short-circuits some other parameter preparation. This could be a significant performance factor if some of the parameters are temporary objects (e.g. results of expressions) that are expensive to allocate/construct and destruct/deallocate. Again, any change to the program could remove or reverse a benefit or penalty observed earlier, so if you care about this you should create a named temporary for parameters you want evaluated first before making the function call.
Registers vs cache (stack memory)
Some parameters may be passed in registers, while others are pushed on to the stack - which effectively means entering at least the fastest of the CPU caches, and implies their handling may be slower.
If the function ends up accessing all the parameters anyway, and the choice is between putting parameter X in a register and Y on the stack or vice versa, it doesn't matter much how they're passed, but given the function may have conditions affecting which variables are actually used (if statements, switches, loops that may or may not be entered, early returns or breaks etc.), it's potentially faster if a variable that's not actually needed was on the stack while one that was needed was in a register.
See http://en.wikipedia.org/wiki/X86_calling_conventions for some background and information on calling conventions.
Alignment and padding
Performance could theoretically be affected by the minutae of parameter passing conventions: the parameters may need particular alignment for any - or perhaps just full-speed - access on the stack, and the compiler might choose to pad rather than reorder the values it pushes - it's hard to imagine that being significant unless the data for parameters was on the scale of cache page sizes
Non-performance factors
Some of the other factors you mention can be quite important - for example, I tend to put any non-const pointers and references first, and name the function load_xxx, so I have a consistent expectation of which parameters may be modified and which order to pass them. There's no particularly dominant convention though.
Strictly speaking it doesn't matter - parameters are pushed onto the stack and the function accessing them by taking them from the stack in some way.
Howver, most C/C++ compilers allow you to specify alternative calling conventions. For example, Visual C++ supports the __fastcall convention which stores the first 2 parameters in the ECX and EDX registers, which (in theory) should give you a performance improvement in the right circumstances.
There's also __thiscall which stores the this pointer in the ECX register. If you're doing C++ then this may be useful.
There are some answers here mentioning calling conventions. They have nothing to do with your question: No matter what calling convention you use, the order in which you declare the parameters doesn't matter. It doesn't matter which parameters are passed by registers and which are passed by stack, as long as the same number of parameters are passed by registers and the same amount of parameters are passed by stack. Please note that parameters that are higher in size than the native architecture size (4-bytes for 32-bit and 8-byte for 64-bit) are passed by an address, so they are passed with the same speed as a smaller size data.
Let's take an example:
You have a function with 6 parameters. And you have a calling convention, lets call it CA, that passes one parameter by register and the rest (5 in this case) by stack, and a second calling convention, lets call it CB, that passes 4 parameters by registers and the rest (in this case 2) by stack.
Now, of course that CA will be faster than CB, but it has nothing to do with the order the parameters are declared. For CA, it will be as fast no matter which parameter you declare first (by register) and which you declare 2nd, 3rd..6th (stack), and for CB it will be as fast no matter which 4 arguments you declare for registers and which you declare as last 2 parameters.
Now, regarding your question:
The only rule that is mandatory is that optional parameters must be declared last. No non-optional parameter can follow an optional parameter.
Other than that, you can use whatever order you want, and the only strong advice I can give you is be consistent. Choose a model and stick to it.
Some guidelines you could consider:
destination comes before source. This is to be close to destination = source.
the size of the buffer comes after the buffer: f(char * s, unsigned size)
input parameters first, output parameters last (this conflicts with the first one I gave you)
But there is no "wrong" or "right" or even a universal accepted guideline for the order of the parameters. Choose something and be consistent.
Edit
I thought of a "wrong" way to order you parameters: by alphabetic order :).
Edit 2
For example, both for CA, if I pass a vector(100) and a int, it will be better if vector(100) comes first, i.e. use registers to load larger data type. Right?
No. As I have mentioned it doesn't matter the data size. Let's talk on a 32-bit architecture (the same discussion is valid for any architecture 16-bit, 64-bit etc). Let's analyze the 3 case we can have regarding the size of the parameters in relation with the native size of the architecture.
Same size: 4-bytes parameters. Nothing to talk here.
Smaller size: a 4-bytes register will be used or 4-bytes will be allocated on stack. So nothing interesting here either.
Larger size: (e.g. a struct with many fields, or a static array). No matter which method is chosen for passing this argument, this data resides in memory, and what is passed is a pointer (size 4-bytes) to that data. Again we have a 4-bytes register or 4-bytes on the stack.
It doesn't matter the size of the parameters.
Edit 3
How #TonyD explained, the order matters if you don't access all the parameters. See his answer.
I have somehow found a few related pages.
https://softwareengineering.stackexchange.com/questions/101346/what-is-best-practice-on-ordering-parameters-in-a-function
https://google.github.io/styleguide/cppguide.html#Function_Parameter_Ordering
So first Google's C++ style does not really answer the question since it fails to answer the actual order within the input parameters or output parameters.
The other page basically suggests that order parameters in a sense easy to be understood and used.
For the sake of readability, I personally prefer to order parameter based on alphabet order. But you can also work on some strategy to name the parameters to be nicely ordered so that they could be still easy to be understood and used.

Fortran arrays and subroutines (sub arrays)

I'm going through a Fortran code, and one bit has me a little puzzled.
There is a subroutine, say
SUBROUTINE SSUB(X,...)
REAL*8 X(0:N1,1:N2,0:N3-1),...
...
RETURN
END
Which is called in another subroutine by:
CALL SSUB(W(0,1,0,1),...)
where W is a 'working array'. It appears that a specific value from W is passed to the X, however, X is dimensioned as an array. What's going on?
This is non-uncommon idiom for getting the subroutine to work on a (rectangular in N-dimensions) subset of the original array.
All parameters in Fortran (at least before Fortran 90) are passed by reference, so the actual array argument is resolved as a location in memory. Choose a location inside the space allocated for the whole array, and the subroutine manipulates only part of the array.
Biggest issue: you have to be aware of how the array is laid out in memory and how Fortran's array indexing scheme works. Fortran uses column major array ordering which is the opposite convention from c. Consider an array that is 5x5 in size (and index both directions from 0 to make the comparison with c easier). In both languages 0,0 is the first element in memory. In c the next element in memory is [0][1] but in Fortran it is (1,0). This affects which indexes you drop when choosing a subspace: if the original array is A(i,j,k,l), and the subroutine works on a three dimensional subspace (as in your example), in c it works on Aprime[i=constant][j][k][l], but in Fortran in works on Aprime(i,j,k,l=constant).
The other risk is wrap around. The dimensions of the (sub)array in the subroutine have to match those in the calling routine, or strange, strange things will happen (think about it). So if A is declared of size (0:4,0:5,0:6,0:7), and we call with element A(0,1,0,1), the receiving routine is free to start the index of each dimension where ever it likes, but must make the sizes (4,5,6) or else; but that means that the last element in the j direction actually wraps around! The thing to do about this is not use the last element. Making sure that that happens is the programmers job, and is a pain in the butt. Take care. Lots of care.
in fortran variables are passed by address.
So W(0,1,0,1) is value and address. so basically you pass subarray starting at W(0,1,0,1).
This is called "sequence association". In this case, what appears to be a scaler, an element of an array (actual argument in caller) is associated with an array (implicitly the first element), the dummy argument in the subroutine . Thereafter the elements of the arrays are associated by storage order, known as "sequence". This was done in Fortran 77 and earlier for various reasons, here apparently for a workspace array -- perhaps the programmer was doing their own memory management. This is retained in Fortran >=90 for backwards compatibility, but IMO, doesn't belong in new code.