Is there a gfortran option for prefetch?

Is there a gfortran option for prefetch? - fortran

I would like to use prefetch in my code to improve cache behavior. For example, assume I have this array of indexes: indexes = [9,3,2,6,7,5,8,4,1,10] and the code below:
do i=1,10:
total = total + arr(indexes(i)) * i
end do
So the cache behavior of indexes is good while the one of arr is bad.
An example of what I want will be:
do i=1,10:
prefetch(arr(indexes(i+1)))
total = total + arr(indexes(i)) * i
end do
I've seen this: https://www.intel.com/content/www/us/en/develop/documentation/fortran-compiler-oneapi-dev-guide-and-reference/top/language-reference/a-to-z-reference/o-to-p/prefetch-and-noprefetch.html but I was looking for a version for gfortran too, or better: compiler independent.

That loop is so short it will most likely be fully optimized out if you use any decent level of compiler optimization. Anyways, I think everything depends on the nature of your indexes array. As a rule of thumb:
does indexes change often, seldom, at all?
If it never changes:
Make it a parameter
Also make the multiplier a parameter: `integer, parameter :: fixed_i(*) = [(i,i=1,10)]
So the compiler will have all the information at compile time.
If not that often, you should consider to pre-process it to at least some extent. I would at least
Make a temporary array tmp_i = [(i,i=1,10)]; sort indexes in ascending order, (and tmp_i accordingly)

Depending on your specific real conditions, I would try manually pre-fetching. Basically, assign a new array using the indexed order:
! -- Manual prefetch, ideally done once
do i=1,10
indexed(i) = arr(indexes(i))
end do
! -- Use cached data
do i=1,10
total = total + indexed(i) * i
end do
In my opinion, this work better when:
There are not too many indexes arrays
Each indexes array is used relatively frequently
memory usage is not a bottleneck
I doubt there is a compiler-independent solution to this in the compile options.

Related

Golang concurrent array access

Is it safe to access the same array from multiple goroutines, when every goroutine works on a slice, pointing to the same underlying array but without overlapping ?
Like:
var arr [100]int
sliceA := arr[:50]
sliceB := arr[50:]
go WorkOn(sliceA)
go WorkOn(sliceB)
Just imagine "WorkOn" would do something fancy.

As long as you can guarantee the areas won't overlap, it's fine.
By guarantee I mean: whomever works on sliceA, should not be allowed to do sliceA = append(sliceA, a, b, c). Because then it'll start running into sliceB's territory.
Relevant here, is some documentation for Go 1.2:
This concerns a new language element: 3-index slices:
Go 1.2 adds the ability to specify the capacity as well as the length when using a slicing operation on an existing array or slice. A slicing operation creates a new slice by describing a contiguous section of an already-created array or slice:
var array [10]int
slice := array[2:4]
The capacity of the slice is the maximum number of elements that the slice may hold, even after reslicing; it reflects the size of the underlying array. In this example, the capacity of the slice variable is 8.
Go 1.2 adds new syntax to allow a slicing operation to specify the capacity as well as the length. A second colon introduces the capacity value, which must be less than or equal to the capacity of the source slice or array, adjusted for the origin. For instance,
slice = array[2:4:7]
sets the slice to have the same length as in the earlier example but its capacity is now only 5 elements (7-2). It is impossible to use this new slice value to access the last three elements of the original array.
In this three-index notation, a missing first index ([:i:j]) defaults to zero but the other two indices must always be specified explicitly. It is possible that future releases of Go may introduce default values for these indices.
Further details are in the design document.

Actually jimt's answer MAY be wrong. It depends... :)
E.g. if you are using a []uint8, then a operation like
p[2] = 5
is essentially this
tmp = p[0..3] // this is 32 bit
tmp[2] = 5
p[0..3] = tmp // yeah this is all fake syntax but you'll get it
This is because your CPU is 32 (or even 64) bit. So that is actually more efficient although it seems more complex.
But as you can see, you are WRITING p[0,1,3] although you only intended to write to p[2]. This can create some fun bugs to debug! :)
If your data is e.g. pointers to your data then this issue should not occur as arrays are guaranteed to be stored in memory so that this problem doesn't occur as long as your data is as long as your native instruction set.

Which algorithm brings the best performance? [duplicate]

This question already has answers here:
Why are elementwise additions much faster in separate loops than in a combined loop?
(10 answers)
What is the overhead in splitting a for-loop into multiple for-loops, if the total work inside is the same? [duplicate]
(4 answers)
Closed 9 years ago.
I have a piece of code that is really dirty.
I want to optimize it a little bit. Does it makes any difference when I take one of the following structures or are they identical with the point of view to performance in c++ ?
for(unsigned int i = 1; i < entity.size(); ++i) begin
if
if ... else ...
for end
for(unsigned int i = 1; i < entity.size(); ++i) begin
if
if ... else ...
for end
for(unsigned int i = 1; i < entity.size(); ++i) begin
if
if ... else ...
for end
....
or
for(unsigned int i = 1; i < entity.size(); ++i) begin
if
if ... else ...
if
if ... else ...
if
if ... else ...
....
for end
Thanks in Advance!

Both are O(n). As we do not know the guts of the various for loops it is impossible to say.
BTW - Mark it as pseudo code and not C++

The 1st one may spend less time incrementing/testing i and conditionally branching (assuming the compiler's optimiser doesn't reduce it to the equivalent of the second one anyway), but with loop unrolling the time taken for the i loop may be insignificant compared to the time spent within the loop anyway.
Countering that, it's easily possible that the choice of separate versus combined loops will affect the ratio of cache hits, and that could significantly impact either version: it really depends on the code. For example, if each of the three if/else statements accessed different arrays at index i, then they'll be competing for CPU cache and could slow each other down. On the other hand, if they accessed the same array at index i, doing different steps in some calculation, then it's probably better to do all three steps while those memory pages are still in cache.
There are potential impacts other than caches - from impact to register allocation, speed of I/O devices (e.g. if each loop operates on lines/records from a distinct file on different physical drives, it's very probably faster to process some of each file in a loop, rather than sequentially process each file), etc..
If you care, benchmark your actual application with representative data.

Just from the structure of the loop it is not possible to say which approach will be faster.
Algorithmically, both has the same complexity O(n). However, both might have different performance numbers depending upon the kind of operation you are performing on the elements and the size of the container.
The size of container may have an impact on locality and hence the performance. So generally speaking, you would like to chew the data as much as you can, once you get it into the cache. So I would prefer the second approach. To get a clear picture you should actually measure the performance of you approach.

The second is only slightly more efficient than the first. You save:
Initialization of loop index
Calling size()
Comparing the loop index with the size()`
Incrementing the loop index
These are very minor optimizations. Do it if it doesn't impact readability.

I would expect the second approach to be at least marginally more optimal in most cases as it can leverage the locality of reference with respect to access to elements of the entity collection/set. Note that in the first approach, each for loop would need to start accessing elements from the beginning; depending on the size of the cache, the size of the list and the extent to which compiler can infer and optimize, this may lead to cache misses when a new for loop attempts to read an element even though that element would have been read already by a preceding loop.

Reducing for loop overhead

I have a need to iterate through an amortization formula, which looks like this:
R = ( L * (r / m) ) / ( 1 - pow( (1 + (r / m)), (-1 * m * t ) );
I'm using a for loop for iteration, and incrementing the L (loan value) by 1 each time. The loop works just fine, but it did make me wonder about something else, which is the value (or lack thereof) in performing basic operations before a loop executes and then referencing those values through a variable. For example, I could further modify this function to look like
// outside for loop
amortization = (r/m)/(1 - pow( (1+(r/m)), (-1*m*t) ) )
// inside for loop
R = L * amortization
This way, instead of having to perform lots of math operations on every iteration of the loop, I can just reference the variable amount and perform a single operation.
What I'm wondering is how relevant is this? Is there any actual value in extracting these operations, or is the time saved so small that we're talking about a savings of milliseconds from a for loop that iterates approx. 200,000 times. Follow up question: would extracting operations like this be worth it if I were doing more expensive operations like sqrt?
(note: in case it matters, I'm asking about this specifically with c++ in mind)

Compilers would exercise an optimization technique here which is called loop invariant code motion. It does pretty much what you did manually, i.e. extracting a constant part of expression evaluated repeteadly in loop into a precomputed value stored in variable (or register). Hence it is not likely that you gain any performance by doing this yourself.
Of course if it's critical speed-wise, you should profile and/or review the assembly code produced by compiler in both cases.

Compilers already move loop invariant code to the outside of the loop. This optimisation is known as "Loop Invariant Code Motion" or "Hoisting Invariants".
If you want to know how much it affects performance then the only way you are going to know is if you try. I would imagine that if you are doing this 200,000 times then it certainly could affect performance (if the compiler doesn't already do it for you).

As others have mentioned, good compilers will do this sort of optimization automatically. However...
First, pow is probably a library function, so your compiler might or might not know it is a "pure" function (i.e., that its behavior depends only on its arguments). If not, it will be unable to perform this optimization, because for all it knows pow might print a message or something.
Second, I think factoring this expression out of the loop makes the code easier to understand, anyway. The human reading your code can also factor out this expression "automatically", but why make them? It just distracts them from your algorithm's flow.
So I would say to make this change regardless.
That said, if you really care about performance, get a decent profiler and do not worry about micro-optimizations until it tells you to. Your priorities early on should be (a) use a decent algorithm and (b) implement it clearly. And not in that order.

When you have opimizations turned on then moving constant expressions outside of a loop is something that compilers are pretty good at doing on their own, so this might buy you no speed up anyway.
But if it doesn't this looks like a reasonable thing to try, and then time it IF this is actually taking longer than you require.

Setting all array elements to an integer

I have an array,
int a[size];
I want to set all the array elements to 1
because some of the indexes in the array are already set to 1 so would it better checking each element using a conditional statement like,
for (int index = 0; index < size; index++)
{
if (a[index] != 1)
a[index] = 1;
}
or set all the indexes no matter what. what would be the difference?

Your code has two paths in the loop, depending on each value:
Read from array, comparison, and branch
Read from array, comparison, and write
That's not worth it. Just write.
If you want, you can do the same by calling
std::fill(a, a + size, 1);
If the array is of type char instead of int, it will likely call memset. And platform-specific implementations of fill can offer the compiler optimization hints.

Just set all the elements to 1. Code for simplicity and readability first. If you find that the code runs too slow, profile it to see where improvements need to be made (although I highly doubt performance problems can come from setting elements of an array of integers to a certain value).

I'm guessing you are just looking for understanding and not battling a real performance issue... this just wouldn't show up under measurement and here's why:
Normally whenever a cached memory processor (i.e. most of today's desktop CPUs) has to write a value to memory, the cache line that contains the address must be read from (relatively slow) RAM. The value is then modified by a CPU write to the cache. The entire cache line is eventually written back to main RAM.
When you are performing operations over a range of continuous addresses like your array, the CPU will be able to perform several operations very quickly over one cache line before it is written back. It then moves on to the next cache line which was previously fetched in anticipation.
Most likely performing the test before writing the value will not be measurably different than just writing for several reasons:
Branch prediction makes this process extremely efficient.
The compiler will have done some really powerful optimizations.
The memory transfer to cache RAM will be the real rate determining step.
So just write your code for clarity. Measure the difference if you are still curious.

Use an std::vector instead.
#include <vector>
...
std::vector<int> a(10, 1);
// access elements just as you would with a C array
std::cout << "Second element is: " << a[1] << std::endl;
Now you have an array of 10 integers all set to 1. If you already have an initialised vector, i.e. a vector filled with values other than one, you can use fill, like this:
#include <algorithm>
...
std::fill(a.begin(), a.end(), 1);

I wouldn't expect there to be a noticeable difference unless size is a very large value - however, if you're wanting the optimal variant then just setting all values to 1 would be the more performant option - I'm certain that the conditional will take more time than a simple assignment even if the assignment is then deemed not needed.

With C++11, you can use a the range-based for to set all values:
int a[size];
for(auto &v: a) {
v = 1;
}
The &v iterates by reference, so the loop variable is assignable.
This format is a nice alternative to std::fill, and really comes into its own if there if the assignment is a more complicated expression.

C++ using precalculated limiters in for loops

In scripting languages like PHP having a for loop like this would be a very bad idea:
string s("ABCDEFG");
int i;
for( i = 0; i < s.length(); i ++ )
{
cout << s[ i ];
}
This is an example, i'm not building a program like this. (For the guys that feel like they have to tell me why this piece of code <insert bad thing about it here>)
If this C++ example was translated to a similar PHP script the lenght of the string would be calculated every loop cycle. That would cause an enormous perfomance loss in realistic scripts.
I thought the same would apply to C++ programs but when I take a look at tutorials, several open-source libraries and other pieces of code I see that the limiter for the loop isn't precalculated.
Should I precalculate the lenght of the string s?
Why isn't the limiter always precalculated? (seen this in tutorials and examples)
Is there some sort of optimization done by the compiler?

It's all relative.
PHP is interpreted, but if s.length drops into a compiled part of the PHP interpreter, it will not be slow. But even if it is slow, what about the time spent in s[i], and what about the time spent in cout <<?
It's really easy to focus on loop overhead while getting swamped with other stuff.
Like if you wrote this in C++, and cout were writing to the console, do you know what would dominate? cout would, far and away, because that innocent-looking << operator invokes a huge pile of library code and system routines.

You should learn to justify simpler code. Try to convince yourself that sooner or later you will replace string::length implementation to more optimized one. (Even though your project will most likely miss all deadlines, and optimizing string::length will be the least of your problems.) This kind of thinking will help you focus on things that really matter, even though it's not always easy...

It depends on how the string is implemented.
On null terminated strings you have to calculate the size on every iteration.
std::string is a container and the size should be returned in O(1) time,
it depends (again) on the implementation.

The optimizer may indeed be able to optimize the call to length away if he's able to determine that its value won't change - nevertheless, you're on the safe side if you precalculate it (in many cases, however, optimization won't be possible because it's not clear to the compiler whether the condition variable could possible be changed during the loop).
In many cases, it just doesn't matter because the loop in question is not performance-relevant. Using the classic for(int i=0; i < somewhat(); ++i) is both less work to type and easier to read than for(int i=0,end=somewhat(); i < end; ++i.
Note that the C++ compiler will usually inline small functions, such as length (which would usually retrieve a precalculated length from the string object). Interpreted scripting languages usually need a dictionary lookup for a function call, so for C++ the relative overhad of the redundant check once per loop iteration is probably much smaller.

You're correct, s.length() will normally be evaluated on every loop iteration. You're better off writing:
size_t len = s.length();
for (size_t i = 0; i < len; ++i) {
...
}
Instead of the above. That said, for a loop with only a few iterations, it doesn't really matter how often the call to length() will be made.

I don't know about php but I can tell what c++ does.
Consider:
std::string s("Rajendra");
for (unsigned int i = 0; i < s.length(); i++)
{
std::cout << s[i] << std::endl;
}
If you go for looking up definition of length() (right click on length() and click on "Go To Definition") OR if you are using Visual Assist X then placing the cursor on length() and press Alt+G, you will find following:
size_type __CLR_OR_THIS_CALL length() const
{ // return length of sequence
return (_Mysize);
}
Where _Mysize is of type int, which clearly reveals that length of the string is pre-calculated and only stored value is being returned each call to length().
However,IMPO (in my personal opinion), this coding style is bad and should be best avoided. I would prefer following:
std::string s("Rajendra");
int len = s.length();
for (unsigned int i = 0; i < len; i++)
{
std::cout << s[i] << std::endl;
}
This way, you will save the overhead of calling length() function equal to length of the string number of times, which saves pushing and popping of stack frame. This can be very expensive when your string is large.
Hope that helps.

Probably.
For readability.
Sometimes. It depends on how good it is at detecting that the length will not change inside the loop.

Short answer, because there are situations where you want it called each time.
someone else's explanation: http://bytes.com/topic/c/answers/212351-loop-condition-evaluation

Well - as this is a very common scenario, most compilers will precalculate the value. Especially when looping through arrays and very common types - string might be one of them.
In addition, introducing an additional variable might destroy some other loop optimizations - it really depends on the compiler you use and might change from version to version.
Thus in some scenarious, the "optimization" could backfire.
If the code is non real "hot-spot" where every tick of performance does matter, you should write it as you did: No "manual" precalculation.
Readability is also very important when writing code! Optimizations should be done very carefully and only after intensive profiling!

std::string.length() returns fixed variable, that stores in container. It is already precalculated

In this particular case, std::string.length() is usually (but not necessarily) a constant-time operation and usually pretty efficient.
For the general case, the loop termination condition can be any expression, not just a comparison of the loop index variable (indeed, C/C++ does not recognize any particular index, just an initializer expression, a loop test expression and a loop counter expression (which just is executed every time through). The C/C++ for loop is basically syntactic sugar for a do/while compound statement.

The compiler may be able to save the result of the call and optimize away all of the extra function calls, but it may not. However, the cost of the function call will be quite low since all it has to do is return an int. On top of that, there's a good chance that it will be inlined, removing the cost of the function call altogether.
However, if you really care, you should profile your code and see whether precalculating the value makes it any faster. But it wouldn't hurt to just choose to precalculate it anyway. It won't cost you anything. Odds are, in most cases though, it doesn't matter. There are some containers - like list - where size() might not be an O(1) operation, and then precalculating would be a really good idea, but for most it probably doesn't matter much - especially if the contents of your loop are enough to overwhelm the cost of such an efficient function call. For std::string, it should be O(1), and will probably be optimized away, but you'd have to test to be sure - and of course things like the optimization level that you compile at could change the results.
It's safer to precalculate but often not necessary.

std::sting::length() returns a precalculated value. Other stl containers recalculate their size every time you call the method size()
e.g
std::list::size() recalculates the size and
std::vector::size() returns a
precalculated value
It depends on how the internal storage of the container is implemented.
std::vector is an array with capacity 2^n and std::list is an linked list.

You could precalculate the length of the string only if you KNOW the string won't ever change inside the loop.
I don't know why it is done this way in tutorials. A few guesses:
1) To get you in the habit so you don't get hosed when you are changing the value of the string inside the loop.
2) Because it is easier to understand.
Yes, the optimizer will try to improve this, if it can determine if the string won't change

Just for information, on my computer, g++ 4.4.2, with -O3, with the given piece of code, the function std::string::length() const is called 8 times.
I agree it's precalculated and the function is very likely to be inlined. Still very interesting to know when you are using your own functions/class.

If your program performs a lot of operations on strings all the time like copying strings ( with memcpy ) then it makes sense to cache the length of the string.
A good example of this I have seen in opensource code is redis.
In sds.h ( Simple Dynamic Strings ) look at sdshdr structure:
struct sdshdr {
long len;
long free;
char buf[];
};
As you can see it caches the length ( in len field ) of the character array buf and also the free space available ( in free field ) after the null character.
So the total memory allocated to buf is
len + free
This saves a realloc in case the buf needs to be modified and it fits in the space already available.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js