Is it bad practice to operate on a structure and assign the result to the same structure? Why? - c++

I don't recall seeing examples of code like this hypothetical snippet:
cpu->dev.bus->uevent = (cpu->dev.bus->uevent) >> 16; //or the equivalent using a macro
in which a member in a large structure gets dereferenced using pointers, operated on, and the result assigned back to the same field of the structure.
The kernel seems to be a place where such large structures are frequent but I haven't seen examples of it and became interested as to the reason why.
Is there a performance reason for this, maybe related to the time required to follow the pointers? Is it simply not good style and if so, what is the preferred way?

There's nothing wrong with the statement syntactically, but it's easier to code it like this:
cpu->dev.bus->uevent >>= 16;

It's mush more a matter of history: the kernel is mostly written in C (not C++), and -in the original development intention- (K&R era) was thought as a "high level assembler", whose statement and expression should have a literal correspondence in C and ASM. In this environment, ++i i+=1 and i=i+1 are completely different things that translates in completely different CPU instructions
Compiler optimizations, at that time, where not so advanced and popular, so the idea to follow the pointer chain twice was often avoided by first store the resulting destination address in a local temporary variable (most likely a register) and than do the assignment.
(like int* p = &a->b->c->d; *p = a + *p;)
or trying to use compond instruction like a->b->c >>= 16;)
With nowadays computers (multicore processor, multilevel caches and piping) the execution of cone inside registers can be ten times faster respect to the memory access, following three pointers is faster than storing an address in memory, thus reverting the priority of the "business model".
Compiler optimization, then, can freely change the produced code to adequate it to size or speed depending on what is retained more important and depending on what kind of processor you are working with.
So -nowadays- it doesn't really matter if you write ++i or i+=1 or i=i+1: The compiler will most likely produce the same code, attempting to access i only once. and following the pointer chain twice will most likely be rewritten as equivalent to (cpu->dev.bus->uevent) >>= 16 since >>= correspond to a single machine instruction in the x86 derivative processors.
That said ("it doesn't really matter"), it is also true that code style tend to reflect stiles and fashions of the age it was first written (since further developers tend to maintain consistency).
You code is not "bad" by itself, it just looks "odd" in the place it is usually written.
Just to give you an idea of what piping and prediction is. consider the comparison of two vectors:
bool equal(size_t n, int* a, int *b)
{
for(size_t i=0; i<n; ++i)
if(a[i]!=b[i]) return false;
return true;
}
Here, as soon we find something different we sortcut saying they are different.
Now consider this:
bool equal(size_t n, int* a, int *b)
{
register size_t c=0;
for(register size_t i=0; i<n; ++i)
c+=(a[i]==b[i]);
return c==n;
}
There is no shortcut, and even if we find a difference continue to loop and count.
But having removed the if from inside the loop, if n isn't that big (let's say less that 20) this can be 4 or 5 times faster!
An optimized compiler can even recognize this situation - proven there are no different side effects- can rework the first code in the second!

I see nothing wrong with something like that, it appears as innocuous as:
i = i + 42;
If you're accessing the data items a lot, you could consider something like:
tSomething *cdb = cpu->dev.bus;
cdb->uevent = cdb->uevent >> 16;
// and many more accesses to cdb here
but, even then, I'd tend to leave it to the optimiser, which tends to do a better job than most humans anyway :-)

There's nothing inherently wrong by doing
cpu->dev.bus->uevent = (cpu->dev.bus->uevent) >> 16;
but depending on the type of uevent, you need to be careful when shifting right like that, so you don't accidentally shift in unexpected bits into your value. For instance, if it's a 64-bit value
uint64_t uevent = 0xDEADBEEF00000000;
uevent = uevent >> 16; // now uevent is 0x0000DEADBEEF0000;
if you thought you shifted a 32-bit value and then pass the new uevent to a function taking a 64-bit value, you're not passing 0xBEEF0000, as you might have expected. Since the sizes fit (64-bit value passed as 64-bit parameter), you won't get any compiler warnings here (which you would have if you passed a 64-bit value as a 32-bit parameter).
Also interesting to note is that the above operation, while similar to
i = ++i;
which is undefined behavior (see http://josephmansfield.uk/articles/c++-sequenced-before-graphs.html for details), is still well defined, since there are no side effects in the right-hand side expression.

Related

In C++, when an element of an array is used multiple times in a loop, is it better to assign it to an other element?

Let's say I have a loop that will do some math on an array x. Is it better to assign a temporary double inside the loop at each iteration or should I use the array[i] every time?
By better I mean performance-wise using C++. I wonder if C++ has some vectorization or cash optimization that I'm ruining?
Also what if I call a function using this array and I might need values of a function multiple times, so I usually do the same with functions. I assume this would be better than calling the function many times.
How about if the loop uses omp parallel, I assume this should be safe, correct?
for(int i=0; i<N; i++){
double xi = X[i];
double fi = f(xi);
t[i] = xi*xi + fi + xi/fi;
}
elcuco is correct. Any compiler worth it's salt will be able to optimise out something this trivial. What matters here is code readability, personally i find X[i] to be a little easier to look at in this situation.
I will note that if you are repeatedly making very long statements i.e X.something.something.darkside[i][j] it might make sense to use a clearly named reference i.e auto & the_emperor = X.something.something.darkside[i][j].
Modern compilers (last 10 years) will optimise it out. Don't worry about it.
EDIT:
This has been discussed in StackOverflow a few times:
Will compiler optimize and reuse variable
In C++, should I bother to cache variables, or let the compiler do the optimization? (Aliasing)
This official documentation explains it, IMHO it is
-fmerge-all-constants -fivopts and maybe -ftree-coalesce-vars clang and MSCV have similar options, feel free to research them yourself or link them here.
In practice, when a compiler sees a memory read (a variable, or array value) it will read it into a register, and unless that not marked as volatile, the compiler can assume it did not change, and will not issue instructions to re-read it.
Having said the magical volatile word: It should not be used for threading. It should be used for hardware mapped memory (for example, video card memory or external ports).

How to effectively apply bitwise operation to (large) packed bit vectors?

I want to implement
void bitwise_and(
char* __restrict__ result,
const char* __restrict__ lhs,
const char* __restrict__ rhs,
size_t length);
or maybe a bitwise_or(), bitwise_xor() or any other bitwise operation. Obviously it's not about the algorithm, just the implementation details - alignment, loading the largest possible element from memory, cache-awareness, using SIMD instructions etc.
I'm sure this has (more than one) fast existing implementations, but I would guess most library implementations would require some fancy container, e.g. std::bitset or boost::dynamic_bit_set - but I don't want to spend the time constructing one of those.
So do I... Copy-paste from an existing library? Find a library which can 'wrap' a raw packed bits array in memory with a nice object? Roll my own implementation anyway?
Notes:
I'm mostly interested in C++ code, but I certainly don't mind a plain C approach.
Obviously, making copies of the input arrays is out of the question - that would probably nearly-double the execution time.
I intentionally did not template the bitwise operator, in case there's some specific optimization for OR, or for AND etc.
Bonus points for discussing operations on multiple vectors at once, e.g. V_out = V_1 bitwise-and V_2 bitwise-and V_3 etc.
I noted this article comparing library implementations, but it's from 5 years ago. I can't ask which library to use since that would violate SO policy I guess...
If it helps you any, assume its uint64_ts rather than chars (that doesn't really matter - if the char array is unaligned we can just treated the heading and trailing chars separately).
This answer is going to assume you want the fastest possible way and are happy to use platform specific things. You optimising compiler may be able to produce similar code to the below from normal C but in my experiance across a few compilers something as specific as this is still best hand-written.
Obviously like all optimisation tasks, never assume anything is better/worse and measure, measure, measure.
If you could lock down you architecture to x86 with at least SSE3 you would do:
void bitwise_and(
char* result,
const char* lhs,
const char* rhs,
size_t length)
{
while(length >= 16)
{
// Load in 16byte registers
auto lhsReg = _mm_loadu_si128((__m128i*)lhs);
auto rhsReg = _mm_loadu_si128((__m128i*)rhs);
// do the op
auto res = _mm_and_si128(lhsReg, rhsReg);
// save off again
_mm_storeu_si128((__m128i*)result, res);
// book keeping
length -= 16;
result += 16;
lhs += 16;
rhs += 16;
}
// do the tail end. Assuming that the array is large the
// most that the following code can be run is 15 times so I'm not
// bothering to optimise. You could do it in 64 bit then 32 bit
// then 16 bit then char chunks if you wanted...
while (length)
{
*result = *lhs & *rhs;
length -= 1;
result += 1;
lhs += 1;
rhs += 1;
}
}
This compiles to ~10asm instructions per 16 bytes (+ change for the leftover and a little overhead).
The great thing about doing intrinsics like this (over hand rolled asm) is that the compiler is still free to do additional optimisations (such as loop unrolling) ontop of what you write. It also handles register allocation.
If you could guarantee aligned data you could save an asm instruction (use _mm_load_si128 instead and the compiler will be clever enough to avoid a second load and use it as an direct mem operand to the 'pand'.
If you could guarantee AVX2+ then you could use the 256 bit version and handle 10asm instructions per 32 bytes.
On arm theres similar NEON instructions.
If you wanted to do multiple ops just add the relevant intrinsic in the middle and it'll add 1 asm instruction per 16 bytes.
I'm pretty sure with a decent processor you dont need any additional cache control.
Don't do it this way. The individual operations will look great, sleek asm, nice performance .. but a composition of them will be terrible. You cannot make this abstraction, nice as it looks. The arithmetic intensity of those kernels is almost the worst possible (the only worse one is doing no arithmetic, such as a straight up copy), and composing them at a high level will retain that awful property. In a sequence of operations each using the result of the previous one, the results are written and read again a lot later (in the next kernel), even though the high level flow could be transposed so that the result the "next operation" needs is right there in a register. Also, if the same argument appears twice in an expression tree (and not both as operands to one operation), they will be streamed in twice, instead of reusing the data for two operations.
It doesn't have that nice warm fuzzy feeling of "look at all this lovely abstraction" about it, but what you should do is find out at a high level how you're combining your vectors, and then try to chop that in pieces that make sense from a performance perspective. In some cases that may mean making big ugly messy loops that will make people get an extra coffee before diving in, that's just too bad then. If you want performance, you often have to sacrifice something else. Usually it's not so bad, it probably just means you have a loop that has an expression consisting of intrinsics in it, instead of an expression of vector-operations that each individually have a loop.

Code reordering due to optimization

I've heard so many times that an optimizer may reorder your code that I'm starting to believe it.
Are there any examples or typical cases where this might happen and how can I Avoid such a thing (eg I want a benchmark to be impervious to this)?
There are LOTS of different kinds of "code-motion" (moving code around), and it's caused by lots of different parts of the optimisation process:
move these instructions around, because it's a waste of time to wait for the memory read to complete without putting at least one or two instructions between the memory read and the operation using the content we got from memory
Move things out of loops, because it only needs to happen once (if you call x = sin(y) once or 1000 times without changing y, x will have the same value, so no point in doing that inside a loop. So compiler moves it out.
Move code around based on "compiler expects this code to hit more often than the other bit, so better cache-hit ratio if we do it this way" - for example error handling being moved away from the source of the error, because it's unlikely that you get an error [compilers often understand commonly used functions and that they typically result in success].
Inlining - code is moved from the actual function into the calling function. This often leads to OTHER effects such as reduction in pushing/poping registers from the stack and arguments can be kept where they are, rather than having to move them to the "right place for arguments".
I'm sure I've missed some cases in the above list, but this is certainly some of the most common.
The compiler is perfectly within its rights to do this, as long as it doesn't have any "observable difference" (other than the time it takes to run and the number of instructions used - those "don't count" in observable differences when it comes to compilers)
There is very little you can do to avoid compiler from reordering your code - you can write code that ensures the order to some degree. So for example, we can have code like this:
{
int sum = 0;
for(i = 0; i < large_number; i++)
sum += i;
}
Now, since sum isn't being used, the compiler can remove it. Adding some code that checks prints the sum would ensure that it's "used" according to the compiler.
Likewise:
for(i = 0; i < large_number; i++)
{
do_stuff();
}
if the compiler can figure out that do_stuff doesn't actually change any global value, or similar, it will move code around to form this:
do_stuff();
for(i = 0; i < large_number; i++)
{
}
The compiler may also remove - in fact almost certainly will - the, now, empty loop so that it doesn't exist at all. [As mentioned in the comments: If do_stuff doesn't actually change anything outside itself, it may also be removed, but the example I had in mind is where do_stuff produces a result, but the result is the same each time]
(The above happens if you remove the printout of results in the Dhrystone benchmark for example, since some of the loops calculate values that are never used other than in the printout - this can lead to benchmark results that exceed the highest theoretical throughput of the processor by a factor of 10 or so - because the benchmark assumes the instructions necessary for the loop were actually there, and says it took X nominal operations to execute each iteration)
There is no easy way to ensure this doesn't happen, aside from ensuring that do_stuff either updates some variable outside the function, or returns a value that is "used" (e.g. summing up, or something).
Another example of removing/omitting code is where you store values repeatedly to the same variable multiple times:
int x;
for(i = 0; i < large_number; i++)
x = i * i;
can be replaced with:
x = (large_number-1) * (large_number-1);
Sometimes, you can use volatile to ensure that something REALLY happens, but in a benchmark, that CAN be detrimental, since the compiler also can't optimise code that it SHOULD optimise (if you are not careful with the how you use volatile).
If you have some SPECIFIC code that you care particularly about, it would be best to post it (and compile it with several state of the art compilers, and see what they actually do with it).
[Note that moving code around is definitely not a BAD thing in general - I do want my compiler (whether it is the one I'm writing myself, or one that I'm using that was written by someone else) to make optimisation by moving code, because, as long as it does so correctly, it will produce faster/better code by doing so!]
Most of the time, reordering is only allowed in situations where the observable effects of the program are the same - this means you shouldn't be able to tell.
Counterexamples do exist, for example the order of operands is unspecified and an optimizer is free to rearrange things. You can't predict the order of these two function calls for example:
int a = foo() + bar();
Read up on sequence points to see what guarantees are made.

Programming Myth? Pre vs. post increment and decrement speed [duplicate]

Is there a performance difference between i++ and ++i if the resulting value is not used?
Executive summary: No.
i++ could potentially be slower than ++i, since the old value of i
might need to be saved for later use, but in practice all modern
compilers will optimize this away.
We can demonstrate this by looking at the code for this function,
both with ++i and i++.
$ cat i++.c
extern void g(int i);
void f()
{
int i;
for (i = 0; i < 100; i++)
g(i);
}
The files are the same, except for ++i and i++:
$ diff i++.c ++i.c
6c6
< for (i = 0; i < 100; i++)
---
> for (i = 0; i < 100; ++i)
We'll compile them, and also get the generated assembler:
$ gcc -c i++.c ++i.c
$ gcc -S i++.c ++i.c
And we can see that both the generated object and assembler files are the same.
$ md5 i++.s ++i.s
MD5 (i++.s) = 90f620dda862cd0205cd5db1f2c8c06e
MD5 (++i.s) = 90f620dda862cd0205cd5db1f2c8c06e
$ md5 *.o
MD5 (++i.o) = dd3ef1408d3a9e4287facccec53f7d22
MD5 (i++.o) = dd3ef1408d3a9e4287facccec53f7d22
From Efficiency versus intent by Andrew Koenig :
First, it is far from obvious that ++i is more efficient than i++, at least where integer variables are concerned.
And :
So the question one should be asking is not which of these two operations is faster, it is which of these two operations expresses more accurately what you are trying to accomplish. I submit that if you are not using the value of the expression, there is never a reason to use i++ instead of ++i, because there is never a reason to copy the value of a variable, increment the variable, and then throw the copy away.
So, if the resulting value is not used, I would use ++i. But not because it is more efficient: because it correctly states my intent.
A better answer is that ++i will sometimes be faster but never slower.
Everyone seems to be assuming that i is a regular built-in type such as int. In this case there will be no measurable difference.
However if i is complex type then you may well find a measurable difference. For i++ you must make a copy of your class before incrementing it. Depending on what's involved in a copy it could indeed be slower since with ++i you can just return the final value.
Foo Foo::operator++()
{
Foo oldFoo = *this; // copy existing value - could be slow
// yadda yadda, do increment
return oldFoo;
}
Another difference is that with ++i you have the option of returning a reference instead of a value. Again, depending on what's involved in making a copy of your object this could be slower.
A real-world example of where this can occur would be the use of iterators. Copying an iterator is unlikely to be a bottle-neck in your application, but it's still good practice to get into the habit of using ++i instead of i++ where the outcome is not affected.
Short answer:
There is never any difference between i++ and ++i in terms of speed. A good compiler should not generate different code in the two cases.
Long answer:
What every other answer fails to mention is that the difference between ++i versus i++ only makes sense within the expression it is found.
In the case of for(i=0; i<n; i++), the i++ is alone in its own expression: there is a sequence point before the i++ and there is one after it. Thus the only machine code generated is "increase i by 1" and it is well-defined how this is sequenced in relation to the rest of the program. So if you would change it to prefix ++, it wouldn't matter in the slightest, you would still just get the machine code "increase i by 1".
The differences between ++i and i++ only matters in expressions such as array[i++] = x; versus array[++i] = x;. Some may argue and say that the postfix will be slower in such operations because the register where i resides has to be reloaded later. But then note that the compiler is free to order your instructions in any way it pleases, as long as it doesn't "break the behavior of the abstract machine" as the C standard calls it.
So while you may assume that array[i++] = x; gets translated to machine code as:
Store value of i in register A.
Store address of array in register B.
Add A and B, store results in A.
At this new address represented by A, store the value of x.
Store value of i in register A // inefficient because extra instruction here, we already did this once.
Increment register A.
Store register A in i.
the compiler might as well produce the code more efficiently, such as:
Store value of i in register A.
Store address of array in register B.
Add A and B, store results in B.
Increment register A.
Store register A in i.
... // rest of the code.
Just because you as a C programmer is trained to think that the postfix ++ happens at the end, the machine code doesn't have to be ordered in that way.
So there is no difference between prefix and postfix ++ in C. Now what you as a C programmer should be vary of, is people who inconsistently use prefix in some cases and postfix in other cases, without any rationale why. This suggests that they are uncertain about how C works or that they have incorrect knowledge of the language. This is always a bad sign, it does in turn suggest that they are making other questionable decisions in their program, based on superstition or "religious dogmas".
"Prefix ++ is always faster" is indeed one such false dogma that is common among would-be C programmers.
Taking a leaf from Scott Meyers, More Effective c++ Item 6: Distinguish between prefix and postfix forms of increment and decrement operations.
The prefix version is always preferred over the postfix in regards to objects, especially in regards to iterators.
The reason for this if you look at the call pattern of the operators.
// Prefix
Integer& Integer::operator++()
{
*this += 1;
return *this;
}
// Postfix
const Integer Integer::operator++(int)
{
Integer oldValue = *this;
++(*this);
return oldValue;
}
Looking at this example it is easy to see how the prefix operator will always be more efficient than the postfix. Because of the need for a temporary object in the use of the postfix.
This is why when you see examples using iterators they always use the prefix version.
But as you point out for int's there is effectively no difference because of compiler optimisation that can take place.
Here's an additional observation if you're worried about micro optimisation. Decrementing loops can 'possibly' be more efficient than incrementing loops (depending on instruction set architecture e.g. ARM), given:
for (i = 0; i < 100; i++)
On each loop you you will have one instruction each for:
Adding 1 to i.
Compare whether i is less than a 100.
A conditional branch if i is less than a 100.
Whereas a decrementing loop:
for (i = 100; i != 0; i--)
The loop will have an instruction for each of:
Decrement i, setting the CPU register status flag.
A conditional branch depending on CPU register status (Z==0).
Of course this works only when decrementing to zero!
Remembered from the ARM System Developer's Guide.
First of all: The difference between i++ and ++i is neglegible in C.
To the details.
1. The well known C++ issue: ++i is faster
In C++, ++i is more efficient iff i is some kind of an object with an overloaded increment operator.
Why?
In ++i, the object is first incremented, and can subsequently passed as a const reference to any other function. This is not possible if the expression is foo(i++) because now the increment needs to be done before foo() is called, but the old value needs to be passed to foo(). Consequently, the compiler is forced to make a copy of i before it executes the increment operator on the original. The additional constructor/destructor calls are the bad part.
As noted above, this does not apply to fundamental types.
2. The little known fact: i++ may be faster
If no constructor/destructor needs to be called, which is always the case in C, ++i and i++ should be equally fast, right? No. They are virtually equally fast, but there may be small differences, which most other answerers got the wrong way around.
How can i++ be faster?
The point is data dependencies. If the value needs to be loaded from memory, two subsequent operations need to be done with it, incrementing it, and using it. With ++i, the incrementation needs to be done before the value can be used. With i++, the use does not depend on the increment, and the CPU may perform the use operation in parallel to the increment operation. The difference is at most one CPU cycle, so it is really neglegible, but it is there. And it is the other way round then many would expect.
Please don't let the question of "which one is faster" be the deciding factor of which to use. Chances are you're never going to care that much, and besides, programmer reading time is far more expensive than machine time.
Use whichever makes most sense to the human reading the code.
#Mark
Even though the compiler is allowed to optimize away the (stack based) temporary copy of the variable and gcc (in recent versions) is doing so,
doesn't mean all compilers will always do so.
I just tested it with the compilers we use in our current project and 3 out of 4 do not optimize it.
Never assume the compiler gets it right, especially if the possibly faster, but never slower code is as easy to read.
If you don't have a really stupid implementation of one of the operators in your code:
Alwas prefer ++i over i++.
I have been reading through most of the answers here and many of the comments, and I didn't see any reference to the one instance that I could think of where i++ is more efficient than ++i (and perhaps surprisingly --i was more efficient than i--). That is for C compilers for the DEC PDP-11!
The PDP-11 had assembly instructions for pre-decrement of a register and post-increment, but not the other way around. The instructions allowed any "general-purpose" register to be used as a stack pointer. So if you used something like *(i++) it could be compiled into a single assembly instruction, while *(++i) could not.
This is obviously a very esoteric example, but it does provide the exception where post-increment is more efficient(or I should say was, since there isn't much demand for PDP-11 C code these days).
In C, the compiler can generally optimize them to be the same if the result is unused.
However, in C++ if using other types that provide their own ++ operators, the prefix version is likely to be faster than the postfix version. So, if you don't need the postfix semantics, it is better to use the prefix operator.
I can think of a situation where postfix is slower than prefix increment:
Imagine a processor with register A is used as accumulator and it's the only register used in many instructions (some small microcontrollers are actually like this).
Now imagine the following program and their translation into a hypothetical assembly:
Prefix increment:
a = ++b + c;
; increment b
LD A, [&b]
INC A
ST A, [&b]
; add with c
ADD A, [&c]
; store in a
ST A, [&a]
Postfix increment:
a = b++ + c;
; load b
LD A, [&b]
; add with c
ADD A, [&c]
; store in a
ST A, [&a]
; increment b
LD A, [&b]
INC A
ST A, [&b]
Note how the value of b was forced to be reloaded. With prefix increment, the compiler can just increment the value and go ahead with using it, possibly avoid reloading it since the desired value is already in the register after the increment. However, with postfix increment, the compiler has to deal with two values, one the old and one the incremented value which as I show above results in one more memory access.
Of course, if the value of the increment is not used, such as a single i++; statement, the compiler can (and does) simply generate an increment instruction regardless of postfix or prefix usage.
As a side note, I'd like to mention that an expression in which there is a b++ cannot simply be converted to one with ++b without any additional effort (for example by adding a - 1). So comparing the two if they are part of some expression is not really valid. Often, where you use b++ inside an expression you cannot use ++b, so even if ++b were potentially more efficient, it would simply be wrong. Exception is of course if the expression is begging for it (for example a = b++ + 1; which can be changed to a = ++b;).
I always prefer pre-increment, however ...
I wanted to point out that even in the case of calling the operator++ function, the compiler will be able to optimize away the temporary if the function gets inlined. Since the operator++ is usually short and often implemented in the header, it is likely to get inlined.
So, for practical purposes, there likely isn't much of a difference between the performance of the two forms. However, I always prefer pre-increment since it seems better to directly express what I"m trying to say, rather than relying on the optimizer to figure it out.
Also, giving the optmizer less to do likely means the compiler runs faster.
My C is a little rusty, so I apologize in advance. Speedwise, I can understand the results. But, I am confused as to how both files came out to the same MD5 hash. Maybe a for loop runs the same, but wouldn't the following 2 lines of code generate different assembly?
myArray[i++] = "hello";
vs
myArray[++i] = "hello";
The first one writes the value to the array, then increments i. The second increments i then writes to the array. I'm no assembly expert, but I just don't see how the same executable would be generated by these 2 different lines of code.
Just my two cents.

Can gcc/g++ tell me when it ignores my register?

When compiling C/C++ codes using gcc/g++, if it ignores my register, can it tell me?
For example, in this code
int main()
{
register int j;
int k;
for(k = 0; k < 1000; k++)
for(j = 0; j < 32000; j++)
;
return 0;
}
j will be used as register, but in this code
int main()
{
register int j;
int k;
for(k = 0; k < 1000; k++)
for(j = 0; j < 32000; j++)
;
int * a = &j;
return 0;
}
j will be a normal variable.
Can it tell me whether a variable I used register is really stored in a CPU register?
You can fairly assume that GCC ignores the register keyword except perhaps at -O0. However, it shouldn't make a difference one way or another, and if you are in such depth, you should already be reading the assembly code.
Here is an informative thread on this topic: http://gcc.gnu.org/ml/gcc/2010-05/msg00098.html . Back in the old days, register indeed helped compilers to allocate a variable into registers, but today register allocation can be accomplished optimally, automatically, without hints. The keyword does continue to serve two purposes in C:
In C, it prevents you from taking the address of a variable. Since registers don't have addresses, this restriction can help a simple C compiler. (Simple C++ compilers don't exist.)
A register object cannot be declared restrict. Because restrict pertains to addresses, their intersection is pointless. (C++ does not yet have restrict, and anyway, this rule is a bit trivial.)
For C++, the keyword has been deprecated since C++11 and proposed for removal from the standard revision scheduled for 2017.
Some compilers have used register on parameter declarations to determine the calling convention of functions, with the ABI allowing mixed stack- and register-based parameters. This seems to be nonconforming, it tends to occur with extended syntax like register("A1"), and I don't know whether any such compiler is still in use.
With respect to modern compilation and optimization techniques, the register annotation does not make any sense at all. In your second program you take the address of j, and registers do not have addresses, but one same local or static variable could perfectly well be stored in two different memory locations during its lifetime, or sometimes in memory and sometimes in a register, or not exist at all. Indeed, an optimizing compiler would compile your nested loops as nothing, because they do not have any effects, and simply assign their final values to k and j. And then omit these assignments because the remaining code does not use these values.
You can't get the address of a register in C, plus the compiler can totally ignore you; C99 standard, section 6.7.1 (pdf):
The implementation may treat any
register declaration simply as an auto
declaration. However, whether or not
addressable storage is actually used,
the address of any part of an object
declared with storage-class specifier
register cannot be computed, either
explicitly (by use of the unary &
operator as discussed in 6.5.3.2) or
implicitly (by converting an array
name to a pointer as discussed in
6.3.2.1). Thus, the only operator that can be applied to an array declared
with storage-class specifier register
is sizeof.
Unless you're mucking around on 8-bit AVRs or PICs, the compiler will probably laugh at you thinking you know best and ignore your pleas. Even on them, I've thought I knew better a couple times and found ways to trick the compiler (with some inline asm), but my code exploded because it had to massage a bunch of other data to work around my stubbornness.
This question, and some of the answers, and several other discussions of the 'register' keywords I've seen -- seem to assume implicitly that all locals are mapped either to a specific register, or to a specific memory location on the stack. This was generally true until 15-25 years ago, and it's true if you turn off optimizing, but it's not true at all
when standard optimizing is performed. Locals are now seen by optimizers as symbolic names that you use to describe the flow of data, rather than as values that need to be stored in specific locations.
Note: by 'locals' here I mean: scalar variables, of storage class auto (or 'register'), which are never used as the operand of '&'. Compilers can sometimes break up auto structs, unions or arrays into individual 'local' variables, too.
To illustrate this: suppose I write this at the top of a function:
int factor = 8;
.. and then the only use of the factor variable is to multiply by various things:
arr[i + factor*j] = arr[i - factor*k];
In this case - try it if you want - there will be no factor variable. The code analysis will show that factor is always 8, and so all the shifts will turn into <<3. If you did the same thing in 1985 C, factor would get a location on the stack, and there would be multipilies, since the compilers basically worked one statement at a time and didn't remember anything about the values of the variables. Back then programmers would be more likely to use #define factor 8 to get better code in this situation, while maintaining adjustable factor.
If you use -O0 (optimization off) - you will indeed get a variable for factor. This will allow you, for instance, to step over the factor=8 statement, and then change factor to 11 with the debugger, and keep going. In order for this to work, the compiler can't keep anything in registers between statements, except for variables which are assigned to specific registers; and in that case the debugger is informed of this. And it can't try to 'know' anything about the values of variables, since the debugger could change them. In other words, you need the 1985 situation if you want to change local variables while debugging.
Modern compilers generally compile a function as follows:
(1) when a local variable is assigned to more than once in a function, the compiler creates different 'versions' of the variable so that each one is only assigned in one place. All of the 'reads' of the variable refer to a specific version.
(2) Each of these locals is assigned to a 'virtual' register. Intermediate calculation results are also assigned variables/registers; so
a = b*c + 2*k;
becomes something like
t1 = b*c;
t2 = 2;
t3 = k*t2;
a = t1 + t3;
(3) The compiler then takes all these operations, and looks for common subexpressions, etc. Since each of the new registers is only ever written once, it is rather easier to rearrange them while maintaining correctness. I won't even start on loop analysis.
(4) The compiler then tries to map all these virtual registers into actual registers in order to generate code. Since each virtual register has a limited lifetime it is possible to reuse actual registers heavily - 't1' in the above is only needed until the add which generates 'a', so it could be held in the same register as 'a'. When there are not enough registers, some of the virtual registers can be allocated to memory -- or -- a value can be held in a certain register, stored to memory for a while, and loaded back into a (possibly) different register later. On a load-store machine, where only values in registers can be used in computations, this second strategy accomodates that nicely.
From the above, this should be clear: it's easy to determine that the virtual register mapped to factor is the same as the constant '8', and so all multiplications by factor are multiplications by 8. Even if factor is modified later, that's a 'new' variable and it doesn't affect prior uses of factor.
Another implication, if you write
vara = varb;
.. it may or may not be the case that there is a corresponding copy in the code. For instance
int *resultp= ...
int acc = arr[0] + arr[1];
int acc0 = acc; // save this for later
int more = func(resultp,3)+ func(resultp,-3);
acc += more; // add some more stuff
if( ...){
resultp = getptr();
resultp[0] = acc0;
resultp[1] = acc;
}
In the above the two 'versions' of acc (initial, and after adding 'more') could be in two different registers, and 'acc0' would then be the same as the inital 'acc'. So no register copy would be needed for 'acc0 =acc'.
Another point: the 'resultp' is assigned to twice, and since the second assignment ignores the previous value, there are essentially two distinct 'resultp' variables in the code, and this is easily determined by analysis.
An implication of all this: don't be hesitant to break out complex expressions into smaller ones using additional locals for intermediates, if it makes the code easier to follow. There is basically zero run-time penalty for this, since the optimizer sees the same thing anyway.
If you're interested in learning more you could start here: http://en.wikipedia.org/wiki/Static_single_assignment_form
The point of this answer is to (a) give some idea of how modern compilers work and (b) point out that asking the compiler, if it would be so kind, to put a particular local variable into a register -- doesn't really make sense. Each 'variable' may be seen by the optimizer as several variables, some of which may be heavily used in loops, and others not. Some variables will vanish -- e.g. by being constant; or, sometimes, the temp variable used in a swap. Or calculations not actually used. The compiler is equipped to use the same register for different things in different parts of the code, according to what's actually best on the machine you are compiling for.
The notion of hinting the compiler as to which variables should be in registers assumes that each local variable maps to a register or to a memory location. This was true back when Kernighan + Ritchie designed the C language, but is not true any more.
Regarding the restriction that you can't take the address of a register variable: Clearly, there's no way to implement taking the address of a variable held in a register, but you might ask - since the compiler has discretion to ignore the 'register' - why is this rule in place? Why can't the compiler just ignore the 'register' if I happen to take the address? (as is the case in C++).
Again, you have to go back to the old compiler. The original K+R compiler would parse a local variable declaration, and then immediately decide whether to assign it to a register or not (and if so, which register). Then it would proceed to compile expressions, emitting the assembler for each statement one at a time. If it later found that you were taking the address of a 'register' variable, which had been assigned to a register, there was no way to handle that, since the assignment was, in general, irreversible by then. It was possible, however, to generate an error message and stop compiling.
Bottom line, it appears that 'register' is essentially obsolete:
C++ compilers ignore it completely
C compilers ignore it except to enforce the restriction about & - and possibly don't ignore it at -O0 where it could actually result in allocation as requested. At -O0 you aren't concerned about code speed though.
So, it's basically there now for backward compatibility, and probably on the basis that some implementations could still be using it for 'hints'. I never use it -- and I write real-time DSP code, and spend a fair bit of time looking at generated code and finding ways to make it faster. There are many ways to modify code to make it run faster, and knowing how compilers work is very helpful. It's been a long time indeed since I last found that adding 'register' to be among those ways.
Addendum
I excluded above, from my special definition of 'locals', variables to which & is applied (these are are of course included in the usual sense of the term).
Consider the code below:
void
somefunc()
{
int h,w;
int i,j;
extern int pitch;
get_hw( &h,&w ); // get shape of array
for( int i = 0; i < h; i++ ){
for( int j = 0; j < w; j++ ){
Arr[i*pitch + j] = generate_func(i,j);
}
}
}
This may look perfectly harmless. But if you are concerned about execution speed, consider this: The compiler is passing the addresses of h and w to get_hw, and then later calling generate_func. Let's assume the compiler knows nothing about what's in those functions (which is the general case). The compiler must assume that the call to generate_func could be changing h or w. That's a perfectly legal use of the pointer passed to get_hw - you could store it somewhere and then use it later, as long as the scope containing h,w is still in play, to read or write those variables.
Thus the compiler must store h and w in memory on the stack, and can't determine anything in advance about how long the loop will run. So certain optimizations will be impossible, and the loop could be less efficient as a result (in this example, there's a function call in the inner loop anyway, so it may not make much of a difference, but consider the case where there's a function which is occasionally called in the inner loop, depending on some condition).
Another issue here is that generate_func could change pitch, and so i*pitch needs to done each time, rather than only when i changes.
It can be recoded as:
void
somefunc()
{
int h0,w0;
int h,w;
int i,j;
extern int pitch;
int apit = pitch;
get_hw( &h0,&w0 ); // get shape of array
h= h0;
w= w0;
for( int i = 0; i < h; i++ ){
for( int j = 0; j < w; j++ ){
Arr[i*apit + j] = generate_func(i,j);
}
}
}
Now the variables apit,h,w are all 'safe' locals in the sense I defined above, and the compiler can be sure they won't be changed by any function calls. Assuming I'm not modifying anything in generate_func, the code will have the same effect as before but could be more efficient.
Jens Gustedt has suggested the use of the 'register' keyword as a way of tagging key variables to inhibit the use of & on them, e.g. by others maintaining the code (It won't affect the generated code, since the compiler can determine the lack of & without it). For my part, I always think carefully before applying & to any local scalar in a time-critical area of the code, and in my view using 'register' to enforce this is a little cryptic, but I can see the point (unfortunately it doesn't work in C++ since the compiler will just ignore the 'register').
Incidentally, in terms of code efficiency, the best way to have a function return two values is with a struct:
struct hw { // this is what get_hw returns
int h,w;
};
void
somefunc()
{
int h,w;
int i,j;
struct hw hwval = get_hw(); // get shape of array
h = hwval.h;
w = hwval.w;
...
This may look cumbersome (and is cumbersome to write), but it will generate cleaner code than the previous examples. The 'struct hw' will actually be returned in two registers (on most modern ABIs anyway). And due to the way the 'hwval' struct is used, the optimizer will effectively break it up into two 'locals' hwval.h and hwval.w, and then determine that these are equivalent to h and w -- so hwval will essentially disappear in the code. No pointers need to be passed, no function is modifying another function's variables via pointer; it's just like having two distinct scalar return values. This is much easier to do now in C++11 - with std::tie and std::tuple, you can use this method with less verbosity (and without having to write a struct definition).
Your second example is invalid in C. So you see well that the register keyword changes something (in C). It is just there for this purpose, to inhibit the taking of an address of a variable. So just don't take its name "register" verbally, it is a misnomer, but stick to its definition.
That C++ seems to ignore register, well they must have their reason for that, but I find it kind of sad to again find one of these subtle difference where valid code for one is invalid for the other.