Can a C compiler rearrange stack variables? - c++

I have worked on projects for embedded systems in the past where we have rearranged the order of declaration of stack variables to decrease the size of the resulting executable. For instance, if we had:
void func()
{
char c;
int i;
short s;
...
}
We would reorder this to be:
void func()
{
int i;
short s;
char c;
...
}
Because of alignment issues the first one resulted in 12 bytes of stack space being used and the second one resulted in only 8 bytes.
Is this standard behavior for C compilers or just a shortcoming of the compiler we were using?
It seems to me that a compiler should be able to reorder stack variables to favor smaller executable size if it wanted to. It has been suggested to me that some aspect of the C standard prevents this, but I haven't been able to find a reputable source either way.
As a bonus question, does this also apply to C++ compilers?
Edit
If the answer is yes, C/C++ compilers can rearrange stack variables, can you give an example of a compiler that definitely does this? I'd like to see compiler documentation or something similar that backs this up.
Edit Again
Thanks everybody for your help. For documentation, the best thing I've been able to find is the paper Optimal Stack Slot Assignment in GCC(pdf), by Naveen Sharma and Sanjiv Kumar Gupta, which was presented at the GCC summit proceedings in 2003.
The project in question here was using the ADS compiler for ARM development. It is mentioned in the documentation for that compiler that ordering declarations like I've shown can improve performance, as well as stack size, because of how the ARM-Thumb architecture calculates addresses in the local stack frame. That compiler didn't automatically rearrange locals to take advantage of this. The paper linked here says that as of 2003 GCC also didn't rearrange the stack frame to improve locality of reference for ARM-Thumb processors, but it implies that you could.
I can't find anything that definitely says this was ever implemented in GCC, but I think this paper counts as proof that you're all correct. Thanks again.

Not only can the compiler reorder the stack layout of the local variables, it can assign them to registers, assign them to live sometimes in registers and sometimes on the stack, it can assign two locals to the same slot in memory (if their live ranges do not overlap) and it can even completely eliminate variables.

As there is nothing in the standard prohibiting that for C or C++ compilers, yes, the compiler can do that.
It is different for aggregates (i.e. structs), where the relative order must be maintained, but still the compiler may insert pad bytes to achieve preferable alignment.
IIRC newer MSVC compilers use that freedom in their fight against buffer overflows of locals.
As a side note, in C++, the order of destruction must be reverse order of declaration, even if the compiler reorders the memory layout.
(I can't quote chapter and verse, though, this is from memory.)

The compiler is even free to remove the variable from the stack and make it register only if analysis shows that the address of the variable is never taken/used.

The stack need not even exist (in fact, the C99 standard does not have a single occurence of the word "stack"). So yes, the compiler is free to do whatever it wants as long as that preserves the semantics of variables with automatic storage duration.
As for an example: I encountered many times a situation where I could not display a local variable in the debugger because it was stored in a register.

The compiler for the Texas instruments 62xx series of DSP's is capable of, and does
"whole program optimization." ( you can turn it off)
This is where your code gets rearranged, not just the locals. So order of execution ends up being not quite what you might expect.
C and C++ don't actually promise a memory model (in the sense of say the JVM), so things can be quite different and still legal.
For those who don't know them, the 62xx family are 8 instruction per clock cycle DSP's; at 750Mhz, they do peak at 6e+9 instructions. Some of the time anyway. They do parallel execution, but instruction ordering is done in the compiler, not the CPU, like an Intel x86.
PIC's and Rabbit embedded boards don't have stacks unless you ask especially nicely.

A compiler might not even be using a stack at all for data. If you're on a platform so tiny that you're worrying about 8 vs 12 bytes of stack, then it's likely that there will be compilers which have pretty specialised approaches. (Some PIC and 8051 compilers come to mind)
What processor are you compiling for?

it is compiler specifics, one can make his own compiler that would do the inverse if he wanted it that way.

A decent compiler will put local variables in registers if it can. Variables should only be placed on the stack if there is excessive register pressure (not enough room) or if the variable's address is taken, meaning it needs to live in memory.
As far as I know, there is nothing that says variables need to be placed at any specific location or alignment on the stack for C/C++; the compiler will put them wherever is best for performance and/or whatever is convenient for compiler writers.

AFAIK there is nothing in the definition of C or C++ specifying how the compiler should order local variables on the stack. I would say that relying on what the compiler may do in this case is a bad idea, because the next version of your compiler may do it differently. If you spend time and effort to order your local variables to save a few bytes of stack, those few bytes had better be really critical for the functioning of your system.

There's no need for idle speculation about what the C standard requires or does not require: recent drafts are freely available online from the ANSI/ISO working group.

This does not answer your question but here is my 2 cents about a related issue...
I did not have the problem of stack space optimization but I had the problem of mis-alignment of double variables on the stack. A function may be called from any other function and the stack pointer value may have any un-aligned value. So I have come up with the idea below. This is not the original code, I just wrote it...
#pragma pack(push, 16)
typedef struct _S_speedy_struct{
double fval[4];
int64 lval[4];
int32 ival[8];
}S_speedy_struct;
#pragma pack(pop)
int function(...)
{
int i, t, rv;
S_speedy_struct *ptr;
char buff[112]; // sizeof(struct) + alignment
// ugly , I know , but it works...
t = (int)buff;
t += 15; // alignment - 1
t &= -16; // alignment
ptr = (S_speedy_struct *)t;
// speedy code goes on...
}

Related

How does reordering numerical code in order to avoid temporary variables make the code faster?

I made the experience (this is not the question but a statement), that avoiding non-constant local variables in favor of const variables or avoiding local variables at all, enables the c++ compiler to generate faster code.
I assume, that this gives the compiler more freedom to interleave calculation of expressions, whereas assignments force the compiler to insert a sync point.
Is this assumption in fact the case?
Any other explanation? e.g. Compiler giving up on certain optimization levels, as soon as the code gets too complex in order to avoid astronomical compile times?
No, assignments don't force the compiler to insert a sync point. If the variables are local, and don't affect anything visible outside your function, compiler will remove all unneeded variables, as part of the usual "register allocation" optimization it does.
If your code is so complex it approaches the limit of what the compiler can keep in memory, additional local variables can make the compiler give up and produce unoptimized code. However, this is a very rare edge-case; and it can be triggered on any change in code, not only regarding local variables.
Generally, compiler optimization is hard to reason about, outside of well-known problems (aliasing, loop-carried dependencies, etc). You might feel like you found some related consideration, but it could disappear when you upgrade your compiler or switch to a different one.
Assignments to local variables that you don't subsequently modify allow the compiler to assume that that value in that variable won't change. It might therefore decide (for example) to store it in a register for the 'usage-span' of the variable. This is a simple optimisation, and no self-respecting compiler is going to miss it (unless perhaps register pressure means it is forced to spill).
An example of where this might speed up the code (and maybe reduce code size a little also) is to assign a member variable to a local and then subsequently use that instead of the member variable. If you are confident that the value is not going to change, this might help the compiler generate better code. But then again, it might be a good way of introducing bugs, you do have to be careful playing games like this.
As Thomas Matthews said in the comments, another advantage of doing what you might consider to be a redundant assignment is to help with debugging. It allows the variable to be inspected (and perhaps adjusted) during a debugging run and that can be really handy. I'm not proud, I make mistakes, so I do it a lot.
Just my $0.02
It's unusual that temp vars hurt optimization; usually they're optimized away, or they help the compiler do a load or calculation once instead of repeating it (common subexpression elimination).
Repeated access to arr[i] might actually load multiple times if the compiler can't prove that no other assignments to other pointers to the same type couldn't have modified that array element. float *__restrict arr can help the compiler figure it out, or float ai = arr[i]; can tell the compiler to read it once and keep using the same value, regardless of other stores.
Of course, if optimization is disabled, more statements are typically slower than using fewer large expressions, and store/reload latency bottlenecks are usually the main bottleneck. See How to optimize these loops (with compiler optimization disabled)? . But -O0 (no optimization) is supposed to be slow. If you're compiling without at least -O2, preferably -O3 -march=native -ffast-math -flto, that's your problem.
I assume, that this gives the compiler more freedom to interleave calculation of expressions, whereas assignments force the compiler to insert a sync point.
Is this assumption in fact the case?
"Sync point" isn't the right technical term for it, but ISO C++ rules for FP math do distinguish between optimization within one expression vs. across statements / expressions.
Contraction of a * b + c into fma(a,b,c) is only allowed within one expression, if at all.
GCC defaults to -ffp-contract=fast, allowing it across expressions. clang defaults to strict or no, but supports -ffp-contract=fast. See How to use Fused Multiply-Add (FMA) instructions with SSE/AVX . If fast makes the code with temp vars run as fast as without, strict FP-contraction rules were the reason why it was slower with temp vars.
(Legacy x87 80-bit FP math, or other unusual machines with FLT_EVAL_METHOD!=0 - FP math happens at higher precision, and rounding to float or double costs extra). Strict ISO C++ semantics require rounding at expression boundaries, e.g. on assignments. GCC defaults to ignoring that, -fno-float-store. But -std=c++11 or whatever (instead of -std=gnu++11) will enforce that extra rounding work (a store/reload which costs throughput and latency).
This isn't a problem for x86 with SSE2 for scalar math; computation happens at either float or double according to the type of the data, with instructions like mulsd (scalar double) or mulss (scalar single). So it implements FLT_EVAL_METHOD == 0 instead of x87's 2. Hopefully nobody in 2023 is building number crunching code for 32-bit x87 and caring about the performance, especially without mentioning that obscure build choice. I mention this mostly for completeness.

How stable is C/C++ structure padding under the AAPCS (ARM ABI)?

Question
The C99 standard tells us:
There may be unnamed padding within a structure object, but not at its beginning.
and
There may be unnamed padding at the end of a structure or union.
I am assuming this applies to any of the C++ standards too, but I have not checked them.
Let's assume a C/C++ application (i.e. both languages are used in the application) running on an ARM Cortex-M would store some persistent data on a local medium (a serial NOR-flash chip for instance), and read it back after power cycling, possibly after an upgrade of the application itself in the future. The upgraded application may have been compiled with an upgraded compiler (we assume gcc).
Let's further assume that the developer is lazy (that's not me, of course), and directly streams some plain C or C++ structs to flash, instead of first serializing them as any paranoid experienced developer would do.
In fact, the developer in question is lazy, but not totally ignorant, since he has read the AAPCS (Procedure Call Standard for the Arm Architecture).
His rationale, besides laziness, is the following:
He does not want to pack the structs to avoid misalignment problems in the rest of the application.
The AAPCS specifies a fixed alignment for every single fundamental data type.
The only rational motivation for padding is to achieve proper alignment.
Therefore, he thinks, padding (and therefore member offsetof and total sizeof) is fully determined for any C or C++ struct by the AAPCS.
Therefore, he further reasons, there is no way my application would not be able to interpret some read back data that an earlier version of the same application would have written (assuming, of course, that the offset of the data in flash memory has not changed between writing and reading).
However, the developer has a conscience and he is a little worried:
The C standard does not mention any reason for padding. Achieving proper alignment may be the only rational reason for padding, but compilers are free to pad as much as they want, according to the standard.
How can he be sure that his compiler really follows the AAPCS?
Could his assumptions suddenly be broken by some apparently unrelated compiler flag that he would start using, or by a compiler upgrade?
My question is: how dangerously does that lazy developer live? In other words, how stable is padding in C/C++ structs under the assumptions above?
Conclusion
Two weeks after this question was asked, the only answer that has been
received does not really answer the asked question. I have also asked
the exact same question on an ARM community forum,
but got no answer at all.
I however choose to accept 3246135 as the answer because:
I take the absence of proper answer as very relevant information
for this case. The correctness of solutions to software problems
should be obvious. The assumptions made in my question may be true,
but I cannot easily prove it. Additionally, if the assumptions are
incorrect, the consequences, in the general case, could be
catastrophic.
Compared to the risk, the burden on the developer when using the
strategy exposed in the answer seems
very reasonable. Assuming a constant endianness (which is quite easy
to enforce), it is a hundred percent-safe (any deviation will generate
an error at compile-time) and it is much lighter than a full-blown
serialization. Basically, the strategy exposed in
the answer is a mandatory minimum
price to pay in order to make one's C/C++ structs persistent independently of any ABI.
If you are a developer asking yourself the question above, please do
not be lazy, and use instead the strategy exposed in the accepted
answer, or an alternative strategy that guarantees a constant padding
across software releases.
You can never by 100% sure that the compiler won't introduce padding in some capacity. However, you can mitigate the risks by following a few rules:
Use fixed size types for all members, i.e. uint32_t, int64_t, etc.
Start each member at an offset that is a multiple of the member's size (or if the member is an array / struct, the size of the largest member).
Avoid bitfields
Note that doing this will likely introduce some explicit padding fields to satisfy alignment.
For example:
struct orig {
int a;
char b;
int c[10];
short d;
char e[15];
long f;
int g;
};
The size of this struct's members, assuming sizeof(short) == 2, sizeof(int) == 4, and sizeof(long) == 8, would be 74. If you take into account likely padding:
struct orig_padded {
int a;
char b;
char pad1[3];
int c[10];
short d;
char e[15];
char pad2[7];
long f;
int g;
char pad3[4];
};
You have a struct size of 88.
With some rearranging we can reduce the size back to 74:
struct reordered {
int64_t f;
int32_t a;
int32_t c[10];
int32_t g;
int16_t d;
char b;
char e[15];
};
By ordering the fields in descending order of size, we basically remove padding between the fields and only leave potential padding at the end. Note also the use of fixed sizes to avoid some surprises. Then as a safeguard, we add:
static_assert(sizeof(struct reordered) == 74);
So if the compiled size of the struct ever changes, you'll know at compile time.
For more details, take a look at The Lost Art of Structure Packing.

When should I use CUDA's built-in warpSize, as opposed to my own proper constant?

nvcc device code has access to a built-in value, warpSize, which is set to the warp size of the device executing the kernel (i.e. 32 for the foreseeable future). Usually you can't tell it apart from a constant - but if you try to declare an array of length warpSize you get a complaint about it being non-const... (with CUDA 7.5)
So, at least for that purpose you are motivated to have something like (edit):
enum : unsigned int { warp_size = 32 };
somewhere in your headers. But now - which should I prefer, and when? : warpSize, or warp_size?
Edit: warpSize is apparently a compile-time constant in PTX. Still, the question stands.
Let's get a couple of points straight. The warp size isn't a compile time constant and shouldn't be treated as one. It is an architecture specific runtime immediate constant (and its value just happens to be 32 for all architectures to date). Once upon a time, the old Open64 compiler did emit a constant into PTX, however that changed at least 6 years ago if my memory doesn't fail me.
The value is available:
In CUDA C via warpSize, where is is not a compile time constant (the PTX WARP_SZ variable is emitted by the compiler in such cases).
In PTX assembler via WARP_SZ, where it is a runtime immediate constant
From the runtime API as a device property
Don't declare you own constant for the warp size, that is just asking for trouble. The normal use case for an in-kernel array dimensioned to be some multiple of the warp size would be to use dynamically allocated shared memory. You can read the warp size from the host API at runtime to get it. If you have a statically declared in-kernel you need to dimension from the warp size, use templates and select the correct instance at runtime. The latter might seem like unnecessary theatre, but it is the right thing to do for a use case that almost never arises in practice. The choice is yours.
Contrary to talonmies's answer I find warp_size constant perfectly acceptable. The only reason to use warpSize is to make the code forward-compatibly with a possible future hardware that may have warps of different size. However, when such hardware arrives, the kernel code will most likely require other alterations as well in order to remain efficient. CUDA is not a hardware-agnostic language - on the contrary, it is still quite a low-level programming language. Production code uses various intrinsic functions that come and go over time (e.g. __umul24).
The day we get a different warp size (e.g. 64) many things will change:
The warpSize will have to be adjusted obviously
Many warp-level intrinsic will need their signature adjusted, or a new version produced, e.g. int __ballot, and while int does not need to be 32-bit, it is most commonly so!
Iterative operations, such as warp-level reductions, will need their number of iterations adjusted. I have never seen anyone writing:
for (int i = 0; i < log2(warpSize); ++i) ...
that would be overly complex in something that is usually a time-critical piece of code.
warpIdx and laneIdx computation out of threadIdx would need to be adjusted. Currently, the most typical code I see for it is:
warpIdx = threadIdx.x/32;
laneIdx = threadIdx.x%32;
which reduces to simple right-shift and mask operations. However, if you replace 32 with warpSize this suddenly becomes a quite expensive operation!
At the same time, using warpSize in the code prevents optimization, since formally it is not a compile-time known constant.
Also, if the amount of shared memory depends on the warpSize this forces you to use the dynamically allocated shmem (as per talonmies's answer). However, the syntax for that is inconvenient to use, especially when you have several arrays -- this forces you to do pointer arithmetic yourself and manually compute the sum of all memory usage.
Using templates for that warp_size is a partial solution, but adds a layer of syntactic complexity needed at every function call:
deviceFunction<warp_size>(params)
This obfuscates the code. The more boilerplate, the harder the code is to read and maintain.
My suggestion would be to have a single header that control all the model-specific constants, e.g.
#if __CUDA_ARCH__ <= 600
//all devices of compute capability <= 6.0
static const int warp_size = 32;
#endif
Now the rest of your CUDA code can use it without any syntactic overhead. The day you decide to add support for newer architecture, you just need to alter this one piece of code.

Is uninitialized local variable the fastest random number generator?

I know the uninitialized local variable is undefined behaviour(UB), and also the value may have trap representations which may affect further operation, but sometimes I want to use the random number only for visual representation and will not further use them in other part of program, for example, set something with random color in a visual effect, for example:
void updateEffect(){
for(int i=0;i<1000;i++){
int r;
int g;
int b;
star[i].setColor(r%255,g%255,b%255);
bool isVisible;
star[i].setVisible(isVisible);
}
}
is it that faster than
void updateEffect(){
for(int i=0;i<1000;i++){
star[i].setColor(rand()%255,rand()%255,rand()%255);
star[i].setVisible(rand()%2==0?true:false);
}
}
and also faster than other random number generator?
As others have noted, this is Undefined Behavior (UB).
In practice, it will (probably) actually (kind of) work. Reading from an uninitialized register on x86[-64] architectures will indeed produce garbage results, and probably won't do anything bad (as opposed to e.g. Itanium, where registers can be flagged as invalid, so that reads propagate errors like NaN).
There are two main problems though:
It won't be particularly random. In this case, you're reading from the stack, so you'll get whatever was there previously. Which might be effectively random, completely structured, the password you entered ten minutes ago, or your grandmother's cookie recipe.
It's Bad (capital 'B') practice to let things like this creep into your code. Technically, the compiler could insert reformat_hdd(); every time you read an undefined variable. It won't, but you shouldn't do it anyway. Don't do unsafe things. The fewer exceptions you make, the safer you are from accidental mistakes all the time.
The more pressing issue with UB is that it makes your entire program's behavior undefined. Modern compilers can use this to elide huge swaths of your code or even go back in time. Playing with UB is like a Victorian engineer dismantling a live nuclear reactor. There's a zillion things to go wrong, and you probably won't know half of the underlying principles or implemented technology. It might be okay, but you still shouldn't let it happen. Look at the other nice answers for details.
Also, I'd fire you.
Let me say this clearly: we do not invoke undefined behavior in our programs. It is never ever a good idea, period. There are rare exceptions to this rule; for example, if you are a library implementer implementing offsetof. If your case falls under such an exception you likely know this already. In this case we know using uninitialized automatic variables is undefined behavior.
Compilers have become very aggressive with optimizations around undefined behavior and we can find many cases where undefined behavior has lead to security flaws. The most infamous case is probably the Linux kernel null pointer check removal which I mention in my answer to C++ compilation bug? where a compiler optimization around undefined behavior turned a finite loop into an infinite one.
We can read CERT's Dangerous Optimizations and the Loss of Causality (video) which says, amongst other things:
Increasingly, compiler writers are taking advantage of undefined
behaviors in the C and C++ programming languages to improve
optimizations.
Frequently, these optimizations are interfering with
the ability of developers to perform cause-effect analysis on their
source code, that is, analyzing the dependence of downstream results
on prior results.
Consequently, these optimizations are eliminating
causality in software and are increasing the probability of software
faults, defects, and vulnerabilities.
Specifically with respect to indeterminate values, the C standard defect report 451: Instability of uninitialized automatic variables makes for some interesting reading. It has not been resolved yet but introduces the concept of wobbly values which means the indeterminatness of a value may propagate through the program and can have different indeterminate values at different points in the program.
I don't know of any examples where this happens but at this point we can't rule it out.
Real examples, not the result you expect
You are unlikely to get random values. A compiler could optimize the away the loop altogether. For example, with this simplified case:
void updateEffect(int arr[20]){
for(int i=0;i<20;i++){
int r ;
arr[i] = r ;
}
}
clang optimizes it away (see it live):
updateEffect(int*): # #updateEffect(int*)
retq
or perhaps get all zeros, as with this modified case:
void updateEffect(int arr[20]){
for(int i=0;i<20;i++){
int r ;
arr[i] = r%255 ;
}
}
see it live:
updateEffect(int*): # #updateEffect(int*)
xorps %xmm0, %xmm0
movups %xmm0, 64(%rdi)
movups %xmm0, 48(%rdi)
movups %xmm0, 32(%rdi)
movups %xmm0, 16(%rdi)
movups %xmm0, (%rdi)
retq
Both of these cases are perfectly acceptable forms of undefined behavior.
Note, if we are on an Itanium we could end up with a trap value:
[...]if the register happens to hold a special not-a-thing value,
reading the register traps except for a few instructions[...]
Other important notes
It is interesting to note the variance between gcc and clang noted in the UB Canaries project over how willing they are to take advantage of undefined behavior with respect to uninitialized memory. The article notes (emphasis mine):
Of course we need to be completely clear with ourselves that any such expectation has nothing to do with the language standard and everything to do with what a particular compiler happens to do, either because the providers of that compiler are unwilling to exploit that UB or just because they have not gotten around to exploiting it yet. When no real guarantee from the compiler provider exists, we like to say that as-yet unexploited UBs are time bombs: they’re waiting to go off next month or next year when the compiler gets a bit more aggressive.
As Matthieu M. points out What Every C Programmer Should Know About Undefined Behavior #2/3 is also relevant to this question. It says amongst other things (emphasis mine):
The important and scary thing to realize is that just about any
optimization based on undefined behavior can start being triggered on
buggy code at any time in the future. Inlining, loop unrolling, memory
promotion and other optimizations will keep getting better, and a
significant part of their reason for existing is to expose secondary
optimizations like the ones above.
To me, this is deeply dissatisfying, partially because the compiler
inevitably ends up getting blamed, but also because it means that huge
bodies of C code are land mines just waiting to explode.
For completeness sake I should probably mention that implementations can choose to make undefined behavior well defined, for example gcc allows type punning through unions while in C++ this seems like undefined behavior. If this is the case the implementation should document it and this will usually not be portable.
No, it's terrible.
The behaviour of using an uninitialised variable is undefined in both C and C++, and it's very unlikely that such a scheme would have desirable statistical properties.
If you want a "quick and dirty" random number generator, then rand() is your best bet. In its implementation, all it does is a multiplication, an addition, and a modulus.
The fastest generator I know of requires you to use a uint32_t as the type of the pseudo-random variable I, and use
I = 1664525 * I + 1013904223
to generate successive values. You can choose any initial value of I (called the seed) that takes your fancy. Obviously you can code that inline. The standard-guaranteed wraparound of an unsigned type acts as the modulus. (The numeric constants are hand-picked by that remarkable scientific programmer Donald Knuth.)
Good question!
Undefined does not mean it's random. Think about it, the values you'd get in global uninitialized variables were left there by the system or your/other applications running. Depending what your system does with no longer used memory and/or what kind of values the system and applications generate, you may get:
Always the same.
Be one of a small set of values.
Get values in one or more small ranges.
See many values dividable by 2/4/8 from pointers on 16/32/64-bit system
...
The values you'll get completely depend on which non-random values are left by the system and/or applications. So, indeed there will be some noise (unless your system wipes no longer used memory), but the value pool from which you'll draw will by no means be random.
Things get much worse for local variables because these come directly from the stack of your own program. There is a very good chance that your program will actually write these stack locations during the execution of other code. I estimate the chances for luck in this situation very low, and a 'random' code change you make tries this luck.
Read about randomness. As you'll see randomness is a very specific and hard to obtain property. It's a common mistake to think that if you just take something that's hard to track (like your suggestion) you'll get a random value.
Many good answers, but allow me to add another and stress the point that in a deterministic computer, nothing is random. This is true for both the numbers produced by an pseudo-RNG and the seemingly "random" numbers found in areas of memory reserved for C/C++ local variables on the stack.
BUT... there is a crucial difference.
The numbers generated by a good pseudorandom generator have the properties that make them statistically similar to truly random draws. For instance, the distribution is uniform. The cycle length is long: you can get millions of random numbers before the cycle repeats itself. The sequence is not autocorrelated: for instance, you will not begin to see strange patterns emerge if you take every 2nd, 3rd, or 27th number, or if you look at specific digits in the generated numbers.
In contrast, the "random" numbers left behind on the stack have none of these properties. Their values and their apparent randomness depend entirely on how the program is constructed, how it is compiled, and how it is optimized by the compiler. By way of example, here is a variation of your idea as a self-contained program:
#include <stdio.h>
notrandom()
{
int r, g, b;
printf("R=%d, G=%d, B=%d", r&255, g&255, b&255);
}
int main(int argc, char *argv[])
{
int i;
for (i = 0; i < 10; i++)
{
notrandom();
printf("\n");
}
return 0;
}
When I compile this code with GCC on a Linux machine and run it, it turns out to be rather unpleasantly deterministic:
R=0, G=19, B=0
R=130, G=16, B=255
R=130, G=16, B=255
R=130, G=16, B=255
R=130, G=16, B=255
R=130, G=16, B=255
R=130, G=16, B=255
R=130, G=16, B=255
R=130, G=16, B=255
R=130, G=16, B=255
If you looked at the compiled code with a disassembler, you could reconstruct what was going on, in detail. The first call to notrandom() used an area of the stack that was not used by this program previously; who knows what was in there. But after that call to notrandom(), there is a call to printf() (which the GCC compiler actually optimizes to a call to putchar(), but never mind) and that overwrites the stack. So the next and subsequent times, when notrandom() is called, the stack will contain stale data from the execution of putchar(), and since putchar() is always called with the same arguments, this stale data will always be the same, too.
So there is absolutely nothing random about this behavior, nor do the numbers obtained this way have any of the desirable properties of a well-written pseudorandom number generator. In fact, in most real-life scenarios, their values will be repetitive and highly correlated.
Indeed, as others, I would also seriously consider firing someone who tried to pass off this idea as a "high performance RNG".
Undefined behavior means that the authors of compilers are free to ignore the problem because programmers will never have a right to complain whatever happens.
While in theory when entering UB land anything can happen (including a daemon flying off your nose) what normally means is that compiler authors just won't care and, for local variables, the value will be whatever is in the stack memory at that point.
This also means that often the content will be "strange" but fixed or slightly random or variable but with a clear evident pattern (e.g. increasing values at each iteration).
For sure you cannot expect it being a decent random generator.
Undefined behaviour is undefined. It doesn't mean that you get an undefined value, it means that the the program can do anything and still meet the language specification.
A good optimizing compiler should take
void updateEffect(){
for(int i=0;i<1000;i++){
int r;
int g;
int b;
star[i].setColor(r%255,g%255,b%255);
bool isVisible;
star[i].setVisible(isVisible);
}
}
and compile it to a noop. This is certainly faster than any alternative. It has the downside that it will not do anything, but such is the downside of undefined behaviour.
Not mentioned yet, but code paths that invoke undefined behavior are allowed to do whatever the compiler wants, e.g.
void updateEffect(){}
Which is certainly faster than your correct loop, and because of UB, is perfectly conformant.
Because of security reasons, new memory assigned to a program has to be cleaned, otherwise the information could be used, and passwords could leak from one application into another. Only when you reuse memory, you get different values than 0. And it is very likely, that on a stack the previous value is just fixed, because the previous use of that memory is fixed.
Your particular code example would probably not do what you are expecting. While technically each iteration of the loop re-creates the local variables for the r, g, and b values, in practice it's the exact same memory space on the stack. Hence it won't get re-randomized with each iteration, and you will end up assigning the same 3 values for each of the 1000 colors, regardless of how random the r, g, and b are individually and initially.
Indeed, if it did work, I would be very curious as to what's re-randomizing it. The only thing I can think of would be an interleaved interrupt that piggypacked atop that stack, highly unlikely. Perhaps internal optimization that kept those as register variables rather than as true memory locations, where the registers get re-used further down in the loop, would do the trick, too, especially if the set visibility function is particularly register-hungry. Still, far from random.
As most of people here mentioned undefined behavior. Undefined also means that you may get some valid integer value (luckily) and in this case this will be faster (as rand function call is not made).
But don't practically use it. I am sure this will terrible results as luck is not with you all the time.
Really bad! Bad habit, bad result.
Consider:
A_Function_that_use_a_lot_the_Stack();
updateEffect();
If the function A_Function_that_use_a_lot_the_Stack() make always the same initialization it leaves the stack with the same data on it. That data is what we get calling updateEffect(): always same value!.
I performed a very simple test, and it wasn't random at all.
#include <stdio.h>
int main() {
int a;
printf("%d\n", a);
return 0;
}
Every time I ran the program, it printed the same number (32767 in my case) -- you can't get much less random than that. This is presumably whatever the startup code in the runtime library left on the stack. Since it uses the same startup code every time the program runs, and nothing else varies in the program between runs, the results are perfectly consistent.
You need to have a definition of what you mean by 'random'.
A sensible definition involves that the values you get should have little correlation. That's something you can measure. It's also not trivial to achieve in a controlled, reproducible manner. So undefined behaviour is certainly not what you are looking for.
There are certain situations in which uninitialized memory may be safely read using type "unsigned char*" [e.g. a buffer returned from malloc]. Code may read such memory without having to worry about the compiler throwing causality out the window, and there are times when it may be more efficient to have code be prepared for anything memory might contain than to ensure that uninitialized data won't be read (a commonplace example of this would be using memcpy on partially-initialized buffer rather than discretely copying all of the elements that contain meaningful data).
Even in such cases, however, one should always assume that if any combination of bytes will be particularly vexatious, reading it will always yield that pattern of bytes (and if a certain pattern would be vexatious in production, but not in development, such a pattern won't appear until code is in production).
Reading uninitialized memory might be useful as part of a random-generation strategy in an embedded system where one can be sure the memory has never been written with substantially-non-random content since the last time the system was powered on, and if the manufacturing process used for the memory causes its power-on state to vary in semi-random fashion. Code should work even if all devices always yield the same data, but in cases where e.g. a group of nodes each need to select arbitrary unique IDs as quickly as possible, having a "not very random" generator which gives half the nodes the same initial ID might be better than not having any initial source of randomness at all.
As others have said, it will be fast, but not random.
What most compilers will do for local variables is to grab some space for them on the stack, but not bother setting it to anything (the standard says they don't need to, so why slow down the code you're generating?).
In this case, the value you'll get will depend on what was on previously on the stack - if you call a function before this one that has a hundred local char variables all set to 'Q' and then call you're function after that returns, then you'll probably find your "random" values behave as if you've memset() them all to 'Q's.
Importantly for your example function trying to use this, these values wont change each time you read them, they'll be the same every time. So you'll get a 100 stars all set to the same colour and visibility.
Also, nothing says that the compiler shouldn't initialize these value - so a future compiler might do so.
In general: bad idea, don't do it.
(like a lot of "clever" code level optimizations really...)
As others have already mentioned, this is undefined behavior (UB), but it may "work".
Except from problems already mentioned by others, I see one other problem (disadvantage) - it will not work in any language other than C and C++. I know that this question is about C++, but if you can write code which will be good C++ and Java code and it's not a problem then why not? Maybe some day someone will have to port it to other language and searching for bugs caused by "magic tricks" UB like this definitely will be a nightmare (especially for an inexperienced C/C++ developer).
Here there is question about another similar UB. Just imagine yourself trying to find bug like this without knowing about this UB. If you want to read more about such strange things in C/C++, read answers for question from link and see this GREAT slideshow. It will help you understand what's under the hood and how it's working; it's not not just another slideshow full of "magic". I'm quite sure that even most of experienced C/c++ programmers can learn a lot from this.
Not a good idea to rely our any logic on language undefined behaviour. In addition to whatever mentioned/discussed in this post, I would like to mention that with modern C++ approach/style such program may not be compile.
This was mentioned in my previous post which contains the advantage of auto feature and useful link for the same.
https://stackoverflow.com/a/26170069/2724703
So, if we change the above code and replace the actual types with auto, the program would not even compile.
void updateEffect(){
for(int i=0;i<1000;i++){
auto r;
auto g;
auto b;
star[i].setColor(r%255,g%255,b%255);
auto isVisible;
star[i].setVisible(isVisible);
}
}
I like your way of thinking. Really outside the box. However the tradeoff is really not worth it. Memory-runtime tradeoff is a thing, including undefined behavior for runtime is not.
It must give you a very unsettling feeling to know you are using such "random" as your business logic. I woudn't do it.
Use 7757 every place you are tempted to use uninitialized variables. I picked it randomly from a list of prime numbers:
it is defined behavior
it is guaranteed to not always be 0
it is prime
it is likely to be as statistically random as uninitualized
variables
it is likely to be faster than uninitialized variables since its
value is known at compile time
There is one more possibility to consider.
Modern compilers (ahem g++) are so intelligent that they go through your code to see what instructions affect state, and what don't, and if an instruction is guaranteed to NOT affect the state, g++ will simply remove that instruction.
So here's what will happen. g++ will definitely see that you are reading, performing arithmetic on, saving, what is essentially a garbage value, which produces more garbage. Since there is no guarantee that the new garbage is any more useful than the old one, it will simply do away with your loop. BLOOP!
This method is useful, but here's what I would do. Combine UB (Undefined Behaviour) with rand() speed.
Of course, reduce rand()s executed, but mix them in so compiler doesn't do anything you don't want it to.
And I won't fire you.
Using uninitialized data for randomness is not necessarily a bad thing if done properly. In fact, OpenSSL does exactly this to seed its PRNG.
Apparently this usage wasn't well documented however, because someone noticed Valgrind complaining about using uninitialized data and "fixed" it, causing a bug in the PRNG.
So you can do it, but you need to know what you're doing and make sure that anyone reading your code understands this.

Force compiler to not optimize side-effect-less statements

I was reading some old game programming books and as some of you might know, back in that day it was usually faster to do bit hacks than do things the standard way. (Converting float to int, mask sign bit, convert back for absolute value, instead of just calling fabs(), for example)
Nowadays is almost always better to just use the standard library math functions, since these tiny things are hardly the cause of most bottlenecks anyway.
But I still want to do a comparison, just for curiosity's sake. So I want to make sure when I profile, I'm not getting skewed results. As such, I'd like to make sure the compiler does not optimize out statements that have no side effect, such as:
void float_to_int(float f)
{
int i = static_cast<int>(f); // has no side-effects
}
Is there a way to do this? As far as I can tell, doing something like i += 10 will still have no side-effect and as such won't solve the problem.
The only thing I can think of is having a global variable, int dummy;, and after the cast doing something like dummy += i, so the value of i is used. But I feel like this dummy operation will get in the way of the results I want.
I'm using Visual Studio 2008 / G++ (3.4.4).
Edit
To clarify, I would like to have all optimizations maxed out, to get good profile results. The problem is that with this the statements with no side-effect will be optimized out, hence the situation.
Edit Again
To clarify once more, read this: I'm not trying to micro-optimize this in some sort of production code.
We all know that the old tricks aren't very useful anymore, I'm merely curious how not useful they are. Just plain curiosity. Sure, life could go on without me knowing just how these old hacks perform against modern day CPU's, but it never hurts to know.
So telling me "these tricks aren't useful anymore, stop trying to micro-optimize blah blah" is an answer completely missing the point. I know they aren't useful, I don't use them.
Premature quoting of Knuth is the root of all annoyance.
Assignment to a volatile variable shold never be optimized away, so this might give you the result you want:
static volatile int i = 0;
void float_to_int(float f)
{
i = static_cast<int>(f); // has no side-effects
}
So I want to make sure when I profile, I'm not getting skewed results. As such, I'd like to make sure the compiler does not optimize out statements
You are by definition skewing the results.
Here's how to fix the problem of trying to profile "dummy" code that you wrote just to test: For profiling, save your results to a global/static array and print one member of the array to the output at the end of the program. The compiler will not be able to optimize out any of the computations that placed values in the array, but you'll still get any other optimizations it can put in to make the code fast.
In this case I suggest you make the function return the integer value:
int float_to_int(float f)
{
return static_cast<int>(f);
}
Your calling code can then exercise it with a printf to guarantee it won't optimize it out. Also make sure float_to_int is in a separate compilation unit so the compiler can't play any tricks.
extern int float_to_int(float f)
int sum = 0;
// start timing here
for (int i = 0; i < 1000000; i++)
{
sum += float_to_int(1.0f);
}
// end timing here
printf("sum=%d\n", sum);
Now compare this to an empty function like:
int take_float_return_int(float /* f */)
{
return 1;
}
Which should also be external.
The difference in times should give you an idea of the expense of what you're trying to measure.
What always worked on all compilers I used so far:
extern volatile int writeMe = 0;
void float_to_int(float f)
{
writeMe = static_cast<int>(f);
}
note that this skews results, boith methods should write to writeMe.
volatile tells the compiler "the value may be accessed without your notice", thus the compiler cannot omit the calculation and drop the result. To block propagiation of input constants, you might need to run them through an extern volatile, too:
extern volatile float readMe = 0;
extern volatile int writeMe = 0;
void float_to_int(float f)
{
writeMe = static_cast<int>(f);
}
int main()
{
readMe = 17;
float_to_int(readMe);
}
Still, all optimizations inbetween the read and the write can be applied "with full force". The read and write to the global variable are often good "fenceposts" when inspecting the generated assembly.
Without the extern the compiler may notice that a reference to the variable is never taken, and thus determine it can't be volatile. Technically, with Link Time Code Generation, it might not be enough, but I haven't found a compiler that agressive. (For a compiler that indeed removes the access, the reference would need to be passed to a function in a DLL loaded at runtime)
Compilers are unfortunately allowed to optimise as much as they like, even without any explicit switches, if the code behaves as if no optimisation takes place. However, you can often trick them into not doing so if you indicate that value might be used later, so I would change your code to:
int float_to_int(float f)
{
return static_cast<int>(f); // has no side-effects
}
As others have suggested, you will need to examine the assemnler output to check that this approach actually works.
You just need to skip to the part where you learn something and read the published Intel CPU optimisation manual.
These quite clearly state that casting between float and int is a really bad idea because it requires a store from the int register to memory followed by a load into a float register. These operations cause a bubble in the pipeline and waste many precious cycles.
a function call incurs quite a bit of overhead, so I would remove this anyway.
adding a dummy += i; is no problem, as long as you keep this same bit of code in the alternate profile too. (So the code you are comparing it against).
Last but not least: generate asm code. Even if you can not code in asm, the generated code is typically understandable since it will have labels and commented C code behind it. So you know (sortoff) what happens, and which bits are kept.
R
p.s. found this too:
inline float pslNegFabs32f(float x){
__asm{
fld x //Push 'x' into st(0) of FPU stack
fabs
fchs //change sign
fstp x //Pop from st(0) of FPU stack
}
return x;
}
supposedly also very fast. You might want to profile this too. (although it is hardly portable code)
Return the value?
int float_to_int(float f)
{
return static_cast<int>(f); // has no side-effects
}
and then at the call site, you can sum all the return values up, and print out the result when the benchmark is done. The usual way to do this is to somehow make sure you depend on the result.
You could use a global variable instead, but it seems like that'd generate more cache misses. Usually, simply returning the value to the caller (and making sure the caller actually does something with it) does the trick.
If you are using Microsoft's compiler - cl.exe, you can use the following statement to turn optimization on/off on a per-function level [link to doc].
#pragma optimize("" ,{ on |off })
Turn optimizations off for functions defined after the current line:
#pragma optimize("" ,off)
Turn optimizations back on:
#pragma optimize("" ,on)
For example, in the following image, you can notice 3 things.
Compiler optimizations flag is set - /O2, so code will get optimized.
Optimizations are turned off for first function - square(), and turned back on before square2() is defined.
Amount of assembly code generated for 1st function is higher. In second function there is no assembly code generated for int i = num; statement in code.
Thus while 1st function is not optimized, the second function is.
See https://godbolt.org/z/qJTBHg for link to this code on compiler explorer.
A similar directive exists for gcc too - https://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html
A micro-benchmark around this statement will not be representative of using this approach in a genuine scenerio; the surrounding instructions and their affect on the pipeline and cache are generally as important as any given statement in itself.
GCC 4 does a lot of micro-optimizations now, that GCC 3.4 has never done. GCC4 includes a tree vectorizer that turns out to do a very good job of taking advantage of SSE and MMX. It also uses the GMP and MPFR libraries to assist in optimizing calls to things like sin(), fabs(), etc., as well as optimizing such calls to their FPU, SSE or 3D Now! equivalents.
I know the Intel compiler is also extremely good at these kinds of optimizations.
My suggestion is to not worry about micro-optimizations like this - on relatively new hardware (anything built in the last 5 or 6 years), they're almost completely moot.
Edit: On recent CPUs, the FPU's fabs instruction is far faster than a cast to int and bit mask, and the fsin instruction is generally going to be faster than precalculating a table or extrapolating a Taylor series. A lot of the optimizations you would find in, for example, "Tricks of the Game Programming Gurus," are completely moot, and as pointed out in another answer, could potentially be slower than instructions on the FPU and in SSE.
All of this is due to the fact that newer CPUs are pipelined - instructions are decoded and dispatched to fast computation units. Instructions no longer run in terms of clock cycles, and are more sensitive to cache misses and inter-instruction dependencies.
Check the AMD and Intel processor programming manuals for all the gritty details.