I have tried to use restrict qualified pointers, and I have encountered a problem.
The program below is just a simple one only to present the problem.
The calc_function uses three pointers, which is restricted so they "SHALL" not alias with each other. When compiling this code in visual studio, the function will be inlined, so for no reason Visual Studio 2010 ignores the qualifiers. If I disable inlining, the code executes more then six times faster (from 2200ms to 360ms). But I do not want to disable inlining in the whole project nor the whole file (because then will it be call overheads in e.g. all getters and setters, which would be horrible).
(Might the only solution be to disable inlining of only this function?)
I have tried to create temporary restrict qualified pointers in the function, both at the top and in the inner loop to try to tell the compiler that I promise that there is no aliasing, but the compiler won't believe me, and it will not work.
I have also tried to tweaking compiler settings, but the only one that i have found that works, is to disable inlining.
I would appreciate some help to solve this optimization problem.
To run the program (in realeasemode) don't forget to use the arguments 0 1000 2000.
Why the use of userinput/program arguments is to be sure that the compiler can't know if there is or isn't aliasing between the pointers a, b and c.
#include <cstdlib>
#include <cstdio>
#include <ctime>
// Data-table where a,b,c will point into, so the compiler cant know if they alias.
const size_t listSize = 10000;
int data[listSize];
//void calc_function(int * a, int * b, int * c){
void calc_function(int *__restrict a, int *__restrict b, int *__restrict c){
for(size_t y=0; y<1000*1000; ++y){ // <- Extra loop to be able to messure the time.
for(size_t i=0; i<1000; ++i){
*a += *b;
*c += *a;
}
}
}
int main(int argc, char *argv[]){ // argv SHALL be "0 1000 2000" (with no quotes)
// init
for(size_t i=0; i<listSize; ++i)
data[i] = i;
// get a, b and c from argv(0,1000,2000)
int *a,*b,*c;
sscanf(argv[1],"%d",&a);
sscanf(argv[2],"%d",&b);
sscanf(argv[3],"%d",&c);
a = data + int(a); // a, b and c will (after the specified argv) be,
b = data + int(b); // a = &data[0], b = &data[1000], c = &data[2000],
c = data + int(c); // So they will not alias, and the compiler cant know.
// calculate and take time
time_t start = clock();
funcResticted(a,b,c);
time_t end = clock();
time_t t = (end-start);
printf("funcResticted %u (microSec)\n", t);
system("PAUSE");
return EXIT_SUCCESS;
}
If you declare a function with __declspec(noinline), it will force it not to be inlined:
http://msdn.microsoft.com/en-us/library/kxybs02x%28v=vs.80%29.aspx
You can use this to manually disable inlining on a per-function basis.
As for restrict, the compiler is free to use it only when it wants to. So fiddling around with different versions of the same code is somewhat unavoidable when attempting to "trick" compilers to do such optimizations.
Related
So i recently stumbled on this code somewhere where it is copying one string to another using just one line of code with the help of a while loop , however , I am not able to understand as to how and why it happens-:
int main()
{
char arr1[100];
cin.getline(arr1 , 100);
char arr2[100];
int i = -1;
while(arr2[i] = arr1[++i]);
cout<<arr1<<endl<<arr2<<endl;
return 0;
}
Can somebody explain me what is happening in the backdrop?
and moreover if the above code works fine then why dont the below ones?
int main()
{
char arr1[100];
cin.getline(arr1 , 100);
char arr2[100];
int i = 0;
while(arr2[i++] = arr1[i]);
cout<<arr1<<endl<<arr2<<endl;
return 0;
}
another one-:
int main()
{
char arr1[100];
cin.getline(arr1 , 100);
char arr2[100];
int i = 0;
while(arr2[++i] = arr1[i]);
cout<<arr1<<endl<<arr2<<endl;
return 0;
}
The code snippet is relying on an order-of-evaluation guarantee that was added in C++17.
Since C++17 it is guaranteed that the right-hand side of a = operator is evaluated first. As a consequence the loop is equivalent to
int i = -1;
while(true) {
i++;
arr2[i] = arr1[i];
if(!arr2[i])
break;
};
Except that one would normally start at i = 0; and put i++; at the end of the loop iteration, I think it should be clearer what is happening now. The loop breaks when a null character is encountered, so it expects that arr1 is a null-terminated string and won't copy the whole array.
Before C++17 the order of evaluation was not specified and the code had undefined behavior as a consequence.
If you change the loop to int i=0; while(arr2[++i] = arr1[i]);, then (since C++17) you execute ++i only after indexing arr1[i], but before indexing arr2. As a consequence you are not copying to the beginning of arr2. Again, before C++17 this is undefined behavior.
int i=0; while(arr2[i++] = arr1[i]); should work correctly since C++17 as well. It does the increment only after indexing both arrays. Again, before C++17 it has undefined behavior.
You shouldn't use either of these, since they are hard to reason about and have undefined behavior if the user happens to set the C++ version switch to something before C++17 or tries to use it in C, where it is undefined behavior in all versions.
Also int may be too small to hold all indices of a string. Prefer std::size_t (which however is unsigned and so the first variant won't work).
Utilities for things like copying strings should be written in functions, not inline every place they're used. That makes it simpler to avoid the complexities of incrementing the same variable twice:
void copy_string(char* dest, const char *src) {
while (*dest++ = *src++)
;
}
Yes, I know, some people like to have their compiler ignore the rules and refuse to compile valid, well-defined code like this. If your compiler is set that way, figure out how to rewrite that code to make your compiler happy, and perhaps think about who's the boss: you or your compiler.
Let's say we have the following two pieces of code:
int *a = (int *)malloc(sizeof(*a));
int *b = (int *)malloc(sizeof(*b));
And
int *a = (int *)malloc(2 * sizeof(*a));
int *b = a + 1;
Both of them allocate two integers on the heap and (assuming the normal usage) they should be equivalent. The first seems to be slower as it calls malloc twice and can result in a more cache-friendly code. The second however is possibly insecure as we can accidentally override the value of what b points to just by incrementing a and writing to the resulting pointer (or someone malicious can instantly change the value of b just by knowing where a is).
It's possible that the above claims are not true (for example the speed is questioned here: Minimizing the amount of malloc() calls improves performance?) but my question is just: Can the compiler do this type of transformation or is there something fundamentally different between the two according to the standard? If it is possible, what compiler flags (let's say gcc) can allow it?
In reality, no, the compiler will never combine the 2 malloc() calls into a single malloc() call automatically. Each call to malloc() returns the address of a new memory block, there is no guarantee that the allocated blocks will be located anywhere close to each other, and each allocated block must be free()'d individually. So no compiler will ever assume anything about the relationship between multiple allocated blocks and try to optimize their allocations for you.
Now, it is possible that in a very simplified use-case, where the allocation and deallocation were in the same scope, and if it can be proven to be safe to do so, then the compiler vendor might decide to try to optimize, ie:
void doIt()
{
int *a = (int *)malloc(sizeof(*a));
int *b = (int *)malloc(sizeof(*b));
...
free(a);
free(b);
}
Could become:
void doIt()
{
void *ptr = malloc(sizeof(int) * 2);
int *a = (int *)ptr;
int *b = a + 1;
...
free(ptr);
}
But in reality, no compiler vendor will actually attempt to do this. It is not worth the effort, or the risk, for such little gain. And it would not work in more complex scenarios anyway, eg:
void doIt()
{
int *a = (int *)malloc(sizeof(*a));
int *b = (int *)malloc(sizeof(*b));
...
UseAndFree(a, b);
}
void UseAndFree(int *a, int *b)
{
...
free(a);
free(b);
}
No, it can't, because the compiler (in general) doesn't know when a and b might get free()'d, and if it allocates them both as part of a single allocation, then it would need to free() them both at the same time also.
There's a number of reasons why this will likely never happen, but the most important is lifetimes where these allocations, if made independently, can be freed independently. If made together they're locked to the same lifetime.
This sort of nuance is best expressed by the developer rather than determined by the compiler.
Is the second "insecure" in that you can overwrite values? In C, and by extension C++, the language does not protect you from bad programming. You are free to shoot yourself in the foot at any time, using any means necessary:
int a;
int b;
int* p = &a;
p[1] = 9; // Bullet, meet foot
(&b)[-1] = 9; // Why not?
If you want to allocate N of something by all means use calloc() to express it, or an appropriately sized malloc(). Doing individual allocations is pointless unless there's a good reason.
Normally you wouldn't allocate a single int, that's kind of useless, but there are cases where that might be the only reasonable option. Typically it's larger blocks of things, like a full struct or a character buffer.
First of all:
int *a = (int *)malloc(8);
int *b = a + 4;
Is not what you think. You want:
int *a = malloc(sizeof(*a) * 2);
int *b = a + 1;
It shows that pointer arithmetic is something you need to learn.
Secondly: the compiler does not change anything in your code, and it will not combine any function calls in one. What you try to achieve is a micro-optimization. If you want to use a larger chunk of memory simply use arrays.
int *a = malloc(sizeof(*a) * 2);
a[0] = 5;
a[1] = 6;
/* some other code */
free(a);
Do not use "magic" number is malloc only sizeof of the objects. Do not cast the result of malloc
I've done exactly that with a bignum library, but you only free the one pointer.
//initialization every time program runs
extern bignum_t *scratch00; //these are useful for taylor series, etc.
extern bignum_t *scratch01;
extern bignum_t *scratch02;
.
.
.
bignum_t *bn_malloc(int bignums)
{
return(malloc(bignums * bn_numbytes));
}
.
.
.
//bignums specific to the program being written at the moment
bignum_t *numerator;
bignum_t *denom;
bignum_t *denom_add;
bignum_t *accum;
bignum_t *term;
.
.
.
numerator = bn_malloc(1);
denom = bn_malloc(1);
denom_add = bn_malloc(1);
accum = bn_malloc(1);
term = bn_malloc(1);
I have function that receives an array of pointers like so:
void foo(int *ptrs[], int num, int size)
{
/* The body is an example only */
for (int i = 0; i < size; ++i) {
for (int j = 0; j < num-1; ++j)
ptrs[num-1][i] += ptrs[j][i];
}
}
What I want to convey to the compiler is that the pointers ptrs[i] are not aliases of each other and that the arrays ptrs[i] do not overlap. How shall I do this ? My ulterior motive is to encourage automatic vectorization.
Also, is there a way to get the same effect as __restrict__ on an iterator of a std::vector ?
restrict, unlike the more common const, is a property of the pointer rather than the data pointed to. It therefore belongs on the right side of the '*' declarator-modifier. [] in a parameter declaration is another way to write *. Putting these things together, you should be able to get the effect you want with this function prototype:
void foo(int *restrict *restrict ptrs, int num, int size)
{
/* body */
}
and no need for new names. (Not tested. Your mileage may vary. restrict is a pure optimization hint and may not actually do anything constructive with your compiler.)
Something like:
void foo(int *ptrs[], int num, int size)
{
/* The body is an example only */
for (int i = 0; i < size; ++i) {
for (int j = 0; j < num-1; ++j) {
int * restrict a = ptrs[num-1];
int * restrict b = ptrs[j];
a[i] += b[i];
}
}
... should do it, I think, in C99. I don't think there's any way in C++, but many C++ compilers also support restrict.
In C++, pointer arguments are assumed not to alias if they point to fundamentally different types ("strict aliasing" rules).
In C99, the "restrict" keyword specifies that a pointer argument does not alias any other pointer argument.
Call std::memcpy. Memcpy's definition will have restrict set up if your language/version and compiler support it, and most compilers will lower it into vector instructions if the size of the copied region is small.
I want to write code that compiles conditionally and according to the following two cases:
CASE_A:
for(int i = 1; i <= 10; ++i){
// do something...
}
CASE_B: ( == !CASE_A)
{
const int i = 0;
// do something...
}
That is, in case A, I want to have a normal loop over variable i but, in case B, i want to restrict local scope variable i to only a special case (designated here as i = 0). Obviously, I could write something along the lines:
for(int i = (CASE_A ? 1 : 0); i <= (CASE_A ? 10 : 0); ++i){
// do something
}
However, I do not like this design as it doesn't allow me to take advantage of the const declaration in the special case B. Such declaration would presumably allow for lots of optimization as the body of this loop benefits greatly from a potential compile-time replacement of i by its constant value.
Looking forward to any tips from the community on how to efficiently achieve this.
Thank you!
EDITS:
CASE_A vs CASE_B can be evaluated at compile-time.
i is not passed as reference
i is not re-evaluated in the body (otherwise const would not make sense), but I am not sure the compiler will go through the effort to certify that
Assuming, you aren't over-simplifying your example, it shouldn't matter. Assuming CASE_A can be evaluated at compile-time, the code:
for( int i = 0; i <= 0; ++i ) {
do_something_with( i );
}
is going to generate the same machine code as:
const int i = 0;
do_something_with( i );
for any decent compiler (with optimization turned on, of course).
In researching this, I find there is a fine point here. If i gets passed to a function via a pointer or reference, the compiler can't assume it doesn't change. This is true even if the pointer or reference is const! (Since the const can be cast away in the function.)
Seems to be the obvious solution:
template<int CASE>
void do_case();
template<>
void do_case<CASE_A>()
{
for(int i = 1; i <= 10; ++i){
do_something( i );
}
}
template<>
void do_case<CASE_B>()
{
do_something( 0 );
}
// Usage
...
do_case<CURRENT_CASE>(); // CURRENT_CASE is the compile time constant
If your CASE_B/CASE_B determination can be expressed as a compile time constant, then you can do what you want in a nice, readable format using something like the following (which is just a variation on your example of using the ?: operator for the for loop initialization and condition):
enum {
kLowerBound = (CASE_A ? 1 : 0),
kUpperBound = (CASE_A ? 10 : 0)
};
for (int i = kLowerBound; i <= kUpperBound; ++i) {
// do something
}
This makes it clear that the for loop bounds are compile time constants - note that I think most compilers today would have no problem making that determination even if the ?: expressions were used directly in the for statement's controlling clauses. However, I do think using enums makes it more evident to people reading the code.
Again, any compiler worth its salt today should recognize when i is invariant inside the loop, and in the CASE_B situation also determine that the loop will never iterate. Making i const won't benefit the compiler's optimization possibilities.
If you're convinced that the compiler might be able to optimize better if i is const, then a simple modification can help:
for (int ii = kLowerBound; ii <= kUpperBound; ++ii) {
const int i = ii;
// do something
}
I doubt this will help the compiler much (but check it's output - I could be wrong) if i isn't modified or has its address taken (even by passing it as a reference). However, it might help you make sure that i isn't inappropriately modified or passed by reference/address in the loop.
On the other hand, you might actually see a benefit to optimizations produced by the compiler if you use the const modifier on it - in the cases where the address of i is taken or the const is cast away, the compiler is still permitted to treat i as not being modified for its lifetime. Any modifications that might be made by something that cast away the const would be undefined behavior, so the compiler is allowed to ignore that they might occur. Of course, if you have code that might do this, you have bigger worries than optimization. So it's more important to make sure that there are no 'behind the back' modification attempts to i than to simply marking i as const 'for optimization', but using const might help you identify whether modifications are made (but remember that casts can continue to hide that).
I'm not quite sure that this is what you're looking for, but I'm using this macro version of the vanilla FOR loop which enforces the loop counter to be const to catch any modification of it in the body
#define FOR(type, var, start, maxExclusive, inc) if (bool a = true) for (type var##_ = start; var##_ < maxExclusive; a=true,var##_ += inc) for (const auto var = var##_;a;a=false)
Usage:
#include <stdio.h>
#define FOR(type, var, start, maxExclusive, inc) if (bool a = true) for (type var##_ = start; var##_ < maxExclusive; a=true,var##_ += inc) for (const auto var = var##_;a;a=false)
int main()
{
FOR(int, i, 0, 10, 1) {
printf("i: %d\n", i);
}
// does the same as:
for (int i = 0; i < 10; i++) {
printf("i: %d\n", i);
}
// FOR catches some bugs:
for (int i = 0; i < 10; i++) {
i += 10; // is legal but bad
printf("i: %d\n", i);
}
FOR(int, i, 0, 10, 1) {
i += 10; // is illlegal and will not compile
printf("i: %d\n", i);
}
return 0;
}
I am trying to show by example that the prefix increment is more efficient than the postfix increment.
In theory this makes sense: i++ needs to be able to return the unincremented original value and therefore store it, whereas ++i can return the incremented value without storing the previous value.
But is there a good example to show this in practice?
I tried the following code:
int array[100];
int main()
{
for(int i = 0; i < sizeof(array)/sizeof(*array); i++)
array[i] = 1;
}
I compiled it using gcc 4.4.0 like this:
gcc -Wa,-adhls -O0 myfile.cpp
I did this again, with the postfix increment changed to a prefix increment:
for(int i = 0; i < sizeof(array)/sizeof(*array); ++i)
The result is identical assembly code in both cases.
This was somewhat unexpected. It seemed like that by turning off optimizations (with -O0) I should see a difference to show the concept. What am I missing? Is there a better example to show this?
In the general case, the post increment will result in a copy where a pre-increment will not. Of course this will be optimized away in a large number of cases and in the cases where it isn't the copy operation will be negligible (ie., for built in types).
Here's a small example that show the potential inefficiency of post-increment.
#include <stdio.h>
class foo
{
public:
int x;
foo() : x(0) {
printf( "construct foo()\n");
};
foo( foo const& other) {
printf( "copy foo()\n");
x = other.x;
};
foo& operator=( foo const& rhs) {
printf( "assign foo()\n");
x = rhs.x;
return *this;
};
foo& operator++() {
printf( "preincrement foo\n");
++x;
return *this;
};
foo operator++( int) {
printf( "postincrement foo\n");
foo temp( *this);
++x;
return temp;
};
};
int main()
{
foo bar;
printf( "\n" "preinc example: \n");
++bar;
printf( "\n" "postinc example: \n");
bar++;
}
The results from an optimized build (which actually removes a second copy operation in the post-increment case due to RVO):
construct foo()
preinc example:
preincrement foo
postinc example:
postincrement foo
copy foo()
In general, if you don't need the semantics of the post-increment, why take the chance that an unnecessary copy will occur?
Of course, it's good to keep in mind that a custom operator++() - either the pre or post variant - is free to return whatever it wants (or even do whatever it wants), and I'd imagine that there are quite a few that don't follow the usual rules. Occasionally I've come across implementations that return "void", which makes the usual semantic difference go away.
You won't see any difference with integers. You need to use iterators or something where post and prefix really do something different. And you need to turn all optimisations on, not off!
I like to follow the rule of "say what you mean".
++i simply increments. i++ increments and has a special, non-intuitive result of evaluation. I only use i++ if I explicitly want that behavior, and use ++i in all other cases. If you follow this practice, when you do see i++ in code, it's obvious that post-increment behavior really was intended.
Several points:
First, you're unlikely to see a major performance difference in any way
Second, your benchmarking is useless if you have optimizations disabled. What we want to know is if this change gives us more or less efficient code, which means that we have to use it with the most efficient code the compiler is able to produce. We don't care whether it is faster in unoptimized builds, we need to know if it is faster in optimized ones.
For built-in datatypes like integers, the compiler is generally able to optimize the difference away. The problem mainly occurs for more complex types with overloaded increment iterators, where the compiler can't trivially see that the two operations would be equivalent in the context.
You should use the code that clearest expresses your intent. Do you want to "add one to the value", or "add one to the value, but keep working on the original value a bit longer"? Usually, the former is the case, and then a pre-increment better expresses your intent.
If you want to show the difference, the simplest option is simply to impement both operators, and point out that one requires an extra copy, the other does not.
This code and its comments should demonstrate the differences between the two.
class a {
int index;
some_ridiculously_big_type big;
//etc...
};
// prefix ++a
void operator++ (a& _a) {
++_a.index
}
// postfix a++
void operator++ (a& _a, int b) {
_a.index++;
}
// now the program
int main (void) {
a my_a;
// prefix:
// 1. updates my_a.index
// 2. copies my_a.index to b
int b = (++my_a).index;
// postfix
// 1. creates a copy of my_a, including the *big* member.
// 2. updates my_a.index
// 3. copies index out of the **copy** of my_a that was created in step 1
int c = (my_a++).index;
}
You can see that the postfix has an extra step (step 1) which involves creating a copy of the object. This has both implications for both memory consumption and runtime. That is why prefix is more efficient that postfix for non-basic types.
Depending on some_ridiculously_big_type and also on whatever you do with the result of the incrememt, you'll be able to see the difference either with or without optimizations.
In response to Mihail, this is a somewhat more portable version his code:
#include <cstdio>
#include <ctime>
using namespace std;
#define SOME_BIG_CONSTANT 100000000
#define OUTER 40
int main( int argc, char * argv[] ) {
int d = 0;
time_t now = time(0);
if ( argc == 1 ) {
for ( int n = 0; n < OUTER; n++ ) {
int i = 0;
while(i < SOME_BIG_CONSTANT) {
d += i++;
}
}
}
else {
for ( int n = 0; n < OUTER; n++ ) {
int i = 0;
while(i < SOME_BIG_CONSTANT) {
d += ++i;
}
}
}
int t = time(0) - now;
printf( "%d\n", t );
return d % 2;
}
The outer loops are there to allow me to fiddle the timings to get something suitable on my platform.
I don't use VC++ any more, so i compiled it (on Windows) with:
g++ -O3 t.cpp
I then ran it by alternating:
a.exe
and
a.exe 1
My timing results were approximately the same for both cases. Sometimes one version would be faster by up to 20% and sometimes the other. This I would guess is due to other processes running on my system.
Try to use while or do something with returned value, e.g.:
#define SOME_BIG_CONSTANT 1000000000
int _tmain(int argc, _TCHAR* argv[])
{
int i = 1;
int d = 0;
DWORD d1 = GetTickCount();
while(i < SOME_BIG_CONSTANT + 1)
{
d += i++;
}
DWORD t1 = GetTickCount() - d1;
printf("%d", d);
printf("\ni++ > %d <\n", t1);
i = 0;
d = 0;
d1 = GetTickCount();
while(i < SOME_BIG_CONSTANT)
{
d += ++i;
}
t1 = GetTickCount() - d1;
printf("%d", d);
printf("\n++i > %d <\n", t1);
return 0;
}
Compiled with VS 2005 using /O2 or /Ox, tried on my desktop and on laptop.
Stably get something around on laptop, on desktop numbers are a bit different (but rate is about the same):
i++ > 8xx <
++i > 6xx <
xx means that numbers are different e.g. 813 vs 640 - still around 20% speed up.
And one more point - if you replace "d +=" with "d = " you will see nice optimization trick:
i++ > 935 <
++i > 0 <
However, it's quite specific. But after all, I don't see any reasons to change my mind and think there is no difference :)
Perhaps you could just show the theoretical difference by writing out both versions with x86 assembly instructions? As many people have pointed out before, compiler will always make its own decisions on how best to compile/assemble the program.
If the example is meant for students not familiar with the x86 instruction set, you might consider using the MIPS32 instruction set -- for some odd reason many people seem to find it to be easier to comprehend than x86 assembly.
Ok, all this prefix/postfix "optimization" is just... some big misunderstanding.
The major idea that i++ returns its original copy and thus requires copying the value.
This may be correct for some unefficient implementations of iterators. However in 99% of cases even with STL iterators there is no difference because compiler knows how to optimize it and the actual iterators are just pointers that look like class. And of course there is no difference for primitive types like integers on pointers.
So... forget about it.
EDIT: Clearification
As I had mentioned, most of STL iterator classes are just pointers wrapped with classes, that have all member functions inlined allowing out-optimization of such irrelevant copy.
And yes, if you have your own iterators without inlined member functions, then it may
work slower. But, you should just understand what compiler does and what does not.
As a small prove, take this code:
int sum1(vector<int> const &v)
{
int n;
for(auto x=v.begin();x!=v.end();x++)
n+=*x;
return n;
}
int sum2(vector<int> const &v)
{
int n;
for(auto x=v.begin();x!=v.end();++x)
n+=*x;
return n;
}
int sum3(set<int> const &v)
{
int n;
for(auto x=v.begin();x!=v.end();x++)
n+=*x;
return n;
}
int sum4(set<int> const &v)
{
int n;
for(auto x=v.begin();x!=v.end();++x)
n+=*x;
return n;
}
Compile it to assembly and compare sum1 and sum2, sum3 and sum4...
I just can tell you... gcc give exactly the same code with -02.