Should division by one get a special case? - c++

If we have a division by one in an inner loop, is it smart to add special case treatment to eliminate the division:
BEFORE:
int collapseFactorDepth...
for (int i = 0; i < numPixels; i++)
{
pDataTarget[i] += pPixelData[i] / collapseFactorDepth;
}
AFTER:
if (collapseFactorDepth != 1)
{
for (int i = 0; i < numPixels; i++)
{
pDataTarget[i] += pPixelData[i] / collapseFactorDepth;
}
}
else
{
for (int i = 0; i < numPixels; i++)
{
pDataTarget[i] += pPixelData[i];
}
}
Can the compiler reason this by itself? Do modern CPUs contain any means to optimize this?
I am particularly interested if you consider the additional code beneficial in contrast to the performance gain (is there any?).
Background:
Numpixels is big
collapseFactorDepth is 90% of the time 1
Modern CPUs: Intel x86/amd64 architecture
Please don't consider the wider things. The memory overhead of loading is optimized.
Let's not sweat that we should probably do this as a double multiplication anyway.

As a general rule, the answer is No. Write clear code first and optimize it later when the profiler tells you have a problem.
The only way to answer whether this particular optimization will help in this particular hotspot is: "measure it and see".
Unless collapseFactorDepth is almost always 1, or numPixels is very large (at least thousands and possibly more), I would not expect the optimization to help (branches are expensive).
You are much more likely to benefit from using SSE or similar SIMD instructions.

Follow #Martin Bonner's advice. Optimize when you need to.
When you need to:
int identity(int pixel)
{
return pixel;
}
template<int collapseFactorDepth>
int div(int pixel)
{
return pixel / collapseFactorDepth;
}
struct Div
{
int collapseFactorDepth_;
Div(collapseFactorDepth)
: collapseFactorDepth(collapseFactorDepth_) {}
int operator()(int pixel)
{
return pixel / collapseFactorDepth_;
}
};
template<typename T>
void fn(int* pDataTarget, T fn)
{
for (int i = 0; i < numPixels; i++)
{
pDataTarget[i] += fn(pPixelData[i]);
}
}
void fn(int* pDataTarget)
{
fn(pDataTarget, identity);
}
template<int collapseFactorDepth>
void fnComp()
{
fn(pDataTarget, div<collapseFactorDepth>);
}
void fn(int* pDataTarget, int collapseFactorDepth)
{
fn(pDataTarget, Div(collapseFactorDepth));
}
This provides you a convenient default behaviour, a compile-time divide (which might be faster than divide-by-int) when possible and a way (passing a Div) to specify runtime behaviour.

Related

Enable fast-math in Clang on a per-function basis?

GCC provides a way to optimize a function/section of code selectively with
fast-math using attributes. Is there a way to enable the same in Clang with pragmas/attributes?
I understand Clang provides some pragmas
to specify floating point flags. However, none of these pragmas enable fast-math.
PS: A similar question was asked before but was not answered in the context of Clang.
UPD: I missed that you mentioned pragmas. The Option1 is fast math for one op as far as I understood. I am not sure about flushing subnormals, I hope it's not affected.
I didn't find the per function options, but I did find 2 pragmas that can help.
Let's say we want dot product.
Option 1.
float innerProductF32(const float* a, const float* b, std::size_t size) {
float res = 0.f;
for (std::size_t i = 0; i != size; ++i) {
#pragma float_control(precise, off)
res += a[i] * b[i];
}
return res;
}
Option2:
float innerProductF32(const float* a, const float* b, std::size_t size) {
float res = 0.f;
_Pragma("clang loop vectorize(enable) interleave(enable)")
for (std::size_t i = 0; i != size; ++i) {
res += a[i] * b[i];
}
return res;
}
The second one is less powerful, it does not generate fma instructions, but maybe it's not what you want.

Performance gap between vector<bool> and array

I was trying to solve a coding problem in C++ which counts the number of prime numbers less than a non-negative number n.
So I first came up with some code:
int countPrimes(int n) {
vector<bool> flag(n+1,1);
for(int i =2;i<n;i++)
{
if(flag[i]==1)
for(long j=i;i*j<n;j++)
flag[i*j]=0;
}
int result=0;
for(int i =2;i<n;i++)
result+=flag[i];
return result;
}
which takes 88 ms and uses 8.6 MB of memory. Then I changed my code into:
int countPrimes(int n) {
// vector<bool> flag(n+1,1);
bool flag[n+1] ;
fill(flag,flag+n+1,true);
for(int i =2;i<n;i++)
{
if(flag[i]==1)
for(long j=i;i*j<n;j++)
flag[i*j]=0;
}
int result=0;
for(int i =2;i<n;i++)
result+=flag[i];
return result;
}
which takes 28 ms and 9.9 MB. I don't really understand why there is such a performance gap in both the running time and memory consumption. I have read relative questions like this one and that one but I am still confused.
EDIT: I reduced the running time to 40 ms with 11.5 MB of memory after replacing vector<bool> with vector<char>.
std::vector<bool> isn't like any other vector. The documentation says:
std::vector<bool> is a possibly space-efficient specialization of
std::vector for the type bool.
That's why it may use up less memory than an array, because it might represent multiple boolean values with one byte, like a bitset. It also explains the performance difference, since accessing it isn't as simple anymore. According to the documentation, it doesn't even have to store it as a contiguous array.
std::vector<bool> is special case. It is specialized template. Each value is stored in single bit, so bit operations are needed. This memory compact but has couple drawbacks (like no way to have a pointer to bool inside this container).
Now bool flag[n+1]; compiler will usually allocate same memory in same manner as for char flag[n+1]; and it will do that on stack, not on heap.
Now depending on page sizes, cache misses and i values one can be faster then other. It is hard to predict (for small n array will be faster, but for larger n result may change).
As an interesting experiment you can change std::vector<bool> to std::vector<char>. In this case you will have similar memory mapping as in case of array, but it will be located at heap not a stack.
I'd like to add some remarks to the good answers already posted.
The performance differences between std::vector<bool> and std::vector<char> may vary (a lot) between different library implementations and different sizes of the vectors.
See e.g. those quick benches: clang++ / libc++(LLVM) vs. g++ / libstdc++(GNU).
This: bool flag[n+1]; declares a Variable Length Array, which (despites some performance advantages due to it beeing allocated in the stack) has never been part of the C++ standard, even if provided as an extension by some (C99 compliant) compilers.
Another way to increase the performances could be to reduce the amount of calculations (and memory occupation) by considering only the odd numbers, given that all the primes except for 2 are odd.
If you can bare the less readable code, you could try to profile the following snippet.
int countPrimes(int n)
{
if ( n < 2 )
return 0;
// Sieve starting from 3 up to n, the number of odd number between 3 and n are
int sieve_size = n / 2 - 1;
std::vector<char> sieve(sieve_size);
int result = 1; // 2 is a prime.
for (int i = 0; i < sieve_size; ++i)
{
if ( sieve[i] == 0 )
{
// It's a prime, no need to scan the vector again
++result;
// Some ugly transformations are needed, here
int prime = i * 2 + 3;
for ( int j = prime * 3, k = prime * 2; j <= n; j += k)
sieve[j / 2 - 1] = 1;
}
}
return result;
}
Edit
As Peter Cordes noted in the comments, using an unsigned type for the variable j
the compiler can implement j/2 as cheaply as possible. C signed division by a power of 2 has different rounding semantics (for negative dividends) than a right shift, and compilers don't always propagate value-range proofs sufficiently to prove that j will always be non-negative.
It's also possible to reduce the number of candidates exploiting the fact that all primes (past 2 and 3) are one below or above a multiple of 6.
I am getting different timings and memory usage than the ones mentioned in the question when compiling with g++-7.4.0 -g -march=native -O2 -Wall and running on a Ryzen 5 1600 CPU:
vector<bool>: 0.038 seconds, 3344 KiB memory, IPC 3.16
vector<char>: 0.048 seconds, 12004 KiB memory, IPC 1.52
bool[N]: 0.050 seconds, 12644 KiB memory, IPC 1.69
Conclusion: vector<bool> is the fastest option because of its higher IPC (instructions per clock).
#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <vector>
size_t countPrimes(size_t n) {
std::vector<bool> flag(n+1,1);
//std::vector<char> flag(n+1,1);
//bool flag[n+1]; std::fill(flag,flag+n+1,true);
for(size_t i=2;i<n;i++) {
if(flag[i]==1) {
for(size_t j=i;i*j<n;j++) {
flag[i*j]=0;
}
}
}
size_t result=0;
for(size_t i=2;i<n;i++) {
result+=flag[i];
}
return result;
}
int main() {
{
const rlim_t kStackSize = 16*1024*1024;
struct rlimit rl;
int result = getrlimit(RLIMIT_STACK, &rl);
if(result != 0) abort();
if(rl.rlim_cur < kStackSize) {
rl.rlim_cur = kStackSize;
result = setrlimit(RLIMIT_STACK, &rl);
if(result != 0) abort();
}
}
printf("%zu\n", countPrimes(10e6));
return 0;
}

Best way to parallelize this recursion using OpenMP

I have following recursive function (NOTE: It is stripped of all unimportant details)
int recursion(...) {
int minimum = INFINITY;
for(int i=0; i<C; i++) {
int foo = recursion(...);
if (foo < minimum) {
minimum = foo;
}
}
return minimum;
}
Note 2: It is finite, but not in this simplified example, so please ignore it. Point of this question is how to aproach this problem correctly.
I was thinking about using tasks, but I am not sure, how to use it correctly - how to paralelize the inner cycle.
EDIT 1: The recursion tree isn't well balanced. It is being used with dynamic programing approach, so as time goes on, a lot of values are re-used from previous passes. This worries me a lot and I think it will be a big bottleneck.
C is somewhere around 20.
Metric for the best is fastest :)
It will run on 2x Xeon, so there is plenty of HW power availible.
Yes, you can use OpenMP tasks exploit parallelism on multiple recursion levels and ensure that imbalances don't cause wasted cycles.
I would collect the results in a vector and compute the minimum outside. You could also perform a guarded (critical / lock) minimum computation within the task.
Avoid spawning tasks / allocating memory for the minimum if you are too deep in the recursion, where the overhead / work ratio becomes too bad. The strongest solution it to create two separate (parallel/serial) recursive functions. That way you have zero runtime overhead once you switch to the serial function - as opposed to checking the recursion depth against a threshold every time in a unified function.
int recursion(...) {
#pragma omp parallel
#pragma omp single
return recursion_par(..., 0);
}
int recursion_ser(...) {
int minimum = INFINITY;
for(int i=0; i<C; i++) {
int foo = recursion_ser(...);
if (foo < minimum) {
minimum = foo;
}
}
return minimum;
}
int recursion_par(..., int depth) {
std::vector<int> foos(C);
for(int i=0; i<C; i++) {
#pragma omp task
{
if (depth < threshhold) {
foos[i] = recursion_par(..., depth + 1);
} else {
foos[i] = recursion_ser(...);
}
}
}
#pragma omp taskwait
return *std::min_element(std::begin(foos), std::end(foos));
}
Obviously you must not do any nasty things with global / shared state within the unimportant details.

Executable runs faster on Wine than Windows -- why?

Solution: Apparently the culprit was the use of floor(), the performance of which turns out to be OS-dependent in glibc.
This is a followup question to an earlier one: Same program faster on Linux than Windows -- why?
I have a small C++ program, that, when compiled with nuwen gcc 4.6.1, runs much faster on Wine than Windows XP (on the same computer). The question: why does this happen?
The timings are ~15.8 and 25.9 seconds, for Wine and Windows respectively. Note that I'm talking about the same executable, not only the same C++ program.
The source code is at the end of the post. The compiled executable is here (if you trust me enough).
This particular program does nothing useful, it is just a minimal example boiled down from a larger program I have. Please see this other question for some more precise benchmarking of the original program (important!!) and the most common possibilities ruled out (such as other programs hogging the CPU on Windows, process startup penalty, difference in system calls such as memory allocation). Also note that while here I used rand() for simplicity, in the original I used my own RNG which I know does no heap-allocation.
The reason I opened a new question on the topic is that now I can post an actual simplified code example for reproducing the phenomenon.
The code:
#include <cstdlib>
#include <cmath>
int irand(int top) {
return int(std::floor((std::rand() / (RAND_MAX + 1.0)) * top));
}
template<typename T>
class Vector {
T *vec;
const int sz;
public:
Vector(int n) : sz(n) {
vec = new T[sz];
}
~Vector() {
delete [] vec;
}
int size() const { return sz; }
const T & operator [] (int i) const { return vec[i]; }
T & operator [] (int i) { return vec[i]; }
};
int main() {
const int tmax = 20000; // increase this to make it run longer
const int m = 10000;
Vector<int> vec(150);
for (int i=0; i < vec.size(); ++i)
vec[i] = 0;
// main loop
for (int t=0; t < tmax; ++t)
for (int j=0; j < m; ++j) {
int s = irand(100) + 1;
vec[s] += 1;
}
return 0;
}
UPDATE
It seems that if I replace irand() above with something deterministic such as
int irand(int top) {
static int c = 0;
return (c++) % top;
}
then the timing difference disappears. I'd like to note though that in my original program I used a different RNG, not the system rand(). I'm digging into the source of that now.
UPDATE 2
Now I replaced the irand() function with an equivalent of what I had in the original program. It is a bit lengthy (the algorithm is from Numerical Recipes), but the point was to show that no system libraries are being called explictly (except possibly through floor()). Yet the timing difference is still there!
Perhaps floor() could be to blame? Or the compiler generates calls to something else?
class ran1 {
static const int table_len = 32;
static const int int_max = (1u << 31) - 1;
int idum;
int next;
int *shuffle_table;
void propagate() {
const int int_quo = 1277731;
int k = idum/int_quo;
idum = 16807*(idum - k*int_quo) - 2836*k;
if (idum < 0)
idum += int_max;
}
public:
ran1() {
shuffle_table = new int[table_len];
seedrand(54321);
}
~ran1() {
delete [] shuffle_table;
}
void seedrand(int seed) {
idum = seed;
for (int i = table_len-1; i >= 0; i--) {
propagate();
shuffle_table[i] = idum;
}
next = idum;
}
double frand() {
int i = next/(1 + (int_max-1)/table_len);
next = shuffle_table[i];
propagate();
shuffle_table[i] = idum;
return next/(int_max + 1.0);
}
} rng;
int irand(int top) {
return int(std::floor(rng.frand() * top));
}
edit: It turned out that the culprit was floor() and not rand() as I suspected - see
the update at the top of the OP's question.
The run time of your program is dominated by the calls to rand().
I therefore think that rand() is the culprit. I suspect that the underlying function is provided by the WINE/Windows runtime, and the two implementations have different performance characteristics.
The easiest way to test this hypothesis would be to simply call rand() in a loop, and time the same executable in both environments.
edit I've had a look at the WINE source code, and here is its implementation of rand():
/*********************************************************************
* rand (MSVCRT.#)
*/
int CDECL MSVCRT_rand(void)
{
thread_data_t *data = msvcrt_get_thread_data();
/* this is the algorithm used by MSVC, according to
* http://en.wikipedia.org/wiki/List_of_pseudorandom_number_generators */
data->random_seed = data->random_seed * 214013 + 2531011;
return (data->random_seed >> 16) & MSVCRT_RAND_MAX;
}
I don't have access to Microsoft's source code to compare, but it wouldn't surprise me if the difference in performance was in the getting of thread-local data rather than in the RNG itself.
Wikipedia says:
Wine is a compatibility layer not an emulator. It duplicates functions
of a Windows computer by providing alternative implementations of the
DLLs that Windows programs call,[citation needed] and a process to
substitute for the Windows NT kernel. This method of duplication
differs from other methods that might also be considered emulation,
where Windows programs run in a virtual machine.[2] Wine is
predominantly written using black-box testing reverse-engineering, to
avoid copyright issues.
This implies that the developers of wine could replace an api call with anything at all to as long as the end result was the same as you would get with a native windows call. And I suppose they weren't constrained by needing to make it compatible with the rest of Windows.
From what I can tell, the C standard libraries used WILL be different in the two different scenarios. This affects the rand() call as well as floor().
From the mingw site... MinGW compilers provide access to the functionality of the Microsoft C runtime and some language-specific runtimes. Running under XP, this will use the Microsoft libraries. Seems straightforward.
However, the model under wine is much more complex. According to this diagram, the operating system's libc comes into play. This could be the difference between the two.
While Wine is basically Windows, you're still comparing apples to oranges. As well, not only is it apples/oranges, the underlying vehicles hauling those apples and oranges around are completely different.
In short, your question could trivially be rephrased as "this code runs faster on Mac OSX than it does on Windows" and get the same answer.

Is using string.length() in loop efficient?

For example, assuming a string s is this:
for(int x = 0; x < s.length(); x++)
better than this?:
int length = s.length();
for(int x = 0; x < length; x++)
Thanks,
Joel
In general, you should avoid function calls in the condition part of a loop, if the result does not change during the iteration.
The canonical form is therefore:
for (std::size_t x = 0, length = s.length(); x != length; ++x);
Note 3 things here:
The initialization can initialize more than one variable
The condition is expressed with != rather than <
I use pre-increment rather than post-increment
(I also changed the type because is a negative length is non-sense and the string interface is defined in term of std::string::size_type, which is normally std::size_t on most implementations).
Though... I admit that it's not as much for performance than for readability:
The double initialization means that both x and length scope is as tight as necessary
By memoizing the result the reader is not left in the doubt of whether or not the length may vary during iteration
Using pre-increment is usually better when you do not need to create a temporary with the "old" value
In short: use the best tool for the job at hand :)
It depends on the inlining and optimization abilities of the compiler. Generally, the second variant will most likely be faster (better: it will be either faster or as fast as the first snippet, but almost never slower).
However, in most cases it doesn't matter, so people tend to prefer the first variant for its shortness.
It depends on your C++ implementation / library, the only way to be sure is to benchmark it. However, it's effectively certain that the second version will never be slower than the first, so if you don't modify the string within the loop it's a sensible optimisation to make.
How efficient do you want to be?
If you don't modify the string inside the loop, the compiler will easily see than the size doesn't change. Don't make it any more complicated than you have to!
Although I am not necessarily encouraging you to do so, it appears it is faster to constantly call .length() than to store it in an int, surprisingly (atleast on my computer, keeping in mind that I'm using an MSI gaming laptop with i5 4th gen, but it shouldn't really affect which way is faster).
Test code for constant call:
#include <iostream>
using namespace std;
int main()
{
string g = "01234567890";
for(unsigned int rep = 0; rep < 25; rep++)
{
g += g;
}//for loop used to double the length 25 times.
int a = 0;
//int b = g.length();
for(unsigned int rep = 0; rep < g.length(); rep++)
{
a++;
}
return a;
}
On average, this ran for 385ms according to Code::Blocks
And here's the code that stores the length in a variable:
#include <iostream>
using namespace std;
int main()
{
string g = "01234567890";
for(unsigned int rep = 0; rep < 25; rep++)
{
g += g;
}//for loop used to double the length 25 times.
int a = 0;
int b = g.length();
for(unsigned int rep = 0; rep < b; rep++)
{
a++;
}
return a;
}
And this averaged around 420ms.
I know this question already has an accepted answer, but there haven't been any practically tested answers, so I decided to throw my 2 cents in. I had the same question as you, but I didn't find any helpful answers here, so I ran my own experiment.
Is s.length() inline and returns a member variable? then no, otherwise cost of dereferencing and putting stuff in stack, you know all the overheads of function call you will incur for each iteration.