Counting down in for-loops

Counting down in for-loops - c++

I believe (from some research reading) that counting down in for-loops is actually more efficient and faster in runtime. My full software code is C++
I currently have this:
for (i=0; i<domain; ++i) {
my 'i' is unsigned resgister int,
also 'domain' is unsigned int
in the for-loop i is used for going through an array, e.g.
array[i] = do stuff
converting this to count down messes up the expected/correct output of my routine.
I can imagine the answer being quite trivial, but I can't get my head round it.
UPDATE: 'do stuff' does not depend on previous or later iteration. The calculations within the for-loop are independant for that iteration of i. (I hope that makes sense).
UPDATE: To achieve a runtime speedup with my for-loop, do I count down and if so remove the unsigned part when delcaring my int, or what other method?
Please help.

There is only one correct method of looping backwards using an unsigned counter:
for( i = n; i-- > 0; )
{
// Use i as normal here
}
There's a trick here, for the last loop iteration you will have i = 1 at the top of the loop, i-- > 0 passes because 1 > 0, then i = 0 in the loop body. On the next iteration i-- > 0 fails because i == 0, so it doesn't matter that the postfix decrement rolled over the counter.
Very non obvious I know.

I'm guessing your backward for loop looks like this:
for (i = domain - 1; i >= 0; --i) {
In that case, because i is unsigned, it will always be greater than or equal to zero. When you decrement an unsigned variable that is equal to zero, it will wrap around to a very large number. The solution is either to make i signed, or change the condition in the for loop like this:
for (i = domain - 1; i >= 0 && i < domain; --i) {
Or count from domain to 1 rather than from domain - 1 to 0:
for (i = domain; i >= 1; --i) {
array[i - 1] = ...; // notice you have to subtract 1 from i inside the loop now
}

This is not an answer to your problem, because you don't seem to have a problem.
This kind of optimization is completely irrelevant and should be left to the compiler (if done at all).
Have you profiled your program to check that your for-loop is a bottleneck? If not, then you do not need to spend time worrying about this. Even more so, having "i" as a "register" int, as you write, makes no real sense from a performance standpoint.
Even without knowing your problem domain, I can guarantee you that both the reverse-looping technique and the "register" int counter will have negligible impact on your program's performance. Remember, "Premature optimization is the root of all evil".
That said, better spent optimization time would be on thinking about the overall program structure, data structures and algorithms used, resource utilization, etc.

Checking to see if a number is zero can be quicker or more efficient than a comparison. But this is the sort of micro-optimization you really shouldn't worry about - a few clock cycles will be greatly dwarfed by just about any other perf issue.
On x86:
dec eax
jnz Foo
Instead of:
inc eax
cmp eax, 15
jl Foo

It has nothing to do with counting up or down. What can be faster is counting toward zero. Michael's answer shows why — x86 gives you a comparison with zero as an implicit side effect of many instructions, so after you adjust your counter, you just branch based on the result instead of doing an explicit comparison. (Maybe other architectures do that, too; I don't know.)
Borland's Pascal compilers are notorious for performing that optimization. The compiler transforms this code:
for i := x to y do
foo(i);
into an internal representation more akin to this:
tmp := Succ(y - x);
i := x;
while tmp > 0 do begin
foo(i);
Inc(i);
Dec(tmp);
end;
(I say notorious not because the optimization affects the outcome of the loop, but because the debugger displays the counter variable incorrectly. When the programmer inspects i, the debugger may display the value of tmp instead, causing no end of confusion and panic for programmers who think their loops are running backward.)
The idea is that even with the extra Inc or Dec instruction, it's still a net win, in terms of running time, over doing an explicit comparison. Whether you can actually notice that difference is up for debate.
But note that the conversion is something the compiler would do automatically, based on whether it deemed the transformation worthwhile. The compiler is usually better at optimizing code than you are, so don't spend too much effort competing with it.
Anyway, you asked about C++, not Pascal. C++ "for" loops aren't quite as easy to apply that optimization to as Pascal "for" loops are because the bounds of Pascal's loops are always fully calculated before the loop runs, whereas C++ loops sometimes depend on the stopping condition and the loop contents. C++ compilers need to do some amount of static analysis to determine whether any given loop could fit the requirements for the kind of transformation Pascal loops qualify for unconditionally. If the C++ compiler does the analysis, then it could do a similar transformation.
There's nothing stopping you from writing your loops that way on your own:
for (unsigned i = 0, tmp = domain; tmp > 0; ++i, --tmp)
array[i] = do stuff
Doing that might make your code run faster. Like I said before, though, you probably won't notice. The bigger cost you pay by manually arranging your loops like that is that your code no longer follows established idioms. Your loop is a perfectly ordinary "for" loop, but it no longer looks like one — it has two variables, they're counting in opposite directions, and one of them isn't even used in the loop body — so anyone reading your code (including you, a week, a month, or a year from now when you've forgotten the "optimization" you were hoping to achieve) will need to spend extra effort proving to himself or herself that the loop is indeed an ordinary loop in disguise.
(Did you notice that my code above used unsigned variables with no danger of wrapping around at zero? Using two separate variables allows that.)
Three things to take away from all this:
Let the optimizer do its job; on the whole it's better at it than you are.
Make ordinary code look ordinary so that the special code doesn't have to compete to get attention from people reviewing, debugging, or maintaining it.
Don't do anything fancy in the name of performance until testing and profiling show it to be necessary.

If you have a decent compiler, it will optimize "counting up" just as effectively as "counting down". Just try a few benchmarks and you'll see.

So you "read" that couting down is more efficient? I find this very difficult to believe unless you show me some profiler results and the code. I can buy it under some circumstances, but in the general case, no. Seems to me like this is a classic case of premature optimization.
Your comment about "register int i" is also very telling. Nowadays, the compiler always knows better than you how to allocate registers. Don't bother using using the register keyword unless you have profiled your code.

When you're looping through data structures of any sort, cache misses have a far bigger impact than the direction you're going. Concern yourself with the bigger picture of memory layout and algorithm structure instead of trivial micro-optimisations.

You may try the following, which compiler will optimize very efficiently:
#define for_range(_type, _param, _A1, _B1) \
for (_type _param = _A1, _finish = _B1,\
_step = static_cast<_type>(2*(((int)_finish)>(int)_param)-1),\
_stop = static_cast<_type>(((int)_finish)+(int)_step); _param != _stop; \
_param = static_cast<_type>(((int)_param)+(int)_step))
Now you can use it:
for_range (unsigned, i, 10,0)
{
cout << "backwards i: " << i << endl;
}
for_range (char, c, 'z','a')
{
cout << c << endl;
}
enum Count { zero, one, two, three };
for_range (Count, c, three, zero)
{
cout << "backwards: " << c << endl;
}
You may iterate in any direction:
for_range (Count, c, zero, three)
{
cout << "forward: " << c << endl;
}
The loop
for_range (unsigned,i,b,a)
{
// body of the loop
}
will produce the following code:
mov esi,b
L1:
; body of the loop
dec esi
cmp esi,a-1
jne L1

Hard to say with information given but... reverse your array, and count down?

Jeremy Ruten rightly pointed out that using an unsigned loop counter is dangerous. It's also unnecessary, as far as I can tell.
Others have also pointed out the dangers of premature optimization. They're absolutely right.
With that said, here is a style I used when programming embedded systems many years ago, when every byte and every cycle did count for something. These forms were useful for me on the particular CPUs and compilers that I was using, but your mileage may vary.
// Start out pointing to the last elem in array
pointer_to_array_elem_type p = array + (domain - 1);
for (int i = domain - 1; --i >= 0 ; ) {
*p-- = (... whatever ...)
}
This form takes advantage of the condition flag that is set on some processors after arithmetical operations -- on some architectures, the decrement and testing for the branch condition can be combined into a single instruction. Note that using predecrement (--i) is the key here -- using postdecrement (i--) would not have worked as well.
Alternatively,
// Start out pointing *beyond* the last elem in array
pointer_to_array_elem_type p = array + domain;
for (pointer_to_array_type p = array + domain; p - domain > 0 ; ) {
*(--p) = (... whatever ...)
}
This second form takes advantage of pointer (address) arithmetic. I rarely see the form (pointer - int) these days (for good reason), but the language guarantees that when you subtract an int from a pointer, the pointer is decremented by (int * sizeof (*pointer)).
I'll emphasize again that whether these forms are a win for you depends on the CPU and compiler that you're using. They served me well on Motorola 6809 and 68000 architectures.

In some later arm cores, decrement and compare takes only a single instruction. This makes decrementing loops more efficient than incrementing ones.
I don't know why there isn't an increment-compare instruction also.
I'm surprised that this post was voted -1 when it's a true issue.

Everyone here is focusing on performance. There is actually a logical reason to iterate towards zero that can result in cleaner code.
Iterating over the last element first is convenient when you delete invalid elements by swapping with the end of the array. For bad elements not adjacent to the end we can swap into the end position, decrease the end bound of the array, and keep iterating. If you were to iterate toward the end then swapping with the end could result in swapping bad for bad. By iterating end to 0 we know that the element at the end of the array has already been proven valid for this iteration.
For further explanation...
If:
You delete bad elements by swapping with one end of the array and changing the array bounds to exclude the bad elements.
Then obviously:
You would swap with a good element i.e. one that has already been tested in this iteration.
So this implies:
If we iterate away from the variable bound then elements between the variable bound and the current iteration pointer have been proven good. Whether the iteration pointer gets ++ or -- doesn't matter. What matters is that we're iterating away from the variable bound so we know that the elements adjacent to it are good.
So finally:
Iterating towards 0 allows us to use only one variable to represent the array bounds. Whether this matters is a personal decision between you and your compiler.

What matters much more than whether you're increasing or decreasing your counter is whether or not you're going up memory or down memory. Most caches are optimized for going up memory, not down memory. Since memory access time is the bottleneck that most programs today face, this means that changing your program so that you go up memory can result in a performance boost even if this requires comparing your counter to a non-zero value. In some of my programs, I saw a significant improvement in performance by changing my code to go up memory instead of down it.
Skeptical? Here's the output that I got:
sum up = 705046256
sum down = 705046256
Ave. Up Memory = 4839 mus
Ave. Down Memory = 5552 mus
sum up = inf
sum down = inf
Ave. Up Memory = 18638 mus
Ave. Down Memory = 19053 mus
from running this program:
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
template<class Iterator, typename T>
void FillWithRandomNumbers(Iterator start, Iterator one_past_end, T a, T b) {
std::random_device rnd_device;
std::mt19937 generator(rnd_device());
std::uniform_int_distribution<T> dist(a, b);
for (auto it = start; it != one_past_end; it++)
*it = dist(generator);
return ;
}
template<class Iterator>
void FillWithRandomNumbers(Iterator start, Iterator one_past_end, double a, double b) {
std::random_device rnd_device;
std::mt19937_64 generator(rnd_device());
std::uniform_real_distribution<double> dist(a, b);
for (auto it = start; it != one_past_end; it++)
*it = dist(generator);
return ;
}
template<class RAI, class T>
inline void sum_abs_up(RAI first, RAI one_past_last, T &total) {
T sum = 0;
auto it = first;
do {
sum += *it;
it++;
} while (it != one_past_last);
total += sum;
}
template<class RAI, class T>
inline void sum_abs_down(RAI first, RAI one_past_last, T &total) {
T sum = 0;
auto it = one_past_last;
do {
it--;
sum += *it;
} while (it != first);
total += sum;
}
template<class T> std::chrono::nanoseconds TimeDown(
std::vector<T> &vec, const std::vector<T> &vec_original,
std::size_t num_repititions, T &running_sum) {
std::chrono::nanoseconds total{0};
for (std::size_t i = 0; i < num_repititions; i++) {
auto start_time = std::chrono::high_resolution_clock::now();
sum_abs_down(vec.begin(), vec.end(), running_sum);
total += std::chrono::high_resolution_clock::now() - start_time;
vec = vec_original;
}
return total;
}
template<class T> std::chrono::nanoseconds TimeUp(
std::vector<T> &vec, const std::vector<T> &vec_original,
std::size_t num_repititions, T &running_sum) {
std::chrono::nanoseconds total{0};
for (std::size_t i = 0; i < num_repititions; i++) {
auto start_time = std::chrono::high_resolution_clock::now();
sum_abs_up(vec.begin(), vec.end(), running_sum);
total += std::chrono::high_resolution_clock::now() - start_time;
vec = vec_original;
}
return total;
}
int main() {
std::size_t num_repititions = 1 << 10;
{
typedef int ValueType;
auto lower = std::numeric_limits<ValueType>::min();
auto upper = std::numeric_limits<ValueType>::max();
std::vector<ValueType> vec(1 << 24);
FillWithRandomNumbers(vec.begin(), vec.end(), lower, upper);
const auto vec_original = vec;
ValueType sum_up = 0, sum_down = 0;
auto time_up = TimeUp(vec, vec_original, num_repititions, sum_up).count();
auto time_down = TimeDown(vec, vec_original, num_repititions, sum_down).count();
std::cout << "sum up = " << sum_up << '\n';
std::cout << "sum down = " << sum_down << '\n';
std::cout << "Ave. Up Memory = " << time_up/(num_repititions * 1000) << " mus\n";
std::cout << "Ave. Down Memory = "<< time_down/(num_repititions * 1000) << " mus"
<< std::endl;
}
{
typedef double ValueType;
auto lower = std::numeric_limits<ValueType>::min();
auto upper = std::numeric_limits<ValueType>::max();
std::vector<ValueType> vec(1 << 24);
FillWithRandomNumbers(vec.begin(), vec.end(), lower, upper);
const auto vec_original = vec;
ValueType sum_up = 0, sum_down = 0;
auto time_up = TimeUp(vec, vec_original, num_repititions, sum_up).count();
auto time_down = TimeDown(vec, vec_original, num_repititions, sum_down).count();
std::cout << "sum up = " << sum_up << '\n';
std::cout << "sum down = " << sum_down << '\n';
std::cout << "Ave. Up Memory = " << time_up/(num_repititions * 1000) << " mus\n";
std::cout << "Ave. Down Memory = "<< time_down/(num_repititions * 1000) << " mus"
<< std::endl;
}
return 0;
}
Both sum_abs_up and sum_abs_down do the same thing and are timed they same way with the only difference being that sum_abs_up goes up memory while sum_abs_down goes down memory. I even pass vec by reference so that both functions access the same memory locations. Nevertheless, sum_abs_up is consistently faster than sum_abs_down. Give it a run yourself (I compiled it with g++ -O3).
FYI vec_original is there for experimentation, to make it easy for me to change sum_abs_up and sum_abs_down in a way that makes them alter vec while not allowing these changes to affect future timings.
It's important to note how tight the loop that I'm timing is. If a loop's body is large then it likely won't matter whether its iterator goes up or down memory since the time it takes to execute the loop's body will likely completely dominate. Also, it's important to mention that with some rare loops, going down memory is sometimes faster than going up it. But even with such loops it's rarely ever the case that going up was always slower than going down (unlike loops that go up memory, which are very often always faster than the equivalent down-memory loops; a small handful of times they were even 40+% faster).
The point is, as a rule of thumb, if you have the option, if the loop's body is small, and if there's little difference between having your loop go up memory instead of down it, then you should go up memory.

Related

What is the fastest implementation for accessing and changing a long array of boolean?

I want to implement a very long boolean array (as a binary genome) and access some intervals to check if that interval is all true or not, and in addition I want to change some intervals value,
For example, I can create 4 representations:
boolean binaryGenome1[10e6]={false};
vector<bool> binaryGenome2; binaryGenome2.resize(10e6);
vector<char> binaryGenome3; binaryGenome3.resize(10e6);
bitset<10e6> binaryGenome4;
and access this way:
inline bool checkBinGenome(long long start , long long end){
for(long long i = start; i < end+1 ; i++)
if(binaryGenome[i] == false)
return false;
return true;
}
inline void changeBinGenome(long long start , long long end){
for(long long i = start; i < end+1 ; i++)
binaryGenome[i] = true;
}
vector<char> and normal boolean array (ass stores every boolean in a byte) both seem to be a poor choice as I need to be efficient in space. But what are the differences between vector<bool> and bitset?
Somewhere else I read that vector has some overhead as you can choose it's size and compile time - "overhead" for what - accessing? And how much is that overhead?
As I want to access array elements many times using CheckBinGenome() and changeBinGenome(), what is the fastest implementation?

Use std::bitset It's the best.

If the length of the data is known at compile time, consider std::array<bool> or std::bitset. The latter is likely to be more space-efficient (you'll have to measure whether the associated extra work in access times outweighs the speed gain from reducing cache pressure - that will depend on your workload).
If your array's length is not fixed, then you'll need a std::vector<bool> or std::vector<char>; there's also boost::dynamic_bitset but I've never used that.
If you will be changing large regions at once, as your sample implies, it may well be worth constructing your own representation and manipulating the underlying storage directly, rather than one bit at a time through the iterators. For example, if you use an array of char as the underlying representation, then setting a large range to 0 or 1 is mostly a memset() or std::fill() call, with computation only for the values at the start and end of the range. I'd start with a simple implementation and a good set of unit tests before trying anything like that.
It is (at least theoretically) possible that your Standard Library has specialized versions of algorithms for the iterators of std::vector<bool>, std::array<bool> and/or std::bitset that do exactly the above, or you may be able to write and contribute such specializations. That's a better path if possible - the world may thank you, and you'll have shared some of the maintenance responsibility.
Important note
If using std::array<bool>, you do need to be aware that, unlike other std::array<> instantiations, it does not implement the standard container semantics. That's not to say it shouldn't be used, but make sure you understand its foibles!

E.g., checking whether all the elements are true
I am really NOT sure whether this will give us more overheads than speedup or not. Actually I think that nowadays CPU can do this quite fast, are you really experiencing a poor performance? (or is this just a skeleton of your real problem?)
#include <omp.h>
#include <iostream>
#include <cstring>
using namespace std;
#define N 10000000
bool binaryGenome[N];
int main() {
memset(binaryGenome, true, sizeof(bool) * N);
int shouldBreak = 0;
bool result = true;
cout << result << endl;
binaryGenome[9999995] = false;
bool go = true;
uint give = 0;
#pragma omp parallel
{
uint start, stop;
#pragma omp critical
{
start = give;
give += N / omp_get_num_threads();
stop = give;
if (omp_get_thread_num() == omp_get_num_threads() - 1)
stop = N;
}
while (start < stop && go) {
if (!binaryGenome[start]) {
cout << start << endl;
go = false;
result = false;
}
++start;
}
}
cout << result << endl;
}

running speed of permutation function using different methods results in unexpected results

I have implemented a isPermutation function which given two string will return true if the two are permutation of each other, otherwise it will return false.
One uses c++ sort algorithm twice, while the other uses an array of ints to keep track of string count.
I ran the code several times and every time the sorting method is faster. Is my array implementation wrong?
Here is the output:
1
0
1
Time: 0.088 ms
1
0
1
Time: 0.014 ms
And the code:
#include <iostream> // cout
#include <string> // string
#include <cstring> // memset
#include <algorithm> // sort
#include <ctime> // clock_t
using namespace std;
#define MAX_CHAR 255
void PrintTimeDiff(clock_t start, clock_t end) {
std::cout << "Time: " << (end - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;
}
// using array to keep a count of used chars
bool isPermutation(string inputa, string inputb) {
int allChars[MAX_CHAR];
memset(allChars, 0, sizeof(int) * MAX_CHAR);
for(int i=0; i < inputa.size(); i++) {
allChars[(int)inputa[i]]++;
}
for (int i=0; i < inputb.size(); i++) {
allChars[(int)inputb[i]]--;
if(allChars[(int)inputb[i]] < 0) {
return false;
}
}
return true;
}
// using sorting anc comparing
bool isPermutation_sort(string inputa, string inputb) {
std::sort(inputa.begin(), inputa.end());
std::sort(inputb.begin(), inputb.end());
if(inputa == inputb) return true;
return false;
}
int main(int argc, char* argv[]) {
clock_t start = clock();
cout << isPermutation("god", "dog") << endl;
cout << isPermutation("thisisaratherlongerinput","thisisarathershorterinput") << endl;
cout << isPermutation("armen", "ramen") << endl;
PrintTimeDiff(start, clock());
start = clock();
cout << isPermutation_sort("god", "dog") << endl;
cout << isPermutation_sort("thisisaratherlongerinput","thisisarathershorterinput") << endl;
cout << isPermutation_sort("armen", "ramen") << endl;
PrintTimeDiff(start, clock());
return 0;
}

To benchmark this you have to eliminate all the noise you can.
The easiest way to do this is to wrap it in a loop that repeats the call to each 1000 times or so, then only spit out the value every 10 iterations. This way they each have a similar caching profile. Throw away values that are bogus due (eg blowouts due to context switches by the OS).
I got your method marginally faster by doing this. An excerpt.
method 1 array Time: 0.768 us
method 2 sort Time: 0.840333 us
method 1 array Time: 0.621333 us
method 2 sort Time: 0.774 us
method 1 array Time: 0.769 us
method 2 sort Time: 0.856333 us
method 1 array Time: 0.766 us
method 2 sort Time: 0.850333 us
method 1 array Time: 0.802667 us
method 2 sort Time: 0.89 us
method 1 array Time: 0.778 us
method 2 sort Time: 0.841333 us
I used rdtsc which works better for me on this system. 3000 cycles per microsecond is close enough for this, but please do make it more accurate if you care about precision of the readings.
#if defined(__x86_64__)
static uint64_t rdtsc()
{
uint64_t hi, lo;
__asm__ __volatile__ (
"xor %%eax, %%eax\n"
"cpuid\n"
"rdtsc\n"
: "=a"(lo), "=d"(hi)
:: "ebx", "ecx");
return (hi << 32)|lo;
}
#else
#error wrong architecture - implement me
#endif
void PrintTimeDiff(uint64_t start, uint64_t end) {
std::cout << "Time: " << (end - start)/double(3000) << " us" << std::endl;
}

you cannot check performance differences between implementations putting in the mix calls to std::cout. isPermutation and isPermutation_sort are some order of magnitude faster than a call to std::cout (and, anyway, prefer \n over std::endl).
for testing you have to activate compiler optimizations. Doing so the compiler will apply the loop-invariant code motion optimization and you'll probably get the same results for both implementations.
A more effective way of testing is:
int main()
{
const std::vector<std::string> bag
{
"god", "dog", "thisisaratherlongerinput", "thisisarathershorterinput",
"armen", "ramen"
};
static std::mt19937 engine;
std::uniform_int_distribution<std::size_t> rand(0, bag.size() - 1);
const unsigned stop = 1000000;
unsigned counter = 0;
std::clock_t start = std::clock();
for (unsigned i(0); i < stop; ++i)
counter += isPermutation(bag[rand(engine)], bag[rand(engine)]);
std::cout << counter << '\n';
PrintTimeDiff(start, clock());
counter = 0;
start = std::clock();
for (unsigned i(0); i < stop; ++i)
counter += isPermutation_sort(bag[rand(engine)], bag[rand(engine)]);
std::cout << counter << '\n';
PrintTimeDiff(start, clock());
return 0;
}
I have 2.4s for isPermutations_sort vs 2s for isPermutation (somewhat similar to Hal's results). Same with g++ and clang++.
Printing the value of counter has the double benefit of:
triggering the as-if rule (the compiler cannot remove the for loops);
allowing a first check of your implementations (the two values cannot be too distant).
There're some things you have to change in your implementation of isPermutation:
pass arguments as const references
bool isPermutation(const std::string &inputa, const std::string &inputb)
just this change brings the time down to 0.8s (of course you cannot do the same with isPermutation_sort).
you can use std::array and std::fill instead of memset (this is C++ :-)
avoid premature pessimization and prefer preincrement. Only use postincrement if you're going to use the original value
do not mix signed and unsigned value in the for loops (inputa.size() and i). i should be declared as std::size_t
even better, use the range based for loop.
So something like:
bool isPermutation(const std::string &inputa, const std::string &inputb)
{
std::array<int, MAX_CHAR> allChars;
allChars.fill(0);
for (auto c : inputa)
++allChars[(unsigned char)c];
for (auto c : inputb)
{
--allChars[(unsigned char)c];
if (allChars[(unsigned char)c] < 0)
return false;
}
return true;
}
Anyway both isPermutation and isPermutation_sort should have this preliminary check:
if (inputa.length() != inputb.length())
return false;
Now we are at 0.55s for isPermutation vs 1.1s for isPermutation_sort.
Last but not least consider std::is_permutation:
for (unsigned i(0); i < stop; ++i)
{
const std::string &s1(bag[rand(engine)]), &s2(bag[rand(engine)]);
counter += std::is_permutation(s1.begin(), s1.end(), s2.begin());
}
(0.6s)
EDIT
As observed in BeyelerStudios' comment Mersenne-Twister is too much in this case.
You can change the engine to a simpler one.:
static std::linear_congruential_engine<std::uint_fast32_t, 48271, 0, 2147483647> engine;
This further lowers the timings. Luckily the relative speeds remain the same.
Just to be sure I've also checked with a non random access scheme obtaining the same relative results.

Your idea amounts to using a Counting Sort on both strings, but with the comparison happening on the count array, rather than after writing out sorted strings.
It works well because a byte can only have one of 255 non-zero values. Zeroing 256B of memory, or even 4*256B, is pretty cheap, so it works well even for fairly short strings, where most of the count array isn't touched.
It should be fairly good for very long strings, at least in some cases. It's pretty heavily dependent on a good and a heavily pipelined L1 cache, because scattered increments to the count array produces scattered read-modify-writes. Repeated occurrences create a dependency chain of with a store-load round-trip in it. This is a big glass-jaw for this algorithm, on CPUs where many loads and stores can be in flight at once (with their latencies happening in parallel). Modern x86 CPUs should run it pretty well, since they can sustain a load + store every clock cycle.
The initial count of inputa compiles to a very tight loop:
.L15:
movsx rdx, BYTE PTR [rax]
add rax, 1
add DWORD PTR [rsp-120+rdx*4], 1
cmp rax, rcx
jne .L15
This brings us to the first major bug in your code: char can be signed or unsigned. In the x86-64 ABI, char is signed, so allChars[(int)inputa[i]]++; sign-extends it for use as an array index. (movsx instead of movzx). Your code will write outside the array bounds on non-ASCII characters that have their high bit set. So you should have written allChars[(unsigned char)inputa[i]]++;. Note that casting to (unsigned) doesn't give the result we want (see comments).
Note that clang makes much worse code (v3.7.1 and v3.8, both with -O3), with a function call to std::basic_string<...>::_M_leak_hard() inside the inner loop. (Leak as in leak a reference, I think.) #manlio's version doesn't have this problem, so I guess for (auto c : inputa) syntax helps clang figure out what's happening.
Also, using std::string when your callers have char[] forces them to construct a std::string. That's kind of silly, but it is helpful to be able to compare string lengths.
GNU libc's std::is_permutation uses a very different strategy:
First, it skips any common prefix that's identical without permutation in both strings.
Then, for each element in inputa:
count the occurrences of that element in inputb. Check that it matches the count in inputa.
There are a couple optimizations:
Only compare counts the first time an element is seen: find duplicates by searching from the beginning of inputa, and if the match position isn't the current position, we've already checked this element.
check that the match count in inputb is != 0 before counting matches in the rest of inputa.
This doesn't need any temporary storage, so it can work when the elements are large. (e.g. an array of int64_t, or an array of structs).
If there is a mismatch, this is likely to find it early, before doing as much work. There are probably a few cases of inputs where the counting version would take less time, but probably for most inputs the library algorithm is best.
std::is_permutation uses std::count, which should be implemented very well with SSE / AVX vectors. Unfortunately, it's auto-vectorized in a really stupid way by both gcc and clang. It unpacks bytes to 64bit integers before accumulating them in vector elements, to avoid overflow. So it spends most of its instructions shuffling data around, and is probably slower than a scalar implementation (which you'd get from compiling with -O2, or with -O3 -fno-tree-vectorize).
It could and should only do this every few iterations, so the inner loop of count can just be something like pcmpeqb / psubb, with a psadbw every 255 iterations. Or pcmpeqb / pmovmskb / popcnt / add, but that's slower.
Template specializations in the library could help a lot for std::count for 8, 16, and 32bit types whose equality can be checked with bitwise equality (integer ==).

How to zero a vector<bool>?

I have a vector<bool> and I'd like to zero it out. I need the size to stay the same.
The normal approach is to iterate over all the elements and reset them. However, vector<bool> is a specially optimized container that, depending on implementation, may store only one bit per element. Is there a way to take advantage of this to clear the whole thing efficiently?
bitset, the fixed-length variant, has the set function. Does vector<bool> have something similar?

There seem to be a lot of guesses but very few facts in the answers that have been posted so far, so perhaps it would be worthwhile to do a little testing.
#include <vector>
#include <iostream>
#include <time.h>
int seed(std::vector<bool> &b) {
srand(1);
for (int i = 0; i < b.size(); i++)
b[i] = ((rand() & 1) != 0);
int count = 0;
for (int i = 0; i < b.size(); i++)
if (b[i])
++count;
return count;
}
int main() {
std::vector<bool> bools(1024 * 1024 * 32);
int count1= seed(bools);
clock_t start = clock();
bools.assign(bools.size(), false);
double using_assign = double(clock() - start) / CLOCKS_PER_SEC;
int count2 = seed(bools);
start = clock();
for (int i = 0; i < bools.size(); i++)
bools[i] = false;
double using_loop = double(clock() - start) / CLOCKS_PER_SEC;
int count3 = seed(bools);
start = clock();
size_t size = bools.size();
bools.clear();
bools.resize(size);
double using_clear = double(clock() - start) / CLOCKS_PER_SEC;
int count4 = seed(bools);
start = clock();
std::fill(bools.begin(), bools.end(), false);
double using_fill = double(clock() - start) / CLOCKS_PER_SEC;
std::cout << "Time using assign: " << using_assign << "\n";
std::cout << "Time using loop: " << using_loop << "\n";
std::cout << "Time using clear: " << using_clear << "\n";
std::cout << "Time using fill: " << using_fill << "\n";
std::cout << "Ignore: " << count1 << "\t" << count2 << "\t" << count3 << "\t" << count4 << "\n";
}
So this creates a vector, sets some randomly selected bits in it, counts them, and clears them (and repeats). The setting/counting/printing is done to ensure that even with aggressive optimization, the compiler can't/won't optimize out our code to clear the vector.
I found the results interesting, to say the least. First the result with VC++:
Time using assign: 0.141
Time using loop: 0.068
Time using clear: 0.141
Time using fill: 0.087
Ignore: 16777216 16777216 16777216 16777216
So, with VC++, the fastest method is what you'd probably initially think of as the most naive -- a loop that assigns to each individual item. With g++, the results are just a tad different though:
Time using assign: 0.002
Time using loop: 0.08
Time using clear: 0.002
Time using fill: 0.001
Ignore: 16777216 16777216 16777216 16777216
Here, the loop is (by far) the slowest method (and the others are basically tied -- the 1 ms difference in speed isn't really repeatable).
For what it's worth, in spite of this part of the test showing up as much faster with g++, the overall times were within 1% of each other (4.944 seconds for VC++, 4.915 seconds for g++).

Try
v.assign(v.size(), false);
Have a look at this link:
http://www.cplusplus.com/reference/vector/vector/assign/
Or the following
std::fill(v.begin(), v.end(), 0)

You are out of luck. std::vector<bool> is a specialization that apparently does not even guarantee contiguous memory or random access iterators (or even forward?!), at least based on my reading of cppreference -- decoding the standard would be the next step.
So write implementation specific code, pray and use some standard zeroing technique, or do not use the type. I vote 3.
The recieved wisdom is that it was a mistake, and may become deprecated. Use a different container if possible. And definitely do not mess around with the internal guts, or rely on its packing. Check if you have dynamic bitset in your std library mayhap, or roll your own wrapper around std::vector<unsigned char>.

I ran into this as a performance issue recently. I hadn't tried looking for answers on the web but did find that using assignment with the constructor was 10x faster using g++ O3 (Debian 4.7.2-5) 4.7.2. I found this question because I was looking to avoid the additional malloc. Looks like the assign is optimized as well as the constructor and about twice as good in my benchmark.
unsigned sz = v.size(); for (unsigned ii = 0; ii != sz; ++ii) v[ii] = false;
v = std::vector(sz, false); // 10x faster
v.assign(sz, false); > // 20x faster
So, I wouldn't say to shy away from using the specialization of vector<bool>; just be very cognizant of the bit vector representation.

Use the std::vector<bool>::assign method, which is provided for this purpose.
If an implementation is specific for bool, then assign, most likely, also implemented appropriately.

If you're able to switch from vector<bool> to a custom bit vector representation, then you can use a representation designed specifically for fast clear operations, and get some potentially quite significant speedups (although not without tradeoffs).
The trick is to use integers per bit vector entry and a single 'rolling threshold' value that determines which entries actually then evaluate to true.
You can then clear the bit vector by just increasing the single threshold value, without touching the rest of the data (until the threshold overflows).
A more complete write up about this, and some example code, can be found here.

It seems that one nice option hasn't been mentioned yet:
auto size = v.size();
v.resize(0);
v.resize(size);
The STL implementer will supposedly have picked the most efficient means of zeroising, so we don't even need to know which particular method that might be. And this works with real vectors as well (think templates), not just the std::vector<bool> monstrosity.
There can be a minuscule added advantage for reused buffers in loops (e.g. sieves, whatever), where you simply resize to whatever will be needed for the current round, instead of to the original size.

As an alternative to std::vector<bool>, check out boost::dynamic_bitset (https://www.boost.org/doc/libs/1_72_0/libs/dynamic_bitset/dynamic_bitset.html). You can zero one (ie, set each element to false) out by calling the reset() member function.
Like clearing, say, std::vector<int>, reset on a boost::dynamic_bitset can also compile down to a memset, whereas you probably won't get that with std::vector<bool>. For example, see https://godbolt.org/z/aqSGCi

Efficiency of vector index access vs iterator access

I have question to correct my understanding of efficiency of accessing elements of a vector by using index access (with operator []) or using an iterator.
My understanding is "iterator" is more efficient than "index access".
(also I think vector::end() is more efficient than vector::size()).
Now I wrote sample code measure it (under Windows 7 using Cygwin, with g++ 4.5.3)
The index access loop version (formerly labeled as random access):
int main()
{
std::vector< size_t > vec ( 10000000 );
size_t value = 0;
for( size_t x=0; x<10; ++x )
{
for ( size_t idx = 0; idx < vec.size(); ++idx )
{
value += vec[idx];
}
return value;
}
}
The iterator loop code is this:
for (std::vector< size_t >::iterator iter = vec.begin(); iter != vec.end(); ++iter) {
value = *iter;
}
I am surprised to see that the "index access" version is much quicker. I used the time command to "measure". The numbers were :
results using g++ source.cpp (no optimizations)
index access
real 800ms
iterator access
real 2200ms
Do these numbers make sense? (I repeated the runs multiple times) And I wondered what details I miss and why I am mistaken...
results using g++ -O2
index access, time real: ~200ms
iterator access, time real: ~200ms
I repeated tests on different platform (amd64 w/ g++ and power7 w xlC) and see that all the time I used optimized code the example programs have similar execution time.
edit changed code to add values ( value += *iter ) instead of just using assignment. Added details about compiler options. Added new numbers for using -O2.
*edit2 changed title correcting "iterator efficiency" to "accesses efficiency".

Without seeing the test harnesses, the compiler options, and how you
measured the time, it's hard to say anything. Also, a good compiler may
be able eliminate the loop in one case or the other, since the loop has
no effect on the value returned. Still, depending on the
implementation, it wouldn't surprise me to see iterators significantly
faster than indexing (or vice versa).
With regards to your "understanding", there's nothing inherent about the
type of iterator and its performance. You can write forward iterators
which are very fast, or very slow, just as you can write random access
iterators which are very fast or very slow. Globally, the types of data
structures which will support random access iterators are likely to have
better locality than those which don't, which might work in favor of
random access iterators; but that's really not enough to be able to make
any reasonable generalizations.

When I compile both programs with -O2 (Linux, GCC 4.6.1), they run equally fast.
Then: your first program is not using iterators, it is using indices. These are different concepts.
Your second program is in fact using random access iterators, because that is what std::vector<T>::iterators are. The restrictions on std::vector are designed in such a way that an iterator can be implemented as a simple pointer into the dynamic array that a vector encapsulates.
begin should be just as fast as size. The only difference between the two in a typical implementation of std::vector is that end might need to compute begin() + size(), though size might also be implemented as (roughly) end() - begin(). The compiler might optimize both away in the loop, though.

With optimizations the two codes should be (near) identical. Try -O2.
Without optimizations and added debug information your measurements will be quite misleading.

In your first example, you dereference each individual item using value = vec[idx];, which causes an offset of element_size * index to be calculated each time you access an element.
Since a vector consists of elements lined up in a continuous block of memory, a vector iterator is usually just implemented as a simple pointer, so iterating through a vector (like in your second example) just involves advancing the pointer one element after each iteration.
If you enable optimizations (try -O2 or -O3), however, the compiler will likely optimize your loop in the first example to something similar to the second example, making the performance nearly identical.

I have found iterators to be faster, actually. Try refactoring your iterator loop to something like the following and you may see this as well:
#include <ctime>
#include <vector>
#include <iostream>
using namespace std;
int main()
{
std::vector< size_t > vec ( 1000000 );
size_t value = 0;
srand ( time(NULL) );
clock_t start,stop;
int ncycle = 10000;
start = std::clock();
for( size_t x=0; x<ncycle; ++x ) {
for ( size_t idx = 0; idx < vec.size(); ++idx )
vec[idx] += rand();
}
stop = std::clock();
cout << start << " " << stop << endl;
cout << "INDEX: " << (double((stop - start)) / CLOCKS_PER_SEC) / ncycle << " seconds per cycle" << endl;
start = std::clock();
for( size_t x=0; x<ncycle; ++x ) {
for (std::vector< size_t >::iterator iter = vec.begin(), end = vec.end(); iter != end; ++iter)
*iter += rand();
}
stop = std::clock();
cout << "ITERATOR: " << (double((stop - start)) / CLOCKS_PER_SEC) / ncycle << " seconds per cycle" << endl;
}
The result is the following on my pc, showing that iterators have a slight lead:
INDEX: 0.012069 seconds per cycle
ITERATOR: 0.011482 seconds per cycle
You should note that I used an addition of rand(); this prevents the compiler from optimizing out something that it can calculate at compile time. Compilers seem to have a much easier time doing so with intrinsic arrays than with vectors, and that can misleadingly give arrays an advantage over vectors.
I compiled the above with "icpc -fast". slavik was right on about having to do calculations on indices vs incrementing when using iterators (ala pointers).

Speed accessing a std::vector by iterator vs by operator[]/index?

Say, I have a
std::vector<SomeClass *> v;
in my code and I need to access its elements very often in the program, looping them forward and backward .
Which is the fastest access type between those two ?
Iterator access:
std::vector<SomeClass *> v;
std::vector<SomeClass *>::iterator i;
std::vector<SomeClass *>::reverse_iterator j;
// i loops forward, j loops backward
for( i = v.begin(), j = v.rbegin(); i != v.end() && j != v.rend(); i++, j++ ){
// some operations on v items
}
Subscript access (by index)
std::vector<SomeClass *> v;
unsigned int i, j, size = v.size();
// i loops forward, j loops backward
for( i = 0, j = size - 1; i < size && j >= 0; i++, j-- ){
// some operations on v items
}
And, does const_iterator offer a faster way to access vector elements in case I do not have to modify them?

The performance difference is likely negligable or none (the compiler might optimise them to be identical); you should worry about other things, like whether your program is correct (a slow but correct program is better than a fast and incorrect program). There are other advantages to using iterators though, such as being able to change the underlying container to one with no operator[] without modifying your loops. See this question for more.
const_iterators will most likely have none, or negligable, performance difference compared to ordinary iterators. They are designed to improve the correctness of your program by preventing modifying things that shouldn't be modified, not for performance. The same goes for the const keyword in general.
In short, optimisation should not be a concern of yours until two things have happened: 1) you have noticed it runs too slowly and 2) you have profiled the bottlenecks. For 1), if it ran ten times slower than it could, but is only ever run once and takes 0.1ms, who cares? For 2), make sure it's definitely the bottleneck, otherwise optimising it will have nearly no measurable effect on performance!

A simple loop-based benchmark has been fulfilled. I used VS 2010 SP1 (release configuration).
Use iterators (*it = *it + 1;)
Use indices (vs[i] = vs[i] + 1;)
In several billions of iterations the second approach turned out to be a bit faster, by 1%. The result (indices are slightly faster than iterators) is reproducible but the difference, as I said, is very small.

I had a test yesterday, use [] vs iterator, the code is create a vector with some elements and remove some elements from the vector.
This is the code uses operator [] to access elements
TimeSpent([](){
std::vector<int> vt = { 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 };
for (int i = int(vt.size()) - 1; i >= 0; i--)
{
if (vt[i] % 2 == 0)
{
//cout << "removing " << vt[i] << endl;
vt.erase(vt.begin() + i);
}
}
});
The following code is about access vector elements by using iterator
TimeSpent([](){
std::vector<int> vt = { 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 };
for (std::vector<int>::iterator num = vt.begin(); num != vt.end();)
{
if (*num % 2 == 0)
{
num = vt.erase(num);
}
else
{
++num;
}
}
});
Tested by calling them by this function separately
void TimeSpent(std::function<void()> func)
{
const int ONE_MIL = 10000;
long times = ONE_MIL;
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
while (times > 0)
{
func();
--times;
}
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
cout << "time elapsed : " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << endl;
}
Tested environment is visual studio 2013 pro. version 4.5.51650
The results are :
operator[] : 192
iterator : 212
Summary: when we access the vector container, operator [] is faster than iterator.

I believe that vector iterators are implemented as pointers internally (in a good STL implementation), so in general there should be negligible performance difference between the two idioms. But if you want to know how these perform on your platform, why don't you measure it with a little test program? I don't think it would take more than 5 minutes to measure execution time of e.g. 1 million iterations with both variants...

As always, it depends. Normally I wouldn't think you'd see any kind of difference, but only you can determine that by profiling your code. Some compilers implement vector iterators as raw pointers, and some don't. Also, in debug builds, some compilers may be using a checked iterator, which may be slower. But in production mode it may not be any different. Profile it and see.

With optimization (-O2) the timings should improve (should be nearly identical).

In terms of speed, I think might be almost same.
Better, you can profile and check anyway.
At least you can reduce the number of variables used :)
for( i = 0; i < size ; i++){
// some operations on v items
v[i];
v[size-i+1];
}
About const_iterator: Pls refer my Q: Are const_iterators faster ?

You are not only prematurely optimizing, you are micro-optimizing. This is an evil almost as bad as the former (the difference being that very, very, very rarely it is actually necessary to micro-optimize). Put the two together and you've got a recipe for disaster.
If you run a profiler and find this area of code is a bottleneck then you will need to optimize. You don't optimize by trying to reduce your loop from taking 23 clock cycles to 22. You optimize by finding ways to reduce the O() of your algorithm. But until you run a profiler you should be paying more attention to design than any other concern.

I'd go for iterators, but what I would optimize is calling end() in the loop and would change preincrement to postincrement. I.e. I'd
std::vector<SomeClass *> v;
std::vector<SomeClass *>::iterator i,ie;
std::vector<SomeClass *>::reverse_iterator j,je;
// i loops forward, j loops backward
for( i=v.begin(),ie=v.end(), j=v.rbegin(),je=v.rend(); i!=ie && j!=je; ++i,++j ){
// some operations on v items
}
And I don't think it's premature microoptimization, it's just writing better code. Much less evil than calling every attempt to write efficient code premature microoptimization and substituting thinking with profiling.

I was confused about something similar and wrote a program to test the performance : https://github.com/rajatkhanduja/Benchmarks/blob/master/C%2B%2B/vectorVsArray.cpp
Here's the relevant observations for reading/writing to vector<int> of size 1m using g++ (without any optimization flags), on Linux-i686 (64-bit machine) with 7.7 GB RAM:-
Time taken to write to vector using indices. : 11.3909 ms
Time taken to read from vector using indices, sequentially. : 4.09106 ms
Time taken to read from vector using indices, randomly. : 39 ms
Time taken to write to vector using iterators (sequentially). : 24.9949 ms
Time taken to read from vector using iterators (sequentially). : 18.8049 ms

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js