Is using string.length() in loop efficient? - c++

For example, assuming a string s is this:
for(int x = 0; x < s.length(); x++)
better than this?:
int length = s.length();
for(int x = 0; x < length; x++)
Thanks,
Joel

In general, you should avoid function calls in the condition part of a loop, if the result does not change during the iteration.
The canonical form is therefore:
for (std::size_t x = 0, length = s.length(); x != length; ++x);
Note 3 things here:
The initialization can initialize more than one variable
The condition is expressed with != rather than <
I use pre-increment rather than post-increment
(I also changed the type because is a negative length is non-sense and the string interface is defined in term of std::string::size_type, which is normally std::size_t on most implementations).
Though... I admit that it's not as much for performance than for readability:
The double initialization means that both x and length scope is as tight as necessary
By memoizing the result the reader is not left in the doubt of whether or not the length may vary during iteration
Using pre-increment is usually better when you do not need to create a temporary with the "old" value
In short: use the best tool for the job at hand :)

It depends on the inlining and optimization abilities of the compiler. Generally, the second variant will most likely be faster (better: it will be either faster or as fast as the first snippet, but almost never slower).
However, in most cases it doesn't matter, so people tend to prefer the first variant for its shortness.

It depends on your C++ implementation / library, the only way to be sure is to benchmark it. However, it's effectively certain that the second version will never be slower than the first, so if you don't modify the string within the loop it's a sensible optimisation to make.

How efficient do you want to be?
If you don't modify the string inside the loop, the compiler will easily see than the size doesn't change. Don't make it any more complicated than you have to!

Although I am not necessarily encouraging you to do so, it appears it is faster to constantly call .length() than to store it in an int, surprisingly (atleast on my computer, keeping in mind that I'm using an MSI gaming laptop with i5 4th gen, but it shouldn't really affect which way is faster).
Test code for constant call:
#include <iostream>
using namespace std;
int main()
{
string g = "01234567890";
for(unsigned int rep = 0; rep < 25; rep++)
{
g += g;
}//for loop used to double the length 25 times.
int a = 0;
//int b = g.length();
for(unsigned int rep = 0; rep < g.length(); rep++)
{
a++;
}
return a;
}
On average, this ran for 385ms according to Code::Blocks
And here's the code that stores the length in a variable:
#include <iostream>
using namespace std;
int main()
{
string g = "01234567890";
for(unsigned int rep = 0; rep < 25; rep++)
{
g += g;
}//for loop used to double the length 25 times.
int a = 0;
int b = g.length();
for(unsigned int rep = 0; rep < b; rep++)
{
a++;
}
return a;
}
And this averaged around 420ms.
I know this question already has an accepted answer, but there haven't been any practically tested answers, so I decided to throw my 2 cents in. I had the same question as you, but I didn't find any helpful answers here, so I ran my own experiment.

Is s.length() inline and returns a member variable? then no, otherwise cost of dereferencing and putting stuff in stack, you know all the overheads of function call you will incur for each iteration.

Related

Fastest way to create a vector of indices from distance matrix in C++

I have a distance matrix D of size n by n and a constant L as input. I need to create a vector v contains all entries in D such that its value is at most L. Here v must be in a specific order v = [v1 v2 .. vn] where vi contains entries in ith row of D with value at most L. The order of entries in each vi is not important.
I wonder there is a fast way to create v using vector, array or any data structure + parallization. What I did is to use for loops and it is very slow for large n.
vector<int> v;
for (int i=0; i < n; ++i){
for (int j=0; j < n; ++j){
if (D(i,j) <= L) v.push_back(j);
}
}
The best way is mostly depending on the context. If you are seeking for GPU parallization you should take a look at OpenCL.
For CPU based parallization the C++ standard #include <thread> library is probably your best bet, but you need to be careful:
Threads take time to create so if n is relatively small (<1000 or so) it will slow you down
D(i,j) has to be readably by multiple threads at the same time
v has to be writable by multiple threads, a standard vector wont cut it
v may be a 2d vector with vi as its subvectors, but these have to be initialized before the parallization:
std::vector<std::vector<int>> v;
v.reserve(n);
for(size_t i = 0; i < n; i++)
{
v.push_back(std::vector<int>());
}
You need to decide how many threads you want to use. If this is for one machine only, hardcoding is a valid option. There is a function in the thread library that gets the amount of supported threads, but it is more of a hint than trustworthy.
size_t threadAmount = std::thread::hardware_concurrency(); //How many threads should run hardware_concurrency() gives you a hint, but its not optimal
std::vector<std::thread> t; //to store the threads in
t.reserve(threadAmount-1); //you need threadAmount-1 extra threads (we already have the main-thread)
To start a thread you need a function it can execute. In this case this is to read through part of your matrix.
void CheckPart(size_t start, size_t amount, int L, std::vector<std::vector<int>>& vec)
{
for(size_t i = start; i < amount+start; i++)
{
for(size_t j = 0; j < n; j++)
{
if(D(i,j) <= L)
{
vec[i].push_back(j);
}
}
}
}
Now you need to split your matrix in parts of about n/threadAmount rows and start the threads. The thread constructor needs a function and its parameter, but it will always try to copy the parameters, even if the function wants a reference. To prevent this, you need to force using a reference with std::ref()
int i = 0;
int rows;
for(size_t a = 0; a < threadAmount-1; a++)
{
rows = n/threadAmount + ((n%threadAmount>a)?1:0);
t.push_back(std::thread(CheckPart, i, rows, L, std::ref(v)));
i += rows;
}
The threads are now running and all there is to do is run the last block on the main function:
SortPart(i, n/threadAmount, L, v);
After that you need to wait for the threads finishing and clean them up:
for(unsigned int a = 0; a < threadAmount-1; a++)
{
if(t[a].joinable())
{
t[a].join();
}
}
Please note that this is just a quick and dirty example. Different problems might need different implementation, and since I can't guess the context the help I can give is rather limited.
In consideration of the comments, I made the appropriate corrections (in emphasis).
Have you searched tips for writing performance code, threading, asm instructions (if your assembly is not exactly what you want) and OpenCL for parallel-processing? If not, I strongly recommend!
In some cases, declaring all for loop variables out of the for loop (to avoid declaring they a lot of times) will make it faster, but not in this case (comment from our friend Paddy).
Also, using new insted of vector can be faster, as we see here: Using arrays or std::vectors in C++, what's the performance gap? - and I tested, and with vector it's 6 seconds slower than with new,which only takes 1 second. I guess that the safety and ease of management guarantees that come with std::vector is not desired when someone is searching for performance, even because using new is not so difficult, just avoid heap overflow with calculations and remember using delete[]
user4581301 is correct here, and the following statement is untrue: Finally, if you build D in a array instead of matrix (or maybe if you copy D into a constant array, maybe...), it will be much mor cache-friendly and will save one for loop statement.

What is the fastest implementation for accessing and changing a long array of boolean?

I want to implement a very long boolean array (as a binary genome) and access some intervals to check if that interval is all true or not, and in addition I want to change some intervals value,
For example, I can create 4 representations:
boolean binaryGenome1[10e6]={false};
vector<bool> binaryGenome2; binaryGenome2.resize(10e6);
vector<char> binaryGenome3; binaryGenome3.resize(10e6);
bitset<10e6> binaryGenome4;
and access this way:
inline bool checkBinGenome(long long start , long long end){
for(long long i = start; i < end+1 ; i++)
if(binaryGenome[i] == false)
return false;
return true;
}
inline void changeBinGenome(long long start , long long end){
for(long long i = start; i < end+1 ; i++)
binaryGenome[i] = true;
}
vector<char> and normal boolean array (ass stores every boolean in a byte) both seem to be a poor choice as I need to be efficient in space. But what are the differences between vector<bool> and bitset?
Somewhere else I read that vector has some overhead as you can choose it's size and compile time - "overhead" for what - accessing? And how much is that overhead?
As I want to access array elements many times using CheckBinGenome() and changeBinGenome(), what is the fastest implementation?
Use std::bitset It's the best.
If the length of the data is known at compile time, consider std::array<bool> or std::bitset. The latter is likely to be more space-efficient (you'll have to measure whether the associated extra work in access times outweighs the speed gain from reducing cache pressure - that will depend on your workload).
If your array's length is not fixed, then you'll need a std::vector<bool> or std::vector<char>; there's also boost::dynamic_bitset but I've never used that.
If you will be changing large regions at once, as your sample implies, it may well be worth constructing your own representation and manipulating the underlying storage directly, rather than one bit at a time through the iterators. For example, if you use an array of char as the underlying representation, then setting a large range to 0 or 1 is mostly a memset() or std::fill() call, with computation only for the values at the start and end of the range. I'd start with a simple implementation and a good set of unit tests before trying anything like that.
It is (at least theoretically) possible that your Standard Library has specialized versions of algorithms for the iterators of std::vector<bool>, std::array<bool> and/or std::bitset that do exactly the above, or you may be able to write and contribute such specializations. That's a better path if possible - the world may thank you, and you'll have shared some of the maintenance responsibility.
Important note
If using std::array<bool>, you do need to be aware that, unlike other std::array<> instantiations, it does not implement the standard container semantics. That's not to say it shouldn't be used, but make sure you understand its foibles!
E.g., checking whether all the elements are true
I am really NOT sure whether this will give us more overheads than speedup or not. Actually I think that nowadays CPU can do this quite fast, are you really experiencing a poor performance? (or is this just a skeleton of your real problem?)
#include <omp.h>
#include <iostream>
#include <cstring>
using namespace std;
#define N 10000000
bool binaryGenome[N];
int main() {
memset(binaryGenome, true, sizeof(bool) * N);
int shouldBreak = 0;
bool result = true;
cout << result << endl;
binaryGenome[9999995] = false;
bool go = true;
uint give = 0;
#pragma omp parallel
{
uint start, stop;
#pragma omp critical
{
start = give;
give += N / omp_get_num_threads();
stop = give;
if (omp_get_thread_num() == omp_get_num_threads() - 1)
stop = N;
}
while (start < stop && go) {
if (!binaryGenome[start]) {
cout << start << endl;
go = false;
result = false;
}
++start;
}
}
cout << result << endl;
}

Is continue instant?

In the follow two code snippets, is there actually any different according to the speed of compiling or running?
for (int i = 0; i < 50; i++)
{
if (i % 3 == 0)
continue;
printf("Yay");
}
and
for (int i = 0; i < 50; i++)
{
if (i % 3 != 0)
printf("Yay");
}
Personally, in the situations where there is a lot more than a print statement, I've been using the first method as to reduce the amount of indentation for the containing code. Been wondering for a while so found it about time I ask whether it's actually having an effect other than visually.
Reply to Alf (i couldn't get code working in comments...)
More accurate to my usage is something along the lines of a "handleObjectMovement" function which would include
for each object
if object position is static
continue
deal with velocity and jazz
compared with
for each object
if object position is not static
deal with velocity and jazz
Hence me not using return. Essentially "if it's not relevant to this iteration, move on"
The behaviour is the same, so the runtime speed should be the same unless the compiler does something stupid (or unless you disable optimisation).
It's impossible to say whether there's a difference in compilation speed, since it depends on the details of how the compiler parses, analyses and translates the two variations.
If speed is important, measure it.
If you know which branch of the condition has higher probability you may use GCC likely/unlikely macro
How about getting rid of the check altogether?
for (int t = 0; t < 33; t++)
{
int i = t + (t >> 1) + 1;
printf("%d\n", i);
}

Which one will be faster

Just calculating sum of two arrays with slight modification in code
int main()
{
int a[10000]={0}; //initialize something
int b[10000]={0}; //initialize something
int sumA=0, sumB=0;
for(int i=0; i<10000; i++)
{
sumA += a[i];
sumB += b[i];
}
printf("%d %d",sumA,sumB);
}
OR
int main()
{
int a[10000]={0}; //initialize something
int b[10000]={0}; //initialize something
int sumA=0, sumB=0;
for(int i=0; i<10000; i++)
{
sumA += a[i];
}
for(int i=0; i<10000; i++)
{
sumB += b[i];
}
printf("%d %d",sumA,sumB);
}
Which code will be faster.
There is only one way to know, and that is to test and measure. You need to work out where your bottleneck is (cpu, memory bandwidth etc).
The size of the data in your array (int's in your example) would affect the result, as this would have an impact into the use of the processor cache. Often, you will find example 2 is faster, which basically means your memory bandwidth is the limiting factor (example 2 will access memory in a more efficient way).
Here's some code with timing, built using VS2005:
#include <windows.h>
#include <iostream>
using namespace std;
int main ()
{
LARGE_INTEGER
start,
middle,
end;
const int
count = 1000000;
int
*a = new int [count],
*b = new int [count],
*c = new int [count],
*d = new int [count],
suma = 0,
sumb = 0,
sumc = 0,
sumd = 0;
QueryPerformanceCounter (&start);
for (int i = 0 ; i < count ; ++i)
{
suma += a [i];
sumb += b [i];
}
QueryPerformanceCounter (&middle);
for (int i = 0 ; i < count ; ++i)
{
sumc += c [i];
}
for (int i = 0 ; i < count ; ++i)
{
sumd += d [i];
}
QueryPerformanceCounter (&end);
cout << "Time taken = " << (middle.QuadPart - start.QuadPart) << endl;
cout << "Time taken = " << (end.QuadPart - middle.QuadPart) << endl;
cout << "Done." << endl << suma << sumb << sumc << sumd;
return 0;
}
Running this, the latter version is usually faster.
I tried writing some assembler to beat the second loop but my attempts were usually slower. So I decided to see what the compiler had generated. Here's the optimised assembler produced for the main summation loop in the second version:
00401110 mov edx,dword ptr [eax-0Ch]
00401113 add edx,dword ptr [eax-8]
00401116 add eax,14h
00401119 add edx,dword ptr [eax-18h]
0040111C add edx,dword ptr [eax-10h]
0040111F add edx,dword ptr [eax-14h]
00401122 add ebx,edx
00401124 sub ecx,1
00401127 jne main+110h (401110h)
Here's the register usage:
eax = used to index the array
ebx = the grand total
ecx = loop counter
edx = sum of the five integers accessed in one iteration of the loop
There are a few interesting things here:
The compiler has unrolled the loop five times.
The order of memory access is not contiguous.
It updates the array index in the middle of the loop.
It sums five integers then adds that to the grand total.
To really understand why this is fast, you'd need to use Intel's VTune performance analyser to see where the CPU and memory stalls are as this code is quite counter-intuitive.
In theory, due to cache optimizations the second one should be faster.
Caches are optimized to bring and keep chunks of data so that for the first access you'll get a big chunk of the first array into cache. In the first code, it may happen that when you access the second array you might have to take out some of the data of the first array, therefore requiring more accesses.
In practice both approach will take more or less the same time, being the first a little better given the size of actual caches and the likehood of no data at all being taken out of the cache.
Note: This sounds a lot like homework. In real life for those sizes first option will be slightly faster, but this only applies to this concrete example, nested loops, bigger arrays or specially smaller cache sizes would have a significant impact in performance depending on the order.
The first one will be faster. The compiler will not need to repeat the loop twice. Although not much work, bu some cycles are lost on incrementing the cycle variable and performing the check condition.
For me (GCC -O3) measuring shows that the second version is faster by some 25%, which can be explained with more efficient memory access pattern (all memory accesses are close to each other, not all over the place). Of course you'll need to repeat the operation thousands of times before the difference becomes significant.
I also tried std::accumulate from the numeric header which is the simple way to implement the second version and was in turn a tiny amount faster than the second version (probably due to more compiler-friendly looping mechanism?):
sumA = std::accumulate(a, a + 10000, 0);
sumB = std::accumulate(b, b + 10000, 0);
The first one will be faster because you loop from 1 to 10000 only one time.
C++ Standard says nothing about it, it is implementation dependent. It is looks like you are trying to do premature optimization. It is shouldn't bother you until it is not a bottleneck in your program. If it so, you should use some profiler to find out which one will be faster on certain platform.
Until that, I'd prefer first variant because it looks more readable (or better std::accumulate).
If the data type size is enough large not to cache both variables (as example 1), but single variable (example 2), then the code of first example will be slower than the code of second example.
Otherwise code of first example will be faster than the second one.
The first one will probably be faster. The memory access pattern will allow the (modern) CPU to manage the caches efficiently (prefetch), even while accessing two arrays.
Much faster if your CPU allows it and the arrays are aligned: use SSE3 instructions to process 4 int at a time.
If you meant a[i] instead of a[10000] (and for b, respectively) and if your compiler performs loop distribution optimizations, the first one will be exactly the same as the second. If not, the second will perform slightly better.
If a[10000] is intended, then both loops will perform exactly the same (with trivial cache and flow optimizations).
Food for thought for some answers that were voted up: how many additions are performed in each version of the code?

Which is faster/preferred: memset or for loop to zero out an array of doubles?

double d[10];
int length = 10;
memset(d, length * sizeof(double), 0);
//or
for (int i = length; i--;)
d[i] = 0.0;
If you really care you should try and measure. However the most portable way is using std::fill():
std::fill( array, array + numberOfElements, 0.0 );
Note that for memset you have to pass the number of bytes, not the number of elements because this is an old C function:
memset(d, 0, sizeof(double)*length);
memset can be faster since it is written in assembler, whereas std::fill is a template function which simply does a loop internally.
But for type safety and more readable code I would recommend std::fill() - it is the c++ way of doing things, and consider memset if a performance optimization is needed at this place in the code.
Try this, if only to be cool xD
{
double *to = d;
int n=(length+7)/8;
switch(length%8){
case 0: do{ *to++ = 0.0;
case 7: *to++ = 0.0;
case 6: *to++ = 0.0;
case 5: *to++ = 0.0;
case 4: *to++ = 0.0;
case 3: *to++ = 0.0;
case 2: *to++ = 0.0;
case 1: *to++ = 0.0;
}while(--n>0);
}
}
Assuming the loop length is an integral constant expression, the most probable outcome it that a good optimizer will recognize both the for-loop and the memset(0). The result would be that the assembly generated is essentially equal. Perhaps the choice of registers could differ, or the setup. But the marginal costs per double should really be the same.
In addition to the several bugs and omissions in your code, using memset is not portable. You can't assume that a double with all zero bits is equal to 0.0. First make your code correct, then worry about optimizing.
memset(d,0,10*sizeof(*d));
is likely to be faster. Like they say you can also
std::fill_n(d,10,0.);
but it is most likely a prettier way to do the loop.
calloc(length, sizeof(double))
According to IEEE-754, the bit representation of a positive zero is all zero bits, and there's nothing wrong with requiring IEEE-754 compliance. (If you need to zero out the array to reuse it, then pick one of the above solutions).
According to this Wikipedia article on IEEE 754-1975 64-bit floating point a bit pattern of all 0s will indeed properly initialize a double to 0.0. Unfortunately your memset code doesn't do that.
Here is the code you ought to be using:
memset(d, 0, length * sizeof(double));
As part of a more complete package...
{
double *d;
int length = 10;
d = malloc(sizeof(d[0]) * length);
memset(d, 0, length * sizeof(d[0]));
}
Of course, that's dropping the error checking you should be doing on the return value of malloc. sizeof(d[0]) is slightly better than sizeof(double) because it's robust against changes in the type of d.
Also, if you use calloc(length, sizeof(d[0])) it will clear the memory for you and the subsequent memset will no longer be necessary. I didn't use it in the example because then it seems like your question wouldn't be answered.
Memset will always be faster, if debug mode or a low level of optimization is used. At higher levels of optimization, it will still be equivalent to std::fill or std::fill_n.
For example, for the following code under Google Benchmark:
(Test setup: xubuntu 18, GCC 7.3, Clang 6.0)
#include <cstring>
#include <algorithm>
#include <benchmark/benchmark.h>
double total = 0;
static void memory_memset(benchmark::State& state)
{
int ints[50000];
for (auto _ : state)
{
std::memset(ints, 0, sizeof(int) * 50000);
}
for (int counter = 0; counter != 50000; ++counter)
{
total += ints[counter];
}
}
static void memory_filln(benchmark::State& state)
{
int ints[50000];
for (auto _ : state)
{
std::fill_n(ints, 50000, 0);
}
for (int counter = 0; counter != 50000; ++counter)
{
total += ints[counter];
}
}
static void memory_fill(benchmark::State& state)
{
int ints[50000];
for (auto _ : state)
{
std::fill(std::begin(ints), std::end(ints), 0);
}
for (int counter = 0; counter != 50000; ++counter)
{
total += ints[counter];
}
}
// Register the function as a benchmark
BENCHMARK(memory_filln);
BENCHMARK(memory_fill);
BENCHMARK(memory_memset);
int main (int argc, char ** argv)
{
benchmark::Initialize (&argc, argv);
benchmark::RunSpecifiedBenchmarks ();
printf("Total = %f\n", total);
getchar();
return 0;
}
Gives the following results in release mode for GCC (-O2;-march=native):
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
memory_filln 16488 ns 16477 ns 42460
memory_fill 16493 ns 16493 ns 42440
memory_memset 8414 ns 8408 ns 83022
And the following results in debug mode (-O0):
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
memory_filln 87209 ns 87139 ns 8029
memory_fill 94593 ns 94533 ns 7411
memory_memset 8441 ns 8434 ns 82833
While at -O3 or with clang at -O2, the following is obtained:
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
memory_filln 8437 ns 8437 ns 82799
memory_fill 8437 ns 8437 ns 82756
memory_memset 8436 ns 8436 ns 82754
TLDR: use memset unless told you absolutely have to use std::fill or a for-loop, at least for POD types which are not non-IEEE-754 floating-points. There are no strong reasons not to.
(note: the for loops counting the array contents are necessary for clang not to optimize away the google benchmark loops entirely (it will detect they're not used otherwise))
The example will not work because you have to allocate memory for your array. You can do this on the stack or on the heap.
This is an example to do it on the stack:
double d[50] = {0.0};
No memset is needed after that.
Don't forget to compare a properly optimized for loop if you really care about performance.
Some variant of Duff's device if the array is sufficiently long, and prefix --i not suffix i-- (although most compilers will probably correct that automatically.).
Although I'd question if this is the most valuable thing to be optimising. Is this genuinely a bottleneck for the system?
memset(d, 10, 0) is wrong as it only nulls 10 bytes.
prefer std::fill as the intent is clearest.
In general the memset is going to be much faster, make sure you get your length right, obviously your example has not (m)allocated or defined the array of doubles. Now if it truly is going to end up with only a handful of doubles then the loop may turn out to be faster. But as get to the point where the fill loop shadows the handful of setup instructions memset will typically use larger and sometimes aligned chunks to maximize speed.
As usual, test and measure. (although in this case you end up in the cache and the measurement may turn out to be bogus).
One way of answering this question is to quickly run the code through Compiler Explorer: If you check this link, you'll see assembly for the following code:
void do_memset(std::array<char, 1024>& a) {
memset(&a, 'q', a.size());
}
void do_fill(std::array<char, 1024>& a) {
std::fill(a.begin(), a.end(), 'q');
}
void do_loop(std::array<char, 1024>& a) {
for (int i = 0; i < a.size(); ++i) {
a[i] = 'q';
}
}
The answer (at least for clang) is that with optimization levels -O0 and -O1, the assembly is different and std::fill will be slower because the use of the iterators is not optimized out. For -O2 and higher, do_memset and do_fill produce the same assembly. The loop ends up calling memset on every item in the array even with -O3.
Assuming release builds tend to run -O2 or higher, there are no performance considerations and I'd recommend using std::fill when it's available, and memset for C.
If you're required to not use STL...
double aValues [10];
ZeroMemory (aValues, sizeof(aValues));
ZeroMemory at least makes the intent clear.
As an alternative to all stuff proposed, I can suggest you NOT to set array to all zeros at startup. Instead, set up value to zero only when you first access the value in a particular cell. This will stave your question off and may be faster.
I think you mean
memset(d, 0, length * sizeof(d[0]))
and
for (int i = length; --i >= 0; ) d[i] = 0;
Personally, I do either one, but I suppose std::fill() is probably better.