Implement a switch statement using a jump table in C++

Implement a switch statement using a jump table in C++ - c++

I am trying to improve my parser's speed. And switch-case, sometimes it's useful, but I see it's still slow. Not sure - if C++ supports this feature (address checkpoints (with additional parameter)), it's great!
Simple example :
enum Transport //MOTORBIKE = 1, CAR = 2, ... SHIP = 10
Transport foo = //unknown
switch(foo)
{
case MOTORBIKE : /*do something*/ break;
case CAR : /*do something*/ break;
//////////////////////////
case SHIP : /*do something*/ break;
}
If the variable foo is SHIP, at least the program has to re-check the value up to ten times! -> It's still slow.
If C++ supports checkpoints :
Transport foo = //unknown
__checkpoint smart_switch;
goto (smart_switch + foo); //instant call!!!
smart_switch + MOTORBIKE : /*do something*/ goto __end;
smart_switch + CAR : /*do something*/ goto __end;
smart_switch + [...] : /*do something*/ goto __end;
////////////////////////////////////////////////////////////
smart_switch + SHIP : /*do something*/ goto __end;
__end : return 0;
It doesn't generate any jump tables, and then check per value. Maybe it doesn't work well with default case. The only thing is smart_switch + CAR -> smart_switch + SHIP may have different addresses so if C++ evaluate them as real addresses, the process will fail. So when compiling the compiler just has to convert them to real addresses.
Does C++ support this feature? And does it greatly improve speed & performance?

What you are talking about is called a jump table. The jump table is usually an array of relative addresses where the program execution control can be transferred. Here is an example of how you can implement one:
#include <ctime>
#include <cstdlib>
#include <cstdio>
int main()
{
static constexpr void* jump_table[] =
{
&&print_0, &&print_1, &&print_2,
&&print_3, &&print_4, &&print_5
};
std::srand(std::time(nullptr));
int v = std::rand();
if (v < 0 || v > 5)
goto out;
goto *jump_table[v];
print_0:
std::printf("zero\n");
goto out;
print_1:
std::printf("one\n");
goto out;
print_2:
std::printf("two\n");
goto out;
print_3:
std::printf("three\n");
goto out;
print_4:
std::printf("four\n");
goto out;
print_5:
std::printf("five\n");
goto out;
out:
return EXIT_SUCCESS;
}
However, I seriously doubt two things. The first doubt is that using jump table will make your program faster. An indirect jump is relatively expensive and is badly predicted by the hardware. Chances are that if you only have three values then you are better off simply comparing each of them using "if-then-else" statement. For a lot of sparse values (i.e. 1, 100, 250, 500 etc.), you are better off doing a binary search rather than blowing up the size of your table. In either case, this is just a head of a huge iceberg when it comes to switch statements. So unless you know all of the details and know where compiler did the wrong thing for your particular case, don't even bother trying to change switch to something else — you will never outsmart the compiler and will only make your program slower.
The second doubt is actually that switching is the bottleneck of your parser. Most likely it is not. So in order to save yourself a lot of valuable time, try profiling your code first to know for sure what is the slowest part of your program. Usually it goes in steps like this:
Profile and find the bottleneck.
Figure out why this is a bottleneck and come up with a reasonable idea of how to improve the code for speed.
Try to improve the code.
Go to step #1.
And there is no exit from this loop. Optimization is something you can spend your entire life doing. At some point, you will have to assume the program is fast enough and there are no bottlenecks :)
Also, I have written a more comprehensive analysis with in-depth (more or less) details on how switch statements are implemented by compilers and when and when not try to engage in trying to outsmart them. Please find the article here.

Yes C/C++ does support this feature, and it is in... standard switch. I have no idea where you get the idea that switch will check each value, but you are wrong. Yes I heard that some compilers can generate better code for pretty big cases (many variants like probably several hundreds), but I do not think that it is yours.
For example following code compiled by gcc without any optimization:
enum E { One, Two, Three, Four, Five };
void func( E e )
{
int res;
switch( e ) {
case One : res = 10; break;
case Two : res = 20; break;
case Three : res = 30; break;
case Four : res = 40; break;
case Five : res = 50; break;
}
}
generates following:
_Z4func1E:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl %edi, -20(%rbp)
movl -20(%rbp), %eax
cmpl $4, %eax
ja .L1
movl %eax, %eax
movq .L8(,%rax,8), %rax
jmp *%rax
.section .rodata
.align 8
.align 4
.L8:
.quad .L3
.quad .L4
.quad .L5
.quad .L6
.quad .L7
.text
.L3:
movl $10, -4(%rbp)
jmp .L1
.L4:
movl $20, -4(%rbp)
jmp .L1
.L5:
movl $30, -4(%rbp)
jmp .L1
.L6:
movl $40, -4(%rbp)
jmp .L1
.L7:
movl $50, -4(%rbp)
nop
.L1:
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
As you can see it simply jumps into particular position without checking each value.

You can just build an array with function pointers and index at it by the enum

Related

If Else construct

if (a == 1)
//do something
else if (a == 2)
//do something
else if (a == 3)
//do something
else if (a == 4)
//do something
else if (a == 5)
//do something
else if (a == 6)
//do something
else if (a == 7)
//do something
else if (a == 8)
//do something
Now imagine, we know that a will mostly be 7 and we execute this block of code several times in a program. Will moving the (a == 7 ) check to top improve any time performance? That is:
if (a == 7)
//do something
else if (a == 1)
//do something
else if (a == 2)
//do something
else if (a == 3)
//do something
and so on. Does it improve anything or it's just wishful thinking?

You can use switch case for improving the performance of program
switch (a)
{
case 1:
break;
case 2:
break;
case 3:
break;
case 4:
break;
case 5:
break;
case 6:
break;
case 7:
break;
}

Since the if conditions are checked in the order specified, yes. Whether it is measurable (and, hence, should you care) will depend on how many times that portion of the code is called.

Imagine you go to a hotel and are given a room with number 7.
You have to go across the hall checking every room until you find the room with number 7.
Will the time taken depend on how many rooms you checked before you got the one you have been alloted?
Yes..
But know this, in your scenario the time difference will be very very minute to be noticed.
For scenarios where there are too many numbers to be checked, putting the one in the beginning which occur many number of times does improve performance. In fact, this methodology is used by some network protocols for comparing protocol numbers

There is some penalty to be paid in case of compiler couldn't make the constructs to a jump table, I would think the switch/case implementation will be compiled as a jump table in assembly and if else not as a jump table then switch/case has an edge over if else. Again I guess this depends on architecture and compilers.
In case of switch/case compiler will be able to generate the asm jump table only based on the constants (eg. consecutive values) that we provide.
The test i ran on my machine gave the assembly for if/else as this (not jump table) ,
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl $7, -4(%rbp)
cmpl $1, -4(%rbp)
jne .L2
movl $97, %edi
call putchar
jmp .L3
.L2:
**cmpl $2, -4(%rbp)
jne .L4**
movl $97, %edi
call putchar
jmp .L3
.L4:
**cmpl $3, -4(%rbp)
jne .L5**
movl $97, %edi
call putchar
jmp .L3
.L5:
**cmpl $4, -4(%rbp)
jne .L6**
movl $97, %edi
call putchar
jmp .L3
.L6:
**cmpl $5, -4(%rbp)
jne .L7**
movl $97, %edi
call putchar
jmp .L3
.L7:
**cmpl $6, -4(%rbp)
jne .L8**
movl $97, %edi
call putchar
jmp .L3
.L8:
cmpl $7, -4(%rbp)
jne .L9
movl $97, %edi
call putchar
jmp .L3
.L9:
cmpl $8, -4(%rbp)
jne .L3
movl $97, %edi
call putchar
.L3:
movl $0, %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
But for switch/case (jump table),
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl $7, -4(%rbp)
cmpl $7, -4(%rbp)
ja .L2
movl -4(%rbp), %eax
movq .L4(,%rax,8), %rax
jmp *%rax
.section .rodata
.align 8
.align 4
.L4:
.quad .L2
.quad .L3
.quad .L5
.quad .L6
.quad .L7
.quad .L8
.quad .L9
.quad .L10
.text
.L3:
movl $97, %edi
call putchar
jmp .L2
.L5:
movl $97, %edi
call putchar
jmp .L2
.L6:
movl $97, %edi
call putchar
jmp .L2
.L7:
movl $97, %edi
call putchar
jmp .L2
.L8:
movl $97, %edi
call putchar
jmp .L2
.L9:
movl $97, %edi
call putchar
jmp .L2
.L10:
movl $97, %edi
call putchar
nop
.L2:
movl $0, %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
From the tests i feel that switch/case is better as it doesn't have to go through the earlier entries to find a match.
I would suggest to try gcc -S option to generate the assembly to check the asm to see.

TL;DR version
For so few values, any differences in speed will be immeasurably small, and you'd be better off sticking with the more straightforward, easier-to-understand version. It isn't until you need to start searching through tables containing thousands to millions of entries that you'll want something smarter than a linear ordered search.
James Michener Version
Another possibility not yet mentioned is to do a partitioned search, like so:
if ( a > 4 )
{
if ( a > 6 )
{
if ( a == 7 ) // do stuff
else // a == 8, do stuff
}
else
{
if ( a == 5 ) // do stuff
else // a == 6, do stuff
}
}
else
{
if ( a > 2 )
{
if ( a == 3 ) // do stuff
else // a == 4, do stuff
}
else
{
if ( a == 1 ) // do stuff
else // a == 2, do stuff
}
}
No more than three tests are performed for any value of a. Of course, no less than three tests are performed for any value of a, either. On average, it should give better performance than the naive 1-8 search when the majority of inputs are 7, but...
As with all things performance-related, the rule is measure, don't guess. Code up different versions, profile them, analyze the results. For testing against so few values, it's going to be hard to get reliable numbers; you'll need to execute each method thousands of times for a given value just to get a useful non-zero time measurement (it also means that any difference between the methods will be ridiculously small).
Stuff like this can also be affected by compiler optimization settings. You'll want to build at different optimization levels and re-run your tests.
Just for giggles, I coded up my own version measuring several different approaches:
naive - the straightforward test from 1 to 8 in order;
sevenfirst - check for 7 first, then 1 - 6 and 8;
eightfirst - check from 8 to 1 in reverse order;
partitioned - use the partitioned search above;
switcher - use a switch statement instead of if-else;
I used the following test harness:
int main( void )
{
size_t counter[9] = {0};
struct timeval start, end;
unsigned long total_nsec;
void (*funcs[])(int, size_t *) = { naive, sevenfirst, eightfirst, partitioned, switcher };
srand(time(NULL));
printf("%15s %15s %15s %15s %15s %15s\n", "test #", "naive", "sevenfirst", "eightfirst", "partitioned", "switcher" );
printf("%15s %15s %15s %15s %15s %15s\n", "------", "-----", "----------", "----------", "-----------", "--------" );
unsigned long times[5] = {0};
for ( size_t t = 0; t < 20; t++ )
{
printf( "%15zu ", t );
for ( size_t f = 0; f < 5; f ++ )
{
total_nsec = 0;
for ( size_t i = 0; i < 1000; i++ )
{
int a = generate();
gettimeofday( &start, NULL );
for ( size_t j = 0; j < 10000; j++ )
(*funcs[f])( a, counter );
gettimeofday( &end, NULL );
}
total_nsec += end.tv_usec - start.tv_usec;
printf( "%15lu ", total_nsec );
times[f] += total_nsec;
memset( counter, 0, sizeof counter );
}
putchar('\n');
}
putchar ('\n');
printf( "%15s ", "average:" );
for ( size_t i = 0; i < 5; i++ )
printf( "%15f ", (double) times[i] / 20 );
putchar ('\n' );
return 0;
}
The generate function produces random numbers from 1 through 8, weighted so that 7 appears half the time. I run each method 10000 times per generated value to get measurable times, for 1000 generated values.
I didn't want the performance difference between the various control structures to get swamped by the // do stuff code, so each case just increments a counter, such as
if ( a == 1 )
counter[1]++;
This also gave me a way to verify that my number generator was working properly.
I run through the whole sequence 20 times and average the results. Even so, the numbers can vary a bit from run to run, so don't trust them too deeply. If nothing else, they show that changes at this level don't result in huge improvements. For example:
test # naive sevenfirst eightfirst partitioned switcher
------ ----- ---------- ---------- ----------- --------
0 121 100 118 119 111
1 110 100 131 120 115
2 110 100 125 121 111
3 115 125 117 105 110
4 120 116 125 110 115
5 129 100 110 106 116
6 115 176 105 106 115
7 111 100 111 106 110
8 139 100 106 111 116
9 125 100 136 106 111
10 106 100 105 106 111
11 126 112 135 105 115
12 116 120 135 110 115
13 120 105 106 111 115
14 120 105 105 106 110
15 100 131 106 118 115
16 106 113 116 111 110
17 106 105 105 118 111
18 121 113 103 106 115
19 130 101 105 105 116
average: 117.300000 111.100000 115.250000 110.300000 113.150000
Numbers are in microseconds. The code was built using gcc 4.1.2 with no optimization running on a SLES 10 system1.
So, running each method 10000 times for 1000 values, averaged over 20 runs, gives a total variation of about 7 μsec. That's really not worth getting exercised over. For something that's only searching among 8 distinct values and isn't going to run more than "several times", you're not going to see any measurable improvement in performance regardless of the method used. Stick with the method that's the easiest to read and understand.
Now, for searching a table containing several hundreds to thousands to millions of entries, you definitely want to use something smarter than a linear search.
1. It should go without saying, but the results above are only valid for this code built with this specific compiler and running on this specific system.

There should be a slight difference, but this would depend on the platform and the nature of the comparison - different compilers may optimise something like this differently, different architectures will have different effects as well, and it also depends on what the comparison actually is (if it is a more complex comparison than a simple primitive type comparison, for example)
It is probably good practice to test the specific case you are actually going to use, IF this is something that is likely to actually be a performance bottleneck.
Alternatively, a switch statement, if usable, should have the same performance for any value independent of order because it is implemented using an offset in memory as opposed to successive comparisons.

Probably not, you still have the same number of conditions and still either of them might evaluate to true, even if you check for a == 7 first, all other conditions are potentially true therefore will be evaluated.
The block of code that eis executed if a == 7 might be executed quicker when the program runs - but essentially you're code is still the same, with the same number of statements.

Is it faster to iterate through the elements of an array with pointers incremented by 1? [duplicate]

This question already has answers here:
Efficiency: arrays vs pointers
(14 answers)
Closed 7 years ago.
Is it faster to do something like
for ( int * pa(arr), * pb(arr+n); pa != pb; ++pa )
{
// do something with *pa
}
than
for ( size_t k = 0; k < n; ++k )
{
// do something with arr[k]
}
???
I understand that arr[k] is equivalent to *(arr+k), but in the first method you are using the current pointer which has incremented by 1, while in the second case you are using a pointer which is incremented from arr by successively larger numbers. Maybe hardware has special ways of incrementing by 1 and so the first method is faster? Or not? Just curious. Hope my question makes sense.

If the compiler is smart enought (and most of compilers is) then performance of both loops should be ~equal.
For example I have compiled the code in gcc 5.1.0 with generating assembly:
int __attribute__ ((noinline)) compute1(int* arr, int n)
{
int sum = 0;
for(int i = 0; i < n; ++i)
{
sum += arr[i];
}
return sum;
}
int __attribute__ ((noinline)) compute2(int* arr, int n)
{
int sum = 0;
for(int * pa(arr), * pb(arr+n); pa != pb; ++pa)
{
sum += *pa;
}
return sum;
}
And the result assembly is:
compute1(int*, int):
testl %esi, %esi
jle .L4
leal -1(%rsi), %eax
leaq 4(%rdi,%rax,4), %rdx
xorl %eax, %eax
.L3:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdx, %rdi
jne .L3
rep ret
.L4:
xorl %eax, %eax
ret
compute2(int*, int):
movslq %esi, %rsi
xorl %eax, %eax
leaq (%rdi,%rsi,4), %rdx
cmpq %rdx, %rdi
je .L10
.L9:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdi, %rdx
jne .L9
rep ret
.L10:
rep ret
main:
xorl %eax, %eax
ret
As you can see, the most heavy part (loop) of both functions is equal:
.L9:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdi, %rdx
jne .L9
rep ret
But in more complex examples or in other compiler the results might be different. So you should test it and measure, but most of compilers generate similar code.
The full code sample: https://goo.gl/mpqSS0

This cannot be answered. It depends on your compiler AND on your machine.
A very naive compiler would translate the code as is to machine code. Most machines indeed provide an increment operation that is very fast. They normally also provide relative addressing for an address with an offset. This could take a few cycles more than absolute addressing. So, yes, the version with pointers could potentially be faster.
But take into account that every machine is different AND that compilers are allowed to optimize as long as the observable behavior of your program doesn't change. Given that, I would suggest a reasonable compiler will create code from both versions that doesn't differ in performance.

Any reasonable compiler will generate code that is identical inside the loop for these two choices - I looked at the code generated for iterating over a std::vector, using for-loop with an integer for the iterator or using a for( auto i: vec) type construct [std::vector internally has two pointers for the begin and end of the stored values, so like your pa and pb]. Both gcc and clang generates identical code inside the loop itself [the exact details of the loop is subtly different between the compilers, but other than that, there's no difference]. The setup of the loop was subtly different, but unless you OFTEN do loops of less than 5 items [and if so, why do you worry?], the actual content of the loop is what matters, not the bit just before the actual loop.
As with ALL code where performance is important, the exact code, compiler make and version, compiler options, processor make and model, will make a difference to how the code performs. But for the vast majority of processors and compilers, I'd expect no measurable difference. If the code is really critical, measure different alternatives and see what works best in your case.

Why '==' is slow on std::string?

While profiling my application I realized that a lot of time is spent on string comparisons. So I wrote a simple benchmark and I was surprised that '==' is much slower than string::compare and strcmp! here is the code, can anyone explain why is that? or what's wrong with my code? because according to the standard '==' is just an operator overload and simply returnes !lhs.compare(rhs).
#include <iostream>
#include <vector>
#include <string>
#include <stdint.h>
#include "Timer.h"
#include <random>
#include <time.h>
#include <string.h>
using namespace std;
uint64_t itr = 10000000000;//10 Billion
int len = 100;
int main() {
srand(time(0));
string s1(len,random()%128);
string s2(len,random()%128);
uint64_t a = 0;
Timer t;
t.begin();
for(uint64_t i =0;i<itr;i++){
if(s1 == s2)
a = i;
}
t.end();
cout<<"== took:"<<t.elapsedMillis()<<endl;
t.begin();
for(uint64_t i =0;i<itr;i++){
if(s1.compare(s2)==0)
a = i;
}
t.end();
cout<<".compare took:"<<t.elapsedMillis()<<endl;
t.begin();
for(uint64_t i =0;i<itr;i++){
if(strcmp(s1.c_str(),s2.c_str()))
a = i;
}
t.end();
cout<<"strcmp took:"<<t.elapsedMillis()<<endl;
return a;
}
And here is the result:
== took:5986.74
.compare took:0.000349
strcmp took:0.000778
And my compile flags:
CXXFLAGS = -O3 -Wall -fmessage-length=0 -std=c++1y
I use gcc 4.9 on a x86_64 linux machine.
Obviously using -o3 does some optimizations which I guess rolls out the last two loops totally; however, using -o2 still the results are weird:
for 1 billion iterations:
== took:19591
.compare took:8318.01
strcmp took:6480.35
P.S. Timer is just a wrapper class to measure spent time; I am absolutely sure about it :D
Code for Timer class:
#include <chrono>
#ifndef SRC_TIMER_H_
#define SRC_TIMER_H_
class Timer {
std::chrono::steady_clock::time_point start;
std::chrono::steady_clock::time_point stop;
public:
Timer(){
start = std::chrono::steady_clock::now();
stop = std::chrono::steady_clock::now();
}
virtual ~Timer() {}
inline void begin() {
start = std::chrono::steady_clock::now();
}
inline void end() {
stop = std::chrono::steady_clock::now();
}
inline double elapsedMillis() {
auto diff = stop - start;
return std::chrono::duration<double, std::milli> (diff).count();
}
inline double elapsedMicro() {
auto diff = stop - start;
return std::chrono::duration<double, std::micro> (diff).count();
}
inline double elapsedNano() {
auto diff = stop - start;
return std::chrono::duration<double, std::nano> (diff).count();
}
inline double elapsedSec() {
auto diff = stop - start;
return std::chrono::duration<double> (diff).count();
}
};
#endif /* SRC_TIMER_H_ */

UPDATE: output of improved benchmark at http://ideone.com/rGc36a
== took:21
.compare took:21
strcmp took:14
== took:21
.compare took:25
strcmp took:14
The thing that proved crucial to get it working meaningfully was "outwitting" the compiler's ability to predict the strings being compared at compile time:
// more strings that might be used...
string s[] = { {len,argc+'A'}, {len,argc+'A'}, {len, argc+'B'}, {len, argc+'B'} };
if(s[i&3].compare(s[(i+1)&3])==0) // trickier to optimise
a += i; // cumulative observable side effects
Note that in general, strcmp is not functionally equivalent to == or .compare when the text may embed NULs, as the former will get to "exit early". (That's not the reason it's "faster" above, but do read below for comments re possible variations with string length/content etc..)
Discussion / Earlier answer
Just have a look at your implementation - e.g.
echo '#include <string>' > stringE.cc
g++ -E stringE.cc | less
Search for the basic_string template, then for the operator== working on two string instances - mine is:
template<class _Elem,
class _Traits,
class _Alloc> inline
bool __cdecl operator==(
const basic_string<_Elem, _Traits, _Alloc>& _Left,
const basic_string<_Elem, _Traits, _Alloc>& _Right)
{
return (_Left.compare(_Right) == 0);
}
Notice that operator== is inline and simply calls compare. There's no way it's consistently significantly slower with normal optimisation levels enabled, though the optimiser might occasionally happen to optimise one loop better than another due to subtle side effects of surrounding code.
Your ostensible problem will have been caused by e.g. your code being optimised beyond the point of doing the intended work, for loops arbitrarily unrolled to different degrees, or other quirks or bugs in the optimisation or your timing. That's not unusual when you have unvarying inputs and loops that don't have any cumulative side-effects (i.e. the compiler can work out that intermediate values of a are not used, so only the last a = i need take affect).
So, learn to write better benchmarks. In this case, that's a bit tricky as having lots of distinct strings in memory ready to invoke the comparisons on, and selecting them in a way that the optimiser can't predict at compile time that's still fast enough not to overwhelm and obscure the impact of the string comparison code, is not an easy task. Further, beyond a point - comparing things spread across more memory makes cache affects more relevant to the benchmark, which further obscures the real comparison performance.
Still, if I were you I'd read some strings from a file - pushing each to a vector, then loop over the vector doing each of the three comparison operations between adjacent elements. Then the compiler can't possibly predict any pattern in the outcomes. You might find compare/== faster/slower than strcmp for strings often differing in the first character or three, but the other way around for long strings that are equal or only differing near the end, so make sure you try different kinds of input before you conclude you understand the performance profile.

Either your timings are screwy, or your compiler has optimised some of your code out of existence.
Think about it, ten billion operations in 0.000349 milliseconds (I'll use 0.000500 milliseconds, or half a microsecond, to make my calculations easier) means that you're performing twenty trillion operations per second.
Even if one operation could be done in a single clock cycle, that would be 20,000 GHz, a bit beyond the current crop of CPUs, even with their massively optimised pipelines and multiple cores.
And, given that the -O2 optimised figures are more on par with each other (== taking about double the time of compare), the "code optimised out of existence" possibility is looking far more likely.
The doubling of time could easily be explained as ten billion extra function calls, due to operator== needing to call compare to do its work.
As further support, examine the following table, showing figures in milliseconds (third column is simple divide-by-ten scale of second column so that both first and third columns are for a billion iterations):
-O2/1billion -O3/10billion -O3/1billion Improvement
(a) (b) (c = b / 10) (a / c)
============ ============= ============ ===========
oper== 19151 5987 599 32
compare 8319 0.0005 0.00005 166,380,000
It beggars belief that -O3 could speed up the == code by a factor of about 32 but manage to speed up the compare code by a factor of a few hundred million.
I strongly suggest you have a look at the assembler code generated by your compiler (such as with the gcc -S option) to verify that it's actually doing that work it's claiming to do.

The problem is that the compiler is making a lot of serious optimizations to your code.
Here's the modified code:
#include <iostream>
#include <vector>
#include <string>
#include <stdint.h>
#include "Timer.h"
#include <random>
#include <time.h>
#include <string.h>
using namespace std;
uint64_t itr = 500000000;//10 Billion
int len = 100;
int main() {
srand(time(0));
string s1(len,random()%128);
string s2(len,random()%128);
uint64_t a = 0;
Timer t;
t.begin();
for(uint64_t i =0;i<itr;i++){
asm volatile("" : "+g"(s2));
if(s1 == s2)
a += i;
}
t.end();
cout<<"== took:"<<t.elapsedMillis()<<",a="<<a<<endl;
t.begin();
for(uint64_t i =0;i<itr;i++){
asm volatile("" : "+g"(s2));
if(s1.compare(s2)==0)
a+=i;
}
t.end();
cout<<".compare took:"<<t.elapsedMillis()<<",a="<<a<<endl;
t.begin();
for(uint64_t i =0;i<itr;i++){
asm volatile("" : "+g"(s2));
if(strcmp(s1.c_str(),s2.c_str()) == 0)
a+=i;
}
t.end();
cout<<"strcmp took:"<<t.elapsedMillis()<<",a="<<a<< endl;
return a;
}
where I've added asm volatile("" : "+g"(s2)); to force the compiler to run the comparison. I've also added <<",a="< to force the compiler to compute a.
The output is now:
== took:10221.5,a=0
.compare took:10739,a=0
strcmp took:9700,a=0
Can you explain why strcmp is faster than .compare which is slower than ==? however, the speed differences are marginal, but significant.
It actually makes sense! :p

The speed analysis below is wrong - thanks to Tony D for pointing out my error. The criticisms and advice for better benchmarks still apply though.
All the previous answers deal with the compiler optimisation issues in your benchmark, but don't answer why strcmp is still slightly faster.
strcmp is likely faster (in the corrected benchmarks) due to the strings sometimes containing zeros. Since strcmp uses C-strings it can exit when it comes across the string termination char '\0'. std::string::compare() treats '\0' as just another char and continues until the end of the string array.
Since you have non-deterministically seeded the RNG, and only generated two strings, your results will change with every run of the code. (I'd advise against this in benchmarks.) Given the numbers, 28 times out of 128, there ought to be no advantage. 10 times out of 128 you will get more than a 10-fold speed up. And so on.
As well as defeating the compiler's optimiser, I would suggest that, next time, you generate a new string for each comparison iteration, allowing you to average away such effects.

Compiled the code with gcc -O3 -S --std=c++1y. The result is here. gcc version is:
gcc (Ubuntu 4.9.1-16ubuntu6) 4.9.1
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Look at it, we can be the first loop (operator ==) is like this: (comment is added by me)
movq itr(%rip), %rbp
movq %rax, %r12
movq %rax, 56(%rsp)
testq %rbp, %rbp
je .L25
movq 16(%rsp), %rdi
movq 32(%rsp), %rsi
xorl %ebx, %ebx
movq -24(%rsi), %rdx ; length of string1
cmpq -24(%rdi), %rdx ; compare lengths
je .L53 ; compare content only when length is the same
.L10
; end of loop, print out follows
;....
.L53:
.cfi_restore_state
call memcmp ; compare content
xorl %edx, %edx ; zero loop count
.p2align 4,,10
.p2align 3
.L13:
testl %eax, %eax ; check result
cmove %rdx, %rbx ; a = i
addq $1, %rdx ; i++
cmpq %rbp, %rdx ; i < itr?
jne .L13
jmp .L10
; ....
.L25:
xorl %ebx, %ebx
jmp .L10
We can see that operator == is inline, only a call to memcmp is there. And for operator ==, if the length is different, the content is not compared.
Most importantly, compare is done only once. The loop content only contains i++;, a=i;, i<itr;.
For the second loop (compare()):
movq itr(%rip), %r12
movq %rax, %r13
movq %rax, 56(%rsp)
testq %r12, %r12
je .L14
movq 16(%rsp), %rdi
movq 32(%rsp), %rsi
movq -24(%rdi), %rbp
movq -24(%rsi), %r14 ; read and compare length
movq %rbp, %rdx
cmpq %rbp, %r14
cmovbe %r14, %rdx ; save the shorter length of the two string to %rdx
subq %r14, %rbp ; length difference in %rbp
call memcmp ; content is always compared
movl $2147483648, %edx ; 0x80000000 sign extended
addq %rbp, %rdx ; revert the sign bit of %rbp (length difference) and save to %rdx
testl %eax, %eax ; memcmp returned 0?
jne .L14 ; no, string different
testl %ebp, %ebp ; memcmp returned 0. Are lengths the same (%ebp == 0)?
jne .L14 ; no, string different
movl $4294967295, %eax ; string compare equal
subq $1, %r12 ; itr - 1
cmpq %rax, %rdx
cmovbe %r12, %rbx ; a = itr - 1
.L14:
; output follows
There no loop at all here.
In compare(), as it should return plus, minus, or zero based on the comparison, string content is always compared. memcmp called once.
For the third loop (strcmp()), the assembly is the most simple:
movq itr(%rip), %rbp ; itr to %rbp
movq %rax, %r12
movq %rax, 56(%rsp)
testq %rbp, %rbp
je .L16
movq 32(%rsp), %rsi
movq 16(%rsp), %rdi
subq $1, %rbp ; itr - 1 to %rbp
call strcmp
testl %eax, %eax ; test compare result
cmovne %rbp, %rbx ; if not equal, save itr - 1 to %rbx (a)
.L16:
These also no loop at all. strcmp is called, and if the strings are not equal (as in your code), save itr-1 to a directly.
So your benchmark cannot test the running time for operator ==, compare() or strcmp(). The are all called only once, not able to show the running time difference.
As to why operator == takes the most time, it is because for operator==, the compiler for some reason did not eliminate the loop. The loop takes time (but the loop does not contain string comparison at all).
And from the assembly shown, we may assume that operator == may be fastest, because it won't do string comparison at all if the length of the two strings are different. (Of course, under gcc4.9.1 -O3)

Fastest way to reset a value in every struct element of a vector?

Very much like this question, except that instead vector<int> I have vector<struct myType>.
If I want to reset (or for that matter, set to some value) myType.myVar for every element in the vector, what's the most efficient method?
Right now I'm iterating through:
for(int i=0; i<myVec.size(); i++) myVec.at(i).myVar = 0;
But since vectors are guaranteed to be stored contiguously, there's surely a better way?

Resetting will need to traverse every element of the vector, so it will need at least O(n) complexity. Your current algorithm takes O(n).
In this particular case you can use operator[] instead of at (that might throw an exception). But I doubt that's the bottleneck of your application.
On this note you should probably use std::fill:
std::fill(myVec.begin(), myVec.end(), 0);
But unless you want to go byte level and set a chunk of memory to 0, which will not only cause you headaches but will also make you lose portability in most cases, there's nothing to improve here.

Instead of the below code
for(int i=0; i<myVec.size(); i++) myVec.at(i).myVar = 0;
do it as follows:
size_t sz = myVec.size();
for(int i=0; i<sz; ++i) myVec[i].myVar = 0;
As the "at" method internally checks whether index is out of range or not. But as your loop index is taking care(myVec.size()), you can avoid the extra check. Otherwise this is the fastest way to do it.
EDIT
In addition to that, we can store the size() of vector prior to executing the for loop.This would ensure that there is no further call of method size() inside the for loop.

One of the fastest ways would be to perform loop unwinding and break the speed limit posed by conventional for loops that cause great cash spills. In your case, as it's a run time thing, there's no way to apply template metaprogramming so a variation on good old Duff's device would do the trick
#include <iostream>
#include <vector>
using namespace std;
struct myStruct {
int a;
double b;
};
int main()
{
std::vector<myStruct> mV(20);
double val(20); // the new value you want to reset to
int n = (mV.size()+7) / 8; // 8 is the size of the block unroll
auto to = mV.begin(); //
switch(mV.size()%8)
{
case 0: do { (*to++).b = val;
case 7: (*to++).b = val;
case 6: (*to++).b = val;
case 5: (*to++).b = val;
case 4: (*to++).b = val;
case 3: (*to++).b = val;
case 2: (*to++).b = val;
case 1: (*to++).b = val;
} while (--n>0);
}
// just printing to verify that the value is set
for (auto i : mV) std::cout << i.b << std::endl;
return 0;
}
Here I choose to perform an 8-block unwind, to reset the value (let's say) b of a myStruct structure. The block size can be tweaked and loops are effectively unrolled. Remember this is the underlying technique in memcpy and one of the optimizations (loop unrolling in general) a compiler will attempt (actually they're quite good at this so we might as well let them do their job).

In addition to what's been said before, you should consider that if you turn on optimizations, the compiler will likely perform loop-unrolling which will make the loop itself faster.

Also pre increment ++i takes few instructions less than post increment i++.
Explanation here

Beware of spending a lot of time thinking about optimization details which the compiler will just take care of for you.
Here are four implementations of what I understand the OP to be, along with the code generated using gcc 4.8 with --std=c++11 -O3 -S
Declarations:
#include <algorithm>
#include <vector>
struct T {
int irrelevant;
int relevant;
double trailing;
};
Explicit loop implementations, roughly from answers and comments provided to OP. Both produced identical machine code, aside from labels.
.cfi_startproc
movq (%rdi), %rsi
void clear_relevant(std::vector<T>* vecp) { movq 8(%rdi), %rcx
for(unsigned i=0; i<vecp->size(); i++) { xorl %edx, %edx
vecp->at(i).relevant = 0; xorl %eax, %eax
} subq %rsi, %rcx
} sarq $4, %rcx
testq %rcx, %rcx
je .L1
.p2align 4,,10
.p2align 3
.L5:
void clear_relevant2(std::vector<T>* vecp) { salq $4, %rdx
std::vector<T>& vec = *vecp; addl $1, %eax
auto s = vec.size(); movl $0, 4(%rsi,%rdx)
for (unsigned i = 0; i < s; ++i) { movl %eax, %edx
vec[i].relevant = 0; cmpq %rcx, %rdx
} jb .L5
} .L1:
rep ret
.cfi_endproc
Two other versions, one using std::for_each and the other one using the range for syntax. Here there is a subtle difference in the code for the two versions (other than the labels):
.cfi_startproc
movq 8(%rdi), %rdx
movq (%rdi), %rax
cmpq %rax, %rdx
je .L17
void clear_relevant3(std::vector<T>* vecp) { .p2align 4,,10
for (auto& p : *vecp) p.relevant = 0; .p2align 3
} .L21:
movl $0, 4(%rax)
addq $16, %rax
cmpq %rax, %rdx
jne .L21
.L17:
rep ret
.cfi_endproc
.cfi_startproc
movq 8(%rdi), %rdx
movq (%rdi), %rax
cmpq %rdx, %rax
void clear_relevant4(std::vector<T>* vecp) { je .L12
std::for_each(vecp->begin(), vecp->end(), .p2align 4,,10
[](T& o){o.relevant=0;}); .p2align 3
} .L16:
movl $0, 4(%rax)
addq $16, %rax
cmpq %rax, %rdx
jne .L16
.L12:
rep ret
.cfi_endproc

A virtual function returning a small structure - return value vs output parameter?

I have a virtual function in a hotspot code that needs to return a structure as a result. I have these two options:
virtual Vec4 generateVec() const = 0; // return value
virtual void generateVec(Vec4& output) const = 0; // output parameter
My question is, is there generally any difference in the performance of these functions? I'd assume the second one is faster, because it does not involve copying data on the stack. However, the first one is often much more convenient to use. If the first one is still slightly slower, would this be measurable at all? Am I too obsessed :)
Let me stress that that this function will be called millions of times every second, but also that the size of the structure Vec4 is small - 16 bytes.

As has been said, try them out - but you will quite possibly find that Vec4 generateVec() is actually faster. Return value optimization will elide the copy operation, whereas void generateVec(Vec4& output) may cause an unnecessary initialisation of the output parameter.
Is there any way you can avoid making the function virtual? If you're calling it millions of times a sec that extra level of indirection is worth looking at.

Code called millions of times per second implies you really do need to optimize for speed.
Depending on how complex the body of the derived generateVec's is, the difference between the two may be unnoticeable or could be massive.
Best bet is to try them both and profile to see if you need to worry about optimizing this particular aspect of the code.

Feeling a bit bored, so I came up with this:
#include <iostream>
#include <ctime>
#include <cstdlib>
using namespace std;
struct A {
int n[4];
A() {
n[0] = n[1] = n[2] = n[3] = rand();
}
};
A f1() {
return A();
}
A f2( A & a ) {
a = A();
}
const unsigned long BIG = 100000000;
int main() {
unsigned int sum = 0;
A a;
clock_t t = clock();
for ( unsigned int i = 0; i < BIG; i++ ) {
a = f1();
sum += a.n[0];
}
cout << clock() - t << endl;
t = clock();
for ( unsigned int i = 0; i < BIG; i++ ) {
f2( a );
sum += a.n[0];
}
cout << clock() - t << endl;
return sum & 1;
}
Results with -O2 optimisation are that there is no significant difference.

There are chances that the first solution is faster.
A very nice article :
http://cpp-next.com/archive/2009/08/want-speed-pass-by-value/

Just out of curiosity, I wrote 2 similar functions (uses 8-byte data types) to check their assembly code.
long long int ret_val()
{
long long int tmp(1);
return tmp;
}
// ret_val() assembly
.globl _Z7ret_valv
.type _Z7ret_valv, #function
_Z7ret_valv:
.LFB0:
.cfi_startproc
.cfi_personality 0x0,__gxx_personality_v0
pushl %ebp
.cfi_def_cfa_offset 8
movl %esp, %ebp
.cfi_offset 5, -8
.cfi_def_cfa_register 5
subl $16, %esp
movl $1, -8(%ebp)
movl $0, -4(%ebp)
movl -8(%ebp), %eax
movl -4(%ebp), %edx
leave
ret
.cfi_endproc
Surprisingly, the pass-by-value method below required a few more instructions:
void output_val(long long int& value)
{
long long int tmp(2);
value = tmp;
}
// output_val() assembly
.globl _Z10output_valRx
.type _Z10output_valRx, #function
_Z10output_valRx:
.LFB1:
.cfi_startproc
.cfi_personality 0x0,__gxx_personality_v0
pushl %ebp
.cfi_def_cfa_offset 8
movl %esp, %ebp
.cfi_offset 5, -8
.cfi_def_cfa_register 5
subl $16, %esp
movl $2, -8(%ebp)
movl $0, -4(%ebp)
movl 8(%ebp), %ecx
movl -8(%ebp), %eax
movl -4(%ebp), %edx
movl %eax, (%ecx)
movl %edx, 4(%ecx)
leave
ret
.cfi_endproc
These functions were called in a test code as:
long long val = ret_val();
long long val2;
output_val(val2);
Compiled by gcc.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Implement a switch statement using a jump table in C++ - c++

You can just build an array with function pointers and index at it by the enum

Related

If Else construct

Is it faster to iterate through the elements of an array with pointers incremented by 1? [duplicate]

Why '==' is slow on std::string?

Fastest way to reset a value in every struct element of a vector?

A virtual function returning a small structure - return value vs output parameter?

Categories

Resources