I've been writing C++11 code for quite some time now, and haven't done any benchmarking of it, only expecting things like vector operations to "just be faster" now with move semantics. So when actually benchmarking with GCC 4.7.2 and clang 3.0 (default compilers on Ubuntu 12.10 64-bit) I get very unsatisfying results. This is my test code:
EDIT: With regards to the (good) answers posted by #DeadMG and #ronag, I changed the element type from std::string to my::string which does not have a swap(), and made all inner strings larger (200-700 bytes) so that they shouldn't be the victims of SSO.
EDIT2: COW was the reason. Adapted code again by the great comments, changed the storage from std::string to std::vector<char> and leaving out copy/move onstructors (letting the compiler generate them instead). Without COW, the speed difference is actually huge.
EDIT3: Re-added the previous solution when compiled with -DCOW. This makes the internal storage a std::string rather than a std::vector<char> as requested by #chico.
#include <string>
#include <vector>
#include <fstream>
#include <iostream>
#include <algorithm>
#include <functional>
static std::size_t dec = 0;
namespace my { class string
{
public:
string( ) { }
#ifdef COW
string( const std::string& ref ) : str( ref ), val( dec % 2 ? - ++dec : ++dec ) {
#else
string( const std::string& ref ) : val( dec % 2 ? - ++dec : ++dec ) {
str.resize( ref.size( ) );
std::copy( ref.begin( ), ref.end( ), str.begin( ) );
#endif
}
bool operator<( const string& other ) const { return val < other.val; }
private:
#ifdef COW
std::string str;
#else
std::vector< char > str;
#endif
std::size_t val;
}; }
template< typename T >
void dup_vector( T& vec )
{
T v = vec;
for ( typename T::iterator i = v.begin( ); i != v.end( ); ++i )
#ifdef CPP11
vec.push_back( std::move( *i ) );
#else
vec.push_back( *i );
#endif
}
int main( )
{
std::ifstream file;
file.open( "/etc/passwd" );
std::vector< my::string > lines;
while ( ! file.eof( ) )
{
std::string s;
std::getline( file, s );
lines.push_back( s + s + s + s + s + s + s + s + s );
}
while ( lines.size( ) < ( 1000 * 1000 ) )
dup_vector( lines );
std::cout << lines.size( ) << " elements" << std::endl;
std::sort( lines.begin( ), lines.end( ) );
return 0;
}
What this does is read /etc/passwd into a vector of lines, then duplicating this vector onto itself over and over until we have at least 1 million entries. This is where the first optimization should be useful, not only the explicit std::move() you see in dup_vector(), but also the push_back per se should perform better when it needs to resize (create new + copy) the inner array.
Finally, the vector is sorted. This should definitely be faster when you don't need to copy temporary objects each time two elements are swapped.
I compile and run this two ways, one being as C++98, the next as C++11 (with -DCPP11 for the explicit move):
1> $ rm -f a.out ; g++ --std=c++98 test.cpp ; time ./a.out
2> $ rm -f a.out ; g++ --std=c++11 -DCPP11 test.cpp ; time ./a.out
3> $ rm -f a.out ; clang++ --std=c++98 test.cpp ; time ./a.out
4> $ rm -f a.out ; clang++ --std=c++11 -DCPP11 test.cpp ; time ./a.out
With the following results (twice for each compilation):
GCC C++98
1> real 0m9.626s
1> real 0m9.709s
GCC C++11
2> real 0m10.163s
2> real 0m10.130s
So, it's slightly slower to run when compiled as C++11 code. Similar results goes for clang:
clang C++98
3> real 0m8.906s
3> real 0m8.750s
clang C++11
4> real 0m8.858s
4> real 0m9.053s
Can someone tell me why this is? Are the compilers optimizing so good even when compiling for pre-C++11, that they practically reach move semantic behaviour after all? If I add -O2, all code runs faster, but the results between the different standards are almost the same as above.
EDIT: New results with my::string and rather than std::string, and larger individual strings:
$ rm -f a.out ; g++ --std=c++98 test.cpp ; time ./a.out
real 0m16.637s
$ rm -f a.out ; g++ --std=c++11 -DCPP11 test.cpp ; time ./a.out
real 0m17.169s
$ rm -f a.out ; clang++ --std=c++98 test.cpp ; time ./a.out
real 0m16.222s
$ rm -f a.out ; clang++ --std=c++11 -DCPP11 test.cpp ; time ./a.out
real 0m15.652s
There are very small differences between C++98 and C+11 with move semantics. Slightly slower with C++11 with GCC and slightly faster with clang, but still very small differencies.
EDIT2: Now without std::string's COW, the performance improvement is huge:
$ rm -f a.out ; g++ --std=c++98 test.cpp ; time ./a.out
real 0m10.313s
$ rm -f a.out ; g++ --std=c++11 -DCPP11 test.cpp ; time ./a.out
real 0m5.267s
$ rm -f a.out ; clang++ --std=c++98 test.cpp ; time ./a.out
real 0m10.218s
$ rm -f a.out ; clang++ --std=c++11 -DCPP11 test.cpp ; time ./a.out
real 0m3.376s
With optimization, the difference is a lot bigger too:
$ rm -f a.out ; g++ -O2 --std=c++98 test.cpp ; time ./a.out
real 0m5.243s
$ rm -f a.out ; g++ -O2 --std=c++11 -DCPP11 test.cpp ; time ./a.out
real 0m0.803s
$ rm -f a.out ; clang++ -O2 --std=c++98 test.cpp ; time ./a.out
real 0m5.248s
$ rm -f a.out ; clang++ -O2 --std=c++11 -DCPP11 test.cpp ; time ./a.out
real 0m0.785s
Above showing a factor of ~6-7 times faster with C++11.
Thanks for the great comments and answers. I hope this post will be useful and interesting to others too.
This should definitely be faster when you don't need to copy temporary
objects each time two elements are swapped.
std::string has a swap member, so sort will already use that, and it's internal implementation will already be move semantics, effectively. And you won't see a difference between copy and move for std::string as long as SSO is involved. In addition, some versions of GCC still have a non-C++11-permitted COW-based implementation, which also would not see much difference between copy and move.
This is probably due to the small string optimization, which can occur (depending on the compiler) for strings shorter than e.g 16 characters. I would guess that all the lines in the file are quite short, since they are passwords.
When small string optimization is active for a particular string then move is done as a copy.
You will need to have larger strings to see any speed improvements with move semantics.
I think that you'll need to profile the program. Maybe most of the time is spent in the lines T v = vec; and the std::sort(..) of a vector of 20 million strings!!! Nothing to do with move semantics.
Related
I am writing a program that calls an external string array from within a compiled static library.
When I compile and run the program in 64-bit, it works without issue. However, when I try to call the external array when compiling code in* 32-bit*, it give a Segmentation Fault when running main.
Here is the code:
Header declaration "hoenyB_lib.h:
#ifndef HONEYB_LIB_H_
#define HONEYB_LIB_H_
#include <string>
extern std::string honeyB_libs[];
#endif
Extern definition HoneyB_lib.cpp:
#include <string>
std::string honeyB_libs[] = { "libHoneyB.so", "libHoneyB3.so", "libHoneyB2.so", "" };
Extern use HoneyB_fcn.cpp:
deque<string> get_array()
{
deque<string> dst;
int i =0;
for(;;)
{
if(honeyB_libs[i] == "")
break;
else
{
dst.push_front(honeyB_libs[i]);
i++;
}
}
return dst;
}
The Makefile to compile this is as follows:
all:
$(CC) -c -Wall -fPIC source.cpp
$(CC) -g -c -fPIC honeyB_fcn.cpp
ar rcs libHB.a honeyB_fcn.o
g++ -g -c -fPIC honeyB_lib.cpp
g++ --whole-archive -shared -o libHoneyB.so source.o honeyB_lib.o libHB.a
g++ -L. -o main main.cpp -lHoneyB
This works without issue when main() is called. However, when I compile as 32-bit with the following:
all32:
$(CC) -m32 -c -Wall -fPIC source.cpp
$(CC) -m32 -g -c -fPIC honeyB_fcn.cpp
ar rcs libHB.a honeyB_fcn.o
g++ -m32 -g -c -fPIC honeyB_lib.cpp
g++ --whole-archive -m32 -shared -o libHoneyB.so source.o honeyB_lib.o libHB.a
g++ -m32 -L. -o main main.cpp -lHoneyB
The code give a Segmentation Fault. If I remove the call in honeyB_fct.cpp to honeyB_libs[], the code compiles and executes.
Does anybody have any idea why this fails for 32-bit, but works for 64?
Thanks in advance.
Order of initialization between different translation units is undefined. You have no guarantee that global variables in HoneyB_lib.cpp will be initialized before they are used in HoneyB_fcn.cpp. The only reason it worked for the 64-bit version is because you got lucky.
There are a couple workarounds:
Define the array in honeyB_lib.h, wrapped in an anonymous namespace to get around the ODR. Each TU that includes your header will have its own copy of the array.
Again, define the array in the header, but put it inside of a function that returns the array. The compiler should optimize it out everywhere, but if not you can make the array static in the scope of the function and return by reference (i.e. make it a singleton).
As a side note, I'd recommend a std::array instead of a raw array; this will let you do honeyB_libs.size() (or even for (auto&& lib : honeyB_libs) {...}) instead of relying on the "" sentinel value, which would clean up your get_array function a bit.
Thank you for the help. It appears that the problem had to do with the bit count of strings in 32-bit vs 64-bit. Changing honeyB_libs[] from a string array to a const char* array solved the issue.
honeyB_lib.h
extern const char* honeyB_libs[];
honeyB_lib.cpp
const char* honeyB_libs[] = { "libHoneyB.so", "libHoneyB3.so", "libHoneyB2.so", "" }
function.cpp
deque<string> get_array()
{
deque<string> dst;
string temp;
int i =0;
for(;;)
{
if(strlen(honeyB_libs[i]) == 0)
break;
else
{
temp = honeyB_libs[i];
dst.push_front(temp);
i++;
}
}
return dst;
}
Doing this allows my program to compile and run as 64-bit and 32-bit
With right value reference and move semantics, C++11's swap/sort speed, should be equal or greater than C++03. So I designed a simple experiment to test this.
I compile and run it with -O2, with c++03 and c++11 standard.
$g++ test.cpp -O2 && ./a.out
10240000 end construction
sort 10240000 spent1.40035
$g++ test.cpp -O2 -std=c++11 && ./a.out
10240000 end construction
sort 10240000 spent2.25684
So it seems with C++11 enabled, program is slower.
I'm on a very new mac and gcc environment:
$gcc -v
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.3.0 (clang-703.0.31)
Target: x86_64-apple-darwin15.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
Below is source code:
#include<string>
#include<algorithm>
#include<vector>
#include<cstdlib>
#include<cstdio>
#include<iostream>
#include<ctime>
using namespace std;
string randomString()
{
const size_t scale=600;
char ret[scale];
for(size_t i=0;i<scale;++i)
{
double rand0to1=(double)rand()/RAND_MAX;
ret[i]=(char)rand0to1*92+33;
}
return ret;
}
int main()
{
srand(time(NULL));
const size_t scale=10240000;
vector<string> vs;
vs.reserve(scale);
for(size_t i=0;i<scale;++i)
{
vs.push_back(randomString());
}
cout<<vs.size()<<" end construction\n";
clock_t begin=clock();
sort(vs.begin(),vs.end());
clock_t end=clock();
double duration=(double)(end-begin)/CLOCKS_PER_SEC;
cout<<"sort "<<scale<<" spent"<<duration<<"\n";
return 0;
}
Any error with my program or understanding, how to explain my test result?
Really need your expertise on this!
Your test code has several issues.
The string you generate in ret is not null terminated so it will contain garbage from the stack which is likely to change with compiler settings. This the most likely cause of your strange results: the c++11 version sorting longer strings.
Your casts lead to strings which are all identical. Not an actual problem with measurements but probably not what you're interested in testing.
You should not use a truly random seed for benchmarking. You want to produce the same strings on every run to get reproducibility.
This fixed version of the code:
#include<string>
#include<algorithm>
#include<vector>
#include<cstdlib>
#include<cstdio>
#include<iostream>
#include<ctime>
using namespace std;
string randomString()
{
const size_t scale=600;
char ret[scale];
for(size_t i=0;i<scale;++i)
{
double rand0to1=(double)rand()/RAND_MAX;
ret[i]=(char)(rand0to1*92+33);
}
ret[scale-1] = 0;
return ret;
}
int main()
{
srand(1);
const size_t scale=10240000;
vector<string> vs;
vs.reserve(scale);
for(size_t i=0;i<scale;++i)
{
vs.push_back(randomString());
}
cout<<vs.size()<<" end construction\n";
clock_t begin=clock();
sort(vs.begin(),vs.end());
clock_t end=clock();
double duration=(double)(end-begin)/CLOCKS_PER_SEC;
cout<<"sort "<<scale<<" spent "<<duration<<"\n";
return 0;
}
produces what I believe you were expecting:
$ g++ -O2 -std=c++03 test.cpp && ./a.out
10240000 end construction
sort 10240000 spent 10.8765
$ g++ -O2 -std=c++11 test.cpp && ./a.out
10240000 end construction
sort 10240000 spent 8.72834
By the way, g++ from Xcode on a mac is actually clang. But the results are similar:
$ clang++ -O2 -std=c++03 test.cpp && ./a.out
10240000 end construction
sort 10240000 spent 10.9408
$ clang++ -O2 -std=c++11 test.cpp && ./a.out
10240000 end construction
sort 10240000 spent 8.33261
Tested with g++ 6.2.1 and clang 3.9.0. The -std=c++03 switch is important as without it, g++ compiles in a mode which gives the fast times.
This is a simple c++ program using valarrays:
#include <iostream>
#include <valarray>
int main() {
using ratios_t = std::valarray<float>;
ratios_t a{0.5, 1, 2};
const auto& res ( ratios_t::value_type(256) / a );
for(const auto& r : ratios_t{res})
std::cout << r << " " << std::endl;
return 0;
}
If I compile and run it like this:
g++ -O0 main.cpp && ./a.out
The output is as expected:
512 256 128
However, if I compile and run it like this:
g++ -O3 main.cpp && ./a.out
The output is:
0 0 0
Same happens if I use -O1 optimization parameter.
GCC version is (latest in Archlinux):
$ g++ --version
g++ (GCC) 6.1.1 20160707
However, if I try with clang, both
clang++ -std=gnu++14 -O0 main.cpp && ./a.out
and
clang++ -std=gnu++14 -O3 main.cpp && ./a.out
produce the same correct result:
512 256 128
Clang version is:
$ clang++ --version
clang version 3.8.0 (tags/RELEASE_380/final)
I've also tried with GCC 4.9.2 on Debian, where executable produces the correct result.
Is this a possible bug in GCC or am I doing something wrong? Can anyone reproduce this?
EDIT: I managed to reproduce the issue also on Homebrew version of GCC 6 on Mac OS.
valarray and auto do not mix well.
This creates a temporary object, then applies operator/ to it:
const auto& res ( ratios_t::value_type(256) / a );
The libstdc++ valarray uses expression templates so that operator/ returns a lightweight object that refers to the original arguments and evaluates them lazily. You use const auto& which causes the expression template to be bound to the reference, but doesn't extend the lifetime of the temporary that the expression template refers to, so when the evaluation happens the temporary has gone out of scope, and its memory has been reused.
It will work fine if you do:
ratios_t res = ratios_t::value_type(256) / a;
Update: as of today, GCC trunk will give the expected result for this example. I've modified our valarray expression templates to be a bit less error-prone, so that it's harder (but still not impossible) to create dangling references. The new implementation should be included in GCC 9 next year.
It's the result of careless implementation of operator/ (const T& val, const std::valarray<T>& rhs) (and most probably other operators over valarrays) using lazy evaluation:
#include <iostream>
#include <valarray>
int main() {
using ratios_t = std::valarray<float>;
ratios_t a{0.5, 1, 2};
float x = 256;
const auto& res ( x / a );
// x = 512; // <-- uncommenting this line affects the output
for(const auto& r : ratios_t{res})
std::cout << r << " ";
return 0;
}
With the "x = 512" line commented out, the output is
512 256 128
Uncomment that line and the output changes to
1024 512 256
Since in your example the left-hand side argument of the division operator is a temporary, the result is undefined.
UPDATE
As Jonathan Wakely correctly pointed out, the lazy-evaluation based implementation becomes a problem in this example due to the usage of auto.
I have written a small test where I'm trying to compare the run speed of resizing a container and then subsequently using std::generate_n to fill it up. I'm comparing std::string and std::vector<char>. Here is the program:
#include <algorithm>
#include <iostream>
#include <iterator>
#include <random>
#include <vector>
int main()
{
std::random_device rd;
std::default_random_engine rde(rd());
std::uniform_int_distribution<int> uid(0, 25);
#define N 100000
#ifdef STRING
std::cout << "String.\n";
std::string s;
s.resize(N);
std::generate_n(s.begin(), N,
[&]() { return (char)(uid(rde) + 65); });
#endif
#ifdef VECTOR
std::cout << "Vector.\n";
std::vector<char> v;
v.resize(N);
std::generate_n(v.begin(), N,
[&]() { return (char)(uid(rde) + 65); });
#endif
return 0;
}
And my Makefile:
test_string:
g++ -std=c++11 -O3 -Wall -Wextra -pedantic -pthread -o test test.cpp -DSTRING
valgrind --tool=callgrind --log-file="test_output" ./test
cat test_output | grep "refs"
test_vector:
g++ -std=c++11 -O3 -Wall -Wextra -pedantic -pthread -o test test.cpp -DVECTOR
valgrind --tool=callgrind --log-file="test_output" ./test
cat test_output | grep "refs"
And the comparisons for certain values of N:
N=10000
String: 1,865,367
Vector: 1,860,906
N=100000
String: 5,295,213
Vector: 5,290,757
N=1000000
String: 39,593,564
Vector: 39,589,108
std::vector<char> comes out ahead everytime. Since it seems to be more performant, what is even the point of using std::string?
I used #define N 100000000. Tested 3 times for each scenario and in all scenarios string is faster. Not using Valgrind, it does not make sense.
OS: Ubuntu 14.04. Arch:x86_64 CPU: Intel(R) Core(TM) i5-4670 CPU # 3.40GHz.
$COMPILER -std=c++11 -O3 -Wall -Wextra -pedantic -pthread -o test x.cc -DVECTOR
$COMPILER -std=c++11 -O3 -Wall -Wextra -pedantic -pthread -o test x.cc -DSTRING
Times:
compiler/variant | time(1) | time(2) | time(3)
---------------------------+---------+---------+--------
g++ 4.8.2/vector Times: | 1.724s | 1.704s | 1.669s
g++ 4.8.2/string Times: | 1.675s | 1.678s | 1.674s
clang++ 3.5/vector Times: | 1.929s | 1.934s | 1.905s
clang++ 3.5/string Times: | 1.616s | 1.612s | 1.619s
std::vector comes out ahead everytime. Since it seems to be more
performant, what is even the point of using std::string?
Even if we suppose that your observation holds true for a wide range of different systems and different application contexts, it would still make sense to use std::string for various reasons, which are all rooted in the fact that a string has different semantics than a vector. A string is a piece of text (at least simple, non-internationalised English text), a vector is a collection of characters.
Two things come to mind:
Ease of use. std::string can be constructed from string literals, has a lot of convenient operators and can be subject to string-specific algorithms. Try std::string x = "foo" + ("bar" + boost::algorithm::replace_all_copy(f(), "abc", "ABC").substr(0, 10) with a std::vector<char>...
std::string is implemented with Small-String Optimization (SSO) in MSVC, eliminating heap allocation entirely in many cases. SSO is based on the observation that strings are often very short, which certainly cannot be said about vectors.
Try the following:
#include <iostream>
#include <vector>
#include <string>
int main()
{
char const array[] = "short string";
#ifdef STRING
std::cout << "String.\n";
for (int i = 0; i < 10000000; ++i) {
std::string s = array;
}
#endif
#ifdef VECTOR
std::cout << "Vector.\n";
for (int i = 0; i < 10000000; ++i) {
std::vector<char> v(std::begin(array), std::end(array));
}
#endif
}
The std::string version should outperform the std::vector version, at least with MSVC. The difference is about 2-3 seconds on my machine. For longer strings, the results should be different.
Of course, this does not really prove anything either, except two things:
Performance tests depend a lot on the environment.
Performance tests should test what will realistically be done in a real program. In the case of strings, your program may deal with many small strings rather than a single huge one, so test small strings.
Consider following scheme. We have 3 files:
main.cpp:
int main() {
clock_t begin = clock();
int a = 0;
for (int i = 0; i < 1000000000; ++i) {
a += i;
}
clock_t end = clock();
printf("Number: %d, Elapsed time: %f\n",
a, double(end - begin) / CLOCKS_PER_SEC);
begin = clock();
C b(0);
for (int i = 0; i < 1000000000; ++i) {
b += C(i);
}
end = clock();
printf("Number: %d, Elapsed time: %f\n",
a, double(end - begin) / CLOCKS_PER_SEC);
return 0;
}
class.h:
#include <iostream>
struct C {
public:
int m_number;
C(int number);
void operator+=(const C & rhs);
};
class.cpp
C::C(int number)
: m_number(number)
{
}
void
C::operator+=(const C & rhs) {
m_number += rhs.m_number;
}
Files are compiled using clang++ with flags -std=c++11 -O3.
What I expected were very similar performance results, since I thought that compiler will optimize the operators not to be called as functions. The reality though was a bit different, here is the result:
Number: -1243309312, Elapsed time: 0.000003
Number: -1243309312, Elapsed time: 5.375751
I played around a bit and found out, that if I paste all of the code from class.* into the main.cpp the speed dramatically improves and results are very similar.
Number: -1243309312, Elapsed time: 0.000003
Number: -1243309312, Elapsed time: 0.000003
Than I realized that this behavior is probably caused by the fact, that compilation of main.cpp and class.cpp is completely separated and therefore compiler is unable to perform adequate optimizations.
My question: Is there any way of keeping the 3-file scheme and still achieve the optimization level as if the files were merged into one and than compiled? I have read something about 'unity builds' but that seems like an overkill.
Short answer
What you want is link time optimization. Try the answer from this question. I.e., try:
clang++ -O4 -emit-llvm main.cpp -c -o main.bc
clang++ -O4 -emit-llvm class.cpp -c -o class.bc
llvm-link main.bc class.bc -o all.bc
opt -std-compile-opts -std-link-opts -O3 all.bc -o optimized.bc
clang++ optimized.bc -o yourExecutable
You should see that your performance reaches the one that you had when pasting everything into main.cpp.
Long answer
The problem is that the compiler cannot inline your overloaded operator during linking, because it no longer has its definition in a form which it can use to inline it (it cannot inline bare machine code). Thus, the operator call in main.cpp will stay a real function call to the function declared in class.cpp. A function call is very expensive in comparison to a simple inlined addition which can be optimized further (e.g., vectorized).
When you enable link time optimization, the compiler is able to do this. As you see above, you first create llvm intermediate representation byte code (the .bc files, which I will simply call llvm code hereinafter) instead of machine code.
You then link these files to a new .bc file which still contains llvm code instead of machine code. In contrast to machine code, the compiler is able to perform inlining on llvm code. opt is the llvm optimizer (be sure to install llvm), which performs the inlining and further link time optimizations. Then, we call clang++ a final time to generate executable machine code from the optimized llvm code.
For People with GCC
The answer above is only for clang. GCC (g++) users must use the -flto flag during compilation and during linking to enable link time optimization. It is simpler than with clang, simply add -flto everywhere:
g++ -c -O2 -flto main.cpp
g++ -c -O2 -flto class.cpp
g++ -o myprog -flto -O2 main.o class.o
The technique what you are looking for is called Link Time Optimization.
From the timing data, it is obvious that the compiler doesn't just generate better code for the trivial case, but that it doesn't perform any code at all to sum up a billion number. That doesn't happen in real life. You are not performing a useful benchmark. You want to test code that is at least complicated enough to avoid stupid/clever things like this.
I'd re-run the test, but change the loop to
for (int i = 0; i < 1000000000; ++i) if (i != 1000000) {
// ...
}
so that the compiler is forced to actually add up the numbers.