Why std::u16string is slower than array of char16_t? - c++

After some performance experiments, it seemed that using char16_t arrays may boost performance sometimes up to 40-50%, but it seems that using std::u16string without any copying and allocations should be as fast as C arrays. However, benchmarks are showing the opposite.
Here is the code I've written for benchmark (it uses Google Benchmark lib):
#include "benchmark/benchmark.h"
#include <string>
static std::u16string str;
static char16_t *str2;
static void BM_Strings(benchmark::State &state) {
while (state.KeepRunning()) {
for (size_t i = 0; i < str.size(); i++){
benchmark::DoNotOptimize(str[i]);
}
}
}
static void BM_CharArray(benchmark::State &state) {
while (state.KeepRunning()) {
for (size_t i = 0; i < str.size(); i++){
benchmark::DoNotOptimize(str2[i]);
}
}
}
BENCHMARK(BM_Strings);
BENCHMARK(BM_CharArray);
static void init(){
str = u"Various applications of randomness have led to the development of several different methods ";
str2 = (char16_t *) str.c_str();
}
int main(int argc, char** argv) {
init();
::benchmark::Initialize(&argc, argv);
::benchmark::RunSpecifiedBenchmarks();
}
It shows the following result:
Run on (8 X 2200 MHz CPU s)
2017-07-11 23:05:57
Benchmark Time CPU Iterations
---------------------------------------------------
BM_Strings 1832 ns 1830 ns 365938
BM_CharArray 928 ns 926 ns 712577
I'm using clang (Apple LLVM version 8.1.0 (clang-802.0.42)) on mac. With optimizations turned on the gap is smaller but still noticeable:
Benchmark Time CPU Iterations
---------------------------------------------------
BM_Strings 242 ns 241 ns 2906615
BM_CharArray 161 ns 161 ns 4552165
Can someone explain what's going on here and why there is a difference?
Updated (mixing the order and added few warm-up steps):
Benchmark Time CPU Iterations
---------------------------------------------------
BM_CharArray 670 ns 665 ns 903168
BM_Strings 856 ns 854 ns 817776
BM_CharArray 166 ns 166 ns 4369997
BM_Strings 225 ns 225 ns 3149521
Also including compile flags I'm using:
/usr/bin/clang++ -I{some includes here} -O3 -std=c++14 -stdlib=libc++ -Wall -Wextra -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk -O3 -fsanitize=address -Werror -o CMakeFiles/BenchmarkString.dir/BenchmarkString.cpp.o -c test/benchmarks/BenchmarkString.cpp

Because of the way libc++ implements the small string optimization, on every dereference it needs to check whether the string contents are stored in the string object itself or on the heap. Because the indexing is wrapped in benchmark::DoNotOptimize, it needs to perform this check every time a character is accessed. When accessing the string data via a pointer the data is always external, and so requires no check.

Interestingly I am unable to reproduce your results. I can barely detect a difference between the two.
The (incomplete) code I used is shown here:
hol::StdTimer timer;
using index_type = std::size_t;
index_type const N = 100'000'000;
index_type const SIZE = 1024;
static std::u16string s16;
static char16_t const* p16;
int main(int, char** argv)
{
std::generate_n(std::back_inserter(s16), SIZE,
[]{ return (char)hol::random_number((int)'A', (int)'Z'); });
p16 = s16.c_str();
unsigned sum;
{
sum = 0;
timer.start();
for(index_type n = 0; n < N; ++n)
for(index_type i = 0; i < SIZE; ++i)
sum += s16[i];
timer.stop();
RESULT("string", sum, timer);
}
{
sum = 0;
timer.start();
for(std::size_t n = 0; n < N; ++n)
for(std::size_t i = 0; i < SIZE; ++i)
sum += p16[i];
timer.stop();
RESULT("array ", sum, timer);
}
}
Output:
string: (670240768) 17.575232 secs
array : (670240768) 17.546145 secs
Compiler:
GCC 7.1
g++ -std=c++14 -march=native -O3 -D NDEBUG

In pure char16_t you access array directly, while in string you have overloaded operator[]
reference
operator[](size_type __pos)
{
#ifdef _GLIBCXX_DEBUG_PEDANTIC
__glibcxx_check_subscript(__pos);
#else
// as an extension v3 allows s[s.size()] when s is non-const.
_GLIBCXX_DEBUG_VERIFY(__pos <= this->size(),
_M_message(__gnu_debug::__msg_subscript_oob)
._M_sequence(*this, "this")
._M_integer(__pos, "__pos")
._M_integer(this->size(), "size"));
#endif
return _M_base()[__pos];
}
and _M_base() is:
_Base& _M_base() { return *this; }
Now, my guesses are that either:
_M_base() might not get inlined and than you get performance hit because of every read takes additional operation to read the function address.
or
One of those subscript checks happen.

Related

OpenMP: copying vector using ' multithreading'

For a certain coding application i need to copy a vector consisting of big objects, so i want to make it more efficient. I'll give the old code below, with an attempt to use OpenMP to make it more efficient.
std::vector<Object> Objects, NewObjects;
Objects.reserve(30);
NewObjects.reserve(30);
// old code
Objects = NewObjects;
// new code
omp_set_num_threads(30);
#pragma omp parallel{
Objects[omp_get_thread_num()] = NewObjects[omp_get_thread_num()];
}
Would this give the same result? Or are there issues since i access the vector ' Object' . I thought it might work since i don't access the same index/Object.
omp_set_num_threads(30) does not guarantee that you obtain 30 threads, you may get less and your code will not work properly. You have to use a loop and parallelize it by OpenMP:
#pragma omp parallel for
for(size_t i=0;i<NewObjects.size(); ++i)
{
Objects[i] = NewObjects[i];
}
Note that It may not be faster than the serial version, because parallel execution has significant overheads.
If you use a C++17 compiler the best idea is to use std::copy using parallel execution policy:
std::copy(std::execution::par, NewObjects.begin(), NewObjects.end(), Objects.begin());
I created a benchmark to see how fast my test machine copies objects:
#include <benchmark/benchmark.h>
#include <omp.h>
#include <vector>
constexpr int operator "" _MB(unsigned long long v) { return v * 1024 * 1024; }
class CopyableBigObject
{
public:
CopyableBigObject(const size_t s) : vec(s) {}
CopyableBigObject(const CopyableBigObject& other) = default;
CopyableBigObject(CopyableBigObject&& other) = delete;
~CopyableBigObject() = default;
CopyableBigObject& operator =(const CopyableBigObject&) = default;
CopyableBigObject& operator =(CopyableBigObject&&) = delete;
char& operator [](const int index) { return vec[index]; }
size_t size() const { return vec.size(); }
private:
std::vector<char> vec;
};
// Force some work on the objects so they are not optimized away
int calculated_value(std::vector<CopyableBigObject>& vec)
{
int sum = 0;
for (int x = 0; x < vec.size(); ++x)
{
for (int index = 0; index < vec[x].size(); index += 100)
{
sum += vec[x][index];
}
}
return sum;
}
static void BM_copy_big_objects(benchmark::State& state)
{
const size_t number_of_objects = state.range(0);
const size_t data_size = state.range(1);
for (auto _ : state)
{
std::vector<CopyableBigObject> src{ number_of_objects, CopyableBigObject(data_size) };
std::vector<CopyableBigObject> dest;
state.counters["src"] = calculated_value(src);
dest = src;
state.counters["dest"] = calculated_value(dest);
}
}
static void BM_copy_big_objects_in_parallel(benchmark::State& state)
{
const size_t number_of_objects = state.range(0);
const size_t data_size = state.range(1);
const int number_of_threads = state.range(2);
for (auto _ : state)
{
std::vector<CopyableBigObject> src{ number_of_objects, CopyableBigObject(data_size) };
std::vector<CopyableBigObject> dest{ number_of_objects, CopyableBigObject(0) };
state.counters["src"] = calculated_value(src);
#pragma omp parallel num_threads(number_of_threads)
{
if (omp_get_thread_num() == 0)
{
state.counters["number_of_threads"] = omp_get_num_threads();
}
#pragma omp for
for (int x = 0; x < src.size(); ++x)
{
dest[x] = src[x];
}
}
state.counters["dest"] = calculated_value(dest);
}
}
BENCHMARK(BM_copy_big_objects)
->Unit(benchmark::kMillisecond)
->Args({ 30, 16_MB })
->Args({ 1000, 1_MB })
->Args({ 100, 8_MB });
BENCHMARK(BM_copy_big_objects_in_parallel)
->Unit(benchmark::kMillisecond)
->Args({ 100, 1_MB, 1 })
->Args({ 100, 8_MB, 1 })
->Args({ 800, 1_MB, 1 })
->Args({ 100, 8_MB, 2 })
->Args({ 100, 8_MB, 4 })
->Args({ 100, 8_MB, 8 });
BENCHMARK_MAIN();
These are results I got on my test machine, an old Xeon workstation:
Run on (4 X 2394 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 4096 KiB (x4)
L3 Unified 16384 KiB (x1)
Load Average: 0.25, 0.14, 0.10
--------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
BM_copy_big_objects/30/16777216 30.9 ms 30.5 ms 24 dest=0 src=0
BM_copy_big_objects/1000/1048576 0.352 ms 0.349 ms 1987 dest=0 src=0
BM_copy_big_objects/100/8388608 4.62 ms 4.57 ms 155 dest=0 src=0
BM_copy_big_objects_in_parallel/100/1048576/1 0.359 ms 0.355 ms 2028 dest=0 number_of_threads=1 src=0
BM_copy_big_objects_in_parallel/100/8388608/1 4.67 ms 4.61 ms 151 dest=0 number_of_threads=1 src=0
BM_copy_big_objects_in_parallel/800/1048576/1 0.357 ms 0.353 ms 1983 dest=0 number_of_threads=1 src=0
BM_copy_big_objects_in_parallel/100/8388608/2 5.29 ms 5.23 ms 132 dest=0 number_of_threads=2 src=0
BM_copy_big_objects_in_parallel/100/8388608/4 5.32 ms 5.25 ms 133 dest=0 number_of_threads=4 src=0
BM_copy_big_objects_in_parallel/100/8388608/8 5.57 ms 3.98 ms 175 dest=0 number_of_threads=8 src=0
As I expected, parallelizing copying does not improve performance. However, copying large objects is slower than I expected.
Given you stated that you use C++14, there are a number of things you can try which could improve performance:
Move the objects using the move-constructor / move-assignment combination or unique_ptr instead of copying.
Defer making copies of member variables until you really need them by using Copy-On-Write.
This will make copying cheap until you have to update a big object.
If a large proportion of your objects are not updated after they have been copied then you should get a performance boost.
Make sure your class definitions are using the most compact representation. I have seen classes be different sizes depending on whether it is a release build or a debug build because the compiler was using padding for the release build but not the debug build.
Possibly rewrite so copying is avoided altogether.
Without knowing the specific details of your objects, it is not possible to give a specific answer. However, this should point to a full solution.

How to improve the speed of merkle root calculation in C++?

I am trying to optimise the merkle root calculation as much as possible. So far, I implemented it in Python which resulted in this question and the suggestion to rewrite it in C++.
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <streambuf>
#include <sstream>
#include <openssl/evp.h>
#include <openssl/sha.h>
#include <openssl/crypto.h>
std::vector<unsigned char> double_sha256(std::vector<unsigned char> a, std::vector<unsigned char> b)
{
unsigned char inp[64];
int j=0;
for (int i=0; i<32; i++)
{
inp[j] = a[i];
j++;
}
for (int i=0; i<32; i++)
{
inp[j] = b[i];
j++;
}
const EVP_MD *md_algo = EVP_sha256();
unsigned int md_len = EVP_MD_size(md_algo);
std::vector<unsigned char> out( md_len );
EVP_Digest(inp, 64, out.data(), &md_len, md_algo, nullptr);
EVP_Digest(out.data(), md_len, out.data(), &md_len, md_algo, nullptr);
return out;
}
std::vector<std::vector<unsigned char> > calculate_merkle_root(std::vector<std::vector<unsigned char> > inp_list)
{
std::vector<std::vector<unsigned char> > out;
int len = inp_list.size();
if (len == 1)
{
out.push_back(inp_list[0]);
return out;
}
for (int i=0; i<len-1; i+=2)
{
out.push_back(
double_sha256(inp_list[i], inp_list[i+1])
);
}
if (len % 2 == 1)
{
out.push_back(
double_sha256(inp_list[len-1], inp_list[len-1])
);
}
return calculate_merkle_root(out);
}
int main()
{
std::ifstream infile("txids.txt");
std::vector<std::vector<unsigned char> > txids;
std::string line;
int count = 0;
while (std::getline(infile, line))
{
unsigned char* buf = OPENSSL_hexstr2buf(line.c_str(), nullptr);
std::vector<unsigned char> buf2;
for (int i=31; i>=0; i--)
{
buf2.push_back(
buf[i]
);
}
txids.push_back(
buf2
);
count++;
}
infile.close();
std::cout << count << std::endl;
std::vector<std::vector<unsigned char> > merkle_root_hash;
for (int k=0; k<1000; k++)
{
merkle_root_hash = calculate_merkle_root(txids);
}
std::vector<unsigned char> out0 = merkle_root_hash[0];
std::vector<unsigned char> out;
for (int i=31; i>=0; i--)
{
out.push_back(
out0[i]
);
}
static const char alpha[] = "0123456789abcdef";
for (int i=0; i<32; i++)
{
unsigned char c = out[i];
std::cout << alpha[ (c >> 4) & 0xF];
std::cout << alpha[ c & 0xF];
}
std::cout.put('\n');
return 0;
}
However, the performance is worse compared to the Python implementation (~4s):
$ g++ test.cpp -L/usr/local/opt/openssl/lib -I/usr/local/opt/openssl/include -lcrypto
$ time ./a.out
1452
289792577c66cd75f5b1f961e50bd8ce6f36adfc4c087dc1584f573df49bd32e
real 0m9.245s
user 0m9.235s
sys 0m0.008s
The complete implementation and the input file are available here: test.cpp and txids.txt.
How can I improve the performance? Are the compiler optimizations enabled by default? Are there faster sha256 libraries than openssl available?
There are plenty of things you can do to optimize the code.
Here is the list of the important points:
compiler optimizations need to be enabled (using -O3 in GCC);
std::array can be used instead of the slower dynamically-sized std::vector (since the size of a hash is 32), one can even define a new Hash type for clarity;
parameters should be passed by reference (C++ pass parameter by copy by default)
the C++ vectors can be reserved to pre-allocate the memory space and avoid unneeded copies;
OPENSSL_free must be called to release the allocated memory of OPENSSL_hexstr2buf;
push_back should be avoided when the size is a constant known at compile time;
using std::copy is often faster (and cleaner) than a manual copy;
std::reverse is often faster (and cleaner) than a manual loop;
the size of a hash is supposed to be 32, but one can check that using assertions to be sure it is fine;
count is not needed as it is the size of the txids vector;
Here is the resulting code:
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <streambuf>
#include <sstream>
#include <cstring>
#include <array>
#include <algorithm>
#include <cassert>
#include <openssl/evp.h>
#include <openssl/sha.h>
#include <openssl/crypto.h>
using Hash = std::array<unsigned char, 32>;
Hash double_sha256(const Hash& a, const Hash& b)
{
assert(a.size() == 32 && b.size() == 32);
unsigned char inp[64];
std::copy(a.begin(), a.end(), inp);
std::copy(b.begin(), b.end(), inp+32);
const EVP_MD *md_algo = EVP_sha256();
assert(EVP_MD_size(md_algo) == 32);
unsigned int md_len = 32;
Hash out;
EVP_Digest(inp, 64, out.data(), &md_len, md_algo, nullptr);
EVP_Digest(out.data(), md_len, out.data(), &md_len, md_algo, nullptr);
return out;
}
std::vector<Hash> calculate_merkle_root(const std::vector<Hash>& inp_list)
{
std::vector<Hash> out;
int len = inp_list.size();
out.reserve(len/2+2);
if (len == 1)
{
out.push_back(inp_list[0]);
return out;
}
for (int i=0; i<len-1; i+=2)
{
out.push_back(double_sha256(inp_list[i], inp_list[i+1]));
}
if (len % 2 == 1)
{
out.push_back(double_sha256(inp_list[len-1], inp_list[len-1]));
}
return calculate_merkle_root(out);
}
int main()
{
std::ifstream infile("txids.txt");
std::vector<Hash> txids;
std::string line;
while (std::getline(infile, line))
{
unsigned char* buf = OPENSSL_hexstr2buf(line.c_str(), nullptr);
Hash buf2;
std::copy(buf, buf+32, buf2.begin());
std::reverse(buf2.begin(), buf2.end());
txids.push_back(buf2);
OPENSSL_free(buf);
}
infile.close();
std::cout << txids.size() << std::endl;
std::vector<Hash> merkle_root_hash;
for (int k=0; k<1000; k++)
{
merkle_root_hash = calculate_merkle_root(txids);
}
Hash out0 = merkle_root_hash[0];
Hash out = out0;
std::reverse(out.begin(), out.end());
static const char alpha[] = "0123456789abcdef";
for (int i=0; i<32; i++)
{
unsigned char c = out[i];
std::cout << alpha[ (c >> 4) & 0xF];
std::cout << alpha[ c & 0xF];
}
std::cout.put('\n');
return 0;
}
On my machine, this code is 3 times faster than the initial version and 2 times faster than the Python implementation.
This implementation spends >98% of its time in EVP_Digest. As a result, if you want a faster code, you could try to find a faster hashing library although OpenSSL should be already pretty fast. The current code already succeed to compute 1.7 millions hashes per second in sequential on a mainstream CPU. This is quite good. Alternatively, you can also parallelize the program using OpenMP (this is roughly 5 times faster on my 6 core machine).
I decided to implement Merkle Root and SHA-256 computation from scratch, full SHA-256 implemented, using SIMD (Single Instruction Multiple Data) approach, known for SSE2, AVX2, AVX512.
My code below for AVX2 case has speed 3.5x times faster than OpenSSL version, and 7.3x times faster than Python's hashlib implementation.
Here I provide C++ implementation, also I made Python implementation with same speed (because it uses C++ code in the core), for Python implementation see related post. Python implementation is definitely easier to use than C++.
My code is quite complex, both because it has full SHA-256 implementation and also because it has a class for abstracting any SIMD operations, also many tests.
First I provide timings, made on Google Colab because they have quite advanced AVX2 processor there:
MerkleRoot-Ossl 1274 ms
MerkleRoot-Simd-GEN-1 1613 ms
MerkleRoot-Simd-GEN-2 1795 ms
MerkleRoot-Simd-GEN-4 788 ms
MerkleRoot-Simd-GEN-8 423 ms
MerkleRoot-Simd-SSE2-1 647 ms
MerkleRoot-Simd-SSE2-2 626 ms
MerkleRoot-Simd-SSE2-4 690 ms
MerkleRoot-Simd-AVX2-1 407 ms
MerkleRoot-Simd-AVX2-2 403 ms
MerkleRoot-Simd-AVX2-4 489 ms
Ossl is for testing OpenSSL implementation, the rest is mine implementation. AVX512 has even more improvement in speed, here it is not tested because Colab has no AVX512 support. Actual improvement in speed depends on processor capabilities.
Compilation is tested both in Windows (MSVC) and Linux (CLang), using following commands:
Windows with OpenSSL support cl.exe /O2 /GL /Z7 /EHs /std:c++latest sha256_simd.cpp -DSHS_HAS_AVX2=1 -DSHS_HAS_OPENSSL=1 /MD -Id:/bin/OpenSSL/include/ /link /LIBPATH:d:/bin/OpenSSL/lib/ libcrypto_static.lib libssl_static.lib Advapi32.lib User32.lib Ws2_32.lib, provide your directory with installed OpenSSL. If OpenSSL support is not needed use cl.exe /O2 /GL /Z7 /EHs /std:c++latest sha256_simd.cpp -DSHS_HAS_AVX2=1. Here also instead of of AVX2 you may use SSE2 or AVX512. Windows openssl can be downloaded from here.
Linux CLang compilation is done through clang++-12 -march=native -g -m64 -O3 -std=c++20 sha256_simd.cpp -o sha256_simd.exe -DSHS_HAS_OPENSSL=1 -lssl -lcrypto if OpenSSL is needed and if not needed then clang++-12 -march=native -g -m64 -O3 -std=c++20 sha256_simd.cpp -o sha256_simd.exe. As you can see most recent clang-12 is used, to install it do bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)" (this command is described here). Linux version automatically detects current CPU architecture and uses best SIMD set of instructions.
My code needs C++20 standard support, as it uses some advanced features for easier implementing all things.
I implemented OpenSSL support in my library only to compare timings to show that my AVX2 version is 3-3.5x times faster.
Also providing timings done on GodBolt, but those are only for example of AVX-512 usage, as GodBolt CPUs have advanced AVX-512. Don't use GodBolt to actually measure timings, because all the timings there jump up to 5x times up and down, seems because of active processes eviction by operating system. Also providing GodBolt link for playground (this link may have a bit outdated code, use newest link to code at the bottom of my post):
MerkleRoot-Ossl 2305 ms
MerkleRoot-Simd-GEN-1 2982 ms
MerkleRoot-Simd-GEN-2 3078 ms
MerkleRoot-Simd-GEN-4 1157 ms
MerkleRoot-Simd-GEN-8 781 ms
MerkleRoot-Simd-GEN-16 349 ms
MerkleRoot-Simd-SSE2-1 387 ms
MerkleRoot-Simd-SSE2-2 769 ms
MerkleRoot-Simd-SSE2-4 940 ms
MerkleRoot-Simd-AVX2-1 251 ms
MerkleRoot-Simd-AVX2-2 253 ms
MerkleRoot-Simd-AVX2-4 777 ms
MerkleRoot-Simd-AVX512-1 257 ms
MerkleRoot-Simd-AVX512-2 741 ms
MerkleRoot-Simd-AVX512-4 961 ms
Examples of my code usage can be seen inside Test() function that tests all functionality of my library. My code is a bit dirty because I didn't want to spend very much time creating beautiful library, rather just to make a Proof of Concept that SIMD-based implementation can be considerably faster than OpenSSL version.
If you really want to use my boosted SIMD-based version instead of OpenSSL and if you care for speed very much, and you have questions about how to use it, please ask me in comments or chat.
Also I didn't bother about implementing multi-core/multi-threaded version, I think it is obvious how to do that and you can and should implement it without difficulties.
Providing external link to the code below, because my code is around 51 KB in size which exceeds allowed 30 KB of text for StackOverflow post.
sha256_simd.cpp

Is the poor performance of std::vector due to not calling realloc a logarithmic number of times?

EDIT: I added two more benchmarks, to compare the use of realloc with the C array and of reserve() with the std::vector. From the last analysis it seems that realloc influences a lot, even if called only 30 times. Checking the documentation I guess this is due to the fact that realloc can return a completely new pointer, copying the old one.
To complete the scenario I also added the code and graph for allocating completely the array during the initialisation. The difference from reserve() is tangible.
Compile flags: only the optimisation described in the graph, compiling with g++ and nothing more.
Original question:
I made a benchmark of std::vector vs a new/delete array, when I add 1 billion integers and the second code is dramatically faster than the one using the vector, especially with the optimisation turned on.
I suspect that this is caused by the vector internally calling realloc too many times. This would be the case if vector does not grows doubling its size every time it gets filled (here the number 2 has nothing special, what matters is that its size grows geometrically).
In such a case the calls to realloc would be only O(log n) instead of O(n).
If this is what causes the slowness of the first code, how can I tell std::vector to grow geometrically?
Note that calling reserve once would work in this case but not in the more general case in which the number of push_back is not known in advance.
black line
#include<vector>
int main(int argc, char * argv[]) {
const unsigned long long size = 1000000000;
std::vector <int> b(size);
for(int i = 0; i < size; i++) {
b[i]=i;
}
return 0;
}
blue line
#include<vector>
int main(int argc, char * argv[]) {
const int size = 1000000000;
std::vector <int> b;
for(int i = 0; i < size; i++) {
b.push_back(i);
}
return 0;
}
green line
#include<vector>
int main(int argc, char * argv[]) {
const int size = 1000000000;
std::vector <int> b;
b.reserve(size);
for(int i = 0; i < size; i++) {
b.push_back(i);
}
return 0;
}
red line
int main(int argc, char * argv[]) {
const int size = 1000000000;
int * a = new int [size];
for(int i = 0; i < size; i++) {
a[i] = i;
}
delete [] a;
return 0;
}
orange line
#include<vector>
int main(int argc, char * argv[]) {
const unsigned long long size = 1000000000;
int * a = (int *)malloc(size*sizeof(int));
int next_power = 1;
for(int i = 0; i < size; i++) {
a[i] = i;
if(i == next_power - 1) {
next_power *= 2;
a=(int*)realloc(a,next_power*sizeof(int));
}
}
free(a);
return 0;
}
EDIT: checking .capacity(), as suggested, we saw that the growth is indeed exponential. So why the vector is so slow?
The optimized C style array is optimized to nothing.
On godbolt:
xorl %eax, %eax
retq
that is the program.
Whenever you have a program optimized to nearly 0s you should consider this possibility.
The optimizer sees you are doing nothing with the memory allocated, notes that unused allocating memory may have zero side effects, and eliminates the allocation.
And writing to memory then never reading it also has zero side effects.
In comparison, the compiler has difficulty proving that the vector's allocation is useless. Probably the compiler developers could teach it to recognize unused std vectors like they recognize unused raw C arrays, but that optimization really is a corner case, and it causes lots of problems profiling in my experience.
Note that the vector-with-reserve at any optimization level is basically the same speed as the unoptimized C style version.
In the C style code, the only thing to optimize is "don't do anything". In the vector code, the unoptimized version is full of extra stack frames and debug checks to ensure you don't go out of bounds (and crash cleanly if you do).
Note that on a Linux system, allocating huge chunks of memory doesn't do anything except fiddle with the virtual memory table. Only when the memory is touched does it actually find some zero'd physical memory for you.
Without reserve, the std vector has to guess an initial small size, resize it an copy it, and repeat. This causes a 50% performance loss, which seems reasonable to me.
With reserve, it actually does the work. The work takes just under 5s.
Adding to vector via push back does causes it to grow geometrically. Geometric grows results in an asymptotic average of 2-3 copies of each piece of data being made.
As for realloc, std::vector does not realloc. It allocates a new buffer, and copies the old data, then discards the old one.
Realloc attempts to grow the buffer, and if it cannot it bitwise copies the buffer.
That is more efficient than std vector can manage for bitwise copyable types. I'd bet the realloc version actually never copies; there is always memory space to grow the vector into (in a real program this may not be the case).
The lack of realloc in std library allocators is a minor flaw. You'd have to invent a new API for it, because you'd want it to work for non-bitwise copy (something like "try grow allocated memory", which if it fails leaves it up to you to grow the allocation).
when I add 1 billion integers and the second code is dramatically faster than the one using the vector
That's... completely unsurprising. One of your cases involves a dynamically sized container that has to readjust for its load, and the other involves a fixed size container that doesn't. The latter simply has to do way less work, no branching, no additional allocations. The fair comparison would be:
std::vector<int> b(size);
for(int i = 0; i < size; i++) {
b[i] = i;
}
This now does the same thing as your array example (well, almost - new int[size] default-initializes all the ints whereas std::vector<int>(size) zero-initializes them, so it's still more work).
It doesn't really make sense to compare these two to each other. If the fixed-size int array fits your use case, then use it. If it doesn't, then don't. You either need a dynamically sized container or not. If you do, performing slower than a fixed-size solution is something you're implicitly giving up.
If this is what causes the slowness of the first code, how can I tell std::vector to grow geometrically?
std::vector is already mandated to grow geometrically already, it's the only way to maintain O(1) amortized push_back complexity.
Is the poor performance of std::vector due to not calling realloc a logarithmic number of times?
Your test neither supports that conclusion, nor does it prove the opposite. However, I would assume that reallocation is called linear number of times unless there is contrary evidence.
Update: Your new test is apparently evidence against your non-logarithmic reallocation hypothesis.
I suspect that this is caused by the vector internally calling realloc too many times.
Update: Your new test shows that some of the difference is due to reallocations... but not all. I suspect that the remainder is due to the fact that optimizer was able to prove (but only in the case of the non-growing) that the array values are unused, and chose to not loop and write them at all. If you were to make sure that the written values are actually used, then I would expect that the non-growing array would have similar optimized performance to the reserving vector.
The difference (between reserving code and non-reserving vector) in optimized build is most likely due to doing more reallocations (compared to no reallocations of the reserved array). Whether the number of reallocations is too much is situational and subjective. The downside of doing fewer reallocations is more wasted space due to overallocation.
Note that the cost of reallocation of large arrays comes primarily from copying of elements, rather than memory allocation itself.
In unoptimized build, there is probably additional linear overhead due to function calls that weren't expanded inline.
how can I tell std::vector to grow geometrically?
Geometric growth is required by the standard. There is no way and no need to tell std::vector to use geometric growth.
Note that calling reserve once would work in this case but not in the more general case in which the number of push_back is not known in advance.
However, a general case in which the number of push_back is not known in advance is a case where the non-growing array isn't even an option and so its performance is irrelevant for that general case.
This isn't comparing geometric growth to arithmetic (or any other) growth. It's comparing pre-allocating all the space necessary to growing the space as needed. So let's start by comparing std::vector to some code that actually does use geometric growth, and use both in ways that put the geometric growth to use1. So, here's a simple class that does geometric growth:
class my_vect {
int *data;
size_t current_used;
size_t current_alloc;
public:
my_vect()
: data(nullptr)
, current_used(0)
, current_alloc(0)
{}
void push_back(int val) {
if (nullptr == data) {
data = new int[1];
current_alloc = 1;
}
else if (current_used == current_alloc) {
int *temp = new int[current_alloc * 2];
for (size_t i=0; i<current_used; i++)
temp[i] = data[i];
swap(temp, data);
delete [] temp;
current_alloc *= 2;
}
data[current_used++] = val;
}
int &at(size_t index) {
if (index >= current_used)
throw bad_index();
return data[index];
}
int &operator[](size_t index) {
return data[index];
}
~my_vect() { delete [] data; }
};
...and here's some code to exercise it (and do the same with std::vector):
int main() {
std::locale out("");
std::cout.imbue(out);
using namespace std::chrono;
std::cout << "my_vect\n";
for (int size = 100; size <= 1000000000; size *= 10) {
auto start = high_resolution_clock::now();
my_vect b;
for(int i = 0; i < size; i++) {
b.push_back(i);
}
auto stop = high_resolution_clock::now();
std::cout << "Size: " << std::setw(15) << size << ", Time: " << std::setw(15) << duration_cast<microseconds>(stop-start).count() << " us\n";
}
std::cout << "\nstd::vector\n";
for (int size = 100; size <= 1000000000; size *= 10) {
auto start = high_resolution_clock::now();
std::vector<int> b;
for (int i = 0; i < size; i++) {
b.push_back(i);
}
auto stop = high_resolution_clock::now();
std::cout << "Size: " << std::setw(15) << size << ", Time: " << std::setw(15) << duration_cast<microseconds>(stop - start).count() << " us\n";
}
}
I compiled this with g++ -std=c++14 -O3 my_vect.cpp. When I execute that, I get this result:
my_vect
Size: 100, Time: 8 us
Size: 1,000, Time: 23 us
Size: 10,000, Time: 141 us
Size: 100,000, Time: 950 us
Size: 1,000,000, Time: 8,040 us
Size: 10,000,000, Time: 51,404 us
Size: 100,000,000, Time: 442,713 us
Size: 1,000,000,000, Time: 7,936,013 us
std::vector
Size: 100, Time: 40 us
Size: 1,000, Time: 4 us
Size: 10,000, Time: 29 us
Size: 100,000, Time: 426 us
Size: 1,000,000, Time: 3,730 us
Size: 10,000,000, Time: 41,294 us
Size: 100,000,000, Time: 363,942 us
Size: 1,000,000,000, Time: 5,200,545 us
I undoubtedly could optimize the my_vect to keep up with std::vector (e.g., initially allocating space for, say, 256 items would probably be a pretty large help). I haven't attempted to do enough runs (and statistical analysis) to be at all sure that std::vector is really dependably faster than my_vect either. Nonetheless, this seems to indicate that when we compare apples to apples, we get results that are at least roughly comparable (e.g., within a fairly small, constant factor of each other).
1. As a side note, I feel obliged to point out that this still doesn't really compare apples to apples--but at least as long as we're only instantiating std::vector over int, many of the obvious differences are basically covered up.
This post include
wrapper classes over realloc, mremap to provide reallocation functionality.
A custom vector class.
A performance test.
// C++17
#include <benchmark/benchmark.h> // Googleo benchmark lib, for benchmark.
#include <new> // For std::bad_alloc.
#include <memory> // For std::allocator_traits, std::uninitialized_move.
#include <cstdlib> // For C heap management API.
#include <cstddef> // For std::size_t, std::max_align_t.
#include <cassert> // For assert.
#include <utility> // For std::forward, std::declval,
namespace linux {
#include <sys/mman.h> // For mmap, mremap, munmap.
#include <errno.h> // For errno.
auto get_errno() noexcept {
return errno;
}
}
/*
* Allocators.
* These allocators will have non-standard compliant behavior if the type T's cp ctor has side effect.
*/
// class mrealloc are usefull for allocating small space for
// std::vector.
//
// Can prevent copy of data and memory fragmentation if there's enough
// continous memory at the original place.
template <class T>
struct mrealloc {
using pointer = T*;
using value_type = T;
auto allocate(std::size_t len) {
if (auto ret = std::malloc(len))
return static_cast<pointer>(ret);
else
throw std::bad_alloc();
}
auto reallocate(pointer old_ptr, std::size_t old_len, std::size_t len) {
if (auto ret = std::realloc(old_ptr, len))
return static_cast<pointer>(ret);
else
throw std::bad_alloc();
}
void deallocate(void *ptr, std::size_t len) noexcept {
std::free(ptr);
}
};
// class mmaprealloc is suitable for large memory use.
//
// It will be usefull for situation that std::vector can grow to a huge
// size.
//
// User can call reserve without worrying wasting a lot of memory.
//
// It can prevent data copy and memory fragmentation at any time.
template <class T>
struct mmaprealloc {
using pointer = T*;
using value_type = T;
auto allocate(std::size_t len) const
{
return allocate_impl(len, MAP_PRIVATE | MAP_ANONYMOUS);
}
auto reallocate(pointer old_ptr, std::size_t old_len, std::size_t len) const
{
return reallocate_impl(old_ptr, old_len, len, MREMAP_MAYMOVE);
}
void deallocate(pointer ptr, std::size_t len) const noexcept
{
assert(linux::munmap(ptr, len) == 0);
}
protected:
auto allocate_impl(std::size_t _len, int flags) const
{
if (auto ret = linux::mmap(nullptr, get_proper_size(_len), PROT_READ | PROT_WRITE, flags, -1, 0))
return static_cast<pointer>(ret);
else
fail(EAGAIN | ENOMEM);
}
auto reallocate_impl(pointer old_ptr, std::size_t old_len, std::size_t _len, int flags) const
{
if (auto ret = linux::mremap(old_ptr, old_len, get_proper_size(_len), flags))
return static_cast<pointer>(ret);
else
fail(EAGAIN | ENOMEM);
}
static inline constexpr const std::size_t magic_num = 4096 - 1;
static inline auto get_proper_size(std::size_t len) noexcept -> std::size_t {
return round_to_pagesize(len);
}
static inline auto round_to_pagesize(std::size_t len) noexcept -> std::size_t {
return (len + magic_num) & ~magic_num;
}
static inline void fail(int assert_val)
{
auto _errno = linux::get_errno();
assert(_errno == assert_val);
throw std::bad_alloc();
}
};
template <class T>
struct mmaprealloc_populate_ver: mmaprealloc<T> {
auto allocate(size_t len) const
{
return mmaprealloc<T>::allocate_impl(len, MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE);
}
};
namespace impl {
struct disambiguation_t2 {};
struct disambiguation_t1 {
constexpr operator disambiguation_t2() const noexcept { return {}; }
};
template <class Alloc>
static constexpr auto has_reallocate(disambiguation_t1) noexcept -> decltype(&Alloc::reallocate, bool{}) { return true; }
template <class Alloc>
static constexpr bool has_reallocate(disambiguation_t2) noexcept { return false; }
template <class Alloc>
static inline constexpr const bool has_reallocate_v = has_reallocate<Alloc>(disambiguation_t1{});
} /* impl */
template <class Alloc>
struct allocator_traits: public std::allocator_traits<Alloc> {
using Base = std::allocator_traits<Alloc>;
using value_type = typename Base::value_type;
using pointer = typename Base::pointer;
using size_t = typename Base::size_type;
static auto reallocate(Alloc &alloc, pointer prev_ptr, size_t prev_len, size_t new_len) {
if constexpr(impl::has_reallocate_v<Alloc>)
return alloc.reallocate(prev_ptr, prev_len, new_len);
else {
auto new_ptr = Base::allocate(alloc, new_len);
// Move existing array
for(auto _prev_ptr = prev_ptr, _new_ptr = new_ptr; _prev_ptr != prev_ptr + prev_len; ++_prev_ptr, ++_new_ptr) {
new (_new_ptr) value_type(std::move(*_prev_ptr));
_new_ptr->~value_type();
}
Base::deallocate(alloc, prev_ptr, prev_len);
return new_ptr;
}
}
};
template <class T, class Alloc = std::allocator<T>>
struct vector: protected Alloc {
using alloc_traits = allocator_traits<Alloc>;
using pointer = typename alloc_traits::pointer;
using size_t = typename alloc_traits::size_type;
pointer ptr = nullptr;
size_t last = 0;
size_t avail = 0;
~vector() noexcept {
alloc_traits::deallocate(*this, ptr, avail);
}
template <class ...Args>
void emplace_back(Args &&...args) {
if (last == avail)
double_the_size();
alloc_traits::construct(*this, &ptr[last++], std::forward<Args>(args)...);
}
void double_the_size() {
if (__builtin_expect(!!(avail), true)) {
avail <<= 1;
ptr = alloc_traits::reallocate(*this, ptr, last, avail);
} else {
avail = 1 << 4;
ptr = alloc_traits::allocate(*this, avail);
}
}
};
template <class T>
static void BM_vector(benchmark::State &state) {
for(auto _: state) {
T c;
for(auto i = state.range(0); --i >= 0; )
c.emplace_back((char)i);
}
}
static constexpr const auto one_GB = 1 << 30;
BENCHMARK_TEMPLATE(BM_vector, vector<char>) ->Range(1 << 3, one_GB);
BENCHMARK_TEMPLATE(BM_vector, vector<char, mrealloc<char>>) ->Range(1 << 3, one_GB);
BENCHMARK_TEMPLATE(BM_vector, vector<char, mmaprealloc<char>>) ->Range(1 << 3, one_GB);
BENCHMARK_TEMPLATE(BM_vector, vector<char, mmaprealloc_populate_ver<char>>)->Range(1 << 3, one_GB);
BENCHMARK_MAIN();
Performance test.
All the performance test are done on:
Debian 9.4, Linux version 4.9.0-6-amd64 (debian-kernel#lists.debian.org)(gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) ) #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02)
Compiled using clang++ -std=c++17 -lbenchmark -lpthread -Ofast main.cc
The command I used to run this test:
sudo cpupower frequency-set --governor performance
./a.out
Here's the output of google benchmark test:
Run on (8 X 1600 MHz CPU s)
CPU Caches:
L1 Data 32K (x4)
L1 Instruction 32K (x4)
L2 Unified 256K (x4)
L3 Unified 6144K (x1)
----------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------------------------------------
BM_vector<vector<char>>/8 58 ns 58 ns 11476934
BM_vector<vector<char>>/64 324 ns 324 ns 2225396
BM_vector<vector<char>>/512 1527 ns 1527 ns 453629
BM_vector<vector<char>>/4096 7196 ns 7196 ns 96695
BM_vector<vector<char>>/32768 50145 ns 50140 ns 13655
BM_vector<vector<char>>/262144 549821 ns 549825 ns 1245
BM_vector<vector<char>>/2097152 5007342 ns 5006393 ns 146
BM_vector<vector<char>>/16777216 42873349 ns 42873462 ns 15
BM_vector<vector<char>>/134217728 336225619 ns 336097218 ns 2
BM_vector<vector<char>>/1073741824 2642934606 ns 2642803281 ns 1
BM_vector<vector<char, mrealloc<char>>>/8 55 ns 55 ns 12914365
BM_vector<vector<char, mrealloc<char>>>/64 266 ns 266 ns 2591225
BM_vector<vector<char, mrealloc<char>>>/512 1229 ns 1229 ns 567505
BM_vector<vector<char, mrealloc<char>>>/4096 6903 ns 6903 ns 102752
BM_vector<vector<char, mrealloc<char>>>/32768 48522 ns 48523 ns 14409
BM_vector<vector<char, mrealloc<char>>>/262144 399470 ns 399368 ns 1783
BM_vector<vector<char, mrealloc<char>>>/2097152 3048578 ns 3048619 ns 229
BM_vector<vector<char, mrealloc<char>>>/16777216 24426934 ns 24421774 ns 29
BM_vector<vector<char, mrealloc<char>>>/134217728 262355961 ns 262357084 ns 3
BM_vector<vector<char, mrealloc<char>>>/1073741824 2092577020 ns 2092317044 ns 1
BM_vector<vector<char, mmaprealloc<char>>>/8 4285 ns 4285 ns 161498
BM_vector<vector<char, mmaprealloc<char>>>/64 5485 ns 5485 ns 125375
BM_vector<vector<char, mmaprealloc<char>>>/512 8571 ns 8569 ns 80345
BM_vector<vector<char, mmaprealloc<char>>>/4096 24248 ns 24248 ns 28655
BM_vector<vector<char, mmaprealloc<char>>>/32768 165021 ns 165011 ns 4421
BM_vector<vector<char, mmaprealloc<char>>>/262144 1177041 ns 1177048 ns 557
BM_vector<vector<char, mmaprealloc<char>>>/2097152 9229860 ns 9230023 ns 74
BM_vector<vector<char, mmaprealloc<char>>>/16777216 75425704 ns 75426431 ns 9
BM_vector<vector<char, mmaprealloc<char>>>/134217728 607661012 ns 607662273 ns 1
BM_vector<vector<char, mmaprealloc<char>>>/1073741824 4871003928 ns 4870588050 ns 1
BM_vector<vector<char, mmaprealloc_populate_ver<char>>>/8 3956 ns 3956 ns 175037
BM_vector<vector<char, mmaprealloc_populate_ver<char>>>/64 5087 ns 5086 ns 133944
BM_vector<vector<char, mmaprealloc_populate_ver<char>>>/512 8662 ns 8662 ns 80579
BM_vector<vector<char, mmaprealloc_populate_ver<char>>>/4096 23883 ns 23883 ns 29265
BM_vector<vector<char, mmaprealloc_populate_ver<char>>>/32768 158374 ns 158376 ns 4444
BM_vector<vector<char, mmaprealloc_populate_ver<char>>>/262144 1171514 ns 1171522 ns 593
BM_vector<vector<char, mmaprealloc_populate_ver<char>>>/2097152 9297357 ns 9293770 ns 74
BM_vector<vector<char, mmaprealloc_populate_ver<char>>>/16777216 75140789 ns 75141057 ns 9
BM_vector<vector<char, mmaprealloc_populate_ver<char>>>/134217728 636359403 ns 636368640 ns 1
BM_vector<vector<char, mmaprealloc_populate_ver<char>>>/1073741824 4865103542 ns 4864582150 ns 1

Vectors and Arrays in C++

Performance difference between C++ vectors and plain arrays has been extensively discussed, for example here and here. Usually discussions conclude that vectors and arrays are similar in terms on performance when accessed with the [] operator and the compiler is enabled to inline functions. That is why expected but I came through a case where it seems that is not true. The functionality of the lines below is quite simple: a 3D volume is taken and it is swap and applied some kind of 3D little mask a certain number of times. Depending on the VERSION macro, volumes will be declared as vectors and accessed through the at operator (VERSION=2), declared as vectors and accessed via [] (VERSION=1) or declared as simple arrays.
#include <vector>
#define NX 100
#define NY 100
#define NZ 100
#define H 1
#define C0 1.5f
#define C1 0.25f
#define T 3000
#if !defined(VERSION) || VERSION > 2 || VERSION < 0
#error "Bad version"
#endif
#if VERSION == 2
#define AT(_a_,_b_) (_a_.at(_b_))
typedef std::vector<float> Field;
#endif
#if VERSION == 1
#define AT(_a_,_b_) (_a_[_b_])
typedef std::vector<float> Field;
#endif
#if VERSION == 0
#define AT(_a_,_b_) (_a_[_b_])
typedef float* Field;
#endif
#include <iostream>
#include <omp.h>
int main(void) {
#if VERSION != 0
Field img(NX*NY*NY);
#else
Field img = new float[NX*NY*NY];
#endif
double end, begin;
begin = omp_get_wtime();
const int csize = NZ;
const int psize = NZ * NX;
for(int t = 0; t < T; t++ ) {
/* Swap the 3D volume and apply the "blurring" coefficients */
#pragma omp parallel for
for(int j = H; j < NY-H; j++ ) {
for( int i = H; i < NX-H; i++ ) {
for( int k = H; k < NZ-H; k++ ) {
int eindex = k+i*NZ+j*NX*NZ;
AT(img,eindex) = C0 * AT(img,eindex) +
C1 * (AT(img,eindex - csize) +
AT(img,eindex + csize) +
AT(img,eindex - psize) +
AT(img,eindex + psize) );
}
}
}
}
end = omp_get_wtime();
std::cout << "Elapsed "<< (end-begin) <<" s." << std::endl;
/* Access img field so we force it to be deleted after accouting time */
#define WHATEVER 12.f
if( img[ NZ ] == WHATEVER ) {
std::cout << "Whatever" << std::endl;
}
#if VERSION == 0
delete[] img;
#endif
}
One would expect code will perform the same with VERSION=1 and VERSION=0, but the output is as follows:
VERSION 2 : Elapsed 6.94905 s.
VERSION 1 : Elapsed 4.08626 s
VERSION 0 : Elapsed 1.97576 s.
If I compile without OMP (I've got only two cores), I get similar results:
VERSION 2 : Elapsed 10.9895 s.
VERSION 1 : Elapsed 7.14674 s
VERSION 0 : Elapsed 3.25336 s.
I always compile with GCC 4.6.3 and the compilation options -fopenmp -finline-functions -O3 (I of course remove -fopenmp when I compile without omp) Is there something I do wrong, for example when compiling? Or should we really expect that difference between vectors and arrays?
PS: I cannot use std::array because of the compiler, of which I depend, that doesn't support C11 standard. With ICC 13.1.2 I get similar behavior.
I tried your code, used chrono to count the time.
And I compiled with clang (version 3.5) and libc++.
clang++ test.cc -std=c++1y -stdlib=libc++ -lc++abi -finline-functions
-O3
The result is exactly same for VERSION 0 and VERSION 1, there's no big difference. They are both 3.4 seconds in average (I use virtual machine so it is slower.).
Then I tried g++ (version 4.8.1),
g++ test.cc -std=c++1y -finline-functions
-O3
The result shows that, for VERSION 0, it is 4.4seconds (roughly), for VERSION 1, it is 5.2 seconds (roughly).
I then, tried clang++ with libstdc++.
clang++ test.cc -std=c++11 -finline-functions
-O3
voila, the result back to 3.4seconds again.
So, it's purely the optimization "bug" of g++.

How fast is D compared to C++?

I like some features of D, but would be interested if they come with a
runtime penalty?
To compare, I implemented a simple program that computes scalar products of many short vectors both in C++ and in D. The result is surprising:
D: 18.9 s [see below for final runtime]
C++: 3.8 s
Is C++ really almost five times as fast or did I make a mistake in the D
program?
I compiled C++ with g++ -O3 (gcc-snapshot 2011-02-19) and D with dmd -O (dmd 2.052) on a moderate recent linux desktop. The results are reproducible over several runs and standard deviations negligible.
Here the C++ program:
#include <iostream>
#include <random>
#include <chrono>
#include <string>
#include <vector>
#include <array>
typedef std::chrono::duration<long, std::ratio<1, 1000>> millisecs;
template <typename _T>
long time_since(std::chrono::time_point<_T>& time) {
long tm = std::chrono::duration_cast<millisecs>( std::chrono::system_clock::now() - time).count();
time = std::chrono::system_clock::now();
return tm;
}
const long N = 20000;
const int size = 10;
typedef int value_type;
typedef long long result_type;
typedef std::vector<value_type> vector_t;
typedef typename vector_t::size_type size_type;
inline value_type scalar_product(const vector_t& x, const vector_t& y) {
value_type res = 0;
size_type siz = x.size();
for (size_type i = 0; i < siz; ++i)
res += x[i] * y[i];
return res;
}
int main() {
auto tm_before = std::chrono::system_clock::now();
// 1. allocate and fill randomly many short vectors
vector_t* xs = new vector_t [N];
for (int i = 0; i < N; ++i) {
xs[i] = vector_t(size);
}
std::cerr << "allocation: " << time_since(tm_before) << " ms" << std::endl;
std::mt19937 rnd_engine;
std::uniform_int_distribution<value_type> runif_gen(-1000, 1000);
for (int i = 0; i < N; ++i)
for (int j = 0; j < size; ++j)
xs[i][j] = runif_gen(rnd_engine);
std::cerr << "random generation: " << time_since(tm_before) << " ms" << std::endl;
// 2. compute all pairwise scalar products:
time_since(tm_before);
result_type avg = 0;
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
avg += scalar_product(xs[i], xs[j]);
avg = avg / N*N;
auto time = time_since(tm_before);
std::cout << "result: " << avg << std::endl;
std::cout << "time: " << time << " ms" << std::endl;
}
And here the D version:
import std.stdio;
import std.datetime;
import std.random;
const long N = 20000;
const int size = 10;
alias int value_type;
alias long result_type;
alias value_type[] vector_t;
alias uint size_type;
value_type scalar_product(const ref vector_t x, const ref vector_t y) {
value_type res = 0;
size_type siz = x.length;
for (size_type i = 0; i < siz; ++i)
res += x[i] * y[i];
return res;
}
int main() {
auto tm_before = Clock.currTime();
// 1. allocate and fill randomly many short vectors
vector_t[] xs;
xs.length = N;
for (int i = 0; i < N; ++i) {
xs[i].length = size;
}
writefln("allocation: %i ", (Clock.currTime() - tm_before));
tm_before = Clock.currTime();
for (int i = 0; i < N; ++i)
for (int j = 0; j < size; ++j)
xs[i][j] = uniform(-1000, 1000);
writefln("random: %i ", (Clock.currTime() - tm_before));
tm_before = Clock.currTime();
// 2. compute all pairwise scalar products:
result_type avg = cast(result_type) 0;
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
avg += scalar_product(xs[i], xs[j]);
avg = avg / N*N;
writefln("result: %d", avg);
auto time = Clock.currTime() - tm_before;
writefln("scalar products: %i ", time);
return 0;
}
To enable all optimizations and disable all safety checks, compile your D program with the following DMD flags:
-O -inline -release -noboundscheck
EDIT: I've tried your programs with g++, dmd and gdc. dmd does lag behind, but gdc achieves performance very close to g++. The commandline I used was gdmd -O -release -inline (gdmd is a wrapper around gdc which accepts dmd options).
Looking at the assembler listing, it looks like neither dmd nor gdc inlined scalar_product, but g++/gdc did emit MMX instructions, so they might be auto-vectorizing the loop.
One big thing that slows D down is a subpar garbage collection implementation. Benchmarks that don't heavily stress the GC will show very similar performance to C and C++ code compiled with the same compiler backend. Benchmarks that do heavily stress the GC will show that D performs abysmally. Rest assured, though, this is a single (albeit severe) quality-of-implementation issue, not a baked-in guarantee of slowness. Also, D gives you the ability to opt out of GC and tune memory management in performance-critical bits, while still using it in the less performance-critical 95% of your code.
I've put some effort into improving GC performance lately and the results have been rather dramatic, at least on synthetic benchmarks. Hopefully these changes will be integrated into one of the next few releases and will mitigate the issue.
This is a very instructive thread, thanks for all the work to the OP and helpers.
One note - this test is not assessing the general question of abstraction/feature penalty or even that of backend quality. It focuses on virtually one optimization (loop optimization). I think it's fair to say that gcc's backend is somewhat more refined than dmd's, but it would be a mistake to assume that the gap between them is as large for all tasks.
Definitely seems like a quality-of-implementation issue.
I ran some tests with the OP's code and made some changes. I actually got D going faster for LDC/clang++, operating on the assumption that arrays must be allocated dynamically (xs and associated scalars). See below for some numbers.
Questions for the OP
Is it intentional that the same seed be used for each iteration of C++, while not so for D?
Setup
I have tweaked the original D source (dubbed scalar.d) to make it portable between platforms. This only involved changing the type of the numbers used to access and modify the size of arrays.
After this, I made the following changes:
Used uninitializedArray to avoid default inits for scalars in xs (probably made the biggest difference). This is important because D normally default-inits everything silently, which C++ does not.
Factored out printing code and replaced writefln with writeln
Changed imports to be selective
Used pow operator (^^) instead of manual multiplication for final step of calculating average
Removed the size_type and replaced appropriately with the new index_type alias
...thus resulting in scalar2.cpp (pastebin):
import std.stdio : writeln;
import std.datetime : Clock, Duration;
import std.array : uninitializedArray;
import std.random : uniform;
alias result_type = long;
alias value_type = int;
alias vector_t = value_type[];
alias index_type = typeof(vector_t.init.length);// Make index integrals portable - Linux is ulong, Win8.1 is uint
immutable long N = 20000;
immutable int size = 10;
// Replaced for loops with appropriate foreach versions
value_type scalar_product(in ref vector_t x, in ref vector_t y) { // "in" is the same as "const" here
value_type res = 0;
for(index_type i = 0; i < size; ++i)
res += x[i] * y[i];
return res;
}
int main() {
auto tm_before = Clock.currTime;
auto countElapsed(in string taskName) { // Factor out printing code
writeln(taskName, ": ", Clock.currTime - tm_before);
tm_before = Clock.currTime;
}
// 1. allocate and fill randomly many short vectors
vector_t[] xs = uninitializedArray!(vector_t[])(N);// Avoid default inits of inner arrays
for(index_type i = 0; i < N; ++i)
xs[i] = uninitializedArray!(vector_t)(size);// Avoid more default inits of values
countElapsed("allocation");
for(index_type i = 0; i < N; ++i)
for(index_type j = 0; j < size; ++j)
xs[i][j] = uniform(-1000, 1000);
countElapsed("random");
// 2. compute all pairwise scalar products:
result_type avg = 0;
for(index_type i = 0; i < N; ++i)
for(index_type j = 0; j < N; ++j)
avg += scalar_product(xs[i], xs[j]);
avg /= N ^^ 2;// Replace manual multiplication with pow operator
writeln("result: ", avg);
countElapsed("scalar products");
return 0;
}
After testing scalar2.d (which prioritized optimization for speed), out of curiousity I replaced the loops in main with foreach equivalents, and called it scalar3.d (pastebin):
import std.stdio : writeln;
import std.datetime : Clock, Duration;
import std.array : uninitializedArray;
import std.random : uniform;
alias result_type = long;
alias value_type = int;
alias vector_t = value_type[];
alias index_type = typeof(vector_t.init.length);// Make index integrals portable - Linux is ulong, Win8.1 is uint
immutable long N = 20000;
immutable int size = 10;
// Replaced for loops with appropriate foreach versions
value_type scalar_product(in ref vector_t x, in ref vector_t y) { // "in" is the same as "const" here
value_type res = 0;
for(index_type i = 0; i < size; ++i)
res += x[i] * y[i];
return res;
}
int main() {
auto tm_before = Clock.currTime;
auto countElapsed(in string taskName) { // Factor out printing code
writeln(taskName, ": ", Clock.currTime - tm_before);
tm_before = Clock.currTime;
}
// 1. allocate and fill randomly many short vectors
vector_t[] xs = uninitializedArray!(vector_t[])(N);// Avoid default inits of inner arrays
foreach(ref x; xs)
x = uninitializedArray!(vector_t)(size);// Avoid more default inits of values
countElapsed("allocation");
foreach(ref x; xs)
foreach(ref val; x)
val = uniform(-1000, 1000);
countElapsed("random");
// 2. compute all pairwise scalar products:
result_type avg = 0;
foreach(const ref x; xs)
foreach(const ref y; xs)
avg += scalar_product(x, y);
avg /= N ^^ 2;// Replace manual multiplication with pow operator
writeln("result: ", avg);
countElapsed("scalar products");
return 0;
}
I compiled each of these tests using an LLVM-based compiler, since LDC seems to be the best option for D compilation in terms of performance. On my x86_64 Arch Linux installation I used the following packages:
clang 3.6.0-3
ldc 1:0.15.1-4
dtools 2.067.0-2
I used the following commands to compile each:
C++: clang++ scalar.cpp -o"scalar.cpp.exe" -std=c++11 -O3
D: rdmd --compiler=ldc2 -O3 -boundscheck=off <sourcefile>
Results
The results (screenshot of raw console output) of each version of the source as follows:
scalar.cpp (original C++):
allocation: 2 ms
random generation: 12 ms
result: 29248300000
time: 2582 ms
C++ sets the standard at 2582 ms.
scalar.d (modified OP source):
allocation: 5 ms, 293 μs, and 5 hnsecs
random: 10 ms, 866 μs, and 4 hnsecs
result: 53237080000
scalar products: 2 secs, 956 ms, 513 μs, and 7 hnsecs
This ran for ~2957 ms. Slower than the C++ implementation, but not too much.
scalar2.d (index/length type change and uninitializedArray optimization):
allocation: 2 ms, 464 μs, and 2 hnsecs
random: 5 ms, 792 μs, and 6 hnsecs
result: 59
scalar products: 1 sec, 859 ms, 942 μs, and 9 hnsecs
In other words, ~1860 ms. So far this is in the lead.
scalar3.d (foreaches):
allocation: 2 ms, 911 μs, and 3 hnsecs
random: 7 ms, 567 μs, and 8 hnsecs
result: 189
scalar products: 2 secs, 182 ms, and 366 μs
~2182 ms is slower than scalar2.d, but faster than the C++ version.
Conclusion
With the correct optimizations, the D implementation actually went faster than its equivalent C++ implementation using the LLVM-based compilers available. The current gap between D and C++ for most applications seems only to be based on limitations of current implementations.
dmd is the reference implementation of the language and thus most work is put into the frontend to fix bugs rather than optimizing the backend.
"in" is faster in your case cause you are using dynamic arrays which are reference types. With ref you introduce another level of indirection (which is normally used to alter the array itself and not only the contents).
Vectors are usually implemented with structs where const ref makes perfect sense. See smallptD vs. smallpt for a real-world example featuring loads of vector operations and randomness.
Note that 64-Bit can also make a difference. I once missed that on x64 gcc compiles 64-Bit code while dmd still defaults to 32 (will change when the 64-Bit codegen matures). There was a remarkable speedup with "dmd -m64 ...".
Whether C++ or D is faster is likely to be highly dependent on what you're doing. I would think that when comparing well-written C++ to well-written D code, they would generally either be of similar speed, or C++ would be faster, but what the particular compiler manages to optimize could have a big effect completely aside from the language itself.
However, there are a few cases where D stands a good chance of beating C++ for speed. The main one which comes to mind would be string processing. Thanks to D's array slicing capabalities, strings (and arrays in general) can be processed much faster than you can readily do in C++. For D1, Tango's XML processor is extremely fast, thanks primarily to D's array slicing capabilities (and hopefully D2 will have a similarly fast XML parser once the one that's currently being worked on for Phobos has been completed). So, ultimately whether D or C++ is going to be faster is going to be very dependent on what you're doing.
Now, I am suprised that you're seeing such a difference in speed in this particular case, but it is the sort of thing that I would expect to improve as dmd improves. Using gdc might yield better results and would likely be a closer comparison of the language itself (rather than the backend) given that it's gcc-based. But it wouldn't surprise me at all if there are a number of things which could be done to speed up the code that dmd generates. I don't think that there's much question that gcc is more mature than dmd at this point. And code optimizations are one of the prime fruits of code maturity.
Ultimately, what matters is how well dmd performs for your particular application, but I do agree that it would definitely be nice to know how well C++ and D compare in general. In theory, they should be pretty much the same, but it really depends on the implementation. I think that a comprehensive set of benchmarks would be required to really test how well the two presently compare however.
You can write C code is D so as far as which is faster, it will depend on a lot of things:
What compiler you use
What feature you use
how aggressively you optimize
Differences in the first aren't fair to drag in. The second might give C++ an advantage as it, if anything, has fewer heavy features. The third is the fun one: D code in some ways is easier to optimize because in general it is easier to understand. Also it has the ability to do a large degree of generative programing allowing things like verbose and repetitive but fast code to be written in a shorter forms.
Seems like a quality of implementation issue. For example, here's what I've been testing with:
import std.datetime, std.stdio, std.random;
version = ManualInline;
immutable N = 20000;
immutable Size = 10;
alias int value_type;
alias long result_type;
alias value_type[] vector_type;
result_type scalar_product(in vector_type x, in vector_type y)
in
{
assert(x.length == y.length);
}
body
{
result_type result = 0;
foreach(i; 0 .. x.length)
result += x[i] * y[i];
return result;
}
void main()
{
auto startTime = Clock.currTime();
// 1. allocate vectors
vector_type[] vectors = new vector_type[N];
foreach(ref vec; vectors)
vec = new value_type[Size];
auto time = Clock.currTime() - startTime;
writefln("allocation: %s ", time);
startTime = Clock.currTime();
// 2. randomize vectors
foreach(ref vec; vectors)
foreach(ref e; vec)
e = uniform(-1000, 1000);
time = Clock.currTime() - startTime;
writefln("random: %s ", time);
startTime = Clock.currTime();
// 3. compute all pairwise scalar products
result_type avg = 0;
foreach(vecA; vectors)
foreach(vecB; vectors)
{
version(ManualInline)
{
result_type result = 0;
foreach(i; 0 .. vecA.length)
result += vecA[i] * vecB[i];
avg += result;
}
else
{
avg += scalar_product(vecA, vecB);
}
}
avg = avg / (N * N);
time = Clock.currTime() - startTime;
writefln("scalar products: %s ", time);
writefln("result: %s", avg);
}
With ManualInline defined I get 28 seconds, but without I get 32. So the compiler isn't even inlining this simple function, which I think it's clear it should be.
(My command line is dmd -O -noboundscheck -inline -release ....)