How to improve the speed of merkle root calculation in C++?

How to improve the speed of merkle root calculation in C++? - c++

I am trying to optimise the merkle root calculation as much as possible. So far, I implemented it in Python which resulted in this question and the suggestion to rewrite it in C++.
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <streambuf>
#include <sstream>
#include <openssl/evp.h>
#include <openssl/sha.h>
#include <openssl/crypto.h>
std::vector<unsigned char> double_sha256(std::vector<unsigned char> a, std::vector<unsigned char> b)
{
unsigned char inp[64];
int j=0;
for (int i=0; i<32; i++)
{
inp[j] = a[i];
j++;
}
for (int i=0; i<32; i++)
{
inp[j] = b[i];
j++;
}
const EVP_MD *md_algo = EVP_sha256();
unsigned int md_len = EVP_MD_size(md_algo);
std::vector<unsigned char> out( md_len );
EVP_Digest(inp, 64, out.data(), &md_len, md_algo, nullptr);
EVP_Digest(out.data(), md_len, out.data(), &md_len, md_algo, nullptr);
return out;
}
std::vector<std::vector<unsigned char> > calculate_merkle_root(std::vector<std::vector<unsigned char> > inp_list)
{
std::vector<std::vector<unsigned char> > out;
int len = inp_list.size();
if (len == 1)
{
out.push_back(inp_list[0]);
return out;
}
for (int i=0; i<len-1; i+=2)
{
out.push_back(
double_sha256(inp_list[i], inp_list[i+1])
);
}
if (len % 2 == 1)
{
out.push_back(
double_sha256(inp_list[len-1], inp_list[len-1])
);
}
return calculate_merkle_root(out);
}
int main()
{
std::ifstream infile("txids.txt");
std::vector<std::vector<unsigned char> > txids;
std::string line;
int count = 0;
while (std::getline(infile, line))
{
unsigned char* buf = OPENSSL_hexstr2buf(line.c_str(), nullptr);
std::vector<unsigned char> buf2;
for (int i=31; i>=0; i--)
{
buf2.push_back(
buf[i]
);
}
txids.push_back(
buf2
);
count++;
}
infile.close();
std::cout << count << std::endl;
std::vector<std::vector<unsigned char> > merkle_root_hash;
for (int k=0; k<1000; k++)
{
merkle_root_hash = calculate_merkle_root(txids);
}
std::vector<unsigned char> out0 = merkle_root_hash[0];
std::vector<unsigned char> out;
for (int i=31; i>=0; i--)
{
out.push_back(
out0[i]
);
}
static const char alpha[] = "0123456789abcdef";
for (int i=0; i<32; i++)
{
unsigned char c = out[i];
std::cout << alpha[ (c >> 4) & 0xF];
std::cout << alpha[ c & 0xF];
}
std::cout.put('\n');
return 0;
}
However, the performance is worse compared to the Python implementation (~4s):
$ g++ test.cpp -L/usr/local/opt/openssl/lib -I/usr/local/opt/openssl/include -lcrypto
$ time ./a.out
1452
289792577c66cd75f5b1f961e50bd8ce6f36adfc4c087dc1584f573df49bd32e
real 0m9.245s
user 0m9.235s
sys 0m0.008s
The complete implementation and the input file are available here: test.cpp and txids.txt.
How can I improve the performance? Are the compiler optimizations enabled by default? Are there faster sha256 libraries than openssl available?

There are plenty of things you can do to optimize the code.
Here is the list of the important points:
compiler optimizations need to be enabled (using -O3 in GCC);
std::array can be used instead of the slower dynamically-sized std::vector (since the size of a hash is 32), one can even define a new Hash type for clarity;
parameters should be passed by reference (C++ pass parameter by copy by default)
the C++ vectors can be reserved to pre-allocate the memory space and avoid unneeded copies;
OPENSSL_free must be called to release the allocated memory of OPENSSL_hexstr2buf;
push_back should be avoided when the size is a constant known at compile time;
using std::copy is often faster (and cleaner) than a manual copy;
std::reverse is often faster (and cleaner) than a manual loop;
the size of a hash is supposed to be 32, but one can check that using assertions to be sure it is fine;
count is not needed as it is the size of the txids vector;
Here is the resulting code:
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <streambuf>
#include <sstream>
#include <cstring>
#include <array>
#include <algorithm>
#include <cassert>
#include <openssl/evp.h>
#include <openssl/sha.h>
#include <openssl/crypto.h>
using Hash = std::array<unsigned char, 32>;
Hash double_sha256(const Hash& a, const Hash& b)
{
assert(a.size() == 32 && b.size() == 32);
unsigned char inp[64];
std::copy(a.begin(), a.end(), inp);
std::copy(b.begin(), b.end(), inp+32);
const EVP_MD *md_algo = EVP_sha256();
assert(EVP_MD_size(md_algo) == 32);
unsigned int md_len = 32;
Hash out;
EVP_Digest(inp, 64, out.data(), &md_len, md_algo, nullptr);
EVP_Digest(out.data(), md_len, out.data(), &md_len, md_algo, nullptr);
return out;
}
std::vector<Hash> calculate_merkle_root(const std::vector<Hash>& inp_list)
{
std::vector<Hash> out;
int len = inp_list.size();
out.reserve(len/2+2);
if (len == 1)
{
out.push_back(inp_list[0]);
return out;
}
for (int i=0; i<len-1; i+=2)
{
out.push_back(double_sha256(inp_list[i], inp_list[i+1]));
}
if (len % 2 == 1)
{
out.push_back(double_sha256(inp_list[len-1], inp_list[len-1]));
}
return calculate_merkle_root(out);
}
int main()
{
std::ifstream infile("txids.txt");
std::vector<Hash> txids;
std::string line;
while (std::getline(infile, line))
{
unsigned char* buf = OPENSSL_hexstr2buf(line.c_str(), nullptr);
Hash buf2;
std::copy(buf, buf+32, buf2.begin());
std::reverse(buf2.begin(), buf2.end());
txids.push_back(buf2);
OPENSSL_free(buf);
}
infile.close();
std::cout << txids.size() << std::endl;
std::vector<Hash> merkle_root_hash;
for (int k=0; k<1000; k++)
{
merkle_root_hash = calculate_merkle_root(txids);
}
Hash out0 = merkle_root_hash[0];
Hash out = out0;
std::reverse(out.begin(), out.end());
static const char alpha[] = "0123456789abcdef";
for (int i=0; i<32; i++)
{
unsigned char c = out[i];
std::cout << alpha[ (c >> 4) & 0xF];
std::cout << alpha[ c & 0xF];
}
std::cout.put('\n');
return 0;
}
On my machine, this code is 3 times faster than the initial version and 2 times faster than the Python implementation.
This implementation spends >98% of its time in EVP_Digest. As a result, if you want a faster code, you could try to find a faster hashing library although OpenSSL should be already pretty fast. The current code already succeed to compute 1.7 millions hashes per second in sequential on a mainstream CPU. This is quite good. Alternatively, you can also parallelize the program using OpenMP (this is roughly 5 times faster on my 6 core machine).

I decided to implement Merkle Root and SHA-256 computation from scratch, full SHA-256 implemented, using SIMD (Single Instruction Multiple Data) approach, known for SSE2, AVX2, AVX512.
My code below for AVX2 case has speed 3.5x times faster than OpenSSL version, and 7.3x times faster than Python's hashlib implementation.
Here I provide C++ implementation, also I made Python implementation with same speed (because it uses C++ code in the core), for Python implementation see related post. Python implementation is definitely easier to use than C++.
My code is quite complex, both because it has full SHA-256 implementation and also because it has a class for abstracting any SIMD operations, also many tests.
First I provide timings, made on Google Colab because they have quite advanced AVX2 processor there:
MerkleRoot-Ossl 1274 ms
MerkleRoot-Simd-GEN-1 1613 ms
MerkleRoot-Simd-GEN-2 1795 ms
MerkleRoot-Simd-GEN-4 788 ms
MerkleRoot-Simd-GEN-8 423 ms
MerkleRoot-Simd-SSE2-1 647 ms
MerkleRoot-Simd-SSE2-2 626 ms
MerkleRoot-Simd-SSE2-4 690 ms
MerkleRoot-Simd-AVX2-1 407 ms
MerkleRoot-Simd-AVX2-2 403 ms
MerkleRoot-Simd-AVX2-4 489 ms
Ossl is for testing OpenSSL implementation, the rest is mine implementation. AVX512 has even more improvement in speed, here it is not tested because Colab has no AVX512 support. Actual improvement in speed depends on processor capabilities.
Compilation is tested both in Windows (MSVC) and Linux (CLang), using following commands:
Windows with OpenSSL support cl.exe /O2 /GL /Z7 /EHs /std:c++latest sha256_simd.cpp -DSHS_HAS_AVX2=1 -DSHS_HAS_OPENSSL=1 /MD -Id:/bin/OpenSSL/include/ /link /LIBPATH:d:/bin/OpenSSL/lib/ libcrypto_static.lib libssl_static.lib Advapi32.lib User32.lib Ws2_32.lib, provide your directory with installed OpenSSL. If OpenSSL support is not needed use cl.exe /O2 /GL /Z7 /EHs /std:c++latest sha256_simd.cpp -DSHS_HAS_AVX2=1. Here also instead of of AVX2 you may use SSE2 or AVX512. Windows openssl can be downloaded from here.
Linux CLang compilation is done through clang++-12 -march=native -g -m64 -O3 -std=c++20 sha256_simd.cpp -o sha256_simd.exe -DSHS_HAS_OPENSSL=1 -lssl -lcrypto if OpenSSL is needed and if not needed then clang++-12 -march=native -g -m64 -O3 -std=c++20 sha256_simd.cpp -o sha256_simd.exe. As you can see most recent clang-12 is used, to install it do bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)" (this command is described here). Linux version automatically detects current CPU architecture and uses best SIMD set of instructions.
My code needs C++20 standard support, as it uses some advanced features for easier implementing all things.
I implemented OpenSSL support in my library only to compare timings to show that my AVX2 version is 3-3.5x times faster.
Also providing timings done on GodBolt, but those are only for example of AVX-512 usage, as GodBolt CPUs have advanced AVX-512. Don't use GodBolt to actually measure timings, because all the timings there jump up to 5x times up and down, seems because of active processes eviction by operating system. Also providing GodBolt link for playground (this link may have a bit outdated code, use newest link to code at the bottom of my post):
MerkleRoot-Ossl 2305 ms
MerkleRoot-Simd-GEN-1 2982 ms
MerkleRoot-Simd-GEN-2 3078 ms
MerkleRoot-Simd-GEN-4 1157 ms
MerkleRoot-Simd-GEN-8 781 ms
MerkleRoot-Simd-GEN-16 349 ms
MerkleRoot-Simd-SSE2-1 387 ms
MerkleRoot-Simd-SSE2-2 769 ms
MerkleRoot-Simd-SSE2-4 940 ms
MerkleRoot-Simd-AVX2-1 251 ms
MerkleRoot-Simd-AVX2-2 253 ms
MerkleRoot-Simd-AVX2-4 777 ms
MerkleRoot-Simd-AVX512-1 257 ms
MerkleRoot-Simd-AVX512-2 741 ms
MerkleRoot-Simd-AVX512-4 961 ms
Examples of my code usage can be seen inside Test() function that tests all functionality of my library. My code is a bit dirty because I didn't want to spend very much time creating beautiful library, rather just to make a Proof of Concept that SIMD-based implementation can be considerably faster than OpenSSL version.
If you really want to use my boosted SIMD-based version instead of OpenSSL and if you care for speed very much, and you have questions about how to use it, please ask me in comments or chat.
Also I didn't bother about implementing multi-core/multi-threaded version, I think it is obvious how to do that and you can and should implement it without difficulties.
Providing external link to the code below, because my code is around 51 KB in size which exceeds allowed 30 KB of text for StackOverflow post.
sha256_simd.cpp

Related

Program works slower when running second time after recompilation

Performance of the simple program (generate 1 200 000 unique random shuffled integers then sort them) is slower, when I run it from Qt Creator second time after recompilation (and all the next till the next recompilation).
#include <iostream>
#include <random>
#include <algorithm>
#include <chrono>
#include <iterator>
#include <cstdint>
using size_type = std::uint32_t;
alignas(64) size_type v[1200000];
// behaviour really not depends on CPU affinity
#ifdef __linux__
#include <sched.h>
#endif
int main()
{
#ifdef __linux__
{
cpu_set_t m;
int status;
CPU_ZERO(&m);
CPU_SET(0, &m);
status = sched_setaffinity(0, sizeof(m), &m);
if (status != 0) {
perror("sched_setaffinity");
}
}
#endif
std::mt19937 g(0);
for (size_type i = 1; i < std::size(v); ++i) {
v[i] = std::exchange(v[g() % i], i);
}
for (size_type i = 0; i < 10; ++i) { // first output not depends on number of iterations
auto start = std::chrono::high_resolution_clock::now();
std::sort(std::begin(v), std::end(v));
std::cout << std::chrono::duration_cast< std::chrono::microseconds >(std::chrono::high_resolution_clock::now() - start).count() << std::endl;
}
}
Say, first time it prints;
97896
26069
25628
25771
25863
25722
25976
25855
25687
25735
and then:
137238
35056
34880
34468
34746
27309
25781
25932
25502
25383
yet another (and all further like the second and third):
137648
35086
34966
26005
26305
26435
25683
25440
25981
25632
If I recompile program, then all repeating again.
If I recompile program and run it from the console, then all outputs starting from value near the 137000, even the first one, and looks like the next:
137207
35059
35035
34844
34563
34586
34466
34132
34327
34487
If mutters much, I build and run above program on Ubuntu Desktop 16.04.3 64 bit on AMD A10-7800 Radeon R7, 12 Compute Cores 4C+8G w/ 8GB RAM, SSD. Without root previlegies and without debugger on. I use g++-7 -m32 -march=native -mtune=native -O3, gold and ccache.
I expected inverse results, because of (maybe) branch prediction caching or some other caching (if possible at all between consecutive runs of the same code), but the results are discouraging.

May changing unsigned int to size_t impact performances?

After I ported some legacy code from win32 to win64, after I discussed what was the best strategy to remove the warning "possible loss of data" (What's the best strategy to get rid of "warning C4267 possible loss of data"?). I'm about to replace many unsigned int by size_t in my code.
However, my code is critical in term of performance (I can't even run it in Debug...too slow).
I did a quick benchmarking:
#include "stdafx.h"
#include <iostream>
#include <chrono>
#include <string>
template<typename T> void testSpeed()
{
auto start = std::chrono::steady_clock::now();
T big = 0;
for ( T i = 0; i != 100000000; ++i )
big *= std::rand();
std::cout << "Elapsed " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start).count() << "ms" << std::endl;
}
int main()
{
testSpeed<size_t>();
testSpeed<unsigned int>();
std::string str;
std::getline( std::cin, str ); // pause
return 0;
}
Compiled for x64, it outputs:
Elapsed 2185ms
Elapsed 2157ms
Compiled for x86, it outputs:
Elapsed 2756ms
Elapsed 2748ms
So apparently using size_t instead of unsigned int has unsignificant performance impact. But is that really always the case (it's hard to benchmark performances this way).
Does/may changing unsigned int into size_t impact CPU performance (now a 64bits object will be manipulated instead of a 32bits)?

Definitely not. On modern (and even older) CPUs, 64 bits integer operations perfom as fast as 32 bits operation.
Example on my i7 4600u for arithmetic operation a * b / c :
(int32_t) * (int32_t) / (int32_t) : 1.3 nsec
(int64_t) * (int64_t) / (int64_t) : 1.3 nsec
Both tests compiled for x64 target (same target as yours).
Howether, if your code manages big objects full of integers (big arrays of integers, fox example), using size_t instead of unsigned int may have an impact on performance if cache misses count increase (bigger data may exceed cache capacity). The most reliable way to check impact on performance is to test your app in both cases. Use your own type typedef'ed to either size_t or unsigned int then benchmark your application.

Efficiency of STL algorithms with fixed size arrays

In general, I assume that the STL implementation of any algorithm is at least as efficient as anything I can come up with (with the additional benefit of being error free). However, I came to wonder whether the STL's focus on iterators might be harmful in some situations.
Lets assume I want to calculate the inner product of two fixed size arrays. My naive implementation would look like this:
std::array<double, 100000> v1;
std::array<double, 100000> v2;
//fill with arbitrary numbers
double sum = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
sum += v1[i] * v2[i];
}
As the number of iterations and the memory layout are known during compile time and all operations can directly be mapped to native processor instructions, the compiler should easily be able to generate the "optimal" machine code from this (loop unrolling, vectorization / FMA instructions ...).
The STL version
double sum = std::inner_product(cbegin(v1), cend(v1), cbegin(v2), 0.0);
on the other hand adds some additional indirections and even if everything is inlined, the compiler still has to deduce that it is working on a continuous memory region and where this region lies. While this is certainly possible in principle, I wonder, whether the typical c++ compiler will actually do it.
So my question is: Do you think, there can be a performance benefit of implementing standard algorithms that operate on fixed size arrays on my own, or will the STL Version always outperform a manual implementation?

As suggested I did some measurements and
for the code below
compiled with VS2013 for x64 in release mode
executed on a Win8.1 Machine with an i7-2640M,
the algorithm version is consistently slower by about 20% (15.6-15.7s vs 12.9-13.1s). The relative difference, also stays roughly constant over two orders of magnitude for N and REPS.
So I guess the answer is: Using standard library algorithms CAN hurt performance.
It would still be interesting, if this is a general problem or if it is specific to my platform, compiler and benchmark. You are welcome to post your own resutls or comment on the benchmark.
#include <iostream>
#include <numeric>
#include <array>
#include <chrono>
#include <cstdlib>
#define USE_STD_ALGORITHM
using namespace std;
using namespace std::chrono;
static const size_t N = 10000000; //size of the arrays
static const size_t REPS = 1000; //number of repitions
array<double, N> a1;
array<double, N> a2;
int main(){
srand(10);
for (size_t i = 0; i < N; ++i) {
a1[i] = static_cast<double>(rand())*0.01;
a2[i] = static_cast<double>(rand())*0.01;
}
double res = 0.0;
auto start=high_resolution_clock::now();
for (size_t z = 0; z < REPS; z++) {
#ifdef USE_STD_ALGORITHM
res = std::inner_product(a1.begin(), a1.end(), a2.begin(), res);
#else
for (size_t t = 0; t < N; ++t) {
res+= a1[t] * a2[t];
}
#endif
}
auto end = high_resolution_clock::now();
std::cout << res << " "; // <-- necessary, so that loop isn't optimized away
std::cout << duration_cast<milliseconds>(end - start).count() <<" ms"<< std::endl;
}
/*
* Update: Results (ubuntu 14.04 , haswell)
* STL: algorithm
* g++-4.8-2 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 3551 ms
* g++-4.8-2 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 3567 ms
* clang++-3.5 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 9378 ms
* clang++-3.5 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 8505 ms
*
* loop:
* g++-4.8-2 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 3543 ms
* g++-4.8-2 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 3551 ms
* clang++-3.5 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 9613 ms
* clang++-3.5 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 8642 ms
*/
EDIT:
I did a quick check with g++-4.9.2 and clang++-3.5 with O3and std=c++11 on a fedora 21 Virtual Box VM on the same machine and apparently those compilers don't have the same problem (the time is almost the same for both versions). However, gcc's version is about twice as fast as clang's (7.5s vs 14s).

How do you find out what parts of code are creating the most virtual memory?

I have a program that starts up and within about 5 minutes the virtual size of process is about 13 gigs. It runs on Linux, uses boost, gnu c++ library and various other 3rd party libraries.
After 5 minutes size stays at 13 gigs and rss size steady at around 5 gigs.
I can't just run it in a debugger because at startup about 30 threads are started, each of which starts running its own code, that does various allocations. So stepping through and checking virtual memory at different parts of code at each breakpoint is not feasible.
I thought of changing program to start each thread one at a time to make it easier to track allocation of memory, but before doing this are there any good tools?
Valgrind is fairly slow, maybe tcmalloc could provide the info?

I would use valgrind (perhaps run it an entire night) or else use Boehm GC.
Alternatively, use the proc(5) filesystem to understand (e.g. thru /proc/$pid/statm & /proc/$pid/maps) when a lot of memory gets allocated.
The most important is to find memory leaks. If the memory don't grow after startup it is less an issue.
Perhaps adding instance counters to each class might help (use atomic integers or mutexes to serialize them).
If the program's source code is big (e.g. a million of source lines) so that spending several days/weeks is worth the effort, perhaps customizing the GCC compiler (e.g. with MELT) might be relevant.
a std::set minibenchmark
You mentioned big std::set based upon million rows.
#include <set>
#include <string>
#include <string.h>
#include <cstdio>
#include <cstdlib>
#include <unistd.h>
#include <time.h>
class MyElem
{
int _n;
char _s[16-sizeof(_n)];
public:
MyElem(int k) : _n(k)
{
snprintf (_s, sizeof(_s), "%d", k);
};
~MyElem()
{
_n=0;
memset(_s, 0, sizeof(_s));
};
int n() const
{
return _n;
};
std::string str() const
{
return std::string(_s);
};
bool less(const MyElem&x) const
{
return _n < x._n;
};
};
bool operator < (const MyElem& l, const MyElem& r)
{
return l.less(r);
}
typedef std::set<MyElem> MySet;
void bench (int cnt, MySet& set)
{
for (long i=0; i<(long)cnt*1024; i++)
set.insert(MyElem(i));
time_t now = 0;
time (&now);
set.insert (((now) & 0xfffffff) * 100);
}
int main (int argc, char** argv)
{
MySet s;
clock_t cstart, cend;
int c = argc>1?atoi(argv[1]):256;
if (c<16) c=16;
printf ("c=%d Kiter\n", c);
cstart = clock();
bench (c, s);
cend = clock();
int x = getpid();
char cmdbuf[64];
snprintf(cmdbuf, sizeof(cmdbuf), "pmap %d", x);
printf ("running %s\n", cmdbuf);
fflush (NULL);
system(cmdbuf);
putchar('\n');
printf ("at end c=%d Kiter clockdiff=%.2f millisec = %.f µs/Kiter\n",
c, (cend-cstart)*1.0e-3, (double)(cend-cstart)/c);
if (s.find(x) != s.end())
printf("set has %d\n", x);
else
printf("set don't contain %d\n", x);
return 0;
}
Notice the 16 bytes sizeof(MyElem). On Debian/Sid/AMD64 with GCC 4.8.1 (intel i3770K processor, 16Gbytes RAM) and compiling that bench with g++ -Wall -O1 tset.cc -o ./tset-01
With 32768 thousands of iterations, so 32M elements:
total 2109592K
(last line above given by pmap)
at end c=32768 Kiter clockdiff=16470.00 millisec = 503 µs/Kiter
Then the implicit time from my zsh
./tset-01 32768 16.77s user 0.54s system 99% cpu 17.343 total
This is about 2.1Gbytes. so perhaps 64.3 bytes per element & set member overhead (since sizeof(MyElem)==16 the set seems to have a non-negligible cost of perhaps 6 words per element)

Why is string to number conversion so slow in C++?

This function reads an array of doubles from a string:
vector<double> parseVals(string& str) {
stringstream ss(str);
vector<double> vals;
double val;
while (ss >> val) vals.push_back(val);
return vals;
}
When called with a string containing 1 million numbers, the function takes 7.8 seconds to execute (Core i5, 3.3GHz). This means that 25000 CPU cycles are spent to parse ONE NUMBER.
user315052 has pointed out that the same code runs an order of magnitude faster on his system, and further testing has shown very large performance differences among different systems and compilers (also see user315052's answer):
1. Win7, Visual Studio 2012RC or Intel C++ 2013 beta: 7.8 sec
2. Win7, mingw / g++ 4.5.2 : 4 sec
3. Win7, Visual Studio 2010 : 0.94 sec
4. Ubuntu 12.04, g++ 4.7 : 0.65 sec
I have found a great alternative in the Boost/Spirit library. The code is safe, concise and extremely fast (0.06 seconds on VC2012, 130x faster than stringstream).
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
namespace ascii = boost::spirit::ascii;
vector<double> parseVals4(string& str) {
vector<double> vals;
qi::phrase_parse(str.begin(), str.end(),
*qi::double_ >> qi::eoi, ascii::space, vals);
return vals;
}
Although this solves the problem from the practical standpoint, i would still like to know why the performance of stringstream is so inconsistent. I profiled the program to identify the bottleneck, but the STL code looks like gibberish to me. Comments from anybody familiar with STL internals would be much appreciated.
PS: Optimization is O2 or better in all of the above timings. Neither instantiation of stringstream nor the reallocation of vector figure in the program profile. Virtually all of the time is spent inside the extraction operator.

On my Linux VM running on a 1.6 GHz i7, it takes less than half a second. My conclusion is that the parsing is not as slow as you are observing it to be. There must be some other artifact that you are measuring to cause your observation to be so vastly different from mine. So that we can be more sure we are comparing apples to apples, I'll provide what I did.
Edit: On my Linux system, I have g++ 4.6.3, compiled with -O3. Since I don't have MS or Intel compilers, I used cygwin g++ 4.5.3, also compiled with -O3. On Linux, I got the following output: Another fact is my Windows 7 is 64 bit, as is my Linux VM. I believe cygwin only runs in 32 bit mode.
elapsed: 0.46 stringstream
elapsed: 0.11 strtod
On cygwin, I got the following:
elapsed: 1.685 stringstream
elapsed: 0.171 strtod
I speculate that the difference between cygwin and Linux performance has something to do with MS library dependencies. Note that the cygwin environment is just on the host machine of the Linux VM.
This is the routine I timed that used istringstream.
std::vector<double> parseVals (std::string &s) {
std::istringstream ss(s);
std::vector<double> vals;
vals.reserve(1000000);
double val;
while (ss >> val) vals.push_back(val);
return vals;
}
This is the routine I timed that used strtod.
std::vector<double> parseVals2 (char *s) {
char *p = 0;
std::vector<double> vals;
vals.reserve(1000000);
do {
double val = strtod(s, &p);
if (s == p) break;
vals.push_back(val);
s = p+1;
} while (*p);
return vals;
}
This is the routine I used to populate the string with one million doubles.
std::string one_million_doubles () {
std::ostringstream oss;
double x = RAND_MAX/(1.0 + rand()) + rand();
oss << x;
for (int i = 1; i < 1000000; ++i) {
x = RAND_MAX/(1.0 + rand()) + rand();
oss << " " << x;
}
return oss.str();
}
This is the routine I used to do the timing:
template <typename PARSE, typename S>
void time_parse (PARSE p, S s, const char *m) {
struct tms start;
struct tms finish;
long ticks_per_second;
std::vector<double> vals_vec;
times(&start);
vals_vec = p(s);
times(&finish);
assert(vals_vec.size() == 1000000);
ticks_per_second = sysconf(_SC_CLK_TCK);
std::cout << "elapsed: "
<< ((finish.tms_utime - start.tms_utime
+ finish.tms_stime - start.tms_stime)
/ (1.0 * ticks_per_second))
<< " " << m << std::endl;
}
And, this was the main function:
int main ()
{
std::string vals_str;
vals_str = one_million_doubles();
std::vector<char> s(vals_str.begin(), vals_str.end());
time_parse(parseVals, vals_str, "stringstream");
time_parse(parseVals2, &s[0], "strtod");
}

Your overhead is in both repeated instantiation of the std::stringstream and in the parsing itself. If your numbers are plain and not using any locale dependent formatting, then I suggest #include <cstdlib> and std::strtod().

Converting string to double is slow because your Corei5 CPU does not have that conversion operator built in.
While that CPU natively can convert a short to a float to an int at comparatively faster speeds, the conversion you describe must be done step-by-step, analyzing each character and deciding if it's part of the double and how.
What you're observing is representative of the actual work that needs to be done, considering that each double may look like -.0 or INF or 4E6 or -NAN. It may need to be truncated, it probably needs to be approximated and it may not be a valid double at all.

This is a pretty involved task for the parsing. To parse a double of has to match either a decimal or a floating point number then it has to extract this string and do the actual string conversion. This means that for each double in your string you are going over each double at least twice plus any other functionality that is done to get to the next double. The other part as mentioned is that a vector when it resizes is not the most efficient. But, it is just slow to parse and convert strings.

You construct a stringstream object every time you call that function, which is potentially very expensive.
However, we don't have enough information to answer your question. Are you compiling with optimizations turned on all the way? Is your function being inlined, or is there a function call with every invocation?
For a suggestion on how to speed things up, you should consider boost::lexical_cast<double>(str)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js