Protobuf vs Flatbuffers vs Cap'n proto which is faster?

Protobuf vs Flatbuffers vs Cap'n proto which is faster? - c++

I decided to figure out which of Protobuf, Flatbuffers and Cap'n proto would be the best/fastest serialization for my application. In my case sending some kind of byte/char array over a network (the reason I serialized to that format). So I made simple implementations for all three where i seialize and dezerialize a string, a float and an int. This gave unexpected resutls: Protobuf being the fastest. I would call them unexpected since both cap'n proto and flatbuffes "claims" to be faster options. Before I accept this I would like to see if I unitentionally cheated in my code somehow. If i did not cheat I would like to know why protobuf is faster (exactly why is probably impossible). Could the messages be to simeple for cap'n proto and faltbuffers to really make them shine?
My timings:
Time taken flatbuffers: 14162 microseconds
Time taken capnp: 60259 microseconds
Time taken protobuf: 12131 microseconds
(time from one machine. Relative comparison might be relevant.)
UPDATE: The above numbers are not representative of CORRECT usage, at least not for capnp -- see answers & comments.
flatbuffer code:
int main (int argc, char *argv[]){
std::string s = "string";
float f = 3.14;
int i = 1337;
std::string s_r;
float f_r;
int i_r;
flatbuffers::FlatBufferBuilder message_sender;
int steps = 10000;
auto start = high_resolution_clock::now();
for (int j = 0; j < steps; j++){
auto autostring = message_sender.CreateString(s);
auto encoded_message = CreateTestmessage(message_sender, autostring, f, i);
message_sender.Finish(encoded_message);
uint8_t *buf = message_sender.GetBufferPointer();
int size = message_sender.GetSize();
message_sender.Clear();
//Send stuffs
//Receive stuffs
auto recieved_message = GetTestmessage(buf);
s_r = recieved_message->string_()->str();
f_r = recieved_message->float_();
i_r = recieved_message->int_();
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
cout << "Time taken flatbuffer: " << duration.count() << " microseconds" << endl;
return 0;
}
cap'n proto code:
int main (int argc, char *argv[]){
char s[] = "string";
float f = 3.14;
int i = 1337;
const char * s_r;
float f_r;
int i_r;
::capnp::MallocMessageBuilder message_builder;
Testmessage::Builder message = message_builder.initRoot<Testmessage>();
int steps = 10000;
auto start = high_resolution_clock::now();
for (int j = 0; j < steps; j++){
//Encodeing
message.setString(s);
message.setFloat(f);
message.setInt(i);
kj::Array<capnp::word> encoded_array = capnp::messageToFlatArray(message_builder);
kj::ArrayPtr<char> encoded_array_ptr = encoded_array.asChars();
char * encoded_char_array = encoded_array_ptr.begin();
size_t size = encoded_array_ptr.size();
//Send stuffs
//Receive stuffs
//Decodeing
kj::ArrayPtr<capnp::word> received_array = kj::ArrayPtr<capnp::word>(reinterpret_cast<capnp::word*>(encoded_char_array), size/sizeof(capnp::word));
::capnp::FlatArrayMessageReader message_receiver_builder(received_array);
Testmessage::Reader message_receiver = message_receiver_builder.getRoot<Testmessage>();
s_r = message_receiver.getString().cStr();
f_r = message_receiver.getFloat();
i_r = message_receiver.getInt();
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
cout << "Time taken capnp: " << duration.count() << " microseconds" << endl;
return 0;
}
protobuf code:
int main (int argc, char *argv[]){
std::string s = "string";
float f = 3.14;
int i = 1337;
std::string s_r;
float f_r;
int i_r;
Testmessage message_sender;
Testmessage message_receiver;
int steps = 10000;
auto start = high_resolution_clock::now();
for (int j = 0; j < steps; j++){
message_sender.set_string(s);
message_sender.set_float_m(f);
message_sender.set_int_m(i);
int len = message_sender.ByteSize();
char encoded_message[len];
message_sender.SerializeToArray(encoded_message, len);
message_sender.Clear();
//Send stuffs
//Receive stuffs
message_receiver.ParseFromArray(encoded_message, len);
s_r = message_receiver.string();
f_r = message_receiver.float_m();
i_r = message_receiver.int_m();
message_receiver.Clear();
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
cout << "Time taken protobuf: " << duration.count() << " microseconds" << endl;
return 0;
}
not including the message definition files scince they are simple and most likely has nothing to do with it.

In Cap'n Proto, you should not reuse a MessageBuilder for multiple messages. The way you've written your code, every iteration of your loop will make the message bigger, because you're actually adding on to the existing message rather than starting a new one. To avoid memory allocation with each iteration, you should pass a scratch buffer to MallocMessageBuilder's constructor. The scratch buffer can be allocated once outside the loop, but you need to create a new MallocMessageBuilder each time around the loop. (Of course, most people don't bother with scratch buffers and just let MallocMessageBuilder do its own allocation, but if you choose that path in this benchmark, then you should also change the Protobuf benchmark to create a new message object for every iteration rather than reusing a single object.)
Additionally, your Cap'n Proto code is using capnp::messageToFlatArray(), which allocates a whole new buffer to put the message into and copies the entire message over. This is not the most efficient way to use Cap'n Proto. Normally, if you were writing the message to a file or socket, you would write directly from the message's original backing buffer(s) without making this copy. Try doing this instead:
kj::ArrayPtr<const kj::ArrayPtr<const capnp::word>> segments =
message_builder.getSegmentsForOutput();
// Send segments
// Receive segments
capnp::SegmentArrayMessageReader message_receiver_builder(segments);
Or, to make things more realistic, you could write the message out to a pipe and read it back in, using capnp::writeMessageToFd() and capnp::StreamFdMessageReader. (To be fair, you would need to make the protobuf benchmark write to / read from a pipe as well.)
(I'm the author of Cap'n Proto and Protobuf v2. I'm not familiar with FlatBuffers so I can't comment on whether that code has any similar issues...)
On benchmarks
I've spent a lot of time benchmarking Protobuf and Cap'n Proto. One thing I've learned in the process is that most simple benchmarks you can create will not give you realistic results.
First, any serialization format (even JSON) can "win" given the right benchmark case. Different formats will perform very, very differently depending on the content. Is it string-heavy, number-heavy, or object heavy (i.e. with deep message trees)? Different formats have different strengths here (Cap'n Proto is incredibly good at numbers, for example, because it doesn't transform them at all; JSON is incredibly bad at them). Is your message size incredibly short, medium-length, or very large? Short messages will mostly exercise the setup/teardown code rather than body processing (but setup/teardown is important -- sometimes real-world use cases involve lots of small messages!). Very large messages will bust the L1/L2/L3 cache and tell you more about memory bandwidth than parsing complexity (but again, this is important -- some implementations are more cache-friendly than others).
Even after considering all that, you have another problem: Running code in a loop doesn't actually tell you how it performs in the real world. When run in a tight loop, the instruction cache stays hot and all the branches become highly predictable. So a branch-heavy serialization (like protobuf) will have its branching cost swept under the rug, and a code-footprint-heavy serialization (again... like protobuf) will also get an advantage. This is why micro-benchmarks are only really useful to compare code against other versions of itself (e.g. to test minor optimizations), NOT to compare completely different codebases against each other. To find out how any of this performs in the real world, you need to measure a real-world use case end-to-end. But... to be honest, that's pretty hard. Few people have the time to build two versions of their whole app, based on two different serializations, to see which one wins...

Related

Using std::async slower than non-async method to populate a vector

I am experimenting with std::async to populate a vector. The idea behind it is to use multi-threading to save time. However, running some benchmark tests I find that my non-async method is faster!
#include <algorithm>
#include <vector>
#include <future>
std::vector<int> Generate(int i)
{
std::vector<int> v;
for (int j = i; j < i + 10; ++j)
{
v.push_back(j);
}
return v;
}
Async:
std::vector<std::future<std::vector<int>>> futures;
for (int i = 0; i < 200; i+=10)
{
futures.push_back(std::async(
[](int i) { return Generate(i); }, i));
}
std::vector<int> res;
for (auto &&f : futures)
{
auto vec = f.get();
res.insert(std::end(res), std::begin(vec), std::end(vec));
}
Non-async:
std::vector<int> res;
for (int i = 0; i < 200; i+=10)
{
auto vec = Generate(i);
res.insert(std::end(res), std::begin(vec), std::end(vec));
}
My benchmark test shows that the async method is 71 times slower than non-async. What am I doing wrong?

std::async has two modes of operation:
std::launch::async
std::launch::deferred
In this case, you've called std::async without specifying either one, which means it's allowed to choose either one. std::launch::deferred basically means do the work on the calling thread. So std::async returns a future, and with std::launch::deferred, the action you've requested won't be carried out until you call .get on that future. It can be kind of handy under a few circumstances, but it's probably not what you want here.
Even if you specify std::launch::async, you need to realize that this starts up a new thread of execution to carry out the action you've requested. It then has to create a future, and use some sort of signalling from the thread to the future to let you know when the computation you've requested is done.
All of that adds a fair amount of overhead--anywhere from microseconds to milliseconds or so, depending on the OS, CPU, etc.
So, for asynchronous execution to make sense, the "stuff" you do asynchronously typically needs to take tens of milliseconds at the very least (and hundreds of milliseconds might be a more sensible lower threshold). I wouldn't get too wrapped up in the exact cutoff, but it needs to be something that takes a while.
So, filling an array asynchronously probably only makes sense if the array is quite a lot larger than you're dealing with here.
For filling memory, you'll quickly run into another problem though: most CPUs are enough faster than main memory that if all you're doing is writing to memory, there's a pretty good chance that a single thread will already saturate the path to memory, so even at best doing the job asynchronously will only gain a little, and may still pretty easily cause a slow-down.
The ideal case for asynchronous operation would be something like one thread that's heavily memory bound, but another that (for example) reads a little bit of data, and does a lot of computation on that small amount of data. In this case, the computation thread will mostly operate on its data in the cache, so it won't get in the way of the memory thread doing its thing.

There are multiple factors that are causing the Multithreaded code to perform (much) slower than the Singlethreaded code.
Your array sizes are too small
Multithreading often has negligible-to-no effect on datasets that are particularly small. In both versions of your code, you're generating 2000 integers, and each Logical Thread (which, because std::async is often implemented in terms of thread pools, might not be the same as a Software Thread) is only generating 10 integers. The cost of spooling up a thread every 10 integers way offsets the benefit of generating those integers in parallel.
You might see a performance gain if each thread were instead responsible for, say, 10,000 integers each, but you'll probably instead have a different issue:
All your code is bottlenecked by an inherently serial process
Both versions of the code copy the generated integers into a host vector. It would be one thing if the act of generating those integers was itself a time consuming process, but in your case, it's likely just a matter of a small, fast bit of assembly generating each integer.
So the act of copying each integer into the final vector is probably not inherently faster than generating each integer, meaning a sizable chunk of the "work" being done is completely serial, defeating the whole purpose of multithreading your code.
Fixing the code
Compilers are very good at their jobs, so in trying to revise your code, I was only barely able to get multithreaded code that was faster than the serial code. Multiple executions had varying results, so my general assessment is that this kind of code is bad at being multithreaded.
But here's what I came up with:
#include <algorithm>
#include <vector>
#include <future>
#include<chrono>
#include<iostream>
#include<iomanip>
//#1: Constants
constexpr int BLOCK_SIZE = 500000;
constexpr int NUM_OF_BLOCKS = 20;
std::vector<int> Generate(int i) {
std::vector<int> v;
for (int j = i; j < i + BLOCK_SIZE; ++j) {
v.push_back(j);
}
return v;
}
void asynchronous_attempt() {
std::vector<std::future<void>> futures;
//#2: Preallocated Vector
std::vector<int> res(NUM_OF_BLOCKS * BLOCK_SIZE);
auto it = res.begin();
for (int i = 0; i < NUM_OF_BLOCKS * BLOCK_SIZE; i+=BLOCK_SIZE)
{
futures.push_back(std::async(
[it](int i) {
auto vec = Generate(i);
//#3 Copying done multithreaded
std::copy(vec.begin(), vec.end(), it + i);
}, i));
}
for (auto &&f : futures) {
f.get();
}
}
void serial_attempt() {
//#4 Changes here to show fair comparison
std::vector<int> res(NUM_OF_BLOCKS * BLOCK_SIZE);
auto it = res.begin();
for (int i = 0; i < NUM_OF_BLOCKS * BLOCK_SIZE; i+=BLOCK_SIZE) {
auto vec = Generate(i);
it = std::copy(vec.begin(), vec.end(), it);
}
}
int main() {
using clock = std::chrono::steady_clock;
std::cout << "Theoretical # of Threads: " << std::thread::hardware_concurrency() << std::endl;
auto begin = clock::now();
asynchronous_attempt();
auto end = clock::now();
std::cout << "Duration of Multithreaded Attempt: " << std::setw(10) << (end - begin).count() << "ns" << std::endl;
begin = clock::now();
serial_attempt();
end = clock::now();
std::cout << "Duration of Serial Attempt: " << std::setw(10) << (end - begin).count() << "ns" << std::endl;
}
This resulted in the following output:
Theoretical # of Threads: 2
Duration of Multithreaded Attempt: 361149213ns
Duration of Serial Attempt: 364785676ns
Given that this was on an online compiler (here) I'm willing to bet the multithreaded code might win out on a dedicated machine, but I think this at least demonstrates the improvement in performance that we're at least on par between the two methods.
Below are the changes I made, that are ID'd in the code:
We've dramatically increased the number of integers being generated, to force the threads to do actual meaningful work, instead of getting bogged down on OS-level housekeeping
The vector has its size pre-allocated. No more frequent resizing.
Now that the space has been preallocated, we can multithread the copying instead of doing it in serial later.
We have to change the serial code so it also preallocates + copies so that it's a fair comparison.
Now, we've ensured that all the code is indeed running in parallel, and while it's not amounting to a substantial improvement over the serial code, it's at least no longer exhibiting the degenerate performance losses we were seeing before.

First of all, you are not forcing the std::async to work asynchronously (you would need to specify std::launch::async policy to do so). Second of all, it'd be kind of an overkill to asynchronously create an std::vector of 10 ints. It's just not worth it. Remember - using more threads does not mean that you will see performance benefit! Creating a thread (or even using a threadpool) introduces some overhead, which, in this case, seems to dwarf the benefits of running tasks asynchronously.
Thanks #NathanOliver ;>

For loop performance and multithreaded performance questions

I was kind of bored so I wanted to try using std::thread and eventually measure performance of single and multithreaded console application. This is a two part question. So I started with a single threaded sum of a massive vector of ints (800000 of ints).
int sum = 0;
auto start = chrono::high_resolution_clock::now();
for (int i = 0; i < 800000; ++i)
sum += ints[i];
auto end = chrono::high_resolution_clock::now();
auto diff = end - start;
Then I added range based and iterator based for loop and measured the same way with chrono::high_resolution_clock.
for (auto& val : ints)
sum += val;
for (auto it = ints.begin(); it != ints.end(); ++it)
sum += *it;
At this point console output looked like:
index loop: 30.0017ms
range loop: 221.013ms
iterator loop: 442.025ms
This was a debug version, so I changed to release and the difference was ~1ms in favor of index based for. No big deal, but just out of curiosity: should there be a difference this big in debug mode between these three for loops? Or even a difference in 1ms in release mode?
I moved on to the thread creation, and tried to do a parallel sum of the array with this lambda (captured everything by reference so I could use vector of ints and a mutex previously declared) using index based for.
auto func = [&](int start, int total, int index)
{
int partial_sum = 0;
auto s = chrono::high_resolution_clock::now();
for (int i = start; i < start + total; ++i)
partial_sum += ints[i];
auto e = chrono::high_resolution_clock::now();
auto d = e - s;
m.lock();
cout << "thread " + to_string(index) + ": " << chrono::duration<double, milli>(d).count() << "ms" << endl;
sum += partial_sum;
m.unlock();
};
for (int i = 0; i < 8; ++i)
threads.push_back(thread(func, i * 100000, 100000, i));
Basically every thread was summing 1/8 of the total array, and the final console output was:
thread 0: 6.0004ms
thread 3: 6.0004ms
thread 2: 6.0004ms
thread 5: 7.0004ms
thread 4: 7.0004ms
thread 1: 7.0004ms
thread 6: 7.0004ms
thread 7: 7.0004ms
8 threads total: 53.0032ms
So I guess the second part of this question is what's going on here? Solution with 2 threads ended with ~30ms as well. Cache ping pong? Something else? If I'm doing something wrong, what would be the correct way to do it? Also if It's relevant, I was trying this on an i7 with 8 threads, so yes I know I didn't count the main thread, but tried it with 7 separate threads and pretty much got the same result.
EDIT: Sorry forgot the mention this was on Windows 7 with Visual Studio 2013 and Visual Studio's v120 compiler or whatever it's called.
EDIT2: Here's the whole main function:
http://pastebin.com/HyZUYxSY

With optimisation not turned on, all the method calls that are performed behind the scenes are likely real method calls. Inline functions are likely not inlined but really called. For template code, you really need to turn on optimisation to avoid that all the code is taken literally. For example, it's likely that your iterator code will call iter.end () 800,000 times, and operator!= for the comparison 800,000 times, which calls operator== and so on and so on.
For the multithreaded code, processors are complicated. Operating systems are complicated. Your code isn't alone on the computer. Your computer can change its clock speed, change into turbo mode, change into heat protection mode. And rounding the times to milliseconds isn't really helpful. Could be one thread to 6.49 milliseconds and another too 6.51 and it got rounded differently.

should there be a difference this big in debug mode between these three for loops?
Yes. If allowed, a decent compiler can produce identical output for each of the 3 different loops, but if optimizations are not enabled, the iterator version has more function calls and function calls have certain overhead.
Or even a difference in 1ms in release mode?
Your test code:
start = ...
for (auto& val : ints)
sum += val;
end = ...
diff = end - start;
sum = 0;
Doesn't use the result of the loop at all so when optimized, the compiler should simply choose to throw away the code resulting in something like:
start = ...
// do nothing...
end = ...
diff = end - start;
For all your loops.
The difference of 1ms may be produced by high granularity of the "high_resolution_clock" in the used implementation of the standard library and by differences in process scheduling during the execution. I measured the index based for being 0.04 ms slower, but that result is meaningless.

Aside from how std::thread is implemented on Windows I would to point your attention to your available execution units and context switching.
An i7 does not have 8 real execution units. It's a quad-core processor with hyper-threading. And HT does not magically double the available number of threads, no matter how it's advertised. It's a really clever system which tries to fit in instructions from an extra pipeline whenever possible. But in the end all instructions go through only four execution units.
So running 8 (or 7) threads is still more than your CPU can really handle simultaneously. That means your CPU has to switch a lot between 8 hot threads clamouring for calculation time. Top that off with several hundred more threads from the OS, admittedly most of which are asleep, that need time and you're left with a high degree of uncertainty in your measurements.
With a single threaded for-loop the OS can dedicate a single core to that task and spread the half-sleeping threads across the other three. This is why you're seeing such a difference between 1 thread and 8 threads.
As for your debugging questions: you should check if Visual Studio has Iterator checking enabled in debugging. When it's enabled every time an iterator is used it is bounds-checked and such. See: https://msdn.microsoft.com/en-us/library/aa985965.aspx
Lastly: have a look at the -openmp switch. If you enable that and apply the OpenMP #pragmas to your for-loops you can do away with all the manual thread creation. I toyed around with similar threading tests (because it's cool. :) ) and OpenMPs performance is pretty damn good.

For the first question, regarding the difference in performance between the range, iterator and index implementations, others have pointed out that in a non-optimized build, much which would normally be inlined may not be.
However there is an additional wrinkle: by default, in Debug builds, Visual Studio will use checked iterators. Access through a checked iterator is checked for safety (does the iterator refer to a valid element?), and consequently operations which use them, including the range-based iteration, are heavily penalized.
For the second part, I have to say that those durations seem abnormally long. When I run the code locally, compiled with g++ -O3 on a core i7-4770 (Linux), I get sub-millisecond timings for each method, less in fact than the jitter between runs. Altering the code to iterate each test 1000 times gives more stable results, with the per test times being 0.33 ms for the index and range loops with no extra tweaking, and about 0.15 ms for the parallel test.
The parallel threads are doing in total the same number of operations, and what's more, using all four cores limits the CPU's ability to dynamically increase its clock speed. So how can it take less total time?
I'd wager that the gains result from better utilization of the per-core L2 caches, four in total. Indeed, using four threads instead of eight threads reduces the total parallel time to 0.11 ms, consistent with better L2 cache use.
Browsing the Intel processor documentation, all the Core i7 processors, including the mobile ones, have at least 4 MB of L3 cache, which will happily accommodate 800 thousand 4-byte ints. So I'm surprised both by the raw times being 100 times larger than I'm seeing, and the 8-thread time totals being so much greater, which as you surmise, is a strong hint that they are thrashing the cache. I'm presuming this is demonstrating just how suboptimal the Debug build code is. Could you post results from an optimised build?

Not knowing how those std::thread classes are implemented, one possible explanation for the 53ms could be:
The threads are started right away when they get instantiated. (I see no thread.start() or threads.StartAll() or alike). So, during the time the first thread instance gets active, the main thread might (or might not) be preempted. There is no guarantee that the threads are getting spawned on individual cores, after all (thread affinity).
If you have a closer look at POSIX APIs, there is the notion of "application context" and "system context", which basically implies, that there might be an OS policy in place which would not use all cores for 1 application.
On Windows (this is where you were testing), maybe the threads are not being spawned directly but via a thread pool, maybe with some extra std::thread functionality, which could produce overhead/delay. (Such as completion ports etc.).
Unfortunately my machine is pretty fast so I had to increase the amount of data processed to yield significant times. But on the upside, this reminded me to point out, that typically, it starts to pay off to go parallel when the computation times are way beyond the time of a time slice (rule of thumb).
Here my "native" Windows implementation, which - for a large enough array finally makes the threads win over a single threaded computation.
#include <stdafx.h>
#include <nativethreadTest.h>
#include <vector>
#include <cstdint>
#include <Windows.h>
#include <chrono>
#include <iostream>
#include <thread>
struct Range
{
Range( const int32_t *p, size_t l)
: data(p)
, length(l)
, result(0)
{}
const int32_t *data;
size_t length;
int32_t result;
};
static int32_t Sum(const int32_t * data, size_t length)
{
int32_t sum = 0;
const int32_t *end = data + length;
for (; data != end; data++)
{
sum += *data;
}
return sum;
}
static int32_t TestSingleThreaded(const Range& range)
{
return Sum(range.data, range.length);
}
DWORD
WINAPI
CalcThread
(_In_ LPVOID lpParameter
)
{
Range * myRange = reinterpret_cast<Range*>(lpParameter);
myRange->result = Sum(myRange->data, myRange->length);
return 0;
}
static int32_t TestWithNCores(const Range& range, size_t ncores)
{
int32_t result = 0;
std::vector<Range> ranges;
size_t nextStart = 0;
size_t chunkLength = range.length / ncores;
size_t remainder = range.length - chunkLength * ncores;
while (nextStart < range.length)
{
ranges.push_back(Range(&range.data[nextStart], chunkLength));
nextStart += chunkLength;
}
Range remainderRange(&range.data[range.length - remainder], remainder);
std::vector<HANDLE> threadHandles;
threadHandles.reserve(ncores);
for (size_t i = 0; i < ncores; ++i)
{
threadHandles.push_back(::CreateThread(NULL, 0, CalcThread, &ranges[i], 0, NULL));
}
int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
DWORD waitResult = ::WaitForMultipleObjects((DWORD)threadHandles.size(), &threadHandles[0], TRUE, INFINITE);
if (WAIT_OBJECT_0 == waitResult)
{
for (auto& r : ranges)
{
result += r.result;
}
result += remainderResult;
}
else
{
throw std::runtime_error("Something went horribly - HORRIBLY wrong!");
}
for (auto& h : threadHandles)
{
::CloseHandle(h);
}
return result;
}
static int32_t TestWithSTLThreads(const Range& range, size_t ncores)
{
int32_t result = 0;
std::vector<Range> ranges;
size_t nextStart = 0;
size_t chunkLength = range.length / ncores;
size_t remainder = range.length - chunkLength * ncores;
while (nextStart < range.length)
{
ranges.push_back(Range(&range.data[nextStart], chunkLength));
nextStart += chunkLength;
}
Range remainderRange(&range.data[range.length - remainder], remainder);
std::vector<std::thread> threads;
for (size_t i = 0; i < ncores; ++i)
{
threads.push_back(std::thread([](Range* range){ range->result = Sum(range->data, range->length); }, &ranges[i]));
}
int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
for (auto& t : threads)
{
t.join();
}
for (auto& r : ranges)
{
result += r.result;
}
result += remainderResult;
return result;
}
void TestNativeThreads()
{
const size_t DATA_SIZE = 800000000ULL;
typedef std::vector<int32_t> DataVector;
DataVector data;
data.reserve(DATA_SIZE);
for (size_t i = 0; i < DATA_SIZE; ++i)
{
data.push_back(static_cast<int32_t>(i));
}
Range r = { data.data(), data.size() };
std::chrono::system_clock::time_point singleThreadedStart = std::chrono::high_resolution_clock::now();
int32_t result = TestSingleThreaded(r);
std::chrono::system_clock::time_point singleThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "Single threaded sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(singleThreadedEnd - singleThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
std::chrono::system_clock::time_point multiThreadedStart = std::chrono::high_resolution_clock::now();
result = TestWithNCores(r, 8);
std::chrono::system_clock::time_point multiThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "Multi threaded sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(multiThreadedEnd - multiThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
std::chrono::system_clock::time_point stdThreadedStart = std::chrono::high_resolution_clock::now();
result = TestWithSTLThreads(r, 8);
std::chrono::system_clock::time_point stdThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "std::thread sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(stdThreadedEnd - stdThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
}
Here the output on my machine of this code:
Single threaded sum: 382ms. Result = -532120576
Multi threaded sum: 234ms. Result = -532120576
std::thread sum: 245ms. Result = -532120576
Press any key to continue . . ..
Last not least, I feel urged to mention that the way this code is written it is rather a memory IO performance benchmark than a core CPU computation benchmark.
Better computation benchmarks would use small amounts of data which is local, fits into CPU caches etc.
Maybe it would be interesting to experiment with the splitting of the data into ranges. What if each thread were "jumping" over the data from the start to an end with a gap of ncores? Thread 1: 0 8 16... Thread 2: 1 9 17 ... etc.? Maybe then the "locality" of the memory could gain extra speed.

Why is processing multiple streams of data slower than processing one?

I'm testing how reading multiple streams of data affects a CPUs caching performance. I'm using the following code to benchmark this. The benchmark reads integers stored sequentially in memory and writes partial sums back sequentially. The number of sequential blocks that are read from is varied. Integers from the blocks are read in a round-robin manner.
#include <iostream>
#include <vector>
#include <chrono>
using std::vector;
void test_with_split(int num_arrays) {
int num_values = 100000000;
// Fix up the number of values. The effect of this should be insignificant.
num_values -= (num_values % num_arrays);
int num_values_per_array = num_values / num_arrays;
// Initialize data to process
auto results = vector<int>(num_values);
auto arrays = vector<vector<int>>(num_arrays);
for (int i = 0; i < num_arrays; ++i) {
arrays.emplace_back(num_values_per_array);
}
for (int i = 0; i < num_values; ++i) {
arrays[i%num_arrays].emplace_back(i);
results.emplace_back(0);
}
// Try to clear the cache
const int size = 20*1024*1024; // Allocate 20M. Set much larger then L2
char *c = (char *)malloc(size);
for (int i = 0; i < 100; i++)
for (int j = 0; j < size; j++)
c[j] = i*j;
free(c);
auto start = std::chrono::high_resolution_clock::now();
// Do the processing
int sum = 0;
for (int i = 0; i < num_values; ++i) {
sum += arrays[i%num_arrays][i/num_arrays];
results[i] = sum;
}
std::cout << "Time with " << num_arrays << " arrays: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count() << " ms\n";
}
int main() {
int num_arrays = 1;
while (true) {
test_with_split(num_arrays++);
}
}
Here are the timings for splitting 1-80 ways on an Intel Core 2 Quad CPU Q9550 # 2.83GHz:
The bump in the speed soon after 8 streams makes sense to me, as the processor has an 8-way associative L1 cache. The 24-way associative L2 cache in turn explains the bump at 24 streams. These especially hold if I'm getting the same effects as in Why is one loop so much slower than two loops?, where multiple big allocations always end up in the same associativity set. To compare I've included at the end timings when the allocation is done in one big block.
However, I don't fully understand the bump from one stream to two streams. My own guess would be that it has something to do with prefetching to L1 cache. Reading the Intel 64 and IA-32 Architectures Optimization Reference Manual it seems that the L2 streaming prefetcher supports tracking up to 32 streams of data, but no such information is given for the L1 streaming prefetcher. Is the L1 prefetcher unable to keep track of multiple streams, or is there something else at play here?
Background
I'm investigating this because I want to understand how organizing entities in a game engine as components in the structure-of-arrays style affects performance. For now it seems that the data required by a transformation being in two components vs. it being in 8-10 components won't matter much with modern CPUs. However, the testing above suggests that sometimes it might make sense to avoid splitting some data into multiple components if that would allow a "bottlenecking" transformation to only use one component, even if this means that some other transformation would have to read data it is not interested in.
Allocating in one block
Here are the timings if instead allocating multiple blocks of data only one is allocated and accessed in a strided manner. This does not change the bump from one stream to two, but I've included it for sake of completeness.
And here is the modified code for that:
void test_with_split(int num_arrays) {
int num_values = 100000000;
num_values -= (num_values % num_arrays);
int num_values_per_array = num_values / num_arrays;
// Initialize data to process
auto results = vector<int>(num_values);
auto array = vector<int>(num_values);
for (int i = 0; i < num_values; ++i) {
array.emplace_back(i);
results.emplace_back(0);
}
// Try to clear the cache
const int size = 20*1024*1024; // Allocate 20M. Set much larger then L2
char *c = (char *)malloc(size);
for (int i = 0; i < 100; i++)
for (int j = 0; j < size; j++)
c[j] = i*j;
free(c);
auto start = std::chrono::high_resolution_clock::now();
// Do the processing
int sum = 0;
for (int i = 0; i < num_values; ++i) {
sum += array[(i%num_arrays)*num_values_per_array+i/num_arrays];
results[i] = sum;
}
std::cout << "Time with " << num_arrays << " arrays: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count() << " ms\n";
}
Edit 1
I made sure that the 1 vs 2 splits difference was not due to the compiler unrolling the loop and optimizing the first iteration differently. Using the __attribute__ ((noinline)) I made sure the work function is not inlined into the main function. I verified that it did not happen by looking at the generated assembly. The timings after these changed were the same.

To answer the main part of your question: Is the L1 prefetcher able to keep track of multiple streams?
No. This is actually because the L1 cache doesn't have a prefetcher at all. The L1 cache isn't big enough to risk speculatively fetching data that might not be used. It would cause too many evictions and adversely impact any software that isn't reading data in specific patterns suited to that particular L1 cache prediction scheme. Instead L1 caches data that has been explicitly read or written. L1 caches are only helpful when you are writing data and re-reading data that has recently been accessed.
The L1 cache implementation is not the reason for your profile bump from 1X to 2X array depth. On streaming reads like what you've set up, the L1 cache plays little or no factor in performance. Most of your reads are coming directly from the L2 cache. In your first example using nested vectors, some number of reads are probably pulled from L1 (see below).
My guess is your bump from 1X to 2X has a lot to do with the algo and how the compiler is optimizing it. If the compiler knows num_arrays is a constant equal to 1, then it will automatically eliminate a lot of per-iteration overhead for you.
Now for the second part, as to why is the second version faster?:
The reason for the second version being faster is not so much in how the data is arranged in physical memory, but rather what under-the-hood logic change a nested std::vector<std::vector<int>> type implies.
In the nested (first) case, compiled code performs the following steps:
Accesses top-level std::vector class. This class contains a pointer to the start of the data array.
That pointer value must be loaded from memory.
Add current loop offset [i%num_arrays] to that pointer.
Access nested std::vector class data. (Likely L1 cache hit)
Load pointer to the vector's start of data array. (Likely L1 cache hit)
Add loop offset [i/num_arrays]
Read data. Finally!
(note the chances of getting L1 cache hits on steps #4 and #5 decrease drastically after 24 streams or so, due to likeliness of evictions before the next iteration trough the loop)
The second version, by comparison:
Accesses top-level std::vector class.
Load pointer to the vector's start of data array.
Add offset [(i%num_arrays)*num_values_per_array+i/num_arrays]
Read data!
An entire set of under-the-hood steps are removed. The calculation for offset is slightly longer since it needs an extra multiply by num_values_per_array. But the other steps more than make up for it.

How to zero a vector<bool>?

I have a vector<bool> and I'd like to zero it out. I need the size to stay the same.
The normal approach is to iterate over all the elements and reset them. However, vector<bool> is a specially optimized container that, depending on implementation, may store only one bit per element. Is there a way to take advantage of this to clear the whole thing efficiently?
bitset, the fixed-length variant, has the set function. Does vector<bool> have something similar?

There seem to be a lot of guesses but very few facts in the answers that have been posted so far, so perhaps it would be worthwhile to do a little testing.
#include <vector>
#include <iostream>
#include <time.h>
int seed(std::vector<bool> &b) {
srand(1);
for (int i = 0; i < b.size(); i++)
b[i] = ((rand() & 1) != 0);
int count = 0;
for (int i = 0; i < b.size(); i++)
if (b[i])
++count;
return count;
}
int main() {
std::vector<bool> bools(1024 * 1024 * 32);
int count1= seed(bools);
clock_t start = clock();
bools.assign(bools.size(), false);
double using_assign = double(clock() - start) / CLOCKS_PER_SEC;
int count2 = seed(bools);
start = clock();
for (int i = 0; i < bools.size(); i++)
bools[i] = false;
double using_loop = double(clock() - start) / CLOCKS_PER_SEC;
int count3 = seed(bools);
start = clock();
size_t size = bools.size();
bools.clear();
bools.resize(size);
double using_clear = double(clock() - start) / CLOCKS_PER_SEC;
int count4 = seed(bools);
start = clock();
std::fill(bools.begin(), bools.end(), false);
double using_fill = double(clock() - start) / CLOCKS_PER_SEC;
std::cout << "Time using assign: " << using_assign << "\n";
std::cout << "Time using loop: " << using_loop << "\n";
std::cout << "Time using clear: " << using_clear << "\n";
std::cout << "Time using fill: " << using_fill << "\n";
std::cout << "Ignore: " << count1 << "\t" << count2 << "\t" << count3 << "\t" << count4 << "\n";
}
So this creates a vector, sets some randomly selected bits in it, counts them, and clears them (and repeats). The setting/counting/printing is done to ensure that even with aggressive optimization, the compiler can't/won't optimize out our code to clear the vector.
I found the results interesting, to say the least. First the result with VC++:
Time using assign: 0.141
Time using loop: 0.068
Time using clear: 0.141
Time using fill: 0.087
Ignore: 16777216 16777216 16777216 16777216
So, with VC++, the fastest method is what you'd probably initially think of as the most naive -- a loop that assigns to each individual item. With g++, the results are just a tad different though:
Time using assign: 0.002
Time using loop: 0.08
Time using clear: 0.002
Time using fill: 0.001
Ignore: 16777216 16777216 16777216 16777216
Here, the loop is (by far) the slowest method (and the others are basically tied -- the 1 ms difference in speed isn't really repeatable).
For what it's worth, in spite of this part of the test showing up as much faster with g++, the overall times were within 1% of each other (4.944 seconds for VC++, 4.915 seconds for g++).

Try
v.assign(v.size(), false);
Have a look at this link:
http://www.cplusplus.com/reference/vector/vector/assign/
Or the following
std::fill(v.begin(), v.end(), 0)

You are out of luck. std::vector<bool> is a specialization that apparently does not even guarantee contiguous memory or random access iterators (or even forward?!), at least based on my reading of cppreference -- decoding the standard would be the next step.
So write implementation specific code, pray and use some standard zeroing technique, or do not use the type. I vote 3.
The recieved wisdom is that it was a mistake, and may become deprecated. Use a different container if possible. And definitely do not mess around with the internal guts, or rely on its packing. Check if you have dynamic bitset in your std library mayhap, or roll your own wrapper around std::vector<unsigned char>.

I ran into this as a performance issue recently. I hadn't tried looking for answers on the web but did find that using assignment with the constructor was 10x faster using g++ O3 (Debian 4.7.2-5) 4.7.2. I found this question because I was looking to avoid the additional malloc. Looks like the assign is optimized as well as the constructor and about twice as good in my benchmark.
unsigned sz = v.size(); for (unsigned ii = 0; ii != sz; ++ii) v[ii] = false;
v = std::vector(sz, false); // 10x faster
v.assign(sz, false); > // 20x faster
So, I wouldn't say to shy away from using the specialization of vector<bool>; just be very cognizant of the bit vector representation.

Use the std::vector<bool>::assign method, which is provided for this purpose.
If an implementation is specific for bool, then assign, most likely, also implemented appropriately.

If you're able to switch from vector<bool> to a custom bit vector representation, then you can use a representation designed specifically for fast clear operations, and get some potentially quite significant speedups (although not without tradeoffs).
The trick is to use integers per bit vector entry and a single 'rolling threshold' value that determines which entries actually then evaluate to true.
You can then clear the bit vector by just increasing the single threshold value, without touching the rest of the data (until the threshold overflows).
A more complete write up about this, and some example code, can be found here.

It seems that one nice option hasn't been mentioned yet:
auto size = v.size();
v.resize(0);
v.resize(size);
The STL implementer will supposedly have picked the most efficient means of zeroising, so we don't even need to know which particular method that might be. And this works with real vectors as well (think templates), not just the std::vector<bool> monstrosity.
There can be a minuscule added advantage for reused buffers in loops (e.g. sieves, whatever), where you simply resize to whatever will be needed for the current round, instead of to the original size.

As an alternative to std::vector<bool>, check out boost::dynamic_bitset (https://www.boost.org/doc/libs/1_72_0/libs/dynamic_bitset/dynamic_bitset.html). You can zero one (ie, set each element to false) out by calling the reset() member function.
Like clearing, say, std::vector<int>, reset on a boost::dynamic_bitset can also compile down to a memset, whereas you probably won't get that with std::vector<bool>. For example, see https://godbolt.org/z/aqSGCi

Why do std::string operations perform poorly?

I made a test to compare string operations in several languages for choosing a language for the server-side application. The results seemed normal until I finally tried C++, which surprised me a lot. So I wonder if I had missed any optimization and come here for help.
The test are mainly intensive string operations, including concatenate and searching. The test is performed on Ubuntu 11.10 amd64, with GCC's version 4.6.1. The machine is Dell Optiplex 960, with 4G RAM, and Quad-core CPU.
in Python (2.7.2):
def test():
x = ""
limit = 102 * 1024
while len(x) < limit:
x += "X"
if x.find("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) > 0:
print("Oh my god, this is impossible!")
print("x's length is : %d" % len(x))
test()
which gives result:
x's length is : 104448
real 0m8.799s
user 0m8.769s
sys 0m0.008s
in Java (OpenJDK-7):
public class test {
public static void main(String[] args) {
int x = 0;
int limit = 102 * 1024;
String s="";
for (; s.length() < limit;) {
s += "X";
if (s.indexOf("ABCDEFGHIJKLMNOPQRSTUVWXYZ") > 0)
System.out.printf("Find!\n");
}
System.out.printf("x's length = %d\n", s.length());
}
}
which gives result:
x's length = 104448
real 0m50.436s
user 0m50.431s
sys 0m0.488s
in Javascript (Nodejs 0.6.3)
function test()
{
var x = "";
var limit = 102 * 1024;
while (x.length < limit) {
x += "X";
if (x.indexOf("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) > 0)
console.log("OK");
}
console.log("x's length = " + x.length);
}();
which gives result:
x's length = 104448
real 0m3.115s
user 0m3.084s
sys 0m0.048s
in C++ (g++ -Ofast)
It's not surprising that Nodejs performas better than Python or Java. But I expected libstdc++ would give much better performance than Nodejs, whose result really suprised me.
#include <iostream>
#include <string>
using namespace std;
void test()
{
int x = 0;
int limit = 102 * 1024;
string s("");
for (; s.size() < limit;) {
s += "X";
if (s.find("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) != string::npos)
cout << "Find!" << endl;
}
cout << "x's length = " << s.size() << endl;
}
int main()
{
test();
}
which gives result:
x length = 104448
real 0m5.905s
user 0m5.900s
sys 0m0.000s
Brief Summary
OK, now let's see the summary:
javascript on Nodejs(V8): 3.1s
Python on CPython 2.7.2 : 8.8s
C++ with libstdc++: 5.9s
Java on OpenJDK 7: 50.4s
Surprisingly! I tried "-O2, -O3" in C++ but noting helped. C++ seems about only 50% performance of javascript in V8, and even poor than CPython. Could anyone explain to me if I had missed some optimization in GCC or is this just the case? Thank you a lot.

It's not that std::string performs poorly (as much as I dislike C++), it's that string handling is so heavily optimized for those other languages.
Your comparisons of string performance are misleading, and presumptuous if they are intended to represent more than just that.
I know for a fact that Python string objects are completely implemented in C, and indeed on Python 2.7, numerous optimizations exist due to the lack of separation between unicode strings and bytes. If you ran this test on Python 3.x you will find it considerably slower.
Javascript has numerous heavily optimized implementations. It's to be expected that string handling is excellent here.
Your Java result may be due to improper string handling, or some other poor case. I expect that a Java expert could step in and fix this test with a few changes.
As for your C++ example, I'd expect performance to slightly exceed the Python version. It does the same operations, with less interpreter overhead. This is reflected in your results. Preceding the test with s.reserve(limit); would remove reallocation overhead.
I'll repeat that you're only testing a single facet of the languages' implementations. The results for this test do not reflect the overall language speed.
I've provided a C version to show how silly such pissing contests can be:
#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
void test()
{
int limit = 102 * 1024;
char s[limit];
size_t size = 0;
while (size < limit) {
s[size++] = 'X';
if (memmem(s, size, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", 26)) {
fprintf(stderr, "zomg\n");
return;
}
}
printf("x's length = %zu\n", size);
}
int main()
{
test();
return 0;
}
Timing:
matt#stanley:~/Desktop$ time ./smash
x's length = 104448
real 0m0.681s
user 0m0.680s
sys 0m0.000s

So I went and played a bit with this on ideone.org.
Here a slightly modified version of your original C++ program, but with the appending in the loop eliminated, so it only measures the call to std::string::find(). Note that I had to cut the number of iterations to ~40%, otherwise ideone.org would kill the process.
#include <iostream>
#include <string>
int main()
{
const std::string::size_type limit = 42 * 1024;
unsigned int found = 0;
//std::string s;
std::string s(limit, 'X');
for (std::string::size_type i = 0; i < limit; ++i) {
//s += 'X';
if (s.find("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) != std::string::npos)
++found;
}
if(found > 0)
std::cout << "Found " << found << " times!\n";
std::cout << "x's length = " << s.size() << '\n';
return 0;
}
My results at ideone.org are time: 3.37s. (Of course, this is highly questionably, but indulge me for a moment and wait for the other result.)
Now we take this code and swap the commented lines, to test appending, rather than finding. Note that, this time, I had increased the number of iterations tenfold in trying to see any time result at all.
#include <iostream>
#include <string>
int main()
{
const std::string::size_type limit = 1020 * 1024;
unsigned int found = 0;
std::string s;
//std::string s(limit, 'X');
for (std::string::size_type i = 0; i < limit; ++i) {
s += 'X';
//if (s.find("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) != std::string::npos)
// ++found;
}
if(found > 0)
std::cout << "Found " << found << " times!\n";
std::cout << "x's length = " << s.size() << '\n';
return 0;
}
My results at ideone.org, despite the tenfold increase in iterations, are time: 0s.
My conclusion: This benchmark is, in C++, highly dominated by the searching operation, the appending of the character in the loop has no influence on the result at all. Was that really your intention?

The idiomatic C++ solution would be:
#include <iostream>
#include <string>
#include <algorithm>
int main()
{
const int limit = 102 * 1024;
std::string s;
s.reserve(limit);
const std::string pattern("ABCDEFGHIJKLMNOPQRSTUVWXYZ");
for (int i = 0; i < limit; ++i) {
s += 'X';
if (std::search(s.begin(), s.end(), pattern.begin(), pattern.end()) != s.end())
std::cout << "Omg Wtf found!";
}
std::cout << "X's length = " << s.size();
return 0;
}
I could speed this up considerably by putting the string on the stack, and using memmem -- but there seems to be no need. Running on my machine, this is over 10x the speed of the python solution already..
[On my laptop]
time ./test
X's length = 104448
real 0m2.055s
user 0m2.049s
sys 0m0.001s

That is the most obvious one: please try to do s.reserve(limit); before main loop.
Documentation is here.
I should mention that direct usage of standard classes in C++ in the same way you are used to do it in Java or Python will often give you sub-par performance if you are unaware of what is done behind the desk. There is no magical performance in language itself, it just gives you right tools.

My first thought is that there isn't a problem.
C++ gives second-best performance, nearly ten times faster than Java. Maybe all but Java are running close to the best performance achievable for that functionality, and you should be looking at how to fix the Java issue (hint - StringBuilder).
In the C++ case, there are some things to try to improve performance a bit. In particular...
s += 'X'; rather than s += "X";
Declare string searchpattern ("ABCDEFGHIJKLMNOPQRSTUVWXYZ"); outside the loop, and pass this for the find calls. An std::string instance knows it's own length, whereas a C string requires a linear-time check to determine that, and this may (or may not) be relevant to std::string::find performance.
Try using std::stringstream, for a similar reason to why you should be using StringBuilder for Java, though most likely the repeated conversions back to string will create more problems.
Overall, the result isn't too surprising though. JavaScript, with a good JIT compiler, may be able to optimise a little better than C++ static compilation is allowed to in this case.
With enough work, you should always be able to optimise C++ better than JavaScript, but there will always be cases where that doesn't just naturally happen and where it may take a fair bit of knowledge and effort to achieve that.

What you are missing here is the inherent complexity of the find search.
You are executing the search 102 * 1024 (104 448) times. A naive search algorithm will, each time, try to match the pattern starting from the first character, then the second, etc...
Therefore, you have a string that is going from length 1 to N, and at each step you search the pattern against this string, which is a linear operation in C++. That is N * (N+1) / 2 = 5 454 744 576 comparisons. I am not as surprised as you are that this would take some time...
Let us verify the hypothesis by using the overload of find that searches for a single A:
Original: 6.94938e+06 ms
Char : 2.10709e+06 ms
About 3 times faster, so we are within the same order of magnitude. Therefore the use of a full string is not really interesting.
Conclusion ? Maybe that find could be optimized a bit. But the problem is not worth it.
Note: and to those who tout Boyer Moore, I am afraid that the needle is too small, so it won't help much. May cut an order of magnitude (26 characters), but no more.

For C++, try to use std::string for "ABCDEFGHIJKLMNOPQRSTUVWXYZ" - in my implementation string::find(const charT* s, size_type pos = 0) const calculates length of string argument.

I just tested the C++ example myself. If I remove the the call to std::sting::find, the program terminates in no time. Thus the allocations during string concatenation is no problem here.
If I add a variable sdt::string abc = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" and replace the occurence of "ABC...XYZ" in the call of std::string::find, the program needs almost the same time to finish as the original example. This again shows that allocation as well as computing the string's length does not add much to the runtime.
Therefore, it seems that the string search algorithm used by libstdc++ is not as fast for your example as the search algorithms of javascript or python. Maybe you want to try C++ again with your own string search algorithm which fits your purpose better.

C/C++ language are not easy and take years make fast programs.
with strncmp(3) version modified from c version:
#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
void test()
{
int limit = 102 * 1024;
char s[limit];
size_t size = 0;
while (size < limit) {
s[size++] = 'X';
if (!strncmp(s, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", 26)) {
fprintf(stderr, "zomg\n");
return;
}
}
printf("x's length = %zu\n", size);
}
int main()
{
test();
return 0;
}

Your test code is checking a pathological scenario of excessive string concatenation. (The string-search part of the test could have probably been omitted, I bet you it contributes almost nothing to the final results.) Excessive string concatenation is a pitfall that most languages warn very strongly against, and provide very well known alternatives for, (i.e. StringBuilder,) so what you are essentially testing here is how badly these languages fail under scenarios of perfectly expected failure. That's pointless.
An example of a similarly pointless test would be to compare the performance of various languages when throwing and catching an exception in a tight loop. All languages warn that exception throwing and catching is abysmally slow. They do not specify how slow, they just warn you not to expect anything. Therefore, to go ahead and test precisely that, would be pointless.
So, it would make a lot more sense to repeat your test substituting the mindless string concatenation part (s += "X") with whatever construct is offered by each one of these languages precisely for avoiding string concatenation. (Such as class StringBuilder.)

As mentioned by sbi, the test case is dominated by the search operation.
I was curious how fast the text allocation compares between C++ and Javascript.
System: Raspberry Pi 2, g++ 4.6.3, node v0.12.0, g++ -std=c++0x -O2 perf.cpp
C++ : 770ms
C++ without reserve: 1196ms
Javascript: 2310ms
C++
#include <iostream>
#include <string>
#include <chrono>
using namespace std;
using namespace std::chrono;
void test()
{
high_resolution_clock::time_point t1 = high_resolution_clock::now();
int x = 0;
int limit = 1024 * 1024 * 100;
string s("");
s.reserve(1024 * 1024 * 101);
for(int i=0; s.size()< limit; i++){
s += "SUPER NICE TEST TEXT";
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();
cout << duration << endl;
}
int main()
{
test();
}
JavaScript
function test()
{
var time = process.hrtime();
var x = "";
var limit = 1024 * 1024 * 100;
for(var i=0; x.length < limit; i++){
x += "SUPER NICE TEST TEXT";
}
var diff = process.hrtime(time);
console.log('benchmark took %d ms', diff[0] * 1e3 + diff[1] / 1e6 );
}
test();

It seems that in nodejs there are better algorithms for substring search. You can implement it by yourself and try it out.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js