Why is pmr::string so slow in these benchmarks? - c++

Trying out the example in Section 5.9.2 Class monotonic_buffer_resource of the following article on Polymorphic Memory Resources by Pablo Halpern :
Doc No: N3816
Date: 2013-10-13
Author: Pablo Halpern
phalpern#halpernwightsoftware.com
Polymorphic Memory Resources - r1
(Originally N3525 – Polymorphic Allocators)
The article claims that :
The monotonic_buffer_resource class is designed for very fast memory allocations
in situations where memory is used to build up a few objects and then is released all
at once when those objects go out of scope.
and that :
A particularly good use for a monotonic_buffer_resource is to provide memory for
a local variable of container or string type. For example, the following code
concatenates two strings, looks for the word “hello” in the concatenated string, and
then discards the concatenated string after the word is found or not found. The
concatenated string is expected to be no more than 80 bytes long, so the code is
optimized for these short strings using a small monotonic_buffer_resource [...]
I've benchmarked the example using the google benchmark library and boost.container 1.69's polymorphic resources, compiled and linked to release binaries with g++-8 on an Ubuntu 18.04 LTS hyper-v virtual machine with the following code :
// overload using pmr::string
static bool find_hello(const boost::container::pmr::string& s1, const boost::container::pmr::string& s2)
{
using namespace boost::container;
char buffer[80];
pmr::monotonic_buffer_resource m(buffer, 80);
pmr::string s(&m);
s.reserve(s1.length() + s2.length());
s += s1;
s += s2;
return s.find("hello") != pmr::string::npos;
}
// overload using std::string
static bool find_hello(const std::string& s1, const std::string& s2)
{
std::string s{};
s.reserve(s1.length() + s2.length());
s += s1;
s += s2;
return s.find("hello") != std::string::npos;
}
static void allocator_local_string(::benchmark::State& state)
{
CLEAR_CACHE(2 << 12);
using namespace boost::container;
pmr::string s1(35, 'c'), s2(37, 'd');
for (auto _ : state)
{
::benchmark::DoNotOptimize(find_hello(s1, s2));
}
}
// pmr::string with monotonic buffer resource benchmark registration
BENCHMARK(allocator_local_string)->Repetitions(5);
static void allocator_global_string(::benchmark::State& state)
{
CLEAR_CACHE(2 << 12);
std::string s1(35, 'c'), s2(37, 'd');
for (auto _ : state)
{
::benchmark::DoNotOptimize(find_hello(s1, s2));
}
}
// std::string using std::allocator and global allocator benchmark registration
BENCHMARK(allocator_global_string)->Repetitions(5);
Here are the results :
How is the pmr::string benchmark so slow compared to std::string?
I assume std::string's std::allocator should use "new" on the reserve call, and construct each character afterwards when calling :
s += s1;
s += s2
Comparing that to a pmr::string using a polymorphic allocator that holds the monotonic_buffer_resource, reserving memory should boil down to simply pointer arithmetic, necessitating no "new" as the char buffer should be sufficient. Subsequently, it would construct each character as std::string does.
So, considering that the only differing operations between the pmr::string version of find_hello and the std::string version of find_hello is the call to reserve memory, with pmr::string using stack allocation and std::string using heap allocation :
Is my benchmark wrong?
Is my interpretation of how allocation should occur wrong?
Why is the pmr::string benchmark approximately 5 times slower than the std::string benchmark?

There is a combination of things that makes boost pmr::basic_string slower:
Construction of the pmr::monotonic_buffer_resource has some cost (17 nano-seconds here).
pmr::basic_string::reserve reserves more than one requires. It reserves 96 bytes in this case, which is more than the 80 bytes you have.
Reserving in pmr::basic_string is not free, even when the buffer is big enough (extra 8 nano-seconds here).
The concatenation of strings is costly (extra 64 ns here).
pmr::basic_string::find has a suboptimal implementation. This is the real cost for the poor speed. In GCC's std::basic_string::find uses __builtin_memchr to find the first character that might match, which boost is doing it all in one big loop. Apparently this is the main cost, and what makes boost run slower than std.
So, after increasing the buffer, and comparing boost::container::string with boost::container::pmr::string, the pmr version comes slightly slower (293 ns vs.
276 ns). This is because new and delete are actually quite fast for such micro-benchmarks, and are faster than the complicated machinery of the pmr (just 17 ns for construction). In fact, the default Linux/gcc new/delete reuse the same pointer again and again. This optimization has a very simple and fast implementation, that also works great with the CPU cache.
As a proof, try this out (without optimization):
for (int i=0 ; i < 10 ; ++i)
{
char * ptr = new char[96];
std::cout << (void*) ptr << '\n';
delete[] ptr;
}
This prints the same pointer again and again.
The theory is that in a real program, where new/delete don't behave that nicely, and can't reuse the same block again and again, then new/delete slow down the execution much more and cache locality becomes quite poor. In such case the pmr+buffer are worth it.
Conclusion: the implementation of boost pmr string is slower than gcc's string. The pmr machinery is slightly more costly than the default and simple scenario of the new/delete.

Related

Why StringCopyFromLiteral is faster than StringCopyFromString?

The Quick C++ Benchmarks example:
static void StringCopyFromLiteral(benchmark::State& state) {
// Code inside this loop is measured repeatedly
for (auto _ : state) {
std::string from_literal("hello");
// Make sure the variable is not optimized away by compiler
benchmark::DoNotOptimize(from_literal);
}
}
// Register the function as a benchmark
BENCHMARK(StringCopyFromLiteral);
static void StringCopyFromString(benchmark::State& state) {
// Code before the loop is not measured
std::string x = "hello";
for (auto _ : state) {
std::string from_string(x);
}
}
// Register the function as a benchmark
BENCHMARK(StringCopyFromString);
http://quick-bench.com/IcZllt_14hTeMaB_sBZ0CQ8x2Ro
What if I understand assembly...
More results:
http://quick-bench.com/39fLTvRdpR5zdapKSj2ZzE3asCI
The answer is simple. In the case where you construct an std::string from a small string literal, the compiler optimizes this case by directly populating the contents of the string object using constants in assembly. This avoids expensive looping as well as tests to see whether small string optimization (SSO) can be applied. In this case it knows SSO can be applied so the code the compiler generates simply involves writing the string directly into the SSO buffer.
Note this assembly code in the StringCreation case:
// Populate SSO buffer (each set of 4 characters is backwards since
// x86 is little-endian)
19.63% movb $0x6f,0x4(%r15) // "o"
19.35% movl $0x6c6c6568,(%r15) // "lleh"
// Set size
20.26% movq $0x5,0x10(%rsp) // size = 5
// Probably set heap pointer. 0 (nullptr) = use SSO buffer
20.07% movb $0x0,0x1d(%rsp)
You're looking at the constant values right there. That's not very much code, and no loop is required. In fact, the std::string constructor doesn't even have to be invoked! The compiler is just putting stuff in memory in the same places where the std::string constructor would.
If the compiler cannot apply this optimization, the results are quite different -- in particular, if we "hide" the fact that the source is a string literal by first copying the literal into a char array, the results flip:
char x[] = "hello";
for (auto _ : state) {
std::string created_string(x);
benchmark::DoNotOptimize(created_string);
}
Now the "from-char-pointer" case takes twice as long! Why?
I suspect that this is because the "copy from char pointer" case cannot simply check to see how long the string is by looking at a value. It needs to know whether small string optimization can be performed. There's a few ways it could go about this:
Measure the length of the string first, make an allocation (if needed), then copy the source to the destination. In the case where SSO does apply (it almost certainly does here) I'd expect this to take twice as long since it has to walk the source twice -- once to measure, once to copy.
Copy from the source character-by-character, appending to the new string. This requires testing on each append operation whether the string is now too long for SSO and needs to be copied into a heap-allocated char array. If the string is currently in a heap-allocated array, it needs to instead test if the allocation needs to be resized. This would also take quite a bit longer since there is at least one test for each character in the source string.
Copy from the source in chunks to lower the number of tests that need to be performed and to avoid walking the source twice. This would be faster than the character-by-character approach both because the number of tests would be lower and, because the source is not being walked twice, the CPU memory cache is going to be more effective. This would only show significant speed improvements for long strings, which we don't have here. For short strings it would work about the same as the first approach (measure, then copy).
Contrast this to the case when it's copying from another string object: it can simply look at the size() of the other string and immediately know whether it can perform SSO, and if it can't perform SSO then it also knows exactly how much memory to allocate for the new string.

Fast move CString to std::string

I'm working in a codebase with a mixture of CString, const char* and std::string (non-unicode), where all new code uses std::string exclusively. I've now had to do the following:
{
CString tempstring;
load_cstring_legacy_method(tempstring);
stdstring = tempstring;
}
and worry about performance. The strings are DNA sequences so we can easily have 100+ of them with each of them ~3M characters. Note that adjusting load_cstring_legacy_method is not an option. I did a quick test:
// 3M
const int stringsize = 3000000;
const int repeat = 1000;
std::chrono::steady_clock::time_point startTime = std::chrono::steady_clock::now();
for ( int i = 0; i < repeat; ++i ){
CString cstring('A', stringsize);
std::string stdstring(cstring); // Comment out
cstring.Empty();
}
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - startTime).count() << " ms" << std::endl;
and commenting out the std::string gives 850 ms, with the assignment its 3600 ms. The magnitude of the difference is suprising so I guess the benchmark might not be doing what I expect. Assuming there is a penalty, is there a way I can avoid it?
So your question is to make the std::string construction faster?
On my machine, comparing this
std::string stdstring(cstring); // 4741 ms
I get better performance this way:
std::string stdstring(cstring, stringsize); // 3419 ms
or if the std::string already exists like the first part of your question suggests:
stdstring.assign(cstring, stringsize); // 3408 ms
Use a more efficient memory allocator. Something like a memory arena/region would substantially help with allocation costs.
If you're really, really desperate, you could theoretically combine ReleaseBuffer with some hideous allocator hacks to avoid the copy altogether. This would involve a lot of pain, though.
In addition, if you have a serious problem, you could consider changing your string implementation. The std::string that ships with Visual Studio employs SSO, or Small String Optimization. This does exactly what it sounds like- it optimizes very small strings, which are quite common all around but not necessarily good for this use case. Another implementation like COW could be more appropriate (be super careful if doing so in a multi-threaded environment).
Finally, if you're using an old version of VS, you should also consider upgrading. Move semantics are a huge instawin as far as performance goes.
CString is probably the Unicode version, which explains the slowness. The generic conversion routine cannot know assume that the characters used are limited to "ACGT".
You can, however, and shamelessly take advantage of that.
{
CString tempstring;
load_cstring_legacy_method(tempstring);
int len = tempstring.GetLength();
stdstring.reserve(len);
for(int i = 0; i != len; ++i)
{
stdstring.push_back(static_cast<char>(tempstring[i]));
}
}
Portable? Only so far as CString is, so Windows variants.

Is simple but frequent usage of std::stringstream a premature pessimization?

I have a simple scenario. I need to join two C-strings together into a std::string. I have decided to do this in one of two ways:
Solution 1
void ProcessEvent(char const* pName) {
std::string fullName;
fullName.reserve(50); // Ensure minimal reallocations for small event names (50 is an arbitrary limit).
fullName += "com.domain.events.";
fullName += pName;
// Use fullName as needed
}
Solution 2
void ProcessEvent(char const* pName) {
std::ostringstream ss;
ss << "com.domain.events." << pName;
std::string fullName{ss.str()};
// Use fullName as needed
}
I like solution 2 better because the code is more natural. Solution 1 seems like a response to a measurable bottleneck from performance testing. However, Solution 1 exists for 2 reasons:
It's a light optimization to reduce allocations. Event management in this application is used quite frequently so there might be benefits (but no measurements have been taken).
I've heard criticism regarding STL streams WRT performance. Some have recommended to only use stringstream when doing heavy string building, especially those involving number conversions and/or usage of manipulators.
Is it a premature pessimization to prefer solution 2 for its simplicity? Or is it a premature optimization to choose solution 1? I'm wondering if I'm too overly concerned about STL streams.
Let's measure it
A quick test with the following functions:
void func1(const char* text) {
std::string s;
s.reserve(50);
s += "com.domain.event.";
s += text;
}
void func2(const char* text) {
std::ostringstream oss;
oss << "com.domain.event." << text;
std::string s = oss.str();
}
Running each 100 000 times in a loop gives the following results on average on my computer (using gcc-4.9.1):
func1 : 37 milliseconds
func2 : 87 milliseconds
That is, func1 is more than twice as fast.
That being said, I would recommend using the clearest most readable syntax until you really need the performance. Implement a testable program first, then optimize if its too slow.
Edit:
As suggested by #Ken P:
void func3(const char* text) {
std::string s = "com.domain.event" + std::string{text};
}
func3 : 27 milliseconds
The simplest solution is often the fastest.
You didn't mention the 3rd alternative of not pre-allocating anything at all in the string and just let the optimizer do what it's best at.
Given these two functions, func1 and func3:
void func1(const char* text) {
std::string s;
s.reserve(50);
s += "com.domain.event.";
s += text;
std::cout << s;
}
void func3(const char* text) {
std::string s;
s += "com.domain.event.";
s += text;
std::cout << s;
}
It can be seen in the example at http://goo.gl/m8h2Ks that the gcc assembly for func1 just for reserving space for 50 characters will add an additional 3 instructions compared to when no pre-allocation is done in func3. One of the calls is a string append call, which in turn will give some overhead:
leaq 16(%rsp), %rdi
movl $50, %esi
call std::basic_string<char>::append(char const*, unsigned long)
Looking at the code alone doesn't guarantee that func3 is faster than func1 though, just because it has fewer instructions. Cache and other things also contributes to the actual performance, which can only be properly assessed by measuring, as others pointed out.

C++ std::string append vs push_back()

This really is a question just for my own interest I haven't been able to determine through the documentation.
I see on http://www.cplusplus.com/reference/string/string/ that append has complexity:
"Unspecified, but generally up to linear in the new string length."
while push_back() has complexity:
"Unspecified; Generally amortized constant, but up to linear in the new string length."
As a toy example, suppose I wanted to append the characters "foo" to a string. Would
myString.push_back('f');
myString.push_back('o');
myString.push_back('o');
and
myString.append("foo");
amount to exactly the same thing? Or is there any difference? You might figure that append would be more efficient because the compiler would know how much memory is required to extend the string the specified number of characters, while push_back may need to secure memory each call?
In C++03 (for which most of "cplusplus.com"'s documentation is written), the complexities were unspecified because library implementers were allowed to do Copy-On-Write or "rope-style" internal representations for strings. For instance, a COW implementation might require copying the entire string if a character is modified and there is sharing going on.
In C++11, COW and rope implementations are banned. You should expect constant amortized time per character added or linear amortized time in the number of characters added for appending to a string at the end. Implementers may still do relatively crazy things with strings (in comparison to, say std::vector), but most implementations are going to be limited to things like the "small string optimization".
In comparing push_back and append, push_back deprives the underlying implementation of potentially useful length information which it might use to preallocate space. On the other hand, append requires that an implementation walk over the input twice in order to find that length, so the performance gain or loss is going to depend on a number of unknowable factors such as the length of the string before you attempt the append. That said, the difference is probably extremely Extremely EXTREMELY small. Go with append for this -- it is far more readable.
I had the same doubt, so I made a small test to check this (g++ 4.8.5 with C++11 profile on Linux, Intel, 64 bit under VmWare Fusion).
And the result is interesting:
push :19
append :21
++++ :34
Could be possible this is because of the string length (big), but the operator + is very expensive compared with the push_back and the append.
Also it is interesting that when the operator only receives a character (not a string), it behaves very similar to the push_back.
For not to depend on pre-allocated variables, each cycle is defined in a different scope.
Note : the vCounter simply uses gettimeofday to compare the differences.
TimeCounter vCounter;
{
string vTest;
vCounter.start();
for (int vIdx=0;vIdx<1000000;vIdx++) {
vTest.push_back('a');
vTest.push_back('b');
vTest.push_back('c');
}
vCounter.stop();
cout << "push :" << vCounter.elapsed() << endl;
}
{
string vTest;
vCounter.start();
for (int vIdx=0;vIdx<1000000;vIdx++) {
vTest.append("abc");
}
vCounter.stop();
cout << "append :" << vCounter.elapsed() << endl;
}
{
string vTest;
vCounter.start();
for (int vIdx=0;vIdx<1000000;vIdx++) {
vTest += 'a';
vTest += 'b';
vTest += 'c';
}
vCounter.stop();
cout << "++++ :" << vCounter.elapsed() << endl;
}
Add one more opinion here.
I personally consider it better to use push_back() when adding characters one by one from another string. For instance:
string FilterAlpha(const string& s) {
string new_s;
for (auto& it: s) {
if (isalpha(it)) new_s.push_back(it);
}
return new_s;
}
If using append()here, I would replace push_back(it) with append(1,it), which is not that readable to me.
Yes, I would also expect append() to perform better for the reasons you gave, and in a situation where you need to append a string, using append() (or operator+=) is certainly preferable (not least also because the code is much more readable).
But what the Standard specifies is the complexity of the operation. And that is generally linear even for append(), because ultimately each character of the string being appended (and possible all characters, if reallocation occurs) needs to be copied (this is true even if memcpy or similar are used).

C++ string literals vs. const strings

I know that string literals in C/C++ have static storage duration, meaning that they live "forever", i.e. as long as the program runs.
Thus, if I have a function that is being called very frequently and uses a string literal like so:
void foo(int val)
{
std::stringstream s;
s << val;
lbl->set_label("Value: " + s.str());
}
where the set_label function takes a const std::string& as a parameter.
Should I be using a const std::string here instead of the string literal or would it make no difference?
I need to minimise as much runtime memory consumption as possible.
edit:
I meant to compare the string literal with a const std::string prefix("Value: "); that is initialized in some sort of a constants header file.
Also, the concatenation here returns a temporary (let us call it Value: 42 and a const reference to this temporary is being passed to the function set_text(), am I correct in this?
Thank you again!
Your program operates on the same literal every time. There is no more efficient form of storage. A std::string would be constructed, duplicated on the heap, then freed every time the function runs, which would be a total waste.
This will use less memory and run much faster (use snprintf if your compiler supports it):
void foo(int val)
{
char msg[32];
lbl->set_label(std::string(msg, sprintf(msg, "Value: %d", val)));
}
For even faster implementations, check out C++ performance challenge: integer to std::string conversion
How will you build your const std::string ? If you do it from some string literral, in the end it will just be worse (or identical if compiler does a good job). A string literal does not consumes much memory, and also static memory, that may not be the kind of memory you are low of.
If you can read all your string literals from, say a file, and give the memory back to OS when the strings are not used any more, there may be some way to reduce memory footprint (but it will probably slow the program much).
But there is probably many other ways to reduce memory consumption before doing that kind of thing.
Store them in some kind of resource and load/unload them as necessary.