As far as my knowledge on resource management goes, allocating something on the heap (operator new) should always be slower than allocating on the stack (automatic storage), because the stack is a LIFO-based structure, thus it requires minimal bookkeeping, and the pointer of the next address to allocate is trivial.
So far, so good. Now look at the following code:
/* ...includes... */
using std::cout;
using std::cin;
using std::endl;
int bar() { return 42; }
int main()
{
auto s1 = std::chrono::steady_clock::now();
std::packaged_task<int()> pt1(bar);
auto e1 = std::chrono::steady_clock::now();
auto s2 = std::chrono::steady_clock::now();
auto sh_ptr1 = std::make_shared<std::packaged_task<int()> >(bar);
auto e2 = std::chrono::steady_clock::now();
auto first = std::chrono::duration_cast<std::chrono::nanoseconds>(e1-s1);
auto second = std::chrono::duration_cast<std::chrono::nanoseconds>(e2-s2);
cout << "Regular: " << first.count() << endl
<< "Make shared: " << second.count() << endl;
pt1();
(*sh_ptr1)();
cout << "As you can see, both are working correctly: "
<< pt1.get_future().get() << " & "
<< sh_ptr1->get_future().get() << endl;
return 0;
}
The results seem to contradict the stuff explained above:
Regular: 6131
Make shared: 843
As you can see, both are working
correctly: 42 & 42
Program ended with exit code: 0
In the second measurement, apart from the call of operator new, the constructor of the std::shared_ptr (auto sh_ptr1) has to finish. I can't seem to understand why is this faster then regular allocation.
What is the explanation for this?
The problem is that the first call to the constructor of std::packaged_task is responsible for initializing a load of per-thread state that is then unfairly attributed to pt1. This is a common problem of benchmarking (particularly microbenchmarking) and is alleviated by warmup; try reading How do I write a correct micro-benchmark in Java?
If I copy your code but run both parts first, the results are the same to within the limits of the resolution of the system clock. This demonstrates another issue of microbenchmarking, that you should run small tests multiple times to allow total time to be measured accurately.
With warmup and running each part 1000 times, I get the following (example):
Regular: 132.986
Make shared: 211.889
The difference (approx 80ns) accords well with the rule of thumb that malloc takes 100ns per call.
It is a problem with your micro-benchmark: if you swap the order in which you measure the timing, you would get opposite results (demo).
It looks like the first-time call of std::packaged_task constructor causes a big hit. Adding an untimed
std::packaged_task<int()> ignore(bar);
before measuring the time fixes this problem (demo):
Regular: 505
Make shared: 937
I've tried your example at ideone and got a result similar to yours:
Regular: 67950
Make shared: 696
Then I reversed the order of tests:
auto s2 = std::chrono::steady_clock::now();
auto sh_ptr1 = std::make_shared<std::packaged_task<int()> >(bar);
auto e2 = std::chrono::steady_clock::now();
auto s1 = std::chrono::steady_clock::now();
std::packaged_task<int()> pt1(bar);
auto e1 = std::chrono::steady_clock::now();
and found an opposite result:
Regular: 548
Make shared: 68065
So that's not difference of stack vs heap, but difference of first and second call. Maybe you need to look into the internals of std::packaged_task.
Related
In C++17 if we design a class like this:
class Editor {
public:
// "copy" constructor
Editor(const std::string& text) : _text {text} {}
// "move" constructor
Editor(std::string&& text) : _text {std::move(text)} {}
private:
std::string _text;
}
It might seem (to me at least), that the "move" constructor should be much faster than the "copy" constructor.
But if we try to measure actual times, we will see something different:
int current_time()
{
return chrono::high_resolution_clock::now().time_since_epoch().count();
}
int main()
{
int N = 100000;
auto t0 = current_time();
for (int i = 0; i < N; i++) {
std::string a("abcdefgh"s);
Editor {a}; // copy!
}
auto t1 = current_time();
for (int i = 0; i < N; i++) {
Editor {"abcdefgh"s};
}
auto t2 = current_time();
cout << "Copy: " << t1 - t0 << endl;
cout << "Move: " << t2 - t1 << endl;
}
Both copy and move times are in the same range. Here's one of the outputs:
Copy: 36299550
Move: 35762602
I tried with strings as long as 285604 characters, with the same result.
Question: why is "copy" constructor Editor(std::string& text) : _text {text} {} so fast? Doesn't it actually creates a copy of input string?
Update I run the benchmark given here using the following line: g++ -std=c++1z -O2 main.cpp && ./a.out
Update 2 Fixing move constructor, as #Caleth suggests (remove const from the const std::string&& text) improves things!
Editor(std::string&& text) : _text {std::move(text)} {}
Now benchmark looks like:
Copy: 938647
Move: 64
It also depends on your optimization flags. With no optimization, you can (and I did!) get even worse results for the move:
Copy: 4164540
Move: 6344331
Running the same code with -O2 optimization gives a much different result:
Copy: 1264581
Move: 791
See it live on Wandbox.
That's with clang 9.0. On GCC 9.1, the difference is about the same for -O2 and -O3 but not quite as stark between copy and move:
Copy: 775
Move: 508
I'm guessing that's a small string optimization kicking in.
In general, the containers in the standard library work best with optimizations on because they have a lot of little functions that the compiler can easily inline and collapse when asked to do so.
Also in that first constructor, per Herb Sutter, "Prefer passing a read-only parameter by value if you’re going to make a copy of the parameter anyway, because it enables move from rvalue arguments."
Update: For very long strings (300k characters), the results are similar to the above (now using std::chrono::duration in milliseconds to avoid int overflows) with GCC 9.1 and optimizations:
Copy: 22560
Move: 1371
and without optimizations:
Copy: 22259
Move: 1404
const std::string&& looks like a typo.
You can't move from it, so you get a copy instead.
So your tests is really looking at the number of times we have to "build" a string object.
So in the fist test:
for (int i = 0; i < N; i++) {
std::string a("abcdefgh"s); // Build a string once.
Editor {a}; // copy! // Here you build the string again.
} // So basically two expensive memory
// allocations and a copying the string
While in the second test:
for (int i = 0; i < N; i++) {
Editor {"abcdefgh"s}; // You build a string once.
// Then internally you move the allocated
// memory (so only one expensive memory
// allocation and copying the string
}
So the difference between the two loops is one extra string copy.
The problem here. I as a human can spot one easy peephole optimization (and the compiler is better than me).
for (int i = 0; i < N; i++) {
std::string a("abcdefgh"s); // This string is only used in a single
// place where it is passed to a
// function as a const parameter
// So we can optimize it out of the loop.
Editor {a};
}
So if we do a manually yanking of the string outside the loop (equivalent to a valid compiler optimization).
So this loop has the same affect:
std::string a("abcdefgh"s);
for (int i = 0; i < N; i++) {
Editor {a};
}
Now this loop only has 1 allocation and copy.
So now both loops look the same in terms of the expensive operations.
Now as a human I am not going to spot (quickly) all the optimization possible. I am just trying to point out here that your quick test here you will not spot a lot of optimizations that the compiler will do and thus estimations and doing timings like this are hard.
On paper you're right, but in practice this is quite easily optimisable so you'll probably find the compiler has ruined your benchmark.
You could benchmark with "optimisations" turned off, but that in itself holds little real-world benefit. It may be possible to trick the compiler in release mode by adding some code that prevents such an optimisation, but off the top of my head I can't imagine what that would look like here.
It's also a relatively small string that can be copied really quickly nowadays.
I think you should just trust your instinct here (because it's correct), while remembering that in practice it might not actually make a lot of difference. But the move certainly won't be worse than the copy.
Sometimes we can and should write obviously "more efficient" code without being able to prove that it'll actually perform better on any particular day of the week with any particular phase of the moon/planetary alignment, because compilers are already trying to make your code as fast as possible.
People may tell you that this is therefore a "premature optimisation", but it really isn't: it's just sensible code.
I had 2 questions regarding the std::shared_ptr control block:
(1) Regarding size:
How can I programatically find the exact size of the control block for a std::shared_ptr?
(2) Regarding logic:
Additionally, boost::shared_ptr mentions that they are completely lock-free with respect to changes in the control block.(Starting with Boost release 1.33.0, shared_ptr uses a lock-free implementation on most common platforms.) I don't think std::shared_ptr follows the same - is this planned for any future C++ version? Doesn't this also mean that boost::shared_ptr is a better idea for multithreaded cases?
(1) Regarding size: How can I programatically find the exact size of the control block for a std::shared_ptr?
There is no way. It's not directly accessible.
(2) Regarding logic: Additionally, boost::shared_ptr mentions that they are completely lock-free with respect to changes in the control block.(Starting with Boost release 1.33.0, shared_ptr uses a lock-free implementation on most common platforms.) I don't think std::shared_ptr follows the same - is this planned for any future C++ version? Doesn't this also mean that boost::shared_ptr is a better idea for multithreaded cases?
Absolutely not. Lock-free implementations are not always better than implementations that use locks. Having an additional constraint, at best, doesn't make the implementation worse but it cannot possibly make the implementation better.
Consider two equally competent programmers each doing their best to implement shared_ptr. One must produce a lock-free implementation. The other is completely free to use their best judgment. There is simply no way the one that must produce a lock-free implementation can produce a better implementation all other things being equal. At best, a lock-free implementation is best and they'll both produce one. At worse, on this platform a lock-free implementation has huge disadvantages and one implementer must use one. Yuck.
The control block is not exposed. In implementations I have read it is dynamic in size to store the deleter contiguously (and/or, in the case of make shared, the object itself).
In general it contains at least 3 pointer-size fields - weak, strong count, and deleter invoker.
At least one implementation relies on RTTI; others do not.
Operations on the count use atomic operations in the implementations I have read; note that C++ does not require atomic operatins to all be lock free (I believe a platform that doesn't have pointer-size lock-free operations can be a conforming C++ platform).
Their state is are consistent with each other and themselves, but no attempt to make them consistent with object state occurs. This is why using raw shared ptrs as copy on write pImpls may be error prone on some platforms.
(1)
Of course it is best to check implementation, however you still may make some checks from your program.
Control block is allocated dynamically, so to determine its size you may overload new operator.
Then what you may also check is if std::make_shared provides you with some optimization of control block size.
In proper implementation I would expect that this will make two allocations (objectA and control block):
std::shared_ptr<A> i(new A());
However this will make only one allocation (and then objectA initialized with placement new):
auto a = std::make_shared<A>();
Consider following example:
#include <iostream>
#include <memory>
void * operator new(size_t size)
{
std::cout << "Requested allocation: " << size << std::endl;
void * p = malloc(size);
return p;
}
class A {};
class B
{
int a[8];
};
int main()
{
std::cout << "Sizeof int: " << sizeof(int) << ", A(empty): " << sizeof(A) << ", B(8 ints): " << sizeof(B) << std::endl;
{
std::cout << "Just new:" << std::endl;
std::cout << "- int:" << std::endl;
std::shared_ptr<int> i(new int());
std::cout << "- A(empty):" << std::endl;
std::shared_ptr<A> a(new A());
std::cout << "- B(8 ints):" << std::endl;
std::shared_ptr<B> b(new B());
}
{
std::cout << "Make shared:" << std::endl;
std::cout << "- int:" << std::endl;
auto i = std::make_shared<int>();
std::cout << "- A(empty):" << std::endl;
auto a = std::make_shared<A>();
std::cout << "- B(8 ints):" << std::endl;
auto b = std::make_shared<B>();
}
}
The output I received (of course it is hw architecture and compiler specific):
Sizeof int: 4, A(empty): 1, B(8 ints): 32
Just new:
- int:
Requested allocation: 4
Requested allocation: 24
First allocation for int - 4 bytes, next one for control block - 24 bytes.
- A(empty):
Requested allocation: 1
Requested allocation: 24
- B(8 ints):
Requested allocation: 32
Requested allocation: 24
Looks that control block is (most probably) 24 bytes.
Here is why to use make_shared:
Make shared:
- int:
Requested allocation: 24
Only one allocation, int + control block = 24 bytes, less then before.
- A(empty):
Requested allocation: 24
- B(8 ints):
Requested allocation: 48
Here one could expect 56 (32+24), but it looks that implementation is optimized. If you use make_shared - pointer to actual object is not needed in control block and its size is only 16 bytes.
Other possibility to check the size of control block is to:
std::cout<< sizeof(std::enable_shared_from_this<int>);
In my case:
16
So I would say that the size of control block in my case is 16-24 bytes, depending on how it was created.
The following code, contains two identical lambdas, but when compiled with g++5 produces different answers. The lambda which uses the auto keyword in the argument declaration compiles fine, but returns zero instead of the correct count of 1. Why? I should add the code produces the correct output with g++-6.
g++-5 -std=c++14 file.cc
./a.out
Output:
f result=0 (incorrect result from lambda f)
...
g result=1 (correct result from lambda g)
...
#include<iostream>
#include<set>
#include<vector>
#include<algorithm>
using namespace std;
enum obsMode { Hbw, Lbw, Raw, Search, Fold};
int main(int , char **)
{
static set<obsMode> legal_obs_modes = {Hbw, Lbw, Raw, Search, Fold};
vector<obsMode> obs_mode = { Hbw,Lbw,Hbw,Lbw};
// I named the lambdas to illustrate the issue
auto f = [&] (auto i) -> void
{
cout << "f result=" << legal_obs_modes.count(i) << endl;
};
auto g = [&] (obsMode i) -> void
{
cout << "g result=" << legal_obs_modes.count(i) << endl;
};
// f does not work
for_each(obs_mode.begin(), obs_mode.end(), f);
// g does work
for_each(obs_mode.begin(), obs_mode.end(), g);
return 0;
}
A somewhat deep dive into the issue shows up what the problem is.
It seems to exist all the way from 5.1 up to 6.1 (according to Compiler Explorer), and I notice this targeted-for-fix-in-6.2 bug report which may be related. The errant code was:
#include <iostream>
#include <functional>
int main() {
static int a;
std::function<void(int)> f = [](auto) { std::cout << a << '\n'; };
a = 1;
f(0);
}
and it printed 0 rather than the correct 1. Basically, the use of statics and lambdas caused some troubles in that a static variable was made available to the lambda, at the time of lambda creation, as a copy. For that particular bug report, it meant that static variable always seemed to have the value it had when the lambda was created, regardless of what you'd done with it in the meantime.
I originally thought that this couldn't be related since the static in this question was initialised on declaration, and never changed after lambda creation. However, if you place the following line before creating the lambdas and as the first line in each lambda, and compile (again, on Compiler Explorer) with x86-64 6.1 with options --std=c++14:
cout << &legal_obs_modes << ' ' << legal_obs_modes.size() << '\n';
then you'll see something very interesting (I've reformatted a little for readability):
0x605220 5
0x605260 0 f result=0
0x605260 0 f result=0
0x605260 0 f result=0
0x605260 0 f result=0
0x605220 5 g result=1
0x605220 5 g result=1
0x605220 5 g result=1
0x605220 5 g result=1
The failing f ones have a size of zero rather than five, and a totally different address. The zero size is indication enough that count will return zero, simply because there are no elements in an empty set. I suspect the different address is a manifestation of the same problem covered in the linked bug report.
You can actually see this in the Compiler Explorer assembler output, where the two different lambdas load up a different set address:
mov edi, 0x605260 ; for f
mov edi, 0x605220 ; for g
Making the set automatic instead of static causes the problem to go away entirely. The address is the same within both lambdas and outside of them, 0x7ffd808eb050 (on-stack rather than in static area, hence the vastly changed value). This tends to gel with the fact that statics aren't actually captured in lambdas, because they're always supposed to be at the same address, so can just be used as-is.
So, the problem appears to be that the f lambda, with its auto-deduced parameter, is making a copy of the static data instead of using it in-situ. And I don't mean a good copy, I mean one akin to being from photo-copier that ran out of toner sometime in 2017 :-)
So, in answer to your specific question about whether this was a bug or not, I think the consensus would be a rather emphatic yes.
After testing this code:
#include <iostream>
#include <chrono>
#include <vector>
#include <string>
void x(std::vector<std::string>&& v){ }
void y(const std::vector<std::string>& v) { }
int main() {
std::vector<std::string> v = {};
auto tp = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1000000000; ++i)
x(std::move(v));
auto t2 = std::chrono::high_resolution_clock::now();
auto time = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - tp);
std::cout << "1- It took: " << time.count() << " seconds\n";
tp = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1000000000; ++i)
y(v);
t2 = std::chrono::high_resolution_clock::now();
time = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - tp);
std::cout << "2- It took: " << time.count() << " seconds\n";
std::cin.get();
}
I get that using const-reference is actually ~15s faster than using move semantics, why is that? I thought that move semantics were faster, else, why would they add them? What did I get wrong about move semantics? thanks
Your code makes no sense. Here is a simpler version of your code, substituted with int and cleaned up. Here is the assembly version of the code, compiled with -std=c++11 -02:
https://goo.gl/6MWLNp
There is NO difference between the assembly for the rvalue and lvalue functions. Whatever is the cause doesn't matter because the test itself doesn't make use of move semantics.
The reason is probably because the compiler optimizes both functions to the same assembly. You're not doing anything with either, so there's no point in doing anything different in the assembly than a simple ret.
Here is a better example, this time, swapping the first two items in the vector:
https://goo.gl/Sp6sk4
Ironically, you can see that the second function actually just calls the rvalue reference version automatically as a part of its execution.
Assuming that a function A which calls B is slower than just executing the function B, the speed of x() should outperform y().
std::move() itself has an additional cost. All things else being constant, calling std::move() is more costly than not calling std::move(). This is why the "move semantics" is slower in the code you gave us. In reality, the code is slower because you're not actually doing anything--both functions simply return as soon as they execute. You can also see that one version appears to call std::move() while the other doesn't.
Edit: The above doesn't appear to be true. std::move() is not usually a true function call; it is mainly a static_cast<T&&> that depends on some template stuff.
In the example I gave you, I'm actually making use of the move semantics. Most of the assembly is more important, but you can see that y() calls x() as a part of its execution. y() should therefore be slower than x().
tl;dr: You're not actually using move semantics because your functions don't need to do anything at all. Make the functions use copying/moving, and you'll see that even the assembly uses part of the "move semantics" code as a part of its copying code.
Out of curiosity, I tested the size of a lamba expression. My first thought was, that they'd be 4 bytes big, like a function pointer. Strangely, the output of my first test was 1:
auto my_lambda = [&]() -> void {};
std::cout << sizeof(my_lambda) << std::endl;
Then I tested with some calculations inside the lambda, the output still being 1:
auto my_lambda2 = [&]() -> void {int i=5, j=23; std::cout << i*j << std::endl;};
std::cout << sizeof(my_lambda2) << std::endl;
My next idea was kinda random, but the output finally changed, displaying the awaited 4:
auto my_lambda3 = [&]() -> void {std::cout << sizeof(my_lambda2) << std::endl;};
std::cout << sizeof(my_lambda3) << std::endl;
At least in Visual Studio 2010. Ideone still display 1 as the output.
I know of the standard rule, that a lambda expression cannot appear in an unevaluated context, but afaik that only counts for direct lambda use like
sizeof([&]() -> void {std::cout << "Forbidden." << std::endl;})
on which VS2010 prompts me with a compiler error.
Anyone got an idea what's going on?
Thanks to #Hans Passant's comment under the question, the solution was found. My original approach was wrong in the fact that I thought every object would be captured by the lambda, but that isn't the case, only those in the enclosing scope are, and only if they are used.
And for everyone of those captured objects, 4 bytes are used (size of the reference).
Visual Studio probably doesn't implement lambda objects as functions. You're probably getting an object back. Who knows what it looks like. If you're truly interested you could always look at your variables with a debugger and see what they look like...if it'll let you.