Race in std::future::unwrap() and std::future::get() - c++

A followup with reference to the upcoming feature in C++20 from n3721 "Improvements to std::future and related APIs"
#include <iostream>
#include <future>
#include <exception>
using std::cout;
using std::endl;
int main() {
auto prom_one = std::promise<std::future<int>>{};
auto fut_one = prom_one.get_future();
std::thread{[prom_one = std::move(prom_one)]() mutable {
auto prom_two = std::promise<int>{};
auto fut_two = prom_two.get_future();
prom_two.set_value(1);
prom_one.set_value(std::move(fut_two));
}}.detach();
auto inner_fut_unwrap = fut_one.unwrap();
auto inner_fut_get = fut_one.get();
auto th_one = std::thread{[&]() {
cout << inner_fut_unwrap.get() << endl;
}};
auto th_two = std::thread{[&]() {
cout << inner_fut_get.get() < endl;
}};
th_one.join();
th_two.join();
return 0;
}
In the code above, which will win the race to print 1? th_one or th_two?
To clarify what race I was talking about, there are two (potential) racy situations here, the latter being the one that is really confusing me.
The first is in the setting and unwrapping of the inner future; the unwrapped future should act as a suitable proxy for the inner future even when the actual set_value has not been called on the inner future. So unwrap() must return a proxy that exposes a thread safe interface regardless of what happens on the other side.
The other situation is what happens to a get() from a future when a proxy for it already exists elsewhere, in this example inner_fut_unwrap is the proxy for inner_fut_get. In such a situation, which should win the race? The unwrapped future or the future fetched via a call to get() on the outer future?

This code makes me worried that there is some kind of misunderstanding about what futures and promises are, and what .get() does. It's also a bit weird that we have using namespace std; followed by a lot of std::.
Let's break it down. Here's the important part:
#include <iostream>
#include <future>
int main() {
auto prom_one = std::promise<std::future<int>>{};
auto fut_one = prom_one.get_future();
auto inner_fut_unwrap = fut_one.unwrap();
auto inner_fut_get = fut_one.get();
// Boom! throws std::future_error()
So neither thread "wins" the race, since neither thread actually gets a chance to run. Note from the document you linked, for .unwrap(), on p13:
Removes the outer-most future and returns a proxy to the inner future.
So the outer-most future, fut_one, is not valid. When you call .get(), it throws std::future_error1. There is no race.
1: Not guaranteed. Technically undefined behavior.

Related

Pass by reference to a function accepting a universal reference

I am trying to use an API which sets a value of a variable based on an HTTP call. The function in which I can set the variable which will be set upon an HTTP Call is of type T&&. I would like to access this variable on a different thread. I tried to simplify the problem and represent it in the following code, as two threads, accessing a variable the same way.
#include <iostream>
#include <thread>
#include <chrono>
class Values
{
public:
int i;
std::string s;
};
template<typename T>
void WriteCycle(T&& i)
{
using namespace std::chrono_literals;
while (true)
{
i++;
std::cout << i << std::endl;
std::this_thread::sleep_for(500ms);
}
}
template<typename T>
void ReadCycle(T&& i)
{
using namespace std::chrono_literals;
while (true)
{
std::cout << i << std::endl;
std::this_thread::sleep_for(500ms);
}
}
int main() {
auto v = new Values();
std::thread t1(WriteCycle<int>, v->i);
std::thread t2(ReadCycle<int>, v->i);
t1.join();
t2.join();
}
Currently the read thread does not see any change in the variable. I read up an perfect forwarding, move semnatics and forwardin references, but I did not get it (I mostly program dotnet, my c++ knowledge is pre C++11). I tried all combinations of std::move, std::ref and std::forward but I cannot get the read thread to see the change of the write thread. Is there a way to solve this, without changing the T&& input type of the functions (since that is part of the API I am trying to use)? How to solve this in a thread-safe way?
Such notation:
template<typename T>
void WriteCycle(T&& i)
Doesn't really mean an rvalue reference, it means a universal reference, which could be an lvalue reference or rvalue reference depending on what kind of data you pass.
In your case it turns into just an lvalue reference, so it has nothing to do with move semantic. The problem is that thread constructor is not quite friendly with references and you may want just to use std::ref to get it round:
auto myRef = std::ref(v->i);
std::thread t1(WriteCycle<&int>, myRef);
std::thread t2(ReadCycle<&int>, myRef);
However it still won't be perfect, because you want to synchronize between threads with use of mutexes or atomic values

C++ destruction point guarantee multi-threading

I have the following simplified code at which while writing I thought was fine, but I have seem some random access violations.
Initially I thought as long as the arguments passed to async were on the stack, and not temporary variables, the code would be safe. I also thought that filename and extra data would destruct/considered not there at the brace where they leave scope.
It did some more research and read about the 'as if' principle that apparently compilers use for optimisation. I've often seen stack variables being optimised away in the debugger right after they have been used too.
My question here is basically, is it guaranteed that those stack variables will be around for the entire duration of the async function running. The .get() call on the future obviously synchronises the call before the two stack variables leave scope.
My current thinking is that it's not thread safe as the compiler can't see the variables being used after the call to the function, and therefore think it is safe to remove them. I can easily change the code to eliminate the problem (if there is one), but I really want to understand this.
The randomness of the AV, occurring more on some computers than others suggests it is a problem, and the scheduling order dictates whether this is a problem or not.
Any help is much appreciated.
#include <future>
#include <fstream>
#include <string>
#include <iostream>
int write_some_file(const char * const filename, int * extra_result)
{
std::ofstream fs;
try {
fs.open(filename);
} catch (std::ios_base::failure e) {
return 1;
}
fs << "Hello";
*extra_result = 1;
return 0;
}
int main(void)
{
std::string filename {"myffile.txt"};
int extraResult = 0;
auto result = std::async(std::launch::async, write_some_file, filename.c_str(), &extraResult);
// Do some other work
// ...
int returnCode = result.get();
std::cout << returnCode << std::endl;
std::cout << extraResult << std::endl;
return 0;
}

boost interprocess managed_mapped_file find failing

I am trying to share a structure across processes using interprocess in Boost.
I've defined the mapped file to use null mutex because I was having problems with it locking and I don't mind doing the synchronisation myself.
What I am having problems with though is finding of objects.
I have the following declaration:
typedef boost::interprocess::basic_managed_mapped_file
< char,
boost::interprocess::rbtree_best_fit<boost::interprocess::null_mutex_family,boost::interprocess::offset_ptr<void>>,
boost::interprocess::flat_map_index>
my_mapped_file;
In process A, I do:
m_managedMappedFile.reset(new my_mapped_file(bip::open_or_create, filename, filesize));
auto hdr = m_managedMappedFile->find_or_construct<Foo>(bip::unique_instance)();
auto x = m_managedMappedFile->find<Foo>(bip::unique_instance);
Which works as I would expect, i.e. it finds the object. Now, in process B:
m_managedMappedFile.reset(new my_mapped_file(bip::open_only, filename));
auto ret = m_managedMappedFile->find<Foo>(bip::unique_instance);
For some reason the find method returns null in process B. I realise I must be doing something daft, but can't figure it out.
Can anyone help?
You should not have to bypass the locking mechanism of the default bip::managed_mapped_file indexes.
See if you can run the following with success:
#include <iostream>
#include <boost/interprocess/managed_mapped_file.hpp>
namespace bip = boost::interprocess;
struct X {
int i;
};
int main()
{
{
bip::managed_mapped_file f(bip::open_or_create, "/tmp/mmf.bin", 1ul << 20);
if (!f.find<X>(bip::unique_instance).first) {
auto xp = f.find_or_construct<X>(bip::unique_instance)();
assert(xp);
xp->i = 42;
}
}
{
bip::managed_mapped_file f(bip::open_only, "/tmp/mmf.bin");
auto xp = f.find<X>(bip::unique_instance).first;
if (xp)
std::cout << "value: " << xp->i++ << "\n";
}
}
This should print 42 on the first run (or after the file has been recreated), and increasing numbers on each subsequent run.
I'm going to have a look at the implementation behind the unique_instance_t* overloads of the segment managers, but I suspect they might not work because the mutex policy was nulled. This is just a hunch though, at the moment.
I'd focus on finding out why you can't get Interprocess managed_mapped_file working in the default configuration, on your platform & installation.

Using boost::future with continuations and boost::when_all

I would like to use boost::future with continuations and boost::when_all / boost::when_any.
Boost trunk - not 1.55 - includes implementations for the latter (modeled after the proposal here, upcoming for C++14/17 and Boost 1.56).
This is what I have (and it does not compile):
#include <iostream>
#define BOOST_THREAD_PROVIDES_FUTURE
#define BOOST_THREAD_PROVIDES_FUTURE_CONTINUATION
#define BOOST_THREAD_PROVIDES_FUTURE_WHEN_ALL_WHEN_ANY
#include <boost/thread/future.hpp>
using namespace boost;
int main() {
future<int> f1 = async([]() { return 1; });
future<int> f2 = async([]() { return 2; });
auto f3 = when_all(f1, f2);
f3.then([](decltype(f3)) {
std::cout << "done" << std::endl;
});
f3.get();
}
Clang 3.4 bails out with a this - here is an excerpt:
/usr/include/c++/v1/memory:1685:31: error: call to deleted constructor of 'boost::future<int>'
::new((void*)__p) _Up(_VSTD::forward<_Args>(__args)...);
Am I doing it wrong or is this a bug?
The problem is that when_all may only be called with rvalue future or shared_future. From N3857:
template <typename... T>
see below when_all(T&&... futures);
Requires: T is of type future<R> or shared_future<R>.
Thanks to the reference-collapsing rules, passing an lvalue results in T being deduced to future<T>& in violation of the stated requirement. The boost implementation doesn't check this precondition so you get an error deep in the template code where what should be a move of an rvalue future turns into an attempted copy of an lvalue future.
You need to either move the futures into the when_all parameters:
auto f3 = when_all(std::move(f1), std::move(f2));
or avoid naming them in the first place:
auto f = when_all(async([]{return 1;}),
async([]{return 2;}));
Also, you must get the future returned from then instead of the intermediate future:
auto done = f.then([](decltype(f)) {
std::cout << "done" << std::endl;
});
done.get();
since the future upon which you call then is moved into the parameter of the continuation. From the description of then in N3857:
Postcondition:
The future object is moved to the parameter of the continuation function
valid() == false on original future object immediately after it returns
Per 30.6.6 [futures.unique_future]/3:
The effect of calling any member function other than the destructor, the move-assignment operator, or valid on a future object for which valid() == false is undefined.
You could avoid most of these issues in c++14 by avoiding naming the futures at all:
when_all(
async([]{return 1;}),
async([]{return 2;})
).then([](auto&) {
std::cout << "done" << std::endl;
}).get();

About unique_ptr performances

I often read that unique_ptr would be preferred in most situations over shared_ptr because unique_ptr is non-copyable and has move semantics; shared_ptr would add an overhead due to copy and ref-counting;
But when I test unique_ptr in some situations, it appears it's noticably slower (in access) than its counterparts
For example, under gcc 4.5 :
edit : the print method doesn't print anything actually
#include <iostream>
#include <string>
#include <memory>
#include <chrono>
#include <vector>
class Print{
public:
void print(){}
};
void test()
{
typedef vector<shared_ptr<Print>> sh_vec;
typedef vector<unique_ptr<Print>> u_vec;
sh_vec shvec;
u_vec uvec;
//can't use initializer_list with unique_ptr
for (int var = 0; var < 100; ++var) {
shared_ptr<Print> p(new Print());
shvec.push_back(p);
unique_ptr<Print> p1(new Print());
uvec.push_back(move(p1));
}
//-------------test shared_ptr-------------------------
auto time_sh_1 = std::chrono::system_clock::now();
for (auto var = 0; var < 1000; ++var)
{
for(auto it = shvec.begin(), end = shvec.end(); it!= end; ++it)
{
(*it)->print();
}
}
auto time_sh_2 = std::chrono::system_clock::now();
cout <<"test shared_ptr : "<< (time_sh_2 - time_sh_1).count() << " microseconds." << endl;
//-------------test unique_ptr-------------------------
auto time_u_1 = std::chrono::system_clock::now();
for (auto var = 0; var < 1000; ++var)
{
for(auto it = uvec.begin(), end = uvec.end(); it!= end; ++it)
{
(*it)->print();
}
}
auto time_u_2 = std::chrono::system_clock::now();
cout <<"test unique_ptr : "<< (time_u_2 - time_u_1).count() << " microseconds." << endl;
}
On average I get (g++ -O0) :
shared_ptr : 1480 microseconds
unique_ptr : 3350 microseconds
where does the difference come from ? is it explainable ?
UPDATED on Jan 01, 2014
I know this question is pretty old, but the results are still valid on G++ 4.7.0 and libstdc++ 4.7. So, I tried to find out the reason.
What you're benchmarking here is the dereferencing performance using -O0 and, looking at the implementation of unique_ptr and shared_ptr, your results are actually correct.
unique_ptr stores the pointer and the deleter in a ::std::tuple, while shared_ptr stores a naked pointer handle directly. So, when you dereference the pointer (using *, ->, or get) you have an extra call to ::std::get<0>() in unique_ptr. In contrast, shared_ptr directly returns the pointer. On gcc-4.7 even when optimized and inlined, ::std::get<0>() is a bit slower than the direct pointer.. When optimized and inlined, gcc-4.8.1 fully omits the overhead of ::std::get<0>(). On my machine, when compiled with -O3, the compiler generates exactly the same assembly code, which means they are literally the same.
All in all, using the current implementation, shared_ptr is slower on creation, moving, copying and reference counting, but equally as fast *on dereferencing*.
NOTE: print() is empty in the question and the compiler omits the loops when optimized. So, I slightly changed the code to correctly observe the optimization results:
#include <iostream>
#include <string>
#include <memory>
#include <chrono>
#include <vector>
using namespace std;
class Print {
public:
void print() { i++; }
int i{ 0 };
};
void test() {
typedef vector<shared_ptr<Print>> sh_vec;
typedef vector<unique_ptr<Print>> u_vec;
sh_vec shvec;
u_vec uvec;
// can't use initializer_list with unique_ptr
for (int var = 0; var < 100; ++var) {
shvec.push_back(make_shared<Print>());
uvec.emplace_back(new Print());
}
//-------------test shared_ptr-------------------------
auto time_sh_1 = std::chrono::system_clock::now();
for (auto var = 0; var < 1000; ++var) {
for (auto it = shvec.begin(), end = shvec.end(); it != end; ++it) {
(*it)->print();
}
}
auto time_sh_2 = std::chrono::system_clock::now();
cout << "test shared_ptr : " << (time_sh_2 - time_sh_1).count()
<< " microseconds." << endl;
//-------------test unique_ptr-------------------------
auto time_u_1 = std::chrono::system_clock::now();
for (auto var = 0; var < 1000; ++var) {
for (auto it = uvec.begin(), end = uvec.end(); it != end; ++it) {
(*it)->print();
}
}
auto time_u_2 = std::chrono::system_clock::now();
cout << "test unique_ptr : " << (time_u_2 - time_u_1).count()
<< " microseconds." << endl;
}
int main() { test(); }
NOTE: That is not a fundamental problem and can be easily fixed by discarding the use of ::std::tuple in current libstdc++ implementation.
All you did in the timed blocks is access them. That won't involve any additional overhead at all. The increased time probably comes from the console output scrolling. You can never, ever do I/O in a timed benchmark.
And if you want to test the overhead of ref counting, then actually do some ref counting. How is the increased time for construction, destruction, assignment and other mutating operations of shared_ptr going to factor in to your time at all if you never mutate shared_ptr?
Edit: If there's no I/O then where are the compiler optimizations? They should have nuked the whole thing. Even ideone junked the lot.
You're not testing anything useful here.
What you are talking about: copy
What you are testing: iteration
If you want to test copy, you actually need to perform a copy. Both smart pointers should have similar performance when it comes to reading, because good shared_ptr implementations will keep a local copy of the object pointed to.
EDIT:
Regarding the new elements:
It's not even worth talking about speed when using debug code, in general. If you care about performance, you will use release code (-O2 in general) and thus that's what should be measured, as there can be significant differences between debug and release code. Most notably, inlining of template code can seriously decrease the execution time.
Regarding the benchmark:
I would add another round of measures: naked pointers. Normally, unique_ptr and naked pointers should have the same performance, it would be worth checking it, and it need not necessarily be true in debug mode.
You might want to "interleave" the execution of the two batches or if you cannot, take the average of each among several runs. As it is, if the computer slows down during the end of the benchmark, only the unique_ptr batch will be affected which will perturbate the measure.
You might be interested in learning more from Neil: The Joy of Benchmarks, it's not a definitive guide, but it's quite interesting. Especially the part about forcing side-effects to avoid dead-code removal ;)
Also, be careful about how you measure. The resolution of your clock might be less precise than what it appears to be. If the clock is refreshed only every 15us for example, then any measure around 15us is suspicious. It might be an issue when measuring release code (you might need to add a few turns to the loop).