Consider a large set of heap allocated objects used to transmit computation result from other threads. Destroying these can be an expensive task.
I am in a scenario where these object are not needed any-more and I just want them to be garbage collected without blocking the main thread. I considered using std::async the following way to destroy them asynchronously:
#include <memory>
#include <future>
#include <vector>
struct Foo
{
// Complex structure containing heap allocated data.
};
void f() {
std::vector<std::unique_ptr<Foo>> graveyard;
// Process some foo and add them in graveyard.
std::async([fooMoved = std::move(graveyard)]{});
}
In the specific program I am working on, using Visual Studio, profiling seems to confirm that this is faster – with respect to the main thread at least – than just letting graveyard destroy its content at the end of f scope.
Is this kind of improvement a glitch specific to the conditions of my test or is it a reliable way to deallocate large set of heap data? Do you see any potential drawbacks? Is there a better way to perform a similar task?
Related
this is more a general question rather than a specific coding problem. Does (and if yes, how) C++ avoid that multiple threads try to allocate the same memory adresses?
For example:
#include <vector>
#include <thread>
int main() {
std::vector<int> x, y;
std::thread do_work([&x] () {
/* push_back a lot of ints to x */
});
/* push_back a lot of ints to y */
do_work.join()
/* do more stuff */ }
When the two vectors allocate memory because they reached their capacity, is it possible that both vectors try to allocate the same piece of memory since the heap is shared among threads? Is code like this unsafe because it potentially creates races?
Allocation memory (via malloc/new/HeapAlloc and such) are thread-safe by default as long as you've compiled you application against thread safe runtimes (which you will, unless you explicitly change that).
Each vector will get its own slice of memory whenever they resize, but once a slice is freed any (other) thread could end up getting it the next time an allocation occurs.
You could, however, mess things up if you replace your allocators. Like if you overload the "new operator" and you're no longer getting memory for a thread safe source. https://en.cppreference.com/w/cpp/memory/new/operator_new
You could also opt to use a non-thread-safe version of malloc if you replace if via some sort of library-preload on Linux (Overriding 'malloc' using the LD_PRELOAD mechanism).
But assuming you're not doing any of that, the default implementation of user-space allocators (new/malloc) are thread safe, and OS level allocators (VirtualAlloc/mmap) are always thread safe.
I am trying to introduce logging to a multithreaded application. Currently, I am just using std::cout from the different threads. However, in that case, the order of the output is getting jumbled, even though one thread logged early, its output in stdout is coming after the log of another thread.
So, one solution can be to move all logging to an extra thread, but I don't want to manage one more thread. So I am thinking of using std::async to do the logging from the different threads. Is this possible? Are there any suggested ways to do this? Also is the order of execution of std::async guaranteed?
#include <iostream>
#include <vector>
#include <algorithm>
#include <numeric>
#include <future>
#include <string>
#include <mutex>
void print(int i)
{
std::cout << i << std::endl;
}
int main()
{
auto a1 = std::async(std::launch::async, print, 1);
auto a2 = std::async(std::launch::async, print, 2);
auto a3 = std::async(std::launch::async, print, 3);
a3.wait();
a2.wait();
a1.wait();
}
For the above code, is it guaranteed that the order of output will be
1
2
3
?
The whole point of std::async(std::launch::async, ...) is that it's asynchronous (thus using not only "async" in the name, but repeating it in the first parameter).
You're not guaranteed much of anything about the relative order of things happening in threads created with std::async, unless you force synchronization using something like an std::mutex, std::condition_variable, or some of the synchronizing primitives in <atomic>.
You say you don't want to manage one more thread, but then you create and manage not only one, but three more threads. I don't quite understand how this makes sense.
My own tendency would be to create a type to handle logging. Whether it uses a separate thread or not is its own internal affair. The threads doing the real work just do something like: log(error) << "error 12345"; and it's up to the logging object to implement that efficiently. Yes, if you have a lot of other threads contending for use of the single logging object, it's likely to be best off running in a thread of its own--but they should neither know nor care one way or the other about that.
For the above code, is it guaranteed that the order of output will be
1
2
3
4
?
No. wait() waits until the value is available. It doesn't start the execution. It merely blocks until its done.
It may as well happen that all of those futures are ready even before you call any wait()s. In that case it's quite obvious that the order is not guaranteed (and completely unrelated to the order of those wait() calls).
A way to offload logging from the current thread to elsewhere is to have a queue between the thread wanting to log and a thread actually writing the log to disk (or whatever). A queue is a good object to use, because it preserves order.
There's probably way better ways of doing this than what I'm about to describe (and I did this quite a while ago, so its bound to have been bettered). It's possible to adapt log4cpp so as to have a thread accepting logging requests submitted via std::queue. That's not a multithreaded thing, so what I've done before is to create a log event class, manage those using shared_ptrs, put the shared_ptrs to log event objects on the std::queue (so the posting to the queue is small, fast, and a minimum lock time on the necessary mutex). Then I've added in ZeroMQ with a PUSH/PULL pattern to allow multiple PUSHERs (posters onto the std::queue) to send 1 byte long messages to wake up a single PULL thread (that polls the zeroMQ and pulls from the std::queue). So logging consisted of the creation of a log event object, acquiring a mutex, pushing a shared_ptr to the log event object onto a std::queue, releasing the mutex and finally pushing a 1 byte message into a ZeroMQ socket.
Yes, it's a fairly horrific blend of std::queue and ZeroMQ, but it was quick to dispatch an arbitrary long log event without having to serialise the log event data in the thread send it.
A possible embelishment would be to turn off the mutex locking on the shared_ptr (it's not needed), or just use a raw pointer instead and try to remember to call delete.
I want to have a multithreaded function that allocates some memory for an object obj and returns the allocated memory. My current single-threaded and multiple threaded version codes are below.
The multi-threaded version has no race conditions but runs slow when a lot of threads are trying to get the lock. After the malloc and pointer update, each thread still needs to acquire and release the same lock. That causes some multi-threading performance drop. I wonder if there are some other ways to improve performance.
struct multi_level_tree{
multi_level_tree* ptr[256];
mutex mtx;
};
multi_level_tree tree; // A global object that every thread need to access and update
/* Single Threaded */
multi_level_tree* get_ptr(multi_level_tree* cur, int idx) {
if (!cur[idx].ptr)
cur[idx].ptr = malloc(sizeof(T));
return cur[idx].ptr;
}
/* Multi Threaded with mutex */
void get_ptr(multi_level_tree* cur, int idx) {
if (!cur[idx].ptr) {
cur[idx].mtx.lock(); // other threads wait here, and go one by one
/* Critical Section Start */
if (!cur[idx].ptr)
cur[idx].ptr = malloc(sizeof(multi_level_tree)); // malloc takes a while
/* Critical Section End */
cur[idx].mtx.unlock();
}
return cur[idx].ptr;
}
The code I am looking for should have the following property.
When the first thread allocated the memory, it should alert all other threads waiting for it.
All other threads should be unblocked at the same time.
No race condition.
The challenges in the problem
* The tree is sparse with multiple levels, initialize all of it is impossible considering the memory we have
* Similar to Double-Checked Locking problem, but was trying to avoid std::atomic
The point for this code is to implement a multi-level array as a global variable. Except for the lowest level, each array is a list of pointers to the next level array. Since this data structure needs to grow dynamically, I got into this problem.
how to have only one thread go through critical section
You could use a mutex. There's an example in your question.
It is not the most optimal solution for synchronised innitialisation. A simple improvement is to use a local static, in which case compiler is responsible for implementing the synchronisation:
T& get_T() {
static T instance;
return instance;
}
but runs slow when a lot of threads are trying to get the lock
This problem is inherent with serialising the access to the same data structure. A way to improve performance is to avoid doing that in the first place.
In this particular example, it appears that you could simply initialise the resource while the process is still single threaded, and start the parallel threads only after the initialisation is complete. That way no locking is required to access the pointer.
If that is not an option, another approach is to simply call get_ptr once in each thread, and store a copy locally. That way the locking overhead remains minimal.
Even better would be to have separate data structures in each thread. This is useful when threads only produce data, and don't need to access results from other threads.
Regarding edited example: You might benefit from a lock free tree implementation. It may be difficult to implement however.
Since you cannot easily fix it, since it's inherent of concurrency, i have an idea that may improve or decrease performance rather substantially, through.
If this resource is really used that often and is detrimental you could try to use Active Object (https://en.wikipedia.org/wiki/Active_object) and Boost Lockfree Queue (https://www.boost.org/doc/libs/1_66_0/doc/html/lockfree/reference.html#header.boost.lockfree.queue_hpp). Use atomic store/load on Future objects, and you will make this process completely lockless. But on the other hand it will require a single thread to maintain. Performance of such solution depends heavily on how often is this resource used.
From the comment form #WilliamClements , I see this is a double-checked locking problem itself. The original multi-threading code in my question may broke. To program it correctly, I switched to atomic pointers to prevent ordering problems with load/store instructions.
However, the example still uses a lock that I want to get rid of. Therefore, I choose to use std::atomic::compare_exchange_weak to only update the pointer when its value is nullptr. In this way, only one thread will successfully update the pointer value, and other threads are going to release requested memory if they fail std::atomic::compare_exchange_weak.
This code is doing very well for me so far.
struct multi_level_tree{
std::atomic<multi_level_tree*> ptr;
};
multi_level_tree tree;
void get_ptr(multi_level_tree* cur, int idx) {
if (!cur[idx].ptr.load()) {
/* Critical Section Start */
if (!cur[idx].ptr.load()) {
node* tmp = malloc(sizeof(multi_level_tree)*256);
if (cur[idx].ptr.compare_exchange_weak(nullptr, tmp)) {
/* successfully updated, do nothing */
}
else {
/* Already updated by other threads, release */
free(tmp);
}
}
/* Critical Section End */
}
return cur[idx].ptr;
}
I wonder, is it safe to implement like this? :
typedef shared_ptr<Foo> FooPtr;
FooPtr *gPtrToFooPtr // global variable
// init (before any thread has been created)
void init()
{
gPtrToFooPtr = new FooPtr(new Foo);
}
// thread A, B, C, ..., K
// Once thread Z execute read_and_drop(),
// no more call to read() from any thread.
// But it is possible even after read_and_drop() has returned,
// some thread is still in read() function.
void read()
{
FooPtr a = *gPtrToFooPtr;
// do useful things (read only)
}
// thread Z (executed once)
void read_and_drop()
{
FooPtr b = *gPtrToFooPtr;
// do useful things with a (read only)
b.reset();
}
We do not know which thread would do the actual realease.
Does boost's shared_ptr do the release safely under circumstance like this?
According to boost's document, thread safety of shared_ptr is:
A shared_ptr instance can be "read" (accessed using only const
operations) simultaneously by multiple threads. Different shared_ptr
instances can be "written to" (accessed using mutable operations such
as operator= or reset) simultaneosly by multiple threads.
As far as I am concerned, the code above does not violate any of thread safety criteria I mentioned above. And I believe the code should run fine. Does anyone tell me if I am right or wrong?
Thanks in advance.
Editted 2012-06-20 01:00 UTC+9
The pseudo code above works fine. The shared_ptr implementation guarantees to work correctly under circumstances where multiple thread is accessing instances of it (each thread MUST access its own instance of shared_ptr instantiated by using copy constructor).
Note that in the pseudo code above, you must delete gPtrToFooPtr to have the shared_ptr implementation finally release (drop the reference count by one) the object it owns(not proper expression since it is not an auto_ptr, but who cares ;) ). And in this case, you must be aware of the fact that it may cause SIGSEGV in multithreaded application.
How do you define 'safe' here? If you define it as 'I want to make sure that the object is destroyed exactly once', then YES, the release is safe. However, the problem is that the two threads share one smart pointer in your example. This is not safe at all. The reset() performed by one thread might not be visible to the other thread.
As stated by the documentation, smart pointers offer the same guarantees as built in types (i.e., pointers). Therefore, it is problematic to perform an unguarded write while an other thread might still be reading. It is undefined when that other reading thread will see writes of the other one. Therefore, while one thread calls reset() the pointer might NOT be reset in the other thread, since the shared_ptr instance itself is shared.
If you want some sort of thread safety, you have to use two shared pointer instances. Then, of course, resetting one of them WILL NOT release the object, since the other thread still has a reference to it. Usually this behaviour is intended.
However, I think the bigger problem is that you are misusing shared_ptrs. It is quite uncommon to use pointers of shared_ptrs and to allocate the shared_ptr on the heap (using new). If you do that, you have the problem you wanted to avoid using smart pointers again (you have to manage the lifetime of the shared_ptr now). Maybe check out some example code about smart pointers and their usage first.
For your own good, I will be honest.
Your code is doing many things and almost all are simply useless and absurd.
typedef shared_ptr<Foo> FooPtr;
FooPtr *gPtrToFooPtr // global variable
A raw pointer to a smart pointer, cancels the advantage of automatic resource management and does not solve any problem.
void read()
{
FooPtr a = *gPtrToFooPtr;
// do useful things (read only)
}
a is not used in any meaningful way.
{
FooPtr b = ...
b.reset();
}
b.reset() is useless here, b is about to be destroyed anyway. b has no purpose in this function.
I am afraid you have no idea what you are doing, what smart pointers are for, how to use shared_ptr, and how to do MT programming; so, you end up with this absurd pile of useless features to not solve the problem.
What about doing simple things simply:
Foo f;
// called before others functions
void init() {
// prepare f
}
// called in many threads {R1, R2, ... Rn} in parallel
void read()
{
// use f (read-only)
}
// called after all threads {R1, R2, ... Rn} have terminated
void read_and_drop()
{
// reset f
}
read_and_drop() must not be called before it can be guaranteed that other threads are not reading f.
To your edit:
Why not call reset() first on the global shared_ptr?
If you were the last one to access the object, fine it is deleted, then you delete the shared_ptr on the heap.
If some other thread still uses it, you reduce the ref count by one, and "disconnect" the global ptr from the (still existing) object that is pointed-to. You can then safely delete the shared_ptr on the heap without affecting any thread that might still use it.
In one of my projects, I identified a call to std::deque<T>::clear() as a major bottleneck.
I therefore decided to move this operation in a dedicated, low-priority thread:
template <class T>
void SomeClass::parallelClear(T& c)
{
if (!c.empty())
{
T* temp = new T;
c.swap(*temp); // swap contents (fast)
// deallocate on separate thread
boost::thread deleteThread([=] () { delete temp; } );
// Windows specific: lower priority class
SetPriorityClass(deleteThread.native_handle(), BELOW_NORMAL_PRIORITY_CLASS);
}
}
void SomeClass:clear(std::deque<ComplexObject>& hugeDeque)
{
parallelClear(hugeDeque);
}
This seems to work fine (VisualC++ 2010), but I wonder if I overlooked any major flaw. I would welcome your comments about the above code.
Additional information:
SomeClass:clear() is called from a GUI-thread, and the user interface is unresponsive until the call returns. The hugeQueue, on the other hand, is unlikely to be accessed by that thread for several seconds after clearing.
This is only valid if you guarantee that access to the heap is serialized. Windows does serialize access to the primary heap by default, but it's possible to turn this behaviour off and there's no guarantee that it holds across platforms or libraries. As such, I'd be careful depending on it- make sure that it's explicitly noted that it depends on the heap being shared between threads and that the heap is thread-safe to access.
I personally would simply suggest using a custom allocator to match the allocation/deallocation pattern would be the best performance improvement here- remember that creating threads has a non-trivial overhead.
Edit: If you are using GUI/Worker thread style threading design, then really, you should create, manage and destroy the deque on the worker thread.
Be aware, that its not sure that this will improve the overall performance of your application. The windows standard heap (also the Low fragmentation heap) is not laid out for frequently transmitting allocation information from one thread to another. This will work, but it might produce quite an overhead of processing.
The documentation of the hoard memory allocator might be a starting point do get deeper into that:
http://www.cs.umass.edu/~emery/hoard/hoard-documentation.html
Your approach will though improve responsiveness etc.
In addition to the things mentioned by the other posters you should consider if the objects contained in the collection have a thread affinity, for example COM objects in single threaded apartment may not be amenable to this kind of trick.