tbb:task_scheduler_init custom allocator? - c++

So I am trying to use parallel for each..
I have code where I do:
Source s;..
parallel_for_each(begin(_allocs), end(_allocs), [&s] (Allocs_t::value_type allocation) {
// cool stuff with allocation
}
This works, and works well. however, I've seen in many posts that I should call tbb:task_scheduler_init before scheduling tasks.
The problem is that I override malloc and calloc and I can't have the init call malloc and calloc(which it does..)
So the questions are:
why does it work well? DOES it work well?
is there a way to give intel a specific allocator for all its purposes?
Thanks

Instantiation of tbb:task_scheduler_init object is optional. TBB has lazy auto-initialization mechanism which constructs everything on the first call to TBB algorithms/scheduler. Auto-initialization is equal to construction of a global task_scheduler_init object just before your first call to TBB.
Of course, if you need to override the default number of threads, specify the scope where TBB should be initialized, or specify the size of the stack for workers, explicit initialization is unavoidable.
TBB scheduler uses either its own scalable allocator (tbbmalloc.dll, libtbbmalloc.so ..) if found nearby to tbb binaries, or it falls back to using malloc otherwise. There is no way to explicitly specify any other allocator to be used by the scheduler (unlike TBB containers which have corresponding template argument).
Given all the above, I think you have two [combinable] options:
Ensure that TBB scheduler uses its own allocator, so you don't have to worry about correct replacement of malloc for TBB.
Or/and, ensure that the only task_scheduler_init is created and destroyed at the points (scope) where malloc/free are in consistent state.

Related

What mechanism ensures that std::shared_ptr control block is thread-safe?

From articles like std::shared_ptr thread safety, I know that the control block of a std::shared_ptr is guaranteed to be thread-safe by the standard whilst the actual data pointed to is not inherently thread-safe (i.e., it is up to me as the user to make it so).
What I haven't been able to find in my research is an answer to how this guaranteed. What I mean is, what mechanism specifically is used to ensure that the control block is thread safe (and thus an object is only deleted once)?
I ask because I am using the newlib-nano C++ library for embedded systems along with FreeRTOS. These two are not inherently designed to work with each other. Since I never wrote any code to ensure that the control block is thread safe (e.g., no code for a critical section or mutex), I can only assume that it may not actually be FreeRTOS thread-safe.
There isn't really much machinery required for this. For a rough sketch (not including all the requirements/features of the standard std::shared_ptr):
You only need to make sure that the reference counter is atomic, that it is incremented/decremented atomically and accessed with acquire/release semantics (actually some of the accesses can even be relaxed).
Then when the last instance of a shared pointer for a given control block is destroyed and it decremented the reference count to zero (this needs to be checked atomically with the decrement using e.g. std::atomic::fetch_add's return value), the destructor knows that there is no other thread holding a reference to the control block anymore and it can simply destroy the managed object and clean up the control block.
MSVC uses InterlockedIncrement to increment the refcount, which is an atomic increment.
My best guess from looking at the C++ library code is that the reference count is implemented as an atomic operation. I am thinking that all the code eventually boils down to a set of built-in function implemented by the compiler for the specific architecture I am using (ARM). Here a list of those built-ins provided by GCC: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html

Using C++11 std::thread native_handle and whether I need to CloseHandle

I'm starting to get into the practice of using C++11 std::threads. Typically, with Win32, I need to call CloseHandle whenever I have a handle to a thread. Do I still need to call CloseHandle when I use a C++11 native_handle? Also, if I don't use C++11 native handles, do thread handles get cleaned up properly?
Of course not.
Thread Objects has a destructor which releases any operating system specific resources that the object may acquired.
Actually, every (good) C++ object has a destructor which cleans whatever is needed to be cleaned up and this destructor (when the code is written correctly) is called automatically by the program.
this idiom is known as RAII - every object has a constructor which gathers resources that the object needs, and a peer destructor which releases them when the object gets out of scope.
when done correctly, this technique is far more powerfull than C-style manual resource management or a "high-level" garbage collector.
as a word of advise, if the standard provides you with some utility, ignore completely the corresponding Win32 API. the standard does not depends on an operating system specific API to work correctly.
No, they are RAII, like shared pointers.
Main thing you need to worry about are:
synchronization (no pain no gain!)
that you cannot copy threads (deleted constructor) so you pass them by reference.
Details here.
http://www.cplusplus.com/reference/thread/thread/thread/

C++ Thread safe vector.erase

I wrote a threaded Renderer for SFML which takes pointers to drawable objects and stores them in a vector to be draw each frame. Starting out adding objects to the vector and removing objects to the vector would frequently cause Segmentation faults (SIGSEGV). To try and combat this, I would add objects that needed to be removed/added to a queue to be removed later (before drawing the frame). This seemed to fix it, but lately I have noticed that if I add many objects at one time (or add/remove them fast enough) I will get the same SIGSEGV.
Should I be using locks when I add/remove from the vector?
You need to understand the thread-safety guarantees the C++ standard (and implementations of C++2003 for possibly concurrent systems) give. The standard containers are a thread-safe in the following sense:
It is OK to have multiple concurrent threads reading the same container.
If there is one thread modifying a container there shall be no concurrent threads reading or writing the same container.
Different containers are independent of each other.
Many people misunderstand thread-safety of container to mean that these rules are imposed by the container implementation: they are not! It is your responsibility to obey these rules.
The reason these aren't, and actually can't, be imposed by the containers is that they don't have an interface suitable for this. Consider for example the following trivial piece of code:
if (!c.empty() {
auto value = c.back();
// do something with the read value
}
The container can control the access to the calls to empty() and back(). However, between these calls it necessarily needs to release any sort of synchronization facilities, i.e. by the time the thread tries to read c.back() the container may be empty again! There are essentially two ways to deal with this problem:
You need to use external locking if there is possibility that a concurrent thread may be changing the container to span the entire range of accesses which are interdependent in some form.
You change the interface of the containers to become monitors. However, the container interface isn't at all suitable to be changed in this direction because monitors essentially only support "fire and forget" style of interfaces.
Both strategies have their advantages and the standard library containers are clearly supporting the first style, i.e. they require external locking when using concurrently with a potential of at least one thread modifying the container. They don't require any kind of locking (neither internal or external) if there is ever only one thread using them in the first place. This is actually the scenario they were designed for. The thread-safety guarantees given for them are in place to guarantee that there are no internal facilities used which are not thread-safe, say one per-object iterator object or a memory allocation facility shared by multiple threads without being thread-safe, etc.
To answer the original question: yes, you need to use external synchronization, e.g. in the form of mutex locks, if you modify the container in one thread and read it in another thread.
Should I be using locks when I add/remove from the vector?
Yes. If you're using the vector from two threads at the same time and you reallocate, then the backing allocation may be swapped out and freed behind the other thread's feet. The other thread would be reading/writing to freed memory, or memory in use for another unrelated allocation.

Clearing STL in dedicated thread

In one of my projects, I identified a call to std::deque<T>::clear() as a major bottleneck.
I therefore decided to move this operation in a dedicated, low-priority thread:
template <class T>
void SomeClass::parallelClear(T& c)
{
if (!c.empty())
{
T* temp = new T;
c.swap(*temp); // swap contents (fast)
// deallocate on separate thread
boost::thread deleteThread([=] () { delete temp; } );
// Windows specific: lower priority class
SetPriorityClass(deleteThread.native_handle(), BELOW_NORMAL_PRIORITY_CLASS);
}
}
void SomeClass:clear(std::deque<ComplexObject>& hugeDeque)
{
parallelClear(hugeDeque);
}
This seems to work fine (VisualC++ 2010), but I wonder if I overlooked any major flaw. I would welcome your comments about the above code.
Additional information:
SomeClass:clear() is called from a GUI-thread, and the user interface is unresponsive until the call returns. The hugeQueue, on the other hand, is unlikely to be accessed by that thread for several seconds after clearing.
This is only valid if you guarantee that access to the heap is serialized. Windows does serialize access to the primary heap by default, but it's possible to turn this behaviour off and there's no guarantee that it holds across platforms or libraries. As such, I'd be careful depending on it- make sure that it's explicitly noted that it depends on the heap being shared between threads and that the heap is thread-safe to access.
I personally would simply suggest using a custom allocator to match the allocation/deallocation pattern would be the best performance improvement here- remember that creating threads has a non-trivial overhead.
Edit: If you are using GUI/Worker thread style threading design, then really, you should create, manage and destroy the deque on the worker thread.
Be aware, that its not sure that this will improve the overall performance of your application. The windows standard heap (also the Low fragmentation heap) is not laid out for frequently transmitting allocation information from one thread to another. This will work, but it might produce quite an overhead of processing.
The documentation of the hoard memory allocator might be a starting point do get deeper into that:
http://www.cs.umass.edu/~emery/hoard/hoard-documentation.html
Your approach will though improve responsiveness etc.
In addition to the things mentioned by the other posters you should consider if the objects contained in the collection have a thread affinity, for example COM objects in single threaded apartment may not be amenable to this kind of trick.

What is the most efficient implementation of a java like object monitor in C++?

In Java each object has a synchronisation monitor. So i guess the implementation is pretty condensed in term of memory usage and hopefully fast as well.
When porting this to C++ what whould be the best implementation for it. I think that there must be something better then "pthread_mutex_init" or is the object overhead in java really so high?
Edit: i just checked that pthread_mutex_t on Linux i386 is 24 bytes large. Thats huge if i have to reserve this space for each object.
In a sense it's worse than pthread_mutex_init, actually. Because of Java's wait/notify you kind of need a paired mutex and condition variable to implement a monitor.
In practice, when implementing a JVM you hunt down and apply every single platform-specific optimisation in the book, and then invent some new ones, to make monitors as fast as possible. If you can't do a really fiendish job of that, you definitely aren't up to optimising garbage collection ;-)
One observation is that not every object needs to have its own monitor. An object which isn't currently synchronised doesn't need one. So the JVM can create a pool of monitors, and each object could just have a pointer field, which is filled in when a thread actually wants to synchronise on the object (with a platform-specific atomic compare and swap operation, for instance). So the cost of monitor initialisation doesn't have to add to the cost of object creation. Assuming the memory is pre-cleared, object creation can be: decrement a pointer (plus some kind of bounds check, with a predicted-false branch to the code that runs gc and so on); fill in the type; call the most derived constructor. I think you can arrange for the constructor of Object to do nothing, but obviously a lot depends on the implementation.
In practice, the average Java application isn't synchronising on very many objects at any one time, so monitor pools are potentially a huge optimisation in time and memory.
The Sun Hotspot JVM implements thin locks using compare and swap. If an object is locked, then the waiting thread wait on the monitor of thread which locked the object. This means you only need one heavy lock per thread.
I'm not sure how Java does it, but .NET doesn't keep the mutex (or analog - the structure that holds it is called "syncblk" there) directly in the object. Rather, it has a global table of syncblks, and object references its syncblk by index in that table. Furthermore, objects don't get a syncblk as soon as they're created - instead, it's created on demand on the first lock.
I assume (note, I do not know how it actually does that!) that it uses atomic compare-and-exchange to associate the object and its syncblk in a thread-safe way:
Check the hidden syncblk_index field of our object for 0. If it's not 0, lock it and proceed, otherwise...
Create a new syncblk in global table, get the index for it (global locks are acquired/released here as needed).
Compare-and-exchange to write it into object itself.
If previous value was 0 (assume that 0 is not a valid index, and is the initial value for the hidden syncblk_index field of our objects), our syncblk creation was not contested. Lock on it and proceed.
If previous value was not 0, then someone else had already created a syncblk and associated it with the object while we were creating ours, and we have the index of that syncblk now. Dispose the one we've just created, and lock on the one that we've obtained.
Thus the overhead per-object is 4 bytes (assuming 32-bit indices into syncblk table) in best case, but larger for objects which actually have been locked. If you only rarely lock on your objects, then this scheme looks like a good way to cut down on resource usage. But if you need to lock on most or all your objects eventually, storing a mutex directly within the object might be faster.
Surely you don't need such a monitor for every object!
When porting from Java to C++, it strikes me as a bad idea to just copy everything blindly. The best structure for Java is not the same as the best for C++, not least because Java has garbage collection and C++ doesn't.
Add a monitor to only those objects that really need it. If only some instances of a type need synchronization then it's not that hard to create a wrapper class that contains the mutex (and possibly condition variable) necessary for synchronization. As others have already said, an alternative is to use a pool of synchronization objects with some means of choosing one for each object, such as using a hash of the object address to index the array.
I'd use the boost thread library or the new C++0x standard thread library for portability rather than relying on platform specifics at each turn. Boost.Thread supports Linux, MacOSX, win32, Solaris, HP-UX and others. My implementation of the C++0x thread library currently only supports Windows and Linux, but other implementations will become available in due course.