Avoid local() calls for tbb enumerable_thread_specific variables - c++

I have a code which uses tbb::enumerable_thread_specific variables, and in the deep place of the call stack the thread local variables are used. The naive implementation leads to a lot of local() function calls.
Now I want to avoid local() function calls by passing parameters hierarchically. Is there a simpler way of doing this? I have many places with local() function calls if I do not pass Foo as a parameter, but the code would be messy if I do. I have been looking for possible usage of an array with size equal to the number of threads, and use thread-id to access the thread local variable, but it seems tbb does not provide that (in contrast to omp_get_thread_num() in OpenMP).
See more descriptions here:
https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/804043

Repeating and expanding my own answer from the TBB forum:
You can use tbb::this_task_arena::max_concurrency() and tbb::this_task_arena::current_thread_index() to implement array-based custom thread local storage. The first function gives the upper limit for the number of working threads; to a degree it's TBB equivalent for omp_get_num_threads(). The second one gives an index of the current thread within the limit, similarly to omp_get_thread_num().

Ryan. Before suggesting something else, I would suggest you try to use enumerable_thread_specific if you can. It provides one feature you may have trouble getting in general: each variable is guaranteed to line up on a cache line, which eliminates false sharing.
If you decide to manage your own thread-local storage, you must
Allocate the storage
Assign the storage to a thread, and
(potentially) free the storage.
Remember also that TBB does not guarantee a particular number of threads, though in general it will give you what you ask for. Be careful of oversubscription.
You can use any storage that does not get reallocated, so std::vector<T> is out. I'd suggest you use a concurrent_vector<T>, which doesn't get moved on expanding the array.
So you have to assign each thread a slot in the vector. That index can be stored in TLS. Then use this index to fetch the instance from your concurrent_vector. This can be an expensive operation if the vector is fragmented.
You can also use the threadID of the thread to hash into storage. If you are willing to allocate a hash map once and never resize, this will work; otherwise you have to manage a chain of hash tables and walk through the chain looking for your instance. If I remember right enumerable_thread_specific uses this technique.
You can see it is a non-trivial to implement your own version, and you'll always do better if you use a stack variable in each thread and pass that as a formal parameter. Your problem may not be structured that way, though.

Related

Performance: should I use a global variable in a function which gets called often?

First off, let me get of my chest the fact that I'm a greenhorn trying to do things the right way which means I get into a contradiction about what is the right way every now and then.
I am modifying a driver for a peripheral which contains a function - lets call it Send(). In the function I have a timestamp variable so the function loops for a specified amount of time.
So, should I declare the variable global (that way it is always in memory and no time is lost for declaring it each time the function runs) or do I leave the variable local to the function context (and avoid a bad design pattern with global variables)?
Please bear in mind that the function can be called multiple times per milisecond.
Speed of execution shouldn't be significantly different for a local vs. a global variable. The only real difference is where the variable lives. Local variables are allocated on the stack, global variables are in a different memory segment. It is true that local variables are allocated every time you enter a routine, but allocating memory is a single instruction to move the stack pointer.
There are much more important considerations when deciding if a variable should be global or local.
When implementing a driver, try to avoid global variables as much as possible, because:
They are thread-unsafe, and you have no idea about the scheduling scheme of the user application (in fact, even without threads, using multiple instances of the same driver is a potential problem).
It automatically yields the creation of data-section as part of the executable image of any application that links to your driver (which is something that the application programmer might want to avoid).
Did you profile a fully-optimized, release build of your code and identify the bottleneck to be small allocations in this function?
The change you are proposing is a micro-optimization; a change to a small part of your code with the intent to make it more efficient. If the question to the above question is "no" as I'd expect, you shouldn't even be thinking of such things.
Select the correct algorithm for your code. Write your code using idiomatic techniques. Do not write in micro-optimizations. You might be surprised how good your compiler is at optimizing your code for you. It will often be able to optimize away these small allocations, but even if it can't you still don't know if the performance penalty imposed by them is even noticeable or significant.
For drivers, with is usually position independent, global variables are accessed indirectly with GOT table unless IP-relative operations is available (i.e. x86_64, ARM, etc)
In case of GOT, you can think it as an extra indirect pointer.
However, even with an extra pointer it won't make any observable difference if it's "only" called in mill-second frequency.

parallelization with openMP - stack or heap variables

I have one working solution for parallelization . However, execution time is very very slightly improved by parallelization. I thinks it comes from the fact I new and delete some variable in the loop. I would like it to be stack created, BUT Command class is abstract, and must remain abstract. What can I do to work around that? How to improve time spent on these very long loops???
#pragma omp parallel for reduction(+:functionEvaluation)
for (int i=rowStart;i<rowEnd+1;i++)
{
Model model_(varModel_);
model_.addVariable("i", i);
model_.addVariable("j", 1);
Command* command_ = formulaCommand->duplicate(&model_);
functionEvaluation += command_->execute().toDouble();
delete command_;
}
The problem may also lie elsewhere! advice welcome!!
thanks and regards.
You may want to play with the private or firstprivate clauses.
Your #pragma would include ...private(varModel, formulaCommand)... or similar, and then each thread would have its own copy of those variables. Using firstprivate will ensure that the thread-specific variable has the initial value copied in, instead of being uninitialized. This would remove the need to new and delete, assuming you can just modify instances for each loop iteration.
This may or may not work as desired, as you haven't provided a lot of detail.
I think you should try to use a mechanism to reuse allocated memory. You probably don't know the size nor the alignment of the Command object coming, so a "big enough" buffer will not suffice. I'd make your duplicate method take two arguments, the second being the reference to a boost::pool. If the pool object is big enough just construct the new Command object inside it, if it's not expand it, and construct into it. boost::pool will handle alignment issues for you, so you don't have to think about it. This way, you'll have to do dynamic memory allocation only a few times per thread.
By the way, it's in general not good practice to return raw pointers in C++. Use smart pointers instead, it's simply better that way without any buts... Well, there's a but in this case :), since with my suggestion you'd be doing some under the hood custom memory management. Still, the bestest practice would be to write a custom smart pointer which handles your special case gracefully, without risking the user to mess up. You could of course do like everyone else and make en exception in this case :) (My advice still holds under normal circumstances though, f.x. in the question above, you should normally use something like boost::scoped_ptr)

Accessing different data members belonging to the same object from 2 different thread in C++

I have a few objects I need to perform actions on from different threads in c++. I known it is necessary to lock any variable that may be used by more than one thread at the same time, but what if each thread is accessing (writing to) a different data member of the same object? For example, each thread is calling a different method of the object and none of the methods called modify the same data member. Is it safe as long as I don't access the same data member or do I need to lock the whole object anyway?
I've looked around for explanations and details on this topic but every example seems to focus on single variables or non-member functions.
To summarize:
Can I safely access 2 different data members of the same object from 2 different thread without placing a lock on the whole object?
It is effectively safe, but will strongly reduce the performance of your code if you do that often. Computers use things called "cache lines" and if two processors are working on the same cache line they'll have to pass it back & forth all the time, slowing your work down.
Yes, it is safe to access different members of one object by different thread.
I think you can do that fine. But you better be sure that that the method internals never change to access the same data or the calling program doesn't decide to call another method that another thread is already using etc.
So possible, but potentially dangerous. But then it will also be quicker because you'll be avoiding calls to get mutexes. Pick your poison.
Well, yes, OK you can do it but, as others have pointed out, you should not have to. IMHO, access to data members should be via getter/setter methods so that any necessary mutexing/criticalSectioning/semaphoring/whatever is encapsulated within the object.
Is it safe as long as I don't access the same data member or do I need to lock the whole object anyway?
The answer totally depends upon the design of the class, However I would still say that it is always recommended to think 100 times before allowing multiple threads to access same object. Given the fact, If you are sure that the data is really independent their is NO need to lock the whole object.
Then a different question arises, "If variables are indeed independent Why they are in same class ?" Be careful threading kills if mistaken.
You might want to be careful. See for example http://gcc.gnu.org/ml/gcc/2012-02/msg00032.html
Depending on how the fields are accessed, you might run across similar hard to find problems.

Best practice when calling initialize functions multiple times?

This may be a subjective question, but I'm more or less asking it and hoping that people share their experiences. (As that is the biggest thing which I lack in C++)
Anyways, suppose I have -for some obscure reason- an initialize function that initializes a datastructure from the heap:
void initialize() {
initialized = true;
pointer = new T;
}
now When I would call the initialize function twice, an memory leak would happen (right?). So I can prevent this is multiple ways:
ignore the call (just check wether I am initialized, and if I am don't do anything)
Throw an error
automatically "cleanup" the code and then reinitialize the thing.
Now what is generally the "best" method, which helps keeping my code manegeable in the future?
EDIT: thank you for the answers so far. However I'd like to know how people handle this is a more generic way. - How do people handle "simple" errors which can be ignored. (like, calling the same function twice while only 1 time it makes sense).
You're the only one who can truly answer the question : do you consider that the initialize function could eventually be called twice, or would this mean that your program followed an unexpected execution flow ?
If the initialize function can be called multiple times : just ignore the call by testing if the allocation has already taken place.
If the initialize function has no decent reason to be called several times : I believe that would be a good candidate for an exception.
Just to be clear, I don't believe cleanup and regenerate to be a viable option (or you should seriously consider renaming the function to reflect this behavior).
This pattern is not unusual for on-demand or lazy initialization of costly data structures that might not always be needed. Singleton is one example, or for a class data member that meets those criteria.
What I would do is just skip the init code if the struct is already in place.
void initialize() {
if (!initialized)
{
initialized = true;
pointer = new T;
}
}
If your program has multiple threads you would have to include locking to make this thread-safe.
I'd look at using boost or STL smart pointers.
I think the answer depends entirely on T (and other members of this class). If they are lightweight and there is no side-effect of re-creating a new one, then by all means cleanup and re-create (but use smart pointers). If on the other hand they are heavy (say a network connection or something like that), you should simply bypass if the boolean is set...
You should also investigate boost::optional, this way you don't need an overall flag, and for each object that should exist, you can check to see if instantiated and then instantiate as necessary... (say in the first pass, some construct okay, but some fail..)
The idea of setting a data member later than the constructor is quite common, so don't worry you're definitely not the first one with this issue.
There are two typical use cases:
On demand / Lazy instantiation: if you're not sure it will be used and it's costly to create, then better NOT to initialize it in the constructor
Caching data: to cache the result of a potentially expensive operation so that subsequent calls need not compute it once again
You are in the "Lazy" category, in which case the simpler way is to use a flag or a nullable value:
flag + value combination: reuse of existing class without heap allocation, however this requires default construction
smart pointer: this bypass the default construction issue, at the cost of heap allocation. Check the copy semantics you need...
boost::optional<T>: similar to a pointer, but with deep copy semantics and no heap allocation. Requires the type to be fully defined though, so heavier on dependencies.
I would strongly recommend the boost::optional<T> idiom, or if you wish to provide dependency insulation you might fall back to a smart pointer like std::unique_ptr<T> (or boost::scoped_ptr<T> if you do not have access to a C++0x compiler).
I think that this could be a scenario where the Singleton pattern could be applied.

Boost shared_ptr use_count function

My application problem is the following -
I have a large structure foo. Because these are large and for memory management reasons, we do not wish to delete them when processing on the data is complete.
We are storing them in std::vector<boost::shared_ptr<foo>>.
My question is related to knowing when all processing is complete. First decision is that we do not want any of the other application code to mark a complete flag in the structure because there are multiple execution paths in the program and we cannot predict which one is the last.
So in our implementation, once processing is complete, we delete all copies of boost::shared_ptr<foo>> except for the one in the vector. This will drop the reference counter in the shared_ptr to 1. Is it practical to use shared_ptr.use_count() to see if it is equal to 1 to know when all other parts of my app are done with the data.
One additional reason I'm asking the question is that the boost documentation on the shared pointer shared_ptr recommends not using "use_count" for production code.
Edit -
What I did not say is that when we need a new foo, we will scan the vector of foo pointers looking for a foo that is not currently in use and use that foo for the next round of processing. This is why I was thinking that having the reference counter of 1 would be a safe way to ensure that this particular foo object is no longer in use.
My immediate reaction (and I'll admit, it's no more than that) is that it sounds like you're trying to get the effect of a pool allocator of some sort. You might be better off overloading operator new and operator delete to get the effect you want a bit more directly. With something like that, you can probably just use a shared_ptr like normal, and the other work you want delayed, will be handled in operator delete for that class.
That leaves a more basic question: what are you really trying to accomplish with this? From a memory management viewpoint, one common wish is to allocate memory for a large number of objects at once, and after the entire block is empty, release the whole block at once. If you're trying to do something on that order, it's almost certainly easier to accomplish by overloading new and delete than by playing games with shared_ptr's use_count.
Edit: based on your comment, overloading new and delete for class sounds like the right thing to do. If anything, integration into your existing code will probably be easier; in fact, you can often do it completely transparently.
The general idea for the allocator is pretty much the same as you've outlined in your edited question: have a structure (bitmaps and linked lists are both common) to keep track of your free objects. When new needs to allocate an object, it can scan the bit vector or look at the head of the linked list of free objects, and return its address.
This is one case that linked lists can work out quite well -- you (usually) don't have to worry about memory usage, because you store your links right in the free object, and you (virtually) never have to walk the list, because when you need to allocate an object, you just grab the first item on the list.
This sort of thing is particularly common with small objects, so you might want to look at the Modern C++ Design chapter on its small object allocator (and an article or two since then by Andrei Alexandrescu about his newer ideas of how to do that sort of thing). There's also the Boost::pool allocator, which is generally at least somewhat similar.
If you want to know whether or not the use count is 1, use the unique() member function.
I would say your application should have some method that eliminates all references to the Foo from other parts of the app, and that method should be used instead of checking use_count(). Besides, if use_count() is greater than 1, what would your program do? You shouldn't be relying on shared_ptr's features to eliminate all references, your application architecture should be able to eliminate references. As a final check before removing it from the vector, you could assert(unique()) to verify it really is being released.
I think you can use shared_ptr's custom deleter functionality to call a particular function when the last copy has been released. That way, you're not using use_count at all.
You would need to hold something other than a copy of the shared_ptr in your vector so that the shared_ptr is only tracking the outstanding processing.
Boost has several examples of custom deleters in the shared_ptr docs.
I would suggest that instead of trying to use the shared_ptr's use_count to keep track, it might be better to implement your own usage counter. this way you will have full control over this rather than using the shared_ptr's one which, as you rightly suggest, is not recommended. You can also pre-set your own counter to allow for the number of threads you know will need to act on the data, rather than relying on them all being initialised at the beginning to get their copies of the structure.