How can I find out what tasks are executing which iterations of a
forall loop?
For example, I'd like to get a feel for how the different DynamicIters behave,
use DynamicIters;
var r = 1..1000;
var A: [r] int;
forall i in adaptive(r) {
A[i] = ???;
}
I can use here.id to discover what locale a forall loop has put an
iteration on, but I don't know how to "see" what task within the
locale each iteration was assigned to.
Chapel's design intentionally avoids supporting a standard language-level way to query a task's ID because we didn't want it to impose any particular numbering or overheads that might be required to maintain the feature across distinct underlying runtime/OS/hardware choices. When writing coforall loops, a standard trick for creating virtual task IDs is to do something like this:
coforall (i, tid) in zip(myIter(), 0..) do
Since each iteration of the loop executes as a separate task, tid will uniquely number each of them starting at 0. But as you're noting, since task creation is abstracted away into iterators when using forall loops, they don't have a straightforward equivalent -- you'd typically need to modify the task-parallel constructs in the parallel iterators that are driving the loop in order to reason about what tasks are being created and what they are doing.
In the specific case of the DynamicIters module that you're curious about, there is a config param named debugDynamicIters that supports printing information about what's going on, so if you compile your program with -sdebugDynamicIters=true, you'll get some sense of what's going on with the tasks. And of course, you can also modify the iterators themselves ( located in $CHPL_HOME/modules/standard/DynamicIters.chpl ) to add additional debug printing.
It is possible to go outside of the language and access the task IDs that the runtime uses, though there's no guarantee that this will be portable across different runtime tasking options ( e.g., qthreads, fifo, massivethreads ) nor that it will continue to work across future versions of Chapel. For example, in Chapel 1.15.0, the following code works:
extern proc chpl_task_getId(): chpl_taskID_t;
forall i in adaptive(r) do
writeln("task ", chpl_task_getId(), " owns iter ", i);
The type chpl_taskID_t is an opaque type that's internal to the implementation, so it can be printed out, but there's no guarantee that it will have any given type across tasking options or utilize any specific set of values.
Related
Assume we have a Container maintaining a set of int values, plus a flag for each value indicating whether the value is valid. Invalid values are considered to be INT_MAX. Initially, all values are invalid. When a value is accessed for the first time, it is set to INT_MAX and its flag is set to valid.
struct Container {
int& operator[](int i) {
if (!isValid[i]) {
values[i] = INT_MAX; // (*)
isValid[i] = true; // (**)
}
return values[i];
}
std::vector<int> values;
std::vector<bool> isValid;
};
Now, another thread reads container values concurrently:
// This member is allowed to overestimate value i, but it must not underestimate it.
int Container::get(int i) {
return isValid[i] ? values[i] : INT_MAX;
}
This is perfectly valid code, but it is crucial that lines (*) and (**) are executed in the given order.
Does the standard guarantee in this case that the lines are executed in the given order? At least from a single-threaded perspective, the lines could be interchanged, couldn't they?
If not, what is the most efficient way to ensure their order? This is high-performance code, so I cannot go without -O3 and do not want to use volatile.
There is no synchronization here. If you access these values from one thread and change them from another thread you got undefined behavior. You'd either need a lock around all accesses in which case things are fine. Otherwise you'll need to make all your std::vector elements atomic<T> and you can control visibility of the values using the appropriate visibility parameters.
There seems to be a misunderstanding of what synchronization and in particular atomic operations do: their purpose is to make code fast! That may appear counter intuitive so here is the explanation: non-atomic operations should be as fast as possibe and there are deliberately no guarantees how they access memory exactly. As long as the compiler and execution system produce the correct results the compiler iand system are free to do whatever they need or want to do. To achieve good performance interaction between different threads are assumed to not exist.
In a concurrent system there are, however, interactions betwwen threads. This is where atomic operations enter the stage: they allow the specification of exactly the necessary synchronization needed. Thus, they allow to tell the compiler the minimal constraints it has to obey to make the thread unteraction correct. The compiler will use these indicators to generate the best possible code to achieve what is specified. That code may be identical to code not using any synchronization although in practice it is normally necessary to also prevent the CPU from reordering operations. As a result, correct use of the synchronization results in the most efficient code with only the absolutely necessary overhead.
The tricky part is to some extent finding which synchronizations are needed and to minimize these. Simply not having any will allow the compiler and the CPU to reorder operations freely and won't work.
Since the question mentioned volatile please note that volatile is entirely unrelated to concurrency! The primary purpose for volatile is to inform the system that an address points to memory whose access may have side effects. Primarily it is used to have memory mapped I/O or hardware control be accessible. Die to the potential of side effects it one of the two aspects of C++ defining the semantics of programs (the other one is I/O using standard library I/O facilities).
So I was trying to optimize an array operation in Julia, but noticed that I was getting a rather large error on my matrix occasionally. I also noticed that there existed the possibility of concurrently writing to the same index of a SharedArray in Julia. I was wondering if Julia can safely handle it. If not, how may I able able to handle it?
Here is a basic example of my issue
for a list of arbitrary x,y indexes in array J
j[x,y] += some_value
end
Can Julia handle this case or, like C, will there exist the possibility of overwriting the data. Are their atomic operations in Julia to compensate ffor this?
Shared arrays deliberately have no locking, since locking can be expensive. The easiest approach is to assign non-overlapping work to different processes. However, you might search to see whether someone has written a locking library, or have a go at it yourself: https://en.wikipedia.org/wiki/Mutual_exclusion
I'm writing some generic code which basically will have a vector of objects being updated by a set of controllers.
The code is a bit complex in my specific context but a simplification would be:
template< class T >
class Controller
{
public:
virtual ~Controller(){}
virtual void update( T& ) = 0;
// and potentially other functions used in other cases than update
}
template< class T >
class Group
{
public:
typedef std::shared_ptr< Controller<T> > ControllerPtr;
void add_controller( ControllerPtr ); // register a controller
void remove_controller( ControllerPtr ); // remove a controller
void update(); // udpate all objects using controllers
private:
std::vector< T > m_objects;
std::vector< ControllerPtr > m_controllers;
};
I intentionally didn't use std::function because I can't use it in my specific case.
I also intentionally use shared pointers instead of raw pointers, this is not important for my question actually.
Anyway here it's the update() implementation that interest me.
I can do it two ways.
A) For each controller, update all objects.
template< class T >
void Group<T>::update()
{
for( auto& controller : m_controllers )
for( auto& object : m_objects )
controller->update( object );
}
B) For each object, update by applying all controllers.
template< class T >
void Group<T>::update()
{
for( auto& object : m_objects )
for( auto& controller : m_controllers )
controller->update( object );
}
"Measure! Measure! Measure!" you will say and I fully agree, but I can't measure what I don't use. The problem is that it's generic code. I don't know the size of T, I just assume it will not be gigantic, maybe small, maybe still a bit big. Really I can't assume much about T other than it is designed to be contained in a vector.
I also don't know how many controllers or T instances will be used. In my current use cases, there would be widely different counts.
The question is: which solution would be the most efficient in general?
I'm thinking about cache coherency here. Also, I assume this code would be used on different compilers and platforms.
My guts tells me that updating instruction cache is certainly faster than updating data cache, which would make solution B) the more efficient in general. However, I learnt to not trust my gusts when I have doubts about performance, so I'm asking here.
The solution I'm getting to would allow the user to choose (using a compile-time policy) which update implementation to use with each Group instance, but I want to provide a default policy and I can't decide which one would be the most efficient for most of the cases.
We have a living proof that modern compilers (Intel C++ in particular) are able to swap loops, so it shouldn't really matter for you.
I have remembered it from the great #Mysticial's answer:
Intel Compiler 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune the mispredictions, it is also twice as fast as whatever VC++ and GCC can generate!
Wikipedia article about the topic
Detecting whether loop interchange can be done requires checking if the swapped code will really produce the same results. In theory it could be possible to prepare classes that won't allow for the swap, but then again, it could be possible to prepare classes that would benefit from either version more.
Cache-Friendliness Is Close to Godliness
Knowing nothing else about how the update methods of individual Controllers behave, I think the most important factor in performance would be cache-friendliness.
Considering cache effectiveness, the only difference between the two loops is that m_objects are laid out contiguously (because they are contained in the vector) and they are accessed linearly in memory (because the loop is in order) but m_controllers are only pointed to here and they can be anywhere in memory and moreover, they can be of different types with different update() methods that themselves can reside anywhere. Therefore, while looping over them we would be jumping around in memory.
In respect to cache, the two loops would behave like this: (things are never simple and straightforward when you are concerned about performance, so bear with me!)
Loop A: The inner loop runs efficiently (unless the objects are large - hundreds or thousands of bytes - or they store their data outside themselves, e.g., std::string) because the cache access pattern is predictable and the CPU will prefetch consecutive cachelines so there won't be much stalling on reading memory for the objects. However, if the size of the vector of objects is larger than the size of the L2 (or L3) cache, each iteration of the outer loop will require reloading of the entire cache. But again, that cache reloading will be efficient!
Loop B: If indeed the controllers have many different types of update() methods, the inner loop here may cause wild jumping around in memory, but all these different update functions will be working on data that is cached and available (specially if objects are large or they themselves contain pointers to data scattered in memory.) Unless the update() methods access so much memory themselves (because, e.g., their code is huge or they require large amount of their own - i.e. controller - data) that they thrash the cache on each invocation; in which case all bets are off anyways.
So, I suggest the following strategy generally, which requires information that you probably don't have:
If objects are small (or smallish!) and POD-like (don't contain pointers themselves) definitely prefer loop A.
If objects are large and/or complex, or if there are many many different types of complex controllers (hundreds or thousands of different update() methods) prefer loop B.
If objects are large and/or complex, and there are so very many of them that iterating over them will thrash the cache many times (millions of objects), and the update() methods are many and they are very large and complex and require a lot of other data, then I'd say the order of your loop doesn't make any difference and you need to consider redesigning objects and controllers.
Sorting the Code
If you can, it may be beneficial to sort the controllers based on their type! You can use some internal mechanism in Controller or something like typeid() or another technique to sort the controllers based on their type, so the behavior of consecutive update() passes become more regular and predictable and nice.
This is a good idea regardless of which loop order you choose to implement, but it will have much more effect in loop B.
However, if you have so much variation among controllers (i.e. if practically all are unique) this won't help much. Also, obviously, if you need to preserve the order that controllers are applied, you won't be able to do this.
Adaptation and Improvisation
It should not be hard to implement both loop strategies and select between them at compile-time (or even runtime) based on either user hint or based on information available at compile time (e.g. size of T or some traits of T; if T is small and/or a POD, you probably should use loop A.)
You can even do this at runtime, basing your decision on the number of objects and controllers and anything else you can find out about them.
But, these kinds of "Klever" tricks can get you into trouble as the behavior of your container will depend on weird, opaque and even surprising heuristics and hacks. Also, they might and will even hurt performance in some cases, because there are many other factors meddling in performance of these two loops, including but not limited to the nature of the data and the code in objects and controllers, the exact sizes and configurations of cache levels and their relative speeds, the architecture of CPU and the exact way it handles prefetching, branch prediction, cache misses, etc., the code that the compiler generates, and much more.
If you want to use this technique (implementing both loops and switching between them are compile- and/or run-time) I highly suggest that you let the user do the choosing. You can accept a hint about which update strategy to use, either as a template parameter or a constructor argument. You can even have two update functions (e.g. updateByController() and updateByObject()) that the user can call at will.
On Branch Prediction
The only interesting branch here is the virtual update call, and as an indirect call through two pointers (the pointer to the controller instance and then the pointer to its vtable) it is quite hard to predict. However, sorting controllers based on type will help immensely with this.
Also remember that a mispredicted branch will cause a stall of a few to a few dozen CPU cycles, but for a cache miss, the stall will be in the hundreds of cycles. Of course, a mispredicted branch can cause a cache miss too, so... As I said before, nothing is simple and straightforward when it comes to performance!
In any case, I think cache friendliness is by far the most important factor in performance here.
Thanks to the help I received in this post:
I have a nice, concise recursive function to traverse a tree in postfix order:
deque <char*> d;
void Node::postfix()
{
if (left != __nullptr) { left->postfix(); }
if (right != __nullptr) { right->postfix(); }
d.push_front(cargo);
return;
};
This is an expression tree. The branch nodes are operators randomly selected from an array, and the leaf nodes are values or the variable 'x', also randomly selected from an array.
char *values[10]={"1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","x"};
char *ops[4]={"+","-","*","/"};
As this will be called billions of times during a run of the genetic algorithm of which it is a part, I'd like to optimize it for speed. I have a number of questions on this topic which I will ask in separate postings.
The first is: how can I get access to each 'cargo' as it is found. That is: instead of pushing 'cargo' onto a deque, and then processing the deque to get the value, I'd like to start processing it right away.
Edit: This question suggests that processing the deque afterwards is a better way.
I don't yet know about parallel processing in c++, but this would ideally be done concurrently on two different processors.
In python, I'd make the function a generator and access succeeding 'cargo's using .next().
See the above Edit.
But I'm using c++ to speed up the python implementation. I'm thinking that this kind of tree has been around for a long time, and somebody has probably optimized it already. Any Ideas? Thanks
Of course, you'd want first to measure the cost overhead before you bother with optmization here, as your genetic algorithm next-generation production and mutations may swamp the evaluation time.
Once you've determined you want to optimize...
the obvious answer is to compile the expression ("as much as possible"). Fortunately, there's lots of ways to "compile".
If you are implementing this in Python, you may be able to ask Python (i'm not an expert) to compile a constructed abstract syntax tree into a function, and that might be a lot faster, especially if CPython supports this.
It appears that you are implementing this in C++, however. In that case, I wouldn't evaluate the expression tree as you have defined it, as it means lots of tree walking, indirect function calls, etc. which are pretty expensive.
One cheesy trick is to spit out the actual expression as a text string with appropriate C++ function body text around it into a file, and launch a C++ compiler on that.
You can automate all the spit-compile-relink with enough script magic, so that if
you do this rarely, this would work and you would get expression evaluation as fast as
the machine can do it.
Under the assumption you don't want to do that, I'd be tempted to walk your expression tree before you start the evaluation process, and "compile" that tree into a set of actions stored in a linear array called "code". The actions would be defined by an enum:
enum actions {
// general actions first
pushx, // action to push x on a stack
push1,
push2, // action to push 2 on a stack
...
pushN,
add,
sub,
mul, // action multiply top two stack elements together
div,
...
// optimized actions
add1,
sub1,
mul1,
div1, // action to divide top stack element by 1
...
addN,
subN,
...
addx,
subX,
...
}
In this case, I've defined the actions to implement a push-down stack expression evaluator, because that's easy to understand. Fortunately your expression vocabulary is pretty limited, so your actions can also be pretty limited (they'd be more complex if you had arbitrary variables or constants).
The expression ((x*2.0)+x)-1 would be executed by the series of actions
pushx
mul2
addx
sub1
Its probably hard to get a lot better than this.
One might instead define the actions to implement a register-oriented expression evaluator following the model of a multi-register CPU; that would enable even faster execution (I'd guess by a factor of two, but only if the expression got really complex).
What you want are actions that cover the most general computation you need to do (so you can always choose a valid actions sequence regardless of your original expression) and actions that occur frequently in the expressions you encounter (add1 is pretty typical in machine code, don't know what your statistics are like, and your remark that you are doing genetic programming suggests you don't know what the statistics will be, but you can measure them somehow or make and educated guess).
Your inner loop for evaluation would then look like (sloppy syntax here):
float stack[max_depth];
stack_depth=0;
for (i=1;i<expression_length;i++)
{
switch (code[i]) // one case for each opcode in the enum
{
case pushx: stack[stack_depth++]=x;
break;
case push1: stack[stack_depth++]=1;
break;
...
case add: stack[stack_depth-1]+=stack[stack_depth];
stack_depth--;
break;
...
case subx: stack[stack_depth]-=x;
break;
...
}
}
// stack[1] contains the answer here
The above code implements a very fast "threaded interpreter" for a pushdown stack expression evaluator.
Now "all" you need to do is to generate the content of the code array. You can do that
by using your original expression tree, executing your original recursive expression tree walk, but instead of doing the expression evaluation, write the action that your current expression evaluator would do into the code array, and spitting out special case actions when you find them (this amounts to "peephole optimization"). This is classic compilation from trees, and you can find out a lot more about how to do this in pretty much any compiler book.
Yes, this is all a fair bit of work. But then, you decided to run a genetic algorithm, which is computationally pretty expensive.
Lots of good suggestions in this thread for speeding up tree iteration:
Tree iterator, can you optimize this any further?
As for the problem I guess you could process Cargo in a different thread but seeing as you aren't actually doing THAT much. You would probably end up spending more time in thread synchronisation mechanisms that doing any actual work.
You may find instead of just pushing it into the deque that if you just process as you go along you may have things run faster. Yuo may find processing it all in a seperate loop at the end is faster. Best way to find out is try both methods with a variety of different inputs and time them.
Assuming that processing a cargo is expensive enough that locking a mutex is relatively cheap, you can use a separate thread to access the queue as you put items on it.
Thread 1 would execute your current logic, but it would lock the queue's mutex before adding an item and unlock it afterwards.
Then thread 2 would just loop forever, checking the size of the queue. If it's not empty, then lock the queue, pull off all available cargo and process it. Repeat loop. If no cargo available sleep for a short period of time and repeat.
If the locking is too expensive you can build up a queue of queues: First you put say 100 items into a cargo queue, and then put that queue into a locked queue (like the first example). Then start on a new "local" queue and continue.
I have a large number of sets of integers, which I have, in turn, put into a vector of pointers. I need to be able to update these sets of integers in parallel without causing a race condition. More specifically. I am using OpenMP's "parallel for" construct.
For dealing with shared resources, OpenMP offers a handy "atomic directive," which allows one to avoid a race condition on a specific piece of memory without using locks. It would be convenient if I could use the "atomic directive" to prevent simultaneous updating to my integer sets, however, I'm not sure whether this is possible.
Basically, I want to know whether the following code could lead to a race condition
vector< set<int>* > membershipDirectory(numSets, new set<int>);
#pragma omp for schedule(guided,expandChunksize)
for(int i=0; i<100; i++)
{
set<int>* sp = membershipDirectory[rand()];
#pragma omp atomic
sp->insert(45);
}
Note that I use a random integer for the index, because in my application, any thread might access any index (there is a random element in my larger application, but I need not go into details).
I have seen a similar example of this for incrementing an integer, but I'm not sure whether it works when working with a pointer to a container as in my case.
After searching around, I found the OpenMP C and C++ API manual on openmp.org, and in section 2.6.4, the limitations of the atomic construct are described.
Basically, the atomic directive can only be used with the following operators:
Unary:
++, -- (prefix and postfix)
Binary:
+,-,*,/,^,&,|,<<,>>
So I will just use locks!
(In some situations critical sections might be preferable, but in my case locks will provide fine grained access to the shared resource, yielding better performance than a critical section.)
you should not use atomic where expression is a function call, it only applies to simple expressions (with possibly built-ins: power, square root).
Instead use critical section (either named or default)
Your code is not clear. Assuming that membershipDirectory[5] is actually membershipDirectory[i], atomic directive is not needed. For two processors, for example, OpenMP produces two threads, one handles i = 0-49, another 50-99 intervals. In this case, there is no need to protect membershipDirectory[i]. atomic directive is required to protect some common resource which does not depend on the loop index, for example, total sum.