Remote variable declarations - chapel

How do remote variable declarations work? I've tried augmenting an ordinary variable declaration with the on clause as described in section 26.2.1 of the Chapel language specification, but it doesn't seem to work. For example, this line of code:
on Locales[1] var x: [0..10] real;
fails to compile, with the error syntax error: near 'var'.

In short, the syntax is specified but it isn't currently implemented. Unfortunately the language spec doesn't currently point out that it's a future feature.
Thanks for pointing out the issue. This one is arguably better as a GitHub issue against the Chapel project, so I've created an issue to track the problem.
The typical workaround is to choose one of:
Use nested on statements to achieve the desired effect
Allocate a class instance in an on statement
Use a distributed array
Here I will describe each.
First, we need a slightly longer example. Suppose you were trying to write:
on Locales[1] var A: [0..10] real; // declare array stored on Locales[1]
A = 1; // on Locale 0, set every element of A to 1
writeln(A); // on Locale 0, print out the array
// print out the locale storing each element
for x in A {
write(x.locale.id, " ");
}
writeln();
An equivalent way to write that using nested on statements is this:
on Locales[1] {
var A: [0..10] real;
on Locales[0] {
A = 1;
writeln(A);
for x in A {
write(x.locale.id, " ");
}
writeln();
}
}
// result, when run on 2 locales:
// from printing array elements:
// 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
// from printing element locales:
// 1 1 1 1 1 1 1 1 1 1 1
Note that in the example, we know the assignment is to happen on Locale 0. If we didn't know what Locale we were running on, we could save it in to a variable before the first on (e.g. var fromLocale = here;) and use that variable in the second on.
In some cases, it might be more convenient to use the on statement to specify where a variable is initialized without changing where it is declared. Right now, this can be done with class instances. Note that these are not garbage collected - you'll need to either use Owned/Shared or ensure delete is called.
In the spirit of a simpler answer to the question, I'll show a version that calls delete.
class MyArrayWrapper {
var A: [0..10] real;
}
var myObject: MyArrayWrapper; // starts out as nil
on Locales[1] {
// set myObject to a new instance
// since we do that on Locales[1], it is allocated there
// and the contained array is stored there too.
myObject = new MyArrayWrapper();
}
myObject.A = 1;
writeln(myObject.A);
for x in myObject.A {
write(x.locale.id, " ");
}
writeln();
delete myObject;
// result, when run on 2 locales:
// from printing array elements:
// 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
// from printing element locales:
// 1 1 1 1 1 1 1 1 1 1 1
Now, how could we achieve the same using a distributed array?
First, consider this example, which distributes the 11 elements among whatever Locales the Chapel program is run with.
use BlockDist;
const MyDom = {0..10}; // this domain represents the index set 0..10
// declare a Block-distributed index set 0..10
// by default, this is distributed over all available Locales
const MyBlockDistributedDomain = MyDom dmapped Block(boundingBox=MyDom);
// declare an Block-distributed array
var BlockDistributedA: [MyBlockDistributedDomain] real;
BlockDistributedA = 1;
writeln(BlockDistributedA);
for x in BlockDistributedA {
write(x.locale.id, " ");
}
writeln();
// result, when run on 2 locales:
// from printing array elements:
// 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
// from printing element locales:
// 0 0 0 0 0 0 1 1 1 1 1
That distributed the array over the available locales, but that behavior is just a default for the Block distribution. We can specify the locales to use with an argument to the Block constructor as the following example shows:
use BlockDist;
const MyDom = {0..10};
// This time, specify the target locales for the Block distribution to use.
// Here we pass in an anonymous array storing just Locales[1], so that
// the resulting array only stores elements on Locales[1].
const MyDistributedDomain = MyDom dmapped Block(boundingBox=MyDom,
targetLocales=[Locales[1]]);
var DistributedA: [MyDistributedDomain] real;
DistributedA = 1;
writeln(DistributedA);
for x in DistributedA {
write(x.locale.id, " ");
}
writeln();
// result, when run on 2 locales:
// from printing array elements:
// 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
// from printing element locales:
// 1 1 1 1 1 1 1 1 1 1 1

Related

Chapel: Can you re-index a domain in place?

A great man once said, I have a matrix A. But this time she has a friend B. Like the Montagues and Capulets, they have different domains.
// A.domain is { 1..10, 1..10 }
// B.domain is { 0.. 9, 0.. 9 }
for ij in B.domain {
if B[ij] <has a condition> {
// poops
A[ij] = B[ij];
}
}
My guess is I need to reindex so that the B.domain is {1..10, 1..10}. Since B is an input, I get push back from the compiler. Any suggestions?
There's a reindex array method to accomplish exactly this, and you can create a ref to the result to prevent creating a new array:
var Adom = {1..10,1..10},
Bdom = {0..9, 0..9};
var A: [Adom] real,
B: [Bdom] real;
// Set B to 1.0
B = 1;
// 0-based reference to A -- note that Bdom must be same shape as Adom
ref A0 = A.reindex(Bdom);
// Set all of A's values to B's values
for ij in B.domain {
A0[ij] = B[ij];
}
// Confirm A is now 1.0 now
writeln(A);
chapel compiler must object,documentation is explicit and clear on this:
Note that querying an array's domain via the .domain method or the function argument query syntax does not result in a domain expression that can be reassigned. In particular, we cannot do:
VarArr.domain = {1..2*n};
In case the <has_a_condition> is non-intervening and without side-effects, a solution to the expressed will may use domain-operators similar to this pure, contiguous-domain, index translation:
forall ij in B.domain do {
if <has_a_condition> {
A[ ij(1) + A.domain.dims()(1).low,
ij(2) + A.domain.dims()(2).low
] = B[ij];
}
}

Array of real to binary (0/1)

I have an array :
0.3 0.4 0.65 1.45
-1.2 6.0 -3.49 3.9
And I would like to have 0 if value is negative and 1 if positive :
1 1 1 1
0 1 0 1
Is there a way to do this without a loop like:
DO X=1,Xmax
Do Y=1,Ymax
IF(Array(X,Y)>0)THEN
Array(X,Y)=1
END IF
END DO
END DO
I'm a fan of the where approach as given by Vladimir F, but I can also suggest a related one.
merge is an intrinsic elemental function which takes two sources and a mask:
array = MERGE(0., 1., array.lt.0.)
As a slight correction to Vladimir F's sign:
array = SIGN(0.5, array) + 0.5
Note the switching of order compared with the other answer.
With the elemental nature of merge and sign it is possible to mix scalar desired values with the array and array mask.
As both of these can naturally be modified to assign the value to another variable (even creating an integer one), I'll show an alternative where for completeness:
where (array.lt.0.)
another_array=0
elsewhere
another_array=1
end where
for another_array appropriately shaped.
I'm having way too much fun with this. This one does not require that the numbers fit into integers:
ARRAY = 0.5 * ARRAY / ABS(ARRAY) + 0.5
The most straight forward
where (array>=0)
array = 1
else where
array = 0
end where
it is not very handy that the sign function needs another array for the magnitudes, because
array = sign(array, halfs) + 0.5
requires an array with 0.5's of the same shape as array.
Actually it should be array = sign(0.5, array) + 0.5 as shown by francescalus. I even looked into the manual and then switched the arguments anyway...
It's ugly, but if you want a one-liner:
ARRAY = CEILING( ARRAY / CEILING(ABS(ARRAY)) )
Vladimir wants FAST!
REAL(KIND=8) :: ARRAY(4,2) = RESHAPE ( &
(/ 0.3, 0.4, 0.65, 1.45, -1.2, 6.0, -3.49, 3.9 /), (/4,2/) )
INTEGER(KIND=8) :: IARRAY(4,2)
EQUIVALENCE (ARRAY, IARRAY)
ARRAY = 1 - IBITS( IARRAY,63,1 )
:D

Can one generate a grid of the Locales where a Distribution is mapped?

If I run the following code:
use BlockDist;
config const dimension: int = 5;
const space = {0..#dimension, 0..#dimension};
const matrixBlock: domain(2) dmapped Block(boundingBox=space) = space;
var A : [matrixBlock] int;
[a in A] a = a.locale.id;
writeln(A);
on 4 Locales, I get:
0 0 0 1 1
0 0 0 1 1
0 0 0 1 1
2 2 2 3 3
2 2 2 3 3
Is there a A.<function> which returns the matrix (below)?
0 1
2 3
Or, is this something I should implement?
The expression A.targetLocales() gives you almost what you asked for, and maybe something you'll find even more useful: Rather than the array of ints you requested, it gives you an array of the target locales themselves. Thus, writeln(A.targetLocales()) prints the 2x2 locale array:
LOCALE0 LOCALE1
LOCALE2 LOCALE3
This routine, and others related to locality queries on arrays, can be found in the domain and array operations section of the online documentation, under the array type.
The expression A.targetLocales().id ought to give you what you want, but due to a longstanding unimplemented feature does not (at least, as of version 1.15 of Chapel). In short, this asks each locale for its ID and should result in an array of ints with the same size and shape as the target locales array; yet because promotion doesn't preserve shape as intended, the shape is lost if you don't preserve it. For example, writeln(A.targetLocales.id) results in:
0 1 2 3
rather than:
0 1
2 3
However, you can assign such promoted expressions into an array of the desired shape. So one way to get your desired array of ints today would be to write:
// declare an array whose domain is the same as that of A's target locales
// and initialize the array using the IDs of A's targetLocales array:
var IDs: [A.targetLocales().domain] int = A.targetLocales().id;
Finally, note that you can pass your own array of locales into the Block() distribution's constructor if you wish to specify a specific target locale set, rather than using the default set of target locales that it sets up for you. For example, adding the two following lines:
const locGridSpace = {0..#numLocales, 0..0};
const locGrid: [locGridSpace] locale = [(i,j) in locGridSpace] Locales[i];
will create a numLocales x 1 array of locales which can then be passed into your call to Block() as follows:
const matrixBlock: domain(2) dmapped Block(boundingBox=space,
targetLocales=locGrid) = space;
Alternatively, you could arrange some or all of the locales into some other shape or ordering. The main restriction is that the rank of the targetLocales array matches that of the domains to which the distribution is applied. (So targetLocales must be 2D when distributing 2D domains and arrays and 3D for 3D domains and arrays).
Arrays, domains, and distributions all have a targetLocales() method that returns the array of locales over which the array/domain/distribution is distributed. See: Domain and Array Operations documentation.
The following calls:
writeln(A.targetLocales());
writeln(A.domain.targetLocales());
writeln(A.domain.dist.targetLocales());
Will all print:
LOCALE0 LOCALE1
LOCALE2 LOCALE3
To extract the integral ids, you can then use the .id accessor:
var targetLocs = A.targetLocales();
var targetLocIDs: [targetLocs.domain] int = targetLocs.id;
writeln(targetLocIDs);
Prints:
0 1
2 3

C++ Intel TBB and Microsoft PPL, how to use next_permutation in a parallel loop?

I have Visual Studio 2012 with Intel parallel studio 2013 installed, so I have Intel TBB.
Say I have the following piece of code:
const int cardsCount = 12; // will be READ by all threads
// the required number of cards of each colour to complete its set:
// NOTE that the required number of cards of each colour is not the same as the total number of cards of this colour available
int required[] = {2,3,4}; // will be READ by all threads
Card cards[cardsCount]; // will be READ by all threads
int cardsIndices[cardsCount];// this will be permuted, permutations need to be split among threads !
// set "cards" to 4 cards of each colour (3 colours total = 12 cards)
// set cardsIndices to {0,1,2,3...,11}
// this variable will be written to by all threads, maybe have one for each thread and combine them later?? or can I use concurrent_vector<int> instead !?
int logColours[] = {0,0,0};
int permutationsCount = fact(cardsCount);
for (int pNum=0; pNum<permutationsCount; pNum++) // I want to make this loop parallel !!
{
int countColours[3] = {0,0,0}; // local loop variable, no problem with multithreading
for (int i=0; i<cardsCount; i++)
{
Card c = cards[cardsIndices[i]]; // accessed "cards"
countColours[c.Colour]++; // local loop variable, np.
// we got the required number of cards of this colour to complete it
if (countColours[c.Colour] == required[c.Colour]) // read global variable "required" !
{
// log that we completed this colour and go to next permutation
logColours[c.Colour] ++; // should I use a concurrent_vector<int> for this shared variable?
break;
}
}
std::next_permutation(cardsIndices, cardsIndices+cardsCount); // !! this is my main issue
}
What I'm calculating is how many times we will complete a colour if we pick randomly from available cards, and that's done exhaustively by going through each permutation possible and picking sequentially, when a colour is "complete" we break and go to the next permutation. Note that we have 4 cards of each colour but the required number of cards to complete each colour is {2,3,4} for Red, Green, Blue. 2 red cards are enough to complete red and we have 4 available, hence red is more likely to be completed than blue which requires all 4 cards to be picked.
I want to make this for-loop parallel, but my main problem is how to deal with "cards" permutations? you have ~0.5 billion permutation here (12!), if I have 4 threads how can I split this into 4 different quarters and let every thread go through each of them?
What if I don't know the number of cores the machine has and I want the program to automatically choose the right number of concurrent threads? surely there must be a way to do that using Intel or Microsoft tools?
This is my Card struct just in case:
struct Card
{
public:
int Colour;
int Symbol;
}
Let N = cardsNumber, M = required[0] * required[1] * ... * required[maxColor].
Then, actually, your problem could be easily solved in O(N * M) time. In your very case, that is 12 * 2 * 3 * 4 = 288 operations. :)
One of possible ways to do this is to use a recurrence relation.
Consider a function logColours f(n, required). Let n be the current number of already considered cards; required is a vector from your example. Function returns the answer in a vector logColours.
You are interested in f(12, {2,3,4}). Brief recurrent calculation inside a function f could be written like this:
std::vector<int> f(int n, std::vector<int> require) {
if (cache[n].count(require)) {
// we have already calculated function with same arguments, do not recalculate it again
return cache[n][require];
}
std::vector<int> logColours(maxColor, 0); // maxColor = 3 in your example
for (int putColor=0; putColor<maxColor; ++putColor) {
if (/* there is still at least one card with color 'putColor'*/) {
// put a card of color 'putColor' on place 'n'
if (require[putColor] == 1) {
// means we've reached needed amount of cards of color 'putColor'
++logColours[putColor];
} else {
--require[putColor];
std::vector<int> logColoursRec = f(n+1, require);
++require[putColor];
// merge child array into your own.
for (int i=0; i<maxColor; ++i)
logColours[i] += logColoursRec[i];
}
}
}
// store logColours in a cache corresponding to this function arguments
cache[n][required] = std::move(logColours);
return cache[n][required];
}
Cache could be implemented as an std::unordered_map<int, std::unordered_map<std::vector<int>, std::vector<int>>>.
Once you understand the main idea, you'll be able to implement it in even more efficient code.
You can easy make your code run in parallel with 1,2, ..., or cardsCount threads by fixing the first element of permutation and calling std::next_permutation on other elements independently in each threads.
Consider the following code:
// declarations
// #pragma omp parallel may be here
{ // start of a parallel section
const int start = (cardsCount * threadIndex) / threadNumber;
const int end = (cardsCount * (threadIndex + 1)) / threadNumber;
int cardsIndices[cardsCount]; // a local array for each thread
for (const int firstElement = start; firstElement < end; ++firstElement) {
cardsIndices[0] = firstElement;
// fill other cardsIndices with elements [0-cardsCount], but skipping firstElement
do {
// your calculations go here
} while (std::next_permutation(cardsIndices + 1, cardsIndices + cardsCount)); // note the +1 here
}
}
If you wish to use OpenMP as a parallelization tool, you only have to
add #pragma omp parallel just before the parallel section. And use
omp_get_thread_num() function to get a thread index.
You also do not have to use a concurrent_vector here, this would
probably make your program extremely slow, use a thread-specific
accumulation array:
logColours[threadNumber][3] = {};
++logColours[threadIndex][c.Colour];
If Card is a rather heavy class, I would suggest using const Card& c = ... instead of copying each time Card c = ....
I guess this is an amateur friendly version of what #Ixanezis means
If red wins
the final outcome will be: 2 red, 0-2 green, 0-3 blue
Say the winning red is A, and the other red is B, there are 12 ways to get A and B.
The following are the possible cases:
Cases: #Cards after A #Cards before A #pick green #pick blue
0 green, 0 blue: 10! = 3628800 1! = 1 1 1
0 green, 1 blue: 9 ! = 362880 2! = 2 1 4
0 green, 2 blue: 8 ! = 40320 3! = 6 1 6
0 green, 3 blue: 7 ! = 5040 4! = 24 1 4
1 green, 0 blue: 9 ! = 362880 2! = 2 4 1
1 green, 1 blue: 8 ! = 40320 3! = 6 4 4
1 green, 2 blue: 7 ! = 5040 4! = 24 4 6
1 green, 3 blue: 6 ! = 720 5! = 120 4 4
2 green, 0 blue: 8 ! = 40320 3! = 6 6 1
2 green, 1 blue: 7 ! = 5040 4! = 24 6 4
2 green, 2 blue: 6 ! = 720 5! = 120 6 6
2 green, 3 blue: 5 ! = 120 6! = 720 6 4
Lets sumproduct those 4 arrays: = 29064960, then multiply by 12 = 348779520
Similarly you can calc for green wins for blue wins.
You can use std::thread::hardware_ concurrency() from <thread>. Quoting from "C++ Concurrency in action" by A.Williams -
One feature of the C++ Standard Library that helps here is
std::thread::hardware_ concurrency(). This function returns an
indication of the number of threads that can truly run concurrently
for a given execution of a program. On a multicore system it might be
the number of CPU cores, for example.

Unexpected result regarding comparing two pointer-related integer values in C++

I have a BST of three elements {1, 2, 3}. Its structure looks like
2
/ \
1 3
Now I try to calculate the height for each node using BSTHeight() defined below and have some problem with calculating the height of '2', which value is supposed to be 1 as the heights of '1' and '3' are defined as 0. My problem is that with direct use of heights from '2's two children (see part 2 highlighted below), its height is ALWAYS 0. However, its value is correct if I use two temporary integer variables (see part 1 highlighted below). I couldn't see any difference between the two approaches in terms of functionality. Can anyone help explain why?
void BSTHeight(bst_node *p_node)
{
if (!p_node)
return;
if (!p_node->p_lchild && !p_node->p_rchild) {
p_node->height = 0;
} else if (p_node->p_lchild && p_node->p_rchild) {
BSTHeight(p_node->p_lchild);
BSTHeight(p_node->p_rchild);
#if 0 // part 1
int lchild_height = p_node->p_lchild->height;
int rchild_height = p_node->p_rchild->height;
p_node->height = 1 + ((lchild_height > rchild_height) ? lchild_height : rchild_height);
#else // part 2
p_node->height = 1 + ((p_node->p_lchild->height) > (p_node->p_rchild->height)) ? (p_node->p_lchild->height) : (p_node->p_rchild->height);
#endif
} else if (!p_node->p_lchild) {
BSTHeight(p_node->p_rchild);
p_node->height = 1 + p_node->p_rchild->height;
} else {
BSTHeight(p_node->p_lchild);
p_node->height = 1 + p_node->p_lchild->height;
}
}
Problem lies in operator precedence. Addition binds stronger than ternary operator, hence you must surround ternary operator (?:) with brackets.
Below is the corrected version. Note that all brackets you used were superflous and I've removed them. I've added the only needed pair instead:
1 + (p_node->p_lchild->height > p_node->p_rchild->height ?
p_node->p_lchild->height : p_node->p_rchild->height);
Even better would be to use std::max (from <algorithm>) instead:
1 + std::max(p_node->p_lchild->height, p_node->p_rchild->height)