How do I efficiently scatter distributed array elements in Chapel?

How do I efficiently scatter distributed array elements in Chapel? - chapel

Consider the following scatter operation :
var A : [DomA] EltType;
var Indices : [DomA] IndexType;
var B : [DomB] EltType;
[(iSrc, iDst) in zip(DomA, Indices)] B[iDst] = A[iSrc];
where the domains are distributed. What is the best way to do this? In particular, is there an easy way to aggregate messages to avoid sending many small messages (assuming that sizeOf(EltType) is small)?

The Chapel team is actively working on aggregation and it is extensively used in Arkouda, but there is currently no built-in support for aggregation. See https://github.com/chapel-lang/chapel/issues/16963 for more information about the current efforts.
If you want to try the current aggregators you can copy AggregationPrimitives.chpl and CopyAggregation.chpl from https://github.com/chapel-lang/chapel/tree/993f9bd/test/studies/bale/aggregation.
Your main loop would then look something like:
forall (iSrc, iDst) in zip(DomA, Indices) with (var agg = new DstAggregator(EltType)) {
agg.copy(B[iDst], A[iSrc]);
}
Or maybe a little cleaner as:
forall (iDst, a) in zip(Indices, A) with (var agg = new DstAggregator(EltType)) {
agg.copy(B[iDst], a);
}
These aggregators should provide significant performance speedups over your unaggregated loop: 2-3x on Cray Aries networks, ~1000x on InfiniBand networks, and ~5000x or more on commodity Ethernet networks.
Longer term, aggregators will be part of the standard library and in many cases, including your example, the compiler should be able to tell that aggregation is safe/legal and automatically use it.

Related

TBB Dynamic Flow Graphs

I'm trying to come up with a way to define a flow graph (think TBB) defined at runtime. Currently, we use TBB to define the nodes and the edges between the nodes at compile time. This is sort of annoying because we have people who want to add processing steps and modify the processing chain without recompiling the whole application or really having to know anything about the application beyond how to add processing kernels. In an ideal world I would have some sort of plugin framework using dlls. We already have the software architected so that each node in TBB represents a processing step so it's pretty easy to add stuff if you're willing to recompile.
As a first step, I was trying to come up with a way to define a TBB flow graph in YAML but it was a massive rabbit hole. Does anyone know if something like this exists before I go all in on implementing this from scratch? It will be a fun project but no point in duplicating work.

I am not sure if anything like this exists in a TBB companion library but it is definitely doable to implement a small subset of the functionalities of Flow Graph configurable at runtime.
If the data that transit through your graph have a well defined type, aka your nodes are basically function_node<T, T> things are manageable. If the Graph transforms data from one type to another it gets more complicated -one solution would be to use a variant of these types and handle the possibly incompatible types at runtime. That really depends on the level of flexibility required.
With:
$ cat nodes.txt
# type concurrency params...
multiply unlimited 2
affine serial 5 -3
and
$ cat edges.txt
# src dst
0 1
1 2
where index 0 is a source node, here is a scaffold of how I would implement it:
using data_t = double;
using node_t = function_node<data_t , data_t >;
graph g;
std::vector<node_t> nodes;
auto node_factory = [&g](std::string type, std::string concurrency, std::string params) -> node_t {
// Implement a dynamic factory of nodes
};
// Add source node first
nodes.push_back(flow::input_node<data_t>(g,
[&]( oneapi::tbb::flow_control &fc ) -> data_t { /*...*/ });
// Parse the node description file and populate the node vector using the factory
for (auto&& n : nodes)
nodes.push_back(node_factory(n.type, n.concurrency, n.params));
// Parse the edge description file and call make_edge accordingly
for (auto&& e : edges)
flow::make_edge(nodes[e.src], nodes[e.dst]);
// Run the graph
nodes[0].activate();
g.wait_for_all();

State Machine design handling data input

Im working on a real time machine control system, which performs a series of tasks, and should react to a large number of inputs. I've decided to implement this system using a state machine.
Ive used simple switch/case based state machines in the past and would like to transition to a more maintainable solution. At the moment I'm a little confused as to how to handle input and transitions.
For example I have an AnalogInput class which provides me with measurement values which i should monitor. Say I have a state WaitForThreshold, which should read a AnalogInput and then transition if the threshold is reached.
Do I
a) a pass a reference of the AnalogInput class to the WaitForThreshold class and allow it to monitor the input itself, signaling to the StateMachine class that it wishes to transition.
b) create dedicated events, LaserMeasurementAtThreshold and a state transition map: StateMachine.addTransition(State A, Event e, State B)
c)create more generic events AnalogInputChanged and implement event handlers for each of the events, which again signal to the StateMachine that a transition is desired
option a is essentially the larger version of a simple switch/case state machine, which could get messy with time, but offers great flexibility and b/c seems more structured and clean but seems like i may have to jump through alot of hoops to implement relatively simple tasks because the number of events can be very large.
Can someone offer some insight on the design of state machines where a large number of inputs sources and types must be monitored, and events are largely state-specific(Most event pertain only to a single state)?
Are there possibly other alternatives to state machine design to control a system where a sequence of steps must be implemented (non linear, looping and branchig must be possible)
Language: C++
Thanks

I believe this would be clearer to implement as a table of transitions:
typedef (void)((*Pointer_To_Transition_Function)());
struct Table_Entry
{
Input_Type input_value;
Pointer_To_Transisiton_Function p_trans_function;
};
static const Table_Entry Transition_Table[] =
{
{4, Read_Sensor},
};
static const size_t transition_quantity =
sizeof(Transition_Table) / sizeof(Transition_Table[0]);
//...
for (size_t index = 0; index < transition_quantity; ++index)
{
if (input_value = Transition_Table[index].input_value)
{
Pointer_To_Transition_Function p_function = Transition_Table[index].p_trans_function;
// Call the function:
p_function();
break;
}
}
You could use std::map, but the std::map has to be initialized during run-time. The table (array) is static and constant, thus it can be placed into a read-only memory segment (convenient for embedded systems); and doesn't use dynamic memory allocation.
Edit 1: ASCII drawing of the table
+-------------+--------------------------------+
| Input value | Pointer to transition function |
+-------------+--------------------------------+
| 4 | Read sensor |
+-------------+--------------------------------+
| 2 | Start motor |
+-------------+--------------------------------+

splitting tasks to categories

I have a class (lets call it checker) and diffrent kind of classes that execute tasks (lets call them tasks). each tasks belongs to several categories
each task runs and at some point asks checker if they are allowed to do something. checker answers according to system state and according to their category. a task can be in multiple categories
how would you implement that? (cpp but I don't really think its language specific).
I was thinking adding a list of categories in each task and have a function that gets a category and answers if the task belongs to it.
class checker {
bool is_allowed(Task * task);
}
class Task
{
bool is_belongging_to_category(Category cat);
void some_task_to_do()
{
...
if (checker::is_allowed(this)) { ....}
else {....}
}
}
Is there a better way to solve this? Maybe some known design pattern...

This looks like questionable design. You're making tasks the objects.
Let's say your tasks are: Eat, Drink, and Be_Merry
If you make each of those tasks objects, they'll have to maintain a reference to the actual individual that they operate on, then when the condition is met they'll need to modify state on the given individual.
This is a violation of Object Oriented Design which defines an object as:
A tight coupling or association of data structures with the methods or functions that act on the data
Notice that you have split the "methods or functions that act on the data" from the object. Instead you should have modeled the objects Jack and Jill which had methods: Eat, Drink, and BeMerry
As far as checker, whether it's parceled out will depend upon whether you're using a push or a pull coding. If you're doing push coding, then checker is simply a holding area for the behavioral properties of Jack and Jill, in such a case the properties should be pushed to Jack and Jill rather than held in checker. If they are properties for all Jack or Jill objects, use a static property. If however you are using pull coding then the information is unavailable until you attempt to execute the task. In this case the checker should probably be a singleton that Jack and Jill access in the process of performing their tasks.
EDIT:
Your comment reveals further tragedy in the design. It seems as though you've kicked off a bunch of threads which are doing busy waiting on checker. This indicates that you need to be using a pull coding. You're Jack and Jill objects need to maintain booleans for which tasks they are actively involved in, for example m_is_going_to_school, then when checker gets the condition that would stop your busy waiting in your design, instead kick off the goToSchool method.

You could make a vector to store all the possible allowed options. You can make a bool function (like you have) called IsAllowed with argument string and that will check if the option its going to do is allowed. If not, return false. That's just my idea though. Of course there's a zillion different ways to implement this. If you want multiple choices. Then you can make a 2d vector, and see if the corresponding row has any of the options. Good luck!

If you know the maximum number of categories in advance, I'd recommend using Bit Flags to do this.
enum Category {
CATEGORY_A = 1,
CATEGORY_B = 1 << 1,
CATEGORY_C = 1 << 2,
CATEGORY_D = 1 << 3,
};
class Task {
int32_t categories_;
public:
Task() : categories_(0) {}
void add_category(Category cat) {
categories_ |= cat;
}
void run() {
checker::can_run(categories_);
}
}
This allows to test for multiple categories all at once:
namespace checker {
bool can_run(int32_t categories) {
int32_t cannot_run_right_now = CATEGORY_A | CATEGORY_C;
if(categories & cannot_run_right_now != 0) {
return false;
}
...
}
}

Well, it depends. If you are 100% sure that you know how many categories there are to be and that is not some gigantic number then you might store this information as an integer. If n-th bit is 1 then task belongs to n-th category. Then depends on the state of system you might create some another integer that would serve as a mask. In the end you would just do some bit-AND ( mask & categories != 0 ) operation to determine if task and mask share common bit.
On the other hand if there will be unknown number of categories you might just make a list of categories it belongs to. Make a dictionary of [SYSTEM_STATE] => [CATEGORIES_AVAILABLE] and check
bool is_allowed(Task * task){
foreach (Category sysC in stateCategories[sys.GetState()])
{
foreach (Category taskC in task.GetCategories())
{
if(sysC == taskC) return true;
}
}
return false;
}
That would of course be slow for a big number of categories.
You could improve this method by making this list of categories some another data structure, in which searching is not O(n) such that the code would look like this :
bool is_allowed(Task * task){
foreach (Category sysC in stateCategories[sys.GetState()])
{
if task.GetCategories().Contains(sysC) {
return true;
}
}
It depends

Adding weka instances after classification but before evaluation?

Suppose X is a raw, labeled (ie, with training labels) data set, and Process(X) returns a set of Y instances
that have been encoded with attributes and converted into a weka-friendly file like Y.arff.
Also suppose Process() has some 'leakage':
some instances Leak = X-Y can't be encoded consistently, and need
to get a default classification FOO. The training labels are also known for the Leak set.
My question is how I can best introduce instances from Leak into the
weka evaluation stream AFTER some classifier has been applied to the
subset Y, folding the Leak instances in with their default
classification label, before performing evaulation across the full set X? In code:
DataSource LeakSrc = new DataSource("leak.arff");
Instances Leak = LeakSrc.getDataSet();
DataSource Ysrc = new DataSource("Y.arff");
Instances Y = Ysrc.getDataSet();
classfr.buildClassifer(Y)
// YunionLeak = ??
eval.crossValidateModel(classfr, YunionLeak);
Maybe this is a specific example of folding together results
from multiple classifiers?

the bounty is closing, but Mark Hall, in another forum (
http://list.waikato.ac.nz/pipermail/wekalist/2015-November/065348.html) deserves what will have to count as the current answer:
You’ll need to implement building the classifier for the cross-validation
in your code. You can still use an evaluation object to compute stats for
your modified test folds though, because the stats it computes are all
additive. Instances.trainCV() and Instances.testCV() can be used to create
the folds:
http://weka.sourceforge.net/doc.stable/weka/core/Instances.html#trainCV(int,%20int,%20java.util.Random)
You can then call buildClassifier() to process each training fold, modify
the test fold to your hearts content, and then iterate over the instances
in the test fold while making use of either Evaluation.evaluateModelOnce()
or Evaluation.evaluateModelOnceAndRecordPrediction(). The later version is
useful if you need the area under the curve summary metrics (as these
require predictions to be retained).
http://weka.sourceforge.net/doc.stable/weka/classifiers/Evaluation.html#evaluateModelOnce(weka.classifiers.Classifier,%20weka.core.Instance)
http://weka.sourceforge.net/doc.stable/weka/classifiers/Evaluation.html#evaluateModelOnceAndRecordPrediction(weka.classifiers.Classifier,%20weka.core.Instance)

Depending on your classifier, it could be very easy! Weka has an interface called UpdateableClassifier, any class using this can be updated after it has been built! The following classes implement this interface:
HoeffdingTree
IBk
KStar
LWL
MultiClassClassifierUpdateable
NaiveBayesMultinomialText
NaiveBayesMultinomialUpdateable
NaiveBayesUpdateable
SGD
SGDText
It can then be updated something like the following:
ArffLoader loader = new ArffLoader();
loader.setFile(new File("/data/data.arff"));
Instances structure = loader.getStructure();
structure.setClassIndex(structure.numAttributes() - 1);
NaiveBayesUpdateable nb = new NaiveBayesUpdateable();
nb.buildClassifier(structure);
Instance current;
while ((current = loader.getNextInstance(structure)) != null) {
nb.updateClassifier(current);
}

GridGain: MapReduce with node-local data processing?

I am trying to perform some numerical computation on a large distributed data set. The algorithms fit the MapReduce model well with the additional property that output from the map step is small in size compared to the input data. Data can be considered read-only and is statically distributed over the nodes (except for re-balancing on fail-over). Note that this is somewhat contrary to the standard word-count examples where the input data is sent to the nodes performing the map step.
This implies that the map step shall be executed in parallel on all nodes, processing each node's local data, while it is acceptable that the output from the map step is sent to one node for the reduce step.
What is the best way to implement this with GridGain?
It seems there has been a reduce(..) method on GridCache/GridCacheProjection interfaces in earlier versions of GridGain, but this is not present any longer. Is there any replacement? I am thinking of a mechanism that takes a map closure and executes it distributed on each datum exactly once while avoiding to copy any input data across the network.
The (somewhat manual) approach I have come up with so far is the following:
public class GridBroadcastCountDemo {
public static void main(String[] args) throws GridException {
try (Grid grid = GridGain.start(CONFIG_FILE)) {
GridFuture<Collection<Integer>> future = grid.forRemotes().compute().broadcast(new GridCallable<Integer>() {
#Override
public Integer call() throws Exception {
GridCache<Integer, float[]> cache = grid.cache(CACHE_NAME);
int count = 0;
for (float[] array : cache.primaryValues()) {
count += array.length;
}
return count;
}
});
int totalCount = 0;
for (int count : future.get()) {
totalCount += count;
}
// expect size of input data
System.out.println(totalCount);
}
}
}
There is however no guarantee that each datum is processed exactly once with this approach. E.g. when re-balancing takes place while the GridCallables are executed, part of the data could be processed zero or multiple times.

GridGain Open Source (which is now Apache Ignite) has ComputeTask API which has both, map() and reduce() methods. If you are looking for a reduce() method, then ComputeTask is definitely the right API for you.
For now your implementation is OK. Apache Ignite is adding a feature where a node will not be considered primary until the migration is fully finished. It should be coming soon.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js