GridGain: MapReduce with node-local data processing? - mapreduce

I am trying to perform some numerical computation on a large distributed data set. The algorithms fit the MapReduce model well with the additional property that output from the map step is small in size compared to the input data. Data can be considered read-only and is statically distributed over the nodes (except for re-balancing on fail-over). Note that this is somewhat contrary to the standard word-count examples where the input data is sent to the nodes performing the map step.
This implies that the map step shall be executed in parallel on all nodes, processing each node's local data, while it is acceptable that the output from the map step is sent to one node for the reduce step.
What is the best way to implement this with GridGain?
It seems there has been a reduce(..) method on GridCache/GridCacheProjection interfaces in earlier versions of GridGain, but this is not present any longer. Is there any replacement? I am thinking of a mechanism that takes a map closure and executes it distributed on each datum exactly once while avoiding to copy any input data across the network.
The (somewhat manual) approach I have come up with so far is the following:
public class GridBroadcastCountDemo {
public static void main(String[] args) throws GridException {
try (Grid grid = GridGain.start(CONFIG_FILE)) {
GridFuture<Collection<Integer>> future = grid.forRemotes().compute().broadcast(new GridCallable<Integer>() {
#Override
public Integer call() throws Exception {
GridCache<Integer, float[]> cache = grid.cache(CACHE_NAME);
int count = 0;
for (float[] array : cache.primaryValues()) {
count += array.length;
}
return count;
}
});
int totalCount = 0;
for (int count : future.get()) {
totalCount += count;
}
// expect size of input data
System.out.println(totalCount);
}
}
}
There is however no guarantee that each datum is processed exactly once with this approach. E.g. when re-balancing takes place while the GridCallables are executed, part of the data could be processed zero or multiple times.

GridGain Open Source (which is now Apache Ignite) has ComputeTask API which has both, map() and reduce() methods. If you are looking for a reduce() method, then ComputeTask is definitely the right API for you.
For now your implementation is OK. Apache Ignite is adding a feature where a node will not be considered primary until the migration is fully finished. It should be coming soon.

Related

State Machine design handling data input

Im working on a real time machine control system, which performs a series of tasks, and should react to a large number of inputs. I've decided to implement this system using a state machine.
Ive used simple switch/case based state machines in the past and would like to transition to a more maintainable solution. At the moment I'm a little confused as to how to handle input and transitions.
For example I have an AnalogInput class which provides me with measurement values which i should monitor. Say I have a state WaitForThreshold, which should read a AnalogInput and then transition if the threshold is reached.
Do I
a) a pass a reference of the AnalogInput class to the WaitForThreshold class and allow it to monitor the input itself, signaling to the StateMachine class that it wishes to transition.
b) create dedicated events, LaserMeasurementAtThreshold and a state transition map: StateMachine.addTransition(State A, Event e, State B)
c)create more generic events AnalogInputChanged and implement event handlers for each of the events, which again signal to the StateMachine that a transition is desired
option a is essentially the larger version of a simple switch/case state machine, which could get messy with time, but offers great flexibility and b/c seems more structured and clean but seems like i may have to jump through alot of hoops to implement relatively simple tasks because the number of events can be very large.
Can someone offer some insight on the design of state machines where a large number of inputs sources and types must be monitored, and events are largely state-specific(Most event pertain only to a single state)?
Are there possibly other alternatives to state machine design to control a system where a sequence of steps must be implemented (non linear, looping and branchig must be possible)
Language: C++
Thanks
I believe this would be clearer to implement as a table of transitions:
typedef (void)((*Pointer_To_Transition_Function)());
struct Table_Entry
{
Input_Type input_value;
Pointer_To_Transisiton_Function p_trans_function;
};
static const Table_Entry Transition_Table[] =
{
{4, Read_Sensor},
};
static const size_t transition_quantity =
sizeof(Transition_Table) / sizeof(Transition_Table[0]);
//...
for (size_t index = 0; index < transition_quantity; ++index)
{
if (input_value = Transition_Table[index].input_value)
{
Pointer_To_Transition_Function p_function = Transition_Table[index].p_trans_function;
// Call the function:
p_function();
break;
}
}
You could use std::map, but the std::map has to be initialized during run-time. The table (array) is static and constant, thus it can be placed into a read-only memory segment (convenient for embedded systems); and doesn't use dynamic memory allocation.
Edit 1: ASCII drawing of the table
+-------------+--------------------------------+
| Input value | Pointer to transition function |
+-------------+--------------------------------+
| 4 | Read sensor |
+-------------+--------------------------------+
| 2 | Start motor |
+-------------+--------------------------------+

splitting tasks to categories

I have a class (lets call it checker) and diffrent kind of classes that execute tasks (lets call them tasks). each tasks belongs to several categories
each task runs and at some point asks checker if they are allowed to do something. checker answers according to system state and according to their category. a task can be in multiple categories
how would you implement that? (cpp but I don't really think its language specific).
I was thinking adding a list of categories in each task and have a function that gets a category and answers if the task belongs to it.
class checker {
bool is_allowed(Task * task);
}
class Task
{
bool is_belongging_to_category(Category cat);
void some_task_to_do()
{
...
if (checker::is_allowed(this)) { ....}
else {....}
}
}
Is there a better way to solve this? Maybe some known design pattern...
This looks like questionable design. You're making tasks the objects.
Let's say your tasks are: Eat, Drink, and Be_Merry
If you make each of those tasks objects, they'll have to maintain a reference to the actual individual that they operate on, then when the condition is met they'll need to modify state on the given individual.
This is a violation of Object Oriented Design which defines an object as:
A tight coupling or association of data structures with the methods or functions that act on the data
Notice that you have split the "methods or functions that act on the data" from the object. Instead you should have modeled the objects Jack and Jill which had methods: Eat, Drink, and BeMerry
As far as checker, whether it's parceled out will depend upon whether you're using a push or a pull coding. If you're doing push coding, then checker is simply a holding area for the behavioral properties of Jack and Jill, in such a case the properties should be pushed to Jack and Jill rather than held in checker. If they are properties for all Jack or Jill objects, use a static property. If however you are using pull coding then the information is unavailable until you attempt to execute the task. In this case the checker should probably be a singleton that Jack and Jill access in the process of performing their tasks.
EDIT:
Your comment reveals further tragedy in the design. It seems as though you've kicked off a bunch of threads which are doing busy waiting on checker. This indicates that you need to be using a pull coding. You're Jack and Jill objects need to maintain booleans for which tasks they are actively involved in, for example m_is_going_to_school, then when checker gets the condition that would stop your busy waiting in your design, instead kick off the goToSchool method.
You could make a vector to store all the possible allowed options. You can make a bool function (like you have) called IsAllowed with argument string and that will check if the option its going to do is allowed. If not, return false. That's just my idea though. Of course there's a zillion different ways to implement this. If you want multiple choices. Then you can make a 2d vector, and see if the corresponding row has any of the options. Good luck!
If you know the maximum number of categories in advance, I'd recommend using Bit Flags to do this.
enum Category {
CATEGORY_A = 1,
CATEGORY_B = 1 << 1,
CATEGORY_C = 1 << 2,
CATEGORY_D = 1 << 3,
};
class Task {
int32_t categories_;
public:
Task() : categories_(0) {}
void add_category(Category cat) {
categories_ |= cat;
}
void run() {
checker::can_run(categories_);
}
}
This allows to test for multiple categories all at once:
namespace checker {
bool can_run(int32_t categories) {
int32_t cannot_run_right_now = CATEGORY_A | CATEGORY_C;
if(categories & cannot_run_right_now != 0) {
return false;
}
...
}
}
Well, it depends. If you are 100% sure that you know how many categories there are to be and that is not some gigantic number then you might store this information as an integer. If n-th bit is 1 then task belongs to n-th category. Then depends on the state of system you might create some another integer that would serve as a mask. In the end you would just do some bit-AND ( mask & categories != 0 ) operation to determine if task and mask share common bit.
On the other hand if there will be unknown number of categories you might just make a list of categories it belongs to. Make a dictionary of [SYSTEM_STATE] => [CATEGORIES_AVAILABLE] and check
bool is_allowed(Task * task){
foreach (Category sysC in stateCategories[sys.GetState()])
{
foreach (Category taskC in task.GetCategories())
{
if(sysC == taskC) return true;
}
}
return false;
}
That would of course be slow for a big number of categories.
You could improve this method by making this list of categories some another data structure, in which searching is not O(n) such that the code would look like this :
bool is_allowed(Task * task){
foreach (Category sysC in stateCategories[sys.GetState()])
{
if task.GetCategories().Contains(sysC) {
return true;
}
}
It depends

Lazy computation of items in list until required element is found

I am trying to get my head around making this requirement as efficient as possible, because it is part of a combinatorial problem solver, so every little bit helps in the grand scheme of things.
Lets say I have a list of elements, in this case called transitions.
val possibleTransitions : List[Transition] = List[...] //coming from somewhere
I want to perform an (somewhat expensive) computation on each transition, to obtain another object, in this case called a State.
The natural way for me to do it is using a for-comprehension or a map. The former for me is more convenient because I want to filter out a few irrelevant State objects, such as those which were already processed earlier.
val newStates = for {
transition <- possibleTransitions
state <- computeExpensiveOperation(transition)
if (confirmNewState(state))
} yield state
State contains a value, lets call it value(), which indicates some kind of attractiveness of that state. If the value() is very low (attractive) I want to discard the rest of the list and use that. Since possibleTransitions could be a very long list (thousands), ideally I avoid doing that computeExpensiveOperation if for example the first State object already has the value() I want.
On the other hand, if I don't find any item with an attractive value() I want to keep all of them and add them to another list.
val newPending = pending ++ newStates
I was trying to use a Stream for this, to avoid computing all the values before processing them. If I use find() and I don't find the required item then I won't be able to get the items in the stream (since its use-once).
The only thing I can see possible at the moment is to use possibleItems.toStream() in the for-comprehension and create another collection, iterating through each item one by one until either I find the item (and discard the collection) or no (and use the collection with all items).
Am I missing some smarter more efficient way to do this?
I would use lazy views and convert them to a stream to cache the intermediate result, then you can get the information you need:
val newStates = for {
transition <- possibleTransitions.view
state <- computeExpensiveOperation(transition)
if (confirmNewState(state))
} yield state
val newStatesStream = newStates.toStream // cache results
val attractive = newStatesStream.find(isAttractive(_))
attractive match {
case Some(a) => // do whatever
case None => {
val newPending = pending ++ newStatesSteam
}
}
As the stream is lazy it will only be computed until the first element is found in the line with val attractive. If there is no attractive element the complete stream will be computed and cached and None will be returned.
When computing the new pending elements we can just append this stream to pending. (By the way: pending should probably be a Queue)

Adding weka instances after classification but before evaluation?

Suppose X is a raw, labeled (ie, with training labels) data set, and Process(X) returns a set of Y instances
that have been encoded with attributes and converted into a weka-friendly file like Y.arff.
Also suppose Process() has some 'leakage':
some instances Leak = X-Y can't be encoded consistently, and need
to get a default classification FOO. The training labels are also known for the Leak set.
My question is how I can best introduce instances from Leak into the
weka evaluation stream AFTER some classifier has been applied to the
subset Y, folding the Leak instances in with their default
classification label, before performing evaulation across the full set X? In code:
DataSource LeakSrc = new DataSource("leak.arff");
Instances Leak = LeakSrc.getDataSet();
DataSource Ysrc = new DataSource("Y.arff");
Instances Y = Ysrc.getDataSet();
classfr.buildClassifer(Y)
// YunionLeak = ??
eval.crossValidateModel(classfr, YunionLeak);
Maybe this is a specific example of folding together results
from multiple classifiers?
the bounty is closing, but Mark Hall, in another forum (
http://list.waikato.ac.nz/pipermail/wekalist/2015-November/065348.html) deserves what will have to count as the current answer:
You’ll need to implement building the classifier for the cross-validation
in your code. You can still use an evaluation object to compute stats for
your modified test folds though, because the stats it computes are all
additive. Instances.trainCV() and Instances.testCV() can be used to create
the folds:
http://weka.sourceforge.net/doc.stable/weka/core/Instances.html#trainCV(int,%20int,%20java.util.Random)
You can then call buildClassifier() to process each training fold, modify
the test fold to your hearts content, and then iterate over the instances
in the test fold while making use of either Evaluation.evaluateModelOnce()
or Evaluation.evaluateModelOnceAndRecordPrediction(). The later version is
useful if you need the area under the curve summary metrics (as these
require predictions to be retained).
http://weka.sourceforge.net/doc.stable/weka/classifiers/Evaluation.html#evaluateModelOnce(weka.classifiers.Classifier,%20weka.core.Instance)
http://weka.sourceforge.net/doc.stable/weka/classifiers/Evaluation.html#evaluateModelOnceAndRecordPrediction(weka.classifiers.Classifier,%20weka.core.Instance)
Depending on your classifier, it could be very easy! Weka has an interface called UpdateableClassifier, any class using this can be updated after it has been built! The following classes implement this interface:
HoeffdingTree
IBk
KStar
LWL
MultiClassClassifierUpdateable
NaiveBayesMultinomialText
NaiveBayesMultinomialUpdateable
NaiveBayesUpdateable
SGD
SGDText
It can then be updated something like the following:
ArffLoader loader = new ArffLoader();
loader.setFile(new File("/data/data.arff"));
Instances structure = loader.getStructure();
structure.setClassIndex(structure.numAttributes() - 1);
NaiveBayesUpdateable nb = new NaiveBayesUpdateable();
nb.buildClassifier(structure);
Instance current;
while ((current = loader.getNextInstance(structure)) != null) {
nb.updateClassifier(current);
}

Best tree/heap data structure for fixed set of nodes with changing values + need top 20 values?

I'm writing something like a game in C++ where I have a database table containing the current score for each user. I want to read that table into memory at the start of the game, quickly change each user's score while the game is being played in response to what each user does, and then when the game ends write the current scores back to the database. I also want to be able to find the 20 or so users with the highest scores. No users will be added or deleted during the short period when the game is being played. I haven't tried it yet, but updating the database might take too much time during the period when the game is being played.
Fixed set of users (might be 10,000 to 50,000 users)
Will map user IDs to their score and other user-specific information.
User IDs will be auto_increment values.
If the structure has a high memory overhead that's probably not an issue.
If the program crashes during gameplay it can just be re-started.
Greatly prefer something already available, such as open source/public domain code.
Quickly get a user's current score.
Quickly add to a user's current score (and return their current score)
Quickly get 20 users with highest score.
No deletes.
No inserts except when the structure is first created, and how long that takes isn't critical.
Getting the top 20 users will only happen every five or ten seconds, but getting/adding will happen much more frequently.
If not for the last, I could just create a memory block equal to sizeof(user) * max(user id) and put each user at user id * sizeof(user) for fast access. Should I do that plus some other structure for the Top 20 feature, or is there one structure that will handle all of this together?
Use a std::map. In the incredibly unlikely event that it ever shows up in your profiling, you could maybe think about changing to something more exotic. Memory overhead for 50k users will be around a megabyte or two.
I doubt that iterating over a map with 50k entries every 5-10 seconds, to find the top scores, will introduce significant overhead. If it does, though, either use a Boost multi-index container, or maintain a separate structure for the hi-scores (a heap, or just an array of pointers to the current top 20, in order). Just with an array / vector of 20, the code to increment a score might look something like this (assuming scores only go up, not down):
player.score += points;
if (player.score > hiscores[19]->score) {
hiscore_dirty = true;
}
And the code to get the hi-scores:
if (hiscore_dirty) {
recalculate_hiscores();
hiscore_dirty = false;
}
std::for_each(hiscores.begin(), hiscores.end(), do_something);
If your "auto-increment" and "no delete" policies are fixed forever (i.e. you will never delete users from the DB), and therefore user ids truly are a contiguous range from 0 to the limit, then you should just use a std::vector instead of a std::map.
You might be interested in Fibonacci Heap. This has O(1) (amortized) increaseKey and findMax.
For more info on Heap in general refer: Heap Data Structure, especially the table which compares different heaps.
An implementation of Fibonacci Heap can be found here which you can perhaps use/get inspired from: http://resnet.uoregon.edu/~gurney_j/jmpc/fib.html
First of all, given that you have a Key/Value scenario, you should probably use an Associative Container.
If you are using plain old C++ and do not have Boost available, follow Steve Jessops's suggestion and simply use a std::map, if you have either C++0x or Boost, you'd better use a hash_map or unordered_map: it just matches your requirements better (you don't need to order the players by id after all, you just want to find them quickly) and will probably be faster given the number of players.
For managing the top20 you have 2 choices:
You could use the Boost.MultiIndex library to create one unique container that both offers fast lookup on ID (using a hash map) and an ordered index on the score... however it's a bit of a waste to order all players when you only need 20 of them
You can simply manages a separate structure, like a vector of pointers to users, and each time you modify the score of a user check it should replace a user in the vector
The last solution, though simple, assumes that a player cannot lose points... it's much more difficult if that may happen.
class UsersCollection;
class User
{
public:
void incrementScore(size_t term);
private:
size_t mId;
size_t mScore;
UsersCollection& mCollection;
};
class UsersCollection
{
public:
static const size_t MNumberHiScores = 20;
static const size_t MNotAChampion = -1;
UsersCollection(DBConnection const&);
// returns either the position of the user in
// the hi scores vector or MNotAChampion
size_t insertUserInHiScores(User const& user);
private:
std::unordered_map<size_t, User> mUsers;
std::vector<User const*> mHiScores; // [1]
};
void User::incrementScore(size_t term)
{
mScore += term;
mCollection.insertUserInHiScores(*this);
}
struct UserSort: std::binary_function<User const*, User const*, bool>
{
bool operator()(User const* lhs, User const* rhs) const
{
return lhs->score() > rhs->score();
}
};
size_t UsersCollection::insertUserInHiScores(User const& user)
{
std::vector<User const*>::const_iterator it =
std::find(mHiScores.begin(), mHiScores.end(), &user);
if (it == mHiScores.end()) // not among the hiscores
{
mHiScores.push_back(&user);
}
std::sort(mHiScores.begin(), mHiScores.end(), UserSort());
if (mHiScores.size() > MNumberHiScores) // purge if too many users
{
User const* last = mHiScores.back();
mHiScores.pop_back();
if (&user == last) return MNotAChampion;
}
// return position in the vector in the [0, MNumberHiScores) range
return std::find(mHiScores.begin(), mHiScores.end(), &user)
- mHiScores.begin();
}
Note (1): using a set may seem a good idea however a set presumes that the elements do not change and it is not the case. It could work if we were very careful:
remove the user from the set before changing the score
putting the user back in once it has changed
optionally popping the last elements if there are too many of them