Is MapReduce one form of Continuation-Passing Style (CPS)? - mapreduce

As the title says. I was reading Yet Another Language Geek: Continuation-Passing Style and I was sort of wondering if MapReduce can be categorized as one form of Continuation-Passing Style aka CPS.
I am also wondering how can CPS utilise more than one computer to perform complex computation. Maybe CPS makes it easier to work with Actor model.

I would say they're opposites. MapReduce obviously lends itself to distribution, where Map can do subtasks independently. With CPS you write a recursive function where each call is waiting on a smaller case to get back.
I think CPS is one of the programming techniques that Guy Steele describes as things we need to outgrow and unlearn, in his talk on The Future of Parallel: What's a Programmer to do?

I wouldn't say so. MapReduce does execute user-defined functions, but these are better known as "callbacks". I think CPS is a very abstract concept that is commonly used to model the behavior of better-known concepts like functions, coroutines, callbacks and loops. It is generally not used directly.
Then again, I may be confusing CPS with continuations themselves. I'm not an expert on either one.

Both CPS and MapReduce make use of higher order functions. This means that both involve functions that take functions as arguments.
In the case of CPS you have a function (called a continuation) with an argument that says what to do with a result. Typically (but not always) the continuation is used once. It's a function that specifies how the whole of the rest of the computation should continue. This also makes it a serial kind of thing. Typically you have one thread of execution and the continuation specifies how it's going to continue.
In the case of MapReduce you're providing function arguments that are used multiple times. These argument functions don't really represent the whole of the rest of the computation, but just little building blocks that are used over and over again. The "over and over" bit can often be distributed over multiple machines making this a parallel kind of thing.
So you're right to see a commonality. But one isn't really an example of the other.

Map-reduce is an implementation. The coding interface which lets you use that implementation could use continuations; it's really a matter of how the framework and job control are abstracted. Consider declarative interfaces for Hadoop such as Pig, or declarative languages in general such as SQL; the machinery below the interface may be implemented in many ways.
For example, here's an abstracted Python map-reduce implementation:
def mapper(input_tuples):
"Return a generator of items with qualifying keys, keyed by item.key"
# we are seeing a partition of input_tuples
return (item.key, item) for (key, item) in input_items if key > 1)
def reducer(input_tuples):
"Return a generator of items with qualifying keys"
# we are seeing a partition of input_tuples
return (item for (key, item) in input_items if key != 'foo')
def run_mapreduce(input_tuples):
# partitioning is magically run across boxes
mapper_inputs = partition(input_tuples)
# each mapper is magically run on separate box
mapper_outputs = (mapper(input) for input in mapper_inputs)
# partitioning and sorting is magically run across boxes
reducer_inputs = partition(
sort(mapper_output for output in mapper_outputs))
# each reducer is magically run on a separate box
reducer_outputs = (reducer(input) for input in reducer_inputs)
And here's the same implementation using coroutines, with even more magical abstraction hidden away:
def mapper_reducer(input_tuples):
# we are seeing a partition of input_tuples
# yield mapper output to caller, get reducer input
reducer_input = yield (
item.key, item) for (key, item) in input_items if key > 1)
# we are seeing a partition of reducer_input tuples again, but the
# caller of this continuation has partitioned and sorted
# yield reducer output to caller
yield (item for (key, item) in input_items if key != 'foo')

Related

Mapping into multiple maps in parallel with Java 8 Streams

I'm iterating over a CloseableIterator (looping over elements) and currently adding to a hashmap (just putting into a HashMap, dealing with conflicts as needed). My goal is to do this process in parallel, add to multiple hashmaps in chunks using parallelism to speed up the process. Then reduce to a single hashmap.
Not sure how to do the first step, using streams to map into multiple hashmaps in parallel. Appreciate help.
Parallel streams collected into Collectors.toMap will already process the stream on multiple threads and then combine per-thread maps as a final step. Or in the case of toConcurrentMap multiple threads will process the stream and combine data into a thread-safe map.
If you only have an Iterator (as opposed to an Iterable or a Spliterator), it's probably not worth parallelizing. In Effective Java, Josh Bloch states that:
Even under the best of circumstances, parallelizing a pipeline is unlikely to increase its performance if the source is from Stream.iterate, or the intermediate operation limit is used.
An Iterator has only a next method, which (typically) must be called sequentially. Thus, any attempt to parallelize would be doing essentially what Stream.iterate does: sequentially starting the stream and then sending the data to other threads. There's a lot of overhead that comes with this transfer, and the cache is not on your side at all. There's a good chance that it wouldn't be worth it, except maybe if you have few elements to iterate over and you have a lot of work to do on each one. In this case, you may as well put them all into an ArrayList and parallelize from there.
It's a different story if you can get a reasonably parallelizable Stream. You can get these if you have a good Iterable or Spliterator. If you have a good Spliterator, you can get a Stream using the StreamSupport.stream methods. Any Iterable has a spliterator method. If you have a Collection, use the parallelStream method.
A Map in Java has key-value pairs, so I'm not exactly sure what you mean by "putting into a HashMap." For this answer, I'll assume that you mean that you're making a call to the put method where the key is one of the elements and the value Boolean.TRUE. If you update your question, I can give a more specific answer.
In this case, your code could look something like this:
public static <E> Map<E, Boolean> putInMap(Stream<E> elements) {
return elements.parallel()
.collect(Collectors.toConcurrentMap(e -> e, e -> Boolean.TRUE, (a, b) -> Boolean.TRUE));
}
e -> e is the key mapper, making it so that the keys are the elements.
e -> Boolean.TRUE is the value mapper, making it so the set values are true.
(a, b) -> Boolean.TRUE is the merge function, deciding how to merge two elements into one.

Rules of thumb for function arguments ordering in Clojure

What (if any) are the rules for deciding the order of the parameters functions in Clojure core?
Functions like map and filter expect a data structure as the last
argument.
Functions like assoc and select-keys expect a data
structure as the first argument.
Functions like map and filter expect a function as the first
argument.
Functions like update-in expect a function as the last argument.
This can cause pains when using the threading macros (I know I can use as-> ) so what is the reasoning behind these decisions? It would also be nice to know so my functions can conform as closely as possible to those written by the great man.
Functions that operate on collections (and so take and return data structures, e.g. conj, merge, assoc, get) take the collection first.
Functions that operate on sequences (and therefore take and return an abstraction over data structures, e.g. map, filter) take the sequence last.
Becoming more aware of the distinction [between collection functions and sequence functions] and when those transitions occur is one of the more subtle aspects of learning Clojure.
(Alex Miller, in this mailing list thread)
This is important part of working intelligently with Clojure's sequence API. Notice, for instance, that they occupy separate sections in the Clojure Cheatsheet. This is not a minor detail. This is central to how the functions are organized and how they should be used.
It may be useful to review this description of the mental model when distinguishing these two kinds of functions:
I am usually very aware in Clojure of when I am working with concrete
collections or with sequences. In many cases I find the flow of data
starts with collections, then moves into sequences (as a result of
applying sequence functions), and then sometimes back to collections
when it comes to rest (via into, vec, or set). Transducers have
changed this a bit as they allow you to separate the target collection
from the transformation and thus it's much easier to stay in
collections all the time (if you want to) by apply into with a
transducer.
When I am building up or working on collections, typically the code
constructing it is "close" and the collection types are known and
obvious. Generally sequential data is far more likely to be vectors
and conj will suffice.
When I am thinking in "sequences", it's very rare for me to do an
operation like "add last" - instead I am thinking in whole collection
terms.
If I do need to do something like that, then I would probably convert
back to collections (via into or vec) and use conj again.
Clojure's FAQ has a few good rules of thumb and visualization techniques for getting an intuition of collection/first-arg versus sequence/last-arg.
Rather than have this be a link-only question, I'll paste a quote of Rich Hickey's response to the Usenet question "Argument order rules of thumb":
One way to think about sequences is that they are read from the left,
and fed from the right:
<- [1 2 3 4]
Most of the sequence functions consume and produce sequences. So one
way to visualize that is as a chain:
map<- filter<-[1 2 3 4]
and one way to think about many of the seq functions is that they are
parameterized in some way:
(map f)<-(filter pred)<-[1 2 3 4]
So, sequence functions take their source(s) last, and any other
parameters before them, and partial allows for direct parameterization
as above. There is a tradition of this in functional languages and
Lisps.
Note that this is not the same as taking the primary operand last.
Some sequence functions have more than one source (concat,
interleave). When sequence functions are variadic, it is usually in
their sources.
I don't think variable arg lists should be a criteria for where the
primary operand goes. Yes, they must come last, but as the evolution
of assoc/dissoc shows, sometimes variable args are added later.
Ditto partial. Every library eventually ends up with a more order-
independent partial binding method. For Clojure, it's #().
What then is the general rule?
Primary collection operands come first.That way one can write -> and
its ilk, and their position is independent of whether or not they have
variable arity parameters. There is a tradition of this in OO
languages and CL (CL's slot-value, aref, elt - in fact the one that
trips me up most often in CL is gethash, which is inconsistent with
those).
So, in the end there are 2 rules, but it's not a free-for-all.
Sequence functions take their sources last and collection functions
take their primary operand (collection) first. Not that there aren't
are a few kinks here and there that I need to iron out (e.g. set/
select).
I hope that helps make it seem less spurious,
Rich
Now, how one distinguishes between a "sequence function" and a "collection function" is not obvious to me. Perhaps others can explain this.

Difference between fold and reduce revisted

I've been reading a nice answer to Difference between reduce and foldLeft/fold in functional programming (particularly Scala and Scala APIs)? provided by samthebest and I am not sure if I understand all the details:
According to the answer (reduce vs foldLeft):
A big big difference (...) is that reduce should be given a commutative monoid, (...)
This distinction is very important for Big Data / MPP / distributed computing, and the entire reason why reduce even exists.
and
Reduce is defined formally as part of the MapReduce paradigm,
I am not sure how this two statements combine. Can anyone put some light on that?
I tested different collections and I haven't seen performance difference between reduce and foldLeft. It looks like ParSeq is a special case, is that right?
Do we really need order to define fold?
we cannot define fold because chunks do not have an ordering and fold only requires associativity, not commutativity.
Why it couldn't be generalized to unordered collection?
As mentioned in the comments, the term reduce means different thing when used in the context of MapReduce and when used in the context of functional programming.
In MapReduce, the system groups the results of the map function by a given key and then calls the reduce operation to aggregate values for each group (so reduce is called once for each group). You can see it as a function (K, [V]) -> R taking the group key K together with all the values belonging to the group [V] and producing some result.
In functional programming, reduce is a function that aggregates elements of some collection when you give it an operation that can combine two elements. In other words, you define a function (V, V) -> V and the reduce function uses it to aggregate a collection [V] into a single value V.
When you want to add numbers [1,2,3,4] using + as the function, the reduce function can do it in a number of ways:
It can run from the start and calculate ((1+2)+3)+4)
It can also calculate a = 1+2 and b = 3+4 in parallel and then add a+b!
The foldLeft operation is, by definition always proceeding from the left and so it always uses the evaluation strategy of (1). In fact, it also takes an initial value, so it evaluates something more like (((0+1)+2)+3)+4). This makes foldLeft useful for operations where the order matters, but it also means that it cannot be implemented for unordered collections (because you do not know what "left" is).

Design for customizable string filter

Suppose I've tons of filenames in my_dir/my_subdir, formatted in a some way:
data11_7TeV.00179691.physics_Egamma.merge.NTUP_PHOTON.f360_m796_p541_tid319627_00
data11_7TeV.00180400.physics_Egamma.merge.NTUP_PHOTON.f369_m812_p541_tid334757_00
data11_7TeV.00178109.physics_Egamma.merge.D2AOD_DIPHO.f351_m765_p539_p540_tid312017_00
For example data11_7TeV is the data_type, 00179691 the run number, NTUP_PHOTON the data format.
I want to write an interface to do something like this:
dataset = DataManager("my_dir/my_subdir").filter_type("data11_7TeV").filter_run("> 00179691").filter_tag("m = 796");
// don't to the filtering, be lazy
cout << dataset.count(); // count is an action, do the filtering
vector<string> dataset_list = dataset.get_list(); // don't repeat the filtering
dataset.save_filter("file.txt", "ALIAS"); // save the filter (not the filenames), for example save the regex
dataset2 = DataManagerAlias("file.txt", "ALIAS"); // get the saved filter
cout << dataset2.filter_tag("p = 123").count();
I want lazy behaviour, for example no real filtering has to be done before any action like count or get_list. I don't want to redo the filtering if it is already done.
I'm just learning something about design pattern, and I think I can use:
an abstract base class AbstractFilter that implement filter* methods
factory to decide from the called method which decorator use
every time I call a filter* method I return a decorated class, for example:
AbstractFilter::filter_run(string arg) {
decorator = factory.get_decorator_run(arg); // if arg is "> 00179691" returns FilterRunGreater(00179691)
return decorator(this);
}
proxy that build a regex to filter the filenames, but don't do the filtering
I'm also learning jQuery and I'm using a similar chaining mechanism.
Can someone give me some hints? Is there some place where a design like this is explained? The design must be very flexible, in particular to handle new format in the filenames.
I believe you're over-complicating the design-pattern aspect and glossing over the underlying matching/indexing issues. Getting the full directory listing from disk can be expected to be orders of magnitude more expensive than the in-RAM filtering of filenames it returns, and the former needs to have completed before you can do a count() or get_list() on any dataset (though you could come up with some lazier iterator operations over the dataset).
As presented, the real functional challenge could be in indexing the filenames so you can repeatedly find the matches quickly. But, even that's unlikely as you presumably proceed from getting the dataset of filenames to actually opening those files, which is again orders of magnitude slower. So, optimisation of the indexing may not make any appreciable impact to your overall program's performance.
But, lets say you read all the matching directory entries into an array A.
Now, for filtering, it seems your requirements can generally be met using std::multimap find(), lower_bound() and upper_bound(). The most general way to approach it is to have separate multimaps for data type, run number, data format, p value, m value, tid etc. that map to a list of indices in A. You can then use existing STL algorithms to find the indices that are common to the results of your individual filters.
There are a lot of optimisations possible if you happen to have unstated insights / restrictions re your data and filtering needs (which is very likely). For example:
if you know a particular filter will always be used, and immediately cuts the potential matches down to a manageable number (e.g. < ~100), then you could use it first and resort to brute force searches for subsequent filtering.
Another possibility is to extract properties of individual filenames into a structure: std::string data_type; std::vector<int> p; etc., then write an expression evaluator supporting predicates like "p includes 924 and data_type == 'XYZ'", though by itself that lends itself to brute-force comparisons rather than faster index-based matching.
I know you said you don't want to use external libraries, but an in-memory database and SQL-like query ability may save you a lot of grief if your needs really are at the more elaborate end of the spectrum.
I would use a strategy pattern. Your DataManager is constructing a DataSet type, and the DataSet has a FilteringPolicy assigned. The default can be a NullFilteringPolicy which means no filters. If the DataSet member function filter_type(string t) is called, it swaps out the filter policy class with a new one. The new one can be factory constructed via the filter_type param. Methods like filter_run() can be used to add filtering conditions onto the FilterPolicy. In the NullFilterPolicy case it's just no-ops. This seems straghtforward to me, I hope this helps.
EDIT:
To address the method chaining you simply need to return *this; e.g. return a reference to the DataSet class. This means you can chain DataSet methods together. It's what the c++ iostream libraries do when you implement operator>> or operator<<.
First of all, I think that your design is pretty smart and lends itself well to the kind of behavior you are trying to model.
Anyway, my understanding is that you are trying and building a sort of "Domain Specific Language", whereby you can chain "verbs" (the various filtering methods) representing actions on, or connecting "entities" (where the variability is represented by different naming formats that could exist, although you do not say anything about this).
In this respect, a very interesting discussion is found in Martin Flowler's book "Domain Specific Languages". Just to give you a taste of what it is about, here you can find an interesting discussion about the "Method Chaining" pattern, defined as:
“Make modifier methods return the host object so that multiple modifiers can be invoked in a single expression.”
As you can see, this pattern describes the very chaining mechanism you are positing in your design.
Here you have a list of all the patterns that were found interesting in defining such DSLs. Again, you will be easily find there several specialized patterns that you are also implying in your design or describing as way of more generic patterns (like the decorator). A few of them are: Regex Table Lexer, Method Chaining, Expression Builder, etc. And many more that could help you further specify your design.
All in all, I could add my grain of salt by saying that I see a place for a "command processor" pattern in your specificaiton, but I am pretty confident that by deploying the powerful abstractions that Fowler proposes you will be able to come up with a much more specific and precise design, covering aspect of the problem that right now are simply hidden by the "generality" of the GoF pattern set.
It is true that this could be "overkill" for a problem like the one you are describing, but as an exercise in pattern oriented design it can be very insightful.
I'd suggest starting with the boost iterator library - eg the filter iterator.
(And, of course, boost includes a very nice regex library.)

Lisp is for List Processing. Is there a language for Tree Processing?

The name for Lisp derives from LISt Processing. Linked lists are the major data structure of Lisp languages, and Lisp source code is itself made up of lists. As a result, Lisp programs can manipulate source code as a data structure (this is known as homoiconicity).
However, a list is by definition a sequential construct. This encourages us to solve problems using sequential language idioms (algorithms that process one thing at a time and accumulate the results). For example, in a Lisp where cons cells are used to implement singly-linked lists, the car operation returns the first element of the list, while cdr returns the rest of the list. My vision is of a functional language for parallel execution, that splits problems into roughly equal sub-problems, recursively solves them, and combines the sub-solutions.
Pretty much every programming language's code is already parsed into trees; is there a homoiconic language like Lisp, but with trees as the major data structure? btw, I'd call it Treep, for TREE Processing.
Update: An interesting presentation (PDF) from 2009 by Guy Steele on parallel algorithms & data structures, Organizing Functional Code for Parallel Execution: or, foldl and foldr Considered Slightly Harmful.
Lisp Lists ARE trees, and Lisp code is a tree, just like any other code.
(+ (* 1 3) (* 4 6))
is a tree:
+
/ \
/ \
* *
/ \ / \
1 3 4 6
And it's not just binary trees.
(+ 1 2 3)
+
/|\
/ | \
1 2 3
So, perhaps Lisp is your answer as well as your question.
I don't see that the change would be very profound. Lisp certainly doesn't have any problem with letting lists be members of other lists, so it can easily represent trees, and algorithms on trees.
Conversely, every list can be regarded as a tree of a particular shape (in various ways).
I would say that the major data structure of Lisp languages is the cons cell. One of the things you can easily build with cons cells is a linked list, but that's by no means the only data structure.
A cons cell is a pair of data items, but there's nothing that says a value has to be in the left cell and a pointer in the right cell (as in a linked list). If you allow both cells to contain either values or pointers themselves, it's easy to build binary (or with a bit more work, n-ary) tree structures. Building upon these structures, one can build dictionaries or B-trees or any other data structure you might think of.
OP appears to be interested in langauges that process trees and have some parallelism support.
Our DMS Software Reengineering Toolkit is a general-purpose program analysis and transformation engine. It parses programmming language source text into trees (OP's first interest), enables analysis and processing of those trees, and regeneration of source text from those trees.
DMS is driven by explicit grammars that describe the language being processed, and it has a lot of production-quality langauge definitions available for it.
DMS's tree processing machinery provide support for parallelism at two different levels, one supported directly by DMS's underlying parallel programming langauge, PARLANSE, which is inspired by LISP but isn't nearly so dynamic.
PARLANSE provides for "teams" of parallel activities called "grains" (as in sand on a beach, the idea is that there are lots of grains). Teams may be structured dynamically, by (classically) forking new grains and adding them to a (dynamic) team; this is useful but not spectacularly fast. Teams may be structured statically, including "fork this fixed size set of grains as pure parallel":
(|| a b c)
and "create a set of grains whose execution order is controlled by a specified partial order" (!| ... ). You write the partial order directly in the fork call:
(!| step1 a
step2 b
step3 (>> step1 step2) c
step4 (>> step2) d )
which encodes the fact that action c occurs after (later >> in time )) actions a and b complete. The statically formed teams are precompiled by the PARLANSE compiler into pretty efficient structures in that the grains can be launched and killed very quickly, allowing pretty small grain size (a few hundred instructions). Standard parallelism libraries have lots higher overhead than this.
The basic method for processing trees is pretty conventional: there's a PARLANSE library of facilities for inspecting tree nodes, walking up and down the tree, creating new nodes and splicing them in place. Recursive procedures are often used to visit/update a tree.
The key observation here is that a tree visit which visits some set of children can be coded sequentially, or easily coded as a static parallel team. So it is pretty easy to manually code parallel tree visits, at the cost of writing lots of special cases to parallel-fork for each tree node type. There's the "divide and conquer" that seems of interest to the OP. (You can of course enumerate a nodes' childern, and using a dynamic team to fork grains for each child, but this isn't as fast). This is the first level of parallelism used to process trees in DMS.
The second level of parallelism comes through a DMS-supplied DSL that implements an attribute grammar (AG)s.
AGs are functional languages that decorate the BNF with a set of computations. One can write a simple calculator with an AG in DMS:
sum = sum + product;
<<Calculator>>: sum[0].result = sum[1].result + product.result;
This causes an attribute, "result" to be computed for the root (sum[0]) by combining the result attribute from the first child (sum[1]) and the second child (product.result).
So-called synthesized attributes propagates information up the tree from the leaves; inherited attributes propagate information down from parents. Attribute grammars in general and DMS's AG allow mixing these, so information can flow up and down the tree in arbitrary order.
Most AGs are purely functional, e.g., no side effects; DMS's allows side effects which complicates the notation but is pretty useful in practice. A nice example is construction of symbol tables by attribute evaluation; current-scope is passed down the tree, local-declaration blocks create new current scopes and pass them down, and individual declarations store symbol table data into the symbol table entry received from a parent.
DMS's AG compiler analyzes the attribute computations, determines how information flows across the entire tree, and then generates a parallel PARLANSE program to implement the information flow per tree node type. It can do a data flow analysis on each AG rule to determine the information flow and what has to happen first vs later vs. last. For our simple sum rule above, it should be clear that the attributes of the children have to computed before the attribute for root can be computed. It turns out that PARLANSE's partial order construction are a perfect way to encode this information, and even nicely handles the side-effects we added to AGs.
The result is that DMS compiles an AG specification into a highly parallel PARLANSE program. Our C++ front end name/type resolver is implemented as about 150K lines of DMS AG (yes, it takes that much to describe how C++ type information is used), which compiles
to some 700K SLOC of parallel PARLANSE code. And it works (and runs in parallel in x86 multicore machines) without any thought or debugging, and it seems to scale nicely.
The canonical example of a homoiconic tree processing language is XSLT. But you probably don't want to write (nor read) anything substantial in it.
I think closure has trees instead of lists and uses them as you wrote for concurrency purposes.
I'm also looking for tree (or better graph) procesing homoiconic language, still have no any points. Let's make some list of required elements of language, and maybe we find some variant:
homoiconicity: based on tree (or better graph) interpretation
attribute grammar support in language core
structural pattern matching with unification
full set of typical graph algorithms
SmallTalk-like interactive environment with rich visualization support
lexer/parser engine in language core for any textual data load
embedded (LLVM based) compiler framework for lowlevel code generation and optomisation for intensive computation tasks
"Lisp" is only a name and an ancient one. It doesn't describe everything that is done in Lisp. Just like Fortran wasn't exclusively a formula translator, and not everything in Cobol is "business oriented".
The processing referred to in "Lisp" is actually tree processing, because it is the processing of nested lists. It is nested list processing. Just the "nested" wasn't included in the name.
The ANSI Common Lisp standard in fact normatively defines "tree" and "tree structure" in the glossary.
Lisp lists are convention in the tree structure: that the right-leaning recursion (through the cdr slot of a cons cell) represents a list. there is a notation to go with it: namely the tree notation structure ((((a . b) . c) . d) . nil) can be shortened to (a b c d) according to the notational rule that the cons cell written as (anything . nil) can also be printed as (anything), and the rule that (foo . (bar)) can be written (foo bar).
The term "list processing" was probably emphasized not because trees are ruled out but because the majority of the growth in the trees is in the direction understood as horizontal: that is to say, the processing of nested lists.
Most of the list based data structures processed in typical Lisp programs, even when nested, are much deeper in the cdr dimension than in the car dimension.