I'm currently learning clojure for fun and today I run into this article about reducers. I find it not only interesting but tricky to me. As a beginner:
I know how to use core map, filter, reduce
I understand that core/map, core/filter... return a sequablecol
Rich Hickey mentioned core.reducers/map... return a reducible col
Because the implementation of core/map... and core.reducers/map... look very identical. My question is
How does reducible col make difference in layman's term?
Can anyone give me some trivial examples of reducible function?
Thank you so much
For me, the main idea of reducers is that they actually do less than map/filter/reduce. Reducers do not specify if they execute lazily or eagerly, in serial or in parallel, on a collection or on another type of data structure and what they produce may be a collection or something else. Examples:
map/filter/reduce must be passed a collection and must produce a
collection; a reducer does not have to do either. This idea of reducers is
extended in transducers so that the same transducer can applied to either
a collection or a core.async channel.
Reducers also do not specify how they are executed.
map/filter/reduce always execute in a serial manner across a
collection; never in parallel. If you want to execute across a
collection in parallel, you must use a different function: pmap. You
could imagine that if you wanted to filter in parallel, you could
also create a function pfilter (this doesn't exist but you could write it).
Instead of creating a parallel version of each function, reducers simply say
"I don't care how I'm executed" and it's up to another function (fold
in the case of reducers) to decide if the execution should be done in parallel or not.
fold is more general than pmap because the reducers that it applies to can
do filtering or mapping (or composed to do both).
In general, because reducers make fewer assumptions about what they apply to, what they produce or how they are applied, they are more flexible and therefore can be used in a wider variety of situations. This is useful because reducers focus on "what your program does" rather than "how it does it". This means that your code can evolve (e.g. from a single thread to a multi thread or even to a distributed application) without having to necessarily touch the part of the program that forms the core logic of what your code does.
Related
In the Strange Loop presentation on Transducers Rich Hickey mentions at a concept in a table called 'parallel'.
You can easily see examples of seqs and into and channels using transducers.
Now you can work out that Observables are talking about RxJava.
My Question is What is the 'parallel' concept in Rich Hickey's transducers Strange Loop talk? Is this a list of futures, or pmap or something else?
There have been some thoughts about creating parallel transducible processes. This is being tracked as CLJ-1553. Currently we are not planning to address this in Clojure 1.7, but would like to do something in Clojure 1.8.
It is possible now to set up a reducer that uses a transducer as the bottom reduce phase (along with more traditional combiner fns) but ideally we would be able to leverage the "self-reducible" concept embodied by persistent vectors and maps to support transduce in parallel in a more natural way.
It is most likely right now that this would emerge as some sort of preduce function, but still much to be decided.
One problematic area is in dealing with kv forms - reducers made some choices there that are difficult or inconvenient with transducers so that needs to be worked through.
The concept is simply that of performing computation in parallel. There are multiple possible implementations:
clojure.core.reducers/fold, which is similar to reduce, except it should only be used with associative reduction functions and it's backed by a protocol which exploits the tree structure of various Clojure data structures to parallelize the computational effort. It's not actually transducer-friendly yet, but it is reducer-friendly and it seems that a transducer-enabled version is bound to arrive eventually.
Recent releases of core.async with transducer support export a function called pipeline which parallelizes channel → channel transducer-based transformations.
I was reading about clojure reducers introduced in 1.5, here: https://github.com/clojure/clojure/blob/master/changes.md. My understanding is that they're a performance enhancement on the existing map/filter/reduce function(s). So if that's the case, I'm wondering why they are in a new namespace, and do not simply replace the existing map/reduce/filter implementations. Stated differently, why would I ever not choose to use the new reducers feature?
EDIT:
In response to the inital two answers, here is a clarification:
I'm going to quote the release notes here:
Reducers provide a set of high performance functions for working with collections. The actual fold/reduce algorithms are specified via the collection being reduced. This allows each collection to define the most efficient way to reduce its contents.
This does not sound to me like the new map/filter/reduce functions are inherently parallel. For example, further down in the release notes it states:
It contains a new function, fold, which is a parallel reduce+combine
So unless the release note are poorly written, it would appear to me that there is one new function, fold, which is parallel, and the other functions are collection specific implementations that aim to produce the highest performance possible for the particular collection. Am I simply mis-reading the release notes here?
Foreword: you have problem and you are going to use parallelism, now problems two have you.
They're replacement in a sense they do that work in parallel (versus plain old sequential map and etc). Not all operations could be parallelized (in many cases operation has to be at least associative, also think about lazy sequences and iterators). Moreover, not every operation could be parallelized efficiently (there is always some coordination overhead, sometimes overhead is greater than parallelization gain).
They cannot replace the old implementations in some cases. For instance if you have infinite sequences or if you actually require sequential processing of the collection.
A couple of good reasons you might decide not to use reducers:
You need to maintain backwards compatibility with Clojure 1.4. This makes it tricky to use reducers in library code, for example, where you don't know what Clojure version your uses will be using
In some circumstances there are better options: for example if you are dealing with numerical arrays then you will almost certainly be better off using something like core.matrix instead.
I found the following write up by Rich Hickey that while still somewhat confusing, cleared (some) things up for me: http://clojure.com/blog/2012/05/08/reducers-a-library-and-model-for-collection-processing.html
In particular the summary:
By adopting an alternative view of collections as reducible, rather than seqable things, we can get a complementary set of fundamental operations that tradeoff laziness for parallelism, while retaining the same high-level, functional programming model. Because the two models retain the same shape, we can easily choose whichever is appropriate for the task at hand.
I am thinking about a method to handle the data more efficiently. Let me explain it:
Currently, there is a class, called Rules, it has a lot of member functions, like Rules::isForwardEligible(), Rules::isCurrentNumberEligible()....So these functions are used to check the specific situations (when other process call them), all of them return bool value.
In the body of these functions are ifs which will query the DB to compare data, finally return turn or false.
So the whole thing is like if(Rules::isCurrentNumberEligible())--->Check content in Rules::isCurrentNumberEligible()--->if(xxxx)(xxxx will be another function again, query DB), I think this kind way is not good. I want to improve it.
What I am imagining, is to use less code but query more for the information.
So I can query in the first step if(Rules::isCurrentNumberEligible()), I can set different tables for query, so the things like if(xxx){if(xx){if(xx)....}} will be less. A solutions is to build a class whose role is like a coordinator, ask him each time for different querys. Is it suitable?
I am not sure it is a good way to control this, or may be there are some good solutions aside. Please help me, thanks!
The classical algorithm for rule-based systems is the RETE algorithm. It strives to minimize the number of rules to be evaluated. The trick is that a re-evaluation of a rule does not make sense unless at least one related fact has changed.
In general, those rules should be queried first which promise maximum information gain. This helps to pin-down the respective case in as few questions as possible.
A physician in differential diagnosis would always order his/her questions from general to specific. In information theory this is called the principle of maximum entropy.
I often see people say if you can do X in some language you can do Y in another language which is the Turing Complete argument. So You'll have often (usually in a snide comment) "sure you can do t with y because y is also Turing complete.
I took CS theory a long time ago but I don't think this is always true because I'm not sure where Turing fits into concurrency. For example there are programming languages with the right hardware you can execute things to happen exactly at the same time but others where that is not possible.
I understand this is probably more of a hardware/driver issue than the language but I'm curious if or how does concurrency change what it is to be Turing Complete? Can be you be more than Turing Complete?
EDIT:
The original reason that I asked this question was in large part due to quantum computing. Although the accepted answer doesn't say this but quantum computing is (ostensible) a subset of turing.
This is a confusing topic for many people; you're not alone. The issue is that there are two different definitions of "possible" in play here. One definition of "possible" is how you're using it: is it possible to do concurrency, is it possible to operate a giant robot using the language, is it possible to make the computer enjoy strawberries, etc. This is the layman's definition of "possible".
Turing completeness has nothing to do with what's possible in the above sense. Certainly, concurrency isn't possible everywhere because (for at least some definition of concurrency) it needs to be the case that the language can produce code that can run on two or more different processors simultaneously. A language that can only compile to code that will run on a single processor thus would not be capable of concurrency. It could still be Turing-complete, however.
Turing completeness has to do with the kinds of mathematical functions that can be computed on a machine given enough memory and running time. For purposes of evaluating mathematical functions, a single processor can do everything multiple processors can because it can emulate multiple processors by executing one processor's logic after the other. The (unproven and unprovable, though falsifiable) statement that all mathematical functions that could be computed on any device are computable using a Turing machine is the so-called Church-Turing thesis.
A programming language is called Turing-complete if you can prove that you can emulate any Turing machine using it. Combining this with the Church-Turing thesis, this implies that the programming language is capable of evaluating every type of mathematical function that any device could evaluate, given enough time and memory. Most languages are Turing-complete because this only requires the capacity to allocate dynamic arrays and use loops and if-statements as well as some basic data operations.
Going in the other direction, since a Turing machine can be constructed to emulate any processor we currently know of and programming languages do what they do using processors, no current programming language has more expressive power than a Turing machine, either. So the computation of mathematical functions is equally possible across all common programming languages. Functions computable in one are computable in another. This says nothing about performance - concurrency is essentially a performance optimization.
Yes and no. There is no known model of computation that can do things that Turing machines can do and still be called computation, as opposed to magic¹. Hence, in that sense, there is nothing beyond Turing completeness.
On the other hand, you may be familiar with the saying that “there is no problem that cannot be solved by adding a layer of indirection”. So we might want to distinguish between models of computation that map directly to Turing machines and models of computation that require a level of indirection. “Adding a level of indirection” is not a precise mathematical concept in general, but on many specific cases you can observe the level of indirection. Often the easiest way to prove that some paradigm of computation is Turing-computable is to write an interpreter for it on a Turing machine, and that is exactly a level of indirection.
So let's have a look at what it means to model concurrency. You mention the ability to “execute things to happen exactly at the same time”. That's a specific kind of concurrency, called parallelism, and as far as concurrency goes it's a highly restricted model. The world of concurrency is a lot wilder than this. Nonetheless, parallelism already allows things that require some form of indirection when modeled on a Turing machine.
Consider the following problem: given computer programs A and B (passed on the tape of a universal Turing machine), execute them both, and return the result of either program; your program must terminate unless both A and B are non-terminating. In a purely sequential world, you can execute A and return the result; or you can execute B and return the result. But if you start by executing A, and it happens to be a non-terminating program while B does terminate, then your execution strategy does not solve the problem. And similarly, if you start by executing B, your execution strategy does not solve the problem because B might not terminate even if A does.
Given that it is undecidable whether A or B terminates, you cannot base your decision of which one to execute first on that. However, there is a very simple way to modify your Turing machine to execute the programs in parallel: put A and B on separate tapes, duplicate your automaton, and execute one step of each program until one of the two terminates. By adding this level of processing, you can solve the parallel execution problem.
Solving this problem only required a slight modification to the model (it is easy to model a dual-tape Turing machine with a single-tape machine). I nonetheless mention it because it is an important example in [lambda calculus](http://en.wikipedia.org/wiki/Lambda calculus), another important model of computation. The operation of reducing (evaluating) two lambda-terms in parallel until one of them reaches a normal form (terminates) is called Plotkin's parallel or. It is known that it is not possible to write a lambda term (a lambda calculus program) that implements parallel or. Hence lambda calculus is said to be “inherently sequential”.
The reason I mention the lambda calculus here is that most programming languages are closer to the lambda calculus than they are to programming machine. So as a programmer, insights from the lambda calculus are often more important than insights from Turing machines. The example of parallel or shows that adding concurrency to a language² can open possibilities that are not available in the original language.
It is possible to add concurrency to a sequential language through essentially the same trick as on Turing machines: execute a small piece of thread A, then a small piece of thread B, and so on. In fact, if you don't do that in your language, the operating system's kernel can typically do it for you. Strictly speaking, this provides concurrent execution of threads, but still using a single processor.
As a theoretical model, this kind of threaded execution suffers the limitation that it is deterministic. Indeed, any system that can be modeled directly on Turing machines is deterministic. When dealing with concurrent systems, it is often important to be able to write non-deterministic programs. Often the exact order in which the multiple threads of computation are interleaved is irrelevant. So two programs are equivalent if they do essentially the same computation, but in a slightly different order. You can make a model of concurrent computation out of a model of sequential computation by looking at sets of possible interleavings instead of single program runs, but that adds a level of indirection that is difficult to manage. Hence most models of concurrency bake nondeterminism into the system. When you do that, you can't run on a Turing machine any more.
¹
In this respect, thought (what happens in our brain) is still magic in the sense that we have no idea how it's done, we don't have a scientific understanding of it. Anything we know how to reproduce (not in the biological sense!) is Turing-computable.
²
Note that here, the language includes everything you can't define by yourself. In this sense, the standard library is part of “the language”.
We are going to write a concurrent program using Clojure, which is going to extract keywords from a huge amount of incoming mail which will be cross-checked with a database.
One of my teammates has suggested to use Erlang to write this program.
Here I want to note something that I am new to functional programming so I am in a little doubt whether clojure is a good choice for writing this program, or Erlang is more suitable.
Do you really mean concurrent or distributed?
If you mean concurrent (multi-threaded, multi-core etc.), then I'd say Clojure is the natural solution.
Clojure's STM model is perfectly designed for multi-core concurrency since it is very efficient at storing and managing shared state between threads. If you want to understand more, well worth looking at this excellent video.
Clojure STM allows safe mutation of data by concurrent threads. Erlang sidesteps this problem by making everything immutable, which is fine in itself but doesn't help when you genuinely need shared mutable state. If you want shared mutable state in Erlang, you have to implement it with a set of message interactions which is neither efficient nor convenient (that's the price of a nothing shared model....)
You will get inherently better performance with Clojure if you are in a concurrent setting in a large machine, since Clojure doesn't rely on message passing and hence communication between threads can be much more efficient.
If you mean distributed (i.e. many different machines sharing work over a network which are effectively running as isolated processes) then I'd say Erlang is the more natural solution:
Erlang's immutable, nothing-shared, message passing style forces you to write code in a way that can be distributed. So idiomatic Erlang automatically can be distributed across multiple machines and run in a distributed, fault-tolerant setting.
Erlang is therefore very well optimised for this use case, so would be the natural choice and would certainly be the quickest to get working.
Clojure could do it as well, but you will need to do much more work yourself (i.e. you'd either need to implement or choose some form of distributed computing framework) - Clojure does not currently come with such a framework by default.
In the long term, I hope that Clojure develops a distributed computing framework that matches Erlang - then you can have the best of both worlds!
The two languages and runtimes take different approaches to concurrency:
Erlang structures programs as many lightweight processes communicating between one another. In this case, you will probably have a master process sending jobs and data to many workers and more processes to handle the resulting data.
Clojure favors a design where several threads share data and state using common data structures. It sounds particularly suitable for cases where many threads access the same data (read-only) and share little mutable state.
You need to analyze your application to determine which model suits you best. This may also depend on the external tools you use -- for example, the ability of the database to handle concurrent requests.
Another practical consideration is that clojure runs on the JVM where many open source libraries are available.
Clojure is Lisp running on the Java JVM. Erlang is designed from the ground up to be highly fault tolerant and concurrent.
I believe the task is doable with either of these languages and many others as well. Your experience will depend on how well you understand the problem and how well you know the language. If you are new to both, I'd say the problem will be challenging no matter which one you choose.
Have you thought about something like Lucene/Solr? It's great software for indexing and searching documents. I don't know what "cross checking" means for your context, but this might be a good solution to consider.
My approach would be to write a simple test in each language and test the performance of each one. Both languages are somewhat different to C style languages and if you aren't used to them (and you don't have a team that is used to them) you may end up with a maintenance nightmare.
I'd also look at using something like Groovy 1.8. Groovy now includes GPars to enable parallel computing. String and file manipulation in Groovy is very easy indeed.
It depends what you mean by huge.
Strings in erlang are painful..
but:
If huge means tens of distributed machines, than go with erlang and write workers in text friendly languages (python?, perl?). You will have distributed layer on the top with highly concurrent local workers. Each worker would be represented by erlang process. If you need more performance, rewrite your worker into C. In Erlang it is super easy to talk to another languages.
If huge still means one strong machine go with JVM. It is not huge then.
If huge is hundreds of machines, I think you will need something stronger google-like (bigtable, map/reduce) probably on C++ stack. Erlang still OK, however you will need good devs to code it.