large scale data mining with clojure - clojure

I'm looking for a good reference on
large scale data mining with Clojure
I know of many good clojure programming books (Programming Clojure, Joy of Clojure, ...), and many good data mining text books (mining of massive data sets, managing gigabytes, ...). However I'm not aware of any reference that specifically addresses
large scale data mining with Clojure
The "with clojure" part is rather important to me for the following reasons:
* most theoretical analysis uses big-Oh running time, which ignores constants
* constants matter, if it ends up being a matter of 1 second vs 1 hour (for things that need to be real time)
* or 1 hour vs 1 week (for batch jobs)
In particular, I think there's a lot of interplay between the JVM, Clojure Data Structures, whether data is stored in memory or lazily read from disk -- that can have the "same" algorithm have drastically different running times by "slightly" different implementations.
Thus, my question (all of the above was to avoid being closed by "Check Google"):
what is a good resource on massive data mining with Clojure?
Thanks!

I don't think anyone's yet written a good comprehensive reference. But there is certainly lots of work going on in this space (my own company included!)
Some interesting links to follow up:
Storm - distributed realtime computation using Clojure. Could be used for large scale data mining.
http://www.infoq.com/presentations/Why-Prismatic-Goes-Faster-With-Clojure - interesting video regarding Clojure performance and optimisation for machine learning applications
Incanter - probably the leading Clojure library for statistics and data visualisation
Weka - very comprehensive data mining / machine learning library for Java (and hence very easy to use directly from Clojure)

There is a wonderful book that is coming out in May 2013: Clojure Data Analysis Cookbook. I will probably buy it.
http://www.amazon.co.uk/Clojure-Data-Analysis-Cookbook-ebook/dp/B00BECVV9C/ref=sr_1_1?s=books&ie=UTF8&qid=1360697819&sr=1-1
In Detail
Data is everywhere and it's increasingly important to be able to gain
insights that we can act on. Using Clojure for data analysis and
collection, this book will show you how to gain fresh insights and
perspectives from your data with an essential collection of practical,
structured recipes.
"The Clojure Data Analysis Cookbook" presents recipes for every stage
of the data analysis process. Whether scraping data off a web page,
performing data mining, or creating graphs for the web, this book has
something for the task at hand.
You'll learn how to acquire data, clean it up, and transform it into
useful graphs which can then be analyzed and published to the
Internet. Coverage includes advanced topics like processing data
concurrently, applying powerful statistical techniques like Bayesian
modelling, and even data mining algorithms such as K-means clustering,
neural networks, and association rules.
Approach
Full of practical tips, the "Clojure Data Analysis Cookbook" will help
you fully utilize your data through a series of step-by-step, real
world recipes covering every aspect of data analysis.
Who this book is for
Prior experience with Clojure and data analysis techniques and
workflows will be beneficial, but not essential.

Related

Client / server reactive sync in Clojure / Clojurescript

Is there an idiomatic way to do reactive data synchronization between browser and server with Clojure and Clojurescript? What are the pros and cons of one technique vs another?
Having used Meteor.js in the past this sort of reactive database sync is highly preferable to manually writing routes and polling for updates. A pub/sub system lets web developers write less boilerplate code to move data around. Clojure seems like it would be a natural fit for such a technique. I have been unable to determine if this is a solved problem in the clj/cljs ecosystem.
The short answer is "no".
I've used Meteor.js in production, and also looked for the same when getting into CLJ(S). The closest I'm aware of are:
Datsync
Uses Datomic and Datoms as the sync protocol
Seems to be alpha
Fulcro
Has help for optimistic updates, but not direct data sync like Meteor has with Minimongo, closer to Meteor methods (ie a client and server implementation of the same call)
Production-ready, widely used
Why hasn't a Meteor-killer hasn't been written in CLJ(S)?
It's hard to do (at all, performantly)
The company behind Meteor raised tens of millions of dollars over the last decade to work in this space and still didn't resolve all the scalability or semantic rough edges in the data sync design
The datastores Clojure people tend to use (Datomic or relational) likely make data sync even harder, and certainly quite different (it wouldn't be a port from Meteor's implementation).
Clojure developers tend to prefer assembling a system around the needs of a particular problem/solution/requirement rather than assembling a solution around a particular feature/framework (eg Mongo documents that sync full-stack).
People are still thinking about this though, both inside the Clojure community and out. Check out:
Baqend (and their blog comparing with Meteor.js, Firebase, RechinkDB)
"The Web After Tomorrow" (by the developer of DataScript)
Reactive Datalog
The only meaningful way to compare is to consider them both in your context. I remember giving up Meteor's magic felt significant at the time but haven't found myself wanting to go back. I see it as trading magic for simplicity and flexibility.

Which algorithm is best for Text Summarization?

Cosine, Dice, Jaccard out of these algorithms which algorithm is best for Text Summarization?
None of them is an algorithm for text summarization at all.
They are similarity measures that can be applied for two texts.
Extractive summarization
Extractive summarization means identifying important sections of the text and generating them verbatim producing a subset of the sentences from the original text; while abstractive summarization reproduces important material in a new way after interpretation and examination of the text using advanced natural language techniques to generate a new shorter text that conveys the most critical information from the original one.
This is where the model identifies the important sentences and phrases from the original text and only outputs those.
Abstractive Summarization
Abstractive Summarization is more advanced and closer to human-like interpretation. Though it has more potential (and is generally more interesting for researchers and developers), so far the more traditional methods have proved to yield better results.
The model produces a completely different text that is shorter than the original, it generates new sentences in a new form, just like humans do. In this tutorial, we will use transformers for this approach.
reference_text = """Artificial intelligence (AI, also machine intelligence, MI) is intelligence demonstrated by machines, in contrast to the natural intelligence (NI) displayed by humans and other animals. In computer science AI research is defined as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving". See glossary of artificial intelligence. The scope of AI is disputed: as machines become increasingly capable, tasks considered as requiring "intelligence" are often removed from the definition, a phenomenon known as the AI effect, leading to the quip "AI is whatever hasn't been done yet." For instance, optical character recognition is frequently excluded from "artificial intelligence", having become a routine technology. Capabilities generally classified as AI as of 2017 include successfully understanding human speech, competing at a high level in strategic game systems (such as chess and Go), autonomous cars, intelligent routing in content delivery networks, military simulations, and interpreting complex data, including images and videos. Artificial intelligence was founded as an academic discipline in 1956, and in the years since has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success and renewed funding. For most of its history, AI research has been divided into subfields that often fail to communicate with each other. These sub-fields are based on technical considerations, such as particular goals (e.g. "robotics" or "machine learning"), the use of particular tools ("logic" or "neural networks"), or deep philosophical differences. Subfields have also been based on social factors (particular institutions or the work of particular researchers). The traditional problems (or goals) of AI research include reasoning, knowledge, planning, learning, natural language processing, perception and the ability to move and manipulate objects. General intelligence is among the field's long-term goals. Approaches include statistical methods, computational intelligence, and traditional symbolic AI. Many tools are used in AI, including versions of search and mathematical optimization, neural networks and methods based on statistics, probability and economics. The AI field draws upon computer science, mathematics, psychology, linguistics, philosophy and many others. The field was founded on the claim that human intelligence "can be so precisely described that a machine can be made to simulate it". This raises philosophical arguments about the nature of the mind and the ethics of creating artificial beings endowed with human-like intelligence, issues which have been explored by myth, fiction and philosophy since antiquity. Some people also consider AI to be a danger to humanity if it progresses unabatedly. Others believe that AI, unlike previous technological revolutions, will create a risk of mass unemployment. In the twenty-first century, AI techniques have experienced a resurgence following concurrent advances in computer power, large amounts of data, and theoretical understanding; and AI techniques have become an essential part of the technology industry, helping to solve many challenging problems in computer science."""
Abstractive Summarize
len(reference_text.split())
from transformers import pipeline
summarization = pipeline("summarization")
abstractve_summarization = summarization(reference_text)[0]["summary_text"]
Output
In computer science AI research is defined as the study of "intelligent agents" Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving" Capabilities generally classified as AI as of 2017 include successfully understanding human speech, competing at a high level in strategic game systems (such as chess and Go)
Extractive summarization
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
parser = PlaintextParser.from_string(reference_text, Tokenizer("english"))
summarizer = LexRankSummarizer()
extractve_summarization = summarizer(parser.document,2)
extractve_summarization = ' '.join([str(s) for s in list(extractve_summarization)])
Extractive Output
Colloquially, the term "artificial intelligence" is often used to describe machines that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect. Sub-fields have also been based on social factors (particular institutions or the work of particular researchers).The traditional problems (or goals) of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception and the ability to move and manipulate objects.
seq2seq deep learning models are mostly used for this case,
this is a blog series that talks in much detail from the very beginning of how seq2seq works till reaching the newest research approaches
Also this repo collects multiple implementations on building a text summarization model, it runs these models on google colab, and hosts the data on google drive, so no matter how powerful your computer is, you can use google colab which is a free system to train your deep models on
If you like to see the text summarization in action, you can use this free api.
I truly hope this helps

will Spark support Clojure?

I am about to start learning functional programming and Clojure appeals to me the most, I love its community, syntax and concept of immutable data structures. I am also interested in bio inspired ML for rich data Numenta.
However, my huge concern is that Spark does not support it as yet, and Spark rocks!!!
There is a Flambo Flambo Clojure ,but is it a satisfactory solution?
My course and job is in Scala. Should I defeat and enter Scala world or do you think that I should focus solely on Clojure?
Being the author or Sparkling (thanks to Josh Rosen for pointing that out), I can tell you that we use it at our company for ETL processing.
Here's what's good:
it provides a Clojuristic way of interacting with Spark, and as you can see in the presentation "Big Data with style - the Clojure/Spark way"
it's optimized for performance
it's used in production and there are others also using it
Here's what's missing:
There's currently no support for Spark Streaming, Spark SQL, Spark Dataframes or Spark ML. That might come in the future, I'm happy to accept pull requests, but it's currently not main focus (at the time of writing, April 2015).
I hope this helps you make up your mind on going with Clojure or starting to learn Scala.
It's hard to say that something like Spark doesn't support Clojure. It would make more sense to ask if there are good libraries to use that project that are easy to use from Clojure. From googling around Flambo looks like a viable option and at the various clojure conferences I hear incidental talk of using Spark in several contexts.
I would say that there is fairly low technical risk in using spark from Clojure so you are free to make this choice based on the other constraints of your working environment and pojects. Being particularly biased toward Clojure I whole heartedly recommend at least trying it and see what parts of the language and ecosystem work well for you.

Is Neo4J a good fit for clojure?

I have found that relational databases are a very good fit for Clojure as the set functions (project/join/union etc) map very nicely to a database schema making Clojure almost a perfect fit for using with databases.
I was wondering how Clojure fits in with graph databases like Neo4j however?
Neo4J has clojure'ey bindings here and here and here
you can get the leiningen and maven config for each of these from clojars
allegrograph is another similar graph data store that is widely supported in clojure. so there is some evidence that the answer may be yes!
graph stores lend them selves well to immutable trees which may be an even better fit to Clojure than sets but this is all fairly subjective. The most objective answer I can give is to point to existing use of graph-stores/triple-stores.
Mark Watson's book (free pdf version: http://www.markwatson.com/opencontent/book_java.pdf), a lesser known Clojure book he self-published last year, covers some useful graph technology, mainly allegrograph.
I myself don't have much experience with graph db libraries, but the above-cited book mentions that neo4j is optimized to traverse graphs, whereas allegrograph is optimized for subgraph matching. So the choice will likely depend on your specific application.
If you go with allegograph, the author of that book waives the AGPL license on his wrappers for production use if you buy copies of his book, and of course can be used under the conditions of the license freely https://github.com/mark-watson/java_practical_semantic_web
The clojure-neo4j wrapper library exists, though it's unclear if it would be code-rotted or ready for use given the last commit date https://github.com/JulianMorrison/neo4j-clojure. The most recently updated fork, by mattrepl, however was not that long ago: https://github.com/mattrepl/clojure-neo4j.git

Is Communicating Sequential Processes ever used in large multi threaded C++ programs?

I'm currently writing a large multi threaded C++ program (> 50K LOC).
As such I've been motivated to read up alot on various techniques for handling multi-threaded code. One theory I've found to be quite cool is:
http://en.wikipedia.org/wiki/Communicating_sequential_processes
And it's invented by a slightly famous guy, who's made other non-trivial contributions to concurrent programming.
However, is CSP used in practice? Can anyone point to any large application written in a CSP style?
Thanks!
CSP, as a process calculus, is fundamentally a theoretical thing that enables us to formalize and study some aspects of a parallel program.
If you instead want a theory that enables you to build distributed programs, then you should take a look to parallel structured programming.
Parallel structural programming is the base of the current HPC (high-performance computing) research and provides to you a methodology about how to approach and design parallel programs (essentially, flowcharts of communicating computing nodes) and runtime systems to implements them.
A central idea in parallel structured programming is that of algorithmic skeleton, developed initially by Murray Cole. A skeleton is a thing like a parallel design pattern with a cost model associated and (usually) a run-time system that supports it. A skeleton models, study and supports a class of parallel algorithms that have a certain "shape".
As a notable example, mapreduce (made popular by Google) is just a kind of skeleton that address data parallelism, where a computation can be described by a map phase (apply a function f to all elements that compose the input data), and a reduce phase (take all the transformed items and "combine" them using an associative operator +).
I found the idea of parallel structured programming both theoretical sound and practical useful, so I'll suggest to give a look to it.
A word about multi-threading: since skeletons addresses massive parallelism, usually they are implemented in distributed memory instead of shared. Intel has developed a tool, TBB, which address multi-threading and (partially) follows the parallel structured programming framework. It is a C++ library, so probably you can just start using it in your projects.
Yes and no. The basic idea of CSP is used quite a bit. For example, thread-safe queues in one form or another are frequently used as the primary (often only) communication mechanism to build a pipeline out of individual processes (threads).
Hoare being Hoare, however, there's quite a bit more to his original theory than that. He invented a notation for talking about the processes, defined a specific set of signals that can be sent between the processes, and so on. The notation has since been refined in various ways, quite a bit of work put into proving various aspects, and so on.
Application of that relatively formal model of CSP (as opposed to just the general idea) is much less common. It's been used in a few systems where high reliability was considered extremely important, but few programmers appear interested in learning (yet another) formal design notation.
When I've designed systems like this, I've generally used an approach that's less rigorous, but (at least to me) rather easier to understand: a fairly simple diagram, with boxes representing the processes, and arrows representing the lines of communication. I doubt I could really offer much in the way of a proof about most of the designs (and I'll admit I haven't designed a really huge system this way), but it's worked reasonably well nonetheless.
Take a look at the website for a company called Verum. Their ASD technology is based on CSP and is used by companies like Philips Healthcare, Ericsson and NXP semiconductors to build software for all kinds of high-tech equipment and applications.
So to answer your question: Yes, CSP is used on large software projects in real-life.
Full disclosure: I do freelance work for Verum
Answering a very old question, yet it seems important that one
There is Go where CSPs are a fundamental part of the language. In the FAQ to Go, the authors write:
Concurrency and multi-threaded programming have a reputation for difficulty. We believe this is due partly to complex designs such as pthreads and partly to overemphasis on low-level details such as mutexes, condition variables, and memory barriers. Higher-level interfaces enable much simpler code, even if there are still mutexes and such under the covers.
One of the most successful models for providing high-level linguistic support for concurrency comes from Hoare's Communicating Sequential Processes, or CSP. Occam and Erlang are two well known languages that stem from CSP. Go's concurrency primitives derive from a different part of the family tree whose main contribution is the powerful notion of channels as first class objects. Experience with several earlier languages has shown that the CSP model fits well into a procedural language framework.
Projects implemented in Go are:
Docker
Google's download server
Many more
This style is ubiquitous on Unix where many tools are designed to process from standard in to standard out. I don't have any first hand knowledge of large systems that are build that way, but I've seen many small once-off systems that are
for instance this simple command line uses (at least) 3 processes.
cat list-1 list-2 list-3 | sort | uniq > final.list
This system is only moderately sized, but I wrote a protocol processor that strips away and interprets successive layers of protocol in a message that used a style very similar to this. It was an event driven system using something akin to cooperative threading, but I could've used multithreading fairly easily with a couple of added tweaks.
The program is proprietary (unfortunately) so I can't show off the source code.
In my opinion, this style is useful for some things, but usually best mixed with some other techniques. Often there is a core part of your program that represents a processing bottleneck, and applying various concurrency increasing techniques there is likely to yield the biggest gains.
Microsoft had a technology called ActiveMovie (if I remember correctly) that did sequential processing on audio and video streams. Data got passed from one filter to another to go from input to output format (and source/sink). Maybe that's a practical example??
The Wikipedia article looks to me like a lot of funny symbols used to represent somewhat pedestrian concepts. For very large or extensible programs, the formalism can be very important to check how the (sub)processes are allowed to interact.
For a 50,000 line class program, you're probably better off architecting it as you see fit.
In general, following ideas such as these is a good idea in terms of performance. Persistent threads that process data in stages will tend not to contend, and exploit data locality well. Also, it is easy to throttle the threads to avoid data piling up as a fast stage feeds a slow stage: just block the fast one if its output buffer grows too big.
A little bit off-topic but for my thesis I used a tool framework called TERRA/LUNA which aims for software development for Embedded Control Systems but is used heavily for all sorts of software development at my institute (so only academical use here).
TERRA is a graphical CSP and software architecture editor and LUNA is both the name for a C++ library for CSP based constructs and the plugin you'll find in TERRA to generate C++ code from your CSP models.
It becomes very handy in combination with FDR3 (a CSP refinement checker) to detect any sort of (dead/life/etc) lock or even profiling.