Clustering (fkmeans) with Mahout using Clojure - clojure

I am trying to write a short script to cluster my data via clojure (calling Mahout classes though). I have my input data in this format (which is an output from a php script)
format: (tag) (image) (frequency)
tag_sit image_a 0
tag_sit image_b 1
tag_lorem image_a 1
tag_lorem image_b 0
tag_dolor image_a 0
tag_dolor image_b 1
tag_ipsum image_a 1
tag_ipsum image_b 1
tag_amit image_a 1
tag_amit image_b 0
... (more)
Then I write them into a SequenceFile using this script (clojure)
#!./bin/clj
(ns sensei.sequence.core)
(require 'clojure.string)
(require 'clojure.java.io)
(import org.apache.hadoop.conf.Configuration)
(import org.apache.hadoop.fs.FileSystem)
(import org.apache.hadoop.fs.Path)
(import org.apache.hadoop.io.SequenceFile)
(import org.apache.hadoop.io.Text)
(import org.apache.mahout.math.VectorWritable)
(import org.apache.mahout.math.SequentialAccessSparseVector)
(with-open [reader (clojure.java.io/reader *in*)]
(let [hadoop_configuration ((fn []
(let [conf (new Configuration)]
(. conf set "fs.default.name" "hdfs://localhost:9000/")
conf)))
hadoop_fs (FileSystem/get hadoop_configuration)]
(reduce
(fn [writer [index value]]
(. writer append index value)
writer)
(SequenceFile/createWriter
hadoop_fs
hadoop_configuration
(new Path "test/sensei")
Text
VectorWritable)
(map
(fn [[tag row_vector]]
(let [input_index (new Text tag)
input_vector (new VectorWritable)]
(. input_vector set row_vector)
[input_index input_vector]))
(map
(fn [[tag photo_list]]
(let [photo_map (apply hash-map photo_list)
input_vector (new SequentialAccessSparseVector (count (vals photo_map)))]
(loop [frequency_list (vals photo_map)]
(if (zero? (count frequency_list))
[tag input_vector]
(when-not (zero? (count frequency_list))
(. input_vector set
(mod (count frequency_list) (count (vals photo_map)))
(Integer/parseInt (first frequency_list)))
(recur (rest frequency_list)))))))
(reduce
(fn [result next_line]
(let [[tag photo frequency] (clojure.string/split next_line #" ")]
(update-in result [tag]
#(if (nil? %)
[photo frequency]
(conj % photo frequency)))))
{}
(line-seq reader)))))))
Basically it turns the input into sequence file, in this format
key (Text): $tag_uri
value (VectorWritable): a vector (cardinality = number of documents) with numeric index and the respective frequency <0:1 1:0 2:0 3:1 4:0 ...>
Then I proceed to do the actual cluster with this script (by referring to this blog post)
#!./bin/clj
(ns sensei.clustering.fkmeans)
(import org.apache.hadoop.conf.Configuration)
(import org.apache.hadoop.fs.Path)
(import org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver)
(import org.apache.mahout.common.distance.EuclideanDistanceMeasure)
(import org.apache.mahout.clustering.kmeans.RandomSeedGenerator)
(let [hadoop_configuration ((fn []
(let [conf (new Configuration)]
(. conf set "fs.default.name" "hdfs://127.0.0.1:9000/")
conf)))
input_path (new Path "test/sensei")
output_path (new Path "test/clusters")
clusters_in_path (new Path "test/clusters/cluster-0")]
(FuzzyKMeansDriver/run
hadoop_configuration
input_path
(RandomSeedGenerator/buildRandom
hadoop_configuration
input_path
clusters_in_path
(int 2)
(new EuclideanDistanceMeasure))
output_path
(new EuclideanDistanceMeasure)
(double 0.5)
(int 10)
(float 5.0)
true
false
(double 0.0)
false)) '' runSequential
However I am getting output like this
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
11/08/25 15:20:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
11/08/25 15:20:16 INFO compress.CodecPool: Got brand-new compressor
11/08/25 15:20:16 INFO compress.CodecPool: Got brand-new decompressor
11/08/25 15:20:17 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/08/25 15:20:17 INFO input.FileInputFormat: Total input paths to process : 1
11/08/25 15:20:17 INFO mapred.JobClient: Running job: job_local_0001
11/08/25 15:20:17 INFO mapred.MapTask: io.sort.mb = 100
11/08/25 15:20:17 INFO mapred.MapTask: data buffer = 79691776/99614720
11/08/25 15:20:17 INFO mapred.MapTask: record buffer = 262144/327680
11/08/25 15:20:17 WARN mapred.LocalJobRunner: job_local_0001
java.lang.IllegalStateException: No clusters found. Check your -c path.
at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansMapper.setup(FuzzyKMeansMapper.java:62)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
11/08/25 15:20:18 INFO mapred.JobClient: map 0% reduce 0%
11/08/25 15:20:18 INFO mapred.JobClient: Job complete: job_local_0001
11/08/25 15:20:18 INFO mapred.JobClient: Counters: 0
Exception in thread "main" java.lang.RuntimeException: java.lang.InterruptedException: Fuzzy K-Means Iteration failed processing test/clusters/cluster-0/part-randomSeed
at clojure.lang.Util.runtimeException(Util.java:153)
at clojure.lang.Compiler.eval(Compiler.java:6417)
at clojure.lang.Compiler.load(Compiler.java:6843)
at clojure.lang.Compiler.loadFile(Compiler.java:6804)
at clojure.main$load_script.invoke(main.clj:282)
at clojure.main$script_opt.invoke(main.clj:342)
at clojure.main$main.doInvoke(main.clj:426)
at clojure.lang.RestFn.invoke(RestFn.java:436)
at clojure.lang.Var.invoke(Var.java:409)
at clojure.lang.AFn.applyToHelper(AFn.java:167)
at clojure.lang.Var.applyTo(Var.java:518)
at clojure.main.main(main.java:37)
Caused by: java.lang.InterruptedException: Fuzzy K-Means Iteration failed processing test/clusters/cluster-0/part-randomSeed
at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.runIteration(FuzzyKMeansDriver.java:252)
at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClustersMR(FuzzyKMeansDriver.java:421)
at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClusters(FuzzyKMeansDriver.java:345)
at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.run(FuzzyKMeansDriver.java:295)
at sensei.clustering.fkmeans$eval17.invoke(fkmeans.clj:35)
at clojure.lang.Compiler.eval(Compiler.java:6406)
... 10 more
When runSequential is set to true
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
11/09/07 14:32:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
11/09/07 14:32:32 INFO compress.CodecPool: Got brand-new compressor
11/09/07 14:32:32 INFO compress.CodecPool: Got brand-new decompressor
Exception in thread "main" java.lang.IllegalStateException: Clusters is empty!
at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClustersSeq(FuzzyKMeansDriver.java:361)
at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClusters(FuzzyKMeansDriver.java:343)
at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.run(FuzzyKMeansDriver.java:295)
at sensei.clustering.fkmeans$eval17.invoke(fkmeans.clj:35)
at clojure.lang.Compiler.eval(Compiler.java:6465)
at clojure.lang.Compiler.load(Compiler.java:6902)
at clojure.lang.Compiler.loadFile(Compiler.java:6863)
at clojure.main$load_script.invoke(main.clj:282)
at clojure.main$script_opt.invoke(main.clj:342)
at clojure.main$main.doInvoke(main.clj:426)
at clojure.lang.RestFn.invoke(RestFn.java:436)
at clojure.lang.Var.invoke(Var.java:409)
at clojure.lang.AFn.applyToHelper(AFn.java:167)
at clojure.lang.Var.applyTo(Var.java:518)
at clojure.main.main(main.java:37)
I have also rewritten the fkmeans script to this form
#!./bin/clj
(ns sensei.clustering.fkmeans)
(import org.apache.hadoop.conf.Configuration)
(import org.apache.hadoop.fs.Path)
(import org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver)
(import org.apache.mahout.common.distance.EuclideanDistanceMeasure)
(import org.apache.mahout.clustering.kmeans.RandomSeedGenerator)
(let [hadoop_configuration ((fn []
(let [conf (new Configuration)]
(. conf set "fs.default.name" "hdfs://localhost:9000/")
conf)))
driver (new FuzzyKMeansDriver)]
(. driver setConf hadoop_configuration)
(. driver
run
(into-array String ["--input" "test/sensei"
"--output" "test/clusters"
"--clusters" "test/clusters/clusters-0"
"--clustering"
"--overwrite"
"--emitMostLikely" "false"
"--numClusters" "3"
"--maxIter" "10"
"--m" "5"])))
but is still getting same error as the first initial version :/
Command Line tool runs fine
$ bin/mahout fkmeans --input test/sensei --output test/clusters --clusters test/clusters/clusters-0 --clustering --overwrite --emitMostLikely false --numClusters 10 --maxIter 10 --m 5
However it would not return the points when I try clusterdumper even though --clustering option exists in the previous command and --pointsDir is defined here
$ ./bin/mahout clusterdump --seqFileDir test/clusters/clusters-1 --pointsDir test/clusters/clusteredPoints --output sensei.txt
Mahout version used: 0.6-snapshot, clojure 1.3.0-snapshot
Please let me know if I miss out anything

My guess is that the Mahout implementation of fuzzy-c-means needs initial clusters to start with, which you maybe did not supply?
Also it sounds a bit as if you are running single-node? Note that for single-node systems you should avoid all the Mahout/Hadoop overhead and just use a regular clustering algorithm. Hadoop/Mahout comes at quite a cost that only pays off when you can no longer process the data on a single system. It is not "map reduce" unless you do that on a large number of systems.

Related

Logging to two files in Timbre

I'm trying to log to two different files from the same namespace with Timbre. Or if that's not possible, at least to different files from the two different namespaces.
Inspecting timbre/*config* I get the impression that I'd need two configuration maps to configure something like that. I can create another config map and use it with timbre/log* in place of the standard config map but I can't shake off the feeling that it's not how this is supposed to be used...?
(timbre/log* timbre/*config* :info "Test with standard config")
AFAIK, the easiest way is indeed to create two config maps:
(def config1
{:level :debug
:appenders {:spit1 (appenders/spit-appender {:fname "file1.log"})}})
(def config2
{:level :debug
:appenders {:spit2 (appenders/spit-appender {:fname "file2.log"})}})
(timbre/with-config config1
(info "This will print in file1") )
(timbre/with-config config2
(info "This will print in file2") )
A second way would be to write your own appender from the spit-appender:
https://github.com/ptaoussanis/timbre/blob/master/src/taoensso/timbre/appenders/core.cljx
(defn my-spit-appender
"Returns a simple `spit` file appender for Clojure."
[& [{:keys [fname] :or {fname "./timbre-spit.log"}}]]
{:enabled? true
:async? false
:min-level nil
:rate-limit nil
:output-fn :inherit
:fn
(fn self [data]
(let [{:keys [output_]} data]
(try
;; SOME LOGIC HERE TO CHOOSE THE FILE TO OUTPUT TO ...
(spit fname (str (force output_) "\n") :append true)
(catch java.io.IOException e
(if (:__spit-appender/retry? data)
(throw e) ; Unexpected error
(let [_ (have? enc/nblank-str? fname)
file (java.io.File. ^String fname)
dir (.getParentFile (.getCanonicalFile file))]
(when-not (.exists dir) (.mkdirs dir))
(self (assoc data :__spit-appender/retry? true))))))))})

Clojure - tests with components strategy

I am implementing an app using Stuart Sierra component. As he states in the README :
Having a coherent way to set up and tear down all the state associated
with an application enables rapid development cycles without
restarting the JVM. It can also make unit tests faster and more
independent, since the cost of creating and starting a system is low
enough that every test can create a new instance of the system.
What would be the preferred strategy here ? Something similar to JUnit oneTimeSetUp / oneTimeTearDown , or really between each test (similar to setUp / tearDown) ?
And if between each test, is there a simple way to start/stop a system for all tests (before and after) without repeating the code every time ?
Edit : sample code to show what I mean
(defn test-component-lifecycle [f]
(println "Setting up test-system")
(let [s (system/new-test-system)]
(f s) ;; I cannot pass an argument here ( https://github.com/clojure/clojure/blob/master/src/clj/clojure/test.clj#L718 ), so how can I pass a system in parameters of a test ?
(println "Stopping test-system")
(component/stop s)))
(use-fixtures :once test-component-lifecycle)
Note : I am talking about unit-testing here.
I would write a macro, which takes a system-map and starts all components before running tests and stop all components after testing.
For example:
(ns de.hh.new-test
(:require [clojure.test :refer :all]
[com.stuartsierra.component :as component]))
;;; Macro to start and stop component
(defmacro with-started-components [bindings & body]
`(let [~(bindings 0) (component/start ~(bindings 1))]
(try
(let* ~(destructure (vec (drop 2 bindings)))
~#body)
(catch Exception e1#)
(finally
(component/stop ~(bindings 0))))))
;; Test Component
(defprotocol Action
(do-it [self]))
(defrecord TestComponent [state]
component/Lifecycle
(start [self]
(println "====> start")
(assoc self :state (atom state)))
(stop [self]
(println "====> stop"))
Action
(do-it [self]
(println "====> do action")
#(:state self)))
;: TEST
(deftest ^:focused component-test
(with-started-components
[system (component/system-map :test-component (->TestComponent"startup-state"))
test-component (:test-component system)]
(is (= "startup-state" (do-it test-component)))))
Running Test you should see the out put like this
====> start
====> do action
====> stop
Ran 1 tests containing 1 assertions.
0 failures, 0 errors.

Clojure : Get "OutOfMemoryError Java heap space" when parsing big log file

all.
I want to parse big log files using Clojure.
And the structure of each line record is "UserID,Lantitude,Lontitude,Timestamp".
My implemented steps are:
----> Read log file & Get top-n user list
----> Find each top-n user's records and store in separate log file (UserID.log) .
The implement source code :
;======================================================
(defn parse-file
""
[file n]
(with-open [rdr (io/reader file)]
(println "001 begin with open ")
(let [lines (line-seq rdr)
res (parse-recur lines)
sorted
(into (sorted-map-by (fn [key1 key2]
(compare [(get res key2) key2]
[(get res key1) key1])))
res)]
(println "Statistic result : " res)
(println "Top-N User List : " sorted)
(find-write-recur lines sorted n)
)))
(defn parse-recur
""
[lines]
(loop [ls lines
res {}]
(if ls
(recur (next ls)
(update-res res (first ls)))
res)))
(defn update-res
""
[res line]
(let [params (string/split line #",")
id (if (> (count params) 1) (params 0) "0")]
(if (res id)
(update-in res [id] inc)
(assoc res id 1))))
(defn find-write-recur
"Get each users' records and store into separate log file"
[lines sorted n]
(loop [x n
sd sorted
id (first (keys sd))]
(if (and (> x 0) sd)
(do (create-write-file id
(find-recur lines id))
(recur (dec x)
(rest sd)
(nth (keys sd) 1))))))
(defn find-recur
""
[lines id]
(loop [ls lines
res []]
(if ls
(recur (next ls)
(update-vec res id (first ls)))
res)))
(defn update-vec
""
[res id line]
(let [params (string/split line #",")
id_ (if (> (count params) 1) (params 0) "0")]
(if (= id id_ )
(conj res line)
res)))
(defn create-write-file
"Create a new file and write information into the file."
([file info-lines]
(with-open [wr (io/writer (str MAIN-PATH file))]
(doseq [line info-lines] (.write wr (str line "\n")))
))
([file info-lines append?]
(with-open [wr (io/writer (str MAIN-PATH file) :append append?)]
(doseq [line info-lines] (.write wr (str line "\n"))))
))
;======================================================
I tested this clj in REPL with command (parse-file "./DATA/log.log" 3), and get the results:
Records-----Size-----Time----Result
1,000-------42KB-----<1s-----OK
10,000------420KB----<1s-----OK
100,000-----4.3MB----3s------OK
1,000,000---43MB-----15s-----OK
6,000,000---258MB---->20M----"OutOfMemoryError Java heap space java.lang.String.substring (String.java:1913)"
======================================================
Here is the question:
1. how can i fix the error when i try to parse big log file , like > 200MB
2. how can i optimize the function to run faster ?
3. there are logs more than 1G size , how can the function deal with it.
I am still new to Clojure, any suggestion or solution will be appreciate~
Thanks
As a direct answer to your questions; from a little Clojure experience.
The quick and dirty fix for running out of memory boils down to giving the JVM more memory. You can try adding this to your project.clj:
:jvm-opts ["-Xmx1G"] ;; or more
That will make Leiningen launch the JVM with a higher memory cap.
This kind of work is going to use a lot of memory no matter how you work it. #Vidya's suggestion ot use a library is definitely worth considering. However, there's one optimization that you can make that should help a little.
Whenever you're dealing with your (line-seq ...) object (a lazy sequence) you should make sure to maintain it as a lazy seq. Doing next on it will pull the whole thing into memory at once. Use rest instead. Take a look at the clojure site, especially the section on laziness:
(rest aseq) - returns a possibly empty seq, never nil
[snip]
a (possibly) delayed path to the remaining items, if any
You may even want to traverse the log twice--once to pull just the username from each line as a lazy-seq, again to filter out those users. This will minimize the amount of the file you're holding onto at any one time.
Making sure your function is lazy should reduce the sheer overhead that having the file as a sequence in memory creates. Whether that's enough to parse a 1G file, I don't think I can say.
You definitely don't need Cascalog or Hadoop simply to parse a file which doesn't fit into your Java heap. This SO question provides some working examples of how to process large files lazily. The main point is you need to keep the file open while you traverse the lazy seq. Here is what worked for me in a similar situation:
(defn lazy-file-lines [file]
(letfn [(helper [rdr]
(lazy-seq
(if-let [line (.readLine rdr)]
(cons line (helper rdr))
(do (.close rdr) nil))))]
(helper (clojure.java.io/reader file))))
You can map, reduce, count, etc. over this lazy sequence:
(count (lazy-file-lines "/tmp/massive-file.txt"))
;=> <a large integer>
The parsing is a separate, simpler problem.
I am also relatively new to Clojure, so there are no obvious optimizations I can see. Hopefully others more experienced can offer some advice. But I feel like this is simply a matter of the data size being too big for the tools at hand.
For that reason, I would suggest using Cascalog, an abstraction over Hadoop or your local machine using Clojure. I think the syntax for querying big log files would be pretty straightforward for you.

Clojure Agent Parellel HTTP IllegalStateException and await-for

I am working through the example of making parallel http requests in Clojure,
http://lethain.com/a-couple-of-clojure-agent-examples/
In particular
(ns parallel-fetch
(:import [java.io InputStream InputStreamReader BufferedReader]
[java.net URL HttpURLConnection]))
(defn get-url [url]
(let [conn (.openConnection (URL. url))]
(.setRequestMethod conn "GET")
(.connect conn)
(with-open [stream (BufferedReader.
(InputStreamReader. (.getInputStream conn)))]
(.toString (reduce #(.append %1 %2)
(StringBuffer.) (line-seq stream))))))
(defn get-urls [urls]
(let [agents (doall (map #(agent %) urls))]
(doseq [agent agents] (send-off agent get-url))
(apply await-for 5000 agents)
(doall (map #(deref %) agents))))
(prn (get-urls '("http://lethain.com" "http://willarson.com")))
When I run this in the
IllegalStateException await-for in transaction
What does this mean and how do I fix it?
Taking the comment on the question into account:
A transaction is being set up in the process of loading your namespace, and since it has a call to get-urls at the top-level, the await-for happens in that transaction and throws the exception.
The best way to fix that is to put the prn / get-urls form inside a function and only call it once the namespace is loaded. (If you wanted to run this code as a standalone app, with lein run or java -jar on an überjar, you'd put a call to that function inside -main.)
Incidentally, the transaction is set up when you use :reload-all, but not without it. (See the private functions load-lib, which checks for the presence of :reload-all and decides to use the private function load-all if it's there, and load-all itself, which is where the transaction is set up. Here's a link to the 1.5.1 source.)

how to load resources from a specific .jar file using clojure.java.io

In clojure.java.io, there is a io/resource function but I think it just loads the resource of the current jar that is running. Is there a way to specify the .jar file that the resource is in?
For example:
I have a jar file: /path/to/abc.jar
abc.jar when unzipped contains some/text/output.txt in the root of the unzipped directory
output.txt contains the string "The required text that I want."
I need functions that can do these operations:
(list-jar "/path/to/abc.jar" "some/text/")
;; => "output.txt"
(read-from-jar "/path/to/abc.jar" "some/text/output.txt")
;; => "The required text that I want"
Thanks in advance!
From Ankur's comments, I managed to piece together the functions that I needed:
The java.util.jar.JarFile object does the job.
you can call the method (.entries (Jarfile. a-path)) to give the list of files but instead of returning a tree structure:
i.e:
/dir-1
/file-1
/file-2
/dir-2
/file-3
/dir-3
/file-4
it returns an enumeration of filenames:
/dir-1/file-1, /dir-1/file-2, /dir-1/dir-2/file-3, /dir-1/dir-3/file-4
The following functions I needed are defined below:
(import java.util.jar.JarFile)
(defn list-jar [jar-path inner-dir]
(if-let [jar (JarFile. jar-path)]
(let [inner-dir (if (and (not= "" inner-dir) (not= "/" (last inner-dir)))
(str inner-dir "/")
inner-dir)
entries (enumeration-seq (.entries jar))
names (map (fn [x] (.getName x)) entries)
snames (filter (fn [x] (= 0 (.indexOf x inner-dir))) names)
fsnames (map #(subs % (count inner-dir)) snames)]
fsnames)))
(defn read-from-jar [jar-path inner-path]
(if-let [jar (JarFile. jar-path)]
(if-let [entry (.getJarEntry jar inner-path)]
(slurp (.getInputStream jar entry)))))
Usage:
(read-from-jar "/Users/Chris/.m2/repository/lein-newnew/lein-newnew/0.3.5/lein-newnew-0.3.5.jar"
"leiningen/new.clj")
;=> "The list of built-in templates can be shown with `lein help new`....."
(list-jar "/Users/Chris/.m2/repository/lein-newnew/lein-newnew/0.3.5/lein-newnew-0.3.5.jar" "leiningen")
;; => (new/app/core.clj new/app/project.clj .....)