Slow lexer in clojure - clojure

I'm trying to write a simple lexer in clojure. For now, it recognizes only white-space separated identifiers.
(refer 'clojure.set :only '[union])
(defn char-range-set
"Generate set containing all characters in the range [from; to]"
[from to]
(set (map char (range (int from) (inc (int to))))))
(def ident-initial (union (char-range-set \A \Z) (char-range-set \a \z) #{\_}))
(def ident-subseq (union ident-initial (char-range-set \0 \9)))
(defn update-lex [lex token source]
(assoc (update lex :tokens conj token) :source source))
(defn scan-identifier [lex]
(assert (ident-initial (first (:source lex))))
(loop [[c & cs :as source] (rest (:source lex))
value [(first (:source lex))]]
(if (ident-subseq c)
(recur cs (conj value c))
(update-lex lex {:type :identifier :value value} source))))
(defn scan [{tokens :tokens [c & cs :as source] :source :as lex}]
(cond
(Character/isWhitespace c) (assoc lex :source cs)
(ident-initial c) (scan-identifier lex)))
(defn tokenize [source]
(loop [lex {:tokens [] :source source}]
(if (empty? (:source lex))
(:tokens lex)
(recur (scan lex)))))
(defn measure-tokenizer [n]
(let [s (clojure.string/join (repeat n "abcde "))]
(time (tokenize s))
(* n (count "abcde "))))
Lexer processes approximately 6 million characters for 15 seconds.
=> (measure-tokenizer 1000000)
"Elapsed time: 15865.909399 msecs"
After that, I converted all maps and vectors into transients. This gave no improvement.
Also, I've implemented analogous algorithm in C++. It takes only 0.2 seconds for the same input.
My question is: How can I improve my code? Maybe I use clojure data structures incorrectly?
UPDATE:
So here's my C++ code.
#include <iostream>
#include <vector>
#include <chrono>
#include <unordered_set>
#include <cstdlib>
#include <string>
#include <cctype>
using namespace std;
struct Token
{
enum { IDENTIFIER = 1 };
int type;
string value;
};
class Lexer
{
public:
Lexer(const std::string& source)
: mSource(source)
, mIndex(0)
{
initCharSets();
}
std::vector<Token> tokenize()
{
while (mIndex < mSource.size())
{
scan();
}
return mResult;
}
private:
void initCharSets()
{
for (char c = 'a'; c <= 'z'; ++c)
mIdentifierInitial.insert(c);
for (char c = 'A'; c <= 'Z'; ++c)
mIdentifierInitial.insert(c);
mIdentifierInitial.insert('_');
mIdentifierSubsequent = mIdentifierInitial;
for (char c = '0'; c <= '9'; ++c)
mIdentifierSubsequent.insert(c);
}
void scan()
{
skipSpaces();
if (mIndex < mSource.size())
{
if (mIdentifierInitial.find(mSource[mIndex]) != mIdentifierInitial.end())
{
scanIdentifier();
}
mResult.push_back(mToken);
}
}
void scanIdentifier()
{
size_t i = mIndex;
while ((i < mSource.size()) && (mIdentifierSubsequent.find(mSource[i]) != mIdentifierSubsequent.end()))
++i;
mToken.type = Token::IDENTIFIER;
mToken.value = mSource.substr(mIndex, i - mIndex);
mIndex = i;
}
void skipSpaces()
{
while ((mIndex < mSource.size()) && std::isspace(mSource[mIndex]))
++mIndex;
}
unordered_set<char> mIdentifierInitial;
unordered_set<char> mIdentifierSubsequent;
string mSource;
size_t mIndex;
vector<Token> mResult;
Token mToken;
};
void measureBigString(int n)
{
std::string substri = "jobbi ";
std::string bigstr;
for (int i =0 ;i < n;++i)
bigstr += substri;
Lexer lexer(bigstr);
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
lexer.tokenize();
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << n << endl;
std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count() <<std::endl;
std::cout << "\n\n\n";
}
int main()
{
measureBigString(1000000);
return 0;
}

I don't see anything obviously wrong with this code. I wouldn't expect transients to help you too much as you are not bulk loading, but rather updating once per loop (plus I doubt that's actually the slowest part).
My guess at which things are slow:
checking the character in the set (requires hashing and traversing the internal hash tree). Instead of building sets, creating functions that actually did an int-based check on the character ranges (> this, < that, etc) would not be as pretty but would almost certainly be faster, particularly if you took the care to use primitive type hints and avoid boxing to objects.
each time through the loop bashes a value nested in a hashmap. That's not going to be the fastest operation. If you did keep that thing as an independent transient vector that would be faster and avoid rebuilding the upper tree. Depending how far outside idiomatic Clojure and into Java land you wanted to go, you could also use a mutable ArrayList. It's dirty, but it's fast - if you constrain the scope of who is exposed to that mutable state, then I would consider something like this. Conceptually, same thing as a transient vector.

UPDATE:
One more significant tuning is on vector de-structuring. By replacing code like this:
(let [[c & cs] xs] ...)
with:
(let [c (first xs)
cs (rest xs)] ...)
will give another x2 performance improvement. All together you will get a x26 speedup - which should be on par with C++ implementation.
So in short:
Type hint avoids all the reflection call
Record gives you optimised access/update to properties
first and rest avoids vector de-structuring - which uses nth/nthFrom and performs sequential access for seq.
Hopefully vector de-structuring can be optimised to avoid nthFrom for common case like this (where only first and rest are there in the binding).
FIRST TUNING - with type hint and record:
You can also use record instead of generic map:
(refer 'clojure.set :only '[union])
(defn char-range-set
"Generate set containing all characters in the range [from; to]"
[from to]
(set (map char (range (int from) (inc (int to))))))
(def ident-initial (union (char-range-set \A \Z) (char-range-set \a \z) #{\_}))
(def ident-subseq (union ident-initial (char-range-set \0 \9)))
(defrecord Token [type value])
(defrecord Lex [tokens source])
(defn update-lex [^Lex lex ^Token token source]
(assoc (update lex :tokens conj token) :source source))
(defn scan-identifier [^Lex lex]
(let [[x & xs] (:source lex)]
(loop [[c & cs :as source] xs
value [x]]
(if (ident-subseq c)
(recur cs (conj value c))
(update-lex lex (Token. :identifier value) source)))))
(defn scan [^Lex lex]
(let [[c & cs] (:source lex)
tokens (:tokens lex)]
(cond
(Character/isWhitespace ^char c) (assoc lex :source cs)
(ident-initial c) (scan-identifier lex))))
(defn tokenize [source]
(loop [lex (Lex. [] source)]
(if (empty? (:source lex))
(:tokens lex)
(recur (scan lex)))))
(use 'criterium.core)
(defn measure-tokenizer [n]
(let [s (clojure.string/join (repeat n "abcde "))]
(bench (tokenize s))
(* n (count "abcde "))))
(measure-tokenizer 1000)
Using criterium:
Evaluation count : 128700 in 60 samples of 2145 calls.
Execution time mean : 467.378916 µs
Execution time std-deviation : 329.455994 ns
Execution time lower quantile : 466.867909 µs ( 2.5%)
Execution time upper quantile : 467.984646 µs (97.5%)
Overhead used : 1.502982 ns
Comparing to the original code:
Evaluation count : 9960 in 60 samples of 166 calls.
Execution time mean : 6.040209 ms
Execution time std-deviation : 6.630519 µs
Execution time lower quantile : 6.028470 ms ( 2.5%)
Execution time upper quantile : 6.049443 ms (97.5%)
Overhead used : 1.502982 ns
The optimized version is roughly x13 speedup. With n=1,000,000, it now takes ~0.5 second.

Related

Clojure translate from Java

I'm starting to learn Clojure and have decided that doing some projects on HackerRank is a good way to do that. What I'm finding is that my Clojure solutions are horribly slow. I'm assuming that's because I'm still thinking imperatively or just don't know enough about how Clojure operates. The latest problem I wrote solutions for was Down To Zero II. Here's my Java code
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class Solution {
private static final int MAX_NUMBER = 1000000;
private static final BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
public static int[] precompute() {
int[] values = new int[MAX_NUMBER];
values[0] = 0;
values[1] = 1;
for (int i = 1; i < MAX_NUMBER; i += 1) {
if ((values[i] == 0) || (values[i] > (values[i - 1] + 1))) {
values[i] = (values[i - 1] + 1);
}
for (int j = 1; j <= i && (i * j) < MAX_NUMBER; j += 1) {
int mult = i * j;
if ((values[mult] == 0) || (values[mult] > (values[i] + 1))) {
values[mult] = values[i] + 1;
}
}
}
return values;
}
public static void main(String[] args) throws Exception {
int numQueries = Integer.parseInt(reader.readLine());
int[] values = Solution.precompute();
for (int loop = 0; loop < numQueries; loop += 1) {
int query = Integer.parseInt(reader.readLine());
System.out.println(values[query]);
}
}
}
My Clojure implementation is
(def MAX-NUMBER 1000000)
(defn set-i [out i]
(cond
(= 0 i) (assoc out i 0)
(= 1 i) (assoc out i 1)
(or (= 0 (out i))
(> (out i) (inc (out (dec i)))))
(assoc out i (inc (out (dec i))))
:else out))
(defn set-j [out i j]
(let [mult (* i j)]
(if (or (= 0 (out mult)) (> (out mult) (inc (out i))))
(assoc out mult (inc (out i)))
out)))
;--------------------------------------------------
; Precompute the values for all possible inputs
;--------------------------------------------------
(defn precompute []
(loop [i 0 out (vec (repeat MAX-NUMBER 0))]
(if (< i MAX-NUMBER)
(recur (inc i) (loop [j 1 new-out (set-i out i)]
(if (and (<= j i) (< (* i j) MAX-NUMBER))
(recur (inc j) (set-j new-out i j))
new-out)))
out)))
;--------------------------------------------------
; Read the number of queries
;--------------------------------------------------
(def num-queries (Integer/parseInt (read-line)))
;--------------------------------------------------
; Precompute the solutions
;--------------------------------------------------
(def values (precompute))
;--------------------------------------------------
; Read and process each query
;--------------------------------------------------
(loop [iter 0]
(if (< iter num-queries)
(do
(println (values (Integer/parseInt (read-line))))
(recur (inc iter)))))
The Java code runs in about 1/10 of a second on my machine, while the Clojure code takes close to 2 seconds. Since it's the same machine, with the same JVM, it means I'm doing something wrong in Clojure.
How do people go about trying to translate this type of code? What are the gotchas that are causing it to be so much slower?
I'm going to do some transformations to your code (which might be slightly outside of what you were originally asking)
and then address your more specific questions.
I know it's almost two years later, but after running across your question and spending way too much time fighting with
HackerRank and its time limits, I thought I would post an answer. Does achieving a solution within HR's environment and
time limits make us better Clojure programmers? I didn't learn the answer to that. But I'll share what I did learn.
I found a slightly slimmer version of your same algorithm. It still has two loops, but the update only happens once in
the inner loop, and many of the conditions are handled in a min function. Here is my adaptation of it:
(defn compute
"Returns a vector of down-to-zero counts for all numbers from 0 to m."
[m]
(loop [i 2 out (vec (range (inc m)))]
(if (<= i m)
(recur (inc i)
(loop [j 1 out out]
(let [ij (* i j)]
(if (and (<= j i) (<= ij m))
(recur (inc j)
(assoc out ij (min (out ij) ;; current value
(inc (out (dec ij))) ;; steps from value just below
(inc (out i))))) ;; steps from a factor
out))))
out)))
Notice we're still using loop/recur (twice), we're still using a vector to hold the output. But some differences:
We initialize out to incrementing integers. This is the worst case number of steps for every value, and once
initialized, we don't have to test that a value equals 0 and we can skip indices 0 and 1 and start the outer loop at
index 2. (We also fix a bug in your original and make sure out contains MAX-NUMBER+1 values.)
All three tests happen inside a min function that encapsulates the original logic: a value will be
updated only if it's a shorter number of steps from the number just below it, or from one of it's factors.
The tests are now simple enough that we don't need to break them out into separate functions.
This code (along with your original) is fast enough to pass some of the test cases in HR, but not all. Here are some
things to speed this up:
Use int-array instead of vec. This means we'll use aset instead of assoc and aget instead of calling out
with an index. It also means that loop/recur isn't the best structure anymore (because we are no longer passing
around new versions of an immutable vector, but actually mutating a java.util.Array); instead we'll use doseq.
Type hints. This alone makes a huge speed difference. When testing your code, include a form at the top (set! *warn-on-reflection* true) and you'll see where Clojure is having to do extra work to figure out what types it is
dealing with.
Use custom I/O functions to read the input. HR's boilerplate I/O code is supposed to let you focus on solving the
challenge and not worry about I/O, but it is basically garbage, and often the culprit behind your program timing out.
Below is a version that incorporates the tips above and runs fast enough to pass all test cases. I've included my custom
I/O approach that I've been using for all my HR challenges. One nice benefit of using doseq is we can include a
:let and a :while clause within the binding form, removing some of the indentation within the body of doseq. Also
notice a few strategically placed type hints that really speed up the program.
(ns down-to-zero-int-array)
(set! *warn-on-reflection* true)
(defn compute
"Returns a vector of down-to-zero counts for all numbers from 0 to m."
^ints [m]
(let [out ^ints (int-array (inc m) (range (inc m)))]
(doseq [i (range 2 (inc m)) j (range 1 (inc i)) :let [ij (* i j)] :while (<= ij m)]
(aset out ij (min (aget out ij)
(inc (aget out (dec ij)))
(inc (aget out i)))))
out))
(let [tokens ^java.io.StreamTokenizer
(doto (java.io.StreamTokenizer. (java.io.BufferedReader. *in*))
(.parseNumbers))]
(defn next-int []
"Read next integer from input. As fast as `read-line` for a single value,
and _much_ faster than `read-line`+`split` for multiple values on same line."
(.nextToken tokens)
(int (.-nval tokens))))
(def MAX 1000000)
(let [q (next-int)
down-to-zero (compute MAX)]
(doseq [n (repeatedly q next-int)]
(println (aget down-to-zero n))))

Optimize tail-recursion in Clojure: exponential moving average

I'm new to Clojure and trying to implement an exponential moving average function using tail recursion. After battling a little with stack overflows using lazy-seq and concat, I got to the following implementation which works, but is very slow:
(defn ema3 [c a]
(loop [ct (rest c) res [(first c)]]
(if (= (count ct) 0)
res
(recur
(rest ct)
(into;NOT LAZY-SEQ OR CONCAT
res
[(+ (* a (first ct)) (* (- 1 a) (last res)))]
)
)
)
)
)
For a 10,000 item collection, Clojure will take around 1300ms, whereas a Python Pandas call such as
s.ewm(alpha=0.3, adjust=True).mean()
will only take 700 us. How can I reduce that performance gap? Thank you,
Personally I would do this lazily with reductions. It's simpler to do than using loop/recur or building up a result vector by hand with reduce, and it also means you can consume the result as it is built up, rather than needing to wait for the last element to be finished before you can look at the first one.
If you care most about throughput then I suppose Taylor Wood's reduce is the best approach, but the lazy solution is only very slightly slower and is much more flexible.
(defn ema3-reductions [c a]
(let [a' (- 1 a)]
(reductions
(fn [ave x]
(+ (* a x)
(* (- 1 a') ave)))
(first c)
(rest c))))
user> (quick-bench (dorun (ema3-reductions (range 10000) 0.3)))
Evaluation count : 288 in 6 samples of 48 calls.
Execution time mean : 2.336732 ms
Execution time std-deviation : 282.205842 µs
Execution time lower quantile : 2.125654 ms ( 2.5%)
Execution time upper quantile : 2.686204 ms (97.5%)
Overhead used : 8.637601 ns
nil
user> (quick-bench (dorun (ema3-reduce (range 10000) 0.3)))
Evaluation count : 270 in 6 samples of 45 calls.
Execution time mean : 2.357937 ms
Execution time std-deviation : 26.934956 µs
Execution time lower quantile : 2.311448 ms ( 2.5%)
Execution time upper quantile : 2.381077 ms (97.5%)
Overhead used : 8.637601 ns
nil
Honestly in that benchmark you can't even tell the lazy version is slower than the vector version. I think my version is still slower, but it's a vanishingly trivial difference.
You can also speed things up if you tell Clojure to expect doubles, so it doesn't have to keep double-checking the types of a, c, and so on.
(defn ema3-reductions-prim [c ^double a]
(let [a' (- 1.0 a)]
(reductions (fn [ave x]
(+ (* a (double x))
(* a' (double ave))))
(first c)
(rest c))))
user> (quick-bench (dorun (ema3-reductions-prim (range 10000) 0.3)))
Evaluation count : 432 in 6 samples of 72 calls.
Execution time mean : 1.720125 ms
Execution time std-deviation : 385.880730 µs
Execution time lower quantile : 1.354539 ms ( 2.5%)
Execution time upper quantile : 2.141612 ms (97.5%)
Overhead used : 8.637601 ns
nil
Another 25% speedup, not too bad. I expect you could squeeze out a bit more by using primitives in either a reduce solution or with loop/recur if you were really desperate. It would be especially helpful in a loop because you wouldn't have to keep boxing and unboxing the intermediate results between double and Double.
If res is a vector (which it is in your example) then using peek instead of last yields much better performance:
(defn ema3 [c a]
(loop [ct (rest c) res [(first c)]]
(if (= (count ct) 0)
res
(recur
(rest ct)
(into
res
[(+ (* a (first ct)) (* (- 1 a) (peek res)))])))))
Your example on my computer:
(time (ema3 (range 10000) 0.3))
"Elapsed time: 990.417668 msecs"
Using peek:
(time (ema3 (range 10000) 0.3))
"Elapsed time: 9.736761 msecs"
Here's a version using reduce that's even faster on my computer:
(defn ema3 [c a]
(reduce (fn [res ct]
(conj
res
(+ (* a ct)
(* (- 1 a) (peek res)))))
[(first c)]
(rest c)))
;; "Elapsed time: 0.98824 msecs"
Take these timings with a grain of salt. Use something like criterium for more thorough benchmarking. You might be able to squeeze out more gains using mutability/transients.

Clojure reduced function

I know there is the reduced function to terminate such an infinite thing, but i am curious why in the second version (with range without arg) it doesn't terminate the reduction as it reaches 150?
user=> (reduce (fn [a v] (if (< a 100) (+ a v) a)) (range 2000))
105
user=> (reduce (fn [a v] (if (< a 100) (+ a v) a)) (range))
As you mention, and for those who come along later googling for reduced. The reducing function does have a the ability to declare the final answer of the reduction explicitly with the guarantee that no further input will be consumed by returning the result of calling (reduced the-final-answer)
user> (reduce (fn [a v]
(if (< a 100)
(+ a v)
(reduced a)))
(range))
105
In this case when the new collected result passes 100 the next iteration will stop the reduction rather than contribute it's value to the answer. This does consume one extra value from the input stream that is not included in the result.
user> (reduce (fn [a v]
(let [res (+ a v)]
(if (< res 100)
res
(reduced res))))
(range))
105
This finishes the reduction as soon as threshold is met and does not consume any extra values from the lazy (and infinite) collection.
Because, reduce applies the function to every element in the sequence (range), thus (range) is fully realized.
(range)
produces an infinite sequence, and
(fn [a v] (if (< a 100) (+ a v) a))
doesn't stop the loop, it is being applied to every element.
Executed at the REPL
(reduce (fn [a v] (if (< a 100) (+ a v) a)) (range))
means we eargly wants to get and print the result, therefore the REPL hangs.

why is this looping function so slow compared to map?

I looked at maps source code which basically keeps creating lazy sequences. I would think that iterating over a collection and adding to a transient vector would be faster, but clearly it isn't. What don't I understand about clojures performance behavior?
;=> (time (do-with / (range 1 1000) (range 1 1000)))
;"Elapsed time: 23.1808 msecs"
;
; vs
;=> (time (doall (map #(/ %1 %2) (range 1 1000) (range 1 1000))))
;"Elapsed time: 2.604174 msecs"
(defn do-with
[fn coll1 coll2]
(let [end (count coll1)]
(loop [i 0
res (transient [])]
(if
(= i end)
(persistent! res)
(let [x (nth coll1 i)
y (nth coll2 i)
r (fn x y)]
(recur (inc i) (conj! res r)))
))))
In order of conjectured impact on relative results:
Your do-with function uses nth to access the individual items in the input collections. nth operates in linear time on ranges, making do-with quadratic. Needless to say, this will kill performance on large collections.
range produces chunked seqs and map handles those extremely efficiently. (Essentially it produces chunks of up to 32 elements -- here it will in fact be exactly 32 -- by running a tight loop over the internal array of each input chunk in turn, placing results in internal arrays of output chunks.)
Benchmarking with time doesn't give you steady state performance. (Which is why one should really use a proper benchmarking library; in the case of Clojure, Criterium is the standard solution.)
Incidentally, (map #(/ %1 %2) xs ys) can simply be written as (map / xs ys).
Update:
I've benchmarked the map version, the original do-with and a new do-with version with Criterium, using (range 1 1000) as both inputs in each case (as in the question text), obtaining the following mean execution times:
;;; (range 1 1000)
new do-with 170.383334 µs
(doall (map ...)) 230.756753 µs
original do-with 15.624444 ms
Additionally, I've repeated the benchmark using a vector stored in a Var as input rather than ranges (that is, with (def r (vec (range 1 1000))) at the start and using r as both collection arguments in each benchmark). Unsurprisingly, the original do-with came in first -- nth is very fast on vectors (plus using nth with a vector avoids all the intermediate allocations involved in seq traversal).
;;; (vec (range 1 1000))
original do-with 73.975419 µs
new do-with 87.399952 µs
(doall (map ...)) 153.493128 µs
Here's the new do-with with linear time complexity:
(defn do-with [f xs ys]
(loop [xs (seq xs)
ys (seq ys)
ret (transient [])]
(if (and xs ys)
(recur (next xs)
(next ys)
(conj! ret (f (first xs) (first ys))))
(persistent! ret))))

Why is Clojure much faster than mit-scheme for equivalent functions?

I found this code in Clojure to sieve out first n prime numbers:
(defn sieve [n]
(let [n (int n)]
"Returns a list of all primes from 2 to n"
(let [root (int (Math/round (Math/floor (Math/sqrt n))))]
(loop [i (int 3)
a (int-array n)
result (list 2)]
(if (>= i n)
(reverse result)
(recur (+ i (int 2))
(if (< i root)
(loop [arr a
inc (+ i i)
j (* i i)]
(if (>= j n)
arr
(recur (do (aset arr j (int 1)) arr)
inc
(+ j inc))))
a)
(if (zero? (aget a i))
(conj result i)
result)))))))
Then I wrote the equivalent (I think) code in Scheme (I use mit-scheme)
(define (sieve n)
(let ((root (round (sqrt n)))
(a (make-vector n)))
(define (cross-out t to dt)
(cond ((> t to) 0)
(else
(vector-set! a t #t)
(cross-out (+ t dt) to dt)
)))
(define (iter i result)
(cond ((>= i n) (reverse result))
(else
(if (< i root)
(cross-out (* i i) (- n 1) (+ i i)))
(iter (+ i 2) (if (vector-ref a i)
result
(cons i result))))))
(iter 3 (list 2))))
The timing results are:
For Clojure:
(time (reduce + 0 (sieve 5000000)))
"Elapsed time: 168.01169 msecs"
For mit-scheme:
(time (fold + 0 (sieve 5000000)))
"Elapsed time: 3990 msecs"
Can anyone tell me why mit-scheme is more than 20 times slower?
update: "the difference was in iterpreted/compiled mode. After I compiled the mit-scheme code, it was running comparably fast. – abo-abo Apr 30 '12 at 15:43"
Modern incarnations of the Java Virtual Machine have extremely good performance when compared to interpreted languages. A significant amount of engineering resource has gone into the JVM, in particular the hotspot JIT compiler, highly tuned garbage collection and so on.
I suspect the difference you are seeing is primarily down to that. For example if you look Are the Java programs faster? you can see a comparison of java vs ruby which shows that java outperforms by a factor of 220 on one of the benchmarks.
You don't say what JVM options you are running your clojure benchmark with. Try running java with the -Xint flag which runs in pure interpreted mode and see what the difference is.
Also, it's possible that your example is too small to really warm-up the JIT compiler. Using a larger example may yield an even larger performance difference.
To give you an idea of how much Hotspot is helping you. I ran your code on my MBP 2011 (quad core 2.2Ghz), using java 1.6.0_31 with default opts (-server hotspot) and interpreted mode (-Xint) and see a large difference
; with -server hotspot (best of 10 runs)
>(time (reduce + 0 (sieve 5000000)))
"Elapsed time: 282.322 msecs"
838596693108
; in interpreted mode using -Xint cmdline arg
> (time (reduce + 0 (sieve 5000000)))
"Elapsed time: 3268.823 msecs"
838596693108
As to comparing Scheme and Clojure code, there were a few things to simplify at the Clojure end:
don't rebind the mutable array in loops;
remove many of those explicit primitive coercions, no change in performance. As of Clojure 1.3 literals in function calls compile to primitives if such a function signature is available, and generally the difference in performance is so small that it gets quickly drowned by any other operations happening in a loop;
add a primitive long annotation into the fn signature, thus removing the rebinding of n;
call to Math/floor is not needed -- the int coercion has the same semantics.
Code:
(defn sieve [^long n]
(let [root (int (Math/sqrt n))
a (int-array n)]
(loop [i 3, result (list 2)]
(if (>= i n)
(reverse result)
(do
(when (< i root)
(loop [inc (+ i i), j (* i i)]
(when (>= j n) (aset a j 1) (recur inc (+ j inc)))))
(recur (+ i 2) (if (zero? (aget a i))
(conj result i)
result)))))))