I have some Clojure code that is simulating and then processing numerical data. The data are basically vectors of double values; the processing mainly involves summing their values in various ways. I will include some code below, but my question is (I think) more general - I just don't have a clue how to interpret the hprof results.
Anyway, my test code is:
(defn spin [n]
(let [c 6000
signals (spin-signals c)]
(doseq [_ (range n)] (time (spin-voxels c signals)))))
(defn -main []
(spin 4))
where spin-voxels should be more expensive than spin-signals (especially when repeated multiple times). I can give the lower level routines, but I think this question is more about me not understanding basics of the traces (below).
When I compile this with lein and then do some simple profiling:
> java -cp classes:lib/clojure-1.3.0-beta1.jar -agentlib:hprof=cpu=samples,depth=10,file=hprof.vec com.isti.compset.stack
"Elapsed time: 14118.772924 msecs"
"Elapsed time: 10082.015672 msecs"
"Elapsed time: 9212.522973 msecs"
"Elapsed time: 12968.23877 msecs"
Dumping CPU usage by sampling running threads ... done.
and the profile trace looks like:
CPU SAMPLES BEGIN (total = 4300) Sun Aug 28 15:51:40 2011
rank self accum count trace method
1 5.33% 5.33% 229 300791 clojure.core$seq.invoke
2 5.21% 10.53% 224 300786 clojure.core$seq.invoke
3 5.05% 15.58% 217 300750 clojure.core$seq.invoke
4 4.93% 20.51% 212 300787 clojure.lang.Numbers.add
5 4.74% 25.26% 204 300799 clojure.core$seq.invoke
6 2.60% 27.86% 112 300783 clojure.lang.RT.more
7 2.51% 30.37% 108 300803 clojure.lang.Numbers.multiply
8 2.42% 32.79% 104 300788 clojure.lang.RT.first
9 2.37% 35.16% 102 300831 clojure.lang.RT.more
10 2.37% 37.53% 102 300840 clojure.lang.Numbers.add
which is pretty cool. Up to here, I am happy. I can see that I am wasting time with generic handling of numerical values.
So I look at my code and decide that, as a first step, I will replace vec with d-vec:
(defn d-vec [collection]
(apply conj (vector-of :double) collection))
I am not sure that will be sufficient - I suspect I will also need to add some type annotations in various places - but it seems like a good start. So I compile and profile again:
> java -cp classes:lib/clojure-1.3.0-beta1.jar -agentlib:hprof=cpu=samples,depth=10,file=hprof.d-vec com.isti.compset.stack
"Elapsed time: 15944.278043 msecs"
"Elapsed time: 15608.099677 msecs"
"Elapsed time: 16561.659408 msecs"
"Elapsed time: 15416.414548 msecs"
Dumping CPU usage by sampling running threads ... done.
Ewww. So it is significantly slower. And the profile?
CPU SAMPLES BEGIN (total = 6425) Sun Aug 28 15:55:12 2011
rank self accum count trace method
1 26.16% 26.16% 1681 300615 clojure.core.Vec.count
2 23.28% 49.45% 1496 300607 clojure.core.Vec.count
3 7.74% 57.18% 497 300608 clojure.lang.RT.seqFrom
4 5.59% 62.77% 359 300662 clojure.core.Vec.count
5 3.72% 66.49% 239 300604 clojure.lang.RT.first
6 3.25% 69.74% 209 300639 clojure.core.Vec.count
7 1.91% 71.66% 123 300635 clojure.core.Vec.count
8 1.03% 72.68% 66 300663 clojure.core.Vec.count
9 1.00% 73.68% 64 300644 clojure.lang.RT.more
10 0.79% 74.47% 51 300666 clojure.lang.RT.first
11 0.75% 75.22% 48 300352 clojure.lang.Numbers.double_array
12 0.75% 75.97% 48 300638 clojure.lang.RT.more
13 0.64% 76.61% 41 300621 clojure.core.Vec.count
14 0.62% 77.23% 40 300631 clojure.core.Vec.cons
15 0.61% 77.84% 39 300025 java.lang.ClassLoader.defineClass1
16 0.59% 78.43% 38 300670 clojure.core.Vec.cons
17 0.58% 79.00% 37 300681 clojure.core.Vec.cons
18 0.54% 79.55% 35 300633 clojure.lang.Numbers.multiply
19 0.48% 80.03% 31 300671 clojure.lang.RT.seqFrom
20 0.47% 80.50% 30 300609 clojure.lang.Numbers.add
I have included more rows here because this is the part I don't understand.
Why on earth is Vec.count appearing so often? It's a method that returns the size of the vector. A single line lookup of an attribute.
I assume I am slower because I am still jumping back and forth between Double and double, and that things may improve again when I add more type annotations. But I don't understand what I have now, so I am not so sure blundering fowards makes much sense.
Please, can anyone explain the dump above in general terms? I promise I don't repeatedly call count - instead I have lots of maps and reduces and a few explicit loops.
I wondered if I am perhaps being confused by the JIT? Maybe I am missing a pile of information because functions are being inlined? Oh, and I am using 1.3.0-beta1 because it appears to have more sensible number handling.
[UPDATE] I summarised my experiences at http://www.acooke.org/cute/Optimising1.html - got a 5x speedup (actually was 10x after cleaning some more and moving to 1.3) despite never understanding this.
calling seq for a Vec object (object created by vector-of) creates a VecSeq Object.
VecSeq object created on Veca calls Vec.count in it's method internal-reduce, which is used by clojure.core/reduce.
so it seems that a vector created by vector-of calls Vec.count while reducing. And as you mentioned that the code did a lot of reducing this seems to be the cause
What remains spooky is that Vec.count is that Vec.count seems to be very simple:
clojure.lang.Counted
(count [_] cnt)
a simple getter that doesn't do any counting.
Just talking out loud, it looks like your code is doing a lot of back and forth conversion to/from Seq.
Looking at RT.seqFrom, this calls ArraySeq.createFromObject
if(array == null || Array.getLength(array) == 0)
return null;
So would that be that using vec uses fast vector access, and that using d-vec is forcing the use of arrays and the call to a slow java.lang.Array.getLength method (that uses reflection .. )
Related
I followed this tutorial to learn clojure (https://www.youtube.com/watch?v=zFPiPBIkAcQ at around 2:26). In the last example, you programme the game "Snake".
(ns ...tests.snake
(:import
(java.awt Color Dimension)
(javax.swing JPanel JFrame Timer JOptionPane)
(java.awt.event ActionListener KeyListener KeyEvent)))
...
113 (defn game-panel [frame snake apple]
114 (proxy [JPanel ActionListener KeyListener] []
115 ;JPanel
116 (paintComponent [g]
117 (proxy-super paintComponent g)
118 (paint g #apple)
119 (paint g #snake))
120 (getPreferredSize []
121 (Dimension. (* (inc field-width) point-size)
122 (* (inc field-height) point-size)))
123 ;ActionListener
124 (actionPerformed [e]
125 (update-positions snake apple)
126 (if (lose? #snake)
127 (do
128 (reset-game snake apple)
129 (JOptionPane/showMessageDialog frame "You lose")))
130 (if (win? #snake)
131 (do
132 (reset-game snake apple)
133 (JOptionPane/showMessageDialog "You win")))
134 (.repaint this))
135 (keyPressed [e]
136 (let [direction (directions (.getKeyCode e))]
137 (if direction
138 (update-direction snake direction))))
139 (keyReleased [e])
140 (keyTyped [e])))
I get an IllegalArgumentException there when using "proxy".
; Syntax error (IllegalArgumentException) compiling new at (c:\[...]\Clojure_Project\tests\snake.clj:114:3).
; Unable to resolve classname: ...tests.snake.proxy$javax.swing.JPanel$ActionListener$KeyListener$1b88ffec
I thought at first it might be related to the fact that I am passing multiple arguments, but that doesn't seem to be the problem.
I use VisualStudioCode and the "Getting Starting REPL" from Calva (because I don't know how to connect another one).
I don't know, did I forget to install something or import something?
I tried to look at the code of "proxy", but due to the fact that I'm not really familiar with the programming language yet, it didn't help me much.
my code: https://github.com/shadowprincess/clojure-learning
When looking at your question initially, I thought that your ...tests.snake namespace was just something you elided and not the actual namespace name.
But given your repo, seems like it's the real namespace name that you're using.
That's invalid - you can't start a namespace with .. Rename it to tests.snake and the error will go away.
Unfortunately, your code in the repo still won't work because there are many other errors, but they should be easy for you to figure out on your own. And as a general advice - don't run your whole project by sending separate forms into REPL. Learn how to launch it with a single command - it'll induce good practices that will be useful even when using REPL.
I have a high load app with many users requesting it with various GET params. Imagine giving different answers to different polls. Save vote, show latest poll results.
To mitigate the back pressure issue I was thinking about creating a top-level atom to store all the latest poll results for all polls.
So the workflow is like this:
boot an app => app pulls in the latest poll results and populates the atom.
new request comes in => increment votes counter in that atom for the specific poll, add vote payload to the core.async queue listener(working in a separate thread) to persist it to the database eventually.
The goal I'm trying to achieve:
each new request gets the latest poll results with near-instant response time(avoid network call to persistent storage)
An obvious drawback of this approach is in case we need to redeploy it will result in some temporary data loss. It is not very critical, deploys could be postponed.
The reason why I'm interested in this tricky approach and not just using RabbitMQ/Kafka is that it sounds like a really cool and simple architecture with very few "moving parts"(just JVM + database) to get the job done.
More data is always good. Let's time incrementing a counter in an atom:
(ns tst.demo.core
(:use demo.core tupelo.core tupelo.test)
(:require
[criterium.core :as crit]))
(def cum (atom 0))
(defn incr []
(swap! cum inc))
(defn timer []
(spy :time
(crit/quick-bench
(dotimes [ii 1000] incr))))
(dotest
(timer))
with result
-------------------------------
Clojure 1.10.1 Java 14
-------------------------------
Testing tst.demo.core
Evaluation count : 1629096 in 6 samples of 271516 calls.
Execution time mean : 328.476758 ns
Execution time std-deviation : 37.482750 ns
Execution time lower quantile : 306.738888 ns ( 2.5%)
Execution time upper quantile : 393.249204 ns (97.5%)
Overhead used : 1.534492 ns
So 1000 calls to incr take only about 330 ns. How long does it take to ping google.com?
PING google.com (172.217.4.174) 56(84) bytes of data.
64 bytes from lax28s01-in-f14.1e100.net (172.217.4.174): icmp_seq=1 ttl=54 time=14.6 ms
64 bytes from lax28s01-in-f14.1e100.net (172.217.4.174): icmp_seq=2 ttl=54 time=14.9 ms
64 bytes from lax28s01-in-f14.1e100.net (172.217.4.174): icmp_seq=3 ttl=54 time=15.0 ms
64 bytes from lax28s01-in-f14.1e100.net (172.217.4.174): icmp_seq=4 ttl=54 time=17.8 ms
64 bytes from lax28s01-in-f14.1e100.net (172.217.4.174): icmp_seq=5 ttl=54 time=16.9 ms
Let's call it 15 ms. So the ratio is:
ratio = 15e-3 / 330e-9 => 45000x
Your operations with the atom are overwhelmed by the network I/O time, so there is no problem storing the application state in the atom, even for a large number of users.
You may also be interested to know that the Datomic database have stated that the concurrency in the database is managed by a single atom.
I have around 900000 RECORDS:
(defparameter RECORDS
'((293847 "john menk" "john.menk#example.com" 0123456789 2300 2760 "CHEQUE" 012345 "menk freeway" "high rose")
(244841 "january agami" "j.a#example.com" 0123456789 2300 2760 "CHEQUE" 012345 "ishikawa street" "fremont apartments")
...))
(These are read from a file. The above code is provided only as an example. It helps show the internal structure of this data.)
For quick prototyping I use aliased names for selectors:
(defmacro alias (new-name existing-name)
"Alias NEW-NAME to EXISTING-NAME. EXISTING-NAME has to be a function."
`(setf (fdefinition ',new-name) #',existing-name))
(progn
(alias account-number first)
(alias full-name second)
(alias email third)
(alias mobile fourth)
(alias average-paid fifth)
(alias highest-paid sixth)
(alias usual-payment-mode seventh)
(alias pincode eighth)
(alias road ninth)
(alias building tenth))
Now I run:
(time (loop for field in '(full-name email)
append (loop for record in RECORDS
when (cl-ppcre:scan ".*?january.*?agami.*?"
(funcall (symbol-function field) record))
collect record)))
The repl outputs:
...
took 1,714 milliseconds (1.714 seconds) to run.
During that period, and with 4 available CPU cores,
1,698 milliseconds (1.698 seconds) were spent in user mode
9 milliseconds (0.009 seconds) were spent in system mode
40 bytes of memory allocated.
...
Define a function doing the same thing:
(defun searchx (regex &rest fields)
(loop for field in fields
append (loop for record in RECORDS
when (cl-ppcre:scan regex (funcall (symbol-function field) record))
collect record)))
And then call it:
(time (searchx ".*?january.*?agami.*?" 'full-name 'email))
The output:
...
took 123,389 milliseconds (123.389 seconds) to run.
992 milliseconds ( 0.992 seconds, 0.80%) of which was spent in GC.
During that period, and with 4 available CPU cores,
118,732 milliseconds (118.732 seconds) were spent in user mode
4,569 milliseconds ( 4.569 seconds) were spent in system mode
2,970,867,648 bytes of memory allocated.
501 minor page faults, 0 major page faults, 0 swaps.
...
It's almost 70 times slower ?!!
I thought maybe it's a computer specific issue. So I ran the same code on two different machines. A macbook air, and a macbook pro. The individual timings vary, but the behaviour is consistent. Calling it as a function takes much longer than calling it directly on both machines. Surely the overhead of a single function call should not be that much slower.
Then I thought it might be that Clozure CL is responsible. So I ran the same code in SBCL and even there the behaviour is similar. The difference isn't as big, but it's still pretty big. It's about 22 times slower.
SBCL output when running direct:
Evaluation took:
1.519 seconds of real time
1.477893 seconds of total run time (0.996071 user, 0.481822 system)
97.30% CPU
12 lambdas converted
2,583,290,520 processor cycles
492,536 bytes consed
SBCL output when running as a function:
Evaluation took:
33.522 seconds of real time
33.472137 seconds of total run time (33.145166 user, 0.326971 system)
[ Run times consist of 0.254 seconds GC time, and 33.219 seconds non-GC time. ]
99.85% CPU
56,989,918,442 processor cycles
2,999,581,336 bytes consed
Why is calling the code as a function so much slower? And how do I fix it?
The difference is probably due to the regular expression.
Here the regex is a literal string:
(cl-ppcre:scan ".*?january.*?agami.*?"
(funcall (symbol-function field) record))
The cl-ppcre:scan function has a compiler macro that detects this case and generates a (load-time-value (create-scanner ...)) expression (since the string cannot possibly depend on runtime values, this is acceptable).
The compiler-macro might be applied in your test too, in which case the load-time-value is probably executed only once.
In the following code, however, the regular expression is a runtime value, obtained as an input of the function:
(defun searchx (regex &rest fields)
(loop for field in fields
append (loop for record in RECORDS
when (cl-ppcre:scan regex (funcall (symbol-function field) record))
collect record)))
In that case, the scanner object is built when evaluating scan, i.e. each time the loop iterates over a record.
In order to test this hypothesis, you may want to do the following:
(defun searchx (regex &rest fields)
(loop
with scanner = (cl-ppcre:create-scanner regex)
for field in fields
append (loop for record in RECORDS
when (cl-ppcre:scan scanner (funcall (symbol-function field) record))
collect record)))
Alternatively, do not change the function but give it a scanner:
(time (searchx (cl-ppcre:create-scanner
".*?january.*?agami.*?")
'full-name
'email))
I'm trying to measure a function's performance by measuring the time for each iteration.
During the process, I found even if I do nothing, the results still vary quite a bit.
e.g.
volatile long count = 0;
for (int i = 0; i < N; ++i) {
measure.begin();
++count;
measure.end();
}
In measure.end(), I measure the time difference and keep an unordered_map to keep track of the time-count.
I've used clock_gettime as well as rdtsc, but there's always about 1% of the data points lie far away from mean, in a 1000 factor.
Here's what the above loop generates:
T: count percentile
18 117563 11.7563%
19 111821 22.9384%
21 201605 43.0989%
22 541095 97.2084%
23 2136 97.422%
24 2783 97.7003%
...
406 1 99.9994%
3678 1 99.9995%
6662 1 99.9996%
17945 1 99.9997%
18148 1 99.9998%
18181 1 99.9999%
22800 1 100%
mean:21
So whether it's ticks or ns, the worst case 22800 is about 1000 times bigger than mean.
I did isolcpus in grub and was running this with taskset. The simple loop almost does nothing, the hash table to do time-count statistics is outside of the time measurements.
What am I missing?
I'm running this on a laptop with ubuntu installed, CPU is Intel(R) Core(TM) i5-2520M CPU # 2.50GHz
Thank you for all the answers.
The main interrupt that I couldn't stop is the local timer interrupt. And it seems new 3.10 kernel would support tickless. I'll try that one.
I trying to get started with Google Perf Tools to profile some CPU intensive applications. It's a statistical calculation that dumps each step to a file using `ofstream'. I'm not a C++ expert so I'm having troubling finding the bottleneck. My first pass gives results:
Total: 857 samples
357 41.7% 41.7% 357 41.7% _write$UNIX2003
134 15.6% 57.3% 134 15.6% _exp$fenv_access_off
109 12.7% 70.0% 276 32.2% scythe::dnorm
103 12.0% 82.0% 103 12.0% _log$fenv_access_off
58 6.8% 88.8% 58 6.8% scythe::const_matrix_forward_iterator::operator*
37 4.3% 93.1% 37 4.3% scythe::matrix_forward_iterator::operator*
15 1.8% 94.9% 47 5.5% std::transform
13 1.5% 96.4% 486 56.7% SliceStep::DoStep
10 1.2% 97.5% 10 1.2% 0x0002726c
5 0.6% 98.1% 5 0.6% 0x000271c7
5 0.6% 98.7% 5 0.6% _write$NOCANCEL$UNIX2003
This is surprising, since all the real calculation occurs in SliceStep::DoStep. The "_write$UNIX2003" (where can I find out what this is?) appears to be coming from writing the output file. Now, what confuses me is that if I comment out all the outfile << "text" statements and run pprof, 95% is in SliceStep::DoStep and `_write$UNIX2003' goes away. However my application does not speed up, as measured by total time. The whole thing speeds up less than 1 percent.
What am I missing?
Added:
The pprof output without the outfile << statements is:
Total: 790 samples
205 25.9% 25.9% 205 25.9% _exp$fenv_access_off
170 21.5% 47.5% 170 21.5% _log$fenv_access_off
162 20.5% 68.0% 437 55.3% scythe::dnorm
83 10.5% 78.5% 83 10.5% scythe::const_matrix_forward_iterator::operator*
70 8.9% 87.3% 70 8.9% scythe::matrix_forward_iterator::operator*
28 3.5% 90.9% 78 9.9% std::transform
26 3.3% 94.2% 26 3.3% 0x00027262
12 1.5% 95.7% 12 1.5% _write$NOCANCEL$UNIX2003
11 1.4% 97.1% 764 96.7% SliceStep::DoStep
9 1.1% 98.2% 9 1.1% 0x00027253
6 0.8% 99.0% 6 0.8% 0x000274a6
This looks like what I'd expect, except I see no visible increase in performance (.1 second on a 10 second calculation). The code is essentially:
ofstream outfile("out.txt");
for loop:
SliceStep::DoStep()
outfile << 'result'
outfile.close()
Update: I timing using boost::timer, starting where the profiler starts and ending where it ends. I do not use threads or anything fancy.
From my comments:
The numbers you get from your profiler say, that the program should be around 40% faster without the print statements.
The runtime, however, stays nearly the same.
Obviously one of the measurements must be wrong. That means you have to do more and better measurements.
First I suggest starting with another easy tool: the time command. This should get you a rough idea where your time is spend.
If the results are still not conclusive you need a better testcase:
Use a larger problem
Do a warmup before measuring. Do some loops and start any measurement afterwards (in the same process).
Tiristan: It's all in user. What I'm doing is pretty simple, I think... Does the fact that the file is open the whole time mean anything?
That means the profiler is wrong.
Printing 100000 lines to the console using python results in something like:
for i in xrange(100000):
print i
To console:
time python print.py
[...]
real 0m2.370s
user 0m0.156s
sys 0m0.232s
Versus:
time python test.py > /dev/null
real 0m0.133s
user 0m0.116s
sys 0m0.008s
My point is:
Your internal measurements and time show you do not gain anything from disabling output. Google Perf Tools says you should. Who's wrong?
_write$UNIX2003 is probably referring to the write POSIX system call, which outputs to the terminal. I/O is very slow compared to almost anything else, so it makes sense that your program is spending a lot of time there if you are writing a fair bit of output.
I'm not sure why your program wouldn't speed up when you remove the output, but I can't really make a guess on only the information you've given. It would be nice to see some of the code, or even the perftools output when the cout statement is removed.
Google perftools collects samples of the call stack, so what you need is to get some visibility into those.
According to the doc, you can display the call graph at statement or address granularity. That should tell you what you need to know.