Related
Is it possible to use in Julia something like pointers or references as in C/C++ or C#? I'm wondering because it will be helpful to pass weighty objects as a pointer/reference but not as a value. For example, using pointers memory to store an object can be allocated once for the whole program and then the pointer can be passed through the program. As I can imagine, it will boost performance in memory and computing power usage.
Simple C++ code showing what I'm trying to execute in Julia:
#include <iostream>
void some_function(int* variable){ // declare function
*variable += 1; // add a value to the variable
}
int main(){
int very_big_object = 1; // define variable
some_function( &very_big_object ); // pass pointer of very_big_object to some_function
std::cout << very_big_object; // write value of very_big_object to stdout
return 0; // end of the program
}
Output:
2
New object is created, its pointer is then passed to some_funciton that modifies this object using passed pointer. Returning of the new value is not necessary, because program edited original object, not copy. After executing some_function value of the variable is print to see how it has changed.
While you can manually obtain pointers for Julia objects, this is actually not necessary to obtain the performance and results you want. As the manual notes:
Julia function arguments follow a convention sometimes called "pass-by-sharing", which means that values are not copied when they are passed to functions. Function arguments themselves act as new variable bindings (new locations that can refer to values), but the values they refer to are identical to the passed values. Modifications to mutable values (such as Arrays) made within a function will be visible to the caller. This is the same behavior found in Scheme, most Lisps, Python, Ruby and Perl, among other dynamic languages.
consequently, you can just pass the object to the function normally, and then operate on it in-place. Functions that mutate their input arguments (thus including any in-place functions) by convention in Julia have names that end in !. The only catch is that you have to be sure not to do anything within the body of the function that would trigger the object to be copied.
So, for example
function foo(A)
B = A .+ 1 # This makes a copy, uh oh
return B
end
function foo!(A)
A .= A .+ 1 # Mutate A in-place
return A # Technically don't have to return anything at all, but there is also no performance cost to returning the mutated A
end
(note in particular the . in front of .= in the second version; this is critical. If we had left that off, we actually would not have mutated A, but just reassigned the name A (within the scope of the function) to refer to the result of the RHS, so this actually would be entirely equivalent to the first version if it weren't for that .)
If we then benchmark these, you can see the difference clearly:
julia> large_array = rand(10000,1000);
julia> using BenchmarkTools
julia> #benchmark foo(large_array)
BechmarkTools.Trial: 144 samples with 1 evaluations.
Range (min … max): 23.315 ms … 78.293 ms ┊ GC (min … max): 0.00% … 43.25%
Time (median): 27.278 ms ┊ GC (median): 0.00%
Time (mean ± σ): 34.705 ms ± 12.316 ms ┊ GC (mean ± σ): 23.72% ± 23.04%
▁▄▃▄█ ▁
▇█████▇█▇▆▅▃▁▃▁▃▁▁▁▄▁▁▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▅▃▄▁▄▄▃▄▆▅▆▃▅▆▆▆▅▃ ▃
23.3 ms Histogram: frequency by time 55.3 ms <
Memory estimate: 76.29 MiB, allocs estimate: 2.
julia> #benchmark foo!(large_array)
BechmarkTools.Trial: 729 samples with 1 evaluations.
Range (min … max): 5.209 ms … 9.655 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.529 ms ┊ GC (median): 0.00%
Time (mean ± σ): 6.845 ms ± 955.282 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▄▇██▇▆ ▁ ▁
▂▂▃▂▄▃▇▄▅▅████████▇█▆██▄▅▅▆▆▄▅▆▇▅▆▄▃▃▄▃▄▃▃▃▂▂▃▃▃▃▃▃▂▃▃▃▂▂▂▃ ▃
5.21 ms Histogram: frequency by time 9.33 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Note especially the difference in memory usage and allocations in addition to the ~4x time difference. At this point, pretty much all the time is being spent on the actual addition, so the only thing left to optimize if this were performance-critical code would be to make sure that the code is efficiently using all of your CPU's SIMD vector registers and instructions, which you can do with the LoopVectorization.jl package:
using LoopVectorization
function foo_simd!(A)
#turbo A .= A .+ 1 # Mutate A in-place. Equivalentely #turbo #. A = A + 1
return A
end
julia> #benchmark foo_simd!(large_array)
BechmarkTools.Trial: 986 samples with 1 evaluations.
Range (min … max): 4.873 ms … 7.387 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.922 ms ┊ GC (median): 0.00%
Time (mean ± σ): 5.061 ms ± 330.307 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▇▄▁▁ ▂▂▅▁
█████████████▆▆▇▆▅▅▄▅▅▅▆▅▅▅▅▅▅▆▅▅▄▁▄▅▄▅▄▁▄▁▄▁▅▄▅▁▁▄▅▁▄▁▁▄▁▄ █
4.87 ms Histogram: log(frequency) by time 6.7 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
This buys us a bit more performance, but it looks like for this particular case Julia's normal compiler probably already found some of these SIMD optimizations.
Now, if for any reason you still want a literal pointer, you can always get this with Base.pointer, though note that this comes with some significant caveats and is generally not what you want.
help?> Base.pointer
pointer(array [, index])
Get the native address of an array or string, optionally at a given location index.
This function is "unsafe". Be careful to ensure that a Julia reference to array exists
as long as this pointer will be used. The GC.#preserve macro should be used to protect
the array argument from garbage collection within a given block of code.
Calling Ref(array[, index]) is generally preferable to this function as it guarantees
validity.
While Julia uses "pass-by-sharing" and normally you do not have to/do not want to use pointers in some cases you actually want do it and you can!
You construct pointers by Ref{Type} and then deference them using [].
Consider the following function mutating its argument.
function mutate(v::Ref{Int})
v[] = 999
end
This can be used in the following way:
julia> vv = Ref(33)
Base.RefValue{Int64}(33)
julia> mutate(vv);
julia> vv
Base.RefValue{Int64}(999)
julia> vv[]
999
For a longer discussion on passing reference in Julia please have a look at this post How to pass an object by reference and value in Julia?
Look at this snippet:
#include <atomic>
#include <thread>
typedef volatile unsigned char Type;
// typedef std::atomic_uchar Type;
void fn(Type *p) {
for (int i=0; i<500000000; i++) {
(*p)++;
}
}
int main() {
const int N = 4;
std::thread thr[N];
alignas(64) Type buffer[N*64];
for (int i=0; i<N; i++) {
thr[i] = std::thread(&fn, &buffer[i*1]);
}
for (int i=0; i<N; i++) {
thr[i].join();
}
}
This little program increments four adjacent bytes a lot of times from four different threads. Before, I used the rule: don't use the same cache line from different threads, as cache line sharing is bad. So I expected that a four thread version (N=4) is much slower than a one thread version (N=1).
However, these are my measurements (on a Haswell CPU):
N=1: 1 sec
N=4: 1.2 sec
So N=4 is not much slower. If I use different cache lines (replace *1 with *64), then N=4 becomes a little faster: 1.1 sec.
The same measurements for atomic access (swap the comments at typedef), same cache line:
N=1: 3.1 sec
N=4: 48 sec
So the N=4 case is much slower (as I expected). If different cache lines used, then N=4 has similar performance as N=1: 3.3 sec.
I don't understand the reason behind these results. Why don't I get a serious slowdown the non-atomic, N=4 case? Four cores have the same memory in their caches, so they must synchronize them somehow, don't they? How can they run almost perfectly parallel? Why just the atomic case gets a serious slowdown?
I think I need to understand how memory gets updated in this case. In the beginning, no cores have buffer in their caches. After one for iteration (in fn), all 4 cores have buffer in their cache-lines, but each core writes a different byte. How do these cache-lines get synchronized (in the non-atomic case)? How does the cache know, which byte is dirty? Or is there some other mechanism to handle this case? Why is this mechanism a lot cheaper (actually, it is almost free) than the atomic-one?
What you are seeing is basically the effect of the store buffer combined with store-to-load forwarding allowing each core to work mostly independently, despite sharing a cache line. As we will see below, it is truly a weird case where more contention is bad, up to a point, then even more contention suddenly makes things really fast!
Now with the conventional view of contention your code seems like something that will be high contention and therefore much slower than ideal. What happens, however, is that as soon as each core gets a single pending write in its write buffer, all later reads can be satisfied from the write buffer (store forwarding), and later writes just go into the buffer as well even after the core has lost ownership of the cache line. This turns most of the work into a totally local operation. The cache line is still bouncing around between the cores, but it's decoupled from the core execution path and is only needed to actually commit the stores now and then1.
The std::atomic version can't use this magic at all since it has to use locked operations to maintain atomicity and defeat the store buffer, so you see both the full cost of contention and the cost of the long-latency atomic operations2.
Let's try to actually collect some evidence that this is what's occurring. All of the discussion below deals with the non-atomic version of the benchmark that uses volatile to force reads and writes from buffer.
Let's first check the assembly, to make sure it's what we expect:
0000000000400c00 <fn(unsigned char volatile*)>:
400c00: ba 00 65 cd 1d mov edx,0x1dcd6500
400c05: 0f 1f 00 nop DWORD PTR [rax]
400c08: 0f b6 07 movzx eax,BYTE PTR [rdi]
400c0b: 83 c0 01 add eax,0x1
400c0e: 83 ea 01 sub edx,0x1
400c11: 88 07 mov BYTE PTR [rdi],al
400c13: 75 f3 jne 400c08 <fn(unsigned char volatile*)+0x8>
400c15: f3 c3 repz ret
It's straightforward: a five instruction loop with a byte load, an increment of the loaded byte, a byte store, and finally the loop increment and conditional jump back to the top. Here, gcc has missed an optimization by breaking up the sub and jne, inhibiting macro-fusion, but overall it's OK and the store-forwarding latency is going to limit the loop in any case.
Next, let's take a look at the number of L1D misses. Every time a core needs to write into the line that has been stolen away, it will suffer an L1D miss, which we can measure with perf. First, the single threaded (N=1) case:
$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment
Performance counter stats for './cache-line-increment':
1070.188749 task-clock (msec) # 0.998 CPUs utilized
2,775,874,257 cycles # 2.594 GHz
2,504,256,018 instructions # 0.90 insn per cycle
501,139,187 L1-dcache-loads # 468.272 M/sec
69,351 L1-dcache-load-misses # 0.01% of all L1-dcache hits
1.072119673 seconds time elapsed
It is about what we expect: essentially zero L1D misses (0.01% of the total, probably mostly from interrupts and other code outside the loop), and just over 500,000,000 hits (matching almost exactly the number of loop iterations). Note also that we can easily calculate the cycles per iteration: about 5.55. This primarily reflects the cost of store-to-load forwarding, plus one cycle for the increment, which is a carried dependency chain as the same location is repeatedly updated (and volatile means it can't be hoisted into a register).
Let's take a look at the N=4 case:
$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment
Performance counter stats for './cache-line-increment':
5920.758885 task-clock (msec) # 3.773 CPUs utilized
15,356,014,570 cycles # 2.594 GHz
10,012,249,418 instructions # 0.65 insn per cycle
2,003,487,964 L1-dcache-loads # 338.384 M/sec
61,450,818 L1-dcache-load-misses # 3.07% of all L1-dcache hits
1.569040529 seconds time elapsed
As expected the L1 loads jumps from 500 million to 2 billion, since there are 4 threads each doing the 500 million loads. The number of L1D misses also jumped by about a factor of 1,000, to about 60 million. Still, that number is not a lot compared to the 2 billion loads (and 2 billion stores - not shown, but we know they are there). That's ~33 loads and ~33 stores for every miss. It also means 250 cycles between each miss.
That doesn't really fit the model of the cache line bouncing around erratically between the cores, where as soon as a core gets the line, another core demands it. We know that lines bounce around between cores sharing an L2 in perhaps 20-50 cycles, so the ratio of one miss every 250 cycles seems way to low.
Two Hypotheses
A couple ideas spring to mind for the above described behavior:
Perhaps the MESI protocol variant used in this chip is "smart" and recognizes that one line is hot among several cores, but only a small amount of work is being done each time a core gets the lock and the line spends more time moving between L1 and L2 than actually satisfying loads and stores for some core. In light of this, some smart component in the coherence protocol decides to enforce some kind of minimum "ownership time" for each line: after a core gets the line, it will keep it for N cycles, even if demanded by another core (the other cores just have to wait).
This would help balance out the overhead of cache line ping-pong with real work, at the cost of "fairness" and responsiveness of the other cores, kind of like the trade-off between unfair and fair locks, and counteracting the effect described here, where the faster & fairer the coherency protocol is, the worse some (usually synthetic) loops may perform.
Now I've never heard of anything like that (and the immediately previous link shows that at least in the Sandy-Bridge era things were moving in the opposite direction), but it's certainly possible!
The store-buffer effect described is actually occurring, so most operations can complete almost locally.
Some Tests
Let's try to distinguish two cases with some modifications.
Reading and Writing Distinct Bytes
The obvious approach is to change the fn() work function so that the threads still contend on the same cache line, but where store-forwarding can't kick in.
How about we just read from location x and then write to location x + 1? We'll give each thread two consecutive locations (i.e., thr[i] = std::thread(&fn, &buffer[i*2])) so each thread is operating on two private bytes. The modified fn() looks like:
for (int i=0; i<500000000; i++)
unsigned char temp = p[0];
p[1] = temp + 1;
}
The core loop is pretty much identical to earlier:
400d78: 0f b6 07 movzx eax,BYTE PTR [rdi]
400d7b: 83 c0 01 add eax,0x1
400d7e: 83 ea 01 sub edx,0x1
400d81: 88 47 01 mov BYTE PTR [rdi+0x1],al
400d84: 75 f2 jne 400d78
The only thing that's changed is that we write to [rdi+0x1] rather than [rdi].
Now as I mentioned above, the original (same location) loop is actually running fairly slowly at about 5.5 cycles per iteration even in the best-case single-threaded case, because of the loop-carried load->add->store->load... dependency. This new code breaks that chain! The load no longer depends on the store so we can execute everything pretty much in parallel and I expect this loop to run at about 1.25 cycles per iteration (5 instructions / CPU width of 4).
Here's the single threaded case:
$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment
Performance counter stats for './cache-line-increment':
318.722631 task-clock (msec) # 0.989 CPUs utilized
826,349,333 cycles # 2.593 GHz
2,503,706,989 instructions # 3.03 insn per cycle
500,973,018 L1-dcache-loads # 1571.815 M/sec
63,507 L1-dcache-load-misses # 0.01% of all L1-dcache hits
0.322146774 seconds time elapsed
So about 1.65 cycles per iteration3, about about three times faster versus incrementing the same location.
How about 4 threads?
$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment
Performance counter stats for './cache-line-increment':
22299.699256 task-clock (msec) # 3.469 CPUs utilized
57,834,005,721 cycles # 2.593 GHz
10,038,366,836 instructions # 0.17 insn per cycle
2,011,160,602 L1-dcache-loads # 90.188 M/sec
237,664,926 L1-dcache-load-misses # 11.82% of all L1-dcache hits
6.428730614 seconds time elapsed
So it's about 4 times slower than the same location case. Now rather than being just a bit slower than the single-threaded case it is about 20 times slower. This is the contention you've been looking for! Now also that the number of L1D misses has increased by a factor of 4 as well, nicely explaining the performance degradation and consistent with the idea that when store-to-load forwarding can't hide the contention, misses will increase by a lot.
Increasing the Distance Between Stores
Another approach would be to increase the distance in time/instructions between the store and the subsequent load. We can do this by incrementing SPAN consecutive locations in the fn() method, rather than always the same location. E.g, if SPAN is 4, increment consecutively 4 locations like:
for (long i=0; i<500000000 / 4; i++) {
p[0]++;
p[1]++;
p[2]++;
p[3]++;
}
Note that we are still incrementing 500 million locations in total, just spreading out the increments among 4 bytes. Intuitively you would expect overall performance to increase since you now have SPAN parallel dependency with length 1/SPAN, so in the case above you might expect performance to improve by a factor of 4, since the 4 parallel chains can proceed at about 4 times the total throughput.
Here's what we actually get for time (measured in cycles) for the 1 thread and 3 thread4, for SPAN values from 1 to 20:
Initially you see performance increase substantially in both single and multi-threaded cases; the increase from a SPAN of one to two and three is close to the theoretical expected in the case of perfect parallelism for both cases.
The single-threaded case reaches an asymptote of about 4.25x faster than the single-location write: at this point the store-forwarding latency isn't the bottleneck and other bottlenecks have taken over (max IPC and store port contention, mostly).
The multi-threaded case is very different, however! Once you hit a SPAN of about 7, the performance rapidly gets worse, leveling out at about 2.5 times worse than the SPAN=1 case and almost 10x worse compared to the best performance at SPAN=5. What happens is that store-to-load forwarding stops occurring because the store and subsequent load are far enough apart in time/cycles that the store has retired to L1, so the load actually has to get the line and participate in MESI.
Also plotted is the L1D misses, which as mentioned above is indicative of "cache line transfers" between cores. The single-threaded case has essentially zero, and they are uncorrelated with the performance. The performance of the multi-threaded case, however, pretty much tracks exactly the cache misses. With SPAN values in the 2 to 6 range, where store-forwarding is still working, there are proportionally fewer misses. Evidently the core is able to "buffer up" more stores between each cache line transfer since the core loop is faster.
Another way to think of it is that in the contended case L1D misses are basically constant per unit-time (which makes sense, since they are basically tied to the L1->L2->L1 latency, plus some coherency protocol overhead), so the more work you can do in between the cache line transfers, the better.
Here's the code for the multi-span case:
void fn(Type *p) {
for (long i=0; i<500000000 / SPAN; i++) {
for (int j = 0; j < SPAN; j++) {
p[j]++;
}
}
}
The bash script to run perf for all SPAN value from 1 to 20:
PERF_ARGS=${1:--x, -r10}
for span in {1..20}; do
g++ -std=c++11 -g -O2 -march=native -DSPAN=$span cache-line-increment.cpp -lpthread -o cache-line-increment
perf stat ${PERF_ARGS} -e cycles,L1-dcache-loads,L1-dcache-load-misses,machine_clears.count,machine_clears.memory_ordering ./cache-line-increment
done
Finally, "transpose" the results into proper CSV:
FILE=result1.csv; for metric in cycles L1-dcache-loads L1-dcache-load-misses; do { echo $metric; grep $metric $FILE | cut -f1 -d,; } > ${metric}.tmp; done && paste -d, *.tmp
A Final Test
There's a final test that you can do to show that each core is effectively doing most of its work in private: use the version of the benchmark where the threads work on the same location (which doesn't change the performance characteristics) examine the sum of the final counter values (you'd need int counters rather than char). If everything was atomic, you'd have a sum of 2 billion, and in the non-atomic case how close the total is to that value is a rough measure of how frequently the cores were passing around the lines. If the cores are working almost totally privately, the value would be closer to 500 million than 2 billion, and I guess that's what you'll find (a value fairly close to 500 million).
With some more clever incrementing, you can even have each thread track how often the value they incremented came from their last increment rather than another threads increment (e.g., by using a few bits of the value to stash a thread identifier). With an even more clever test you could practically reconstruct thee way the cache line moved around between the cores (is there a pattern, e.g., does core A prefer to hand off to core B?) and which cores contributed most to the final value, etc.
That's all left as an exercise :).
1 On top of that, if Intel has a coalescing store buffer where later stores that fully overlap earlier ones kill the earlier stores, it would only have to commit one value to L1 (the latest store) every time it gets the line.
2 You can't really separate the two effects here, but we will do it later by defeating store-to-load forwarding.
3 A bit more than I expected, perhaps bad scheduling leading to port pressure. If gcc would just all the sub and jne to fuse, it runs at 1.1 cycles per iteration (still worse than the 1.0 I'd expect). It will do that I use -march=haswell instead of -march=native but I'm not going to go back and change all the numbers.
4 The results hold with 4 threads as well: but I only have 4 cores and I'm running stuff like Firefox in the background, so using 1 less core makes the measurements a lot less noisy. Measuring time in cycles helps a lot too.
5 On this CPU architecture, store forwarding where the load arrives before the store data is ready seem to alternate between 4 and 5 cycles, for an average of 4.5 cycles.
The atomic version has to ensure that some other thread will be able to read the result in a sequentially consistent fashion. So there are fences for each write.
The volatile version does not make any relationships visible to the other cores, so does not try and synchronize the memory so it is visible on other cores. For a multi-threaded system using C++11 or newer, volatile is not a mechanism for communicating between threads.
While working on a competitive programming problem I discovered an interesting issue that drastically reduced the performance of some of my code. After much experimentation I have managed to reduce the issue to the following minimal example:
module Main where
main = interact handle
handle :: String -> String
-- handle s = show $ sum l
-- handle s = show $ length l
-- handle s = show $ seq (length l) (sum l)
where
l = [0..10^8] :: [Int]
If you uncomment each commented line individually, compile with ghc -O2 test.hs and run with time ./test > /dev/null, you should get something like the following:
For sum l:
0.02user 0.00system 0:00.03elapsed 93%CPU (0avgtext+0avgdata 3380maxresident)k
0inputs+0outputs (0major+165minor)pagefaults 0swaps
For length l:
0.02user 0.00system 0:00.02elapsed 100%CPU (0avgtext+0avgdata 3256maxresident)k
0inputs+0outputs (0major+161minor)pagefaults 0swaps
For seq (length l) (sum l):
5.47user 1.15system 0:06.63elapsed 99%CPU (0avgtext+0avgdata 7949048maxresident)k
0inputs+0outputs (0major+1986697minor)pagefaults 0swaps
Look at that huge increase in peak memory usage. This makes some amount of sense, because of course both sum and length can lazily consume the list as a stream, while the seq will be triggering the evaluation of the whole list, which must then be stored. But the seq version of the code is using just shy of 8 GB of memory to handle a list that contains just 400 MB of actual data. The purely functional nature of Haskell lists could explain some small constant factor, but a 20 fold increase in memory seems unintended.
This behaviour can be triggered by a number of things. Perhaps the easiest way is using force from Control.DeepSeq, but the way in which I originally encountered this was while using Data.Array.IArray (I can only use the standard library) and trying to construct an array from a list. The implementation of Array is monadic, and so was forcing the evaluation of the list from which it was being constructed.
If anyone has any insight into the underlying cause of this behaviour, I would be very interested to learn why this happens. I would of course also appreciate any suggestions as to how to avoid this issue, bearing in mind that I have to use just the standard library in this case, and that every Array constructor takes and eventually forces a list.
I hope you find this issue as interesting as I did, but hopefully less baffling.
EDIT: user2407038's comment made me realize I had forgotten to post profiling results. I have tried profiling this code and the profiler simply states that 100% of allocations are performed in handle.l, so it seems that simply anything that forces the evaluation of the list uses huge amounts of memory. As I mentioned above, using the force function from Control.DeepSeq, constructing an Array, or anything else that forces the list causes this behaviour. I am confused as to why it would ever require 8 GB of memory to compute a list containing 400 MB of data. Even if every element in the list required two 64-bit pointers, that is still only a factor of 5, and I would think GHC would be able to do something more efficient than that. If not this is an obvious bottleneck for the Array package, as constructing any array inherently requires us to allocate far more memory than the array itself.
So, ultimately: Does anyone have any idea why forcing a list requires such huge amounts of memory, which has such a high cost on performance?
EDIT: user2407038 provided a link to the very helpful GHC Memory Footprint reference. This explains exactly the data sizes of everything, and almost entirely explains the huge overhead: An [Int] is specified as requiring 5N+1 words of memory, which at 8 bytes per word gives 40 bytes per element. In this example that would suggest 4 GB, which accounts for half the total peak usage. It is easy to then believe that the evaluation of sum would then add a similar factor, so this answers my question.
Thanks to all commenters for your help.
EDIT: As I mentioned above, I originally encountered this behaviour why trying to construct an Array. Having had a bit of a dig into GHC.Arr I have found what I think is the root cause of this behaviour when constructing an array: The constructor folds over the list to compose a program in the ST monad that it then runs. Obviously the ST can't be executed until it is completely composed, and in this case the ST construct will be large and linear in the size of the input. To avoid this behaviour we would have to somehow modify the constructor to stream elements from the list as it adds them in ST.
There are multiple factors that come to play here. The first one is that GHC will lazily lift l out of handle. This would enable handle to reuse l, so that you don't have to recalculate it every time, but in this case it creates a space leak. You can check this if you -ddump-simplified core:
Main.handle_l :: [Int]
[GblId,
Str=DmdType,
Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=False,
WorkFree=False, Expandable=False, Guidance=IF_ARGS [] 40 0}]
Main.handle_l =
case Main.handle3 of _ [Occ=Dead] { GHC.Types.I# y_a1HY ->
GHC.Enum.eftInt 0 y_a1HY
}
The functionality to calculate the [0..10^7] 1 is hidden away in other functions, but essentially, handle_l = [0..10^7], at top-level (TopLvl=True). It won't get reclaimed, since you may or may not use handle again. If we use handle s = show $ length l, l itself will be inlined. You will not find any TopLvl=True function that has type [Int].
So GHC detects that you use l twice and creates a top-level CAF. How big is that CAF? An Int takes two words:
data Int = I# Int#
One for I#, one for Int#. How much for [Int]?
data [a] = [] | (:) a ([a]) -- pseudo, but similar
That's one word for [], and three words for (:) a ([a]). A list of [Int] with size N will therefore have a total size of (3N + 1) + 2N words, in your case 5N+1 words. Given your memory, I assume a word is 8byte on your plattform, so we end up with
5 * 10^8 * 8 bytes = 4 000 000 000 bytes
So how do we get rid of that list? The first option we have is to get rid of l:
handle _ = show $ seq (length [0..10^8]) (sum [0..10^8])
This will now run in constant memory due to foldr/buildr rules. While we have [0..10^8] there twice, they don't share the same name. If we check the -stats, we will see that it runs in constant memory:
> SO.exe +RTS -s
5000000050000000 4,800,066,848 bytes allocated in the heap
159,312 bytes copied during GC
43,832 bytes maximum residency (2 sample(s))
20,576 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 9154 colls, 0 par 0.031s 0.013s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0001s 0.0002s
INIT time 0.000s ( 0.000s elapsed)
MUT time 4.188s ( 4.232s elapsed)
GC time 0.031s ( 0.013s elapsed)
EXIT time 0.000s ( 0.001s elapsed)
Total time 4.219s ( 4.247s elapsed)
%GC time 0.7% (0.3% elapsed)
Alloc rate 1,146,284,620 bytes per MUT second
Productivity 99.3% of total user, 98.6% of total elapsed
But that's not really nice, since we now have to track all the uses of [0..10^8]. What if we create a function instead?
handle :: String -> String
handle _ = show $ seq (length $ l ()) (sum $ l ())
where
{-# INLINE l #-}
l _ = [0..10^7] :: [Int]
This works, but we must inline l, otherwise we get the same problem as before if we use optimizations. -O1 (and -O2) enable -ffull-laziness, which—together with common subexpression elimination—would lift l () to the top. So we either need to inline it or use -O2 -fno-full-laziness to prevent that behaviour.
1 Had to decrease the list size, otherwise I would have started swapping.
A given benchmark consists of 35% loads, 10% stores, 16% branches, 27% integer ALU operations, 8% FP +/-, 3% FP * and 1% FP /. We want to compare the benchmark as run on two processors. CPI of P1 = 5.05 and CPI of P2 = 3.58.
You are considering two possible enhancements for the Processor 1. One enhancement is a better memory organization, which would improve the average CPI for FP/ instructions from 30 to 2. The other enhancement is a new multiply-and-add instruction that would reduce the number of ALU instructions by 20% while still maintaining the average CPI of 4 for the remaining ALU instructions. Unfortunately, there is room on the processor chip for only one of these two enhancements, so you must choose the enhancement that provides better overall performance. Which one would you choose, and why?
So for this part CPI (FP/) = 5.05 - 0.01(30 - 2) = 4.77
But, I am not able to find the new CPI for ALU.
Is it -> CPI (ALU) = 5.05 - 0.20 (4 - 4) = 5.05? I am probably wrong about this.
Caveat: This may only be a partial answer because I'm not sure what you mean by "CPI". This could be "cost per instruction", but, I'm guessing it could be "cycles per instruction". And, we may need more information for a more full/complete answer.
The original cost for FP/ is 1% * 30 --> 30. The enhancement is 1% * 2 --> 2. So, the improvement is 30 - 2 --> 28
The original cost for ALU is 27% * 4 --> 108. With a 20% reduction in the number of ALU instructions executed, this becomes 0.8 * 27% * 4 --> 86.4. The improvement is 108 - 86.4 --> 21.6
So [I think] that answers your basic question.
And, I might choose the improvement for FP.
But, I'd be careful with this. And, the following could be wrong, overthinking the problem, but I offer it anyway.
The FP improvement just speeds up the instruction. But, the number of cycles for the FP/ is reduced and these cycles can be used for other things.
The ALU improvement frees up some cycles that can be used for other things.
In both cases, we don't know what the additional instructions might be. That is, we're changing the percentages of everything after the proposed improvement. We have to assume that the new "windfall" instructions will follow the stated original percentages. But, we may have to calculate the post-improvement, adjusted percentages
We could recalculate things [by solving for unknowns] from:
505 == 35*loads + 10*stores + 16*branches + 27*ALU + 8*FPadd + 3*FPmul + 1*FPdiv
... if we knew the CPI for the other instructions (e.g. the CPI for a load, etc.). But, this is missing information.
I want to read an input file (in C/C++) and process each line independently as fast as possible. The processing takes a few ticks itself, so I decided to use OpenMP threads. I have this code:
#pragma omp parallel num_threads(num_threads)
{
string line;
while (true) {
#pragma omp critical(input)
{
getline(f, line);
}
if (f.eof())
break;
process_line(line);
}
}
My question is, how do I determine the optimal number of threads to use? Ideally, I would like this to be dynamically detected at runtime. I don't understand the DYNAMIC schedule option for parallel, so I can't say if that would help. Any insights?
Also, I'm not sure how to determine the optimal number "by hand". I tried various numbers for my specific application. I would have thought the CPU usage reported by top would help, but it doesn't(!) In my case, the CPU usage stays consistently at around num_threads*(85-95). However, using pv to observe the speed at which I'm reading input, I noted that the optimal number is around 2-5; above that, the input speed becomes smaller. So my quesiton is- why would I then see a CPU usage of 850 when using 10 threads?? Can this be due to some inefficiency in how OpenMP handles threads waiting to get in the critical section?
EDIT: Here are some timings. I obtained them with:
for NCPU in $(seq 1 20) ; do echo "NCPU=$NCPU" ; { pv -f -a my_input.gz | pigz -d -p 20 | { { sleep 60 ; PID=$(ps gx -o pid,comm | grep my_prog | sed "s/^ *//" | cut -d " " -f 1) ; USAGE=$(ps h -o "%cpu" $PID) ; kill -9 $PID ; sleep 1 ; echo "usage: $USAGE" >&2 ; } & cat ; } | ./my_prog -N $NCPU >/dev/null 2>/dev/null ; sleep 2 ; } 2>&1 | grep -v Killed ; done
NCPU=1
[8.27MB/s]
usage: 98.4
NCPU=2
[12.5MB/s]
usage: 196
NCPU=3
[18.4MB/s]
usage: 294
NCPU=4
[23.6MB/s]
usage: 393
NCPU=5
[28.9MB/s]
usage: 491
NCPU=6
[33.7MB/s]
usage: 589
NCPU=7
[37.4MB/s]
usage: 688
NCPU=8
[40.3MB/s]
usage: 785
NCPU=9
[41.9MB/s]
usage: 884
NCPU=10
[41.3MB/s]
usage: 979
NCPU=11
[41.5MB/s]
usage: 1077
NCPU=12
[42.5MB/s]
usage: 1176
NCPU=13
[41.6MB/s]
usage: 1272
NCPU=14
[42.6MB/s]
usage: 1370
NCPU=15
[41.8MB/s]
usage: 1493
NCPU=16
[40.7MB/s]
usage: 1593
NCPU=17
[40.8MB/s]
usage: 1662
NCPU=18
[39.3MB/s]
usage: 1763
NCPU=19
[38.9MB/s]
usage: 1857
NCPU=20
[37.7MB/s]
usage: 1957
My problem is that I can achieve 40MB/s with 785 CPU usage, but also with 1662 CPU usage. Where do those extra cycles go??
EDIT2: Thanks to Lirik and John Dibling, I now understand that the reason I find the timings above puzzling has nothing to do with I/O, but rather, with the way OpenMP implements critical sections. My intuition is that if you have 1 thread inside a CS and 10 threads waiting to get in, the moment the 1st thread exits the CS, the kernel should wake up one other thread and let it in. The timings suggest otherwise: can it be that the threads wake up many times on their own and find the CS occupied? Is this an issue with the threading library or with the kernel?
"I want to read an input file (in C/C++) and process each line independently as fast as possible."
Reading from file makes your application I/O bound, so the maximum performance you would be able to achieve for the reading portion alone is to read at the maximum disk speed (on my machine that's less than 10% CPU time). What this means is that if you were able to completely free the reading thread from any processing, it would require that the processing takes less than the remaining CPU time (90% with my computer). If the line processing threads take up more than the remaining CPU time, then you will not be able to keep up with the hard drive.
There are several options in that case:
Queue up the input and let the processing threads dequeue "work" until they've caught up with the input that was presented (given that you have enough RAM to do so).
Open enough threads and just max out your processor until all the data is read, which is your best effort scenario.
Throttle reading/processing so that you don't take up all of the system resources (in case you're worried about UI responsiveness and/or user experience).
"...the processing takes a few ticks itself, so I decided to use OpenMP threads"
This is a good sign, but it also means that your CPU utilization will not be very high. This is the part where you can optimize your performance and it's probably best to do it by hand, as John Dibling mentioned. In general, it's best if you queue up each line and let the processing threads pull processing requests from the queue until you have nothing more to process. The latter is also know as a Producer/Consumer design pattern- a very common pattern in concurrent computing.
Update
Why is there be a difference between
(i) each process get lock, pull data, release lock, process data; and
(ii) one process: pull data, get lock, enqueue chunk, release lock,
others: get lock, dequeue chunk, release lock, process data?
There is very little difference: in a way, both represent a consumer/producer pattern. In the first case (i) you don't have an actual queue, but you could consider the file stream to be your Producer (queue) and the Consumer is the thread that reads from the stream. In the second case (ii) you're explicitly implementing the consumer/producer pattern, which is more robust, reusable and provides better abstraction for the Producer. If you ever decide to use more than one "input channel," then the latter case is better.
Finally (and probably most importantly), you can use a lock-free queue with a single producer and a single consumer which will make (ii) a lot faster than (i) in terms of getting you to be i/o bound. With a lock-free queue you can pull data, enqueue chunk and dequque chunk without locking.
The best you can hope to do is tune it yourself, by hand, through repeated measure-adjust-compare cycles.
The optimal number of threads to use to process a dataset is highly dependant on many factors, not the least of which:
The data itself
The algoritm you use to process it
The CPU(s) the threads are running on
The operating system
You could try to design some kind of heuristic that measures the throughput of your processors and adjust it on the fly, but this kind of thing tends to be way more trouble than its worth.
As a rule, for tasks that are I/O bound I generally start with about 12 threads per core and tune from there. For tasks that are CPU bound, I'd start with about 4 threads per core and go from there. The key is the "going from there" part if you really want to optimize your processor use.
Also keep in mind that you should make this setting configurable if you really want to optimize, because every system onn which this is deployed will have different characteristics.