While working on a competitive programming problem I discovered an interesting issue that drastically reduced the performance of some of my code. After much experimentation I have managed to reduce the issue to the following minimal example:
module Main where
main = interact handle
handle :: String -> String
-- handle s = show $ sum l
-- handle s = show $ length l
-- handle s = show $ seq (length l) (sum l)
where
l = [0..10^8] :: [Int]
If you uncomment each commented line individually, compile with ghc -O2 test.hs and run with time ./test > /dev/null, you should get something like the following:
For sum l:
0.02user 0.00system 0:00.03elapsed 93%CPU (0avgtext+0avgdata 3380maxresident)k
0inputs+0outputs (0major+165minor)pagefaults 0swaps
For length l:
0.02user 0.00system 0:00.02elapsed 100%CPU (0avgtext+0avgdata 3256maxresident)k
0inputs+0outputs (0major+161minor)pagefaults 0swaps
For seq (length l) (sum l):
5.47user 1.15system 0:06.63elapsed 99%CPU (0avgtext+0avgdata 7949048maxresident)k
0inputs+0outputs (0major+1986697minor)pagefaults 0swaps
Look at that huge increase in peak memory usage. This makes some amount of sense, because of course both sum and length can lazily consume the list as a stream, while the seq will be triggering the evaluation of the whole list, which must then be stored. But the seq version of the code is using just shy of 8 GB of memory to handle a list that contains just 400 MB of actual data. The purely functional nature of Haskell lists could explain some small constant factor, but a 20 fold increase in memory seems unintended.
This behaviour can be triggered by a number of things. Perhaps the easiest way is using force from Control.DeepSeq, but the way in which I originally encountered this was while using Data.Array.IArray (I can only use the standard library) and trying to construct an array from a list. The implementation of Array is monadic, and so was forcing the evaluation of the list from which it was being constructed.
If anyone has any insight into the underlying cause of this behaviour, I would be very interested to learn why this happens. I would of course also appreciate any suggestions as to how to avoid this issue, bearing in mind that I have to use just the standard library in this case, and that every Array constructor takes and eventually forces a list.
I hope you find this issue as interesting as I did, but hopefully less baffling.
EDIT: user2407038's comment made me realize I had forgotten to post profiling results. I have tried profiling this code and the profiler simply states that 100% of allocations are performed in handle.l, so it seems that simply anything that forces the evaluation of the list uses huge amounts of memory. As I mentioned above, using the force function from Control.DeepSeq, constructing an Array, or anything else that forces the list causes this behaviour. I am confused as to why it would ever require 8 GB of memory to compute a list containing 400 MB of data. Even if every element in the list required two 64-bit pointers, that is still only a factor of 5, and I would think GHC would be able to do something more efficient than that. If not this is an obvious bottleneck for the Array package, as constructing any array inherently requires us to allocate far more memory than the array itself.
So, ultimately: Does anyone have any idea why forcing a list requires such huge amounts of memory, which has such a high cost on performance?
EDIT: user2407038 provided a link to the very helpful GHC Memory Footprint reference. This explains exactly the data sizes of everything, and almost entirely explains the huge overhead: An [Int] is specified as requiring 5N+1 words of memory, which at 8 bytes per word gives 40 bytes per element. In this example that would suggest 4 GB, which accounts for half the total peak usage. It is easy to then believe that the evaluation of sum would then add a similar factor, so this answers my question.
Thanks to all commenters for your help.
EDIT: As I mentioned above, I originally encountered this behaviour why trying to construct an Array. Having had a bit of a dig into GHC.Arr I have found what I think is the root cause of this behaviour when constructing an array: The constructor folds over the list to compose a program in the ST monad that it then runs. Obviously the ST can't be executed until it is completely composed, and in this case the ST construct will be large and linear in the size of the input. To avoid this behaviour we would have to somehow modify the constructor to stream elements from the list as it adds them in ST.
There are multiple factors that come to play here. The first one is that GHC will lazily lift l out of handle. This would enable handle to reuse l, so that you don't have to recalculate it every time, but in this case it creates a space leak. You can check this if you -ddump-simplified core:
Main.handle_l :: [Int]
[GblId,
Str=DmdType,
Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=False,
WorkFree=False, Expandable=False, Guidance=IF_ARGS [] 40 0}]
Main.handle_l =
case Main.handle3 of _ [Occ=Dead] { GHC.Types.I# y_a1HY ->
GHC.Enum.eftInt 0 y_a1HY
}
The functionality to calculate the [0..10^7] 1 is hidden away in other functions, but essentially, handle_l = [0..10^7], at top-level (TopLvl=True). It won't get reclaimed, since you may or may not use handle again. If we use handle s = show $ length l, l itself will be inlined. You will not find any TopLvl=True function that has type [Int].
So GHC detects that you use l twice and creates a top-level CAF. How big is that CAF? An Int takes two words:
data Int = I# Int#
One for I#, one for Int#. How much for [Int]?
data [a] = [] | (:) a ([a]) -- pseudo, but similar
That's one word for [], and three words for (:) a ([a]). A list of [Int] with size N will therefore have a total size of (3N + 1) + 2N words, in your case 5N+1 words. Given your memory, I assume a word is 8byte on your plattform, so we end up with
5 * 10^8 * 8 bytes = 4 000 000 000 bytes
So how do we get rid of that list? The first option we have is to get rid of l:
handle _ = show $ seq (length [0..10^8]) (sum [0..10^8])
This will now run in constant memory due to foldr/buildr rules. While we have [0..10^8] there twice, they don't share the same name. If we check the -stats, we will see that it runs in constant memory:
> SO.exe +RTS -s
5000000050000000 4,800,066,848 bytes allocated in the heap
159,312 bytes copied during GC
43,832 bytes maximum residency (2 sample(s))
20,576 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 9154 colls, 0 par 0.031s 0.013s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0001s 0.0002s
INIT time 0.000s ( 0.000s elapsed)
MUT time 4.188s ( 4.232s elapsed)
GC time 0.031s ( 0.013s elapsed)
EXIT time 0.000s ( 0.001s elapsed)
Total time 4.219s ( 4.247s elapsed)
%GC time 0.7% (0.3% elapsed)
Alloc rate 1,146,284,620 bytes per MUT second
Productivity 99.3% of total user, 98.6% of total elapsed
But that's not really nice, since we now have to track all the uses of [0..10^8]. What if we create a function instead?
handle :: String -> String
handle _ = show $ seq (length $ l ()) (sum $ l ())
where
{-# INLINE l #-}
l _ = [0..10^7] :: [Int]
This works, but we must inline l, otherwise we get the same problem as before if we use optimizations. -O1 (and -O2) enable -ffull-laziness, which—together with common subexpression elimination—would lift l () to the top. So we either need to inline it or use -O2 -fno-full-laziness to prevent that behaviour.
1 Had to decrease the list size, otherwise I would have started swapping.
Related
I found a strange thing about GHCi and lists.
This command takes some time to execute and just returns the right answer.
ghci> length [1..10^8]
100000000
However, binding this to a variable and executing causes GHC to consume about 5 GiB of RAM without releasing until the GHCi session is over. Typing :quit after it consumes 3 GiB more before actually exiting.
ghci> len = length [1..10^8]
ghci> len
-- Consumes 5 GiB
100000000
ghci> :quit
-- Consumes 3 GiB
-- Exits
Is this normal? What is the difference between the commands?
GHC version is 8.2.2.
Update: The optimization performed by -O0 is a little different than I first understood. Also, added a note about filing a new Trac bug.
I can reproduce this in GHC 8.2.2. Directly evaluating the expression (or using let to bind it to a variable and then evaluating it) both complete quickly:
Prelude> length [1..10^8]
10000000 -- pretty fast
Prelude> let len = length [1..10^8]
Prelude> len
10000000 -- pretty fast
Prelude>
However, using the let-free syntax:
Prelude> len = length [1..10^8]
Prelude> len
10000000
Prelude>
takes longer and allocates a lot of memory which isn't freed until the session ends.
Note that this is specific to GHCi and interactive mode -- in a real, compiled Haskell program, there would be no issue. Compiling the following will run quickly and not consume excess memory:
len = length [1..10^8]
main = print len
To understand what's going on, you should understand that Haskell is capable of performing two potential optimizations of this code:
It can explicitly create a lazy list and start computing its length but determine that once the beginning of the list has been counted, those elements will not be needed again allowing them to be immediately garbage collected.
It can determine that no list needs to be created at all, and via a process known as "list fusion", create compiled code that counts directly from 1 up to 10^8 without trying to put those numbers in any kind of data structure.
When this code is compiled with optimizations (-O1 or -O2), GHC will perform optimization #2. The compiled version will run quickly in a small amount of constant memory (a couple of megabytes resident for the runtime). If you run this with:
$ time ./Length +RTS -s
to collect statistics, you'll find that GHC is still allocating about 1.6 gigabytes of heap, but this is actually to store the individual Integer values as they are being incremented. (Since values in Haskell are immutable, a new Integer must be allocated for every increment.) If you force the type to be Int:
len = length [(1::Int)..10^8]
then the program will allocate only a few kilobytes of heap, and you can see that truly there's no list being allocated.
It turns out that when this code is compiled without optimizations (-O0), GHC only performs optimization #1 (as pointed out by #Carl), but it manages to do a really good job of it, so much so that even though the GHC statistics show lots of heap allocation, the program still runs quite quickly with a very small memory footprint.
However, when this code is compiled to byte-code in GHCi, not only is just optimization #1 used, but GHC doesn't do quite so good a job of garbage collecting the list. A huge multi-gigabyte list is generated, and the beginning is garbage collected nearly as fast as it's being generated. Memory use ends up being quite large, but at least it's relatively constant.
You can see this by turning on timing/memory stats:
> :set +s
> length [1..10^8]
100000000
(1.54 secs, 7,200,156,128 bytes)
>
This means that this code actually allocates a 7.2 gigabyte list; fortunately, it can be thrown away almost as fast as it's generated, so the memory in use by the GHCi process after this computation will still be reasonably modest.
You'll see that:
> let len = length [1..10^8]
> len
and:
> len = length [1..10^8]
> len
chew through exactly the same enormous amount of memory (about 7.2 gigs).
The difference is that, for some reason, the let version allows the list to be garbage collected as it's counted, and the non-let version doesn't.
In the end, this is almost certainly a GHCi bug. It might be related to one of the existing space-leak bugs that have been reported (e.g., Trac #12848 or #14336), or maybe it's a new one. I decided to file it as #14789, so maybe someone will take a look at it.
So, I'm working on an assignment for my intro to computer science class. The assignment is as follows.
There is an organism whose population can be determined according to
the following rules:
The organism requires at least one other organism to propagate. Thus,
if the population goes to 1, then the organism will become extinct in
one time cycle (e.g. one breeding season). In an unusual turn of
events, an even number of organisms is not a good thing. The
organisms will form pairs and from each pair, only one organism will
survive If there are an odd number of organisms and this number is
greater than 1 (e.g., 3,5,7,9,…), then this is good for population
growth. The organisms cannot pair up and in one time cycle, each
organism will produce 2 other organisms. In addition, one other
organism will be created. (As an example, let us say there are 3
organisms. Since 3 is an odd number greater than 1, in one time
cycle, each of the 3 organisms will produce 2 others. This yields 6
additional organisms. Furthermore, there is one more organism
produced so the total will be 10 organisms, 3 originals, 6 produced by
the 3, and then 1 more.)
A: Write a program that tests initial populations from 1 to 100,000.
Find all populations that do not eventually become extinct.
Write your answer here:
B: Find the value of the initial population that eventually goes
extinct but that has the largest number of time cycles before it does.
Write your answer here:
The general idea of what I have so far is (lacking sytanx) is this with P representing the population
int generations = 0;
{
if (P is odd) //I'll use a modulus modifier to divide by two and if the result is not 0 then I'll know it's odd
P = 3P + 1
else
P = 1/2 P
generations = generations + 1
}
The problem for me is that I'm uncertain how to tell what numbers will not go extinct or how to figure out which population takes the longest time to go extinct. Any suggestions would be helpful.
Basically what you want to do is this: wrap your code into a while-loop that exits if either P==1 or generations > someMaxValue.
Wrap this construct into a for-loop that counts from 1 to 100,000 and uses this count to set the initial P.
If you always store the generations after your while-loop (e.g. into an array) you can then search for the greatest element in the array.
This problem can actually be harder than it looks at the first sight. First, you should use memorization to speed things up - for example, with 3 you get 3 -> 10 -> 5 -> 16 -> 8 -> 4 -> 2 -> 1 -> 0, so you know the answer for all those numbers as well (note that every power of 2 will extinct).
But as pointed out by #Jerry, the problem is with the generations which eventually do not extinct - it will be difficult to say when to actually stop. The only chance is that there will (always) be a recurrence (number of organisms you already passed once when examining the current number of organisms), then you can say for sure that the organisms will not extinct.
Edit: I hacked a solution quickly and if it is correct, you are lucky - every population between 1-100,000 seems to eventually extinct (as my program terminated so I didn't actually need to check for recurrences). Will not give you the solution for now so that you can try by yourself and learn, but according to my program the largest number of cycles is 351 (and the number is close to 3/4 of the range). According to the google search for Collatz conjecture, that is a correct number (they say 350 to go to population of 1, where I'm adding one extra cycle to 0), also the initial population number agrees.
One additional hint: Check for integer overflow, and use 64-bit integer (unsigned __int64, unsigned long long) to calculate the population growth, as with 32-bit unsignet int, there is already an overflow in the range of 1-100,000 (the population can indeed grow much higher intermediately) - that was a problem in my initial solution, although it did not change the result. With 64-bit ints I was able to calculate up to 100,000,000 in relatively decent time (didn't try more; optimized release MSVC build), for that I had to limit the memo table to first 80,000,000 items to not go out of memory (compiled in 32-bit with LARGEADDRESSAWARE to be able to use up to 4 GB of memory - when compiled 64-bit the table could of course be larger).
I am trying to create concurrent examples of Lwt and came up with this little sample
let () =
Lwt_main.run (
let start = Unix.time () in
Lwt_io.open_file Lwt_io.Input "/dev/urandom" >>= fun data_source ->
Lwt_unix.mkdir "serial" 0o777 >>= fun () ->
Lwt_list.iter_p
(fun count ->
let count = string_of_int count in
Lwt_io.open_file
~flags:[Unix.O_RDWR; Unix.O_CREAT]
~perm:0o777
~mode:Lwt_io.Output ("serial/file"^ count ^ ".txt") >>= fun h ->
Lwt_io.read ~count:52428800
data_source >>= Lwt_io.write_line h)
[0;1;2;3;4;5;6;7;8;9] >>= fun () ->
let finished = Unix.time () in
Lwt_io.printlf "Execution time took %f seconds" (finished -. start))
EDIT: With asking for 50GB it was:
"However this is incredibly slow and basically useless.
Does the inner bind need to be forced somehow?"
EDIT: I originally had written asking for 50 GB and it never finished, now I have a different problem with asking for 50MB, The execution is nearly instantaneously and du -sh reports only a directory size of 80k.
EDIT: I have also tried the code with explicitly closing the file handles with the same bad result.
I am on OS X latest version and compile with
ocamlfind ocamlopt -package lwt.unix main.ml -linkpkg -o Test
(I have also tried /dev/random, yes I'm using wall-clock time.)
So, your code has some issues.
Issue 1
The main issue is that you understood the Lwt_io.read function incorrectly (and nobody can blame you!).
val read : ?count : int -> input_channel -> string Lwt.t
(** [read ?count ic] reads at most [len] characters from [ic]. It
returns [""] if the end of input is reached. If [count] is not
specified, it reads all bytes until the end of input. *)
When ~count:len is specified it will read at most len characters. At most, means, that it can read less. But if the count option is omitted, then it will read all data. I, personally, find this behavior unintuitive, if not weird. So, this at most means up to len or less, i.e., no guarantee is provided that it will read exactly len bytes. And indeed, if you add a check into your program:
Lwt_io.read ~count:52428800 data_source >>= fun data ->
Lwt_io.printlf "Read %d bytes" (String.length data) >>= fun () ->
Lwt_io.write h data >>= fun () ->
You will see, that it will read only 4096 bytes, per try:
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Why 4096? Because this is the default buffer size. But it actually doesn't matter.
Issue 2
Lwt_io module implements a buffered IO. That means that all your writes and reads are not going directly to a file, but are buffered in the memory. That means, that you should remember to flush and close. Your code doesn't close descriptors on finish, so you can end up with a situation when some buffers are left unflushed after a program is terminated. Lwt_io in particular, flushes all buffers before program exit. But you shouldn't rely on this undocumented feature (it may hit you in future, when you will try any other buffered io, like fstreams from standard C library). So, always close your files (another problem is that today file descriptors are the most precious resource, and their leaking is very hard to find).
Issue 3
Don't use /dev/urandom or /dev/random to measure io. For the former you will measure the performance of random number generator, for the latter you will measure the flow of entropy in your machine. Both are quite slow. Depending on the speed of your CPU, you will rarely get more than 16 Mb/s, and it is much less, then Lwt can throughput. Reading from /dev/zero and writing to /dev/null will actually perform all transfers in memory and will show the actual speed, that can be achieved by your program. A well-written program will be still bounded by the kernel speed. In the example program, provided below, this will show an average speed of 700 MB/s.
Issue 4
Don't use the buffered io, if you're really striving for a performance. You will never get the maximum. For example, Lwt_io.read will read first at buffer, then it will create a string and copy data to that string. If you really need some performance, then you should provide your own buffering. In most cases, there is no need for this, as Lwt_io is quite performant. But if you need to process dozens of megabytes per second, or need some special buffering policy (something non-linear), you may need to think about providing your own buffering. The good news is that Lwt_io allows you to do this. You can take a look at an example program, that will measure the performance of Lwt input/output. It mimics a well-known pv program.
Issue 5
You're expecting to get some performance by running threads in parallel. The problem is that in your test there is no place for the concurrency. /dev/random (as well as /dev/zero) is one device that is bounded only by CPU. This is the same, as just calling a random function. It will always be available, so no system call will block on it. Writing to a regular file is also not a good place for concurrency. First of all, usually there is only one hard-drive, with one writing head in it. Even if system call will block and yield control to another thread, this will result in a performance digression, as two threads will now compete for the header position. If you have SSD, there will not be any competition for the header, but the performance will be still worse, as you will spoil your caches. But fortunately, usually writing on regular files doesn't block. So your threads will run consequently, i.e., they will be serialized.
If you look at your files, you'll see they're each 4097K – that's 4096K that was read from /dev/urandom, plus one byte for the newline. You're reaching a buffer maximum with Lwt_io.read, so even though you say ~count:awholelot, it only gives you ~count:4096.
I don't know what the canonical Lwt way to do this is, but here's one alternative:
open Lwt
let stream_a_little source n =
let left = ref n in
Lwt_stream.from (fun () ->
if !left <= 0 then return None
else Lwt_io.read ~count:!left source >>= (fun s ->
left:=!left - (Bytes.length s);
return (Some s)
))
let main () =
Lwt_io.open_file ~buffer_size:(4096*8) ~mode:Lwt_io.Input "/dev/urandom" >>= fun data_source ->
Lwt_unix.mkdir "serial" 0o777 >>= fun () ->
Lwt_list.iter_p
(fun count ->
let count = string_of_int count in
Lwt_io.open_file
~flags:[Unix.O_RDWR; Unix.O_CREAT]
~perm:0o777
~mode:Lwt_io.Output ("serial/file"^ count ^ ".txt") >>= (fun h ->
Lwt_stream.iter_s (Lwt_io.write h)
(stream_a_little data_source 52428800)))
[0;1;2;3;4;5;6;7;8;9]
let timeit f =
let start = Unix.time () in
f () >>= fun () ->
let finished = Unix.time () in
Lwt_io.printlf "Execution time took %f seconds" (finished -. start)
let () =
Lwt_main.run (timeit main)
EDIT: Note that lwt is a cooperative threading library; when you have two threads going "at the same time", they don't actually do stuff in your OCaml process at the same time. OCaml is (as of yet) single-core, so when one thread is moving, the others wait nicely until that thread says "OK, I've done some work, you others go". So when you try to stream to 8 files at the same time, you're basically doling out a little randomness to file1, then a little to file2, … a little to file8, then (if there's still work left to do) a little to file1, then a little to file2 etc. This makes sense if you're waiting on lots of input anyway (say your input is coming over the network), and your main process has a lot of time to go through each thread and check "is there any input?", but when all your threads are just reading from /dev/random, it'd be much faster to just fill up one file first, then the second, etc. And assuming it's possible for several CPU's to read /dev/(u)random in parallell (and your drive can keep up), it'd of course be much faster to load ncpu reads at the same time, but then you need multicore (or just do this in shell script).
EDIT2: showed how to increase buffer size on the reader, ups the speed a bit ;) Note that you can also simply set the buffer_size as high as you want on your old example, which will read it all in one go, but you can't get more than your buffer_size unless you read several times.
I am reading MSD at following link.
http://www.cs.princeton.edu/~rs/AlgsDS07/18RadixSort.pdf
Here it is mentioned that in MSD may not have to examine all the keys on page 20. How is this related to program on page 18. When I try to put the example in code walk through I am not able to understand how we reduced to examine all the keys.
Thanks!
Let's say you have: (0's can really be anything)
600000
100000
300000
500000
400000
200000
If you only sort on the most significant digit, you get:
100000
200000
300000
400000
500000
600000
After this the array is already sorted and we can stop - there's no need to check the other digits.
While it may not be so ideal in practice, there are certainly cases where we don't have to evaluate all the digits of all the elements.
The if (hi <= lo + 1) return; statement should cause it to return if there's only one element in the subarray, preventing checking unnecessary data.
I'm working my way through Real World Haskell (I'm in chapter 4) and to practice a bit off-book I've created the following program to calculate the nth prime.
import System.Environment
isPrime primes test = loop primes test
where
loop (p:primes) test
| test `mod` p == 0 = False
| p * p > test = True
| otherwise = loop primes test
primes = [2, 3] ++ loop [2, 3] 5
where
loop primes test
| isPrime primes test = test:(loop primes' test')
| otherwise = test' `seq` (loop primes test')
where
test' = test + 2
primes' = primes ++ [test]
main :: IO()
main = do
args <- getArgs
print(last (take (read (head args) :: Int) primes))
Obviously since I'm saving a list of primes this is not a constant space solution. The problem is when I try to get a very large prime say ./primes 1000000 I receive the error:
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase
I'm fairly sure that I got the tail recursion right; reading http://www.haskell.org/haskellwiki/Stack_overflow and the various responses here lead me to believe that it's a byproduct of lazy evaluation, and thunks are building up until it overflows, but so far I've been unsuccessful in fixing it. I've tried using seq in various places to force evaluation, but it hasn't had an effect. Am I on the right track? Is there something else I'm not getting?
As I said in my comment, you shouldn't be building a list by appending a single element list to the end of a really long list (your line primes' = primes ++ [test]). It is better to just define the infinite list, primes, and let lazy evaluation do it's thing. Something like the below code:
primes = [2, 3] ++ loop 5
where.
loop test
| isPrime primes test = test:(loop test')
| otherwise = test' `seq` (loop test')
where
test' = test + 2
Obviously you don't need to parameterize the isPrime function by primes either, but that's just a nit. Also, when you know all the numbers are positive you should use rem instead of mod - this results in a 30% performance increase on my machine (when finding the millionth prime).
First, you don't have tail recursion here, but guarded recursion, a.k.a. tail recursion modulo cons.
The reason you're getting a stack overflow is, as others commented, a thunk pile-up. But where? One suggested culprit is your use of (++). While not optimal, the use of (++) not necessarily leads to a thunk pileup and stack overflow. For instance, calling
take 2 $ filter (isPrime primes) [15485860..]
should produce [15485863,15485867] in no time, and without any stack overflow. But it is still the same code which uses (++), right?
The problem is, you have two lists you call primes. One (at the top level) is infinite, co-recursively produced through guarded (not tail) recursion. Another (an argument to loop) is a finite list, built by adding each newly found prime to its end, used for testing.
But when it is used for testing, it is not forced through to its end. If that happened there wouldn't be an SO problem. It is only forced through to the sqrt of a test number. So (++) thunks do pile up past that point.
When isPrime primes 15485863 is called, it forces the top-level primes up to 3935, which is 547 primes. The internal testing-primes list too consists of 547 primes, of which only first 19 are forced.
But when you call primes !! 1000000, out of the 1,000,000 primes in the duplicate internal list only 547 are forced. The rest are all in thunks.
If you were adding new primes to the end of testing-primes list only when their square was seen among the candidates, the testing-primes list would be always forced through completely, or nearly to its end, and there wouldn't be a thunk pileup causing the SO. And appending with (++) to the end of a forced list is not that bad when next access forces that list to its end and leaves no thunks behind. (It still copies the list though.)
Of course the top-level primes list can be used directly, as Thomas M. DuBuisson shows in his answer.
But the internal list has its uses. When correctly implemented, adding new primes to it only when their square is seen among the candidates, it may allow your program to run in O(sqrt(n)) space, when compiled with optimizations.
You should probably check these two questions:
How can I increase the stack size with runhaskell?
How to avoid stack space overflows?