Stack space overflow when computing primes

Stack space overflow when computing primes - primes

I'm working my way through Real World Haskell (I'm in chapter 4) and to practice a bit off-book I've created the following program to calculate the nth prime.
import System.Environment
isPrime primes test = loop primes test
where
loop (p:primes) test
| test `mod` p == 0 = False
| p * p > test = True
| otherwise = loop primes test
primes = [2, 3] ++ loop [2, 3] 5
where
loop primes test
| isPrime primes test = test:(loop primes' test')
| otherwise = test' `seq` (loop primes test')
where
test' = test + 2
primes' = primes ++ [test]
main :: IO()
main = do
args <- getArgs
print(last (take (read (head args) :: Int) primes))
Obviously since I'm saving a list of primes this is not a constant space solution. The problem is when I try to get a very large prime say ./primes 1000000 I receive the error:
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase
I'm fairly sure that I got the tail recursion right; reading http://www.haskell.org/haskellwiki/Stack_overflow and the various responses here lead me to believe that it's a byproduct of lazy evaluation, and thunks are building up until it overflows, but so far I've been unsuccessful in fixing it. I've tried using seq in various places to force evaluation, but it hasn't had an effect. Am I on the right track? Is there something else I'm not getting?

As I said in my comment, you shouldn't be building a list by appending a single element list to the end of a really long list (your line primes' = primes ++ [test]). It is better to just define the infinite list, primes, and let lazy evaluation do it's thing. Something like the below code:
primes = [2, 3] ++ loop 5
where.
loop test
| isPrime primes test = test:(loop test')
| otherwise = test' `seq` (loop test')
where
test' = test + 2
Obviously you don't need to parameterize the isPrime function by primes either, but that's just a nit. Also, when you know all the numbers are positive you should use rem instead of mod - this results in a 30% performance increase on my machine (when finding the millionth prime).

First, you don't have tail recursion here, but guarded recursion, a.k.a. tail recursion modulo cons.
The reason you're getting a stack overflow is, as others commented, a thunk pile-up. But where? One suggested culprit is your use of (++). While not optimal, the use of (++) not necessarily leads to a thunk pileup and stack overflow. For instance, calling
take 2 $ filter (isPrime primes) [15485860..]
should produce [15485863,15485867] in no time, and without any stack overflow. But it is still the same code which uses (++), right?
The problem is, you have two lists you call primes. One (at the top level) is infinite, co-recursively produced through guarded (not tail) recursion. Another (an argument to loop) is a finite list, built by adding each newly found prime to its end, used for testing.
But when it is used for testing, it is not forced through to its end. If that happened there wouldn't be an SO problem. It is only forced through to the sqrt of a test number. So (++) thunks do pile up past that point.
When isPrime primes 15485863 is called, it forces the top-level primes up to 3935, which is 547 primes. The internal testing-primes list too consists of 547 primes, of which only first 19 are forced.
But when you call primes !! 1000000, out of the 1,000,000 primes in the duplicate internal list only 547 are forced. The rest are all in thunks.
If you were adding new primes to the end of testing-primes list only when their square was seen among the candidates, the testing-primes list would be always forced through completely, or nearly to its end, and there wouldn't be a thunk pileup causing the SO. And appending with (++) to the end of a forced list is not that bad when next access forces that list to its end and leaves no thunks behind. (It still copies the list though.)
Of course the top-level primes list can be used directly, as Thomas M. DuBuisson shows in his answer.
But the internal list has its uses. When correctly implemented, adding new primes to it only when their square is seen among the candidates, it may allow your program to run in O(sqrt(n)) space, when compiled with optimizations.

You should probably check these two questions:
How can I increase the stack size with runhaskell?
How to avoid stack space overflows?

Related

Efficient way for generating coprime pairs

I need to print the number of coprime pairs (a,b), 0 < a <= b <= n and for n = 10^8 the program should run in less than 10 seconds. I have used this method : http://mathworld.wolfram.com/CarefreeCouple.html
But the program isn't as fast as I expected.
I have heard about an effient way of solving this problem by using something called 'Farey Sequence' but the code was written in PHP and I can only understand C.
So which method can help me solve the problem? thanks for the time.

Your stated interest is in co-prime pairs (a, b). The carefree couple adds an additional restriction that a is square-free. Therefore it is not the same problem, though some of the math is similar. As I understand your problem it is equivalent to summing the Euler totient function from 1 to n, the so-called Totient Summatory Function.
I do not know of any tricks that give one a simple closed form solution to come up with the answer. However, I think modifying a straightforward Sieve of Eratosthenes (SoT) should get you an answer in much less than 10 seconds in most programming languages.
Normally running the SoT simply yields a list of the primes <= n. Our goal will change, however, to computing the complete prime-power factorization of each integer between 1 and n inclusive. To do that we must store more than a single bit of information for each sieve entry, we must store a list. As we sieve a prime p through the array, we add (p, 1) to the list already stored at that entry. Then we sieve by p2 and change the (p,1) entries in each location we hit to (p,2), and so one for each power of p <= n and every p <= n. When it finishes, you
can compute the Totient function quickly for every value 0 <= x <= n and sum them up.
EDIT:
I see that there is already a much better set of answers to the question on math.stackexchange.com here. I'll leave this answer up for awhile until the disposition of the question is settled.

Haskell List Generator High Memory Usage

While working on a competitive programming problem I discovered an interesting issue that drastically reduced the performance of some of my code. After much experimentation I have managed to reduce the issue to the following minimal example:
module Main where
main = interact handle
handle :: String -> String
-- handle s = show $ sum l
-- handle s = show $ length l
-- handle s = show $ seq (length l) (sum l)
where
l = [0..10^8] :: [Int]
If you uncomment each commented line individually, compile with ghc -O2 test.hs and run with time ./test > /dev/null, you should get something like the following:
For sum l:
0.02user 0.00system 0:00.03elapsed 93%CPU (0avgtext+0avgdata 3380maxresident)k
0inputs+0outputs (0major+165minor)pagefaults 0swaps
For length l:
0.02user 0.00system 0:00.02elapsed 100%CPU (0avgtext+0avgdata 3256maxresident)k
0inputs+0outputs (0major+161minor)pagefaults 0swaps
For seq (length l) (sum l):
5.47user 1.15system 0:06.63elapsed 99%CPU (0avgtext+0avgdata 7949048maxresident)k
0inputs+0outputs (0major+1986697minor)pagefaults 0swaps
Look at that huge increase in peak memory usage. This makes some amount of sense, because of course both sum and length can lazily consume the list as a stream, while the seq will be triggering the evaluation of the whole list, which must then be stored. But the seq version of the code is using just shy of 8 GB of memory to handle a list that contains just 400 MB of actual data. The purely functional nature of Haskell lists could explain some small constant factor, but a 20 fold increase in memory seems unintended.
This behaviour can be triggered by a number of things. Perhaps the easiest way is using force from Control.DeepSeq, but the way in which I originally encountered this was while using Data.Array.IArray (I can only use the standard library) and trying to construct an array from a list. The implementation of Array is monadic, and so was forcing the evaluation of the list from which it was being constructed.
If anyone has any insight into the underlying cause of this behaviour, I would be very interested to learn why this happens. I would of course also appreciate any suggestions as to how to avoid this issue, bearing in mind that I have to use just the standard library in this case, and that every Array constructor takes and eventually forces a list.
I hope you find this issue as interesting as I did, but hopefully less baffling.
EDIT: user2407038's comment made me realize I had forgotten to post profiling results. I have tried profiling this code and the profiler simply states that 100% of allocations are performed in handle.l, so it seems that simply anything that forces the evaluation of the list uses huge amounts of memory. As I mentioned above, using the force function from Control.DeepSeq, constructing an Array, or anything else that forces the list causes this behaviour. I am confused as to why it would ever require 8 GB of memory to compute a list containing 400 MB of data. Even if every element in the list required two 64-bit pointers, that is still only a factor of 5, and I would think GHC would be able to do something more efficient than that. If not this is an obvious bottleneck for the Array package, as constructing any array inherently requires us to allocate far more memory than the array itself.
So, ultimately: Does anyone have any idea why forcing a list requires such huge amounts of memory, which has such a high cost on performance?
EDIT: user2407038 provided a link to the very helpful GHC Memory Footprint reference. This explains exactly the data sizes of everything, and almost entirely explains the huge overhead: An [Int] is specified as requiring 5N+1 words of memory, which at 8 bytes per word gives 40 bytes per element. In this example that would suggest 4 GB, which accounts for half the total peak usage. It is easy to then believe that the evaluation of sum would then add a similar factor, so this answers my question.
Thanks to all commenters for your help.
EDIT: As I mentioned above, I originally encountered this behaviour why trying to construct an Array. Having had a bit of a dig into GHC.Arr I have found what I think is the root cause of this behaviour when constructing an array: The constructor folds over the list to compose a program in the ST monad that it then runs. Obviously the ST can't be executed until it is completely composed, and in this case the ST construct will be large and linear in the size of the input. To avoid this behaviour we would have to somehow modify the constructor to stream elements from the list as it adds them in ST.

There are multiple factors that come to play here. The first one is that GHC will lazily lift l out of handle. This would enable handle to reuse l, so that you don't have to recalculate it every time, but in this case it creates a space leak. You can check this if you -ddump-simplified core:
Main.handle_l :: [Int]
[GblId,
Str=DmdType,
Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=False,
WorkFree=False, Expandable=False, Guidance=IF_ARGS [] 40 0}]
Main.handle_l =
case Main.handle3 of _ [Occ=Dead] { GHC.Types.I# y_a1HY ->
GHC.Enum.eftInt 0 y_a1HY
}
The functionality to calculate the [0..10^7] 1 is hidden away in other functions, but essentially, handle_l = [0..10^7], at top-level (TopLvl=True). It won't get reclaimed, since you may or may not use handle again. If we use handle s = show $ length l, l itself will be inlined. You will not find any TopLvl=True function that has type [Int].
So GHC detects that you use l twice and creates a top-level CAF. How big is that CAF? An Int takes two words:
data Int = I# Int#
One for I#, one for Int#. How much for [Int]?
data [a] = [] | (:) a ([a]) -- pseudo, but similar
That's one word for [], and three words for (:) a ([a]). A list of [Int] with size N will therefore have a total size of (3N + 1) + 2N words, in your case 5N+1 words. Given your memory, I assume a word is 8byte on your plattform, so we end up with
5 * 10^8 * 8 bytes = 4 000 000 000 bytes
So how do we get rid of that list? The first option we have is to get rid of l:
handle _ = show $ seq (length [0..10^8]) (sum [0..10^8])
This will now run in constant memory due to foldr/buildr rules. While we have [0..10^8] there twice, they don't share the same name. If we check the -stats, we will see that it runs in constant memory:
> SO.exe +RTS -s
5000000050000000 4,800,066,848 bytes allocated in the heap
159,312 bytes copied during GC
43,832 bytes maximum residency (2 sample(s))
20,576 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 9154 colls, 0 par 0.031s 0.013s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0001s 0.0002s
INIT time 0.000s ( 0.000s elapsed)
MUT time 4.188s ( 4.232s elapsed)
GC time 0.031s ( 0.013s elapsed)
EXIT time 0.000s ( 0.001s elapsed)
Total time 4.219s ( 4.247s elapsed)
%GC time 0.7% (0.3% elapsed)
Alloc rate 1,146,284,620 bytes per MUT second
Productivity 99.3% of total user, 98.6% of total elapsed
But that's not really nice, since we now have to track all the uses of [0..10^8]. What if we create a function instead?
handle :: String -> String
handle _ = show $ seq (length $ l ()) (sum $ l ())
where
{-# INLINE l #-}
l _ = [0..10^7] :: [Int]
This works, but we must inline l, otherwise we get the same problem as before if we use optimizations. -O1 (and -O2) enable -ffull-laziness, which—together with common subexpression elimination—would lift l () to the top. So we either need to inline it or use -O2 -fno-full-laziness to prevent that behaviour.
1 Had to decrease the list size, otherwise I would have started swapping.

Can't figure out how to program in outputs for the given situations

So, I'm working on an assignment for my intro to computer science class. The assignment is as follows.
There is an organism whose population can be determined according to
the following rules:
The organism requires at least one other organism to propagate. Thus,
if the population goes to 1, then the organism will become extinct in
one time cycle (e.g. one breeding season). In an unusual turn of
events, an even number of organisms is not a good thing. The
organisms will form pairs and from each pair, only one organism will
survive If there are an odd number of organisms and this number is
greater than 1 (e.g., 3,5,7,9,…), then this is good for population
growth. The organisms cannot pair up and in one time cycle, each
organism will produce 2 other organisms. In addition, one other
organism will be created. (As an example, let us say there are 3
organisms. Since 3 is an odd number greater than 1, in one time
cycle, each of the 3 organisms will produce 2 others. This yields 6
additional organisms. Furthermore, there is one more organism
produced so the total will be 10 organisms, 3 originals, 6 produced by
the 3, and then 1 more.)
A: Write a program that tests initial populations from 1 to 100,000.
Find all populations that do not eventually become extinct.
Write your answer here:
B: Find the value of the initial population that eventually goes
extinct but that has the largest number of time cycles before it does.
Write your answer here:
The general idea of what I have so far is (lacking sytanx) is this with P representing the population
int generations = 0;
{
if (P is odd) //I'll use a modulus modifier to divide by two and if the result is not 0 then I'll know it's odd
P = 3P + 1
else
P = 1/2 P
generations = generations + 1
}
The problem for me is that I'm uncertain how to tell what numbers will not go extinct or how to figure out which population takes the longest time to go extinct. Any suggestions would be helpful.

Basically what you want to do is this: wrap your code into a while-loop that exits if either P==1 or generations > someMaxValue.
Wrap this construct into a for-loop that counts from 1 to 100,000 and uses this count to set the initial P.
If you always store the generations after your while-loop (e.g. into an array) you can then search for the greatest element in the array.

This problem can actually be harder than it looks at the first sight. First, you should use memorization to speed things up - for example, with 3 you get 3 -> 10 -> 5 -> 16 -> 8 -> 4 -> 2 -> 1 -> 0, so you know the answer for all those numbers as well (note that every power of 2 will extinct).
But as pointed out by #Jerry, the problem is with the generations which eventually do not extinct - it will be difficult to say when to actually stop. The only chance is that there will (always) be a recurrence (number of organisms you already passed once when examining the current number of organisms), then you can say for sure that the organisms will not extinct.
Edit: I hacked a solution quickly and if it is correct, you are lucky - every population between 1-100,000 seems to eventually extinct (as my program terminated so I didn't actually need to check for recurrences). Will not give you the solution for now so that you can try by yourself and learn, but according to my program the largest number of cycles is 351 (and the number is close to 3/4 of the range). According to the google search for Collatz conjecture, that is a correct number (they say 350 to go to population of 1, where I'm adding one extra cycle to 0), also the initial population number agrees.
One additional hint: Check for integer overflow, and use 64-bit integer (unsigned __int64, unsigned long long) to calculate the population growth, as with 32-bit unsignet int, there is already an overflow in the range of 1-100,000 (the population can indeed grow much higher intermediately) - that was a problem in my initial solution, although it did not change the result. With 64-bit ints I was able to calculate up to 100,000,000 in relatively decent time (didn't try more; optimized release MSVC build), for that I had to limit the memo table to first 80,000,000 items to not go out of memory (compiled in 32-bit with LARGEADDRESSAWARE to be able to use up to 4 GB of memory - when compiled 64-bit the table could of course be larger).

Algo: find max Xor in array for various interval limis, given N inputs, and p,q where 0<=p<=i<=q<=N

the problem statement is the following:
Xorq has invented an encryption algorithm which uses bitwise XOR operations extensively. This encryption algorithm uses a sequence of non-negative integers x1, x2, … xn as key. To implement this algorithm efficiently, Xorq needs to find maximum value for (a xor xj) for given integers a,p and q such that p<=j<=q. Help Xorq to implement this function.
Input
First line of input contains a single integer T (1<=T<=6). T test cases follow.
First line of each test case contains two integers N and Q separated by a single space (1<= N<=100,000; 1<=Q<= 50,000). Next line contains N integers x1, x2, … xn separated by a single space (0<=xi< 2^15). Each of next Q lines describe a query which consists of three integers ai,pi and qi (0<=ai< 2^15, 1<=pi<=qi<= N).
Output
For each query, print the maximum value for (ai xor xj) such that pi<=j<=qi in a single line.
int xArray[100000];
cin >>t;
for(int j =0;j<t;j++)
{
cin>> n >>q;
//int* xArray = (int*)malloc(n*sizeof(int));
int i,a,pi,qi;
for(i=0;i<n;i++)
{
cin>>xArray[i];
}
for(i=0;i<q;i++)
{
cin>>a>>pi>>qi;
int max =0;
for(int it=pi-1;it<qi;it++)
{
int t = xArray[it] ^ a;
if(t>max)
max =t;
}
cout<<max<<"\n" ;
}
No other assumptions may be made except for those stated in the text of the problem (numbers are not sorted).
The code is functional but not fast enough; is reading from stdin really that slow or is there anything else I'm missing?

XOR flips bits. The max result of XOR is 0b11111111.
To get the best result
if 'a' on ith place has 1 then you have to XOR it with key that has ith bit = 0
if 'a' on ith place has 0 then you have to XOR it with key that has ith bit = 1
saying simply, for bit B you need !B
Another obvious thing is that higher order bits are more important than lower order bits.
That is:
if 'a' on highest place has B and you have found a key with highest bit = !B
then ALL keys that have highest bit = !B are worse that this one
This cuts your amount of numbers by half "in average".
How about building a huge binary tree from all the keys and ordering them in the tree by their bits, from MSB to LSB. Then, cutting the A bit-by-bit from MSB to LSB would tell you which left-right branch to take next to get the best result. Of course, that ignores PI/QI limits, but surely would give you the best result since you always pick the best available bit on i-th level.
Now if you annotate the tree nodes with low/high index ranges of its subelements (performed only done once when building the tree), then later when querying against a case A-PI-QI you could use that to filter-out branches that does not fall in the index range.
The point is that if you order the tree levels like the MSB->LSB bit order, then the decision performed at the "upper nodes" could guarantee you that currently you are in the best possible branch, and it would hold even if all the subbranches were the worst:
Being at level 3, the result of
0b111?????
can be then expanded into
0b11100000
0b11100001
0b11100010
and so on, but even if the ????? are expanded poorly, the overall result is still greater than
0b11011111
which would be the best possible result if you even picked the other branch at level 3rd.
I habe absolutely no idea how long would preparing the tree cost, but querying it for an A-PI-QI that have 32 bits seems to be something like 32 times N-comparisons and jumps, certainly faster than iterating randomly 0-100000 times and xor/maxing. And since you have up to 50000 queries, then building such tree can actually be a good investment, since such tree would be build once per keyset.
Now, the best part is that you actually dont need the whole tree. You may build such from i.e. first two or four or eight bits only, and use the index ranges from the nodes to limit your xor-max loop to a smaller part. At worst, you'd end up with the same range as PiQi. At best, it'd be down to one element.
But, looking at the max N keys, I think the whole tree might actually fit in the memory pool and you may get away without any xor-maxing loop.

I've spent some time google-ing this problem and it seams that you can find it in the context of various programming competitions. While the brute force approach is intuitive it does not really solve the challenge as it is too slow.
There are a few contraints in the problem which you need to speculate in order to write a faster algorithm:
the input consists of max 100k numbers, but there are only 32768 (2^15) possible numbers
for each input array there are Q, max 50k, test cases; each test case consists of 3 values, a,pi,and qi. Since 0<=a<2^15 and there are 50k cases, there is a chance the same value will come up again.
I've found 2 ideas for solving the problem: splitting the input in sqrt(N) intervals and building a segment tree ( a nice explanation for these approaches can be found here )
The biggest problem is the fact that for each test case you can have different values for a, and that would make previous results useless, since you need to compute max(a^x[i]), for a small number of test cases. However when Q is large enough and the value a repeats, using previous results can be possible.
I will come back with the actual results once I finish implementing both methods

Efficient convergence check

I have a grid with thousands of double precision reals.
It's iterating through, and I need it to stop when it's reached convergence to 3 decimal places.
The target is to have it run as fast as possible, but needs to give the same result every (to 3 dp) every time.
At the minute I'm doing something like this
REAL(KIND=DP) :: TOL = 0.001_DP
DO WHILE(.NOT. CONVERGED)
CONVERGED = .TRUE.
DO I = 1, NUM_POINTS
NEW POTENTIAL = !blah blah blah
IF (CONVERGED) THEN
IF (NEW_POTENTIAL < OLD_POTENTIAL - TOL .OR. NEW_POTENTIAL > OLD_POTENTIAL + TOL) THEN
CONVERGED = .FALSE.
END IF
END IF
OLD_POTENTIAL = NEW POTENTIAL
END DO
END DO
I'm thinking that many IF statements can't be too great for performance. I thought about checking for convergence at the end; finding the average value (summing the whole grid, divide by num_points), and checking if that has converged in the same way as above, but I'm not convinced this will always be accurate.
What is the best way of doing this?

If I understand correctly you've got some kind of time-stepping going on, where you create the values in new_potential by calculations on old_potential. Then make old equal to new and carry on.
You could replace your existing convergence tests with the single statement
converged = all(abs(new_potential - old_potential)<tol)
which might be faster. If the speed of the test is a major concern you could test only every other (or every third or fourth ...) iteration
A few comments:
1) If you used a potential array with 2 planes, instead of an old_ and new_potential, you could transfer new_ into old_ by swapping indices at the end of each iteration. As your code stands there's a lot of data movement going on.
2) While semantically you are right to have a while loop, I'd always use a do loop with a maximum number of iterations, just in case the convergence criterion is never met.
3) In your declaration REAL(KIND=DP) :: TOL = 0.001_DP the specification of DP on the numerical value of TOL is redundant, REAL(KIND=DP) :: TOL = 0.001 is adequate. I'd also make this a parameter, the compiler may be able to optimise its use if it knows that it is immutable.
4) You don't really need to execute CONVERGED = .TRUE. inside the outermost loop, set it before the first iteration -- this will save you a nanosecond or two.
Finally, if your convergence criterion is that every element in the potential array has converged to 3dp then that is what you should test for. It would be relatively easy to construct counterexamples for your suggested averages. However, my concern would be that your system will never converge on every element and that you should be using some matrix norm computation to determine convergence. SO is not the place for a lesson in that topic.

What are the calculations for the convergence criteria? Unless they are worse then the calculations to advance the potential it is probably better to have the IF statement to terminate the loop as soon as possible rather than guess a very large number of iterations to be sure to obtain a good solution.
Re High Performance Mark's suggestion #1, if the copying operation is a significant portion of the run time, you could also use pointers.
The only way to be sure about this stuff is to measure the run time ... Fortran provides intrinsic functions to measure both CPU and clock time. Otherwise you may modify your some portion of you code to make it faster, perhaps making it less easier to understand and possibly introducing a bug, possibly without much improvement in runtime ... if that portion was taking a small amount of the total runtime, no amount of cleverness will can make much difference.
As High Performance Mark says, though the current semantics are elegant, you probably want to guard against an infinite loop. One approach:
PotentialLoop: do i=1, MaxIter
blah
Converged = test...
if (Converged) exit PotentialLoop
blah
end do PotentialLoop
if (.NOT. Converged) write (*, *) "error, did not converge"

I = 1
DO
NEWPOT = !bla bla bla
IF (ABS(NEWPOT-OLDPOT).LT.TOL) EXIT
OLDPOT = NEWPOT
I = MOD(I,NUMPOINTS) + 1
END DO
Maybe better
I = 1
DO
NEWPOT = !bla bla bla
IF (ABS(NEWPOT-OLDPOT).LT.TOL) EXIT
OLDPOT = NEWPOT
IF (I.EQ.NUMPOINTS) THEN
I = 1
ELSE
I = I + 1
END IF
END DO

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js