Yet another synthetic benchmark: Sieve of Eratosthenes
C++
#include <vector>
#include <cmath>
void find_primes(int n, std::vector<int>& out)
{
std::vector<bool> is_prime(n + 1, true);
int last = sqrt(n);
for (int i = 2; i <= last; ++i)
{
if (is_prime[i])
{
for (int j = i * i; j <= n; j += i)
{
is_prime[j] = false;
}
}
}
for (unsigned i = 2; i < is_prime.size(); ++i)
{
if (is_prime[i])
{
out.push_back(i);
}
}
}
OCaml (using Jane Street's Core and Res libraries)
open Core.Std
module Bits = Res.Bits
module Vect = Res.Array
let find_primes n =
let is_prime = Bits.make (n + 1) true in
let last = float n |! sqrt |! Float.iround_exn ~dir:`Zero in
for i = 2 to last do
if not (Bits.get is_prime i) then () else begin
let j = ref (i * i) in
while !j <= n; do
Bits.set is_prime !j false;
j := !j + i;
done;
end;
done;
let ar = Vect.empty () in
for i = 2 to n do
if Bits.get is_prime i then Vect.add_one ar i else ()
done;
ar
I was surprised that OCaml version (native) is about 13 times slower than C++. I replaced Res.Bits with Core_extended.Bitarray, but it became ~18 times slower. Why it is so slow? Doesn't OCaml provide fast operations for bit manipulation? Is there any alternative fast implementation of bit arrays?
To be clear: I'm from C++ world and consider OCaml as a possible alternative for writing performance critical code. Actually, I'm a bit scary with such results.
EDIT:
Profiling results
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
50.81 1.26 1.26 camlRes__pos_1113
9.72 1.50 0.24 camlRes__unsafe_get_1117
6.68 1.66 0.17 camlRes__unsafe_set_1122
6.28 1.82 0.16 camlNopres_impl__set_1054
6.07 1.97 0.15 camlNopres_impl__get_1051
5.47 2.10 0.14 47786824 0.00 0.00 caml_apply3
3.64 2.19 0.09 22106943 0.00 0.00 caml_apply2
2.43 2.25 0.06 817003 0.00 0.00 caml_oldify_one
2.02 2.30 0.05 1 50.00 265.14 camlPrimes__find_primes_64139
1.21 2.33 0.03 camlRes__unsafe_get_1041
...
Did you try using simple datastructure first before jumping on the sophisticated ones?
On my machine, the following code is only 4x slower than you C++ version (note that I made the minimal changes to use an Array as the cache, and a list to accumulate results; you could use the array get/set syntactic sugar):
let find_primes n =
let is_prime = Array.make (n + 1) true in
let last = int_of_float (sqrt (float n)) in
for i = 2 to last do
if not (Array.get is_prime i) then () else begin
let j = ref (i * i) in
while !j <= n; do
Array.set is_prime !j false;
j := !j + i;
done;
end;
done;
let ar = ref [] in
for i = 2 to n do
if Array.get is_prime i then ar := i :: !ar else ()
done;
ar
(4x slower: it takes 4s to compute the 10_000_000 first primes, vs. 1s
for g++ -O1 or -O2 on your code)
Realizing that the efficiency of your bitvector solution probably
comes from the economic memory layout, I changed the code to use
strings instead of arrays:
let find_primes n =
let is_prime = String.make (n + 1) '0' in
let last = int_of_float (sqrt (float n)) in
for i = 2 to last do
if not (String.get is_prime i = '0') then () else begin
let j = ref (i * i) in
while !j <= n; do
String.set is_prime !j '1';
j := !j + i;
done;
end;
done;
let ar = ref [] in
for i = 2 to n do
if String.get is_prime i = '0' then ar := i :: !ar else ()
done;
ar
This now takes only 2s, which makes it 2x slower than your C++
solution.
It seems Jeffrey Scofield is right. Such terrible performance degradation is due to div and mod operations.
I prototyped small Bitarray module
module Bitarray = struct
type t = { len : int; buf : string }
let create len x =
let init = (if x = true then '\255' else '\000') in
let buf = String.make (len / 8 + 1) init in
{ len = len; buf = buf }
let get t i =
let ch = int_of_char (t.buf.[i lsr 3]) in
let mask = 1 lsl (i land 7) in
(ch land mask) <> 0
let set t i b =
let index = i lsr 3 in
let ch = int_of_char (t.buf.[index]) in
let mask = 1 lsl (i land 7) in
let new_ch = if b then (ch lor mask) else (ch land lnot mask) in
t.buf.[index] <- char_of_int new_ch
end
It uses string as byte array (8 bits per char). Initially I used x / 8 and x mod 8 for bit extraction. It was 10x slower than C++ code. Then I replaced them with x lsr 3 and x land 7. Now, it is only 4x slower than C++.
It's not often useful to compare micro-benchmarks like this, but the basic conclusion is probably correct. This is a case where OCaml is at a distinct disadvantage. C++ can access a more or less ideal representation (vector of machine integers). OCaml can make a vector, but can't get at the machine integers directly. So OCaml has to use div and mod where C++ can use shift and mask.
I reproduced this test (using a different bit vector library) and found that appreciable time in OCaml was spent constructing the result, which isn't a bit array. So the test might not be measuring exactly what you think.
Update
I tried some quick tests packing 32 booleans into a 63-bit int. It does seem to make things go faster, but only a little bit. It's not a perfect test, but it suggests gasche is right that the non-power-of-2 effect is minor.
Please make sure that you install Core including the .cmx file (.cmxa is not enough!), otherwise cross-module inlining will not work. Your profile suggests that certain calls may not have been inlined, which would explain a dramatic loss of efficiency.
Sadly, the Oasis packaging tool, which a lot of OCaml projects use, currently has a bug that prevents it from installing the .cmx file. The Core package is also affected by this problem, probably irrespective of which package manager (Opam, Godi) you use.
Related
I've been experimenting with various prime sieves in Julia with a view to finding the fastest. This is my simplest, if not my fastest, and it runs in around 5-6 ms on my 1.80 GHz processor for n = 1 million. However, when I add a simple 'if' statement to take care of the cases where n <= 1 or s (the start number) > n, the run-time increases by a factor of 15 to around 80-90 ms.
using BenchmarkTools
function get_primes_1(n::Int64, s::Int64=2)::Vector{Int64}
#=if n <= 1 || s > n
return []
end=#
sieve = fill(true, n)
for i = 3:2:isqrt(n) + 1
if sieve[i]
for j = i ^ 2:i:n
sieve[j]= false
end
end
end
pl = [i for i in s - s % 2 + 1:2:n if sieve[i]]
return s == 2 ? unshift!(pl, 2) : pl
end
#btime get_primes_1(1_000_000)
Output with the 'if' statement commented out, as above, is:
5.752 ms (25 allocations: 2.95 MiB)
Output with the 'if' statement included is:
86.496 ms (2121646 allocations: 35.55 MiB)
I'm probably embarrassingly ignorant or being terminally stupid, but if someone could point out what I'm doing wrong it would be very much appreciated.
The problem of this function is with Julia compiler having problems with type inference when closures appear in your function. In this case the closure is a comprehension and the problem is that if statement makes sieve to be only conditionally defined.
You can see this by moving sieve up:
function get_primes_1(n::Int64, s::Int64=2)::Vector{Int64}
sieve = fill(true, n)
if n <= 1 || s > n
return Int[]
end
for i = 3:2:isqrt(n) + 1
if sieve[i]
for j = i ^ 2:i:n
sieve[j]= false
end
end
end
pl = [i for i in s - s % 2 + 1:2:n if sieve[i]]
return s == 2 ? unshift!(pl, 2) : pl
end
However, this makes sieve to be created also when n<1 which you want to avoid I guess :).
You can solve this problem by wrapping sieve in let block like this:
function get_primes_1(n::Int64, s::Int64=2)::Vector{Int64}
if n <= 1 || s > n
return Int[]
end
sieve = fill(true, n)
for i = 3:2:isqrt(n) + 1
if sieve[i]
for j = i ^ 2:i:n
sieve[j]= false
end
end
end
let sieve = sieve
pl = [i for i in s - s % 2 + 1:2:n if sieve[i]]
return s == 2 ? unshift!(pl, 2) : pl
end
end
or avoiding an inner closure for example like this:
function get_primes_1(n::Int64, s::Int64=2)::Vector{Int64}
if n <= 1 || s > n
return Int[]
end
sieve = fill(true, n)
for i = 3:2:isqrt(n) + 1
if sieve[i]
for j = i ^ 2:i:n
sieve[j]= false
end
end
end
pl = Int[]
for i in s - s %2 +1:2:n
sieve[i] && push!(pl, i)
end
s == 2 ? unshift!(pl, 2) : pl
end
Now you might ask how can you detect such problems and make sure that some solution solves them? The answer is to use #code_warntype on a function. In your original function you will notice that sieve is Core.Box which is an indication of the problem.
See https://github.com/JuliaLang/julia/issues/15276 for details. In general this is in my perception the most important issue with performance of Julia code which is easy to miss. Hopefully in the future the compiler will be smarter with this.
Edit: My suggestion actually doesn't seem to help. I missed your output annotation, so the return type appears to be correctly inferred after all. I am stumped, for the moment.
Original answer:
The problem isn't that there is an if statement, but that you introduce a type instability inside that if statement. You can read about type instabilities in the performance section of the Julia manual here.
An empty array defined like this: [], has a different type than a vector of integers:
> typeof([1,2,3])
Array{Int64,1}
> typeof([])
Array{Any,1}
The compiler cannot predict what the output type of the function will be, and therefore produces defensive, slow code.
Try to change
return []
to
return Int[]
I am converting a cpp prog (from another author) to a Fortran prog, my C is not too strong. I came across for-loop constructs starting with
for (int n = 1; 1; ++n) {
...
I would have expected this to convert to a Fortran Do as per
Do n=1, 1, 2
...
... at least that is my guess based on my understanding of what ++n will do.
Is my translation correct? If so, the loop will cycle at most once, so what am I missing ???
I understand that in some ways c for-loops have a "do-while" aspect, and hence wrinkles porting to Fortran Do's.
Anyway ... a clarification would be much appreciated.
EDITED: after some prompt responses, and I think I see where this is going
First, the exact C code copy/paste but "trimming" a little, is
for (int n = 1; 1; ++n) {
const double coef = exp(-a2*(n*n)) * expx2 / (a2*(n*n) + y*y);
prod2ax *= exp2ax;
prodm2ax *= expm2ax;
sum1 += coef;
sum2 += coef * prodm2ax;
sum4 += (coef * prodm2ax) * (a*n);
sum3 += coef * prod2ax;
sum5 += (coef * prod2ax) * (a*n);
// test convergence via sum5, since this sum has the slowest decay
if ((coef * prod2ax) * (a*n) < relerr * sum5) break;
}
So yes, there is a "break" in the loop, which on the Fortran side is replaced with an "Exit".
I think the key seems to be from the answers below that the original code's author created the
for (int n=1; 1 ; ++n )
precisely to create a an infinite loop, and I had not guessed that this for construct would create an infinite loop.
Anyway, I can certainly create an infinite loop with an "Exit" in Fortran (though I expect I might "do" it a bit more judiciously)
Many thanks to all.
It seems the Mr Gregory's response was the one that imediately lead to a solution for me, so I will mark his correct. As for the Fortran side, there are a number of alternatives such as:
Do While
:
If( something ) Exit
End Do
but being old fashioned I would probably use a construct with a "limit" such as
Do i=1, MaxIter
:
If( something ) Exit
End Do
For slightly fancier applications I might include a return flag in case it did not converge in MaxIter's etc.
It's difficult to be definitive without seeing how the C++ program breaks out of that loop, but a straightforward Fortran equivalent would be
n = 1
do
! code, including an exit under some condition, presumably on the value of n
n = n+1
end do
If the loop is terminated when n reaches a critical value then the equivalent might be
do n = 1, critical_value ! no need to indicate step size if it is 1
! code
end do
Are you sure you wrote the C code correctly? Typically loops in C/C++ are done like this:
for (int n = 1; n < 10; ++n) {
// ...
}
Note the "n < 10" test condition. Your code's test condition is simply 1, which will always evaluate to Boolean "true". This means the code will loop infinitely, unless there's a break inside the loop, which you haven't shown.
++n means "increment n".
So if the code you've shown is indeed correct, the FORTRAN equivalent would be:
n = 1
do
[Body of the loop, which you haven't shown]
n = n + 1
enddo
Here's what
for (int n = 1; 1; ++n)
does:
It sets n to 1, then loops infinitely, incrementing n by 1 at the end of each loop iteration. The loop will never terminate unless something inside the loop breaks out.
It's been a long time since I wrote Fortran but as I recall the do loop you translated it to is not correct.
I don't think you can translate
for (int n = 1; 1; ++n)
to a FORTRAN DO loop. From what I recall, the notion of the generic conditional in C/C++ cannot be emulated in a FORTRAN DO loop.
The equivalent of
Do n=1, 1, 2
in C/C++ is
for ( int n = 1; n <= 1; n += 2 )
A few notes in addition to CareyGregory’s answer.
++n means ‘increment n by one (before n is evaluated)’
In C and C++, a for loop has three clauses, much like in FORTRAN:
for (init; condition; increment)
The difference is that each of the clauses must be a complete expression, whereas in FORTRAN the clauses are just values. It is just a ‘short’ way of writing an equivalent while loop:
int n = 1; │ for (int n = 1; 1; ++n) │ n = 1
while (1) │ { │ do
{ │ ... │ ...
... │ } │ n = n + 1
++n; │ │ enddo
} │ │
I am making a program for nth Fibonacci number. I made the following program using recursion and memoization.
The main problem is that the value of n can go up to 10000 which means that the Fibonacci number of 10000 would be more than 2000 digit long.
With a little bit of googling, I found that i could use arrays and store every digit of the solution in an element of the array but I am still not able to figure out how to implement this approach with my program.
#include<iostream>
using namespace std;
long long int memo[101000];
long long int n;
long long int fib(long long int n)
{
if(n==1 || n==2)
return 1;
if(memo[n]!=0)
return memo[n];
return memo[n] = fib(n-1) + fib(n-2);
}
int main()
{
cin>>n;
long long int ans = fib(n);
cout<<ans;
}
How do I implement that approach or if there is another method that can be used to achieve such large values?
One thing that I think should be pointed out is there's other ways to implement fib that are much easier for something like C++ to compute
consider the following pseudo code
function fib (n) {
let a = 0, b = 1, _;
while (n > 0) {
_ = a;
a = b;
b = b + _;
n = n - 1;
}
return a;
}
This doesn't require memoisation and you don't have to be concerned about blowing up your stack with too many recursive calls. Recursion is a really powerful looping construct but it's one of those fubu things that's best left to langs like Lisp, Scheme, Kotlin, Lua (and a few others) that support it so elegantly.
That's not to say tail call elimination is impossible in C++, but unless you're doing something to optimise/compile for it explicitly, I'm doubtful that whatever compiler you're using would support it by default.
As for computing the exceptionally large numbers, you'll have to either get creative doing adding The Hard Way or rely upon an arbitrary precision arithmetic library like GMP. I'm sure there's other libs for this too.
Adding The Hard Way™
Remember how you used to add big numbers when you were a little tater tot, fresh off the aluminum foil?
5-year-old math
1259601512351095520986368
+ 50695640938240596831104
---------------------------
?
Well you gotta add each column, right to left. And when a column overflows into the double digits, remember to carry that 1 over to the next column.
... <-001
1259601512351095520986368
+ 50695640938240596831104
---------------------------
... <-472
The 10,000th fibonacci number is thousands of digits long, so there's no way that's going to fit in any integer C++ provides out of the box. So without relying upon a library, you could use a string or an array of single-digit numbers. To output the final number, you'll have to convert it to a string tho.
(woflram alpha: fibonacci 10000)
Doing it this way, you'll perform a couple million single-digit additions; it might take a while, but it should be a breeze for any modern computer to handle. Time to get to work !
Here's an example in of a Bignum module in JavaScript
const Bignum =
{ fromInt: (n = 0) =>
n < 10
? [ n ]
: [ n % 10, ...Bignum.fromInt (n / 10 >> 0) ]
, fromString: (s = "0") =>
Array.from (s, Number) .reverse ()
, toString: (b) =>
b .reverse () .join ("")
, add: (b1, b2) =>
{
const len = Math.max (b1.length, b2.length)
let answer = []
let carry = 0
for (let i = 0; i < len; i = i + 1) {
const x = b1[i] || 0
const y = b2[i] || 0
const sum = x + y + carry
answer.push (sum % 10)
carry = sum / 10 >> 0
}
if (carry > 0) answer.push (carry)
return answer
}
}
We can verify that the Wolfram Alpha answer above is correct
const { fromInt, toString, add } =
Bignum
const bigfib = (n = 0) =>
{
let a = fromInt (0)
let b = fromInt (1)
let _
while (n > 0) {
_ = a
a = b
b = add (b, _)
n = n - 1
}
return toString (a)
}
bigfib (10000)
// "336447 ... 366875"
Expand the program below to run it in your browser
const Bignum =
{ fromInt: (n = 0) =>
n < 10
? [ n ]
: [ n % 10, ...Bignum.fromInt (n / 10 >> 0) ]
, fromString: (s = "0") =>
Array.from (s) .reverse ()
, toString: (b) =>
b .reverse () .join ("")
, add: (b1, b2) =>
{
const len = Math.max (b1.length, b2.length)
let answer = []
let carry = 0
for (let i = 0; i < len; i = i + 1) {
const x = b1[i] || 0
const y = b2[i] || 0
const sum = x + y + carry
answer.push (sum % 10)
carry = sum / 10 >> 0
}
if (carry > 0) answer.push (carry)
return answer
}
}
const { fromInt, toString, add } =
Bignum
const bigfib = (n = 0) =>
{
let a = fromInt (0)
let b = fromInt (1)
let _
while (n > 0) {
_ = a
a = b
b = add (b, _)
n = n - 1
}
return toString (a)
}
console.log (bigfib (10000))
Try not to use recursion for a simple problem like fibonacci. And if you'll only use it once, don't use an array to store all results. An array of 2 elements containing the 2 previous fibonacci numbers will be enough. In each step, you then only have to sum up those 2 numbers. How can you save 2 consecutive fibonacci numbers? Well, you know that when you have 2 consecutive integers one is even and one is odd. So you can use that property to know where to get/place a fibonacci number: for fib(i), if i is even (i%2 is 0) place it in the first element of the array (index 0), else (i%2 is then 1) place it in the second element(index 1). Why can you just place it there? Well when you're calculating fib(i), the value that is on the place fib(i) should go is fib(i-2) (because (i-2)%2 is the same as i%2). But you won't need fib(i-2) any more: fib(i+1) only needs fib(i-1)(that's still in the array) and fib(i)(that just got inserted in the array).
So you could replace the recursion calls with a for loop like this:
int fibonacci(int n){
if( n <= 0){
return 0;
}
int previous[] = {0, 1}; // start with fib(0) and fib(1)
for(int i = 2; i <= n; ++i){
// modulo can be implemented with bit operations(much faster): i % 2 = i & 1
previous[i&1] += previous[(i-1)&1]; //shorter way to say: previous[i&1] = previous[i&1] + previous[(i-1)&1]
}
//Result is in previous[n&1]
return previous[n&1];
}
Recursion is actually discommanded while programming because of the time(function calls) and ressources(stack) it consumes. So each time you use recursion, try to replace it with a loop and a stack with simple pop/push operations if needed to save the "current position" (in c++ one can use a vector). In the case of the fibonacci, the stack isn't even needed but if you are iterating over a tree datastructure for example you'll need a stack (depends on the implementation though). As I was looking for my solution, I saw #naomik provided a solution with the while loop. That one is fine too, but I prefer the array with the modulo operation (a bit shorter).
Now concerning the problem of the size long long int has, it can be solved by using external libraries that implement operations for big numbers (like the GMP library or Boost.multiprecision). But you could also create your own version of a BigInteger-like class from Java and implement the basic operations like the one I have. I've only implemented the addition in my example (try to implement the others they are quite similar).
The main idea is simple, a BigInt represents a big decimal number by cutting its little endian representation into pieces (I'll explain why little endian at the end). The length of those pieces depends on the base you choose. If you want to work with decimal representations, it will only work if your base is a power of 10: if you choose 10 as base each piece will represent one digit, if you choose 100 (= 10^2) as base each piece will represent two consecutive digits starting from the end(see little endian), if you choose 1000 as base (10^3) each piece will represent three consecutive digits, ... and so on. Let's say that you have base 100, 12765 will then be [65, 27, 1], 1789 will be [89, 17], 505 will be [5, 5] (= [05,5]), ... with base 1000: 12765 would be [765, 12], 1789 would be [789, 1], 505 would be [505]. It's not the most efficient, but it is the most intuitive (I think ...)
The addition is then a bit like the addition on paper we learned at school:
begin with the lowest piece of the BigInt
add it with the corresponding piece of the other one
the lowest piece of that sum(= the sum modulus the base) becomes the corresponding piece of the final result
the "bigger" pieces of that sum will be added ("carried") to the sum of the following pieces
go to step 2 with next piece
if no piece left, add the carry and the remaining bigger pieces of the other BigInt (if it has pieces left)
For example:
9542 + 1097855 = [42, 95] + [55, 78, 09, 1]
lowest piece = 42 and 55 --> 42 + 55 = 97 = [97]
---> lowest piece of result = 97 (no carry, carry = 0)
2nd piece = 95 and 78 --> (95+78) + 0 = 173 = [73, 1]
---> 2nd piece of final result = 73
---> remaining: [1] = 1 = carry (will be added to sum of following pieces)
no piece left in first `BigInt`!
--> add carry ( [1] ) and remaining pieces from second `BigInt`( [9, 1] ) to final result
--> first additional piece: 9 + 1 = 10 = [10] (no carry)
--> second additional piece: 1 + 0 = 1 = [1] (no carry)
==> 9542 + 1 097 855 = [42, 95] + [55, 78, 09, 1] = [97, 73, 10, 1] = 1 107 397
Here is a demo where I used the class above to calculate the fibonacci of 10000 (result is too big to copy here)
Good luck!
PS: Why little endian? For the ease of the implementation: it allows to use push_back when adding digits and iteration while implementing the operations will start from the first piece instead of the last piece in the array.
I was trying to write a solution for Problem 12 (Project Euler) in Python. The solution was just too slow, so I tried checking up other people's solution on the internet. I found this code written in C++ which does virtually the same exact thing as my python code, with just a few insignificant differences.
Python:
def find_number_of_divisiors(n):
if n == 1:
return 1
div = 2 # 1 and the number itself
for i in range(2, n/2 + 1):
if (n % i) == 0:
div += 1
return div
def tri_nums():
n = 1
t = 1
while 1:
yield t
n += 1
t += n
t = tri_nums()
m = 0
for n in t:
d = find_number_of_divisiors(n)
if m < d:
print n, ' has ', d, ' divisors.'
m = d
if m == 320:
exit(0)
C++:
#include <iostream>
int main(int argc, char *argv[])
{
unsigned int iteration = 1;
unsigned int triangle_number = 0;
unsigned int divisor_count = 0;
unsigned int current_max_divisor_count = 0;
while (true) {
triangle_number += iteration;
divisor_count = 0;
for (int x = 2; x <= triangle_number / 2; x ++) {
if (triangle_number % x == 0) {
divisor_count++;
}
}
if (divisor_count > current_max_divisor_count) {
current_max_divisor_count = divisor_count;
std::cout << triangle_number << " has " << divisor_count
<< " divisors." << std::endl;
}
if (divisor_count == 318) {
exit(0);
}
iteration++;
}
return 0;
}
The python code takes 1 minute and 25.83 seconds on my machine to execute. While the C++ code takes around 4.628 seconds. Its like 18x faster. I had expected the C++ code to be faster but not by this great margin and that too just for a simple solution which consists of just 2 loops and a bunch of increments and mods.
Although I would appreciate answers on how to solve this problem, the main question I want to ask is Why is C++ code so much faster? Am I using/doing something wrongly in python?
Replacing range with xrange:
After replacing range with xrange the python code takes around 1 minute 11.48 seconds to execute. (Around 1.2x faster)
This is exactly the kind of code where C++ is going to shine compared to Python: a single fairly tight loop doing arithmetic ops. (I'm going to ignore algorithmic speedups here, because your C++ code uses the same algorithm, and it seems you're explicitly not asking for that...)
C++ compiles this kind of code down to a relatively few number of instructions for the processor (and everything it does probably all fits in the super-fast levels of CPU cache), while Python has a lot of levels of indirection it's going through for each operation. For example, every time you increase a number it's checking that the number didn't just overflow and need to be moved into a bigger data type.
That said, all is not necessarily lost! This is also the kind of code that a just-in-time compiler system like PyPy will do well at, since once it's gone through the loop a few times it compiles the code to something similar to what the C++ code starts at. On my laptop:
$ time python2.7 euler.py >/dev/null
python euler.py 72.23s user 0.10s system 97% cpu 1:13.86 total
$ time pypy euler.py >/dev/null
pypy euler.py > /dev/null 13.21s user 0.03s system 99% cpu 13.251 total
$ clang++ -o euler euler.cpp && time ./euler >/dev/null
./euler > /dev/null 2.71s user 0.00s system 99% cpu 2.717 total
using the version of the Python code with xrange instead of range. Optimization levels don't make a difference for me with the C++ code, and neither does using GCC instead of Clang.
While we're at it, this is also a case where Cython can do very well, which compiles almost-Python code to C code that uses the Python APIs, but uses raw C when possible. If we change your code just a little bit by adding some type declarations, and removing the iterator since I don't know how to handle those efficiently in Cython, getting
cdef int find_number_of_divisiors(int n):
cdef int i, div
if n == 1:
return 1
div = 2 # 1 and the number itself
for i in xrange(2, n/2 + 1):
if (n % i) == 0:
div += 1
return div
cdef int m, n, t, d
m = 0
n = 1
t = 1
while True:
n += 1
t += n
d = find_number_of_divisiors(t)
if m < d:
print n, ' has ', d, ' divisors.'
m = d
if m == 320:
exit(0)
then on my laptop I get
$ time python -c 'import euler_cy' >/dev/null
python -c 'import euler_cy' > /dev/null 4.82s user 0.02s system 98% cpu 4.941 total
(within a factor of 2 of the C++ code).
Rewriting the divisor counting algorithm to use divisor function makes the run time reduces to less than 1 second. It is still possible to make it faster, but not really necessary.
This is to show that: before you do any optimization trick with the language features and compiler, you should check whether your algorithm is the bottleneck or not. The trick with compiler/interpreter is indeed quite powerful, as shown in Dougal's answer where the gap between Python and C++ is closed for the equivalent code. However, as you can see, the change in algorithm immediately give a huge performance boost and lower the run time to around the level of algorithmically inefficient C++ code (I didn't test the C++ version, but on my 6-year-old computer, the code below finishes running in ~0.6s).
The code below is written and tested with Python 3.2.3.
import math
def find_number_of_divisiors(n):
if n == 1:
return 1
num = 1
count = 1
div = 2
while (n % div == 0):
n //= div
count += 1
num *= count
div = 3
while (div <= pow(n, 0.5)):
count = 1
while n % div == 0:
n //= div
count += 1
num *= count
div += 2
if n > 1:
num *= 2
return num
Here's my own variant built on nhahtdh's factor-counting optimization plus my own prime factorization code:
def prime_factors(x):
def factor_this(x, factor):
factors = []
while x % factor == 0:
x /= factor
factors.append(factor)
return x, factors
x, factors = factor_this(x, 2)
x, f = factor_this(x, 3)
factors += f
i = 5
while i * i <= x:
for j in (2, 4):
x, f = factor_this(x, i)
factors += f
i += j
if x > 1:
factors.append(x)
return factors
def product(series):
from operator import mul
return reduce(mul, series, 1)
def factor_count(n):
from collections import Counter
c = Counter(prime_factors(n))
return product([cc + 1 for cc in c.values()])
def tri_nums():
n, t = 1, 1
while 1:
yield t
n += 1
t += n
if __name__ == '__main__':
m = 0
for n in tri_nums():
d = factor_count(n)
if m < d:
print n, ' has ', d, ' divisors.'
m = d
if m == 320:
break
To check my C++ code, I would like to be able to let Boost::Random and Matlab produce the same random numbers.
So for Boost I use the code:
boost::mt19937 var(static_cast<unsigned> (std::time(0)));
boost::uniform_int<> dist(1, 6);
boost::variate_generator<boost::mt19937&, boost::uniform_int<> > die(var, dist);
die.engine().seed(0);
for(int i = 0; i < 10; ++i) {
std::cout << die() << " ";
}
std::cout << std::endl;
Which produces (every run of the program):
4 4 5 6 4 6 4 6 3 4
And for matlab I use:
RandStream.setDefaultStream(RandStream('mt19937ar','seed',0));
randi(6,1,10)
Which produces (every run of the program):
5 6 1 6 4 1 2 4 6 6
Which is bizarre, since both use the same algorithm, and same seed.
What do I miss?
It seems that Python (using numpy) and Matlab seems comparable, in the random uniform numbers:
Matlab
RandStream.setDefaultStream(RandStream('mt19937ar','seed',203));rand(1,10)
0.8479 0.1889 0.4506 0.6253 0.9697 0.2078 0.5944 0.9115 0.2457 0.7743
Python:
random.seed(203);random.random(10)
array([ 0.84790006, 0.18893843, 0.45060688, 0.62534723, 0.96974765,
0.20780668, 0.59444858, 0.91145688, 0.24568615, 0.77430378])
C++Boost
0.8479 0.667228 0.188938 0.715892 0.450607 0.0790326 0.625347 0.972369 0.969748 0.858771
Which is identical to ever other Python and Matlab value...
I have to agree with the other answers, stating that these generators are not "absolute". They may produce different results according to the implementation. I think the simplest solution would be to implement your own generator. It might look daunting (Mersenne twister sure is by the way) but take a look at Xorshift, an extremely simple though powerful one. I copy the C implementation given in the Wikipedia link :
uint32_t xor128(void) {
static uint32_t x = 123456789;
static uint32_t y = 362436069;
static uint32_t z = 521288629;
static uint32_t w = 88675123;
uint32_t t;
t = x ^ (x << 11);
x = y; y = z; z = w;
return w = w ^ (w >> 19) ^ (t ^ (t >> 8));
}
To have the same seed, just put any values you want int x,y,z,w (except(0,0,0,0) I believe). You just need to be sure that Matlab and C++ use both 32 bit for these unsigned int.
Using the interface like
randi(6,1,10)
will apply some kind of transformation on the raw result of the random generator. This transformation is not trivial in general and Matlab will almost certainly do a different selection step than Boost.
Try comparing raw data streams from the RNGs - chances are they are the same
In case this helps anyone interested in the question:
In order to the get the same behavior for the Twister algorithm:
Download the file
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/CODES/mt19937ar.c
Try the following:
#include <stdint.h>
// mt19937ar.c content..
int main(void)
{
int i;
uint32_t seed = 100;
init_genrand(seed);
for (i = 0; i < 100; ++i)
printf("%.20f\n",genrand_res53());
return 0;
}
Make sure the same values are generated within matlab:
RandStream.setGlobalStream( RandStream.create('mt19937ar','seed',100) );
rand(100,1)
randi() seems to be simply ceil( rand()*maxval )
Thanks to Fezvez's answer I've written xor128 in matlab:
function [ w, state ] = xor128( state )
%XOR128 implementation of Xorshift
% https://en.wikipedia.org/wiki/Xorshift
% A starting state might be [123456789, 362436069, 521288629, 88675123]
x = state(1);
y = state(2);
z = state(3);
w = state(4);
% t1 = (x << 11)
t1 = bitand(bitshift(x,11),hex2dec('ffffffff'));
% t = x ^ (x << 11)
t = bitxor(x,t1);
x = y;
y = z;
z = w;
% t2 = (t ^ (t >> 8))
t2 = bitxor(t, bitshift(t,-8));
% t3 = w ^ (w >> 19)
t3 = bitxor(w, bitshift(w,-19));
% w = w ^ (w >> 19) ^ (t ^ (t >> 8))
w = bitxor(t3, t2);
state = [x y z w];
end
You need to pass state in to xor128 every time you use it. I've written a "tester" function which simply returns a vector with random numbers. I tested 1000 numbers output by this function against values output by cpp with gcc and it is perfect.
function [ v ] = txor( iterations )
%TXOR test xor128, returns vector v of length iterations with random number
% output from xor128
% output
v = zeros(iterations,1);
state = [123456789, 362436069, 521288629, 88675123];
i = 1;
while i <= iterations
disp(i);
[t,state] = xor128(state);
v(i) = t;
i = i + 1;
end
I would be very careful assuming that two different implementations of pseudo random generators (even though based on the same algorithms) produce the same result. There could be that one of the implementations use some sort of tweak, hence producing different results. If you need two equal "random" distributions I suggest you either precalculate a sequence, store and access from both C++ and Matlab or create your own generator. It should be fairly easy to implement MT19937 if you use the pseudocode on Wikipedia.
Take care ensuring that both your Matlab and C++ code runs on the same architecture (that is, both runs on either 32 or 64-bit) - using a 64 bit integer in one implementation and a 32 bit integer in the other will lead to different results.