I'm learning Haskell. My interest is to use it for personal computer experimentation. Right now, I'm trying to see how fast Haskell can get. Many claim parity with C(++), and if that is true, I would be very happy (I should note that I will be using Haskell whether or not it's fast, but fast is still a good thing).
My test program implements π(x) with a very simple algorithm: Primes numbers add 1 to the result. Prime numbers have no integer divisors between 1 and √x. This is not an algorithm battle, this is purely for compiler performance.
Haskell seems to be about 6x slower on my computer, which is fine (still 100x faster than pure Python), but that could be just because I'm a Haskell newbie.
Now, my question: How, without changing the algorithm, can I optimize the Haskell implementation? Is Haskell really on performance parity with C?
Here is my Haskell code:
import System.Environment
-- a simple integer square root
isqrt :: Int -> Int
isqrt = floor . sqrt . fromIntegral
-- primality test
prime :: Int -> Bool
prime x = null [x | q <- [3, 5..isqrt x], rem x q == 0]
main = do
n <- fmap (read . head) getArgs
print $ length $ filter prime (2:[3, 5..n])
Here is my C++ code:
#include <iostream>
#include <cmath>
#include <cstdlib>
using namespace std;
bool isPrime(int);
int main(int argc, char* argv[]) {
int primes = 10000, count = 0;
if (argc > 1) {
primes = atoi(argv[1]);
}
if (isPrime(2)) {
count++;
}
for (int i = 3; i <= primes; i+=2) {
if (isPrime(i)){
count++;
}
}
cout << count << endl;
return 0;
}
bool isPrime(int x){
for (int i = 2; i <= floor(sqrt(x)); i++) {
if (x % i == 0) {
return false;
}
}
return true;
}
Your Haskell version is constructing a lazy list in prime only to test if it is null. This seems to indeed be a bottle neck. The following version runs just as fast as the C++ version on my machine:
prime :: Int -> Bool
prime x = go 3
where
go q | q <= isqrt x = if rem x q == 0 then False else go (q+2)
go _ = True
3.31s when compiled with -O2 vs. 3.18s for C++ with gcc 4.8 and -O3 for n=5000000.
Of course, 'guessing' where the program is slow to optimize it is not a very good approach. Fortunately, Haskell has good profiling tools on board.
Compiling and running with
$ ghc --make primes.hs -O2 -prof -auto-all -fforce-recomp && ./primes 5000000 +RTS -p
gives
# primes.prof
Thu Feb 20 00:49 2014 Time and Allocation Profiling Report (Final)
primes +RTS -p -RTS 5000000
total time = 5.71 secs (5710 ticks # 1000 us, 1 processor)
total alloc = 259,580,976 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
prime.go Main 96.4 0.0
main Main 2.0 84.6
isqrt Main 0.9 15.4
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 45 0 0.0 0.0 100.0 100.0
main Main 91 0 2.0 84.6 100.0 100.0
prime Main 92 2500000 0.7 0.0 98.0 15.4
prime.go Main 93 326103491 96.4 0.0 97.3 15.4
isqrt Main 95 0 0.9 15.4 0.9 15.4
--- >8 ---
which clearly shows that prime is where things get hot. For more information on profiling, I'll refer you to Real World Haskell, Chap 25.
To really understand what is going on, you can look at (one of) GHC's intermediate languages Core, which will show you how the code looks like after optimization. Some good info is at the Haskell wiki. I would not recommend to do that unless necessary, but it is good to know that the possibility exists.
To your other questions:
1) How, without changing the algorithm, can I optimize the Haskell implementation?
Profile, and try to write inner loops so that they don't do any memory allocations and can be made strict by the compiler. Doing so can take some practice and experience.
2) Is Haskell really on performance parity with C?
That depends. GHC is amazing and can often optimize your program very well. If you know what you're doing you can usually get close to the performance of optimized C (100% - 200% of C's speed). That said, these optimizations are not always easy or pretty to the eye and high level Haskell can be slower. But don't forget that you're gaining amazing expressiveness and high level abstractions when using Haskell. It will usually be fast enough for all but the most performance critical applications and even then you can often get pretty close to C with some profiling and performance optimizations.
I dont think that the Haskell version (original and improved by first answer) are equivalent with the C++ version.
The reason is this:
Both only consider every second element (in the prime function), while the C++ version scans every element (only i++ in the isPrime() function.
When i fix this (change i++ to i+=2 in the isPrime() function for C++) i get down to almost 1/3 of the runtime of the optimized Haskell version (2.1s C++ vs 6s Haskell).
The output remains the same for both (of course).
Note that this is no specific opimization of the C++ version ,just an adaptation of the trick already applied in the Haskell version.
Related
#include<iostream>
using namespace std;
double log(double x,int n)
{
static double p = x ;
double s;
if(n==1)
return x;
else
{
s=log(x,n-1);
p*=x;
if(n%2==0)
return s - (p/n);
else
return s + (p/n);
}
}
int main()
{
double r = log(1,15);
cout << r;
return 0;
}
I tried writing the above function for evaluating the log(1+x) function using its taylor series with recursion. But it didn't gave the result as I expected.
Eg :
ln(2) = 0.693 whereas my code gave 0.725. In the above code, n represents the number of terms.
Also I am new to this platform, so can I say that the above question is complete or does it need some additional information for further explanation?
There is nothing wrong with that piece of code: this has obviously got to do with the rate of convergence of the Taylor series.
If you take n = 200 instead of n = 15 in your code, the approximation error will be low enough that the first two decimals of the exact solution ln(2) = 0.693147... will be the correct ones.
The more you increase the n parameter, the better approximation you will get of ln(2).
Your program does converge to the right number, just very slowly...
log(1,15) returns 0.725, as you noticed, log(1,50) is 0.683, log(1,100) is 0.688, and log(1,200) is 0.691. That's getting close to the number you expected, but still a long way to go...
So there is no C++ or recursion bug in your code - you just need to find a better Taylor series to calculate log(X). Don't look for a Taylor series for log(1+x) - these will typically assume x is small, and converge quickly for small x, not for x=1.
This question already has answers here:
C++11 vector<bool> performance issue (with code example)
(3 answers)
Poor performance of vector<bool> in 64-bit target with VS2012
(3 answers)
Closed 3 years ago.
I would like to ask a question that can be summarized in one sentence:
What could be explanation for the performance difference between using vector<bool> (slow) and using vector<uint8_t> (fast)? (compiled with g++ 8.2 -O1 with C++17 standard)
Here is the full story.
It starts with LeetCode's Count Primes problem. Count the number of primes less than n.
One solution is to apply the Sieve of Eratosthenes, whose C++ code I will list below.
However, I interestedly noticed that with only one difference -- the vector type, there is a consistent gap in performance. Namely,
| type | runtime | memory |
|-----------------|---------|---------|
| vector<bool> | 68 ms | 8.6 MB |
| vector<uint8_t> | 24 ms | 11.5 MB |
The full code is
class Solution
{
public:
int countPrimes(int n)
{
// vector<uint8_t> isprime(n, true);
vector<bool> isprime(n, true);
int res = 0, root = std::sqrt(n);
for (int i = 2; i < n; ++i) {
if (isprime[i]) {
++res;
if (i <= root)
for (int k = i * i; k < n; k += i)
isprime[k] = false;
}
}
return res;
}
};
For your reference, LeetCode has the following note for C++ compilation
Compiled with g++ 8.2 using the latest C++ 17 standard.
Your code is compiled with level one optimization (-O1). AddressSanitizer is also enabled to help detect out-of-bounds and use-after-free bugs.
Most standard library headers are already included automatically for your convenience.
My question is: what could be the explanation for the observation that using vector<bool> is much slower (68 ms, 60% percentile) than using vector<uint8_t> (24 ms, 95% percentile)?
I repeated the experiment for more than 10 times. The difference in memory usage is also consistent. If you have any explanation for that (vector<uint8_t> uses more memory than vector<bool>), I also would like to hear.
Thank you very much!
I've been experimenting with various prime sieves in Julia with a view to finding the fastest. This is my simplest, if not my fastest, and it runs in around 5-6 ms on my 1.80 GHz processor for n = 1 million. However, when I add a simple 'if' statement to take care of the cases where n <= 1 or s (the start number) > n, the run-time increases by a factor of 15 to around 80-90 ms.
using BenchmarkTools
function get_primes_1(n::Int64, s::Int64=2)::Vector{Int64}
#=if n <= 1 || s > n
return []
end=#
sieve = fill(true, n)
for i = 3:2:isqrt(n) + 1
if sieve[i]
for j = i ^ 2:i:n
sieve[j]= false
end
end
end
pl = [i for i in s - s % 2 + 1:2:n if sieve[i]]
return s == 2 ? unshift!(pl, 2) : pl
end
#btime get_primes_1(1_000_000)
Output with the 'if' statement commented out, as above, is:
5.752 ms (25 allocations: 2.95 MiB)
Output with the 'if' statement included is:
86.496 ms (2121646 allocations: 35.55 MiB)
I'm probably embarrassingly ignorant or being terminally stupid, but if someone could point out what I'm doing wrong it would be very much appreciated.
The problem of this function is with Julia compiler having problems with type inference when closures appear in your function. In this case the closure is a comprehension and the problem is that if statement makes sieve to be only conditionally defined.
You can see this by moving sieve up:
function get_primes_1(n::Int64, s::Int64=2)::Vector{Int64}
sieve = fill(true, n)
if n <= 1 || s > n
return Int[]
end
for i = 3:2:isqrt(n) + 1
if sieve[i]
for j = i ^ 2:i:n
sieve[j]= false
end
end
end
pl = [i for i in s - s % 2 + 1:2:n if sieve[i]]
return s == 2 ? unshift!(pl, 2) : pl
end
However, this makes sieve to be created also when n<1 which you want to avoid I guess :).
You can solve this problem by wrapping sieve in let block like this:
function get_primes_1(n::Int64, s::Int64=2)::Vector{Int64}
if n <= 1 || s > n
return Int[]
end
sieve = fill(true, n)
for i = 3:2:isqrt(n) + 1
if sieve[i]
for j = i ^ 2:i:n
sieve[j]= false
end
end
end
let sieve = sieve
pl = [i for i in s - s % 2 + 1:2:n if sieve[i]]
return s == 2 ? unshift!(pl, 2) : pl
end
end
or avoiding an inner closure for example like this:
function get_primes_1(n::Int64, s::Int64=2)::Vector{Int64}
if n <= 1 || s > n
return Int[]
end
sieve = fill(true, n)
for i = 3:2:isqrt(n) + 1
if sieve[i]
for j = i ^ 2:i:n
sieve[j]= false
end
end
end
pl = Int[]
for i in s - s %2 +1:2:n
sieve[i] && push!(pl, i)
end
s == 2 ? unshift!(pl, 2) : pl
end
Now you might ask how can you detect such problems and make sure that some solution solves them? The answer is to use #code_warntype on a function. In your original function you will notice that sieve is Core.Box which is an indication of the problem.
See https://github.com/JuliaLang/julia/issues/15276 for details. In general this is in my perception the most important issue with performance of Julia code which is easy to miss. Hopefully in the future the compiler will be smarter with this.
Edit: My suggestion actually doesn't seem to help. I missed your output annotation, so the return type appears to be correctly inferred after all. I am stumped, for the moment.
Original answer:
The problem isn't that there is an if statement, but that you introduce a type instability inside that if statement. You can read about type instabilities in the performance section of the Julia manual here.
An empty array defined like this: [], has a different type than a vector of integers:
> typeof([1,2,3])
Array{Int64,1}
> typeof([])
Array{Any,1}
The compiler cannot predict what the output type of the function will be, and therefore produces defensive, slow code.
Try to change
return []
to
return Int[]
In C++, I should write a program where the app detects which numbers are divisible by 3 from 1 till 10 and then multiply all of them and print the result. That means that I should multiply 3,6,9 and print only the result, which is 162, but I should do it by using a "While" loop, not just multiplying the 3 numbers with each other. How should I write the code of this? I attached my attempt to code the problem below. Thanks
#include <iostream>
using namespace std;
int main() {
int x, r;
int l;
x = 1;
r = 0;
while (x < 10 && x%3==0) {
r = (3 * x) + 3;
cout << r;
}
cin >> l;
}
Firstly your checking the condition x%3 == 0 brings you out of your while - loop right in the first iteration where x is 1. You need to check the condition inside the loop.
Since you wish to store your answer in variable r you must initialize it to 1 since the product of anything with 0 would give you 0.
Another important thing is you need to increment the value of x at each iteration i.e. to check if each number in the range of 1 to 10 is divisible by 3 or not .
int main()
{
int x, r;
int l;
x = 1;
r = 1;
while (x < 10)
{
if(x%3 == 0)
r = r*x ;
x = x + 1; //incrementing the value of x
}
cout<<r;
}
Lastly I have no idea why you have written the last cin>>l statement . Omit it if not required.
Ok so here are a few hints that hopefully help you solving this:
Your approach with two variables (x and r) outside the loop is a good starting point for this.
Like I wrote in the comments you should use *= instead of your formula (I still don't understand how it is related to the problem)
Don't check if x is dividable by 3 inside the while-check because it would lead to an too early breaking of the loop
You can delete your l variable because it has no affect at the moment ;)
Your output should also happen outside the loop, else it is done everytime the loop runs (in your case this would be 10 times)
I hope I can help ;)
EDIT: Forget about No.4. I didn't saw your comment about the non-closing console.
int main()
{
int result = 1; // "result" is better than "r"
for (int x=1; x < 10; ++x)
{
if (x%3 == 0)
result = result * x;
}
cout << result;
}
or the loop in short with some additional knowledge:
for (int x=3; x < 10; x += 3) // i know that 3 is dividable
result *= x;
or, as it is c++, and for learning purposes, you could do:
vector<int> values; // a container holding integers that will get the multiples of 3
for (int x=1; x < 10; ++x) // as usual
if ( ! x%3 ) // same as x%3 == 0
values.push_back(x); // put the newly found number in the container
// now use a function that multiplies all numbers of the container (1 is start value)
result = std::accumulate(values.begin(), values.end(), 1, multiplies<int>());
// so much fun, also get the sum (0 is the start value, no function needed as add is standard)
int sum = std::accumulate(values.begin(), values.end(), 0);
It's important to remember the difference between = and ==. = sets something to a value while == compares something to a value. You're on the right track with incrementing x and using x as a condition to check your range of numbers. When writing code I usually try and write a "pseudocode" in English to organize my steps and get my logic down. It's also wise to consider using variables that tell you what they are as opposed to just random letters. Imagine if you were coding a game and you just had letters as variables; it would be impossible to remember what is what. When you are first learning to code this really helps a lot. So with that in mind:
/*
- While x is less than 10
- check value to see if it's mod 3
- if it's mod 3 add it to a sum
- if not's mod 3 bump a counter
- After my condition is met
- print to screen pause screen
*/
Now if we flesh out that pseudocode a little more we'll get a skeletal structure.
int main()
{
int x=1//value we'll use as a counter
int sum=0//value we'll use as a sum to print out at the end
while(x<10)//condition we'll check against
{
if (x mod 3 is zero)
{
sum=x*1;
increment x
}
else
{
increment x
}
}
//screen output the sum the sum
//system pause or cin.get() use whatever your teacher gave you.
I've given you a lot to work with here you should be able to figure out what you need from this. Computer Science and programming is hard and will require a lot of work. It's important to develop good coding habits and form now as it will help you in the future. Coding is a skill like welding; the more you do it the better you'll get. I often refer to it as the "Blue Collar Science" because it's really a skillset and not just raw knowledge. It's not like studying history or Biology (minus Biology labs) because those require you to learn things and loosely apply them whereas programming requires you to actually build something. It's like welding or plumbing in my opinion.
Additionally when you come to sites like these try and read up how things should be posted and try and seek the "logic" behind the answer and come up with it on your own as opposed to asking for the answer. People will be more inclined to help you if they think you're working for something instead of asking for a handout (not saying you are, just some advice). Additionally take the attitude these guys give you with a grain of salt, Computer Scientists aren't known to be the worlds most personable people. =) Good luck.
I've recently been writing some code for a research project that I'm working on, where efficiency is very important. I've been considering scraping some of the regular methods I do things in and using bitwise XORs instead. What I'm wondering is if this will make if a difference (if I'm performing this operation say several million times) or if it's the same after I use 03 in g++.
The two examples that come to mind:
I had an instance where (I'm working with purely positive ints) I needed to change n to n-1 if n was odd or n to (n+1) if n was even. I figured I had a few options:
if(n%2) // or (n%2==0) and flip the order
n=n-1
else
n=n+1
or
n=n+2*n%2-1; //This of course was silly, but was the only non-bitwise 1 line I could come up with
Finally:
n=n^1;
All of the methods clearly do the same thing, but my feeling was that the third one would be the most efficient.
The next example is on a more general note. Say I'm comparing two positive integers, will one of these perform better than the others. Or will the difference really not be noticeable, even if I perform this operation several million times:
if(n_1==n_2)
if(! (n_1 ^ n_2) )
if( n_1 ^ n_2) else \do work here
Will the compiler just do the same operation in all of these instances? I'm just curious if there is an instance when I should use bitwise operations and not trust the compiler to do the work for me.
Fixed:In correct statement of problem.
It's easy enough to check, just fire up your disassembler. Take a look:
f.c:
unsigned int f1(unsigned int n)
{
n ^= 1;
return n;
}
unsigned int f2(unsigned int n)
{
if (n % 2)
n=n-1;
else
n=n+1;
return n;
}
Build and disassemble:
$ cc -O3 -c f.c
$ otool -tV f.o
f.o:
(__TEXT,__text) section
_f1:
00 pushq %rbp
01 movq %rsp,%rbp
04 xorl $0x01,%edi
07 movl %edi,%eax
09 leave
0a ret
0b nopl _f1(%rax,%rax)
_f2:
10 pushq %rbp
11 movq %rsp,%rbp
14 leal 0xff(%rdi),%eax
17 leal 0x01(%rdi),%edx
1a andl $0x01,%edi
1d cmovel %edx,%eax
20 leave
21 ret
It looks like f1() is a bit shorter, whether or not that matters in reality is up to some benchmarking.
I needed to change n to n-1 if n was even or n to (n+1) if n was odd.
In that case, regardless of efficiency, n = n ^ 1 is wrong.
For your second case, == will be just as efficient (if not more so) than any of the others.
In general, when it comes to optimization, you should benchmark it yourself. If a potential optimization is not worth benchmarking, it's not really worth making.
I kind-of disagree with most of the answers here, which is why I still see myself replying to a question of 2010 :-)
XOR is practically speaking the fastest operation a CPU can possibly do, and the good part is that all CPU's support it. The reason for this is quite easy: a XOR gate can be created with only 4 NAND gates or 5 NOR gates -- which means it's easy to create using the fabric of your silicon. Unsurprising, all CPU's that I know of can execute your XOR operation in 1 clock tick (or even less).
If you need to do a XOR on multiple items in an array, modern x64 CPU's also support XOR's on multiple items at once like f.ex. the SIMD instructions on Intel.
The alternative solution you opt uses the if-then-else. True, most compilers are able to figure this easy thing out... but why take any chances and what's the consequence?
The consequence of your compiler not figuring it out are branch prediction errors. A single branch prediction failure will easily take 17 clock ticks. If you take one look at the execution speeds of processor instructions, you'll find that branches are quite bad for your performance, especially when dealing with random data.
Note that this also means that if you construct your test incorrectly, the data will mess up your performance measurements.
So to conclude: first think, then program, then profile - not the other way around. And use XOR.
About the only way to know for sure is to test. I'd have to agree that it would take a fairly clever compiler to produce as efficient of output for:
if(n%2) // or (n%2==0) and flip the order
n=n-1
else
n=n+1
as it could for n ^= 1;, but I haven't checked anything similar recently enough to say with any certainty.
As for your second question, I doubt it makes any difference -- an equality comparison is going to end up fast for any of these methods. If you want speed, the main thing to do is avoid having a branch involved at all -- e.g. something like:
if (a == b)
c += d;
can be written as: c += d * (a==b);. Looking at the assembly language, the second will often look a bit messy (with ugly cruft to get the result of the comparison from the flags into a normal register) but still often perform better by avoiding any branches.
Edit: At least the compilers I have handy (gcc & MSVC), do not generate a cmov for the if, but they do generate a sete for the * (a==b). I expanded the code to something testable.
Edit2: Since Potatoswatter brought up another possibility using bit-wise and instead of multiplication, I decided to test that along with the others. Here's the code with that added:
#include <time.h>
#include <iostream>
#include <stdlib.h>
int addif1(int a, int b, int c, int d) {
if (a==b)
c+=d;
return c;
}
int addif2(int a, int b, int c, int d) {
return c += d * (a == b);
}
int addif3(int a, int b, int c, int d) {
return c += d & -(a == b);
}
int main() {
const int iterations = 50000;
int x = rand();
unsigned tot1 = 0;
unsigned tot2 = 0;
unsigned tot3 = 0;
clock_t start1 = clock();
for (int i=0; i<iterations; i++) {
for (int j=0; j<iterations; j++)
tot1 +=addif1(i, j, i, x);
}
clock_t stop1 = clock();
clock_t start2 = clock();
for (int i=0; i<iterations; i++) {
for (int j=0; j<iterations; j++)
tot2 +=addif2(i, j, i, x);
}
clock_t stop2 = clock();
clock_t start3 = clock();
for (int i=0; i<iterations; i++) {
for (int j=0; j<iterations; j++)
tot3 +=addif3(i, j, i, x);
}
clock_t stop3 = clock();
std::cout << "Ignore: " << tot1 << "\n";
std::cout << "Ignore: " << tot2 << "\n";
std::cout << "Ignore: " << tot3 << "\n";
std::cout << "addif1: " << stop1-start1 << "\n";
std::cout << "addif2: " << stop2-start2 << "\n";
std::cout << "addif3: " << stop3-start3 << "\n";
return 0;
}
Now the really interesting part: the results for the third version are quite interesting. For MS VC++, we get roughly what most of us would probably expect:
Ignore: 2682925904
Ignore: 2682925904
Ignore: 2682925904
addif1: 4814
addif2: 3504
addif3: 3021
Using the & instead of the *, gives a definite improvement -- almost as much of an improvement as * gives over if. With gcc the result is quite a bit different though:
Ignore: 2680875904
Ignore: 2680875904
Ignore: 2680875904
addif1: 2901
addif2: 2886
addif3: 7675
In this case, the code using if is much closer to the speed of the code using *, but the code using & is slower than either one -- a lot slower! In case anybody cares, I found this surprising enough that I re-compiled a couple of times with different flags, re-ran a few times with each, and so on and the result was entirely consistent -- the code using & was consistently considerably slower.
The poor result with the third version of the code compiled with gcc gets us back to what I said to start with [and ends this edit]:
As I said to start with, "the only way to know for sure is to test" -- but at least in this limited testing, the multiplication consistently beats the if. There may be some combination of compiler, compiler flags, CPU, data pattern, iteration count, etc., that favors the if over the multiplication -- there's no question that the difference is small enough that a test going the other direction is entirely believable. Nonetheless, I believe that it's a technique worth knowing; for mainstream compilers and CPUs, it seems reasonably effective (though it's certainly more helpful with MSVC than with gcc).
[resumption of edit2:] the result with gcc using & demonstrates the degree to which 1) micro-optimizations can be/are compiler specific, and 2) how much different real-life results can be from expectations.
Is n^=1 faster than if ( n%2 ) --n; else ++n;? Yes. I wouldn't expect a compiler to optimize that. Since the bitwise operation is so much more succinct, it might be worthwhile to familiarize yourself with XOR and maybe add a comment on that line of code.
If it's really critical to the functionality of your program, it could also be considered a portability issue: if you test on your compiler and it's fast, you would likely be in for a surprise when trying on another compiler. Usually this isn't an issue for algebraic optimizations.
Is x^y faster than x==y? No. Doing things in roundabout ways is generally not good.
A good compiler will optimize n%2 but you can always check the assembly produced to see. If you see divides, start optimizing it yourself because divide is about as slow as it gets.
You should trust your compiler. gcc/++ is the product of years of development and it's capable of doing any optimizations you're probably thinking of doing. And, it's likely that if you start playing around, you'll tamper with it's efforts to optimize your code.
n ^= 1 and n1==n2 are probably the best you can do, but really, if you are after maximum efficiency, don't eyeball the code looking for little things like that.
Here's an example of how to really tune for performance.
Don't expect low level optimizations to help much until sampling has proven they are where you should focus.