I'm writing code that takes a number from a user and prints in back in letters as string. I want to know, which is better performance-wise, to have if statements, like
if (n < 100) {
// code for 2-digit numbers
} else if (n < 1000) {
// code for 3-digit numbers
} // etc..
or to put the number in a string and get its length, then work on it as a string.
The code is written in C++.
Of course if-else will be faster.
To compare two numbers you just compare them bitwise (there are different ways to do it but it's a very fast operation).
To get the length of the string you will need to make the string, put the data into it and compute the length somehow (there can be different ways of doing it too, the simplest being counting all the symbols). Of course it takes much more time.
On a simple example though you will not notice any difference. It often amazes me that people get concerned with such things (no offense). It will not make any difference for you if the code will execute in 0.003 seconds instead of 0.001 seconds really... You should make such low-level optimizations only after you know that this exact place is a bottleneck of your application, and when you are sure that you can increase the performance by a decent amount.
Until you measure and this really is a bottleneck, don't worry about performance.
That said, the following should be even faster (for readability, let's assume you use a type that ranges between 0 and 99999999):
if (n < 10000) {
// code for less or equal to 4 digits
if (n < 100)
{
//code for less or equal to 2 digits
if (n < 10)
return 1;
else
return 2;
}
else
{
//code for over 2 digits, but under or equal to 4
if (n>=1000)
return 4;
else
return 3;
}
} else {
//similar
} // etc..
Basically, it's a variation of binary search. Worst case, this will take O(log(n)) as opposed to O(n) - n being the maximum number of digits.
The string variant will be slower:
std::stringstream ss; // allocation, initialization ...
ss << 4711; // parsing, setting internal flags, ...
std::string str = ss.str(); // allocations, array copies ...
// cleaning up (compiler does it for you) ...
str.~string();
ss.~stringstream(); // destruction ...
The ... indicate there's more stuff happening.
A compact (good for cache) loop (good for branch prediction) might be what you want:
int num_digits (int value, int base=10) {
int num = 0;
while (value) {
value /= base;
++num;
}
return num;
}
int num_zeros (int value, int base=10) {
return num_decimal_digits(value, base) - 1;
}
Depending on circumstances, because it is cache and prediction friendly, this may be faster than solutions based on relational operators.
The templated variant enables the compiler to do some micro optimizations for your division:
template <int base=10>
int num_digits (int value) {
int num = 0;
while (value) {
value /= base;
++num;
}
return num;
}
The answers are good, but think a bit, about relative times.
Even by the slowest method you can think of, the program can do it in some tiny fraction of a second, like maybe 100 microseconds.
Balance that against the fastest user you can imagine, who could type in the number in maybe 500 milliseconds, and who could read the output in another 500 milliseconds, before doing whatever comes next.
OK, the machine does essentially nothing for 1000 milliseconds, and in the middle it has to crunch like crazy for 100 microseconds because, after all, we don't want the user to think the program is slow ;-)
Related
Problem Statement is to find prime number below 2 billion in timeframe < 20 sec.
I followed below approaches.
Divide the number n with list of number k ( k < sqrt(n)) - took 20 sec
Divide the number n with list of prime number below sqrt(n).In this scenario I stored prime numbers in std::list - took more than 180 sec
Can someone help me understand why did 2nd approach take longtime even though we reduced no of divisions by 50%(approx)? or Did I choose wrong Data Structure?
Approach 1:
#include <iostream>
#include<list>
#include <ctime>
using namespace std;
list<long long> primeno;
void ListPrimeNumber();
int main()
{
clock_t time_req = clock();
ListPrimeNumber();
time_req = clock() - time_req;
cout << "time taken " << static_cast<float>(time_req) / CLOCKS_PER_SEC << " seconds" << endl;
return 0;
}
void check_prime(int i);
void ListPrimeNumber()
{
primeno.push_back(2);
primeno.push_back(3);
primeno.push_back(5);
for (long long i = 6; i <= 20000000; i++)
{
check_prime(i);
}
}
void check_prime(int i)
{
try
{
int j = 0;
int limit = sqrt(i);
for (j = 2 ; j <= limit;j++)
{
if(i % j == 0)
{
break;
}
}
if( j > limit)
{
primeno.push_back(i);
}
}
catch (exception ex)
{
std::cout << "Message";
}
}
Approach 2 :
#include <iostream>
#include<list>
#include <ctime>
using namespace std;
list<long long> primeno;
int noofdiv = 0;
void ListPrimeNumber();
int main()
{
clock_t time_req = clock();
ListPrimeNumber();
time_req = clock() - time_req;
cout << "time taken " << static_cast<float>(time_req) / CLOCKS_PER_SEC << " seconds" << endl;
cout << "No of divisions : " << noofdiv;
return 0;
}
void check_prime(int i);
void ListPrimeNumber()
{
primeno.push_back(2);
primeno.push_back(3);
primeno.push_back(5);
for (long long i = 6; i <= 10000; i++)
{
check_prime(i);
}
}
void check_prime(int i)
{
try
{
int limit = sqrt(i);
for (int iter : primeno)
{
noofdiv++;
if (iter <= limit && (i%iter) == 0)
{
break;
}
else if (iter > limit)
{
primeno.push_back(i);
break;
}
}
}
catch (exception ex)
{
std::cout << "Message";
}
}
The reason your second example takes longer is you're iterating a std::list.
A std::list in C++ is a linked list, which means it doesn't use contiguous memory. This is bad because to iterate the list you must jump from node to node in a (to the CPU/prefetcher) unpredictable way. Also, You're most likely only "using" a few bytes of each cacheline. RAM is slow. Fetching a byte from RAM takes a lot longer than fetching it from L1. CPUs are fast these days, so your program is most of the time not doing anything and waiting for memory to arrive.
Use a std::vector instead. It stores all values one after the other and iterating is very cheap. Since you're iterating forward in memory without jumping, you're using the full cacheline and your prefetcher will be able to fetch further pages before you need them because your access of memory is predictable.
It has been proven by numerous people, including Bjarne Stroustrup, that std::vector is in a lot of cases faster than std::list, even in cases where the std::list has "theoretically" better complexity (random insert, delete, ...) just because caching helps a lot. So always use std::vector as your default. And if you think a linked list would be faster in your case, measure it and be surprised that - most of the time - std::vector dominates.
Edit: as others have noted, your method of finding primes isn't very efficient. I just played around a bit and implemented a Sieve of Eratosthenes using a bitset.
constexpr int max_prime = 1000000000;
std::bitset<max_prime> *bitset = new std::bitset<max_prime>{};
// Note: Bit SET means NO prime
bitset->set(0);
bitset->set(1);
for(int i = 4; i < max_prime ; i += 2)
bitset->set(i); // set all even numbers
int max = sqrt(max_prime);
for(int i = 3; i < max; i += 2) { // No point testing even numbers as they can't be prime
if(!bitset->test(i)) { // If i is prime
for(int j = i * 2; j < no_primes; j += i)
bitset->set(j); // set all multiples of i to non-prime
}
}
This takes between 4.2 and 4.5 seconds 30 seconds (not sure why it changed that much after slight modifications... must be an optimization I'm not hitting anymore) to find all primes below one Billion (1,000,000,000) on my machine. Your approach took way too long even for 100 million. I cancelled the 1 Billion search after about two minutes.
Comparison for 100 million:
time taken: 63.515 seconds
time taken bitset: 1.874 seconds
No of divisions : 1975961174
No of primes found: 5761455
No of primes found bitset: 5761455
I'm not a mathematician so I'm pretty sure there are still ways to optimize it further, I only optimize for even numbers.
The first thing to do is make sure you are compiling with optimisations enabled. The c++ standard library template classes tend to perform very poorly with unoptimised code as they generate lots of function calls. The optimiser inlines most of these function calls which makes them much cheaper.
std::list is a linked list. Its is mostly useful where you want to insert or remove elements randomly (i.e. not from the end).
For the case where you are only appending to the end of a list std::list has the following issues:
Iterating through the list is relatively expensive as the code has to follow node pointers and then retrieve the data
The list uses quite a lot more memory, each element needs a pointer to the previous and next nodes in addition to the actual data. On a 64-bit system this equates to 20 bytes per element rather than 4 for a list of int
As the elements in the list are not contiguous in memory the compiler can't perform as many SIMD optimisations and you will suffer more from CPU cache misses
A std::vector would solve all of the above as its memory is contiguous and iterating through it is basically just a case of incrementing an array index. You do need to make sure that you call reserve on your vector at the beginning with a sufficiently large value so that appending to the vector doesn't cause the whole array to be copied to a new larger array.
A bigger optimisation than the above would be to use the Sieve of Eratosthenes to calculate your primes. As generating this light require random deletions (depending on your exact implementation) std::list might perform better than std::vector though even in this case the overheads of std::list might not outweigh its costs.
A test at Ideone (the OP code with few superficial alterations) completely contradicts the claims made in this question:
/* check_prime__list:
time taken No of divisions No of primes
10M: 0.873 seconds 286144936 664579
20M: 2.169 seconds 721544444 1270607 */
2B: projected time: at least 16 minutes but likely much more (*)
/* check_prime__nums:
time taken No of divisions No of primes
10M: 4.650 seconds 1746210131 664579
20M: 12.585 seconds 4677014576 1270607 */
2B: projected time: at least 3 hours but likely much more (*)
I also changed the type of the number of divisions counter to long int because it was wrapping around the data type limit. So they could have been misinterpreting that.
But the run time wasn't being affected by that. A wall clock is a wall clock.
Most likely explanation seems to be a sloppy testing by the OP, with different values used in each test case, by mistake.
(*) The time projection was made by the empirical orders of growth analysis:
100**1.32 * 2.169 / 60 = 15.8
100**1.45 * 12.585 / 3600 = 2.8
Empirical orders of growth, as measured on the given range of sizes, were noticeably better for the list algorithm, n1.32 vs. the n1.45 for the testing by all numbers. This is entirely expected from theoretical complexity, since there are fewer primes than all numbers up to n, by a factor of log n, for a total complexity of O(n1.5/log n) vs. O(n1.5). It is also highly unlikely for any implementational discrepancy to beat an actual algorithmic advantage.
I have a function here that can make program count, wait etc with least count of 1 millisecond. But i was wondering if i can do same will lower precision. I have read other answers but they are mostly about changing to linux or sleep is guesstimate and whats more is those answers were around a decade old so maybe there might have come new function to do it.
Here's function-
void sleep(unsigned int mseconds)
{
clock_t goal = mseconds + clock();
while (goal > clock());
}
Actually, i was trying to make a function similar to secure_compare but i dont think it is wise idea to waste 1 millisecond(current least count) on just comparing two strings.
Here is function i made for the same -
bool secure_compare(string a,string b){
clock_t limit=wait + clock(); //limit of time program can take to compare
bool x = (a==b);
if(clock()>limit){ //if time taken to compare is more increase wait so it takes this new max time for other comparisons too
wait = clock()-limit;
cout<<"Error";
secure_compare(a,b);
}
while(clock()<limit); //finishing time left to make it constant time function
return x;
}
You're trying to make a comparison function time-independent. There are basically two ways to do this:
Measure the time taken for the call and sleep the appropriate amount
This might only swap out one side channel (timing) with another (power consumption, since sleeping and computation might have different power usage characteristics).
Make the control flow more data-independent:
Instead of using the normal string comparison, you could implement your own comparison that compares all characters and not just up until the first mismatch, like this:
bool match = true;
size_t min_length = min(a.size(), b.size());
for (size_t i = 0; i < min_length; ++i) {
match &= (a[i] == b[i]);
}
return match;
Here, no branching (conditional operations) takes place, so every call of this method with strings of the same length should take roughly the same time. So the only side-channel information you leak is the length of the strings you compare, but that would be difficult to hide anyways, if they are of arbitrary length.
EDIT: Incorporating Passer By's comment:
If we want to reduce the size leakage, we could try to round the size up and clamp the index values.
bool match = true;
size_t min_length = min(a.size(), b.size());
size_t rounded_length = (min_length + 1023) / 1024 * 1024;
for (size_t i = 0; i < rounded_length; ++i) {
size_t clamped_i = min(i, min_length - 1);
match &= (a[clamped_i] == b[clamped_i]);
}
return match;
There might be a tiny cache timing sidechannel present (because we don't get any more cache misses if i > clamped_i), but since a and b should be in the cache hierarchy anyways, I doubt the difference is usable in any way.
I want to implement a very long boolean array (as a binary genome) and access some intervals to check if that interval is all true or not, and in addition I want to change some intervals value,
For example, I can create 4 representations:
boolean binaryGenome1[10e6]={false};
vector<bool> binaryGenome2; binaryGenome2.resize(10e6);
vector<char> binaryGenome3; binaryGenome3.resize(10e6);
bitset<10e6> binaryGenome4;
and access this way:
inline bool checkBinGenome(long long start , long long end){
for(long long i = start; i < end+1 ; i++)
if(binaryGenome[i] == false)
return false;
return true;
}
inline void changeBinGenome(long long start , long long end){
for(long long i = start; i < end+1 ; i++)
binaryGenome[i] = true;
}
vector<char> and normal boolean array (ass stores every boolean in a byte) both seem to be a poor choice as I need to be efficient in space. But what are the differences between vector<bool> and bitset?
Somewhere else I read that vector has some overhead as you can choose it's size and compile time - "overhead" for what - accessing? And how much is that overhead?
As I want to access array elements many times using CheckBinGenome() and changeBinGenome(), what is the fastest implementation?
Use std::bitset It's the best.
If the length of the data is known at compile time, consider std::array<bool> or std::bitset. The latter is likely to be more space-efficient (you'll have to measure whether the associated extra work in access times outweighs the speed gain from reducing cache pressure - that will depend on your workload).
If your array's length is not fixed, then you'll need a std::vector<bool> or std::vector<char>; there's also boost::dynamic_bitset but I've never used that.
If you will be changing large regions at once, as your sample implies, it may well be worth constructing your own representation and manipulating the underlying storage directly, rather than one bit at a time through the iterators. For example, if you use an array of char as the underlying representation, then setting a large range to 0 or 1 is mostly a memset() or std::fill() call, with computation only for the values at the start and end of the range. I'd start with a simple implementation and a good set of unit tests before trying anything like that.
It is (at least theoretically) possible that your Standard Library has specialized versions of algorithms for the iterators of std::vector<bool>, std::array<bool> and/or std::bitset that do exactly the above, or you may be able to write and contribute such specializations. That's a better path if possible - the world may thank you, and you'll have shared some of the maintenance responsibility.
Important note
If using std::array<bool>, you do need to be aware that, unlike other std::array<> instantiations, it does not implement the standard container semantics. That's not to say it shouldn't be used, but make sure you understand its foibles!
E.g., checking whether all the elements are true
I am really NOT sure whether this will give us more overheads than speedup or not. Actually I think that nowadays CPU can do this quite fast, are you really experiencing a poor performance? (or is this just a skeleton of your real problem?)
#include <omp.h>
#include <iostream>
#include <cstring>
using namespace std;
#define N 10000000
bool binaryGenome[N];
int main() {
memset(binaryGenome, true, sizeof(bool) * N);
int shouldBreak = 0;
bool result = true;
cout << result << endl;
binaryGenome[9999995] = false;
bool go = true;
uint give = 0;
#pragma omp parallel
{
uint start, stop;
#pragma omp critical
{
start = give;
give += N / omp_get_num_threads();
stop = give;
if (omp_get_thread_num() == omp_get_num_threads() - 1)
stop = N;
}
while (start < stop && go) {
if (!binaryGenome[start]) {
cout << start << endl;
go = false;
result = false;
}
++start;
}
}
cout << result << endl;
}
I'm trying to find some primes with the Sieve of the greek guy algorithm. I have some efficiency concerns. Here's the code:
void check_if_prime(unsigned number)
{
unsigned index = 0;
while (primes[index] <= std::sqrt(number))
{
if (number % primes[index] == 0) return;
++index;
}
primes.push_back(number);
}
And, because I coded huge 2/3/5/7/11/13 prime wheel, the code is 5795 lines longs.
for (unsigned i = 0; i < selection; ++i)
{
unsigned multiple = i * 30030;
if (i!=0) check_if_prime( multiple+1 );
check_if_prime ( multiple+17 );
check_if_prime ( multiple+19 );
check_if_prime ( multiple+23 );
// ...so on until 30029
}
Optimization flags: -O3, -fexpensive-optimizations, -march=pentium2
25 million primes in 20 minutes with CPU stuck at 50% (no idea why, tried real time priority but it didn't change much). Size of output text file is 256MB (going to change to binary later on).
Compilation takes ages! Is it okay? How can I make it faster without compromising efficiency?
Is that if statement at the start of for loop OK? I've read if statements take the longest.
Anything else concerning the code, not the algorithm? Anything to make it faster? What statements are faster than others?
Would even a bigger wheel (up to 510510, not just 30030, hell a lot of lines) compile within a day?
I want to find all primes up to 2^32 and little optimizations would save some hours and electricity. Thank you in advance!
EDIT: not seeking for an algorithm, seeking for code improvement if there can be done any!
Here is what I can say about the performance of your program:
Likely your main problem is the call to std::sqrt(). This is a floating point function that's designed for full precision of the result, and it definitely take quite a few cycles. I bet you'll be much faster if you use this check instead:
while (primes[index]*primes[index] < number)
That way you are using an integer multiplication which is trivial for modern CPUs.
The if statement at the start of your for() loop is irrelevant to performance. It's not executed nearly enough times. Your inner loop is the while loop within check_if_prime(). That's the one you need to optimize.
I can't see how you are doing output. There are ways to do output that can severely slow you down, but I don't think that's the main issue (if it is an issue at all).
Code size can be an issue: your CPU has an instruction cache with limited capacity. If your 6k lines don't fit into the first level instruction cache, the penalty can be severe. If I were you, I'd reimplement the wheel using data instead of code, i. e.:
unsigned const wheel[] = {1, 17, 19, 23, ...}; //add all your 6k primes here
for (unsigned i = 0; i < selection; ++i)
{
unsigned multiple = i * 30030;
for(unsigned j = 0; j < sizeof(wheel)/sizeof(*wheel); j++) {
check_if_prime(multiple + wheel[j]);
}
}
Get it running under a debugger, and single-step it, instruction by instruction, and at each point understand what it is doing, and why.
This makes you walk in the shoes of the CPU, and you will see all the silliness that nutty programmer is making you do,
and you will see what you could do better.
That's one way to make your code go as fast as possible.
Program size, by itself, only affects speed if you've got it so fast that caching becomes an issue.
Here's a stab at some techniques for checking if a number is a prime:
bool is_prime(unsigned int number) // negative numbers are not prime.
{
// A data store for primes already calculated.
static std::set<unsigned int> calculated_primes;
// Simple checks first:
// Primes must be >= 2.
// Primes greater than 2 are odd.
if ( (number < 2)
|| ((number > 2) && ((number & 1) == 0) )
{
return false;
}
// Initialize the set with a few prime numbers, if necessary.
if (calculated_primes.empty())
{
static const unsigned int primes[] =
{ 2, 3, 5, 7, 13, 17, 19, 23, 29};
static const unsigned int known_primes_quantity =
sizeof(primes) / sizeof(primes[0]);
calculated_primes.insert(&primes[0], &primes[known_primes_quantity]);
}
// Check if the number is a prime that is already calculated:
if (calculated_primes.find(number) != calculated_primes.end())
{
return true;
}
// Find the smallest prime to the number:
std::set<unsigned int>::iterator prime_iter =
calculated_primes.lower_bound(number);
// Use this value as the start for the sieve.
unsigned int prime_candidate = *prime_iter;
const unsigned int iteration_limit = number * number;
while (prime_candidate < iteration_limit)
{
prime_candidate += 2;
bool is_prime = true;
for (prime_iter = calculated_primes.begin();
prime_iter != calculated_primes.end();
++prime_iter)
{
if ((prime_candidate % (*prime_iter)) == 0)
{
is_prime = false;
break;
}
}
if (is_prime)
{
calculated_primes.insert(prime_candidate);
if (prime_candidate == number)
{
return true;
}
}
}
return false;
}
Note: This is untested code but demonstrates some techniques for checking if a number is prime.
I have to print on screen 2^20 lines of integers under 1 second printf is not quick enough for there , are there any other easy to use alternatives for fast output?
Each line contains only 1 integer.
I require it for a competitive programming problem whose source code I have to submit to the judge.
There is putchar and puts that you can try out.
If timing speed of the program is all that is required, you can print out to /dev/null (unix).
That's 4 MB of binary integer data. 5 MB if you count the newlines. If you like the data in binary, just write it out to wherever as binary values.
I'll assume you need formatting as well. The best way to do this then is to allocate a "huge" string which is big enough to handle everything, which in this case is 10+1 chars per integer. This means 11 MB. That is a reasonable memory requirement and definitely allocatable on a normal desktop system. Then, use sprintf to write the integer values out to the string:
#include <cstdio>
#include <iostream>
#include <string>
int main()
{
std::string buffer(11534336, '\0');
for (int i = 0; i < 1048576; ++i)
{
std::sprintf(&buffer[i * (10 + 1)], // take into account the newline
"%010d\n", i);
}
std::cout << buffer;
}
Note the effective formatting operation is very fast.
The physical output to the console window will take some time on Windows, this is inherent to the Windows console and cannot be remedied. As an example, Coliru times out after 17872 entries, which I believe is 5 seconds. So unfortunately, printing to the screen at this speed is impossible using Standard C(++). You might be able to do it faster when you do everything on the GPU directly and display a surface/texture/image you create, but that can hardly be the point of the exercise.
There are about three major bottlenecks in printf
parsing algorithm (must handle all kind of inputs/outputs)
base conversions (typically not optimized for your particular purpose)
I/O
The cure is
process multiple entries at time
process file i/o in blocks
finetune the base conversion for your specific problem
If your numbers are in order, you can have considerable increase of speed by processing multiple integers at a time;
e.g.
char strings[10*6];
memcpy(strings, "10000\n10001\n10002\n10003\n10004\n", 30);
memcpy(strings + 30, "10005\n10006\n10007\n10008\n10009\n", 30);
fwrite(strings, 60, 1, stdout);
After each block of 10 integers are printed, one has to update the common part of the string, which can be done even with 1 x sprintf + 9x memcpy.
Expanding on what stefaanv mentioned about using putchar, this is a somewhat ugly C-style hack that should do the job fairly quickly. It makes use of the fact that ASCII decimal digits are 0x30 to 0x39:
inline void print_int(int val)
{
char chars[10]; // Max int = 2147483647
int digits = 0;
if (val < 0)
{
putchar('-');
val = -val;
}
do
{
chars[digits++] = ((val % 10) + 0x30);
val /= 10;
}while (val && digits < 10);
while (digits>0)
{
putchar(chars[--digits]);
}
putchar('\n');
}