C++ unordered set out of memory - c++

This program goes through every combination of 4 numbers from 1 - 400 and sees how many unique numbers can be made from the product.
I believe the unordered_set used to hold the already checked numbers is getting too large and thus it quits; task manager tells me it's at 1.5 GB.
Is there any way I can make this code run? I think maybe splitting up the set or somehow finding a more efficient formula.
Note: based on the comments, I would like to say again I am not storing all 25 billion numbers. I'm only storing ~100,000,000 numbers. The question has a RANGE of 400, but I'm looking for a comprehensive solution that can handle 500, or even 1000 without the memory problem.
#include <iostream>
#include <unordered_set>
using namespace std;
const int RANGE = 400;
int main(){
unordered_set<long long> nums;
for(long long a=1; a<=RANGE; a++)
{
for(long long b=a; b<=RANGE; b++)
{
for(long long c=b; c<=RANGE; c++)
{
for(long long d=c; d<=RANGE; d++)
{
unordered_set<long long>::const_iterator got = nums.find(a*b*c*d);
if (got == nums.end())
{
nums.insert(a*b*c*d);
}
}
}
}
cout << a << endl;
}
cout << nums.size() << endl;
return 0;
}

You will need to compile the code with a 64-bit compiler to allow allocating of more than around 2GB of memory. You will also need at least 4GB of RAM on the machine you are running this on, or at best it will take almost forever to finish.
A "bitset", using a single bit per entry will take up about 3GB of memory.
Using the code as it stands, it will use around 4GB of memory on a Linux machine with 64-bit g++ compiler, and it takes around 220 seconds to complete, giving the answer:
86102802
1152921504606846975
According to /usr/bin/time -v:
Command being timed: "./a.out"
User time (seconds): 219.15
System time (seconds): 2.01
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:42.53
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4069336
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 679924
Voluntary context switches: 1
Involuntary context switches: 23250
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

Your solution works by doing every unique permutation of four numbers between 1 and 400, multiplying them, and storing the results, and then counting them. At a minimum, this would take 400*400*400*400bits ~3GB, which is apperently more than your hardware/compiler/OS can handle. (Probably the compiler, which is easy to fix)
So what if we try to solve the program one step at a time? Can we count how many sets of these numbers have products between 1000 and 2000?
for(a=1; a<400; ++a)
{
bmin = max(1000/a/400/400, a); //a*b*400*400 is at least 1000
bmax = min(2000/a, 400); //a*b*1*1 is at most 2000
for(b=bmin; b<=bmax; b++)
{
cmin = max(1000/a/b/400, b); //a*b*c*400 is at least 1000
cmax = min(2000/a/b, 400); //a*b*c*1 is at most 2000
for(c=cmin; c<=cmax; c++)
{
dmin = max(1000/a/b/c, c); //a*b*c*d is at least 1000
dmax = min(2000/a/b/c, 400); //a*b*c*d is at most 2000
for(d=dmin; d<=dmax; d++) //this will usually be zero, one, or two numbers
{
res = a*b*c*d;
if (res>=1000 && res<2000) //a rare few WILL be outside this range
YES
We can simply count how many products between 0-1000 are accessible, then 1000-2000, then 2000-3000, etc, up to 400*400*400*400. This is a significantly slower algorithm, but since it takes very little memory, the hope is that the increased cache coherence will make up for some of the difference.
In fact, speaking of very little memory, since the target numbers in each batch are always in a sequential range of 1000, then you can use a bool nums[1000] = {} instead of an unordered_set, which should give a significant performance boost.
My full code is here: http://coliru.stacked-crooked.com/a/bc1739e972cb40f0, and I have confirmed my code has the same results as yours. After fixing several bugs, your algorithm still vastly outperforms mine for small RANGE with MSVC2013. (For anyone else testing this with MSVC, be sure to test with the debugger NOT attached, it makes a HUGE difference in the timing of the original code)
origional( 4) found 25 in 0s
duck( 4) found 25 in 0s
origional( 6) found 75 in 0s
duck( 6) found 75 in 0s
origional( 9) found 225 in 0s
duck( 9) found 225 in 0s
origional( 13) found 770 in 0s
duck( 13) found 770 in 0s
origional( 17) found 1626 in 0.001s
duck( 17) found 1626 in 0s
origional( 25) found 5135 in 0.004s
duck( 25) found 5135 in 0.002s
origional( 35) found 14345 in 0.011s
duck( 35) found 14345 in 0.015s
origional( 50) found 49076 in 0.042s
duck( 50) found 49075 in 0.076s
origional( 71) found 168909 in 0.178s
duck( 71) found 168909 in 0.738s
origional( 100) found 520841 in 0.839s
duck( 100) found 520840 in 7.206s
origional( 141) found 1889918 in 5.072s
duck( 141) found 1889918 in 76.028s
When I study the issue, what finally occurred to me is that my algorithm requires a large number of 64bit divisions, which seems to be slow even in 64bit operating systems.

Related

chrono give different measures at same function

i am trying to measure the execution time.
i'm on windows 10 and use gcc compiler.
start_t = chrono::system_clock::now();
tree->insert();
end_t = chrono::system_clock::now();
rslt_period = chrono::duration_cast<chrono::nanoseconds>(end_t - start_t);
this is my code to measure time about bp_w->insert()
the function insert work internally like follow (just pseudo code)
insert(){
_load_node(node);
// do something //
_save_node(node, addr);
}
_save_node(n){
ofstream file(name);
file.write(n);
file.close();
}
_load_node(n, addr){
ifstream file(name);
file.read_from(n, addr);
file.close();
}
the actual results is,
read is number of _load_node executions.
write is number of _save_node executions.
time is nano secs.
read write time
1 1 1000000
1 1 0
2 1 0
1 1 0
1 1 0
1 1 0
2 1 0
1 1 1004000
1 1 1005000
1 1 0
1 1 0
1 1 15621000
i don't have any idea why this result come and want to know.
What you are trying to measure is ill-defined.
"How long did this code take to run" can seem simple. In practice, though, do you mean "how many CPU cycles my code took" ? Or how many cycles between my program and the other running programs ? Do you account for the time to load/unload it on the CPU ? Do you account for the CPU being throttled down when on battery ? Do you want to account for the time to access the main clock located on the motherboard (in terms of computation that is extremely far).
So, in practice timing will be affected by a lot of factors and the simple fact of measuring it will slow everything down. Don't expect nanosecond accuracy. Micros, maybe. Millis, certainly.
So, that leaves you in a position where any measurement will fluctuate a lot. The sane way is to average it out over multiple measurement. Or, even better, do the same operation (on different data) a thousand (million?) times and divide the results by a thousand.
Then, you'll get significant improvement on accuracy.
In code:
start_t = chrono::system_clock::now();
for(int i = 0; i < 1000000; i++)
tree->insert();
end_t = chrono::system_clock::now();
You are using the wrong clock. system_clock is not useful for timing intervals due to low resolution and its non-monotonic nature.
Use steady_clock instead. it is guaranteed to be monotonic and have a low enough resolution to be useful.

Python 2.7 - Count number of items in Output

I need to count the number of items in my output.
So for example i created this:
a =1000000
while a >=10:
print a
a=a/2
How would i count how many halving steps were carried out?
Thanks
You have 2 ways: the empiric way and the predictible way.
a =1000000
import math
print("theorical iterations {}".format(int(math.log2(a//10)+0.5)))
counter=0
while a >=10:
counter+=1
a//=2
print("real iterations {}".format(counter))
I get:
theorical iterations 17
real iterations 17
The experimental method just counts the iterations, whereas the predictive method relies on the rounded (to upper bound) result of log2 value of a (which matches the complexity of the algorithm).
(It's rounded to upper bound because if it's more than 16, then you need 17 iterations)
c = 0
a = 1000000
while a >= 10:
print a
a = a / 2
c = c + 1

How to check if a value within a range is multiple of a value from another range?

Let say I've a system that distribute 8820 values into 96 values, rounding using Banker's Round (call them pulse). The formula is:
pulse = BankerRound(8820 * i/96), with i[0,96[
Thus, this is the list of pulses:
0
92
184
276
368
459
551
643
735
827
919
1011
1102
1194
1286
1378
1470
1562
1654
1746
1838
1929
2021
2113
2205
2297
2389
2481
2572
2664
2756
2848
2940
3032
3124
3216
3308
3399
3491
3583
3675
3767
3859
3951
4042
4134
4226
4318
4410
4502
4594
4686
4778
4869
4961
5053
5145
5237
5329
5421
5512
5604
5696
5788
5880
5972
6064
6156
6248
6339
6431
6523
6615
6707
6799
6891
6982
7074
7166
7258
7350
7442
7534
7626
7718
7809
7901
7993
8085
8177
8269
8361
8452
8544
8636
8728
Now, suppose the system doesn't send to me these pulses directly. Instead, it send these pulse in 8820th (call them tick):
tick = value * 1/8820
The list of the ticks I get become:
0
0.010430839
0.020861678
0.031292517
0.041723356
0.052040816
0.062471655
0.072902494
0.083333333
0.093764172
0.104195011
0.11462585
0.124943311
0.13537415
0.145804989
0.156235828
0.166666667
0.177097506
0.187528345
0.197959184
0.208390023
0.218707483
0.229138322
0.239569161
0.25
0.260430839
0.270861678
0.281292517
0.291609977
0.302040816
0.312471655
0.322902494
0.333333333
0.343764172
0.354195011
0.36462585
0.375056689
0.38537415
0.395804989
0.406235828
0.416666667
0.427097506
0.437528345
0.447959184
0.458276644
0.468707483
0.479138322
0.489569161
0.5
0.510430839
0.520861678
0.531292517
0.541723356
0.552040816
0.562471655
0.572902494
0.583333333
0.593764172
0.604195011
0.61462585
0.624943311
0.63537415
0.645804989
0.656235828
0.666666667
0.677097506
0.687528345
0.697959184
0.708390023
0.718707483
0.729138322
0.739569161
0.75
0.760430839
0.770861678
0.781292517
0.791609977
0.802040816
0.812471655
0.822902494
0.833333333
0.843764172
0.854195011
0.86462585
0.875056689
0.88537415
0.895804989
0.906235828
0.916666667
0.927097506
0.937528345
0.947959184
0.958276644
0.968707483
0.979138322
0.989569161
Unfortunately, between these ticks it sends to me also fake ticks, that aren't multiply of original pulses. Such as 0,029024943, which is multiply of 256, which isn't in the pulse lists.
How can I find from this list which ticks are valid and which are fake?
I don't have the pulse list to compare with during the process, since 8820 will change during the time, so I don't have a list to compare step by step. I need to deduce it from ticks at each iteration.
What's the best math approch to this? Maybe reasoning only in tick and not pulse.
I've thought to find the closer error between nearest integer pulse and prev/next tick. Here in C++:
double pulse = tick * 96.;
double prevpulse = (tick - 1/8820.) * 96.;
double nextpulse = (tick + 1/8820.) * 96.;
int pulseRounded=round(pulse);
int buffer=lrint(tick * 8820.);
double pulseABS = abs(pulse - pulseRounded);
double prevpulseABS = abs(prevpulse - pulseRounded);
double nextpulseABS = abs(nextpulse - pulseRounded);
if (nextpulseABS > pulseABS && prevpulseABS > pulseABS) {
// is pulse
}
but for example tick 0.0417234 (pulse 368) fails since the prev tick error seems to be closer than it: prevpulseABS error (0.00543795) is smaller than pulseABS error (0.0054464).
That's because this comparison doesn't care about rounding I guess.
NEW POST:
Alright. Based on what I now understand, here's my revised answer.
You have the information you need to build a list of good values. Each time you switch to a new track:
vector<double> good_list;
good_list.reserve(96);
for(int i = 0; i < 96; i++)
good_list.push_back(BankerRound(8820.0 * i / 96.0) / 8820.0);
Then, each time you want to validate the input:
auto iter = find(good_list.begin(), good_list.end(), input);
if(iter != good_list.end()) //It's a match!
cout << "Happy days! It's a match!" << endl;
else
cout << "Oh bother. It's not a match." << endl;
The problem with mathematically determining the correct pulses is the BankerRound() function which will introduce an ever-growing error the higher values you input. You would then need a formula for a formula, and that's getting out of my wheelhouse. Or, you could keep track of the differences between successive values. Most of them would be the same. You'd only have to check between two possible errors. But that falls apart if you can jump tracks or jump around in one track.
OLD POST:
If I understand the question right, the only information you're getting should be coming in the form of (p/v = y) where you know 'y' (that's each element in your list of ticks you get from the device) and you know that 'p' is the Pulse and 'v' is the Values per Beat, but you don't know what either of them are. So, pulling one point of data from your post, you might have an equation like this:
p/v = 0.010430839
'v', in all the examples you've used thus far, is 8820, but from what I understand, that value is not a guaranteed constant. The next question then is: Do you have a way of determining what 'v' is before you start getting all these decimal values? If you do, you can work out mathematically what the smallest error can be (1/v) then take your decimal information, multiply it by 'v', round it to the nearest whole number and check to see if the difference between its rounded form and its non-rounded form falls in the bounds of your calculated error like so:
double input; //let input be elements in your list of doubles, such as 0.010430839
double allowed_error = 1.0 / values_per_beat;
double proposed = input * values_per_beat;
double rounded = std::round(proposed);
if(abs(rounded - proposed) < allowed_error){cout << "It's good!" << endl;}
If, however, you are not able to ascertain the values_per_beat ahead of time, then this becomes a statistical question. You must accumulate enough data samples, remove the outliers (the few that vary from the norm) and use that data. But that approach will not be realtime, which, given the terms you've been using (values per beat, bpm, the value 44100) it sounds like realtime might be what you're after.
Playing around with Excel, I think you want to multiply up to (what should be) whole numbers rather than looking for closest pulses.
Tick Pulse i Error OK
Tick*8820 Pulse*96/8820 ABS( i - INT( i+0.05 ) ) Error < 0.01
------------ ------------ ------------- ------------------------ ------------
0.029024943 255.9999973 2.786394528 0.786394528 FALSE
0.0417234 368.000388 4.0054464 0.0054464 TRUE
0 0 0 0 TRUE
0.010430839 91.99999998 1.001360544 0.001360544 TRUE
0.020861678 184 2.002721088 0.002721088 TRUE
0.031292517 275.9999999 3.004081632 0.004081632 TRUE
0.041723356 367.9999999 4.005442176 0.005442176 TRUE
0.052040816 458.9999971 4.995918336 0.004081664 TRUE
0.062471655 550.9999971 5.99727888 0.00272112 TRUE
0.072902494 642.9999971 6.998639424 0.001360576 TRUE
0.083333333 734.9999971 7.999999968 3.2E-08 TRUE
The table shows your two "problem" cases (the real wrong value, 256, and the one your code gets wrong, 368) followed by the first few "good" values.
If both 8820s vary at the same time, then obviously they will cancel out, and i will just be Tick*96.
The Error term is the difference between the calculated i and the nearest integer; if this less than 0.01, then it is a "good" value.
NOTE: the 0.05 and 0.01 values were chosen somewhat arbitrarily (aka inspired first time guess based on the numbers): adjust if needed. Although I've only shown the first few rows, all the 96 "good" values you gave show as TRUE.
The code (completely untested) would be something like:
double pulse = tick * 8820.0 ;
double i = pulse * 96.0 / 8820.0 ;
double error = abs( i - floor( i + 0.05 ) ) ;
if( error < 0.05 ) {
// is pulse
}
I assume your initializing your pulses in a for-loop, using int i as loop variable; then the problem is this line:
BankerRound(8820 * i/96);
8820 * i / 96 is an all integer operation and the result is integer again, cutting off the remainder (so in effect, always rounding towards zero already), and BankerRound actually has nothing to round any more. Try this instead:
BankerRound(8820 * i / 96.0);
Same problem applies if you are trying to calculate prev and next pulse, as you actually subtract and add 0 (again, 1/8820 is all integer and results in 0).
Edit:
From what I read from the commments, the 'system' is not – as I assumed previously – modifiable. Actually, it calculates ticks in the form of n / 96.0, n &#x220a [0, 96) in ℕ
however including some kind of internal rounding appearently independent from the sample frequency, so there is some difference to the true value of n/96.0 and the ticks multiplied by 96 do not deliver exactly the integral values in [0, 96) (thanks KarstenKoop). And some of the delivered samples are simply invalid...
So the task is to detect, if tick * 96 is close enough to an integral value to be accepted as valid.
So we need to check:
double value = tick * 96.0;
bool isValid
= value - floor(value) < threshold
|| ceil(value) - value < threshold;
with some appropriately defined threshold. Assuming the values really are calculated as
double tick = round(8820*i/96.0)/8820.0;
then the maximal deviation would be slightly greater than 0.00544 (see below for a more exact value), thresholds somewhere in the sizes of 0.006, 0.0055, 0.00545, ... might be a choice.
Rounding might be a matter of internally used number of bits for the sensor value (if we have 13 bits available, ticks might actually be calculated as floor(8192 * i / 96.0) / 8192.0 with 8192 being 1 << 13 &ndash and floor accounting to integer division; just a guess...).
The exact value of the maximal deviation, using 8820 as factor, as exact as representable by double, was:
0.00544217687075132516838493756949901580810546875
The multiplication by 96 is actually not necessary, you can compare directly with the threshold divided by 96, which would be:
0.0000566893424036596371706764330156147480010986328125

Simple π(x) in Haskell vs C++

I'm learning Haskell. My interest is to use it for personal computer experimentation. Right now, I'm trying to see how fast Haskell can get. Many claim parity with C(++), and if that is true, I would be very happy (I should note that I will be using Haskell whether or not it's fast, but fast is still a good thing).
My test program implements π(x) with a very simple algorithm: Primes numbers add 1 to the result. Prime numbers have no integer divisors between 1 and √x. This is not an algorithm battle, this is purely for compiler performance.
Haskell seems to be about 6x slower on my computer, which is fine (still 100x faster than pure Python), but that could be just because I'm a Haskell newbie.
Now, my question: How, without changing the algorithm, can I optimize the Haskell implementation? Is Haskell really on performance parity with C?
Here is my Haskell code:
import System.Environment
-- a simple integer square root
isqrt :: Int -> Int
isqrt = floor . sqrt . fromIntegral
-- primality test
prime :: Int -> Bool
prime x = null [x | q <- [3, 5..isqrt x], rem x q == 0]
main = do
n <- fmap (read . head) getArgs
print $ length $ filter prime (2:[3, 5..n])
Here is my C++ code:
#include <iostream>
#include <cmath>
#include <cstdlib>
using namespace std;
bool isPrime(int);
int main(int argc, char* argv[]) {
int primes = 10000, count = 0;
if (argc > 1) {
primes = atoi(argv[1]);
}
if (isPrime(2)) {
count++;
}
for (int i = 3; i <= primes; i+=2) {
if (isPrime(i)){
count++;
}
}
cout << count << endl;
return 0;
}
bool isPrime(int x){
for (int i = 2; i <= floor(sqrt(x)); i++) {
if (x % i == 0) {
return false;
}
}
return true;
}
Your Haskell version is constructing a lazy list in prime only to test if it is null. This seems to indeed be a bottle neck. The following version runs just as fast as the C++ version on my machine:
prime :: Int -> Bool
prime x = go 3
where
go q | q <= isqrt x = if rem x q == 0 then False else go (q+2)
go _ = True
3.31s when compiled with -O2 vs. 3.18s for C++ with gcc 4.8 and -O3 for n=5000000.
Of course, 'guessing' where the program is slow to optimize it is not a very good approach. Fortunately, Haskell has good profiling tools on board.
Compiling and running with
$ ghc --make primes.hs -O2 -prof -auto-all -fforce-recomp && ./primes 5000000 +RTS -p
gives
# primes.prof
Thu Feb 20 00:49 2014 Time and Allocation Profiling Report (Final)
primes +RTS -p -RTS 5000000
total time = 5.71 secs (5710 ticks # 1000 us, 1 processor)
total alloc = 259,580,976 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
prime.go Main 96.4 0.0
main Main 2.0 84.6
isqrt Main 0.9 15.4
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 45 0 0.0 0.0 100.0 100.0
main Main 91 0 2.0 84.6 100.0 100.0
prime Main 92 2500000 0.7 0.0 98.0 15.4
prime.go Main 93 326103491 96.4 0.0 97.3 15.4
isqrt Main 95 0 0.9 15.4 0.9 15.4
--- >8 ---
which clearly shows that prime is where things get hot. For more information on profiling, I'll refer you to Real World Haskell, Chap 25.
To really understand what is going on, you can look at (one of) GHC's intermediate languages Core, which will show you how the code looks like after optimization. Some good info is at the Haskell wiki. I would not recommend to do that unless necessary, but it is good to know that the possibility exists.
To your other questions:
1) How, without changing the algorithm, can I optimize the Haskell implementation?
Profile, and try to write inner loops so that they don't do any memory allocations and can be made strict by the compiler. Doing so can take some practice and experience.
2) Is Haskell really on performance parity with C?
That depends. GHC is amazing and can often optimize your program very well. If you know what you're doing you can usually get close to the performance of optimized C (100% - 200% of C's speed). That said, these optimizations are not always easy or pretty to the eye and high level Haskell can be slower. But don't forget that you're gaining amazing expressiveness and high level abstractions when using Haskell. It will usually be fast enough for all but the most performance critical applications and even then you can often get pretty close to C with some profiling and performance optimizations.
I dont think that the Haskell version (original and improved by first answer) are equivalent with the C++ version.
The reason is this:
Both only consider every second element (in the prime function), while the C++ version scans every element (only i++ in the isPrime() function.
When i fix this (change i++ to i+=2 in the isPrime() function for C++) i get down to almost 1/3 of the runtime of the optimized Haskell version (2.1s C++ vs 6s Haskell).
The output remains the same for both (of course).
Note that this is no specific opimization of the C++ version ,just an adaptation of the trick already applied in the Haskell version.

Percent case statements

I was just writing an if statement to "spawn" objects in a map and I was playing with percents, but I'm not sure if I'm doing it right. This is what I have:
int chance = rng_.nextInt(0, 100);
if(chance <= 20) // 20%
{
// Spawn a chest
}
else if((chance > 20) && (chance <= 50)) // 30%
{
// Spawn a monster
}
// Otherwise don't spawn nothing
Is this a correct approach or I'm just wrong?
Edit: Ok, I have fixed the code and now I think the question is solved.
No, because
30% + 70% + 30% is more than 100%
in your code the chance of "Spawn a monster" is 40% not 70%
Assuming rng_.nextInt is generating a number between 0 and 100, and it is a linear distributions (any number between 0 and 99 is just as likely as any other number), then 0-19 would be a chest (20 percent chance), 20-49 would be a monster (30 percent chance), and anything between 50 and 99 (50 percent chance) would spawn nothing. So the code would look like:
int chance = rng_.nextInt(0,100);
if ( chance < 20 )
{
// spawn a chest
}
else if ( chance < 50 )
{
// spawn a monster
}
else
{
// Do other items if required.
}
so 20+30+50 = 100 which equates to 0-99 (100 integer values) in your random number generation.