Write a program to get CPU cache sizes and levels - c++

I want to write a program to get my cache size(L1, L2, L3). I know the general idea of it.
Allocate a big array
Access part of it of different size each time.
So I wrote a little program.
Here's my code:
#include <cstdio>
#include <time.h>
#include <sys/mman.h>
const int KB = 1024;
const int MB = 1024 * KB;
const int data_size = 32 * MB;
const int repeats = 64 * MB;
const int steps = 8 * MB;
const int times = 8;
long long clock_time() {
struct timespec tp;
clock_gettime(CLOCK_REALTIME, &tp);
return (long long)(tp.tv_nsec + (long long)tp.tv_sec * 1000000000ll);
}
int main() {
// allocate memory and lock
void* map = mmap(NULL, (size_t)data_size, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
if (map == MAP_FAILED) {
return 0;
}
int* data = (int*)map;
// write all to avoid paging on demand
for (int i = 0;i< data_size / sizeof(int);i++) {
data[i]++;
}
int steps[] = { 1*KB, 4*KB, 8*KB, 16*KB, 24 * KB, 32*KB, 64*KB, 128*KB,
128*KB*2, 128*KB*3, 512*KB, 1 * MB, 2 * MB, 3 * MB, 4 * MB,
5 * MB, 6 * MB, 7 * MB, 8 * MB, 9 * MB};
for (int i = 0; i <= sizeof(steps) / sizeof(int) - 1; i++) {
double totalTime = 0;
for (int k = 0; k < times; k++) {
int size_mask = steps[i] / sizeof(int) - 1;
long long start = clock_time();
for (int j = 0; j < repeats; j++) {
++data[ (j * 16) & size_mask ];
}
long long end = clock_time();
totalTime += (end - start) / 1000000000.0;
}
printf("%d time: %lf\n", steps[i] / KB, totalTime);
}
munmap(map, (size_t)data_size);
return 0;
}
However, the result is so weird:
1 time: 1.989998
4 time: 1.992945
8 time: 1.997071
16 time: 1.993442
24 time: 1.994212
32 time: 2.002103
64 time: 1.959601
128 time: 1.957994
256 time: 1.975517
384 time: 1.975143
512 time: 2.209696
1024 time: 2.437783
2048 time: 7.006168
3072 time: 5.306975
4096 time: 5.943510
5120 time: 2.396078
6144 time: 4.404022
7168 time: 4.900366
8192 time: 8.998624
9216 time: 6.574195
My CPU is Intel(R) Core(TM) i3-2350M. L1 Cache: 32K (for data), L2 Cache 256K, L3 Cache 3072K.
Seems like it doesn't follow any rule. I can't get information of cache size or cache level from that.
Could anybody give some help? Thanks in advance.
Update:
Follow #Leeor advice, I use j*64 instead of j*16. New results:
1 time: 1.996282
4 time: 2.002579
8 time: 2.002240
16 time: 1.993198
24 time: 1.995733
32 time: 2.000463
64 time: 1.968637
128 time: 1.956138
256 time: 1.978266
384 time: 1.991912
512 time: 2.192371
1024 time: 2.262387
2048 time: 3.019435
3072 time: 2.359423
4096 time: 5.874426
5120 time: 2.324901
6144 time: 4.135550
7168 time: 3.851972
8192 time: 7.417762
9216 time: 2.272929
10240 time: 3.441985
11264 time: 3.094753
Two peaks, 4096K and 8192K. Still weird.

I'm not sure if this is the only problem here, but it's definitely the biggest one - your code would very quickly trigger the HW stream prefetchers, making you almost always hit in L1 or L2 latencies.
More details can be found here - http://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers
For your benchmark You should either disable them (through BIOS or any other means), or at least make your steps longer by replacing j*16 (* 4 bytes per int = 64B, one cache line - a classic unit stride for the stream detector), with j*64 (4 cache lines). The reason being - the prefetcher can issue 2 prefetches per stream request, so it runs ahead of your code when you do unit strides, may still get a bit ahead of you when your code is jumping over 2 lines, but become mostly useless with longer jumps (3 isn't good because of your modulu, you need a divider of step_size)
Update the questions with the new results and we can figure out if there's anything else here.
EDIT1:
Ok, I ran the fixed code and got -
1 time: 1.321001
4 time: 1.321998
8 time: 1.336288
16 time: 1.324994
24 time: 1.319742
32 time: 1.330685
64 time: 1.536644
128 time: 1.536933
256 time: 1.669329
384 time: 1.592145
512 time: 2.036315
1024 time: 2.214269
2048 time: 2.407584
3072 time: 2.259108
4096 time: 2.584872
5120 time: 2.203696
6144 time: 2.335194
7168 time: 2.322517
8192 time: 5.554941
9216 time: 2.230817
It makes much more sense if you ignore a few columns - you jump after the 32k (L1 size), but instead of jumping after 256k (L2 size), we get too good of a result for 384, and jump only at 512k. Last jump is at 8M (my LLC size), but 9k is broken again.
This allows us to spot the next error - ANDing with size mask only makes sense when it's a power of 2, otherwise you don't wrap around, but instead repeat some of the last addresses again (which ends up in optimistic results since it's fresh in the cache).
Try replacing the ... & size_mask with % steps[i]/sizeof(int), the modulu is more expensive but if you want to have these sizes you need it (or alternatively, a running index that gets zeroed whenever it exceeds the current size)

I think you'd be better off looking at the CPUID instruction. It's not trivial, but there should be information on the web.
Also, if you're on Windows, you can use GetLogicalProcessorInformation function. Mind you, it's only present in Windows XP SP3 and above. I know nothing about Linux/Unix.

If you're using GNU/Linux you can just read the content of the files /proc/cpuinfo and for further details /sys/devices/system/cpu/*. It is just common under UNIX not to define a API, where a plain file can do that job anyway.
I would also take a look at the source of util-linux, it contains a program named lscpu. This should be give you an example how to retrieve the required information.
// update
http://git.kernel.org/cgit/utils/util-linux/util-linux.git/tree/sys-utils/lscpu.c
If just taken a look at the source their. It basically reading from the file mentioned above, thats all. An therefore it is absolutely valid to read also from that files, they are provided by the kernel.

Related

Explain this CPU Cache processor effect: Number of operations decreased exponentially but average time does not

For context this question is related to the blog post on Cache Processor Effects, specifically Example 1-2.
In the code snippet below, I'm increasing the step size by 2 each time i.e. the number of operations I'm performing is decreasing by a factor of 2 each time. From the blog post, I am expecting to see for step size 1-16, the average time to complete the loop remains roughly the same. The main intuitions discussed by the author were 1) Majority of the time is contributed by memory access (i.e. we fetch then multiply) rather than arithmetic operations, 2) Each time the cpu fetches a cacheline (i.e. 64 bytes or 16 int).
I've tried to replicate the experiment on my local machine with the following code. Note that I've allocated a new int array for every step size so that they do not take advantage of previous cached data. For a similar reason, I've also only "repeated" the inner for loop for each step size only once (instead of repeating the experiment multiple times).
constexpr long long size = 64 * 1024 * 1024; // 64 MB
for (int step = 1; step <= 1<<15 ; step <<= 1) {
auto* arr = new int[size];
auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < size; i += step) {
arr[i] *= 3;
}
auto finish = std::chrono::high_resolution_clock::now();
auto microseconds = std::chrono::duration_cast<std::chrono::milliseconds>(finish-start);
std::cout << step << " : " << microseconds.count() << "ms\n";
// delete[] arr; (updated - see Paul's comment)
}
Result, however, was very different from what was described in the blog post.
Without optimization:
clang++ -g -std=c++2a -Wpedantic -Wall -Wextra -o a cpu-cache1.cpp
1 : 222ms
2 : 176ms
4 : 152ms
8 : 140ms
16 : 135ms
32 : 128ms
64 : 130ms
128 : 125ms
256 : 123ms
512 : 118ms
1024 : 121ms
2048 : 62ms
4096 : 32ms
8192 : 16ms
16384 : 8ms
32768 : 4ms
With -O3 optimization
clang++ -g -std=c++2a -Wpedantic -Wall -Wextra -o a cpu-cache1.cpp -O3
1 : 155ms
2 : 145ms
4 : 134ms
8 : 131ms
16 : 125ms
32 : 130ms
64 : 130ms
128 : 121ms
256 : 123ms
512 : 127ms
1024 : 123ms
2048 : 62ms
4096 : 31ms
8192 : 15ms
16384 : 8ms
32768 : 4ms
Note that I'm running with Macbook Pro 2019 and my pagesize is 4096. From the observation above, it seems like until 1024 step size the time taken roughly remain linear. Since each int is 4 bytes, this seem related to the size of a page (i.e. 1024*4 = 4096) which makes me think this might be some kind of prefetching/page related optimization even when there is no optimization specified?
Does someone have any ideas or explanation on why these numbers are occurring?
In your code, you called new int[size] which is essentially a wrapper around malloc. The kernel does not immediately allocate a physical page/memory to it due to Linux's optimistic memory allocation strategy (see man malloc).
What happens when you call arr[i] *= 3 is that a page fault will occur if the page is not in the Translation Lookup Buffer (TLB). Kernel will check that your requested virtual page is valid, but an associated physical page is not allocated yet. The kernel will assign your requested virtual page with a physical page.
For step = 1024, you are requesting every page associated to arr. For step = 2048, you are requesting every other page associated to arr.
This act of assigning a physical page is your bottleneck (based on your data it takes ~120ms to assign 64mb of pages one by one). When you increase step from 1024 to 2048, now the kernel doesn't need to allocate physical pages for every virtual page associated to arr hence the runtime halves.
As linked by #Daniel Langr, you'll need to "touch" each element of arr or zero-initialize the arr with new int[size]{}. What this does is to force the kernel to assign physical pages to each virtual page of arr.

Would Google Benchmark library call testing blocks several times?

I have this minimal example of Google Benchmark usage.
The weird thing is that "42" is printed a number of times (4), not just once.
I understand that the library has to run things several times to gain statistics but this I though that this is handled by the statie-loop itself.
This is a minimal example of something more complicated where I wanted to print (outside the loop) the result to verify that different implementations of the same function would give the same result.
#include <benchmark/benchmark.h>
#include<iostream>
#include <thread> //sleep for
int SomeFunction(){
using namespace std::chrono_literals;
std::this_thread::sleep_for(10ms);
return 42;
}
static void BM_SomeFunction(benchmark::State& state) {
// Perform setup here
int result = -1;
for (auto _ : state) {
// This code gets timed
result = SomeFunction();
benchmark::DoNotOptimize(result);
}
std::cout<< result <<std::endl;
}
// Register the function as a benchmark
BENCHMARK(BM_SomeFunction);
// Run the benchmark
BENCHMARK_MAIN();
output: (42 is printed 4 times, why more than once, why 4?)
Running ./a.out
Run on (12 X 4600 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.30, 0.65, 0.79
42
42
42
42
----------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------
BM_SomeFunction 10243011 ns 11051 ns 1000
How else could I test (at least visually) that different benchmarking blocks give the same answer?

CPLEX - Error in accessing solution C++

I have a problem in accessing the solution of a LP problem.
This is the output of CPLEX after calling cplex.solve();
CPXPARAM_MIP_Strategy_CallbackReducedLP 0
Found incumbent of value 0.000000 after 0.00 sec. (0.70 ticks)
Tried aggregator 1 time.
MIP Presolve eliminated 570 rows and 3 columns.
MIP Presolve modified 88 coefficients.
Reduced MIP has 390 rows, 29291 columns, and 76482 nonzeros.
Reduced MIP has 29291 binaries, 0 generals, 0 SOSs, and 0 indicators.
Presolve time = 0.06 sec. (49.60 ticks)
Tried aggregator 1 time.
Reduced MIP has 390 rows, 29291 columns, and 76482 nonzeros.
Reduced MIP has 29291 binaries, 0 generals, 0 SOSs, and 0 indicators.
Presolve time = 0.04 sec. (31.47 ticks)
Probing time = 0.02 sec. (1.36 ticks)
MIP emphasis: balance optimality and feasibility.
MIP search method: dynamic search.
Parallel mode: deterministic, using up to 8 threads.
Root relaxation solution time = 0.03 sec. (17.59 ticks)
Nodes Cuts/
Node Left Objective IInf Best Integer Best Bound ItCnt Gap
* 0+ 0 0.0000 -395.1814 ---
* 0+ 0 -291.2283 -395.1814 35.69%
* 0 0 integral 0 -372.2283 -372.2283 201 0.00%
Elapsed time = 0.21 sec. (131.64 ticks, tree = 0.00 MB, solutions = 3)
Root node processing (before b&c):
Real time = 0.21 sec. (133.18 ticks)
Parallel b&c, 8 threads:
Real time = 0.00 sec. (0.00 ticks)
Sync time (average) = 0.00 sec.
Wait time (average) = 0.00 sec.
------------
Total (root+branch&cut) = 0.21 sec. (133.18 ticks)
However when I call cplex.getValues(values, variables); the program gives a SIGABRT signal throwing the following exception:
libc++abi.dylib: terminating with uncaught exception of type IloAlgorithm::NotExtractedException
This is my code. What I'm doing wrong?
std::vector<links_t> links(pointsA.size()*pointsB.size());
std::unordered_map<int, std::vector<std::size_t> > point2DToLinks;
for(std::size_t i=0; i<pointsA.size(); ++i){
for(std::size_t j=0; j<pointsB.size(); ++j){
std::size_t index = (i*pointsA.size()) + j;
links[index].from = i;
links[index].to = j;
links[index].value = cv::norm(pointsA[i] - pointsB[j]);
point2DToLinks[pointsA[i].point2D[0]->id].push_back(index);
point2DToLinks[pointsA[i].point2D[1]->id].push_back(index);
point2DToLinks[pointsA[i].point2D[2]->id].push_back(index);
point2DToLinks[pointsB[j].point2D[0]->id].push_back(index);
point2DToLinks[pointsB[j].point2D[1]->id].push_back(index);
point2DToLinks[pointsB[j].point2D[2]->id].push_back(index);
}
}
std::size_t size = links.size() + point2DToLinks.size();
IloEnv environment;
IloNumArray coefficients(environment, size);
for(std::size_t i=0; i<links.size(); ++i) coefficients[i] = links[i].value;
for(std::size_t i=links.size(); i<size; ++i) coefficients[i] = -lambda;
IloNumVarArray variables(environment, size, 0, 1, IloNumVar::Bool);
IloObjective objective(environment, 0.0, IloObjective::Minimize);
objective.setLinearCoefs(variables, coefficients);
IloRangeArray constrains = IloRangeArray(environment);
std::size_t counter = 0;
for(auto point=point2DToLinks.begin(); point!=point2DToLinks.end(); point++){
IloExpr expression(environment);
const std::vector<std::size_t> & inLinks = point->second;
for(std::size_t j=0; j<inLinks.size(); j++) expression += variables[inLinks[j]];
expression -= variables[links.size() + counter];
constrains.add(IloRange(environment, 0, expression));
expression.end();
++counter;
}
IloModel model(environment);
model.add(objective);
model.add(constrains);
IloCplex cplex(model);
cplex.solve();
if(cplex.getStatus() != IloAlgorithm::Optimal){
fprintf(stderr, "error: cplex terminate with an error.\n");
abort();
}
IloNumArray values(environment, size);
cplex.getValues(values, variables);
for(std::size_t i=0; i<links.size(); ++i)
if(values[i] > 0) pairs.push_back(links[i]);
environment.end();
This is an error that happens if you ask CPLEX for the value of a variable that CPLEX does not have in its model. When you build the model, it is not enough to just declare and define the variable for it to be included in the model. It also has to be part of one of the constraints or the objective in the model. Any variable that you declare/define that is NOT included in one of the constraints or the objective will therefore not be in the set of variables that gets extracted into the inner workings of CPLEX. There are two obvious things that you can do to resolve this.
First you can try to get the variable values inside a loop over the variables, and test whether each is actually in the cplex model - I think is is something like cplex.isExtracted(var). Do something simple like print a message when you come across a variable that is not extracted, telling you which variable is causing the problem.
Secondly you can export the model from CPLEX as an LP format file and check it manually. This is a very useful way to see what is actually in your model rather than what you think is in your model.

Run loop every millisecond with sleep_for

I want to run a loop inside a thread that calculates some data every millisecond. But I am having trouble with the sleep function. It is sleeping much too long.
I created a basic console application in visual studio:
#include <windows.h>
#include <iostream>
#include <chrono>
#include <thread>
using namespace std;
typedef std::chrono::high_resolution_clock Clock;
int _tmain(int argc, _TCHAR* argv[])
{
int iIdx = 0;
bool bRun = true;
auto aTimeStart = Clock::now();
while (bRun){
iIdx++;
if (iIdx >= 500) bRun = false;
//Sleep(1);
this_thread::sleep_for(chrono::microseconds(10));
}
printf("Duration: %i ms\n", chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - aTimeStart).count());
cin.get();
return 0;
}
This prints out: Duration: 5000 ms
The same result is printed, when i use Sleep(1);
I would expect the duration to be 500 ms, and not 5000 ms. What am I doing wrong here?
Update:
I was using Visual Studio 2013. Now I have installed Visual Studio 2015, and its fine - prints out: Duration: 500 ms (sometimes its 527 ms).
However, this sleep_for still isn't very accurate, so I will look out for other solutions.
The typical time slice used by popular OSs is much longer than 1ms (say 20ms or so); the sleep sets a minimum for how long you want your thread to be suspended not a maximum. Once your thread becomes runnable it is up to the OS when to next schedule it.
If you need this level of accuracy you either need a real time OS, or set a very high priority on your thread (so it can pre-empt almost anything else), or write your code in the kernel, or use a busy wait.
But do you really need to do the calculation every ms? That sort of timing requirement normally comes from hardware. What goes wrong if you bunch up the calculations a bit later?
On Windows, try timeBeginPeriod: https://msdn.microsoft.com/en-us/library/windows/desktop/dd757624(v=vs.85).aspx
It increases timer resolution.
What am I doing wrong here?
Attempting to use sleep for precise timing.
sleep(n) does not pause your thread for precisely n time then immediately continue.
sleep(n) yields control of the thread back to the scheduler, and indicates that you do not want control back until at least n time has passed.
Now, the scheduler already divvies up thread processing time into time slices, and these are typically on the order of 25 milliseconds or so. That's the bare minimum you can expect your sleep to run.
sleep is simply the wrong tool for this job. Never use it for precise scheduling.
This thread is fairly old, but perhaps someone can still use this code.
It's written for C++11 and I've tested it on Ubuntu 15.04.
class MillisecondPerLoop
{
public:
void do_loop(uint32_t loops)
{
int32_t time_to_wait = 0;
next_clock = ((get_current_clock_ns() / one_ms_in_ns) * one_ms_in_ns);
for (uint32_t loop = 0; loop < loops; ++loop)
{
on_tick();
// Assume on_tick takes less than 1 ms to run
// calculate the next tick time and time to wait from now until that time
time_to_wait = calc_time_to_wait();
// check if we're already past the 1ms time interval
if (time_to_wait > 0)
{
// wait that many ns
std::this_thread::sleep_for(std::chrono::nanoseconds(time_to_wait));
}
++m_tick;
}
}
private:
void on_tick()
{
// TEST only: simulate the work done in every tick
// by waiting a random amount of time
std::this_thread::sleep_for(std::chrono::microseconds(distribution(generator)));
}
uint32_t get_current_clock_ns()
{
return std::chrono::duration_cast<std::chrono::nanoseconds>(
std::chrono::system_clock::now().time_since_epoch()).count();
}
int32_t calc_time_to_wait()
{
next_clock += one_ms_in_ns;
return next_clock - get_current_clock_ns();
}
static constexpr uint32_t one_ms_in_ns = 1000000L;
uint32_t m_tick;
uint32_t next_clock;
};
A typical run shows a pretty accurate 1ms loop with a 1- 3 microsecond error. Your PC may be more accurate than this if it's a faster CPU.
Here's typical output:
One Second Loops:
Avg (ns) ms err(ms)
[ 0] 999703 0.9997 0.0003
[ 1] 999888 0.9999 0.0001
[ 2] 999781 0.9998 0.0002
[ 3] 999896 0.9999 0.0001
[ 4] 999772 0.9998 0.0002
[ 5] 999759 0.9998 0.0002
[ 6] 999879 0.9999 0.0001
[ 7] 999915 0.9999 0.0001
[ 8] 1000043 1.0000 -0.0000
[ 9] 999675 0.9997 0.0003
[10] 1000120 1.0001 -0.0001
[11] 999606 0.9996 0.0004
[12] 999714 0.9997 0.0003
[13] 1000171 1.0002 -0.0002
[14] 999670 0.9997 0.0003
[15] 999832 0.9998 0.0002
[16] 999812 0.9998 0.0002
[17] 999868 0.9999 0.0001
[18] 1000096 1.0001 -0.0001
[19] 999665 0.9997 0.0003
Expected total time: 20.0000ms
Actual total time : 19.9969ms
I have a more detailed write up here:
https://arrizza.org/wiki/index.php/One_Millisecond_Loop

C++ unordered set out of memory

This program goes through every combination of 4 numbers from 1 - 400 and sees how many unique numbers can be made from the product.
I believe the unordered_set used to hold the already checked numbers is getting too large and thus it quits; task manager tells me it's at 1.5 GB.
Is there any way I can make this code run? I think maybe splitting up the set or somehow finding a more efficient formula.
Note: based on the comments, I would like to say again I am not storing all 25 billion numbers. I'm only storing ~100,000,000 numbers. The question has a RANGE of 400, but I'm looking for a comprehensive solution that can handle 500, or even 1000 without the memory problem.
#include <iostream>
#include <unordered_set>
using namespace std;
const int RANGE = 400;
int main(){
unordered_set<long long> nums;
for(long long a=1; a<=RANGE; a++)
{
for(long long b=a; b<=RANGE; b++)
{
for(long long c=b; c<=RANGE; c++)
{
for(long long d=c; d<=RANGE; d++)
{
unordered_set<long long>::const_iterator got = nums.find(a*b*c*d);
if (got == nums.end())
{
nums.insert(a*b*c*d);
}
}
}
}
cout << a << endl;
}
cout << nums.size() << endl;
return 0;
}
You will need to compile the code with a 64-bit compiler to allow allocating of more than around 2GB of memory. You will also need at least 4GB of RAM on the machine you are running this on, or at best it will take almost forever to finish.
A "bitset", using a single bit per entry will take up about 3GB of memory.
Using the code as it stands, it will use around 4GB of memory on a Linux machine with 64-bit g++ compiler, and it takes around 220 seconds to complete, giving the answer:
86102802
1152921504606846975
According to /usr/bin/time -v:
Command being timed: "./a.out"
User time (seconds): 219.15
System time (seconds): 2.01
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:42.53
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4069336
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 679924
Voluntary context switches: 1
Involuntary context switches: 23250
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Your solution works by doing every unique permutation of four numbers between 1 and 400, multiplying them, and storing the results, and then counting them. At a minimum, this would take 400*400*400*400bits ~3GB, which is apperently more than your hardware/compiler/OS can handle. (Probably the compiler, which is easy to fix)
So what if we try to solve the program one step at a time? Can we count how many sets of these numbers have products between 1000 and 2000?
for(a=1; a<400; ++a)
{
bmin = max(1000/a/400/400, a); //a*b*400*400 is at least 1000
bmax = min(2000/a, 400); //a*b*1*1 is at most 2000
for(b=bmin; b<=bmax; b++)
{
cmin = max(1000/a/b/400, b); //a*b*c*400 is at least 1000
cmax = min(2000/a/b, 400); //a*b*c*1 is at most 2000
for(c=cmin; c<=cmax; c++)
{
dmin = max(1000/a/b/c, c); //a*b*c*d is at least 1000
dmax = min(2000/a/b/c, 400); //a*b*c*d is at most 2000
for(d=dmin; d<=dmax; d++) //this will usually be zero, one, or two numbers
{
res = a*b*c*d;
if (res>=1000 && res<2000) //a rare few WILL be outside this range
YES
We can simply count how many products between 0-1000 are accessible, then 1000-2000, then 2000-3000, etc, up to 400*400*400*400. This is a significantly slower algorithm, but since it takes very little memory, the hope is that the increased cache coherence will make up for some of the difference.
In fact, speaking of very little memory, since the target numbers in each batch are always in a sequential range of 1000, then you can use a bool nums[1000] = {} instead of an unordered_set, which should give a significant performance boost.
My full code is here: http://coliru.stacked-crooked.com/a/bc1739e972cb40f0, and I have confirmed my code has the same results as yours. After fixing several bugs, your algorithm still vastly outperforms mine for small RANGE with MSVC2013. (For anyone else testing this with MSVC, be sure to test with the debugger NOT attached, it makes a HUGE difference in the timing of the original code)
origional( 4) found 25 in 0s
duck( 4) found 25 in 0s
origional( 6) found 75 in 0s
duck( 6) found 75 in 0s
origional( 9) found 225 in 0s
duck( 9) found 225 in 0s
origional( 13) found 770 in 0s
duck( 13) found 770 in 0s
origional( 17) found 1626 in 0.001s
duck( 17) found 1626 in 0s
origional( 25) found 5135 in 0.004s
duck( 25) found 5135 in 0.002s
origional( 35) found 14345 in 0.011s
duck( 35) found 14345 in 0.015s
origional( 50) found 49076 in 0.042s
duck( 50) found 49075 in 0.076s
origional( 71) found 168909 in 0.178s
duck( 71) found 168909 in 0.738s
origional( 100) found 520841 in 0.839s
duck( 100) found 520840 in 7.206s
origional( 141) found 1889918 in 5.072s
duck( 141) found 1889918 in 76.028s
When I study the issue, what finally occurred to me is that my algorithm requires a large number of 64bit divisions, which seems to be slow even in 64bit operating systems.