CPLEX - Error in accessing solution C++ - c++

I have a problem in accessing the solution of a LP problem.
This is the output of CPLEX after calling cplex.solve();
CPXPARAM_MIP_Strategy_CallbackReducedLP 0
Found incumbent of value 0.000000 after 0.00 sec. (0.70 ticks)
Tried aggregator 1 time.
MIP Presolve eliminated 570 rows and 3 columns.
MIP Presolve modified 88 coefficients.
Reduced MIP has 390 rows, 29291 columns, and 76482 nonzeros.
Reduced MIP has 29291 binaries, 0 generals, 0 SOSs, and 0 indicators.
Presolve time = 0.06 sec. (49.60 ticks)
Tried aggregator 1 time.
Reduced MIP has 390 rows, 29291 columns, and 76482 nonzeros.
Reduced MIP has 29291 binaries, 0 generals, 0 SOSs, and 0 indicators.
Presolve time = 0.04 sec. (31.47 ticks)
Probing time = 0.02 sec. (1.36 ticks)
MIP emphasis: balance optimality and feasibility.
MIP search method: dynamic search.
Parallel mode: deterministic, using up to 8 threads.
Root relaxation solution time = 0.03 sec. (17.59 ticks)
Nodes Cuts/
Node Left Objective IInf Best Integer Best Bound ItCnt Gap
* 0+ 0 0.0000 -395.1814 ---
* 0+ 0 -291.2283 -395.1814 35.69%
* 0 0 integral 0 -372.2283 -372.2283 201 0.00%
Elapsed time = 0.21 sec. (131.64 ticks, tree = 0.00 MB, solutions = 3)
Root node processing (before b&c):
Real time = 0.21 sec. (133.18 ticks)
Parallel b&c, 8 threads:
Real time = 0.00 sec. (0.00 ticks)
Sync time (average) = 0.00 sec.
Wait time (average) = 0.00 sec.
------------
Total (root+branch&cut) = 0.21 sec. (133.18 ticks)
However when I call cplex.getValues(values, variables); the program gives a SIGABRT signal throwing the following exception:
libc++abi.dylib: terminating with uncaught exception of type IloAlgorithm::NotExtractedException
This is my code. What I'm doing wrong?
std::vector<links_t> links(pointsA.size()*pointsB.size());
std::unordered_map<int, std::vector<std::size_t> > point2DToLinks;
for(std::size_t i=0; i<pointsA.size(); ++i){
for(std::size_t j=0; j<pointsB.size(); ++j){
std::size_t index = (i*pointsA.size()) + j;
links[index].from = i;
links[index].to = j;
links[index].value = cv::norm(pointsA[i] - pointsB[j]);
point2DToLinks[pointsA[i].point2D[0]->id].push_back(index);
point2DToLinks[pointsA[i].point2D[1]->id].push_back(index);
point2DToLinks[pointsA[i].point2D[2]->id].push_back(index);
point2DToLinks[pointsB[j].point2D[0]->id].push_back(index);
point2DToLinks[pointsB[j].point2D[1]->id].push_back(index);
point2DToLinks[pointsB[j].point2D[2]->id].push_back(index);
}
}
std::size_t size = links.size() + point2DToLinks.size();
IloEnv environment;
IloNumArray coefficients(environment, size);
for(std::size_t i=0; i<links.size(); ++i) coefficients[i] = links[i].value;
for(std::size_t i=links.size(); i<size; ++i) coefficients[i] = -lambda;
IloNumVarArray variables(environment, size, 0, 1, IloNumVar::Bool);
IloObjective objective(environment, 0.0, IloObjective::Minimize);
objective.setLinearCoefs(variables, coefficients);
IloRangeArray constrains = IloRangeArray(environment);
std::size_t counter = 0;
for(auto point=point2DToLinks.begin(); point!=point2DToLinks.end(); point++){
IloExpr expression(environment);
const std::vector<std::size_t> & inLinks = point->second;
for(std::size_t j=0; j<inLinks.size(); j++) expression += variables[inLinks[j]];
expression -= variables[links.size() + counter];
constrains.add(IloRange(environment, 0, expression));
expression.end();
++counter;
}
IloModel model(environment);
model.add(objective);
model.add(constrains);
IloCplex cplex(model);
cplex.solve();
if(cplex.getStatus() != IloAlgorithm::Optimal){
fprintf(stderr, "error: cplex terminate with an error.\n");
abort();
}
IloNumArray values(environment, size);
cplex.getValues(values, variables);
for(std::size_t i=0; i<links.size(); ++i)
if(values[i] > 0) pairs.push_back(links[i]);
environment.end();

This is an error that happens if you ask CPLEX for the value of a variable that CPLEX does not have in its model. When you build the model, it is not enough to just declare and define the variable for it to be included in the model. It also has to be part of one of the constraints or the objective in the model. Any variable that you declare/define that is NOT included in one of the constraints or the objective will therefore not be in the set of variables that gets extracted into the inner workings of CPLEX. There are two obvious things that you can do to resolve this.
First you can try to get the variable values inside a loop over the variables, and test whether each is actually in the cplex model - I think is is something like cplex.isExtracted(var). Do something simple like print a message when you come across a variable that is not extracted, telling you which variable is causing the problem.
Secondly you can export the model from CPLEX as an LP format file and check it manually. This is a very useful way to see what is actually in your model rather than what you think is in your model.

Related

How to change the increment or step or scale of a Google Benchmark 'Range()' function

I have a very simple Google Benchmark program that benchmarks a function taking two integer arguments, I'm trying to use the benchmark to see how exactly does the time the function takes increase as the second argument's value increases from 1 to 100, so with the first argument staying with the same value of 999999 .
The way to achieve this that looked the most logical to me was to use Google Benchmark's Ranges() function like this:
// registering function 'largestDivisorOdd' as a benchmark
BENCHMARK(largestDivisorOdd) ->Ranges( { {999999, 999999}, {1,100} } );
The resulting output was this:
-----------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
largestDivisorOdd/999999/1 116 ns 116 ns 6018141
largestDivisorOdd/999999/8 205 ns 205 ns 3485489
largestDivisorOdd/999999/64 2715 ns 2715 ns 260611
largestDivisorOdd/999999/100 2710 ns 2710 ns 256160
My problem is that it seems Google Benchmark has made the range from only values of an exponential increase from 1 to 100, resulting in only 4 benchmarks of 4 different values for the second parameter, instead of 100 benchmarks of 100 values from 1 to 100 as I expected and wanted.
One of the following should work:
BENCHMARK(largestDivisorOdd)
->ArgsProduct({
benchmark::CreateRange(999999, 999999, /*multi=*/2), // This is probably not what you want
benchmark::CreateDenseRange(1, 100, /*step=*/1) // This creates a DenseRange from 1 to 100
})
Or create your own custom arguments:
static void CustomArguments(benchmark::internal::Benchmark* b) {
for (int i = 999999; i <= 999999; ++i)
for (int j = 1; j <= 100; j++)
b->Args({i, j});
}
BENCHMARK(BM_SetInsert)->Apply(CustomArguments);
Both examples are taken from the User Guide

How is myArray.size() implemented in Salesforce Apex? Does the method retrieve a value stored on the List.class object, or calculated at time of call?

Often when I'm writing a loop in Apex, i wonder if it's inefficient to call myArray.size() inside of a loop. My question is really: should I use myArray.size() inside the loop, or store myArray.Size() in an Integer variable before starting the loop and reuse that instead (assuming the size of myArray remains constant).
Under the hood, the method either counts each element of an array by iterating through it and incrementing a size variable which it then returns, or the List/Set/Map etc class stores the size variable inside itself and updates the value any time the array is changed. Different languages handle this in different ways, so how does Apex work?
I went searching but couldn't find an answer. The answer could change the way I code, since calling myArray.size() could exponentially increase the number of operations performed when called inside a loop.
I tried running a benchmark of the two scenarios as laid out below and found that myList.size() does take longer, but didn't really answer my question. Especially since the code was too simple to really make much of a difference.
In both samples, the list has ten thousand elements, and I only started timing once the list is fully created. The loop itself doesn't do anything interesting. It just counts up. That was the least resource heavy thing I could think of.
Sample 1 - store size of list in an integer before entering loop:
List<Account> myArray = [SELECT Id FROM Account LIMIT 10000];
Long startTime = System.now().getTime();
Integer size = myArray.size();
Integer count = 0;
for (Integer i = 0; i < 100000; i++) {
if (count < size) {
count++;
}
}
Long finishTime = System.now().getTime();
Long benchmark = finishTime - startTime;
System.debug('benchmark: ' + benchmark);
Results after running 5 times: 497, 474, 561, 445, 474
Sample 2 - use myArray.size() inside loop:
List<Account> myArray = [SELECT Id FROM Account LIMIT 10000];
Long startTime = System.now().getTime();
Integer count = 0;
for (Integer i = 0; i < 100000; i++) {
if (count < myArray.size()) {
count++;
}
}
Long finishTime = System.now().getTime();
Long benchmark = finishTime - startTime;
System.debug('benchmark: ' + benchmark);
Results after running 5 times: 582, 590, 667, 742, 730
Sample 3 - just for good measure (control), here's the loop without the if condition:
Long startTime = System.now().getTime();
Long count = 0;
for (Integer i = 0; i < 100000; i++) {
count++;
}
Long finishTime = System.now().getTime();
Long benchmark = finishTime - startTime;
System.debug('benchmark: ' + benchmark);
Results after running 5 times: 349, 348, 486, 475, 531

Why is grouped summation slower with sorted groups than unsorted groups?

I have 2 columns of tab delimited integers, the first of which is a random integer, the second an integer identifying the group, which can be generated by this program. (generate_groups.cc)
#include <cstdlib>
#include <iostream>
#include <ctime>
int main(int argc, char* argv[]) {
int num_values = atoi(argv[1]);
int num_groups = atoi(argv[2]);
int group_size = num_values / num_groups;
int group = -1;
std::srand(42);
for (int i = 0; i < num_values; ++i) {
if (i % group_size == 0) {
++group;
}
std::cout << std::rand() << '\t' << group << '\n';
}
return 0;
}
I then use a second program (sum_groups.cc) to calculate the sums per group.
#include <iostream>
#include <chrono>
#include <vector>
// This is the function whose performance I am interested in
void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
for (size_t i = 0; i < n; ++i) {
p_out[p_g[i]] += p_x[i];
}
}
int main() {
std::vector<int> values;
std::vector<int> groups;
std::vector<int> sums;
int n_groups = 0;
// Read in the values and calculate the max number of groups
while(std::cin) {
int value, group;
std::cin >> value >> group;
values.push_back(value);
groups.push_back(group);
if (group > n_groups) {
n_groups = group;
}
}
sums.resize(n_groups);
// Time grouped sums
std::chrono::system_clock::time_point start = std::chrono::system_clock::now();
for (int i = 0; i < 10; ++i) {
grouped_sum(values.data(), groups.data(), values.size(), sums.data());
}
std::chrono::system_clock::time_point end = std::chrono::system_clock::now();
std::cout << (end - start).count() << std::endl;
return 0;
}
If I then run these programs on a dataset of given size, and then shuffle the order of the rows of the same dataset the shuffled data computes the sums ~2x or more faster than the ordered data.
g++ -O3 generate_groups.cc -o generate_groups
g++ -O3 sum_groups.cc -o sum_groups
generate_groups 1000000 100 > groups
shuf groups > groups2
sum_groups < groups
sum_groups < groups2
sum_groups < groups2
sum_groups < groups
20784
8854
8220
21006
I would have expected the original data which is sorted by group to have better data locality and be faster, but I observe the opposite behavior. I was wondering if anyone can hypothesize the reason?
Setup / make it slow
First of all, the program runs in about the same time regardless:
sumspeed$ time ./sum_groups < groups_shuffled
11558358
real 0m0.705s
user 0m0.692s
sys 0m0.013s
sumspeed$ time ./sum_groups < groups_sorted
24986825
real 0m0.722s
user 0m0.711s
sys 0m0.012s
Most of the time is spent in the input loop. But since we're interested in the grouped_sum(), let's ignore that.
Changing the benchmark loop from 10 to 1000 iterations, grouped_sum() starts dominating the run time:
sumspeed$ time ./sum_groups < groups_shuffled
1131838420
real 0m1.828s
user 0m1.811s
sys 0m0.016s
sumspeed$ time ./sum_groups < groups_sorted
2494032110
real 0m3.189s
user 0m3.169s
sys 0m0.016s
perf diff
Now we can use perf to find the hottest spots in our program.
sumspeed$ perf record ./sum_groups < groups_shuffled
1166805982
[ perf record: Woken up 1 times to write data ]
[kernel.kallsyms] with build id 3a2171019937a2070663f3b6419330223bd64e96 not found, continuing without symbols
Warning:
Processed 4636 samples and lost 6.95% samples!
[ perf record: Captured and wrote 0.176 MB perf.data (4314 samples) ]
sumspeed$ perf record ./sum_groups < groups_sorted
2571547832
[ perf record: Woken up 2 times to write data ]
[kernel.kallsyms] with build id 3a2171019937a2070663f3b6419330223bd64e96 not found, continuing without symbols
[ perf record: Captured and wrote 0.420 MB perf.data (10775 samples) ]
And the difference between them:
sumspeed$ perf diff
[...]
# Event 'cycles:uppp'
#
# Baseline Delta Abs Shared Object Symbol
# ........ ......... ................... ........................................................................
#
57.99% +26.33% sum_groups [.] main
12.10% -7.41% libc-2.23.so [.] _IO_getc
9.82% -6.40% libstdc++.so.6.0.21 [.] std::num_get<char, std::istreambuf_iterator<char, std::char_traits<c
6.45% -4.00% libc-2.23.so [.] _IO_ungetc
2.40% -1.32% libc-2.23.so [.] _IO_sputbackc
1.65% -1.21% libstdc++.so.6.0.21 [.] 0x00000000000dc4a4
1.57% -1.20% libc-2.23.so [.] _IO_fflush
1.71% -1.07% libstdc++.so.6.0.21 [.] std::istream::sentry::sentry
1.22% -0.77% libstdc++.so.6.0.21 [.] std::istream::operator>>
0.79% -0.47% libstdc++.so.6.0.21 [.] __gnu_cxx::stdio_sync_filebuf<char, std::char_traits<char> >::uflow
[...]
More time in main(), which probably has grouped_sum() inlined. Great, thanks a lot, perf.
perf annotate
Is there a difference in where the time is spent inside main()?
Shuffled:
sumspeed$ perf annotate -i perf.data.old
[...]
│ // This is the function whose performance I am interested in
│ void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
│ for (size_t i = 0; i < n; ++i) {
│180: xor %eax,%eax
│ test %rdi,%rdi
│ ↓ je 1a4
│ nop
│ p_out[p_g[i]] += p_x[i];
6,88 │190: movslq (%r9,%rax,4),%rdx
58,54 │ mov (%r8,%rax,4),%esi
│ #include <chrono>
│ #include <vector>
│
│ // This is the function whose performance I am interested in
│ void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
│ for (size_t i = 0; i < n; ++i) {
3,86 │ add $0x1,%rax
│ p_out[p_g[i]] += p_x[i];
29,61 │ add %esi,(%rcx,%rdx,4)
[...]
Sorted:
sumspeed$ perf annotate -i perf.data
[...]
│ // This is the function whose performance I am interested in
│ void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
│ for (size_t i = 0; i < n; ++i) {
│180: xor %eax,%eax
│ test %rdi,%rdi
│ ↓ je 1a4
│ nop
│ p_out[p_g[i]] += p_x[i];
1,00 │190: movslq (%r9,%rax,4),%rdx
55,12 │ mov (%r8,%rax,4),%esi
│ #include <chrono>
│ #include <vector>
│
│ // This is the function whose performance I am interested in
│ void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
│ for (size_t i = 0; i < n; ++i) {
0,07 │ add $0x1,%rax
│ p_out[p_g[i]] += p_x[i];
43,28 │ add %esi,(%rcx,%rdx,4)
[...]
No, it's the same two instructions dominating. So they take a long time in both cases, but are even worse when the data is sorted.
perf stat
Okay. But we should be running them the same number of times, so each instruction must be getting slower for some reason. Let's see what perf stat says.
sumspeed$ perf stat ./sum_groups < groups_shuffled
1138880176
Performance counter stats for './sum_groups':
1826,232278 task-clock (msec) # 0,999 CPUs utilized
72 context-switches # 0,039 K/sec
1 cpu-migrations # 0,001 K/sec
4 076 page-faults # 0,002 M/sec
5 403 949 695 cycles # 2,959 GHz
930 473 671 stalled-cycles-frontend # 17,22% frontend cycles idle
9 827 685 690 instructions # 1,82 insn per cycle
# 0,09 stalled cycles per insn
2 086 725 079 branches # 1142,639 M/sec
2 069 655 branch-misses # 0,10% of all branches
1,828334373 seconds time elapsed
sumspeed$ perf stat ./sum_groups < groups_sorted
2496546045
Performance counter stats for './sum_groups':
3186,100661 task-clock (msec) # 1,000 CPUs utilized
5 context-switches # 0,002 K/sec
0 cpu-migrations # 0,000 K/sec
4 079 page-faults # 0,001 M/sec
9 424 565 623 cycles # 2,958 GHz
4 955 937 177 stalled-cycles-frontend # 52,59% frontend cycles idle
9 829 009 511 instructions # 1,04 insn per cycle
# 0,50 stalled cycles per insn
2 086 942 109 branches # 655,014 M/sec
2 078 204 branch-misses # 0,10% of all branches
3,186768174 seconds time elapsed
Only one thing stands out: stalled-cycles-frontend.
Okay, the instruction pipeline is stalling. In the frontend. Exactly what that means probably varies between microarchictectures.
I have a guess, though. If you're generous, you might even call it a hypothesis.
Hypothesis
By sorting the input, you're increasing locality of the writes. In fact, they will be very local; almost all additions you do will write to the same location as the previous one.
That's great for the cache, but not great for the pipeline. You're introducing data dependencies, preventing the next addition instruction from proceeding until the previous addition has completed (or has otherwise made the result available to succeeding instructions)
That's your problem.
I think.
Fixing it
Multiple sum vectors
Actually, let's try something. What if we used multiple sum vectors, switching between them for each addition, and then summed those up at the end? It costs us a little locality, but should remove the data dependencies.
(the code is not pretty; don't judge me, internet!!)
#include <iostream>
#include <chrono>
#include <vector>
#ifndef NSUMS
#define NSUMS (4) // must be power of 2 (for masking to work)
#endif
// This is the function whose performance I am interested in
void grouped_sum(int* p_x, int *p_g, int n, int** p_out) {
for (size_t i = 0; i < n; ++i) {
p_out[i & (NSUMS-1)][p_g[i]] += p_x[i];
}
}
int main() {
std::vector<int> values;
std::vector<int> groups;
std::vector<int> sums[NSUMS];
int n_groups = 0;
// Read in the values and calculate the max number of groups
while(std::cin) {
int value, group;
std::cin >> value >> group;
values.push_back(value);
groups.push_back(group);
if (group >= n_groups) {
n_groups = group+1;
}
}
for (int i=0; i<NSUMS; ++i) {
sums[i].resize(n_groups);
}
// Time grouped sums
std::chrono::system_clock::time_point start = std::chrono::system_clock::now();
int* sumdata[NSUMS];
for (int i = 0; i < NSUMS; ++i) {
sumdata[i] = sums[i].data();
}
for (int i = 0; i < 1000; ++i) {
grouped_sum(values.data(), groups.data(), values.size(), sumdata);
}
for (int i = 1; i < NSUMS; ++i) {
for (int j = 0; j < n_groups; ++j) {
sumdata[0][j] += sumdata[i][j];
}
}
std::chrono::system_clock::time_point end = std::chrono::system_clock::now();
std::cout << (end - start).count() << " with NSUMS=" << NSUMS << std::endl;
return 0;
}
(oh, and I also fixed the n_groups calculation; it was off by one.)
Results
After configuring my makefile to give an -DNSUMS=... arg to the compiler, I could do this:
sumspeed$ for n in 1 2 4 8 128; do make -s clean && make -s NSUMS=$n && (perf stat ./sum_groups < groups_shuffled && perf stat ./sum_groups < groups_sorted) 2>&1 | egrep '^[0-9]|frontend'; done
1134557008 with NSUMS=1
924 611 882 stalled-cycles-frontend # 17,13% frontend cycles idle
2513696351 with NSUMS=1
4 998 203 130 stalled-cycles-frontend # 52,79% frontend cycles idle
1116188582 with NSUMS=2
899 339 154 stalled-cycles-frontend # 16,83% frontend cycles idle
1365673326 with NSUMS=2
1 845 914 269 stalled-cycles-frontend # 29,97% frontend cycles idle
1127172852 with NSUMS=4
902 964 410 stalled-cycles-frontend # 16,79% frontend cycles idle
1171849032 with NSUMS=4
1 007 807 580 stalled-cycles-frontend # 18,29% frontend cycles idle
1118732934 with NSUMS=8
881 371 176 stalled-cycles-frontend # 16,46% frontend cycles idle
1129842892 with NSUMS=8
905 473 182 stalled-cycles-frontend # 16,80% frontend cycles idle
1497803734 with NSUMS=128
1 982 652 954 stalled-cycles-frontend # 30,63% frontend cycles idle
1180742299 with NSUMS=128
1 075 507 514 stalled-cycles-frontend # 19,39% frontend cycles idle
The optimal number of sum vectors will probably depend on your CPU's pipeline depth. My 7-year old ultrabook CPU can probably max out the pipeline with fewer vectors than a new fancy desktop CPU would need.
Clearly, more is not necessarily better; when I went crazy with 128 sum vectors, we started suffering more from cache misses -- as evidenced by the shuffled input becoming slower than sorted, like you had originally expected. We've come full circle! :)
Per-group sum in register
(this was added in an edit)
Agh, nerd sniped! If you know your input will be sorted and are looking for even more performance, the following rewrite of the function (with no extra sum arrays) is even faster, at least on my computer.
// This is the function whose performance I am interested in
void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
int i = n-1;
while (i >= 0) {
int g = p_g[i];
int gsum = 0;
do {
gsum += p_x[i--];
} while (i >= 0 && p_g[i] == g);
p_out[g] += gsum;
}
}
The trick in this one is that it allows the compiler to keep the gsum variable, the sum of the group, in a register. I'm guessing (but may be very wrong) that this is faster because the feedback loop in the pipeline can be shorter here, and/or fewer memory accesses. A good branch predictor will make the extra check for group equality cheap.
Results
It's terrible for shuffled input...
sumspeed$ time ./sum_groups < groups_shuffled
2236354315
real 0m2.932s
user 0m2.923s
sys 0m0.009s
...but is around 40% faster than my "many sums" solution for sorted input.
sumspeed$ time ./sum_groups < groups_sorted
809694018
real 0m1.501s
user 0m1.496s
sys 0m0.005s
Lots of small groups will be slower than a few big ones, so whether or not this is the faster implementation will really depend on your data here. And, as always, on your CPU model.
Multiple sums vectors, with offset instead of bit masking
Sopel suggested four unrolled additions as an alternative to my bit masking approach. I've implemented a generalized version of their suggestion, which can handle different NSUMS. I'm counting on the compiler unrolling the inner loop for us (which it did, at least for NSUMS=4).
#include <iostream>
#include <chrono>
#include <vector>
#ifndef NSUMS
#define NSUMS (4) // must be power of 2 (for masking to work)
#endif
#ifndef INNER
#define INNER (0)
#endif
#if INNER
// This is the function whose performance I am interested in
void grouped_sum(int* p_x, int *p_g, int n, int** p_out) {
size_t i = 0;
int quadend = n & ~(NSUMS-1);
for (; i < quadend; i += NSUMS) {
for (int k=0; k<NSUMS; ++k) {
p_out[k][p_g[i+k]] += p_x[i+k];
}
}
for (; i < n; ++i) {
p_out[0][p_g[i]] += p_x[i];
}
}
#else
// This is the function whose performance I am interested in
void grouped_sum(int* p_x, int *p_g, int n, int** p_out) {
for (size_t i = 0; i < n; ++i) {
p_out[i & (NSUMS-1)][p_g[i]] += p_x[i];
}
}
#endif
int main() {
std::vector<int> values;
std::vector<int> groups;
std::vector<int> sums[NSUMS];
int n_groups = 0;
// Read in the values and calculate the max number of groups
while(std::cin) {
int value, group;
std::cin >> value >> group;
values.push_back(value);
groups.push_back(group);
if (group >= n_groups) {
n_groups = group+1;
}
}
for (int i=0; i<NSUMS; ++i) {
sums[i].resize(n_groups);
}
// Time grouped sums
std::chrono::system_clock::time_point start = std::chrono::system_clock::now();
int* sumdata[NSUMS];
for (int i = 0; i < NSUMS; ++i) {
sumdata[i] = sums[i].data();
}
for (int i = 0; i < 1000; ++i) {
grouped_sum(values.data(), groups.data(), values.size(), sumdata);
}
for (int i = 1; i < NSUMS; ++i) {
for (int j = 0; j < n_groups; ++j) {
sumdata[0][j] += sumdata[i][j];
}
}
std::chrono::system_clock::time_point end = std::chrono::system_clock::now();
std::cout << (end - start).count() << " with NSUMS=" << NSUMS << ", INNER=" << INNER << std::endl;
return 0;
}
Results
Time to measure. Note that since I was working in /tmp yesterday, I don't have the exact same input data. Hence, these results aren't directly comparable to the previous ones (but probably close enough).
sumspeed$ for n in 2 4 8 16; do for inner in 0 1; do make -s clean && make -s NSUMS=$n INNER=$inner && (perf stat ./sum_groups < groups_shuffled && perf stat ./sum_groups < groups_sorted) 2>&1 | egrep '^[0-9]|frontend'; done; done1130558787 with NSUMS=2, INNER=0
915 158 411 stalled-cycles-frontend # 16,96% frontend cycles idle
1351420957 with NSUMS=2, INNER=0
1 589 408 901 stalled-cycles-frontend # 26,21% frontend cycles idle
840071512 with NSUMS=2, INNER=1
1 053 982 259 stalled-cycles-frontend # 23,26% frontend cycles idle
1391591981 with NSUMS=2, INNER=1
2 830 348 854 stalled-cycles-frontend # 45,35% frontend cycles idle
1110302654 with NSUMS=4, INNER=0
890 869 892 stalled-cycles-frontend # 16,68% frontend cycles idle
1145175062 with NSUMS=4, INNER=0
948 879 882 stalled-cycles-frontend # 17,40% frontend cycles idle
822954895 with NSUMS=4, INNER=1
1 253 110 503 stalled-cycles-frontend # 28,01% frontend cycles idle
929548505 with NSUMS=4, INNER=1
1 422 753 793 stalled-cycles-frontend # 30,32% frontend cycles idle
1128735412 with NSUMS=8, INNER=0
921 158 397 stalled-cycles-frontend # 17,13% frontend cycles idle
1120606464 with NSUMS=8, INNER=0
891 960 711 stalled-cycles-frontend # 16,59% frontend cycles idle
800789776 with NSUMS=8, INNER=1
1 204 516 303 stalled-cycles-frontend # 27,25% frontend cycles idle
805223528 with NSUMS=8, INNER=1
1 222 383 317 stalled-cycles-frontend # 27,52% frontend cycles idle
1121644613 with NSUMS=16, INNER=0
886 781 824 stalled-cycles-frontend # 16,54% frontend cycles idle
1108977946 with NSUMS=16, INNER=0
860 600 975 stalled-cycles-frontend # 16,13% frontend cycles idle
911365998 with NSUMS=16, INNER=1
1 494 671 476 stalled-cycles-frontend # 31,54% frontend cycles idle
898729229 with NSUMS=16, INNER=1
1 474 745 548 stalled-cycles-frontend # 31,24% frontend cycles idle
Yup, the inner loop with NSUMS=8 is the fastest on my computer. Compared to my "local gsum" approach, it also has the added benefit of not becoming terrible for the shuffled input.
Interesting to note: NSUMS=16 becomes worse than NSUMS=8. This could be because we're starting to see more cache misses, or because we don't have enough registers to unroll the inner loop properly.
Here is why sorted groups slower than unsored groups;
First here is the assembly code for summing loop:
008512C3 mov ecx,dword ptr [eax+ebx]
008512C6 lea eax,[eax+4]
008512C9 lea edx,[esi+ecx*4] // &sums[groups[i]]
008512CC mov ecx,dword ptr [eax-4] // values[i]
008512CF add dword ptr [edx],ecx // sums[groups[i]]+=values[i]
008512D1 sub edi,1
008512D4 jne main+163h (08512C3h)
Lets look at the add instruction which is the main reason for this issue;
008512CF add dword ptr [edx],ecx // sums[groups[i]]+=values[i]
When processor executes this instruction first it will issue a memory read (load) request to the address in edx then add the value of ecx then issue write (store) request for the same address.
there is a feature in processor caller memory reorder
To allow performance optimization of instruction execution, the IA-32
architecture allows departures from strongordering model called
processor ordering in Pentium 4, Intel Xeon, and P6 family processors.
These processorordering variations (called here the memory-ordering
model) allow performance enhancing operations such as allowing reads
to go ahead of buffered writes. The goal of any of these variations is
to increase instruction execution speeds, while maintaining memory
coherency, even in multiple-processor systems.
and there is a rule
Reads may be reordered with older writes to different locations but
not with older writes to the same location.
So if the next iteration reaches the add instruction before the write request completes it will not wait if the edx address is different than the previous value and issue the read request and it reordered over the older write request and the add instruction continue. but if the address is the same the add instruction will wait until old write is done.
Note that the loop is short and the processor can execute it faster than memory controller completes the write to memory request.
so for sorted groups you will read and write from same address many time consecutively so it will lose the performance enhancement using memory reorder;
meanwhile if random groups used then each iteration will have probably different address so the read will not wait older write and reordered before it; add instruction will not wait previous one to go.

Discrete-event Simulation Algorithm 1.2.1 in C++

I'm currently trying to work and extend on the Algorithm given in "Discrete-event Simulation" text book pg 15. My C++ knowledge is limited, It's not homework problem just want to understand how to approach this problem in C++ & understand what going.
I want to be able to compute 12 delays in a single-server FIFO service node.
Algorithm in the book is as follow:
Co = 0.0; //assumes that a0=0.0
i = 0;
while (more jobs to process) {
i++;
a_i = GetArrival ();
if (a_i < c_i - 1)
d_i = c_i - 1 - a_i; //calculate delay for job i
else
d_i = 0.0; // job i has no delay
s_i = GetService ();
c_i = a_i + d_i + s_i; // calculate departure time for job i
}
n = i;
return d_1, d_2,..., d_n
The GetArrival and GetService procedures read the next arrival and service time from a file.
Just looking at the pseudo-code, it seems that you just need one a which is a at step i, one c which is c at step i-1, and an array of ds to store the delays. I'm assuming the first line in your pseudo-code is c_0 = 0 and not Co = 0, other wise the code doesn't make a lot of sense.
Now here is a C++-ized version of the pseudo-code:
std::vector<int> d;
int c = 0;
int a, s;
while(!arrivalFile.eof() && !serviceFile.eof())
{
arrivalFile >> a;
int delay = 0;
if (a < c)
delay = c - a;
d.push_back(delay);
serviceFile >> s;
c = a + delay + s;
}
return d;
If I understand the code right, d_1, d_2, ..., d_n are the delays you have, number of delays depends on number of jobs to process. while (more jobs to process)
thus if you have 12 processes you will have 12 delays.
In general if arrival time is less than previous departure time then the delay is the previous departure time - current arrival time
if (a_i < c_i-1)
d_i = c_i-1 - a_i;
the first departure time is set to zero
if something is not clear let me know

Filter strange C++ multimap values

I have this multimap in my code:
multimap<long, Note> noteList;
// notes are added with this method. measureNumber is minimum `1` and doesn't go very high
void Track::addNote(Note &note) {
long key = note.measureNumber * 1000000 + note.startTime;
this->noteList.insert(make_pair(key, note));
}
I'm encountering problems when I try to read the notes from the last measure. In this case the song has only 8 measures and it's measure number 8 that causes problems. If I go up to 16 measures it's measure 16 that causes the problem and so on.
// (when adding notes I use as key the measureNumber * 1000000. This searches for notes within the same measure)
for(noteIT = trackIT->noteList.lower_bound(this->curMsr * 1000000); noteIT->first < (this->curMsr + 1) * 1000000; noteIT++){
if(this->curMsr == 8){
cout << "_______________________________________________________" << endl;
cout << "ID:" << noteIT->first << endl;
noteIT->second.toString();
int blah = 0;
}
// code left out here that processes the notes
}
I have only added one note to the 8th measure and yet this is the result I'm getting in console:
_______________________________________________________
ID:8000001
note toString()
Duration: 8
Start Time: 1
Frequency: 880
_______________________________________________________
ID:1
note toString()
Duration: 112103488
Start Time: 44
Frequency: 0
_______________________________________________________
ID:8000001
note toString()
Duration: 8
Start Time: 1
Frequency: 880
_______________________________________________________
ID:1
note toString()
Duration: 112103488
Start Time: 44
Frequency: 0
This keeps repeating. The first result is a correct note which I've added myself but I have no idea where the note with ID: 1 is coming from.
Any ideas how to avoid this? This loop gets stuck repeating the same two results and I can't get out of it. Even if there are several notes within measure 8 (so that means several values within the multimap that start with 8xxxxxx it only repeats the first note and the non-existand one.
You aren't checking for the end of your loop correctly. Specifically there is no guarantee that noteIT does not equal trackIT->noteList.end(). Try this instead
for (noteIT = trackIT->noteList.lower_bound(this->curMsr * 1000000);
noteIT != trackIT->noteList.end() &&
noteIT->first < (this->curMsr + 1) * 1000000;
++noteIT)
{
For the look of it, it might be better to use some call to upper_bound as the limit of your loop. That would handle the end case automatically.