Optimizing this code block - c++

for (int i = 0; i < 5000; i++)
for (int j = 0; j < 5000; j++)
{
for (int ii = 0; ii < 20; ii++)
for (int jj = 0; jj < 20; jj++)
{
int num = matBigger[i+ii][j+jj];
// Extract range from this.
int low = num & 0xff;
int high = num >> 8;
if (low < matSmaller[ii][jj] && matSmaller[ii][jj] > high)
// match found
}
}
The machine is x86_64, 32kb L1 cahce, 256 Kb L2 cache.
Any pointers on how can I possibly optimize this code?
EDIT Some background to the original problem : Fastest way to Find a m x n submatrix in M X N matrix

First thing I'd try is to move the ii and jj loops outside the i and j loops. That way you're using the same elements of matSmaller for 25 million iterations of the i and j loops, meaning that you (or the compiler if you're lucky) can hoist the access to them outside those loops:
for (int ii = 0; ii < 20; ii++)
for (int jj = 0; jj < 20; jj++)
int smaller = matSmaller[ii][jj];
for (int i = 0; i < 5000; i++)
for (int j = 0; j < 5000; j++) {
int num = matBigger[i+ii][j+jj];
int low = num & 0xff;
if (low < smaller && smaller > (num >> 8)) {
// match found
}
}
This might be faster (thanks to less access to the matSmaller array), or it might be slower (because I've changed the pattern of access to the matBigger array, and it's possible that I've made it less cache-friendly). A similar alternative would be to move the ii loop outside i and j and hoist matSmaller[ii], but leave the jj loop inside. The rule of thumb is that it's more cache-friendly to increment the last index of a multi-dimensional array in your inner loops, than earlier indexes. So we're "happier" to modify jj and j than we are to modify ii and i.
Second thing I'd try - what's the type of matBigger? Looks like the values in it are only 16 bits, so try it both as int and as (u)int16_t. The former might be faster because aligned int access is fast. The latter might be faster because more of the array fits in cache at any one time.
There are some higher-level things you could consider with some early analysis of smaller: for example if it's 0 then you needn't examine matBigger for that value of ii and jj, because num & 0xff < 0 is always false.
To do better than "guess things and see whether they're faster or not" you need to know for starters which line is hottest, which means you need a profiler.

Some basic advice:
Profile it, so you can learn where the hot-spots are.
Think about cache locality, and the addresses resulting from your loop order.
Use more const in the innermost scope, to hint more to the compiler.
Try breaking it up so you don't compute high if the low test is failing.
Try maintaining the offset into matBigger and matSmaller explicitly, to the innermost stepping into a simple increment.

Best thing ist to understand what the code is supposed to do, then check whether another algorithm exists for this problem.
Apart from that:
if you are just interested if a matching entry exists, make sure to break out of all 3 loops at the position of // match found.
make sure the data is stored in an optimal way. It all depends on your problem, but i.e. it could be more efficient to have just one array of size 5000*5000*20 and overload operator()(int,int,int) for accessing elements.

What are matSmaller and matBigger?
Try changing them to matBigger[i+ii * COL_COUNT + j+jj]

I agree with Steve about rearranging your loops to have the higher count as the inner loop. Since your code is only doing loads and compares, I believe a significant portion of the time is used for pointer arithmetic. Try an experiment to change Steve's answer into this:
for (int ii = 0; ii < 20; ii++)
{
for (int jj = 0; jj < 20; jj++)
{
int smaller = matSmaller[ii][jj];
for (int i = 0; i < 5000; i++)
{
int *pI = &matBigger[i+ii][jj];
for (int j = 0; j < 5000; j++)
{
int num = *pI++;
int low = num & 0xff;
if (low < smaller && smaller > (num >> 8)) {
// match found
} // for j
} // for i
} // for jj
} // for ii
Even in 64-bit mode, the C compiler doesn't necessarily do a great job of keeping everything in register. By changing the array access to be a simple pointer increment, you'll make the compiler's job easier to produce efficient code.
Edit: I just noticed #unwind suggested basically the same thing. Another issue to consider is the statistics of your comparison. Is the low or high comparison more probable? Arrange the conditional statement so that the less probable test is first.

Looks like there is a lot of repetition here. One optimization is to reduce the amount of duplicate effort. Using pen and paper, I'm showing the matBigger "i" index iterating as:
[0 + 0], [0 + 1], [0 + 2], ..., [0 + 19],
[1 + 0], [1 + 1], ..., [1 + 18], [1 + 19]
[2 + 0], ..., [2 + 17], [2 + 18], [2 + 19]
As you can see there are locations that are accessed many times.
Also, multiplying the iteration counts indicate that the inner content is accessed: 20 * 20 * 5000 * 5000, or 10000000000 (10E+9) times. That's a lot!
So rather than trying to speed up the execution of 10E9 instructions (such as execution (pipeline) cache or data cache optimization), try reducing the number of iterations.
The code is searcing the matrix for a number that is within a range: larger than a minimal value and less than the maximum range value.
Based on this, try a different approach:
Find and remember all coordinates where the search value is greater
than the low value. Let us call these anchor points.
For each anchor point, find the coordinates of the first value after
the anchor point that is outside the range.
The objective is to reduce the number of duplicate accesses. Anchor points allow for a one pass scan and allow other decisions such as finding a range or determining an MxN matrix that contains the anchor value.
Another idea is to create new data structures containing the matBigger and matSmaller that are more optimized for searching.
For example, create a {value, coordinate list} entry for each unique value in matSmaller:
Value coordinate list
26 -> (2,3), (6,5), ..., (1007, 75)
31 -> (4,7), (2634, 5), ...
Now you can use this data structure to find values in matSmaller and immediately know their locations. So you could search matBigger for each unique value in this data structure. This again reduces the number of access to the matrices.

Related

Fastest way to find smallest missing integer from list of integers

I have a list of 100 random integers. Each random integer has a value from 0 to 99. Duplicates are allowed, so the list could be something like
56, 1, 1, 1, 1, 0, 2, 6, 99...
I need to find the smallest integer (>= 0) is that is not contained in the list.
My initial solution is this:
vector<int> integerList(100); //list of random integers
...
vector<bool> listedIntegers(101, false);
for (int theInt : integerList)
{
listedIntegers[theInt] = true;
}
int smallestInt;
for (int j = 0; j < 101; j++)
{
if (!listedIntegers[j])
{
smallestInt = j;
break;
}
}
But that requires a secondary array for book-keeping and a second (potentially full) list iteration. I need to perform this task millions of times (the actual application is in a greedy graph coloring algorithm, where I need to find the smallest unused color value with a vertex adjacency list), so I'm wondering if there's a clever way to get the same result without so much overhead?
It's been a year, but ...
One idea that comes to mind is to keep track of the interval(s) of unused values as you iterate the list. To allow efficient lookup, you could keep intervals as tuples in a binary search tree, for example.
So, using your sample data:
56, 1, 1, 1, 1, 0, 2, 6, 99...
You would initially have the unused interval [0..99], and then, as each input value is processed:
56: [0..55][57..99]
1: [0..0][2..55][57..99]
1: no change
1: no change
1: no change
0: [2..55][57..99]
2: [3..55][57..99]
6: [3..5][7..55][57..99]
99: [3..5][7..55][57..98]
Result (lowest value in lowest remaining interval): 3
I believe there is no faster way to do it. What you can do in your case is to reuse vector<bool>, you need to have just one such vector per thread.
Though the better approach might be to reconsider the whole algorithm to eliminate this step at all. Maybe you can update least unused color on every step of the algorithm?
Since you have to scan the whole list no matter what, the algorithm you have is already pretty good. The only improvement I can suggest without measuring (that will surely speed things up) is to get rid of your vector<bool>, and replace it with a stack-allocated array of 4 32-bit integers or 2 64-bit integers.
Then you won't have to pay the cost of allocating an array on the heap every time, and you can get the first unused number (the position of the first 0 bit) much faster. To find the word that contains the first 0 bit, you only need to find the first one that isn't the maximum value, and there are bit twiddling hacks you can use to get the first 0 bit in that word very quickly.
You program is already very efficient, in O(n). Only marginal gain can be found.
One possibility is to divide the number of possible values in blocks of size block, and to register
not in an array of bool but in an array of int, in this case memorizing the value modulo block.
In practice, we replace a loop of size N by a loop of size N/block plus a loop of size block.
Theoretically, we could select block = sqrt(N) = 12 in order to minimize the quantity N/block + block.
In the program hereafter, block of size 8 are selected, assuming that dividing integers by 8 and calculating values modulo 8 should be fast.
However, it is clear that a gain, if any, can be obtained only for a minimum value rather large!
constexpr int N = 100;
int find_min1 (const std::vector<int> &IntegerList) {
constexpr int Size = 13; //N / block
constexpr int block = 8;
constexpr int Vmax = 255; // 2^block - 1
int listedBlocks[Size] = {0};
for (int theInt : IntegerList) {
listedBlocks[theInt / block] |= 1 << (theInt % block);
}
for (int j = 0; j < Size; j++) {
if (listedBlocks[j] == Vmax) continue;
int &k = listedBlocks[j];
for (int b = 0; b < block; b++) {
if ((k%2) == 0) return block * j + b;
k /= 2;
}
}
return -1;
}
Potentially you can reduce the last step to O(1) by using some bit manipulation, in your case __int128, set the corresponding bits in loop one and call something like __builtin_clz or use the appropriate bit hack
The best solution I could find for finding smallest integer from a set is https://codereview.stackexchange.com/a/179042/31480
Here are c++ version.
int solution(std::vector<int>& A)
{
for (std::vector<int>::size_type i = 0; i != A.size(); i++)
{
while (0 < A[i] && A[i] - 1 < A.size()
&& A[i] != i + 1
&& A[i] != A[A[i] - 1])
{
int j = A[i] - 1;
auto tmp = A[i];
A[i] = A[j];
A[j] = tmp;
}
}
for (std::vector<int>::size_type i = 0; i != A.size(); i++)
{
if (A[i] != i+1)
{
return i + 1;
}
}
return A.size() + 1;
}

Performance optimization nested loops

I am implementing a rather complicated code and in one of the critical sections I need to basically consider all the possible strings of numbers following a certain rule. The naive implementation to explain what I do would be such a nested loop implementation:
std::array<int,3> max = { 3, 4, 6};
for(int i = 0; i <= max.at(0); ++i){
for(int j = 0; j <= max.at(1); ++j){
for(int k = 0; k <= max.at(2); ++k){
DoSomething(i, j, k);
}
}
}
Obviously I actually need more nested for and the "max" rule is more complicated but the idea is clear I think.
I implemented this idea using a recursive function approach:
std::array<int,3> max = { 3, 4, 6};
std::array<int,3> index = {0, 0, 0};
int total_depth = 3;
recursive_nested_for(0, index, max, total_depth);
where
void recursive_nested_for(int depth, std::array<int,3>& index,
std::array<int,3>& max, int total_depth)
{
if(depth != total_depth){
for(int i = 0; i <= max.at(depth); ++i){
index.at(depth) = i;
recursive_nested_for(depth+1, index, max, total_depth);
}
}
else
DoSomething(index);
}
In order to save as much as possible I declare all the variable I use global in the actual code.
Since this part of the code takes really long is it possible to do anything to speed it up?
I would also be open to write 24 nested for if necessary to avoid the overhead at least!
I thought that maybe an approach like expressions templates to actually generate at compile time these nested for could be more elegant. But is it possible?
Any suggestion would be greatly appreciated.
Thanks to all.
The recursive_nested_for() is a nice idea. It's a bit inflexible as it is currently written. However, you could use std::vector<int> for the array dimensions and indices, or make it a template to handle any size std::array<>. The compiler might be able to inline all recursive calls if it knows how deep the recursion is, and then it will probably be just as efficient as the three nested for-loops.
Another option is to use a single for loop for incrementing the indices that need incrementing:
void nested_for(std::array<int,3>& index, std::array<int,3>& max)
{
while (index.at(2) < max.at(2)) {
DoSomething(index);
// Increment indices
for (int i = 0; i < 3; ++i) {
if (++index.at(i) >= max.at(i))
index.at(i) = 0;
else
break;
}
}
}
However, you can also consider creating a linear sequence that visits all possible combinations of the iterators i, j, k and so on. For example, with array dimensions {3, 4, 6}, there are 3 * 4 * 6 = 72 possible combinations. So you can have a single counter going from 0 to 72, and then "split" that counter into the three iterator values you need, like so:
for (int c = 0; c < 72; c++) {
int k = c % 6;
int j = (c / 6) % 4;
int i = c / 6 / 4;
DoSomething(i, j, k);
}
You can generalize this to as many dimensions as you want. Of course, the more dimensions you have, the higher the cost of splitting the linear iterator. But if your array dimensions are powers of two, it might be very cheap to do so. Also, it might be that you don't need to split it at all; for example if you are calculating the sum of all elements of a multidimensional array, you don't care about the actual indices i, j, k and so on, you just want to visit all elements once. If the array is layed out linearly in memory, then you just need a linear iterator.
Of course, if you have 24 nested for loops, you'll notice that the product of all the dimension's sizes will become a very large number. If it doesn't fit in a 32 bit integer, your code is going to be very slow. If it doesn't fit into a 64 bit integer anymore, it will never finish.

Trying to understand this solution

I was trying to solve a question and I got into a few obstacles that I failed to solve, starting off here is the question: Codeforces - 817D
Now I tried to brute force it, using a basic get min and max for each segment of the array I could generate and then keeping track of them I subtract them and add them together to get the final imbalance, this looked good but it gave me a time limit exceeded cause brute forcing n*(n+1)/2 subsegments of the array given n is 10^6 , so I just failed to go around it and after like a couple of hours of not getting any new ideas I decided to see a solution that I could not understand anything in to be honest :/ , here is the solution:
#include <bits/stdc++.h>
#define ll long long
int a[1000000], l[1000000], r[1000000];
int main(void) {
int i, j, n;
scanf("%d",&n);
for(i = 0; i < n; i++) scanf("%d",&a[i]);
ll ans = 0;
for(j = 0; j < 2; j++) {
vector<pair<int,int>> v;
v.push_back({-1,INF});
for(i = 0; i < n; i++) {
while (v.back().second <= a[i]) v.pop_back();
l[i] = v.back().first;
v.push_back({i,a[i]});
}
v.clear();
v.push_back({n,INF});
for(i = n-1; i >= 0; i--) {
while (v.back().second < a[i]) v.pop_back();
r[i] = v.back().first;
v.push_back({i,a[i]});
}
for(i = 0; i < n; i++) ans += (ll) a[i] * (i-l[i]) * (r[i]-i);
for(i = 0; i < n; i++) a[i] *= -1;
}
cout << ans;
}
I tried tracing it but I keep wondering why was the vector used , the only idea I got is he wanted to use the vector as a stack since they both act the same(Almost) but then the fact that I don't even know why we needed a stack here and this equation ans += (ll) a[i] * (i-l[i]) * (r[i]-i); is really confusing me because I don't get where did it come from.
Well thats a beast of a calculation. I must confess, that i don't understand it completely either. The problem with the brute force solution is, that you have to calculate values or all over again.
In a slightly modified example, you calculate the following values for an input of 2, 4, 1 (i reordered it by "distance")
[2, *, *] (from index 0 to index 0), imbalance value is 0; i_min = 0, i_max = 0
[*, 4, *] (from index 1 to index 1), imbalance value is 0; i_min = 1, i_max = 1
[*, *, 1] (from index 2 to index 2), imbalance value is 0; i_min = 2, i_max = 2
[2, 4, *] (from index 0 to index 1), imbalance value is 2; i_min = 0, i_max = 1
[*, 4, 1] (from index 1 to index 2), imbalance value is 3; i_min = 2, i_max = 1
[2, 4, 1] (from index 0 to index 2), imbalance value is 3; i_min = 2, i_max = 1
where i_min and i_max are the indices of the element with the minimum and maximum value.
For a better visual understanding, i wrote the complete array, but hid the unused values with *
So in the last case [2, 4, 1], brute-force looks for the minimum value over all values, which is not necessary, because you already calculated the values for a sub-space of the problem, by calculating [2,4] and [4,1]. But comparing only the values is not enough, you also need to keep track of the indices of the minimum and maximum element, because those can be reused in the next step, when calculating [2, 4, 1].
The idead behind this is a concept called dynamic programming, where results from a calculation are stored to be used again. As often, you have to choose between speed and memory consumption.
So to come back to your question, here is what i understood :
the arrays l and r are used to store the indices of the greatest number left or right of the current one
vector v is used to find the last number (and it's index) that is greater than the current one (a[i]). It keeps track of rising number series, e.g. for the input 5,3,4 at first the 5 is stored, then the 3 and when the 4 comes, the 3 is popped but the index of 5 is needed (to be stored in l[2])
then there is this fancy calculation (ans += (ll) a[i] * (i-l[i]) * (r[i]-i)). The stored indices of the maximum (and in the second run the minimum) elements are calculated together with the value a[i] which does not make much sense for me by now, but seems to work (sorry).
at last, all values in the array a are multiplied by -1, which means, the old maximums are now the minimums, and the calculation is done again (2nd run of the outer for-loop over j)
This last step (multiply a by -1) and the outer for-loop over j are not necessary but it's an elegant way to reuse the code.
Hope this helps a bit.

Treats for the cows - bottom up dynamic programming

The full problem statement is here. Suppose we have a double ended queue of known values. Each turn, we can take a value out of one or the other end and the values still in the queue increase as value*turns. The goal is to find maximum possible total value.
My first approach was to use straightforward top-down DP with memoization. Let i,j denote starting, ending indexes of "subarray" of array of values A[].
A[i]*age if i == j
f(i,j,age) =
max(f(i+1,j,age+1) + A[i]*age , f(i,j-1,age+1) + A[j]*age)
This works, however, proves to be too slow, as there are superfluous stack calls. Iterative bottom-up should be faster.
Let m[i][j] be the maximum reachable value of the "subarray" of A[] with begin/end indexes i,j. Because i <= j, we care only about the lower triangular part.
This matrix can be built iteratively using the fact that m[i][j] = max(m[i-1][j] + A[i]*age, m[i][j-1] + A[j]*age), where age is maximum on the diagonal (size of A[] and linearly decreases as A.size()-(i-j).
My attempt at implementation meets with bus error.
Is the described algorithm correct? What is the cause for the bus error?
Here is the only part of the code where the bus error might occur:
for(T j = 0; j < num_of_treats; j++) {
max_profit[j][j] = treats[j]*num_of_treats;
for(T i = j+1; i < num_of_treats; i++)
max_profit[i][j] = max( max_profit[i-1][j] + treats[i]*(num_of_treats-i+j),
max_profit[i][j-1] + treats[j]*(num_of_treats-i+j));
}
for(T j = 0; j < num_of_treats; j++) {
Inside this loop, j is clearly a valid index into the array max_profit. But you're not using just j.
The bus error is caused by trying to access array via negative index when j=0 and i=1 as I should have noticed during the debugging. The algorithm is wrong as well. First, the relationship used to construct the max_profit[][] array should is
max_profit[i][j] = max( max_profit[i+1][j] + treats[i]*(num_of_treats-i+j),
max_profit[i][j-1] + treats[j]*(num_of_treats-i+j));
Second, the array must by filled diagonally, so that max_profit[i+1][j] and max_profit[i][j-1] is already computed with exception of the main diagonal.
Third, the data structure chosen is extremely inefficient. I am using only half of the space allocated for max_profit[][]. Plus, at each iteration, I only need the last computed diagonal. An array of size num_of_treats should suffice.
Here is a working code using this improved algorithm. I really like it. I even used bit operators for the first time.

Long array performance issue

I have an array of char pointers of length 175,000. Each pointer points to a c-string array of length 100, each character is either 1 or 0. I need to compare the difference between the strings.
char* arr[175000];
So far, I have two for loops where I compare every string with every other string. The comparison functions basically take two c-strings and returns an integer which is the number of differences of the arrays.
This is taking really long on my 4-core machine. Last time I left it to run for 45min and it never finished executing. Please advise of a faster solution or some optimizations.
Example:
000010
000001
have a difference of 2 since the last two bits do not match.
After i calculate the difference i store the value in another array
int holder;
for(int x = 0;x < UsedTableSpace; x++){
int min = 10000000;
for(int y = 0; y < UsedTableSpace; y++){
if(x != y){
//compr calculates difference between two c-string arrays
int tempDiff =compr(similarity[x]->matrix, similarity[y]->matrix);
if(tempDiff < min){
min = tempDiff;
holder = y;
}
}
}
similarity[holder]->inbound++;
}
With more information, we could probably give you better advice, but based on what I understand of the question, here are some ideas:
Since you're using each character to represent a 1 or a 0, you're using several times more memory than you need to use, which creates a big performance impact when it comes to caching and such. Instead, represent your data using numeric values that you can think of in terms of a series of bits.
Once you've implemented #1, you can grab an entire integer or long at a time and do a bitwise XOR operation to end up with a number that has a 1 in every place where the two numbers didn't have the same values. Then you can use some of the tricks mentioned here to count these bits speedily.
Work on "unrolling" your loops somewhat to avoid the number of jumps necessary. For example, the following code:
total = total + array[i];
total = total + array[i + 1];
total = total + array[i + 2];
... will work faster than just looping over total = total + array[i] three times. Jumps are expensive, and interfere with the processor's pipelining. Update: I should mention that your compiler may be doing some of this for you already--you can check the compiled code to see.
Break your overall data set into chunks that will allow you to take full advantage of caching. Think of your problem as a "square" with the i index on one axis and the j axis on the other. If you start with one i and iterate across all 175000 j values, the first j values you visit will be gone from the cache by the time you get to the end of the line. On the other hand, if you take the top left corner and go from j=0 to 256, most of the values on the j axis will still be in a low-level cache as you loop around to compare them with i=0, 1, 2, etc.
Lastly, although this should go without saying, I guess it's worth mentioning: Make sure your compiler is set to optimize!
One simple optimization is to compare the strings only once. If the difference between A and B is 12, the difference between B and A is also 12. Your running time is going to drop almost half.
In code:
int compr(const char* a, const char* b) {
int d = 0, i;
for (i=0; i < 100; ++i)
if (a[i] != b[i]) ++d;
return d;
}
void main_function(...) {
for(int x = 0;x < UsedTableSpace; x++){
int min = 10000000;
for(int y = x + 1; y < UsedTableSpace; y++){
//compr calculates difference between two c-string arrays
int tempDiff = compr(similarity[x]->matrix, similarity[y]->matrix);
if(tempDiff < min){
min = tempDiff;
holder = y;
}
}
similarity[holder]->inbound++;
}
}
Notice the second-th for loop, I've changed the start index.
Some other optimizations is running the run method on separate threads to take advantage of your 4 cores.
What is your goal, i.e. what do you want to do with the Hamming Distances (which is what they are) after you've got them? For example, if you are looking for the closest pair, or most distant pair, you probably can get an O(n ln n) algorithm instead of the O(n^2) methods suggested so far. (At n=175000, n^2 is 15000 times larger than n ln n.)
For example, you could characterize each 100-bit number m by 8 4-bit numbers, being the number of bits set in 8 segments of m, and sort the resulting 32-bit signatures into ascending order. Signatures of the closest pair are likely to be nearby in the sorted list. It is easy to lower-bound the distance between two numbers if their signatures differ, giving an effective branch-and-bound process as less-distant numbers are found.