Problem 1: suppose you have an array of n floats and you want to calculate an array of n running averages over three elements. The middle part would be straightforward:
for (int i=0; i<n; i++)
b[i] = (a[i-1] + a[i] + a[i+1])/3.
But you need to have separate code to handle the cases i==0 and i==(n-1). This is often done with extra code before the loop, extra code after the loop, and adjusting the loop range, e.g.
b[0] = (a[0] + a[1])/2.
for (int i=1; i<n-1; i++)
b[i] = (a[i-1] + a[i] + a[i+1])/3.;
b[n-1] = (a[n-1] + a[n-2])/2.
Even that is not enough, because the cases of n<3 need to be handled separately.
Problem 2. You are reading a variable-length code from an array (say implementing a UTF-8 to UTF-32 converter). The code reads a byte, and accordingly may read one or more bytes to determine the output. However, before each such step, it also needs to check if the end of the input array has been reached, and if so, perhaps load more data into a buffer, or terminate with an error.
Both of these problems are cases of loops where the interior of the loop can be expressed neatly, but the edges need special handling. I find these sort of problems the most prone to error and to messy programming. So here's my question:
Are there any C++ idioms which generalize wrapping such loop patterns in a clean way?
Efficiently and elegantly handling boundary conditions is troublesome in any programming language -- C++ has no magic hammer for this. This is a common problem with applying convolution filters to signals / images -- what do you do at the image boundaries where your kernel goes outside the image support?
There are generally two things you are trying to avoid:
out of bounds array indexing (which you must avoid), and
special computation (which is inelegant and results in slower code due to extra branching).
There are usually three approaches:
Avoid the boundaries -- this is the simplest approach and is often sufficient since the boundary case make up a tiny slice of the problem and can be ignored.
Extend the bounds of your buffer -- add extra columns/rows of padding to the array so the same code used in the general case can be used at the edges. Of course this raises the problem of what values to place in the padding -- this often depends on the problem you are solving and is considered in the next approach.
Special computation at the boundary -- this is what you do in your example. Of course how you do this is problem dependent and raises a similar issue as the previous approach -- what is the correct thing to do when my filter (in your case an averaging filter) extends beyond the array support? What should I consider the values to be outside the array support? Most image filter libraries provide some form of extrapolation options -- for example:
assume a value zero or some other constant (define a[i] = 0 if i < 0 || i >= n),
replicate the boundary value (e.g. a[i] = a[0] if i < 0 and a[i] = a[n-1] if i >= n)
wrap the value (define a[i] = a[(i + n) % n] -- makes sense of some cases -- e.g, texture filters)
mirror the border ((e.g. a[i] = a[abs(i+1)] if i < 0 and a[i] = a[2n - i -1] if i >= n)
other special case (what you do)
When reasonable, its best to separate the special case from the general case (like you do) to avoid inelegant and slow general cases. One could always wrap/hide the special case and general case in a function or operator (e.g., overload operator[]) , but this only sugar coats the problem like any contrived C++ idiom would. In a multi-threaded environment (e.g. CUDA / SIMD) you can do some other tricks be preloading out-of-bounds values, but you are still stuck with the same problem.
This is why programmers use the phrase "edge case" in referring any kind of special case programming and is often a time sink and a source for annoying errors. Some languages that efficiently support exception handling for out of bounds array indexing (e.g. Ada) can make for prettier code, but still cause the same pain.
Unfortunately the answer is NO.
There is no C++ idioms which generalize wrapping such loop patterns in a clean way!
You can do it by making something like this, but you still need to adjust window size.
template <typename T, int N>
T subscript(T (&data)[N], int index) {
if (index < 0 || index >= N) {
return 0;
}
return data[index];
}
for (int i = 0; i < n; ++i) {
b[i] = (subscript(a, i - 1) + subscript(a, i) + subscript(a, i + 1)) / 3.
}
Related
Most of the for loops I have read/written start from 0 and to be fair most of the code I have read are used for embedded systems and they were in C/C++. In embedded systems the readability is not as important as code efficiency in some cases. Therefore, I am not sure which of the following cases would be a better choice:
version 1
for(i = 0; i < allowedNumberOfIteration; i++)
{
//something that may take from 1 iteration to allowedNumberOfIteration before it happens
if(somethingHappened)
{
if(i + 1 > maxIteration)
{
maxIteration = i + 1;
}
}
}
Version 2
for(i = 1; i <= allowedNumberOfIteration; i++)
{
//something that may take from 1 iteration to allowedNumberOfIteration before it happens
if(somethingHappened)
{
if(i > maxIteration)
{
maxIteration = i;
}
}
}
Why first version is better in my opinion:
1.Most loops starts with 0. So, maybe experienced programmers find it to be better if it starts from 0.
Why second version is better in my opinion:
To be fair if there was an array in the function starting from 0
would be great because the index of arrays start from zero. But in
this part of the code no arrays are used.
Beside the second version looks simpler because you do not have to think about the '+1'.
Things I do not know
1) Is there any performance difference?
2) Which version is better?
3) Are there any other aspect that should be considered in deciding the starting point?
4) Am I worrying too much?
1) No
2) Neither
3) Arrays in C and C++ are zero-based.
4) Yes.
Arrays of all forms in C++ are zero-based. I.e. their index start at zero and goes up to the size of the array minus one. For example an array of five elements will have the indexes 0 to 4 (inclusive).
That is why most loops in C++ are starting at zero.
As for your specific list of questions, for 1 there might be a performance difference. If you start a loop at 1 then you might need to subtract 1 in each iterator if you use the value as an array index. Or if you increase the size of the arrays then you use more memory.
For 2 it really depends on what you're iterating over. Is it over array indexes, then the loop starting at zero is clearly better. But you might need to start a loop at any value, it really depends on what you're doing and the problem you're trying to solve.
For 3, what you need to consider is what you're using the loop for.
And 4, maybe a little. ;)
This argument comes from a small, 3-page note by the famous computer scientist Dijkstra (the one from Dijkstra's algorithm). In it, he lays out the reasons we might index starting at zero, and the story begins with trying to iterate over a sequence of natural numbers (meaning a sequence on the number line 0, 1, 2, 3, ...).
There are 4 possibilities to index 2, 3, ..., 12.
a.) 2 <= i < 13
b.) 1 < i <= 12
c.) 2 <= i <= 12
d.) 1 < i < 13
He mentions that a.) and b.) have the advantage that the difference of the two bounds equals the number of elements in the sequence. He also mentions if two sequences are adjacent, the upper bound of one equals the lower bound of the other. He says this doesn't help decide between a.) or b.) so he will start afresh.
He immediately removes b.) and d.) from the list since, if we were to start a natural sequence with zero, they would have bounds outside the natural numbers (-1), which is "ugly". He completes the observation by saying we prefer <= for the lower bound -- leaving us with a.) and c.).
For an empty set, he notes that in b.) and c.) will have -1 for its upper bound, which is also "ugly".
All three of these observations leads to the convention to represent a sequence of natural numbers with a.), and that indeed is how most people write a for that goes over an array: for(int i = 0; i < size; ++i). We include the lower bound (i <= 0), and we exclude the upper bound (i < size).
If you were to use something like for(int i = 0; i <= iterations - 1; ++i) to do i iterations, you can see the ugliness he refers to in the case of the empty set. iterations - 1 would be -1 for zero iterations.
So by convention, we use a.) and due to indexing arrays at zero, we start a huge number for for loops with i = 0. Then, we reason parsimony - might as well do different things the exact same way if there is no other reason to do one or the other a different way.
Now, if we were to use a.) with 1-based indexing into an array instead of 0-based indexing, we would get for(int i = 1; i < size + 1; ++i). The + 1 is "ugly", so we prefer to start our range with i = 0.
In conclusion, you should do a for iterations times with for(int i = 0; i < iterations; ++i). Something like for(int i = 1; i <= iterations; ++i) is fairly understandable and works, but is there any good reason to add a different way to loop iterations times? Just use the same pattern as when indexing an array. In other words, use 0 <= i < size. Worse, the loop based on 1 <= i <= iterations doesn't have all the reasons Dijkstra came up with to support using 0 <= i < iterations as a convention.
You're not worrying too much. In fact, Dijkstra himself wondered the exact same question as has pretty much any serious programmer. Tuning your style like a craftsman who loves their trade is the ground a great programmer stands on. Pursuing parsimony and writing code the way others tend to write code (including yourself - the looping of an array!) are both sane, great things to pursue.
Due to this convention, when I see for(i = 1, I notice a departure from a convention. I am then more cautious around that code, thinking the logic within the for might depend on starting at 1 instead of 0. This is slight, but there's no reason to add that possibility when a convention is so widely used. If you happen to have a large for body, this complaint becomes less slight.
To understand why starting at one makes no sense, consider taking the argument to its natural conclusion - the argument of "but it makes sense to me!": You can start i at anything! If we free ourselves from convention, why not loop for(int i = 5; i <= iterations + 4; ++i)? Or for(int i = -5; i > -iterations - 5; --i)? Just do it the way a majority of programmers do in the majority of cases, and save being different for when there's a good reason - the difference signals to the programmer reading your code that the body of the for contains something unusual. With the standard way, we know the for is either indexing/ordering/doing arithmetic with a sequence starting at 0 or executing some logic iterations times in a row.
Note how prevalent this convention is too. In C++, every standard container iterates between [start, end), which corresponds to a.) above. There, they do it so that the end condition can be iter != end, but the fact that we already do the logic one way and that that one way has no immediate drawbacks flows naturally into the argument of "Why do it two different ways when we already do it this way in this context?" In his little paper, Dijkstra also notes a language called Mesa that can do a.), b.), c.), or d.) with particular syntax. He claims that there, a.) has won out in practice, and the others are associated with the cause of bugs. He then laments how FORTRAN indexes at 1 and how PASCAL took on c.) by convention.
I am looking for a way to optimize an algorithm that I am working on. It's most repetitive and thus compute-intensive part is comparison of two sorted arrays of any size, containing unique unsigned integer (uint32_t) values in order to obtain the size of symmetric difference of them (number of elements that exist only in one of the vectors). The target machine on which the algorithm will be deployed uses Intel processors supporting AVX2, therefore I am looking for a way to perform it in-place using SIMD.
Is there a way to exploit the AVX2 instructions to obtain the size of symmetric difference of two sorted arrays of unsigned integers?
Since both arrays are sorted it should be fairly easy to implement this algorithm using SIMD (AVX2). You would just need to iterate through the two arrays concurrently, and then when you get a mismatch when comparing two vectors of 8 ints you would need to resolve the mismatch, i.e. count the differences, and get the two array indices back in phase, and continue until you get to the end of one of the arrays. Then just add the no of remaining elements in the other array, if any, to get the final count.
Unless your arrays are tiny (like <=16 elements), you need to perform merge of the two sorted arrays with additional code for dumping non-equal elements.
If the size of symmetric difference is expected to be very small, then use the method described by PaulR.
If the size is expected to be high (like 10% of total number of elements), then you will have real trouble with vectorizing it. It is much easier to optimize scalar solution.
After writing several versions of code, the fastest one for me is:
int Merge3(const int *aArr, int aCnt, const int *bArr, int bCnt, int *dst) {
int i = 0, j = 0, k = 0;
while (i < aCnt - 32 && j < bCnt - 32) {
for (int t = 0; t < 32; t++) {
int aX = aArr[i], bX = bArr[j];
dst[k] = (aX < bX ? aX : bX);
k += (aX != bX);
i += (aX <= bX);
j += (aX >= bX);
}
}
while (i < aCnt && j < bCnt) {
... //use simple code to merge tails
The main optimizations here are:
Perform merging iterations in blocks (32 iterations per block). This allows to simplify stop criterion from (i < aCnt && j < bCnt) to t < 32. This can be done for most of the elements, and the tails can be processed with slow code.
Perform iterations in branchless fashion. Note that ternary operator is compiled into cmov instruction, and comparisons are compiled into setXX instructions, so there are no branches in the loop body. The output data is stored with the well-known trick: write all elements, but increase pointer only for the valid ones.
What else I have tried:
(no vectorization) perform (4 + 4) bitonic merge, check consecutive elements for duplicates, move pointers so that 4 min elements (in total) are skipped:
gets 4.95ns vs 4.65ns --- slightly worse.
(fully vectorized) compare 4 x 4 elements pairwise, extract comparison results into 16-bit mask, pass it through perfect hash function, use _mm256_permutevar8x32_epi32 with 128-entry LUT to get sorted 8 elements, check consecutive elements for duplicates, use _mm_movemask_ps + 16-entry LUT + _mm_shuffle_epi8 to store only unique elements among minimal 4 elements: gets 4.00ns vs 4.65ns --- slightly better.
Here is the file with solutions and file with perfect hash + LUT generator.
P.S. Note that similar problem for intersection of sets is solved here. The solution is somewhat similar to what I outlined as point 2 above.
Let A[1, …, n] be an array storing a bit (1 or 0) at each location, and f(m) is a function whose time complexity is θ(m). Consider the following program fragment written in a C like language:
Case 1 :-
counter = 0;
for (i = 1; i < = n; i++)
{
if (A[i] == 1)
counter++;
else {
f(counter);
counter = 0;
}
}
Case 2 :-
counter = 0;
for (i = 1; i < = n; i++)
{
if (A[i] == 1)
counter++;
else {
counter = 0;
f(counter);
}
}
The complexity of this program fragment is
(A) Ω(n2)
(B) Ω(nlog n) and O(n2)
(C) θ(n)
(D) O(n)
the question is how do i know that when if statement or else statement is used and when the f(m) function is called , How do i approach it? i can consider the cases when only if is executed or only else but what about general case when if statement is executed sometimes and else statement sometimes
We can start by looking at the easy case, case 2. Clearly, each time we go through the loop in case 2, one of either 2 things happens:
count is incremented (which takes O(1) [except not really, but we just say it does for computers that operate on fixed-length numbers (that is, 32-bit ints)])
count is set to 0 (which takes O(1) [again, debatable]) and f(count) is evaluated (which definitely takes constant time)
we go through the loop n times, each time takes practically O(1) time, bada-bing, bada-boom, it takes O(n) (or O(n * lg(n)) if you're being pedantic and using variable-length integers).
Case 1, on the other hand, requires a little bit of mathematical thinking.
The bit strings that take the shortest amount of time in Case 1 are obviously 11111....11111, 000....000, 000...0111...111, or similar. All of these take θ(n) time to complete, establishing a lower bound for case 1. Now, we need to establish a worst-case scenario.
Without going into the rigor of a proper proof, it's pretty straightforward to assert that the worst-case bit strings look like this:
111....1110
A bit string of the above form with length 100 would have 99 1's, and therefore would need 99 + 99 time units to complete. A string of length n clearly needs 2(n - 1) time units to complete.
This is clearly still linear in n, so case 1, even in the worst case scenario, is θ(n).
Because both case 1 and case 2 are θ(n), the problem is θ(n).
If you still need to be convinced that 11.....110 is the worst case bit string, consider this:
A bit string of the form
|--------------n bits------------|
1....101....101....10......1....10
|-L1-| |-L2-| |-L3-| |-Lm-|
11110
Where L1 - Lm are arbitrary integers will require time
t = (L1) + (L2) + (L3) + ... + (Lm) + (n - m)
= sum(L1 to Lm) - m + n
the more "runs" of ones there are, the larger the - m factor is. If we
just have one big "run" of ones, we have
t = n - 1 + n - 1 = 2(n - 1)
As a matter of principle, I don't answer poorly asked homework questions on stackoverflow.
After talking to coder101 in chat, however, he/she showed me that this is NOT a homework problem, but is instead a problem that was retrieved from an online database here, which is meant to provide "mock tests for geeks". This looks like a challenge that coder101 bestowed upon themselves, and while it could be a better question, I don't think it's that bad.
In a digital filtering C++ application, I use std::inner_product (with std::vector<double> and std::deque<double>) to compute the dot product between the filter coefficients and the input data, for each data sample. After profiling my application, I figured out that no less than 85% of the execution time is spent in std::inner_product!
To what extend is std::inner_product optimized, in GCC for example?
Does it uses SIMD instructions? Does it performs loop unrolling? How to make sure of that?
Based on this, would it worth it to implement custom dot product function(s) (especially if the number of coefficient is low)? (but I would like to keep the function as generic as possible)
More specifically, this is the piece of code I use to apply a filter:
std::deque<double> in(filterNum.size(), 0.0);
std::deque<double> out(filterDenom.size() - 1, 0.0);
const double gain = filterDenom.back();
for (unsigned int s = 0, size = data.size(); s < size; ++s) {
in.pop_front();
in.push_back(data[s] / gain);
data[s] = inner_product(in.begin(), in.end(), filterNum.begin(),
-inner_product(out.begin(), out.end(), filterDenom.begin(), 0.0));
out.pop_front();
out.push_back(data[s]);
}
Typically, I use second order bandpass IIR filters, which means that the size of filterNum and filterDenom (numerator and denominator coefficients of the filter) is 5. data is the vector containing the input samples.
Getting an additional factor of 2 out of this shouldn't be hard if you just write the code directly. Part of it might come from removing some of the generality of inner_product, but some would also come from such things as eliminating the use of deques - if you just keep a pointer into your input array you can index off it and off the filter array in the inner loop, and increment the pointer to the input array in the outer loop.
Each of those inner_products has to use iterators through deques,
Most of the (coding) effort then becomes handling the edge conditions.
And take that division out of there - it should be a multiplication by a constant calculated outside the loop.
Inner product itself is pretty efficient (there's not much to do there), but it needs to increment two iterators on each pass through the inner loop. There is no explicit loop unrolling, but a good compiler can unroll a loop that simple. And a compiler is more likely to know how far to unroll a loop before running into instruction cache issues.
Deque iterators are not nearly as efficient as ++ on a pure pointer. There is at least a test on every ++, and there may be more than one assignment.
This is what a simple (FIR) filter can look like, without including the code for the edge conditions (which goes outside of the loop)
double norm = 1.0/sum;
double *p = data.values(); // start of input data
double *q = output.values(); // start of output buffer
int width = data.size() - filter.size();
for( int i = 0; i < width; ++i )
{
double *f = filter.values();
double accumulator = ( f[0] * p[0] );
for( int j = 1; j < filter.size(); ++j )
{
accumulator += ( f[i] * p[i] );
}
*q++ = accumulator * norm;
}
Note that there are messy details left out, and this is not the same as your filter, but it gives the idea. What's inside the outer loop easily fits in a modern instruction cache. The inner loop may be unrolled by the compiler. Most modern architectures can do the add and multiply in parallel.
You can ask GCC to computes most of the algorithms in <algorithms> and <numeric> in parallel mode, it may give a performance boost if your data set is very high (I think that it really only uses OpenMP inside).
However on small datasets it may give a performance hit.
A comparison with the other solution would be more than welcome!
http://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html
I have a homework assignment that requires having to shift a circular array -n places in memory.
I have accomplished the task with the following C++ Syntax:
while ( r < 0 ) // rotate negatively.
{
if ( i == top+1 )
{
current->n = items[top+1].n;
items[top+1].n = items[back-1].n;
}
midPtr->n = items[++i].n;
items[i].n = current->n;
if ( i == back-1 )
{
items[back-1].n = current->n;
i = 0;
r++; continue;
}
else
{
current->n = items[++i].n;
items[i].n = midPtr->n;
if ( i == back-1 )
{
i = 0;
r++; continue;
}
}
}
I was wondering if anyone has a better, more efficient
method of shifting a circular array by -n units.
Because I seem to be performing unnecessary transfers
between ptr variables.
Without giving you code (this is after all a homework assignment) consider this ...
a circular array is just an allocation of n units in memory and a pointer to the "start" of the array. The first item in the array need not be the lowest address in the allocated memory, but just a pointer/index to the item in the array that represents the logical first element. Shifting the array is simply shifting the index of the first element. It can be done without loops, simply calculate how far the index needs to move - taking into account the circular nature of the data structure.
Finnicky, but, I guess, OK (there are just too many places that MIGHT have off-by-1 errors to make sure about them all -- just make sure you have a ton of unit tests;-). Personally, whenever faced with such problems (in a real-work context -- alas, the time of homework is far in the past for me;-) I tend to focus on something I learned from Gries a long time ago...: any "swap two blocks" task inside an array can be performed quite effectively (linear time, 0 extra memory) with a single primitive -- "invert a block". Visualizing the array as a normal compact piece of memory, say you have...:
start
.
.
beg1
.
.
end1
.
.
beg2
.
.
end2
.
.
finis
and your task is to swap the block (beg1..end1) with the block (beg2..end2). The Gries solution is (in each case a block's given (begin..end) with extremes included):
1. invert (beg1..end2)
2. invert (end2..beg2)
3. invert (end2+1..end1-1)
4. invert (end1..beg1)
...that's all!-) Since invert(X..Y) takes (Y-X+1) elementary moves, the Gries solution takes 2*(end2-beg1+1) such moves -- the relative overhead compared to a "minimal possible elementary moves" solution can be high in some special cases (beg2 much larger than end1 AND the two blocks exactly of the same length, for example), but the generality and lack of finnicky worries about off-by-one issues makes it worth to me more often than not.
"Rotation" is a special case of "swap two blocks", of course. So my instinct would be to ensure I have the "invert" primitive (much easier to program anyway;-) and then use that.
But then, I DO have to worry only about my code being clear, correct, and fast, NOT on how a TA will evaluate it, admittedly;-).