Using intrinsics to find next non-zero in an array

Using intrinsics to find next non-zero in an array - c++

I have an int array[10000] and I want to iterate from a certain position to find the next non-zero index. Currently I use a basic while loop:
while(array[i] == 0){
pos++;
}
etc
I know with intrinsics I could test 4 integers for zero at a time, but is there a way to return something indicating the vector index of the "first" non-zero?

It's fairly simple to do this, but throughput improvement may not be great, since you will probably be limited by memory bandwidth (unless your array is already cached):
int index = -1;
for (i = 0; i < n; i += 4)
{
__m128i v = _mm_load_si128(&A[i]);
__m128i vcmp = _mm_cmpeq_epi32(v, _mm_setzero_si128());
int mask = _mm_movemask_epi8(vcmp);
if (mask != 0xffff)
{
break;
}
}
if (i < n)
{
for (j = i; j < i + 4; ++j)
{
if (A[j] != 0)
{
index = j;
break;
}
}
}
This assumes that the array A is 16 byte aligned, its size, n, is a multiple of 4, and that the ints are 32 bits.
Loop unrolling by a factor of 2 may help, particularly if your input data is large and/or sparse, e.g.
int index = -1;
for (i = 0; i < n; i += 8)
{
__m128i v0 = _mm_load_si128(&A[i]);
__m128i v1 = _mm_load_si128(&A[i + 4]);
__m128i vcmp0 = _mm_cmpeq_epi32(v0, _mm_setzero_si128());
__m128i vcmp1 = _mm_cmpeq_epi32(v1, _mm_setzero_si128());
int mask0 = _mm_movemask_epi8(vcmp0);
int mask1 = _mm_movemask_epi8(vcmp1);
if ((mask0 | mask1) != 0xffff)
{
break;
}
}
if (i < n)
{
for (j = i; j < i + 8; ++j)
{
if (A[j] != 0)
{
index = j;
break;
}
}
}
If you have AVX2 (Haswell and later) then you can process 8 ints at a time rather than 4.

Related

Max value of sub-list items

I'm looking for the optimal method to find the maximum value of sub-list items in a list.
Here is my O(n.m) implementation:
vector<int> movMax(const vector<int>& v, int span)
{
span /= 2;
vector<int> ret = v;
for (int i = 0; i < (int)v.size(); ++i)
{
for (int j = std::max(0, i - span); j < std::min((int)v.size(), i + span + 1); j++)
{
ret[i] = std::max(ret[i], v[j]);
}
}
return ret;
}
int main()
{
vector<int> v = { 4, 3, 3, 7, 2, 5, 1, 2 };
v = movMax(v, 3);
for (int x : v) cout << x << ' '; // 4 4 7 7 7 5 5 2
}

Here's a version that's O(N log M), where N is the size of the input, and M is the window.
We keep a track of the values within the window in-order, so adding or removing them is O(log M), and we do this for each element of v.
std::vector<int> movMax(const std::vector<int>& v, int window)
{
int mid = window / 2;
int size = v.size();
std::vector<int> result;
result.reserve(size);
std::multiset<int> working_set;
for (int i = -mid; i < size + mid; ++i)
{
if (i + mid < size) working_set.insert(v.at(i + mid));
if (i >= 0 && i < size) result.push_back(*working_set.rbegin());
if (i - mid >= 0) working_set.erase(working_set.find(v.at(i - mid)));
}
return result;
}
See it on coliru
If window is allowed to be even, you need to account for which side it prefers.
Instead of defining one mid, you have wide and narrow:
int wide = window / 2;
int narrow = window - wide - 1;
Assuming window should span to front more:
for (int i = -narrow; i < size + wide; ++i)
{
if (i + narrow < size) working_set.insert(v.at(i + narrow));
if (i >= 0 && i < size) result.push_back(*working_set.rbegin());
if (i - wide >= 0) working_set.erase(working_set.find(v.at(i - wide)));
}
Assuming window should span to back more:
for (int i = -wide; i < size + narrow; ++i)
{
if (i + wide < size) working_set.insert(v.at(i + wide));
if (i >= 0 && i < size) result.push_back(*working_set.rbegin());
if (i - narrow >= 0) working_set.erase(working_set.find(v.at(i - narrow)));
}
With tests on godbolt

Codility MinAbsSum

I tried this Codility test: MinAbsSum.
https://codility.com/programmers/lessons/17-dynamic_programming/min_abs_sum/
I solved the problem by searching the whole tree of possibilities. The results were OK, however, my solution failed due to timeout for large input. In other words the time complexity was not as good as expected. My solution is O(nlogn), something normal with trees. But this coding test was in the section "Dynamic Programming", and there must be some way to improve it. I tried with summing the whole set first and then using this information, but always there is something missing in my solution. Does anybody have an idea on how to improve my solution using DP?
#include <vector>
using namespace std;
int sum(vector<int>& A, size_t i, int s)
{
if (i == A.size())
return s;
int tmpl = s + A[i];
int tmpr = s - A[i];
return min (abs(sum(A, i+1, tmpl)), abs(sum(A, i+1, tmpr)));
}
int solution(vector<int> &A) {
return sum(A, 0, 0);
}

I could not solve it. But here's the official answer.
Quoting it:
Notice that the range of numbers is quite small (maximum 100). Hence,
there must be a lot of duplicated numbers. Let count[i] denote the
number of occurrences of the value i. We can process all occurrences
of the same value at once. First we calculate values count[i] Then we
create array dp such that:
dp[j] = −1 if we cannot get the sum j,
dp[j] >=  0 if we can get sum j.
Initially, dp[j] = -1 for all of j (except dp[0] = 0). Then we scan
through all the values a appearing in A; we consider all a such
that count[a]>0. For every such a we update dp that dp[j] denotes
how many values a remain (maximally) after achieving sum j. Note
that if the previous value at dp[j] >= 0 then we can set dp[j] =
count[a] as no value a is needed to obtain the sum j. Otherwise we
must obtain sum j-a first and then use a number a to get sum j. In
such a situation dp[j] = dp[j-a]-1. Using this algorithm, we can
mark all the sum values and choose the best one (closest to half of S,
the sum of abs of A).
def MinAbsSum(A):
N = len(A)
M = 0
for i in range(N):
A[i] = abs(A[i])
M = max(A[i], M)
S = sum(A)
count = [0] * (M + 1)
for i in range(N):
count[A[i]] += 1
dp = [-1] * (S + 1)
dp[0] = 0
for a in range(1, M + 1):
if count[a] > 0:
for j in range(S):
if dp[j] >= 0:
dp[j] = count[a]
elif (j >= a and dp[j - a] > 0):
dp[j] = dp[j - a] - 1
result = S
for i in range(S // 2 + 1):
if dp[i] >= 0:
result = min(result, S - 2 * i)
return result
(note that since the final iteration only considers sums up until S // 2 + 1, we can save some space and time by only creating a DP Cache up until that value as well)
The Java answer provided by fladam returns wrong result for input [2, 3, 2, 2, 3], although it gets 100% score.
Java Solution
import java.util.Arrays;
public class MinAbsSum{
static int[] dp;
public static void main(String args[]) {
int[] array = {1, 5, 2, -2};
System.out.println(findMinAbsSum(array));
}
public static int findMinAbsSum(int[] A) {
int arrayLength = A.length;
int M = 0;
for (int i = 0; i < arrayLength; i++) {
A[i] = Math.abs(A[i]);
M = Math.max(A[i], M);
}
int S = sum(A);
dp = new int[S + 1];
int[] count = new int[M + 1];
for (int i = 0; i < arrayLength; i++) {
count[A[i]] += 1;
}
Arrays.fill(dp, -1);
dp[0] = 0;
for (int i = 1; i < M + 1; i++) {
if (count[i] > 0) {
for(int j = 0; j < S; j++) {
if (dp[j] >= 0) {
dp[j] = count[i];
} else if (j >= i && dp[j - i] > 0) {
dp[j] = dp[j - i] - 1;
}
}
}
}
int result = S;
for (int i = 0; i < Math.floor(S / 2) + 1; i++) {
if (dp[i] >= 0) {
result = Math.min(result, S - 2 * i);
}
}
return result;
}
public static int sum(int[] array) {
int sum = 0;
for(int i : array) {
sum += i;
}
return sum;
}
}

I invented another solution, better than the previous one. I do not use recursion any more.
This solution works OK (all logical tests passed), and also passed some of the performance tests, but not all. How else can I improve it?
#include <vector>
#include <set>
using namespace std;
int solution(vector<int> &A) {
if (A.size() == 0) return 0;
set<int> sums, tmpSums;
sums.insert(abs(A[0]));
for (auto it = begin(A) + 1; it != end(A); ++it)
{
for (auto s : sums)
{
tmpSums.insert(abs(s + abs(*it)));
tmpSums.insert(abs(s - abs(*it)));
}
sums = tmpSums;
tmpSums.clear();
}
return *sums.begin();
}

This solution (in Java) scored 100% for both (correctness and performance)
public int solution(int[] a){
if (a.length == 0) return 0;
if (a.length == 1) return a[0];
int sum = 0;
for (int i=0;i<a.length;i++){
sum += Math.abs(a[i]);
}
int[] indices = new int[a.length];
indices[0] = 0;
int half = sum/2;
int localSum = Math.abs(a[0]);
int minLocalSum = Integer.MAX_VALUE;
int placeIndex = 1;
for (int i=1;i<a.length;i++){
if (localSum<half){
if (Math.abs(2*minLocalSum-sum) > Math.abs(2*localSum - sum))
minLocalSum = localSum;
localSum += Math.abs(a[i]);
indices[placeIndex++] = i;
}else{
if (localSum == half)
return Math.abs(2*half - sum);
if (Math.abs(2*minLocalSum-sum) > Math.abs(2*localSum - sum))
minLocalSum = localSum;
if (placeIndex > 1) {
localSum -= Math.abs(a[indices[placeIndex--]]);
i = indices[placeIndex];
}
}
}
return (Math.abs(2*minLocalSum - sum));
}
this solution treats all elements like they are positive numbers and it's looking to reach as close as it can to the sum of all elements divided by 2 (in that case we know that the sum of all other elements will be the same delta far from the half too -> abs sum will be minimum possible ).
it does so by starting with the first element and successively adding others to the "local" sum (and recording indices of elements in the sum) until it reaches sum of x >= sumAll/2. if that x is equal to sumAll/2 we have an optimal solution. if not, we go step back in the indices array and continue picking other element where last iteration in that position ended. the result will be a "local" sum having abs((sumAll - sum) - sum) closest to 0;
fixed solution:
public static int solution(int[] a){
if (a.length == 0) return 0;
if (a.length == 1) return a[0];
int sum = 0;
for (int i=0;i<a.length;i++) {
a[i] = Math.abs(a[i]);
sum += a[i];
}
Arrays.sort(a);
int[] arr = a;
int[] arrRev = new int[arr.length];
int minRes = Integer.MAX_VALUE;
for (int t=0;t<=4;t++) {
arr = fold(arr);
int res1 = findSum(arr, sum);
if (res1 < minRes) minRes = res1;
rev(arr, arrRev);
int res2 = findSum(arrRev, sum);
if (res2 < minRes) minRes = res2;
arrRev = fold(arrRev);
int res3 = findSum(arrRev, sum);
if (res3 < minRes) minRes = res3;
}
return minRes;
}
private static void rev(int[] arr, int[] arrRev){
for (int i = 0; i < arrRev.length; i++) {
arrRev[i] = arr[arr.length - 1 - i];
}
}
private static int[] fold(int[] a){
int[] arr = new int[a.length];
for (int i=0;a.length/2+i/2 < a.length && a.length/2-i/2-1 >= 0;i+=2){
arr[i] = a[a.length/2+i/2];
arr[i+1] = a[a.length/2-i/2-1];
}
if (a.length % 2 > 0) arr[a.length-1] = a[a.length-1];
else{
arr[a.length-2] = a[0];
arr[a.length-1] = a[a.length-1];
}
return arr;
}
private static int findSum(int[] arr, int sum){
int[] indices = new int[arr.length];
indices[0] = 0;
double half = Double.valueOf(sum)/2;
int localSum = Math.abs(arr[0]);
int minLocalSum = Integer.MAX_VALUE;
int placeIndex = 1;
for (int i=1;i<arr.length;i++){
if (localSum == half)
return 2*localSum - sum;
if (Math.abs(2*minLocalSum-sum) > Math.abs(2*localSum - sum))
minLocalSum = localSum;
if (localSum<half){
localSum += Math.abs(arr[i]);
indices[placeIndex++] = i;
}else{
if (placeIndex > 1) {
localSum -= Math.abs(arr[indices[--placeIndex]]);
i = indices[placeIndex];
}
}
}
return Math.abs(2*minLocalSum - sum);
}

The following is a rendering of the official answer in C++ (scoring 100% in task, correctness, and performance):
#include <cmath>
#include <algorithm>
#include <numeric>
using namespace std;
int solution(vector<int> &A) {
// write your code in C++14 (g++ 6.2.0)
const int N = A.size();
int M = 0;
for (int i=0; i<N; i++) {
A[i] = abs(A[i]);
M = max(M, A[i]);
}
int S = accumulate(A.begin(), A.end(), 0);
vector<int> counts(M+1, 0);
for (int i=0; i<N; i++) {
counts[A[i]]++;
}
vector<int> dp(S+1, -1);
dp[0] = 0;
for (int a=1; a<M+1; a++) {
if (counts[a] > 0) {
for (int j=0; j<S; j++) {
if (dp[j] >= 0) {
dp[j] = counts[a];
} else if ((j >= a) && (dp[j-a] > 0)) {
dp[j] = dp[j-a]-1;
}
}
}
}
int result = S;
for (int i =0; i<(S/2+1); i++) {
if (dp[i] >= 0) {
result = min(result, S-2*i);
}
}
return result;
}

You are almost 90% to the actual solution. It seems you understand recursion very well. Now, You should apply dynamic programming here with your program.
Dynamic Programming is nothing but memoization to the recursion so that we will not calculate same sub problems again and again. If same sub problems encounter , we return the previously calculated and memorized value. Memorization can be done with the help of a 2D array , say dp[][], where first state represent current index of array and second state represent summation.
For this problem specific, instead of giving calls to both states from each state, you sometimes can greedily take decision to skip one call.

I would like to provide the algorithm and then my implementation in C++. Idea is more or less the same as the official codility solution with some constant optimisation added.
Calculate the maximum absolute element of the inputs.
Calculate the absolute sum of the inputs.
Count the number of occurrence of each number in the inputs. Store the results in a vector hash.
Go through each input.
For each input, goes through all possible sums of any number of inputs. It is a slight constant optimisation to go only up to half of the possible sums.
For each sum that has been made before, set the occurrence count of the current input.
Check for each potential sum equal to or greater than the current input whether this input has already been used before. Update the values at the current sum accordingly. We do not need to check for potential sums less than the current input in this iteration, since it is evident that it has not been used before.
The above nested loop will fill in each possible sum with a value greater than -1.
Go through this possible sum hash again to look for the closest sum to half that is possible to make. Eventually, the min abs sum will be the difference of this from the half multiplied by two as the difference will be added up in both groups as the difference from the median.
The runtime complexity of this algorithm is O(N * max(abs(A)) ^ 2), or simply O(N * M ^ 2). That is because the outer loop is iterating M times and the inner loop is iterating sum times. The sum is basically N * M in worst case. Therefore, it is O(M * N * M).
The space complexity of this solution is O(N * M) because we allocate a hash of N items for the counts and a hash of S items for the sums. S is N * M again.
int solution(vector<int> &A)
{
int M = 0, S = 0;
for (const int e : A) { M = max(abs(e), M); S += abs(e); }
vector<int> counts(M + 1, 0);
for (const int e : A) { ++counts[abs(e)]; }
vector<int> sums(S + 1, -1);
sums[0] = 0;
for (int ci = 1; ci < counts.size(); ++ci) {
if (!counts[ci]) continue;
for (int si = 0; si < S / 2 + 1; ++si) {
if (sums[si] >= 0) sums[si] = counts[ci];
else if (si >= ci and sums[si - ci] > 0) sums[si] = sums[si - ci] - 1;
}
}
int min_abs_sum = S;
for (int i = S / 2; i >= 0; --i) if (sums[i] >= 0) return S - 2 * i;
return min_abs_sum;
}

Let me add my 50 cent, how to come up with the score 100% solution.
For me it was hard to understand the ultimate solution, proposed earlier in this thread.
So I started with warm-up solution with score 63%, because its O(NxNxM),
and because it doesn't use the fact that M is quite small value, and there are many duplicates in big arrays
here the key part is to understand how array isSumPossible is filled and interpreted:
how to fill array isSumPossible using numbers in input array:
if isSumPossible[sum] >= 0, i.e. sum is already possible, even without current number, then let's set it's value to 1 - count of current number, that is left unused for this sum, it'll go to our "reserve", so we can use it later for greater sums.
if (isSumPossible[sum] >= 0) {
isSumPossible[sum] = 1;
}
if isSumPossible[sum] <= 0, i.e. sum is considered not yet possible, with all input numbers considered previously, then let's check maybe
smaller sum sum - number is already considered as possible, and we have in "reserve" our current number (isSumPossible[sum - number] == 1), then do following
else if (sum >= number && isSumPossible[sum - number] == 1) {
isSumPossible[sum] = 0;
}
here isSumPossible[sum] = 0 means that we have used number in composing sum and it's now considered as possible (>=0), but we have no number in "reserve", because we've used it ( =0)
how to interpret filled array isSumPossible after considering all numbers in input array:
if isSumPossible[sum] >= 0 then the sum is possible, i.e. it can be reached by summation of some numbers in given array
if isSumPossible[sum] < 0 then the sum can't be reached by summation of any numbers in given array
The more simple thing here is to understand why we are searching sums only in interval [0, maxSum/2]:
because if find a possible sum, that is very close to maxSum/2,
ideal case here if we've found possible sum = maxSum/2,
if so, then it's obvious, that we can somehow use the rest numbers in input array to make another maxSum/2, but now with negative sign, so as a result of annihilation we'll get solution = 0, because maxSum/2 + (-1)maxSum/2 = 0.
But 0 the best case solution, not always reachable.
But we, nevertheless, should seek for the minimal delta = ((maxSum - sum) - sum),
so this we seek for delta -> 0, that's why we have this:
int result = Integer.MAX_VALUE;
for (int sum = 0; sum < maxSum / 2 + 1; sum++) {
if (isSumPossible[sum] >= 0) {
result = Math.min(result, (maxSum - sum) - sum);
}
}
warm-up solution
public int solution(int[] A) {
if (A == null || A.length == 0) {
return 0;
}
if (A.length == 1) {
return A[0];
}
int maxSum = 0;
for (int i = 0; i < A.length; i++) {
A[i] = Math.abs(A[i]);
maxSum += A[i];
}
int[] isSumPossible = new int[maxSum + 1];
Arrays.fill(isSumPossible, -1);
isSumPossible[0] = 0;
for (int number : A) {
for (int sum = 0; sum < maxSum / 2 + 1; sum++) {
if (isSumPossible[sum] >= 0) {
isSumPossible[sum] = 1;
} else if (sum >= number && isSumPossible[sum - number] == 1) {
isSumPossible[sum] = 0;
}
}
}
int result = Integer.MAX_VALUE;
for (int sum = 0; sum < maxSum / 2 + 1; sum++) {
if (isSumPossible[sum] >= 0) {
result = Math.min(result, maxSum - 2 * sum);
}
}
return result;
}
and after this we can optimize it, using the fact that there are many duplicate numbers in big arrays, and we come up with the solution with 100% score, its O(Mx(NxM)), because maxSum = NxM at worst case
public int solution(int[] A) {
if (A == null || A.length == 0) {
return 0;
}
if (A.length == 1) {
return A[0];
}
int maxNumber = 0;
int maxSum = 0;
for (int i = 0; i < A.length; i++) {
A[i] = Math.abs(A[i]);
maxNumber = Math.max(maxNumber, A[i]);
maxSum += A[i];
}
int[] count = new int[maxNumber + 1];
for (int i = 0; i < A.length; i++) {
count[A[i]]++;
}
int[] isSumPossible = new int[maxSum + 1];
Arrays.fill(isSumPossible, -1);
isSumPossible[0] = 0;
for (int number = 0; number < maxNumber + 1; number++) {
if (count[number] > 0) {
for (int sum = 0; sum < maxSum / 2 + 1; sum++) {
if (isSumPossible[sum] >= 0) {
isSumPossible[sum] = count[number];
} else if (sum >= number && isSumPossible[sum - number] > 0) {
isSumPossible[sum] = isSumPossible[sum - number] - 1;
}
}
}
}
int result = Integer.MAX_VALUE;
for (int sum = 0; sum < maxSum / 2 + 1; sum++) {
if (isSumPossible[sum] >= 0) {
result = Math.min(result, maxSum - 2 * sum);
}
}
return result;
}
I hope I've made it at least a little clear

Kotlin solution
Time complexity: O(N * max(abs(A))**2)
Score: 100%
import kotlin.math.*
fun solution(A: IntArray): Int {
val N = A.size
var M = 0
for (i in 0 until N) {
A[i] = abs(A[i])
M = max(M, A[i])
}
val S = A.sum()
val counts = MutableList(M + 1) { 0 }
for (i in 0 until N) {
counts[A[i]]++
}
val dp = MutableList(S + 1) { -1 }
dp[0] = 0
for (a in 1 until M + 1) {
if (counts[a] > 0) {
for (j in 0 until S) {
if (dp[j] >= 0) {
dp[j] = counts[a]
} else if (j >= a && dp[j - a] > 0) {
dp[j] = dp[j - a] - 1
}
}
}
}
var result = S
for (i in 0 until (S / 2 + 1)) {
if (dp[i] >= 0) {
result = minOf(result, S - 2 * i)
}
}
return result
}

General formula for pairing members of array?

Hello guys I am having the following problem:
I have an array with a lenght that is a multiple of 4 e.g:
{1,2,3,4,5,6,7,8}
I want to know how can i get the numbers in the following pairs: {1,4},{2,3},{5,8},{6,7}.....(etc)
Suppose i loop through them and i want to get the index of the pair member from my current index
int myarr[8]={1,2,3,4,5,6,7,8};
for(int i=0;i<8;i++)
**j= func(i)**
I have thought of something like this:
f(1)=4
f(4)=1
and i would be taking: **f(i)=a * i + b** (i think a linear function is enough) It would result: f(i)=j=-i+5 .How can i generalise this for more then 4 members? What do you do in cases where you need a general formula for pairing elements?

Basically, if i is odd j would be i+3, otherwise j = i+1;
int func(int i) {
if(i%2 != 0)
return i+3;
else
return i+1;
}
This will generate
func(1) = 4, func(2) = 3, func(5) = 8, func(6) = 7 // {1,4},{2,3},{5,8},{6,7}.

You could do it as follows by keeping the incremental iteration but use a function depending on the current block and the remainder as follows.
int myarr[8]={1,2,3,4,5,6,7,8};
int Successor(int i)
{
int BlockStart = i / 4;
int Remainder = i % 4;
int j = 0;
if ( Remainder == 0 )
j = 0;
else if ( Remainder == 1 )
j = 3;
else if ( Remainder == 2 )
j = 1;
else if ( Remainder == 3 )
j = 2
return BlockStart + j;
}
for(int i = 0; i < 8; i++)
{
j = f(i);
// usage of the index
}

About the generalization, this should do it:
auto pairs(const vector<int>& in, int groupLength = 4) {
vector<pair<int, int>> result;
int groups = in.size() / groupLength;
for (int group = 0; group < groups; ++group) {
int i = group * groupLength;
int j = i + groupLength - 1;
while (i < j) {
result.emplace_back(in[i++], in[j--]);
}
}
return result;
}
You can run this code online.
If you are just looking for a formula to calculate the indices, then in general case it's:
int f(int i, int k = 4) {
return i + k - 2 * (i % k) - 1;
}
Turns out your special case (size 4) is sequence A004444 in OEIS.
In general you have "nimsum n + (size-1)".

Comparing two vector<bool> with SSE

I have two vector<bool> A and B.
I want to compare them and count the number of elements that are equal:
For example:
A = {0,1,0,1}
B = {0,0,1,1}
Result will be equal to 2.
I can use _mm_cmpeq_epi8 but it is only compare 16 elements (i.e. I should convert 0 and 1 to char and then do the comparison).
Is it possible to compare 128 elements each time with SSE (or SIMD instructions)?

If you can either assume that vector<bool> is using contiguous byte-sized elements for storage, or if you can consider using something like vector<uint8_t> instead, then this example should give you a good starting point:
static size_t count_equal(const vector<uint8_t> &vec1, const vector<uint8_t> &vec2)
{
assert(vec1.size() == vec2.size()); // vectors must be same size
const size_t n = vec1.size();
const size_t max_block_size = 255 * 16; // max block size before possible overflow
__m128i vcount = _mm_setzero_si128();
size_t i, count = 0;
for (i = 0; i + 16 <= n; ) // for each block
{
size_t m = std::min(n, i + max_block_size);
for ( ; i + 16 <= m; i += 16) // for each vector in block
{
__m128i v1 = _mm_loadu_si128((__m128i *)&vec1[i]);
__m128i v2 = _mm_loadu_si128((__m128i *)&vec2[i]);
__m128i vcmp = _mm_cmpeq_epi8(v1, v2);
vcount = _mm_sub_epi8(vcount, vcmp);
}
vcount = _mm_sad_epu8(vcount, _mm_setzero_si128());
count += _mm_extract_epi16(vcount, 0) + _mm_extract_epi16(vcount, 4);
vcount = _mm_setzero_si128(); // update count from current block
}
vcount = _mm_sad_epu8(vcount, _mm_setzero_si128());
count += _mm_extract_epi16(vcount, 0) + _mm_extract_epi16(vcount, 4);
for ( ; i < n; ++i) // deal with any remaining partial vector
{
count += (vec1[i] == vec2[i]);
}
return count;
}
Note that this is using vector<uint8_t>. If you really have to use vector<bool> and can guarantee that the elements will always be contiguous and byte-sized then you'll just need to coerce the vector<bool> into a const uint8_t * or similar somehow.
Test harness:
#include <cassert>
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <vector>
#include <emmintrin.h> // SSE2
using std::vector;
static size_t count_equal_ref(const vector<uint8_t> &vec1, const vector<uint8_t> &vec2)
{
assert(vec1.size() == vec2.size());
const size_t n = vec1.size();
size_t i, count = 0;
for (i = 0 ; i < n; ++i)
{
count += (vec1[i] == vec2[i]);
}
return count;
}
static size_t count_equal(const vector<uint8_t> &vec1, const vector<uint8_t> &vec2)
{
assert(vec1.size() == vec2.size()); // vectors must be same size
const size_t n = vec1.size();
const size_t max_block_size = 255 * 16; // max block size before possible overflow
__m128i vcount = _mm_setzero_si128();
size_t i, count = 0;
for (i = 0; i + 16 <= n; ) // for each block
{
size_t m = std::min(n, i + max_block_size);
for ( ; i + 16 <= m; i += 16) // for each vector in block
{
__m128i v1 = _mm_loadu_si128((__m128i *)&vec1[i]);
__m128i v2 = _mm_loadu_si128((__m128i *)&vec2[i]);
__m128i vcmp = _mm_cmpeq_epi8(v1, v2);
vcount = _mm_sub_epi8(vcount, vcmp);
}
vcount = _mm_sad_epu8(vcount, _mm_setzero_si128());
count += _mm_extract_epi16(vcount, 0) + _mm_extract_epi16(vcount, 4);
vcount = _mm_setzero_si128(); // update count from current block
}
vcount = _mm_sad_epu8(vcount, _mm_setzero_si128());
count += _mm_extract_epi16(vcount, 0) + _mm_extract_epi16(vcount, 4);
for ( ; i < n; ++i) // deal with any remaining partial vector
{
count += (vec1[i] == vec2[i]);
}
return count;
}
int main(int argc, char * argv[])
{
size_t n = 100;
if (argc > 1)
{
n = atoi(argv[1]);
}
vector<uint8_t> vec1(n);
vector<uint8_t> vec2(n);
srand((unsigned int)time(NULL));
for (size_t i = 0; i < n; ++i)
{
vec1[i] = rand() & 1;
vec2[i] = rand() & 1;
}
size_t n_ref = count_equal_ref(vec1, vec2);
size_t n_test = count_equal(vec1, vec2);
if (n_ref == n_test)
{
std::cout << "PASS" << std::endl;
}
else
{
std::cout << "FAIL: n_ref = " << n_ref << ", n_test = " << n_test << std::endl;
}
return 0;
}
Compile and run:
$ g++ -Wall -msse3 -O3 test.cpp && ./a.out
PASS

std::vector<bool> is a specialization of std::vector for the type bool. Although not specified by the C++ standard, in most implementations std::vector<bool> is made space efficient such that each of its element is a single bit instead of a bool.
The behaviour of std::vector<bool> is similar to its primarily template counterpart, except that:
std::vector<bool> does not necessarily store its element contiguously .
In order to expose its elements (i.e., the individual bits) std::vector<bool> uses a proxy class (i.e., std::vector<bool>::reference). Objects of class std::vector<bool>::reference are returned by std::vector<bool> subscript operator (i.e., operator[]) by value.
Accordingly, I don't think it's portable to use _mm_cmpeq_epi8 like functions since storage of a std::vector<bool> is implementation defined (i.e., not guaranteed contiguous).
An alternative but portable way is to use regular STL facilities like the example below:
std::vector<bool> A = {0,1,0,1};
std::vector<bool> B = {0,0,1,1};
std::vector<bool> C(A.size());
std::transform(A.begin(), A.end(), B.begin(), C.begin(), [](bool const &a, bool const &b) { return a == b;});
std::cout << std::count(C.begin(), C.end(), true) << std::endl;
Live Demo

Why is this Segfault Occuring?

So I've traced the segfault to the line, but I don't understand why this is a segfault. Can someone elaborate on the error of my ways?
Here are the variable declarations.
size_t i, j, n, m, chunk_size, pixel_size;
i = j = n = m = 0;
chunk_size = 256;
pixel_size = 4;
Here are the array declarations.
uint8_t** values = new uint8_t*[chunk_size];
for (i = 0; i < chunk_size; ++ i)
values[i] = new uint8_t[chunk_size];
float** a1 = new float*[chunk_size];
for (i = 0; i < chunk_size; ++i)
a1[i] = new float[chunk_size];
And here is where the segfault occurs.
float delta, d;
for (i = 0; i < 256; ++i) {
for (j = n = m = d = 0; j < 256; j = m) {
while (i == 0 || d != 0) {
d = a1[i][m]; <------SEGFAULT per GDB
++m;
}
delta = (d - a1[i][j]) / m;
n = j + 1;
while (n < j + m) {
a1[i][n] = a1[i][n - 1] + delta;
++n;
}
}
}
I'm fairly new to C++ and can't figure out why this would be a segfault. Is this not the proper way to set a variables value to a variable in an array? Is that the source of my segfault?
Note: The point of this whole thing is too expand a 4x4 array to a 256x256 array with my simpleton interpolation formula.

while (i == 0 || d != 0) {
d = a1[i][m]; <------SEGFAULT per GDB
++m;
}
This is an endless while loop, in some cases (e.g. in the first iteration of the outer loop).

Your outer loop starts out with i = 0 and the inner loops starts with d = 0 and the logic controlling the while loop is not sufficient (see code comment).
for (i = 0; i < 256; ++i) {
for (j = n = m = d = 0; j < 256; j = m) {
// Here i == 0 is ALWAYS true (so d != 0 is ignored due to
// short-circuit evaluation) and then 'm' is continuously incremented
// until it goes out of bounds
while (i == 0 || d != 0) {
d = a1[i][m]; <------SEGFAULT per GDB
++m;
}
delta = (d - a1[i][j]) / m;
n = j + 1;
while (n < j + m) {
a1[i][n] = a1[i][n - 1] + delta;
++n;
}
}

The problem is in the following lines :
while (i == 0 || d != 0) {
d = a1[i][m]; <------SEGFAULT per GDB
++m;
}
Your while loop will keep on going while i equals 0. Since you never increment i in your while loop, m keeps on incrementing forever until arriving out of bounds, causing the segfault issue that you are having.
Make sure you check the values of i and m, so that they are in the allocated memory range and your code will work.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using intrinsics to find next non-zero in an array - c++

Related

Max value of sub-list items

Codility MinAbsSum

General formula for pairing members of array?

Comparing two vector<bool> with SSE

Why is this Segfault Occuring?

Categories

Resources