Add uchar values in ushort array with SSE or SSE3 - c++

I have an unsigned short dst[16][16] matrix and a larger unsigned char src[m][n] matrix.
Now I have to access in the src matrix and add a 16x16 submatrix to dst, using SSE2 or SSE3.
In an older implementation, I was sure that my summed values were never greater than 256, so I could do this:
for (int row = 0; row < 16; ++row)
__m128i subMat = _mm_lddqu_si128(reinterpret_cast<const __m128i*>(src));
dst[row] = _mm_add_epi8(dst[row], subMat);
src += W; // Step to the next row I need to add
where W is an offset to reach the desired rows. This code works, but now my values in src are larger and summed could be greater than 256, so I need to store them as ushort.
I've tried the following, but it doesn't work.
for (int row = 0; row < 16; ++row)
__m128i subMat = _mm_lddqu_si128(reinterpret_cast<const __m128i*>(src));
dst[row] = _mm_add_epi16(dst[row], subMat);
src += W; // Step to the next row I need to add
How can I solve this problem?
Thank you paul, but I think your offsets are wrong. I've tried your solution and seems that submatrix's rows are added to the wrong dst's rows. I hope the right solution is this:
for (int row = 0; row < 32; row += 2)
__m128i subMat = _mm_lddqu_si128(reinterpret_cast<const __m128i*>(src));
__m128i subMatLo = _mm_unpacklo_epi8(subMat, _mm_set1_epi8(0));
__m128i subMatHi = _mm_unpackhi_epi8(subMat, _mm_set1_epi8(0));
dst[row] = _mm_add_epi16(dst[row], subMatLo);
dst[row + 1] = _mm_add_epi16(dst[row + 1], subMatHi);
src += W;

You need to unpack your vector of 16 x 8 bit values into two vectors of 8 x 16 bit values and then add both these vectors to your destination:
for (int row = 0; row < 16; ++row)
__m128i subMat = _mm_lddqu_si128(reinterpret_cast<const __m128i*>(src));
__m128i subMatLo = _mm_unpacklo_epi8(subMat, _mm_set1_epi8(0));
__m128i subMatHi = _mm_unpackhi_epi8(subMat, _mm_set1_epi8(0));
dst[row] = _mm_add_epi16(dst[row], subMatLo);
dst[row + 1] = _mm_add_epi16(dst[row + 1], subMatHi);
src += W;


Add values serially in the same SIMD register

I'm trying to convert this to AVX2:
// parallel arrays
int16_t* Nums = ...
int16_t* Capacities = ...
int** Data = ...
int* freePointer = ...
for (int i = 0; i < n; i++)
if (Nums[i] == 0)
Capacities[i] = 0;
Data[i] = freePointer;
freePointer += Capacities[i];
But didn't get too far:
for (int i = 0; i < n; i += 4) // 4 as Data is 64 bits
const __m256i nums = _mm256_loadu_si256((__m256i*)&Nums[i]);
const __m256i bZeroes = _mm256_cmpeq_epi16(nums, ZEROES256);
const __m256i capacities = _mm256_loadu_si256((__m256i*)&Capacities[i]);
const __m256i zeroedCapacities = _mm256_andnot_si256(bZeroes, capacities);
_mm256_storeu_si256((__m256i*)&Capacities[i], zeroedCapacities);
Stuck at the else branch, not sure how to add (prefix sum?...) Capacities into freePointer and assign the "serial" results to Data in the same 256-bit SIMD register.
My terminology is probably off, I hope the code gets across what I'm trying to accomplish.
lane0: freePointer
lane1: freePointer + Capacities[i + 0]
lane2: freePointer + Capacities[i + 0] + Capacities[i + 1]
lane3: freePointer + Capacities[i + 0] + Capacities[i + 1] + Capacities[i + 2]
Basically this is what I want to do in as few instructions as possible, if at all possible. Target is AVX2.
You can find a lot of details here:
Here you can plug in any type instead of T and U see the resulting asm for x86 and arm

Tensorflow cpp with Eigen element wise multiply

I am trying to do an element wise multiply for my own op in tensorflow + Eigen. This is a simple version of what I am currently using:
// eg) temp_res_shape = [3, 8], temp_b_shape = [1, 8]
// allocate Tensorflow tensors
Tensor temp_res;
OP_REQUIRES_OK(ctx, ctx->allocate_temp(DataTypeToEnum<complex64>::v(),
temp_res_shape, &temp_res));
Tensor temp_a;
OP_REQUIRES_OK(ctx, ctx->allocate_temp(DataTypeToEnum<complex64>::v(),
temp_res_shape, &temp_a));
Tensor temp_b;
OP_REQUIRES_OK(ctx, ctx->allocate_temp(DataTypeToEnum<complex64>::v(),
temp_b_shape, &temp_b));
// These actually come from different places but the type/shape is right
// ie) I want to do this within Eigen::TensorMap if possible
auto mult_res = Tensor(temp_res).flat_inner_dims<complex64, 2>();
auto a_in = Tensor(temp_a).flat_inner_dims<complex64, 2>();
auto b_in = Tensor(temp_b).flat_inner_dims<complex64, 2>();
// convert to an array
auto a_data =;
auto b_data =;
auto res =;
for ( int i = 0; i < 3; i++ ) {
for ( int j = 0; j < 8; j++ )
res[ i*8 + j ] = a_data[ i*3 + 8 ] * b_data[j];
This is obviously the wrong way to do it but I couldn't get anything else working. I feel like it should be something of the form:
mult_res.device( device ) = a_in * b_in;
But that does the matrix multiply. I couldn't figure out how to convert b_in to a diagonal matrix to multiply that way either :/
I feel like this should be trivial but I can't work it out (my cpp is not great). Thanks in advance!

Computing Rand error efficiently

I'm trying to compare two image segmentations to one another.
In order to do so, I transform each image into a vector of unsigned short values, and calculate the rand error,
according to the following formula:
Here is my code (the rand error calculation part):
cv::Mat im1,im2;
//code for acquiring data for im1, im2
//code for copying im1(:)->v1, im2(:)->v2
int N = v1.size();
double a = 0;
double b = 0;
for (int i = 0; i <N; i++)
for (int j = 0; j < i; j++)
unsigned short l1 = v1[i];
unsigned short l2 = v1[j];
unsigned short gt1 = v2[i];
unsigned short gt2 = v2[j];
if (l1 == l2 && gt1 == gt2)
else if (l1 != l2 && gt1 != gt2)
double NPairs = (double)(N*N)/2;
double res = (a + b) / NPairs;
My problem is that length of each vector is 307,200.
Therefore the total number of iterations is 47,185,920,000.
It makes the running time of the entire process is very slow (a few minutes to compute).
Do you have any idea how can I improve it?
Let's assume that we have P distinct labels in the first image and Q distinct labels in the second image. The key observation for efficient computation of Rand error, also called Rand index, is that the number of distinct labels is usually much smaller than the number of pixels (i.e. P, Q << n).
Step 1
First, pre-compute the following auxiliary data:
the vector s1, with size P, such that s1[p] is the number of pixel positions i with v1[i] = p.
the vector s2, with size Q, such that s2[q] is the number of pixel positions i with v2[i] = q.
the matrix M, with size P x Q, such that M[p][q] is the number of pixel positions i with v1[i] = p and v2[i] = q.
The vectors s1, s2 and the matrix M can be computed by passing once through the input images, i.e. in O(n).
Step 2
Once s1, s2 and M are available, a and b can be computed efficiently:
This holds because each pair of pixels (i, j) that we are interested in has the property that both its pixels have the same label in image 1, i.e. v1[i] = v1[j] = p; and the same label in image 2, i.e. v2[i] = v2[ j ] = q. Since v1[i] = p and v2[i] = q, the pixel i will contribute to the bin M[p][q], and the same does the pixel j. Therefore, for each combination of labels p and q we need to consider the number of pairs of pixels that fall into the M[p][q] bin, and then to sum them up for all possible labels p and q.
Similarly, for b we have:
Here, we are counting how many pairs are formed with one of the pixels falling into the bin M[p][q]. Such a pixel can form a good pair with each pixel that is falling into a bin M[p'][q'], with the condition that p != p' and q != q'. Summing over all such M[p'][q'] is equivalent to subtracting from the sum over the entire matrix M (this sum is n) the sum on row p (i.e. s1[p]) and the sum on the column q (i.e. s2[q]). However, after subtracting the row and column sums, we have subtracted M[p][q] twice, and this is why it is added at the end of the expression above. Finally, this is divided by 2 because each pair was counted twice (once for each of its two constituent pixels as being part of a bin M[p][q] in the argument above).
The Rand error (Rand index) can now be computed as:
The overall complexity of this method is O(n) + O(PQ), with the first term usually being the dominant one.
After reading your comments, I tried the following approach:
calculate the intersections for each possible pair of values.
use the intersection results to calculate the error.
I performed the calculation straight on the cv::Mat objects, without converting them into std::vector objects. That gave me the ability to use opencv functions and achieve a faster runtime.
double a = 0, b = 0; //init variables
//unique function finds all the unique value of a matrix, with an optional input mask
std::set<unsigned short> m1Vals = unique(mat1);
for (unsigned short s1 : m1Vals)
cv::Mat mask1 = (mat1 == s1);
std::set<unsigned short> m2ValsInRoi = unique(mat2, mat1==s1);
for (unsigned short s2 : m2ValsInRoi)
cv::Mat mask2 = mat2 == s2;
cv::Mat andMask = mask1 & mask2;
double andVal = cv::countNonZero(andMask);
a += (andVal*(andVal - 1)) / 2;
b += ((double)cv::countNonZero(andMask) * (double)cv::countNonZero(~mask1 & ~mask2)) / 2;
double NPairs = (double)(N*(N-1)) / 2;
double res = (a + b) / NPairs;
The runtime is now reasonable (only a few milliseconds vs a few minutes), and the output is the same as the code above.
I ran the code on the following matrices:
//mat1 = [1 1 2]
cv::Mat mat1 = cv::Mat::ones(cv::Size(3, 1), CV_16U);<ushort>(cv::Point(2, 0)) = 2;
//mat2 = [1 2 1]
cv::Mat mat2 = cv::Mat::ones(cv::Size(3, 1), CV_16U);<ushort>(cv::Point(1, 0)) = 2;
In this case a = 0 (no matching pairs correspondence), and b=1(one matching pair for i=2,j=3). The algorithm result:
a = 0
b = 1
NPairs = 3
result = 0.3333333
Thank you all for your help!

Trying to compute my own Histogram without opencv calcHist()

What I'm trying to do is writing a function that calculates a Histogram of a greyscale image with a forwarded Number of Bins (anzBin) which the histograms range is divided in. Then I'm running through the Image Pixels compairing their value to the different Bins and in case a value fits, increasing the value of the Bin by 1
vector<int> calcuHisto(const IplImage *src_pic, int anzBin)
CvSize size = cvGetSize(src_pic);
int binSize = (size.width / 256)*anzBin;
vector<int> histogram(anzBin,0);
for (int y = 0; y<size.height; y++)
const uchar *src_pic_point =
(uchar *)(src_pic->imageData + y*src_pic->widthStep);
for (int x = 0; x<size.width; x++)
for (int z = 0; z < anzBin; z++)
if (src_pic_point[x] <= z*binSize)
return histogram;
But unfortunately it's not working...
What is wrong here?
Please help
There are a few issues I can see
Your binSize calculation is wrong
Your binning algorithm is one sided, and should be two sided
You aren't incrementing the proper bin when you find a match
1. binsize calculation
bin size = your range / number of bins
2. two sided binning
if (src_pic_point[x] <= z*binSize)
you need a two sided range of values, not a one sided inequality. Imagine you have 4 bins and values from 0 to 255. Your bins should have the following ranges
bin low high
0 0 63.75
1 63.75 127.5
2 127.5 191.25
3 191.25 255
For example: a value of 57 should go in bin 0. Your code says the value goes in all the bins! Because its always <= z*binsize You need something something with a lower and upper bound.
3. Incrementing the appropriate bin
You are using z to loop over each bin, so when you find a match you should increment bin z, you don't use the actual pixel value except when determining which bin it belongs to
this would likely be buffer overrun imagine again you have 4 bins, and the current pixel has a value of 57. This code says increment bin 57. But you only have 4 bins (0-3)
you want to increment only the bin the pixel value falls into
With that in mind here is revised code (it is untested, but should work)
vector<int> calcuHisto(const IplImage *src_pic, int anzBin)
CvSize size = cvGetSize(src_pic);
double binSize = 256.0 / anzBin; //new definition
vector<int> histogram(anzBin,0); //i don't know if this works so I
//so I will leave it
//goes through all rows
for (int y = 0; y<size.height; y++)
//grabs an entire row of the imageData
const uchar *src_pic_point = (uchar *)(src_pic->imageData + y*src_pic->widthStep);
//goes through each column
for (int x = 0; x<size.width; x++)
//for each bin
for (int z = 0; z < anzBin; z++)
//check both upper and lower limits
if (src_pic_point[x] >= z*binSize && src_pic_point[x] < (z+1)*binSize)
//increment the index that contains the point
return histogram;

Square root of a OpenCV's grey image using SSE

given a grey cv::Mat (CV_8UC1) I want to return another cv::Mat containing the square root of the elements (CV_32FC1) and I want to do it with SSE2 intrinsics. I am having some problems with the conversion from 8-bit values to 32 float values to perform the square root. I would really appreciate any help. This is my code for now(it does not give correct values):
uchar *source = (uchar *)cv::alignPtr(, 16);
float *sqDataPtr = cv::alignPtr((float *), 16);
for (x = 0; x < (pixels - 16); x += 16) {
__m128i a0 = _mm_load_si128((__m128i *)(source + x));
__m128i first8 = _mm_unpacklo_epi8(a0, _mm_set1_epi8(0));
__m128i last8 = _mm_unpackhi_epi8(a0, _mm_set1_epi8(0));
__m128i first4i = _mm_unpacklo_epi16(first8, _mm_set1_epi16(0));
__m128i second4i = _mm_unpackhi_epi16(first8, _mm_set1_epi16(0));
__m128 first4 = _mm_cvtepi32_ps(first4i);
__m128 second4 = _mm_cvtepi32_ps(second4i);
__m128i third4i = _mm_unpacklo_epi16(last8, _mm_set1_epi16(0));
__m128i fourth4i = _mm_unpackhi_epi16(last8, _mm_set1_epi16(0));
__m128 third4 = _mm_cvtepi32_ps(third4i);
__m128 fourth4 = _mm_cvtepi32_ps(fourth4i);
// Store
_mm_store_ps(sqDataPtr + x, _mm_sqrt_ps(first4));
_mm_store_ps(sqDataPtr + x + 4, _mm_sqrt_ps(second4));
_mm_store_ps(sqDataPtr + x + 8, _mm_sqrt_ps(third4));
_mm_store_ps(sqDataPtr + x + 12, _mm_sqrt_ps(fourth4));
The SSE code looks OK, except that you're not processing the last 16 pixels:
for (x = 0; x < (pixels - 16); x += 16)
should be:
for (x = 0; x <= (pixels - 16); x += 16)
Note that if your image width is not a multiple of 16 then you will need to take care of any remaining pixels after the last full vector.
Also note that you are taking the sqrt of values in the range 0..255. It may be that you want normalised value in the range 0..1.0, in which case you'll want to scale the values accordingly.
I have no experience with SSE2, but I think that if performance is the issue you should use look-up table. Creation of look-up table is fast since you have only 256 possible values. Copy 4 bytes from look-up table into destination matrix should be a very efficient operation.