Explaining ARM Neon Image Sampling

Explaining ARM Neon Image Sampling - c++

I'm trying to write a better version of cv::resize() of the OpenCV, and I came a cross a code that is here: https://github.com/rmaz/NEON-Image-Downscaling/blob/master/ImageResize/BDPViewController.m
The code is for downsampling an image by 2 but I can not get the algorithm. I would like first to convert that algorithm to C then try to modify it for Learning purposes. Is it easy also to convert it to downsample by any size ?
The function is:
static void inline resizeRow(uint32_t *dst, uint32_t *src, uint32_t pixelsPerRow)
{
const uint32_t * rowB = src + pixelsPerRow;
// force the number of pixels per row to a multiple of 8
pixelsPerRow = 8 * (pixelsPerRow / 8);
__asm__ volatile("Lresizeloop: \n" // start loop
"vld1.32 {d0-d3}, [%1]! \n" // load 8 pixels from the top row
"vld1.32 {d4-d7}, [%2]! \n" // load 8 pixels from the bottom row
"vhadd.u8 q0, q0, q2 \n" // average the pixels vertically
"vhadd.u8 q1, q1, q3 \n"
"vtrn.32 q0, q2 \n" // transpose to put the horizontally adjacent pixels in different registers
"vtrn.32 q1, q3 \n"
"vhadd.u8 q0, q0, q2 \n" // average the pixels horizontally
"vhadd.u8 q1, q1, q3 \n"
"vtrn.32 d0, d1 \n" // fill the registers with pixels
"vtrn.32 d2, d3 \n"
"vswp d1, d2 \n"
"vst1.64 {d0-d1}, [%0]! \n" // store the result
"subs %3, %3, #8 \n" // subtract 8 from the pixel count
"bne Lresizeloop \n" // repeat until the row is complete
: "=r"(dst), "=r"(src), "=r"(rowB), "=r"(pixelsPerRow)
: "0"(dst), "1"(src), "2"(rowB), "3"(pixelsPerRow)
: "q0", "q1", "q2", "q3", "cc"
);
}
To call it:
// downscale the image in place
for (size_t rowIndex = 0; rowIndex < height; rowIndex+=2)
{
void *sourceRow = (uint8_t *)buffer + rowIndex * bytesPerRow;
void *destRow = (uint8_t *)buffer + (rowIndex / 2) * bytesPerRow;
resizeRow(destRow, sourceRow, width);
}

The algorithm is pretty straightforward. It reads 8 pixels from the current line and 8 from the line below. It then uses the vhadd (halving-add) instruction to average the 8 pixels vertically. It then transposes the position of the pixels so that the horizontally adjacent pixel pairs are now in separate registers (arranged vertically). It then does another set of halving-adds to average those together. The result is then transformed again to put them in their original positions and written to the destination. This algorithm could be rewritten to handle different integral sizes of scaling, but as written it can only do 2x2 to 1 reduction with averaging. Here's the C code equivalent:
static void inline resizeRow(uint32_t *dst, uint32_t *src, uint32_t pixelsPerRow)
{
uint8_t * pSrc8 = (uint8_t *)src;
uint8_t * pDest8 = (uint8_t *)dst;
int stride = pixelsPerRow * sizeof(uint32_t);
int x;
int r, g, b, a;
for (x=0; x<pixelsPerRow; x++)
{
r = pSrc8[0] + pSrc8[4] + pSrc8[stride+0] + pSrc8[stride+4];
g = pSrc8[1] + pSrc8[5] + pSrc8[stride+1] + pSrc8[stride+5];
b = pSrc8[2] + pSrc8[6] + pSrc8[stride+2] + pSrc8[stride+6];
a = pSrc8[3] + pSrc8[7] + pSrc8[stride+3] + pSrc8[stride+7];
pDest8[0] = (uint8_t)((r + 2)/4); // average with rounding
pDest8[1] = (uint8_t)((g + 2)/4);
pDest8[2] = (uint8_t)((b + 2)/4);
pDest8[3] = (uint8_t)((a + 2)/4);
pSrc8 += 8; // skip forward 2 source pixels
pDest8 += 4; // skip forward 1 destination pixel
}

Related

C++ performance optimization for linear combination of large matrices?

I have a large tensor of floating point data with the dimensions 35k(rows) x 45(cols) x 150(slices) which I have stored in an armadillo cube container. I need to linearly combine all the 150 slices together in under 35 ms (a must for my application). The linear combination floating point weights are also stored in an armadillo container. My fastest implementation so far takes 70 ms, averaged over a window of 30 frames, and I don't seem to be able to beat that. Please note I'm allowed CPU parallel computations but not GPU.
I have tried multiple different ways of performing this linear combination but the following code seems to be the fastest I can get (70 ms) as I believe I'm maximizing the cache hit chances by fetching the largest possible contiguous memory chunk at each iteration.
Please note that Armadillo stores data in column major format. So in a tensor, it first stores the columns of the first channel, then the columns of the second channel, then third and so forth.
typedef std::chrono::system_clock Timer;
typedef std::chrono::duration<double> Duration;
int rows = 35000;
int cols = 45;
int slices = 150;
arma::fcube tensor(rows, cols, slices, arma::fill::randu);
arma::fvec w(slices, arma::fill::randu);
double overallTime = 0;
int window = 30;
for (int n = 0; n < window; n++) {
Timer::time_point start = Timer::now();
arma::fmat result(rows, cols, arma::fill::zeros);
for (int i = 0; i < slices; i++)
result += tensor.slice(i) * w(i);
Timer::time_point end = Timer::now();
Duration span = end - start;
double t = span.count();
overallTime += t;
cout << "n = " << n << " --> t = " << t * 1000.0 << " ms" << endl;
}
cout << endl << "average time = " << overallTime * 1000.0 / window << " ms" << endl;
I need to optimize this code by at least 2x and I would very much appreciate any suggestions.

First at all I need to admit, I'm not familiar with the arma framework or the memory layout; the least if the syntax result += slice(i) * weight evaluates lazily.
Two primary problem and its solution anyway lies in the memory layout and the memory-to-arithmetic computation ratio.
To say a+=b*c is problematic because it needs to read the b and a, write a and uses up to two arithmetic operations (two, if the architecture does not combine multiplication and accumulation).
If the memory layout is of form float tensor[rows][columns][channels], the problem is converted to making rows * columns dot products of length channels and should be expressed as such.
If it's float tensor[c][h][w], it's better to unroll the loop to result+= slice(i) + slice(i+1)+.... Reading four slices at a time reduces the memory transfers by 50%.
It might even be better to process the results in chunks of 4*N results (reading from all the 150 channels/slices) where N<16, so that the accumulators can be allocated explicitly or implicitly by the compiler to SIMD registers.
There's a possibility of a minor improvement by padding the slice count to multiples of 4 or 8, by compiling with -ffast-math to enable fused multiply accumulate (if available) and with multithreading.
The constraints indicate the need to perform 13.5GFlops, which is a reasonable number in terms of arithmetic (for many modern architectures) but also it means at least 54 Gb/s memory bandwidth, which could be relaxed with fp16 or 16-bit fixed point arithmetic.
EDIT
Knowing the memory order to be float tensor[150][45][35000] or float tensor[kSlices][kRows * kCols == kCols * kRows] suggests to me to try first unrolling the outer loop by 4 (or maybe even 5, as 150 is not divisible by 4 requiring special case for the excess) streams.
void blend(int kCols, int kRows, float const *tensor, float *result, float const *w) {
// ensure that the cols*rows is a multiple of 4 (pad if necessary)
// - allows the auto vectorizer to skip handling the 'excess' code where the data
// length mod simd width != 0
// one could try even SIMD width of 16*4, as clang 14
// can further unroll the inner loop to 4 ymm registers
auto const stride = (kCols * kRows + 3) & ~3;
// try also s+=6, s+=3, or s+=4, which would require a dedicated inner loop (for s+=2)
for (int s = 0; s < 150; s+=5) {
auto src0 = tensor + s * stride;
auto src1 = src0 + stride;
auto src2 = src1 + stride;
auto src3 = src2 + stride;
auto src4 = src3 + stride;
auto dst = result;
for (int x = 0; x < stride; x++) {
// clang should be able to optimize caching the weights
// to registers outside the innerloop
auto add = src0[x] * w[s] +
src1[x] * w[s+1] +
src2[x] * w[s+2] +
src3[x] * w[s+3] +
src4[x] * w[s+4];
// clang should be able to optimize this comparison
// out of the loop, generating two inner kernels
if (s == 0) {
dst[x] = add;
} else {
dst[x] += add;
}
}
}
}
EDIT 2
Another starting point (before adding multithreading) would be consider changing the layout to
float tensor[kCols][kRows][kSlices + kPadding]; // padding is optional
The downside now is that kSlices = 150 can't anymore fit all the weights in registers (and secondly kSlices is not a multiple of 4 or 8). Furthermore the final reduction needs to be horizontal.
The upside is that reduction no longer needs to go through memory, which is a big thing with the added multithreading.
void blendHWC(float const *tensor, float const *w, float *dst, int n, int c) {
// each thread will read from 4 positions in order
// to share the weights -- finding the best distance
// might need some iterations
auto src0 = tensor;
auto src1 = src0 + c;
auto src2 = src1 + c;
auto src3 = src2 + c;
for (int i = 0; i < n/4; i++) {
vec8 acc0(0.0f), acc1(0.0f), acc2(0.0f), acc3(0.0f);
// #pragma unroll?
for (auto j = 0; j < c / 8; c++) {
vec8 w(w + j);
acc0 += w * vec8(src0 + j);
acc1 += w * vec8(src1 + j);
acc2 += w * vec8(src2 + j);
acc3 += w * vec8(src3 + j);
}
vec4 sum = horizontal_reduct(acc0,acc1,acc2,acc3);
sum.store(dst); dst+=4;
}
}
These vec4 and vec8 are some custom SIMD classes, which map to SIMD instructions either through intrinsics, or by virtue of the compiler being able to do compile using vec4 = float __attribute__ __attribute__((vector_size(16))); to efficient SIMD code.

As #hbrerkere suggested in the comment section, by using the -O3 flag and making the following changes, the performance improved by almost 65%. The code now runs at 45 ms as opposed to the initial 70 ms.
int lastStep = (slices / 4 - 1) * 4;
int i = 0;
while (i <= lastStep) {
result += tensor.slice(i) * w_id(i) + tensor.slice(i + 1) * w_id(i + 1) + tensor.slice(i + 2) * w_id(i + 2) + tensor.slice(i + 3) * w_id(i + 3);
i += 4;
}
while (i < slices) {
result += tensor.slice(i) * w_id(i);
i++;
}

Without having the actual code, I'm guessing that
+= tensor.slice(i) * w_id(i)
creates a temporary object and then adds it to the lhs. Yes, overloaded operators look nice, but I would write a function
addto( lhs, slice1, w1, slice2, w2, ....unroll to 4... )
which translates to pure loops over the elements:
for (i=....)
for (j=...)
lhs[i][j] += slice1[i][j]*w1[j] + slice2[i][j] &c
It would surprise me if that doesn't buy you an extra factor.

AVX2 barycentric interpolation of a vertex component

I'm just starting out on the path of using simd intrinsics. My profiler has shown that a significant amount of time is being spent on vertex interpolation. I am targeting AVX2 and am trying to find an optimization for the following - given that I have 3 vector2s that need interpolation I imagine I should be able to load them into a single __m256 and do the multiply and add efficiently. Here is the code I am trying to convert - is it worth doing it as a 256bit operation? The vectors are unaligned.
Vector2 Interpolate( Vector3 uvw, Vector2 v0, Vector2 v1, Vector2 v2 )
{
Vector2 out;
out = v0 * uvw.x;
out += v1 * uvw.y;
out += v2 * uvw.z;
return out;
}
struct Vector2 { float x; float y; } ;
struct Vector3 { float x; float y; float z; } ;
My question is this - how do I load three unaligned vector2 into the single 256bit register so I can do the multiply and add?
I am using VS2013.

I was bored so I wrote it, not tested (but compiled, both Clang and GCC make reasonable code from this)
void interpolateAll(int n, float* scales, float* vin, float* vout)
{
// preconditions:
// (n & 7 == 0) (not really, but vout must be padded)
// scales & 31 == 0
// vin & 31 == 0
// vout & 31 == 0
// vin format:
// float v0x[8]
// float v0y[8]
// float v1x[8]
// float v1y[8]
// float v2x[8]
// float v2y[8]
// scales format:
// float scale0[8]
// float scale1[8]
// float scale2[8]
// vout format:
// float vx[8]
// float vy[8]
for (int i = 0; i < n; i += 8) {
__m256 scale_0 = _mm256_load_ps(scales + i * 3);
__m256 scale_1 = _mm256_load_ps(scales + i * 3 + 8);
__m256 scale_2 = _mm256_load_ps(scales + i * 3 + 16);
__m256 v0x = _mm256_load_ps(vin + i * 6);
__m256 v0y = _mm256_load_ps(vin + i * 6 + 8);
__m256 v1x = _mm256_load_ps(vin + i * 6 + 16);
__m256 v1y = _mm256_load_ps(vin + i * 6 + 24);
__m256 v2x = _mm256_load_ps(vin + i * 6 + 32);
__m256 v2y = _mm256_load_ps(vin + i * 6 + 40);
__m256 x = _mm256_mul_ps(scale_0, v0x);
__m256 y = _mm256_mul_ps(scale_0, v0y);
x = _mm256_fmadd_ps(scale_1, v1x, x);
y = _mm256_fmadd_ps(scale_1, v1y, y);
x = _mm256_fmadd_ps(scale_2, v2x, x);
y = _mm256_fmadd_ps(scale_2, v2y, y);
_mm256_store_ps(vout + i * 2, x);
_mm256_store_ps(vout + i * 2 + 8, y);
}
}
Uses Z boson's format, if I understood him correctly. In any case it's a nice format, from a SIMD perspective. Slightly inconvenient from a C++ perspective.
The FMAs do serialize the multiplies unnecessarily but that shouldn't matter since it's not part of a loop-carried dependency.
The predicted throughput of this (assuming a small enough array) is 2 iterations per 9 cycles, bottlenecked by the loads. In practice probably slightly worse, there was some talk about simple stores stealing p2 or p3 occasionally, that sort of thing, I'm not really sure. Anyway, that's enough time for 18 "FMAs" but there are only 12 (8 and 4 mulps), so it may be useful to move some extra computation in here if there is any.

Algorithm for adjustment of image levels

I need to implement in C++ algorithm for adjusting image levels that works similar to Levels function in Photoshop or GIMP. I.e. inputs are: color RGB image to be adjusted adjust, while point, black point, midtone point, output from/to values. But I didn't find yet any info on how to perform this adjustment. Probably someone recommend me algorithm description or materials to study.
To the moment I've came up with following code myself, but it doesn't give expected result, similar to what I can see, for example in the GIMP, image becomes too lightened. Below is my current fragment of the code:
const int normalBlackPoint = 0;
const int normalMidtonePoint = 127;
const int normalWhitePoint = 255;
const double normalLowRange = normalMidtonePoint - normalBlackPoint + 1;
const double normalHighRange = normalWhitePoint - normalMidtonePoint;
int blackPoint = 53;
int midtonePoint = 110;
int whitePoint = 168;
int outputFrom = 0;
int outputTo = 255;
double outputRange = outputTo - outputFrom + 1;
double lowRange = midtonePoint - blackPoint + 1;
double highRange = whitePoint - midtonePoint;
double fullRange = whitePoint - blackPoint + 1;
double lowPart = lowRange / fullRange;
double highPart = highRange / fullRange;
int dim(256);
cv::Mat lut(1, &dim, CV_8U);
for(int i = 0; i < 256; ++i)
{
double p = i > normalMidtonePoint
? (static_cast<double>(i - normalMidtonePoint) / normalHighRange) * highRange * highPart + lowPart
: (static_cast<double>(i + 1) / normalLowRange) * lowRange * lowPart;
int v = static_cast<int>(outputRange * p ) + outputFrom - 1;
if(v < 0) v = 0;
else if(v > 255) v = 255;
lut.at<uchar>(i) = v;
}
....
cv::Mat sourceImage = cv::imread(inputFileName, CV_LOAD_IMAGE_COLOR);
if(!sourceImage.data)
{
std::cerr << "Error: couldn't load image " << inputFileName << "." << std::endl;
continue;
}
#if 0
const int forwardConversion = CV_BGR2YUV;
const int reverseConversion = CV_YUV2BGR;
#else
const int forwardConversion = CV_BGR2Lab;
const int reverseConversion = CV_Lab2BGR;
#endif
cv::Mat convertedImage;
cv::cvtColor(sourceImage, convertedImage, forwardConversion);
// Extract the L channel
std::vector<cv::Mat> convertedPlanes(3);
cv::split(convertedImage, convertedPlanes);
cv::LUT(convertedPlanes[0], lut, convertedPlanes[0]);
//dst.copyTo(convertedPlanes[0]);
cv::merge(convertedPlanes, convertedImage);
cv::Mat resImage;
cv::cvtColor(convertedImage, resImage, reverseConversion);
cv::imwrite(outputFileName, resImage);

Pseudocode for Photoshop's Levels Adjustment
First, calculate the gamma correction value to use for the midtone adjustment (if desired). The following roughly simulates Photoshop's technique, which applies gamma 9.99-1.00 for midtone values 0-128, and 1.00-0.01 for 128-255.
Apply gamma correction:
Gamma = 1
MidtoneNormal = Midtones / 255
If Midtones < 128 Then
MidtoneNormal = MidtoneNormal * 2
Gamma = 1 + ( 9 * ( 1 - MidtoneNormal ) )
Gamma = Min( Gamma, 9.99 )
Else If Midtones > 128 Then
MidtoneNormal = ( MidtoneNormal * 2 ) - 1
Gamma = 1 - MidtoneNormal
Gamma = Max( Gamma, 0.01 )
End If
GammaCorrection = 1 / Gamma
Then, for each channel value R, G, B (0-255) for each pixel, do the following in order.
Apply the input levels:
ChannelValue = 255 * ( ( ChannelValue - ShadowValue ) /
( HighlightValue - ShadowValue ) )
Apply the midtones:
If Midtones <> 128 Then
ChannelValue = 255 * ( Pow( ( ChannelValue / 255 ), GammaCorrection ) )
End If
Apply the output levels:
ChannelValue = ( ChannelValue / 255 ) *
( OutHighlightValue - OutShadowValue ) + OutShadowValue
Where:
All channel and adjustment parameter values are integers, 0-255 inclusive
Shadow/Midtone/HighlightValue are the input adjustment values (defaults 0, 128, 255)
OutShadow/HighlightValue are the output adjustment values (defaults 0, 255)
You should optimize things and make sure values are kept in bounds (like 0-255 for each channel)
For a more accurate simulation of Photoshop, you can use a non-linear interpolation curve if Midtones < 128. Photoshop also chops off the darkest and lightest 0.1% of the values by default.

Ignoring the midtone/Gamma, the Levels function is a simple linear scaling.
All input values are first linearly scaled so that all values less or equal to the "black point" are set to 0, and all values greater than or equal white point are set to 255.
Then all values are linearly scaled from 0/255 to the output range.
For the mid-point—it depends what you actually mean by that.
In GIMP, there is a Gamma value. The Gamma value is a simple exponent of the input values (after restricting to the black/white points).
For Gamma == 1, the values are not changed.
For gamma < 1, the values are darkened.

How to speed up bilinear interpolation of image?

I'm trying to rotate image with interpolation, but it's too slow for real time for big images.
the code something like:
for(int y=0;y<dst_h;++y)
{
for(int x=0;x<dst_w;++x)
{
//do inverse transform
fPoint pt(Transform(Point(x, y)));
//in coor of src
int x1= (int)floor(pt.x);
int y1= (int)floor(pt.y);
int x2= x1+1;
int y2= y1+1;
if((x1>=0&&x1<src_w&&y1>=0&&y1<src_h)&&(x2>=0&&x2<src_w&&y2>=0&&y2<src_h))
{
Mask[y][x]= 1; //show pixel
float dx1= pt.x-x1;
float dx2= 1-dx1;
float dy1= pt.y-y1;
float dy2= 1-dy1;
//bilinear
pd[x].blue= (dy2*(ps[y1*src_w+x1].blue*dx2+ps[y1*src_w+x2].blue*dx1)+
dy1*(ps[y2*src_w+x1].blue*dx2+ps[y2*src_w+x2].blue*dx1));
pd[x].green= (dy2*(ps[y1*src_w+x1].green*dx2+ps[y1*src_w+x2].green*dx1)+
dy1*(ps[y2*src_w+x1].green*dx2+ps[y2*src_w+x2].green*dx1));
pd[x].red= (dy2*(ps[y1*src_w+x1].red*dx2+ps[y1*src_w+x2].red*dx1)+
dy1*(ps[y2*src_w+x1].red*dx2+ps[y2*src_w+x2].red*dx1));
//nearest neighbour
//pd[x]= ps[((int)pt.y)*src_w+(int)pt.x];
}
else
Mask[y][x]= 0; //transparent pixel
}
pd+= dst_w;
}
How I can speed up this code, I try to parallelize this code but it seems there is no speed up because of memory access pattern (?).

The key is to do most of your computations as ints. The only thing that is necessary to do as a float is the weighting. See here for a good resource.
From that same resource:
int px = (int)x; // floor of x
int py = (int)y; // floor of y
const int stride = img->width;
const Pixel* p0 = img->data + px + py * stride; // pointer to first pixel
// load the four neighboring pixels
const Pixel& p1 = p0[0 + 0 * stride];
const Pixel& p2 = p0[1 + 0 * stride];
const Pixel& p3 = p0[0 + 1 * stride];
const Pixel& p4 = p0[1 + 1 * stride];
// Calculate the weights for each pixel
float fx = x - px;
float fy = y - py;
float fx1 = 1.0f - fx;
float fy1 = 1.0f - fy;
int w1 = fx1 * fy1 * 256.0f;
int w2 = fx * fy1 * 256.0f;
int w3 = fx1 * fy * 256.0f;
int w4 = fx * fy * 256.0f;
// Calculate the weighted sum of pixels (for each color channel)
int outr = p1.r * w1 + p2.r * w2 + p3.r * w3 + p4.r * w4;
int outg = p1.g * w1 + p2.g * w2 + p3.g * w3 + p4.g * w4;
int outb = p1.b * w1 + p2.b * w2 + p3.b * w3 + p4.b * w4;
int outa = p1.a * w1 + p2.a * w2 + p3.a * w3 + p4.a * w4;

wow you are doing a lot inside most inner loop like:
1.float to int conversions
can do all on floats ...
they are these days pretty fast
the conversion is what is killing you
also you are mixing float and ints together (if i see it right) which is the same ...
2.transform(x,y)
any unnecessary call makes heap trashing and slow things down
instead add 2 variables xx,yy and interpolate them insde your for loops
3.if ....
why to heck are you adding if ?
limit the for ranges before loop and not inside ...
the background can be filled with other fors before or later

How to alpha blend RGBA unsigned byte color fast?

I am using c++ , I want to do alpha blend using the following code.
#define CLAMPTOBYTE(color) \
if ((color) & (~255)) { \
color = (BYTE)((-(color)) >> 31); \
} else { \
color = (BYTE)(color); \
}
#define GET_BYTE(accessPixel, x, y, scanline, bpp) \
((BYTE*)((accessPixel) + (y) * (scanline) + (x) * (bpp)))
for (int y = top ; y < bottom; ++y)
{
BYTE* resultByte = GET_BYTE(resultBits, left, y, stride, bytepp);
BYTE* srcByte = GET_BYTE(srcBits, left, y, stride, bytepp);
BYTE* srcByteTop = GET_BYTE(srcBitsTop, left, y, stride, bytepp);
BYTE* maskCurrent = GET_GREY(maskSrc, left, y, width);
int alpha = 0;
int red = 0;
int green = 0;
int blue = 0;
for (int x = left; x < right; ++x)
{
alpha = *maskCurrent;
red = (srcByteTop[R] * alpha + srcByte[R] * (255 - alpha)) / 255;
green = (srcByteTop[G] * alpha + srcByte[G] * (255 - alpha)) / 255;
blue = (srcByteTop[B] * alpha + srcByte[B] * (255 - alpha)) / 255;
CLAMPTOBYTE(red);
CLAMPTOBYTE(green);
CLAMPTOBYTE(blue);
resultByte[R] = red;
resultByte[G] = green;
resultByte[B] = blue;
srcByte += bytepp;
srcByteTop += bytepp;
resultByte += bytepp;
++maskCurrent;
}
}
however I find it is still slow, it takes about 40 - 60 ms when compose two 600 * 600 image.
Is there any method to improve the speed to less then 16ms?
Can any body help me to speed this code? Many thanks!

Use SSE - start around page 131.
The basic workflow
Load 4 pixels from src (16 1 byte numbers) RGBA RGBA RGBA RGBA (streaming load)
Load 4 more which you want to blend with srcbytetop RGBx RGBx RGBx RGBx
Do some swizzling so that the A term in 1 fills every slot I.e
xxxA xxxB xxxC xxxD -> AAAA BBBB CCCC DDDD
In my solution below I opted instead to re-use your existing "maskcurrent" array but having alpha integrated into the "A" field of 1 will require less loads from memory and thus be faster. Swizzling in this case would probably be: And with mask to select A, B, C, D. Shift right 8, Or with origional, shift right 16, or again.
Add the above to a vector that is all -255 in every slot
Multiply 1 * 4 (source with 255-alpha) and 2 * 3 (result with alpha).
You should be able to use the "multiply and discard bottom 8 bits" SSE2 instruction for this.
add those two (4 and 5) together
Store those somewhere else (if possible) or on top of your destination (if you must)
Here is a starting point for you:
//Define your image with __declspec(align(16)) i.e char __declspec(align(16)) image[640*480]
// so the first byte is aligned correctly for SIMD.
// Stride must be a multiple of 16.
for (int y = top ; y < bottom; ++y)
{
BYTE* resultByte = GET_BYTE(resultBits, left, y, stride, bytepp);
BYTE* srcByte = GET_BYTE(srcBits, left, y, stride, bytepp);
BYTE* srcByteTop = GET_BYTE(srcBitsTop, left, y, stride, bytepp);
BYTE* maskCurrent = GET_GREY(maskSrc, left, y, width);
for (int x = left; x < right; x += 4)
{
//If you can't align, use _mm_loadu_si128()
// Step 1
__mm128i src = _mm_load_si128(reinterpret_cast<__mm128i*>(srcByte))
// Step 2
__mm128i srcTop = _mm_load_si128(reinterpret_cast<__mm128i*>(srcByteTop))
// Step 3
// Fill the 4 positions for the first pixel with maskCurrent[0], etc
// Could do better with shifts and so on, but this is clear
__mm128i mask = _mm_set_epi8(maskCurrent[0],maskCurrent[0],maskCurrent[0],maskCurrent[0],
maskCurrent[1],maskCurrent[1],maskCurrent[1],maskCurrent[1],
maskCurrent[2],maskCurrent[2],maskCurrent[2],maskCurrent[2],
maskCurrent[3],maskCurrent[3],maskCurrent[3],maskCurrent[3],
)
// step 4
__mm128i maskInv = _mm_subs_epu8(_mm_set1_epu8(255), mask)
//Todo : Multiply, with saturate - find correct instructions for 4..6
//note you can use Multiply and add _mm_madd_epi16
alpha = *maskCurrent;
red = (srcByteTop[R] * alpha + srcByte[R] * (255 - alpha)) / 255;
green = (srcByteTop[G] * alpha + srcByte[G] * (255 - alpha)) / 255;
blue = (srcByteTop[B] * alpha + srcByte[B] * (255 - alpha)) / 255;
CLAMPTOBYTE(red);
CLAMPTOBYTE(green);
CLAMPTOBYTE(blue);
resultByte[R] = red;
resultByte[G] = green;
resultByte[B] = blue;
//----
// Step 7 - store result.
//Store aligned if output is aligned on 16 byte boundrary
_mm_store_si128(reinterpret_cast<__mm128i*>(resultByte), result)
//Slow version if you can't guarantee alignment
//_mm_storeu_si128(reinterpret_cast<__mm128i*>(resultByte), result)
//Move pointers forward 4 places
srcByte += bytepp * 4;
srcByteTop += bytepp * 4;
resultByte += bytepp * 4;
maskCurrent += 4;
}
}
To find out which AMD processors will run this code (currently it is using SSE2 instructions) see Wikipedia's List of AMD Turion microprocessors. You could also look at other lists of processors on Wikipedia but my research shows that AMD cpus from around 4 years ago all support at least SSE2.
You should expect a good SSE2 implimentation to run around 8-16 times faster than your current code. That is because we eliminate branches in the loop, process 4 pixels (or 12 channels) at once and improve cache performance by using streaming instructions. As an alternative to SSE, you could probably make your existing code run much faster by eliminating the if checks you are using for saturation. Beyond that I would need to run a profiler on your workload.
Of course, the best solution is to use hardware support (i.e code your problem up in DirectX) and have it done on the video card.

You can always calculate the alpha of red and blue at the same time. You can also use this trick with the SIMD implementation mentioned before.
unsigned int blendPreMulAlpha(unsigned int colora, unsigned int colorb, unsigned int alpha)
{
unsigned int rb = (colora & 0xFF00FF) + ( (alpha * (colorb & 0xFF00FF)) >> 8 );
unsigned int g = (colora & 0x00FF00) + ( (alpha * (colorb & 0x00FF00)) >> 8 );
return (rb & 0xFF00FF) + (g & 0x00FF00);
}
unsigned int blendAlpha(unsigned int colora, unsigned int colorb, unsigned int alpha)
{
unsigned int rb1 = ((0x100 - alpha) * (colora & 0xFF00FF)) >> 8;
unsigned int rb2 = (alpha * (colorb & 0xFF00FF)) >> 8;
unsigned int g1 = ((0x100 - alpha) * (colora & 0x00FF00)) >> 8;
unsigned int g2 = (alpha * (colorb & 0x00FF00)) >> 8;
return ((rb1 | rb2) & 0xFF00FF) + ((g1 | g2) & 0x00FF00);
}
0 <= alpha <= 0x100

For people that want to divide by 255, i found a perfect formula:
pt->r = (r+1 + (r >> 8)) >> 8; // fast way to divide by 255

Here's some pointers.
Consider using pre-multiplied foreground images as described by Porter and Duff. As well as potentially being faster, you avoid a lot of potential colour-fringing effects.
The compositing equation changes from
r = kA + (1-k)B
... to ...
r = A + (1-k)B
Alternatively, you can rework the standard equation to remove one multiply.
r = kA + (1-k)B
== kA + B - kB
== k(A-B) + B
I may be wrong, but I think you shouldn't need the clamping either...

I can't comment because I don't have enough reputation, but I want to say that Jasper's version will not overflow for valid input.
Masking the multiplication result is necessary because otherwise the red+blue multiplication would leave bits in the green channel (this would also be true if you multiplied red and blue separately, you'd still need to mask out bits in the blue channel) and the green multiplication would leave bits in the blue channel.
These are bits that are lost to right shift if you separate the components out, as is often the case with alpha blending.
So they're not overflow, or underflow. They're just useless bits that need to be masked out to achieve expected results.
That said, Jasper's version is incorrect. It should be 0xFF-alpha (255-alpha), not 0x100-alpha (256-alpha). This would probably not produce a visible error.
I've found an adaptation of Jasper's code to be be faster than my old alpha blending code, which was already decent, and am currently using it in my software renderer project. I work with 32-bit ARGB pixels:
Pixel AlphaBlendPixels(Pixel p1, Pixel p2)
{
static const int AMASK = 0xFF000000;
static const int RBMASK = 0x00FF00FF;
static const int GMASK = 0x0000FF00;
static const int AGMASK = AMASK | GMASK;
static const int ONEALPHA = 0x01000000;
unsigned int a = (p2 & AMASK) >> 24;
unsigned int na = 255 - a;
unsigned int rb = ((na * (p1 & RBMASK)) + (a * (p2 & RBMASK))) >> 8;
unsigned int ag = (na * ((p1 & AGMASK) >> 8)) + (a * (ONEALPHA | ((p2 & GMASK) >> 8)));
return ((rb & RBMASK) | (ag & AGMASK));
}

No exactly answering the question but...
One thing is to do it fast, the other thing is to do it right.
Alpha compositing is a dangerous beast, it looks straight forward and intuitive but common errors have been widespread for decades without anybody noticing it (almost)!
The most famous and common mistake is about NOT using premultiplied alpha. I highly recommend this: Alpha Blending for Leaves

You can use 4 bytes per pixel in both images (for memory alignment), and then use SSE instructions to process all channels together. Search "visual studio sse intrinsics".

First of all lets use the proper formula for each color component
You start with this:
v = ( 1-t ) * v0 + t * v1
where
t=interpolation parameter [0..1]
v0=source color value
v1=transfer color value
v=output value
Reshuffling the terms, we can reduce the number of operations:
v = v0 + t * (v1 - v0)
You would need to perform this calculation once per color channel (3 times for RGB).
For 8-bit unsigned color components, you need to use correct fixed point math:
i = i0 + t * ( ( i1 - i0 ) + 127 ) / 255
where
t = interpolation parameter [0..255]
i0= source color value [0..255]
i1= transfer color value [0..255]
i = output color
If you leave out the +127 then your colors will be biased towards the darker end. Very often, people use /256 or >> 8 for speed. This is not correct! If you divide by 256, you will never be able to reach pure white (255,255,255) because 255/256 is slightly less than one.
I hope this helps.

I think hardware support will help you. try to move the logic from software to hardware if feasible

I've done similar code in unsafe C#. Is there any reason you aren't looping through each pixel directly? Why use all the BYTE* and GET_BYTE() calls? That is probably part of the speed issue.
What does GET_GRAY look like?
More importantly, are you sure your platform doesn't expose alpha blending capabilities? What platform are you targeting? Wiki informs me that the following support it out of the box:
Mac OS X
Windows 2000, XP, Server 2003, Windows CE, Vista and Windows 7
The XRender extension to the X Window System (this includes modern Linux systems)
RISC OS Adjust
QNX Neutrino
Plan 9
Inferno
AmigaOS 4.1
BeOS, Zeta and Haiku
Syllable
MorphOS

The main problem will be the poor loop construct, possibly made worse by a compiler failing to eliminate CSE's. Move the real common bits outside the loops. int red isn't common, thouigh - that should be inside the inner loop.
Furthermore, red, green and blue are independent. If you calculate them in turn, you don't need to keep interim red results in registers when you are calculating green results. This is especially important on CPUs with limited registers like x86.
There will be only a limited number of values allowed for bytepp. Make it a template parameter, and then call the right instantiation from a switch. This will produce multiple copies of your function, but each can be optimized a lot better.
As noted, clamping is not needed. In alphablending, you're creating a linear combination of two images a[x][y] and b[x][y]. Since 0<=alpha<=255, you know that each output is bound by max(255*a[x][y], 255*b[x][y]). And since your output range is the same as both input ranges (0-255), this is OK.
With a small loss of precision, you could calculate (a[x][y]*alpha * b[x][y]*(256-alpha))>>8. Bitshifts are often faster than division.

Depending on the target architecture, you could try either vectorize or parallellize the function.
Other than that, try to linearize the whole method (i.e. no loop-in-loop) and work with a quadruple of bytes at once, that would lose the overhead of working with single bytes plus make it easier for the compiler to optimize the code.

Move it to the GPU.

I am assuming that you want to do this in a completely portable way, without the help of a GPU, the use of a proprietry intel SIMD library (which may not work as efficiently on AMD processors).
Put the following inplace of your calculation for RGB
R = TopR + (SourceR * alpha) >> 8;
G = TopG + (SourceG * alpha) >> 8;
B = TopB + (SourceB * alpha) >> 8;
It is a more efficient calculation.
Also use shift left instruction on your get pixel macro instead of multiplying by the BPP.

This one works when the first color, (colora, the destination) has also alpha channel (blending two transparent ARGB colors)
The alpha is in the second color's alpha (colorb, the source)
This adds the two alphas (0 = transparent, 255 = fully opaque)
It is a modified version of Jasper Bekkers' answer.
I use it to blend transparent pixel art on to a transparent screen.
Uint32 alphaBlend(unsigned int colora, unsigned int colorb) {
unsigned int a2 = (colorb & 0xFF000000) >> 24;
unsigned int alpha = a2;
if (alpha == 0) return colora;
if (alpha == 255) return colorb;
unsigned int a1 = (colora & 0xFF000000) >> 24;
unsigned int nalpha = 0x100 - alpha;
unsigned int rb1 = (nalpha * (colora & 0xFF00FF)) >> 8;
unsigned int rb2 = (alpha * (colorb & 0xFF00FF)) >> 8;
unsigned int g1 = (nalpha * (colora & 0x00FF00)) >> 8;
unsigned int g2 = (alpha * (colorb & 0x00FF00)) >> 8;
unsigned int anew = a1 + a2;
if (anew > 255) {anew = 255;}
return ((rb1 + rb2) & 0xFF00FF) + ((g1 + g2) & 0x00FF00) + (anew << 24);
}

Here's my adaption of a software alpha blend that works well for 2 unsigned integers.
My code differs a bit as the code above is basically always assuming the destination alpha is 255.
With a decent optimizing compiler most calculations should be in registers as the scope of most variables is very short. I also opted to progressively shift the result << 8 incrementally to avoid << 24, << 16 when putting the ARGB back together. I know it's a long time ago... but I remember on the 286 cycles for a shift was (1 + 1*each bit shifted) so assume there is still some sort of penalty for larger shifts.
Also... instead of "/ 255" I opted for ">> 8" which can be changed as desired.
/*
alpha blend source and destination, either may have an alpha!!!!
Src AAAAAAAA RRRRRRRR GGGGGGGG BBBBBBBB
Dest AAAAAAAA RRRRRRRR GGGGGGGG BBBBBBBB
res AAAAAAAA RRRRRRRR GGGGGGGG BBBBBBBB
NOTE - α = αsrc + αdest(1.0-αsrc) where α = 0.0 - 1.0
ALSO - DWORD is unsigned int so (F8000000 >> 24) = F8 not FFFFFFF8 as it would with int (signed)
*/
inline DWORD raw_blend(const DWORD src, const DWORD dest)
{
// setup and calculate α
DWORD src_a = src >> 24;
DWORD src_a_neg = 255 - src_a;
DWORD dest_a = dest >> 24;
DWORD res = src_a + ((dest_a * src_a_neg) >> 8);
// setup and calculate R
DWORD src_r = (src >> 16) & 255;
DWORD dest_r = (dest >> 16) & 255;
res = (res << 8) | (((src_r * src_a) + (dest_r * src_a_neg)) >> 8);
// setup and calculate G
DWORD src_g = (src >> 8) & 255;
DWORD dest_g = (dest >> 8) & 255;
res = (res << 8) | (((src_g * src_a) + (dest_g * src_a_neg)) >> 8);
// setup and calculate B
DWORD src_b = src & 255;
DWORD dest_b = dest & 255;
return (res << 8) | (((src_b * src_a) + (dest_b * src_a_neg)) >> 8);
}

; In\ EAX = background color (ZRBG) 32bit (Z mean zero, always is zero)
; In\ EDX = foreground color (RBGA) 32bit
; Out\ EAX = new color
; free registers (R10, RDI, RSI, RSP, RBP)
abg2:
mov r15b, dl ; av
movzx ecx, dl
not ecx ; faster than 255 - dl
mov r14b, cl ; rem
shr edx, 8
and edx, 0x00FFFFFF
mov r12d, edx
mov r13d, eax ; RBGA ---> ZRGB
; s: eax
; d: edx
;=============================red = ((s >> 16) * rem + (d >> 16) * av) >> 8;
mov edx, r12d
shr edx, 0x10
movzx eax, r14b
imul edx, eax
mov ecx, r13d
shr ecx, 0x10
movzx eax, r15b
imul eax, ecx
lea eax, [eax + edx] ; faster than add eax, edx
shr eax, 0x8
mov r9b, al
shl r9d, 8
;=============================green = (((s >> 8) & 0x0000ff) * rem + ((d >> 8) & 0x0000ff) * av) >> 8;
mov eax, r12d
shr eax, 0x8
movzx edx, al
movzx eax, r14b
imul edx, eax
mov eax, r13d
shr eax, 0x8
movzx ecx, al
movzx eax, r15b
imul eax, ecx
lea eax, [eax, + edx] ; faster than add eax, edx
shr eax, 0x8
mov r9b, al
shl r9d, 8
;=============================blue = ((s & 0x0000ff) * rem + (d & 0x0000ff) * av) >> 8;
movzx edx, r12b
movzx eax, r14b
imul edx, eax
movzx ecx, r13b
movzx eax, r15b
imul eax, ecx
lea eax, [eax + edx] ; faster than add eax, edx
shr eax, 0x8
mov r9b, al
mov eax, r9d
ret

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Explaining ARM Neon Image Sampling - c++

Related

C++ performance optimization for linear combination of large matrices?

AVX2 barycentric interpolation of a vertex component

Algorithm for adjustment of image levels

How to speed up bilinear interpolation of image?

How to alpha blend RGBA unsigned byte color fast?

Categories

Resources