Split Multiplication of integers - c++

I need an algorithm that uses two 32-bit integers as parameters, and returns the multiplication of these parameters split into two other 32-bit integers: 32-highest-bits part and 32-lowest-bits part.
I would try:
uint32_t p1, p2; // globals to hold the result
void mult(uint32_t x, uint32_t y){
uint64_t r = (x * y);
p1 = r >> 32;
p2 = r & 0xFFFFFFFF;
}
Although it works1, it's not guaranteed the existence of 64-bit integers in the machine, neither is the use of them by the compiler.
So, how is the best way to solve it?
Note1: Actually, it didn't work because my compiler does not support 64-bit integers.
Obs: Please, avoid using boost.

Just use 16 bits digits.
void multiply(uint32_t a, uint32_t b, uint32_t* h, uint32_t* l) {
uint32_t const base = 0x10000;
uint32_t al = a%base, ah = a/base, bl = b%base, bh = b/base;
*l = al*bl;
*h = ah*bh;
uint32_t rlh = *l/base + al*bh;
*h += rlh/base;
rlh = rlh%base + ah*bl;
*h += rlh/base;
*l = (rlh%base)*base + *l%base;
}

As I commented, you can treat each number as a binary string of length 32.
Just multiply these numbers using school arithmetic. You will get a 64 character long string.
Then just partition it.
If you want fast multiplication, then you can look into Karatsuba multiplication algorithm.

This is the explanation and an implementation of the Karatsubas-Algorithm.
I have downloaded the code and ran it several times. It seems that it's doing well. You can modify the code according to your need.

If the unsigned long type are supported, this should work:
void umult32(uint32 a, uint32 b, uint32* c, uint32* d)
{
unsigned long long x = ((unsigned long long)a)* ((unsigned long long)b); //Thanks to #Толя
*c = x&0xffffffff;
*d = (x >> 32) & 0xffffffff;
}
Logic borrowed from here.

Related

access violation _mm_store_si128 SSE Intrinsics

I want to create a histogram of vertical gradients in an 8 bit gray image.
The vertical distance to calculate the gradient can be specified.
I already managed to speed up another part of my code using Intrinsics, but it does not work here.
The code runs without exception if the _mm_store_si128 is commented out.
When it is not commented, I get an access violation.
What is going wrong here?
#define _mm_absdiff_epu8(a,b) _mm_adds_epu8(_mm_subs_epu8(a, b), _mm_subs_epu8(b, a)) //from opencv
void CreateAbsDiffHistogramUnmanaged(void* source, unsigned int sourcestride, unsigned int height, unsigned int verticalDistance, unsigned int histogram[])
{
unsigned int xcount = sourcestride / 16;
__m128i absdiffData;
unsigned char* bytes = (unsigned char*) _aligned_malloc(16, 16);
__m128i* absdiffresult = (__m128i*) bytes;
__m128i* sourceM = (__m128i*) source;
__m128i* sourceVOffset = (__m128i*)source + verticalDistance * sourcestride;
for (unsigned int y = 0; y < (height - verticalDistance); y++)
{
for (unsigned int x = 0; x < xcount; x++, ++sourceM, ++sourceVOffset)
{
absdiffData = _mm_absdiff_epu8(*sourceM, *sourceVOffset);
_mm_store_si128(absdiffresult, absdiffData);
//unroll loop
histogram[bytes[0]]++;
histogram[bytes[1]]++;
histogram[bytes[2]]++;
histogram[bytes[3]]++;
histogram[bytes[4]]++;
histogram[bytes[5]]++;
histogram[bytes[6]]++;
histogram[bytes[7]]++;
histogram[bytes[8]]++;
histogram[bytes[9]]++;
histogram[bytes[10]]++;
histogram[bytes[11]]++;
histogram[bytes[12]]++;
histogram[bytes[13]]++;
histogram[bytes[14]]++;
histogram[bytes[15]]++;
}
}
_aligned_free(bytes);
}
Your function crashed while loading because the input data was not aligned properly. In order to solve this problem you have to change your code from:
absdiffData = _mm_absdiff_epu8(*sourceM, *sourceVOffset);
to:
absdiffData = _mm_absdiff_epu8(_mm_loadu_si128(sourceM), _mm_loadu_si128(sourceVOffset));
Here I use unaligned loading.
P.S. I have implemented a similar function (SimdAbsSecondDerivativeHistogram) in Simd Library. It has SSE2, AVX2, NEON and Altivec implementations. I hope that it will help you.
P.P.S. Also I would strongly recommended to check this line:
__m128i* sourceVOffset = (__m128i*)source + verticalDistance * sourcestride);
It may result in a crash (access to memory outside of the input array bounds). Maybe you had in mind this:
__m128i* sourceVOffset = (__m128i*)((char*)source + verticalDistance * sourcestride);

Convert __m128d to double

I just try to use SSE extensions, and I started with a simple vector dot multiplication.
So I wrote the following code:
void SSE_vectormult(double * A, double * B)
{
__m128d a;
__m128d b;
a = _mm_load_pd(A);
b = _mm_load_pd(B);
const int mask = 0xf1;
__m128d res = _mm_dp_pd(a,b,mask);
A = res;
}
with A and B vectors of the same length. Now, I have to convert the result in __m128d back into double. Is there a easy way to do this (or a conversion function)?
Thank you!
You should use double dot = _mm_cvtsd_f64(res). This extracts the lower 64bit double from the 128 bit register.
The counterpart of load would be store[ms, intel]. So in your case I'd guess (double precision, aligned pointer, regular store):
_mm_store_pd(A, res); //A = res;

Why is this slower than memcmp

I am trying to compare two rows of pixels.
A pixel is defined as a struct containing 4 float values (RGBA).
The reason I am not using memcmp is because I need to return the position of the 1st different pixel, which memcmp does not do.
My first implementation uses SSE intrinsics, and is ~30% slower than memcmp:
inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
for (int i = 0; i < count; i++)
{
__m128 x = _mm_load_ps((float*)(a + i));
__m128 y = _mm_load_ps((float*)(b + i));
__m128 cmp = _mm_cmpeq_ps(x, y);
if (_mm_movemask_ps(cmp) != 15) return i;
}
return -1;
}
I then found that treating the values as integers instead of floats sped things up a bit, and is now only ~20% slower than memcmp.
inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
for (int i = 0; i < count; i++)
{
__m128i x = _mm_load_si128((__m128i*)(a + i));
__m128i y = _mm_load_si128((__m128i*)(b + i));
__m128i cmp = _mm_cmpeq_epi32(x, y);
if (_mm_movemask_epi8(cmp) != 0xffff) return i;
}
return -1;
}
From what I've read on other questions, the MS implementation of memcmp is also implemented using SSE. My question is what other tricks does the MS implementation have up it's sleeve that I don't? How is it still faster even though it does a byte-by-byte comparison?
Is alignment an issue? If the pixel contains 4 floats, won't an array of pixels already be allocated on a 16 byte boundary?
I am compiling with /o2 and all the optimization flags.
I have written strcmp/memcmp optimizations with SSE (and MMX/3DNow!), and the first step is to ensure that the arrays are as aligned as possible - you may find that you have to do the first and/or last bytes "one at a time".
If you can align the data before it gets to the loop [if your code does the allocation], then that's ideal.
The second part is to unroll the loop, so you don't get so many "if loop isn't at the end, jump back to beginning of loop" - assuming the loop is quite long.
You may find that preloading the next data of the input before doing the "do we leave now" condition helps too.
Edit: The last paragraph may need an example. This code assumes an unrolled loop of at least two:
__m128i x = _mm_load_si128((__m128i*)(a));
__m128i y = _mm_load_si128((__m128i*)(b));
for(int i = 0; i < count; i+=2)
{
__m128i cmp = _mm_cmpeq_epi32(x, y);
__m128i x1 = _mm_load_si128((__m128i*)(a + i + 1));
__m128i y1 = _mm_load_si128((__m128i*)(b + i + 1));
if (_mm_movemask_epi8(cmp) != 0xffff) return i;
cmp = _mm_cmpeq_epi32(x1, y1);
__m128i x = _mm_load_si128((__m128i*)(a + i + 2));
__m128i y = _mm_load_si128((__m128i*)(b + i + 2));
if (_mm_movemask_epi8(cmp) != 0xffff) return i + 1;
}
Roughly something like that.
You might want to check this memcmp SSE implementation, specifically the __sse_memcmp function, it starts with some sanity checks and then checks if the pointers are aligned or not:
aligned_a = ( (unsigned long)a & (sizeof(__m128i)-1) );
aligned_b = ( (unsigned long)b & (sizeof(__m128i)-1) );
If they are not aligned it compares the pointers byte by byte until the start of an aligned address:
while( len && ( (unsigned long) a & ( sizeof(__m128i)-1) ) )
{
if(*a++ != *b++) return -1;
--len;
}
And then compares the remaining memory with SSE instructions similar to your code:
if(!len) return 0;
while( len && !(len & 7 ) )
{
__m128i x = _mm_load_si128( (__m128i*)&a[i]);
__m128i y = _mm_load_si128( (__m128i*)&b[i]);
....
I cannot help you directly because I'm using Mac, but there's an easy way to figure out what happens:
You just step into memcpy in the debug mode and switch to Disassembly view. As the memcpy is a simple little function, you will easily figure out all the implementation tricks.

How can you convert a std::bitset<64> to a double?

Is there a way to convert a std::bitset<64> to a double without using any external library (Boost, etc.)? I am using a bitset to represent a genome in a genetic algorithm and I need a way to convert a set of bits to a double.
The C++11 road:
union Converter { uint64_t i; double d; };
double convert(std::bitset<64> const& bs) {
Converter c;
c.i = bs.to_ullong();
return c.d;
}
EDIT: As noted in the comments, we can use char* aliasing as it is unspecified instead of being undefined.
double convert(std::bitset<64> const& bs) {
static_assert(sizeof(uint64_t) == sizeof(double), "Cannot use this!");
uint64_t const u = bs.to_ullong();
double d;
// Aliases to `char*` are explicitly allowed in the Standard (and only them)
char const* cu = reinterpret_cast<char const*>(&u);
char* cd = reinterpret_cast<char*>(&d);
// Copy the bitwise representation from u to d
memcpy(cd, cu, sizeof(u));
return d;
}
C++11 is still required for to_ullong.
Most people are trying to provide answers that let you treat the bit-vector as though it directly contained an encoded int or double.
I would advise you completely avoid that approach. While it does "work" for some definition of working, it introduces hamming cliffs all over the place. You usually want your encoding to arrange things so that if two decoded values are near to one another, then their encoded values are near to one another as well. It also forces you to use 64-bits of precision.
I would manage the conversion manually. Say you have three variables to encode, x, y, and z. Your domain expertise can be used to say, for example, that -5 <= x < 5, 0 <= y < 100, and 0 <= z < 1, where you need 8 bits of precision for x, 12 bits for y, and 10 bits for z. This gives you a total search space of only 30 bits. You can have a 30 bit string, treat the first 8 as encoding x, the next 12 as y, and the last 10 as z. You are also free to gray code each one to remove the hamming cliffs.
I've personally done the following in the past:
inline void binary_encoding::encode(const vector<double>& params)
{
unsigned int start=0;
for(unsigned int param=0; param<params.size(); ++param) {
// m_bpp[i] = number of bits in encoding of parameter i
unsigned int num_bits = m_bpp[param];
// map the double onto the appropriate integer range
// m_range[i] is a pair of (min, max) values for ith parameter
pair<double,double> prange=m_range[param];
double range=prange.second-prange.first;
double max_bit_val=pow(2.0,static_cast<double>(num_bits))-1;
int int_val=static_cast<int>((params[param]-prange.first)*max_bit_val/range+0.5);
// convert the integer to binary
vector<int> result(m_bpp[param]);
for(unsigned int b=0; b<num_bits; ++b) {
result[b]=int_val%2;
int_val/=2;
}
if(m_gray) {
for(unsigned int b=0; b<num_bits-1; ++b) {
result[b]=!(result[b]==result[b+1]);
}
}
// insert the bits into the correct spot in the encoding
copy(result.begin(),result.end(),m_genotype.begin()+start);
start+=num_bits;
}
}
inline void binary_encoding::decode()
{
unsigned int start = 0;
// for each parameter
for(unsigned int param=0; param<m_bpp.size(); param++) {
unsigned int num_bits = m_bpp[param];
unsigned int intval = 0;
if(m_gray) {
// convert from gray to binary
vector<int> binary(num_bits);
binary[num_bits-1] = m_genotype[start+num_bits-1];
intval = binary[num_bits-1];
for(int i=num_bits-2; i>=0; i--) {
binary[i] = !(binary[i+1] == m_genotype[start+i]);
intval += intval + binary[i];
}
}
else {
// convert from binary encoding to integer
for(int i=num_bits-1; i>=0; i--) {
intval += intval + m_genotype[start+i];
}
}
// convert from integer to double in the appropriate range
pair<double,double> prange = m_range[param];
double range = prange.second - prange.first;
double m = range / (pow(2.0,double(num_bits)) - 1.0);
// m_phenotype is a vector<double> containing all the decoded parameters
m_phenotype[param] = m * double(intval) + prange.first;
start += num_bits;
}
}
Note that for reasons that probably don't matter to you, I wasn't using bit vectors -- just ordinary vector<int> to encoding things. And of course, there's a bunch of stuff tied into this code that isn't shown here, but you can probably get the basic idea.
One other note, if you're doing GPU calculations or if you have a particular problem such that 64 bits are the appropriate size anyway, it may be worth the extra overhead to stuff everything into native words. Otherwise, I would guess that the overhead you add to the search process will probably overwhelm whatever benefits you get by faster encoding and decoding.
Edit:: I've decided that I was being a bit silly with this. While you do end up with a double it assumes that the bitset holds an integer... which is a big assumption to make. You will end up with a predictable and repeatable value per bitset but still I don't think that this is what the author intended.
Well if you iterate over the bit values and do
output_double += pow( 2, 64-(bit_position+1) ) * bit_value;
That would work. As long as it is big-endian

Any way to make this relatively simple (nested for memory copy) C++ code more efficient?

I realize this is kind of a goofy question, for lack of a better term. I'm just kind of looking for any outside idea on increasing the efficiency of this code, as it's bogging down the system very badly (it has to perform this function a lot) and I'm running low on ideas.
What it's doing it loading two image containers (imgRGB for a full color img and imgBW for a b&w image) pixel-by-individual-pixel of an image that's stored in "unsigned char *pImage".
Both imgRGB and imgBW are containers for accessing individual pixels as necessary.
// input is in the form of an unsigned char
// unsigned char *pImage
for (int y=0; y < 640; y++) {
for (int x=0; x < 480; x++) {
imgRGB[y][x].blue = *pImage;
pImage++;
imgRGB[y][x].green = *pImage;
imgBW[y][x] = *pImage;
pImage++;
imgRGB[y][x].red = *pImage;
pImage++;
}
}
Like I said, I was just kind of looking for fresh input and ideas on better memory management and/or copy than this. Sometimes I look at my own code so much I get tunnel vision... a bit of a mental block. If anyone wants/needs more information, by all means let me know.
The obvious question is, do you need to copy the data in the first place? Can't you just define accessor functions to extract the R, G and B values for any given pixel from the original input array?
If the image data is transient so you have to keep a copy of it, you could just make a raw copy of it without any reformatting, and again define accessors to index into each pixel/channel on that.
Assuming the copy you outlined is necessary, unrolling the loop a few times may prove to help.
I think the best approach will be to unroll the loop enough times to ensure that each iteration processes a chunk of data divisible by 4 bytes (so in each iteration, the loop can simply read a small number of ints, rather than a large number of chars)
Of course this requires you to mask out bits of these ints when writing, but that's a fast operation, and most importantly, it is done in registers, without burdening the memory subsystem or the CPU cache:
// First, we need to treat the input image as an array of ints. This is a bit nasty and technically unportable, but you get the idea)
unsigned int* img = reinterpret_cast<unsigned int*>(pImage);
for (int y = 0; y < 640; ++y)
{
for (int x = 0; x < 480; x += 4)
{
// At the start of each iteration, read 3 ints. That's 12 bytes, enough to write exactly 4 pixels.
unsigned int i0 = *img;
unsigned int i1 = *(img+1);
unsigned int i2 = *(img+2);
img += 3;
// This probably won't make a difference, but keeping a reference to the found pixel saves some typing, and it may assist the compiler in avoiding aliasing.
ImgRGB& pix0 = imgRGB[y][x];
pix0.blue = i0 & 0xff;
pix0.green = (i0 >> 8) & 0xff;
pix0.red = (i0 >> 16) & 0xff;
imgBW[y][x] = (i0 >> 8) & 0xff;
ImgRGB& pix1 = imgRGB[y][x+1];
pix1.blue = (i0 >> 24) & 0xff;
pix1.green = i1 & 0xff;
pix1.red = (i0 >> 8) & 0xff;
imgBW[y][x+1] = i1 & 0xff;
ImgRGB& pix2 = imgRGB[y][x+2];
pix2.blue = (i1 >> 16) & 0xff;
pix2.green = (i1 >> 24) & 0xff;
pix2.red = i2 & 0xff;
imgBW[y][x+2] = (i1 >> 24) & 0xff;
ImgRGB& pix3 = imgRGB[y][x+3];
pix3.blue = (i2 >> 8) & 0xff;
pix3.green = (i2 >> 16) & 0xff;
pix3.red = (i2 >> 24) & 0xff;
imgBW[y][x+3] = (i2 >> 16) & 0xff;
}
}
it is also very likely that you're better off filling a temporary ImgRGB value, and then writing that entire struct to memory at once, meaning that the first block would look like this instead: (the following blocks would be similar, of course)
ImgRGB& pix0 = imgRGB[y][x];
ImgRGB tmpPix0;
tmpPix0.blue = i0 & 0xff;
tmpPix0.green = (i0 >> 8) & 0xff;
tmpPix0.red = (i0 >> 16) & 0xff;
imgBW[y][x] = (i0 >> 8) & 0xff;
pix0 = tmpPix0;
Depending on how clever the compiler is, this may cut down dramatically on the required number of reads.
Assuming the original code is naively compiled (which is probably unlikely, but will serve as an example), this will get you from 3 reads and 4 writes per pixel (read RGB channel, and write RGB + BW) to 3/4 reads per pixel and 2 writes. (one write for the RGB struct, and one for the BW value)
You could also accumulate the 4 writes to the BW image in a single int, and then write that in one go too, something like this:
bw |= (i0 >> 8) & 0xff;
bw |= (i1 & 0xff) << 8;
bw |= ((i1 >> 24) & 0xff) << 16;
bw |= ((i2 >> 16) & 0xff) << 24;
*(imgBW + y*480+x/4) = bw; // Assuming you can treat imgBW as an array of integers
This would cut down on the number of writes to 1.25 per pixel (1 per RGB struct, and 1 for every 4 BW values)
Again, the benefit will probably be a lot smaller (or even nonexistent), but it may be worth a shot.
Taking this a step further, the same could be done without too much trouble using the SSE instructions, allowing you to process 4 times as many values per iteration. (Assuming you're running on x86)
Of course, an important disclaimer here is that the above is nonportable. The reinterpret_cast is probably an academic point (it'll most likely work no matter what, especially if you can ensure that the original array is aligned on a 32-bit boundary, which will typically be the case for large allocations on all platforms)
A bigger issue is that the bit-twiddling depends on the CPU's endianness.
But in practice, this should work on x86. and with small changes, it should work on big-endian machines too. (modulo any bugs in my code, of course. I haven't tested or even compiled any of it ;))
But no matter how you solve it, you're going to see the biggest speed improvements from minimizing the number of reads and writes, and trying to accumulate as much data in the CPU's registers as possible. Read all you can in large chunks, like ints, reorder it in the registers (accumulate it into a number of ints, or write it into temporary instances of the RGB struct), and then write those combined value out to memory.
Depending on how much you know about low-level optimizations, it may be surprising to you, but temporary variables are fine, while direct memory to memory access can be slow (for example your pointer dereferencing assigned directly into the array). The problem with this is that you may get more memory accesses than necessary, and it's harder for the compiler to guarantee that no aliasing will occur, and so it may be unable to reorder or combine the memory accesses. You're generally better off writing as much as you can early on (top of the loop), doing as much as possible in temporaries (because the compiler can keep everything in registers), and then write everything out at the end. That also gives the compiler as much leeway as possible to wait for the initially slow reads.
Finally, adding a 4th dummy value to the RGB struct (so it has a total size of 32bit) will most likely help a lot too (because then writing such a struct is a single 32-bit write, which is simpler and more efficient than the current 24-bit)
When deciding how much to unroll the loop (you could do the above twice or more in each iteration), keep in mind how many registers your CPU has. Spilling out into the cache will probably hurt you as there are plenty of memory accesses already, but on the other hand, unroll as much as you can afford given the number of registers available (the above uses 3 registers for keeping the input data, and one to accumulate the BW values. It may need one or two more to compute the necessary addresses, so on x86, doubling the above might be pushing it a bit (you have 8 registers total, and some of them have special meanings). On the other hand, modern CPU's do a lot to compensate for register pressure, by using a much larger number of registers behind the scenes, so further unrolling might still be a total performance win.
As always, measure measure measure. It's impossible to say what's fast and what isn't until you've tested it.
Another general point to keep in mind is that data dependencies are bad. This won't be a big deal as long as you're only dealing with integral values, but it still inhibits instruction reordering, and superscalar execution.
In the above, I've tried to keep dependency chains as short as possible. Rather than continually incrementing the same pointer (which means that each increment is dependant on the previous one), adding a different offset to the same base address means that every address can be computed independently, again giving more freedom to the compiler to reorder and reschedule instructions.
I think the array accesses (are they real array accesses or operator []?) are going to kill you. Each one represents a multiply.
Basically, you want something like this:
for (int y=0; y < height; y++) {
unsigned char *destBgr = imgRgb.GetScanline(y); // inline methods are better
unsigned char *destBW = imgBW.GetScanline(y);
for (int x=0; x < width; x++) {
*destBgr++ = *pImage++;
*destBW++ = *destBgr++ = *pImage++; // do this in one shot - don't double deref
*destBgr++ = *pImage++;
}
}
This will do two multiplies per scanline. You code was doing 4 multiplies per PIXEL.
What I like to do in situations like this is go into the debugger and step through the disassembly to see what it is really doing (or have the compiler generate an assembly listing). This can give you a lot of clues about where inefficencies are. They are often not where you think!
By implementing the changes suggested by Assaf and David Lee above, you can get a before and after instruction count. This really helps me in optimizing tight inner loops.
You could optimize away some of the pointer arithmetic you're doing over and over with the subscript operators [][] and use an iterator instead (that is, advance a pointer).
Memory bandwidth is your bottleneck here. There is a theoretical minimum time required to transfer all the data to and from system memory. I wrote a little test to compare the OP's version with some simple assembler to see how good the compiler was. I'm using VS2005 with default release mode settings. Here's the code:
#include <windows.h>
#include <iostream>
using namespace std;
const int
c_width = 640,
c_height = 480;
typedef struct _RGBData
{
unsigned char
r,
g,
b;
// I'm assuming there's no padding byte here
} RGBData;
// similar to the code given
void SimpleTest
(
unsigned char *src,
RGBData *rgb,
unsigned char *bw
)
{
for (int y = 0 ; y < c_height ; ++y)
{
for (int x = 0 ; x < c_width ; ++x)
{
rgb [x + y * c_width].b = *src;
src++;
rgb [x + y * c_width].g = *src;
bw [x + y * c_width] = *src;
src++;
rgb [x + y * c_width].r = *src;
src++;
}
}
}
// the assembler version
void ASM
(
unsigned char *src,
RGBData *rgb,
unsigned char *bw
)
{
const int
count = 3 * c_width * c_height / 12;
_asm
{
push ebp
mov esi,src
mov edi,bw
mov ecx,count
mov ebp,rgb
l1:
mov eax,[esi]
mov ebx,[esi+4]
mov edx,[esi+8]
mov [ebp],eax
shl eax,16
mov [ebp+4],ebx
rol ebx,16
mov [ebp+8],edx
shr edx,24
and eax,0xff000000
and ebx,0x00ffff00
and edx,0x000000ff
or eax,ebx
or eax,edx
add esi,12
bswap eax
add ebp,12
stosd
loop l1
pop ebp
}
}
// timing framework
LONGLONG TimeFunction
(
void (*function) (unsigned char *src, RGBData *rgb, unsigned char *bw),
char *description,
unsigned char *src,
RGBData *rgb,
unsigned char *bw
)
{
LARGE_INTEGER
start,
end;
cout << "Testing '" << description << "'...";
memset (rgb, 0, sizeof *rgb * c_width * c_height);
memset (bw, 0, c_width * c_height);
QueryPerformanceCounter (&start);
function (src, rgb, bw);
QueryPerformanceCounter (&end);
bool
ok = true;
unsigned char
*bw_check = bw,
i = 0;
RGBData
*rgb_check = rgb;
for (int count = 0 ; count < c_width * c_height ; ++count)
{
if (bw_check [count] != i || rgb_check [count].r != i || rgb_check [count].g != i || rgb_check [count].b != i)
{
ok = false;
break;
}
++i;
}
cout << (end.QuadPart - start.QuadPart) << (ok ? " OK" : " Failed") << endl;
return end.QuadPart - start.QuadPart;
}
int main
(
int argc,
char *argv []
)
{
unsigned char
*source_data = new unsigned char [c_width * c_height * 3];
RGBData
*rgb = new RGBData [c_width * c_height];
unsigned char
*bw = new unsigned char [c_width * c_height];
int
v = 0;
for (unsigned char *dest = source_data ; dest < &source_data [c_width * c_height * 3] ; ++dest)
{
*dest = v++ / 3;
}
LONGLONG
totals [2] = {0, 0};
for (int i = 0 ; i < 10 ; ++i)
{
cout << "Iteration: " << i << endl;
totals [0] += TimeFunction (SimpleTest, "Initial Copy", source_data, rgb, bw);
totals [1] += TimeFunction ( ASM, " ASM Copy", source_data, rgb, bw);
}
LARGE_INTEGER
freq;
QueryPerformanceFrequency (&freq);
freq.QuadPart /= 100000;
cout << totals [0] / freq.QuadPart << "ns" << endl;
cout << totals [1] / freq.QuadPart << "ns" << endl;
delete [] bw;
delete [] rgb;
delete [] source_data;
return 0;
}
And the ratio between C and assembler I was getting was about 2.5:1, i.e. C was 2.5 times the time of the assembler version.
I've just noticed the original data was in BGR order. If the copy swapped the B and R components then it does make the assembler code a bit more complex. But it would also make the C code more complex too.
Ideally, you need to work out what the theoretical minimum time is and compare it to what you're actually getting. To do that, you need to know the memory frequency and the type of memory and the workings of the CPU's MMU.
You might try using a simple cast to get your RGB data, and just recompute the grayscale data:
#pragma pack(1)
typedef unsigned char bw_t;
typedef struct {
unsigned char blue;
unsigned char green;
unsigned char red;
} rgb_t;
#pragma pack(pop)
rgb_t *imageRGB = (rgb_t*)pImage;
bw_t *imageBW = (bw_t*)calloc(640*480, sizeof(bw_t));
// RGB(X,Y) = imageRGB[Y*480 + X]
// BW(X,Y) = imageBW[Y*480 + X]
for (int y = 0; y < 640; ++y)
{
// try and pull some larger number of bytes from pImage (24 is arbitrary)
// 24 / sizeof(rgb_t) = 8
for (int x = 0; x < 480; x += 24)
{
imageBW[y*480 + x ] = GRAYSCALE(imageRGB[y*480 + x ]);
imageBW[y*480 + x + 1] = GRAYSCALE(imageRGB[y*480 + x + 1]);
imageBW[y*480 + x + 2] = GRAYSCALE(imageRGB[y*480 + x + 2]);
imageBW[y*480 + x + 3] = GRAYSCALE(imageRGB[y*480 + x + 3]);
imageBW[y*480 + x + 4] = GRAYSCALE(imageRGB[y*480 + x + 4]);
imageBW[y*480 + x + 5] = GRAYSCALE(imageRGB[y*480 + x + 5]);
imageBW[y*480 + x + 6] = GRAYSCALE(imageRGB[y*480 + x + 6]);
imageBW[y*480 + x + 7] = GRAYSCALE(imageRGB[y*480 + x + 7]);
}
}
Several steps you can take. Result at the end of this answer.
First, use pointers.
const unsigned char *pImage;
RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;
for (int y=0; y < 640; ++y) {
for (int x=0; x < 480; ++x) {
rgbOut->blue = *pImage;
++pImage;
unsigned char tmp = *pImage; // Save to reduce amount of reads.
rgbOut->green = tmp;
*bwOut = tmp;
++pImage;
rgbOut->red = *pImage;
++pImage;
++rgbOut;
++bwOut;
}
}
If imgRGB and imgBW are declared as:
unsigned char imgBW[480][640];
RGB imgRGB[480][640];
You can combine the two loops:
const unsigned char *pImage;
RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;
for (int i=0; i < 640 * 480; ++i) {
rgbOut->blue = *pImage;
++pImage;
unsigned char tmp = *pImage; // Save to reduce amount of reads.
rgbOut->green = tmp;
*bwOut = tmp;
++pImage;
rgbOut->red = *pImage;
++pImage;
++rgbOut;
++bwOut;
}
You can exploit the fact that word reads are faster than four char reads. We will use a helper macro for this. Note this example assumes a little-endian target system.
const unsigned char *pImage;
RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;
const uint32_t *curPixelGroup = pImage;
for (int i=0; i < 640 * 480; ++i) {
uint64_t pixels = 0;
#define WRITE_PIXEL \
rgbOut->blue = pixels; \
pixels >>= 8; \
\
rgbOut->green = pixels; \
*bwOut = pixels; \
pixels >>= 8; \
\
rgbOut->red = pixels; \
pixels >>= 8; \
\
++rgbOut; \
++bwOut;
#define READ_PIXEL(shift) \
pixels |= (*curPixelGroup++) << (shift * 8);
READ_PIXEL(0); WRITE_PIXEL;
READ_PIXEL(1); WRITE_PIXEL;
READ_PIXEL(2); WRITE_PIXEL;
READ_PIXEL(3); WRITE_PIXEL;
/* Remaining */ WRITE_PIXEL;
#undef COPY_PIXELS
}
(Your compiler will probably optimize away the redundant or operation in the first READ_PIXEL. It will also optimize shifts, removing the redundant << 0, too.)
If the structure of RGB is thus:
struct RGB {
unsigned char blue, green, red;
};
You can optimize even further, copy to the struct directly, instead of through its members (red, green, blue). This can be done using anonymous structs (or casting, but that makes the code a bit more messy and probably more prone to error). (Again, this is dependant on little-endian systems, etc. etc.):
union RGB {
struct {
unsigned char blue, green, red;
};
uint32_t rgb:24; // Make sure it's a bitfield, otherwise the union will strech and ruin the ++ operator.
};
const unsigned char *pImage;
RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;
const uint32_t *curPixelGroup = pImage;
for (int i=0; i < 640 * 480; ++i) {
uint64_t pixels = 0;
#define WRITE_PIXEL \
rgbOut->rgb = pixels; \
pixels >>= 8; \
\
*bwOut = pixels; \
pixels >>= 16; \
\
++rgbOut; \
++bwOut;
#define READ_PIXEL(shift) \
pixels |= (*curPixelGroup++) << (shift * 8);
READ_PIXEL(0); WRITE_PIXEL;
READ_PIXEL(1); WRITE_PIXEL;
READ_PIXEL(2); WRITE_PIXEL;
READ_PIXEL(3); WRITE_PIXEL;
/* Remaining */ WRITE_PIXEL;
#undef COPY_PIXELS
}
You can optimize writing the pixel similarly as we did with reading (writing in words rather than 24-bits). In fact, that'd be a pretty good idea, and will be a great next step in optimization. Too tired to code it, though. =]
Of course, you can write the routine in assembly language. This makes it less portable than it already is, however.
I'm assuming the following at the moment, so please let me know if my assumptions are wrong:
a) imgRGB is a structure of the type
struct ImgRGB
{
unsigned char blue;
unsigned char green;
unsigned char red;
};
or at least something similar.
b) imgBW looks something like this:
struct ImgBW
{
unsigned char BW;
};
c) The code is single threaded
Assuming the above, I see several problems with your code:
You put the assignment to the BW part right in the middle of the assignments to the other containers. If you're working on a modern CPU, chances are that with the size of your data your L1 cache gets invalidated every time you're switching containers and you're looking at reloading or switching a cache line. Caches are optimised for linear access these days so hopping to and fro doesn't help. Accessing main memory is a lot slower, so that would be a noticeable performance hit. To verify if this is a problem, temporarily I'd remove the assignment to imgBW and measure if there is a noticeable speedup.
The array access doesn't help and it'll potentially slow down the code a little, although a decent optimiser should take care of that. I'd probably write the loop along these lines instead, but would not expect a big performance gain. Maybe a couple percent.
for (int y=0; y blue = *pImage;
...
}
}
For consistency I would change from using postfix to prefix increment but I would not expect to see a big gain.
If you can waste a little storage (well, 25%) you might gain from adding a fourth dummy unsigned char to the structure ImgRGB provided that this would increase the size of the structure to the size of an int. Native ints are usually fastest to access and if you're looking at a structure of chars that are not filling up an int completely, you're potentially running into all sorts of interesting access issues that can slow your code down noticeably because the compiler might have to generate additional instructions to extract the unsigned chars. Again, try this and measure the result - it might make a noticeable difference or none at all. In the same vein, upping the size of the structure members from unsigned char to unsigned int might waste lots of space but potentially can speed up the code. Nevertheless as long as pImage is a pointer to an unsigned char, you would only eliminate half the problem.
All in all you are down to making your loop fit to your underlying hardware, so for specific optimisation techniques you might have to read up on what your hardware does well and what it does badly.
Make sure pImage, imgRGB, and imgBW are marked __restrict.
Use SSE and do it sixteen bytes at a time.
Actually from what you're doing there it looks like you could use a simple memcpy() to copy pImage into imgRGB (since imgRGB is in row-major format and apparently in the same order as pImage). You could fill out imgBW by using a series of SSE swizzle and store ops to pack down the green values but it might be cumbersome since you'd need to work on ( 3*16 =) 48 bytes at a time.
Are you sure pImage and your output arrays are all in dcache when you start this? Try using a prefetch hint to fetch 128 bytes ahead and measure to see if that improves things.
Edit If you're not on x86, replace "SSE" with the appropriate SIMD instruction set for your hardware, of course. (That'd be VMX, Altivec, SPU, VLIW, HLSL, etc.)
If possible, fix this at a higher level then bit or instruction twiddling!
You could specialize the the B&W image class to one that references the green channel of the color image class (thus saving a copy per pixel). If you always create them in pair, you might not even need the naive imgBW class at all.
By taking care about how your store the data in imgRGB, you could copy a triplet at a time from the input data. Better, you might copy the whole thing, or even just store a reference (which makes the previous suggestion easy as well).
If you don't control the implementation of everything here, you might be stuck, then:
Last resort: unroll the loop (cue someone mentioning Duff's device, or just ask the compiler to do it for you...), though I don't think you'll see much improvement...
It seems that you defined each pixel as some kind of structure or object. Using a primitive type (say, int) could be faster. As others have mentioned, the compiler is likely to optimize the array access using pointer increments. If the compile doesn't do that for you, you can do that yourself to avoid multiplications when you use array[][].
Since you only need 3 bytes per pixel, you could pack one pixel into one int. By doing that, you could copy 3 bytes a time instead of byte-by-byte. The only tricky thing is when you want to read individual color components of a pixel, you will need some bit masking and shifting. This could give you more overhead than that saved by using an int.
Or you can use 3 int arrays for 3 color components respectively. You will need a lot more storage, though.
Here is one very tiny, very simple optimization:
You are referring to imageRGB[y][x] repeatedly, and that likely needs to be re-calculated at each step.
Instead, calculate it once, and see if that makes some improvement:
Pixel* apixel;
for (int y=0; y < 640; y++) {
for (int x=0; x < 480; x++) {
apixel = &imgRGB[y][x];
apixel->blue = *pImage;
pImage++;
apixel->green = *pImage;
imgBW[y][x] = *pImage;
pImage++;
apixel->red = *pImage;
pImage++;
}
}
If pImage is already entirely in memory, why do you need to massage the data? I mean if it is already in pseudo-RGB format, why can't you just write some inline routines/macros that can spit out the values on demand instead of copying it around?
If rearranging the pixel data is important for later operations, consider block operations and/or cache line optimization.