Using Vector Intrinsics Yields Unexpected (Slow) Results

Using Vector Intrinsics Yields Unexpected (Slow) Results - c++

I'm attempting to use vector intrinsics to speed up a trivial piece of code (as a test), and I'm not getting a speed up - in fact, it runs slower by a bit sometimes. I'm wondering two things:
Do vectorized instructions speed up simple load from one region / store to another type operations in any way?
Division intrinsics aren't yielding anything faster either, and in fact, I started getting segfaults when I introduced the _mm256_div_pd intrinsic. Is my usage correct?
constexpr size_t VECTORSIZE{ (size_t)1024 * 1024 * 64 }; //large array to force main memory accesses
void normal_copy(const fftw_complex* in, fftw_complex* copyto, size_t copynum)
{
for (size_t i = 0; i < copynum; i++)
{
copyto[i][0] = in[i][0] / 128.0;
copyto[i][1] = in[i][1] / 128.0;
}
}
#if defined(_WIN32) || defined(_WIN64)
void avx2_copy(const fftw_complex* __restrict in, fftw_complex* __restrict copyto, size_t copynum)
#else
void avx2_copy(const fftw_complex* __restrict__ in, fftw_complex* __restrict__ copyto, size_t copynum)
#endif
{ //avx2 supports 256 bit vectorized instructions
constexpr double zero = 0.0;
constexpr double dnum = 128.0;
__m256d tmp = _mm256_broadcast_sd(&zero);
__m256d div = _mm256_broadcast_sd(&dnum);
for (size_t i = 0; i < copynum; i += 2)
{
tmp = _mm256_load_pd(&in[i][0]);
tmp = _mm256_div_pd(tmp, div);
_mm256_store_pd(&copyto[i][0], tmp);
}
}
int main()
{
fftw_complex* invec = (fftw_complex*)fftw_malloc(VECTORSIZE * sizeof(fftw_complex));
fftw_complex* outvec1 = (fftw_complex*)fftw_malloc(VECTORSIZE * sizeof(fftw_complex));
fftw_complex* outvec3 = (fftw_complex*)fftw_malloc(VECTORSIZE * sizeof(fftw_complex));
//some initialization stuff for invec
//some timing stuff (wall clock)
normal_copy(invec, outvec1, VECTORSIZE);
//some timing stuff (wall clock)
avx2_copy(invec, outvec3, VECTORSIZE);
return 0;
}
fftw_complex is a datatype equivalent to std::complex. I've tested using both g++ (with -O3 and -ftree-vectorize) on Linux, and Visual Studio on Windows - same results - AVX2 copy and div is slower and segfaults for certain array sizes. Tested array sizes are always powers of 2, so anything related to reading invalid memory (from _mm256_load_pd) doesn't seem to be the issue. Any thoughts?

Put it shortly: using SIMD instructions does not help much here except for the use of non-temporal stores.
Do vectorized instructions speed up simple load from one region / store to another type operations in any way?
This is dependent of the type of data that is copied and the target processor as well as the target RAM used. That being said, in your case, a modern x86-64 processor should nearly saturate the memory hierarchy with a scalar code because modern processors can both load and store 8-bytes in parallel per cycle and most processor are working at least at 2.5 GHz. This means 37.2 GiB/s for a core at this minimum frequency. While this is generally not enough to saturate the L1 or L2 cache, this is enough to saturate the RAM of most PC.
In practice, this is significantly more complex and the saturation is clearly underestimated. Indeed, Intel x86-64 processors and AMD Zen ones use a write allocate cache policy that cause written cache lines to be read first from the memory before being written back. This means that the actual throughput would be 37.2*1.5 = 56 GiB/s. This is not enough: even if the RAM would be able to support such a high throughput, cores often cannot because of the very high latency of the RAM compared to the size of the cache and the capability of hardware prefetchers (see this related post for more information). To reduce the wasted memory througput and so increase the real throughput, you can use non-temporal streaming instructions (aka. NT stores) like _mm256_stream_pd. Note that such an instruction require the data pointer to be aligned.
Note that NT store are only useful for data that are not directly reused or that are to big to fit in caches. Note also that memcpy should use NT-stores on x86-64 processor on relatively big input data. Note also that working in-place does not cause any issue due to the write allocate policy.
Division intrinsics aren't yielding anything faster either, and in fact, I started getting segfaults when I introduced the _mm256_div_pd intrinsic. Is my usage correct?
Because of the possible address misalignment (mentioned in the comments), you need to use a scalar loop to operate on some items until the address is aligned. As also mentioned in the comment, using a multiplication (_mm256_mul_pd) by 1./128. is much more efficient. The multiplication adds some latency but does not impact the throughput.
PS: do not forget to free the allocated memory.

Related

Why does this piece of code written using uint8_t run faster than analogous code written with uint32_t or uint64_t on a 64bit machine?

Isn't the common knowledge that math operations on 64bit systems run faster on 32/64 bit datatypes than the smaller datatypes like short due to implicit promotion? Yet while testing my bitset implementation(where the majority of the time depends on bitwise operations), I found I got a ~40% improvement using uint8_t over uint32_t. I'm especially surprised because there is hardly any copying going on that would justify the difference. The same thing occurred regardless of the clang optimisation level.
8bit:
#define mod8(x) x&7
#define div8(x) x>>3
template<unsigned long bits>
struct bitset{
private:
uint8_t fill[8] = {};
uint8_t clear[8];
uint8_t band[(bits/8)+1] = {};
public:
template<typename T>
inline bool operator[](const T ind) const{
return band[div8(ind)]&fill[mod8(ind)];
}
template<typename T>
inline void store_high(const T ind){
band[div8(ind)] |= fill[mod8(ind)];
}
template<typename T>
inline void store_low(const T ind){
band[div8(ind)] &= clear[mod8(ind)];
}
bitset(){
for(uint8_t ii = 0, val = 1; ii < 8; ++ii){
fill[ii] = val;
clear[ii] = ~fill[ii];
val*=2;
}
}
};
32bit:
#define mod32(x) x&31
#define div32(x) x>>5
template<unsigned long bits>
struct bitset{
private:
uint32_t fill[32] = {};
uint32_t clear[32];
uint32_t band[(bits/32)+1] = {};
public:
template<typename T>
inline bool operator[](const T ind) const{
return band[div32(ind)]&fill[mod32(ind)];
}
template<typename T>
inline void store_high(const T ind){
band[div32(ind)] |= fill[mod32(ind)];
}
template<typename T>
inline void store_low(const T ind){
band[div32(ind)] &= clear[mod32(ind)];
}
bitset(){
for(uint32_t ii = 0, val = 1; ii < 32; ++ii){
fill[ii] = val;
clear[ii] = ~fill[ii];
val*=2;
}
}
};
And here is the benchmark I used(just moves a single 1 from position 0 till the end iteratively):
const int len = 1000000;
bitset<len> bs;
{
auto start = std::chrono::high_resolution_clock::now();
bs.store_high(0);
for (int ii = 1; ii < len; ++ii) {
bs.store_high(ii);
bs.store_low(ii-1);
}
auto stop = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>((stop-start)).count()<<std::endl;
}

TL:DR: large "buckets" for a bitset mean you access the same one repeatedly when you iterate linearly, creating longer dependency chains that out-of-order exec can't overlap as effectively.
Smaller buckets give instruction-level parallelism, making operations on bits in separate bytes independent of each other.
On possible reason is that you iterate linearly over bits, so all the operations within the same band[] element form one long dependency chain of &= and |= operations, plus store and reload (if the compiler doesn't manage to optimize that away with loop unrolling).
For uint32_t band[], that's a chain of 2x 32 operations, since ii>>5 will give the same index for that long.
Out-of-order exec can only partially overlap execution of these long chains if their latency and instruction-count is too large for the ROB (ReOrder Buffer) and RS (Reservation Station, aka Scheduler). With 64 operations probably including store/reload latency (4 or 5 cycles on modern x86), that's a dep chain length of probably 6 x 64 = 384 cycles, composed of probably at least 128 uops, with some parallelism for loading (or better calculating) 1U<<(n&31) or rotl(-1U, n&31) masks that can "use up" some of the wasted execution slots in the pipeline.
But for uint8_t band[], you've moving to a new element 4x as frequently, after only 2x 8 = 16 operations, so the dep chains are 1/4 the length.
See also Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for another case of a modern x86 CPU overlapping two long dependency chains (a simple chain of imul with no other instruction-level parallelism), especially the part about a single dep chain becoming longer than the RS (scheduler for un-executed uops) being the point at which we start to lose some of the overlap of execution of the independent work. (For the case without lfence to artificially block overlap.)
See also Modern Microprocessors
A 90-Minute Guide! and https://www.realworldtech.com/sandy-bridge/ for some background on how modern OoO exec CPUs decode and look at instructions.
Small vs. large buckets
Large buckets are only useful when scanning through for the first non-zero bit, or filling the whole thing or something. Of course, really you'd want to vectorize that with SIMD, checking 16 or 32 bytes at once to see if there's a non-zero element anywhere in that. Current compilers will vectorize for you in loops that fill the whole array, but not search loops (or anything with a trip-count that can't be calculated ahead of the first iteration), except for ICC which can handle that. Re: using fast operations over bit-vectors, see Howard Hinnant's article (in the context of vector<bool>, which is an unfortunate name for a sometimes-useful data structure.)
C++ unfortunately doesn't make it easy in general to use different sized accesses to the same data, unless you compile with g++ -O3 -fno-strict-aliasing or something like that.
Although unsigned char can always alias anything else, so you could use that for your single-bit accesses, only using uintptr_t (which is likely to be as wide as a register, except on ILP32-on-64bit ISAs) for init or whatever. Or in this case, uint_fast32_t being a 64-bit type on many x86-64 C++ implementations would make it useful for this, unlike usual when that sucks, wasting cache footprint when you're only using the value-range of a 32-bit number and being slower for non-constant division on some CPUs.
On x86 CPU, a byte store is naturally fully efficient, but even on an ARM or something, coalescing in the store buffer could still make adjacent byte RMWs fully efficient. (Are there any modern CPUs where a cached byte store is actually slower than a word store?). And you'd still gain ILP; a slower commit to cache is still not as bad as coupling loads to stores that could have been independent if narrower. Especially important on lower-end CPUs with smaller out-of-order schedulers buffers.
(x86 byte loads need to use movzx to zero-extend to avoid false dependencies, but most compilers know that. Clang is reckless about it which can occasionally hurt.)
(Different sized accesses close to each other can lead to store-forwarding stalls, e.g. a byte store and an unsigned long reload that overlaps that byte will have extra latency: What are the costs of failed store-to-load forwarding on x86?)
Code review:
Storing an array of masks is probably worse than just computing 1u32<<(n&31)) as needed, on most CPUs. If you're really lucky, a smart compiler might manage constant propagation from the constructor into the benchmark loop, and realize that it can rotate or shift inside the loop to generate the bitmask instead of indexing memory in a loop that already does other memory operations.
(Some non-x86 ISAs have better bit-manipulation instructions and can materialize 1<<n cheaply, although x86 can do that in 2 instructions as well if compilers are smart. xor eax,eax / bts eax, esi, with the BTS implicitly masking the shift count by the operand-size. But that only works so well for 32-bit operand-size, not 8-bit. Without BMI2 shlx, x86 variable-count shifts run as 3-uops on Intel CPUs, vs. 1 on AMD.)
Almost certainly not worth it to store both fill[] and clear[] constants. Some ISAs even have an andn instruction that can NOT one of the operands on the fly, i.e. implements (~x) & y in one instruction. For example, x86 with BMI1 extensions has andn. (gcc -march=haswell).
Also, your macros are unsafe: wrap the expression in () so operator-precedence doesn't bits you if you use foo[div8(x) - 1].
As in #define div8(x) (x>>3)
But really, you shouldn't be using CPP macros for stuff like this anyway. Even in modern C, just define static const shift = 3; shift counts and masks. In C++, do that inside the struct/class scope, and use band[idx >> shift] or something. (When I was typing ind, my fingers wanted to type int; idx is probably a better name.)

Isn't the common knowledge that math operations on 64bit systems run faster on 32/64 bit datatypes than the smaller datatypes like short due to implicit promotion?
This isn't a universal truth. As always, fit depends on details.
Why does this piece of code written using uint_8 run faster than analogous code written with uint_32 or uint_64 on a 64bit machine?
The title doesn't match the question. There are no such types as uint_X and you aren't using uintX_t. You are using uint_fastX_t. uint_fastX_t is an alias for an integer type that is at least X bytes, that is deemed by the language implementers to provide fastest operations.
If we were to take your earlier mentioned assumption for granted, then it should logically follow that the language implementers would have chosen to use 32/64 bit type as uint_fast8_t. That said, you cannot assume that they have done so and whatever generic measurement (if any) has been used to make that choice doesn't necessarily apply to your case.
That said, regardless of which type uint_fast8_t is an alias of, your test isn't fair for comparing the relative speeds of calculation of potentially different integer types:
uint_fast8_t fill[8] = {};
uint_fast8_t clear[8];
uint_fast8_t band[(bits/8)+1] = {};
uint_fast32_t fill[32] = {};
uint_fast32_t clear[32];
uint_fast32_t band[(bits/32)+1] = {};
Not only are the types (potentially) different, but the sizes of the arrays are too. This can certainly have an effect on the efficiency.

Unaligned load versus unaligned store

The short question is that if I have a function that takes two vectors. One is input and the other is output (no alias). I can only align one of them, which one should I choose?
The longer version is that, consider a function,
void func(size_t n, void *in, void *out)
{
__m256i *in256 = reinterpret_cast<__m256i *>(in);
__m256i *out256 = reinterpret_cast<__m256i *>(out);
while (n >= 32) {
__m256i data = _mm256_loadu_si256(in256++);
// process data
_mm256_storeu_si256(out256++, data);
n -= 32;
}
// process the remaining n % 32 bytes;
}
If in and out are both 32-bytes aligned, then there's no penalty of using vmovdqu instead of vmovdqa. The worst case scenario is that both are unaligned, and one in four load/store will cross the cache-line boundary.
In this case, I can align one of them to the cache line boundary by processing a few elements first before entering the loop. However, the question is which should I choose? Between unaligned load and store, which one is worse?

Risking to state the obvious here: There is no "right answer" except "you need to benchmark both with actual code and actual data". Whichever variant is faster strongly depends on the CPU you are using, the amount of calculations you are doing on each package and many other things.
As noted in the comments, you should also try non-temporal stores. What also sometimes can help is to load the input of the following data packet inside the current loop, i.e.:
__m256i next = _mm256_loadu_si256(in256++);
for(...){
__m256i data = next; // usually 0 cost
next = _mm256_loadu_si256(in256++);
// do computations and store data
}
If the calculations you are doing have unavoidable data latencies, you should also consider calculating two packages interleaved (this uses twice as many registers though).

Why matrix multiplication with SSE is slower?

I have a matrix class(4x4)
class matrix {
public:
matrix() {}
matrix(float m11,float m21,float m31,float m41,
float m12,float m22,float m32,float m42,
float m13,float m23,float m33,float m43,
float m14,float m24,float m34,float m44);
matrix(const float*);
matrix(const matrix&);
matrix operator *(const matrix& other)const;
static const matrix identity;
private:
union {
float m[16];
struct {
float m11,m21,m31,m41;
float m12,m22,m32,m42;
float m13,m23,m33,m43;
float m14,m24,m34,m44;
};
struct {
float element[4][4];
};
};
};
below is the first implementation of the multiplication operator,
matrix matrix::operator*(const matrix &other) const{
return matrix(
m11*other.m11+m12*other.m21+m13*other.m31+m14*other.m41,
m21*other.m11+m22*other.m21+m23*other.m31+m24*other.m41,
m31*other.m11+m32*other.m21+m33*other.m31+m34*other.m41,
m41*other.m11+m42*other.m21+m43*other.m31+m44*other.m41,
m11*other.m12+m12*other.m22+m13*other.m32+m14*other.m42,
m21*other.m12+m22*other.m22+m23*other.m32+m24*other.m42,
m31*other.m12+m32*other.m22+m33*other.m32+m34*other.m42,
m41*other.m12+m42*other.m22+m43*other.m32+m44*other.m42,
m11*other.m13+m12*other.m23+m13*other.m33+m14*other.m43,
m21*other.m13+m22*other.m23+m23*other.m33+m24*other.m43,
m31*other.m13+m32*other.m23+m33*other.m33+m34*other.m43,
m41*other.m13+m42*other.m23+m43*other.m33+m44*other.m43,
m11*other.m14+m12*other.m24+m13*other.m34+m14*other.m44,
m21*other.m14+m22*other.m24+m23*other.m34+m24*other.m44,
m31*other.m14+m32*other.m24+m33*other.m34+m34*other.m44,
m41*other.m14+m42*other.m24+m43*other.m34+m44*other.m44
);
}
and i try to use sse instructions to accelerate with the version below,
matrix matrix::operator*(const matrix &other) const{
float r[4][4];
__m128 c1=_mm_loadu_ps(&m11);
__m128 c2=_mm_loadu_ps(&m12);
__m128 c3=_mm_loadu_ps(&m13);
__m128 c4=_mm_loadu_ps(&m14);
for (int i = 0;i < 4; ++i) {
__m128 v1 = _mm_set1_ps(other.element[i][0]);
__m128 v2 = _mm_set1_ps(other.element[i][1]);
__m128 v3 = _mm_set1_ps(other.element[i][2]);
__m128 v4 = _mm_set1_ps(other.element[i][3]);
__m128 col = _mm_add_ps(
_mm_add_ps(_mm_mul_ps(v1,c1),_mm_mul_ps(v2,c2)),
_mm_add_ps(_mm_mul_ps(v3,c3),_mm_mul_ps(v4,c4))
);
_mm_storeu_ps(r[i], col);
}
return matrix(&r[0][0]);
}
But on my macbookpro, doing 100000 matrix multiplication costs about 6ms for the first version, and about 8ms for the second version.
i want to know why this happens.
Perhaps because of cpu pipeline makes the first version runs concurrent computations and the load/save lags the second version?

You benefit from massive instruction parallelism in the first (scalar) case, when you allow the compiler to optimize the code as it sees best. By arranging the code so as to minimize data dependencies, even though that may result in more total instructions being required, each instruction can be run simultaneously on different execution units. There are lots of registers available, so most of the values can be kept enregistered, minimizing the need for costly memory reads, and even when memory reads are necessary, they can be done nearly for free while other operations are completing, thanks to out-of-order execution scheduling. I would further speculate that you are benefitting from μ-op caching here, the benefit of which is compensating for the increased code size.
In the second (parallel) case, you're creating significant data dependencies. Even when the compiler emits optimal object code (and this isn't necessarily going to be the case when you use intrinsics), there is a cost involved in forcing this parallelism. You can see that if you ask the compiler to show you an assembly listing. There are tons of shufps instructions required to pack and reorder the floating-point operands within the SSE registers between operations. That only takes a single cycle on modern Intel architectures*, but the subsequent addps and mulps operations cannot execute in parallel. They have to wait for it to complete. Chances are very good that this code is hitting up against a hard μ-op throughput bottleneck. (You may also be paying an unaligned data penalty in this code, but that is minimal on modern architectures.)
In other words, you've traded parallelism (at the expense of larger code) for increased data dependencies (albeit with smaller code). At least, that would be my semi-educated guess, looking at the disassembly for your example code. In this case, your benchmark tells you very clearly that it did not work out in your favor.
Things might change if you instructed the compiler to assume AVX support. If the target architecture does not support AVX, the compiler has no choice but to transform your _mm_set1_ps intrinsic into a pair of movss, shufps instructions. If you enable AVX support, you'll get a single vbroadcastss instruction instead, which may be faster, especially with AVX2 support, where you can broadcast from register-to-register (instead of only from memory-to-register). With AVX support, you also get the benefit of VEX-encoded instructions.
* Although on certain older architectures like Core 2, shufps was an integer-based instruction, and therefore resulted in a delay when it was followed by a floating-point instruction like addps or mulps. I can't remember when exactly this was fixed, but certainly it is not a problem on Sandy Bridge and later.

AVX: data alignment: store crash, storeu, load, loadu doesn't

I am modifying RNNLM a neural net to study language model. However given the size of my corpus it's running real slow. I tried to optimize the matrix*vector routine (which is the one accountable for 63% of total time for small data set (I would expect it to be worse on larger sets)). Right now I am stuck with intrinsics.
for (b=0; b<(to-from)/8; b++)
{
val = _mm256_setzero_ps();
for (a=from2; a<to2; a++)
{
t1 = _mm256_set1_ps (srcvec.ac[a]);
t2 = _mm256_load_ps(&(srcmatrix[a+(b*8+from+0)*matrix_width].weight));
//val =_mm256_fmadd_ps (t1, t2, t3)
t3 = _mm256_mul_ps(t1,t2);
val = _mm256_add_ps (val, t3);
}
t4 = _mm256_load_ps(&(dest.ac[b*8+from+0]));
t4 = _mm256_add_ps(t4,val);
_mm256_store_ps (&(dest.ac[b*8+from+0]), t4);
}
This example crashes on:
_mm256_store_ps (&(dest.ac[b*8+from+0]), t4);
However if i change to
_mm256_storeu_ps (&(dest.ac[b*8+from+0]), t4);
(with u for unaligned i suppose) everything works as intended. My question is: why would load work (whereas it is not supposed to, if the data is unaligned) and store doesn't. (furthermore both are operating on the same address).
dest.ac have been allocated using
void *_aligned_calloc(size_t nelem, size_t elsize, size_t alignment=64)
{
size_t max_size = (size_t)-1;
// Watch out for overflow
if(elsize == 0 || nelem >= max_size/elsize)
return NULL;
size_t size = nelem * elsize;
void *memory = _mm_malloc(size+64, alignment);
if(memory != NULL)
memset(memory, 0, size);
return memory;
}
and it's at least 50 elements long.
(BTW with VS2012 I have an illegal instruction on some random assignment, so I use linux.)
thank you in advance,
Arkantus.

TL:DR: in optimized code, loads will fold into memory operands for other operations, which don't have alignment requirements in AVX. Stores won't.
Your sample code doesn't compile by itself, so I can't easily check what instruction _mm256_load_ps compiles to.
I tried a small experiment with gcc 4.9, and it doesn't generate a vmovaps at all for _mm256_load_ps, since I only used the result of the load as an input to one other instruction. It generates that instruction with a memory operand. AVX instructions have no alignment requirements for their memory operands. (There is a performance hit for crossing a cache line, and a bigger hit for crossing a page boundary, but your code still works.)
The store, on the other hand, does generate a vmov... instruction. Since you used the alignment-required version, it faults on unaligned addresses. Simply use the unaligned version; it'll be just as fast when the address is aligned, and still work when it isn't.
I didn't check your code carefully to see if all the accesses SHOULD be aligned. I assume not, from the way you phrased it to just ask why you weren't also getting faults for unaligned loads. Like I said, probably your code just didn't compile to any vmovaps load instructions, or else even "aligned" AVX loads don't fault on unaligned addresses.
Are you running AVX (without AVX2 or FMA?) on a Sandy/Ivybridge CPU? I assume that's why your FMA instrinsics are commented out.

fastest way to blit image buffer into an xy offset of another buffer in C++ on amd64 architecture

I have image buffers of an arbitrary size that I copy into equal-sized or larger buffers at an x,y offset. The colorspace is BGRA. My current copy method is:
void render(guint8* src, guint8* dest, uint src_width, uint src_height, uint dest_x, uint dest_y, uint dest_buffer_width) {
bool use_single_memcpy = (dest_x == 0) && (dest_y == 0) && (dest_buffer_width == src_width);
if(use_single_memcpy) {
memcpy(dest, src, src_width * src_height * 4);
}
else {
dest += (dest_y * dest_buffer_width * 4);
for(uint i=0;i < src_height;i++) {
memcpy(dest + (dest_x * 4), src, src_width * 4);
dest += dest_buffer_width * 4;
src += src_width * 4;
}
}
}
It runs fast but I was curious if there was anything I could do to improve it and gain a few extra milliseconds. If it involves going to assembly code I'd prefer to avoid that, but I'm willing to add additional libraries.

One popular answer on StackOverflow, that does use x86-64 assembly and SSE can be found here: Very fast memcpy for image processing?. If you do use this code, you'll need to make sure your buffers are 128-bit aligned. A basic explanation for that code is that:
Non-temporal stores are used, so unnecessary cache writes can be bypassed and writes to main memory can be combined.
Reads and writes are interleaved in only very large chunks (doing many reads and then many writes). Performing many reads back-to-back typically has better performance than single read-write-read-write patterns.
Much larger registers are used (128 bit SSE registers).
Prefetch instructions are included as hints to the CPU's pipelining.
I found this document - Optimizing CPU to Memory Accesses on the SGI Visual Workstations 320 and 540 - which seems to be the inspiration of the above code, but for older processor generations; however, it does contain a significant amount of discussion on how it works.
For instance, consider this discussion on write-combining / non-temporal stores:
The Pentium II and III CPU caches operate on 32-byte cache-line sized
blocks. When data is written to or read from (cached) memory, entire
cache lines are read or written. While this generally enhances
CPU-memory performance, under some conditions it can lead to
unnecessary data fetches. In particular, consider a case where the CPU
will do an 8-byte MMX register store: movq. Since this is only one
quarter of a cache line, it will be treated as a read-modify-write
operation from the cache's perspective; the target cache line will be
fetched into cache, then the 8-byte write will occur. In the case of a
memory copy, this fetched data is unnecessary; subsequent stores will
overwrite the remainder of the cache line. The read-modify-write
behavior can be avoided by having the CPU gather all writes to a cache
line then doing a single write to memory. Coalescing individual writes
into a single cache-line write is referred to as write combining.
Write combining takes place when the memory being written to is
explicitly marked as write combining (as opposed to cached or
uncached), or when the MMX non-temporal store instruction is used.
Memory is generally marked write combining only when it is used in
frame buffers; memory allocated by VirtualAlloc is either uncached or
cached (but not write combining). The MMX movntps and movntq
non-temporal store instructions instruct the CPU to write the data
directly to memory, bypassing the L1 and L2 caches. As a side effect,
it also enables write combining if the target memory is cached.
If you'd prefer to stick with memcpy, consider investigating the source code for the memcpy implementation you're using. Some memcpy implementations look for native-word-aligned buffers to improve performance by using the full register size; others will automatically copy as much as possible using native-word-aligned and then mop-up the remainders. Making sure your buffers are 8-byte aligned will facilitate these mechanisms.
Some memcpy implementations contain a ton of up-front conditionals to make it efficient for small buffers (<512) - you may want to consider a copy-paste of the code with those chunks ripped out since you're presumably not working with small buffers.

Your use_single_memcpy test is too restrictive. A slight rearrangement allows you to remove the dest_y == 0 requirement.
void render(guint8* src, guint8* dest,
uint src_width, uint src_height,
uint dest_x, uint dest_y,
uint dest_buffer_width)
{
bool use_single_memcpy = (dest_x == 0) && (dest_buffer_width == src_width);
dest_buffer_width <<= 2;
src_width <<= 2;
dest += (dest_y * dest_buffer_width);
if(use_single_memcpy) {
memcpy(dest, src, src_width * src_height);
}
else {
dest += (dest_x << 2);
while (src_height--) {
memcpy(dest, src, src_width);
dest += dest_buffer_width;
src += src_width;
}
}
}
I've also changed the loop to a countdown (which may be more efficient) and removed a useless temporary variable, and lifted repeated calculations.
It's likely that you can do even better using SSE intrinsics to copy 16 bytes at a time instead of 4, but then you'll need to worry about alignment and multiples of 4 pixels. A good memcpy implementation should already do these things.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js