The short question is that if I have a function that takes two vectors. One is input and the other is output (no alias). I can only align one of them, which one should I choose?
The longer version is that, consider a function,
void func(size_t n, void *in, void *out)
{
__m256i *in256 = reinterpret_cast<__m256i *>(in);
__m256i *out256 = reinterpret_cast<__m256i *>(out);
while (n >= 32) {
__m256i data = _mm256_loadu_si256(in256++);
// process data
_mm256_storeu_si256(out256++, data);
n -= 32;
}
// process the remaining n % 32 bytes;
}
If in and out are both 32-bytes aligned, then there's no penalty of using vmovdqu instead of vmovdqa. The worst case scenario is that both are unaligned, and one in four load/store will cross the cache-line boundary.
In this case, I can align one of them to the cache line boundary by processing a few elements first before entering the loop. However, the question is which should I choose? Between unaligned load and store, which one is worse?
Risking to state the obvious here: There is no "right answer" except "you need to benchmark both with actual code and actual data". Whichever variant is faster strongly depends on the CPU you are using, the amount of calculations you are doing on each package and many other things.
As noted in the comments, you should also try non-temporal stores. What also sometimes can help is to load the input of the following data packet inside the current loop, i.e.:
__m256i next = _mm256_loadu_si256(in256++);
for(...){
__m256i data = next; // usually 0 cost
next = _mm256_loadu_si256(in256++);
// do computations and store data
}
If the calculations you are doing have unavoidable data latencies, you should also consider calculating two packages interleaved (this uses twice as many registers though).
Related
Isn't the common knowledge that math operations on 64bit systems run faster on 32/64 bit datatypes than the smaller datatypes like short due to implicit promotion? Yet while testing my bitset implementation(where the majority of the time depends on bitwise operations), I found I got a ~40% improvement using uint8_t over uint32_t. I'm especially surprised because there is hardly any copying going on that would justify the difference. The same thing occurred regardless of the clang optimisation level.
8bit:
#define mod8(x) x&7
#define div8(x) x>>3
template<unsigned long bits>
struct bitset{
private:
uint8_t fill[8] = {};
uint8_t clear[8];
uint8_t band[(bits/8)+1] = {};
public:
template<typename T>
inline bool operator[](const T ind) const{
return band[div8(ind)]&fill[mod8(ind)];
}
template<typename T>
inline void store_high(const T ind){
band[div8(ind)] |= fill[mod8(ind)];
}
template<typename T>
inline void store_low(const T ind){
band[div8(ind)] &= clear[mod8(ind)];
}
bitset(){
for(uint8_t ii = 0, val = 1; ii < 8; ++ii){
fill[ii] = val;
clear[ii] = ~fill[ii];
val*=2;
}
}
};
32bit:
#define mod32(x) x&31
#define div32(x) x>>5
template<unsigned long bits>
struct bitset{
private:
uint32_t fill[32] = {};
uint32_t clear[32];
uint32_t band[(bits/32)+1] = {};
public:
template<typename T>
inline bool operator[](const T ind) const{
return band[div32(ind)]&fill[mod32(ind)];
}
template<typename T>
inline void store_high(const T ind){
band[div32(ind)] |= fill[mod32(ind)];
}
template<typename T>
inline void store_low(const T ind){
band[div32(ind)] &= clear[mod32(ind)];
}
bitset(){
for(uint32_t ii = 0, val = 1; ii < 32; ++ii){
fill[ii] = val;
clear[ii] = ~fill[ii];
val*=2;
}
}
};
And here is the benchmark I used(just moves a single 1 from position 0 till the end iteratively):
const int len = 1000000;
bitset<len> bs;
{
auto start = std::chrono::high_resolution_clock::now();
bs.store_high(0);
for (int ii = 1; ii < len; ++ii) {
bs.store_high(ii);
bs.store_low(ii-1);
}
auto stop = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>((stop-start)).count()<<std::endl;
}
TL:DR: large "buckets" for a bitset mean you access the same one repeatedly when you iterate linearly, creating longer dependency chains that out-of-order exec can't overlap as effectively.
Smaller buckets give instruction-level parallelism, making operations on bits in separate bytes independent of each other.
On possible reason is that you iterate linearly over bits, so all the operations within the same band[] element form one long dependency chain of &= and |= operations, plus store and reload (if the compiler doesn't manage to optimize that away with loop unrolling).
For uint32_t band[], that's a chain of 2x 32 operations, since ii>>5 will give the same index for that long.
Out-of-order exec can only partially overlap execution of these long chains if their latency and instruction-count is too large for the ROB (ReOrder Buffer) and RS (Reservation Station, aka Scheduler). With 64 operations probably including store/reload latency (4 or 5 cycles on modern x86), that's a dep chain length of probably 6 x 64 = 384 cycles, composed of probably at least 128 uops, with some parallelism for loading (or better calculating) 1U<<(n&31) or rotl(-1U, n&31) masks that can "use up" some of the wasted execution slots in the pipeline.
But for uint8_t band[], you've moving to a new element 4x as frequently, after only 2x 8 = 16 operations, so the dep chains are 1/4 the length.
See also Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for another case of a modern x86 CPU overlapping two long dependency chains (a simple chain of imul with no other instruction-level parallelism), especially the part about a single dep chain becoming longer than the RS (scheduler for un-executed uops) being the point at which we start to lose some of the overlap of execution of the independent work. (For the case without lfence to artificially block overlap.)
See also Modern Microprocessors
A 90-Minute Guide! and https://www.realworldtech.com/sandy-bridge/ for some background on how modern OoO exec CPUs decode and look at instructions.
Small vs. large buckets
Large buckets are only useful when scanning through for the first non-zero bit, or filling the whole thing or something. Of course, really you'd want to vectorize that with SIMD, checking 16 or 32 bytes at once to see if there's a non-zero element anywhere in that. Current compilers will vectorize for you in loops that fill the whole array, but not search loops (or anything with a trip-count that can't be calculated ahead of the first iteration), except for ICC which can handle that. Re: using fast operations over bit-vectors, see Howard Hinnant's article (in the context of vector<bool>, which is an unfortunate name for a sometimes-useful data structure.)
C++ unfortunately doesn't make it easy in general to use different sized accesses to the same data, unless you compile with g++ -O3 -fno-strict-aliasing or something like that.
Although unsigned char can always alias anything else, so you could use that for your single-bit accesses, only using uintptr_t (which is likely to be as wide as a register, except on ILP32-on-64bit ISAs) for init or whatever. Or in this case, uint_fast32_t being a 64-bit type on many x86-64 C++ implementations would make it useful for this, unlike usual when that sucks, wasting cache footprint when you're only using the value-range of a 32-bit number and being slower for non-constant division on some CPUs.
On x86 CPU, a byte store is naturally fully efficient, but even on an ARM or something, coalescing in the store buffer could still make adjacent byte RMWs fully efficient. (Are there any modern CPUs where a cached byte store is actually slower than a word store?). And you'd still gain ILP; a slower commit to cache is still not as bad as coupling loads to stores that could have been independent if narrower. Especially important on lower-end CPUs with smaller out-of-order schedulers buffers.
(x86 byte loads need to use movzx to zero-extend to avoid false dependencies, but most compilers know that. Clang is reckless about it which can occasionally hurt.)
(Different sized accesses close to each other can lead to store-forwarding stalls, e.g. a byte store and an unsigned long reload that overlaps that byte will have extra latency: What are the costs of failed store-to-load forwarding on x86?)
Code review:
Storing an array of masks is probably worse than just computing 1u32<<(n&31)) as needed, on most CPUs. If you're really lucky, a smart compiler might manage constant propagation from the constructor into the benchmark loop, and realize that it can rotate or shift inside the loop to generate the bitmask instead of indexing memory in a loop that already does other memory operations.
(Some non-x86 ISAs have better bit-manipulation instructions and can materialize 1<<n cheaply, although x86 can do that in 2 instructions as well if compilers are smart. xor eax,eax / bts eax, esi, with the BTS implicitly masking the shift count by the operand-size. But that only works so well for 32-bit operand-size, not 8-bit. Without BMI2 shlx, x86 variable-count shifts run as 3-uops on Intel CPUs, vs. 1 on AMD.)
Almost certainly not worth it to store both fill[] and clear[] constants. Some ISAs even have an andn instruction that can NOT one of the operands on the fly, i.e. implements (~x) & y in one instruction. For example, x86 with BMI1 extensions has andn. (gcc -march=haswell).
Also, your macros are unsafe: wrap the expression in () so operator-precedence doesn't bits you if you use foo[div8(x) - 1].
As in #define div8(x) (x>>3)
But really, you shouldn't be using CPP macros for stuff like this anyway. Even in modern C, just define static const shift = 3; shift counts and masks. In C++, do that inside the struct/class scope, and use band[idx >> shift] or something. (When I was typing ind, my fingers wanted to type int; idx is probably a better name.)
Isn't the common knowledge that math operations on 64bit systems run faster on 32/64 bit datatypes than the smaller datatypes like short due to implicit promotion?
This isn't a universal truth. As always, fit depends on details.
Why does this piece of code written using uint_8 run faster than analogous code written with uint_32 or uint_64 on a 64bit machine?
The title doesn't match the question. There are no such types as uint_X and you aren't using uintX_t. You are using uint_fastX_t. uint_fastX_t is an alias for an integer type that is at least X bytes, that is deemed by the language implementers to provide fastest operations.
If we were to take your earlier mentioned assumption for granted, then it should logically follow that the language implementers would have chosen to use 32/64 bit type as uint_fast8_t. That said, you cannot assume that they have done so and whatever generic measurement (if any) has been used to make that choice doesn't necessarily apply to your case.
That said, regardless of which type uint_fast8_t is an alias of, your test isn't fair for comparing the relative speeds of calculation of potentially different integer types:
uint_fast8_t fill[8] = {};
uint_fast8_t clear[8];
uint_fast8_t band[(bits/8)+1] = {};
uint_fast32_t fill[32] = {};
uint_fast32_t clear[32];
uint_fast32_t band[(bits/32)+1] = {};
Not only are the types (potentially) different, but the sizes of the arrays are too. This can certainly have an effect on the efficiency.
If we have an array of integer pointers which all pointing to the same int, and loop over it doing ++ operation, it'll be 100% slower than those pointers pointing to two different ints. Here is a concrete example
int* data[2];
int a, b;
a = b = 0;
for (auto i = 0ul; i < 2; ++i) {
// Case 3: 2.5 sec
data[i] = &a;
// Case 2: 1.25 sec
// if (i & 1)
// data[i] = &a;
// else
// data[i] = &b;
}
for (auto i = 0ul; i < 1000000000; ++i) {
// Case 1: 0.5sec
// asm volatile("" : "+g"(i)); // deoptimize
// ++*data[0];
++*data[i & 1];
}
In summary, the observations are: (described the loop body)
case 1 (fast): ++*pointer[0]
case 2 (medium): ++*pointer[i] with half pointer pointing to one int and other half pointing to another int.
case 3 (slow): ++*pointer[i] with all pointer pointing to the same int
Here are my current thoughts. Case 1 is fast because modern CPU knows we are read/write the same memory location, thus buffering the operation, while in Case 2 and Case 3, we need to write the result out in each iteration. The reason that Case 3 is slower than Case 2 is because when we write to a memory location by pointer a, and then trying to read it by pointer b, we have to wait the write to finish. This stops superscalar execution.
Do I understand it correctly? Is there any way to make Case 3 faster without changing the pointer array? (perhaps adding some CPU hints?)
The question is extracted from the real problem https://github.com/ClickHouse/ClickHouse/pull/7550
You've discovered one of the effects that causes bottlenecks in histograms. A workaround for that problem is to keep multiple arrays of counters and rotate through them, so repeated runs of the same index are distributed over 2 or 4 different counters in memory.
(Then loop over the arrays of counters to sum them down into one final set of counts. This part can benefit from SIMD.)
Case 1 is fast because modern CPU knows we are read/write the same memory location, thus buffering the operation
No, it's not the CPU, it's a compile-time optimization.
++*pointer[0] is fast because the compiler can hoist the store/reload out of the loop and actually just increment a register. (If you don't use the result, it might optimize away even that.)
Assumption of no data-race UB lets the compiler assume that nothing else is modifying pointer[0] so it's definitely the same object being incremented every time. And the as-if rule lets it keep *pointer[0] in a register instead of actually doing a memory-destination increment.
So that means 1 cycle latency for the increment, and of course it can combine multiple increments into one and do *pointer[0] += n if it fully unrolls and optimizes away the loop.
when we write to a memory location by pointer a, and then trying to read it by pointer b, we have to wait the write to finish. This stops superscalar execution.
Yes, the data dependency through that memory location is the problem. Without knowing at compile time that the pointers all point to the same place, the compiler will make asm that does actually increment the pointed-to memory location.
"wait for the write to finish" isn't strictly accurate, though. The CPU has a store buffer to decouple store execution from cache misses, and out-of-order speculative exec from stores actually committing to L1d and being visible to other cores. A reload of recently-stored data doesn't have to wait for it to commit to cache; store forwarding from the store-buffer to a reload is a thing once the CPU detects it.
On modern Intel CPUs, store-forwarding latency is about 5 cycles, so a memory-destination add has 6-cycle latency. (1 for the add, 5 for the store/reload if it's on the critical path.)
And yes, out-of-order execution lets two of these 6-cycle-latency dependency chains run in parallel. And the loop overhead is hidden under that latency, again by OoO exec.
Related:
Store-to-Load Forwarding and Memory Disambiguation in x86 Processors
on stuffedcow.net
Store forwarding Address vs Data: What the difference between STD and STA in the Intel Optimization guide?
How does store to load forwarding happens in case of unaligned memory access?
Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?
Why is execution time of a process shorter when another process shares the same HT core (On Sandybridge-family, store-forwarding latency can be reduced if you don't try to reload right away.)
Is there any way to make Case 3 faster without changing the pointer array?
Yes, if that case is expected, maybe branch on it:
int *current_pointer = pointer[0];
int repeats = 1;
...
loop {
if (pointer[i] == current_pointer) {
repeats++;
} else {
*current_pointer += repeats;
current_pointer = pointer[i];
repeats = 1;
}
}
We optimize by counting a run-length of repeating the same pointer.
This is totally defeated by Case 2 and will perform poorly if long runs are not common.
Short runs can be hidden by out-of-order exec; only when the dep chain becomes long enough to fill the ROB (reorder buffer) do we actually stall.
Assume that we have 2^10 CUDA cores and 2^20 data points. I want a kernel that will process these points and will provide true/false for each of them. So I will have 2^20 bits. Example:
bool f(x) { return x % 2? true : false; }
void kernel(int* input, byte* output)
{
tidx = thread.x ...
output[tidx] = f(input[tidx]);
...or...
sharedarr[tidx] = f(input[tidx]);
sync()
output[blockidx] = reduce(sharedarr);
...or...
atomic_result |= f(input[tidx]) << tidx;
sync(..)
output[blckidx] = atomic_result;
}
Thrust/CUDA has some algorithms as "partitioning", "transformation" which provides similar alternatives.
My question is, when I write the relevant CUDA kernel with a predicate that is providing the corresponding bool result,
should I use one byte for each result and directly store the result in the output array? Performing one step for calculation and performing another step for reduction/partitioning later.
should I compact the output in the shared memory, using one byte for 8 threads and then at the end write the result from shared memory to output array?
should I use atomic variables?
What's the best way to write such a kernel and the most logical data structure to keep the results? Is it better to use more memory and simply do more writes to main memory instead of trying to deal with compacting the result before writing back to result memory area?
There is no tradeoff between speed and data size when using the __ballot() warp-voting intrinsic to efficiently pack the results.
Assuming that you can redefine output to be of uint32_t type, and that your block size is a multiple of the warp size (32), you can simply store the packed output using
output[tidx / warpSize] = __ballot(f(input[tidx]));
Note this makes all threads of the warp try to store the result of __ballot(). Only one thread of the warp will succeed, but as their results are all identical, it does not matter which one will.
I'm having an issue with the following code, and I fail to understand where is the problem. The problem is occurring however only with V2 intel processor and not V3.
Consider the following code in C++:
struct Tuple{
size_t _a;
size_t _b;
size_t _c;
size_t _d;
size_t _e;
size_t _f;
size_t _g;
size_t _h;
};
void
deref_A(Tuple& aTuple, const size_t& aIdx) {
aTuple._a = A[aIdx];
}
void
deref_AB(Tuple& aTuple, const size_t& aIdx) {
aTuple._a = A[aIdx];
aTuple._b = B[aIdx];
}
void
deref_ABC(Tuple& aTuple, const size_t& aIdx) {
aTuple._a = A[aIdx];
aTuple._b = B[aIdx];
aTuple._c = C[aIdx];
}
....
void
deref_ABCDEFG(Tuple& aTuple, const size_t& aIdx) {
aTuple._a = A[aIdx];
aTuple._b = B[aIdx];
aTuple._c = C[aIdx];
aTuple._d = D[aIdx];
aTuple._e = E[aIdx];
aTuple._f = F[aIdx];
aTuple._g = G[aIdx];
}
Note that A, B, C, ..., G are simple arrays (declared globally). Arrays are filled with integers.
The methods "deref_*", simply assign some values from arrays (accessed via index - aIdx) to the given struct parameter "aTuple". I first start by assigning to a single field of the given struct as parameter, and continue all the way to all fields. That is, each method assigns one more field than the previous one. The methods "deref_*" are called with index (aIdx) starting from 0, to MAX size of the arrays (arrays have the same size by the way). The index is used to access array elements, as shown in the code -- pretty simple.
Now, consider the graph (http://docdro.id/AUSil1f), which depicts the performance for array sizes starting with 20 million (size_t = 8 bytes) integers, up to 24 m (x-axes denote the arrays size).
For arrays with 21 million integers (size_t), the performance degrades for the methods touching at least 5 different arrays (i.e., deref_ACDE...G), therefore you will see peaks in the graph. The performance then improves again for arrays with 22 m integers and onwards. I'm wondering why this is happening for array size of 21 m only? This is happening only when I'm testing on a server with CPU: Intel(R) Xeon(R) CPU E5-2690 v2 # 3.00GHz, but not with Haswell, i.e., v3. Clearly this is a known issue to Intel and has been resolved, but I don't know what it is, and how to improve the code for v2.
I would highly appreciate any hint from your side.
I suspect you might be seeing cache-bank conflicts. Sandybridge/Ivybridge (Xeon Exxxx v1/v2) have them, Haswell (v3) doesn't.
Update from the OP: it was DTLB misses. Cache bank conflicts will usually only be an issue when your working set fits in cache. Being limited to one 8B read per clock instead of 2 shouldn't stop a a CPU from keeping up with main memory, even single-threaded. (8B * 3GHz = 24GB/s, which is about equal to main memory sequential-read bandwidth.)
I think there's a perf counter for that, which you can check with perf or other tools.
Quoting Agner Fog's microarchitecture doc (section 9.13):
Cache bank conflicts
Each consecutive 128 bytes, or two cache lines, in the data cache is
divided into 8 banks of 16 bytes each. It is not possible to do two
memory reads in the same clock cycle if the two memory addresses have
the same bank number, i.e. if bit 4 - 6 in the two addresses are the
same.
; Example 9.5. Sandy bridge cache
mov eax, [rsi] ; Use bank 0, assuming rsi is divisible by 40H
mov ebx, [rsi+100H] ; Use bank 0. Cache bank conflict
mov ecx, [rsi+110H] ; Use bank 1. No cache bank conflict
Changing the total size of your arrays changes the distance between two elements with the same index, if they're laid out more or less head to tail.
If you have each array aligned to a different 16B offset (modulo 128), this will help some for SnB/IvB. Access to the same index in each array will be in a different cache bank, and thus can happen in parallel. Achieving this can be as simple as allocating 128B-aligned arrays, with 16*n extra bytes at the start of each one. (Keeping track of the pointer to eventually free separately from the pointer to dereference will be an annoyance.)
If the tuple where you're writing the results has the same address as a read, modulo 4096, you also get a false dependence. (i.e. a read from one of the arrays might have to wait for a store to the tuple.) See Agner Fog's doc for the details on that. I didn't quote that part because I think cache-bank conflicts are the more likely explanation. Haswell still has the false-dependence issue, but the cache-bank conflict issue is completely gone.
I have image buffers of an arbitrary size that I copy into equal-sized or larger buffers at an x,y offset. The colorspace is BGRA. My current copy method is:
void render(guint8* src, guint8* dest, uint src_width, uint src_height, uint dest_x, uint dest_y, uint dest_buffer_width) {
bool use_single_memcpy = (dest_x == 0) && (dest_y == 0) && (dest_buffer_width == src_width);
if(use_single_memcpy) {
memcpy(dest, src, src_width * src_height * 4);
}
else {
dest += (dest_y * dest_buffer_width * 4);
for(uint i=0;i < src_height;i++) {
memcpy(dest + (dest_x * 4), src, src_width * 4);
dest += dest_buffer_width * 4;
src += src_width * 4;
}
}
}
It runs fast but I was curious if there was anything I could do to improve it and gain a few extra milliseconds. If it involves going to assembly code I'd prefer to avoid that, but I'm willing to add additional libraries.
One popular answer on StackOverflow, that does use x86-64 assembly and SSE can be found here: Very fast memcpy for image processing?. If you do use this code, you'll need to make sure your buffers are 128-bit aligned. A basic explanation for that code is that:
Non-temporal stores are used, so unnecessary cache writes can be bypassed and writes to main memory can be combined.
Reads and writes are interleaved in only very large chunks (doing many reads and then many writes). Performing many reads back-to-back typically has better performance than single read-write-read-write patterns.
Much larger registers are used (128 bit SSE registers).
Prefetch instructions are included as hints to the CPU's pipelining.
I found this document - Optimizing CPU to Memory Accesses on the SGI Visual Workstations 320 and 540 - which seems to be the inspiration of the above code, but for older processor generations; however, it does contain a significant amount of discussion on how it works.
For instance, consider this discussion on write-combining / non-temporal stores:
The Pentium II and III CPU caches operate on 32-byte cache-line sized
blocks. When data is written to or read from (cached) memory, entire
cache lines are read or written. While this generally enhances
CPU-memory performance, under some conditions it can lead to
unnecessary data fetches. In particular, consider a case where the CPU
will do an 8-byte MMX register store: movq. Since this is only one
quarter of a cache line, it will be treated as a read-modify-write
operation from the cache's perspective; the target cache line will be
fetched into cache, then the 8-byte write will occur. In the case of a
memory copy, this fetched data is unnecessary; subsequent stores will
overwrite the remainder of the cache line. The read-modify-write
behavior can be avoided by having the CPU gather all writes to a cache
line then doing a single write to memory. Coalescing individual writes
into a single cache-line write is referred to as write combining.
Write combining takes place when the memory being written to is
explicitly marked as write combining (as opposed to cached or
uncached), or when the MMX non-temporal store instruction is used.
Memory is generally marked write combining only when it is used in
frame buffers; memory allocated by VirtualAlloc is either uncached or
cached (but not write combining). The MMX movntps and movntq
non-temporal store instructions instruct the CPU to write the data
directly to memory, bypassing the L1 and L2 caches. As a side effect,
it also enables write combining if the target memory is cached.
If you'd prefer to stick with memcpy, consider investigating the source code for the memcpy implementation you're using. Some memcpy implementations look for native-word-aligned buffers to improve performance by using the full register size; others will automatically copy as much as possible using native-word-aligned and then mop-up the remainders. Making sure your buffers are 8-byte aligned will facilitate these mechanisms.
Some memcpy implementations contain a ton of up-front conditionals to make it efficient for small buffers (<512) - you may want to consider a copy-paste of the code with those chunks ripped out since you're presumably not working with small buffers.
Your use_single_memcpy test is too restrictive. A slight rearrangement allows you to remove the dest_y == 0 requirement.
void render(guint8* src, guint8* dest,
uint src_width, uint src_height,
uint dest_x, uint dest_y,
uint dest_buffer_width)
{
bool use_single_memcpy = (dest_x == 0) && (dest_buffer_width == src_width);
dest_buffer_width <<= 2;
src_width <<= 2;
dest += (dest_y * dest_buffer_width);
if(use_single_memcpy) {
memcpy(dest, src, src_width * src_height);
}
else {
dest += (dest_x << 2);
while (src_height--) {
memcpy(dest, src, src_width);
dest += dest_buffer_width;
src += src_width;
}
}
}
I've also changed the loop to a countdown (which may be more efficient) and removed a useless temporary variable, and lifted repeated calculations.
It's likely that you can do even better using SSE intrinsics to copy 16 bytes at a time instead of 4, but then you'll need to worry about alignment and multiples of 4 pixels. A good memcpy implementation should already do these things.