Objective difference between register and pointer in AVX instructions

Objective difference between register and pointer in AVX instructions - c++

Scenario: You are writing a complex algorithm using SIMD. A handful of constants and/or infrequently changing values are used. Ultimately, the algorithm ends up using more than 16 ymm, resulting in the use of stack pointers (e.g. opcode contains vaddps ymm0,ymm1,ymmword ptr [...] instead of vaddps ymm0,ymm1,ymm7).
In order to make the algorithm fit into the available registers, the constants can be "inlined". For example:
const auto pi256{ _mm256_set1_ps(PI) };
for (outer condition)
{
...
const auto radius_squared{ _mm256_mul_ps(radius, radius) };
...
for (inner condition)
{
...
const auto area{ _mm256_mul_ps(radius_squared, pi256) };
...
}
}
... becomes ...
for (outer condition)
{
...
for (inner condition)
{
...
const auto area{ _mm256_mul_ps(_mm256_mul_ps(radius, radius), _mm256_set1_ps(PI)) };
...
}
}
Whether the disposable variable in question is a constant, or is infrequently calculated (calculated outer loop), how can one determine which approach achieves the best throughput? Is it a matter of some concept like "ptr adds 2 extra latency"? Or is it nondeterministic such that it differs on a case-by-case basis and can only be fully optimized through trial-and-error + profiling?

A good optimizing compiler should generate the same machine code for both versions. Just define your vector constants as locals, or use them anonymously for maximum readability; let the compiler worry about register allocation and pick the least expensive way to deal with running out of registers if that happens.
Your best bet for helping the compiler is to use fewer different constants if possible. e.g. instead of _mm_and_si128 with both set1_epi16(0x00FF) and 0xFF00, use _mm_andn_si128 to mask the other way. You usually can't do anything to influence which things it chooses to keep in registers vs. not, but fortunately compilers are pretty good at this because it's also essential for scalar code.
A compiler will hoist constants out of the loop (even inlining a helper function containing constants), or if only used in one side of a branch, bring the setup into that side of the branch.
The source code computes exactly the same thing with no difference in visible side-effects, so the as-if rule allows the compiler the freedom to do this.
I think compilers normally do register allocation and choose what to spill/reload (or just use a read-only vector constant) after doing CSE (common subexpression elimination) and identifying loop invariants and constants that can be hoisted.
When it finds it doesn't have enough registers to keep all variables and constants in regs inside the loop, the first choice for something to not keep in a register would normally be a loop-invariant vector, either a compile-time constant or something computed before the loop.
An extra load that hits in L1d cache is cheaper than storing (aka spilling) / reloading a variable inside the loop. Thus, compilers will choose to load constants from memory regardless of where you put the definition in the source code.
Part of the point of writing in C++ is that you have a compiler to make this decision for you. Since it's allowed to do the same thing for both sources, doing different things would be a missed-optimization for at least one of the cases. (The best thing to do in any particular case depends on surrounding code, but normally using vector constants as memory source operands is fine when the compiler runs low on regs.)
Is it a matter of some concept like "ptr adds 2 extra latency"?
Micro-fusion of a memory source operand doesn't lengthen the critical path from the non-constant input to the output. The load uop can start as soon as the address is ready, and for vector constants it's usually either a RIP-relative or [rsp+constant] addressing mode. So usually the load is ready to execute as soon as it's issued into the out-of-order part of the core. Assuming an L1d cache hit (since it will stay hot in cache if loaded every loop iteration), this is only ~5 cycles, so it will easily be ready in time if there's a dependency-chain bottleneck on the vector register input.
It doesn't even hurt front-end throughput. Unless you're bottlenecked on load-port throughput (2 loads per clock on modern x86 CPUs), it typically makes no difference. (Even with highly accurate measurement techniques.)

Related

Loading an entire cache line at once to avoid contention for multiple elements of it

Assuming that there are three pieces of data that I need from a heavily contended cache line, is there a way to load all three things "atomically" so as to avoid more than one roundtrip to any other core?
I don't actually need a correctness guarantee of atomicity for a snapshot of all 3 members, just in the normal case that all three items are read in the same clock cycle. I want to avoid the case where the cache line arrives, but then an invalidate request comes in before all 3 objects are read. That would result in the 3rd access needing to send another request to share the line, making contention even worse.
For example,
class alignas(std::hardware_destructive_interference_size) Something {
std::atomic<uint64_t> one;
std::uint64_t two;
std::uint64_t three;
};
void bar(std::uint64_t, std::uint64_t, std::uint64_t);
void f1(Something& something) {
auto one = something.one.load(std::memory_order_relaxed);
auto two = something.two;
if (one == 0) {
bar(one, two, something.three);
} else {
bar(one, two, 0);
}
}
void f2(Something& something) {
while (true) {
baz(something.a.exchange(...));
}
}
Can I somehow ensure that one, two and three all get loaded together without multiple RFOs under heavy contention (assume f1 and f2 are running concurrently)?
The target architecture / platform for the purposes of this question is Intel x86 Broadwell, but if there is a technique or compiler intrinsic that allows doing something best-effort like this somewhat portably, that would be great as well.

terminology: A load won't generate an RFO, it doesn't need ownership. It only sends a request to share the data. Multiple cores can be reading from the same physical address in parallel, each with a copy of it hot in their L1d cache.
Other cores writing the line will send RFOs which invalidate the shared copy in our cache, though, and yes that could come in after reading one or two elements of a cache line before all have been read. (I updated your question with a description of the problem in those terms.)
Hadi's SIMD load is a good idea to grab all the data with one instruction.
As far as we know, _mm_load_si128() is in practice atomic for its 8-byte chunks, so it can safely replace the .load(mo_relaxed) of the atomic. But see Per-element atomicity of vector load/store and gather/scatter? - there's no clear written guarantee of this.
If you used _mm256_loadu_si256(), beware of GCC's default tuning -mavx256-split-unaligned-load: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd? So that's another good reason to use an aligned load, besides needing to avoid a cache-line split.
But we're writing in C, not asm, so we need to worry about some of the other things that std::atomic with mo_relaxed does: specifically that repeated loads from the same address might not give the same value. You probably need to dereference a volatile __m256i* to kind of simulate what load(mo_relaxed).
You can use atomic_thread_fence() if you want stronger ordering; I think in practice C++11 compilers that support Intel intrinsics will order volatile dereferences wrt. fences the same way as std::atomic loads/stores. In ISO C++, volatile objects are still subject to data-race UB, but in real implementations that can for example compile a Linux kernel, volatile can be used for multi-threading. (Linux rolls its own atomics with volatile and inline asm, and this is I think considered supported behaviour by gcc/clang.) Given what volatile actually does (object in memory matches the C++ abstract machine), it basically just automatically works, despite any rules-lawyer concerns that it's technically UB. It's UB that compilers can't know or care about because that's the whole point of volatile.
In practice there's good reason to believe that entire aligned 32-byte loads/store on Haswell and later are atomic. Certainly for reading from L1d into the out-of-order backend, but also even for transferring cache lines between cores. (e.g. multi-socket K10 can tear on 8-byte boundaries with HyperTransport, so this really is a separate issue). The only problem for taking advantage of it is the lack of any written guarantee or CPU-vendor-approved way to detect this "feature".
Other than that, for portable code it could help to hoist auto three = something.three; out of the branch; a branch mispredict gives the core much more time to invalidate the line before the 3rd load.
But compilers will probably not respect that source change, and only load it in the case that needs it. But branchless code would always load it, so maybe we should encourage that with
bar(one, two, one == 0 ? something.three : 0);
Broadwell can run 2 loads per clock cycle (like all mainstream x86 since Sandybridge and K8); uops typically execute in oldest-ready-first order so it's likely (if this load did have to wait for data from another core) that our 2 load uops will execute in the first cycle possible after the data arrives.
The 3rd load uop will hopefully run in the cycle after that, leaving a very small window for an invalidate to cause a problem.
Or on CPUs with only 1-per clock loads, still having all 3 loads adjacent in the asm reduces the window for invalidations.
But if one == 0 is rare, then three often isn't needed at all, so unconditional loading brings a risk of unnecessary requests for it. So you have to consider that tradeoff when tuning, if you can't cover all the data with one SIMD load.
As discussed in comments, software prefetch could potentially help to hide some of the inter-core latency.
But you have to prefetch much later than you would for a normal array, so finding places in your code that are often running ~50 to ~100 cycles before f1() is called is a hard problem and can "infect" a lot of other code with details unrelated to its normal operation. And you need a pointer to the right cache line.
You need the PF to be late enough that the demand load happens a few (tens of) cycles before the prefetched data actually arrives. This is the opposite of the normal use-case, where L1d is a buffer to prefetch into and hold data from completed prefetches before demand-loads get to them. But you want load_hit_pre.sw_pf perf events (load hit prefetch), because that means the demand load happened while the data was still in flight, before there's any chance of it being invalidated.
That means tuning is even more brittle and difficult than usual, because instead of a nearly-flat sweet spot for prefetch distance where earlier or later doesn't hurt, earlier hides more latency right up until the point where it allows invalidations, so it's a slope all the way up to a cliff. (And any too-early prefetches just make overall contention even worse.)

As long as the size of std::atomic<uint64_t> is at most 16 bytes (which is the case in all major compilers), the total size of one, two, and three does not exceed 32 bytes. Therefore, you can define a union of __m256i and Something where the Something field is aligned to 32-bytes to ensure that it is fully contained within a single 64-byte cache line. To load all of the three values at the same time, you can use a single 32-byte AVX load uop. The corresponding compiler intrinsic is _mm256_load_si256, which causes the compiler to emit the VMOVDQA ymm1, m256 instruction. This instruction is supported with a single load uop decoding on Intel Haswell and later.
The 32-byte alignment is really only needed to ensure that all of the fields are contained within a 64-byte cache line. However, _mm256_load_si256 requires the specified memory address to be 32-byte aligned. Alternatively, _mm256_loadu_si256 could be used instead in case the address is not 32-byte aligned.

Understanding how the instrinsic functions for SSE use memory

Before I ask my question, just a little background information.
In C languages, when you assign to a variable, you can conceptually assume you just modified a little piece of memory in RAM.
int a = rand(); //conceptually, you created and assigned variable A in ram
In assembly language, to do the same thing, you essentially need the result of rand() stored in a register, and a pointer to "a". You would then do a store instruction to get the register content into ram.
When you program in C++ for example, when you assign and manipulate value type objects, you usually dont even have to think about their addresses or how or when they will be stored in registers.
Using SSE instrinics are strange because they appear to somewhere inbetween coding in C and assembly, in terms of the conceptual memory model.
You can call load/store functions, and they return objects. A math operation like _mm_add will return an object, yet it's unclear to me weather the result will actually be stored in the object unless you call _mm_store.
Consider the following example:
inline void block(float* y, const float* x) const {
// load 4 data elements at a time
__m128 X = _mm_loadu_ps(x);
__m128 Y = _mm_loadu_ps(y);
// do the computations
__m128 result = _mm_add_ps(Y, _mm_mul_ps(X, _mm_set1_ps(a)));
// store the results
_mm_storeu_ps(y, result);
}
There are alot of temporary objects here. Do the temporary objects actually not exist? Is it all just syntax sugar for calling assembly instrunctions in a C like way? What happens if instead of doing the store command at the end, you just kept the result, would the result then be more than syntax sugar, and will actually hold data?
TL:DR How am I suppose to think about memory when using SSE instrinsics?

An __m128 variable may be in a register and/or memory. It's much the same as with simple float or int variables - the compiler will decide which variables belong in registers and which must be stored in memory. In general the compiler will try to keep the "hottest" variables in registers and the rest in memory. It will also analyse the lifetimes of variables so that a register may be used for more than one variable within a block. As a programmer you don't need to worry about this too much, but you should be aware of how many registers you have, i.e.. 8 XMM registers in 32 bit mode and 16 in 64 bit mode. Keeping your variable usage below these numbers will help to keep everything in registers as far as possible. Having said that, the penalty for accessing an operand in L1 cache is not that much greater than accessing a register operand, so you shouldn't get too hung up on keeping everything in registers if it proves difficult to do so.
Footnote: this vagueness about whether SSE variables are in registers or memory when using intrinsics is actually quite helpful, and makes it much easier to write optimised code than doing it with raw assembler - the compiler does the grunt work of keeping track of register allocation and other optimisations, allowing you to concentrate on making the code work correctly.

Vector variables aren't special. They will be spilled to memory and re-loaded when needed later, if the compiler runs out of registers when optimizing a loop (or across a function call to a function the compiler can't "see" to know that it doesn't touch the vector regs).
gcc -O0 actually does tend to store to RAM when you set them, instead of keeping __m128i variables only in registers, IIRC.
You could write all your intrinsic-using code without ever using any load or store intrinsics, but then you'd be at the mercy of the compiler to decide how and when to move data around. (You actually still are, to some degree these days, thanks to compilers being good at optimizing intrinsics, and not just literally spitting out a load wherever you use a load intrinsic.)
Compilers will fold loads into memory operands for following instructions, if the value isn't needed as an input to something else as well. However, this is only safe if the data is at a known-aligned address, or an aligned-load intrinsic was used.
The way I currently think about load intrinsics is as a way of communicating alignment guarantees (or lack thereof) to the compiler. The "regular" SSE (non-AVX / non-VEX-encoded) versions of vector instructions fault if used with an unaligned 128b memory operand. (Even on CPUs supporting AVX, FWIW.) For example, note that even punpckl* lists its memory operand as a m128, and thus has alignment requirements, even if it only actually reads the low 64b. pmovzx lists its operand as a m128.
Anyway, using load instead of loadu tells the compiler that it can fold the load into being a memory operand for another instruction, even if it can't otherwise prove that it comes from an aligned address.
Compiling for an AVX target machine will allow the compiler to fold even unaligned loads into other operations, to take advantage of uop micro-fusion.
This came up in comments on How to specify alignment with _mm_mul_ps.
The store intrinsics apparently have two purposes:
To tell the compiler whether it should use the aligned or unaligned asm instruction.
To remove the need for a cast from __m128d to double * (doesn't apply to the integer case).
Just to confuse things, AVX2 introduced things like _mm256_storeu2_m128i (__m128i* hiaddr, __m128i* loaddr, __m256i a), which stores the high/low halves to different addresses. It probably compiles to a vmovdqu / vextracti128 ..., 1 sequence. Incidentally, I guess they made vextracti128 with AVX512 in mind, since using it with 0 as the immediate is the same as vmovdqu, but slower and longer-to-encode.

Access speed: local variable vs. array

Given this example code:
struct myStruct1 { int one, two; } first;
struct myStruct2 { myStruct1 n; int c; } second[255];
// this is a member of a structure in an array of structures!
register int i = second[2].n.one;
const int u = second[3].n.one;
while (1)
{
// do something with second[1].n.one
// do something with i
// do something with u
}
Which one is faster?
Is it correct that a local copy of an array index can be copied into a register?
Will it be even faster if the copy is done inside the loop?

Which one is faster?
The only way to know is to measure or profile. You can look at the assembly code to get a hint at which one is faster, but the truth is in the profiling.
Is it correct that a local copy of an array index can be copied into a register?
A register can hold many things. The use of registers is controlled by the compiler and the quantity of registers that the processor has available.
The compiler may put values into registers or place them on the stack. Eventually, values need to go into registers. Some processors have the ability to copy memory from one location to another without using registers. Whether or not the compiler uses these features depends on the compiler and the optimization level.
Will it be even faster if the copy is done inside the loop?
Unnecessary code in a loop slows down the loop. Compilers may factor out code that isn't changing inside the loop.
Some processors can contain all of the instructions for a loop in their instruction cache; others not. Again, all this depends on the processor and the compiler optimization settings.
Micro-optimizations
Your questions fall under the category of micro-optimizations. In general, this group of optimizations usually gains speed in terms of processor instructions. Unless you iterate over 1.0E+09 times, the optimizations won't gain you significant savings. With today's processors, were talking an average gain of 100 nanoseconds per instructions (or worst case 1 millisecond). Unless you have profiled, you don't want to waste your development effort with these optimizations.
Design, & Coding optimizations
Here is a list of optimizations that will gain more performance benefits than micro-optimizations:
Removing unwanted requirements.
Removing unused modules.
Sharing common modules.
Using efficient data structures.
Removing unnecessary work.
Performing tasks in the background.
Double buffering.
Input (blocks), Process (blocks), Output (blocks).
Reducing function calls, and comparisons.
Reducing code by simplifying using algebra or Karnough Maps.

Does the static keyword play a role in C/C++ and the storage level?

This question has been bugging me for a while.
From what I understand that are various levels of storage. They are
CPU Registers
Lower Level Cache
Memory (RAM/ROM)
Hard Disk Space
With "fastest access time / fewest number" of at the top and "slowest access time / most number of" towards the bottom?
In C/C++ how do you control whether variables are put into (and stay in) Lower Level Cache? I'm assuming there is not a way to control which variables say in CPU registers since there are a very limited number.
I want to say that the C/C++ static keyword plays some part in it, but wanted to get clarification on this.
I understand how the static works in theory. Namely that
#include <stdio.h>
void increment(){
static int iSum = 0;
printf(" iSum = %d\n", ++iSum);
return;
}
void main(int argc, char* argv[]){
int iInc = 0;
for(iInc = 0; iInc < 5; iInc++)
increment();
return;
}
Would print
iSum = 1
iSum = 2
iSum = 3
iSum = 4
iSum = 5
But I am not certain how the different levels of storage play a part. Does where a variable lies depend more on the optimziation level such as through invoking the -o2 and -o3 flags on GCC?
Any insight would be greatly appreciated.
Thanks,
Jeff

The static keyword has nothing to do with cache hinting and the compiler is free to allocate registers as it thinks suits better. You might have thought of that because of the storage class specifiers list with a deprecated register specifier.
There's no way to precisely control via C++ (or C) standard-conformant language features how caching and/or register allocation work because you would have to deeply interface with your underlying hardware (and writing your own register allocator or hinting on how to store/spill/cache stuff). Register allocation is usually a compiler's back-end duty while caching stuff is processor's work (along with instruction pipelining, branch prediction and other low-level tasks).
It is true that changing the compiler's optimization level might deeply affect how variables are accessed/loaded into registers. Ideally you would keep everything into registers (they're fast) but since you can't (their size and number is limited) the compiler has to make some predictions and guess what should be spilled (i.e. taken out of a register and reloaded later) and what not (or even optimized-out). Register allocation is a NP-complete problem. In CUDA C you usually can't deal with such issues but you do have a chance of specifying the caching mechanism you intend to use by using different types of memory. However this is not standard C++ as extensions are in place.

Caches are intermediate storage areas between main memory and registers.
They are used because accessing memory today is very expensive, measured in clock ticks, compared to how things used to be (memory access hasn't increased in speed anywhere near what's happened to CPUs).
So they are a way to "simulate" faster memory access while letting you write exactly the same code as without them.
Variables are never "stored" in the cache as such — their values are only held there temporarily in case the CPU needs them. Once modified, they are written out to their proper place in main memory (if they reside there and not in a register).
And statichas nothing to do with any of this.
If a program is small enough, the compiler can decide to use a register for that, too, or inline it to make it disappear completely.

Essentially you need to start looking at writing applications and code that are cache coherent. This is a quick intro to cache coherence:
http://supercomputingblog.com/optimization/taking-advantage-of-cache-coherence-in-your-programs/
Its a long and complicated subject and essentially boils down to actual implementation of algorithms along with the platform that they are targeting. There is a similar discussion in the following thread:
Can I force cache coherency on a multicore x86 CPU?

A function variable declared as static makes it's lifetime that of the duration of the program. That's all C/C++ says about it, nothing about staorage/memory.

To answer this question:
In C/C++ how do you control whether variables are put into (and stay
in) Lower Level Cache?
You can't. You can do some stuff to help the data stay in cache, but you can't pin anything in cache.
It's not what those caches are for, they are mainly fed from the main memory, to speed up access, or allow for some advanced techniques like branch prediction and pipelining.

I think there may be a few things that need clarification. CPU cache (L1, L2, L3, etc...) is a mechanism the CPU uses to avoid having to read and write directly to memory for values that will be accessed more frequently. It isn't distinct from RAM; it could be thought of as a narrow window of it.
Using cache effectively is extremely complex, and it requires nuanced knowledge of code memory access patterns, as well as the underlying architecture. You generally don't have any direct control over the cache mechanism, and an enormous amount of research has gone into compilers and CPUs to use CPU cache effectively. There are storage class specifiers, but these aren't meant to perform cache preload or support streaming.
Maybe it should be noted that simply because something takes fewer cycles to use (register, L1, L2, etc...) doesn't mean using it will necessarily make code faster. For example, if something is only written to memory once, loading it into L1 may cause a cache eviction, which could move data needed for a tight loop into a slower memory. Since the data that's accessed more frequently now takes more cycles to access, the cumulative impact would be lower (not higher) performance.

Ring buffer: Disadvantages by moving through memory backwards?

This is probably language agnostic, but I'm asking from a C++ background.
I am hacking together a ring buffer for an embedded system (AVR, 8-bit). Let's assume:
const uint8_t size = /* something > 0 */;
uint8_t buffer[size];
uint8_t write_pointer;
There's this neat trick of &ing the write and read pointers with size-1 to do an efficient, branchless rollover if the buffer's size is a power of two, like so:
// value = buffer[write_pointer];
write_pointer = (write_pointer+1) & (size-1);
If, however, the size is not a power of two, the fallback would probably be a compare of the pointer (i.e. index) to the size and do a conditional reset:
// value = buffer[write_pointer];
if (++write_pointer == size) write_pointer ^= write_pointer;
Since the reset occurs rather rarely, this should be easy for any branch prediction.
This assumes though that the pointers need to be advancing foreward in memory. While this is intuitive, it requires a load of size in every iteration. I assume that reversing the order (advancing backwards) would yield better CPU instructions (i.e. jump if not zero) in the regular case, since size is only required during the reset.
// value = buffer[--write_pointer];
if (write_pointer == 0) write_pointer = size;
so
TL;DR: My question is: Does marching backwards through memory have a negative effect on the execution time due to cache misses (since memory cannot simply be read forward) or is this a valid optimization?

You have an 8 bit avr with a cache? And branch prediction?
How does forward or backwards matter as far as caches are concerned? The hit or miss on a cache is anywhere within the cache line, beginning, middle, end, random, sequential, doesnt matter. You can work from the back to the front or the front to back of a cache line, it is the same cost (assuming all other things held constant) the first mist causes a fill, then that line is in cache and you can access any of the items in any pattern at a lower latency until evicted.
On a microcontroller like that you want to make the effort, even at the cost of throwing away some memory, to align a circular buffer such that you can mask. There is no cache the instruction fetches are painful because they are likely from a flash that may be slower than the processor clock rate, so you do want to reduce instructions executed, or make the execution a little more deterministic (same number of instructions every loop until that task is done). There might be a pipeline that would appreciate the masking rather than an if-then-else.
TL;DR: My question is: Does marching backwards through memory have a
negative effect on the execution time due to cache misses (since
memory cannot simply be read forward) or is this a valid optimization?
The cache doesnt care, a miss from any item in the line causes a fill, once in the cache any pattern of access, random, sequential forward or back, or just pounding on the same address, takes less time being in faster memory. Until evicted. Evictions wont come from neighboring cache lines they will come from cache lines larger powers of two away, so whether the next cache line you pull is at a higher address or lower, the cost is the same.

Does marching backwards through memory have a negative effect on the
execution time due to cache misses (since memory cannot simply be read
forward)
Why do you think that you will have a cache miss? You will have a cache miss if you try to access outside the cache (forward or backward).

There are a number of points which need clarification:
That size needs to be loaded each and every time (it's const, therefore ought to be immutable)
That your code is correct. For example in a 0-based index (as used in C/C++ for array access) the value 0' is a valid pointer into the buffer, and the valuesize' is not. Similarly there is no need to xor when you could simply assign 0, equally a modulo operator will work (writer_pointer = (write_pointer +1) % size).
What happens in the general case with virtual memory (i.e. the logically adjacent addresses might be all over the place in the real memory map), paging (stuff may well be cached on a page-by-page basis) and other factors (pressure from external processes, interrupts)
In short: this is the kind of optimisation that leads to more feet related injuries than genuine performance improvements. Additionally it is almost certainly the case that you get much, much better gains using vectorised code (SIMD).
EDIT: And in interpreted languages or JIT'ed languages it might be a tad optimistic to assume you can rely on the use of JNZ and others at all. At which point the question is, how much of a difference is there really between loading size and comparing versus comparing with 0.

As usual, when performing any form of manual code optimization, you must have extensive in-depth knowledge of the specific hardware. If you don't have that, then you should not attempt manual optimizations, end of story.
Thus, your question is filled with various strange assumptions:
First, you assume that write_pointer = (write_pointer+1) & (size-1) is more efficient than something else, such as the XOR example you posted. You are just guessing here, you will have to disassemble the code and see which yields the less CPU instructions.
Because, when writing code for a tiny, primitive 8-bit MCU, there is not much going on in the core to speed up your code. I don't know AVR8, but it seems likely that you have a small instruction pipe and that's it. It seems quite unlikely that you have much in the way of branch prediction. It seems very unlikely that you have a data and/or instruction cache. Read the friendly CPU core manual.
As for marching backwards through memory, it will unlikely have any impact at all on your program's performance. On old, crappy compilers you would get slightly more efficient code if the loop condition was a comparison vs zero instead of a value. On modern compilers, this shouldn't be an issue. As for cache memory concerns, I doubt you have any cache memory to worry about.
The best way to write efficient code on 8-bit MCUs is to stick to 8-bit arithmetic whenever possible and to avoid 32-bit arithmetic like the plague. And forget you ever heard about something called floating point. This is what will make your program efficient, you are unlikely to find any better way to manually optimize your code.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js