Why are these 8 byte-writes not optimized into a MOV? - c++

My colleague and myself are unsuccessful in explaining why GCC, ICC and Clang do not optimize this function
void f(std::uint64_t a, void * p) {
std::uint8_t *x = reinterpret_cast<std::uint8_t *>(p);
x[7] = a >> 56;
x[6] = a >> 48;
x[5] = a >> 40;
x[4] = a >> 32;
x[3] = a >> 24;
x[2] = a >> 16;
x[1] = a >> 8;
x[0] = a;
}
Into this
mov QWORD PTR [rsi], rdi
If we formulate f in terms of memcpy, it emits just that mov. Why does it not happen if we do the seemingly trivial sequence of byte writes?

I'm not an expert, but gcc only gained the ability to merge adjacent stores for immediate constants in gcc 7:
Closed bug for immediate constants: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=23684
Open bug for assignment of small structs:https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78821
Store-merging pass code: https://github.com/gcc-mirror/gcc/blob/master/gcc/gimple-ssa-store-merging.c
If I had to guess, by the second bug, it might not be too long a wait.

Related

Convert uint64_t to byte array portably and optimally in Clang

If you want to convert uint64_t to a uint8_t[8] (little endian). On a little endian architecture you can just do an ugly reinterpret_cast<> or memcpy(), e.g:
void from_memcpy(const std::uint64_t &x, uint8_t* bytes) {
std::memcpy(bytes, &x, sizeof(x));
}
This generates efficient assembly:
mov rax, qword ptr [rdi]
mov qword ptr [rsi], rax
ret
However it is not portable. It will have different behaviour on a little endian machine.
For converting uint8_t[8] to uint64_t there is a great solution - just do this:
void to(const std::uint8_t* bytes, std::uint64_t &x) {
x = (std::uint64_t(bytes[0]) << 8*0) |
(std::uint64_t(bytes[1]) << 8*1) |
(std::uint64_t(bytes[2]) << 8*2) |
(std::uint64_t(bytes[3]) << 8*3) |
(std::uint64_t(bytes[4]) << 8*4) |
(std::uint64_t(bytes[5]) << 8*5) |
(std::uint64_t(bytes[6]) << 8*6) |
(std::uint64_t(bytes[7]) << 8*7);
}
This looks inefficient but actually with Clang -O2 it generates exactly the same assembly as before, and if you compile on a big endian machine it will be smart enough to use a native byte swap instruction. E.g. this code:
void to(const std::uint8_t* bytes, std::uint64_t &x) {
x = (std::uint64_t(bytes[7]) << 8*0) |
(std::uint64_t(bytes[6]) << 8*1) |
(std::uint64_t(bytes[5]) << 8*2) |
(std::uint64_t(bytes[4]) << 8*3) |
(std::uint64_t(bytes[3]) << 8*4) |
(std::uint64_t(bytes[2]) << 8*5) |
(std::uint64_t(bytes[1]) << 8*6) |
(std::uint64_t(bytes[0]) << 8*7);
}
Compiles to:
mov rax, qword ptr [rdi]
bswap rax
mov qword ptr [rsi], rax
ret
My question is: is there an equivalent reliably-optimised construct for converting in the opposite direction? I've tried this, but it gets compiled naively:
void from(const std::uint64_t &x, uint8_t* bytes) {
bytes[0] = x >> 8*0;
bytes[1] = x >> 8*1;
bytes[2] = x >> 8*2;
bytes[3] = x >> 8*3;
bytes[4] = x >> 8*4;
bytes[5] = x >> 8*5;
bytes[6] = x >> 8*6;
bytes[7] = x >> 8*7;
}
Edit: After some experimentation, this code does get compiled optimally with GCC 8.1 and later as long as you use uint8_t* __restrict__ bytes. However I still haven't managed to find a form that Clang will optimise.
Here's what I could test based on the discussion in OP's comments:
void from_optimized(const std::uint64_t &x, std::uint8_t* bytes) {
std::uint64_t big;
std::uint8_t* temp = (std::uint8_t*)&big;
temp[0] = x >> 8*0;
temp[1] = x >> 8*1;
temp[2] = x >> 8*2;
temp[3] = x >> 8*3;
temp[4] = x >> 8*4;
temp[5] = x >> 8*5;
temp[6] = x >> 8*6;
temp[7] = x >> 8*7;
std::uint64_t* dest = (std::uint64_t*)bytes;
*dest = big;
}
Looks like this will make things clearer for the compiler and let it assume the necessary parameters to optimize it (both on GCC and Clang with -O2).
Compiling to x86-64 (little endian) on Clang 8.0.0 (test on Godbolt):
mov rax, qword ptr [rdi]
mov qword ptr [rsi], rax
ret
Compiling to aarch64_be (big endian) on Clang 8.0.0 (test on Godbolt):
ldr x8, [x0]
rev x8, x8
str x8, [x1]
ret
What about returning a value?
Easy to reason about and small assembly:
#include <cstdint>
#include <array>
auto to_bytes(std::uint64_t x)
{
std::array<std::uint8_t, 8> b;
b[0] = x >> 8*0;
b[1] = x >> 8*1;
b[2] = x >> 8*2;
b[3] = x >> 8*3;
b[4] = x >> 8*4;
b[5] = x >> 8*5;
b[6] = x >> 8*6;
b[7] = x >> 8*7;
return b;
}
https://godbolt.org/z/FCroX5
and big endian:
#include <stdint.h>
struct mybytearray
{
uint8_t bytes[8];
};
auto to_bytes(uint64_t x)
{
mybytearray b;
b.bytes[0] = x >> 8*0;
b.bytes[1] = x >> 8*1;
b.bytes[2] = x >> 8*2;
b.bytes[3] = x >> 8*3;
b.bytes[4] = x >> 8*4;
b.bytes[5] = x >> 8*5;
b.bytes[6] = x >> 8*6;
b.bytes[7] = x >> 8*7;
return b;
}
https://godbolt.org/z/WARCqN
(std::array not available for -target aarch64_be? )
First of all, the reason why your original from implementation cannot be optimized is because you are passing the arguments by reference and pointer. So, the compiler has to consider the possibility that both of of them point to the very same address (or at least that they overlap). As you have 8 consecutive read and write operations to the (potentially) same address, the as-if rule cannot be applied here.
Note, that just by removing the the & from the function signature, apparently GCC already considers this as proof that bytes does not point into x and thus this can safely be optimized. However, for Clang this is not good enough.
Technically, of course bytes can point to from's stack memory (aka. to x), but I think that would be undefined behavior and thus Clang just misses this optimization.
Your implementation of to doesn't suffer from this issue because you have implemented it in such a way that first you read all the values of bytes and then you make one big assignment to x. So even if x and bytes point to the same address, as you do all the reading first and all the writing afterwards (instead of mixing reads and writes as you do in from), this can be optimized.
Flávio Toribio's answer works because it does precisely this: It reads all the values first and only then writes to the destination.
However, there are less complicated ways to achieve this:
void from(uint64_t x, uint8_t* dest) {
uint8_t bytes[8];
bytes[7] = uint8_t(x >> 8*7);
bytes[6] = uint8_t(x >> 8*6);
bytes[5] = uint8_t(x >> 8*5);
bytes[4] = uint8_t(x >> 8*4);
bytes[3] = uint8_t(x >> 8*3);
bytes[2] = uint8_t(x >> 8*2);
bytes[1] = uint8_t(x >> 8*1);
bytes[0] = uint8_t(x >> 8*0);
*(uint64_t*)dest = *(uint64_t*)bytes;
}
gets compiled to
mov qword ptr [rsi], rdi
ret
on little endian and to
rev x8, x0
str x8, [x1]
ret
on big endian.
Note, that even if you passed x by reference, Clang would be able to optimize this. However, that would result in one more instruction each:
mov rax, qword ptr [rdi]
mov qword ptr [rsi], rax
ret
and
ldr x8, [x0]
rev x8, x8
str x8, [x1]
ret
respectively.
Also note, that you can improve your implementation of to with a similar trick: Instead of passing the result by non-const reference, take the "more natural" approach and just return it from the function:
uint64_t to(const uint8_t* bytes) {
return
(uint64_t(bytes[7]) << 8*7) |
(uint64_t(bytes[6]) << 8*6) |
(uint64_t(bytes[5]) << 8*5) |
(uint64_t(bytes[4]) << 8*4) |
(uint64_t(bytes[3]) << 8*3) |
(uint64_t(bytes[2]) << 8*2) |
(uint64_t(bytes[1]) << 8*1) |
(uint64_t(bytes[0]) << 8*0);
}
Summary:
Don't pass arguments by reference.
Do all the reading first, then all the writing.
Here are the best solutions I could get to for both, little endian and big endian. Note, how to and from are truly inverse operations that can be optimized to a no-op if executed one after another.
The code you've given is way overcomplicated. You can replace it with:
void from(uint64_t x, uint8_t* dest) {
x = htole64(x);
std::memcpy(dest, &x, sizeof(x));
}
Yes, this uses the Linux-ism htole64(), but if you're on another platform you can easily reimplement that.
Clang and GCC optimize this perfectly, on both little- and big-endian platforms.

Convert uint64_t to uint8_t[8]

How can I convert uint64_t to uint8_t[8] without loosing information in C++?
I tried the following:
uint64_t number = 23425432542254234532;
uint8_t result[8];
for(int i = 0; i < 8; i++) {
std::memcpy(result[i], number, 1);
}
You are almost there. Firstly, the literal 23425432542254234532 is too big to fit in uint64_t.
Secondly, as you can see from the documentation, std::memcpy has the following declaration:
void * memcpy ( void * destination, const void * source, size_t num );
As you can see, it takes pointers (addresses) as arguments. Not uint64_t, nor uint8_t. You can easily get the address of the integer using the address-of operator.
Thridly, you are only copying the first byte of the integer into each array element. You would need to increment the input pointer in every iteration. But the loop is unnecessary. You can copy all bytes in one go like this:
std::memcpy(result, &number, sizeof number);
Do realize that the order of the bytes depend on the endianness of the cpu.
First, do you want the conversion to be big-endian, or little-endian? Most of the previous answers are going to start giving you the bytes in the opposite order, and break your program,` as soon as you switch architectures.
If you need to get consistent results, you would want to convert your 64-bit input into big-endian (network) byte order, or perhaps to little-endian. For example, on GNU glib, the function is GUINT64_TO_BE(), but there is an equivalent built-in function for most compilers.
Having done that, there are several alternatives:
Copy with memcpy() or memmove()
This is the method that the language standard guarantees will work, although here I use one function from a third-party library (to convert the argument to big-endian byte order on all platforms). For example:
#include <stdint.h>
#include <stdlib.h>
#include <glib.h>
union eight_bytes {
uint64_t u64;
uint8_t b8[sizeof(uint64_t)];
};
eight_bytes u64_to_eight_bytes( const uint64_t input )
{
eight_bytes result;
const uint64_t big_endian = (uint64_t)GUINT64_TO_BE((guint64)input);
memcpy( &result.b8, &big_endian, sizeof(big_endian) );
return result;
}
On Linux x86_64 with clang++ -std=c++17 -O, this compiles to essentially the instructions:
bswapq %rdi
movq %rdi, %rax
retq
If you wanted the results in little-endian order on all platforms, you could replace GUINT64_TO_BE() with GUINT64_TO_LE() and remove the first instruction, then declare the function inline to remove the third instruction. (Or, if you’re certain that cross-platform compatibility does not matter, you might risk just omitting the normalization.)
So, on a modern, 64-bit compiler, this code is just as efficient as anything else. On another target, it might not be.
Type-Punning
The common way to write this in C would be to declare the union as before, set its uint64_t member, and then read its uint8_t[8] member. This is legal in C.
I personally like it because it allows me to express the entire operation as static single assignments.
However, in C++, it is formally undefined behavior. In practice, all C++ compilers I’m aware of support it for Plain Old Data (the formal term in the language standard), of the same size, with no padding bits, but not for more complicated classes that have virtual function tables and the like. It seems more likely to me that a future version of the Standard will officially support type-punning on POD than that any important compiler will ever break it silently.
The C++ Guidelines Way
Bjarne Stroustrup recommended that, if you are going to type-pun instead of copying, you use reinterpret_cast, such as
uint8_t (&array_of_bytes)[sizeof(uint64_t)] =
*reinterpret_cast<uint8_t(*)[sizeof(uint64_t)]>(
&proper_endian_uint64);
His reasoning was that both an explicit cast and type-punning through a union are undefined behavior, but the cast makes it blatant and unmistakable that you are shooting yourself in the foot on purpose, whereas reading a different union member than the active one can be a very subtle bug.
If I understand correctly you can do this that way for instance:
uint64_t number = 23425432542254234532;
uint8_t *p = (uint8_t *)&number;
//if you need a copy
uint8_t result[8];
for(int i = 0; i < 8; i++) {
result[i] = p[i];
}
When copying memory around between incompatible types, the first thing to be aware of is strict aliasing - you don't want to alias pointers incorrectly. Alignment is also to be considered.
You were almost there, the for is not needed.
uint64_t number = 0x2342543254225423; // trimmed to fit
uint8_t result[sizeof(number)];
std::memcpy(result, &number, sizeof(number));
Note: be aware of the endianness of the platform as well.
Either use a union, or do it with bitwise operations- memcpy is for blocks of memory and might not be the best option here.
uint64_t number = 23425432542254234532;
uint8_t result[8];
for(int i = 0; i < 8; i++) {
result[i] = uint8_t((number >> 8*(7 - i)) & 0xFF);
}
Or, although I'm told this breaks the rules, it works on my compiler:
union
{
uint64_t a;
uint8_t b[8];
};
a = 23425432542254234532;
//Can now read off the value of b
uint8_t copy[8];
for(int i = 0; i < 8; i++)
{
copy[i]= b[i];
}
The packing and unpacking can be done with masks. One more thing to worry about is the byte order. packing and unpacking should use the same byte order. Beware - This is not super efficient implementation and do not come with guarantees on small CPU that are not native 64-bit.
void unpack_uint64(uint64_t number, uint8_t *result) {
result[0] = number & 0x00000000000000FF ; number = number >> 8 ;
result[1] = number & 0x00000000000000FF ; number = number >> 8 ;
result[2] = number & 0x00000000000000FF ; number = number >> 8 ;
result[3] = number & 0x00000000000000FF ; number = number >> 8 ;
result[4] = number & 0x00000000000000FF ; number = number >> 8 ;
result[5] = number & 0x00000000000000FF ; number = number >> 8 ;
result[6] = number & 0x00000000000000FF ; number = number >> 8 ;
result[7] = number & 0x00000000000000FF ;
}
uint64_t pack_uint64(uint8_t *buffer) {
uint64_t value ;
value = buffer[7] ;
value = (value << 8 ) + buffer[6] ;
value = (value << 8 ) + buffer[5] ;
value = (value << 8 ) + buffer[4] ;
value = (value << 8 ) + buffer[3] ;
value = (value << 8 ) + buffer[2] ;
value = (value << 8 ) + buffer[1] ;
value = (value << 8 ) + buffer[0] ;
return value ;
}
#include<cstdint>
#include<iostream>
struct ByteArray
{
uint8_t b[8] = { 0,0,0,0,0,0,0,0 };
};
ByteArray split(uint64_t x)
{
ByteArray pack;
const uint8_t MASK = 0xFF;
for (auto i = 0; i < 7; ++i)
{
pack.b[i] = x & MASK;
x = x >> 8;
}
return pack;
}
int main()
{
uint64_t val_64 = UINT64_MAX;
auto pack = split(val_64);
for (auto i = 0; i < 7; ++i)
{
std::cout << (uint32_t)pack.b[i] << std::endl;
}
system("Pause");
return 0;
}
Although union approach which is addressed by Straw1239 is better and cleaner.Please do care about compiler/platform compatibility with endianness.

What is the real definition of the xorshift128+ algorithm?

I have need of a good pseudo random number generator (PRNG), and it seems like the current state of the art is the xorshift128+ algoritm. Unfortunately, I have discovered 2 different versions. The one on wikipedia: Xorshift shows as:
uint64_t s[2];
uint64_t xorshift128plus(void) {
uint64_t x = s[0];
uint64_t const y = s[1];
s[0] = y;
x ^= x << 23; // a
s[1] = x ^ y ^ (x >> 17) ^ (y >> 26); // b, c
return s[1] + y;
}
Which seems straight forward enough. What's more, the edit logs appear to show that this code snippet was added by a user named "Vigna", which is presumably "Sebastiano Vigna" who is the author of the paper on xorshift128+: Further scramblings of Marsaglia’s xorshift generators. Unfortunately, the implementation in that paper is slightly different:
uint64_t next(void) {
uint64_t s1 = s[0];
const uint64_t s0 = s[1];
s[0] = s0;
s1 ^= s1 << 23; // a
s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5); // b, c
return s[1] + s0;
}
Apart from some different names, these two snippets are identical except for the final two shifts. In the Wikipedia version those shifts are by 17 and 26, while the shifts in the paper are by 18 and 5.
Does anyone know which is the "right" algorithm? Does it make a difference? This is apparently a fairly widely used algorithm - but which version is used?
Thanks to #Blastfurnace, it appears that the answer is that the most recent set of constants according to the author of the algorithm are: 23, 18, and 5. Apparently it doesn't matter too much, but those are theoretically better than the initial set of numbers he used. Sebastiano Vigna made these comments in response to the news that the V8 Javascript engine is shifting to using this algorithm.
The implementation that I am using is:
uint64_t a = s[0];
uint64_t b = s[1];
s[0] = b;
a ^= a << 23;
a ^= a >> 18;
a ^= b;
a ^= b >> 5;
s[1] = a;
return a + b;

how to tie variables in c++

Well my problem is as follows:
I'm trying to translate an x86 assembly source code to c++ source code.
Explanation as to what registers are.
skip this if you know what they are and how they work.
As you may or may not know, assembly language makes use of "general purpose registers".
In x86 assembly these registers are, and can be considered as "4 bytes" in length variables ( int var in c++ ), their names are: eax, ebx, ecx and edx.
Now, these registers are each respectively broken down into ax, bx, cx and dx that represent the 2 bytes less significant value of each register.
ax, bx, cx and dx are also broken down into ah, bx, ch and dh ( most significant byte ) and al, bl, cl and dl ( less significant byte ).
So, for example:
If I set eax:
EAX = 0xAB12CDEF
that would automatically change ax, al and ah
AX would become 0xCDEF
AH would become 0xCD
AL would become 0xEF
My question is: How do I make that possible in C++ ?
int eax, ax, ah, al;
eax = 0xAB12CDEF
How can I make, ax, ah and al, change at the same time?
Or is it possible to make them pointers to different portions eax, if so, how?
Thanks!
P.S. Also how could i use to make another variable be a char ?
How could I make variable new variable "char chAL" point to al which points to eax.
So that when i make a change to chAL, the changes would automatically reverberate to eax, ah and al.
If your goal is to emulate X86 assembly code, then indeed you need to support the behaviour of X86 registers.
Here's a simple implementation using a union:
#include <iostream>
#include <cstdint>
using namespace std;
union reg_t {
uint64_t rx;
uint32_t ex;
uint16_t x;
struct {
uint8_t l;
uint8_t h;
};
};
int main(){
reg_t a;
a.rx = 0xdeadbeefcafebabe;
cout << "rax = " << hex << a.rx << endl;
cout << "eax = " << hex << a.ex << endl;
cout << "ax = " << hex << a.x << endl;
cout << "al = " << hex << (uint16_t)a.l << endl;
cout << "ah = " << hex << (uint16_t)a.h << endl;
cout << "ax & 0xFF = " << hex << (a.x & 0xFF) << endl;
cout << "(ah << 8) + al = " << hex << (a.h << 8) + a.l << endl;
}
output:
rax = deadbeefcafebabe
eax = cafebabe
ax = babe
al = be
ah = ba
ax & 0xFF = be
(ah << 8) + al = babe
You'll get the correct result on the right platform (little-endian). You'll have to swap
bytes, and/or add padding for other platforms.
That's the basic, down to earth solution, which will certainly work on many x86 platforms (at least X86/linux/g++ works fine), but the behaviour this very approach relies on seems undefined in C++.
Here's another approach using a byte array to store register content:
class x86register {
uint8_t bytes[8];
public:
x86register &operator =(const uint64_t &v){
for (int i = 0; i < 8; i++)
bytes[i] = (v >> (i * 8)) & 0xff;
return *this;
}
x86register &operator =(const uint32_t &v){
for (int i = 0; i < 4; i++)
bytes[i] = (v >> (i * 8)) & 0xff;
return *this;
}
x86register &operator =(const uint16_t &v){
for (int i = 0; i < 2; i++)
bytes[i] = (v >> (i * 8)) & 0xff;
return *this;
}
x86register &operator =(const uint8_t &v){
bytes[0] = v;
return *this;
}
operator uint64_t(){
uint64_t res = 0;
for (int i = 7; i >= 0; i--)
res = (res << 8) + bytes[i];
return res;
}
operator uint32_t(){
uint32_t res = 0;
for (int i = 4; i >= 0; i--)
res = (res << 8) + bytes[i];
return res;
}
operator uint16_t(){
uint16_t res = 0;
for (int i = 2; i >= 0; i--)
res = (res << 8) + bytes[i];
return res;
}
operator uint8_t(){
return bytes[0];
}
};
This simple class should work regardless of endianness on the running platform. Also, you probably want to add a few other accessors/mutators to handle the HSB (AH, BH, etc) of word registers.
You can extract parts of eax using bitwise operations, like this:
void main()
{
int eax, ax, ah, al;
eax = 0xAB12CDEF;
ax = eax & 0x0000FFFF;
ah = (eax & 0x0000FF00) >> 8;
al = eax & 0x000000FF;
printf("ax = eax & 0x0000FFFF = 0x%X\n", ax);
printf("ah = (eax & 0x0000FF00) >> 8 = 0x%X\n", ah);
printf("al = eax & 0x000000FF = 0x%X\n", al);
}
Output
ax = eax & 0x0000FFFF = 0xCDEF
ah = (eax & 0x0000FF00) >> 8 = 0xCD
al = eax & 0x000000FF = 0xEF
You could also define macro like that:
#define AX(dw) ((dw) & 0x0000FFFF)
#define AH(dw) ((dw) & 0x0000FF00) >> 8)
#define AL(dw) ((dw) & 0x000000FF)
void main()
{
int eax = 0xAB12CDEF;
cout << "ax = " << hex << AX(eax) << endl; // prints ax = 0xCDEF
}
If you want it to work as simply as you've put the example ints, you can get away with it through reinterpret casts, though this violates pointer aliasing rules, so the behavior is undefined.
std::uint32_t eax = 0xAB12CDEF;
std::uint16_t& ax = reinterpret_cast<std::uint16_t*>(&eax)[1];
std::uint8_t& ah = reinterpret_cast<std::uint8_t&>(ax);
std::uint8_t& al = (&ah)[1];
The second line casts the address of eax to a std::uint16_t*, by applying [1] to that, you get the second half of the 32 bits.
The third line is just a cast to uint8_t, which works because ah will be the same as the front of ax.
Indexing into the address of ah by 1 gives the following byte, which is al.
What you're trying to do seems pretty unsafe and strange though. So to get the most similar behavior in the sanest way, you could just use a custom type. However the results will be consistent from machine to machine in the below, but they won't in the above because of different endian schemes.
class Reg {
private:
std::uint32_t data_;
public:
Reg(std::uint32_t in) : data_{in} { }
std::uint32_t ex() const {
return data_;
}
std::uint16_t x() const {
return static_cast<std::uint16_t>(data_ & 0xFFFF);
}
std::uint8_t h() const {
return static_cast<std::uint8_t>((data_ & 0xFF00) >> 8);
}
std::uint8_t l() const {
return static_cast<std::uint8_t>(data_ & 0xFF);
}
};

Optimisation of IIR filter

Quick question related to IIR filter coefficients. Here is a very typical implementation of a direct form II biquad IIR processor that I found online.
// b0, b1, b2, a1, a2 are filter coefficients
// m1, m2 are the memory locations
// dn is the de-denormal coeff (=1.0e-20f)
void processBiquad(const float* in, float* out, unsigned length)
{
for(unsigned i = 0; i < length; ++i)
{
register float w = in[i] - a1*m1 - a2*m2 + dn;
out[i] = b1*m1 + b2*m2 + b0*w;
m2 = m1; m1 = w;
}
dn = -dn;
}
I understand that the "register" is somewhat unnecessary given how smart modern compilers are about this kind of thing. My question is, are there any potential performance benefits to storing the filter coefficients in individual variables rather than using arrays and dereferencing the values? Would the answer to this question depend on the target platform?
i.e.
out[i] = b[1]*m[1] + b[2]*m[2] + b[0]*w;
versus
out[i] = b1*m1 + b2*m2 + b0*w;
It really depends on your compiler and the optimization options. Here is my take:
Any modern compiler would just ignore register. It is just a hint to the compiler and modern ones just don't use it.
Accessing constant indexes in a loop is usually optimized away when compiling with optimization on. In a sense, using variables or an array as you showed makes no difference.
Always, always run benchmarks and look at the generated code for performance critical sections of the code.
EDIT: OK, just out of curiosity I wrote a small program and got "identical" code generated when using full optimization with VS2010. Here is what I get inside the loop for the expression in question (exactly identical for both cases):
0128138D fmul dword ptr [eax+0Ch]
01281390 faddp st(1),st
01281392 fld dword ptr [eax+10h]
01281395 fld dword ptr [w]
01281398 fld st(0)
0128139A fmulp st(2),st
0128139C fxch st(2)
0128139E faddp st(1),st
012813A0 fstp dword ptr [ecx+8]
Notice that I added a few lines to output the results so that I make sure compiler does not just optimize away everything. Here is the code:
#include <iostream>
#include <iterator>
#include <algorithm>
class test1
{
float a1, a2, b0, b1, b2;
float dn;
float m1, m2;
public:
void processBiquad(const float* in, float* out, unsigned length)
{
for(unsigned i = 0; i < length; ++i)
{
float w = in[i] - a1*m1 - a2*m2 + dn;
out[i] = b1*m1 + b2*m2 + b0*w;
m2 = m1; m1 = w;
}
dn = -dn;
}
};
class test2
{
float a[2], b[3];
float dn;
float m1, m2;
public:
void processBiquad(const float* in, float* out, unsigned length)
{
for(unsigned i = 0; i < length; ++i)
{
float w = in[i] - a[0]*m1 - a[1]*m2 + dn;
out[i] = b[0]*m1 + b[1]*m2 + b[2]*w;
m2 = m1; m1 = w;
}
dn = -dn;
}
};
int _tmain(int argc, _TCHAR* argv[])
{
test1 t1;
test2 t2;
float a[1000];
float b[1000];
t1.processBiquad(a, b, 1000);
t2.processBiquad(a, b, 1000);
std::copy(b, b+1000, std::ostream_iterator<float>(std::cout, " "));
return 0;
}
I am not sure, but this :
out[i] = b[1]*m[1] + b[2]*m[2] + b[0]*w;
might be worse, because it would compile to indirect access, and that is worse then direct access performance wise.
The only way to actually see, is to check the compiled assembler and profile the code.
You will likely get a benefit if you can declare the coefficients b0, b1, b2 as const. Code will be more efficient if any of your operands are known and fixed at compile time.