Convert uint64_t to byte array portably and optimally in Clang - c++

If you want to convert uint64_t to a uint8_t[8] (little endian). On a little endian architecture you can just do an ugly reinterpret_cast<> or memcpy(), e.g:
void from_memcpy(const std::uint64_t &x, uint8_t* bytes) {
std::memcpy(bytes, &x, sizeof(x));
}
This generates efficient assembly:
mov rax, qword ptr [rdi]
mov qword ptr [rsi], rax
ret
However it is not portable. It will have different behaviour on a little endian machine.
For converting uint8_t[8] to uint64_t there is a great solution - just do this:
void to(const std::uint8_t* bytes, std::uint64_t &x) {
x = (std::uint64_t(bytes[0]) << 8*0) |
(std::uint64_t(bytes[1]) << 8*1) |
(std::uint64_t(bytes[2]) << 8*2) |
(std::uint64_t(bytes[3]) << 8*3) |
(std::uint64_t(bytes[4]) << 8*4) |
(std::uint64_t(bytes[5]) << 8*5) |
(std::uint64_t(bytes[6]) << 8*6) |
(std::uint64_t(bytes[7]) << 8*7);
}
This looks inefficient but actually with Clang -O2 it generates exactly the same assembly as before, and if you compile on a big endian machine it will be smart enough to use a native byte swap instruction. E.g. this code:
void to(const std::uint8_t* bytes, std::uint64_t &x) {
x = (std::uint64_t(bytes[7]) << 8*0) |
(std::uint64_t(bytes[6]) << 8*1) |
(std::uint64_t(bytes[5]) << 8*2) |
(std::uint64_t(bytes[4]) << 8*3) |
(std::uint64_t(bytes[3]) << 8*4) |
(std::uint64_t(bytes[2]) << 8*5) |
(std::uint64_t(bytes[1]) << 8*6) |
(std::uint64_t(bytes[0]) << 8*7);
}
Compiles to:
mov rax, qword ptr [rdi]
bswap rax
mov qword ptr [rsi], rax
ret
My question is: is there an equivalent reliably-optimised construct for converting in the opposite direction? I've tried this, but it gets compiled naively:
void from(const std::uint64_t &x, uint8_t* bytes) {
bytes[0] = x >> 8*0;
bytes[1] = x >> 8*1;
bytes[2] = x >> 8*2;
bytes[3] = x >> 8*3;
bytes[4] = x >> 8*4;
bytes[5] = x >> 8*5;
bytes[6] = x >> 8*6;
bytes[7] = x >> 8*7;
}
Edit: After some experimentation, this code does get compiled optimally with GCC 8.1 and later as long as you use uint8_t* __restrict__ bytes. However I still haven't managed to find a form that Clang will optimise.

Here's what I could test based on the discussion in OP's comments:
void from_optimized(const std::uint64_t &x, std::uint8_t* bytes) {
std::uint64_t big;
std::uint8_t* temp = (std::uint8_t*)&big;
temp[0] = x >> 8*0;
temp[1] = x >> 8*1;
temp[2] = x >> 8*2;
temp[3] = x >> 8*3;
temp[4] = x >> 8*4;
temp[5] = x >> 8*5;
temp[6] = x >> 8*6;
temp[7] = x >> 8*7;
std::uint64_t* dest = (std::uint64_t*)bytes;
*dest = big;
}
Looks like this will make things clearer for the compiler and let it assume the necessary parameters to optimize it (both on GCC and Clang with -O2).
Compiling to x86-64 (little endian) on Clang 8.0.0 (test on Godbolt):
mov rax, qword ptr [rdi]
mov qword ptr [rsi], rax
ret
Compiling to aarch64_be (big endian) on Clang 8.0.0 (test on Godbolt):
ldr x8, [x0]
rev x8, x8
str x8, [x1]
ret

What about returning a value?
Easy to reason about and small assembly:
#include <cstdint>
#include <array>
auto to_bytes(std::uint64_t x)
{
std::array<std::uint8_t, 8> b;
b[0] = x >> 8*0;
b[1] = x >> 8*1;
b[2] = x >> 8*2;
b[3] = x >> 8*3;
b[4] = x >> 8*4;
b[5] = x >> 8*5;
b[6] = x >> 8*6;
b[7] = x >> 8*7;
return b;
}
https://godbolt.org/z/FCroX5
and big endian:
#include <stdint.h>
struct mybytearray
{
uint8_t bytes[8];
};
auto to_bytes(uint64_t x)
{
mybytearray b;
b.bytes[0] = x >> 8*0;
b.bytes[1] = x >> 8*1;
b.bytes[2] = x >> 8*2;
b.bytes[3] = x >> 8*3;
b.bytes[4] = x >> 8*4;
b.bytes[5] = x >> 8*5;
b.bytes[6] = x >> 8*6;
b.bytes[7] = x >> 8*7;
return b;
}
https://godbolt.org/z/WARCqN
(std::array not available for -target aarch64_be? )

First of all, the reason why your original from implementation cannot be optimized is because you are passing the arguments by reference and pointer. So, the compiler has to consider the possibility that both of of them point to the very same address (or at least that they overlap). As you have 8 consecutive read and write operations to the (potentially) same address, the as-if rule cannot be applied here.
Note, that just by removing the the & from the function signature, apparently GCC already considers this as proof that bytes does not point into x and thus this can safely be optimized. However, for Clang this is not good enough.
Technically, of course bytes can point to from's stack memory (aka. to x), but I think that would be undefined behavior and thus Clang just misses this optimization.
Your implementation of to doesn't suffer from this issue because you have implemented it in such a way that first you read all the values of bytes and then you make one big assignment to x. So even if x and bytes point to the same address, as you do all the reading first and all the writing afterwards (instead of mixing reads and writes as you do in from), this can be optimized.
Flávio Toribio's answer works because it does precisely this: It reads all the values first and only then writes to the destination.
However, there are less complicated ways to achieve this:
void from(uint64_t x, uint8_t* dest) {
uint8_t bytes[8];
bytes[7] = uint8_t(x >> 8*7);
bytes[6] = uint8_t(x >> 8*6);
bytes[5] = uint8_t(x >> 8*5);
bytes[4] = uint8_t(x >> 8*4);
bytes[3] = uint8_t(x >> 8*3);
bytes[2] = uint8_t(x >> 8*2);
bytes[1] = uint8_t(x >> 8*1);
bytes[0] = uint8_t(x >> 8*0);
*(uint64_t*)dest = *(uint64_t*)bytes;
}
gets compiled to
mov qword ptr [rsi], rdi
ret
on little endian and to
rev x8, x0
str x8, [x1]
ret
on big endian.
Note, that even if you passed x by reference, Clang would be able to optimize this. However, that would result in one more instruction each:
mov rax, qword ptr [rdi]
mov qword ptr [rsi], rax
ret
and
ldr x8, [x0]
rev x8, x8
str x8, [x1]
ret
respectively.
Also note, that you can improve your implementation of to with a similar trick: Instead of passing the result by non-const reference, take the "more natural" approach and just return it from the function:
uint64_t to(const uint8_t* bytes) {
return
(uint64_t(bytes[7]) << 8*7) |
(uint64_t(bytes[6]) << 8*6) |
(uint64_t(bytes[5]) << 8*5) |
(uint64_t(bytes[4]) << 8*4) |
(uint64_t(bytes[3]) << 8*3) |
(uint64_t(bytes[2]) << 8*2) |
(uint64_t(bytes[1]) << 8*1) |
(uint64_t(bytes[0]) << 8*0);
}
Summary:
Don't pass arguments by reference.
Do all the reading first, then all the writing.
Here are the best solutions I could get to for both, little endian and big endian. Note, how to and from are truly inverse operations that can be optimized to a no-op if executed one after another.

The code you've given is way overcomplicated. You can replace it with:
void from(uint64_t x, uint8_t* dest) {
x = htole64(x);
std::memcpy(dest, &x, sizeof(x));
}
Yes, this uses the Linux-ism htole64(), but if you're on another platform you can easily reimplement that.
Clang and GCC optimize this perfectly, on both little- and big-endian platforms.

Related

C++: get int from any place of vector<byte>

I have big enough
std::vector<byte> source
and I need to get four bytes from any offset in vector (for example, 10-13 bytes) and convert it to integer.
int ByteVector2Int(std::vector &source, int offset)
{
return (source[offset] | source[offset + 1] << 8 | source[offset + 2] << 16 | source[offset + 3] << 24);
}
This method called too offen, how I can do that with maximum perfomance?
Use memcpy. You might be tempted to use reinterpret_cast, but then you can easily end up with undefined behavior (for instance due to alignment issues). Also, pass a vector by a const reference:
int f(const std::vector<std::byte>& v, size_t n)
{
int temp;
memcpy(&temp, v.data() + n, sizeof(int));
return temp;
}
Note that compilers are very good in optimizations. In my case, GCC with -O2 resulted in:
mov rax, qword ptr [rdi]
mov eax, dword ptr [rax + rsi]
ret
So, there is no memcpy invoked and the assembly is minimal. Live demo: https://godbolt.org/z/oWGqej
UPDATE (based on question update)
After edit, you may also notice that the generated assembly is the very same (in my case) as for your approach:
int f2(const std::vector<std::byte>& v, size_t n)
{
return (int)(
(unsigned int)v[n]
+ ((unsigned int)v[n + 1] << 8)
+ ((unsigned int)v[n + 2] << 16)
+ ((unsigned int)v[n + 3] << 24) );
}
Live demo: https://godbolt.org/z/c9dE9W
Note that your code is not correct. First, bitwise operations are performed with std::byte which overflows, and second, there is no implicit conversion of std::byte to int.

Why are these 8 byte-writes not optimized into a MOV?

My colleague and myself are unsuccessful in explaining why GCC, ICC and Clang do not optimize this function
void f(std::uint64_t a, void * p) {
std::uint8_t *x = reinterpret_cast<std::uint8_t *>(p);
x[7] = a >> 56;
x[6] = a >> 48;
x[5] = a >> 40;
x[4] = a >> 32;
x[3] = a >> 24;
x[2] = a >> 16;
x[1] = a >> 8;
x[0] = a;
}
Into this
mov QWORD PTR [rsi], rdi
If we formulate f in terms of memcpy, it emits just that mov. Why does it not happen if we do the seemingly trivial sequence of byte writes?
I'm not an expert, but gcc only gained the ability to merge adjacent stores for immediate constants in gcc 7:
Closed bug for immediate constants: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=23684
Open bug for assignment of small structs:https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78821
Store-merging pass code: https://github.com/gcc-mirror/gcc/blob/master/gcc/gimple-ssa-store-merging.c
If I had to guess, by the second bug, it might not be too long a wait.

Why AVX dot product slower than native C++ code

I have the following AVX and Native codes:
__forceinline double dotProduct_2(const double* u, const double* v)
{
_mm256_zeroupper();
__m256d xy = _mm256_mul_pd(_mm256_load_pd(u), _mm256_load_pd(v));
__m256d temp = _mm256_hadd_pd(xy, xy);
__m128d dotproduct = _mm_add_pd(_mm256_extractf128_pd(temp, 0), _mm256_extractf128_pd(temp, 1));
return dotproduct.m128d_f64[0];
}
__forceinline double dotProduct_1(const D3& a, const D3& b)
{
return a[0] * b[0] + a[1] * b[1] + a[2] * b[2] + a[3] * b[3];
}
And respective test scripts:
std::cout << res_1 << " " << res_2 << " " << res_3 << '\n';
{
std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < (1 << 30); ++i)
{
zx_1 += dotProduct_1(aVx[i % 10000], aVx[(i + 1) % 10000]);
}
std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
std::cout << "NAIVE : " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << '\n';
}
{
std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < (1 << 30); ++i)
{
zx_2 += dotProduct_2(&aVx[i % 10000][0], &aVx[(i + 1) % 10000][0]);
}
std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
std::cout << "AVX : " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << '\n';
}
std::cout << math::min2(zx_1, zx_2) << " " << zx_1 << " " << zx_2;
Well, all of the data are aligned by 32. (D3 with __declspec... and aVx arr with _mm_malloc()..)
And, as i can see, native variant is equal/or faster than AVX variant. I can't understand it's nrmally behaviour ? Because i'm think that AVX is 'super FAST' ... If not, how i can optimize it ? I compile it on MSVC 2015(x64), with arch AVX. Also, my hardwre is intel i7 4750HQ(haswell)
Simple profiling with basic loops isn't a great idea - it usually just means you are memory bandwidth limited, so the tests end up coming out at about the same speed (memory is typically slower than the CPU, and that's basically all you are testing here).
As others have said, your code example isn't great, because you are constantly going across the lanes (which I assume is just to find the fastest dot product, and not specifically because a sum of all the dot products is the desired result?). To be honest, if you really need a fast dot product (for AOS data as presented here), I think I would prefer to replace the VHADDPD with a VADDPD + VPERMILPD (trading an additional instruction for twice the throughput, and a lower latency)
double dotProduct_3(const double* u, const double* v)
{
__m256d dp = _mm256_mul_pd(_mm256_load_pd(u), _mm256_load_pd(v));
__m128d a = _mm256_extractf128_pd(dp, 0);
__m128d b = _mm256_extractf128_pd(dp, 1);
__m128d c = _mm_add_pd(a, b);
__m128d yy = _mm_unpackhi_pd(c, c);
__m128d dotproduct = _mm_add_pd(c, yy);
return _mm_cvtsd_f64(dotproduct);
}
asm:
dotProduct_3(double const*, double const*):
vmovapd ymm0,YMMWORD PTR [rsi]
vmulpd ymm0,ymm0,YMMWORD PTR [rdi]
vextractf128 xmm1,ymm0,0x1
vaddpd xmm0,xmm1,xmm0
vpermilpd xmm1,xmm0,0x3
vaddpd xmm0,xmm1,xmm0
vzeroupper
ret
Generally speaking, if you are using horizontal adds, you're doing it wrong! Whilst a 256bit register may seem ideal for a Vector4d, it's not actually a particularly great representation (especially if you consider that AVX512 is now available!). A very similar question to this came up recently: For C++ Vector3 utility class implementations, is array faster than struct and class?
If you want performance, then structure-of-arrays is the best way to go.
struct HybridVec4SOA
{
__m256d x;
__m256d y;
__m256d z;
__m256d w;
};
__m256d dot(const HybridVec4SOA& a, const HybridVec4SOA& b)
{
return _mm256_fmadd_pd(a.w, b.w,
_mm256_fmadd_pd(a.z, b.z,
_mm256_fmadd_pd(a.y, b.y,
_mm256_mul_pd(a.x, b.x))));
}
asm:
dot(HybridVec4SOA const&, HybridVec4SOA const&):
vmovapd ymm1,YMMWORD PTR [rdi+0x20]
vmovapd ymm2,YMMWORD PTR [rdi+0x40]
vmovapd ymm3,YMMWORD PTR [rdi+0x60]
vmovapd ymm0,YMMWORD PTR [rsi]
vmulpd ymm0,ymm0,YMMWORD PTR [rdi]
vfmadd231pd ymm0,ymm1,YMMWORD PTR [rsi+0x20]
vfmadd231pd ymm0,ymm2,YMMWORD PTR [rsi+0x40]
vfmadd231pd ymm0,ymm3,YMMWORD PTR [rsi+0x60]
ret
If you compare the latencies (and more importantly throughput) of load/mul/fmadd compared to hadd and extract, and then consider that the SOA version is computing 4 dot products at a time (instead of 1), you'll start to understand why it's the way to go...
You add too much overhead with vzeroupper and hadd instructions. Good way to write it, is to do all multiplies in a loop and aggregate the result just once at the end. Imagine you unroll original loop 4 times and use 4 accumulators:
for(i=0; i < (1<<30); i+=4) {
s0 += a[i+0] * b[i+0];
s1 += a[i+1] * b[i+1];
s2 += a[i+2] * b[i+2];
s3 += a[i+3] * b[i+3];
}
return s0+s1+s2+s3;
And now just replace unrolled loop with SIMD mul and add (or even FMA intrinsic if available)

SSE2 shift by vector

I've been trying to implement shift by vector in SSE2 intrinsics, but from experimentation and the intel intrinsic guide, it appears to only use the least-significant part of the vector.
To reword my question, given a vector {v1, v2, ..., vn} and a set of shifts {s1, s2, ..., sn}, how do I calculate a result {r1, r2, ..., rn} such that:
r1 = v1 << s1
r2 = v2 << s2
...
rn = vn << sn
since it appears that _mm_sll_epi* performs this:
r1 = v1 << s1
r2 = v2 << s1
...
rn = vn << s1
Thanks in advance.
EDIT:
Here's the code I have:
#include <iostream>
#include <cstdint>
#include <mmintrin.h>
#include <emmintrin.h>
namespace SIMD {
using namespace std;
class SSE2 {
public:
// flipped operands due to function arguments
SSE2(uint64_t a, uint64_t b, uint64_t c, uint64_t d) { low = _mm_set_epi64x(b, a); high = _mm_set_epi64x(d, c); }
uint64_t& operator[](int idx)
{
switch (idx) {
case 0:
_mm_storel_epi64((__m128i*)result, low);
return result[0];
case 1:
_mm_store_si128((__m128i*)result, low);
return result[1];
case 2:
_mm_storel_epi64((__m128i*)result, high);
return result[0];
case 3:
_mm_store_si128((__m128i*)result, high);
return result[1];
}
/* Undefined behaviour */
return 0;
}
SSE2& operator<<=(const SSE2& rhs)
{
low = _mm_sll_epi64(low, rhs.getlow());
high = _mm_sll_epi64(high, rhs.gethigh());
return *this;
}
void print()
{
uint64_t a[2];
_mm_store_si128((__m128i*)a, low);
cout << hex;
cout << a[0] << ' ' << a[1] << ' ';
_mm_store_si128((__m128i*)a, high);
cout << a[0] << ' ' << a[1] << ' ';
cout << dec;
}
__m128i getlow() const
{
return low;
}
__m128i gethigh() const
{
return high;
}
private:
__m128i low, high;
uint64_t result[2];
};
}
int main()
{
cout << "operator<<= test: vector << vector: ";
{
auto x = SIMD::SSE2(7, 8, 15, 10);
auto y = SIMD::SSE2(4, 5, 6, 7);
x.print();
y.print();
x <<= y;
if (x[0] != 112 || x[1] != 256 || x[2] != 960 || x[3] != 1280) {
cout << "FAILED: ";
x.print();
cout << endl;
} else {
cout << "PASSED" << endl;
}
}
return 0;
}
What should be happening gets results of {7 << 4 = 112, 8 << 5 = 256, 15 << 6 = 960, 10 << 7 = 1280}. The results seem to be {7 << 4 = 112, 8 << 4 = 128, 15 << 6 = 960, 15 << 6 = 640}, which isn't what I want.
Hope this helps, Jens.
If AVX2 is available, and your elements are 32 or 64 bits, your operation takes one variable-shift instruction: vpsrlvq, (__m128i _mm_srlv_epi64 (__m128i a, __m128i count) )
For 32bit elements with SSE4.1, see Shifting 4 integers right by different values SIMD. Depending on latency vs. throughput requirements, you can do separate shifts shift and then blend, or use a multiply (by a specially-constructed vector of powers of 2) to get variable-count left shifts and then do a same-count-for-all-elements right shift.
For your case, 64bit elements with runtime-variable shift counts:
There are only two elements per SSE vector, so we just need two shifts and then combine the results (which we can do with a pblendw, or with a floating-point movsd (which may cause extra bypass-delay latency on some CPUs), or we can use two shuffles, or we can do two ANDs and an OR.
__m128i SSE2_emulated_srlv_epi64(__m128i a, __m128i count)
{
__m128i shift_low = _mm_srl_epi64(a, count); // high 64 is garbage
__m128i count_high = _mm_unpackhi_epi64(count,count); // broadcast the high element
__m128i shift_high = _mm_srl_epi64(a, count_high); // low 64 is garbage
// SSE4.1:
// return _mm_blend_epi16(shift_low, shift_high, 0x0F);
#if 1 // use movsd to blend
__m128d blended = _mm_move_sd( _mm_castsi128_pd(shift_high), _mm_castsi128_pd(shift_low) ); // use movsd as a blend. Faster than multiple instructions on most CPUs, but probably bad on Nehalem.
return _mm_castpd_si128(blended);
#else // SSE2 without using FP instructions:
// if we're going to do it this way, we could have shuffled the input before shifting. Probably not helpful though.
shift_high = _mm_unpackhi_epi64(shift_high, shift_high); // broadcast the high64
return _mm_unpacklo_epi64(shift_high, shift_low); // combine
#endif
}
Other shuffles like pshufd or psrldq would work, but punpckhqdq gets the job done without needing an immediate byte, so it's one byte shorter. SSSE3 palignr could get the high element from one register and the low element from another register into one vector, but they'd be reversed (so we'd need a pshufd to swap high and low halves). shufpd would work to blend, but has no advantage over movsd.
See Agner Fog's microarch guide for the details of the potential bypass-delay latency from using an FP instruction between two integer instructions. It's probably fine on Intel SnB-family CPUs, because other FP shuffles are. (And yes, movsd xmm1, xmm0 runs on the shuffle unit in port5. Use movaps or movapd for reg-reg moves even of scalars if you don't need the merging behaviour).
This compiles (on Godbolt with gcc5.3 -O3) to
movdqa xmm2, xmm0 # tmp97, a
psrlq xmm2, xmm1 # tmp97, count
punpckhqdq xmm1, xmm1 # tmp99, count
psrlq xmm0, xmm1 # tmp100, tmp99
movsd xmm0, xmm2 # tmp102, tmp97
ret

how to tie variables in c++

Well my problem is as follows:
I'm trying to translate an x86 assembly source code to c++ source code.
Explanation as to what registers are.
skip this if you know what they are and how they work.
As you may or may not know, assembly language makes use of "general purpose registers".
In x86 assembly these registers are, and can be considered as "4 bytes" in length variables ( int var in c++ ), their names are: eax, ebx, ecx and edx.
Now, these registers are each respectively broken down into ax, bx, cx and dx that represent the 2 bytes less significant value of each register.
ax, bx, cx and dx are also broken down into ah, bx, ch and dh ( most significant byte ) and al, bl, cl and dl ( less significant byte ).
So, for example:
If I set eax:
EAX = 0xAB12CDEF
that would automatically change ax, al and ah
AX would become 0xCDEF
AH would become 0xCD
AL would become 0xEF
My question is: How do I make that possible in C++ ?
int eax, ax, ah, al;
eax = 0xAB12CDEF
How can I make, ax, ah and al, change at the same time?
Or is it possible to make them pointers to different portions eax, if so, how?
Thanks!
P.S. Also how could i use to make another variable be a char ?
How could I make variable new variable "char chAL" point to al which points to eax.
So that when i make a change to chAL, the changes would automatically reverberate to eax, ah and al.
If your goal is to emulate X86 assembly code, then indeed you need to support the behaviour of X86 registers.
Here's a simple implementation using a union:
#include <iostream>
#include <cstdint>
using namespace std;
union reg_t {
uint64_t rx;
uint32_t ex;
uint16_t x;
struct {
uint8_t l;
uint8_t h;
};
};
int main(){
reg_t a;
a.rx = 0xdeadbeefcafebabe;
cout << "rax = " << hex << a.rx << endl;
cout << "eax = " << hex << a.ex << endl;
cout << "ax = " << hex << a.x << endl;
cout << "al = " << hex << (uint16_t)a.l << endl;
cout << "ah = " << hex << (uint16_t)a.h << endl;
cout << "ax & 0xFF = " << hex << (a.x & 0xFF) << endl;
cout << "(ah << 8) + al = " << hex << (a.h << 8) + a.l << endl;
}
output:
rax = deadbeefcafebabe
eax = cafebabe
ax = babe
al = be
ah = ba
ax & 0xFF = be
(ah << 8) + al = babe
You'll get the correct result on the right platform (little-endian). You'll have to swap
bytes, and/or add padding for other platforms.
That's the basic, down to earth solution, which will certainly work on many x86 platforms (at least X86/linux/g++ works fine), but the behaviour this very approach relies on seems undefined in C++.
Here's another approach using a byte array to store register content:
class x86register {
uint8_t bytes[8];
public:
x86register &operator =(const uint64_t &v){
for (int i = 0; i < 8; i++)
bytes[i] = (v >> (i * 8)) & 0xff;
return *this;
}
x86register &operator =(const uint32_t &v){
for (int i = 0; i < 4; i++)
bytes[i] = (v >> (i * 8)) & 0xff;
return *this;
}
x86register &operator =(const uint16_t &v){
for (int i = 0; i < 2; i++)
bytes[i] = (v >> (i * 8)) & 0xff;
return *this;
}
x86register &operator =(const uint8_t &v){
bytes[0] = v;
return *this;
}
operator uint64_t(){
uint64_t res = 0;
for (int i = 7; i >= 0; i--)
res = (res << 8) + bytes[i];
return res;
}
operator uint32_t(){
uint32_t res = 0;
for (int i = 4; i >= 0; i--)
res = (res << 8) + bytes[i];
return res;
}
operator uint16_t(){
uint16_t res = 0;
for (int i = 2; i >= 0; i--)
res = (res << 8) + bytes[i];
return res;
}
operator uint8_t(){
return bytes[0];
}
};
This simple class should work regardless of endianness on the running platform. Also, you probably want to add a few other accessors/mutators to handle the HSB (AH, BH, etc) of word registers.
You can extract parts of eax using bitwise operations, like this:
void main()
{
int eax, ax, ah, al;
eax = 0xAB12CDEF;
ax = eax & 0x0000FFFF;
ah = (eax & 0x0000FF00) >> 8;
al = eax & 0x000000FF;
printf("ax = eax & 0x0000FFFF = 0x%X\n", ax);
printf("ah = (eax & 0x0000FF00) >> 8 = 0x%X\n", ah);
printf("al = eax & 0x000000FF = 0x%X\n", al);
}
Output
ax = eax & 0x0000FFFF = 0xCDEF
ah = (eax & 0x0000FF00) >> 8 = 0xCD
al = eax & 0x000000FF = 0xEF
You could also define macro like that:
#define AX(dw) ((dw) & 0x0000FFFF)
#define AH(dw) ((dw) & 0x0000FF00) >> 8)
#define AL(dw) ((dw) & 0x000000FF)
void main()
{
int eax = 0xAB12CDEF;
cout << "ax = " << hex << AX(eax) << endl; // prints ax = 0xCDEF
}
If you want it to work as simply as you've put the example ints, you can get away with it through reinterpret casts, though this violates pointer aliasing rules, so the behavior is undefined.
std::uint32_t eax = 0xAB12CDEF;
std::uint16_t& ax = reinterpret_cast<std::uint16_t*>(&eax)[1];
std::uint8_t& ah = reinterpret_cast<std::uint8_t&>(ax);
std::uint8_t& al = (&ah)[1];
The second line casts the address of eax to a std::uint16_t*, by applying [1] to that, you get the second half of the 32 bits.
The third line is just a cast to uint8_t, which works because ah will be the same as the front of ax.
Indexing into the address of ah by 1 gives the following byte, which is al.
What you're trying to do seems pretty unsafe and strange though. So to get the most similar behavior in the sanest way, you could just use a custom type. However the results will be consistent from machine to machine in the below, but they won't in the above because of different endian schemes.
class Reg {
private:
std::uint32_t data_;
public:
Reg(std::uint32_t in) : data_{in} { }
std::uint32_t ex() const {
return data_;
}
std::uint16_t x() const {
return static_cast<std::uint16_t>(data_ & 0xFFFF);
}
std::uint8_t h() const {
return static_cast<std::uint8_t>((data_ & 0xFF00) >> 8);
}
std::uint8_t l() const {
return static_cast<std::uint8_t>(data_ & 0xFF);
}
};