How to convert between uint64_t and poly64_t on ARM? - c++

I'd like to perform polynomial multiplication of two uint64_t values (where the least significant bit (the one got by w&1) is the least significant coefficient (the a0 in for w(x)=∑iai*xi )) on ARM and get the least significant 64 coefficients (a0...a63) of the result as uint64_t (so result>>i&1 is ai).
It's not clear to me, however, what is the standard-compliant way to convert uint64_t to poly64_t and (least significant part of) poly128_t to uint64_t.
poly8_t, poly16_t, poly64_t and poly128_t are defined as unsigned integer types. It is unspecified whether these are the same type as uint8_t, uint16_t, uint64_t and uint128_t for overloading and mangling purposes.
ACLE does not define whether int64x1_t is the same type as int64_t, or whether uint64x1_t is the same type as uint64_t, or whether poly64x1_t is the same as poly64_t for example for C++ overloading purposes.
source: https://developer.arm.com/documentation/101028/0009/Advanced-SIMD--Neon--intrinsics
Above quotes opens some scary possibilities in my head like perhaps the bit order is flipped, or there's some padding, or who knows, maybe these are some structs.
So far I've come out with these two:
poly64_t uint64_t_to_poly64_t(uint64_t x) {
return vget_lane_p64(vcreate_p64(x), 0);
}
uint64_t less_sinificant_half_of_poly128_t_to_uint64_t(poly128_t big) {
return vgetq_lane_u64(vreinterpretq_u64_p128(big), 0);
}
But they seem cumbersome (as they go through some intermediary stuff like poly64x1_t), and still make some assumptions (like that poly128_t can be treated as a vector of two uint64_t, and that the the 0-th uint64_t will contain the "less significant coefficients", and that least significant polynomial coefficient will be at the least significant uint64_t's bit).
OTOH it seems that I can simply "ignore" the whole issue, and just pretend that integers are polynomials as the two functions produce the same assembly:
__attribute__((target("+crypto")))
uint64_t polynomial_mul_low(uint64_t v,uint64_t w) {
const poly128_t big = vmull_p64(uint64_t_to_poly64_t(v),
uint64_t_to_poly64_t(w));
return less_sinificant_half_of_poly128_t_to_uint64_t(big);
}
__attribute__((target("+crypto")))
uint64_t polynomial_mul_low_naive(uint64_t v,uint64_t w) {
return vmull_p64(v,w);
}
that is:
fmov d0, x0
fmov d1, x1
pmull v0.1q, v0.1d, v1.1d
fmov x0, d0
ret
also, the assembly for uint64_t_to_poly_64_t and less_sinificant_half_of_poly128_t_to_uint64_t seems to be a no-op, which supports the hypothesis that there are no steps involved in conversion, really.
(See above in action: https://godbolt.org/z/o6bYsn4E4)
Also:
__attribute__((target("+crypto")))
uint64_t polynomial_mul_low_naive(uint64_t v,uint64_t w) {
return (uint64_t)vmull_p64(poly64_t{v},poly64_t{w});
}
seems to compile, and while the {..}s give me the soothing confidence that no narrowing occurred, I'm still unsure if the order of the bits and order of the coefficients are guaranteed to be consistent, and thus have some worries about the final (uint64_t) cast.
I want my code to be correct w.r.t. to standards, as opposed to just work by an accident, as it has to be written once and run on many ARM64 platforms, hence my question:
How does one perform a proper conversion between polyXXX_t and uintXXX_t, and how does one extract "lower half of coefficients" from polyXXX_t?

The ARM-NEON intrinsic set provides many types, but fundamentally they just map to the same set of registers. The types are there to help you, the programmer, organize your code and the hardware really doesn't care.
Many implementations of ARM-NEON intrinsics just set all those types to some internal variable, so the type-safety is largely lost in those cases: Visual C++ and clang/LLVM are both fairly "loose" with respect to ARM-NEON type-safety.
GNUC seems to be one compiler that I've used that generates these type warnings, although you can you use -flax-vector-conversions.
The ARM-NEON intrinsic set defines a number of vreinterpret_X_Y and vreinterpretq_X_Y instructions. These are for doing the 'type-casts' between the various types when you need to force them for the particular mix of instructions you are using.
// Convert poly to unsigned int (the reverse is also defined)
vreinterpret_u8_p8
vreinterpret_u8_p16
vreinterpret_u16_p8
vreinterpret_u16_p16
vreinterpret_u32_p8
vreinterpret_u32_p16
vreinterpret_u64_p8
vreinterpret_u64_p16
vreinterpretq_u8_p8
vreinterpretq_u8_p16
vreinterpretq_u16_p8
vreinterpretq_u16_p16
vreinterpretq_u32_p8
vreinterpretq_u32_p16
vreinterpretq_u64_p8
vreinterpretq_u64_p16

My proposal is not to use the poly128_t or poly64_t types at all, as this leads to very bad code generation mixing neon and GPR registers.
poly128_t mul_lo_p64(poly128_t a, poly128_t b) {
return vmull_p64(a, b);
}
fmov d0, x0
fmov d1, x2
pmull v0.1q, v0.1d, v1.1d
mov x1, v0.d[1]
fmov x0, d0
ret
This is seen also in more complex scenarios.
To fix this, one should stay completely in the neon register domain and needs just two primitives, namely
inline poly64x2_t mul_lo_p64(poly64x2_t a, poly64x2_t b) {
poly64x2_t res;
asm("pmull %0.1q, %1.1d, %2.1d": "=w"(res): "w"(a), "w"(b));
return res;
}
inline poly64x2_t mul_hi_p64(poly64x2_t a, poly64x2_t b) {
poly64x2_t res;
asm("pmull2 %0.1q, %1.2d, %2.2d": "=w"(res): "w"(a), "w"(b));
return res;
}
Then e.g. the two other often used intrinsics poly64x2_t vaddq_p64(poly64x2_t a, poly64x2_t b); and vextq_p64(poly64x2_t,poly64x2_t,1); work as expected.

Related

Why can't GCC generate an optimal operator== for a struct of two int32s?

A colleague showed me code that I thought wouldn't be necessary, but sure enough, it was. I would expect most compilers would see all three of these attempts at equality tests as equivalent:
#include <cstdint>
#include <cstring>
struct Point {
std::int32_t x, y;
};
[[nodiscard]]
bool naiveEqual(const Point &a, const Point &b) {
return a.x == b.x && a.y == b.y;
}
[[nodiscard]]
bool optimizedEqual(const Point &a, const Point &b) {
// Why can't the compiler produce the same assembly in naiveEqual as it does here?
std::uint64_t ai, bi;
static_assert(sizeof(Point) == sizeof(ai));
std::memcpy(&ai, &a, sizeof(Point));
std::memcpy(&bi, &b, sizeof(Point));
return ai == bi;
}
[[nodiscard]]
bool optimizedEqual2(const Point &a, const Point &b) {
return std::memcmp(&a, &b, sizeof(a)) == 0;
}
[[nodiscard]]
bool naiveEqual1(const Point &a, const Point &b) {
// Let's try avoiding any jumps by using bitwise and:
return (a.x == b.x) & (a.y == b.y);
}
But to my surprise, only the ones with memcpy or memcmp get turned into a single 64-bit compare by GCC. Why? (https://godbolt.org/z/aP1ocs)
Isn't it obvious to the optimizer that if I check equality on contiguous pairs of four bytes that that's the same as comparing on all eight bytes?
An attempt to avoid separately booleanizing the two parts compiles somewhat more efficiently (one fewer instruction and no false dependency on EDX), but still two separate 32-bit operations.
bool bithackEqual(const Point &a, const Point &b) {
// a^b == 0 only if they're equal
return ((a.x ^ b.x) | (a.y ^ b.y)) == 0;
}
GCC and Clang both have the same missed optimizations when passing the structs by value (so a is in RDI and b is in RSI because that's how x86-64 System V's calling convention packs structs into registers): https://godbolt.org/z/v88a6s. The memcpy / memcmp versions both compile to cmp rdi, rsi / sete al, but the others do separate 32-bit operations.
struct alignas(uint64_t) Point surprisingly still helps in the by-value case where arguments are in registers, optimizing both naiveEqual versions for GCC, but not the bithack XOR/OR. (https://godbolt.org/z/ofGa1f). Does this give us any hints about GCC's internals? Clang isn't helped by alignment.
If you "fix" the alignment, all give the same assembly language output (with GCC):
struct alignas(std::int64_t) Point {
std::int32_t x, y;
};
Demo
As a note, some correct/legal ways to do some stuff (as type punning) is to use memcpy, so having specific optimization (or be more aggressive) when using that function seems logical.
There's a performance cliff you risk falling off of when implementing this as a single 64-bit comparison:
You break store to load forwarding.
If the 32-bit numbers in the structs are written to memory by separate store instructions, and then loaded back from memory with 64-bit load instructions quickly (before the stores hit L1$), your execution will stall until the stores commit to globally visible cache coherent L1$. If the loads are 32-bit loads that match the previous 32-bit stores, modern CPUs will avoid the store-load stall by forwarding the stored value to the load instruction before the store reaches cache. This violates sequential consistency if multiple CPUs access the memory (a CPU sees its own stores in a different order than other CPUs do), but is allowed by most modern CPU architectures, even x86. The forwarding also allows much more code to be executed completely speculatively, because if the execution has to be rolled back, no other CPU can have seen the store for the code that used the loaded value on this CPU to be speculatively executed.
If you want this to use 64-bit operations and you don't want this perf cliff, you may want to ensure the struct is also always written as a single 64-bit number.
Why can't the compiler generate [same assembly as memcpy version]?
The compiler "could" in the sense that it would be allowed to.
The compiler simply doesn't. Why it doesn't is beyond my knowledge as that requires deep knowledge of how the optimiser has been implemented. But, the answer may range from "there is no logic covering such transformation" to "the rules aren't tuned to assume one output is faster than the other" on all target CPUs.
If you use Clang instead of GCC, you'll notice that it produces same output for naiveEqual and naiveEqual1 and that assembly has no jump. It is same as for the "optimised" version except for using two 32 bit instructions in place of one 64 bit instruction. Furthermore restricting the alignment of Point as shown in Jarod42's answer has no effect to the optimiser.
MSVC behaves like Clang in the sense that it is unaffected by the alignment, but differently in the sense that it doesn't get rid of the jump in naiveEqual.
For what its worth, the compilers (I checked GCC and Clang) produce essentially same output for the C++20 defaulted comparison as they do fornaiveEqual. For whatever reason, GCC opted to use jne instead of je for the jump.
is this a missing compiler optimization
With the assumption that one is always faster than the other on the target CPUs, that would be fair conclusion.

The best way in C++ to cast different signedness types each other?

There is an uint64_t data field sent by the communication peer, it carries an order ID that I need to store into a Postgresql-11 DB that do NOT support unsigned integer types. Although a real data may exceed 2^63, I think a INT8 filed in Postgresql11 can hold it, if I do some casting carefully.
Let's say there be:
uint64_t order_id = 123; // received
int64_t to_db; // to be writed into db
I plan to use one of the following methods to cast an uint64_t value into an int64_t value:
to_db = order_id; // directly assigning;
to_db = (int64_t)order_id; //c-style casting;
to_db = static_cast<int64_t>(order_id);
to_db = *reinterpret_cast<const int64_t*>( &order_id );
and when I need to load it from the db, I can do a reversed casting.
I know they all work, I'm just interested in which one meet the C++ standard the most perfectly.
In other words, which method will always work in whatever 64bit platform with whatever compiler?
Depends where it would be compiled and run... any of those not fully portable without C++20 support.
The safest way without that would be doing conversion yourself by changing range of values, something like that
int64_t to_db = (order_id > (uint64_t)LLONG_MAX)
? int64_t(order_id - (uint64_t)LLONG_MAX - 1)
: int64_t(order_id ) - LLONG_MIN;
uint64_t from_db = (to_db < 0)
? to_db + LLONG_MIN
: uint64_t(to_db) + (uint64_t)LLONG_MAX + 1;
If order_id is greater than (2^63 -1), then order_id - (uint64_t)LLONG_MAX - 1 yields a non-negative value. If not, then cast to signed is well defined and subtraction ensures values to be shifted into negative range.
During reverse conversion, to_db + LLONG_MIN places value into [0, ULLONG_MAX] range.
and do opposite on reading. Database platform or compiler you use may do something awful with binary representation of unsigned values when converting them to signed, not to mention that different format of signed do exist.
For same reason inter-platform protocols often involve use of string formatting or "least bit's value" for representing floating point values as integers, i.e. as encoded fixed point.
I would go with memcpy. It avoids (? see comments) undefined behavior and typically compilers optimize any byte copying away:
int64_t uint64_t_to_int64_t(uint64_t u)
{
int64_t i;
memcpy(&i, &u, sizeof(int64_t));
return i;
}
order_id = uint64_t_to_int64_t(to_db);
GCC with -O2 generated the optimal assembly for uint64_t_to_int64_t:
mov rax, rdi
ret
Live demo: https://godbolt.org/z/Gbvhzh
All four methods will always work, as long as the value is within range. The first will generate warnings on many compilers, so should probably not be used. The second is more a C idiom than a C++ idiom, but is widely used in C++. The last one is ugly and relies on subtle details from the standard, and should not be used.
This function seems UB-free
int64_t fromUnsignedTwosComplement(uint64_t u)
{
if (u <= std::numeric_limits<int64_t>::max()) return static_cast<int64_t>(u);
else return -static_cast<int64_t>(-u);
}
It reduces to a no-op under optimisations.
Conversion in the other direction is a straight cast to uint64_t. It is always well-defined.

ULP comparison code

The following code snippet is scattered all over the web and seems to be used in multiple different projects with very little changes:
union Float_t {
Float_t(float num = 0.0f) : f(num) {}
// Portable extraction of components.
bool Negative() const { return (i >> 31) != 0; }
int RawMantissa() const { return i & ((1 << 23) - 1); }
int RawExponent() const { return (i >> 23) & 0xFF; }
int i;
float f;
};
inline bool AlmostEqualUlpsAndAbs(float A, float B, float maxDiff, int maxUlpsDiff)
{
// Check if the numbers are really close -- needed
// when comparing numbers near zero.
float absDiff = std::fabs(A - B);
if (absDiff <= maxDiff)
return true;
Float_t uA(A);
Float_t uB(B);
// Different signs means they do not match.
if (uA.Negative() != uB.Negative())
return false;
// Find the difference in ULPs.
return (std::abs(uA.i - uB.i) <= maxUlpsDiff);
}
See, for example here or here or here.
However, I don't understand what is going on here. To my (maybe naive) understanding, the floating-point member variable f is initialized in the constructor, but the integer member i is not.
I'm not terribly familiar with the binary operators that are used here, but I fail to understand how accesses of uA.i and uB.i produce anything but random numbers, given that no line in the code actually connects the values of f and i in any meaningful way.
If somebody could enlighten my on why (and how) exactly this code produces the desired result, I would be very delighted!
A lot of Undefined Behaviour are being exploited here. First assumption is that fields of union can be accessed in place of each other, which is, in itself, UB. Furthermore, coder assumes that: sizeof(int) == sizeof(float), that floats have a given length of mantissa and exponent, that all union members are aligned to zero, that the binary representation of float coincides with the binary representation with int in a very specific way. In short, this will work as long as you're on x86, have specific int and float types and you say a prayer at every sunrise and sunset.
What you probably didn't note is that this is a union, therefore int i and float f is usually aligned in a specific manner in a common memory array by most compilers. This is, in general, still UB and you can't even safely assume that the same physical bits of memory will be used without restricting yourself to a specific compiler and a specific architecture. All that's guaranteed is, the address of both members will be the same (but there might be alignment and/or typedness issues). Assuming that your compiler uses the same physical bits (which is by no means guaranteed by standard) and they both start at offset 0 and have the same size, then i will represent the binary storage format of f.. as long as nothing changes in your architecture. Word of advice? Do not use it until you don't have to. Stick to floating point operations for AlmostEquals(), you can implement it like that. It's the very final pass of optimization when we consider these specialities and we usually do it in a separate branch, you shouldn't plan your code around it.

Why does gcc/clang use two 128bit xmm registers to pass a single value?

So I stumbled upon something which I'd like to understand, as it's causing me headaches. I have the following code:
#include <stdio.h>
#include <smmintrin.h>
typedef union {
struct { float x, y, z, w; } v;
__m128 m;
} vec;
vec __attribute__((noinline)) square(vec a)
{
vec x = { .m = _mm_mul_ps(a.m, a.m) };
return x;
}
int main(int argc, char *argv[])
{
float f = 4.9;
vec a = (vec){f, f, f, f};
vec res = square(a); // ?
printf("%f %f %f %f\n", res.v.x, res.v.y, res.v.z, res.v.w);
return 0;
}
Now, in my mind, the call to square in main should put the value of a in xmm0 so that the square function can do mulps xmm0, xmm0 and be done with it.
This is not what happens when I compile with clang or gcc. Instead, the first 8 bytes of a are put in xmm0 and the next 8 bytes in xmm1, making the square function a lot more complicated as it needs to patch things back up.
Any idea why?
NOTE: This is with -O3 optimization.
After further research, it seems like it has to do with the union type. If the function takes a straight __m128, the generated code will expect the value in a single register (xmm0). But given that they should both fit in xmm0, I don't see why it's being split in two half-used registers when the vec type is used..
The compiler is just trying to follow the calling convention as specified by the System V Application Binary Interface AMD64 Architecture Processor Supplement, section 3.2.3 Parameter Passing.
The relevant points are:
We first define a number of classes to classify arguments. The
classes are corresponding to AMD64 register classes and defined as:
SSE The class consists of types that fit into a vector register.
SSEUP The class consists of types that fit into a vector register and can
be passed and returned in the upper bytes of it.
The size of each argument gets rounded up to eightbytes.
The basic types are assigned their natural classes:
Arguments of types float, double, _Decimal32, _Decimal64 and __m64 are
in class SSE.
The classification of aggregate (structures and arrays) and union types
works as follows:
If the size of the aggregate exceeds a single eightbyte, each is
classified separately.
Applying the above rules means that the x, y and z, w pairs of the embedded struct get separately classified as SSE class, which in turn means they must be passed in two separate registers. The presence of the m member in this case doesn't have any effect, you can even delete it.
EDIT: on a second read through, I'm less certain why this is happening, but I'm more certain that this is where it is happening. I don't think this answer is right, but I'll leave it up as it may be helpful.
Speaking only for clang:
It seems like this is an issue that is just an unfortunate side effect of a compiler heuristic.
From a brief look at clang (file CGRecordLayoutBuilder.cpp, function CGRecordLowering::lowerUnion) it looks like llvm doesn't internally represent union types as such, and the types of a function don't get changed depending on the uses within the function.
clang looks at your function and sees that it needs 16 bytes worth of arguments for the type signature, then uses a heuristic to pick which type it thinks is best. It favors a { double, double } interpretation over a <4 x float> (which would give it the most efficiency in your case) because doubles are more lenient with respect to alignment.
I'm no expert on clang internals, so I could be very wrong, but it doesn't look like there's a particularly nice way around this one. If you want the optimized version you may have to use pointer casting instead of unions to get it.
The code I suspect is causing the problem:
void CGRecordLowering::lowerUnion() {
...
// Conditionally update our storage type if we've got a new "better" one.
if (!StorageType ||
getAlignment(FieldType) > getAlignment(StorageType) ||
(getAlignment(FieldType) == getAlignment(StorageType) &&
getSize(FieldType) > getSize(StorageType)))
StorageType = FieldType;
...
}

Verifying that C / C++ signed right shift is arithmetic for a particular compiler?

According to the C / C++ standard (see this link), the >> operator in C and C++ is not necessarily an arithmetic shift for signed numbers. It is up to the compiler implementation whether 0's (logical) or the sign bit (arithmetic) are shifted in as bits are shifted to the right.
Will this code function to ASSERT (fail) at compile time for compilers that implement a logical right shift for signed integers ?
#define COMPILE_TIME_ASSERT(EXP) \
typedef int CompileTimeAssertType##__LINE__[(EXP) ? 1 : -1]
#define RIGHT_SHIFT_IS_ARITHMETIC \
( (((signed int)-1)>>1) == ((signed int)-1) )
// SHR must be arithmetic to use this code
COMPILE_TIME_ASSERT( RIGHT_SHIFT_IS_ARITHMETIC );
Looks good to me! You can also set the compiler to emit an assembly file (or load the compiled program in the debugger) and look at which opcode it emits for signed int i; i >> 1;, but that's not automatic like your solution.
If you ever find a compiler that does not implement arithmetic right shift of a signed number, I'd like to hear about it.
Why assert? If your compiler's shift operator doesn't suit your needs, you could gracefully remedy the situation by sign-extending the result. Also, sometimes run-time is good enough. After all, the compiler's optimizer can make compile-time out of run-time:
template <typename Number>
inline Number shift_logical_right(Number value, size_t bits)
{
static const bool shift_is_arithmetic = (Number(-1) >> 1) == Number(-1);
const bool negative = value < 0;
value >>= bits;
if (!shift_is_arithmetic && negative) // sign extend
value |= -(Number(1) << (sizeof(Number) * 8 - bits));
}
The static const bool can be evaluated at compile time, so if shift_is_arithmetic is guaranteed to be true, every compiler worth its salt will eliminate the whole if clause and the construction of const bool negative as dead code.
Note: code is adapted from Mono's encode_sleb128 function: here.
Update
If you really want to abort compilation on machines without arithmetic shift, you're still better off not relying on the preprocessor. You can use static_assert (or BOOST_STATIC_ASSERT):
static_assert((Number(-1) >> 1) == Number(-1), "Arithmetic shift unsupported.");
From your various comments, you talk about using this cross-platform. Make sure that your compilers guarantee that when they compile for a platform, their compile-time operators will behave the same as run-time ones.
An example of differing behavior can be found with floating point numbers. Is your compiler doing its constant-expression math in single, double, or extended precision if you're casting back to int? Such as
constexpr int a = 41;
constexpr int b = (a / 7.5);
What I am saying is, you should make sure your compilers guarantee the same behavior during run-time as compile-time when you're working across so many different architectures.
It is entirely possible that a compiler might sign-extend internally but not generate the intended opcode(s) on the target. The only way to be sure is to test at run-time or look at the assembly output.
It's not the end of the world to look at assembly output...How many different platforms are there? Since this is so performance-critical just do the "work" of looking at 1-3 lines of assembler output for 5 different architectures. It isn't as if you have to dive through an entire assembly output (usually!) to find your line. It's very, very easy to do.