Distinguishing SSE datatypes under visual studio - c++

Under gcc I would write:
typedef __v16qi vUInt8;
or even:
typedef unsigned char vUInt8 __attribute__ ((__vector_size__ (16)));
This makes it possible to distinguish between e.g. a vector of 16 x uint8 and a vector of 4 x uint32. It also makes possible function overloading, e.g.
vUInt8 vAdd(vUInt8 a, vUInt8 b) { return _mm_add_epi8(a, b); }
vUInt16 vAdd(vUInt16 a, vUInt16 b) { return _mm_add_epi16(a, b); }
or even just:
template<class T> T vAdd(T a, T b) { return a + b; }
I have ended up with quite a lot of cross-platform code where the SIMD details are encapsulated like this(*), and along with alternative implementations for Neon and Altivec it makes for very convenient cross-platform compilation under gcc.
The problem is that this just doesn't seem to be an option under MS Visual C. I am not that familiar with Windows but, as far as I can see, MSVS only understands the generic _m128i type, which makes it impossible to distinguish between different element sizes. As far as I can see, this completely scuppers my setup where I encapsulate all the vector code in cross-platform functions such as vAdd().
If I try:
typedef __m128i vUInt8;
typedef __m128i vUInt16;
then that doesn't help me because vUInt8 and vUInt16 are treated as completely synonymous and interchangeable by the compiler, so I still cannot overload vAdd for different element sizes.
Can anyone suggest a solution that would enable me to write code of the sort I am describing here, but that would compile under Visual C? I am hoping to find a way to do that rather than e.g. having to set up mingw, since this is for a C extension for python (and my understanding is that I can only really assume that MSVC is available, not anything gcc-like).
(*) clearly this isn't practical for all vector instructions, but it works well for me with the relatively straightforward arithmetic operations I am doing.

Related

How to convert between uint64_t and poly64_t on ARM?

I'd like to perform polynomial multiplication of two uint64_t values (where the least significant bit (the one got by w&1) is the least significant coefficient (the a0 in for w(x)=∑iai*xi )) on ARM and get the least significant 64 coefficients (a0...a63) of the result as uint64_t (so result>>i&1 is ai).
It's not clear to me, however, what is the standard-compliant way to convert uint64_t to poly64_t and (least significant part of) poly128_t to uint64_t.
poly8_t, poly16_t, poly64_t and poly128_t are defined as unsigned integer types. It is unspecified whether these are the same type as uint8_t, uint16_t, uint64_t and uint128_t for overloading and mangling purposes.
ACLE does not define whether int64x1_t is the same type as int64_t, or whether uint64x1_t is the same type as uint64_t, or whether poly64x1_t is the same as poly64_t for example for C++ overloading purposes.
source: https://developer.arm.com/documentation/101028/0009/Advanced-SIMD--Neon--intrinsics
Above quotes opens some scary possibilities in my head like perhaps the bit order is flipped, or there's some padding, or who knows, maybe these are some structs.
So far I've come out with these two:
poly64_t uint64_t_to_poly64_t(uint64_t x) {
return vget_lane_p64(vcreate_p64(x), 0);
}
uint64_t less_sinificant_half_of_poly128_t_to_uint64_t(poly128_t big) {
return vgetq_lane_u64(vreinterpretq_u64_p128(big), 0);
}
But they seem cumbersome (as they go through some intermediary stuff like poly64x1_t), and still make some assumptions (like that poly128_t can be treated as a vector of two uint64_t, and that the the 0-th uint64_t will contain the "less significant coefficients", and that least significant polynomial coefficient will be at the least significant uint64_t's bit).
OTOH it seems that I can simply "ignore" the whole issue, and just pretend that integers are polynomials as the two functions produce the same assembly:
__attribute__((target("+crypto")))
uint64_t polynomial_mul_low(uint64_t v,uint64_t w) {
const poly128_t big = vmull_p64(uint64_t_to_poly64_t(v),
uint64_t_to_poly64_t(w));
return less_sinificant_half_of_poly128_t_to_uint64_t(big);
}
__attribute__((target("+crypto")))
uint64_t polynomial_mul_low_naive(uint64_t v,uint64_t w) {
return vmull_p64(v,w);
}
that is:
fmov d0, x0
fmov d1, x1
pmull v0.1q, v0.1d, v1.1d
fmov x0, d0
ret
also, the assembly for uint64_t_to_poly_64_t and less_sinificant_half_of_poly128_t_to_uint64_t seems to be a no-op, which supports the hypothesis that there are no steps involved in conversion, really.
(See above in action: https://godbolt.org/z/o6bYsn4E4)
Also:
__attribute__((target("+crypto")))
uint64_t polynomial_mul_low_naive(uint64_t v,uint64_t w) {
return (uint64_t)vmull_p64(poly64_t{v},poly64_t{w});
}
seems to compile, and while the {..}s give me the soothing confidence that no narrowing occurred, I'm still unsure if the order of the bits and order of the coefficients are guaranteed to be consistent, and thus have some worries about the final (uint64_t) cast.
I want my code to be correct w.r.t. to standards, as opposed to just work by an accident, as it has to be written once and run on many ARM64 platforms, hence my question:
How does one perform a proper conversion between polyXXX_t and uintXXX_t, and how does one extract "lower half of coefficients" from polyXXX_t?
The ARM-NEON intrinsic set provides many types, but fundamentally they just map to the same set of registers. The types are there to help you, the programmer, organize your code and the hardware really doesn't care.
Many implementations of ARM-NEON intrinsics just set all those types to some internal variable, so the type-safety is largely lost in those cases: Visual C++ and clang/LLVM are both fairly "loose" with respect to ARM-NEON type-safety.
GNUC seems to be one compiler that I've used that generates these type warnings, although you can you use -flax-vector-conversions.
The ARM-NEON intrinsic set defines a number of vreinterpret_X_Y and vreinterpretq_X_Y instructions. These are for doing the 'type-casts' between the various types when you need to force them for the particular mix of instructions you are using.
// Convert poly to unsigned int (the reverse is also defined)
vreinterpret_u8_p8
vreinterpret_u8_p16
vreinterpret_u16_p8
vreinterpret_u16_p16
vreinterpret_u32_p8
vreinterpret_u32_p16
vreinterpret_u64_p8
vreinterpret_u64_p16
vreinterpretq_u8_p8
vreinterpretq_u8_p16
vreinterpretq_u16_p8
vreinterpretq_u16_p16
vreinterpretq_u32_p8
vreinterpretq_u32_p16
vreinterpretq_u64_p8
vreinterpretq_u64_p16
My proposal is not to use the poly128_t or poly64_t types at all, as this leads to very bad code generation mixing neon and GPR registers.
poly128_t mul_lo_p64(poly128_t a, poly128_t b) {
return vmull_p64(a, b);
}
fmov d0, x0
fmov d1, x2
pmull v0.1q, v0.1d, v1.1d
mov x1, v0.d[1]
fmov x0, d0
ret
This is seen also in more complex scenarios.
To fix this, one should stay completely in the neon register domain and needs just two primitives, namely
inline poly64x2_t mul_lo_p64(poly64x2_t a, poly64x2_t b) {
poly64x2_t res;
asm("pmull %0.1q, %1.1d, %2.1d": "=w"(res): "w"(a), "w"(b));
return res;
}
inline poly64x2_t mul_hi_p64(poly64x2_t a, poly64x2_t b) {
poly64x2_t res;
asm("pmull2 %0.1q, %1.2d, %2.2d": "=w"(res): "w"(a), "w"(b));
return res;
}
Then e.g. the two other often used intrinsics poly64x2_t vaddq_p64(poly64x2_t a, poly64x2_t b); and vextq_p64(poly64x2_t,poly64x2_t,1); work as expected.

Why don't compilers optimize trivial wrapper function pointers?

Consider the following code snippet
#include <vector>
#include <cstdlib>
void __attribute__ ((noinline)) calculate1(double& a, int x) { a += x; };
void __attribute__ ((noinline)) calculate2(double& a, int x) { a *= x; };
void wrapper1(double& a, int x) { calculate1(a, x); }
void wrapper2(double& a, int x) { calculate2(a, x); }
typedef void (*Func)(double&, int);
int main()
{
std::vector<std::pair<double, Func>> pairs = {
std::make_pair(0, (rand() % 2 ? &wrapper1 : &wrapper2)),
std::make_pair(0, (rand() % 2 ? &wrapper1 : &wrapper2)),
};
for (auto& [a, wrapper] : pairs)
(*wrapper)(a, 5);
return pairs[0].first + pairs[1].first;
}
With -O3 optimization the latest gcc and clang versions do not optimize the pointers to wrappers to pointers to underlying functions. See assembly here at line 22:
mov ebp, OFFSET FLAT:wrapper2(double&, int) # tmp118,
which results later in call + jmp, instead of just call had the compiler put a pointer to the calculate1 instead.
Note that I specifically asked for no-inlined calculate functions to illustrate; doing it without noinline results in another flavour of non-optimization where compiler will generate two identical functions to be called by pointer (so still won't optimize, just in a different fashion).
What am I missing here? Is there any way to guide the compiler short of manually plugging in the correct functions (without wrappers)?
Edit 1. Following suggestions in the comments, here is a disassembly with all functions declared static, with exactly the same result (call + jmp instead of call).
Edit 2. Much simpler example of the same pattern:
#include <vector>
#include <cstdlib>
typedef void (*Func)(double&, int);
static void __attribute__ ((noinline)) calculate(double& a, int x) { a += x; };
static void wrapper(double& a, int x) { calculate(a, x); }
int main() {
double a = 5.0;
Func f;
if (rand() % 2)
f = &wrapper; // f = &calculate;
else
f = &wrapper;
f(a, 0);
return 0;
}
gcc 8.2 successfully optimizes this code by throwing pointer to wrapper away and storing &calculate directly in its place (https://gcc.godbolt.org/z/nMIBeo). However changing the line as per comment (that is, performing part of the same optimization manually) breaks the magic and results in pointless jmp.
You seem to be suggesting that &calculate1 should be stored in the vector instead of &wrapper1. In general this is not possible: later code might try to compare the stored pointer against &calculate1 and that must compare false.
I further assume that your suggestion is that the compiler might try to do some static analysis and determine that the function pointers values in the vector are never compared for equality with other function pointers, and in fact that none of the other operations done on the vector elements would produce a change in observable behaviour; and therefore in this exact program it could store &calculate1 instead.
Usually the answer to "why does the compiler not perform some particular optimization" is that nobody has conceived of and implemented that idea. Another common reason is that the static analysis involved is, in the general case, quite difficult and might lead to a slowdown in compilation with no benefit in real programs where the analysis could not be guaranteed to succeed.
You are making a lot of assumptions here. Firstly, your syntax. The second is that compilers are perfect in the eye of the beholder and catch everything. The reality is that it is easy to find and hand optimize compiler output, it is not difficult to write small functions to trip up a compiler that you are well in tune with or write a decent size application and there will be places where you can hand tune. This is all known and expected. Then opinion comes in where on my machine my blah is faster than blah so it should have made these instructions instead.
gcc is not a great compiler for performance, on some targets it has been getting worse for a number of major revs. It is pretty good at what it does, better than pretty good, it deals with a number of pre processors/languages has a common middle and a number of backends. Some backends get better optimization applied front to back others are just hanging on for the ride.There were a number of other compilers that could produce code that could easily outperform gcc.
These were mostly pay-for compilers. More than an individual would pay out of pocket: used car prices, sometimes recurring annually.
There are things that gcc can optimize that are simply amazing and times that it totally goes in the wrong direction. Same goes for clang, often they do similar jobs with similar output, sometimes do some impressive things sometimes just go off into the weeds. I now find it more fun to manipulate the optimizer to make it do good or bad things rater than worry about why didn't it do what I "think" it should have done on a particular occasion. If I need that code faster I take the compiled output and hand fix it and use it as an assembly function.
You get what you pay for with gcc, if you were to look deep in its bowels you will find it is barely held together with duct tape and bailing wire (llvm is catching up). But for a free tool it does a simply amazing job, it is so widely used that you can get free support just about anywhere. We are sadly well into a time where folks think that because gcc interprets the language in a certain way that is how the language is defined and sadly that is not remotely true. But so many folks don't try other compilers to find out what "implementation defined" really means.
Last and most important, it's open source, if you want to "fix" an optimization then just do it. Keep that fix for yourself, post it, or try to push it upstream.

When I target 32-wide warp CUDA architectures, should I use warpSize?

This is a follow-up question to this one.
Suppose I have a CUDA kernel
template<unsigned ThreadsPerWarp>
___global__ foo(bar_t* a, const baz_t* b);
and I'm implementing a specialization of it for the case of ThreadsPerWarp being 32 (this circumvents the valid criticism of Talonmies' answer to my previous question.)
In the body of this function (or of other __device__ functions called from it) - should I prefer using the constant value of ThreadsPerWarp? Or is it better to use warpSize? Or - will it be all the same to the compiler in terms of the PTX it generates?
No, don't use warpSize.
It seems that other than potential future-proof'ness (which in practice is questionable), there is no advantages in using it. Instead, you can very well use something like:
enum : unsigned { warp_size = 32 };

WinAPI _Interlocked* intrinsic functions for char, short

I need to use _Interlocked*** function on char or short, but it takes long pointer as input. It seems that there is function _InterlockedExchange8, I don't see any documentation on that. Looks like this is undocumented feature. Also compiler wasn't able to find _InterlockedAdd8 function.
I would appreciate any information on that functions, recommendations to use/not to use and other solutions as well.
update 1
I'll try to simplify the question.
How can I make this work?
struct X
{
char data;
};
X atomic_exchange(X another)
{
return _InterlockedExchange( ??? );
}
I see two possible solutions
Use _InterlockedExchange8
Cast another to long, do exchange and cast result back to X
First one is obviously bad solution.
Second one looks better, but how to implement it?
update 2
What do you think about something like this?
template <typename T, typename U>
class padded_variable
{
public:
padded_variable(T v): var(v) {}
padded_variable(U v): var(*static_cast<T*>(static_cast<void*>(&v))) {}
U& cast()
{
return *static_cast<U*>(static_cast<void*>(&var));
}
T& get()
{
return var;
}
private:
T var;
char padding[sizeof(U) - sizeof(T)];
};
struct X
{
char data;
};
template <typename T, int S = sizeof(T)> class var;
template <typename T> class var<T, 1>
{
public:
var(): data(T()) {}
T atomic_exchange(T another)
{
padded_variable<T, long> xch(another);
padded_variable<T, long> res(_InterlockedExchange(&data.cast(), xch.cast()));
return res.get();
}
private:
padded_variable<T, long> data;
};
Thanks.
It's pretty easy to make 8-bit and 16-bit interlocked functions but the reason they're not included in WinAPI is due to IA64 portability. If you want to support Win64 the assembler cannot be inline as MSVC no longer supports it. As external function units, using MASM64, they will not be as fast as inline code or intrinsics so you are wiser to investigate promoting algorithms to use 32-bit and 64-bit atomic operations instead.
Example interlocked API implementation: intrin.asm
Why do you want to use smaller data types? So you can fit a bunch of them in a small memory space? That's just going to lead to false sharing and cache line contention.
Whether you use locking or lockless algorithms, it's ideal to have your data in blocks of at least 128 bytes (or whatever the cache line size is on your CPU) that are only used by a single thread at a time.
Well, you have to make do with the functions available. _InterlockedIncrement and `_InterlockedCompareExchange are available in 16 and 32-bit variants (the latter in a 64-bit variant as well), and maybe a few other interlocked intrinsics are available in 16-bit versions as well, but InterlockedAdd doesn't seem to be, and there seem to be no byte-sized Interlocked intrinsics/functions at all.
So... You need to take a step back and figure out how to solve your problem without an IntrinsicAdd8.
Why are you working with individual bytes in any case? Stick to int-sized objects unless you have a really good reason to use something smaller.
Creating a new answer because your edit changed things a bit:
Use _InterlockedExchange8
Cast another to long, do exchange and cast result back to X
The first simply won't work. Even if the function existed, it would allow you to atomically update a byte at a time. Which means that the object as a whole would be updated in a series of steps which wouldn't be atomic.
The second doesn't work either, unless X is a long-sized POD type. (and unless it is aligned on a sizeof(long) boundary, and unless it is of the same size as a long)
In order to solve this problem you need to narrow down what types X might be. First, of course, is it guaranteed to be a POD type? If not, you have an entirely different problem, as you can't safely treat non-POD types as raw memory bytes.
Second, what sizes may X have? The Interlocked functions can handle 16, 32 and, depending on circumstances, maybe 64 or even 128 bit widths.
Does that cover all the cases you can encounter?
If not, you may have to abandon these atomic operations, and settle for plain old locks. Lock a Mutex to ensure that only one thread touches these objects at a time.

Why or why not should I use 'UL' to specify unsigned long?

ulong foo = 0;
ulong bar = 0UL;//this seems redundant and unnecessary. but I see it a lot.
I also see this in referencing the first element of arrays a good amount
blah = arr[0UL];//this seems silly since I don't expect the compiler to magically
//turn '0' into a signed value
Can someone provide some insight to why I need 'UL' throughout to specify specifically that this is an unsigned long?
void f(unsigned int x)
{
//
}
void f(int x)
{
//
}
...
f(3); // f(int x)
f(3u); // f(unsigned int x)
It is just another tool in C++; if you don't need it don't use it!
In the examples you provide it isn't needed. But suffixes are often used in expressions to prevent loss of precision. For example:
unsigned long x = 5UL * ...
You may get a different answer if you left off the UL suffix, say if your system had 16-bit ints and 32-bit longs.
Here is another example inspired by Richard Corden's comments:
unsigned long x = 1UL << 17;
Again, you'd get a different answer if you had 16 or 32-bit integers if you left the suffix off.
The same type of problem will apply with 32 vs 64-bit ints and mixing long and long long in expressions.
Some compiler may emit a warning I suppose.
The author could be doing this to make sure the code has no warnings?
Sorry, I realize this is a rather old question, but I use this a lot in c++11 code...
ul, d, f are all useful for initialising auto variables to your intended type, e.g.
auto my_u_long = 0ul;
auto my_float = 0f;
auto my_double = 0d;
Checkout the cpp reference on numeric literals: http://www.cplusplus.com/doc/tutorial/constants/
You don't normally need it, and any tolerable editor will have enough assistance to keep things straight. However, the places I use it in C# are (and you'll see these in C++):
Calling a generic method (template in C++), where the parameter types are implied and you want to make sure and call the one with an unsigned long type. This happens reasonably often, including this one recently:
Tuple<ulong, ulong> = Tuple.Create(someUlongVariable, 0UL);
where without the UL it returns Tuple<ulong, int> and won't compile.
Implicit variable declarations using the var keyword in C# or the auto keyword coming to C++. This is less common for me because I only use var to shorten very long declarations, and ulong is the opposite.
When you feel obligated to write down the type of constant (even when not absolutely necessary) you make sure:
That you always consider how the compiler will translate this constant into bits
Who ever reads your code will always know how you thought the constant looks like and that you taken it into consideration (even you, when you rescan the code)
You don't spend time if thoughts whether you need to write the 'U'/'UL' or don't need to write it
also, several software development standards such as MISRA require you to mention the type of constant no matter what (at least write 'U' if unsigned)
in other words it is believed by some as good practice to write the type of constant because at the worst case you just ignore it and at the best you avoid bugs, avoid a chance different compilers will address your code differently and improve code readability