AMD HCC Swizzle Intrinsic

AMD HCC Swizzle Intrinsic - c++

I've just recently discovered AMD's equivalent to CUDA's __byte_perm intrinsic; amdgcn_ds_swizzle(Or at least I think its the equivalent of a byte permutation function). My problem is this: CUDA's byte perm takes in two unsigned 32 bit integers, and then permutes that based on the value of the selector argument (supplied as a hex value). However, AMD's swizzle function only takes in one single unsigned 32 bit integer, and one int that's named as "pattern". How do I utilize AMD's Swizzle intrinsic function?

ds_swizzle and __byte_perm do are a little bit different. One permutes a whole register across lanes and the later permutes any four bytes from two 32-bit regs.
AMD's ds_swizzle_b32 GCN instruction is actually swapping values with other lanes. You specify the 32-bit register in the lane you want to read and the 32-bit register you want to place it in. There is also a hard-coded value that specifies how these are to be swapped. A great explanation of ds_swizzle_b32 is here as user3528438 pointed out.
The __byte_perm does not swap data with other lanes. It just gathers any 4 bytes from two 32-bit registers in its own lane and stores it to a register. There is no cross-lane traffic.
I'm guessing the next question would be how to do a "byte permute" on AMD GCN hardware. The instruction for that is v_perm_b32. (see page 12-152 here) It basically selects any four bytes from two specified 32-bit registers.

Related

Are there any performance differences between representing a number using a (4 byte) `int` and a 4 element unsigned char array?

Assuming an int in C++ is represented by 4 bytes, and an unsigned char is represented by 1 byte, you could represent an int with an array of unsigned char with 4 elements right?
My question is, are there any performance downsides to representing a number with an array of unsigned char? Like if you wanted to add two numbers together would it be just as fast to do int + int compared to adding each element in the array and dealing with carries manually?
This is just me trying to experiment and to practice working with bytes rather than some practical application.

There will be many performance downsides on any kind of manipulation using the 4-byte array. For example, take simple addition: almost any CPU these days will have a single instruction that adds two 32-bit integers, in one (maybe two) CPU cycle(s). To emulate that with your 4-byte array, you would need at least 4 separate CPU instructions.
Further, many CPUs actually work faster with 32- or 64-bit data than they do with 8-bit data - because their internal registers are optimized for 32- and 64-bit operands.

Let's scale your question up. Is there any performance difference between single addition of two 16 byte variables compared to four separate additions of 4 byte variables? And here comes the concept of vector registers and vector instructions (MMX, SSE, AVX). It's pretty much the same story, SIMD is always faster, because there is literally less instructions to execute and the whole operation is done by dedicated hardware. On top of that, in your question you also have to take into account that modern CPUs don't work with 1 byte variables, instead they still process 32 or 64 bits at once anyway. So effectively you will do 4 individual additions using 4 byte registers, only to use single lower byte each time and then manually handle carry bit. Yeah, that will be very slow.

Encode additional information in pointer

My problem:
I need to encode additional information about an object in a pointer to the object.
What I thought I could do is use part of the pointer to do so. That is, use a few bits encode bool flags. As far as I know, the same thing is done with certain types of handles in the windows kernel.
Background:
I'm writing a small memory management system that can garbage-collect unused objects. To reduce memory consumption of object references and speed up copying, I want to use pointers with additional encoded data e.g. state of the object(alive or ready to be collected), lock bit and similar things that can be represented by a single bit.
My question:
How can I encode such information into a 64-bit pointer without actually overwriting the important bits of the pointer?
Since x64 windows has limited address space, I believe, not all 64 bits of the pointer are used, so I believe it should be possible. However, I wasn't able to find which bits windows actually uses for the pointer and which not. To clarify, this question is about usermode on 64-bit windows.
Thanks in advance.

This is heavily dependent on the architecture, OS, and compiler used, but if you know those things, you can do some things with it.
x86_64 defines a 48-bit1 byte-oriented virtual address space in the hardware, which means essentially all OSes and compilers will use that. What that means is:
the top 17 bits of all valid addresses must be all the same (all 0s or all 1s)
the bottom k bits of any 2k-byte aligned address must be all 0s
in addition, pretty much all OSes (Windows, Linux, and OSX at least) reserve the addresses with the upper bits set as kernel addresses -- all user addresses must have the upper 17 bits all 0s
So this gives you a variety of ways of packing a valid pointer into less than 64 bits, and then later reconstructing the original pointer with shift and/or mask instructions.
If you only need 3 bits and always use 8-byte aligned pointers, you can use the bottom 3 bits to encode extra info, and mask them off before using the pointer.
If you need more bits, you can shift the pointer up (left) by 16 bits, and use those lower 16 bits for information. To reconstruct the pointer, just right shift by 16.
To do shifting and masking operations on pointers, you need to cast them to intptr_t or int64_t (those will be the same type on any 64-bit implementation of C or C++)
1There's some hints that there may soon be hardware that extends this to 56 bits, so only the top 9 bits would need to be 0s or 1s, but it will be awhile before any OS supports this

What is the Lower and the higher part of multiplication in assembly instructions

I was reading this link, in short can someone explain the problem with current C++ compiler to someone who started learning about assembly x86 and 64bit a week ago.
Unfortunately current compilers don't optimize #craigster0's nice
portable version, so if you want to take advantage of 64-bit CPUs, you
can't use it except as a fallback for targets you don't have an #ifdef
for. (I don't see a generic way to optimize it; you need a 128-bit
type or an intrinsic.)
for clarification I was researching for the benefits of assembly when I came across people saying in multiple posts that the current compilers are not optimised when it comes to multiplication for the 64 bit because they use the lowest part so they do not perform full 64bit multiplication what does this means. so what is the meaning of getting the higher part also I read in a book I have that in the 64 bits architecture only the lowest 32 bits are used for the RFlags, Are these related I am confused?

Most CPUs will allow you to start with two operands, each the size of a register, and multiply them together to get a result that fills two registers.
For example, on x86 if you multiply two 32-bit numbers, you'll get the upper 32 bits of the result in EDX and the lower 32 bits of the result in EAX. If you multiply two 64-bit numbers, you get the results in RDX and RAX instead.
On other processors, other registers are used, but the same basic idea applies: one register times one register gives a result that fills two registers.
C and C++ don't provide an easy way of taking advantage of that capability. When you operate on types smaller than int, the input operands are converted to int, then the ints are multiplied, and the result is an int. If the inputs are larger than int, then they're multiplied as the same type, and the result is the same type. Nothing is done to take into account that the result is twice as big as the input types, and virtually every processor on earth will produce a result twice as big as each input is individually.
There are, of course, ways of dealing with that. The simplest is the basic factoring we learned in grade school: take each number and break it up onto upper and lower halves. We can then multiply those pieces together individually: (a+b) * (c+d) = ac + ad + bc + bd. Since each of those multiplications has only half as many non-zero bits, we can do each piece of arithmetic as a half-size operation producing a full-sized result (plus a single bit carried out from the addition). For example, if we wanted to do 64-bit multiplication on a 64-bit processor to get a 128-bit result, we'd break each 64-bit input up into 32-bit pieces. Then each multiplication would produce a 64-bit result. We'd then add pieces together (with suitable bit-shifts) to get our final 128-bit result.
But, as Peter pointed out, when we do that, compilers are not smart enough to realize what we're trying to accomplish, and turn that sequence of multiplications and additions back into a single multiplication producing a result twice as large as each input. Instead, it translates the the expression fairly directly into a series of multiplications and additions, so it takes somewhere around four times longer than the single multiplication would have.

Howto vblend for 32-bit integer? or: Why is there no _mm256_blendv_epi32?

I'm using the AVX2 x86 256-bit SIMD extensions. I want to do a 32-bit integer component wise if-then-else instruction. In the Intel documentations such an instruction is called vblend.
The Intel intrinsic guide contains the function _mm256_blendv_epi8. This function does nearly what I need. The only problem is that it works with 8-bit integers. Unfortunately there is no _mm256_blendv_epi32 in docs. My first question is: Why does this function not exist? My second question is: How to emulate it?
After some searching I found _mm256_blendv_ps which does what I want for 32-bit floating points. Further I found cast functions _mm256_castsi256_ps and _mm256_castps_si256 which cast from integers to 32-bit floats and back. Putting these together gives:
inline __m256i _mm256_blendv_epi32 (__m256i a, __m256i b, __m256i mask){
return _mm256_castps_si256(
_mm256_blendv_ps(
_mm256_castsi256_ps(a),
_mm256_castsi256_ps(b),
_mm256_castsi256_ps(mask)
)
);
}
While this looks like 5 functions, 4 of them are only glorified casts and one maps directly onto a processor instruction. The whole function therefore boils down to one processor instruction.
The real awkward part therefore is that there seems to be a 32-bit blendv, except that the corresponding intrinsic is missing.
Is there some border case where this will fail miserably? For example, what happens when the integer bit pattern happens to represent a floating point NAN? Does blendv simply ignore this or will it raise some signal?
In case this works: Am I correct that there is a 8-bit, a 32-bit and a 64-bit blendv but a 16-bit blendv is missing?

If your mask is already all-zero / all-one for the whole 32-bit element (like a vpcmpgtd result), use _mm256_blendv_epi8 directly.
My code relies on blendv only checking the highest bit.
Then you have two good options:
Broadcast the high bit within each element using an arithmetic right shift by 31 to set up for VPBLENDVB (_mm256_blendv_epi8). i.e. VPSRAD: mask=_mm256_srai_epi32(mask, 31).
VPSRAD is 1-uop on Intel Haswell, for port0. (More throughput on Skylake: p01). If your algorithm bottlenecks on port 0 (e.g. integer multiply and shift), this is not great.
Use VBLENDVPS for throughput over latency. You're correct that all the casts are just to keep the compiler happy, and that VBLENDVPS will do exactly what you want in one instruction.
static inline
__m256i blendvps_si256(__m256i a, __m256i b, __m256i mask) {
__m256 res = _mm256_blendv_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(b), _mm256_castsi256_ps(mask));
return _mm256_castps_si256(res);
}
However, Intel SnB-family CPUs have a bypass-delay latency of 1 cycle when forwarding integer results to the FP blend unit, and another 1c latency when forwarding the blend results to other integer instructions. If this isn't part of a long dependency chain (across iterations), then probably saving uops will be better, letting OoO exec hide the extra latency.
For more about bypass-delay latency, see Agner Fog's microach guide. It's the reason they don't make __m256i intrinsics for FP instructions, and vice versa. Note that since Sandybridge, FP shuffles don't have extra latency to forward from/to instructions like PADDD. So SHUFPS is a great way to combine data from two integer vectors if PUNPCK* or PALIGNR don't do exactly what you want. (SHUFPS on integers can be worth it even on Nehalem, where it does have a 2c penalty both ways, if throughput is your bottleneck).
Try both ways and benchmark. Either way could be better, depending on surrounding code.
Latency might not matter compared to uop throughput / instruction count. Also note that if you're just storing the result to memory, store instructions don't care which domain the data was coming from.
But if you are using this as part of a long dependency chain, then it might be worth the extra instruction to avoid the extra 2 cycles of latency for the data being blended, if the critical path goes through the data being blended, not the mask.
Note that if the mask-generation is on the critical path, then VPSRAD's 1 cycle latency is equivalent to the bypass-delay latency, so using an FP blend is only 1 extra cycle of latency for the mask->result chain, vs. 2 extra cycles for the data->result chain. And if you're consuming the blend result with an instruction that can forward efficiently from either an FP or integer blend, then it's pure win to use an FP blend, saving an instruction (and its uop) for the same latency.
For example, what happens when the integer bit pattern happens to represent a floating point NAN?
BLENDVPS doesn't care. Intel's insn ref manual fully documents everything an instruction can/can't do, and SIMD Floating-Point Exceptions: None means that this isn't a problem. See also the x86 tag wiki for links to docs.
FP blend/shuffle/bitwise-boolean/load/store instructions don't care about NaNs. Only instructions that do actual FP math (including CMPPS, MINPS, and stuff like that) raise FP exceptions or can possibly slow down with denormals.
Am I correct that there is a 8-bit, a 32-bit and a 64-bit blendv but a 16-bit blendv is missing?
Yes. But there are 32 and 16-bit arithmetic shifts, so it costs at most one extra instruction to use the 8-bit granularity blend. (There is no PSRAQ, so blendv of 64-bit integers is often best done with BLENDVPD, unless maybe the mask-generation is off the critical path and/or the same mask will be reused many times on the critical path.)
The most common use-case is for compare-masks where each element is all-ones or all-zeros already, so you could blend with PAND/PANDN => POR. Of course, clever tricks that leave just the sign-bit of your mask with the truth value can save instructions and latency, especially since variable-blends are somewhat faster than three boolean bitwise instructions. (e.g. ORPS two float vectors to see if they're both non-negative, instead of 2x CMPPS and ORing the masks. This can work great if you don't care about negative zero, or you're happy to treat underflow to -0.0 as negative).

Is a letter compare slower than number compare?

Bear with me here.
A couple months ago I remember my algorithms teacher discussing the implementation of bucket sort with us (named Distribution sort in my algorithms book) and how it works. Basically, instead of taking a number at face value, we start comparing by the binary representation like so:
// 32 bit integers.
Input: 9 4
4: 00000000 00000000 00000000 00000110
9: 00000000 00000000 00000000 00001001
// Etc.
and start comparing from right to left.
// First step.
4: 0
9: 1
Output: 9 4
// Second step
4: 1
9: 0
Output: 4 9 // Technically a stable algorithm, but we cannot observe that here.
// Third step
4: 1
9: 0
Output: 4 9
// Fourth step
4: 0
9: 1
Output: 9 4
And that's it; the other 28 iterations are all zeroes, so the output won't change anymore. Now, comparing a whole bunch of strings like this would go
// strings
Input: "Christian" "Denis"
Christian: C h r i s t i a n
Denis: D e n i s
// First step.
Christian: n
Denis: s
Output: Christian, Denis
// Second step
Christian: a
Denis: i
Output: Denis, Christian
// ...
and so forth.
My question is, is comparing an signed char, a byte figure, faster than comparing ints?
If I had to assume, a 1 byte char is compared faster than a 4-byte integer. Is this correct? Can I make the same assumption with wchar_t, or UTF-16/32 formats?

In C or C++, a char is simply a one-byte integer (though "one byte" may or may not be 8 bits). That means that in a typical case, the only difference you have to deal with is whether a single-byte comparison is faster than a multi-byte comparison.
At least in most cases, the answer is no. Many RISC processors don't have instructions to deal with single bytes at all, so an operation on a single byte is carried out by sign-extending the byte to a word, operating on the word, and then (if necessary) masking all the bits outside of the single byte back to zeros -- i.e., operating on a whole word can often be around triple the speed of operating on a single byte.
Even on something like an x86 that supports single-byte operations directly, they're still often slower (on a modern processor). There are a couple of things that contribute to this. First of all, the instructions using registers of the size "natural" to the current mode have a simpler encoding than instructions using other sizes. Second, a fair number of x86 processors have what's called a "partial register stall" -- even though it's all implicit, internally they do something like the RISC does, carrying out an operation on a full-sized register, then merging it with the other bytes of the original value. For example, if you produce a result in AL then refer to EAX, the sequence will take longer to execute than if you produced the result in EAX to start with.
OTOH, if you look at old enough processors the reverse could be (and often was) true. For an extreme example, consider the Intel 8080 or Zilog Z80. Both had some 16-bit instructions, but the paths through the ALU were only 8 bits wide -- a 16-bit addition, for example, was actually carried out as two consecutive 8-bit additions. If you could get by with only an 8-bit operation, it was about twice as fast. Although 8-bit processors are a (distant) memory on desktop machines, they're still used in some embedded applications, so this isn't entirely obsolete either.

One-byte chars are compared as numbers in C++. The exact speed depends on hosting CPU platform, and usually it's the same as speed of comparing 4-byte integers.

You cannot assume anything about what type of comparison is faster, it depends on your particular platform.
Typically, int is the most "comfortable" size for the CPU, and so comparing these will usually be fastest. Anything larger could well be slower, as it may need to be broken down into multiple ints. Anything smaller may be as fast as an int, but depending on the memory architecture, mis-aligned reads may take longer.
On top of all this, there is the memory-bandwidth factor. The larger the type, the higher the required bandwidth. And then there's caching effects on top of that. If the bottleneck is the CPU speed, then this doesn't matter. Otherwise, it does.

My question is, is comparing an signed char, a byte figure, faster than comparing ints?
No. In C++, these operations will certainly be identical in speed. Modern CPUs do most operations on bytes in counts of 4 anyway1 so the 1 byte vs. 4 bytes will not shave off any computation time.
Please assume that conversion to binary with the integer example is irrelevant
There doesn’t happen any conversion. Numbers are represented as binary in the PC anyway.
1 Gross simplification. But for the sake of argument we can state that an int in C++ will always be the “native” unit of measure on a given CPU.

If I had to assume, a 1 byte char is compared faster than a 4-byte integer. Is this correct?
I very much doubt it. If I where to guess my bet would be on the other way around if either is slower than the other. Reason? Most of today's processors are built to work directly with 4 byte types.
Can I make the same assumption with wchar_t, or UTF-16/32 formats?
No. UTF formats are much more involved and cannot be compared directly, byte for byte, unless you're strictly checking for equality.
You really shouldn't be worrying about this kind of speed issue. If your instructor is teaching you to be concerned about the speed of comparing a 1 byte type vs. a 4 byte type then you really need to take everything they say with a LOT of salt. Write efficient algorithms, don't try optimizing at this level of detail.

The answer is "alignment". Comparing chars, which are not aligned on natural word boundary will always be slower than comparing aligned data. Other than that, processor does multiple operations per cycle in pipeline and many other conditions have an effect on performance.

As Al Kepp said, this depends on your platform. However, most CPUs have a built in instruction to compare Words, which, because of being a CPU instruction, always take the same time, as long as the data you are comparing fits in a single word.
CMP x86 Assembly

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js