ARM Cortex M0, shift buffer, bitlevel - cortex-m

I need a fast shift buffer bitwise on Cortex M0.
It's possible using inline asm to get address of buffer
static uint8_t tmp[30];
and rotate all to right with 1 position (carry)?
I can't find proper guidance for gcc inline asm on M0. On PIC16,18 Microchip,
I'd simply use rotate with (default) carry, one instruction next another for each buffer (memory address)
rlf buff+0,F
rlf buff+1,F
rlf buff+2,F
etc
It's possible on M0?
Thanks in advance,

In GCC, to get the address of your buffer, use:
ldr r0, =#tmp
and further you can rotate with the .asm instruction:
RORS {Rd,} Rm, Rs
Use your micro brand programming manual for details of this instruction.

Related

Instruction/intrinsic for taking higher half of uint64_t in C++?

Imagine following code:
Try it online!
uint64_t x = 0x81C6E3292A71F955ULL;
uint32_t y = (uint32_t) (x >> 32);
y receives higher 32-bit part of 64-bit integer. My question is whether there exists any intrinsic function or any CPU instruction that does this in single operation without doing move and shift?
At least CLang (linked in Try-it-online above) creates two instruction mov rax, rdi and shr rax, 32 for this, so either CLang doesn't do such optimization, or there exists no such special instruction.
Would be great if there existed imaginary single instruction like movhi dst_reg, src_reg.
If there was a better way to do this bitfield-extraction for an arbitrary uint64_t, compilers would already use it. (At least in theory; compilers do have missed optimizations, and their choices sometimes favour latency even if it costs more uops.)
You only need intrinsics for things that you can't express efficiently in pure C, in ways the compiler can already easily understand. (Or if your compiler is dumb and can't spot the obvious.)
You could maybe imagine cases where the input value comes from the multiply of two 32-bit values, then it might be worthwhile on some CPUs for the compiler to use widening mul r32 to already generate the result in two separate 32-bit registers, instead of imul r64, r64 + shr reg,32, if it can easily use EAX/EDX. But other than gcc -mtune=silvermont or other tuning options, you can't make the compiler do it that way.
shr reg, 32 has 1 cycle latency, and can run on more than 1 execution port on most modern x86 microarchitectures (https://uops.info/). The only thing one might wish for is that it could put the result in a different register, without overwriting the input.
Most modern non-x86 ISAs are RISC-like with 3-operand instructions, so a shift instruction can copy-and-shift, unlike x86 shifts where the compiler needs a mov in addition to shr if it also needs the original 64-bit value later, or (in the case of a tiny function) needs the return value in a different register.
And some ISAs have bitfield-extract instructions. PowerPC even has a fun rotate-and-mask instruction (rlwinm) (with the mask being a bit-range specified by immediates), and it's a different instruction from a normal shift. Compilers will use it as appropriate - no need for an intrinsic. https://devblogs.microsoft.com/oldnewthing/20180810-00/?p=99465
x86 with BMI2 has rorx rax, rdi, 32 to copy-and-rotate, instead of being stuck shifting within the same register. A function returning uint32_t could/should use that instead of mov+shr, in the stand-alone version that doesn't inline because the caller already has to ignore high garbage in RAX. (Both x86-64 System V and Windows x64 define the return value as only the register width matching the C type of the arg; e.g. returning uint32_t means that the high 32 bits of RAX are not part of the return value, and can hold anything. Usually they're zero because writing a 32-bit register implicitly zero-extends to 64, but something like return bar() where bar returns uint64_t can just leave RAX untouched without having to truncate it; in fact an optimized tailcall is possible.)
There's no intrinsic for rorx; compilers are just supposed to know when to use it. (But gcc/clang -O3 -march=haswell miss this optimization.) https://godbolt.org/z/ozjhcc8Te
If a compiler was doing this in a loop, it could have 32 in a register for shrx reg,reg,reg as a copy-and-shift. Or more silly, it could use pext with 0xffffffffULL << 32 as the mask. But that's strictly worse that shrx because of the higher latency.
AMD TBM (Bulldozer-family only, not Zen) had an immediate form of bextr (bitfield-extract), and it ran efficiently as 1 uop (https://agner.org/optimize/). https://godbolt.org/z/bn3rfxzch shows gcc11 -O3 -march=bdver4 (Excavator) uses bextr rax, rdi, 0x2020, while clang misses that optimization. gcc -march=znver1 uses mov + shr because Zen dropped Trailing Bit Manipulation along with the XOP extension.
Standard BMI1 bextr needs position/len in a register, and on Intel CPUs is 2 uops so it's garbage for this. It does have an intrinsic, but I recommend not using it. mov+shr is faster on Intel CPUs.

Does intel have a separate instruction set for it's GPU

Assume I'm using my Intel x64 based laptop with no dedicated GPU.
I must have some GPU onboard otherwise my screen won't work, right?
Are onboard GPUs typically embedded into the CPU?
Does intel have a separate instruction set for it's GPU? if so is there a doc?
Do GPU instructions greatly differ from CPU? for example do GPUs have
shift, add, load, store instructions as well? What other instructions do they have
that regular CPUs don't have?
Is there a difference between the instruction set/pipeline of an onboard GPU vs Dedicated? or
the difference is just about the number of extra cores and dedicated RAM?
On a machine with dedicated GPU, how do generated instructions from a C++ OpenGL code get executed on the GPU and not end up with the regular CPU?
Full hardware reference
One can find a full documentation of Intel's graphic controller at 01.org:
Hardware Specification - PRMs Published by: Paul Parenteau Last
modification: Jun 15, 2020
Answering to question 2: yes, there are separate assembly instructions, as developed below (from "Introduction to GEN assembly")
General form of Intel GPU assembly
Typically, all instructions have the following form:
[(pred)] opcode (exec-size|exec-offset) dst src0 [src1] [src2]
(pred) is the optional predicate. We are going to skip it for now.
opcode is the symbol of the instruction, like add or mov (we have a full table of opcodes below.
exec-size is the SIMD width of the instruction, which of our architecture could be 1, 2, 4, 8, or 16. In SIMD32 compilation, typically two instructions of execution size 8 or 16 are grouped into one.
exec-offset is the part that's telling the EU, which part of the ARF registers to read or write from, e.g. (8|M24) consults the bits 24-31 of the execution mask. When emitting SIMD16 or SIMD32 code like the following:
mov (8|M0) r11.0<1>:q r5.0<8;8,1>:d // id:1
mov (8|M8) r13.0<1>:q r6.0<8;8,1>:d // id:1
mov (8|M16) r15.0<1>:q r9.0<8;8,1>:d // id:1
mov (8|M24) r17.0<1>:q r10.0<8;8,1>:d // id:1
(mov instructions of SIMD32 assembly)
the compiler has to emit four 8-wide operations due to a limitation of how many bytes can be accessed per operand in the GRF.
dst is a destination register
src0 is a source register
src1 is an optional source register. Note, that it could also be an immediate value, like 0x3F000000:f (0.5) or 0x2A:ud (42).
src2 is an optional source register.
General Register File (GRF) Registers
Each thread has a dedicated space of 128 registers, r0 through r127. Each register is 256 bits or 32 bytes.
Architecture Register File (ARF) Registers
In the assembly code above, we only saw one of these special registers, the null register, which is typically used as a destination for send instructions used for writing and indicating end of thread. Here is a full table of other architecture registers:
Available GEN (general) Assembly Instructions

GCC w/ inline assembly & -Ofast generating extra code for memory operand

I am inputting the address of an index into a table into an extended inline assembly operation, but GCC is producing an extra lea instruction when it is not necessary, even when using -Ofast -fomit-frame-pointer or -Os -f.... GCC is using RIP-relative addresses.
I was creating a function for converting two consecutive bits into a two-part XMM mask (1 quadword mask per bit). To do this, I am using _mm_cvtepi8_epi64 (internally vpmovsxbq) with a memory operand from a 8-byte table with the bits as index.
When I use the intrinsic, GCC produces the exactly same code as using the extended inline assembly.
I can directly embed the memory operation into the ASM template, but that would force RIP-relative addressing always (and I don't like forcing myself into workarounds).
typedef uint64_t xmm2q __attribute__ ((vector_size (16)));
// Used for converting 2 consecutive bits (as index) into a 2-elem XMM mask (pmovsxbq)
static const uint16_t MASK_TABLE[4] = { 0x0000, 0x0080, 0x8000, 0x8080 };
xmm2q mask2b(uint64_t mask) {
assert(mask < 4);
#ifdef USE_ASM
xmm2q result;
asm("vpmovsxbq %1, %0" : "=x" (result) : "m" (MASK_TABLE[mask]));
return result;
#else
// bad cast (UB?), but input should be `uint16_t*` anyways
return (xmm2q) _mm_cvtepi8_epi64(*((__m128i*) &MASK_TABLE[mask]));
#endif
}
Output assembly with -S (with USE_ASM and without):
__Z6mask2by: ## #_Z6mask2by
.cfi_startproc
## %bb.0:
leaq __ZL10MASK_TABLE(%rip), %rax
vpmovsxbq (%rax,%rdi,2), %xmm0
retq
.cfi_endproc
What I was expecting (I've removed all the extra stuff):
__Z6mask2by:
vpmovsxbq __ZL10MASK_TABLE(%rip,%rdi,2), %xmm0
retq
The only RIP-relative addressing mode is RIP + rel32. RIP + reg is not available.
(In machine code, 32-bit code used to have 2 redundant ways to encode [disp32]. x86-64 uses the shorter (no SIB) form as RIP relative, the longer SIB form as [sign_extended_disp32]).
If you compile for Linux with -fno-pie -no-pie, GCC will be able to access static data with a 32-bit absolute address, so it can use a mode like __ZL10MASK_TABLE(,%rdi,2). This isn't possible for MacOS, where the base address is always above 2^32; 32-bit absolute addressing is completely unsupported on x86-64 MacOS.
In a PIE executable (or PIC code in general like a library), you need a RIP-relative LEA to set up for indexing a static array. Or any other case where the static address won't fit in 32 bits and/or isn't a link-time constant.
Intrinsics
Yes, intrinsics make it very inconvenient to express a pmovzx/sx load from a narrow source because pointer-source versions of the intrinsics are missing.
*((__m128i*) &MASK_TABLE[mask] isn't safe: if you disable optimization, you might well get a movdqa 16-byte load but the address will be misaligned. It's only safe when the compiler folds the load into a memory operand for pmovzxbq which has a 2-byte memory operand therefore not requiring alignment.
In fact current GCC does compile your code with a movdqa 16-byte load like movdqa xmm0, XMMWORD PTR [rax+rdi*2] before a reg-reg pmovzx. This is obviously a missed optimization. :( clang/LLVM (which MacOS installs as gcc) does fold the load into pmovzx.
The safe way is _mm_cvtepi8_epi64( _mm_cvtsi32_si128(MASK_TABLE[mask]) ) or something, and then hoping the compiler optimizes away the zero-extend from 2 to 4 bytes and folds the movd into a load when you enable optimization. Or maybe try _mm_loadu_si32 for a 32-bit load even though you really want 16. But last time I tried, compilers sucked at folding a 64-bit load intrinsic into a memory operand for pmovzxbw for example. GCC and clang still fail at it, but ICC19 succeeds. https://godbolt.org/z/IdgoKV
I've written about this before:
Loading 8 chars from memory into an __m256 variable as packed single precision floats
How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?
Your integer -> vector strategy
Your choice of pmovsx seems odd. You don't need sign-extension, so I would have picked pmovzx (_mm_cvt_epu8_epi64). It's not actually more efficient on any CPUs, though.
A lookup table does work here with only a small amount of static data needed. If your mask range was any bigger, you'd maybe want to look into
is there an inverse instruction to the movemask instruction in intel avx2? for alternative strategies like broadcast + AND + (shift or compare).
If you do this often, using a whole cache line of 4x 16-byte vector constants might be best so you don't need a pmovzx instruction, just index into an aligned table of xmm2 or __m128i vectors which can be a memory source for any other SSE instruction. Use alignas(64) to get all the constants in the same cache line.
You could also consider (intrinsics for) pdep + movd xmm0, eax + pmovzxbq reg-reg if you're targeting Intel CPUs with BMI2. (pdep is slow on AMD, though).

Most efficient way to check if all __m128i components are 0 [using <= SSE4.1 intrinsics]

I am using SSE intrinsics to determine if a rectangle (defined by four int32 values) has changed:
__m128i oldRect; // contains old left, top, right, bottom packed to 128 bits
__m128i newRect; // contains new left, top, right, bottom packed to 128 bits
__m128i xor = _mm_xor_si128(oldRect, newRect);
At this point, the resulting xor value will be all zeros if the rectangle hasn't changed. What is then the most efficient way of determining that?
Currently I am doing so:
if (xor.m128i_u64[0] | xor.m128i_u64[1])
{
// rectangle changed
}
But I assume there's a smarter way (possibly using some SSE instruction that I haven't found yet).
I am targeting SSE4.1 on x64 and I am coding C++ in Visual Studio 2013.
Edit: The question is not quite the same as Is an __m128i variable zero?, as that specifies "on SSE-2-and-earlier processors" (although Antonio did add an answer "for completeness" that addresses 4.1 some time after this question was posted and answered).
You can use the PTEST instuction via the _mm_testz_si128 intrinsic (SSE4.1), like this:
#include "smmintrin.h" // SSE4.1 header
if (!_mm_testz_si128(xor, xor))
{
// rectangle has changed
}
Note that _mm_testz_si128 returns 1 if the bitwise AND of the two arguments is zero.
Ironically, ptest instruction from SSE 4.1 may be slower than pmovmskb from SSE2 in some cases. I suggest using simply:
__m128i cmp = _mm_cmpeq_epi32(oldRect, newRect);
if (_mm_movemask_epi8(cmp) != 0xFFFF)
//registers are different
Note that if you really need that xor value, you'll have to compute it separately.
For Intel processors like Ivy Bridge, the version by PaulR with xor and _mm_testz_si128 translates into 4 uops, while suggested version without computing xor translates into 3 uops (see also this thread). This may result in better throughput of my version.

Fast way to get a valid imm num for arm mov?

Arm mov has a limitation that the immediates must be an 8bit rotated by a multiple of 2, we can write:
mov ip, #0x5000
But we cannot write that:
mov ip, #0x5001
The 0x5000 can be split as 0x5000 + 1, my means, the sum of a valid immediate number and a small number.
So for a given 32bit number, how to find the closest valid immediate number fast? Like this:
uint32 find_imm(uint32 src, bool less_than_src) {
...
}
// x is 0x5000
uint32 x = find_imm(0x5001, true);
It is quite simple, look at the distance between the ones. 0x5001 = 0b101000000000001. 15 significant digits, so it will take you two instructions at 8 bits of immediate per. Also remember to put a rotate in your test, if there are enough zeros 0x80000001 and you rotate that around 0x88000000 or 0x00000003 that is only two significant digits from a distance between the ones measurement. So take the immediate, perform a distance between the ones type test, rotate one step, perform the test again, and repeat until all the possible (counter-)rotations have happened and go with one of the ones with the smallest number of instructions/immediates.
gnu as already does this and gas is open source so you can just go get their code if you prefer. When you use the load address trick:
ldr rd,=const
If that const can be resolved in a single move immediate instruction then it encodes it as a
mov rd,#const
if it cant then it tries to find a location to put the word and encodes it as a pc relative load:
ldr rd,[pc,#offset]
...
.word const
There is not a straightforward rule or function for finding ways to construct values. Once a value exceeds what can be loaded easily from immediate values, you usually load it by defining it in the data section and loading it from memory, rather than constructing it from immediate values.
If you do want to construct a value from two immediate values, you must consider a variety of operations, including:
Adding two immediates.
Subtracting two immediates.
Multiplying two immediates.
More esoteric instructions, such as some of the “SIMD” instructions that split 32-bit registers into multiple lanes.
If you must go to three immediate values, there are more combinations. One can find some patterns in the possibilities that reduce the search, but some portion of it remains a “brute force” search. Generally, there is no point in using complicated instructions sequences, since you can simply load the data from a prepared location in memory.
The ARM assembler has an instruction form to assist this:
LDR Rd, =const
When the assembler sees this, it places the const value in the literal pool and generates an instruction to load the value from the pool. If you are using a different assembler, it might not have the same instruction form, but you can write the necessary code manually.