Is memory barrier meaningful only in SMP? - concurrency

I understand why memory barriers are needed, but I don't get it in the case of Uniprocessor.
Do I have to deal with barriers even when I use UP? Every document explains them with SMP but not UP.
In the following code, is there any possibility that r2 == 0 in point a?
// the location 0xdeadbeef has a zero initial value
ldr r0, =0xdeadbeef
ldr r1, =0xdeadbeef
ldr r2, =1
str r2, [r0]
ldr r2, [r1]
// point a

There are memory barriers and compiler barriers.
Memory barriers are not required on a single processor (I'm not sure if hyperthreading counts as multiple processors) but compiler barriers are - the compiler could re-order the code in different threads such that you fail.

Memory barriers must be used only for "global variables". Because local (in stack) and registers are automatically saved while threads switched.
May be universality is better than assumption that you always deal with UP

Related

How does the following C ++ function pass parameters to the following arm assembly function? [duplicate]

It's been a while since I last coded arm assembler and I'm a little rusty on the details. If I call a C function from arm, I only have to worry about saving r0-r3 and lr, right?
If the C function uses any other registers, is it responsible for saving those on the stack and restoring them? In other words, the compiler would generate code to do this for C functions.
For example if I use r10 in an assembler function, I don't have to push its value on the stack, or to memory, and pop/restore it after a C call, do I?
This is for arm-eabi-gcc 4.3.0.
It depends on the ABI for the platform you are compiling for. On Linux, there are two ARM ABIs; the old one and the new one. AFAIK, the new one (EABI) is in fact ARM's AAPCS. The complete EABI definitions currently live here on ARM's infocenter.
From the AAPCS, §5.1.1:
r0-r3 are the argument and scratch registers; r0-r1 are also the result registers
r4-r8 are callee-save registers
r9 might be a callee-save register or not (on some variants of AAPCS it is a special register)
r10-r11 are callee-save registers
r12-r15 are special registers
A callee-save register must be saved by the callee (in opposition to a caller-save register, where the caller saves the register); so, if this is the ABI you are using, you do not have to save r10 before calling another function (the other function is responsible for saving it).
Edit: Which compiler you are using makes no difference; gcc in particular can be configured for several different ABIs, and it can even be changed on the command line. Looking at the prologue/epilogue code it generates is not that useful, since it is tailored for each function and the compiler can use other ways of saving a register (for instance, saving it in the middle of a function).
Terminology: "callee-save" is a synonym for "non-volatile" or "call-preserved": What are callee and caller saved registers?
When making a function call, you can assume that the values in r4-r11 (except maybe r9) are still there after (call-preserved), but not for r0-r3 (call-clobbered / volatile).
32-bit ARM calling conventions are specified by AAPCSFrom the AAPCS, §5.1.1 Core registers:
r0-r3 are the argument and scratch registers; r0-r1 are also the result registers
r4-r8 are callee-save registers
r9 might be a callee-save register or not (on some variants of AAPCS it is a special register)
r10-r11 are callee-save registers
r12-r15 are special registers
From the AAPCS, §5.1.2.1 VFP register usage conventions:
s16–s31 (d8–d15, q4–q7) must be preserved
s0–s15 (d0–d7, q0–q3) and d16–d31 (q8–q15) do not need to be preserved
Original post:
arm-to-c-calling-convention-neon-registers-to-save
64-bit ARM calling conventions are specified by AAPCS64General-purpose Registers section specifies what registers need be preserved.
r0-r7 are parameter/result registers
r9-r15 are temporary registers
r19-r28 are callee-saved registers.
All others (r8, r16-r18, r29, r30, SP) have special meaning and some might be treated as temporary registers.
SIMD and Floating-Point Registers specifies Neon and floating point registers.
For 64-bit ARM, A64 (from Procedure Call Standard for the ARM 64-bit Architecture)
There are thirty-one, 64-bit, general-purpose (integer) registers visible to the A64 instruction set; these are labeled r0-r30. In a 64-bit context these registers are normally referred to using the names x0-x30; in a 32-bit context the registers are specified by using w0-w30. Additionally, a stack-pointer register, SP, can be used with a restricted number of instructions.
SP The Stack Pointer
r30 LR The Link Register
r29 FP The Frame Pointer
r19…r28 Callee-saved registers
r18 The Platform Register, if needed; otherwise a temporary register.
r17 IP1 The second intra-procedure-call temporary register (can be used
by call veneers and PLT code); at other times may be used as a
temporary register.
r16 IP0 The first intra-procedure-call scratch register (can be used by call
veneers and PLT code); at other times may be used as a
temporary register.
r9…r15 Temporary registers
r8 Indirect result location register
r0…r7 Parameter/result registers
The first eight registers, r0-r7, are used to pass argument values into a subroutine and to return result values from a function. They may also be used to hold intermediate values within a routine (but, in general, only between subroutine calls).
Registers r16 (IP0) and r17 (IP1) may be used by a linker as a scratch register between a routine and any subroutine it calls. They can also be used within a routine to hold intermediate values between subroutine calls.
The role of register r18 is platform specific. If a platform ABI has need of a dedicated general purpose register to carry inter-procedural state (for example, the thread context) then it should use this register for that purpose. If the platform ABI has no such requirements, then it should use r18 as an additional temporary register. The platform ABI specification must document the usage for this register.
SIMD
The ARM 64-bit architecture also has a further thirty-two registers, v0-v31, which can be used by SIMD and Floating-Point operations. The precise name of the register will change indicating the size of the access.
Note: Unlike in AArch32, in AArch64 the 128-bit and 64-bit views of a SIMD and Floating-Point register do not overlap multiple registers in a narrower view, so q1, d1 and s1 all refer to the same entry in the register bank.
The first eight registers, v0-v7, are used to pass argument values into a subroutine and to return result values from a function. They may also be used to hold intermediate values within a routine (but, in general, only between subroutine calls).
Registers v8-v15 must be preserved by a callee across subroutine calls; the remaining registers (v0-v7, v16-v31) do not need to be preserved (or should be preserved by the caller). Additionally, only the bottom 64-bits of each value stored in v8-v15 need to be preserved; it is the responsibility of the caller to preserve larger values.
The answers of CesarB and Pavel provided quotes from AAPCS, but open issues remain. Does the callee save r9? What about r12? What about r14? Furthermore, the answers were very general, and not specific to the arm-eabi toolchain as requested. Here's a practical approach to find out which register are callee-saved and which are not.
The following C code contain an inline assembly block, that claims to modify registers r0-r12 and r14. The compiler will generate the code to save the registers required by the ABI.
void foo() {
asm volatile ( "nop" : : : "r0", "r1", "r2", "r3", "r4", "r5", "r6", "r7", "r8", "r9", "r10", "r11", "r12", "r14");
}
Use the command line arm-eabi-gcc-4.7 -O2 -S -o - foo.c
and add the switches for your platform (such as -mcpu=arm7tdmi for example).
The command will print the generated assembly code on STDOUT. It may look something like this:
foo:
stmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
nop
ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
bx lr
Note, that the compiler generated code saves and restores r4-r11. The compiler does not save r0-r3, r12. That it restores r14 (alias lr) is purely accidental as I know from experience that the exit code may also load the saved lr into r0 and then do a "bx r0" instead of "bx lr". Either by adding the -mcpu=arm7tdmi -mno-thumb-interwork or by using -mcpu=cortex-m4 -mthumb we obtain slightly different assembly code that looks like this:
foo:
stmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
nop
ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, pc}
Again, r4-r11 are saved and restored. But r14 (alias lr) is not restored.
To summarize:
r0-r3 are not callee-saved
r4-r11 are callee-saved
r12 (alias ip) is not callee-saved
r13 (alias sp) is callee-saved
r14 (alias lr) is not callee-saved
r15 (alias pc) is the program counter and is set to the value of lr prior to the function call
This holds at least for arm-eabi-gcc's default's. There are command line switches (in particular the -mabi switch) that may influence the results.
According ARM's aapcs32 and aapcs64, finally summarized to this:
online view
There is also difference at least at Cortex M3 architecture for function call and interrupt.
If an Interrupt occurs it will make automatic push R0-R3,R12,LR,PC onto Stack and when return form IRQ automatic POP. If you use other registers in IRQ routine you have to push/pop them onto Stack manually.
I don't think this automatic PUSH and POP is made for a Function call (jump instruction). If convention says R0-R3 can be used only as an argument, result or scratch registers, so there is no need to store them before function call because there shouldn't be any value used later after function return. But same as in an interrupt you have to store all other CPU registers if you use them in your function.

gcc doesn't merge consecutive fences

For this simple piece of code
std::atomic_int i;
void foo() {
i.store(1);
i.store(2);
}
gcc generates the following assembly for ARM:
movw r3, #:lower16:.LANCHOR0
movt r3, #:upper16:.LANCHOR0
dmb ish
mov r1, #1
mov r2, #2
str r1, [r3]
dmb ish
dmb ish ; why is this not eliminated?
str r2, [r3]
dmb ish
bx lr
You may notice that there is a repeated fence generated in the middle, which seems to be superfluous. Is it an issue of gcc's optimizer being not able to catch and eliminate extra fences or am I missing something?
BTW, clang seems to handle adjacent fences.
Yes, it does not, and I have been debating it for a while with various people. For external observer like myself effect is that it treats atomic as volatile, while standard doesn't require it to. I was not able to find a requirement for this in the standard.
However, it might also be a simple case of missing optimization.

Why does ARM use two instructions to mask a value?

For the following function...
uint16_t swap(const uint16_t value)
{
return value << 8 | value >> 8;
}
...why does ARM gcc 6.3.0 with -O2 yield the following assembly?
swap(unsigned short):
lsr r3, r0, #8
orr r0, r3, r0, lsl #8
lsl r0, r0, #16 # shift left
lsr r0, r0, #16 # shift right
bx lr
It appears the compiler is using two shifts to mask off the unwanted bytes, instead of using a logical AND. Could the compiler instead use and r0, r0, #4294901760?
Older ARM assembly cannot create constants easily. Instead, they are loaded into literal pools and then read in via a memory load. This and you suggest can only take I believe an 8-bit literal with shift. Your 0xFFFF0000 requires 16-bits to do as 1 instructions.
So, we can load from memory and do an and (slow),
Take 2 instructions to create the value and 1 to and (longer),
or just shift twice cheaply and call it good.
The compiler chose the shifts and honestly, it is plenty fast.
Now for a reality check:
Worrying about a single shift, unless this is a 100% for sure bottleneck is a waste of time. Even if the compiler was sub-optimal, you will almost never feel it. Worry about "hot" loops in code instead for micro-ops like this. Looking at this from curiosity is awesome. Worrying about this exact code for performance in your app, not so much.
Edit:
It has been noted by others here that newer versions of the ARM specifications allow this sort of thing to be done more efficiently. This shows that it is important, when talking at this level, to specify the Chip or at least the exact ARM spec we are dealing with. I was assuming ancient ARM from the lack of "newer" instructions given from your output. If we are tracking compiler bugs, then this assumption may not hold and knowing the specification is even more important. For a swap like this, there are indeed simpler instructions to handle this in later versions.
Edit 2
One thing that could be done to possibly make this faster is to make it inline'd. In that case, the compiler could interleave these operations with other work. Depending on the CPU, this could double the throughput here as many ARM CPUs have 2 integer instruction pipelines. Spread out the instructions enough so that there are no hazards, and away it goes. This has to be weighed against I-Cache usage, but in a case where it mattered, you could see something better.
There is a missed-optimization here, but and isn't the missing piece. Generating a 16-bit constant isn't cheap. For a loop, yes it would be a win to generate a constant outside the loop and use just and inside the loop. (TODO: call swap in a loop over an array and see what kind of code we get.)
For an out-of-order CPU, it could also be worth using multiple instructions off the critical path to build a constant, then you only have one AND on the critical path instead of two shifts. But that's probably rare, and not what gcc chooses.
AFAICT (from looking at compiler output for simple functions), the ARM calling convention guarantees there's no high garbage in input registers, and doesn't allow leaving high garbage in return values. i.e. on input, it can assume that the upper 16 bits of r0 are all zero, but must leave them zero on return. The value << 8 left shift is thus a problem, but the value >> 8 isn't (it doesn't have to worry about shifting garbage down into the low 16).
(Note that x86 calling conventions aren't like this: return values are allowed to have high garbage. (Maybe because the caller can simply use the 16-bit or 8-bit partial register). So are input values, except as an undocumented part of the x86-64 System V ABI: clang depends on input values being sign/zero extended to 32-bit. GCC provides this when calling, but doesn't assume as a callee.)
ARMv6 has a rev16 instruction which byte-swaps the two 16-bit halves of a register. If the upper 16 bits are already zeroed, they don't need to be re-zeroed, so gcc -march=armv6 should compile the function to just rev16. But in fact it emits a uxth to extract and zero-extend the low half-word. (i.e. exactly the same thing as and with 0x0000FFFF, but without needing a large constant). I believe this is pure missed optimization; presumably gcc's rotate idiom, or its internal definition for using rev16 that way, doesn't include enough info to let it realize the top half stays zeroed.
swap: ## gcc6.3 -O3 -march=armv6 -marm
rev16 r0, r0
uxth r0, r0 # not needed
bx lr
For ARM pre v6, a shorter sequence is possible. GCC only finds it if we hand-hold it towards the asm we want:
// better on pre-v6, worse on ARMv6 (defeats rev16 optimization)
uint16_t swap_prev6(const uint16_t value)
{
uint32_t high = value;
high <<= 24; // knock off the high bits
high >>= 16; // and place the low8 where we want it
uint8_t low = value >> 8;
return high | low;
//return value << 8 | value >> 8;
}
swap_prev6: # gcc6.3 -O3 -marm. (Or armv7 -mthumb for thumb2)
lsl r3, r0, #24
lsr r3, r3, #16
orr r0, r3, r0, lsr #8
bx lr
But this defeats the gcc's rotate-idiom recognition, so it compiles to this same code even with -march=armv6 when the simple version compiles to rev16 / uxth.
All source + asm on the Godbolt compiler explorer
ARM is a RISC machine (Advanced RISC Machine), and thus, all instrutcions are encoded in the same size, capping at 32bit.
Immediate values in instructions are assigned to a certain number of bits, and AND instruction simply doesn't have enought bits assigned to immediate values to express any 16bit value.
That's the reason for the compiler resorting to two shift instructions instead.
However, if your target CPU is ARMv6 (ARM11) or higher, the compiler takes leverage from the new REV16 instruction, and then masks the lower 16bit by UXTH instruction which is unnecessary and stupid, but there is simply no conventional way to persuade the compiler not to do this.
If you think that you would be served well by GCC intrinsic __builtin_bswap16, you are dead wrong.
uint16_t swap(const uint16_t value)
{
return __builtin_bswap16(value);
}
The function above generates exactly the same machine code that your original C code did.
Even using inline assembly doesn't help either
uint16_t swap(const uint16_t value)
{
uint16_t result;
__asm__ __volatile__ ("rev16 %[out], %[in]" : [out] "=r" (result) : [in] "r" (value));
return result;
}
Again, exactly the same. You cannot get rid of the pesky UXTH as long as you use GCC; It simply cannot read from the context that the upper 16bits are all zeros to start with and thus, UXTH is unnecessary.
Write the whole function in assembly; That's the only option.
This is the optimal solution, the AND would require at least two more instructions possibly having to stop and wait for a load to happen of the value to mask. So worse in a couple of ways.
00000000 <swap>:
0: e1a03420 lsr r3, r0, #8
4: e1830400 orr r0, r3, r0, lsl #8
8: e1a00800 lsl r0, r0, #16
c: e1a00820 lsr r0, r0, #16
10: e12fff1e bx lr
00000000 <swap>:
0: ba40 rev16 r0, r0
2: b280 uxth r0, r0
4: 4770 bx lr
The latter is armv7 but at the same time it is because they added instructions to support this kind of work.
Fixed length RISC instructions have by definition a problem with constants. MIPS chose one way, ARM chose another. Constants are a problem on CISC as well just a different problem. Not difficult to create something that takes advantage of ARMS barrel shifter and shows a disadvantage of MIPS solution and vice versa.
The solution actually has a bit of elegance to it.
Part of this as well is the overall design of the target.
unsigned short fun ( unsigned short x )
{
return(x+1);
}
0000000000000010 <fun>:
10: 8d 47 01 lea 0x1(%rdi),%eax
13: c3 retq
gcc chooses not to return the 16 bit variable you asked for it returns a 32 bit, it doesnt properly/correctly implement the function I asked for with my code. But that is okay if when the user of the data gets that result or uses it the mask happens there or with this architecture ax is used instead of eax. for example.
unsigned short fun ( unsigned short x )
{
return(x+1);
}
unsigned int fun2 ( unsigned short x )
{
return(fun(x));
}
0000000000000010 <fun>:
10: 8d 47 01 lea 0x1(%rdi),%eax
13: c3 retq
0000000000000020 <fun2>:
20: 8d 47 01 lea 0x1(%rdi),%eax
23: 0f b7 c0 movzwl %ax,%eax
26: c3 retq
A compiler design choice (likely based on architecture) not an implementation bug.
Note that for a sufficiently sized project, it is easy to find missed optimization opportunities. No reason to expect an optimizer to be perfect (it isnt and cant be). They just need to be more efficient than a human doing it by hand for that sized project on average.
This is why it is commonly said that for performance tuning you dont pre-optimize or just jump to asm immediately you use the high level language and the compiler you in some way profile your way through to find the performance problems then hand code those, why hand code them because we know we can at times out perform the compiler, implying the compiler output can be improved upon.
This isnt a missed optimization opportunity, this is instead a very elegant solution for the instruction set. Masking a byte is simpler
unsigned char fun ( unsigned char x )
{
return((x<<4)|(x>>4));
}
00000000 <fun>:
0: e1a03220 lsr r3, r0, #4
4: e1830200 orr r0, r3, r0, lsl #4
8: e20000ff and r0, r0, #255 ; 0xff
c: e12fff1e bx lr
00000000 <fun>:
0: e1a03220 lsr r3, r0, #4
4: e1830200 orr r0, r3, r0, lsl #4
8: e6ef0070 uxtb r0, r0
c: e12fff1e bx lr
the latter being armv7, but with armv7 they recognized and solved these issues you cant expect the programmer to always use natural sized variables, some feel the need to use less optimal sized variables. sometimes you still have to mask to a certain size.

Memory ordering behavior of std::atomic::load

Am I wrong to assume that the atomic::load should also act as a memory barrier ensuring that all previous non-atomic writes will become visible by other threads?
To illustrate:
volatile bool arm1 = false;
std::atomic_bool arm2 = false;
bool triggered = false;
Thread1:
arm1 = true;
//std::std::atomic_thread_fence(std::memory_order_seq_cst); // this would do the trick
if (arm2.load())
triggered = true;
Thread2:
arm2.store(true);
if (arm1)
triggered = true;
I expected that after executing both 'triggered' would be true. Please don't suggest to make arm1 atomic, the point is to explore the behavior of atomic::load.
While I have to admit I don't fully understand the formal definitions of the different relaxed semantics of memory order I thought that the sequentially consistent ordering was pretty straightforward in that it guarantees that "a single total order exists in which all threads observe all modifications in the same order." To me this implies that the std::atomic::load with the default memory order of std::memory_order_seq_cst will also act as a memory fence. This is further corroborated by the following statement under "Sequentially-consistent ordering":
Total sequential ordering requires a full memory fence CPU instruction on all multi-core systems.
Yet, my simple example below demonstrates this is not the case with MSVC 2013, gcc 4.9 (x86) and clang 3.5.1 (x86), where the atomic load simply translates to a load instruction.
#include <atomic>
std::atomic_long al;
#ifdef _WIN32
__declspec(noinline)
#else
__attribute__((noinline))
#endif
long load() {
return al.load(std::memory_order_seq_cst);
}
int main(int argc, char* argv[]) {
long r = load();
}
With gcc this looks like:
load():
mov rax, QWORD PTR al[rip] ; <--- plain load here, no fence or xchg
ret
main:
call load()
xor eax, eax
ret
I'll omit the msvc and clang which are essentially identical. Now on gcc for ARM we get what I expected:
load():
dmb sy ; <---- data memory barrier here
movw r3, #:lower16:.LANCHOR0
movt r3, #:upper16:.LANCHOR0
ldr r0, [r3]
dmb sy ; <----- and here
bx lr
main:
push {r3, lr}
bl load()
movs r0, #0
pop {r3, pc}
This is not an academic question, it results in a subtle race condition in our code which called into question my understanding of the behavior of std::atomic.
Sigh, this was too long for a comment:
Isn't the meaning of atomic "to appear to occur instantaneously to the rest of the system"?
I'd say yes and no to that one, depending on how you think of it. For writes with SEQ_CST, yes. But as far as how atomic loads are handled, check out 29.3 of the C++11 standard. Specifically, 29.3.3 is really good reading, and 29.3.4 might be specifically what you're looking for:
For an atomic operation B that reads the value of an atomic object M, if there is a memory_order_seq_-
cst fence X sequenced before B, then B observes either the last memory_order_seq_cst modification of M
preceding X in the total order S or a later modification of M in its modification order.
Basically, SEQ_CST forces a global order just like the standard says, but reads can return and old value without violating the 'atomic' constraint.
To accomplish 'getting the absolute latest value' you'll need to perform an operation that forces the hardware coherency protocol to lock(the lock instruction on x86_64). This is what the atomic compare-and-exchange operations do, if you look at the assembly output.
Am I wrong to assume that the atomic::load should also act as a memory barrier ensuring that all previous non-atomic writes will become visible by other threads?
Yes. atomic::load(SEQ_CST) just enforces that the read cannot load an 'invalid' value, and neither writes nor loads may be reordered by the compiler or the cpu around that statement. It does not mean you'll always get the most up to date value.
I would expect your code to have a data race because again, barriers do not ensure the most up to date value is seen at a given time, they just prevent reordering.
Its perfectly valid for Thread1 to not see the write by Thread2 and therefore not set triggered, and for Thread2 to not see the write by Thread1 (again, not setting triggered), because you only write 'atomically' from one thread.
With two threads writing and reading shared values, you'll need a barrier in each thread to maintain consistency. It looks like you knew this already based in your code comments, so I'll just leave it at "the C++ standard is somewhat misleading when it comes to accurately describing meaning of atomic / multithreaded operations".
Even though you're writing C++, its still best, in my opinion, to think about what you're doing on the underlying architecture.
Not sure I explained that well, but I'd be happy to go into more detail if you'd like.

Strange behaviour of ldr [pc, #value]

I was debugging some c++ code (WinCE 6 on ARM platform),
and i find some behavior strange:
4277220C mov r3, #0x93, 30
42772210 str r3, [sp]
42772214 ldr r3, [pc, #0x69C]
42772218 ldr r2, [pc, #0x694]
4277221C mov r1, #0
42772220 ldr r0, [pc, #0x688]
Line 42772214 ldr r3, [pc, #0x69C] is used to get some constant from .DATA section, at least I think so.
What is strange that according to the code r2 should be filled with memory from address pc=0x42772214 + 0x69C = 0x427728B0, but according to the memory contents it's loaded from 0x427728B8 (8bytes+), it happens for other ldr usages too.
Is it fault of the debugger or my understanding of ldr/pc?
Another issue I don't get - why access to the .data section is relative to the executed code? I find it little bit strange.
And one more issue: i cannot find syntax of the 1st mov command (any one could point me a optype specification for the Thumb (1C2))
Sorry for the laic description, but I'm just familiarizing with the assemblies.
This is correct. When pc is used for reading there is an 8-byte offset in ARM mode and 4-byte offset in Thumb mode.
From the ARM-ARM:
When an instruction reads the PC, the value read depends on which instruction set it comes from:
For an ARM instruction, the value read is the address of the instruction plus 8 bytes. Bits [1:0] of this value are always zero, because ARM instructions are always word-aligned.
For a Thumb instruction, the value read is the address of the instruction plus 4 bytes. Bit [0] of this value is always zero, because Thumb instructions are always halfword-aligned.
This way of reading the PC is primarily used for quick, position-independent addressing of nearby instructions and data, including position-independent branching within a program.
There are 2 reasons for pc-relative addressing.
Position-independent code, which is in your case.
Get some complicated constants nearby which cannot be written in 1 simple instruction, e.g. mov r3, #0x12345678 is impossible to complete in 1 instruction, so the compiler may put this constant in the end of the function and use e.g. ldr r3, [pc, #0x50] to load it instead.
I don't know what mov r3, #0x93, 30 means. Probably it is mov r3, #0x93, rol 30 (which gives 0xC0000024)?