The implementation difference between the interperter and the JIT compiler for AtomicInteger.lazySet() in the Hotspot JVM - concurrency

Refer to the existing discussion at AtomicInteger lazySet vs. set for the background of AtomicInteger.lazySet().
So according to the semantics of AtomicInteger.lazySet(), on x86 CPU, AtomicInteger.lazySet() is equivalent to a normal write operation to the value of the AotmicInteger, because the x86 memory model guarantees the order among write operations.
However, the runtime behavior for AtomicInteger.lazySet() is different between the interperter and the JIT compiler (the C2 compiler specifically) in the JDK 8 Hotspot JVM, which confuses me.
First, create a simple Java application for demo.
import java.util.concurrent.atomic.AtomicInteger;
public class App {
public static void main (String[] args) throws Exception {
AtomicInteger i = new AtomicInteger(0);
i.lazySet(1);
System.out.println(i.get());
}
}
Then, dump the instructions for AtomicInteger.lazySet() which are from the instrinc method provided by the C2 compiler:
$ java -Xcomp -XX:+UnlockDiagnosticVMOptions -XX:-TieredCompilation -XX:CompileCommand=print,*AtomicInteger.lazySet App
...
0x00007f1bd927214c: mov %edx,0xc(%rsi) ;*invokevirtual putOrderedInt
; - java.util.concurrent.atomic.AtomicInteger::lazySet#8 (line 110)
As you can see, the operation is as expected a normal write.
Then, use the GDB to trace the runtime behavior of the interpreter for AtomicInteger.lazySet().
$ gdb --args java App
(gdb) b Unsafe_SetOrderedInt
0x7ffff69ae836 callq 0x7ffff69b6642 <OrderAccess::release_store_fence(int volatile*, int)>
0x7ffff69b6642:
push %rbp
mov %rsp,%rbp
mov %rdi,-0x8(%rbp)
mov %esi,-0xc(%rbp)
mov -0xc(%rbp),%eax
mov -0x8(%rbp),%rdx
xchg %eax,(%rdx)           // the write operation
mov %eax,-0xc(%rbp)
nop
pop %rbp
retq
s you can see, the operation is actually a XCHG instruction ,which has a implict lock semantics, which brings performance overhead that AtomicInteger.lazySet() is intended to eliminate.
Does anyone know why there is such a difference? thanks.

There is no much sense in optimizing a rarely used operation in the interpreter. This would increase development and maintenance costs for no visible benefit.
It's a common practice in HotSpot to implement optimizations only in a JIT compiler (either C2 or both C1+C2). The interpreter implementation just works, it does not need to be fast, because if the code is performance sensitive, it will be JIT-compiled anyway.
So, in the interpreter (i.e. on the slow path), Unsafe_SetOrderedInt is exactly the same as a volatile write.

Related

Why does `monotonic_buffer_resource` appear in the assembly when it doesn't seem to be used?

This is a follow-up from another question.
I think the following code should not use monotonic_buffer_resource, but in the generated assembly there are references to it.
void default_pmr_alloc(std::pmr::polymorphic_allocator<int>& alloc) {
(void)alloc.allocate(1);
}
godbolt
I looked into the source code of the header files and libstdc++, but could not find how monotonic_buffer_resource was selected to be used by the default pmr allocator.
The assembly tells the story. In particular, this:
cmp rax, OFFSET FLAT:_ZNSt3pmr25monotonic_buffer_resource11do_allocateEmm
jne .L11
This appears to be a test to see if the memory resource is a monotonic_buffer_resource. This seems to be done by checking the do_allocate member of the vtable. If it is not such a resource (ie: if do_allocate in the memory resource is not the monotonic one), then it jumps down to this:
.L11:
mov rdi, rbx
mov edx, 4
mov esi, 4
pop rbx
jmp rax
This appears to be a vtable call.
The rest of the assembly appears to be an inlined version of monotonic_buffer_resource::do_allocate. Which is why it conditionally calls std::pmr::monotonic_buffer_resource::_M_new_buffer.
So overall, this implementation of polymorphic_resource::allocate seems to have some built-in inlining of monotonic_buffer_resource::do_allocate if the resource is appropriate for that. That is, it won't do a vtable call if it can determine that it should call monotonic_buffer_resource::do_allocate.

Are functions in a C++/CLI native class compiled to MSIL or native x64 machine code?

This question is related to another question of mine, titled Calling MASM PROC from C++/CLI in x64 mode yields unexpected performance problems. I din't receive any comments and answers, but eventually I found out myself that the problem is caused by function thunks that are inserted by the compiler whenever a managed function calls an unmanaged one, and vice versa. I won't go into the details once again, because today I wan't to focus on another consequence of this tunking mechanism.
To provide some context for the question, my problem was the replacement of a C++ function for 64-to-128-bit unsigned integer multiplication in an unmanaged C++/CLI class by a function in an MASM64 file for the sake of performance. The ASM replacement is as simple as can be:
AsmMul1 proc ; ?AsmMul1##$$FYAX_K0AEA_K1#Z
; ecx : Factor1
; edx : Factor2
; [r8] : ProductL
; [r9] : ProductH
mov rax, rcx ; rax = Factor1
mul rdx ; rdx:rax = Factor1 * Factor2
mov qword ptr [r8], rax ; [r8] = ProductL
mov qword ptr [r9], rdx ; [r9] = ProductH
ret
AsmMul1 endp
I expected a big performance boost by replacing a compiled function with four 32-to-64-bit multiplications with a simple CPU MUL instruction. The big surprise was that the ASM version was about four times slower (!) than the C++ version. After a lot of research and testing, I found out that some function calls in C++/CLI involve thunking, which obviously is such a complex thing that it takes much more time than the thunked function itself.
After reading more about this thunking, it turned out that whenever you are using the compiler option /clr, the calling convention of all functions is silently changed to __clrcall, which means that they become managed functions. Exceptions are functions that use compiler intrinsics, inline ASM, and calls to other DLLs via dllimport - and as my tests revealed, this seems to include functions that call external ASM functions.
As long as all interacting functions use the __clrcall convention (i.e. are managed), no thunking is involved, and everything runs smoothly. As soon as the managed/unmanaged boundary is crossed in either direction, thunking kicks in, and performance is seriously degraded.
Now, after this long prologue, let's get to the core of my question. As far as I understand the __clrcall convention, and the /clr compiler switch, marking a function in an unmanaged C++ class this way causes the compiler to emit MSIL code. I've found this sentence in the documentation of __clrcall:
When marking a function as __clrcall, you indicate the function
implementation must be MSIL and that the native entry point function
will not be generated.
Frankly, this is scaring me! After all, I'm going through the hassles of writing C++/CLI code in order to get real native code, i.e. super-fast x64 machine code. However, this doesn't seem to be the default for mixed assemblies. Please correct me if I'm getting it wrong: If I'm using the project defaults given by VC2017, my assembly contains MSIL, which will be JIT-compiled. True?
There is a #pragma managed that seems to inhibit the generation of MSIL in favor of native code on a per-function basis. I've tested it, and it works, but then the problem is that thunking gets in the way again as soon as the native code calls a managed function, and vice versa. In my C++/CLI project, I found no way to configure the thunking and code generation without getting a performance hit at some place.
So what I'm asking myself now: What's the point in using C++/CLI in the first place? Does it give me performance advantages, when everything is still compiled to MSIL? Maybe it's better to write everything in pure C++ and use Pinvoke to call those functions? I don't know, I'm kind of stuck here.
Maybe someone can shed some light on this terribly poorly documented topic...

How to force GCC to use jmp instruction instead of ret?

I was now using a stackful co-routines for network programming. But I was punished by the invalidation of return stack buffer (see
http://www.agner.org/optimize/microarchitecture.pdf p.36), during the context switch (because we manually change the SP register)
I found out that the jmp instruction is better than ret after assembly language test. However, I have some more functions that indirectly call the context switch function that was written in C++ language (compiled by GCC). How can we force these function return using jmp instead of ret in the GCC assembly result?
Some common but not perfect methods:
using inline assembly and manually set SP register to __builtin_frame_address+2*sizeof(void*) and jmp to the return address, before ret?
This is an unsafe solution. In C++, local variables or right values are destructed before ret instruction. We will omit these instruction if we jmp. What's worse, even if we are in C, callee-saved registers need to be restored before ret instruction and we will also omit these instruction, too.
So what can we do to force GCC use jmp instead of ret to avoid the problems listing above?
Use an assembler macro:
.macro ret
pop %ecx
jmp *%ecx
.endm
Put that in inline assembler at the top of the file or elsewhere.

Why would inline assembly crash in release only? [duplicate]

I have a scenario in GCC causing me problems. The behaviour I get is not the behaviour I expect. To summarise the situation, I am proposing several new instructions for x86-64 which are implemented in a hardware simulator. In order to test these instructions I am taking existing C source code and handcoding the new instructions using hexidecimal. Because these instructions interact with the existing x86-64 registers, I use the input/output/clobber lists to declare dependencies for GCC.
What's happening is that if I call a function e.g. printf, the dependent registers aren't saved and restored.
For example
register unsigned long r9 asm ("r9") = 101;
printf("foo %s\n", "bar");
asm volatile (".byte 0x00, 0x00, 0x00, 0x00" : /* no output */ : "q" (r9) );
101 was assigned to r9 and the inline assembly (fake in this example) is dependent on r9. This runs correctly in the absence of the printf, but when it is there GCC does not save and restore r9 and another value is there by the time my custom instruction is called.
I thought perhaps that GCC might have secretly changed the assignment to the variable r9, but when I do this
asm volatile (".byte %0" : /* no output */ : "q" (r9) );
and look at the assembly output, it is indeed using %r9.
I am using gcc 4.4.5. What do you think might be happening? I thought GCC will always save and restore registers on function calls. Is there some way I can enforce it?
Thanks!
EDIT: By the way, I'm compiling the program like this
gcc -static -m64 -mmmx -msse -msse2 -O0 test.c -o test
The ABI, section 3.2.1 says:
Registers %rbp, %rbx and %r12 through %r15 “belong” to the calling function and the
called function is
required to preserve their values. In other words, a called function must preserve
these registers’ values for its caller. Remaining registers “belong” to the called
function. If a calling function wants to preserve such a register value across a
function call, it must save the value in its local stack frame.
so you shouldn't expect registers other than %rbp, %rbx and %r12 through %r15 to be preserved by a function call.
gcc will not make explicit-register variables like this callee-saved. Basically this register notation you're using makes the variable a direct alias for the register, with the assumption you want to be able to read back the value a callee leaves in the register. If you used a callee-saved register instead of a call-clobbered (caller-saved) register, the problem would go away.

Can a C/C++ compiler inline builtin functions like malloc()?

While inspecting the disassembly of below function,
void * malloc_float_align(size_t n, unsigned int a, float *& dizi)
{
void * adres=NULL;
void * adres2=NULL;
adres=malloc(n*sizeof(float)+a);
size_t adr=(size_t)adres;
size_t adr2=adr+a-(adr&(a-1u));
adres2=(void * ) adr2;
dizi=(float *)adres2;
return adres;
}
Builtin functions are not inlined even with the inline optimization flag set.
; Line 26
$LN4:
push rbx
sub rsp, 32 ; 00000020H
; Line 29
mov ecx, 160 ; 000000a0H
mov rbx, r8
call QWORD PTR __imp_malloc <------this is not inlined
; Line 31
mov rcx, rax
; Line 33
mov rdx, rax
and ecx, 31
sub rdx, rcx
add rdx, 32 ; 00000020H
mov QWORD PTR [rbx], rdx
; Line 35
add rsp, 32 ; 00000020H
pop rbx
ret 0
Question: is this a must-have property of functions like malloc? Can we inline it some way to inspect it(or any other function like strcmp/new/free/delete)? Is this forbidden?
Typically the compiler will inline functions when it has the source code available during compilation (in other words, the function is defined, rather than just a prototype declaration) in a header file).
However, in this case, the function (malloc) is in a DLL, so clearly the source code is not available to the compiler during the compilation of your code. It has nothing to do with what malloc does (etc). However, it's also likely that malloc won't be inlined anyway, since it is a fairly large function [at least it often is], whcih prevents it from being inlined even if the source code is available.
If you are using Visual Studio, you can almost certainly find the source code for your runtime library, as it is supplied with the Visual Studio package.
(The C runtime functions are in a DLL because many different programs in the system use the same functions, so putting them in a DLL that is loaded once for all "users" of the functionality will give a good saving on the size of all the code in the system. Although malloc is perhaps only a few hundred bytes, a function like printf can easily add some 5-25KB to the size of an executable. Multiply that by the number of "users" of printf, and there is likely several hundred kilobytes just from that one function "saved" - and of course, all other functions such as fopen, fclose, malloc, calloc, free, and so on all add a little bit each to the overall size)
A C compiler is allowed to inline malloc (or, as you see in your example, part of it), but it is not required to inline anything. The heuristics it uses need not be documented, and they're usually quite complex, but normally only short functions will be inlined, since otherwise code-bloat is likely.
malloc and friends are implemented in the runtime library, so they're not available for inlining. They would need to have their implementation in their header files for that to happen.
If you want to see their disassembly, you could step into them with a debugger. Or, depending on the compiler and runtime you're using, the source code might be available. It is available for both gcc and msvc, for example.
The main thing stopping the inlining of malloc() et al is their complexity — and the obvious fact that no inline definition of the function is provided. Besides, you may need different versions of the function at different times; it would be harder (messier) for tools like valgrind to work, and you could not arrange to use a debugging version of the functions if their code is expanded inline.