Work around lack of Yz machine constraint under Clang? - c++

We use inline assembly to make SHA instructions available if __SHA__ is not defined. Under GCC we use:
GCC_INLINE __m128i GCC_INLINE_ATTRIB
MM_SHA256RNDS2_EPU32(__m128i a, const __m128i b, const __m128i c)
{
asm ("sha256rnds2 %2, %1, %0" : "+x"(a) : "xm"(b), "Yz" (c));
return a;
}
Clang does not consume GCC's Yz constraint (see Clang 3.2 Issue 13199 and Clang 3.9 Issue 32727), which is required by the sha256rnds2 instruction:
Yz
First SSE register (%xmm0).
We added a mov for Clang:
asm ("mov %2, %%xmm0; sha256rnds2 %%xmm0, %1, %0" : "+x"(a) : "xm"(b), "x" (c) : "xmm0");
Performance is off by about 3 cycles per byte. On my 2.2 GHz Celeron J3455 test machine (Goldmont with SHA extensions), that's about 230 MiB/s. Its non-trivial.
Looking at the disassembly, Clang is not optimizing around SHA's k when two rounds are performed:
Breakpoint 2, SHA256_SSE_SHA_HashBlocks (state=0xaaa3a0,
data=0xaaa340, length=0x40) at sha.cpp:1101
1101 STATE1 = _mm_loadu_si128((__m128i*) &state[4]);
(gdb) disass
Dump of assembler code for function SHA256_SSE_SHA_HashBlocks(unsigned int*, unsigned int const*, unsigned long):
0x000000000068cdd0 <+0>: sub $0x308,%rsp
0x000000000068cdd7 <+7>: movdqu (%rdi),%xmm0
0x000000000068cddb <+11>: movdqu 0x10(%rdi),%xmm1
...
0x000000000068ce49 <+121>: movq %xmm2,%xmm0
0x000000000068ce4d <+125>: sha256rnds2 %xmm0,0x2f0(%rsp),%xmm1
0x000000000068ce56 <+134>: pshufd $0xe,%xmm2,%xmm3
0x000000000068ce5b <+139>: movdqa %xmm13,%xmm2
0x000000000068ce60 <+144>: movaps %xmm1,0x2e0(%rsp)
0x000000000068ce68 <+152>: movq %xmm3,%xmm0
0x000000000068ce6c <+156>: sha256rnds2 %xmm0,0x2e0(%rsp),%xmm2
0x000000000068ce75 <+165>: movdqu 0x10(%rsi),%xmm3
0x000000000068ce7a <+170>: pshufb %xmm8,%xmm3
0x000000000068ce80 <+176>: movaps %xmm2,0x2d0(%rsp)
0x000000000068ce88 <+184>: movdqa %xmm3,%xmm4
0x000000000068ce8c <+188>: paddd 0x6729c(%rip),%xmm4 # 0x6f4130
0x000000000068ce94 <+196>: movq %xmm4,%xmm0
0x000000000068ce98 <+200>: sha256rnds2 %xmm0,0x2d0(%rsp),%xmm1
...
For example, 0068ce8c though 0068ce98 should have been:
paddd 0x6729c(%rip),%xmm0 # 0x6f4130
sha256rnds2 %xmm0,0x2d0(%rsp),%xmm1
I'm guessing our choice of inline asm instructions are a bit off.
How do we work around the lack of Yz machine constraint under Clang? What pattern avoids the intermediate move in optimized code?
Attempting to use Explicit Register Variable:
const __m128i k asm("xmm0") = c;
asm ("sha256rnds2 %2, %1, %0" : "+x"(a) : "xm"(b), "x" (k));
return a;
Results in:
In file included from sha.cpp:24:
./cpu.h:831:22: warning: ignored asm label 'xmm0' on automatic variable
const __m128i k asm("xmm0") = c;
^
./cpu.h:833:7: error: invalid operand for instruction
asm ("sha256rnds2 %2, %1, %0" : "+x"(a) : "xm"(b), "x" (k));
^
<inline asm>:1:21: note: instantiated into assembly here
sha256rnds2 %xmm1, 752(%rsp), %xmm0
^~~~~~~~~~
In file included from sha.cpp:24:
./cpu.h:833:7: error: invalid operand for instruction
asm ("sha256rnds2 %2, %1, %0" : "+x"(a) : "xm"(b), "x" (k));
^
<inline asm>:1:21: note: instantiated into assembly here
sha256rnds2 %xmm3, 736(%rsp), %xmm1
^~~~~~~~~~
...

I created this answer based on the tag inline assembly with no specific language mentioned. Extended assembly templates already assume use of extensions to the languages.
If the Yz constraint isn't available you can attempt to create a temporary variable to tell CLANG what register to use rather than a constraint. You can do this through what is called an Explicit Register Variable:
You can define a local register variable and associate it with a specified register like this:
register int *foo asm ("r12");
Here r12 is the name of the register that should be used. Note that this is the same syntax used for defining global register variables, but for a local variable the declaration appears within a function. The register keyword is required, and cannot be combined with static. The register name must be a valid register name for the target platform.
In your case you wish to force usage of xmm0 register. You could assign the c parameter to a temporary variable using an explicit register and use that temporary as a parameter to the Extended Inline Assembly. This is the primary purpose of explicit registers in GCC/CLANG.
GCC_INLINE __m128i GCC_INLINE_ATTRIB
MM_SHA256RNDS2_EPU32(__m128i a, const __m128i b, const __m128i c)
{
register const __m128i tmpc asm("xmm0") = c;
__asm__("sha256rnds2 %2, %1, %0" : "+x"(a) : "x"(b), "x" (tmpc));
return a;
}
The compiler should be able to provide some optimizations now since it has more knowledge as to how the xmm0 register is to be used.
When you placed mov %2, %%xmm0; into the template CLANG (and GCC) do not do any optimizations on the instructions. Basic Assembly and Extended Assembly templates are a black box that it only knows how to do basic substitution based on the constraints.
Here's a disassembly using the method above. It was compiled with clang++ and -std=c++03. The extra moves are no longer present:
Breakpoint 1, SHA256_SSE_SHA_HashBlocks (state=0x7fffffffae60,
data=0x7fffffffae00, length=0x40) at sha.cpp:1101
1101 STATE1 = _mm_loadu_si128((__m128i*) &state[4]);
(gdb) disass
Dump of assembler code for function SHA256_SSE_SHA_HashBlocks(unsigned int*, unsigned int const*, unsigned long):
0x000000000068cf60 <+0>: sub $0x308,%rsp
0x000000000068cf67 <+7>: movdqu (%rdi),%xmm0
0x000000000068cf6b <+11>: movdqu 0x10(%rdi),%xmm1
...
0x000000000068cfe6 <+134>: paddd 0x670e2(%rip),%xmm0 # 0x6f40d0
0x000000000068cfee <+142>: sha256rnds2 %xmm0,0x2f0(%rsp),%xmm2
0x000000000068cff7 <+151>: pshufd $0xe,%xmm0,%xmm1
0x000000000068cffc <+156>: movdqa %xmm1,%xmm0
0x000000000068d000 <+160>: movaps %xmm2,0x2e0(%rsp)
0x000000000068d008 <+168>: sha256rnds2 %xmm0,0x2e0(%rsp),%xmm3
0x000000000068d011 <+177>: movdqu 0x10(%rsi),%xmm5
0x000000000068d016 <+182>: pshufb %xmm9,%xmm5
0x000000000068d01c <+188>: movaps %xmm3,0x2d0(%rsp)
0x000000000068d024 <+196>: movdqa %xmm5,%xmm0
0x000000000068d028 <+200>: paddd 0x670b0(%rip),%xmm0 # 0x6f40e0
0x000000000068d030 <+208>: sha256rnds2 %xmm0,0x2d0(%rsp),%xmm2
...

Related

Why can't Clang get __m128's data by index in constexpr function

#include <cstddef>
#include <immintrin.h>
constexpr float get_data(__m128 a, std::size_t pos) {
return a[pos];
}
It works on GCC. I wonder is there any workaround to make it possible
Regardless of constexpr, a[pos] is only valid as a GNU C extension, not portable to MSVC. Storing to an array, or C++20 std::bit_cast to a struct might work. bit_cast is constexpr-compatible, unlike other type-punning methods. Although I'd be worried about how efficiently that would compile across compilers for runtime-variable pos
bit_cast does compile ok with clang, and works in a constexpr function. But compiles inefficiently for GCC.
Correction: clang compiles this, but rejects it if called in a context that requires it to be constant-evaluated. note: constexpr bit_cast involving type '__attribute__((__vector_size__(4 * sizeof(float)))) float const' (vector of 4 'float' values) is not yet supported.
Other failed attempts with current clang in a constexpr context:
_mm_store_ps - not supported. Nor is *(__m128*)f = a; because it's a reinterpret_cast.
f[0] = vec[0] etc. initializers: no, even literal constant indexing of a GNU C native vector isn't supported in clang in constexpr.
union type punning: reading an inactive member not allowed in a constexpr context
_mm_cvtss_f32(vec) - non-constexpr function unusable, so no chance of using if constexpr for separate shuffles and returns.
Not-working answer, may work at some point in the future but not with clang trunk pre 15.0
#include <cstddef>
#include <immintrin.h>
#include <bit>
// portable, but inefficient with GCC
constexpr float get_data(__m128 a, std::size_t pos) {
struct foo { float f[4]; } s;
s = std::bit_cast<foo>(a);
return s.f[pos];
}
float test_idx2(__m128 a){
return get_data(a, 2);
}
float test_idxvar(__m128 a, size_t pos){
return get_data(a, pos);
}
These compile to decent asm on Godbolt, the same you'd get from clang with a[pos]. I used -O3 -march=haswell -std=gnu++20
# clang 14 -O3 -march=haswell -std=gnu++20
# get_data has no asm output; constexpr is like inline in that respect
test_idx2(float __vector(4)):
vpermilpd xmm0, xmm0, 1 # xmm0 = xmm0[1,0]
ret
test_idxvar(float __vector(4), unsigned long):
vmovups xmmword ptr [rsp - 16], xmm0
vmovss xmm0, dword ptr [rsp + 4*rdi - 16] # xmm0 = mem[0],zero,zero,zero
ret
Store/reload is a sensible strategy for a runtime-variable index, although vmovd / vpermilps would be an option since AVX introduced a variable-control shuffle that uses dword indices. An out-of-range index is UB so the compiler doesn't have any requirement to return any specific data in that case.
Using vpermilpd for the constant index 2 is a waste of code-size vs. vmovhlps xmm0, xmm0, xmm0 or vunpckhpd. It costs a longer VEX prefix and an immediate, so 2 bytes of machine-code size, but otherwise same performance on most CPUs.
Unfortunately GCC doesn't do such a good job
We get a store/reload even for the fixed index of 2, and even worse, reload by bouncing through a GP-integer register. This is a missed optimization, but IDK how quickly it would get fixed if reported. So if you're going to do this, perhaps #ifdef __clang__ or #ifdef __llvm__ for bit_cast, and #ifdef __GNUC__ for a[pos]. (Clang defines __GNUC__ so check for that after special-casing clang.)
# gcc12 -O3 -march=haswell -std=gnu++20
test_idx2(float __vector(4)):
vmovaps XMMWORD PTR [rsp-24], xmm0
mov rax, QWORD PTR [rsp-16]
vmovd xmm0, eax # slow: should have loaded directly from mem
ret
test_idxvar(float __vector(4), unsigned long):
vmovdqa XMMWORD PTR [rsp-24], xmm0
vmovss xmm0, DWORD PTR [rsp-24+rdi*4] # this is fine, same as clang
ret
Interestingly the runtime-variable version didn't have the same anti-optimization for GCC.

Referencing memory operands in .intel_syntax GNU C inline assembly

I'm catching a link error when compiling and linking a source file with inline assembly.
Here are the test files:
via:$ cat test.cxx
extern int libtest();
int main(int argc, char* argv[])
{
return libtest();
}
$ cat lib.cxx
#include <stdint.h>
int libtest()
{
uint32_t rnds_00_15;
__asm__ __volatile__
(
".intel_syntax noprefix ;\n\t"
"mov DWORD PTR [rnds_00_15], 1 ;\n\t"
"cmp DWORD PTR [rnds_00_15], 1 ;\n\t"
"je done ;\n\t"
"done: ;\n\t"
".att_syntax noprefix ;\n\t"
:
: [rnds_00_15] "m" (rnds_00_15)
: "memory", "cc"
);
return 0;
}
Compiling and linking the program results in:
via:$ g++ -fPIC test.cxx lib.cxx -c
via:$ g++ -fPIC lib.o test.o -o test.exe
lib.o: In function `libtest()':
lib.cxx:(.text+0x1d): undefined reference to `rnds_00_15'
lib.cxx:(.text+0x27): undefined reference to `rnds_00_15'
collect2: error: ld returned 1 exit status
The real program is more complex. The routine is out of registers so the flag rnds_00_15 must be a memory operand. Use of rnds_00_15 is local to the asm block. It is declared in the C code to ensure the memory is allocated on the stack and nothing more. We don't read from it or write to it as far as the C code is concerned. We list it as a memory input so GCC knows we use it and wire up the "C variable name" in the extended ASM.
Why am I receiving a link error, and how do I fix it?
Compile with gcc -masm=intel and don't try to switch modes inside the asm template string. AFAIK there's no equivalent before clang14 (Note: MacOS installs clang as gcc / g++ by default.)
Also, of course you need to use valid GNU C inline asm, using operands to tell the compiler which C objects you want to read and write.
Can I use Intel syntax of x86 assembly with GCC? clang14 supports -masm=intel like GCC
How to set gcc to use intel syntax permanently? clang13 and earlier didn't.
I don't believe Intel syntax uses the percent sign. Perhaps I am missing something?
You're getting mixed up between %operand substitutions into the Extended-Asm template (which use a single %), vs. the final asm that the assembler sees.
You need %% to use a literal % in the final asm. You wouldn't use "mov %%eax, 1" in Intel-syntax inline asm, but you do still use "mov %0, 1" or %[named_operand].
See https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html. In Basic asm (no operands), there is no substitution and % isn't special in the template, so you'd write mov $1, %eax in Basic asm vs. mov $1, %%eax in Extended, if for some reason you weren't using an operand like mov $1, %[tmp] or mov $1, %0.
uint32_t rnds_00_15; is a local with automatic storage. Of course it there's no asm symbol with that name.
Use %[rnds_00_15] and compile with -masm=intel (And remove the .att_syntax at the end; that would break the compiler-generate asm that comes after.)
You also need to remove the DWORD PTR, because the operand-expansion already includes that, e.g. DWORD PTR [rsp - 4], and clang errors on DWORD PTR DWORD PTR [rsp - 4]. (GAS accepts it just fine, but the 2nd one takes precendence so it's pointless and potentially misleading.)
And you'll want a "=m" output operand if you want the compiler to reserve you some scratch space on the stack. You must not modify input-only operands, even if it's unused in the C. Maybe the compiler decides it can overlap something else because it's not written and not initialized (i.e. UB). (I'm not sure if your "memory" clobber makes it safe, but there's no reason not to use an early-clobber output operand here.)
And you'll want to avoid label name conflicts by using %= to get a unique number.
Working example (GCC and ICC, but not clang unfortunately), on the Godbolt compiler explorer (which uses -masm=intel depending on options in the dropdown). You can use "binary mode" (the 11010 button) to prove that it actually assembles after compiling to asm without warnings.
int libtest_intel()
{
uint32_t rnds_00_15;
// Intel syntax operand-size can only be overridden with operand modifiers
// because the expansion includes an explicit DWORD PTR
__asm__ __volatile__
( // ".intel_syntax noprefix \n\t"
"mov %[rnds_00_15], 1 \n\t"
"cmp %[rnds_00_15], 1 \n\t"
"je .Ldone%= \n\t"
".Ldone%=: \n\t"
: [rnds_00_15] "=&m" (rnds_00_15)
:
: // no clobbers
);
return 0;
}
Compiles (with gcc -O3 -masm=intel) to this asm. Also works with gcc -m32 -masm=intel of course:
libtest_intel:
mov DWORD PTR [rsp-4], 1
cmp DWORD PTR [rsp-4], 1
je .Ldone8
.Ldone8:
xor eax, eax
ret
I couldn't get this to work with clang: It choked on .intel_syntax noprefix when I left that in explicitly.
Operand-size overrides:
You have to use %b[tmp] to get the compiler to substitute in BYTE PTR [rsp-4] to only access the low byte of a dword input operand. I'd recommend AT&T syntax if you want to do much of this.
Using %[rnds_00_15] results in Error: junk '(%ebp)' after expression.
That's because you switched to Intel syntax without telling the compiler. If you want it to use Intel addressing modes, compile with -masm=intel so the compiler can substitute into the template with the correct syntax.
This is why I avoid that crappy GCC inline assembly at nearly all costs. Man I despise this crappy tool.
You're just using it wrong. It's a bit cumbersome, but makes sense and mostly works well if you understand how it's designed.
Repeat after me: The compiler doesn't parse the asm string at all, except to do text substitutions of %operand. This is why it doesn't notice your .intel_syntax noprefex and keeps substituting AT&T syntax.
It does work better and more easily with AT&T syntax though, e.g. for overriding the operand-size of a memory operand, or adding an offset. (e.g. 4 + %[mem] works in AT&T syntax).
Dialect alternatives:
If you want to write inline asm that doesn't depend on -masm=intel or not, use Dialect alternatives (which makes your code super-ugly; not recommended for anything other than wrapping one or two instructions):
Also demonstrates operand-size overrides
#include <stdint.h>
int libtest_override_operand_size()
{
uint32_t rnds_00_15;
// Intel syntax operand-size can only be overriden with operand modifiers
// because the expansion includes an explicit DWORD PTR
__asm__ __volatile__
(
"{movl $1, %[rnds_00_15] | mov %[rnds_00_15], 1} \n\t"
"{cmpl $1, %[rnds_00_15] | cmp %k[rnds_00_15], 1} \n\t"
"{cmpw $1, %[rnds_00_15] | cmp %w[rnds_00_15], 1} \n\t"
"{cmpb $1, %[rnds_00_15] | cmp %b[rnds_00_15], 1} \n\t"
"je .Ldone%= \n\t"
".Ldone%=: \n\t"
: [rnds_00_15] "=&m" (rnds_00_15)
);
return 0;
}
With Intel syntax, gcc compiles it to:
mov DWORD PTR [rsp-4], 1
cmp DWORD PTR [rsp-4], 1
cmp WORD PTR [rsp-4], 1
cmp BYTE PTR [rsp-4], 1
je .Ldone38
.Ldone38:
xor eax, eax
ret
With AT&T syntax, compiles to:
movl $1, -4(%rsp)
cmpl $1, -4(%rsp)
cmpw $1, -4(%rsp)
cmpb $1, -4(%rsp)
je .Ldone38
.Ldone38:
xorl %eax, %eax
ret

MSVC inline assembly to GCC (with parameter and return)

inline float sqrt2(float sqr)
{
float root = 0;
__asm
{
sqrtss xmm0, sqr
movss root, xmm0
}
return root;
}
here is MSVC compilator inline assembly which I want to compile with gcc x86, what I know that gcc inline assembly is getting called with asm("asm here"); but I completely don't know how to include parameter in that, the result is obtained by "=r" I know only.
Which should result in something like that:
asm("sqrtss xmm0, %1\n\t"
"movss %0, xmm0"
: "=r" (root)
: "r" (sqr));
The r constraint is for general purpose registers. x is for xmm. Consult the manual for more details. Also, if you are using a mov in inline asm, you are likely doing it wrong.
inline float sqrt2(float sqr)
{
float root = 0;
__asm__("sqrtss %1, %0" : "=x" (root) : "x" (sqr));
return root;
}
Note that gcc is entirely capable of generating sqrtss instruction from sqrtf library function call. You may use -fno-math-errno to get rid of some minor error checking overhead.

why object code generated for noexcept and throw() is same in c++11?

Code using noexcept .
//hello.cpp
class A{
public:
A(){}
~A(){}
};
void fun() noexcept{ //c++11 style
A a[10];
}
int main()
{
fun();
}
Code using throw() .
//hello1.cpp
class A{
public:
A(){}
~A(){}
};
void fun() throw(){//c++98 style
A a[10];
}
int main()
{
fun();
}
As per various online links and scott meyer's book "If, at runtime, an exception leaves fun, fun’s exception specification is violated. With the
C++98 exception specification, the call stack is unwound to f’s caller, and, after some
actions not relevant here, program execution is terminated. With the C++11 exception
specification, runtime behavior is slightly different: the stack is only possibly
unwound before program execution is terminated." He said code using noexcept is more optimized than code using throw() .
But when I have generated machine code for above program , i found code generated for both cases is exactly same .
$ g++ --std=c++11 hello1.cpp -O0 -S -o throw1.s
$ g++ --std=c++11 hello.cpp -O0 -S -o throw.s
diff is below .
$ diff throw.s throw1.s
1c1
< .file "hello.cpp"
---
> .file "hello1.cpp"
Machine code generated is as below for function "fun" for both cases .
.LFB1202:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
pushq %r12
pushq %rbx
subq $16, %rsp
.cfi_offset 12, -24
.cfi_offset 3, -32
leaq -32(%rbp), %rax
movl $9, %ebx
movq %rax, %r12
jmp .L5
.L6:
movq %r12, %rdi
call _ZN1AC1Ev
addq $1, %r12
subq $1, %rbx
.L5:
cmpq $-1, %rbx
jne .L6
leaq -32(%rbp), %rbx
addq $10, %rbx
.L8:
leaq -32(%rbp), %rax
cmpq %rax, %rbx
je .L4
subq $1, %rbx
movq %rbx, %rdi
call _ZN1AD1Ev
jmp .L8
.L4:
addq $16, %rsp
popq %rbx
popq %r12
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE1202:
.size _Z3funv, .-_Z3funv
.globl main
.type main, #function
What is advantage of using noexcept when noexcept and throw are generating the same code ?
They are generating the same code because you are not throwing anything. Your test program is so simple that the compiler can trivially analyze it, determine that it is not throwing an exception, and in fact not doing anything at all! With optimizations enabled (-O1 and higher), the object code:
fun():
rep ret
main:
xor eax, eax
ret
shows that your test code is being optimized simply to the most trivial valid C++ application:
int main()
{
return 0;
}
If you want to really test the difference in object code generation for the two types of exception specifiers, you need to use a real (i.e., non-trivial) test program. Something that actually throws an exception, and where that throw cannot be analyzed out by a bit of compile-time analysis:
void fun(int args) throw() // C++98 style
{
if (args == 0)
{
throw "Not enough arguments!";
}
else
{
// do something
}
}
int main(int argc, char** argv)
{
fun(argc);
return 0;
}
In this code, an exception is conditionally thrown depending on the value of an input parameter (argc) passed to the main function. It is impossible for the compiler to know, at compile-time, what the value of this argument will be, so it cannot optimize out either this conditional check or the throwing of the exception. That forces it to generate exception-throwing and stack-unwinding code.
Now we can compare the resulting object code. Using GCC 5.3, with -O3 and -std=c++11, I get the following:
C++98 style (throw())
.LC0:
.string "Not enough arguments!"
fun(int):
test edi, edi
je .L9
rep ret
.L9:
push rax
mov edi, 8
call __cxa_allocate_exception
xor edx, edx
mov QWORD PTR [rax], OFFSET FLAT:.LC0
mov esi, OFFSET FLAT:typeinfo for char const*
mov rdi, rax
call __cxa_throw
add rdx, 1
mov rdi, rax
je .L4
call _Unwind_Resume
.L4:
call __cxa_call_unexpected
main:
sub rsp, 8
call fun(int)
xor eax, eax
add rsp, 8
ret
C++11 style (noexcept)
.LC0:
.string "Not enough arguments!"
fun(int) [clone .part.0]:
push rax
mov edi, 8
call __cxa_allocate_exception
xor edx, edx
mov QWORD PTR [rax], OFFSET FLAT:.LC0
mov esi, OFFSET FLAT:typeinfo for char const*
mov rdi, rax
call __cxa_throw
fun(int):
test edi, edi
je .L8
rep ret
.L8:
push rax
call fun(int) [clone .part.0]
main:
test edi, edi
je .L12
xor eax, eax
ret
.L12:
push rax
call fun(int) [clone .part.0]
Note that they are clearly different. Just as Meyers et al. have claimed, the C++98 style throw() specification, which indicates that a function does not throw, causes a standards-compliant compiler to emit code to unwind the stack and call std::unexpected when an exception is thrown from inside of that function. That is exactly what happens here. Because fun is marked throw() but in fact does throw, the object code shows the compiler emitting a call to __cxa_call_unexpected.
Clang is also standards-compliant here and does the same thing. I won't reproduce the object code, because it's longer and harder to follow (you can see it on Matt Godbolt's excellent site), but putting the C++98 style exception specification on the function causes the compiler to explicitly call std::terminate if the function throws in violation of its specification, whereas the C++11 style exception specification does not result in a call to std::terminate.

GCC inline assembly error: asm-specifier for variable '%al' conflicts with asm clobber list

Sorry for so many questions, but I've encountered yet another cryptic error trying to compile the following inline assembly (with -fasm-blocks) which works in MSVC, but apparently not in GCC and wasn't able to deal with it:
unsigned char testData = 128;
__asm
{
// ...
mov al, testData
mov ah, al // error: asm-specifier for variable '%al' conflicts with asm clobber list
shl eax, 16
// ...
};
What is this clobber list and what is wrong with it?
I also tried to change optimization level, but it had no effect.
This has to be some bug in gcc (maybe __asm blocks have implicit clobbering). Anyway there are many workarounds:
__asm
{
// ...
mov ah, testData
mov al, ah
shl eax, 16
// ...
};
or
__asm
{
// ...
mov al, testData
mov ah, testData
shl eax, 16
// ...
};
or
__asm
{
// ...
movzx eax, testData
imul eax, 0x0101
shl eax, 16
// ...
};
the clobber-list is explained here: http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html, but not in the context of your __asm syntax, with which I'm not familiar. Trying to compile your snippet I get
jcomeau#intrepid:/tmp$ make test
cc test.c -o test
test.c:4: error: expected ‘(’ before ‘{’ token