Related
This does not look too friendly:
__asm("command 1"
"command 2"
"command 3");
Do I really have to put a doublequote around every line?
Also... since multiline string literals do not work in GCC, I could not cheat with that either.
I always find some examples on Internet that the guy manually insert a tab and new-line instead of \t and \n, however it doesn't work for me. Not very sure if your example even compile.. but this is how I do:
asm volatile( // note the backslash line-continuation
"xor %eax,%eax \n\t\
mov $0x7c802446, %ebx \n\t\
mov $1000, %ax \n\t\
push %eax \n\t\
call *%ebx \n\t\
add $4, %esp \n\t\
"
: "=a"(retval) // output in EAX: function return value
:
: "ecx", "edx", "ebx" // tell compiler about clobbers
// Also x87 and XMM regs should be listed.
);
Or put double quotes around each line, instead of using \ line-continuation. C string literals separately only by whitespace (including a newline) are concatenated into one long string literal. (Which is why you need the \n inside it, so it's separate lines when it's seen by the assembler).
This is less ugly and makes it possible to put C comments on each line.
asm volatile(
"xor %eax,%eax \n\t"
"mov $0x7c802446, %ebx \n\t"
"mov $1000, %ax \n\t"
"push %eax \n\t" // function arg
"call *%ebx \n\t"
"add $4, %esp \n\t" // rebalance the stack: necessary for asm statements
: "=a"(retval)
:
: "ecx", "edx", "ebx" // clobbers. Function calls themselves kill EAX,ECX,EDX
// function calls also clobber all x87 and all XMM registers, omitted here
);
C++ multiline string literals
Interesting how this question pointed me to the answer:
main.cpp
#include <cassert>
#include <cinttypes>
int main() {
uint64_t io = 0;
__asm__ (
R"(
incq %0
incq %0
)"
: "+m" (io)
:
:
);
assert(io == 2);
}
Compile and run:
g++ -o main -pedantic -std=c++11 -Wall -Wextra main.cpp
./main
See also: C++ multiline string literal
GCC also adds the same syntax as a C extension, you just have to use -std=gnu99 instead of -std=c99:
main.c
#include <assert.h>
#include <inttypes.h>
int main(void) {
uint64_t io = 0;
__asm__ (
R"(
incq %0
incq %0
)"
: "+m" (io)
:
:
);
assert(io == 2);
}
Compile and run:
gcc -o main -pedantic -std=gnu99 -Wall -Wextra main.c
./main
See also: How to split a string literal across multiple lines in C / Objective-C?
One downside of this method is that I don't see how to add C preprocessor macros in the assembly, since they are not expanded inside of strings, see also: Multi line inline assembly macro with strings
Tested on Ubuntu 16.04, GCC 6.4.0, binutils 2.26.1.
.incbin
This GNU GAS directive is another thing that should be in your radar if you are going to use large chunks of assembly: Embedding resources in executable using GCC
The assembly will be in a separate file, so it is not a direct answer, but it is still worth knowing about.
I want to use inline assembly to execute a syscall on a PowerPC 32-bit architecture. After performing the syscall, I also want to return the return values of the syscall by taking the values of r3 and r4 and putting them into a long. My function looks as follows:
constexpr auto maximum_syscall_parameter_count = 8;
long execute_system_call_with_arguments(short value, const int parameters_array[maximum_syscall_parameter_count]) {
char return_value_buffer[sizeof(long)];
// syscall value
asm volatile("mr 0, %0" : : "r" (value));
// Pass the parameters
asm volatile("mr 3, %0" : : "r" (parameters_array[0]));
asm volatile("mr 4, %0" : : "r" (parameters_array[1]));
asm volatile("mr 5, %0" : : "r" (parameters_array[2]));
asm volatile("mr 6, %0" : : "r" (parameters_array[3]));
asm volatile("mr 7, %0" : : "r" (parameters_array[4]));
asm volatile("mr 8, %0" : : "r" (parameters_array[5]));
asm volatile("mr 9, %0" : : "r" (parameters_array[6]));
asm volatile("mr 10, %0" : : "r" (parameters_array[7]));
// Execute the syscall
asm volatile ("sc");
// Retrieve the return value
asm volatile ("mr %0, 3" : "=r" (*(int *) &return_value_buffer));
asm volatile ("mr %0, 4" : "=r" (*(int *) &return_value_buffer[sizeof(int)]));
return *(long *) &return_value_buffer;
}
This seems to generate correct code but it feels hacky, there are 2 redundant instructions generated:
mr r0, r30
lwz r9, 0(r31)
mr r3, r9
lwz r9, 4(r31)
mr r4, r9
lwz r9, 8(r31)
mr r5, r9
lwz r9, 0xC(r31)
mr r6, r9
lwz r9, 0x10(r31)
mr r7, r9
lwz r9, 0x14(r31)
mr r8, r9
lwz r9, 0x18(r31)
mr r9, r9
lwz r9, 0x1C(r31)
mr r10, r9
sc
mr r3, r3 # Redundant
mr r9, r4 # Redundant
blr
My goal is to simply return with r3 and r4 set by the sc instruction but removing the return value or the last 2 inline assembly instructions from the source code will corrupt the function to either crash on return or return 0.
Let me start by re-iterating what I said above: I don't speak PPC asm, and I don't have a PPC to run this code on. So while I believe that generally this is the direction you should proceed, don't take this code as gospel.
Next, the reason both Jester and I suggested using local register variables is that it results in better (and arguably more readable/maintainable) code. The reason for that is this line in the gcc docs:
GCC does not parse the assembler instructions themselves and does not know what they mean or even whether they are valid assembler input.
With that in mind, what happens when you use code like you have above, and call the routine with code like:
int parameters_array[maximum_syscall_parameter_count] = {1, 2, 3, 4, 5, 6, 7};
long a = execute_system_call_with_arguments(9, parameters_array);
Since the compiler doesn't know what's going to happen inside that asm block, it must write everything to memory, which the asm block then reads back from memory into registers. While using code like below, the compiler can be smart enough to skip ever allocating the memory and load the registers directly. This can be even more useful if you are calling execute_system_call_with_arguments more than once with (essentially) the same parameters.
constexpr auto maximum_syscall_parameter_count = 7;
long execute_system_call_with_arguments(const int value, const int parameters_array[maximum_syscall_parameter_count]) {
int return_value_buffer[2];
register int foo0 asm("0") = value;
register int foo1 asm("3") = parameters_array[0];
register int foo2 asm("4") = parameters_array[1];
register int foo3 asm("5") = parameters_array[2];
register int foo4 asm("6") = parameters_array[3];
register int foo5 asm("7") = parameters_array[4];
register int foo6 asm("8") = parameters_array[5];
register int foo7 asm("9") = parameters_array[6];
// Execute the syscall
asm volatile ("sc"
: "+r"(foo3), "+r"(foo4)
: "r"(foo0), "r"(foo1), "r"(foo2), "r"(foo5), "r"(foo6), "r"(foo7)
);
return_value_buffer[0] = foo3;
return_value_buffer[1] = foo4;
return *(long *) &return_value_buffer;
}
When called with the example above produces:
.L.main:
li 0,9
li 3,1
li 4,2
li 5,3
li 6,4
li 7,5
li 8,6
li 9,7
sc
extsw 3,6
blr
Keeping as much code as possible outside the asm template (constraints are considered "outside") allows gcc's optimizers to do all sorts of useful things.
A few other points:
If any of the items in parameters_array are (or might be) pointers, you're going to need to add the memory clobber. This ensures that any values that might be stored in registers get flushed to memory before executing the asm instruction. Adding the memory clobber if it's not needed (might) slow down the execution by a couple of instructions. Omitting it if needed could result in reading incorrect data.
If sc modifies any registers that aren't listed here, you must list them as clobbers. And if any registers that ARE listed here (other than foo3 & foo4) change, you must make them input+outputs as well (does sc put a return code in foo0?). Even if you "don't use them" after the asm call, if they change, you HAVE to inform the compiler. As the gcc docs explicitly warn:
Do not modify the contents of input-only operands (except for inputs tied to outputs). The compiler assumes that on exit from the asm statement these operands contain the same values as they had before executing the statement.
Failure to heed this warning can result in code that seems to work fine one day, then suddenly causes bizarre failures at some point after (sometimes well after) the asm block. This "works and then suddenly doesn't" is one of the reasons I suggest that you don't use inline asm, but if you must (which you kinda do if you need to call sc directly), keep it as tiny as you can.
I cheated a bit by changing maximum_syscall_parameter_count to 7. Apparently godbolt's gcc doesn't optimize this code as well with more parameters. There might be ways around this if that's necessary, but you'll want a better PPC expert than me to define it.
I'm building Botan on Solaris 11.3 with the SunCC compiler that comes with Developer Studio 12.5. I'm not too familiar with the library or Solaris, and it takes me some effort to track down issues.
The compile is dying on a relatively benign file called divide.cpp. I've got it reduced to the following test case. According to Oracle's GCC-style asm inlining support in Sun Studio 12 compilers, the ASM is well formed. Clang, GCC and ICC happily consume the code.
$ /opt/developerstudio12.5/bin/CC -m64 -std=c++11 test.cxx -c
"test.cxx", [main]:ube: error: Invalid reference to argument '1' in GASM Inlining
CC: ube failed for test.cxx
$ cat test.cxx
#include <iostream>
#include <stdint.h>
typedef uint64_t word;
inline word multadd(word a, word b, word* c)
{
asm(
"mulq %[b] \n\t"
"addq %[c],%[a] \n\t"
"adcq $0,%[carry] \n\t"
: [a]"=a"(a), [b]"=rm"(b), [carry]"=&d"(*c)
: "0"(a), "1"(b), [c]"g"(*c) : "cc");
return a;
}
int main(int argc, char* argv[])
{
word a, b, c, d;
std::cin >> a >> b >> c;
d = multadd(a, b, &c);
return 0;
}
I can't find useful information on the error string Invalid reference to argument 'N' in GASM Inlining. I found sunCC chokes on inline assembler on the Oracle boards. But the answer is UBE is buggy and buy a support contract to learn more.
I have three questions:
What does the error message indicate?
How can I get SunCC to provide a source file and line number?
How can I work around the issue?
If I change the b parameter to just =m, then the same error is produced. If I change the b parameter to just =r, then a different error is generated:
asm(
"mulq %[b] \n\t"
"addq %[c],%[a] \n\t"
"adcq $0,%[carry] \n\t"
: [a]"=a"(a), [b]"=r"(b), [carry]"=&d"(*c)
: "0"(a), "1"(b), [c]"g"(*c) : "cc");
And the result:
$ /opt/developerstudio12.5/bin/CC -m64 -std=c++11 test.cxx -c
Assembler: test.cxx
"<null>", line 205 : Invalid instruction argument
Near line: "mulq %rcx "
"<null>", line 206 : Invalid instruction argument
Near line: " addq %rbx,%rax "
"<null>", line 207 : Invalid instruction argument
Near line: " adcq $0,%rdx "
CC: ube failed for test.cxx
What does the error message indicate?
Unfortunately, no idea.
If someone buys a support contract and has the time, then please solicit Oracle for an answer .
How can I get SunCC to provide a source file and line number?
Unfortunately, no idea.
How can I work around the issue?
David Wohlferd suspected the [b]"=rm"(b) output operand. It looks like the one ASM block needs to be split into two blocks. Its an awful hack, but we have not figured out another way to do it.
inline word multadd(word a, word b, word* c)
{
asm(
"mulq %[b] \n\t"
: [a]"+a"(a), [b]"=&d"(b)
: "0"(a), "1"(b));
asm(
"addq %[c],%[a]" \n\t"
"adcq $0,%[carry] \n\t"
: [a]"=a"(a), [carry]"=&d"(*c)
: "a"(a), "d"(b), [c]"g"(*c) : "cc");
return a;
}
I'm trying to produce code (currently using clang++-3.8) that adds two numbers consisting of multiple machine words. To simplify things for the moment I'm only adding 128bit numbers, but I'd like to be able to generalise this.
First some typedefs:
typedef unsigned long long unsigned_word;
typedef __uint128_t unsigned_128;
And a "result" type:
struct Result
{
unsigned_word lo;
unsigned_word hi;
};
The first function, f, takes two pairs of unsigned words and returns a result, by as an intermediate step putting both of these 64 bit words into a 128 bit word before adding them, like so:
Result f (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
Result x;
unsigned_128 n1 = lo1 + (static_cast<unsigned_128>(hi1) << 64);
unsigned_128 n2 = lo2 + (static_cast<unsigned_128>(hi2) << 64);
unsigned_128 r1 = n1 + n2;
x.lo = r1 & ((static_cast<unsigned_128>(1) << 64) - 1);
x.hi = r1 >> 64;
return x;
}
This actually gets inlined quite nicely like so:
movq 8(%rsp), %rsi
movq (%rsp), %rbx
addq 24(%rsp), %rsi
adcq 16(%rsp), %rbx
Now, instead I've written a simpler function using the clang multi-precision primatives, as below:
static Result g (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
Result x;
unsigned_word carryout;
x.lo = __builtin_addcll(lo1, lo2, 0, &carryout);
x.hi = __builtin_addcll(hi1, hi2, carryout, &x.carry);
return x;
}
This produces the following assembly:
movq 24(%rsp), %rsi
movq (%rsp), %rbx
addq 16(%rsp), %rbx
addq 8(%rsp), %rsi
adcq $0, %rbx
In this case, there's an extra add. Instead of doing an ordinary add on the lo-words, then an adc on the hi-words, it just adds the hi-words, then adds the lo-words, then does an adc on the hi-word again with an argument of zero.
This may not look too bad, but when you try this with larger words (say 192bit, 256bit) you soon get a mess of ors and other instructions dealing with the carries up the chain, instead of a simple chain of add, adc, adc, ... adc.
The multi-precision primitives seem to be doing a terrible job at exactly what they're intended to do.
So what I'm looking for is code that I could generalise to any length (no need to do it, just enough so I can work out how to), which clang produces additions in an manner with is as efficient as what it does with it's built in 128 bit type (which unfortunately I can't easily generalise). I presume this should just a chain of adcs, but I'm welcome to arguments and code that it should be something else.
There is an intrinsic to do this: _addcarry_u64. However, only Visual Studio and ICC (at least VS 2013 and 2015 and ICC 13 and ICC 15) do this efficiently. Clang 3.7 and GCC 5.2 still don't produce efficient code with this intrinsic.
Clang in addition has a built-in which one would think does this, __builtin_addcll, but it does not produce efficient code either.
The reason Visual Studio does this is that it does not allow inline assembly in 64-bit mode so the compiler should provide a way to do this with an intrinsic (though Microsoft took their time implementing this).
Therefore, with Visual Studio use _addcarry_u64. With ICC use _addcarry_u64 or inline assembly. With Clang and GCC use inline assembly.
Note that since the Broadwell microarchitecture there are two new instructions: adcx and adox which you can access with the _addcarryx_u64 intrinsic . Intel's documentation for these intrinsics used to be different then the assembly produced by the compiler but it appears their documentation is correct now. However, Visual Studio still only appears to produce adcx with _addcarryx_u64 whereas ICC produces both adcx and adox with this intrinsic. But even though ICC produces both instructions it does not produce the most optimal code (ICC 15) and so inline assembly is still necessary.
Personally, I think the fact that a non-standard feature of C/C++, such as inline assembly or intrinsics, is required to do this is a weakness of C/C++ but others might disagree. The adc instruction has been in the x86 instruction set since 1979. I would not hold my breath on C/C++ compilers being able to optimally figure out when you want adc. Sure they can have built-in types such as __int128 but the moment you want a larger type that's not built-in you have to use some non-standard C/C++ feature such as inline assembly or intrinsics.
In terms of inline assembly code to do this I already posted a solution for 256-bit addition for eight 64-bit integers in register at multi-word addition using the carry flag.
Here is that code reposted.
#define ADD256(X1, X2, X3, X4, Y1, Y2, Y3, Y4) \
__asm__ __volatile__ ( \
"addq %[v1], %[u1] \n" \
"adcq %[v2], %[u2] \n" \
"adcq %[v3], %[u3] \n" \
"adcq %[v4], %[u4] \n" \
: [u1] "+&r" (X1), [u2] "+&r" (X2), [u3] "+&r" (X3), [u4] "+&r" (X4) \
: [v1] "r" (Y1), [v2] "r" (Y2), [v3] "r" (Y3), [v4] "r" (Y4))
If you want to explicitly load the values from memory you can do something like this
//uint64_t dst[4] = {1,1,1,1};
//uint64_t src[4] = {1,2,3,4};
asm (
"movq (%[in]), %%rax\n"
"addq %%rax, %[out]\n"
"movq 8(%[in]), %%rax\n"
"adcq %%rax, 8%[out]\n"
"movq 16(%[in]), %%rax\n"
"adcq %%rax, 16%[out]\n"
"movq 24(%[in]), %%rax\n"
"adcq %%rax, 24%[out]\n"
: [out] "=m" (dst)
: [in]"r" (src)
: "%rax"
);
That produces nearlly identical assembly as from the following function in ICC
void add256(uint256 *x, uint256 *y) {
unsigned char c = 0;
c = _addcarry_u64(c, x->x1, y->x1, &x->x1);
c = _addcarry_u64(c, x->x2, y->x2, &x->x2);
c = _addcarry_u64(c, x->x3, y->x3, &x->x3);
_addcarry_u64(c, x->x4, y->x4, &x->x4);
}
I have limited experience with GCC inline assembly (or inline assembly in general - I usually use an assembler such as NASM) so maybe there are better inline assembly solutions.
So what I'm looking for is code that I could generalize to any length
To answer this question here is another solution using template meta programming. I used this same trick for loop unrolling. This produces optimal code with ICC. If Clang or GCC ever implement _addcarry_u64 efficiently this would be a good general solution.
#include <x86intrin.h>
#include <inttypes.h>
#define LEN 4 // N = N*64-bit add e.g. 4=256-bit add, 3=192-bit add, ...
static unsigned char c = 0;
template<int START, int N>
struct Repeat {
static void add (uint64_t *x, uint64_t *y) {
c = _addcarry_u64(c, x[START], y[START], &x[START]);
Repeat<START+1, N>::add(x,y);
}
};
template<int N>
struct Repeat<LEN, N> {
static void add (uint64_t *x, uint64_t *y) {}
};
void sum_unroll(uint64_t *x, uint64_t *y) {
Repeat<0,LEN>::add(x,y);
}
Assembly from ICC
xorl %r10d, %r10d #12.13
movzbl c(%rip), %eax #12.13
cmpl %eax, %r10d #12.13
movq (%rsi), %rdx #12.13
adcq %rdx, (%rdi) #12.13
movq 8(%rsi), %rcx #12.13
adcq %rcx, 8(%rdi) #12.13
movq 16(%rsi), %r8 #12.13
adcq %r8, 16(%rdi) #12.13
movq 24(%rsi), %r9 #12.13
adcq %r9, 24(%rdi) #12.13
setb %r10b
Meta programming is a basic feature of assemblers so it's too bad C and C++ (except through template meta programming hacks) have no solution for this either (the D language does).
The inline assembly I used above which referenced memory was causing some problems in a function. Here is a new version which seems to work better
void foo(uint64_t *dst, uint64_t *src)
{
__asm (
"movq (%[in]), %%rax\n"
"addq %%rax, (%[out])\n"
"movq 8(%[in]), %%rax\n"
"adcq %%rax, 8(%[out])\n"
"movq 16(%[in]), %%rax\n"
"addq %%rax, 16(%[out])\n"
"movq 24(%[in]), %%rax\n"
"adcq %%rax, 24(%[out])\n"
:
: [in] "r" (src), [out] "r" (dst)
: "%rax"
);
}
On Clang 6, both __builtin_addcl and __builtin_add_overflow produce the same, optimal disassembly.
Result g(unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
Result x;
unsigned_word carryout;
x.lo = __builtin_addcll(lo1, lo2, 0, &carryout);
x.hi = __builtin_addcll(hi1, hi2, carryout, &carryout);
return x;
}
Result h(unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
Result x;
unsigned_word carryout;
carryout = __builtin_add_overflow(lo1, lo2, &x.lo);
carryout = __builtin_add_overflow(hi1, carryout, &hi1);
__builtin_add_overflow(hi1, hi2, &x.hi);
return x;
}
Assembly for both:
add rdi, rdx
adc rsi, rcx
mov rax, rdi
mov rdx, rsi
ret
Starting with clang 5.0 it is possible to get good results using __uint128_t-addition and getting the carry bit by shifting:
inline uint64_t add_with_carry(uint64_t &a, const uint64_t &b, const uint64_t &c)
{
__uint128_t s = __uint128_t(a) + b + c;
a = s;
return s >> 64;
}
In many situations clang still does strange operations (I assume because of possible aliasing?), but usually copying one variable into a temporary helps.
Usage examples with
template<int size> struct LongInt
{
uint64_t data[size];
};
Manual usage:
void test(LongInt<3> &a, const LongInt<3> &b_)
{
const LongInt<3> b = b_; // need to copy b_ into local temporary
uint64_t c0 = add_with_carry(a.data[0], b.data[0], 0);
uint64_t c1 = add_with_carry(a.data[1], b.data[1], c0);
uint64_t c2 = add_with_carry(a.data[2], b.data[2], c1);
}
Generic solution:
template<int size>
void addTo(LongInt<size> &a, const LongInt<size> b)
{
__uint128_t c = __uint128_t(a.data[0]) + b.data[0];
for(int i=1; i<size; ++i)
{
c = __uint128_t(a.data[i]) + b.data[i] + (c >> 64);
a.data[i] = c;
}
}
Godbolt Link: All examples above are compiled to only mov, add and adc instructions (starting with clang 5.0, and at least -O2).
The examples don't produce good code with gcc (up to 8.1, which at the moment is the highest version on godbolt).
And I did not yet manage to get anything usable with __builtin_addcll ...
The code using __builtin_addcll is fully optimized by Clang since at version 10, for chains of at least 3 (which require an adc with variable carry-in that also produces a carry-out). Godbolt shows clang 9 making a mess of setc/movzx for that case.
Clang 6 and later handle it well for the much easier case of chains of 2, as shown in #zneak's answer, where no carry-out from an adc is needed.
The idiomatic code without builtins is good too. Moreover, it works in every compiler and is also fully optimized by GCC 5+ for chains of 2 (add/adc, without using the carry-out from the adc). It's tricky to write correct C that generates carry-out when there's carry-in, so this doesn't extend easily.
Result h (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
unsigned_word lo = lo1 + lo2;
bool carry = lo < lo1;
unsigned_word hi = hi1 + hi2 + carry;
return Result{lo, hi};
}
https://godbolt.org/z/ThxGj1WGK
I am attempting to use an integer template argument in some asm macros, but I'm running into a substitution problem. Below is a (very) simplified version of what I'm doing:
template<int X> void delay() {
asm __volatile__(
" .macro fl_delay dtime, reg=r0 \n\t"
" .if (\\dtime > 0) \n\t"
" .set dcycle, (\\dtime / 3) \n\t"
" .set dwork, (dcycle * 3) \n\t"
" .set drem, (\\dtime - dwork) \n\t"
" .rept (drem) \n\t"
" nop \n\t"
" .endr \n\t"
" .if dcycle >= 0 \n\t"
" mov \\reg, #dcycle \n\t"
" loop_\\#: \n\t"
" sub \\reg, #1 \n\t"
" bne loop_\\# \n\t"
" .endif \n\t"
" .endif \n\t"
" .endm \n\t"
" fl_delay %[X], r0 \n\t"
: [X] "I" (X)
:
: "r0");
}
The macro basically allows for me to do cycle count accurate delays. So, for example, if in my asm code I write:
" fl_delay 13, r0"
it will delay for exactly 13 cycles (using register r0 for the loop counter). This part works fantastically well in a bunch of testing. The problem comes when I want to make the argument to my macro depend on a template argument (The X, above). If I (elsewhere) write delay<13>(); the asm for the generated function looks like:
; fl_delay macro definition here
fl_delay #13, r0
The problem is, gnu as's expression parser doesn't like the #. I suspect the problem is the bit in the macro where, say, I do .set dcycle, (\\dtime / 3). The \\dtime will get replaced with #13 making that line .set dcycle, (#13 / 3) (which I have been able to verify by putting that exact line in some test code, the as error message is the same).
So, I have one of two questions that I need to find an answer to:
Is there a way for me to get the int value from a template parameter into the asm block without the leading #? (I've tried a bunch of asm constraints but they all put the leading # in there)
If I can't do the above, is there a way for me to strip the leading character of an as .macro argument value? I've done a bunch of digging but haven't been able to find anything along those lines.
I don't believe there is a way for me to abuse the C preprocessor to do this, as that runs and does its substitutions before any template/C++ parsing is done.
(What I'm writing is some code that has very strict timing requirements, depending on a piece of hardware being talked to - the protocol is pretty much the same for different types of the hardware, the timing is just different. Because I'm running on hardware with pretty low clock speeds (16-48Mhz), sometimes the delay will be 0-3 clocks, which is short enough to mean that I can't do it in asm, which is why I'm using .macros - because then when there needs to be a delay of 0 (because of the combination of clock speed and hardware timings) nothing gets emitted, if i need a delay of 1, then a single nop gets emitted, etc...)
EDIT: I have a potential solution, it is ugly, so I'm going to leave the question up here in case folks have another option. There's a limit to how large my delays can be, so I can basically do something along the following:
template<int X> void delay() {
switch(X) {
case 0: asm __volatile__ (".set X, 0\n\t"); break;
case 1: asm __volatile__ (".set X, 1\n\t"); break;
case 2: asm __volatile__ (".set X, 2\n\t"); break;
case 3: asm __volatile__ (".set X, 3\n\t"); break;
case 4: asm __volatile__ (".set X, 4\n\t"); break;
case 5: asm __volatile__ (".set X, 5\n\t"); break;
case 6: asm __volatile__ (".set X, 6\n\t"); break;
}
asm __volatile__(
" .macro fl_delay dtime, reg=r0 \n\t"
" .if (\\dtime > 0) \n\t"
" .set dcycle, (\\dtime / 3) \n\t"
" .set dwork, (dcycle * 3) \n\t"
" .set drem, (\\dtime - dwork) \n\t"
" .rept (drem) \n\t"
" nop \n\t"
" .endr \n\t"
" .if dcycle >= 0 \n\t"
" mov \\reg, #dcycle \n\t"
" loop_\\#: \n\t"
" sub \\reg, #1 \n\t"
" bne loop_\\# \n\t"
" .endif \n\t"
" .endif \n\t"
" .endm \n\t"
" fl_delay X, r0 \n\t"
:
:
: "r0");
}
void loop() { delay<5>(); }
Which means that for the delay<5>(); instantiation, there's a .set X, 5 that gets emitted just before the rest of my asm block. It'll be ugly (my delays could be up to 120 cycles) but it will work.