Invalid reference to argument 'N' in GASM Inlining - c++

I'm building Botan on Solaris 11.3 with the SunCC compiler that comes with Developer Studio 12.5. I'm not too familiar with the library or Solaris, and it takes me some effort to track down issues.
The compile is dying on a relatively benign file called divide.cpp. I've got it reduced to the following test case. According to Oracle's GCC-style asm inlining support in Sun Studio 12 compilers, the ASM is well formed. Clang, GCC and ICC happily consume the code.
$ /opt/developerstudio12.5/bin/CC -m64 -std=c++11 test.cxx -c
"test.cxx", [main]:ube: error: Invalid reference to argument '1' in GASM Inlining
CC: ube failed for test.cxx
$ cat test.cxx
#include <iostream>
#include <stdint.h>
typedef uint64_t word;
inline word multadd(word a, word b, word* c)
{
asm(
"mulq %[b] \n\t"
"addq %[c],%[a] \n\t"
"adcq $0,%[carry] \n\t"
: [a]"=a"(a), [b]"=rm"(b), [carry]"=&d"(*c)
: "0"(a), "1"(b), [c]"g"(*c) : "cc");
return a;
}
int main(int argc, char* argv[])
{
word a, b, c, d;
std::cin >> a >> b >> c;
d = multadd(a, b, &c);
return 0;
}
I can't find useful information on the error string Invalid reference to argument 'N' in GASM Inlining. I found sunCC chokes on inline assembler on the Oracle boards. But the answer is UBE is buggy and buy a support contract to learn more.
I have three questions:
What does the error message indicate?
How can I get SunCC to provide a source file and line number?
How can I work around the issue?
If I change the b parameter to just =m, then the same error is produced. If I change the b parameter to just =r, then a different error is generated:
asm(
"mulq %[b] \n\t"
"addq %[c],%[a] \n\t"
"adcq $0,%[carry] \n\t"
: [a]"=a"(a), [b]"=r"(b), [carry]"=&d"(*c)
: "0"(a), "1"(b), [c]"g"(*c) : "cc");
And the result:
$ /opt/developerstudio12.5/bin/CC -m64 -std=c++11 test.cxx -c
Assembler: test.cxx
"<null>", line 205 : Invalid instruction argument
Near line: "mulq %rcx "
"<null>", line 206 : Invalid instruction argument
Near line: " addq %rbx,%rax "
"<null>", line 207 : Invalid instruction argument
Near line: " adcq $0,%rdx "
CC: ube failed for test.cxx

What does the error message indicate?
Unfortunately, no idea.
If someone buys a support contract and has the time, then please solicit Oracle for an answer .
How can I get SunCC to provide a source file and line number?
Unfortunately, no idea.
How can I work around the issue?
David Wohlferd suspected the [b]"=rm"(b) output operand. It looks like the one ASM block needs to be split into two blocks. Its an awful hack, but we have not figured out another way to do it.
inline word multadd(word a, word b, word* c)
{
asm(
"mulq %[b] \n\t"
: [a]"+a"(a), [b]"=&d"(b)
: "0"(a), "1"(b));
asm(
"addq %[c],%[a]" \n\t"
"adcq $0,%[carry] \n\t"
: [a]"=a"(a), [carry]"=&d"(*c)
: "a"(a), "d"(b), [c]"g"(*c) : "cc");
return a;
}

Related

GCC inline assembly - assign number to register [duplicate]

This does not look too friendly:
__asm("command 1"
"command 2"
"command 3");
Do I really have to put a doublequote around every line?
Also... since multiline string literals do not work in GCC, I could not cheat with that either.
I always find some examples on Internet that the guy manually insert a tab and new-line instead of \t and \n, however it doesn't work for me. Not very sure if your example even compile.. but this is how I do:
asm volatile( // note the backslash line-continuation
"xor %eax,%eax \n\t\
mov $0x7c802446, %ebx \n\t\
mov $1000, %ax \n\t\
push %eax \n\t\
call *%ebx \n\t\
add $4, %esp \n\t\
"
: "=a"(retval) // output in EAX: function return value
:
: "ecx", "edx", "ebx" // tell compiler about clobbers
// Also x87 and XMM regs should be listed.
);
Or put double quotes around each line, instead of using \ line-continuation. C string literals separately only by whitespace (including a newline) are concatenated into one long string literal. (Which is why you need the \n inside it, so it's separate lines when it's seen by the assembler).
This is less ugly and makes it possible to put C comments on each line.
asm volatile(
"xor %eax,%eax \n\t"
"mov $0x7c802446, %ebx \n\t"
"mov $1000, %ax \n\t"
"push %eax \n\t" // function arg
"call *%ebx \n\t"
"add $4, %esp \n\t" // rebalance the stack: necessary for asm statements
: "=a"(retval)
:
: "ecx", "edx", "ebx" // clobbers. Function calls themselves kill EAX,ECX,EDX
// function calls also clobber all x87 and all XMM registers, omitted here
);
C++ multiline string literals
Interesting how this question pointed me to the answer:
main.cpp
#include <cassert>
#include <cinttypes>
int main() {
uint64_t io = 0;
__asm__ (
R"(
incq %0
incq %0
)"
: "+m" (io)
:
:
);
assert(io == 2);
}
Compile and run:
g++ -o main -pedantic -std=c++11 -Wall -Wextra main.cpp
./main
See also: C++ multiline string literal
GCC also adds the same syntax as a C extension, you just have to use -std=gnu99 instead of -std=c99:
main.c
#include <assert.h>
#include <inttypes.h>
int main(void) {
uint64_t io = 0;
__asm__ (
R"(
incq %0
incq %0
)"
: "+m" (io)
:
:
);
assert(io == 2);
}
Compile and run:
gcc -o main -pedantic -std=gnu99 -Wall -Wextra main.c
./main
See also: How to split a string literal across multiple lines in C / Objective-C?
One downside of this method is that I don't see how to add C preprocessor macros in the assembly, since they are not expanded inside of strings, see also: Multi line inline assembly macro with strings
Tested on Ubuntu 16.04, GCC 6.4.0, binutils 2.26.1.
.incbin
This GNU GAS directive is another thing that should be in your radar if you are going to use large chunks of assembly: Embedding resources in executable using GCC
The assembly will be in a separate file, so it is not a direct answer, but it is still worth knowing about.

Producing good add with carry code from clang

I'm trying to produce code (currently using clang++-3.8) that adds two numbers consisting of multiple machine words. To simplify things for the moment I'm only adding 128bit numbers, but I'd like to be able to generalise this.
First some typedefs:
typedef unsigned long long unsigned_word;
typedef __uint128_t unsigned_128;
And a "result" type:
struct Result
{
unsigned_word lo;
unsigned_word hi;
};
The first function, f, takes two pairs of unsigned words and returns a result, by as an intermediate step putting both of these 64 bit words into a 128 bit word before adding them, like so:
Result f (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
Result x;
unsigned_128 n1 = lo1 + (static_cast<unsigned_128>(hi1) << 64);
unsigned_128 n2 = lo2 + (static_cast<unsigned_128>(hi2) << 64);
unsigned_128 r1 = n1 + n2;
x.lo = r1 & ((static_cast<unsigned_128>(1) << 64) - 1);
x.hi = r1 >> 64;
return x;
}
This actually gets inlined quite nicely like so:
movq 8(%rsp), %rsi
movq (%rsp), %rbx
addq 24(%rsp), %rsi
adcq 16(%rsp), %rbx
Now, instead I've written a simpler function using the clang multi-precision primatives, as below:
static Result g (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
Result x;
unsigned_word carryout;
x.lo = __builtin_addcll(lo1, lo2, 0, &carryout);
x.hi = __builtin_addcll(hi1, hi2, carryout, &x.carry);
return x;
}
This produces the following assembly:
movq 24(%rsp), %rsi
movq (%rsp), %rbx
addq 16(%rsp), %rbx
addq 8(%rsp), %rsi
adcq $0, %rbx
In this case, there's an extra add. Instead of doing an ordinary add on the lo-words, then an adc on the hi-words, it just adds the hi-words, then adds the lo-words, then does an adc on the hi-word again with an argument of zero.
This may not look too bad, but when you try this with larger words (say 192bit, 256bit) you soon get a mess of ors and other instructions dealing with the carries up the chain, instead of a simple chain of add, adc, adc, ... adc.
The multi-precision primitives seem to be doing a terrible job at exactly what they're intended to do.
So what I'm looking for is code that I could generalise to any length (no need to do it, just enough so I can work out how to), which clang produces additions in an manner with is as efficient as what it does with it's built in 128 bit type (which unfortunately I can't easily generalise). I presume this should just a chain of adcs, but I'm welcome to arguments and code that it should be something else.
There is an intrinsic to do this: _addcarry_u64. However, only Visual Studio and ICC (at least VS 2013 and 2015 and ICC 13 and ICC 15) do this efficiently. Clang 3.7 and GCC 5.2 still don't produce efficient code with this intrinsic.
Clang in addition has a built-in which one would think does this, __builtin_addcll, but it does not produce efficient code either.
The reason Visual Studio does this is that it does not allow inline assembly in 64-bit mode so the compiler should provide a way to do this with an intrinsic (though Microsoft took their time implementing this).
Therefore, with Visual Studio use _addcarry_u64. With ICC use _addcarry_u64 or inline assembly. With Clang and GCC use inline assembly.
Note that since the Broadwell microarchitecture there are two new instructions: adcx and adox which you can access with the _addcarryx_u64 intrinsic . Intel's documentation for these intrinsics used to be different then the assembly produced by the compiler but it appears their documentation is correct now. However, Visual Studio still only appears to produce adcx with _addcarryx_u64 whereas ICC produces both adcx and adox with this intrinsic. But even though ICC produces both instructions it does not produce the most optimal code (ICC 15) and so inline assembly is still necessary.
Personally, I think the fact that a non-standard feature of C/C++, such as inline assembly or intrinsics, is required to do this is a weakness of C/C++ but others might disagree. The adc instruction has been in the x86 instruction set since 1979. I would not hold my breath on C/C++ compilers being able to optimally figure out when you want adc. Sure they can have built-in types such as __int128 but the moment you want a larger type that's not built-in you have to use some non-standard C/C++ feature such as inline assembly or intrinsics.
In terms of inline assembly code to do this I already posted a solution for 256-bit addition for eight 64-bit integers in register at multi-word addition using the carry flag.
Here is that code reposted.
#define ADD256(X1, X2, X3, X4, Y1, Y2, Y3, Y4) \
__asm__ __volatile__ ( \
"addq %[v1], %[u1] \n" \
"adcq %[v2], %[u2] \n" \
"adcq %[v3], %[u3] \n" \
"adcq %[v4], %[u4] \n" \
: [u1] "+&r" (X1), [u2] "+&r" (X2), [u3] "+&r" (X3), [u4] "+&r" (X4) \
: [v1] "r" (Y1), [v2] "r" (Y2), [v3] "r" (Y3), [v4] "r" (Y4))
If you want to explicitly load the values from memory you can do something like this
//uint64_t dst[4] = {1,1,1,1};
//uint64_t src[4] = {1,2,3,4};
asm (
"movq (%[in]), %%rax\n"
"addq %%rax, %[out]\n"
"movq 8(%[in]), %%rax\n"
"adcq %%rax, 8%[out]\n"
"movq 16(%[in]), %%rax\n"
"adcq %%rax, 16%[out]\n"
"movq 24(%[in]), %%rax\n"
"adcq %%rax, 24%[out]\n"
: [out] "=m" (dst)
: [in]"r" (src)
: "%rax"
);
That produces nearlly identical assembly as from the following function in ICC
void add256(uint256 *x, uint256 *y) {
unsigned char c = 0;
c = _addcarry_u64(c, x->x1, y->x1, &x->x1);
c = _addcarry_u64(c, x->x2, y->x2, &x->x2);
c = _addcarry_u64(c, x->x3, y->x3, &x->x3);
_addcarry_u64(c, x->x4, y->x4, &x->x4);
}
I have limited experience with GCC inline assembly (or inline assembly in general - I usually use an assembler such as NASM) so maybe there are better inline assembly solutions.
So what I'm looking for is code that I could generalize to any length
To answer this question here is another solution using template meta programming. I used this same trick for loop unrolling. This produces optimal code with ICC. If Clang or GCC ever implement _addcarry_u64 efficiently this would be a good general solution.
#include <x86intrin.h>
#include <inttypes.h>
#define LEN 4 // N = N*64-bit add e.g. 4=256-bit add, 3=192-bit add, ...
static unsigned char c = 0;
template<int START, int N>
struct Repeat {
static void add (uint64_t *x, uint64_t *y) {
c = _addcarry_u64(c, x[START], y[START], &x[START]);
Repeat<START+1, N>::add(x,y);
}
};
template<int N>
struct Repeat<LEN, N> {
static void add (uint64_t *x, uint64_t *y) {}
};
void sum_unroll(uint64_t *x, uint64_t *y) {
Repeat<0,LEN>::add(x,y);
}
Assembly from ICC
xorl %r10d, %r10d #12.13
movzbl c(%rip), %eax #12.13
cmpl %eax, %r10d #12.13
movq (%rsi), %rdx #12.13
adcq %rdx, (%rdi) #12.13
movq 8(%rsi), %rcx #12.13
adcq %rcx, 8(%rdi) #12.13
movq 16(%rsi), %r8 #12.13
adcq %r8, 16(%rdi) #12.13
movq 24(%rsi), %r9 #12.13
adcq %r9, 24(%rdi) #12.13
setb %r10b
Meta programming is a basic feature of assemblers so it's too bad C and C++ (except through template meta programming hacks) have no solution for this either (the D language does).
The inline assembly I used above which referenced memory was causing some problems in a function. Here is a new version which seems to work better
void foo(uint64_t *dst, uint64_t *src)
{
__asm (
"movq (%[in]), %%rax\n"
"addq %%rax, (%[out])\n"
"movq 8(%[in]), %%rax\n"
"adcq %%rax, 8(%[out])\n"
"movq 16(%[in]), %%rax\n"
"addq %%rax, 16(%[out])\n"
"movq 24(%[in]), %%rax\n"
"adcq %%rax, 24(%[out])\n"
:
: [in] "r" (src), [out] "r" (dst)
: "%rax"
);
}
On Clang 6, both __builtin_addcl and __builtin_add_overflow produce the same, optimal disassembly.
Result g(unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
Result x;
unsigned_word carryout;
x.lo = __builtin_addcll(lo1, lo2, 0, &carryout);
x.hi = __builtin_addcll(hi1, hi2, carryout, &carryout);
return x;
}
Result h(unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
Result x;
unsigned_word carryout;
carryout = __builtin_add_overflow(lo1, lo2, &x.lo);
carryout = __builtin_add_overflow(hi1, carryout, &hi1);
__builtin_add_overflow(hi1, hi2, &x.hi);
return x;
}
Assembly for both:
add rdi, rdx
adc rsi, rcx
mov rax, rdi
mov rdx, rsi
ret
Starting with clang 5.0 it is possible to get good results using __uint128_t-addition and getting the carry bit by shifting:
inline uint64_t add_with_carry(uint64_t &a, const uint64_t &b, const uint64_t &c)
{
__uint128_t s = __uint128_t(a) + b + c;
a = s;
return s >> 64;
}
In many situations clang still does strange operations (I assume because of possible aliasing?), but usually copying one variable into a temporary helps.
Usage examples with
template<int size> struct LongInt
{
uint64_t data[size];
};
Manual usage:
void test(LongInt<3> &a, const LongInt<3> &b_)
{
const LongInt<3> b = b_; // need to copy b_ into local temporary
uint64_t c0 = add_with_carry(a.data[0], b.data[0], 0);
uint64_t c1 = add_with_carry(a.data[1], b.data[1], c0);
uint64_t c2 = add_with_carry(a.data[2], b.data[2], c1);
}
Generic solution:
template<int size>
void addTo(LongInt<size> &a, const LongInt<size> b)
{
__uint128_t c = __uint128_t(a.data[0]) + b.data[0];
for(int i=1; i<size; ++i)
{
c = __uint128_t(a.data[i]) + b.data[i] + (c >> 64);
a.data[i] = c;
}
}
Godbolt Link: All examples above are compiled to only mov, add and adc instructions (starting with clang 5.0, and at least -O2).
The examples don't produce good code with gcc (up to 8.1, which at the moment is the highest version on godbolt).
And I did not yet manage to get anything usable with __builtin_addcll ...
The code using __builtin_addcll is fully optimized by Clang since at version 10, for chains of at least 3 (which require an adc with variable carry-in that also produces a carry-out). Godbolt shows clang 9 making a mess of setc/movzx for that case.
Clang 6 and later handle it well for the much easier case of chains of 2, as shown in #zneak's answer, where no carry-out from an adc is needed.
The idiomatic code without builtins is good too. Moreover, it works in every compiler and is also fully optimized by GCC 5+ for chains of 2 (add/adc, without using the carry-out from the adc). It's tricky to write correct C that generates carry-out when there's carry-in, so this doesn't extend easily.
Result h (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
unsigned_word lo = lo1 + lo2;
bool carry = lo < lo1;
unsigned_word hi = hi1 + hi2 + carry;
return Result{lo, hi};
}
https://godbolt.org/z/ThxGj1WGK

C++ template parameters as arguments to assembler macros in an asm block (gcc/gas)

I am attempting to use an integer template argument in some asm macros, but I'm running into a substitution problem. Below is a (very) simplified version of what I'm doing:
template<int X> void delay() {
asm __volatile__(
" .macro fl_delay dtime, reg=r0 \n\t"
" .if (\\dtime > 0) \n\t"
" .set dcycle, (\\dtime / 3) \n\t"
" .set dwork, (dcycle * 3) \n\t"
" .set drem, (\\dtime - dwork) \n\t"
" .rept (drem) \n\t"
" nop \n\t"
" .endr \n\t"
" .if dcycle >= 0 \n\t"
" mov \\reg, #dcycle \n\t"
" loop_\\#: \n\t"
" sub \\reg, #1 \n\t"
" bne loop_\\# \n\t"
" .endif \n\t"
" .endif \n\t"
" .endm \n\t"
" fl_delay %[X], r0 \n\t"
: [X] "I" (X)
:
: "r0");
}
The macro basically allows for me to do cycle count accurate delays. So, for example, if in my asm code I write:
" fl_delay 13, r0"
it will delay for exactly 13 cycles (using register r0 for the loop counter). This part works fantastically well in a bunch of testing. The problem comes when I want to make the argument to my macro depend on a template argument (The X, above). If I (elsewhere) write delay<13>(); the asm for the generated function looks like:
; fl_delay macro definition here
fl_delay #13, r0
The problem is, gnu as's expression parser doesn't like the #. I suspect the problem is the bit in the macro where, say, I do .set dcycle, (\\dtime / 3). The \\dtime will get replaced with #13 making that line .set dcycle, (#13 / 3) (which I have been able to verify by putting that exact line in some test code, the as error message is the same).
So, I have one of two questions that I need to find an answer to:
Is there a way for me to get the int value from a template parameter into the asm block without the leading #? (I've tried a bunch of asm constraints but they all put the leading # in there)
If I can't do the above, is there a way for me to strip the leading character of an as .macro argument value? I've done a bunch of digging but haven't been able to find anything along those lines.
I don't believe there is a way for me to abuse the C preprocessor to do this, as that runs and does its substitutions before any template/C++ parsing is done.
(What I'm writing is some code that has very strict timing requirements, depending on a piece of hardware being talked to - the protocol is pretty much the same for different types of the hardware, the timing is just different. Because I'm running on hardware with pretty low clock speeds (16-48Mhz), sometimes the delay will be 0-3 clocks, which is short enough to mean that I can't do it in asm, which is why I'm using .macros - because then when there needs to be a delay of 0 (because of the combination of clock speed and hardware timings) nothing gets emitted, if i need a delay of 1, then a single nop gets emitted, etc...)
EDIT: I have a potential solution, it is ugly, so I'm going to leave the question up here in case folks have another option. There's a limit to how large my delays can be, so I can basically do something along the following:
template<int X> void delay() {
switch(X) {
case 0: asm __volatile__ (".set X, 0\n\t"); break;
case 1: asm __volatile__ (".set X, 1\n\t"); break;
case 2: asm __volatile__ (".set X, 2\n\t"); break;
case 3: asm __volatile__ (".set X, 3\n\t"); break;
case 4: asm __volatile__ (".set X, 4\n\t"); break;
case 5: asm __volatile__ (".set X, 5\n\t"); break;
case 6: asm __volatile__ (".set X, 6\n\t"); break;
}
asm __volatile__(
" .macro fl_delay dtime, reg=r0 \n\t"
" .if (\\dtime > 0) \n\t"
" .set dcycle, (\\dtime / 3) \n\t"
" .set dwork, (dcycle * 3) \n\t"
" .set drem, (\\dtime - dwork) \n\t"
" .rept (drem) \n\t"
" nop \n\t"
" .endr \n\t"
" .if dcycle >= 0 \n\t"
" mov \\reg, #dcycle \n\t"
" loop_\\#: \n\t"
" sub \\reg, #1 \n\t"
" bne loop_\\# \n\t"
" .endif \n\t"
" .endif \n\t"
" .endm \n\t"
" fl_delay X, r0 \n\t"
:
:
: "r0");
}
void loop() { delay<5>(); }
Which means that for the delay<5>(); instantiation, there's a .set X, 5 that gets emitted just before the rest of my asm block. It'll be ugly (my delays could be up to 120 cycles) but it will work.

GCC: Error: junk `:0x+57f120' after expression

Following my previous question here, I have now a "error junk after expression" when the compiler try to compile the following code:
u32 jmpAdd = BW::BWFXN_SpendRepairReturnAddress;
//BW::BWFXN_SpendRepairReturnAddress has the following value: 0x0046700D
__asm__ __volatile__
(
"movl ds:0x+57f120(, %eax, 4), %ecx\n\t"
"jmp %0":"=m"(jmpAdd)
);
GCC gives me the following errors:
Error: junk ':0x+57f120' after expression
Error: invalid instruction suffix for 'jmp'
How can I correct those errors, please?
EDIT: the original code was the following (I converted it using ta2as v0.8.2) :
__asm
{
mov ecx, dword ptr ds:[eax*4+0x57f120]
jmp BW::BWFXN_SpendRepairReturnAddress
}
Change it to the following and it should compile:
__asm__ __volatile__
(
"movl %%ds:0x57f120(, %%eax, 4), %%ecx\n\t"
"jmp *%0" : : "m"(jmpAdd)
);
Unfortunately, after looking at the source you're probably trying to convert it won't actually work. GCC doesn't support naked functions on x86 targets.

efficient way to divide ignoring rest

there are 2 ways i found to get a whole number from a division in c++
question is which way is more efficient (more speedy)
first way:
Quotient = value1 / value2; // normal division haveing splitted number
floor(Quotient); // rounding the number down to the first integer
second way:
Rest = value1 % value2; // getting the Rest with modulus % operator
Quotient = (value1-Rest) / value2; // substracting the Rest so the division will match
also please demonstrate how to find out which method is faster
If you're dealing with integers, then the usual way is
Quotient = value1 / value2;
That's it. The result is already an integer. No need to use the floor(Quotient); statement. It has no effect anyway. You would want to use Quotient = floor(Quotient); if it was needed.
If you have floating point numbers, then the second method won't work at all, as % is only defined for integers. But what does it mean to get a whole number from a division of real numbers? What integer do you get when you divide 8.5 by 3.2? Does it ever make sense to ask this question?
As a side note, the thing you call 'Rest' is normally called 'reminder'.remainder.
Use this program:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#ifdef DIV_BY_DIV
#define DIV(a, b) ((a) / (b))
#else
#define DIV(a, b) (((a) - ((a) % (b))) / (b))
#endif
#ifndef ITERS
#define ITERS 1000
#endif
int main()
{
int i, a, b;
srand(time(NULL));
a = rand();
b = rand();
for (i = 0; i < ITERS; i++)
a = DIV(a, b);
return 0;
}
You can time execution
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 -DDIV_BY_DIV 1.c && time ./a.out
real 0m0.010s
user 0m0.012s
sys 0m0.000s
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 1.c && time ./a.out
real 0m0.019s
user 0m0.020s
sys 0m0.000s
Or, you look at the assembly output:
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 -DDIV_BY_DIV 1.c -S; mv 1.s 1_div.s
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 1.c -S; mv 1.s 1_modulus.s
mihai#keldon:/tmp$ diff 1_div.s 1_modulus.s
24a25,32
> movl %edx, %eax
> movl 24(%esp), %edx
> movl %edx, %ecx
> subl %eax, %ecx
> movl %ecx, %eax
> movl %eax, %edx
> sarl $31, %edx
> idivl 20(%esp)
As you see, doing only the division is faster.
Edited to fix error in code, formatting and wrong diff.
More edit (explaining the assembly diff): In the second case, when doing the modulus first, the assembly shows that two idivl operations are needed: one to get the result of % and one for the actual division. The above diff shows the subtraction and the second division, as the first one is exactly the same in both codes.
Edit: more relevant timing information:
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=42000000 -DDIV_BY_DIV 1.c && time ./a.out
real 0m0.384s
user 0m0.360s
sys 0m0.004s
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=42000000 1.c && time ./a.out
real 0m0.706s
user 0m0.696s
sys 0m0.004s
Hope it helps.
Edit: diff between assembly with -O0 and without.
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 1.c -S -O0; mv 1.s O0.s
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 1.c -S; mv 1.s noO.s
mihai#keldon:/tmp$ diff noO.s O0.s
Since the defualt optimization level of gcc is O0 (see this article explaining optimization levels in gcc) the result was expected.
Edit: if you compile with -O3 as one of the comments suggested you'll get the same assembly, at that level of optimization, both alternatives are the same.