Is GNU compiler -fexpensive-optimization flag connected to -ffp-contract? - c++

In the application I am working on a colleague noticed that some results minimally change when compiling with -march=native (on a skylake CPU) or with the less aggressive -march=x86-64. I therefore tried to come up with a MWE to clarify myself the compiler behaviour (of course the real case is way more complex, hence the MWE looks stupid, but that's not the point). Before entering details, let me state my question, which can be split in two parts:
Part 1: How is -fexpensive-optimizations related to -ffp-contract? Does the former imply in some sense the latter?
Part 2: Since the default value for floating-point expression contraction is -ffp-contract=fast, independently from optimisation level, why changing it to off with -O2 in the MWE below is fixing the discrepancy? This is a way of rephrasing my comment about the manual description of this flag (see below).
Bonus question: Why in my MWE reducing the std::vector to one entry changes the decimal number representation and leaving the std::vector out makes any discrepancy go away?
Consider the following MWE:
#include <iostream>
#include <iomanip>
#include <array>
#include <vector>
double inline sqr(std::array<double, 4> x) {
return x[0] * x[0] - x[1] * x[1] - x[2] * x[2] - x[3] * x[3];
}
int main(){
//std::vector<std::array<double, 4>> v = {{0.6, -0.3, -0.5, -0.5}}; // <-- for bonus question
std::vector<std::array<double, 4>> v = {{0.6, -0.3, -0.5, -0.5}, {0.6, -0.3, -0.5, -0.5}};
for(const auto& a : v)
std::cout << std::setprecision(18) << std::fixed << std::showpos
<< sqr(a) << " -> " << std::hexfloat << sqr(a) << std::defaultfloat << '\n';
}
If compiled with optimisations up to -O1, using the GNU compiler g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, results are the same targeting both architectures, while they differ starting with -O2:
$ for ARCH in native x86-64; do g++ -march=${ARCH} -O1 -std=c++17 -o mwe_${ARCH} mwe.cpp; done; diff <(./mwe_native) <(./mwe_x86-64)
$ for ARCH in native x86-64; do g++ -march=${ARCH} -O2 -std=c++17 -o mwe_${ARCH} mwe.cpp; done; diff <(./mwe_native) <(./mwe_x86-64)
1,2c1,2
< -0.230000000000000038 -> -0x1.d70a3d70a3d72p-3
< -0.230000000000000038 -> -0x1.d70a3d70a3d72p-3
---
> -0.229999999999999982 -> -0x1.d70a3d70a3d7p-3
> -0.229999999999999982 -> -0x1.d70a3d70a3d7p-3
I then went through all -O2 options listed in the manual and it turned out that -fexpensive-optimizations triggers the difference to occur:
$ for ARCH in native x86-64; do g++ -march=${ARCH} -O1 -fexpensive-optimizations -std=c++17 -o mwe_${ARCH} mwe.cpp; done; diff <(./mwe_native) <(./mwe_x86-64)
1,2c1,2
< -0.230000000000000038 -> -0x1.d70a3d70a3d72p-3
< -0.230000000000000038 -> -0x1.d70a3d70a3d72p-3
---
> -0.229999999999999982 -> -0x1.d70a3d70a3d7p-3
> -0.229999999999999982 -> -0x1.d70a3d70a3d7p-3
$ for ARCH in native x86-64; do g++ -march=${ARCH} -O2 -fno-expensive-optimizations -std=c++17 -o mwe_${ARCH} mwe.cpp; done; diff <(./mwe_native) <(./mwe_x86-64)
$
I then tried to understand the meaning of the option and found this SO answer, which did not shed much light on the situation. After more research it turned out that -ffp-contract option is also making the discrepancy go away, if explicitly turned off. In particular, FMA operations seem to be responsible for the differences:
$ for ARCH in native x86-64; do g++ -march=${ARCH} -O2 -mfma -std=c++17 -o mwe_${ARCH} mwe.cpp; done; sdiff <(./mwe_native) <(./mwe_x86-64)
-0.230000000000000038 -> -0x1.d70a3d70a3d72p-3 -0.230000000000000038 -> -0x1.d70a3d70a3d72p-3
-0.230000000000000038 -> -0x1.d70a3d70a3d72p-3 -0.230000000000000038 -> -0x1.d70a3d70a3d72p-3
$ for ARCH in native x86-64; do g++ -march=${ARCH} -O2 -ffp-contract=off -std=c++17 -o mwe_${ARCH} mwe.cpp; done; sdiff <(./mwe_native) <(./mwe_x86-64)
-0.229999999999999982 -> -0x1.d70a3d70a3d7p-3 -0.229999999999999982 -> -0x1.d70a3d70a3d7p-3
-0.229999999999999982 -> -0x1.d70a3d70a3d7p-3 -0.229999999999999982 -> -0x1.d70a3d70a3d7p-3
In the previous output note how -mfma makes the result on x86-64 become identical to the native one, while -ffp-contract=off makes the vice-versa (the difference is in the last digit in the hex representation).
The manual for -ffp-contract reads:
-ffp-contract=style
-ffp-contract=off disables floating-point expression contraction. -ffp-contract=fast enables floating-point expression contraction such as
forming of fused multiply-add operations if the target has native support for them. -ffp-contract=on enables floating-point expression
contraction if allowed by the language standard. This is currently not implemented and treated equal to -ffp-contract=off.
The default is -ffp-contract=fast.
I assume that specifying -march=x86-64 the "if the target has native support for them" of the manual does not apply and FMA operations are not used. Please, correct me if this is wrong.
NOTE: SO was suggesting to have a look to this question, but I doubt there is any UB in the code above. I also have to say that I tried with clang-14 (which changed the default to -ffp-contract=on) reproducing discrepancies, while with clang-13 (which has default -ffp-contract=off) no discrepancy appears.

Related

How to display AVX registers as doubles with GDB?

I was trying to use AVX in a Mandelbrot program and it's not working right.
I try debugging it but GDB refuses to show me the floating point values in the YMM registers. Here's the minimum example
t.c
#include <stdio.h>
extern void loadnum(void);
extern double input[4];
extern double output[4];
int main(void)
{
/*
input[0] = 1.1;
input[1] = 2.2;
input[2] = 3.3;
input[3] = 3.14159;
*/
printf("%f %f %f %f\n",input[0],input[1],input[2],input[3]);
loadnum();
printf("%f %f %f %f\n",output[0],output[1],output[2],output[3]);
return 0;
}
l.asm
section .data
global input
global output
align 64
input dq 1.1,2.2,3.3,3.14159
output dq 0,0,0,0
section .text
global loadnum
loadnum:
vmovapd ymm0, [input]
vmovapd [output],ymm0
ret
how it's compiled
OBJECTS = t.o l.o
CFLAGS = -c -O2 -g -no-pie -mavx -Wall
t: $(OBJECTS)
gcc -g -no-pie $(OBJECTS) -o t
t.o: t.c
gcc $(CFLAGS) t.c
l.o: l.asm
nasm -felf64 -gdwarf l.asm
The output is
> 1.100000 2.200000 3.300000 3.141590
> 1.100000 2.200000 3.300000 3.141590
which shows it's loading and storing these doubles as expected, but in gdb it shows
> gdb t (followed by some boilerplate)
> Reading symbols from t...
> (gdb) b loadnum
> Breakpoint 1 at 0x4011b0: file l.asm, line 15.
> (gdb) run
> Starting program: /somedir/t
> 1.100000 2.200000 3.300000 3.141590
> Breakpoint 1, loadnum () at l.asm:15
> 15 vmovapd ymm0, [input]
> (gdb) n
> 16 vmovapd [output],ymm0
> (gdb)
then I say
> (gdb) info all-registers
and this shows up.
> ymm0 (blah blah) v4_double = {0x1, 0x2, 0x3, 0x3}
when I expected it to show
> ymm0 (blah blah) v4_double = {1.100000 2.200000 3.300000 3.141590}
None of the other fields show anything like that, unless you want to parse the floating point bits
> v4_int64 = {0x3ff199999999999a, 0x400199999999999a, 0x400a666666666666, 0x400921f9f01b866e}
How can I fix this?
p $ymm0.v4_double (the print command) defaults to decimal formatting.
Use p /whatever for other formats, like p /x $ymm0.v4_int64 to see hex for the bit-patterns. help p for more.
display $ymm0.v4_double can work as a stand-in for layout reg + tui reg vec being buggy/broken in some versions, and always an unusable mess of different formats for registers as wide and numerous as ymm0-15. It takes the same options as print, and prints before every prompt. (undisplay 1 or undisplay (all) to disable some of the expressions you've set up.)
It can get cluttered in TUI mode (layout asm or layout reg + layout next to see integer regs and disassembly) if you want to track more than a couple registers, so you might prefer to use non-TUI mode, either don't use layout in the first place, or tui dis.
(When debugging hand-written asm, I almost always want to look at disassembly, not source; but maybe for a complicated algorithm I'd sometimes want to see source with comments as a reminder of what the values should be/mean at a certain point.)

understand the performance diff of std::optional on different platforms

I was trying to understand the overhead of using std::optional in the interface design, whether it's worthing using std::optional for single interface or using combination of "isNull and getX", the main concern is to understand the overhead of std::optional. So I crafted this simple program to test it out.
The code snippet and test results are summarized in this gist https://gist.github.com/shawncao/fa801d300c4a0a2826240fd9f5961db2
Surprisingly, there are two things I don't understand, I would like to hear from experts to help
on the linux platform with gcc (8 and 9 both tested), I saw consistently std::optional is better than the direct condition, this is something I don't understand.
on Mac (my laptop), I saw the program executes way faster than the time measured on linux (12ms vs >60ms), I don't see why this is the case (is it because of Apple LLVM impl?), from the CPU info, I don't think the mac is faster, maybe I missed other information to check.
Any pointers will be appreciated.
Basically understand the runtime difference between these two platforms, maybe processors related, or compiler related, I'm not sure. Repro app is attached in the gist. Also, I don't quite understand why std::optional is faster on linux with GCC.
https://gist.github.com/shawncao/fa801d300c4a0a2826240fd9f5961db2
see details in the gist.
Code Snippet (copied from the gist based on suggestion)
inline bool f1ni(int i) {
return i % 2 == 0;
}
TEST(OptionalTest, TestOptionalPerf) {
auto f1 = [](int i) -> int {
return i + 1;
};
auto f2 = [](int i) -> std::optional<int> {
if (i % 2 == 0) {
return {};
}
return i + 1;
};
// run 1M cycles
constexpr auto cycles = 100000000;
nebula::common::Evidence::Duration duration;
long sum2 = 0;
for (int i = 0; i < cycles; ++i) {
auto x = f2(i);
if (x) {
sum2 += x.value();
}
}
LOG(INFO) << fmt::format("optional approach: sum={0}, time={1}", sum2, duration.elapsedMs());
duration.reset();
long sum1 = 0;
for (int i = 0; i < cycles; ++i) {
if (i % 2 != 0) {
sum1 += f1(i);
}
}
LOG(INFO) << fmt::format("special value approach: sum={0}, time={1}", sum1, duration.elapsedMs());
EXPECT_EQ(sum1, sum2);
Laptop Mac MacOs Mojave
Processor 2.6 GHz Intel Core i7
/usr/bin/g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/c++/4.2.1
Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
[Test Output]
I0812 09:36:20.678683 348009920 TestCommon.cpp:354] optional approach: sum=2500000050000000, time=13
I0812 09:36:20.692001 348009920 TestCommon.cpp:363] special value approach: sum=2500000050000000, time=12
Linux Ubuntu 18.04 LTS
36 cores - one of them
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8124M CPU # 3.00GHz
stepping : 4
microcode : 0x100014a
cpu MHz : 1685.329
cache size : 25344 KB
physical id : 0
siblings : 36
core id : 0
cpu cores : 18
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
/usr/bin/g++ --version
g++ (Ubuntu 9.1.0-2ubuntu2~18.04) 9.1.0
Copyright (C) 2019 Free Software Foundation, Inc.
[Test Output]
I0812 16:42:07.391607 15080 TestCommon.cpp:354] optional approach: sum=2500000050000000, time=59
I0812 16:42:07.473335 15080 TestCommon.cpp:363] special value approach: sum=2500000050000000, time=81
===== build flags =====
[build commands capture for Mac]
/usr/bin/g++ -I/Users/shawncao/nebula/include -I/Users/shawncao/nebula/src -I/usr/local/Cellar/boost/1.70.0/include -I/usr/local/Cellar/folly/2019.08.05.00/include -I/Users/shawncao/nebula/build/cuckoofilter/src/cuckoofilter/src -I/Users/shawncao/nebula/build/fmtproj-prefix/src/fmtproj/include -I/Users/shawncao/nebula/src/service/gen/nebula -isystem /usr/local/include -isystem /Users/shawncao/nebula/build/gflagsp-prefix/src/gflagsp-build/include -isystem /Users/shawncao/nebula/build/glogp-prefix/src/glogp-build -isystem /Users/shawncao/nebula/build/roaringproj-prefix/src/roaringproj/include -isystem /Users/shawncao/nebula/build/roaringproj-prefix/src/roaringproj/cpp -isystem /Users/shawncao/nebula/build/yomm2-prefix/src/yomm2/include -isystem /Users/shawncao/nebula/build/xxhash-prefix/src/xxhash -isystem /Users/shawncao/nebula/build/bloom/src/bloom -isystem /usr/local/Cellar/openssl/1.0.2s/include -Wall -Wextra -Werror -Wno-error=nullability-completeness -Wno-error=sign-compare -Wno-error=unknown-warning-option -O3 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk -std=gnu++1z -o CMakeFiles/CommonTests.dir/src/common/test/TestExts.cpp.o -c /Users/shawncao/nebula/src/common/test/TestExts.cpp
[100%] Linking CXX executable CommonTests
/Applications/CMake.app/Contents/bin/cmake -E cmake_link_script CMakeFiles/CommonTests.dir/link.txt --verbose=1
/usr/bin/g++ -Wall -Wextra -Werror -Wno-error=nullability-completeness -Wno-error=sign-compare -Wno-error=unknown-warning-option -O3 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk -Wl,-search_paths_first -Wl,-headerpad_max_install_names CMakeFiles/CommonTests.dir/src/common/test/TestCommon.cpp.o CMakeFiles/CommonTests.dir/src/common/test/TestExts.cpp.o -o CommonTests -Wl,-rpath,/Users/shawncao/nebula/build/bloom/src/bloom-build/lib libNCommon.a yomm2-prefix/src/yomm2-build/src/libyomm2.a googletest-prefix/src/googletest-build/lib/libgtest.a googletest-prefix/src/googletest-build/lib/libgtest_main.a xxhash-prefix/src/xxhash/libxxhash.a roaringproj-prefix/src/roaringproj-build/src/libroaring.a glogp-prefix/src/glogp-build/libglog.a gflagsp-prefix/src/gflagsp-build/lib/libgflags.a bloom/src/bloom-build/lib/libbf.dylib libcflib.a /usr/local/Cellar/openssl/1.0.2s/lib/libssl.a fmtproj-prefix/src/fmtproj-build/libfmt.a
[build commands capture for Linux]
/usr/bin/g++ -Wall -Wextra -Werror -lstdc++ -Wl,--no-as-needed -no-pie -ldl -lunwind -I/usr/include/ -I/usr/local/include -L/usr/local/lib -L/usr/lib -O3 -s CMakeFiles/CommonTests.dir/src/common/test/TestCommon.cpp.o CMakeFiles/CommonTests.dir/src/common/test/TestExts.cpp.o -o CommonTests -Wl,-rpath,/home/shawncao/nebula/build/bloom/src/bloom-build/lib libNCommon.a yomm2-prefix/src/yomm2-build/src/libyomm2.a /usr/local/lib/libgtest.a /usr/local/lib/libgtest_main.a xxhash-prefix/src/xxhash/libxxhash.a roaringproj-prefix/src/roaringproj-build/src/libroaring.a /usr/local/lib/libglog.a /usr/local/lib/libgflags.a bloom/src/bloom-build/lib/libbf.so libcflib.a /usr/local/lib/libssl.a /usr/local/lib/libcrypto.a fmtproj-prefix/src/fmtproj-build/libfmt.a -lpthread

long double increment operator not working on large numbers

I'm converting a C++ system from solaris (SUN box and solaris compiler) to linux (intel box and gcc compiler). I'm running into several problems when dealing with large "long double" values. (We use "long double" due to some very very large integers... not for any decimal precision). It manifests itself in several weird ways but I've simplified it to the following program. It's trying to increment a number but doesn't. I don't get any compile or runtime errors... it just doesn't increment the number.
I've also randomly tried a few different compiler switches, (-malign-double and -m128bit-long-double with various combinations of these turned on and off), but no difference.
I've run this in gdb too and gdb's "print" command shows the same value as the cout statement.
Anyone seen this behavior?
Thanks
compile commands
$ /usr/bin/c++ --version
c++ (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4)
$ /usr/bin/c++ -g -Wall -fPIC -c SimpleLongDoubleTest.C -o SimpleLongDoubleTest.o
$ /usr/bin/c++ -g SimpleLongDoubleTest.o -o SimpleLongDoubleTest
$ ./SimpleLongDoubleTest
Maximum value for long double: 1.18973e+4932
digits 10 = 18
ld = 1268035319515045691392
ld = 1268035319515045691392
ld = 1268035319515045691392
ld = 1268035319515045691392
ld = 1268035319515045691392
SimpleLongDoubleTest.C
#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <limits>
#include <iomanip>
int main( int argc, char* argv[] )
{
std::cout << "Maximum value for long double: "
<< std::numeric_limits<long double>::max() << '\n';
std::cout << "digits 10 = " << std::numeric_limits<long double>::digits10
<< std::endl;
// this doesn't work (there might be smaller numbers that also doen't work...
// That is, I'm not sure the exact number between this and the number defined
// below where things break)
long double ld = 1268035319515045691392.0L ;
// but this or any smaller number works (there might be larger numbers that
// work... That is, I'm not sure the exact number between this and the number
// defined above where things break)
//long double ld = 268035319515045691392.0L ;
for ( int i = 0 ; i < 5 ; i++ )
{
ld++ ;
std::cout << std::setiosflags( std::ios::fixed )
<< std::setprecision( 0 )
<< "ld = " << ld
<< std::endl ;
}
}
This is expected behavior. Float, double, long double etc. are internally represented in form of (2^exp-bias)*1 + xxxxx, where xxxxx is a N digit binary number, where N=23 for floats, 52 for doubles and possibly 64 for long doubles. When the number grows larger than 2^N, it's no longer possible to add '1' to that variable -- one can only add multiples of 2^(n-N).
It's also possible that your architecture equates long double as double. (even though x86 can use internally 80-bit doubles).
See also Wikipedia article -- 128 bit double is rather an exception than a norm. (sparc supports it).

a c++ program returns different results in two IDE

I write the following c++ program in CodeBlocks, and the result was 9183. again I write it in Eclipse and after run, it returned 9220. Both use MinGW. The correct result is 9183. What's wrong with this code?
Thanks.
source code:
#include <iostream>
#include <set>
#include <cmath>
int main()
{
using namespace std;
set<double> set_1;
for(int a = 2; a <= 100; a++)
{
for(int b = 2; b <= 100; b++)
{
set_1.insert(pow(double(a), b));
}
}
cout << set_1.size();
return 0;
}
You are probably seeing precision errors due to CodeBlocks compiling in 32-bit mode and Eclipse compiling in 64-bit mode:
$ g++ -m32 test.cpp
$ ./a.out
9183
$ g++ -m64 test.cpp
$ ./a.out
9220
If I cast both arguments to double I get what you would expect:
pow(static_cast<double>(a), static_cast<double>(b))
The difference appears to be due to whether the floating point operations are using 53-bit precision or 64-bit precision. If you add the following two lines in front of the loop (assuming Intel architecture), it will use 53-bit precision and give the 9220 result when compiled as a 32-bit application:
uint16_t precision = 0x27f;
asm("fldcw %0" : : "m" (*&precision));
It is bits 8 and 9 of the FPU that control this precision. The above sets those two bits to 10. Setting them to 11 results in 64-bit precision. And, just for completeness, if you set the bits to 00 (value 0x7f), the size is printed as 9230.
Actually you're not really supposed to rely on == (or technically, x <= y && y <= x) for doubles anyway. So this code produces implementation-dependent results (not strictly speaking UB, per comments, but what I meant :) )

efficient way to divide ignoring rest

there are 2 ways i found to get a whole number from a division in c++
question is which way is more efficient (more speedy)
first way:
Quotient = value1 / value2; // normal division haveing splitted number
floor(Quotient); // rounding the number down to the first integer
second way:
Rest = value1 % value2; // getting the Rest with modulus % operator
Quotient = (value1-Rest) / value2; // substracting the Rest so the division will match
also please demonstrate how to find out which method is faster
If you're dealing with integers, then the usual way is
Quotient = value1 / value2;
That's it. The result is already an integer. No need to use the floor(Quotient); statement. It has no effect anyway. You would want to use Quotient = floor(Quotient); if it was needed.
If you have floating point numbers, then the second method won't work at all, as % is only defined for integers. But what does it mean to get a whole number from a division of real numbers? What integer do you get when you divide 8.5 by 3.2? Does it ever make sense to ask this question?
As a side note, the thing you call 'Rest' is normally called 'reminder'.remainder.
Use this program:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#ifdef DIV_BY_DIV
#define DIV(a, b) ((a) / (b))
#else
#define DIV(a, b) (((a) - ((a) % (b))) / (b))
#endif
#ifndef ITERS
#define ITERS 1000
#endif
int main()
{
int i, a, b;
srand(time(NULL));
a = rand();
b = rand();
for (i = 0; i < ITERS; i++)
a = DIV(a, b);
return 0;
}
You can time execution
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 -DDIV_BY_DIV 1.c && time ./a.out
real 0m0.010s
user 0m0.012s
sys 0m0.000s
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 1.c && time ./a.out
real 0m0.019s
user 0m0.020s
sys 0m0.000s
Or, you look at the assembly output:
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 -DDIV_BY_DIV 1.c -S; mv 1.s 1_div.s
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 1.c -S; mv 1.s 1_modulus.s
mihai#keldon:/tmp$ diff 1_div.s 1_modulus.s
24a25,32
> movl %edx, %eax
> movl 24(%esp), %edx
> movl %edx, %ecx
> subl %eax, %ecx
> movl %ecx, %eax
> movl %eax, %edx
> sarl $31, %edx
> idivl 20(%esp)
As you see, doing only the division is faster.
Edited to fix error in code, formatting and wrong diff.
More edit (explaining the assembly diff): In the second case, when doing the modulus first, the assembly shows that two idivl operations are needed: one to get the result of % and one for the actual division. The above diff shows the subtraction and the second division, as the first one is exactly the same in both codes.
Edit: more relevant timing information:
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=42000000 -DDIV_BY_DIV 1.c && time ./a.out
real 0m0.384s
user 0m0.360s
sys 0m0.004s
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=42000000 1.c && time ./a.out
real 0m0.706s
user 0m0.696s
sys 0m0.004s
Hope it helps.
Edit: diff between assembly with -O0 and without.
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 1.c -S -O0; mv 1.s O0.s
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 1.c -S; mv 1.s noO.s
mihai#keldon:/tmp$ diff noO.s O0.s
Since the defualt optimization level of gcc is O0 (see this article explaining optimization levels in gcc) the result was expected.
Edit: if you compile with -O3 as one of the comments suggested you'll get the same assembly, at that level of optimization, both alternatives are the same.