32 Bit vs 64 Bit: Massive Runtime Difference - c++

I am considering the following C++ program:
#include <iostream>
#include <limits>
int main(int argc, char **argv) {
unsigned int sum = 0;
for (unsigned int i = 1; i < std::numeric_limits<unsigned int>::max(); ++i) {
double f = static_cast<double>(i);
unsigned int t = static_cast<unsigned int>(f);
sum += (t % 2);
}
std::cout << sum << std::endl;
return 0;
}
I use the gcc / g++ compiler, g++ -v gives gcc version 4.7.2 20130108 [gcc-4_7-branch revision 195012] (SUSE Linux).
I am running openSUSE 12.3 (x86_64) and have a Intel(R) Core(TM) i7-3520M CPU.
Running
g++ -O3 test.C -o test_64_opt
g++ -O0 test.C -o test_64_no_opt
g++ -m32 -O3 test.C -o test_32_opt
g++ -m32 -O0 test.C -o test_32_no_opt
time ./test_64_opt
time ./test_64_no_opt
time ./test_32_opt
time ./test_32_no_opt
yields
2147483647
real 0m4.920s
user 0m4.904s
sys 0m0.001s
2147483647
real 0m16.918s
user 0m16.851s
sys 0m0.019s
2147483647
real 0m37.422s
user 0m37.308s
sys 0m0.000s
2147483647
real 0m57.973s
user 0m57.790s
sys 0m0.011s
Using float instead of double, the optimized 64 bit variant even finishes in 2.4 seconds, while the other running times stay roughly the same. However, with float I get different outputs depending on optimization, probably due to the higher processor-internal precision.
I know 64 bit may have faster math, but we have a factor of 7 (and nearly 15 with floats) here.
I would appreciate an explanation of these running time discrepancies.

The problem isn't 32bit vs 64bit, it's the lack of SSE and SSE2. When compiling for 64bit, gcc assumes it can use SSE and SSE2 since all available x86_64 processors have it.
Compile your 32bit version with -msse -msse2 and the runtime difference nearly disappears.
My benchmark results for completeness:
-O3 -m32 -msse -msse2 4.678s
-O3 (64bit) 4.524s

Related

inconsistent dead code elimintation of mathematical functions

In the following example, the elimination of unused code is performed for sin() but not for pow(). I was wondering why. Tried gcc and clang.
Here is some more details about this example, which is otherwise mostly code.
The code contains a loop over an integer from which a floating point number is computed.
The number is passed to a mathematical function: either pow() or sin()
depending on which macros are defined.
If macro USE is defined, the sum of all returned values is accumulated in another variable which is then copied to a volatile variable to prevent the optimizer from removing the code entirely.
// main.cpp
#include <chrono>
#include <cmath>
#include <cstdio>
int main() {
std::chrono::steady_clock clock;
auto start = clock.now();
double s = 0;
const size_t count = 1 << 27;
for (size_t i = 0; i < count; ++i) {
const double x = double(i) / count;
double a = 0;
#ifdef POW
a = std::pow(x, 0.5);
#endif
#ifdef SIN
a = std::sin(x);
#endif
#ifdef USE
s += a;
#endif
}
auto stop = clock.now();
printf(
"%.0f ms\n", std::chrono::duration<double>(stop - start).count() * 1e3);
volatile double a = s;
(void)a;
}
As seen from the output, the computation of sin() is completely eliminated if the results are unused. This is not the case for pow() since the execution time does not decrease.
I normally observe this if the call may return a NaN (log(-x) but not log(+x)).
# g++ 10.2.0
g++ -std=c++14 -O3 -DPOW main.cpp -o main && ./main
3064 ms
g++ -std=c++14 -O3 -DPOW -DUSE main.cpp -o main && ./main
3172 ms
g++ -std=c++14 -O3 -DSIN main.cpp -o main && ./main
0 ms
g++ -std=c++14 -O3 -DSIN -DUSE main.cpp -o main && ./main
1391 ms
# clang++ 11.0.1
clang++ -std=c++14 -O3 -DPOW main.cpp -o main && ./main
3288 ms
clang++ -std=c++14 -O3 -DPOW -DUSE main.cpp -o main && ./main
3351 ms
clang++ -std=c++14 -O3 -DSIN main.cpp -o main && ./main
177 ms
clang++ -std=c++14 -O3 -DSIN -DUSE main.cpp -o main && ./main
1524 ms

How to measure the execution time of C math.h library functions?

By using time.h header, I'm getting execution time of sqrt() as 2 nanoseconds (with the gcc command in a Linux terminal) and 44 nanoseconds (with the g++ command in Ubuntu terminal). Can anyone tell me any other method to measure the execution time of the math.h library functions?
Below is the code:
#include <time.h>
#include <stdio.h>
#include<math.h>
int main()
{
time_t begin,end; // time_t is a datatype to store time values.
time (&begin); // note time before execution
for(int i=0;i<1000000000;i++) //using for loop till 10^9 times to make the execution time in nanoseconds
{
cbrt(9999999); // calling the cube root function from math library
}
time (&end); // note time after execution
double difference = difftime (end,begin);
printf ("time taken for function() %.2lf in Nanoseconds.\n", difference );
printf(" cube root is :%f \t",cbrt(9999999));
return 0;
}
OUTPUT:
by using **gcc**: time taken for function() 2.00 seconds.
cube root is :215.443462
by using **g++**: time taken for function() 44.00 in Nanoseconds.
cube root is:215.443462
Linux terminal result
Give or take the length of the prompt:
$ g++ t1.c
$ ./a.out
time taken for function() 44.00 in Nanoseconds.
cube root is :215.443462
$ gcc t1.c
$ ./a.out
time taken for function() 2.00 in Nanoseconds.
cube root is :215.443462
$
how to measure the execution time of c math.h library functions?
C compilers are often allowed to analyze well known standard library functions and replace such fix code like cbrt(9999999); with 215.443462.... Further, since dropping the function in the loop does not affect the function of the code, that loop may be optimized out.
Use of volatile prevents much of this as the compiler cannot assume no impact when the function is replaced, removed.
for(int i=0;i<1000000000;i++) {
// cbrt(9999999);
volatile double x = 9999999.0;
volatile double y = cbrt(x);
}
The granularity of time() is often only 1 second and if the billion loops only results in a few seconds, consider more loops.
Code could use below to factor out the loop overhead.
time_t begin,middle,end;
time (&begin);
for(int i=0;i<1000000000;i++) {
volatile double x = 9999999.0;
volatile double y = x;
}
time (&middle);
for(int i=0;i<1000000000;i++) {
volatile double x = 9999999.0;
volatile double y = cbrt(x);
}
time (&end);
double difference = difftime(end,middle) - difftime(middle,begin);
Timing code is an art, and one part of the art is making sure that the compiler doesn't optimize your code away. For standard library functions, the compiler may well be aware of what it is/does and be able to evaluate a constant at compile time. In your example, the call cbrt(9999999); gives two opportunities for optimization. The value from cbrt() can be evaluated at compile-time because the argument is a constant. Secondly, the return value is not used, and the standard function has no side-effects, so the compiler can drop it altogether. You can avoid those problems by capturing the result (for example, by evaluating the sum of the cube roots from 0 to one billion (minus one) and printing that value after the timing code.
tm97.c
When I compiled your code, shorn of comments, I got:
$ cat tm97.c
#include <time.h>
#include <stdio.h>
#include <math.h>
int main(void)
{
time_t begin, end;
time(&begin);
for (int i = 0; i < 1000000000; i++)
{
cbrt(9999999);
}
time(&end);
double difference = difftime(end, begin);
printf("time taken for function() %.2lf in Nanoseconds.\n", difference );
printf(" cube root is :%f \t", cbrt(9999999));
return 0;
}
$ make tm97
gcc -O3 -g -std=c11 -Wall -Wextra -Werror -Wmissing-prototypes -Wstrict-prototypes tm97.c -o tm97 -L../lib -lsoq
tm97.c: In function ‘main’:
tm97.c:11:9: error: statement with no effect [-Werror=unused-value]
11 | cbrt(9999999);
| ^~~~
cc1: all warnings being treated as errors
rmk: error code 1
$
I'm using GCC 9.3.0 on a 2017 MacBook Pro running macOS Mojave 10.14.6 with XCode 11.3.1 (11C504) and GCC 9.3.0 — XCode 11.4 requires Catalina 10.15.2, but work hasn't got around organizing support for that, yet. Interestingly, when the same code is compiled by g++, it compiles without warnings (errors):
$ ln -s tm97.c tm89.cpp
make tm89 SXXFLAGS=-std=c++17 CXX=g++
g++ -O3 -g -I../inc -std=c++17 -Wall -Wextra -Werror -L../lib tm89.cpp -lsoq -o tm89
$
I routinely use some timing code that is available in my SOQ (Stack Overflow Questions) repository on GitHub as files timer.c and timer.h in the src/libsoq sub-directory. The code is only compiled as C code in my library, so I created a simple wrapper header, timer2.h, so that the programs below could use #include "timer2.h" and it would work OK with both C and C++ compilations:
#ifndef TIMER2_H_INCLUDED
#define TIMER2_H_INCLUDED
#ifdef __cplusplus
extern "C" {
#endif
#include "timer.h"
#ifdef __cplusplus
}
#endif
#endif /* TIMER2_H_INCLUDED */
tm29.cpp and tm31.c
This code uses the sqrt() function for testing. It accumulates the sum of the square roots. It uses the timing code from timer.h/timer.c around your timing code — type Clock and functions clk_init(), clk_start(), clk_stop(), and clk_elapsed_us() to evaluate the elapsed time in microseconds between when the clock was started and last stopped.
The source code can be compiled by either a C compiler or a C++ compiler.
#include <time.h>
#include <stdio.h>
#include <math.h>
#include "timer2.h"
int main(void)
{
time_t begin, end;
double sum = 0.0;
int i;
Clock clk;
clk_init(&clk);
clk_start(&clk);
time(&begin);
for (i = 0; i < 1000000000; i++)
{
sum += sqrt(i);
}
time(&end);
clk_stop(&clk);
double difference = difftime(end, begin);
char buffer[32];
printf("Time taken for sqrt() is %.2lf nanoseconds (%s ns).\n",
difference, clk_elapsed_us(&clk, buffer, sizeof(buffer)));
printf("Sum of square roots from 0 to %d is: %f\n", i, sum);
return 0;
}
tm41.c and tm43.cpp
This code is almost identical to the previous code, but the tested function is the cbrt() (cube root) function.
#include <time.h>
#include <stdio.h>
#include <math.h>
#include "timer2.h"
int main(void)
{
time_t begin, end;
double sum = 0.0;
int i;
Clock clk;
clk_init(&clk);
clk_start(&clk);
time(&begin);
for (i = 0; i < 1000000000; i++)
{
sum += cbrt(i);
}
time(&end);
clk_stop(&clk);
double difference = difftime(end, begin);
char buffer[32];
printf("Time taken for cbrt() is %.2lf nanoseconds (%s ns).\n",
difference, clk_elapsed_us(&clk, buffer, sizeof(buffer)));
printf("Sum of cube roots from 0 to %d is: %f\n", i, sum);
return 0;
}
tm59.c and tm61.c
This code uses fabs() instead of either sqrt() or cbrt(). It's still a function call, but it might be inlined. It invokes the conversion from int to double explicitly; without that cast, GCC complains that it should be using the integer abs() function instead.
#include <time.h>
#include <stdio.h>
#include <math.h>
#include "timer2.h"
int main(void)
{
time_t begin, end;
double sum = 0.0;
int i;
Clock clk;
clk_init(&clk);
clk_start(&clk);
time(&begin);
for (i = 0; i < 1000000000; i++)
{
sum += fabs((double)i);
}
time(&end);
clk_stop(&clk);
double difference = difftime(end, begin);
char buffer[32];
printf("Time taken for fabs() is %.2lf nanoseconds (%s ns).\n",
difference, clk_elapsed_us(&clk, buffer, sizeof(buffer)));
printf("Sum of absolute values from 0 to %d is: %f\n", i, sum);
return 0;
}
tm73.cpp
This file uses the original code with my timing wrapper code too. The C version doesn't compile — the C++ version does:
#include <time.h>
#include <stdio.h>
#include <math.h>
#include "timer2.h"
int main(void)
{
time_t begin, end;
Clock clk;
clk_init(&clk);
clk_start(&clk);
time(&begin);
for (int i = 0; i < 1000000000; i++)
{
cbrt(9999999);
}
time(&end);
clk_stop(&clk);
double difference = difftime(end, begin);
char buffer[32];
printf("Time taken for cbrt() is %.2lf nanoseconds (%s ns).\n",
difference, clk_elapsed_us(&clk, buffer, sizeof(buffer)));
printf("Cube root is: %f\n", cbrt(9999999));
return 0;
}
Timing
Using a command timecmd which reports start and stop time, and PID, of programs as well as the timing code built into the various commands (it's a variant on the theme of the time command), I got the following results. (rmk is just an alternative implementation of make.)
$ for prog in tm29 tm31 tm41 tm43 tm59 tm61 tm73
> do rmk $prog && timecmd -ur -- $prog
> done
g++ -O3 -g -I../inc -std=c++11 -Wall -Wextra -Werror tm29.cpp -o tm29 -L../lib -lsoq
2020-03-28 08:47:50.040227 [PID 19076] tm29
Time taken for sqrt() is 1.00 nanoseconds (1.700296 ns).
Sum of square roots from 0 to 1000000000 is: 21081851051977.781250
2020-03-28 08:47:51.747494 [PID 19076; status 0x0000] - 1.707267s - tm29
gcc -O3 -g -I../inc -std=c11 -Wall -Wextra -Werror -Wmissing-prototypes -Wstrict-prototypes tm31.c -o tm31 -L../lib -lsoq
2020-03-28 08:47:52.056021 [PID 19088] tm31
Time taken for sqrt() is 1.00 nanoseconds (1.679867 ns).
Sum of square roots from 0 to 1000000000 is: 21081851051977.781250
2020-03-28 08:47:53.742383 [PID 19088; status 0x0000] - 1.686362s - tm31
gcc -O3 -g -I../inc -std=c11 -Wall -Wextra -Werror -Wmissing-prototypes -Wstrict-prototypes tm41.c -o tm41 -L../lib -lsoq
2020-03-28 08:47:53.908285 [PID 19099] tm41
Time taken for cbrt() is 7.00 nanoseconds (6.697999 ns).
Sum of cube roots from 0 to 1000000000 is: 749999999499.628418
2020-03-28 08:48:00.613357 [PID 19099; status 0x0000] - 6.705072s - tm41
g++ -O3 -g -I../inc -std=c++11 -Wall -Wextra -Werror tm43.cpp -o tm43 -L../lib -lsoq
2020-03-28 08:48:00.817975 [PID 19110] tm43
Time taken for cbrt() is 7.00 nanoseconds (6.614539 ns).
Sum of cube roots from 0 to 1000000000 is: 749999999499.628418
2020-03-28 08:48:07.438298 [PID 19110; status 0x0000] - 6.620323s - tm43
gcc -O3 -g -I../inc -std=c11 -Wall -Wextra -Werror -Wmissing-prototypes -Wstrict-prototypes tm59.c -o tm59 -L../lib -lsoq
2020-03-28 08:48:07.598344 [PID 19121] tm59
Time taken for fabs() is 1.00 nanoseconds (1.114822 ns).
Sum of absolute values from 0 to 1000000000 is: 499999999067108992.000000
2020-03-28 08:48:08.718672 [PID 19121; status 0x0000] - 1.120328s - tm59
g++ -O3 -g -I../inc -std=c++11 -Wall -Wextra -Werror tm61.cpp -o tm61 -L../lib -lsoq
2020-03-28 08:48:08.918745 [PID 19132] tm61
Time taken for fabs() is 2.00 nanoseconds (1.117780 ns).
Sum of absolute values from 0 to 1000000000 is: 499999999067108992.000000
2020-03-28 08:48:10.042134 [PID 19132; status 0x0000] - 1.123389s - tm61
g++ -O3 -g -I../inc -std=c++11 -Wall -Wextra -Werror tm73.cpp -o tm73 -L../lib -lsoq
2020-03-28 08:48:10.236899 [PID 19143] tm73
Time taken for cbrt() is 0.00 nanoseconds (0.000004 ns).
Cube root is: 215.443462
2020-03-28 08:48:10.242322 [PID 19143; status 0x0000] - 0.005423s - tm73
$
I've run the programs many times; the times above are representative of what I got each time. There are a number of conclusions that can be drawn:
sqrt() (1.7 ns) is quicker than cbrt() (6.7 ns).
fabs() (1.1 ns) is quicker than sqrt() (1.7 ns).
However, fabs() gives a moderate approximation to the time taken with loop overhead and conversion from int to double.
When the result of cbrt() is not used, the compiler eliminates the loop.
When compiled with the C++ compiler, the code with from the question removes the loop altogether, leaving only the calls to time() to be measured. The result printed by clk_elapsed_us() is the time taken to execute the code between clk_start() and clk_stop() in seconds with microsecond resolution — 0.000004 is 4 microseconds elapsed time. The value is marked in ns because when the loop executes one billion times, the elapsed time in seconds also represents the time in nanoseconds for one loop — there are a billion nanoseconds in a second.
The times reported by timecmd are consistent with the times reported by the programs. There is the overhead of starting the process (fork() and exec()) and the I/O in the process that is included in the times reported by timecmd.
Although not shown, the timings with clang and clang++ (instead of GCC 9.3.0) are very comparable, though the cbrt() code takes about 7.5 ns per iteration instead of 6.7 ns. The timing differences for the others are basically noise.
The number suffixes are all 2-digit primes. They have no other significance except to keep the different programs separate.
As #Jonathan Leffler commented, compiler can optimize your C / c++ code. If the C code just loops from 0 to 1000 w/o doing anything with the counter i (I mean, w/o printing it or using the intermediate values in any other operation, indexes, etc), compiler may not even create the assembly code that corresponds to that loop. Possible arithmetic operations will even be pre-computed. For the code below;
int foo(int x) {
return x * 5;
}
int main() {
int x = 3;
int y = foo(x);
...
...
}
it is not surprising for the compiler to generate just two lines of assembly code (the compiler may even by-pass calling the function foo and generate an inline instruction) for function foo:
mov $15, %eax
; compiler will not bother multiplying 5 by 3
; but just move the pre-computed '15' to register
ret
; and then return

Why is __builtin_popcount slower than my own bit counting function?

I stumbled on __builtin_popcount for gcc after I had written my own bit count routines. But when I switched to __builtin_popcount my software actually ran slower. I'm on Unbutu on an Intel Core i3-4130T CPU # 2.90GHz. I built a performance test to see what gives. It looks like this:
#include <iostream>
#include <sys/time.h>
#include <stdint.h>
using namespace std;
const int bitCount[256] = {
0,1,1,2,1,2,2,3, 1,2,2,3,2,3,3,4, 1,2,2,3,2,3,3,4, 2,3,3,4,3,4,4,5,
1,2,2,3,2,3,3,4, 2,3,3,4,3,4,4,5, 2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6,
1,2,2,3,2,3,3,4, 2,3,3,4,3,4,4,5, 2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6, 3,4,4,5,4,5,5,6, 4,5,5,6,5,6,6,7,
1,2,2,3,2,3,3,4, 2,3,3,4,3,4,4,5, 2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6, 3,4,4,5,4,5,5,6, 4,5,5,6,5,6,6,7,
2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6, 3,4,4,5,4,5,5,6, 4,5,5,6,5,6,6,7,
3,4,4,5,4,5,5,6, 4,5,5,6,5,6,6,7, 4,5,5,6,5,6,6,7, 5,6,6,7,6,7,7,8
};
const uint32_t m32_0001 = 0x000000ffu;
const uint32_t m32_0010 = 0x0000ff00u;
const uint32_t m32_0100 = 0x00ff0000u;
const uint32_t m32_1000 = 0xff000000u;
inline int countBits(uint32_t bitField)
{
return
bitCount[(bitField & m32_0001) ] +
bitCount[(bitField & m32_0010) >> 8] +
bitCount[(bitField & m32_0100) >> 16] +
bitCount[(bitField & m32_1000) >> 24];
}
inline long long currentTime() {
struct timeval ct;
gettimeofday(&ct, NULL);
return ct.tv_sec * 1000000LL + ct.tv_usec;
}
int main() {
long long start, delta, sum;
start = currentTime();
sum = 0;
for(unsigned i = 0; i < 100000000; ++i)
sum += countBits(i);
delta = currentTime() - start;
cout << "countBits : sum=" << sum << ": time (usec)=" << delta << endl;
start = currentTime();
sum = 0;
for(unsigned i = 0; i < 100000000; ++i)
sum += __builtin_popcount(i);
delta = currentTime() - start;
cout << "__builtin_popcount: sum=" << sum << ": time (usec)=" << delta << endl;
start = currentTime();
sum = 0;
for(unsigned i = 0; i < 100000000; ++i) {
int count;
asm("popcnt %1,%0" : "=r"(count) : "rm"(i) : "cc");
sum += count;
}
delta = currentTime() - start;
cout << "assembler : sum=" << sum << ": time (usec)=" << delta << endl;
return 0;
}
At first I ran this with an older compiler:
> g++ --version | head -1
g++ (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
> cat /proc/cpuinfo | grep 'model name' | head -1
model name : Intel(R) Core(TM) i3-4130T CPU # 2.90GHz
> g++ -O3 popcountTest.cpp
> ./a.out
countBits : sum=1314447104: time (usec)=148506
__builtin_popcount: sum=1314447104: time (usec)=345122
assembler : sum=1314447104: time (usec)=138036
As you can see, the table-based countBits is almost as fast as the assembler and far-faster than __builtin_popcount. Then I tried a newer compiler on a different machine type (same processor -- and I think the mother board's the same too):
> g++ --version | head -1
g++ (Ubuntu 7.3.0-16ubuntu3) 7.3.0
> cat /proc/cpuinfo | grep 'model name' | head -1
model name : Intel(R) Core(TM) i3-4130T CPU # 2.90GHz
> g++ -O3 popcountTest.cpp
> ./a.out
countBits : sum=1314447104: time (usec)=164247
__builtin_popcount: sum=1314447104: time (usec)=345167
assembler : sum=1314447104: time (usec)=138028
Curiously, the older compiler optimized my countBits function better than the newer compiler, but it still compares favorably with the assembler. Clearly since the assembler line compiles and runs, my processor supports popcount, but why then is __builtin_popcount more than two times slower? And how can my own routine possibly compete with the silicon-based popcount? I'm having the same experience with other routines for finding the first set bit, etc. My routines are all significantly faster than the GNU "builtin" equivalents.
(BTW, I have no clue how to write assembler. I just found that line on some web page and it miraculously seemed to work.)
Without specifying an appropriate "-march" on the command line gcc generates a call to the __popcountdi2 function rather than the popcnt instruction. See: https://godbolt.org/z/z1BihM
POPCNT is supported by Intel since Nehalem and AMD since Barcelona according to wikipedia: https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT
I thought it might be useful to share the new performance results after adding -march=native to the compile line (as suggested by Mat and Alan Birtles) which enables use of the popcount machine instruction. The results are different depending on compiler version. Here's the older compiler:
> g++ --version | head -1
g++ (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
> cat /proc/cpuinfo | grep 'model name' | head -1
model name : Intel(R) Core(TM) i3-4130T CPU # 2.90GHz
> g++ -march=native -O3 popcountTest.cpp
> ./a.out
countBits : sum=1314447104: time (usec)=163947
__builtin_popcount: sum=1314447104: time (usec)=138046
assembler : sum=1314447104: time (usec)=138036
And here's the newer compiler:
> g++ --version | head -1
g++ (Ubuntu 7.3.0-16ubuntu3) 7.3.0
> cat /proc/cpuinfo | grep 'model name' | head -1
model name : Intel(R) Core(TM) i3-4130T CPU # 2.90GHz
> g++ -march=native -O3 popcountTest.cpp
> ./a.out
countBits : sum=1314447104: time (usec)=163133
__builtin_popcount: sum=1314447104: time (usec)=73987
assembler : sum=1314447104: time (usec)=138036
Observations:
Adding -march=native to the command line of the older g++
compiler improved the performance of __builtin_popcount to equal
that of the assembler, and SLOWED my countbits routine by about 15%.
Adding -march=native to the command line of the newer g++
compiler caused the performance of __builtin_popcount to surpass
that of the assembler. I presume this has something to do with the
stack variable I used with the assembler, though I'm not sure. There
was no effect on my countBits performance (which as stated in my
question, was already slower with this newer compiler.)
I stumbled upon this and I though it might be able to share more modern and surprising results.
On MacOS 12.2 with Intel i7 7920HQ and when compiling with clang++ 13 using -O3 -march=native the results are following:
countBits : sum=1314447104: time (usec)=93142
__builtin_popcount: sum=1314447104: time (usec)=59412
assembler : sum=1314447104: time (usec)=111535
So with modern CPUs and modern compilers it always makes sense using __builtin_popcount.

LLVM Clang produces extremely slow input/output on OS X

Q: Is it possible to improve IO of this code with LLVM Clang under OS X:
test_io.cpp:
#include <iostream>
#include <string>
constexpr int SIZE = 1000*1000;
int main(int argc, const char * argv[]) {
std::ios_base::sync_with_stdio(false);
std::cin.tie(nullptr);
std::string command(argv[1]);
if (command == "gen") {
for (int i = 0; i < SIZE; ++i) {
std::cout << 1000*1000*1000 << " ";
}
} else if (command == "read") {
int x;
for (int i = 0; i < SIZE; ++i) {
std::cin >> x;
}
}
}
Compile:
clang++ -x c++ -lstdc++ -std=c++11 -O2 test_io.cpp -o test_io
Benchmark:
> time ./test_io gen | ./test_io read
real 0m2.961s
user 0m3.675s
sys 0m0.012s
Apart from the sad fact that reading of 10MB file costs 3 seconds, it's much slower than g++ (installed via homebrew):
> gcc-6 -x c++ -lstdc++ -std=c++11 -O2 test_io.cpp -o test_io
> time ./test_io gen | ./test_io read
real 0m0.149s
user 0m0.167s
sys 0m0.040s
My clang version is Apple LLVM version 7.0.0 (clang-700.0.72). clangs installed from homebrew (3.7 and 3.8) also produce slow io. clang installed on Ubuntu (3.8) generates fast io. Apple LLVM version 8.0.0 generates slow io (2 people asked).
I also dtrussed it a bit (sudo dtruss -c "./test_io gen | ./test_io read") and found that clang version makes 2686 write_nocancel syscalls, while gcc version makes 2079 writev syscalls. Which probably points to the root of the problem.
The issue is in libc++ that does not implement sync_with_stdio.
Your command line clang++ -x c++ -lstdc++ -std=c++11 -O2 test_io.cpp -o test_io does not use libstdc++, it will use libc++. To force use libstdc++ you need -stdlib=libstdc++.
Minimal example if you have the input file ready:
int main(int argc, const char * argv[]) {
std::ios_base::sync_with_stdio(false);
int x;
for (int i = 0; i < SIZE; ++i) {
std::cin >> x;
}
}
Timings:
$ clang++ test_io.cpp -o test -O2 -std=c++11
$ time ./test read < input
real 0m2.802s
user 0m2.780s
sys 0m0.015s
$ clang++ test_io.cpp -o test -O2 -std=c++11 -stdlib=libstdc++
clang: warning: libstdc++ is deprecated; move to libc++
$ time ./test read < input
real 0m0.185s
user 0m0.169s
sys 0m0.012s

g++ 4.8.1 on Ubuntu can't compile large bitsets

My source:
#include <iostream>
#include <bitset>
using std::cout;
using std::endl;
typedef unsigned long long U64;
const U64 MAX = 8000000000L;
struct Bitmap
{
void insert(U64 N) {this->s.set(N % MAX);}
bool find(U64 N) const {return this->s.test(N % MAX);}
private:
std::bitset<MAX> s;
};
int main()
{
cout << "Bitmap size: " << sizeof(Bitmap) << endl;
Bitmap* s = new Bitmap();
// ...
}
Compilation command and its output:
g++ -g -std=c++11 -O4 tc002.cpp -o latest
g++: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-4.8/README.Bugs> for instructions.
Bug report and its fix will take long time... Has anybody had this problem already? Can I manipulate some compiler flags or something else (in source probably) to bypass this problem?
I'm compiling on Ubuntu, which is actually VMware virtual machine with 12GB memory and 80GB disk space, and host machine is MacBook Pro:
uname -a
Linux ubuntu 3.11.0-15-generic #23-Ubuntu SMP Mon Dec 9 18:17:04 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
On my machine, g++ 4.8.1 needs a maximum of about 17 gigabytes of RAM to compile this file, as observed with top.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18287 nm 20 0 17.880g 0.014t 808 D 16.6 95.7 0:17.72 cc1plus
't' in the RES column stands for terabytes ;)
The time taken is
real 1m25.283s
user 0m31.279s
sys 0m5.819s
In the C++03 mode, g++ compiles the same file using just a few megabytes. The time taken is
real 0m0.107s
user 0m0.074s
sys 0m0.011s
I would say this is definitely a bug. A workaround is to give the machine more RAM, or enable swap. Or use clang++.
[Comment]
This little thing:
#include <bitset>
int main() {
std::bitset<8000000000UL> b;
}
results in 'virtual memory exhausted: Cannot allocate memory' when compiled with
g++ (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2