Q: Is it possible to improve IO of this code with LLVM Clang under OS X:
test_io.cpp:
#include <iostream>
#include <string>
constexpr int SIZE = 1000*1000;
int main(int argc, const char * argv[]) {
std::ios_base::sync_with_stdio(false);
std::cin.tie(nullptr);
std::string command(argv[1]);
if (command == "gen") {
for (int i = 0; i < SIZE; ++i) {
std::cout << 1000*1000*1000 << " ";
}
} else if (command == "read") {
int x;
for (int i = 0; i < SIZE; ++i) {
std::cin >> x;
}
}
}
Compile:
clang++ -x c++ -lstdc++ -std=c++11 -O2 test_io.cpp -o test_io
Benchmark:
> time ./test_io gen | ./test_io read
real 0m2.961s
user 0m3.675s
sys 0m0.012s
Apart from the sad fact that reading of 10MB file costs 3 seconds, it's much slower than g++ (installed via homebrew):
> gcc-6 -x c++ -lstdc++ -std=c++11 -O2 test_io.cpp -o test_io
> time ./test_io gen | ./test_io read
real 0m0.149s
user 0m0.167s
sys 0m0.040s
My clang version is Apple LLVM version 7.0.0 (clang-700.0.72). clangs installed from homebrew (3.7 and 3.8) also produce slow io. clang installed on Ubuntu (3.8) generates fast io. Apple LLVM version 8.0.0 generates slow io (2 people asked).
I also dtrussed it a bit (sudo dtruss -c "./test_io gen | ./test_io read") and found that clang version makes 2686 write_nocancel syscalls, while gcc version makes 2079 writev syscalls. Which probably points to the root of the problem.
The issue is in libc++ that does not implement sync_with_stdio.
Your command line clang++ -x c++ -lstdc++ -std=c++11 -O2 test_io.cpp -o test_io does not use libstdc++, it will use libc++. To force use libstdc++ you need -stdlib=libstdc++.
Minimal example if you have the input file ready:
int main(int argc, const char * argv[]) {
std::ios_base::sync_with_stdio(false);
int x;
for (int i = 0; i < SIZE; ++i) {
std::cin >> x;
}
}
Timings:
$ clang++ test_io.cpp -o test -O2 -std=c++11
$ time ./test read < input
real 0m2.802s
user 0m2.780s
sys 0m0.015s
$ clang++ test_io.cpp -o test -O2 -std=c++11 -stdlib=libstdc++
clang: warning: libstdc++ is deprecated; move to libc++
$ time ./test read < input
real 0m0.185s
user 0m0.169s
sys 0m0.012s
Related
In the following example, the elimination of unused code is performed for sin() but not for pow(). I was wondering why. Tried gcc and clang.
Here is some more details about this example, which is otherwise mostly code.
The code contains a loop over an integer from which a floating point number is computed.
The number is passed to a mathematical function: either pow() or sin()
depending on which macros are defined.
If macro USE is defined, the sum of all returned values is accumulated in another variable which is then copied to a volatile variable to prevent the optimizer from removing the code entirely.
// main.cpp
#include <chrono>
#include <cmath>
#include <cstdio>
int main() {
std::chrono::steady_clock clock;
auto start = clock.now();
double s = 0;
const size_t count = 1 << 27;
for (size_t i = 0; i < count; ++i) {
const double x = double(i) / count;
double a = 0;
#ifdef POW
a = std::pow(x, 0.5);
#endif
#ifdef SIN
a = std::sin(x);
#endif
#ifdef USE
s += a;
#endif
}
auto stop = clock.now();
printf(
"%.0f ms\n", std::chrono::duration<double>(stop - start).count() * 1e3);
volatile double a = s;
(void)a;
}
As seen from the output, the computation of sin() is completely eliminated if the results are unused. This is not the case for pow() since the execution time does not decrease.
I normally observe this if the call may return a NaN (log(-x) but not log(+x)).
# g++ 10.2.0
g++ -std=c++14 -O3 -DPOW main.cpp -o main && ./main
3064 ms
g++ -std=c++14 -O3 -DPOW -DUSE main.cpp -o main && ./main
3172 ms
g++ -std=c++14 -O3 -DSIN main.cpp -o main && ./main
0 ms
g++ -std=c++14 -O3 -DSIN -DUSE main.cpp -o main && ./main
1391 ms
# clang++ 11.0.1
clang++ -std=c++14 -O3 -DPOW main.cpp -o main && ./main
3288 ms
clang++ -std=c++14 -O3 -DPOW -DUSE main.cpp -o main && ./main
3351 ms
clang++ -std=c++14 -O3 -DSIN main.cpp -o main && ./main
177 ms
clang++ -std=c++14 -O3 -DSIN -DUSE main.cpp -o main && ./main
1524 ms
I have a program that does independent computations on a bunch of images. This seems like a good idea to use OpenMP:
//file: WoodhamData.cpp
#include <omp.h>
...
void WoodhamData::GenerateLightingDirection() {
int imageWidth = (this->normalMap)->width();
int imageHeight = (this->normalMap)->height();
#pragma omp paralell for num_threads(2)
for (int r = 0; r < RadianceMaps.size(); r++) {
if (omp_get_thread_num() == 0){
std::cout<<"threads="<<omp_get_num_threads()<<std::endl;
}
...
}
}
In order to use OpenMP, I add -fopenmp to my makefile, so it outputs:
g++ -g -o test.exe src/test.cpp src/WoodhamData.cpp -pthread -L/usr/X11R6/lib -fopenmp --std=c++0x -lm -lX11 -Ilib/eigen/ -Ilib/CImg
However, I am sad to say, my program reports threads=1 (run from terminal ./test.exe ...)
Does anyone know what might be wrong? This is the slowest part of my program, and it would be great to speed it up a bit.
Your OpenMP directive is wrong - it is "parallel" not "paralell".
This question already has answers here:
What's the difference between -O3 and (-O2 + flags that man gcc says -O3 adds to -O2)?
(2 answers)
Closed 8 years ago.
Here's the function I'm looking at:
template <uint8_t Size>
inline uint64_t parseUnsigned( const char (&buf)[Size] )
{
uint64_t val = 0;
for (uint8_t i = 0; i < Size; ++i)
if (buf[i] != ' ')
val = (val * 10) + (buf[i] - '0');
return val;
}
I have a test harness which passes in all possible numbers with Size=5, left-padded with spaces. I'm using GCC 4.7.2. When I run the program under callgrind after compiling with -O3 I get:
I refs: 7,154,919
When I compile with -O2 I get:
I refs: 9,001,570
OK, so -O3 improves the performance (and I confirmed that some of the improvement comes from the above function, not just the test harness). But I don't want to completely switch from -O2 to -O3, I want to find out which specific option(s) to add. So I consult man g++ to get the list of options it says are added by -O3:
-fgcse-after-reload [enabled]
-finline-functions [enabled]
-fipa-cp-clone [enabled]
-fpredictive-commoning [enabled]
-ftree-loop-distribute-patterns [enabled]
-ftree-vectorize [enabled]
-funswitch-loops [enabled]
So I compile again with -O2 followed by all of the above options. But this gives me even worse performance than plain -O2:
I refs: 9,546,017
I discovered that adding -ftree-vectorize to -O2 is responsible for this performance degradation. But I can't figure out how to match the -O3 performance with any combination of options. How can I do this?
In case you want to try it yourself, here's the test harness (put the above parseUnsigned() definition under the #includes):
#include <cmath>
#include <stdint.h>
#include <cstdio>
#include <cstdlib>
#include <cstring>
template <uint8_t Size>
inline void increment( char (&buf)[Size] )
{
for (uint8_t i = Size - 1; i < 255; --i)
{
if (buf[i] == ' ')
{
buf[i] = '1';
break;
}
++buf[i];
if (buf[i] > '9')
buf[i] -= 10;
else
break;
}
}
int main()
{
char str[5];
memset(str, ' ', sizeof(str));
unsigned max = std::pow(10, sizeof(str));
for (unsigned ii = 0; ii < max; ++ii)
{
uint64_t result = parseUnsigned(str);
if (result != ii)
{
printf("parseUnsigned(%*s) from %u: %lu\n", sizeof(str), str, ii, result);
abort();
}
increment(str);
}
}
A very similar question was already answered here: https://stackoverflow.com/a/6454659/483486
I've copied the relevant text underneath.
UPDATE: There are questions about it in GCC WIKI:
"Is -O1 (-O2,-O3 or -Os) equivalent to individual -foptimization options?"
No. First, individual optimization options (-f*) do not enable optimization, an option -Os or -Ox with x > 0 is required. Second, the -Ox flags enable many optimizations that are not controlled by any individual -f* option. There are no plans to add individual options for controlling all these optimizations.
"What specific flags are enabled by -O1 (-O2, -O3 or -Os)?"
Varies by platform and GCC version. You can get GCC to tell you what flags it enables by doing this:
touch empty.c
gcc -O1 -S -fverbose-asm empty.c
cat empty.s
My source:
#include <iostream>
#include <bitset>
using std::cout;
using std::endl;
typedef unsigned long long U64;
const U64 MAX = 8000000000L;
struct Bitmap
{
void insert(U64 N) {this->s.set(N % MAX);}
bool find(U64 N) const {return this->s.test(N % MAX);}
private:
std::bitset<MAX> s;
};
int main()
{
cout << "Bitmap size: " << sizeof(Bitmap) << endl;
Bitmap* s = new Bitmap();
// ...
}
Compilation command and its output:
g++ -g -std=c++11 -O4 tc002.cpp -o latest
g++: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-4.8/README.Bugs> for instructions.
Bug report and its fix will take long time... Has anybody had this problem already? Can I manipulate some compiler flags or something else (in source probably) to bypass this problem?
I'm compiling on Ubuntu, which is actually VMware virtual machine with 12GB memory and 80GB disk space, and host machine is MacBook Pro:
uname -a
Linux ubuntu 3.11.0-15-generic #23-Ubuntu SMP Mon Dec 9 18:17:04 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
On my machine, g++ 4.8.1 needs a maximum of about 17 gigabytes of RAM to compile this file, as observed with top.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18287 nm 20 0 17.880g 0.014t 808 D 16.6 95.7 0:17.72 cc1plus
't' in the RES column stands for terabytes ;)
The time taken is
real 1m25.283s
user 0m31.279s
sys 0m5.819s
In the C++03 mode, g++ compiles the same file using just a few megabytes. The time taken is
real 0m0.107s
user 0m0.074s
sys 0m0.011s
I would say this is definitely a bug. A workaround is to give the machine more RAM, or enable swap. Or use clang++.
[Comment]
This little thing:
#include <bitset>
int main() {
std::bitset<8000000000UL> b;
}
results in 'virtual memory exhausted: Cannot allocate memory' when compiled with
g++ (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2
I am considering the following C++ program:
#include <iostream>
#include <limits>
int main(int argc, char **argv) {
unsigned int sum = 0;
for (unsigned int i = 1; i < std::numeric_limits<unsigned int>::max(); ++i) {
double f = static_cast<double>(i);
unsigned int t = static_cast<unsigned int>(f);
sum += (t % 2);
}
std::cout << sum << std::endl;
return 0;
}
I use the gcc / g++ compiler, g++ -v gives gcc version 4.7.2 20130108 [gcc-4_7-branch revision 195012] (SUSE Linux).
I am running openSUSE 12.3 (x86_64) and have a Intel(R) Core(TM) i7-3520M CPU.
Running
g++ -O3 test.C -o test_64_opt
g++ -O0 test.C -o test_64_no_opt
g++ -m32 -O3 test.C -o test_32_opt
g++ -m32 -O0 test.C -o test_32_no_opt
time ./test_64_opt
time ./test_64_no_opt
time ./test_32_opt
time ./test_32_no_opt
yields
2147483647
real 0m4.920s
user 0m4.904s
sys 0m0.001s
2147483647
real 0m16.918s
user 0m16.851s
sys 0m0.019s
2147483647
real 0m37.422s
user 0m37.308s
sys 0m0.000s
2147483647
real 0m57.973s
user 0m57.790s
sys 0m0.011s
Using float instead of double, the optimized 64 bit variant even finishes in 2.4 seconds, while the other running times stay roughly the same. However, with float I get different outputs depending on optimization, probably due to the higher processor-internal precision.
I know 64 bit may have faster math, but we have a factor of 7 (and nearly 15 with floats) here.
I would appreciate an explanation of these running time discrepancies.
The problem isn't 32bit vs 64bit, it's the lack of SSE and SSE2. When compiling for 64bit, gcc assumes it can use SSE and SSE2 since all available x86_64 processors have it.
Compile your 32bit version with -msse -msse2 and the runtime difference nearly disappears.
My benchmark results for completeness:
-O3 -m32 -msse -msse2 4.678s
-O3 (64bit) 4.524s