g++ 4.8.1 on Ubuntu can't compile large bitsets - c++

My source:
#include <iostream>
#include <bitset>
using std::cout;
using std::endl;
typedef unsigned long long U64;
const U64 MAX = 8000000000L;
struct Bitmap
{
void insert(U64 N) {this->s.set(N % MAX);}
bool find(U64 N) const {return this->s.test(N % MAX);}
private:
std::bitset<MAX> s;
};
int main()
{
cout << "Bitmap size: " << sizeof(Bitmap) << endl;
Bitmap* s = new Bitmap();
// ...
}
Compilation command and its output:
g++ -g -std=c++11 -O4 tc002.cpp -o latest
g++: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-4.8/README.Bugs> for instructions.
Bug report and its fix will take long time... Has anybody had this problem already? Can I manipulate some compiler flags or something else (in source probably) to bypass this problem?
I'm compiling on Ubuntu, which is actually VMware virtual machine with 12GB memory and 80GB disk space, and host machine is MacBook Pro:
uname -a
Linux ubuntu 3.11.0-15-generic #23-Ubuntu SMP Mon Dec 9 18:17:04 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

On my machine, g++ 4.8.1 needs a maximum of about 17 gigabytes of RAM to compile this file, as observed with top.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18287 nm 20 0 17.880g 0.014t 808 D 16.6 95.7 0:17.72 cc1plus
't' in the RES column stands for terabytes ;)
The time taken is
real 1m25.283s
user 0m31.279s
sys 0m5.819s
In the C++03 mode, g++ compiles the same file using just a few megabytes. The time taken is
real 0m0.107s
user 0m0.074s
sys 0m0.011s
I would say this is definitely a bug. A workaround is to give the machine more RAM, or enable swap. Or use clang++.

[Comment]
This little thing:
#include <bitset>
int main() {
std::bitset<8000000000UL> b;
}
results in 'virtual memory exhausted: Cannot allocate memory' when compiled with
g++ (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2

Related

MPI does not run with requested number of threads

I am trying to run the following example MPI code that launches 20 threads and keeps those threads busy for a while. However, when I check the CPU utilization using a tool like nmon or top I see that only a single thread is being used.
#include <iostream>
#include <thread>
#include <mpi.h>
using namespace std;
int main(int argc, char *argv[]) {
int provided, rank;
MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
if (provided != MPI_THREAD_FUNNELED)
exit(1);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
auto f = [](float x) {
float result = 0;
for (float i = 0; i < x; i++) { result += 10 * i + x; }
cout << "Result: " << result << endl;
};
thread threads[20];
for (int i = 0; i < 20; ++i)
threads[i] = thread(f, 100000000.f); // do some work
for (auto& th : threads)
th.join();
MPI_Finalize();
return 0;
}
I compile this code using mpicxx: mpicxx -std=c++11 -pthread example.cpp -o example and run it using mpirun: mpirun -np 1 example.
I am using Open MPI version 4.1.4 that is compiled with posix thread support (following the explanation from this question).
$ mpicxx --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
$ mpirun --version
mpirun (Open MPI) 4.1.4
$ ompi_info | grep -i thread
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)
FT Checkpoint support: no (checkpoint thread: no)
$ mpicxx -std=c++11 -pthread example.cpp -o example
$ ./example
My CPU has 10 cores and 20 threads and runs the example code above without MPI on all 20 threads. So, why does the code with MPI not run on all threads?
I suspect I might need to do something with MPI bindings, which I see being mentioned in some answers on the same topic (1, 2), but other answers entirely exclude these options, so I'm unsure whether this is the correct approach.
mpirun -np 1 ./example assigns a single core to your program (so 20 threads end up time sharing): this is the default behavior for Open MPI (e.g. 1 core per MPI process when running with -np 1 or -np 2.
./example (e.g. singleton mode) should use all the available cores, unless you are already running on a subset.
If you want to use all the available cores with mpirun, you can
mpirun --bind-to none -np 1 ./example

Armadillo random number generator only generating zeros (Windows, MSYS2)

The following test program is supposed to generate a vector with 5 random elements, but only contains zeroes when I compile and run it on my machine.
#include <iostream>
#include <armadillo>
using std::cout;
using std::endl;
using arma::vec;
int main()
{
arma::arma_rng::set_seed(1);
vec v = arma::randu<vec>(5);
cout << v << endl;
cout << v(0) << endl;
return 0;
}
Compilation/output
$ g++ main.cpp -o example -std=c++11 -O2 -larmadillo
$ ./example.exe
0
0
0
0
0
0
I'm on Windows 10, using gcc ((Rev1, Built by MSYS2 project) 8.2.1 20181214) and Armadillo (9.200.6) from MSYS2.
Packages (pacman in mingw64 subsystem):
mingw64/mingw-w64-x86_64-armadillo 9.200.6-1
mingw64/mingw-w64-x86_64-gcc-8.3.0-1
Any idea what could cause this?
I have an inkling that this might be because I'm using the MSYS2 version of Armadillo, but I'm not sure, and haven't tested compiling the library myself (yet).
EDIT: I have an inkling that this is related to MSYS somehow, so I opened an issue over here: https://github.com/msys2/MINGW-packages/issues/5019

LLVM Clang produces extremely slow input/output on OS X

Q: Is it possible to improve IO of this code with LLVM Clang under OS X:
test_io.cpp:
#include <iostream>
#include <string>
constexpr int SIZE = 1000*1000;
int main(int argc, const char * argv[]) {
std::ios_base::sync_with_stdio(false);
std::cin.tie(nullptr);
std::string command(argv[1]);
if (command == "gen") {
for (int i = 0; i < SIZE; ++i) {
std::cout << 1000*1000*1000 << " ";
}
} else if (command == "read") {
int x;
for (int i = 0; i < SIZE; ++i) {
std::cin >> x;
}
}
}
Compile:
clang++ -x c++ -lstdc++ -std=c++11 -O2 test_io.cpp -o test_io
Benchmark:
> time ./test_io gen | ./test_io read
real 0m2.961s
user 0m3.675s
sys 0m0.012s
Apart from the sad fact that reading of 10MB file costs 3 seconds, it's much slower than g++ (installed via homebrew):
> gcc-6 -x c++ -lstdc++ -std=c++11 -O2 test_io.cpp -o test_io
> time ./test_io gen | ./test_io read
real 0m0.149s
user 0m0.167s
sys 0m0.040s
My clang version is Apple LLVM version 7.0.0 (clang-700.0.72). clangs installed from homebrew (3.7 and 3.8) also produce slow io. clang installed on Ubuntu (3.8) generates fast io. Apple LLVM version 8.0.0 generates slow io (2 people asked).
I also dtrussed it a bit (sudo dtruss -c "./test_io gen | ./test_io read") and found that clang version makes 2686 write_nocancel syscalls, while gcc version makes 2079 writev syscalls. Which probably points to the root of the problem.
The issue is in libc++ that does not implement sync_with_stdio.
Your command line clang++ -x c++ -lstdc++ -std=c++11 -O2 test_io.cpp -o test_io does not use libstdc++, it will use libc++. To force use libstdc++ you need -stdlib=libstdc++.
Minimal example if you have the input file ready:
int main(int argc, const char * argv[]) {
std::ios_base::sync_with_stdio(false);
int x;
for (int i = 0; i < SIZE; ++i) {
std::cin >> x;
}
}
Timings:
$ clang++ test_io.cpp -o test -O2 -std=c++11
$ time ./test read < input
real 0m2.802s
user 0m2.780s
sys 0m0.015s
$ clang++ test_io.cpp -o test -O2 -std=c++11 -stdlib=libstdc++
clang: warning: libstdc++ is deprecated; move to libc++
$ time ./test read < input
real 0m0.185s
user 0m0.169s
sys 0m0.012s

32 Bit vs 64 Bit: Massive Runtime Difference

I am considering the following C++ program:
#include <iostream>
#include <limits>
int main(int argc, char **argv) {
unsigned int sum = 0;
for (unsigned int i = 1; i < std::numeric_limits<unsigned int>::max(); ++i) {
double f = static_cast<double>(i);
unsigned int t = static_cast<unsigned int>(f);
sum += (t % 2);
}
std::cout << sum << std::endl;
return 0;
}
I use the gcc / g++ compiler, g++ -v gives gcc version 4.7.2 20130108 [gcc-4_7-branch revision 195012] (SUSE Linux).
I am running openSUSE 12.3 (x86_64) and have a Intel(R) Core(TM) i7-3520M CPU.
Running
g++ -O3 test.C -o test_64_opt
g++ -O0 test.C -o test_64_no_opt
g++ -m32 -O3 test.C -o test_32_opt
g++ -m32 -O0 test.C -o test_32_no_opt
time ./test_64_opt
time ./test_64_no_opt
time ./test_32_opt
time ./test_32_no_opt
yields
2147483647
real 0m4.920s
user 0m4.904s
sys 0m0.001s
2147483647
real 0m16.918s
user 0m16.851s
sys 0m0.019s
2147483647
real 0m37.422s
user 0m37.308s
sys 0m0.000s
2147483647
real 0m57.973s
user 0m57.790s
sys 0m0.011s
Using float instead of double, the optimized 64 bit variant even finishes in 2.4 seconds, while the other running times stay roughly the same. However, with float I get different outputs depending on optimization, probably due to the higher processor-internal precision.
I know 64 bit may have faster math, but we have a factor of 7 (and nearly 15 with floats) here.
I would appreciate an explanation of these running time discrepancies.
The problem isn't 32bit vs 64bit, it's the lack of SSE and SSE2. When compiling for 64bit, gcc assumes it can use SSE and SSE2 since all available x86_64 processors have it.
Compile your 32bit version with -msse -msse2 and the runtime difference nearly disappears.
My benchmark results for completeness:
-O3 -m32 -msse -msse2 4.678s
-O3 (64bit) 4.524s

Cygwin g++ x86_64 segmentation fault (core dumped) when using > 2GB memory

I've written a prime sieve program in c++, which uses ~12GB ram to calculate all primes below 100,000,000,000 (100 Billion).
The program works fine when compiled with Visual Studio 2012 (in a project set up for x64) as well as g++ on 64 bit linux. However, when compiled with g++ in cygwin64 on Windows 7 Home Premium 64 bit, a segmentation fault occurs when attempting to use more than ~2GB ram (running the sieve for > ~17,000,000,000)
I'm fairly sure it's running as a 64 bit process as there's no *32 next to the process name in task manager.
The code:
#include <iostream>
#include <vector>
#include <cmath>
#include <cstdlib>
using namespace std;
long long sieve(long long n);
int main(int argc, char** argv) {
const long long ONE_BILLION = 1000*1000*1000;
if(argc == 2)
cout << sieve(atol(argv[1])) << endl;
else
cout << sieve(ONE_BILLION * 100) << endl;
}
long long sieve(long long n) {
vector<bool> bools(n+1);
for(long long i = 0; i <=n; i++)
bools[i] = true;
double csqrtn = sqrt(n);
for (long long i = 2; i < csqrtn; ++i)
if (bools[i])
for (long long j = i * i; j < n; j += i)
bools[j] = false;
long long primes2 = 0;
for (long long i = 2; i < n; i++)
if (bools[i])
primes2++;
return primes2;
}
Working fine in Visual studio:
Working fine on x64 linux:
Compiled with the command:
$ g++ -O3 sieve.cpp -o sieve.exe
Running for 18 billion fails:
$ ./sieve.exe 18000000000
Segmentation fault (core dumped)
Works fine (using 2,079,968 K memory according to task manager, though my reputation doesn't allow me to post a third link.)
$ ./sieve.exe 17000000000
755305935
g++ version:
$ g++ --version
g++ (GCC) 4.8.1
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Note: if you are going to try and run this yourself, it can take quite a long time. On a 3570k # 4.2GHz running 100 billion in visual studio takes around 30 mins, 1 billion around 10 seconds. However you might be able to duplicate the error with just the vector allocation.
Edit: since I didn't explicitly put a question: Why does this happen? Is it a limitation of the cygwin64 dll (cygwin64 was only released fully about a month ago)?
Try increasing the cygwin memory limit. This cygwin documentation suggests that the default maximum application heap size on 64-bit platforms is 4GB... although, this may be referring to 32-bit executables on 64-bit platforms... not sure what restrictions cygwin64 64-bit applications would have regarding their maximum heap size.