I was trying to play around with the new parallel library features proposed in the C++17 standard, but I couldn't get it to work. I tried compiling with the up-to-date versions of g++ 8.1.1 and clang++-6.0 and -std=c++17, but neither seemed to support #include <execution>, std::execution::par or anything similar.
When looking at the cppreference for parallel algorithms there is a long list of algorithms, claiming
Technical specification provides parallelized versions of the following 69 algorithms from algorithm, numeric and memory: ( ... long list ...)
which sounds like the algorithms are ready 'on paper', but not ready to use yet?
In this SO question from over a year ago the answers claim these features hadn't been implemented yet. But by now I would have expected to see some kind of implementation. Is there anything we can use already?
GCC 9 has them but you have to install TBB separately
In Ubuntu 19.10, all components have finally aligned:
GCC 9 is the default one, and the minimum required version for TBB
TBB (Intel Thread Building Blocks) is at 2019~U8-1, so it meets the minimum 2018 requirement
so you can simply do:
sudo apt install gcc libtbb-dev
g++ -ggdb3 -O3 -std=c++17 -Wall -Wextra -pedantic -o main.out main.cpp -ltbb
./main.out
and use as:
#include <execution>
#include <algorithm>
std::sort(std::execution::par_unseq, input.begin(), input.end());
see also the full runnable benchmark below.
GCC 9 and TBB 2018 are the first ones to work as mentioned in the release notes: https://gcc.gnu.org/gcc-9/changes.html
Parallel algorithms and <execution> (requires Thread Building Blocks 2018 or newer).
Related threads:
How to install TBB from source on Linux and make it work
trouble linking INTEL tbb library
Ubuntu 18.04 installation
Ubuntu 18.04 is a bit more involved:
GCC 9 can be obtained from a trustworthy PPA, so it is not so bad
TBB is at version 2017, which does not work, and I could not find a trustworthy PPA for it. Compiling from source is easy, but there is no install target which is annoying...
Here are fully automated tested commands for Ubuntu 18.04:
# Install GCC 9
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install gcc-9 g++-9
# Compile libtbb from source.
sudo apt-get build-dep libtbb-dev
git clone https://github.com/intel/tbb
cd tbb
git checkout 2019_U9
make -j `nproc`
TBB="$(pwd)"
TBB_RELEASE="${TBB}/build/linux_intel64_gcc_cc7.4.0_libc2.27_kernel4.15.0_release"
# Use them to compile our test program.
g++-9 -ggdb3 -O3 -std=c++17 -Wall -Wextra -pedantic -I "${TBB}/include" -L
"${TBB_RELEASE}" -Wl,-rpath,"${TBB_RELEASE}" -o main.out main.cpp -ltbb
./main.out
Test program analysis
I have tested with this program that compares the parallel and serial sorting speed.
main.cpp
#include <algorithm>
#include <cassert>
#include <chrono>
#include <execution>
#include <random>
#include <iostream>
#include <vector>
int main(int argc, char **argv) {
using clk = std::chrono::high_resolution_clock;
decltype(clk::now()) start, end;
std::vector<unsigned long long> input_parallel, input_serial;
unsigned int seed;
unsigned long long n;
// CLI arguments;
std::uniform_int_distribution<uint64_t> zero_ull_max(0);
if (argc > 1) {
n = std::strtoll(argv[1], NULL, 0);
} else {
n = 10;
}
if (argc > 2) {
seed = std::stoi(argv[2]);
} else {
seed = std::random_device()();
}
std::mt19937 prng(seed);
for (unsigned long long i = 0; i < n; ++i) {
input_parallel.push_back(zero_ull_max(prng));
}
input_serial = input_parallel;
// Sort and time parallel.
start = clk::now();
std::sort(std::execution::par_unseq, input_parallel.begin(), input_parallel.end());
end = clk::now();
std::cout << "parallel " << std::chrono::duration<float>(end - start).count() << " s" << std::endl;
// Sort and time serial.
start = clk::now();
std::sort(std::execution::seq, input_serial.begin(), input_serial.end());
end = clk::now();
std::cout << "serial " << std::chrono::duration<float>(end - start).count() << " s" << std::endl;
assert(input_parallel == input_serial);
}
On Ubuntu 19.10, Lenovo ThinkPad P51 laptop with CPU: Intel Core i7-7820HQ CPU (4 cores / 8 threads, 2.90 GHz base, 8 MB cache), RAM: 2x Samsung M471A2K43BB1-CRC (2x 16GiB, 2400 Mbps) a typical output for an input with 100 million numbers to be sorted:
./main.out 100000000
was:
parallel 2.00886 s
serial 9.37583 s
so the parallel version was about 4.5 times faster! See also: What do the terms "CPU bound" and "I/O bound" mean?
We can confirm that the process is spawning threads with strace:
strace -f -s999 -v ./main.out 100000000 |& grep -E 'clone'
which shows several lines of type:
[pid 25774] clone(strace: Process 25788 attached
[pid 25774] <... clone resumed> child_stack=0x7fd8c57f4fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fd8c57f59d0, tls=0x7fd8c57f5700, child_tidptr=0x7fd8c57f59d0) = 25788
Also, if I comment out the serial version and run with:
time ./main.out 100000000
I get:
real 0m5.135s
user 0m17.824s
sys 0m0.902s
which confirms again that the algorithm was parallelized since real < user, and gives an idea of how effectively it can be parallelized in my system (about 3.5x for 8 cores).
Error messages
Hey, Google, index this please.
If you don't have tbb installed, the error is:
In file included from /usr/include/c++/9/pstl/parallel_backend.h:14,
from /usr/include/c++/9/pstl/algorithm_impl.h:25,
from /usr/include/c++/9/pstl/glue_execution_defs.h:52,
from /usr/include/c++/9/execution:32,
from parallel_sort.cpp:4:
/usr/include/c++/9/pstl/parallel_backend_tbb.h:19:10: fatal error: tbb/blocked_range.h: No such file or directory
19 | #include <tbb/blocked_range.h>
| ^~~~~~~~~~~~~~~~~~~~~
compilation terminated.
so we see that <execution> depends on an uninstalled TBB component.
If TBB is too old, e.g. the default Ubuntu 18.04 one, it fails with:
#error Intel(R) Threading Building Blocks 2018 is required; older versions are not supported.
You can refer https://en.cppreference.com/w/cpp/compiler_support to check all C++ feature implementation status. For your case, just search "Standardization of Parallelism TS", and you will find only MSVC and Intel C++ compilers support this feature now.
Intel has released a Parallel STL library which follows the C++17 standard:
https://github.com/intel/parallelstl
It is being merged into GCC.
Gcc does not yet implement the Parallelism TS (see https://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.2017)
However libstdc++ (with gcc) has an experimental mode for some equivalent parallel algorithms. See https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html
Getting it to work:
Any use of parallel functionality requires additional compiler and
runtime support, in particular support for OpenMP. Adding this support
is not difficult: just compile your application with the compiler flag
-fopenmp. This will link in libgomp, the GNU Offloading and Multi Processing Runtime Library, whose presence is mandatory.
Code example
#include <vector>
#include <parallel/algorithm>
int main()
{
std::vector<int> v(100);
// ...
// Explicitly force a call to parallel sort.
__gnu_parallel::sort(v.begin(), v.end());
return 0;
}
Gcc now support execution header, but not standard clang build from https://apt.llvm.org
Related
I wrote a multi-thread program and tested OK in Linux:g++12.2.0,clang++15.0.2-1 and Windows:Visual Studio 2022 17.4.2, but caused a deadlock in Windows:MingW-w64.
After a lot of debugging, I found a simple kind of code would cause a deadlock at sometimes when compiled by MingW-w64, no matter with or without optimize options, usually less than 10000 loops would enough to block the progress.
I'm not sure if there is something unsafe in this code or just MingW-w64 has a bug with semaphore.
While in Linux or compiling by Visual Studio would run forever.
And, if replace acquire() to try_acquire_for(chrono::milliseconds(1000)), the program would also run forever under MingW-w64 (without pause).
This is the code:
#include <iostream>
#include <thread>
#include <semaphore>
using namespace std;
std::counting_semaphore<3> cs1(0), cs2(0);
int main(int argc, char const *argv[])
{
thread th(
[]()
{
for (int j = 0;; j--)
{
cs1.release();
printf("%d\n", j);
cs2.acquire();
}
});
for (int i = 0;; i++)
{
cs2.release();
printf("%d\n", i);
cs1.acquire();
}
th.join();
return 0;
}
this is last rows of output in one run:
...
-804
805
806
-805
-806
-807
-808
807
808
809
810
-809
-810
-811
811
812
(blocked)
Seems like the release operation before print(i=812) had not wake the thread waiting at acquire after print(j=-811).
Both MingW-w64-g++ and MingW-w64-clang++ have this problem, this is the info of MingW-w64 (the thread and exception model should be POSIX-seh):
winlibs personal build version gcc-12.2.0-llvm-14.0.6-mingw-w64ucrt-10.0.0-r2
This is the winlibs 64-bit standalone build of:
- GCC 12.2.0
- GDB 12.1
- LLVM/Clang/LLD/LLDB 14.0.6
- MinGW-w64 10.0.0 (linked with ucrt)
- GNU Binutils 2.39
- GNU Make 4.3
- PExports 0.47
- dos2unix 7.4.3
- Yasm 1.3.0
- NASM 2.15.05
- JWasm 2.12pre
- ninja 1.11.0
- doxygen 1.9.5
This build was compiled with GCC 12.2.0 and packaged on 2022-08-28.
Please check out http://winlibs.com/ for the latest personal build.
I'm trying to compile C/C++ code from my Debian partition to generate some executable files for Windows.
Running $ uname -a on the command line gives Linux machine 5.14.0-2-amd64 #1 SMP Debian 5.14.9-2 (2021-10-03) x86_64 GNU/Linux. My processor is an Intel® Core™ i5-1035G4 CPU # 1.10GHz × 8, with a Mesa Intel® Iris(R) Plus Graphics (ICL GT1.5) integrated GPU.
A minimal example to show my current situation includes the following code (called code.cpp):
#include <iostream>
#include <CL/opencl.hpp>
int main()
{
std::vector <cl::Platform> all_platforms; //Get all platforms
cl::Platform::get(&all_platforms);
if (all_platforms.size() == 0)
{
std::cout << "No platforms found. Check OpenCL installation." << std::endl;
exit(1);
}
int pz = all_platforms.size();
std::cout << "Platforms size: " << pz << std::endl;
for (int i = 0; i < pz; i++)
{
cl::Platform default_platform = all_platforms[i];
std::cout << "Using platform: " << default_platform.getInfo<CL_PLATFORM_NAME>() << std::endl;
}
return(0);
}
which uses OpenCL to print all recognized devices. I compile my code writing g++ code.cpp -o code.out -lOpenCL. The executable file code.out works fine, doing what you would expect it to do. I have another program which uses GSL (GNU Scientific Library) written in C which also works well, linking with -lgsl (therefore I think there's not a problem with my code or the regular compilation process). Both OpenCL and GSL were installed from the official repositories (~# apt install ...) with no problem at all. When I execute code.out the output is
Platforms size: 2
Using platform: Intel(R) OpenCL HD Graphics
Using platform: Portable Computing Language
I installed mingw (via ~# apt install mingw-w64) to create executable files to be run on Windows, and for basic programs (i.e. without "external" libraries) it works well (replacing gcc by x86_64-w64-mingw32-gcc or i686-w64-mingw32-gcc). However for the code written above (and for the one using GSL) it doesn't work. Most of the error outputs are very similar for both examples, and I will show the command line outputs for the code using OpenCL.
When I try x86_64-w64-mingw32-g++ code.cpp -o code.out -lOpenCL the output is
code.cpp:2:10: fatal error: CL/opencl.hpp: No such file or directory
2 | #include <CL/opencl.hpp>
| ^~~~~~~~~~~~~~~
compilation terminated.
I thought this meant that I needed to be more specific when linking and including, so I gave the explicit path where the headers are located (found them via dpkg -S opencl.hpp or dpkg -S gsl*.h), and the .so file for OpenCL was found via dpkg -S *OpenCL.so, while the one for GSL was found using dpkg -S *gsl.so. When I try x86_64-w64-mingw32-g++ code.cpp -o code.out -I/usr/include/ -L/usr/lib/x86_64-linux-gnu/libOpenCL.so the output is
In file included from /usr/lib/gcc/x86_64-w64-mingw32/10-win32/include/c++/cwchar:44,
from /usr/lib/gcc/x86_64-w64-mingw32/10-win32/include/c++/bits/postypes.h:40,
from /usr/lib/gcc/x86_64-w64-mingw32/10-win32/include/c++/iosfwd:40,
from /usr/lib/gcc/x86_64-w64-mingw32/10-win32/include/c++/ios:38,
from /usr/lib/gcc/x86_64-w64-mingw32/10-win32/include/c++/ostream:38,
from /usr/lib/gcc/x86_64-w64-mingw32/10-win32/include/c++/iostream:39,
from code.cpp:1:
/usr/include/wchar.h:27:10: fatal error: bits/libc-header-start.h: No such file or directory
27 | #include <bits/libc-header-start.h>
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
Therefore it seems that MinGW needs additional instructions to properly find, include and/or link the libraries. I don't know how to solve this problem. Those are my attempts based on some answers I've found, and the documentation provided by MinGW says nothing about this. The exact same problem occurs no matter if I use x86_64-w64-mingw32-g++ or i686-w64-mingw32-g++, or their gcc counterparts.
When cross-compiling make sure you are only linking things targeting the same platform together. In other words, your dependencies (and their dependencies) must be for the same target platform. You can't link with those libraries for your build platform.
So if you have a Windows 64-bit application that depends on OpenCL, you will need to link it against a Windows 64-bit build of OpenCL.
The OpenCL the sources can be found here:
https://github.com/KhronosGroup/OpenCL-Headers
https://github.com/KhronosGroup/OpenCL-ICD-Loader
so you would need to build those first.
How can I speed up MinGW-w64's extremely slow C++ compilation/linking?
Compiling a trivial "Hello World" program:
#include <iostream>
int main()
{
std::cout << "hello world" << std::endl;
}
...takes 3 minutes(!) on this otherwise-unloaded Windows 10 box (i7-6700, 32GB of RAM, decent SATA SSD):
> ptime.exe g++ main.cpp
ptime 1.0 for Win32, Freeware - http://www.pc-tools.net/
Copyright(C) 2002, Jem Berkes <jberkes#pc-tools.net>
=== g++ main.cpp ===
Execution time: 180.488 s
Process Explorer shows the g++ process tree bottoming out in ld.exe which doesn't use any appreciable CPU or I/O for the duration.
Running the g++ process tree through API Monitor shows there are three unusually long syscalls in ld.exe: two NtCreateFile()s and a NtOpenFile(), each operating on a.exe and taking 60 seconds apiece.
The slowness only happens when using the default a.exe output; g++ -o foo.exe main.cpp takes 2 seconds, tops.
"Well don't use a.exe as an output name then!" isn't really a solution since this behavior causes CMake to take ages doing compiler feature detection.
GCC toolchain versions:
>g++ --version
g++ (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 8.1.0
>ld --version
GNU ld (GNU Binutils) 2.30
Given that I couldn't repro the problem in a clean Windows 10 VM and the dependence on the output filename led me down the path of anti-virus/anti-malware interference.
fltmc instances listed several possible filesystem filter drivers; guess-n-check narrowed it down to two of Carbon Black's: carbonblackk & ParityDriver.
Using Regedit to disable them via setting Start to 0x4 ("Disabled", 0x2 == Automatic, 0x3 == Manual) in these two registry keys followed by a reboot fixed the slowness:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\carbonblackk
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ParityDriver
I am trying to profile a simple C program with perf-events on debian 8 jessie. I can see symbols but I am unable to get stacktraces. The same procedure generates good stacktraces on ubuntu 16.04.
I have installed linux-image-amd64-dbg and libc6-dbg.
I have confirmed that kernel config parameters include CONFIG_KALLSYMS=y
I have compiled the program with gcc -g3 -O0 hello.c to enable debug symbols.
I start profiling with the following command.
sudo perf record -g ./a.out
I generate a flame graph Flame Graph with the following command
sudo perf script | ~/code/FlameGraph/stackcollapse-perf.pl | \
~/code/FlameGraph/flamegraph.pl > perf-kernel.svg
This is the listing for hello.c which I am trying to profile
#include <stdio.h>
#include <unistd.h>
void do2() {
FILE* f = fopen("/dev/zero", "r");
int fd = fileno(f);
char buf[100];
while(1) {
read(fd, buf, sizeof(buf)/sizeof(buf[0]));
}
}
int main(void)
{
do2();
return 0;
}
This is the flame graph with debian jessie
This is the flame graph with ubuntu
Why are stack traces missing in debian jessie?
Thanks
Sharath
Managed to find the issue.
I had to enable CONFIG_FRAME_POINTER=y and recompile the kernel as per Brendan Gregg's perf site
It is unfortunate that the kernel shipping with Debian 8 does not have this enabled, which breaks perf
My application is crashing when built on a new Apple laptop and then launched on a much older Apple laptop.
The application is built using Xcode 6.4, on OSX 10.9 and 10.10, when using llvm 6.1 and C++11. The SDK is 10.10, the target OSX is 10.7. Optimizations are off.
The crash is very very early on when the C runtime is loading my application binary and initializing the modules.
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 com.MyCompany.MyApplication 0x000000010cd10e7a _GLOBAL__I_a + 10
1 dyld 0x00007fff61fd3ceb ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 265
2 dyld 0x00007fff61fd3e78 ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 40
3 dyld 0x00007fff61fd0871 ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int,
This is before any of my application code. The crash does not occur on the build machine (i7 CPU). Crashes occur on i5 and Core 2 Duo machines. I suspect that an extended (CPU specific) instruction is creating the crash on load.
When I use the same Xcode, same llvm, etc to build the application on the Core 2 Duo machine there is no crash.
I am also using homebrew: libmtp, libusb, libusb-compat, cryptopp, curl (with c-ares, openssl, nghttp2), boost. I have specified C++11 where necessary, and have specified --build-bottle. I am statically linking to these libraries.
I have tried to use otool -tV on all libraries, the final binary, etc to find SSE instructions.
I have tried to set the Xcode LLVM build setting "Enable Additional Vector Extensions" to "platform" and "SSE3" to no avail. This is probably because homebrew isn't passing the --universal flag from curl to the building of openssl and it's cryptlib.
I have taken static libraries libcurl.a (CURL), libssl.a (OpenSSL), libcrypto.a (OpenSSL), libz.a (zlib) from the older machine and added them to my repository. Using Xcode to link them into my application solves the problem.
Are there other tools I can should use to narrow down the offending instruction?
Are there other explanations for the crash?
Addendum:
In addition to building the libraries on an older machine, I have also created a proof of concept, minimal, instant crash program that reports a slightly different crash location, but demonstrates the issue:
On an i7 (new Apple computer with new Intel CPU), use homebrew to install:
brew install curl --with-c-ares --with-openssl
Then copy this source into file sse.cpp:
#define CURL_STATICLIB
#include <curl/curl.h>
int main(int argc, const char * argv[]) {
curl_global_init(CURL_GLOBAL_ALL);
return 0;
}
Compile it:
clang++ sse.cpp -c -arch x86_64 -I/usr/local/opt/curl/include
clang++ -o a.out sse.o /usr/local/opt/openssl/lib/libssl.a /usr/local/opt/openssl/lib/libcrypto.a /usr/local/opt/zlib/lib/libz.a /usr/local/opt/curl/lib/libcurl.a /usr/local/opt/c-ares/lib/libcares.a -stdlib=libc++ -framework LDAP
Now move to an older Apple computer with older Intel CPU, and crash it:
./a.out
Crash Report (compressed):
Process: a.out [569]
...
Code Type: X86-64 (Native)
Parent Process: bash [448]
Responsible: Terminal [339]
...
OS Version: Mac OS X 10.10.5 (14F27)
...
Crashed Thread: 0 Dispatch queue: com.apple.main-thread
Exception Type: EXC_BAD_INSTRUCTION (SIGILL)
Exception Codes: 0x0000000000000001, 0x0000000000000000
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 a.out 0x000000010dbdce3f ENGINE_new + 36
1 a.out 0x000000010dbe05e3 ENGINE_load_dynamic + 11
2 a.out 0x000000010dbdf04a ENGINE_load_builtin_engines + 24
3 a.out 0x000000010dc76b36 Curl_ossl_init + 14
4 a.out 0x000000010dc5c2a5 curl_global_init + 114
5 a.out 0x000000010db51d95 main + 37
6 libdyld.dylib 0x00007fff88b735c9 start + 1
Does your code work when you disable compiler optimizations? If not, how about trying an older version of Xcode? It could just be a compiler bug, though I'd hope not! If you can find a working compiler or set of compiler options to check against, you could use LLVM's bugpoint tool to isolate which file is being miscompiled.
The solution appears to involve using:
export HOMEBREW_BUILD_BOTTLE=1
export HOMEBREW_BOTTLE_ARCH=core2
When building the homebrew libraries. Using Intel XED I was able to check the emitted machine code for unsupported instructions:
xed_cmd="/usr/local/bin/xed"
ar -x libcurl.a
parts=(*.o)
for j in "${parts[#]}"; do
chipcheck=$(${xed_cmd} -i ${j} -chip-check ${chipToCheck})
chiperrors=$(echo "${chipcheck}" | grep "# Total Chip Check Errors")
if [[ "$chiperrors" != "# Total Chip Check Errors: 0" ]] ; then
echo ERROR ${libname} ${j} $chiperrors
fi
done