I'm using pthread_cond_timedwait on a thread loop to execute at every X ms (unless it is waked first).
When I'm using gdb to debug it sometimes it the function never returns.
This forum post also have the same problem, but there is no solution.
Here's some code that reproduces the problem:
#include <errno.h>
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
static pthread_cond_t s_cond = PTHREAD_COND_INITIALIZER;
static pthread_mutex_t s_mutex = PTHREAD_MUTEX_INITIALIZER;
int main(int argc, char **argv)
{
int rc = 0;
struct timespec curts = { 0 }; /* transformed timeout value */
clock_gettime(CLOCK_REALTIME, &curts);
curts.tv_sec += 10; /* Add 10 seconds to current time*/
pthread_mutex_lock(&s_mutex);
printf("pthread_cond_timedwait\n");
rc = pthread_cond_timedwait(&s_cond, &s_mutex, &curts);
if (rc == ETIMEDOUT)
{
printf("Timer expired \n");
}
pthread_mutex_unlock(&s_mutex);
return 1;
}
If I run it, it will run OK, and if I run in gdb it will also run OK.
I've narrowed down to these steps (I've named the program timedTest ):
Run the program;
While it runs attach gdb to it;
Execute continue on gdb;
The timedTest program never returns...;
Then, if I hit Ctrl+C on the terminal running gdb and run continue again, then the program will return.
I can probably use some other method to achieve what I want in this case, but I assume that it should be a solution to this problem.
EDIT:
Looks like this only happens in some machines, so maybe there's something to do with gcc / glibc / gdb / kernel versions...
Versions where this happens almost always:
$ ldd --version
ldd (Ubuntu EGLIBC 2.13-0ubuntu13) 2.13
$ gcc --version
gcc (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2
$ gdb --version
GNU gdb (Ubuntu/Linaro 7.2-1ubuntu11) 7.2
$ uname -a
Linux geovani 2.6.38-8-generic-pae #42-Ubuntu SMP Mon Apr 11 05:17:09 UTC 2011 i686 i686 i386 GNU/Linux
According to this forum post, this is a bug in the 2.6.38 kernel. I've made some tests with a 2.6.39 kernel and problem does not happen. Rolling back to the 2.6.38 it appears again.
Related
I wrote a multi-thread program and tested OK in Linux:g++12.2.0,clang++15.0.2-1 and Windows:Visual Studio 2022 17.4.2, but caused a deadlock in Windows:MingW-w64.
After a lot of debugging, I found a simple kind of code would cause a deadlock at sometimes when compiled by MingW-w64, no matter with or without optimize options, usually less than 10000 loops would enough to block the progress.
I'm not sure if there is something unsafe in this code or just MingW-w64 has a bug with semaphore.
While in Linux or compiling by Visual Studio would run forever.
And, if replace acquire() to try_acquire_for(chrono::milliseconds(1000)), the program would also run forever under MingW-w64 (without pause).
This is the code:
#include <iostream>
#include <thread>
#include <semaphore>
using namespace std;
std::counting_semaphore<3> cs1(0), cs2(0);
int main(int argc, char const *argv[])
{
thread th(
[]()
{
for (int j = 0;; j--)
{
cs1.release();
printf("%d\n", j);
cs2.acquire();
}
});
for (int i = 0;; i++)
{
cs2.release();
printf("%d\n", i);
cs1.acquire();
}
th.join();
return 0;
}
this is last rows of output in one run:
...
-804
805
806
-805
-806
-807
-808
807
808
809
810
-809
-810
-811
811
812
(blocked)
Seems like the release operation before print(i=812) had not wake the thread waiting at acquire after print(j=-811).
Both MingW-w64-g++ and MingW-w64-clang++ have this problem, this is the info of MingW-w64 (the thread and exception model should be POSIX-seh):
winlibs personal build version gcc-12.2.0-llvm-14.0.6-mingw-w64ucrt-10.0.0-r2
This is the winlibs 64-bit standalone build of:
- GCC 12.2.0
- GDB 12.1
- LLVM/Clang/LLD/LLDB 14.0.6
- MinGW-w64 10.0.0 (linked with ucrt)
- GNU Binutils 2.39
- GNU Make 4.3
- PExports 0.47
- dos2unix 7.4.3
- Yasm 1.3.0
- NASM 2.15.05
- JWasm 2.12pre
- ninja 1.11.0
- doxygen 1.9.5
This build was compiled with GCC 12.2.0 and packaged on 2022-08-28.
Please check out http://winlibs.com/ for the latest personal build.
How can I speed up MinGW-w64's extremely slow C++ compilation/linking?
Compiling a trivial "Hello World" program:
#include <iostream>
int main()
{
std::cout << "hello world" << std::endl;
}
...takes 3 minutes(!) on this otherwise-unloaded Windows 10 box (i7-6700, 32GB of RAM, decent SATA SSD):
> ptime.exe g++ main.cpp
ptime 1.0 for Win32, Freeware - http://www.pc-tools.net/
Copyright(C) 2002, Jem Berkes <jberkes#pc-tools.net>
=== g++ main.cpp ===
Execution time: 180.488 s
Process Explorer shows the g++ process tree bottoming out in ld.exe which doesn't use any appreciable CPU or I/O for the duration.
Running the g++ process tree through API Monitor shows there are three unusually long syscalls in ld.exe: two NtCreateFile()s and a NtOpenFile(), each operating on a.exe and taking 60 seconds apiece.
The slowness only happens when using the default a.exe output; g++ -o foo.exe main.cpp takes 2 seconds, tops.
"Well don't use a.exe as an output name then!" isn't really a solution since this behavior causes CMake to take ages doing compiler feature detection.
GCC toolchain versions:
>g++ --version
g++ (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 8.1.0
>ld --version
GNU ld (GNU Binutils) 2.30
Given that I couldn't repro the problem in a clean Windows 10 VM and the dependence on the output filename led me down the path of anti-virus/anti-malware interference.
fltmc instances listed several possible filesystem filter drivers; guess-n-check narrowed it down to two of Carbon Black's: carbonblackk & ParityDriver.
Using Regedit to disable them via setting Start to 0x4 ("Disabled", 0x2 == Automatic, 0x3 == Manual) in these two registry keys followed by a reboot fixed the slowness:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\carbonblackk
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ParityDriver
I was trying to play around with the new parallel library features proposed in the C++17 standard, but I couldn't get it to work. I tried compiling with the up-to-date versions of g++ 8.1.1 and clang++-6.0 and -std=c++17, but neither seemed to support #include <execution>, std::execution::par or anything similar.
When looking at the cppreference for parallel algorithms there is a long list of algorithms, claiming
Technical specification provides parallelized versions of the following 69 algorithms from algorithm, numeric and memory: ( ... long list ...)
which sounds like the algorithms are ready 'on paper', but not ready to use yet?
In this SO question from over a year ago the answers claim these features hadn't been implemented yet. But by now I would have expected to see some kind of implementation. Is there anything we can use already?
GCC 9 has them but you have to install TBB separately
In Ubuntu 19.10, all components have finally aligned:
GCC 9 is the default one, and the minimum required version for TBB
TBB (Intel Thread Building Blocks) is at 2019~U8-1, so it meets the minimum 2018 requirement
so you can simply do:
sudo apt install gcc libtbb-dev
g++ -ggdb3 -O3 -std=c++17 -Wall -Wextra -pedantic -o main.out main.cpp -ltbb
./main.out
and use as:
#include <execution>
#include <algorithm>
std::sort(std::execution::par_unseq, input.begin(), input.end());
see also the full runnable benchmark below.
GCC 9 and TBB 2018 are the first ones to work as mentioned in the release notes: https://gcc.gnu.org/gcc-9/changes.html
Parallel algorithms and <execution> (requires Thread Building Blocks 2018 or newer).
Related threads:
How to install TBB from source on Linux and make it work
trouble linking INTEL tbb library
Ubuntu 18.04 installation
Ubuntu 18.04 is a bit more involved:
GCC 9 can be obtained from a trustworthy PPA, so it is not so bad
TBB is at version 2017, which does not work, and I could not find a trustworthy PPA for it. Compiling from source is easy, but there is no install target which is annoying...
Here are fully automated tested commands for Ubuntu 18.04:
# Install GCC 9
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install gcc-9 g++-9
# Compile libtbb from source.
sudo apt-get build-dep libtbb-dev
git clone https://github.com/intel/tbb
cd tbb
git checkout 2019_U9
make -j `nproc`
TBB="$(pwd)"
TBB_RELEASE="${TBB}/build/linux_intel64_gcc_cc7.4.0_libc2.27_kernel4.15.0_release"
# Use them to compile our test program.
g++-9 -ggdb3 -O3 -std=c++17 -Wall -Wextra -pedantic -I "${TBB}/include" -L
"${TBB_RELEASE}" -Wl,-rpath,"${TBB_RELEASE}" -o main.out main.cpp -ltbb
./main.out
Test program analysis
I have tested with this program that compares the parallel and serial sorting speed.
main.cpp
#include <algorithm>
#include <cassert>
#include <chrono>
#include <execution>
#include <random>
#include <iostream>
#include <vector>
int main(int argc, char **argv) {
using clk = std::chrono::high_resolution_clock;
decltype(clk::now()) start, end;
std::vector<unsigned long long> input_parallel, input_serial;
unsigned int seed;
unsigned long long n;
// CLI arguments;
std::uniform_int_distribution<uint64_t> zero_ull_max(0);
if (argc > 1) {
n = std::strtoll(argv[1], NULL, 0);
} else {
n = 10;
}
if (argc > 2) {
seed = std::stoi(argv[2]);
} else {
seed = std::random_device()();
}
std::mt19937 prng(seed);
for (unsigned long long i = 0; i < n; ++i) {
input_parallel.push_back(zero_ull_max(prng));
}
input_serial = input_parallel;
// Sort and time parallel.
start = clk::now();
std::sort(std::execution::par_unseq, input_parallel.begin(), input_parallel.end());
end = clk::now();
std::cout << "parallel " << std::chrono::duration<float>(end - start).count() << " s" << std::endl;
// Sort and time serial.
start = clk::now();
std::sort(std::execution::seq, input_serial.begin(), input_serial.end());
end = clk::now();
std::cout << "serial " << std::chrono::duration<float>(end - start).count() << " s" << std::endl;
assert(input_parallel == input_serial);
}
On Ubuntu 19.10, Lenovo ThinkPad P51 laptop with CPU: Intel Core i7-7820HQ CPU (4 cores / 8 threads, 2.90 GHz base, 8 MB cache), RAM: 2x Samsung M471A2K43BB1-CRC (2x 16GiB, 2400 Mbps) a typical output for an input with 100 million numbers to be sorted:
./main.out 100000000
was:
parallel 2.00886 s
serial 9.37583 s
so the parallel version was about 4.5 times faster! See also: What do the terms "CPU bound" and "I/O bound" mean?
We can confirm that the process is spawning threads with strace:
strace -f -s999 -v ./main.out 100000000 |& grep -E 'clone'
which shows several lines of type:
[pid 25774] clone(strace: Process 25788 attached
[pid 25774] <... clone resumed> child_stack=0x7fd8c57f4fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fd8c57f59d0, tls=0x7fd8c57f5700, child_tidptr=0x7fd8c57f59d0) = 25788
Also, if I comment out the serial version and run with:
time ./main.out 100000000
I get:
real 0m5.135s
user 0m17.824s
sys 0m0.902s
which confirms again that the algorithm was parallelized since real < user, and gives an idea of how effectively it can be parallelized in my system (about 3.5x for 8 cores).
Error messages
Hey, Google, index this please.
If you don't have tbb installed, the error is:
In file included from /usr/include/c++/9/pstl/parallel_backend.h:14,
from /usr/include/c++/9/pstl/algorithm_impl.h:25,
from /usr/include/c++/9/pstl/glue_execution_defs.h:52,
from /usr/include/c++/9/execution:32,
from parallel_sort.cpp:4:
/usr/include/c++/9/pstl/parallel_backend_tbb.h:19:10: fatal error: tbb/blocked_range.h: No such file or directory
19 | #include <tbb/blocked_range.h>
| ^~~~~~~~~~~~~~~~~~~~~
compilation terminated.
so we see that <execution> depends on an uninstalled TBB component.
If TBB is too old, e.g. the default Ubuntu 18.04 one, it fails with:
#error Intel(R) Threading Building Blocks 2018 is required; older versions are not supported.
You can refer https://en.cppreference.com/w/cpp/compiler_support to check all C++ feature implementation status. For your case, just search "Standardization of Parallelism TS", and you will find only MSVC and Intel C++ compilers support this feature now.
Intel has released a Parallel STL library which follows the C++17 standard:
https://github.com/intel/parallelstl
It is being merged into GCC.
Gcc does not yet implement the Parallelism TS (see https://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.2017)
However libstdc++ (with gcc) has an experimental mode for some equivalent parallel algorithms. See https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html
Getting it to work:
Any use of parallel functionality requires additional compiler and
runtime support, in particular support for OpenMP. Adding this support
is not difficult: just compile your application with the compiler flag
-fopenmp. This will link in libgomp, the GNU Offloading and Multi Processing Runtime Library, whose presence is mandatory.
Code example
#include <vector>
#include <parallel/algorithm>
int main()
{
std::vector<int> v(100);
// ...
// Explicitly force a call to parallel sort.
__gnu_parallel::sort(v.begin(), v.end());
return 0;
}
Gcc now support execution header, but not standard clang build from https://apt.llvm.org
Eigen is a popular C++ library, but icpc seems to have a problem generating debugging info from code that uses Eigen. I'm using the compiler icpc version 13.1.1. I checked with both Eigen 3.2.8 and 3.1.3. It's going to be hard to recompile all the libraries I need with another compiler, so does anyone see a good solution to get Eigen to work with a debugger?
The problem is that variable values don't always get updated in the debugger. Here is main.cpp
#include "stdio.h"
#include "/home/mylogin/include/Eigen/Core"
using namespace std;
int main(int argc, char* argv[])
{
printf("Starting main\n");
double mytest = 3.0;
// If the next line is commented out, the debugger works
Eigen::Vector3d v(1,2,3);
printf("This is mytest %f \n",mytest);
return 0;
}
I compile with
icpc -O0 -debug -I/home/mylogin/include/ main.cpp
Then you can run the debugger
idbc ./a.out
Intel(R) Debugger for applications running on Intel(R) 64, Version 13.0, Build [80.215.23]
------------------
object file name: ./a.out
Reading symbols from /mnt/io1/home/mylogin/a.out...done.
(idb) break main
Breakpoint 1 at 0x4005fb: file /mnt/io1/home/mylogin/main.cpp, line 142.
(idb) run
Starting program: /mnt/io1/home/mylogin/a.out
[New Thread 18379 (LWP 18379)]
Breakpoint 1, main (argc=1, argv=0x7fff8b2e89b8) at /mnt/io1/home/mylogin/main.cpp:8
8 printf("Starting main\n");
(idb) next
Starting main
11 Eigen::Vector3d v(1,2,3);
(idb) next
12 printf("This is mytest %f \n",mytest);
(idb) next
This is mytest 3.000000
13 return 0;
(idb) print mytest
$1 = 5.9415882155426741e-313
You see in the last few lines that the executable prints "3.0" correctly. You also see that the variable is not printed correctly by the debugger.
Both gdb and idbc show the problem. It doesn't seem to be because it's near the start or end of the function main(). The CPU is
Intel(R) Xeon(R) CPU E5-2650 0 # 2.00GHz
Linux version is
Description: Scientific Linux release 6.4 (Carbon)
Thanks for ideas!
The process running the following code crashes with a Segmentation fault:
#include <stdlib.h>
#include <iostream>
#include <pthread.h>
void* f( void* )
{
while( true )
{
// It crashes inside this call (with cerr, too).
std::cout << 0;
}
return NULL;
}
int main()
{
pthread_t t;
pthread_create( &t, NULL, &f, NULL );
while( true )
{
// It crashes with any script/app; true is just simple.
system( "true" );
}
return 0;
}
It crashes about every other execution within a few seconds (output has anywhere from thousands to millions of '0's). It crashes a few functions deep in the cout << 0 call with the above code. Depending on extra functions called or data put on the stack in f(), it crashes in different places. In gdb, sometimes the stack doesn't make sense with regard to the order of the function calls. From this I deduce the stack is corrupted.
I found there are some problems with multi-threaded applications calling fork() (see also two of the comments mentioning stack corruption). Forking/cloning a process copies the file descriptors if they aren't set to FD_CLOEXEC. However, there are no explicitly created file descriptors. (I tried setting FD_CLOEXEC on fileno( stdout ) and fileno( stderr ) with no positive change.)
Even without explicit file descriptors can I not mix threads and fork()? Do I simply need to replace the system() call with equivalent functionality? Or is there a bug in the kernel that causes this crash and has been fixed after 2.6.30?
Other Details
I am running it on an ARM AT91 processor (armv5tejl) with Linux 2.6.30 (with some overlays and patches for my specific set of peripherals) compiled with GCC 4.3.2.
Linux 2.6.30 #1 Thu May 29 15:43:04 CDT 2014 armv5tejl GNU/Linux
I had been [cross] compiling it with -g and -O0, but without those it still crashes:
arm-atmel-linux-gnueabi-g++ -o system_thread system_thread.cpp -lpthread
I've also tried the -fstack-protector-all flag: Sometimes it crashes in __stack_chk_fail(), but sometimes other function pointers or data get corrupted and it crashes earlier.
The libraries it loads (from strace):
libpthread.so.0
libstdc++.so.6
libm.so.6
libgcc_s.so.1
libc.so.6
Note: Since it sometimes does not crash and is not really responsive to ^C, I typically run it in the background:
$ killall -9 system_thread; rm -f log; system_thread >log &
I have compiled this program for a few different architectures and Linux kernel versions, but I have not seen it crash anywhere else:
Linux 3.10.29 #1 Wed Feb 12 17:12:39 CST 2014 armv5tejl GNU/Linux
Linux 3.6.0-dirty #3 Wed May 28 13:53:56 CDT 2014 microblaze GNU/Linux
Linux 3.13.0-27-generic #50-Ubuntu SMP Thu May 15 18:06:16 UTC 2014 x86_64 x86_64 GNU/Linux
Linux 3.8.0-35-generic #50~precise1-Ubuntu SMP Wed Dec 4 17:25:51 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
EDIT: Note that on the same architecture (armv5tejl) it does not crash with Linux 3.10.29. Also, it does not crash when running on an earlier version of my "appliance" (older server and client applications), having the same version of Linux - 2.6.30. So the environment of the OS has some effect.
BusyBox v1.20.1 provides sh that system() calls.
This is reproducible on an ARM processor using the 2.6.30 kernel that you mentioned, but not in master. We can use git bisect to find where this bug was fixed (it took about 16 iterations). Note that, since git bisect is meant to find regressions, but in this case master is "good" but a past version is "bad," we need to reverse the meanings of "good" and "bad".
The culprit found by the bisection is this commit, to fix "an instance of userspace data corruption" involving fork(). This symptom is very similar to the symptom you describe, and could also corrupt memory outside of the stack. After backporting this commit and the required parent to the 2.6.30 kernel, the code you posted no longer crashes.