Why is the C++ thread/future overhead so big - c++

I have a worker routine (code below), which is running slower when I run it in a separate thread. As far as I can tell, the worker code and data is completely independent of other threads. All the worker does is to append nodes to a tree. The goal is having multiple workers growing trees in parallel.
Can someone help me understand why there is (significant) overhead when running the worker in a separate thread?
Edit:
Initially I was testing WorkerFuture twice, I corrected that and I now get the same (better) performance in the no thread and defer async cases, and considerable overhead when an extra thread is involved.
The command to compile (linux): g++ -std=c++11 main.cpp -o main -O3 -pthread
Here is the output (time in milliseconds):
Thread : 4000001 size in 1861 ms
Async : 4000001 size in 1836 ms
Defer async: 4000001 size in 1423 ms
No thread : 4000001 size in 1455 ms
Code:
#include <iostream>
#include <vector>
#include <random>
#include <chrono>
#include <thread>
#include <future>
struct Data
{
int data;
};
struct Tree
{
Data data;
long long total;
std::vector<Tree *> children;
long long Size()
{
long long size = 1;
for (auto c : children)
size += c->Size();
return size;
}
~Tree()
{
for (auto c : children)
delete c;
}
};
int
GetRandom(long long size)
{
static long long counter = 0;
return counter++ % size;
}
void
Worker_(Tree *root)
{
std::vector<Tree *> nodes = {root};
Tree *it = root;
while (!it->children.empty())
{
it = it->children[GetRandom(it->children.size())];
nodes.push_back(it);
}
for (int i = 0; i < 100; ++i)
nodes.back()->children.push_back(new Tree{{10}, 1, {}});
for (auto t : nodes)
++t->total;
}
long long
Worker(long long iterations)
{
Tree root = {};
for (long long i = 0; i < iterations; ++i)
Worker_(&root);
return root.Size();
}
void ThreadFn(long long iterations, long long &result)
{
result = Worker(iterations);
}
long long
WorkerThread(long long iterations)
{
long long result = 0;
std::thread t(ThreadFn, iterations, std::ref(result));
t.join();
return result;
}
long long
WorkerFuture(long long iterations)
{
std::future<long long> f = std::async(std::launch::async, [iterations] {
return Worker(iterations);
});
return f.get();
}
long long
WorkerFutureSameThread(long long iterations)
{
std::future<long long> f = std::async(std::launch::deferred, [iterations] {
return Worker(iterations);
});
return f.get();
}
int main()
{
long long iterations = 40000;
auto t1 = std::chrono::high_resolution_clock::now();
auto total = WorkerThread(iterations);
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << "Thread : " << total << " size in " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms\n";
t1 = std::chrono::high_resolution_clock::now();
total = WorkerFuture(iterations);
t2 = std::chrono::high_resolution_clock::now();
std::cout << "Async : " << total << " size in " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms\n";
t1 = std::chrono::high_resolution_clock::now();
total = WorkerFutureSameThread(iterations);
t2 = std::chrono::high_resolution_clock::now();
std::cout << "Defer async: " << total << " size in " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms\n";
t1 = std::chrono::high_resolution_clock::now();
total = Worker(iterations);
t2 = std::chrono::high_resolution_clock::now();
std::cout << "No thread : " << total << " size in " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms\n";
}

It seems that the problem is caused by dynamic memory management. When multiple threads are involved (even if the main thread does nothing), C++ runtime must synchronize access to dynamic memory (heap), which generates some overhead. I did some experiments with GCC and the solution of your problem is to use some scalable memory allocator library. For instance, when I used tbbmalloc, e.g.,
export LD_LIBRARY_PATH=$TBB_ROOT/lib/intel64/gcc4.7:$LD_LIBRARY_PATH
export LD_PRELOAD=libtbbmalloc_proxy.so.2
the whole problem disappeared.

The reason is simple. You do not do anything in parallel manner.
When extra thread is doing something main thread does nothing (waits for thread job to complete).
In case of thread you have extra thing to do (handle thread and synchronization) so you have a trade off.
To see any gain you have to do at least two things at the same time.

Related

Total time in different parts of recursive function

I am new to C++ and I need to measure the total time for different parts of a recursive function. A simple example to show where I get so far is:
#include <iostream>
#include <unistd.h>
#include <chrono>
using namespace std;
using namespace std::chrono;
int recursive(int);
void foo();
void bar();
int main() {
int n = 5; // this value is known only at runtime
int result = recursive(n);
return 0;
}
int recursive(int n) {
auto start = high_resolution_clock::now();
if (n > 1) { recursive(n - 1); n = n - 1; }
auto stop = high_resolution_clock::now();
auto duration_recursive = duration_cast<microseconds>(stop - start);
cout << "time in recursive: " << duration_recursive.count() << endl;
//
// .. calls to other functions and computations parts I don't want to time
//
start = high_resolution_clock::now();
foo();
stop = high_resolution_clock::now();
auto duration_foo = duration_cast<seconds>(stop - start);
cout << "time in foo: " << duration_foo.count() << endl;
//
// .. calls to other functions and computations parts I don't want to time
//
start = high_resolution_clock::now();
bar();
stop = high_resolution_clock::now();
auto duration_bar = duration_cast<seconds>(stop - start);
cout << "time in bar: " << duration_bar.count() << endl;
return 0;
}
void foo() { // a complex function
sleep(1);
}
void bar() { // another complex function
sleep(2);
}
I want the total time for each of the functions, for instance, for foo() it is 5 seconds, while now I always get 1 second. The number of iterations is known only at runtime (n=5 here is fixed just for simplicity).
To compute the total time for each of the functions I did try replacing the type above by using static and accumulate the results but didn't work.
You can use some container to store the times, pass it by reference and accumulate the times. For example with a std::map<std::string,unsinged> to have labels:
int recursive(int n, std::map<std::string,unsigned>& times) {
if (n >= 0) return;
// measure time of foo
times["foo"] += duration_foo;
// measure time of bar
times["bar"] += duration_bar;
// recurse
recursive(n-1,times);
}
Then
std::map<std::string,unsigned> times;
recursive(200,times);
for (const auto& t : times) {
std::cout << t.first << " took total : " << t.second << "\n";
}

Chrono C++ timings not correct

I'm just comparing the speed of a couple Fibonacci functions, one gives an output almost immediately and reads it got done in 500 nanoseconds, while the other, depending on the depth, may sit there loading for many seconds, yet when it is done, it will read that it took it only 100 nanoseconds... After I just sat there and waited like 20 seconds for it.
It's not a big deal as I can prove the other is slower just with raw human perception, but why would chrono not be working? Something to do with recursion?
PS I know that fibonacci2() doesn't give the correct output on odd numbered depths, I'm just testing some things and the output is actually just there so the compiler doesn't optimize it away or something. Go ahead and just copy this code and you'll see fibonacci2() immediately output but you'll have to wait like 5 seconds for fibonacci(). Thank you.
#include <iostream>
#include <chrono>
int fibonacci2(int depth) {
static int a = 0;
static int b = 1;
if (b > a) {
a += b; //std::cout << a << '\n';
}
else {
b += a; //std::cout << b << '\n';
}
if (depth > 1) {
fibonacci2(depth - 1);
}
return a;
}
int fibonacci(int n) {
if (n <= 1) {
return n;
}
return fibonacci(n - 1) + fibonacci(n - 2);
}
int main() {
int f = 0;
auto start2 = std::chrono::steady_clock::now();
f = fibonacci2(44);
auto stop2 = std::chrono::steady_clock::now();
std::cout << f << '\n';
auto duration2 = std::chrono::duration_cast<std::chrono::nanoseconds>(stop2 - start2);
std::cout << "faster function time: " << duration2.count() << '\n';
auto start = std::chrono::steady_clock::now();
f = fibonacci(44);
auto stop = std::chrono::steady_clock::now();
std::cout << f << '\n';
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start);
std::cout << "way slower function with incorrect time: " << duration.count() << '\n';
}
I don't know what compiler you are using and with which compiler options, but I tested x64 msvc v19.28 using /O2 in godbolt. Here the compiled instructions are reordered such that it queries the perf_counter twice before invoking the fibonacci(int) function, which in code would look like
auto start = ...;
auto stop = ...;
f = fibonacci(44);
A solution to disallow this reordering might be to use a atomic_thread_fence just before and after the fibonacci function call.
As Mestkon answered the compiler can reorder your code.
Examples of how to prevent the compiler from reordering Memory Ordering - Compile Time Memory Barrier
It would be beneficial in the future if you provided information on what compiler you were using.
gcc 7.5 with -O2 for example does not reorder the timer instructions in this given scenario.

Boost: Creating objects and populating a vector with threads

Using this boost asio based thread pool, in this case the class is named ThreadPool, I want to parallelize the population of a vector of type std::vector<boost::shared_ptr<T>>, where T is a struct containing a vector of type std::vector<int> whose content and size are dynamically determined after struct initialization.
Unfortunately, I am a newb at both c++ and multi threading, so my attempts at solving this problem have failed spectacularly. Here's an overly simplified sample program that times the non-threaded and threaded versions of the tasks. The threaded version's performance is horrendous...
#include "thread_pool.hpp"
#include <ctime>
#include <iostream>
#include <vector>
using namespace boost;
using namespace std;
struct T {
vector<int> nums = {};
};
typedef boost::shared_ptr<T> Tptr;
typedef vector<Tptr> TptrVector;
void create_T(const int i, TptrVector& v) {
v[i] = Tptr(new T());
T& t = *v[i].get();
for (int i = 0; i < 100; i++) {
t.nums.push_back(i);
}
}
int main(int argc, char* argv[]) {
clock_t begin, end;
double elapsed;
// define and parse program options
if (argc != 3) {
cout << argv[0] << " <num iterations> <num threads>" << endl;
return 1;
}
int iterations = stoi(argv[1]),
threads = stoi(argv[2]);
// create thread pool
ThreadPool tp(threads);
// non-threaded
cout << "non-thread" << endl;
begin = clock();
TptrVector v(iterations);
for (int i = 0; i < iterations; i++) {
create_T(i, v);
}
end = clock();
elapsed = double(end - begin) / CLOCKS_PER_SEC;
cout << elapsed << " seconds" << endl;
// threaded
cout << "threaded" << endl;
begin = clock();
TptrVector v2(iterations);
for (int i = 0; i < iterations; i++) {
tp.submit(boost::bind(create_T, i, v2));
}
tp.stop();
end = clock();
elapsed = double(end - begin) / CLOCKS_PER_SEC;
cout << elapsed << " seconds" << endl;
return 0;
}
After doing some digging, I think the poor performance may be due to the threads vying for memory access, but my newb status if keeping me from exploiting this insight. Can you efficiently populate the pointer vector using multiple threads, ideally in a thread pool?
you haven't provided neither enough details or a Minimal, Complete, and Verifiable example, so expect lots of guessing.
createT is a "cheap" function. Scheduling a task and an overhead of its execution is much more expensive. It's why your performance is bad. To get a boost from parallelism you need to have proper work granularity and amount of work. Granularity means that each task (in your case one call to createT) should be big enough to pay for multithreading overhead. The simplest approach would be to group createT calls to get bigger tasks.

How to measure a time of switching process context in Linux using C++?

I need to measure the time of context switching using C++. I know that I can simply access C-functions from C++ code, but the task is in avoiding C where it's possible. I have searched this in the Internet but found only ways to do this using C. Are there any ways to work with OS in C++? Any analogs of pipe(...) from unistd.h, sched_setaffinity(...) from sched.h and others?
update 2017-06-30: example code added
Are there any ways to work with OS in C++?
All of the C functions you referenced are simply accessed by direct include.
example:
#include "pthread.h"
and in a C++ compile, auto-magically get extern "C"'d.
Your link will need -lrt and -pthread on Linux
Any analogs of pipe(...) from unistd.h, sched_setaffinity(...)
Not analogs, the build links to the real "C" Linux functions.
I need to measure the time of context switching using C++ means.
I measure durations by repeating some action for 1 to 10 seconds, and counting how many times the loop completes.
In my latest minor benchmark, completely written in C++ (but not using C++11 features), I
build a linked list of nodes
each node has its own thread
each thread owns 2 pointers to pthread_mutex semaphores (input and output)
each thread body waits for its input semaphore to be signaled (semTake())
upon awakening, the thread body signals (semGive()) to its output semaphore and does
almost nothing more
The N threads' semaphores are handed out to the node threads and the loop closed
at the end of the list (i.e. end-list-node output semaphore handle points to
begin-list-node input semaphore handle)
The main task, starts the chain reaction with a semGive(), waits 10 seconds (using
usleep), then sets a flag that every thread can see.
Example Run on 6 yr old Dell.
Compilation started at Wed Jan 15 22:31:33
./lmbm101
lmbm101: context-switch duration .. wait up to 10 seconds while measuring.
switch enforced using pthread_mutex semaphores
C5 bogomips: 5210.77 5210.77
686.56 kilo m_thread_switch invocations in 10.88 sec (10000088 us)
68.6554 kilo m_thread_switch events per second
14.5655 u seconds per m_thread_switch event
pid = 12188
now (52d760af): 22:31:43
bdtod 2014/01/15 22:31:43 minod=1351 iod=91 secod=81103 soi=104
I did this minor benchmark prior to C++11 release. This code was compiled with C++11, but does not use the C++11 tasking ... a future effort for me.
update 2017-06-30 - overdue update ...
I wrote this example code 2017-04. I now tend to use std::vector for various things. Previous measurements did not. Similar techniques, but simplified result reporting.
#include <chrono>
#include <iomanip>
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <thread>
#include <vector>
// see EngFormat-Cpp-master.zip
#ifndef ENG_FORMAT_H_INCLUDED
#include "../../bag/src/eng_format.hpp" // to_engineering_string(), from_engineering_string()
#endif
#include <cassert>
#include <semaphore.h> // Note 1 - Ubuntu / Posix feature access, see PPLSEM_t
namespace DTB // doug's test box
{
// Note 2 - typedefs to simplify chrono access
// 'compressed' chrono access --------------vvvvvvv
typedef std::chrono::high_resolution_clock HRClk_t; // std-chrono-hi-res-clk
typedef HRClk_t::time_point Time_t; // std-chrono-hi-res-clk-time-point
typedef std::chrono::microseconds NS_t; // std-chrono-nanoseconds
typedef std::chrono::microseconds US_t; // std-chrono-microseconds
typedef std::chrono::microseconds MS_t; // std-chrono-milliseconds
using namespace std::chrono_literals; // support suffixes like 100ms, 2s, 30us
// examples:
// Time_t testStart_us = HRClk_t::now();
// auto testDuration_us = std::chrono::duration_cast<US_t>(HRClk_t::now() - testStart_us);
// auto count_us = testDuration_us.count();
// or
// std::cout << " complete " << testDuration_us.count() << " us" << std::endl;
// C++ access to Linux semaphore via Posix
// Posix Process Semaphore, set to Local mode (unnamed, unshared)
class PPLSem_t
{
public: // shared-between-threads--v v--initial-value is unlocked
PPLSem_t() { assert(0 == ::sem_init(&m_sem, 0, 1)); } // ctor
~PPLSem_t() { assert(0 == ::sem_destroy(&m_sem)); } // dtor
int lock() { return (::sem_wait(&m_sem)); } // returns 0 when success, else -1
int unlock() { return (::sem_post(&m_sem)); } // returns 0 when success, else -1
void wait() { assert(0 == lock()); }
void post() { assert(0 == unlock()); }
private:
::sem_t m_sem;
};
// POSIX is an api, this C++ class simplifies use
// sem_wait and sem_post are possibly assembly for best performance
// Note 3 - locale what now?
// insert commas from right to left -- change 1234567890 to 1,234,567,890
// input 's' is the digits-to-the-left-of-the-decimal-point
// returns s contents with inserted comma's
std::string digiComma(std::string s)
{ //vvvvv--sSize must be signed int of sufficient size
int32_t sSize = static_cast<int32_t>(s.size());
if (sSize > 3)
for (int32_t indx = (sSize - 3); indx > 0; indx -= 3)
s.insert(static_cast<size_t>(indx), 1, ',');
return(s);
}
const std::string dashLine(" --------------------------------------------------------------\n");
// Note 5 - thread sync to wall clock
// action: pauses a thread, resume thread action at next wall-clock-start-of-second
void sleepToWallClockStartOfSec(std::time_t t0 = 0)
{
if (0 == t0) { t0 = std::time(nullptr); }
while(t0 == std::time(nullptr)) {
std::this_thread::sleep_for(100ms); } // good-neighbor-thread
}
// a good-neighbor-thread delay does not 'hog' a processor
// Note 4 - typedef examples to simplify
// create new types based on vector ... suffix '_t' reminds that this is a type
typedef std::vector<uint> UintVec_t;
typedef std::vector<uint> TIDSeqVec_t;
typedef std::vector<std::thread*> Thread_pVec_t;
// measure -std=C++14 std::thread average context switch duration
// enforced with one PPLSem_t
class Q6_t
{
// private data
const uint MaxThreads; // thread count
const uint MaxSecs; // seconds of test
const std::string m_TIDSeqPFN; // capture tid seq to ram (write to file later)
//
uint m_thrdSwtchCount; // count incremented by all threads
//
bool m_done; // main to threads: cease and desist
uint m_rdy; // threads to main: thread is ready! (running)
PPLSem_t m_sem; // one semaphore shared by all threads
//
UintVec_t m_thrdRunCountVec; // counts incremented per thread
TIDSeqVec_t m_TIDSeq_Vec; // sequence (order) of thread execution
Thread_pVec_t m_thread_pVec; // vector of thread pointers
public:
Q6_t() // default ctor
: MaxThreads(10) // total threads
, MaxSecs(10) // controlled seconds of test
, m_TIDSeqPFN("./Q6.txt") // where put data file
//
, m_thrdSwtchCount(0)
//
, m_done(false) // main() to threads: cease and desist
, m_rdy(0) // threads to main(): thread is ready!
// m_sem // default ctor ok
//
// m_thrdRunCountVec // default ctor ok
// m_TIDSeq_Vec // default ctor ok
// m_thread_pVec // default ctor ok
{
for (size_t i = 0; i < MaxThreads; ++i) {
m_thrdRunCountVec.push_back(0); // 0 each per-thread counter
}
// your results -----vvvvvvvv----will vary
m_TIDSeq_Vec.reserve(45000000); // observed as many as 42,000,000 on my old Dell
m_thread_pVec.reserve(MaxThreads);
// DO NOT start threads (m_thread_pVec) yet
} // AciveObj_t()
~Q6_t()
{
// m_TIDSeq_Vec,
while(m_thread_pVec.size()) { // more to pop and delete
std::thread* t = m_thread_pVec.back(); // return last element
m_thread_pVec.pop_back(); // remove last element
delete t; // delete thread
}
// m_thrdRunCountVec;
// m_TIDSeqPFN, m_sem, m_rdy; m_done;
// m_thrdSwtchCount; MaxSecs; MaxThreads;
} // ~Q6_t()
// Q6_t::main(..) runs in context thread 'main()', invoked in function main()
int main(std::string label)
{
std::cout << dashLine << " " << MaxSecs << " second measure of "
<< MaxThreads << " threads, 1 PPLSem_t " << label << "\n"
<< " output: " << m_TIDSeqPFN << '\n'<< std::endl;
assert(0 == m_sem.lock()); // take posession of m_sem
// now all thread will block at critical section entry (in onceThruCritSect())
std::cout << "\n block threads at crit sect " << std::endl;
createAndActivateThreads();
long int durationUS = 0;
releaseThreadsAndWait(durationUS); // run threads run
std::cout << "\n" << std::endl
<< report(" 'thread context switch' ",
m_thrdSwtchCount, durationUS);
reportThreadActionCounts();
writeTIDSeqToQ6_txt();
reportMainStackSize();
measure_LockUnlock(); // with no context switch, no collision
return(0);
} // int main() // in 'main' context
private:
void onceThru(uint id) // a crit section
{
assert(0 == m_sem.lock()); // critical section entry
{
m_thrdSwtchCount += 1; // 'work'
m_thrdRunCountVec[id] += 1; // diagnostic - thread work-balance
m_TIDSeq_Vec.push_back(id); // thread sequence capture
}
assert(0 == m_sem.unlock()); // critical section exit
}
// thread entry point
void threadRun(uint id)
{
std::cout << '.' << id << std::flush; // ".0.1.2.3.4.5.6.7.8.9"
m_rdy |= (1 << id); // thread to main: i am ready
do {
onceThru(id);
if (m_done) break; // exit when done tbr - FIXME -- rare hang
}while(true);
}
// main() context: create and activate std::thread's with new
void createAndActivateThreads() // main() context
{
std::cout << " createAndActivateThreads() ";
Time_t start_us = HRClk_t::now();
for (uint id = 0; id < MaxThreads; ++id)
{
// std::thread activates when instance created
std::thread* thrd = new
std::thread(&Q6_t::threadRun, this, id);
// method-------^^^^^^^^^^^^^^^ ^^--single param for method
// instance*---------------------^^^^
assert(nullptr != thrd);
// create handshake mask for unique 'id' bit of m_rdy
uint mask = (1 << id);
// wait for bit set in m_rdy by thread
while ( ! (mask & m_rdy) ) {
std::this_thread::sleep_for(100ms); // not a poll
}
// thread has confirmed to main() that it is running
// capture pointer to invoke join's
m_thread_pVec.push_back(thrd);
}
auto duration_us =
std::chrono::duration_cast<US_t>(HRClk_t::now() - start_us);
std::cout << " (" << digiComma(std::to_string(duration_us.count()))
<< " us)" << std::endl;
sleepToWallClockStartOfSec(); // start-of-second
} // void createAndActivateThreads()
// main() context: measure average context switch duration
// by releasing threads to run
void releaseThreadsAndWait(long int& count_us)
{
Time_t testStart_us = HRClk_t::now();
// thread 'main()' is current owner of this semaphore - see "Q6_t::main()"
assert(0 == m_sem.unlock()); // release the hounds
std::cout << " releaseThreadsAndWait " << std::flush;
// progress indicator to user
for (size_t i = 0; i < MaxSecs; ++i) // let threads switch for 10 seconds
{
sleepToWallClockStartOfSec(); // 'main()' sync's to wall clock
std::cout << (MaxSecs-i-1) << ' ' << std::flush; // "9 8 7 6 5 4 3 2 1 0"
}
// tbr - dedicated mutex for this single-write / multiple read ? or std::atomic ?
m_done = true; // command threads to exit - all threads can see m_done
auto testDuration_us =
std::chrono::duration_cast<US_t>(HRClk_t::now() - testStart_us);
count_us = testDuration_us.count();
// tbr - main() shall confirm all threads complete
// tbr - measure how long to detect m_done
Time_t joinStart_us = HRClk_t::now();
std::cout << "\n join threads ";
for (size_t i = 0; i < MaxThreads; ++i)
{
m_thread_pVec[i]->join(); // main() waits here for thread[i] completion
std::cout << ". " << std::flush;
}
auto joinDuration_us =
std::chrono::duration_cast<US_t>(HRClk_t::now() - joinStart_us);
std::cout << " (" << digiComma(std::to_string(joinDuration_us.count()))
<< " us)" << std::endl;
} // void releaseThreadsAndWait(long int& count_us)
void reportThreadActionCounts()
{
std::cout << "\n each thread run count: \n ";
uint sum = 0;
for (auto it : m_thrdRunCountVec)
{
std::cout << std::setw(11) << digiComma(std::to_string(it));
sum += it;
}
std::cout << std::endl;
uint diff = (sum - m_thrdSwtchCount);
std::cout << ' ';
double maxPC = 0.0;
double minPC = 100.0;
for (auto it : m_thrdRunCountVec)
{
double percent = static_cast<double>(it) / static_cast<double>(sum);
if(percent > maxPC) maxPC = percent;
if(percent < minPC) minPC = percent;
std::cout << std::setw(11) << (percent * 100);
}
std::cout << " (% of total)\n\n total : " << digiComma(std::to_string(sum));
if (diff) std::cout << " (diff: " << diff << ")";
std::cout << " note variability -- min : " << (minPC*100)
<< "% max : " << (maxPC*100) << "%" << std::endl;
} // void reportThreadActionCounts()
void writeTIDSeqToQ6_txt() // m_TIDSeq_Vec - record sequence of thread access to critsect
{
size_t sz = m_TIDSeq_Vec.size();
std::cout << '\n' << dashLine << " writing Thread ID sequence of "
<< digiComma(std::to_string(sz)) << " values to "
<< m_TIDSeqPFN << std::endl;
Time_t writeStart_us = HRClk_t::now();
do {
std::ofstream Q6cout(m_TIDSeqPFN);
if ( ! Q6cout.good() )
{
std::cerr << "not able to open for write: " << m_TIDSeqPFN << std::endl;
break;
}
size_t lnSz = 0;
for (auto it : m_TIDSeq_Vec)
{
// encode Thread ID uints: 0 1 2 3 4 5 6 7 8 9
// to letters 'A' thru 'J': vvvvvv 'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J'
Q6cout << static_cast<char>(it+'A');
// whitespace not needed
if (++lnSz > 100) { Q6cout << std::endl; lnSz = 0; } // 100 chars per line
}
Q6cout << '\n' << std::endl;
Q6cout.close();
} while(0);
auto wDuration_us = std::chrono::duration_cast<US_t>
( HRClk_t::now() - writeStart_us );
std::cout << " complete: "
<< digiComma(std::to_string(wDuration_us.count()))
<< " us" << std::endl;
} // writeTIDSeqToQ6_txt
std::string report(std::string lbl, uint64_t eventCount, uint64_t duration_us)
{
std::stringstream ss;
ss << " " << to_engineering_string(static_cast<double>(eventCount),9,eng_prefixed)
<< lbl << " events in " << digiComma(std::to_string(duration_us)) << " us" << std::endl;
double eventsPerSec = (1000000.0*(static_cast<double>(eventCount))/
static_cast<double>(duration_us));
ss << " " << to_engineering_string(eventsPerSec,9,eng_prefixed)
<< lbl << " events per second\n "
<< to_engineering_string((1.0/eventsPerSec), 9, eng_prefixed)
<< " sec per " << lbl << " event " << std::endl;
return(ss.str());
} // std::string report(std::string lbl, uint64_t eventCount, uint64_t duration_us)
// Note 6 - stack size -> use POSIX 'pthread_attr_...' API
void reportMainStackSize()
{
pthread_attr_t tattr;
int stat = pthread_attr_init (&tattr);
assert(0 == stat);
size_t size;
stat = pthread_attr_getstacksize(&tattr, &size);
assert(0 == stat);
std::cout << '\n' << dashLine << " Stack Size: "
<< digiComma(std::to_string(size))
<< " [of 'main()' by pthread_attr_getstacksize]\n"
<< std::endl;
stat = pthread_attr_destroy(&tattr);
assert(0 == stat);
} // void reportMainStackSize()
// Note 7 - semaphore API performance
// measure duration when no context switch (i.e. no thread 'collision')
void measure_LockUnlock()
{
//PPLSem_t* sem1 = new PPLSem_t;
//assert(nullptr != sem1);
PPLSem_t sem1;
size_t count1 = 0;
size_t count2 = 0;
std::cout << dashLine << " 3 second measure of lock()/unlock()"
<< " (no collision) " << std::endl;
time_t t0 = time(0) + 3;
Time_t start_us = HRClk_t::now();
do {
assert(0 == sem1.lock()); count1 += 1;
assert(0 == sem1.unlock()); count2 += 1;
if(time(0) > t0) break;
}while(1);
auto duration_us = std::chrono::duration_cast<US_t>(HRClk_t::now() - start_us);
assert(count1 == count2);
std::cout << report (" 'sem lock()+unlock()' ", count1, duration_us.count());
std::cout << "\n";
} // void mainMeasures_LockUnlock()
}; // class Q6_t
} // namespace DTB
int main(int argc, char* argv[] )
{
std::cout << "\nargc: " << argc << '\n' << std::endl;
for (int i=0; i<argc; i+=1) std::cout << argv[i] << " ";
std::cout << "\n" << std::endl;
setlocale(LC_ALL, "");
std::ios::sync_with_stdio(false);
{
std::time_t t0 = std::time(nullptr);
std::cout << " " << std::asctime(std::localtime(&t0)) << std::endl;;
DTB::sleepToWallClockStartOfSec(t0);
}
DTB::Time_t main_start_us = DTB::HRClk_t::now();
int retVal = 0;
{
DTB::Q6_t q6;
retVal = q6.main(" Q6::main() ");
}
auto duration_us = std::chrono::duration_cast<DTB::US_t>
(DTB::HRClk_t::now() - main_start_us);
std::cout << " FINI "
<< DTB::digiComma(std::to_string(duration_us.count()))
<< " us" << std::endl;
return(retVal);
}
Typical Output on my Old Dell.
Fri Jun 30 15:30:13 2017
--------------------------------------------------------------
10 second measure of 10 threads, 1 PPLSem_t Q6::main()
output: ./Q6.txt
block threads at crit sect
createAndActivateThreads() .0.1.2.3.4.5.6.7.8.9 (1,002,120 us)
releaseThreadsAndWait 9 8 7 6 5 4 3 2 1 0
join threads . . . . . . . . . . (2,971 us)
31.07730700 M 'thread context switch' events in 10,021,447 us
3.101079814 M 'thread context switch' events per second
322.4683207 n sec per 'thread context switch' event
each thread run count:
3,182,496 3,252,929 3,245,473 3,150,344 3,411,918 2,936,982 2,978,690 3,029,319 3,004,926 2,884,230
10.2406 10.4672 10.4432 10.1371 10.9788 9.45057 9.58478 9.74769 9.6692 9.28082 (% of total)
total : 31,077,307 note variability -- min : 9.28082% max : 10.9788%
--------------------------------------------------------------
writing Thread ID sequence of 31,077,307 values to ./Q6.txt
complete: 3,025,289 us
--------------------------------------------------------------
Stack Size: 8,720,384 [of 'main()' by pthread_attr_getstacksize]
--------------------------------------------------------------
3 second measure of lock()/unlock() (no collision)
173.2359360 M 'sem lock()+unlock()' events in 3,902,491 us
44.39111737 M 'sem lock()+unlock()' events per second
22.52702926 n sec per 'sem lock()+unlock()' event
FINI 18,957,304 us
Sample of Q6.txt lines are 100 chars long.
AABABABABAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
last few lines
BJBJBJBJBJBJBJBJBBHHHHHHHHHHHHHHHHHBBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABAAAAAAA
AAAAAAAAAAAAAAAAAAAAAABABABABAABABBAAAAAAAAAAAAAAAAAAAAAAAAAAAAABABABBGGGGGGGGGGGGGGGBGBGBGBGBGBGBGBG
BGBGBGBGBGBGBGBGBGBBGBGBGBGBGBGBGBBHHHHHHHHHHHHHHHHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBHBBHBHBHBHBBJJJJJJJJJ
JJJJJJJJJBBJBBBJBJBJBJBJBJBBJBJBJBJBJBJBJBJBJBBEEEEEEEEEEEEEEEEEBEBEBEBEBEBEBEBEBEBEBEBEBEBEBBEBEBEBE
BEBEBEBEBEBEBEBEBEBEBEBEBEBEBBEBEBEBBBBEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEBEBBEBEBEBEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEBEBEBEBEEBEBEBEBEBBIIIIIIIIIIIIIIIBBIIIBIBBFFFFFFFFFFFFFFFBBFFBBFBFBFBFBFFBBGGGGGGGGGGGGGGGGG
BBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGBGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGBCIHFJDAE
All of the C functions you referenced are simply accessed by
direct include.
Thanks, I know that fact, but the task is in avoiding C and using
C++ where it's possible.
My C++ code has no C code in it, it simply invokes the extern "C"
functions that Linux provides. There is no separate set of C++ Linux
function calls. The Linux API (to the os services) is defined by the
C library and header files. I know of no way to avoid or work around
the Linux API, so perhaps I do not know what you are suggesting / asking.
I measure durations by repeating some action for 1 to 10 seconds,
and counting how many times the loop completes.
Could you explain this?
consider the snippet
{
uint64_t microsecStart = getSystemMicroSecond();
//convert linear time to broken-down/calendar time
local_tm = *localtime_r (&linear_time, &local_tm);
uint64_t microsecDuration = getSystemMicroSecond() - microsecStart;
}
This operation is typically too fast to measure in this simple manner,
essentially a delta-microsecond, The conversion would be over before a
microsecond might change.
To measure something this quick, we spin around the action of
interest, and count the loops, and kick out after, say 3 seconds.
uint64_t microsecStart = getSystemMicroSecond();
uint32_t loopCount = 0;
time_t t0 = time(0) + 3; // loop for < 3 seconds
do
{
//convert linear time to broken-down/calendar time
local_tm = *localtime_r (&linear_time, &local_tm);
time_t t1 = time(0);
if(t1 != t0) break;
loopCount += 1;
} while(1);
uint64_t microsecDuration = getSystemMicroSecond() - microsecStart;
In this loop, the time(0) function is surprisingly quick.
time(0) takes ~75 nano seconds (on my Dell desktop)
so time(0) does not significantly extend the measurement.
But is quick enough to make an accurate measurement of local time duration
localtime_r takes ~335 nano seconds
When these spins complete, the test has created a loopCount, and a
duration measurement outside the loop provides a more consistent measurement
of duration ..., and from those we can then compute an 'average'
duration for each event.
Are you ignoring the time of process running?
Yes. Because I know that a context switch is 2 orders of
magnitude slower than a function call, it is not difficult
to minimize the thread activity to have no/minimal influence
on the measurement.
Is it minor in contrast with the time of switching?
In this test, the threads increment a number, tests a flag, and acts
as a good neighbor (i.e. these threads surrender the processor as
soon as possible). These minor actions are insignificant to the
cost of a context switch.
The numbers for my 6 yr old Dell are 3 orders of magnitude difference.
simple function call: i.e. time(0) < 75 e-9 seconds
thread context switch < 15 e-6 seconds
enforced with semaphore
Other activities can influence the results, but I think in an
insignificant way. My result of 14 us per "thread switch and
semaphore send" is longer than the best possible result, but not
enough longer to influence my design decisions. It is possible to
improve this measurement, but I can't afford the hardware.
Linux provides some thread or task priority ideas, but I have not
explored them. When I'm serious about finding a 'better' measurement,
I guess I would disconnect the ethernet, close any busy work... but
I'm not running compiles nor copying files nor running a backup nor
any obvious cpu cycle consumer when I'm measuring. The machine is
substantially idle. Just clock ticks, timers expiring, memory
refresh, and a few other things that must continue.
For fun or interest, you might pull up the System Monitor utility,
click on the %CPU tag 1 or 2 times, and bring the busiest task to the
top ... you should find that the busiest task is, ta-da: the system
monitor at maybe 3 % of one of the 2 cpu's. All other tasks are
essentially waiting for something, and trigger 0% of load.
Finally you might think of it this way...
are you writing a program to run on an atypical machine?
or is your target similar to your development machine?
Do you plan to shut off interrupts? i/o channels? ethernet? control
priority? or is your target going to be useful.
IMHO, the running tasks in my useful (linux) system, when the system is
not doing anything but waiting for my next keystroke, are generally
doing nothing for most of a 10 second test.
I think the most important takeaway from these efforts is that:
function calls are more than 100x faster than context switches.

Forcing race between threads using C++11 threads

Just got started on multithreading (and multithreading in general) using C++11 threading library, and and wrote small short snipped of code.
#include <iostream>
#include <thread>
int x = 5; //variable to be effected by race
//This function will be called from a thread
void call_from_thread1() {
for (int i = 0; i < 5; i++) {
x++;
std::cout << "In Thread 1 :" << x << std::endl;
}
}
int main() {
//Launch a thread
std::thread t1(call_from_thread1);
for (int j = 0; j < 5; j++) {
x--;
std::cout << "In Thread 0 :" << x << std::endl;
}
//Join the thread with the main thread
t1.join();
std::cout << x << std::endl;
return 0;
}
Was expecting to get different results every time (or nearly every time) I ran this program, due to race between two threads. However, output is always: 0, i.e. two threads run as if they ran sequentially. Why am I getting same results and is there any ways to simulate or force race between two threads ?
Your sample size is rather small, and somewhat self-stalls on the continuous stdout flushes. In short, you need a bigger hammer.
If you want to see a real race condition in action, consider the following. I purposely added an atomic and non-atomic counter, sending both to the threads of the sample. Some test-run results are posted after the code:
#include <iostream>
#include <atomic>
#include <thread>
#include <vector>
void racer(std::atomic_int& cnt, int& val)
{
for (int i=0;i<1000000; ++i)
{
++val;
++cnt;
}
}
int main(int argc, char *argv[])
{
unsigned int N = std::thread::hardware_concurrency();
std::atomic_int cnt = ATOMIC_VAR_INIT(0);
int val = 0;
std::vector<std::thread> thrds;
std::generate_n(std::back_inserter(thrds), N,
[&cnt,&val](){ return std::thread(racer, std::ref(cnt), std::ref(val));});
std::for_each(thrds.begin(), thrds.end(),
[](std::thread& thrd){ thrd.join();});
std::cout << "cnt = " << cnt << std::endl;
std::cout << "val = " << val << std::endl;
return 0;
}
Some sample runs from the above code:
cnt = 4000000
val = 1871016
cnt = 4000000
val = 1914659
cnt = 4000000
val = 2197354
Note that the atomic counter is accurate (I'm running on a duo-core i7 macbook air laptop with hyper threading, so 4x threads, thus 4-million). The same cannot be said for the non-atomic counter.
There will be significant startup overhead to get the second thread going, so its execution will almost always begin after the first thread has finished the for loop, which by comparison will take almost no time at all. To see a race condition you will need to run a computation that takes much longer, or includes i/o or other operations that take significant time, so that the execution of the two computations actually overlap.