How to benchmark my system using multiple threads in c++? - c++

I have simplified my code, and it compiles, but it doesn't do anything. It doesn't error out though either. I am trying to get 7 threads (on my 8-core processor) in this example to write to a variable to benchmark my system. I would like to do this with multiple threads to see if it's faster. It's based off other code that worked before I added multithreading. When I run, it just terminates. It should show progress each second of how many total iterations all the threads have done together. Some of the includes are there from other code I am working on.
I would like to also gracefully terminate all 7 threads when Ctrl-C is pressed. Help would be appreciated. Thanks!
//Compiled using: g++ ./test.cpp -lpthread -o ./test
#include <stdio.h>
#include <string>
#include <iostream>
#include <time.h>
#include <ctime>
#include <ratio>
#include <chrono>
#include <iomanip>
#include <locale.h>
#include <cstdlib>
#include <pthread.h>
using namespace std;
using namespace std::chrono;
const int NUM_THREADS = 7;
const std::string VALUE_TO_WRITE = "TEST";
unsigned long long int total_iterations = 0;
void * RunBenchmark(void * threadid);
class comma_numpunct: public std::numpunct < char > {
protected: virtual char do_thousands_sep() const {
return ',';
}
virtual std::string do_grouping() const {
return "\03";
}
};
void * RunBenchmark(void * threadid) {
unsigned long long int iterations = 0;
std::string benchmark;
int seconds = 0;
std::locale comma_locale(std::locale(), new comma_numpunct());
std::cout.imbue(comma_locale);
auto start = std::chrono::system_clock::now();
auto end = std::chrono::system_clock::now();
do {
start = std::chrono::system_clock::now();
while ((std::chrono::duration_cast < std::chrono::seconds > (end - start).count() != 1)) {
benchmark = VALUE_TO_WRITE;
iterations += 1;
}
total_iterations += iterations;
iterations = 0;
cout << "Total Iterations: " << std::setprecision(0) << std::fixed << total_iterations << "\r";
} while (1);
}
int main(int argc, char ** argv) {
unsigned long long int iterations = 0;
int tc, tn;
pthread_t threads[NUM_THREADS];
for (tn = 0; tn < NUM_THREADS; tn++) {
tc = pthread_create( & threads[tn], NULL, & RunBenchmark, NULL);
}
return 0;
}

Related

Chrono library multithreading time derivation limitations?

I am trying to solve the problem with a time derivation in a multithreaded setup. I have 3 threads, all pinned to different cores. The first two threads (reader_threads.cc) run in the infinite while loop inside the run() function. They finish their execution and send the current time window they are into the third thread.
The current time window is calculated based on the value from chrono time / Ti
The third thread is running at its own pace, and it's checking only the request when the flag has been raised, which is also sent via Message to the third thread.
I was able to get the desired behavior of all three threads in the same epoch if one epoch is at least 20000us. In the results, you can find more info.
Reader threads
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <iostream>
#include <chrono>
#include <atomic>
#include <mutex>
#include "control_thread.h"
#define INTERNAL_THREAD
#if defined INTERNAL_THREAD
#include <thread>
#include <pthread.h>
#else
#endif
using namespace std;
atomic<bool> thread_active[2];
atomic<bool> go;
pthread_barrier_t barrier;
template <typename T>
void send(Message volatile * m, unsigned int epoch, bool flag) {
for (int i = 0 ; i < sizeof(T); i++){
m->epoch = epoch;
m->flag = flag;
}
}
ControlThread * ct;
// Main run for threads
void run(unsigned int threadID){
// Put message into incoming buffer
Message volatile * m1 = &(ct->incoming_requests[threadID - 1]);
thread_active[threadID] = true;
std::atomic<bool> flag;
// this thread is done initializing stuff
thread_active[threadID] = true;
while (!go);
while(true){
using namespace std::chrono;
// Get current time with precision of microseconds
auto now = time_point_cast<microseconds>(steady_clock::now());
// sys_microseconds is type time_point<system_clock, microseconds>
using sys_microseconds = decltype(now);
// Convert time_point to signed integral type
auto duration = now.time_since_epoch();
// Convert signed integral type to time_point
sys_microseconds dt{microseconds{duration}};
// test
if (dt != now){
std::cout << "Failure." << std::endl;
}else{
// std::cout << "Success." << std::endl;
}
auto epoch = duration / Ti;
pthread_barrier_wait(&barrier);
flag = true;
// send current time to the control thread
send<int>(m1, epoch, flag);
auto current_position = duration % Ti;
std::chrono::duration<double, micro> multi_thread_sleep = chrono::microseconds(Ti) - chrono::microseconds(current_position);
if(multi_thread_sleep > chrono::microseconds::zero()){
this_thread::sleep_for(multi_thread_sleep);
}
}
}
int threads_num = 3;
void server() {
// Don't start control thread until reader threds finish init
for (int i=1; i < threads_num; i++){
while (!thread_active[i]);
}
go = true;
while (go) {
for (int i = 0; i < threads_num; i++) {
ct->current_requests(i);
}
// Arbitrary sleep to ensure that locking is accurate
std::this_thread::sleep_for(50us);
}
}
class Thread {
public:
#if defined INTERNAL_THREAD
thread execution_handle;
#endif
unsigned int id;
Thread(unsigned int i) : id(i) {}
};
void init(){
ct = new ControlThread();
}
int main (int argc, char * argv[]){
Thread * r[4];
pthread_barrier_init(&barrier, NULL, 2);
init();
/* start threads
*================*/
for (unsigned int i = 0; i < threads_num; i++) {
r[i] = new Thread(i);
#if defined INTERNAL_THREAD
if(i==0){
r[0]->execution_handle = std::thread([] {server();});
}else if(i == 1){
r[i]->execution_handle = std::thread([i] {run(i);});
}else if(i == 2){
r[i]->execution_handle = std::thread([i] {run(i);});
}
/* pin to core i */
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(i, &cpuset);
int rc = pthread_setaffinity_np(r[i]->execution_handle.native_handle(), sizeof(cpuset), &cpuset);
#endif
}
// wait for threads to end
for (unsigned int i = 0; i < threads_num + 1; i++) {
#if defined INTERNAL_THREAD
r[i]->execution_handle.join();
#endif
}
pthread_barrier_destroy(&barrier);
return 0;
}
Control Thread
#ifndef __CONTROL_THEAD_H__
#define __CONTROL_THEAD_H__
// Global vars
const auto Ti = std::chrono::microseconds(15000);
std::mutex m;
int count;
class Message{
public:
std::atomic<bool> flag;
unsigned long epoch;
};
class ControlThread {
public:
/* rw individual threads */
Message volatile incoming_requests[4];
void current_requests(unsigned long current_thread) {
using namespace std::chrono;
auto now = time_point_cast<microseconds>(steady_clock::now());
// sys_milliseconds is type time_point<system_clock, milliseconds>
using sys_microseconds = decltype(now);
// Convert time_point to signed integral type
auto time = now.time_since_epoch();
// Convert signed integral type to time_point
sys_microseconds dt{microseconds{time}};
// test
if (dt != now){
std::cout << "Failure." << std::endl;
}else{
// std::cout << "Success." << std::endl;
}
long contol_thread_epoch = time / Ti;
// Only check request when flag is raised
if(incoming_requests[current_thread].flag){
m.lock();
incoming_requests[current_thread].flag = false;
m.unlock();
// If reader thread epoch and control thread matches
if(incoming_requests[current_thread].epoch == contol_thread_epoch){
// printf("Successful desired behaviour\n");
}else{
count++;
if(count > 0){
printf("Missed %d\n", count);
}
}
}
}
};
#endif
RUN
g++ -std=c++2a -pthread -lrt -lm -lcrypt reader_threads.cc -o run
sudo ./run
Results
The following missed epochs are with one loop iteration (single Ti) equal to 1000us. Also, by increasing Ti, the less number of epochs have been skipped. Finally, if Ti is set to the 20000 us , no skipped epochs are detected. Does anyone have an idea whether I am making a mistake in casting or in communication between threads? Why the threads are not in sync if epoch is i.e. 5000us?
Missed 1
Missed 2
Missed 3
Missed 4
Missed 5
Missed 6
Missed 7
Missed 8
Missed 9
Missed 10
Missed 11
Missed 12
Missed 13
Missed 14
Missed 15
Missed 16

Multithreading requests with queue cpp

Im trying to make multithreaded proxy checker in c++, when I start the threads and lock it all threads wait till the request is finished. I tried to remove the locks but that doesn't help either. Im using the cpr library to make the requests, the documentation can be found here: https://whoshuu.github.io/cpr/advanced-usage.html.
Reproduceable example:
#include <stdio.h>
#include <pthread.h>
#include <iostream>
#include <queue>
#include <mutex>
#include <cpr/cpr.h>
#include <fmt/format.h>
#define NUMT 10
using namespace std;
using namespace fmt;
std::mutex mut;
std::queue<std::string> q;
void* Checker(void* arg) {
while (!q.empty()) {
mut.lock();
//get a webhook at https://webhook.site
string protocol = "socks4";
string proxyformatted = format("{0}://{1}", protocol, q.front());
auto r = cpr::Get(cpr::Url{ "<webhook url>" },
cpr::Proxies{ {"http", proxyformatted}, {"https", proxyformatted} });
q.pop();
mut.unlock();
}
return NULL;
}
int main(int argc, char** argv) {
q.push("138.201.134.206:5678");
q.push("185.113.7.87:5678");
q.push("5.9.16.126:5678");
q.push("88.146.196.181:4153");
pthread_t tid[NUMT]; int i;
int thread_args[NUMT];
for (i = 0; i < NUMT; i++) {
thread_args[i] = i;
pthread_create(&tid[i], NULL, Checker, (void*) &thread_args);
}
for (i = 0; i < NUMT; i++) {
pthread_join(tid[i], NULL);
fprintf(stderr, "Thread %d terminated\n", i);
}
return 0;
}
Thanks in advance.
I suggest to implement a wrapper class for your queue that will hide the mutex.
That class can provide push(std::string s) and bool pop(std::string& s) that returns true and populate s if the queue wasn't empty or false othervise. Then your worker threads can simply loop
std::string s;
while(q.pop(s)) {
...
}

Linux: Is writeback related to number of files?

I am trying to do some experiments of pagecache and writeback mechanism on linux 5.4.81. I have set dirty_ratio to be 50, dirty_background_ratio 45, dirty_expire_centisecs 500000000, dirty_writeback_centisecs 500000000. When I write a big file of 250GB using write(), I can find that writeback begin to work when dirty_pages reach 45% of total memory, which is in line with expectation.
But when I write 125 small files of 2GB, the writeback begin to work when dirty_pages only reach about 25%. I wonder why this happen?
The total memory of my platform is 256G.
#include <stdio.h>
#include <string>
#include <sys/time.h>
#include <sstream>
#include <fstream>
#include <iostream>
#include <fcntl.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/stat.h>
using namespace std;
int main()
{
struct timeval start, end;
long long total_time = 0;
ofstream outfile("write_time.txt",ios::trunc);
int loop=1;
while(loop--){
for(int i=0; i<125; i++){
stringstream ss;
string num;
ss<<i;
ss>>num;
string a = "test";
string b = ".dat";
string fileName = a+num+b;
const char*tmp=fileName.c_str();
int fp = open(tmp,O_CREAT|O_RDWR,S_IRUSR|S_IWUSR);
// int fp = open(tmp,O_WRONLY|O_TRUNC);
int pos = 0;
char data[1024] = "ab\n";
gettimeofday( &start, NULL );
while (1)
{
write(fp, data, 1024);
pos++;
if (pos >= 2*1024*1024)
break;
}
gettimeofday( &end, NULL );
int timeuse = 1000000 * ( end.tv_sec - start.tv_sec ) + end.tv_usec - start.tv_usec;
printf("write_time: %d us\n", timeuse);
total_time +=timeuse;
outfile<<timeuse<<endl;
}
}
outfile<<total_time;
outfile.close();
return 0;
}

Why OpenCV project's multi thread are slower than single one?

I have written some codes to load 1025 images with OpenCV to process them;these codes are in two versions :
single thread and multi threads; problem is that the results of codes have confused me; because single thread version is faster than multi threads.
What do you think about it?? what's wrong ?
my codes are below.
#include <opencv2/core.hpp>
#include <opencv2/imgcodecs.hpp>
#include <iostream>
#include "ctpl.h"
using namespace std;
using namespace cv;
#define threaded
void loadImage(int id, int param0) {
stringstream stream;
stream << "/home/me/Desktop/Pics/pic (" << param0 << ").jpg";
Mat x = imread(stream.str(), IMREAD_REDUCED_COLOR_8);
}
int main() {
#ifdef threaded
ctpl::thread_pool p(8);
for (int i = 1; i <= 1025; i++) {
p.push(loadImage,i);
}
// for (int i = 0; i < 1025; ++i) {
// pthread_join(threads[i], NULL);
// }
#else
for (int i = 1; i <= 1025; i++) {
stringstream stream;
stream << "/home/me/Desktop/Pics/pic (" << i << ").jpg";
Mat x = imread(stream.str(), IMREAD_REDUCED_COLOR_8);
}
#endif
return 0;
}

std::thread to std::async makes HUGE performance gain. How it can be possible?

I`ve made a test code between std::thread and std::async.
#include <iostream>
#include <mutex>
#include <fstream>
#include <string>
#include <memory>
#include <thread>
#include <future>
#include <functional>
#include <boost/noncopyable.hpp>
#include <boost/lexical_cast.hpp>
#include <boost/filesystem.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <boost/asio.hpp>
namespace fs = boost::filesystem;
namespace pt = boost::posix_time;
namespace as = boost::asio;
class Log : private boost::noncopyable
{
public:
void LogPath(const fs::path& filePath) {
boost::system::error_code ec;
if(fs::exists(filePath, ec)) {
fs::remove(filePath);
}
this->ofStreamPtr_.reset(new fs::ofstream(filePath));
};
void WriteLog(std::size_t i) {
assert(*this->ofStreamPtr_);
std::lock_guard<std::mutex> lock(this->logMutex_);
*this->ofStreamPtr_ << "Hello, World! " << i << "\n";
};
private:
std::mutex logMutex_;
std::unique_ptr<fs::ofstream> ofStreamPtr_;
};
int main(int argc, char *argv[]) {
if(argc != 2) {
std::cout << "Wrong argument" << std::endl;
exit(1);
}
std::size_t iter_count = boost::lexical_cast<std::size_t>(argv[1]);
Log log;
log.LogPath("log.txt");
std::function<void(std::size_t)> func = std::bind(&Log::WriteLog, &log, std::placeholders::_1);
auto start_time = pt::microsec_clock::local_time();
////// Version 1: use std::thread //////
// {
// std::vector<std::shared_ptr<std::thread> > threadList;
// threadList.reserve(iter_count);
// for(std::size_t i = 0; i < iter_count; i++) {
// threadList.push_back(
// std::make_shared<std::thread>(func, i));
// }
//
// for(auto it: threadList) {
// it->join();
// }
// }
// pt::time_duration duration = pt::microsec_clock::local_time() - start_time;
// std::cout << "Version 1: " << duration << std::endl;
////// Version 2: use std::async //////
start_time = pt::microsec_clock::local_time();
{
for(std::size_t i = 0; i < iter_count; i++) {
auto result = std::async(func, i);
}
}
duration = pt::microsec_clock::local_time() - start_time;
std::cout << "Version 2: " << duration << std::endl;
////// Version 3: use boost::asio::io_service //////
// start_time = pt::microsec_clock::local_time();
// {
// as::io_service ioService;
// as::io_service::strand strand{ioService};
// {
// for(std::size_t i = 0; i < iter_count; i++) {
// strand.post(std::bind(func, i));
// }
// }
// ioService.run();
// }
// duration = pt::microsec_clock::local_time() - start_time;
// std::cout << "Version 3: " << duration << std::endl;
}
With 4-core CentOS 7 box(gcc 4.8.5), Version 1(using std::thread) is about 100x slower compared to other implementations.
Iteration Version1 Version2 Version3
100 0.0034s 0.000051s 0.000066s
1000 0.038s 0.00029s 0.00058s
10000 0.41s 0.0042s 0.0059s
100000 throw 0.026s 0.061s
Why threaded version is so slow? I thought each thread won't take long time to complete Log::WriteLog function.
The function may never be called. You are not passing an std::launch policy in Version 2, so you are relying on the default behavior of std::async (emphasis mine):
Behaves the same as async(std::launch::async | std::launch::deferred, f, args...). In other words, f may be executed in another thread or it may be run synchronously when the resulting std::future is queried for a value.
Try re-running your benchmark with this minor change:
auto result = std::async(std::launch::async, func, i);
Alternatively, you could call result.wait() on each std::future in a second loop, similar to how you call join() on all of the threads in Version 1. This forces evaluation of the std::future.
Note that there is a major, unrelated, problem with this benchmark. func immediately acquires a lock for the full duration of the function call, which makes parallelism impossible. There is no advantage to using threads here - I suspect that it will be significantly slower (due to thread creation and locking overhead) than a serial implementation.