We have a qthreads-based workflow engine where worker threads pick up bundles of input as they are placed on a queue, then place their output on another queue for other worker threads to run the next stage; and so on until all the input has been consumed and all the output has been generated.
Typically, several threads will be running the same task and others will be running other tasks at the same time. We want to benchmark performance of these threaded tasks in order to target optimization efforts.
It's easy to get the real (elapsed) time that a given thread, running a given task, has taken. We just look at the difference between the return values of the POSIX times() function at the start and end of the thread's run() procedure. However, I cannot figure out how to get the corresponding user and system time. Getting these from the struct tms that you pass to times() doesn't work, because this structure gives total user and system times of all threads running while the thread in question is active.
Assuming this is on Linux how about getrusage() with RUSAGE_THREAD? Solaris also offers RUSAGE_LWP which is similar and I guess there's probably equivalents for other POSIX-like systems.
Crude example:
#define _GNU_SOURCE
#include <sys/time.h>
#include <sys/resource.h>
#include <stdio.h>
#include <pthread.h>
#include <assert.h>
#include <unistd.h>
struct tinfo {
pthread_t thread;
int id;
struct rusage start;
struct rusage end;
};
static void *
thread_start(void *arg)
{
struct tinfo *inf = arg;
getrusage(RUSAGE_THREAD, &inf->start);
if (inf->id) {
sleep(10);
}
else {
const time_t start = time(NULL);
while (time(NULL) - start < 10); // Waste CPU time!
}
getrusage(RUSAGE_THREAD, &inf->end);
return 0;
}
int main() {
static const int nrthr = 2;
struct tinfo status[nrthr];
for (int i = 0; i < nrthr; ++i) {
status[i].id = i;
const int s = pthread_create(&status[i].thread,
NULL, &thread_start,
&status[i]);
assert(!s);
}
for (int i = 0; i < nrthr; ++i) {
const int s = pthread_join(status[i].thread, NULL);
assert(!s);
// Sub-second timing is available too
printf("Thread %d done: %ld (s) user, %ld (s) system\n", status[i].id,
status[i].end.ru_utime.tv_sec - status[i].start.ru_utime.tv_sec,
status[i].end.ru_stime.tv_sec - status[i].start.ru_stime.tv_sec);
}
}
I think something similar is possible on windows using GetProcessTimes()
Related
The Problem: I have two threads in a Windows 10 application I'm working on, a UI thread (called the render thread in the code) and a worker thread in the background (called the simulate thread in the code). Ever couple of seconds or so, the background thread has to perform a very expensive operation that involves allocating a large amount of memory. For some reason, when this operation happens, the UI thread lags for a split second and becomes unresponsive (this is seen in the application as a camera not moving for a second while the camera movement input is being given).
Maybe I'm misunderstanding something about how threads work on Windows, but I wasn't aware that this was something that should happen. I was under the impression that you use a separate UI thread for this very reason: to keep it responsive while other threads do more time intensive operations.
Things I've tried: I've removed all communication between the two threads, so there are no mutexes or anything of that sort (unless there's something implicit that Windows does that I'm not aware of). I have also tried setting the UI thread to be a higher priority than the background thread. Neither of these helped.
Some things I've noted: While the UI thread lags for a moment, other applications running on my machine are just as responsive as ever. The heavy operation seems to only affect this one process. Also, if I decrease the amount of memory being allocated, it alleviates the issue (however, for the application to work as I want it to, it needs to be able to do this allocation).
The question: My question is two-fold. First, I'd like to understand why this is happening, as it seems to go against my understanding of how multi-threading should work. Second, do you have any recommendations or ideas on how to fix this and get it so the UI doesn't lag.
Abbreviated code: Note the comment about epochs in timeline.h
main.cpp
#include "Renderer/Headers/Renderer.h"
#include "Shared/Headers/Timeline.h"
#include "Simulator/Simulator.h"
#include <iostream>
#include <Windows.h>
unsigned int __stdcall renderThread(void* timelinePtr);
unsigned int __stdcall simulateThread(void* timelinePtr);
int main() {
Timeline timeline;
HANDLE renderHandle = (HANDLE)_beginthreadex(0, 0, &renderThread, &timeline, 0, 0);
if (renderHandle == 0) {
std::cerr << "There was an error creating the render thread" << std::endl;
return -1;
}
SetThreadPriority(renderHandle, THREAD_PRIORITY_HIGHEST);
HANDLE simulateHandle = (HANDLE)_beginthreadex(0, 0, &simulateThread, &timeline, 0, 0);
if (simulateHandle == 0) {
std::cerr << "There was an error creating the simulate thread" << std::endl;
return -1;
}
SetThreadPriority(simulateHandle, THREAD_PRIORITY_IDLE);
WaitForSingleObject(renderHandle, INFINITE);
WaitForSingleObject(simulateHandle, INFINITE);
return 0;
}
unsigned int __stdcall renderThread(void* timelinePtr) {
Timeline& timeline = *((Timeline*)timelinePtr);
Renderer renderer = Renderer(timeline);
renderer.run();
return 0;
}
unsigned int __stdcall simulateThread(void* timelinePtr) {
Timeline& timeline = *((Timeline*)timelinePtr);
Simulator simulator(timeline);
simulator.run();
return 0;
}
simulator.cpp
// abbreviated
void Simulator::run() {
while (true) {
// abbreviated
timeline->push(latestState);
}
}
// abbreviated
timeline.h
#ifndef TIMELINE_H
#define TIMELINE_H
#include "WorldState.h"
#include <mutex>
#include <vector>
class Timeline {
public:
Timeline();
bool tryGetStateAtFrame(int frame, WorldState*& worldState);
void push(WorldState* worldState);
private:
// The concept of an Epoch was introduced to help reduce mutex conflicts, but right now since the threads are disconnected, there should be no mutex locks at all on the UI thread. However, every 1024 pushes onto the timeline, a new Epoch must be created. The amount of slowdown largely depends on how much memory the WorldState class takes. If I make WorldState small, there isn't a noticable hiccup, but when it is large, it becomes noticeable.
class Epoch {
public:
static const int MAX_SIZE = 1024;
void push(WorldState* worldstate);
int getSize();
WorldState* getAt(int index);
private:
int size = 0;
WorldState states[MAX_SIZE];
};
Epoch* pushEpoch;
std::mutex lock;
std::vector<Epoch*> epochs;
};
#endif // !TIMELINE_H
timeline.cpp
#include "../Headers/Timeline.h"
#include <iostream>
Timeline::Timeline() {
pushEpoch = new Epoch();
}
bool Timeline::tryGetStateAtFrame(int frame, WorldState*& worldState) {
if (!lock.try_lock()) {
return false;
}
if (frame >= epochs.size() * Epoch::MAX_SIZE) {
lock.unlock();
return false;
}
worldState = epochs.at(frame / Epoch::MAX_SIZE)->getAt(frame % Epoch::MAX_SIZE);
lock.unlock();
return true;
}
void Timeline::push(WorldState* worldState) {
pushEpoch->push(worldState);
if (pushEpoch->getSize() == Epoch::MAX_SIZE) {
lock.lock();
epochs.push_back(pushEpoch);
lock.unlock();
pushEpoch = new Epoch();
}
}
void Timeline::Epoch::push(WorldState* worldState) {
if (this->size == this->MAX_SIZE) {
throw std::out_of_range("Pushed too many items to Epoch without clearing");
}
this->states[this->size] = *worldState;
this->size++;
}
int Timeline::Epoch::getSize() {
return this->size;
}
WorldState* Timeline::Epoch::getAt(int index) {
if (index >= this->size) {
throw std::out_of_range("Tried accessing nonexistent element of epoch");
}
return &(this->states[index]);
}
Renderer.cpp: loops to call Presenter::update() and some OpenGL rendering tasks.
Presenter.cpp
// abbreviated
void Presenter::update() {
camera->update();
// timeline->tryGetStateAtFrame(Time::getFrames(), worldState); // Normally this would cause a potential mutex conflict, but for now I have it commented out. This is the only place that anything on the UI thread accesses timeline.
}
// abbreviated
Any help/suggestions?
I ended up figuring this out!
So as it turns out, the new operator in C++ is threadsafe, which means that once it starts, it has to finish before any other threads can do anything. Why was that a problem in my case? Well, when an Epoch was being initialized, it had to initialize an array of 1024 WorldStates, each of which has 10,000 CellStates that need to be initialized, and each of those had an array of 16 items that needed to be initalized, so we ended up with over 100,000,000 objects needing to be initialized before the new operator could return. That was taking long enough that it caused the UI to hiccup while it was waiting.
The solution was to create a factory function that would build the pieces of the Epoch piecemeal, one constructor at a time and then combine them together and return a pointer to the new epoch.
timeline.h
#ifndef TIMELINE_H
#define TIMELINE_H
#include "WorldState.h"
#include <mutex>
#include <vector>
class Timeline {
public:
Timeline();
bool tryGetStateAtFrame(int frame, WorldState*& worldState);
void push(WorldState* worldState);
private:
class Epoch {
public:
static const int MAX_SIZE = 1024;
static Epoch* createNew();
void push(WorldState* worldstate);
int getSize();
WorldState* getAt(int index);
private:
Epoch();
int size = 0;
WorldState* states[MAX_SIZE];
};
Epoch* pushEpoch;
std::mutex lock;
std::vector<Epoch*> epochs;
};
#endif // !TIMELINE_H
timeline.cpp
Timeline::Epoch* Timeline::Epoch::createNew() {
Epoch* epoch = new Epoch();
for (unsigned int i = 0; i < MAX_SIZE; i++) {
epoch->states[i] = new WorldState();
}
return epoch;
}
I need to port a multiprocess application that uses the Windows API functions SetEvent, CreateEvent and WaitForMultipleObjects to Linux. I have found many threads concerning this issue, but none of them provided a reasonable solution for my problem.
I have an application that forks into three processes and manages thread workerpool of one process via these Events.
I had multiple solutions to this issue. One was to create FIFO special files on Linux using mkfifo on linux and use a select statement to awaken the threads. The Problem is that this solution will operate differently than WaitForMultipleObjects. For Example if 10 threads of the workerpool will wait for the event and I call SetEvent five times, exactly five workerthreads will wake up and do the work, when using the FIFO variant in Linux, it would wake every thread, that i in the select statement and waiting for data to be put in the fifo. The best way to describe this is that the Windows API kind of works like a global Semaphore with a count of one.
I also thought about using pthreads and condition variables to recreate this and share the variables via shared memory (shm_open and mmap), but I run into the same issue here!
What would be a reasonable way to recreate this behaviour on Linux? I found some solutions doing this inside of a single process, but what about doing this with between multiple processes?
Any ideas are appreciated (Note: I do not expect a full implementation, I just need some more ideas to get myself started with this problem).
You could use a semaphore (sem_init), they work on shared memory. There's also named semaphores (sem_open) if you want to initialize them from different processes. If you need to exchange messages with the workers, e.g. to pass the actual tasks to them, then one way to resolve this is to use POSIX message queues. They are named and work inter-process. Here's a short example. Note that only the first worker thread actually initializes the message queue, the others use the attributes of the existing one. Also, it (might) remain(s) persistent until explicitly removed using mq_unlink, which I skipped here for simplicity.
Receiver with worker threads:
// Link with -lrt -pthread
#include <fcntl.h>
#include <mqueue.h>
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
void *receiver_thread(void *param) {
struct mq_attr mq_attrs = { 0, 10, 254, 0 };
mqd_t mq = mq_open("/myqueue", O_RDONLY | O_CREAT, 00644, &mq_attrs);
if(mq < 0) {
perror("mq_open");
return NULL;
}
char msg_buf[255];
unsigned prio;
while(1) {
ssize_t msg_len = mq_receive(mq, msg_buf, sizeof(msg_buf), &prio);
if(msg_len < 0) {
perror("mq_receive");
break;
}
msg_buf[msg_len] = 0;
printf("[%lu] Received: %s\n", pthread_self(), msg_buf);
sleep(2);
}
}
int main() {
pthread_t workers[5];
for(int i=0; i<5; i++) {
pthread_create(&workers[i], NULL, &receiver_thread, NULL);
}
getchar();
}
Sender:
#include <fcntl.h>
#include <stdio.h>
#include <mqueue.h>
#include <unistd.h>
int main() {
mqd_t mq = mq_open("/myqueue", O_WRONLY);
if(mq < 0) {
perror("mq_open");
}
char msg_buf[255];
unsigned prio;
for(int i=0; i<255; i++) {
int msg_len = sprintf(msg_buf, "Message #%d", i);
mq_send(mq, msg_buf, msg_len, 0);
sleep(1);
}
}
Let's say I have some function func() in my program, and I need it to be called after some specific delay. So far I have googled it and ended up with folowing code:
#include <stdio.h>
#include <sys/time.h> /* for setitimer */
#include <unistd.h> /* for pause */
#include <signal.h> /* for signal */
void func()
{
printf("func() called\n");
}
bool startTimer(double seconds)
{
itimerval it_val;
double integer, fractional;
integer = (int)seconds;
fractional = seconds - integer;
it_val.it_value.tv_sec = integer;
it_val.it_value.tv_usec = fractional * 1000000;
it_val.it_interval = it_val.it_value;
if (setitimer(ITIMER_REAL, &it_val, NULL) == -1)
return false;
return true;
}
int main()
{
if (signal(SIGALRM, (void(*)(int))func) == SIG_ERR)
{
perror("Unable to catch SIGALRM");
exit(1);
}
startTimer(1.5);
while(1)
pause();
return 0;
}
And it works, but the problem is that settimer() causes func() to be called repeatedly with interval of 1.5 sec. And what I need, is to call func() just once.
Can someone tell me how to do this? Maybe, I need some additional parameters to settimer() ?
Note: time interval should be precise, because this program will play midi music later.
Unless you need the program to be doing other things, you can simply sleep for the time allotted.
If you need to use the alarm, you can install the alarm to be processed once.
From the man page:
struct timeval it_interval
This is the period between successive timer interrupts. If zero, the alarm will only be sent once.
Instead of your code:
it_val.it_interval = it_val.it_value;
I'd set:
it_val.it_interval.tv_sec = 0;
it_val.it_interval.tv_usec = 0;
In addition to it_val.it_value which you already set. What you've done is use the same values for both structures, and that is why you see a repeated interval.
I'm writing a simple thread pool for my application, which I test on dual-core processor. Usually it works good, but i noticed that when other processes are using more than 50% of processor, my application almost halts. This made me curious, so i decided to reproduce this situation and created auxiliary application, which simply runs infinite loop (without multithreading), taking 50% of processor. While auxiliary one is running, multithreaded application almost halts, as before (processing speed falls from 300-400 tasks per second to 5-10 tasks per second). But when I changed process affinity of my multithreaded program to use only one core (auxiliary still uses both), it started working, of course using at most 50% processor left. When I disabled multithreading in my application (still processing the same tasks, but without thread pool), it worked like charm, without any slow down from auxiliary, which was still running (and that's how two applications should behave when running on two cores). But when I enable multithreading, the problem comes back.
I've made special code for testing this particular ThreadPool:
header
#ifndef THREADPOOL_H_
#define THREADPOOL_H_
typedef double FloatingPoint;
#include <queue>
#include <vector>
#include <mutex>
#include <atomic>
#include <condition_variable>
#include <thread>
using namespace std;
struct ThreadTask
{
int size;
ThreadTask(int s)
{
size = s;
}
~ThreadTask()
{
}
};
class ThreadPool
{
protected:
queue<ThreadTask*> tasks;
vector<std::thread> threads;
std::condition_variable task_ready;
std::mutex variable_mutex;
std::mutex max_mutex;
std::atomic<FloatingPoint> max;
std::atomic<int> sleeping;
std::atomic<bool> running;
int threads_count;
ThreadTask * getTask();
void runWorker();
void processTask(ThreadTask*);
bool isQueueEmpty();
bool isTaskAvailable();
void threadMethod();
void createThreads();
void waitForThreadsToSleep();
public:
ThreadPool(int);
virtual ~ThreadPool();
void addTask(int);
void start();
FloatingPoint getValue();
void reset();
void clearTasks();
};
#endif /* THREADPOOL_H_ */
and .cpp
#include "stdafx.h"
#include <climits>
#include <float.h>
#include "ThreadPool.h"
ThreadPool::ThreadPool(int t)
{
running = true;
threads_count = t;
max = FLT_MIN;
sleeping = 0;
if(threads_count < 2) //one worker thread has no sense
{
threads_count = (int)thread::hardware_concurrency(); //default value
if(threads_count == 0) //in case it fails ('If this value is not computable or well defined, the function returns 0')
threads_count = 2;
}
printf("%d worker threads\n", threads_count);
}
ThreadPool::~ThreadPool()
{
running = false;
reset(); //it will make sure that all worker threads are sleeping on condition variable
task_ready.notify_all(); //let them finish in natural way
for (auto& th : threads)
th.join();
}
void ThreadPool::start()
{
createThreads();
}
FloatingPoint ThreadPool::getValue()
{
waitForThreadsToSleep();
return max;
}
void ThreadPool::createThreads()
{
threads.clear();
for(int i = 0; i < threads_count; ++i)
threads.push_back(std::thread(&ThreadPool::threadMethod, this));
}
void ThreadPool::threadMethod()
{
while(running)
runWorker();
}
void ThreadPool::runWorker()
{
ThreadTask * task = getTask();
processTask(task);
}
void ThreadPool::processTask(ThreadTask * task)
{
if(task == NULL)
return;
//do something to simulate processing
vector<int> v;
for(int i = 0; i < task->size; ++i)
v.push_back(i);
delete task;
}
void ThreadPool::addTask(int s)
{
ThreadTask * task = new ThreadTask(s);
std::lock_guard<std::mutex> lock(variable_mutex);
tasks.push(task);
task_ready.notify_one();
}
ThreadTask * ThreadPool::getTask()
{
std::unique_lock<std::mutex> lck(variable_mutex);
if(tasks.empty())
{
++sleeping;
task_ready.wait(lck);
--sleeping;
if(tasks.empty()) //in case of ThreadPool being deleted (destructor calls notify_all), or spurious notifications
return NULL; //return to main loop and repeat it
}
ThreadTask * task = tasks.front();
tasks.pop();
return task;
}
bool ThreadPool::isQueueEmpty()
{
std::lock_guard<std::mutex> lock(variable_mutex);
return tasks.empty();
}
bool ThreadPool::isTaskAvailable()
{
return !isQueueEmpty();
}
void ThreadPool::waitForThreadsToSleep()
{
while(isTaskAvailable())
std::this_thread::yield(); //wait for all tasks to be taken
while(true) //wait for all threads to finish they last tasks
{
if(sleeping == threads_count)
break;
std::this_thread::yield();
}
}
void ThreadPool::clearTasks()
{
std::unique_lock<std::mutex> lock(variable_mutex);
while(!tasks.empty()) tasks.pop();
}
void ThreadPool::reset() //don't call this when var_mutex is already locked by this thread!
{
clearTasks();
waitForThreadsToSleep();
max = FLT_MIN;
}
how it's tested:
ThreadPool tp(2);
tp.start();
int iterations = 1000;
int task_size = 1000;
for(int j = 0; j < iterations; ++j)
{
printf("\r%d left", iterations - j);
tp.reset();
for(int i = 0; i < 1000; ++i)
tp.addTask(task_size);
tp.getValue();
}
return 0;
I've build this code with mingw with gcc 4.8.1 (from here) and Visual Studio 2012 (VC11) on Win7 64, both on debug configuration.
Two programs build with mentioned compilers behave totally different.
a) program build with mingw works much faster than one build on VS, when it can take whole processor (system shows almost 100% CPU usage by this process, so i don't think mingw is secretly setting affinity to one core). But when i run auxiliary program (using 50% of CPU), it slows down greatly (about several dozen times). CPU usage in this case is about 50%-50% for main program and auxiliary one.
b) program build with VS 2012, when using whole CPU, is even slower than a) with slowdown (when i set task_size = 1, their speeds were similiar). But when auxiliary is running, main program even takes most of CPU (usage is about 66% main - 33% aux) and resulting slow down is barely noticeable.
When set to use only one core, both programs speed up noticeable (about 1.5 - 2 times), and mingw one stops being vulnerable to competition.
Well, now i don't know what to do. My program behaves differently when build by two different toolsets. Is this a flaw in my code (which is suppose is true), or something to do with compilers having problems with c++11 ?
I'm developing an application For OpenSUSE 12.1.
This application has a main thread and other two threads running instances of the same functions. I'm trying to use pthread_barrier to synchronize all threads but I'm having some problems:
When I put the derived threads to sleep, they will never wake up for some reason.
(in the case when I remove the sleep from the other threads, throwing CPU usage to the sky) In some point all the threads reach pthread_barrier_wait() but none of them continues execution after that.
Here's some pseudo code trying to illustrate what I'm doing.
pthread_barrier_t barrier;
int main(void)
{
pthread_barrier_init(&barrier, NULL , 3);
pthread_create(&thread_id1, NULL,&thread_func, (void*) ¶ms1);
pthread_create(&thread_id2v, NULL,&thread_func, (void*) ¶ms2);
while(1)
{
doSomeWork();
nanosleep(&t1, &t2);
pthread_barrier_wait(&barrier);
doSomeMoreWork();
}
}
void *thread_func(void *params)
{
init_thread(params);
while(1)
{
nanosleep(&t1, &t2);
doAnotherWork();
pthread_barrier_wait(&barrier);
}
}
I don't think it has to do with the barrier as you've presented it in the pseudocode. I'm making an assumption that your glibc is approximately the same as my machine. I compiled roughly your pseudo-code and it's running like I expect: the threads do some work, the main thread does some work, they all reach the barrier and then loop.
Can you comment more about any other synchronization methods or what the work functions are?
This is the the example program I'm using:
#include <pthread.h>
#include <stdio.h>
#include <time.h>
struct timespec req = {1,0}; //{.tv_sec = 1, .tv_nsec = 0};
struct timespec rem = {0,0}; //{.tv_sec = 0, .tv_nsec = 0};
pthread_barrier_t barrier;
void *thread_func(void *params) {
long int name;
name = (long int)params;
while(1) {
printf("This is thread %ld\n", name);
nanosleep(&req, &rem);
pthread_barrier_wait(&barrier);
printf("More work from %ld\n", name);
}
}
int main(void)
{
pthread_t th1, th2;
pthread_barrier_init(&barrier, NULL , 3);
pthread_create(&th1, NULL, &thread_func, (void*)1);
pthread_create(&th2, NULL, &thread_func, (void*)2);
while(1) {
nanosleep(&req, &rem);
printf("This is the parent\n\n");
pthread_barrier_wait(&barrier);
}
return 0;
}
I would suggest to use condition variables in order to synchronize threads.
Here some website about how to do it i hope it helps.
http://www.yolinux.com/TUTORIALS/LinuxTutorialPosixThreads.html