how to solve busy waiting for many running threads - c++

I have write a multithread program :
#include <Windows.h>
#include <process.h>
#include <stdio.h>
#include <fstream>
#include <iostream>
using namespace std;
ofstream myfile;
BYTE lockmem=0x0;
unsigned int __stdcall mythreadA(void* data)
{
__asm
{
mov DL,0xFF
mutex:
mov AL,0x0
LOCK CMPXCHG lockmem,DL
jnz mutex
}
// Enter Critical Section
for (int i = 0; i < 100000; i++)
{
myfile << "." << i << endl;
}
// Exit Critical Section
__asm
{
lock not lockmem
}
return 0;
}
int main(int argc, char* argv[])
{
myfile.open ("report.txt");
HANDLE myhandleA[10];
//int index = 0;
for(int index = 0;index < 100;index++)
{
myhandleA[index] = (HANDLE)_beginthreadex(0, 0, &mythreadA, 0, 0, 0);
}
getchar();
myfile.close();
return 0;
}
at the critical section I write inline code to sure that only one thread is in the critical section .( in this program I don't want to use API and functions for implement the only one thread in critical section so I use inline assembly ). now I have busy waiting problem .because after one thread enter in the critical section the other threads are busy in the loop before critical section , so the process of cpu go up and up! here I search for ways to solve the problem of busy waiting. ( I prefer to use assembly instruction instead of API and any functions but I also want to know them)

What you are doing is basically called spinlock, and it should not be used for long operations. It is the expected result to drain cpu time as you described.
You may however build a mutex, of futex (fast user-mode mutex) based on spinlock and condvar/event.

You can use a kernal API call that blocks the threads that must wait, or you can waste CPU cycles and memory-bandwidth keeping your fans at full speed and your office warm.
There isn't any other choice.

If I understand your question and program correctly,there are couple of problem in your program. This is mentioned by others as well in above post.
Your actual critical section part of code is terribly slow as your are writing the numbers in new line in your file for 100000 times. The file I/O operation would be very very slow and in your case this is what your thread function is each doing. I am not aware much about the assembly instruction used by you in your program but its looks like that these locking mechanism code are creating busy waiting code for your remaining threads which is yet to be scheduled for execution. As mentioned above you should start using the EnterCriticalSection() and LeaveCriticalSection() API provided on Microsoft based system. These API internally take care about anything(which can not be achieved in general from our own logic) and hence you would not have any CPU spikes while waiting.
If you still want to do something using your current logic, I think you should use/implement some form of sleep() type function which should be executed and try. This would ensure that there would not be any CPU busy time scenario due to continuous execution/check for some flag.
Calvin has also mentioned that you should try to distribute the locks instead of central lock. We should remember one thing, if our task is simple and can be done easily by conventional single thread approach,we should not go for multi-threaded solution.

Related

OpenMP task directive slower multithreaded than singlethreaded

I've encountered a problem where the task directive seems to slow down the execution time of the code the more threads I have. Now I have removed all of the unnecessary stuff from my code that isn't related to the problem since the problem still occurs even for this slimmed down piece of code that doesn't really do anything. But the general idea I have for this code is that I have the master thread generate tasks for all the other worker threads to execute.
#ifndef _REENTRANT
#define _REENTRANT
#endif
#include <vector>
#include <iostream>
#include <random>
#include <sched.h>
#include <semaphore.h>
#include <time.h>
#include <bits/stdc++.h>
#include <sys/times.h>
#include <stdio.h>
#include <stdbool.h>
#include <omp.h>
#include <chrono>
#define MAXWORKERS 16
using namespace std;
int nbrThreads = MAXWORKERS; //Number of threads
void busyWait() {
for (int i=0; i < 999; i++){}
}
void generatePlacements() {
#pragma omp parallel
{
#pragma omp master
{
int j = 0;
while (j < 8*7*6*5*4*3*2) {
#pragma omp task
{
busyWait();
}
j++;
}
}
}
}
int main(int argc, char const *argv[])
{
for (int i = 1; i <= MAXWORKERS; i++) {
int nbrThreads = i;
omp_set_num_threads(nbrThreads);
auto begin = omp_get_wtime();
generatePlacements();
double elapsed;
auto end = omp_get_wtime();
auto diff = end - begin;
cout << "Time taken for " << nbrThreads << " threads to execute was " << diff << endl;
}
return 0;
}
And I get the following output from running the program:
Time taken for 1 threads to execute was 0.0707005
Time taken for 2 threads to execute was 0.0375168
Time taken for 3 threads to execute was 0.0257982
Time taken for 4 threads to execute was 0.0234329
Time taken for 5 threads to execute was 0.0208451
Time taken for 6 threads to execute was 0.0288127
Time taken for 7 threads to execute was 0.0380352
Time taken for 8 threads to execute was 0.0403016
Time taken for 9 threads to execute was 0.0470985
Time taken for 10 threads to execute was 0.0539719
Time taken for 11 threads to execute was 0.0582986
Time taken for 12 threads to execute was 0.051923
Time taken for 13 threads to execute was 0.571846
Time taken for 14 threads to execute was 0.569011
Time taken for 15 threads to execute was 0.562491
Time taken for 16 threads to execute was 0.562118
Most notably was that from 6 threads on the time seems to get slower, and going from 12 threads to 13 threads seems to have the biggest performance hit, becoming whooping 10 times slower. Now I know that this issue revolves around the openMP task directive, since if I remove the busyWait() function the performance stays the same as seen above. But if I also remove the #pragma omp task header along with the busyWait() call I don't get any slowdown whatsoever, so the slowdown can't depend on the thread-creation. I have no clue what the problem here is.
First of all, the for (int i=0; i < 999; i++){} loop can be optimized by the compiler when optimization flags like -O2 or -O3 are enabled. In fact, mainstream compilers like Clang and GCC optimize it in -O2. Profiling non-optimized build is a wast of time and should never be done unless you have a very good reason to do that.
Assuming you enabled optimizations, the created task will be empty which means you are measuring the time to create many tasks. The thing is creating tasks is slow and creating many tasks doing nothing causes a contention making the creation even slower. The task granularity should be carefully tuned so not to put to much pressure on the OpenMP runtime. Assuming you did not enabled optimisations, then even a loop of 999 iterations is not enough for the runtime not to be under pressure (it should last less than 1 us on mainstream machines). Tasks should last for at least few microseconds for the overhead not to be the main bottleneck. On mainstream servers with a lot of cores, it should be at least dozens of microseconds. For the overhead to be negligible, tasks should last even longer. Task scheduling is powerful but expensive.
Due to the use of shared data structure protected with atomics and locks in OpenMP runtimes, the contention tends to grows with the number of core. On NUMA systems, it can be significantly higher when using multiple NUMA nodes due to NUMA effects. AMD processors with 16 cores are typically processors having multiple NUMA nodes. Using SMT (multiple hardware thread per physical core) does not significantly speed up this operation and adds more pressure to the OpenMP scheduler and the OS scheduler so it is generally not a good idea to use more threads than cores in this case (it can worth it when the task computational work can benefit from SMT, that is for latency-bound tasks for example, and when the overhead is small).
For more information about the overhead of mainstream OpenMP runtimes please consider reading On the Impact of OpenMP Task Granularity.

Is there a better way to use C++ concurrency than below?

I am trying to learn the basics of threading, mutex, etc. Following the documentation and examples from here. In the below code I am getting the output as expected. Questions:
Want to confirm if there is any pitfall that I am missing? How can we improve the code below?
On which line did my thread tried to take the mutex or is waiting for mutex? (Is it after line 11 or on line 11?)
Is it ok to include cv.notify_all(); in the thread code?
#include<mutex>
#include<iostream>
#include<thread>
#include<condition_variable>
std::mutex mtx;
std::condition_variable cv;
int currentlyRequired = 1;
void work(int id){
std::unique_lock<std::mutex> lock(mtx); // Line 11
while(currentlyRequired != id){
cv.wait(lock);
}
std::cout<<"Thread # "<<id<<"\n";
currentlyRequired++;
cv.notify_all();
}
int main(){
std::thread threads[10];
for(int i=0; i<10; i++){
threads[i] = std::thread(work, i+1);
}
for(int i=0; i<10; i++){
threads[i].join();
}
return 0;
}
/* Current Output:
Thread # 1
Thread # 2
Thread # 3
Thread # 4
Thread # 5
Thread # 6
Thread # 7
Thread # 8
Thread # 9
Thread # 10
Program ended with exit code: 0
*/
First and foremost I recommend you read the documentation here: cppreference.com
It's a little more discursive about the key points you need to conform to to use a condition variable for inter-thread waiting and notification.
I think your code measures up.
Your code obtains the lock in line 11 as you think. Other constructors of unique_lock will adopt a mutex previously locked (by the current thread) or not locked (by current thread) and lock it when requested (that would be lock.lock(); here).
What you have is right. You check the relevant data holding the lock.
Wait then unlocks it (through the unique_lock) and awaits notification.
When notified it stops waiting, locks it again and loops to check the condition.
Eventually the condition is true and (still holding the lock) each thread continues on to do its 'work'.
The 'waiting' side looks correct. The 'notifying' side also looks correct.
The data for the condition must be modified holding the mutex to ensure the correct synchronisation of checking the data and going into the waiting state that the condition-variable manages.
You correctly notify_all(). Even though logically (in this example) only one thread needs to wake up there's no way of picking it out to be the target of notify_one().
So all the threads wake up (if suspended), check their condition and exactly one of them identifies its turn and runs.
Common wisdom (and my experience) is that it's better (and valid) to notify_all() not holding the lock because the waiting threads wake up to block (on the lock). But I'm told some platforms run better notifying under the lock. Welcome to the world of platform dependence...
So in terms of implementing a condition-variable I think that's valid and pretty much textbook.
It's good to see the join() as well. I have a bugbear about coders not joining to threads.
It's one of those that at small scale and load you get away with but when the application scales and experiences high-load can start to cause problems and confusion.
The problem I have with what you've done has no parallelism.
What you've achieved is a daisy-chain. The very intention is to ensure only one thread is 'doing work' at once and they do so in strict order.
To take advantage of concurrency we want to maximise parallelism - the number of parallel running threads doing 'the work' (i.e. not the housekeeping of inter-thread communication) and your code is literally (because you did it right!) guaranteed to ensure there is never more than one thread running and the code is due to housekeeping guaranteed to be slightly slower than a single-threaded application (which would be a for loop!).
So it gets top-marks for program correctness but no marks for being useful!
Both the examples on cplusplus.con and cppreference.com are little better in my view.
The best introductory example is some kind of producer consumer model.
That's closer to an actually useful pattern.
Try something as simple as one producer thread is counting up through the integers and multiple consumer threads square them and output them.
A key point will be that if you're doing it right the squares won't in general come out in sequential order.
These examples are like teaching recursion with factorial. Sure it's recursive but recursion is a terrible way to calculate factorial!
Sure your multi-threading (the other examples) are valid but they're outright contrived to do nothing useful in parallel!
Please don't take that as a criticism. As a 'tooth cutting' exercise what you've got is top-class. The next task is to do something that uses concurrency usefully!.
The problem is we don't want condition-variables! By definition they make threads wait for each other and that reduces parallelism. We need them and they do their job. But a design that reduces the amount of mutual waiting (either on locks or conditions) is usually better because the enemy here is waiting, blocking or (worst) spinning instead of suspended waiting/blocking.
Here's a better design for your task. Better because it entirely avoids the condition-variable!!
#include<mutex>
#include<iostream>
#include<thread>
std::mutex mtx;
int currentlyRequired = 1;
void work(int id){
std::lock_guard<decltype(mtx)> guard(mtx);
const auto old{currentlyRequired++};
std::cout<<"Thread # "<<id<<" "<< old << "-> " <<currentlyRequired<< "\n";
}
int main(){
std::thread threads[10];
for(int i=0; i<10; i++){
threads[i] = std::thread(work, i+1);
}
for(int i=0; i<10; i++){
threads[i].join();
}
std::cout << "Final result: " << currentlyRequired << std::endl;
return 0;
}
Specimen output:
Thread # 7 1-> 2
Thread # 8 2-> 3
Thread # 9 3-> 4
Thread # 10 4-> 5
Thread # 6 5-> 6
Thread # 5 6-> 7
Thread # 4 7-> 8
Thread # 3 8-> 9
Thread # 2 9-> 10
Thread # 1 10-> 11
Final result: 11
Which thread does which increment will vary. But final result is always 11.
It's still no good because there's still no parallelism because the work is all done under a single lock. That's why we need a more interesting task.
Have fun.

C++ multiple threads and processes in vector

While going through a c++ tutorial book(it's in Spanish so I apologize if my translation to English is not as proper as it should be) I have come across a particular code snippet that I do not fully understand in terms of the different processes that are happening in the background. For example, in terms of multiple address spaces, how would I determine if these are all withing the context of a single process(being that multiple threads are being added over each push to the vector)? How would I determine if each thread is different from the other if they have the exact same computation being made?)
#include <iostream>
#include <vector>
#include <thread>
using namespace std;
int addthreads = 0;
void squarenum(int x) {
addthreads += x * x * x;
}
int main() {
vector<thread> septhread;
for (int i = 1; i <= 9; i++){
septhread.push_back(thread(&squarenum, i));
}
for (auto& th : septhread){
th.join();
}
cout << "Your answer = " << addthreads << endl;
system("pause");
return 0;
}
Every answer defaults to 2025, that much I understand. My basic issue is understanding the first part of my question.
By the way, the compiler required(if you are on Linux):
g++ -std=gnu++ -pthread threadExample.cpp -o threadExample
A thread is a "thread of execution" within a process, sharing the same address space, resources, etc. Depending on the operating system, hardware, etc, they may or may not run on the same CPU or CPU Thread.
A major issue with thread programming, as a result, is managing access to resources. If two threads access the same resource at the same time, Undefined Behavior can occur. If they are both reading, it may be fine, but if one is writing at the same moment the other is reading, numerous outcomes ensue. The simplest is that both threads are running on separate CPUs or cores and so the reader does not see the change made by the writer due to cache. Another is that the reader sees only a portion of the write (if it's a 64-bit value they might only see 32-bits changed).
Your code performs a read-modify-store operation, so the first thread to come along sees the value '0', calculates the result of x*x*x, adds it to 0 and stores the result.
Meanwhile the next thread comes along and does the same thing, it also sees 0 before performing its calculation, so it writes 0 + x*x*x to the value, overwriting the first thread.
These threads might not be in the order that you launched them; it's possible for thread #30 to get the first execution cycle rather than thread #1.
You may need to consider looking at std::atomic or std::mutex.

Simulating CPU Load In C++

I am currently writing an application in Windows using C++ and I would like to simulate CPU load.
I have the following code:
void task1(void *param) {
unsigned elapsed =0;
unsigned t0;
while(1){
if ((t0=clock())>=50+elapsed){//if time elapsed is 50ms
elapsed=t0;
Sleep(50);
}
}
}
int main(){
int ThreadNr;
for(int i=0; i < 4;i++){//for each core (i.e. 4 cores)
_beginthread( task1, 0, &ThreadNr );//create a new thread and run the "task1" function
}
while(1){}
}
I wrote this code using the same methodology as in the answers given in this thread: Simulate steady CPU load and spikes
My questions are:
Have I translated the C# code from the other post correctly over to C++?
Will this code generate an average CPU load of 50% on a quad-core processor?
How can I, within reasonable accuracy, find out the load percentage of the CPU? (is task manager my only option?)
EDIT: The reason I ask this question is that I want to eventually be able to generate CPU loads of 10,20,30,...,90% within a reasonable tolerance. This code seems to work well for to generate loads 70%< but seems to be very inaccurate at any load below 70% (as measured by the task manager CPU load readings).
Would anyone have any ideas as to how I could generate said loads but still be able to use my program on different computers (i.e. with different CPUs)?
At first sight, this looks like not-pretty-but-correct C++ or C (an easy way to be sure is to compile it). Includes are missing (<windows.h>, <process.h>, and <time.h>) but otherwise it compiles fine.
Note that clock and Sleep are not terribly accurate, and Sleep is not terribly reliable either. On the average, the thread function should kind of work as intended, though (give or take a few percent of variation).
However, regarding question 2) you should replace the last while(1){} with something that blocks rather than spins (e.g. WaitForSingleObject or Sleep if you will). otherwise the entire program will not have 50% load on a quadcore. You will have 100% load on one core due to the main thread, plus the 4x 50% from your four workers. This will obviously sum up to more than 50% per core (and will cause threads to bounce from one core to the other, resulting in nasty side effects).
Using Task Manager or a similar utility to verify whether you get the load you want is a good option (and since it's the easiest solution, it's also the best one).
Also do note that simulating load in such a way will probably kind of work, but is not 100% reliable.
There might be effects (memory, execution units) that are hard to predict. Assume for example that you're using 100% of the CPU's integer execution units with this loop (reasonable assumption) but zero of it's floating point or SSE units. Modern CPUs may share resources between real or logical cores, and you might not be able to predict exactly what effects you get. Or, another thread may be memory bound or having significant page faults, so taking away CPU time won't affect it nearly as much as you think (might in fact give it enough time to make prefetching work better). Or, it might block on AGP transfers. Or, something else you can't tell.
EDIT:
Improved version, shorter code that fixes a few issues and also works as intended:
Uses clock_t for the value returned by clock (which is technically "more correct" than using a not specially typedef'd integer. Incidentially, that's probably the very reason why the original code does not work as intended, since clock_t is a signed integer under Win32. The condition in if() always evaluates true, so the workers sleep almost all the time, consuming no CPU.
Less code, less complicated math when spinning. Computes a wakeup time 50 ticks in the future and spins until that time is reached.
Uses getchar to block the program at the end. This does not burn CPU time, and it allows you to end the program by pressing Enter. Threads are not properly ended as one would normally do, but in this simple case it's probably OK to just let the OS terminate them as the process exits.
Like the original code, this assumes that clock and Sleep use the same ticks. That is admittedly a bold assumption, but it holds true under Win32 which you used in the original code (both "ticks" are milliseconds). C++ doesn't have anything like Sleep (without boost::thread, or C++11 std::thread), so if non-Windows portability is intended, you'd have to rethink anyway.
Like the original code, it relies on functions (clock and Sleep) which are unprecise and unreliable. Sleep(50) equals Sleep(63) on my system without using timeBeginPeriod. Nevertheless, the program works "almost perfectly", resulting in a 50% +/- 0.5% load on my machine.
Like the original code, this does not take thread priorities into account. A process that has a higher than normal priority class will be entirely unimpressed by this throttling code, because that is how the Windows scheduler works.
#include <windows.h>
#include <process.h>
#include <time.h>
#include <stdio.h>
void task1(void *)
{
while(1)
{
clock_t wakeup = clock() + 50;
while(clock() < wakeup) {}
Sleep(50);
}
}
int main(int, char**)
{
int ThreadNr;
for(int i=0; i < 4; i++) _beginthread( task1, 0, &ThreadNr );
(void) getchar();
return 0;
}
Here is an a code sample which loaded my CPU to 100% on Windows.
#include "windows.h"
DWORD WINAPI thread_function(void* data)
{
float number = 1.5;
while(true)
{
number*=number;
}
return 0;
}
void main()
{
while (true)
{
CreateThread(NULL, 0, &thread_function, NULL, 0, NULL);
}
}
When you build the app and run it, push Ctrl-C to kill the app.
You can use the Windows perf counter API to get the CPU load. Either for the entire system or for your process.

Reporting a thread progress to main thread in C++

In C/C++ How can I make the threads(POSIX pthreads/Windows threads) to give me a safe method to pass progress back to the main thread on the progress of the execution or my work that I’ve decided to perform with the thread.
Is it possible to report the progress in terms of percentage ?
I'm going to assume a very simple case of a main thread, and one function. What I'd recommend is passing in a pointer to an atomic (as suggested by Kirill above) for each time you launch the thread. Assuming C++11 here.
using namespace std;
void threadedFunction(atomic<int>* progress)
{
for(int i = 0; i < 100; i++)
{
progress->store(i); // updates the variable safely
chrono::milliseconds dura( 2000 );
this_thread::sleep_for(dura); // Sleeps for a bit
}
return;
}
int main(int argc, char** argv)
{
// Make and launch 10 threads
vector<atomic<int>> atomics;
vector<thread> threads;
for(int i = 0; i < 10; i++)
{
atomics.emplace_back(0);
threads.emplace_back(threadedFunction, &atomics[i]);
}
// Monitor the threads down here
// use atomics[n].load() to get the value from the atomics
return 0;
}
I think that'll do what you want. I omitted polling the threads, but you get the idea. I'm passing in an object that both the main thread and the child thread know about (in this case the atomic<int> variable) that they both can update and/or poll for results. If you're not on a full C++11 thread/atomic support compiler, use whatever your platform determines, but there's always a way to pass a variable (at the least a void*) into the thread function. And that's how you get something to pass information back and forth via non-statics.
The best way to solve this is to use C++ atomics for that. Declare in some visible enough place:
std::atomic<int> my_thread_progress(0);
In a simple case this should be a static variable, in a more complex place this should be a data field of some object that manages threads or something similar.
On many platforms this will be slightly paranoiac because almost everywhere the read and write operations on integers are atomic. Bit using atomics still it makes because:
You will have guarantee that this will work fine on any platform, even on a 16 bit CPU or whatever unusual hardware;
Your code will be easier to read. Reader will immediately see that this is shared variable without placing any comments. Once it will be updated with load/store methods, it will be easier to catch on what is going on.
EDIT
Intel® 64 and IA-32 Architectures Software Developer’s Manual
Combined Volumes: 1, 2A, 2B, 2C, 3A, 3B and 3C (http://download.intel.com/products/processor/manual/325462.pdf)
Volume 3A: 8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:
Reading or writing a byte
Reading or writing a word aligned on a 16-bit boundary
Reading or writing a doubleword aligned on a 32-bit boundary