I have a very resource intensive code, that I made, so I can split the workload over multiple pthreads. While everything works, the computation is done faster, etc. What I'm guessing happens is that other processes on that processor core get so slow, that they crash after a few seconds of runtime.
I already managed to kill random processes like Chrome tabs, the Cinnamon DE or even the entire OS (Kernel?).
Code: (It's late, and I'm too tired to make a pseudo code, or even comments..)
-- But it's a brute force code, not so much for cracking, but for testing passwords and or CPU IPS.
Any ideas how to fix this, while still keeping as much performance as possible?
static unsigned int NTHREADS = std::thread::hardware_concurrency();
static int THREAD_COMPLETE = -1;
static std::string PASSWORD = "";
static std::string CHARS;
static std::mutex MUTEX;
void *find_seq(void *arg_0)
{
unsigned int _arg_0 = *((unsigned int *) arg_0);
std::string *str_CURRENT = new std::string(" ");
while (true)
{
for (unsigned int loop_0 = _arg_0; loop_0 < CHARS.length() - 1; loop_0 += NTHREADS)
{
str_CURRENT->back() = CHARS[loop_0];
if (*str_CURRENT == PASSWORD)
{
THREAD_COMPLETE = _arg_0;
return (void *) str_CURRENT;
}
}
str_CURRENT->back() = CHARS.back();
for (int loop_1 = (str_CURRENT->length() - 1); loop_1 >= 0; loop_1--)
{
if (str_CURRENT->at(loop_1) == CHARS.back())
{
if (loop_1 == 0)
str_CURRENT->assign(str_CURRENT->length() + 1, CHARS.front());
else
{
str_CURRENT->at(loop_1) = CHARS.front();
str_CURRENT->at(loop_1 - 1) = CHARS[CHARS.find(str_CURRENT->at(loop_1 - 1)) + 1];
}
}
}
};
}
Areuz,
Can you post the full code? I suspect the issue is the NTHREADS value. On my Ubuntu box, the value is set to 8 which is the number of cores in the /proc/cpuinfo file. Kicking off 8 'hot' threads on my box hogs 100% of the CPU. The kernel will time slice for its own critical processes but in general all other processes will starve for CPU.
Check out the max processor value in /etc/cpuinfo and go at least one lower then that. The CPU's are numbered 0-7 on my box, so 7 would be the max for me. The actual max might be 3 since 4 of my cores are hyper-threads. For completely CPU processes, hyper-threading generally doesn't help.
Bottom line, don't hog all the CPU, it will destabilize the system.
--Matt
Thank you for your answers and especially Matthew Fisher for his suggestion to try it on another system.
After some trial and error I decided to pull back my CPU overclock that I thought was stable (I had it for over a year) and that solved this weird behaviour. I guess that I've never ran such a CPU intensive and (I'm guessing) efficient (In regards to not throttling the full CPU by yielding) script to see this happen.
As Matthew suggested I need to come up with a better way than to just constantly check the THREAD_COMPLETE variable with a while true loop, but I hope to resolve that in the comments.
Full and updated code for future visitors is here: pastebin.com/jbiYyKBu
Related
Before I start, let me say that I've only used threads once when we were taught about them in university. Therefore, I have almost zero experience using them and I don't know if what I'm trying to do is a good idea.
I'm doing a project of my own and I'm trying to make a for loop run fast because I need the calculations in the loop for a real-time application. After "optimizing" the calculations in the loop, I've gotten closer to the desired speed. However, it still needs improvement.
Then, I remembered threading. I thought I could make the loop run even faster if I split it in 4 parts, one for each core of my machine. So this is what I tried to do:
void doYourThing(int size,int threadNumber,int numOfThreads) {
int start = (threadNumber - 1) * size / numOfThreads;
int end = threadNumber * size / numOfThreads;
for (int i = start; i < end; i++) {
//Calculations...
}
}
int main(void) {
int size = 100000;
int numOfThreads = 4;
int start = 0;
int end = size / numOfThreads;
std::thread coreB(doYourThing, size, 2, numOfThreads);
std::thread coreC(doYourThing, size, 3, numOfThreads);
std::thread coreD(doYourThing, size, 4, numOfThreads);
for (int i = start; i < end; i++) {
//Calculations...
}
coreB.join();
coreC.join();
coreD.join();
}
With this, computation time changed from 60ms to 40ms.
Questions:
1)Do my threads really run on a different core? If that's true, I would expect a greater increase in speed. More specifically, I assumed it would take close to 1/4 of the initial time.
2)If they don't, should I use even more threads to split the work? Will it make my loop faster or slower?
(1).
The question #François Andrieux asked is good. Because in the original code there is a well-structured for-loop, and if you used -O3 optimization, the compiler might be able to vectorize the computation. This vectorization will give you speedup.
Also, it depends on what is the critical path in your computation. According to Amdahl's law, the possible speedups are limited by the un-parallelisable path. You might check if the computation are reaching some variable where you have locks, then the time could also spend to spin on the lock.
(2). to find out the total number of cores and threads on your computer you may have lscpu command, which will show you the cores and threads information on your computer/server
(3). It is not necessarily true that more threads will have a better performance
There is a header-only library in Github which may be just what you need. Presumably your doYourThing processes an input vector (of size 100000 in your code) and stores the results into another vector. In this case, all you need to do is to say is
auto vectorOut = Lazy::runForAll(vectorIn, myFancyFunction);
The library will decide how many threads to use based on how many cores you have.
On the other hand, if the compiler is able to vectorize your algorithm and it still looks like it is a good idea to split the work into 4 chunks like in your example code, you could do it for example like this:
#include "Lazy.h"
void doYourThing(const MyVector& vecIn, int from, int to, MyVector& vecOut)
{
for (int i = from; i < to; ++i) {
// Calculate vecOut[i]
}
}
int main(void) {
int size = 100000;
MyVector vecIn(size), vecOut(size)
// Load vecIn vector with input data...
Lazy::runForAll({{std::pair{0, size/4}, {size/4, size/2}, {size/2, 3*size/4}, {3*size/4, size}},
[&](auto indexPair) {
doYourThing(vecIn, indexPair.first, indexPair.second, vecOut);
});
// Now the results are in vecOut
}
README.md gives further examples on parallel execution which you might find useful.
I'm having issues with using multiple threads for my madelbrot program.
One of the ways I tired following a tutorial
int sliceSize = 800 / threads;
double start = 0, end = 0;
for (int i = 0; i < threads; i++)
{
start = i * sliceSize;
end = ((1 + i) * sliceSize);
thrd.push_back(thread(compute_mandelbrot, left, right, top, bottom, start, end));
}
for (int i = 0; i < threads; i++)
{
thrd[i].join();
}
thrd.clear();
but the code takes only half the time to compute, while using 8 threads.
I also tried something more complicated but it doesn't work at all
void slicer(double left, double right, double top, double bottom)
{
/*promise<int> prom;
future<int> fut = prom.get_future();*/
int test = -1;
double start = 0, end = 0;
const size_t nthreads = std::thread::hardware_concurrency(); //detect how many threads cpu has
{
int sliceSize = 800 / nthreads;
std::cout << "CPU has " << nthreads << " threads" << std::endl;
std::vector<std::thread> threads(nthreads);
for (int t = 0; t < nthreads; t++)
{
threads[t] = std::thread(std::bind(
[&]()
{
mutex2.lock();
test++;
start = (test) * sliceSize;
end = ((test + 1) * sliceSize);
mutex2.unlock();
compute_mandelbrot(left, right, top, bottom, start, end);
}));
}
std::for_each(threads.begin(), threads.end(), [](std::thread& x) {x.join(); }); //join threads
}
}
but it seems while it is computing 8 things at once they tend to over lap even after using a mutex, and it's not any faster.
This has given me a headache for the last 7h and I want to kill myself. Help.
There's a lot at play when you're trying to speed up a workload by multi-threading, and in the perfect world it's pretty much impossible to get an Nx speed-up when multiplying by N threads. Some things to bear in mind:
If you're making use of hyperthreading (so using 1 thread per virtual core on the system, not just per physical core), then you don't get the equivalent performance of 2 real cores - you'll get some percentage (probably around 1.2x or so).
The operating system (Windows) is going to be doing stuff while your workloads are executing. It's fairly random what and when these OS tasks cut into your app time, but it's going to make a difference. Always expect some percentage of your CPU time is going to be stolen by windows.
Any kind of synchronization is going to heavily impact performance. In your second example, mutexes are pretty hefty and are likely going to impact performance.
Memory accesses, cache access, etc, are going to come in to play. Multiple threads accessing memory all over the place is going to result in pressure on the cache, which is going to have a (potential) impact.
I'm curious - what sort of times are you looking at here? And how many iterations are you passing on each thread? To dig in and see what's happening timing-wise, you could try something like recording the start/end time of each thread using queryPerformanceCounter to see how long each is running, when they start, etc. Posting the times here for 1, 2, 4 and 8 threads would maybe shed a little light.
Hopefully this at least helps a little...
#include <math.h>
#include <sstream>
#include <iostream>
#include <mutex>
#include <stdlib.h>
#include <chrono>
#include <thread>
bool isPrime(int number) {
int i;
for (i = 2; i < number; i++) {
if (number % i == 0) {
return false;
}
}
return true;
}
std::mutex myMutex;
int pCnt = 0;
int icounter = 0;
int limit = 0;
int getNext() {
std::lock_guard<std::mutex> guard(myMutex);
icounter++;
return icounter;
}
void primeCnt() {
std::lock_guard<std::mutex> guard(myMutex);
pCnt++;
}
void primes() {
while (getNext() <= limit)
if (isPrime(icounter))
primeCnt();
}
int main(int argc, char *argv[]) {
std::stringstream ss(argv[2]);
int tCount;
ss >> tCount;
std::stringstream ss1(argv[4]);
int lim;
ss1 >> lim;
limit = lim;
auto t1 = std::chrono::high_resolution_clock::now();
std::thread *arr;
arr = new std::thread[tCount];
for (int i = 0; i < tCount; i++)
arr[i] = std::thread(primes);
for (int i = 0; i < tCount; i++)
arr[i].join();
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << "Primes: " << pCnt << std::endl;
std::cout << "Program took: " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() <<
" milliseconds" << std::endl;
return 0;
}
Hello , im trying to find the amount of prime numbers between the user specified range, i.e., 1-1000000 with a user specified amount of threads to speed up the process, however, it seems to take the same amount of time for any amount of threads compared to one thread. Im not sure if its supposed to be that way or if theres a mistake in my code. thank you in advance!
You don't see performance gain because time spent in isPrime() is much smaller than time which threads take when fighting on mutex.
One possible solution is to use atomic operations, as #The Badger suggested. The other way is to partition your task into smaller ones and distribute them over your thread pool.
For example, if you have n threads, then each thread should test numbers from i*(limit/n) to (i+1)*(limit/n), where i is thread number. This way you wouldn't need to do any synchronization at all and your program would (theoretically) scale linearly.
Multithreaded algorithms work best when threads can do a lot of work on their own.
Imagine doing this in real life: you have a group of 20 humans that will do work for you, and you want them to test whether each number up to 1000 is prime. How will you do this?
Would you hand each person a single number at a time, and ask them to come back to you to tell you if its prime and to receive another number?
Surely not; you would give each person a bunch of numbers to work on at once, and have them come back and tell you how many were prime and to receive another bunch of numbers.
Maybe even you'd divide up the entire set of numbers into 20 groups and tell each person to work on a group. (but then you run the risk of one person being slow and having everyone else sitting idle while you wait for that one person to finish... although there are so-called "work stealing" algorithms, but that's complicated)
The same thing applies here; you want each thread to do a lot of work on its own and keep its own tally, and only have to check back with the centralized information once in a while.
A better solution would be to use the Sieve of Atkin to find the primes (even the Sieve of Eratosthenes which is easier to understand is better), your basic algorithm is very poor to start with. It will for every number n in your interval do n checks in order to determine if it's prime and do this limit times. This means that you're doing about limit*limit/2 checks - that's what we call O(n^2) complexity. The Sieve of Atkins OTOH only have to do O(n) operations to find all primes. If n is large it is hard to beat the algorithm that has fewer steps by performing the steps faster. Trying to fix a poor algorithm by throwing more resources on it is a bad strategy.
Another problem with your implementation is that it has race conditions and therefore is broken to start with. It's often little use in optimizing something unless you first make sure it's working correctly. The problem is in the primes function:
void primes() {
while (getNext() <= limit)
if( isPrime(icounter) )
primeCnt();
}
Between the getNext() and isPrime another thread may have increased the icounter and cause the program to skip candidates. This results in the program giving different result each time. In addition neither icounter nor pCnt is declared volatile so there's actually no guarantee that the value gets to the global storage location as part of the mutex lock.
Since the problem is CPU intensive, that is almost all of the time is spent executing CPU instructions multi threading won't help unless you have multiple CPU's (or cores) which the OS are scheduling threads of the same process on. This means that there is a limit of number of threads (that can be as low as 1 - I fx see only a improvement for two threads, beyond that theres none) where you can expect an improved performance. What happens if you have more threads than cores is that the OS will just let one thread run for a while on a core and then switch the thread an let the next thread execute for a while.
The problem that may arise when scheduling threads on different cores is in addition that each core may have separate cache (which is faster than the shared cache). In effect if two threads are going to access the same memory the separated cache has to be flushed as part of the synchronization of the data involved - this may be time consuming.
That is you have to strive to keep the data that the different threads are working on separate and minimize the frequent use of common variable data. In your example it would mean that you should avoid the global data as much as possible. The counter for example need only be accessed when the counting has finished (to add the threads contribution to the count). Also you could minimize the use of icounter by not reading it for each candidate, but get a bunch of candidates in one go. Something like:
void primes() {
int next;
int count=0;
while( (next = getNext(1000)) <= limit ) {
for( int j = next; j < next+1000 && j <= limit ; j++ ) {
if( isPrime(j) )
count++;
}
}
primeCnt(count);
}
where getNext is the same, but it reserves a number of candidates (by increasing icounter by the supplied count) and primeCnt adds count to pCnt.
Consequently you may end up in a situation where the core runs one thread, then after a while switch to another thread and so on. The result of this is that you will have to run all the code for your problem plus code for switching between the thread. Add that you will probably have more cache hits, then this will probably even be slower.
Perhaps instead of a mutex try to use an atomic integer for the counter. It might speed it up a bit, not sure by how much.
#include <atomic>
std::atomic<uint64_t> pCnt; // Made uint64 for bigger range as #IgnisErus mentioned
std::atomic<uint64_t> icounter;
int getNext() {
return ++icounter; // Pre increment is faster
}
void primeCnt() {
++pCnt;
}
On benchmarking, most of the time the processor need to warm up to get the best performance, so to take the time once is not always a good representation of the actual performance. Try to run the code many times and get an average. You can also try to do some heavy work before you do the calculation (A long for-loop calculating the power of some counter?)
Getting accurate benchmark results is also a topic of interest for me since I do not yet know how to do it.
I am trying to fix a problem with a legacy Visual Studio win32 un-managed c++ app which is not keeping up with input. As a part of my solution, I am exploring bumping up the class and thread priorities.
My PC has 4 xeon processors, running 64 bit XP. I wrote a short win32 test app which creates 4 background looping threads, each one running on their own processor. Some code samples are shown following. The problem is that even when I bump the priorities to the extreme, the cpu utilization is still less than 1%.
My test app is 32 bit, running on WOW64. The same test app also utilizes less than 1% cpu utilization on a 32 bit xp machine. I am an administrator on both machines. What else do I need to do to get this to work?
DWORD __stdcall ThreadProc4 (LPVOID)
{
SetThreadPriority(GetCurrentThread(),THREAD_PRIORITY_TIME_CRITICAL);
while (true)
{
for (int i = 0; i < 1000; i++)
{
int p = i;
int red = p *5;
theClassPrior4 = GetPriorityClass(theProcessHandle);
}
Sleep(1);
}
}
int APIENTRY _tWinMain(...)
{
...
theProcessHandle = GetCurrentProcess();
BOOL theAffinity = GetProcessAffinityMask(
theProcessHandle,&theProcessMask,&theSystemMask);
SetPriorityClass(theProcessHandle,REALTIME_PRIORITY_CLASS);
DWORD threadid4 = 0;
HANDLE thread4 = CreateThread((LPSECURITY_ATTRIBUTES)NULL,
0,
(LPTHREAD_START_ROUTINE)ThreadProc4,
NULL,
0,
&threadid4);
DWORD_PTR theAff4 = 8;
DWORD_PTR theAf4 = SetThreadAffinityMask(thread1,theAff4);
SetThreadPriority(thread4,THREAD_PRIORITY_TIME_CRITICAL);
ResumeThread(thread4);
Well, if you want it to actually eat CPU time, you'll want to remove that 'Sleep' call - your 'processing' is taking no significant amount of time, and so it's spending most of it's time sleeping.
You'll also want to look at what the optimizer is doing to your code. I wouldn't be totally surprised if it completely removed 'p' and 'red' (and the multiply) in your loop (because the results are never used). You could trying marking 'red' as volatile, that should force it to not remove the calculation.
On my laptop with Intel Pentium dual-core processor T2370 (Acer Extensa) I ran a simple multithreading speedup test. I am using Linux. The code is pasted below. While I was expecting a speedup of 2-3 times, I was surprised to see a slowdown by a factor of 2. I tried the same with gcc optimization levels -O0 ... -O3, but everytime I got the same result. I am using pthreads. I also tried the same with only two threads (instead of 3 threads in the code), but the performance was similar.
What could be the reason? The faster version took reasonably long - about 20 secs - so it seems is not an issue of startup overhead.
NOTE: This code is a lot buggy (indeed it does not make much sense as the output of serial and parallel versions would be different). The intention was just to "get" a speedup comparison for the same number of instructions.
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <pthread.h>
class Thread{
private:
pthread_t thread;
static void *thread_func(void *d){((Thread *)d)->run();}
public:
Thread(){}
virtual ~Thread(){}
virtual void run(){}
int start(){return pthread_create(&thread, NULL, Thread::thread_func, (void*)this);}
int wait(){return pthread_join(thread, NULL);}
};
#include <iostream>
const int ARR_SIZE = 100000000;
const int N = 20;
int arr[ARR_SIZE];
int main(void)
{
class Thread_a:public Thread{
public:
Thread_a(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=0; i<ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
class Thread_b:public Thread{
public:
Thread_b(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=ARR_SIZE/3; i<2*ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
class Thread_c:public Thread{
public:
Thread_c(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=2*ARR_SIZE/3; i<ARR_SIZE; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
{
Thread *a=new Thread_a(arr);
Thread *b=new Thread_b(arr);
Thread *c=new Thread_c(arr);
clock_t start = clock();
if (a->start() != 0) {
return 1;
}
if (b->start() != 0) {
return 1;
}
if (c->start() != 0) {
return 1;
}
if (a->wait() != 0) {
return 1;
}
if (b->wait() != 0) {
return 1;
}
if (c->wait() != 0) {
return 1;
}
clock_t end = clock();
double duration = (double)(end - start) / CLOCKS_PER_SEC;
std::cout << duration << "seconds\n";
delete a;
delete b;
}
{
clock_t start = clock();
for(int n = 0; n<N; n++)
for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}
clock_t end = clock();
double duration = (double)(end - start) / CLOCKS_PER_SEC;
std::cout << "serial: " << duration << "seconds\n";
}
return 0;
}
See also: What can make a program run slower when using more threads?
The times you are reporting are measured using the clock function:
The clock() function returns an approximation of processor time used by the program.
$ time bin/amit_kumar_threads.cpp
6.62seconds
serial: 2.7seconds
real 0m5.247s
user 0m9.025s
sys 0m0.304s
The real time will be less for multiprocessor tasks, but the processor time will typically be greater.
When you use multiple threads, the work may be done by more than one processor, but the amount of work is the same, and in addition there may be some overhead such as contention for limited resources. clock() measures the total processor time, which will be the work + any contention overhead. So it should never be less than the processor time for doing the work in a single thread.
It's a little hard to tell from the question whether you knew this, and were surprised that the value returned by clock() was twice that for a single thread rather than being only a little more, or you were expecting it to be less.
Using clock_gettime() instead (you'll need the realtime library librt, g++ -lrt etc.) gives:
$ time bin/amit_kumar_threads.cpp
2.524 seconds
serial: 2.761 seconds
real 0m5.326s
user 0m9.057s
sys 0m0.344s
which still is less of a speed-up than one might hope for, but at least the numbers make some sense.
100000000*20/2.5s = 800Hz, the bus frequency is 1600 MHz, so I suspect with a read and a write for each iteration (assuming some caching), you're memory bandwidth limited as tstenner suggests, and the clock() value shows that most of the time some of your processors are waiting for data. (does anyone know whether clock() time includes such stalls?)
The only thing your thread does is adding some elements, so your application should be IO-bound. When you add an extra thread, you have 2 CPUs sharing the memory bus, so it won't go faster, instead, you'll have cache misses etc.
I believe that your algorithm essentially makes your cache memory useless.
Probably what you are seeing is the effect of (non)locality of reference between the three threads. Essentially because each thread is operating on a different section of data that is widely separated from the others you are causing cache misses as the data section for one thread replaces that for another thread in your cache. If your program was constructed so that the threads operated on sections of data that were smaller (so that they could all be kept in memory) or closer together (so that all threads could use the same in-cache pages), you'd see a performance boost. As it is I suspect that your slow down is because a lot of memory references are having to be satisifed from main memory instead of from your cache.
Not related to your threading issues, but there is a bounds error in your code.
You have:
for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}
When i is zero you will be doing
arr[0] += arr[-1];
Also see herb's article on how multi cpu and cache lines interference in multithreaded code specially the section `All Sharing Is Bad -- Even of "Unshared" Objects...'
As others have pointed out, threads don't necessarily provide improvements to speed. In this particular example, the amount of time spent in each thread is significantly less than the amount of time required to perform context switches and synchronization.
tstenner has got it mostly right.
This is mainly a benchmark of your OS's "allocate and map a new page" algorithm. That array allocation allocates 800MB of virtual memory; the OS won't actually allocate real physical memory until it's needed. "Allocate and map a new page" is usually protected by a mutex, so more cores won't help.
Your benchmark also stresses the memory bus (minimum 800MB transferred; on OSs that zero memory just before they give it to you, the worst case is 800MB * 7 transfers). Adding more cores isn't really going to help if the bottleneck is the memory bus.
You have 3 threads that are trampling all over the same memory. The cache lines are being read and written to by different threads, so will be ping-ponging between the L1 caches on the two CPU cores. (A cache line that is to be written to can only be in one L1 cache, and that must be the L1 cache that is attached to the CPU code that's doing the write). This is not very efficient. The CPU cores are probably spending most of their time waiting for the cache line to be transferred, which is why this is slower with threads than if you single-threaded it.
Incidentally, the code is also buggy because the same array is read & written from different CPUs without locking. Proper locking would have an effect on performance.
Threads take you to the promised land of speed boosts(TM) when you have a proper vector implementation. Which means that you need to have:
a proper parallelization of your algorithm
a compiler that knows and can spread your algorithm out on the hardware as a parallel procedure
hardware support for parallelization
It is difficult to come up with the first. You need to be able to have redundancy and make sure that it's not eating in your performance, proper merging of data for processing the next batch of data and so on ...
But this is then only a theoretical standpoint.
Running multiple threads doesn't give you much when you have only one processor and a bad algorithm. Remember -- there is only one processor, so your threads have to wait for a time slice and essentially you are doing sequential processing.