I am studying pthread but confused about how to use pthread to synchronize the functions.
For example, I have a simple code to do some operations on an array like following:
float add(int numbers[5]){
float sum;
for(int i = 0; i < 5; i++){
sum = sum + numbers[i] +5;
}
return sum/5;
}
float subtract(int numbers[5]){
float sum;
for(int i = 0; i < 5; i++){
sum = sum + numbers[i] -10;
}
return sum/5;
}
float mul(int numbers[5]){
float sum;
for(int i = 0; i < 5; i++){
sum = sum + (float)numbers[i] * 1.5 ;
}
return sum/5;
}
float div(int numbers[5]){
float sum;
for(int i = 0; i < 5; i++){
sum = sum + (float)numbers[i]/ 2;
}
return sum/5;
}
int main(){
int numbers [5] = { 34, 2, 77, 40, 12 };
float addition = add(numbers);
float subtraction = subtract(numbers);
float multiplication = mul(numbers);
float division = div(numbers);
cout << addition + subtraction + multiplication + division << endl;
return -1;
}
Since all the four functions are independent from each other and using the same input, how can I put each operation into one thread and let the functions(or threads) run at the same time?
I think if one day I have a very large array and run the program like above, it will spend a lot time but if I can make the functions run simultaneously, it will save a lot time.
First of all, I suspect, you are not clear how arrays are passed into functions. float subtract(int numbers[5]) does not tell anything about size of the passed array. It is equivalent of float subtract(int numbers[]) (no size), which, in turn, is equivalent to ``float subtract(int* numbers)` (pointer to int). You also have a bug in your function, since you do not initialize float before first use (as well as other functions).
Having this in mind, the whole substract function is better to be written like this:
float subtract(int* numbers, const size_t size) {
float sum = 0;
for(int i = 0; i < size; i++) {
sum = sum + numbers[i] -10;
}
return sum/5;
}
Now, once we cleared the function itself, we can tackle multithreading. I really suggest to ditch pthreads and instead use C++11 thread capability. That is especially true when you need to get the result back as a return value of the function. Doing it with pthreads would require too much typing. The relevant code to do this in C++ would be looking similar to this:
int numbers[] = {34, 2, 77, 40, 12}; // No need to provide array size when inited
auto sub_result = std::async(std::launch::async, &subtract, numbers, sizeof(numbers) / sizeof(*numbers);
auto div_result = ....
// rest of functions
std::cout << "Result of subtraction: " << div_result.get();
Now, this is a lot to grasp :) std::async is a way to run the function asynchronously without worring about multithreading at all. The task of threading is delegated to the compiler. It is a much cleaner way than using pthreads - see, invocation is not much different from normal function invocation! The only thing to keep in mind is that it returns so-called std::future object - an special object on which you can wait until the function which was run completes execution. It also has a get function, which waits until the function is completed and returns it's result. Nice, eh?
Related
Me and my friend wrote a program which calculates the polynomial:
P = abn + a2bn-1 + a3bn-2 + ... anb
He used the C++ std::pow(int, int) function, whereas I used two for loops as I've been taught that for loops with integers are faster than the std::pow(int, int) function or in worst case the speed is gonna be almost the same!
His code:
#include <iostream>
#include <cmath>
using namespace std;
int main()
{
int a = 1,b = 1,n = 100000;
long long P = 0;
for(int i = 1, j = n; i <= n; i++, j--)
{
P += pow(a,i) * pow(b,j);
}
cout << P;
return 0;
}
The program returns 100000 as expected with execution time: 0.048 s and 0.049 s when ran second time
However when I run my code:
#include <iostream>
using namespace std;
int main()
{
int a = 1,b = 1,n = 100000;
long long P = 0;
for(int i = 1, j = n; i <= n; i++, j--)
{
int c = a, d = b;
for (int k = 1; k < i; ++k) {
c *= a;
}
for (int k = 1; k < j; ++k) {
d *= b;
}
P += c * d;
}
cout << P;
return 0;
}
The program returns again the expected 100000 but with execution time: 17.039 s and 17.117 s when ran second time.
My question is, as I have always been taught by people and I have read in articles and tutorials that the std::pow(int, int) is either gonna be slower than a for loop with integers or the speed is gonna be almost equal, why is the execution time having such a big difference if we up the number to 1000000 it's gonna even take a couple of minutes before my code executes, whereas the pow(int, int) function code executes within seconds.
It completely depends on the quality of std::pow implementation, and the optimization capability of your compiler
For example some standard libraries calculate pow(x, y) as exp(log(x)*y) even for integer exponents, therefor it may be inaccurate and slow, resulting in issues like these
Why does pow(5,2) become 24?
Why pow(10,5) = 9,999 in C++
But some better pow implementations check if the exponent is integer to use a better algorithm. They also use exponentiating by squaring so they'll be magnitudes faster than your linear multiplication like your for loop. For example for b100000 they need only 17 loops instead of 100000
If a compiler is smart enough to recognize your loop as to calculate power, it may convert to std::pow completely. But probably it's not your case so std::pow is still faster. If you write your pow(int, int) function that uses exponentiating by squaring then it may likely be faster than std::pow
So I want to optimize the sum of a really big array and in order to do that I have wrote a multi-threaded code. The problem is that with this code I'm getting better timing results using only one thread instead of 2 or 3 or 4 threads...
Can someone explain me why this happens?
(Also I've only started coding in C++ this semester, until then I only knew C, so I'm sorry for possible dumb mistakes)
This is the thread code
*localSum = 0.0;
for (size_t i = 0; i < stop; i++)
*localSum += v[i];
Main process code
int numThreads = atoi(argv[1]);
int N = 100000000;
// create the input vector v and put some values in v
vector<double> v(N);
for (int i = 0; i < N; i++)
v[i] = i;
// this vector will contain the partial sum for each thread
vector<double> localSum(numThreads, 0);
// create threads. Each thread will compute part of the sum and store
// its result in localSum[threadID] (threadID = 0, 1, ... numThread-1)
startChrono();
vector<thread> myThreads(numThreads);
for (int i = 0; i < numThreads; i++){
int start = i * v.size() / numThreads;
myThreads[i] = thread(threadsum, i, numThreads, &v[start], &localSum[i],v.size()/numThreads);
}
for_each(myThreads.begin(), myThreads.end(), mem_fn(&thread::join));
// calculate global sum
double globalSum = 0.0;
for (int i = 0; i < numThreads; i++)
globalSum += localSum[i];
cout.precision(12);
cout << "Sum = " << globalSum << endl;
cout << "Runtime: " << stopChrono() << endl;
exit(EXIT_SUCCESS);
}
There are a few things:
1- The array just isn't big enough. Vectorized streaming add will be really hard to beat. You need a more complex function than add to really see results. Or a very large array.
2- Related, the overhead of all the thread creation and joining is going to swamp any performance gains from the threading. Adding is really fast, and you can easily saturate the CPU's functional units. for the thread to help it can't even be a hyperthread on the same core, it would need to be on a different core entirely (as the hyperthreads would both compete for the floating point units).
To test this, you can try to create all the treads before you start the timer and stop them all after you stop the timer (have them set a done flag instead of waiting on the join).
3- All your localsum variables are sharing the same cache line. Better would be to make the localsum variable on the stack and put the result into the array instead of adding directly into the array: https://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html
If for some reason, you need to keep the sum observable to others in that array, pad the localsum vector entries like this so they don't share the same cache line:
struct localsumentry {
double sum;
char pad[56];
};
Hello I'm having issues calculating the mean of in my function, the program compiles however I don't get the intended answer of 64.2 to print out and instead get a random set of integers and characters.
This is not the entirety of the code but only the appropriate variables and functions.
// main function and prototyping would be here
int size=0;
float values[]={10.1, 9.2, 7.9, 9.2, 13.0, 12.7, 11.3};
float mean(float values[], int size)
{
float sum = 0;
float mean = 0;
for (size = 0; size > 7; size++)
{
sum += values[size];
mean = sum / 7;
}
return mean;
}
Change your loop like so:
for (size = 0; size < 7; size++)
{
sum += values[size];
}
mean = sum / 7;
Your terminating condition for for loop isn't right.
Move the mean out of for loop.
for (size = 0; size > 7; size++)
Since size is initialized as 0, and it is incremented by 1, it becomes 1 at the end of the first iteration and fails the test (it is not > 7). Thus, it immediately exits the loop.
Secondly, you calculate mean inside the loop when you should calculate it after the loop is complete. Theoretically, you should get a correct value since you redo it as the mean of the sums to that point in the loop, but it is a waste of time. You also wipe out size by redefining it.
float mean(float values[], int size)
{
float sum = 0;
float mymean = 0;
for (int i = 0; i < size; i++)
{
sum += values[i];
}
mymean = sum / size;
return mymean;
}
Why is the test size > 7 there? Expecting your initial value to have an unusually large value of zero? It's likely that you mean size < 7, though using arbitrary magic numbers like that is trouble.
What you probably want is:
float mean(float* values, int size)
{
float sum = 0;
for (int i = 0; i < size; ++i)
sum += values[i];
return sum / size;
}
To be more C++ you'd want that signature to be:
float mean(const float* values, const size_t size)
That way you'd catch any mistakes with modifying those values.
Suppose you had a function that would take in a vector, a set of vectors, and find which vector in the set of vectors was closest to the original vector. It may be useful if I included some code:
int findBMU(float * inputVector, float * weights){
int count = 0;
float currentDistance = 0;
int winner = 0;
float leastDistance = 99999;
for(int i = 0; i<10; i++){
for(int j = 0;j<10; j++){
for(int k = 0; k<10; k++){
int offset = (i*100+j*10+k)*644;
for(int i = offset; i<offset+644; i++){
currentDistance += abs((inputVector[count]-weights[i]))*abs((inputVector[count]-weights[i]));
count++;
}
currentDistance = sqrt(currentDistance);
count = 0;
if(currentDistance<leastDistance){
winner = offset;
leastDistance = currentDistance;
}
currentDistance = 0;
}
}
}
return winner;
}
In this example, weights is a single dimensional array, with a block of 644 elements corresponding to one vector. inputVector is the vector that's being compared, and it also has 644 elements.
To speed up my program, I decided to take a look at the CUDA framework provided by NVIDIA. This is what my code looked like once I changed it to fit CUDA's specifications.
__global__ void findBMU(float * inputVector, float * weights, int * winner, float * leastDistance){
int i = threadIdx.x+(blockIdx.x*blockDim.x);
if(i<1000){
int offset = i*644;
int count = 0;
float currentDistance = 0;
for(int w = offset; w<offset+644; w++){
currentDistance += abs((inputVector[count]-weights[w]))*abs((inputVector[count]-weights[w]));
count++;
}
currentDistance = sqrt(currentDistance);
count = 0;
if(currentDistance<*leastDistance){
*winner = offset;
*leastDistance = currentDistance;
}
currentDistance = 0;
}
}
To call the function, I used : findBMU<<<20, 50>>>(d_data, d_weights, d_winner, d_least);
But, when I would call the function, sometimes it would give me the right answer, and sometimes it wouldn't. After doing some research, I found that CUDA has some issues with reduction problems like these, but I couldn't find how to fix it. How can I modify my program to make it work with CUDA?
The issue is that threads that run concurrently will see the same leastDistance and overwrite each other's results. There are two values that are shared between threads; leastDistance and winner. You have two basic options. You can write out the results from all the threads and then do a second pass over the data with a parallel reduction to determine which vector had the best match or you can implement this with a custom atomic operation using atomicCAS().
The first method is the easiest. My guess is that it will also give you the best performance, though it does add a dependency for the the free Thrust library. You would use thrust::min_element().
The method using atomicCAS() uses the fact that atomicCAS() has a 64-bit mode, in which you can assign any semantics that you wish to a 64-bit value. In your case, you would use 32 bits to store leastDistance and 32 bits to store winner. To use this method, adapt this example in the CUDA C Programming Guide that implements a double precision floating point atomicAdd().
__device__ double atomicAdd(double* address, double val)
{
unsigned long long int* address_as_ull =
(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val + __longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}
I have following code to accomplish prefix sum task:
#include <iostream>
#include<math.h>
using namespace std;
int Log(int n){
int count=1;
while (n!=0){
n>>=1;
count++;
}
return count;
}
int main(){
int x[16]={39,21,20,50,13,18,2,33,49,39,47,15,30,47,24,1};
int n=sizeof(x)/sizeof(int );
for (int i=0;i<=(Log(n)-1);i++){
for (int j=0;j<=n-1;j++){
if (j>=(std::powf(2,i))){
int t=powf(2,i);
x[j]=x[j]+x[j-t];
}
}
}
for (int i=0;i<n;i++)
cout<<x[i]<< " ";
return 0;
}
From this wikipedia page
but i have got wrong result what is wrong? please help
I’m not sure what your code is supposed to do but implementing a prefix sum is actually pretty easy. For example, this calculates the (exclusive) prefix sum of an iterator range using an arbitrary operation:
template <typename It, typename F, typename T>
inline void prescan(It front, It back, F op, T const& id) {
if (front == back) return;
typename iterator_traits<It>::value_type accu = *front;
*front++ = id;
for (; front != back; ++front) {
swap(*front, accu);
accu = op(accu, *front);
}
}
This follows the interface style of the C++ standard library algorithms.
To use it from your code, you could write the following:
prescan(x, x + n, std::plus<int>());
Are you trying to implement a parallel prefix sum? This only makes sense when you actually parallelize your code. As it stands, your code is executed sequentially and doesn’t gain anything from the more complex divide and conquer logic that you seem to employ.
Furthermore, there are indeed errors in your code. The most important one:
for(int i=0;i<=(Log(n)-1);i++)
Here, you’re iterating until floor(log(n)) - 1. But the pseudo-code states that you in fact need to iterate until ceil(log(n)) - 1.
Furthermore, consider this:
for (int j=0;j<=n-1;j++)
This isn’t very usual. Normally, you’d write such code as follows:
for (int j = 0; j < n; ++j)
Notice that I used < instead of <= and adjusted the bounds from j - 1 to j. The same would hold for the outer loop, so you’d get:
for (int i = 0; i < std::log(n); ++i)
Finally, instead of std::powf, you can use the fact that pow(2, x) is the same as 1 << x (i.e. taking advantage of the fact that operations base 2 are efficiently implemented as bit operations). This means that you can simply write:
if (j >= 1 << i)
x[j] += x[j - (1 << i)];
I think that std::partial_sum does what you want
http://www.sgi.com/tech/stl/partial_sum.html
The quickest way to get your algorithm working: Drop the outer for(i...) loop, instead setting i to 0, and use only the inner for (j...) loop.
int main(){
...
int i=0;
for (int j=0;j<=n-1;j++){
if (j>=(powf(2,i))){
int t=powf(2,i);
x[j]=x[j]+x[j-t];
}
}
...
}
Or equivalently:
for (int j=0; j<=n-1; j++) {
if (j>=1)
x[j] = x[j] + x[j-1];
}
...which is the intuitive way to do a prefix sum, and also probably the fastest non-parallel algorithm.
Wikipedia's algorithm is designed to be run in parallel, such that all of the additions are completely independent of each other. It reads all the values in, adds to them, then writes them back into the array, all in parallel. In your version, when you execute x[j]=x[j]+x[j-t], you're using the x[j-t] that you just added to, t iterations ago.
If you really want to reproduce Wikipedia's algorithm, here's one way, but be warned it will be much slower than the intuitive way above, unless you are using a parallelizing compiler and a computer with a whole bunch of processors.
int main() {
int x[16]={39,21,20,50,13,18,2,33,49,39,47,15,30,47,24,1};
int y[16];
int n=sizeof(x)/sizeof(int);
for (int i=0;i<=(Log(n)-1);i++){
for (int j=0;j<=n-1;j++){
y[j] = x[j];
if (j>=(powf(2,i))){
int t=powf(2,i);
y[j] += x[j-t];
}
}
for (int j=0;j<=n-1;j++){
x[j] = y[j];
}
}
for (int i=0;i<n;i++)
cout<<x[i]<< " ";
cout<<endl;
}
Side notes: You can use 1<<i instead of powf(2,i), for speed. And as ergosys mentioned, your Log() function needs work; the values it returns are too high, which won't affect the partial sum's result in this case, but will make it take longer.