I'm trying to write a piece of code, which goes over a loop of 8^12 iterations, and in each iteration when some conditions are met I push_back to a vector (each thread has its own vector to push_back, which I combine after the loop). But it seems that my the execution is more time consuming the more threads are active. Here' the function (method of an object) passed to each thread:
void HamiltonianKH::mapping_kernel(ull_int start, ull_int stop, std::vector<ull_int>* map_threaded, int _id) {
int n = 1;
out << "A new thread joined tha party! from " << start << " to " << stop << endl;
for (ull_int j = start; j < stop; j++) {
int bSz = 0, fSz = 0, N_e = 0;
std::tie(bSz, fSz, N_e) = calculateSpinElements(this->L, j);
if ((bSz + fSz == this->Sz) && N_e == this->num_of_electrons)
map_threaded->push_back(j);
if (show_system_size_parameters == true && (j - start) % ull_int((stop - start) * n / 4) == 0 && j > 0) {
out << n << "-th quarter of " << _id << endl;
n++;
}
}
}
, here is the caulculate_spinelements function:
std::tuple<int, int, int> calculateSpinElements(int L, ull_int& j) {
int bSz = 0; //bosonic total spin - spin of upper orbital locked to n=1 filling
int fSz = 0; //fermionic total spin
int N_e = 0; // numer of electrons in given state
std::vector<int> temp = int_to_binary(j, L);
for (int k = 0; k < L; k++) {
if (temp[k] < 4) bSz += 1;
else bSz -= 1;
if (temp[k] % 4 == 1) {
fSz += 1;
N_e += 1;
}
else if (temp[k] % 4 == 2) {
fSz -= 1;
N_e += 1;
}
else if (temp[k] % 4 == 3)
N_e += 2;
}
return std::make_tuple(bSz, fSz, N_e);
}
and her is the separation to threads:
void HamiltonianKH::generate_mapping() {
ull_int start = 0, stop = std::pow(8, L);
//mapping_kernel(start, stop, mapping, L, Sz, num_of_electrons);
//Threaded
std::vector<std::vector<ull_int>*> map_threaded(num_of_threads);
std::vector<std::thread> threads;
threads.reserve(num_of_threads);
for (int t = 0; t < num_of_threads; t++) {
start = t * (ull_int)std::pow(8, L) / num_of_threads;
stop = ((t + 1) == num_of_threads ? (ull_int)std::pow(8, L) : (ull_int)std::pow(8, L) * (t + 1) / num_of_threads);
map_threaded[t] = new std::vector<ull_int>();
threads.emplace_back(&HamiltonianKH::mapping_kernel, this, start, stop, map_threaded[t], t);
}
for (auto& t : threads) t.join();
for (auto& t : threads) t.~thread();
ull_int size = 0;
for (auto& t : map_threaded) {
size += t->size();
}
out << "size = " << size << endl;
for (auto & t : map_threaded)
mapping->insert(mapping->end(), t->begin(), t->end());
//sort(mapping->begin(), mapping->end());
if (show_system_size_parameters == true) {
out << "Mapping generated with " << mapping->size() << " elements" << endl;
out << "Last element = " << mapping->at(mapping->size() - 1) << endl;
}
//out << mapping[0] << " " << mapping[mapping.size() - 1] << endl;
assert(mapping->size() > 0 && "Not possible number of electrons - no. of states < 1");
}
The variables: mapping, L, num_of_electrons and Sz are public fields in the object. The whole code has over 2000 lines, but the execution after the generate_mapping() call is irrelevant to the problem.
Do any of you guys have an idea, why this piece of code executes longer on more threads?
Thank you very much in advance.
Related
My goal is to figure out whether each element of an array is a prime or not.
Example:
Input: int A[5]={1,2,3,4,5}
Output: bool P[5]={0,1,1,0,1}
The problem is the array size is up to 10^6. I tried the most efficient prime-checking algorithm
(code: http://cpp.sh/9ewxa) but just the "cin" and "prime_checking" take really long time. How should I solve this problem, Thanks.
Your "most efficient" prime test is actually horribly inefficient. Something like the Miller-Rabin primality test is much faster on a one by one basis. If your input are below 4.3 billion (i.e. uint32_t) then you only need to do 3 tests: a = 2, 7, and 61. For numbers in the uint64_t range it's 12 tests.
If you have a large array of integers then computing all primes up to some maximum might be faster than repeated tests. See Sieve of Eratosthenes for a good way to compute all primes fast. But it's impractical if your input numbers can be larger than 4 billion due to the memory required.
Here is some code that computes a Sieve up to UINT32_MAX+1 and then checks Miller-Rabin has the same results as the sieve: https://gist.github.com/mrvn/137fb0c8a5c78dbf92108b696ff82d92
#include <iostream>
#include <cstdint>
#include <array>
#include <ranges>
#include <cassert>
#include <bitset>
uint32_t pow_n(uint32_t a, uint32_t d, uint32_t n) {
if (d == 0) return 1;
if (d == 1) return a;
uint32_t t = pow_n(a, d / 2, n);
t = ((uint64_t)t * t) % n;
if (d % 2 == 0) {
return t;
} else {
return ((uint64_t)t * a) % n;
}
};
bool test(uint32_t n, unsigned s, uint32_t d, uint32_t a) {
//std::cout << "test(n = " << n << ", s = " << s << ", d = " << d << ", a = " << a << ")\n";
uint32_t x = pow_n(a ,d ,n);
//std::cout << " x = " << x << std::endl;
if (x == 1 || x == n - 1) return true;
for (unsigned i = 1; i < s; ++i) {
x = ((uint64_t)x * x) % n;
if (x == n - 1) return true;
}
return false;
}
bool is_prime(uint32_t n) {
static const std::array witnesses{2u, 3u, 5u, 7u, 11u, 13u, 17u, 19u, 23u, 29u, 31u, 37u};
static const std::array bounds{
2'047llu, 1'373'653llu, 25'326'001llu, 3'215'031'751llu,
2'152'302'898'747llu, 3'474'749'660'383llu,
341'550'071'728'321llu, 341'550'071'728'321llu /* no bounds for 19 */,
3'825'123'056'546'413'051llu,
3'825'123'056'546'413'051llu /* no bound for 29 */,
3'825'123'056'546'413'051llu /* no bound for 31 */,
(unsigned long long)UINT64_MAX /* off by a bit but it's the last bounds */,
};
static_assert(witnesses.size() == bounds.size());
if (n == 2) return true; // 2 is prime
if (n % 2 == 0) return false; // other even numbers are not
if (n <= witnesses.back()) { // I know the first few primes
return (std::ranges::find(witnesses, n) != std::end(witnesses));
}
// write n = 2^s * d + 1 with d odd
unsigned s = 0;
uint32_t d = n - 1;
while (d % 2 == 0) {
++s;
d /= 2;
}
// test widtnesses until the bounds say it's a sure thing
auto it = bounds.cbegin();
for (auto a : witnesses) {
//std::cout << a << " ";
if (!test(n, s, d, a)) return false;
if (n < *it++) return true;
}
return true;
}
template<std::size_t N>
auto composite() {
std::bitset<N / 2 + 1> is_composite;
for (uint32_t i = 3; (uint64_t)i * i < N; i += 2) {
if (is_composite[i / 2]) continue;
for (uint64_t j = i * i; j < N; j += 2 * i) is_composite[j / 2] = true;
}
return is_composite;
}
bool slow_prime(uint32_t n) {
static const auto is_composite = composite<UINT32_MAX + 1llu>();
if (n < 2) return false;
if (n == 2) return true;
if (n % 2 == 0) return false;
return !is_composite.test(n / 2);
}
int main() {
/*
std::cout << "2047: ";
bool fast = is_prime(2047);
bool slow = slow_prime(2047);
std::cout << (fast ? "fast prime" : "");
std::cout << (slow ? "slow prime" : "");
std::cout << std::endl;
*/
//std::cout << "2: prime\n";
for (uint64_t i = 0; i <= UINT32_MAX; ++i) {
if (i % 1000000 == 1) { std::cout << "\r" << i << " "; std::cout.flush(); }
bool fast = is_prime(i);
bool slow = slow_prime(i);
if (fast != slow) std::cout << i << std::endl;
assert(fast == slow);
//std::cout << i << ": " << (is_prime(i) ? "prime" : "") << std::endl;
}
}
The sieve takes ~15s to compute and uses 256MB of memory, verifying Miller-Rabin takes ~12m45s or 765 times slower than the sieve. Which tells me that if you are testing more than 85 million 32bit numbers for primes then just compute them all with a sieve. Since the sieve is O(n^2) it only gets better if your maximum input is smaller.
I made a function that makes the inverse and then another multithreaded, as long I have to make inverse of arrays >2000 x 2000.
A 1000x1000 array unthreated takes 2.5 seconds (on a i5-4460 4 cores 2.9ghz)
and multithreaded takes 7.25 seconds
I placed the multithreads in the part that most time consumption is taken. Whai is wrong?
Is due vectors are used instead of 2 dimensions arrays?
This is the minimum code to test both versions:
#include<iostream>
#include <vector>
#include <stdlib.h>
#include <time.h>
#include <chrono>
#include <thread>
const int NUCLEOS = 8;
#ifdef __linux__
#include <unistd.h> //usleep()
typedef std::chrono::system_clock t_clock; //try to use high_resolution_clock on new linux x64 computer!
#else
typedef std::chrono::high_resolution_clock t_clock;
#pragma warning(disable:4996)
#endif
using namespace std;
std::chrono::time_point<t_clock> start_time, stop_time = start_time; char null_char = '\0';
void timer(char *title = 0, int data_size = 1) { stop_time = t_clock::now(); double us = (double)chrono::duration_cast<chrono::microseconds>(stop_time - start_time).count(); if (title) printf("%s time = %7lgms = %7lg MOPs\n", title, (double)us*1e-3, (double)data_size / us); start_time = t_clock::now(); }
//makes columns 0
void colum_zero(vector< vector<double> > &x, vector< vector<double> > &y, int pos0, int pos1,int dim, int ord);
//returns inverse of x, x is not modified, not threaded
vector< vector<double> > inverse(vector< vector<double> > x)
{
if (x.size() != x[0].size())
{
cout << "ERROR on inverse() not square array" << endl; getchar(); return{};//returns a null
}
size_t dim = x.size();
int i, j, ord;
vector< vector<double> > y(dim,vector<double>(dim,0));//initializes output = 0
//init_2Dvector(y, dim, dim);
//1. Unity array y:
for (i = 0; i < dim; i++)
{
y[i][i] = 1.0;
}
double diagon, coef;
double *ptrx, *ptry, *ptrx2, *ptry2;
for (ord = 0; ord<dim; ord++)
{
//2 Hacemos diagonal de x =1
int i2;
if (fabs(x[ord][ord])<1e-15) //If that element is 0, a line that contains a non zero is added
{
for (i2 = ord + 1; i2<dim; i2++)
{
if (fabs(x[i2][ord])>1e-15) break;
}
if (i2 >= dim)
return{};//error, returns null
for (i = 0; i<dim; i++)//added a line without 0
{
x[ord][i] += x[i2][i];
y[ord][i] += y[i2][i];
}
}
diagon = 1.0/x[ord][ord];
ptry = &y[ord][0];
ptrx = &x[ord][0];
for (i = 0; i < dim; i++)
{
*ptry++ *= diagon;
*ptrx++ *= diagon;
}
//uses the same function but not threaded:
colum_zero(x,y,0,dim,dim,ord);
}//end ord
return y;
}
//threaded version
vector< vector<double> > inverse_th(vector< vector<double> > x)
{
if (x.size() != x[0].size())
{
cout << "ERROR on inverse() not square array" << endl; getchar(); return{};//returns a null
}
int dim = (int) x.size();
int i, ord;
vector< vector<double> > y(dim, vector<double>(dim, 0));//initializes output = 0
//init_2Dvector(y, dim, dim);
//1. Unity array y:
for (i = 0; i < dim; i++)
{
y[i][i] = 1.0;
}
std::thread tarea[NUCLEOS];
double diagon;
double *ptrx, *ptry;// , *ptrx2, *ptry2;
for (ord = 0; ord<dim; ord++)
{
//2 Hacemos diagonal de x =1
int i2;
if (fabs(x[ord][ord])<1e-15) //If a diagonal element=0 it is added a column that is not 0 the diagonal element
{
for (i2 = ord + 1; i2<dim; i2++)
{
if (fabs(x[i2][ord])>1e-15) break;
}
if (i2 >= dim)
return{};//error, returns null
for (i = 0; i<dim; i++)//It is looked for a line without zero to be added to make the number a non zero one to avoid later divide by 0
{
x[ord][i] += x[i2][i];
y[ord][i] += y[i2][i];
}
}
diagon = 1.0 / x[ord][ord];
ptry = &y[ord][0];
ptrx = &x[ord][0];
for (i = 0; i < dim; i++)
{
*ptry++ *= diagon;
*ptrx++ *= diagon;
}
int pos0 = 0, N1 = dim;//initial array position
if ((N1<1) || (N1>5000))
{
cout << "It is detected out than 1-5000 simulations points=" << N1 << " ABORT or press enter to continue" << endl; getchar();
}
//cout << "Initiation of " << NUCLEOS << " threads" << endl;
for (int thread = 0; thread<NUCLEOS; thread++)
{
int pos1 = (int)((thread + 1)*N1 / NUCLEOS);//next position
tarea[thread] = std::thread(colum_zero, std::ref(x), std::ref(y), pos0, pos1, dim, ord);//ojo, coil current=1!!!!!!!!!!!!!!!!!!
pos0 = pos1;//next thread will work at next point
}
for (int thread = 0; thread<NUCLEOS; thread++)
{
tarea[thread].join();
//cout << "Thread num: " << thread << " end\n";
}
}//end ord
return y;
}
//makes columns 0
void colum_zero(vector< vector<double> > &x, vector< vector<double> > &y, int pos0, int pos1,int dim, int ord)
{
double coef;
double *ptrx, *ptry, *ptrx2, *ptry2;
//Hacemos '0' la columna ord salvo elemento diagonal:
for (int i = pos0; i<pos1; i++)//Begin to end for every thread
{
if (i == ord) continue;
coef = x[i][ord];//element to make 0
if (fabs(coef)<1e-15) continue; //If already zero, it is avoided
ptry = &y[i][0];
ptry2 = &y[ord][0];
ptrx = &x[i][0];
ptrx2 = &x[ord][0];
for (int j = 0; j < dim; j++)
{
*ptry++ = *ptry - coef * (*ptry2++);//1ª matriz
*ptrx++ = *ptrx - coef * (*ptrx2++);//2ª matriz
}
}
}
void test_6_inverse(int dim)
{
vector< vector<double> > vec1(dim, vector<double>(dim));
for (int i=0;i<dim;i++)
for (int j = 0; j < dim; j++)
{
vec1[i][j] = (-1.0 + 2.0*rand() / RAND_MAX) * 10000;
}
vector< vector<double> > vec2,vec3;
double ini, end;
ini = (double)clock();
vec2 = inverse(vec1);
end = (double)clock();
cout << "=== Time inverse unthreaded=" << (end - ini) / CLOCKS_PER_SEC << endl;
ini=end;
vec3 = inverse_th(vec1);
end = (double)clock();
cout << "=== Time inverse threaded=" << (end - ini) / CLOCKS_PER_SEC << endl;
cout<<vec2[2][2]<<" "<<vec3[2][2]<<endl;//to make the sw to do de inverse
cout << endl;
}
int main()
{
test_6_inverse(1000);
cout << endl << "=== END ===" << endl; getchar();
return 1;
}
After looking deeper in the code of the colum_zero() function I have seen that one thread rewrites in the data to be used by another threads, so the threads are not INDEPENDENT from each other. Fortunately the compiler detect it and avoid it.
Conclusions:
It is not recommended to try Gauss-Jordan method alone to make multithreads
If somebody detects that in multithread is slower and the initial function is spreaded correctly for every thread, perhaps is due one thread results are used by another
The main function inverse() works and can be used by other programmers, so this question should not be deleted
Non answered question:
What is a matrix inverse method that could be spreaded in a lot of independent threads to be used in a gpu?
I wrote a C++ routine to find nearest double element in sorted array. Is there a way to speed up?
There are two branches based on the value of boolean reversed, if reversed it is sorted in the decreasing order.
void findNearestNeighbourIndex_new(real_T value, real_T* x, int_T x_size, int_T& l_idx)
{
l_idx = -1;
bool reversed= (x[1] - x[0] < 0);
if ((!reversed&& value <= x[0])
|| (reversed&& value >= x[0])){
// Value is before first position in x
l_idx = 0;
}
else if ((!reversed&& value >= x[x_size - 1])
|| (reversed&& value <= x[x_size - 1])){
// Value is after last position in x
l_idx = x_size - 2;
}
else // All other cases
{
if (reversed)
{
for (int i = 0; i < x_size - 1; ++i)
{
if (value <= x[i] && value > x[i + 1])
{
l_idx = i;
break;
}
}
}
else{
for (int i = 0; i < x_size - 1; ++i)
{
if (value >= x[i] && value < x[i + 1])
{
l_idx = i;
break;
}
}
}
}
}
In this very case where array is sorted, I do not see a way to do better. So, with profiling i see that the comparison in if (value <= x[i] && value > x[i + 1]) is expensive.
EDIT
tried with lower_bound()
std::vector<real_T> x_vec(x, x + x_size);
l_idx = std::upper_bound(x_vec.begin(), x_vec.end(), value) - x_vec.begin() - 1;
You can use std::lower_bound to find a element equal or greater than requested, and then move iterator backwards and check preceding value too. This will use binary search and will cost O(log n), also this enables standard STL comparators and so on.
If you don't actually have an vector to use with upper_bound() you don't need to construct one as that is going to be an O(n) operation. upper_bound() will work with the array that you have. You can use:
l_idx = std::upper_bound(x, x + size, value) - x - 1;
Test case:
#include <iostream>
#include <functional>
#include <algorithm>
int main()
{
const int size = 9;
int x[9] = {1,2,3,4,5,6,7,8,9};
auto pos = std::upper_bound(x, x + size, 5) - x;
std::cout << "position: " << pos;
return 0;
}
Output:
5
As the result of upper_bound() points us to 6(live example).
The way is to substract 1 to size (to make work over higest value) and 0.5 to target value to make it accurate:
#include <iostream>
#include <functional>
#include <algorithm>
using namespace std;
int main()
{
float x[10] = { 2,3,4,5,6,7,8,9,10,11 },y;
int size = sizeof(x) / sizeof(x[0]),pos;
y = 4.1; pos = std::upper_bound(x, x + size - 1, y - 0.5) - x;
std::cout << "position: " << pos << " target value=" << y << " upper_bound=" << x[pos] << endl;
y = 4.9; pos = std::upper_bound(x, x + size - 1, y - 0.5) - x;
std::cout << "position: " << pos << " target value=" << y << " upper_bound=" << x[pos] << endl;
y = -0.5; pos = std::upper_bound(x, x + size - 1, y - 0.5) - x;
std::cout << "position: " << pos << " target value=" << y << " upper_bound=" << x[pos] << endl;
y = 100; pos = std::upper_bound(x, x + size - 1, y - 0.5) - x;
std::cout << "position: " << pos << " target value=" << y << " upper_bound=" << x[pos] << endl;
getchar();
return 0;
}
Implemented this helper routine
void findNearestNeighbourIndex_bin_search_new(real_T value, real_T* x,
int_T start, int_T stop, int_T& l_idx)
{
int_T mid = ( stop - start ) / 2;
if (value >= x[mid+1])
{
findNearestNeighbourIndex_bin_search_new(value, x, mid + 1, stop, l_idx);
}
else if (value < x[mid])
{
findNearestNeighbourIndex_bin_search_new(value, x, start, mid, l_idx);
}
else
{
l_idx = mid;
return;
}
}
I'm afraid this is a bit of a long code. I'm programming a parallel, recursive, task-based version of Euler's partition formula with Intel TBB and C++, and I don't think there's much problem with this program's logic, but I have a feeling the variables are being accessed wrongly and I might have declared them in the wrong place or something. I say this because inputting a number n should always give the same result, and it does below n = 11, but above that it gives different answers. Even stranger, adding lines of output to try and troubleshoot the program results in slightly more accurate answers (as if somehow padding the time each part of the calculation takes helps it). I have no idea how to avoid this problem or which variable exactly is causing it as the answer is usually fairly close, it's not just a random number. So this is a bit of a tricky one, I apologise, but if someone could help me I would so damn thankful, I've spent a number of hours on this problem.
Here's the parallel task:
class ParallelFormula : public task {
public:
int n;
int* pTot;
//Task constructor
ParallelFormula(int n_, int* pTot_) : n(n_), pTot(pTot_) {}
//Task definition
task* execute() {
//Iterating for formula to work
for (int k = 1; k > 0; k++) {
//Add fixed values to pTot for any case where 2 >= n >= 0
switch (n) {
case 0:
if (k % 2 != 0)
*pTot += 1;
else
*pTot -= 1;
return NULL;
case 1:
if (k % 2 != 0)
*pTot += 1;
else
*pTot -= 1;
return NULL;
case 2:
if (k % 2 != 0)
*pTot += 2;
else
*pTot -= 2;
return NULL;
}
//Calculate p numbers using section of Euler's formula (relies on iteration number)
p1 = (k*((3 * k) - 1)) / 2;
p2 = (k*((3 * k) + 1)) / 2;
if (n >= p2) {
//If n is more than p2, must call recursive tasks to break down problem to smaller n's, and adds result to total result pTot (i.e. p(n))
int x = 0;
int y = 0;
ParallelFormula& a = *new(allocate_child()) ParallelFormula(n - p1, &x);
ParallelFormula& b = *new(allocate_child()) ParallelFormula(n - p2, &y);
//Set ref_count to two children plus one for the wait
set_ref_count(3);
//Start b running
spawn(b);
//Start a running and wait for all children (a and b)
spawn_and_wait_for_all(a);
//Sum the total
if (k % 2 != 0)
*pTot += (x + y);
else
*pTot -= (x + y);
}
else if (n >= p1) {
//If n is more than p1, problem is small and therefore need not be parallelised, result added to pTot
if (k % 2 != 0)
*pTot += serialLoop(n - p1);
else
*pTot -= serialLoop(n - p1);
return NULL;
}
else
return NULL;
}
}
};
The method that calls the parallel task:
int parallelLoop(int n) {
int pTot = 0;
ParallelFormula& a = *new(task::allocate_root()) ParallelFormula(n, &pTot);
task::spawn_root_and_wait(a);
return pTot;
}
In case you want to look at the full code for all the context:
// Assignment2.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include "iostream"
#include "tbb/task_scheduler_init.h"
#include "tbb/parallel_reduce.h"
#include "tbb/partitioner.h"
#include "tbb/blocked_range.h"
#include "tbb/tick_count.h"
#include "math.h"
using namespace tbb;
using namespace std;
int p, p1, p2;
int serialLoop(int n);
int n;
int m;
int serialFormula(int pTemp) {
switch (pTemp) {
case 0:
return 1;
case 1:
return 1;
case 2:
return 2;
}
//If p is any other value it is less than 0 and therefore has nothing to calculate - the current calculation is complete
return 0;
}
int serialLoop(int n) {
int pTot = 0;
for (int k = 1; k > 0; k++) {
//Checking whether k is even or odd to determine if adding or substracting value of p(x) to make p(n)
if (n == 0)
return pTot += 1;
else if (k % 2 != 0) {
//Calculate p number using section of Euler's formula
p = n - ((k*((3 * k) - 1)) / 2);
//If p is more than 2, must call recursive function to break down problem to smaller n's, and adds result to total result P (i.e. p(n))
if (p > 2) {
pTot += serialLoop(p);
}
else if (p >= 0) {
pTot += serialFormula(p);
}
else return pTot;
p = n - ((k*((3 * k) + 1)) / 2);
if (p > 2) {
pTot += serialLoop(p);
}
else if (p >= 0) {
pTot += serialFormula(p);
}
else return pTot;
}
else {
p = n - ((k*((3 * k) - 1)) / 2);
if (p > 2) {
pTot -= serialLoop(p);
}
else if (p >= 0) {
pTot -= serialFormula(p);
}
else return pTot;
p = n - ((k*((3 * k) + 1)) / 2);
if (p > 2) {
pTot -= serialLoop(p);
}
else if (p >= 0) {
pTot -= serialFormula(p);
}
else return pTot;
}
}
}
class ParallelFormula : public task {
public:
int n;
int* pTot;
//Task constructor
ParallelFormula(int n_, int* pTot_) : n(n_), pTot(pTot_) {}
//Task definition
task* execute() {
//Checking task is called
for (int k = 1; k > 0; k++) {
//Calculate p number using section of Euler's formula
switch (n) {
case 0:
if (k % 2 != 0)
*pTot += 1;
else
*pTot -= 1;
cout << "Case 0" << endl;
cout << *pTot << endl;
return NULL;
case 1:
if (k % 2 != 0)
*pTot += 1;
else
*pTot -= 1;
cout << "Case 1" << endl;
cout << *pTot << endl;
return NULL;
case 2:
if (k % 2 != 0)
*pTot += 2;
else
*pTot -= 2;
cout << "Case 2" << endl;
cout << *pTot << endl;
return NULL;
}
p1 = (k*((3 * k) - 1)) / 2;
p2 = (k*((3 * k) + 1)) / 2;
if (n >= p2) {
//If p is more than 2, must call recursive function to break down problem to smaller n's, and adds result to total result P (i.e. p(n))
int x = 0;
int y = 0;
ParallelFormula& a = *new(allocate_child()) ParallelFormula(n - p1, &x);
ParallelFormula& b = *new(allocate_child()) ParallelFormula(n - p2, &y);
//Set ref_count to two children plus one for the wait
set_ref_count(3);
//Start b running
spawn(b);
//Start a running and wait for all children (a and b)
spawn_and_wait_for_all(a);
//Sum the total
if (k % 2 != 0)
*pTot += (x + y);
else
*pTot -= (x + y);
cout << "Double p" << endl;
cout << *pTot << endl;
}
else if (n >= p1) {
if (k % 2 != 0)
*pTot += serialLoop(n - p1);
else
*pTot -= serialLoop(n - p1);
cout << "Single p" << endl;
cout << *pTot << endl;
return NULL;
}
else
return NULL;
}
}
};
int parallelLoop(int n) {
int pTot = 0;
ParallelFormula& a = *new(task::allocate_root()) ParallelFormula(n, &pTot);
task::spawn_root_and_wait(a);
return pTot;
}
int main()
{
//Take inputs n and m.
cout << "Enter partition number n:" << endl;
cin >> n;
cout << "Enter modulo m:" << endl;
cin >> m;
//Start timer for serial method
tick_count serial_start = tick_count::now();
//Serial method for computing partition function modulo m.
int sP = serialLoop(n);
int serialMod = sP % m;
//Finish timer for serial method
tick_count serial_end = tick_count::now();
//Output serial results
cout << "Serial result for p(n) is: " << sP << endl;
cout << "Serial result for p(n) mod m is: " << serialMod << endl;
cout << "Serial time (s): " << (serial_end - serial_start).seconds() << endl;
//Start timer for parallel method
tick_count parallel_start = tick_count::now();
//Parallel method for computing partition function
int pP = parallelLoop(n);
int parallelMod = pP % m;
//Finish timer for parallel method
tick_count parallel_end = tick_count::now();
//Output parallel results
cout << "Parallel result for p(n) is: " << pP << endl;
cout << "Parallel result for p(n) mod m is: " << parallelMod << endl;
cout << "Parallel time (s): " << (parallel_end - parallel_start).seconds() << endl;
//Acceleration achieved
cout << "Acceleration achieved was: " << (serial_end - serial_start).seconds() / (parallel_end - parallel_start).seconds() << endl;
return 0;
};
P.S. This was partly based off of the Fibonacci sequence example in the Intel TBB documentation, so if I've done something seriously dumb by following that example then I apologise for that too XD.
The variables p1 and p2 are global but you write to them in ParallelFormula::execute concurrently. Try to declare them inside the ParallelFormula::execute method, e.g.
int p1 = (k*((3 * k) - 1)) / 2;
int p2 = (k*((3 * k) + 1)) / 2;
Also do not forget about the p variable in int serialLoop(int n) since you call this function from ParallelFormula::execute.
I need to generate 4 random numbers, each between [-45 +45] degrees. and if rand%2 = 0 then I want the result (the random number generated to be equal to -angle). Once the 4 random numbers are generated it is requires to scan these angles and find a lock (the point at which the angles meet). Also -3,-2,-1,... +3 in the loop in if statement indicate that the lock takes place within 6 degrees beamwidth. the code works. But can it be simplified? also The objective is to establish a lock between 2 points by scannin elevation and azimuth angles at both points.
#include <iostream>
#include <conio.h>
#include <time.h>
using namespace std;
class Cscan
{
public:
int gran, lockaz, lockel;
};
int main()
{
srand (time(NULL));
int az1, az2, el1, el2, j, k;
BS1.lockaz = rand() % 46;
BS1.lockel = rand() % 46;
BS2.lockaz = rand() % 46;
BS2.lockel = rand() % 46;
k = rand() % 2;
if(k == 0)
k = -1;
BS1.lockaz = k*BS1.lockaz;
k = rand() % 2;
if(k == 0)
k = -1;
BS1.lockel = k*BS1.lockel;
k = rand() % 2;
if(k == 0)
k = -1;
BS2.lockaz = k*BS2.lockaz;
k = rand() % 2;
if(k == 0)
k = -1;
BS2.lockel = k*BS2.lockel;
for(az1=-45; az1<=45; az1=az1+4)
{
for(el1=-45; el1<=45; el1=el1+4)
{
for(az2=-45; az2<=45; az2=az2+4)
{
for(el2=-45; el2<=45; el2=el2+4)
{
if((az1==BS1.lockaz-3||az1==BS1.lockaz-2||az1==BS1.lockaz-1||az1==BS1.lockaz||az1==BS1.lockaz+1||az1==BS1.lockaz+2||az1==BS1.lockaz+3)&&
(az2==BS2.lockaz-3||az2==BS2.lockaz-2||az2==BS2.lockaz-1||az2==BS2.lockaz||az2==BS2.lockaz+1||az2==BS2.lockaz+2||az2==BS2.lockaz+3)&&
(el1==BS1.lockel-3||el1==BS1.lockel-2||el1==BS1.lockel-1||el1==BS1.lockel||el1==BS1.lockel+1||el1==BS1.lockel+2||el1==BS1.lockel+3)&&
(el2==BS2.lockel-3||el2==BS2.lockel-2||el2==BS2.lockel-1||el2==BS2.lockel||el2==BS2.lockel+1||el2==BS2.lockel+2||el2==BS2.lockel+3))
{
cout << "locked \n" << BS1.lockaz << " " << BS1.lockel << " " << BS2.lockaz << " " << BS2.lockel <<endl
< az1 << " " << el1 << " " << az2 << " " << el2 << endl;
k = 1;
break;
}
if(k==1)
break;
}
if(k==1)
break;
}
if(k==1)
break;
}
if(k==1)
break;
}
_getch();
}
BS1.lockaz = rand() % 91 - 45;
BS1.lockel = rand() % 91 - 45;
BS2.lockaz = rand() % 91 - 45;
BS2.lockel = rand() % 91 - 45;
Integer angles in degrees? Very questionable. Something "physical" like an angle is normally best expressed as a floating-point number, so I'd first change
typedef double angle;
struct Cscan { // why class? This is clearly POD
int gran; //I don't know what gran is. Perhaps this should also be floating-point.
angle lockaz, lockel;
};
That seems to make it more difficult at first sight because neither the random-range-selection with % works anymore nor is it much use to compare floats for equality. Which is, however, a good thing, because all of this is in fact very bad practise.
If you want to keep using rand() as the random number generator (though I'd suggest std::uniform_real_distribution), write a function to do this:
const double pi = 3.141592653589793; // Let's use radians internally, not degrees.
const angle rightangle = pi/2.; // It's much handier for real calculations.
inline angle deg2rad(angle dg) {return dg * rightangle / 90.;}
angle random_in_sym_rightangle() {
return rightangle * ( ((double) rand()) / ((double) RAND_MAX) - .5 );
}
Now you'd just do
BS1.lockaz = random_in_sym_rightangle();
BS1.lockel = random_in_sym_rightangle();
BS2.lockaz = random_in_sym_rightangle();
BS2.lockel = random_in_sym_rightangle();
Then you need to do this range-checking. That's again something to put in a dedicated function
bool equal_in_margin(angle theta, angle phi, angle margin) {
return (theta > phi-margin && theta < phi+margin);
}
Then you do this exhaustive search for locks. This could definitely be done more efficiently, but that's an algorithm issue and has nothing to do with the language. Sticking to the for loops, you can still make them look much nicer by avoiding this explicit break checking. One way is good old goto, I'd propose here you just stick it in an extra function and return when you're done
#define TRAVERSE_SYM_RIGHTANGLE(phi) \
for ( angle phi = -pi/4.; phi < pi/4.; phi += deg2rad(4) )
int lock_k // better give this a more descriptive name
( const Cscan& BS1, const Cscan& BS2, int k ) {
TRAVERSE_SYM_RIGHTANGLE(az1) {
TRAVERSE_SYM_RIGHTANGLE(el1) {
TRAVERSE_SYM_RIGHTANGLE(az2) {
TRAVERSE_SYM_RIGHTANGLE(el2) {
if( equal_in_margin( az1, BS1.lockaz, deg2rad(6.) )
&& equal_in_margin( el1, BS1.lockel, deg2rad(6.) )
&& equal_in_margin( az2, BS1.lockaz, deg2rad(6.) )
&& equal_in_margin( el2, BS2.lockel, deg2rad(6.) ) ) {
std::cout << "locked \n" << BS1.lockaz << " " << BS1.lockel << " " << BS2.lockaz << " " << BS2.lockel << '\n'
<< az1 << " " << el1 << " " << az2 << " " << el2 << std::endl;
return 1;
}
}
}
}
}
return k;
}