general tbb issue for calculating fibonacci numbers - c++

I came across the tbb template below as an example of task-based programming for calculating the sum of fibonacci numbers in c++. But when I run it I get a value of 1717986912 which can't be the case. The output should be 3. What am I doing wrong?
class FibTask: public task
{
public:
const long n;
long * const sum;
FibTask( long n_, long* sum_ ) : n(n_), sum(sum_) {}
task* execute( )
{
// Overrides virtual function task::execute
if( n < 0)
{
return 0;
}
else
{
long x, y;
FibTask& a = *new( allocate_child( ) ) FibTask(n-1,&x);
FibTask& b = *new( allocate_child( ) ) FibTask(n-2,&y);
// Set ref_count to "two children plus one for the wait".
set_ref_count(3);
// Start b running.
spawn( b );
// Start a running and wait for all children (a and b).
spawn_and_wait_for_all( a );
// Do the sum
*sum = x+y;
}
return NULL;
}
long ParallelFib( long n )
{
long sum;
FibTask& a = *new(task::allocate_root( )) FibTask(n,&sum);
task::spawn_root_and_wait(a);
return sum;
}
};
long main(int argc, char** argv)
{
FibTask * obj = new FibTask(3,0);
long b = obj->ParallelFib(3);
std::cout << b;
return 0;
}

The cutoff is messed here. It must be 2 at least. E.g.:
if( n<2 ) {
*sum = n;
return NULL;
}
The original example also uses SerialFib as showed here http://www.threadingbuildingblocks.org/docs/help/tbb_userguide/Simple_Example_Fibonacci_Numbers.htm
The inefficient method for calculating Fibonacci numbers using inefficient blocking style technique will be even more inefficient without call to SerialFib().
WARNING: Please note that this example is intended just to demonstrate this particular low-level TBB API and this particular way of using it. It is not intended for reuse unless you are really sure why you are doing this.
Modern high-level API (though, still for the inefficient Fibonacci algorithm) would look like this:
int Fib(int n) {
if( n<CUTOFF ) { // 2 is minimum
return fibSerial(n);
} else {
int x, y;
tbb::parallel_invoke([&]{x=Fib(n-1);}, [&]{y=Fib(n-2);});
return x+y;
}
}

Related

How to get tasks working with TBB

The code below compiles but appears to get stuck somewhere in the tasks I make with Intel TBB. It simply runs and displays nothing and I have to kill the program to end it. Basically, I modelled this after an example in a book and I probably did it incorrectly. What am I doing incorrectly with these tasks? I am using g++ 4.8.4 and think I am using TBB 3.9.
/*
g++ test0.cpp -o test0.out -std=c++11 -ltbb
*/
#include <iostream>
#include "tbb/task_scheduler_init.h"
#include "tbb/task.h"
using namespace tbb;
long serial_fibo(long n) {
if(n < 2) {
return n;
} else {
return serial_fibo(n - 1) + serial_fibo(n - 2);
}
}
class Fibo_Task: public task {
public:
const long n;
long* const sum;
Fibo_Task(long _n_, long* _sum_) :
n(_n_), sum(_sum_) {}
// override virtual function task::execute
task *execute() {
if(n < 4) {
*sum = serial_fibo(n);
} else {
long x = 0, y = 0;
// references x
Fibo_Task& a =
*new(task::allocate_root())
Fibo_Task(n - 1, &x);
// references y
Fibo_Task& b =
*new(task::allocate_root())
Fibo_Task(n - 2, &y);
// two children and another to wait
set_ref_count(3);
spawn(a);
spawn_and_wait_for_all(b);
*sum = x + y;
}
return NULL;
}
};
long parallel_fibo(long n) {
long sum;
Fibo_Task& a =
*new(task::allocate_root())
Fibo_Task(n, &sum);
task::spawn_root_and_wait(a);
return sum;
}
int main() {
task_scheduler_init init;
long number = 8;
long first = serial_fibo(number);
long second = parallel_fibo(number);
std::cout << "first: " << first << "\n";
std::cout << "second: " << second << "\n";
return 0;
}
You allocated 'root' tasks instead of 'child' tasks. The difference is that allocate_root() creates independent task which does not point to anything as its successor. And thus the wait_for_all() does not receive corresponding signals that the tasks are completed and thus hangs.
You can find the correct original example in the TBB documentation here.
Or you can fix yours by adding a... and b.set_parent(this) which effectively fixes the difference between allocate_root() and allocate_child() as I implemented here.

can an std::promise be made from a non-POD object?

One of the things my app does is listen for and receive payloads from a socket. I never want to block. On each payload received, I want to create an object and pass it to a worker thread and forget about it until later which is how the prototype code works. But for the production code I want to reduce complexity (my app is large) by using the convenient async method. async takes a future made from a promise. For that to work I need to create a promise on my non-POD object represented below by the Xxx class. I don't see any way to do that (see error in my sample code below). Is it appropriate to use async here? If so, how can I construct a promise/future object that is more complex than int (all code examples I've see either use int or void):
#include <future>
class Xxx //non-POD object
{
int i;
public:
Xxx( int i ) : i( i ) {}
int GetSquare() { return i * i; }
};
int factorial( std::future< Xxx > f )
{
int res = 1;
auto xxx = f.get();
for( int i = xxx.GetSquare(); i > 1; i-- )
{
res *= i;
}
return res;
}
int _tmain( int argc, _TCHAR* argv[] )
{
Xxx xxx( 2 ); // 2 represents one payload from the socket
std::promise< Xxx > p; // error: no appropriate default constructor available
std::future< Xxx > f = p.get_future();
std::future< int > fu = std::async( factorial, std::move( f ) );
p.set_value( xxx );
fu.wait();
return 0;
}
As Mike already answered, it's definitely a bug in the Visual C++ implementation of std::promise, what you're doing should work.
But I'm curious why you need to do it anyway. Maybe there's some other requirement that you've not shown to keep the example simple, but this would be the obvious way to write that code:
#include <future>
class Xxx //non-POD object
{
int i;
public:
Xxx( int i ) : i( i ) {}
int GetSquare() { return i * i; }
};
int factorial( Xxx xxx )
{
int res = 1;
for( int i = xxx.GetSquare(); i > 1; i-- )
{
res *= i;
}
return res;
}
int main()
{
Xxx xxx( 2 ); // 2 represents one payload from the socket
std::future< int > fu = std::async( factorial, std::move( xxx ) );
int fact = fu.get();
}
It sounds like your implementation is defective. There should be no need for a default constructor (per the general library requirements of [utility.arg.requirements]), and GCC accepts your code (after changing the weird Microsoftish _tmain to a standard main).
I'd switch to a different compiler and operating system. That might not be an option for you, so maybe you could give the class a default constructor to keep it happy.

Condensing a do-while loop to a #define macro

Consider the following sample code (I actually work with longer binary strings but this is enough to explain the problem):
void enumerateAllSubsets(unsigned char d) {
unsigned char n = 0;
do {
cout<<binaryPrint(n)<<",";
} while ( n = (n - d) & d );
}
The function (due to Knuth) effectively loops through all subsets of a binary string;
For example :
33 = '00100001' in binary and enumerateAllSubsets(33) would produce:
00000000, 00100000, 00000001, 00100001.
I need to write a #define which would make
macroEnumerate(n,33)
cout<<binaryPrint(n)<<",";
behave in a way equivalent to enumerateAllSubsets(33). (well, the order might be rearranged)
Basically i need the ability to perform various operations on subsets of a set.
Doing something similar with for-loops is trivial:
for(int i=0;i < a.size();i++)
foo(a[i]);
can be replaced with:
#define foreach(index,container) for(int index=0;index < container.size();index++)
...
foreach(i,a)
foo(a[i]);
The problem with enumerateAllSubsets() is that the loop body needs to be executed once unconditionally and as a result the do-while cannot be rewritten as for.
I know that the problem can be solved by STL-style templated function and a lambda passed to it (similar to STL for_each function), but some badass #define macro seems like a cleaner solution.
Assuming C++11, define a range object:
#include <iostream>
#include <iterator>
#include <cstdlib>
template <typename T>
class Subsets {
public:
Subsets(T d, T n = 0) : d_(d), n_(n) { }
Subsets begin() const { return *this; }
Subsets end() const { return {0, 0}; }
bool operator!=(Subsets const & i) const { return d_ != i.d_ || n_ != i.n_; }
Subsets & operator++() {
if (!(n_ = (n_ - d_) & d_)) d_ = 0;
return *this;
}
T operator*() const { return n_; }
private:
T d_, n_;
};
template <typename T>
inline Subsets<T> make_subsets(T t) { return Subsets<T>(t); }
int main(int /*argc*/, char * argv[]) {
int d = atoi(argv[1]);
for (auto i : make_subsets(d))
std::cout << i << "\n";
}
I've made it quite general in case you want to work with, e.g., uint64_t.
One option would be to use a for loop that always runs at least once, such as this:
for (bool once = true; once? (once = false, true) : (n = (n - d) & d); )
// loop body
On the first iteration, the once variable gets cleared and the expression evaluates to true, so the loop executes. From that point forward, the actual test-and-step logic controls the loop.
From here, rewriting this to a macro should be a lot easier.
Hope this helps!
You can do a multiline macro that uses an expression, like this:
#define macroenum(n, d, expr ) \
n = 0; \
do { \
(expr); \
} while (n = (n -d) & d) \
; \
int main(int argc, const char* argv[])
{
enumerateAllSubsets(33);
int n;
macroenum(n, 33, cout << n << ",");
}
As others have mentioned this will not be considered very clean by many - amongst other things, it relies on the variable 'n' existing in scope. You may need to wrap expr in another set of parens, but I tested it with g++ and got the same output as enumerateAllSubsets.
It seems like your goal is to be able to do something like enumerateAllSubsets but change the action performed for each iteration.
In C++ you can do this with a function in the header file:
template<typename Func>
inline void enumerateAllSubsets(unsigned char d, Func f)
{
unsigned char n = 0;
do { f(n); } while ( n = (n - d) & d );
}
Sample usage:
enumerateAllSubsets(33, [](auto n) { cout << binaryPrint(n) << ','; } );

convert from recursive to iterative function cuda c++

I'm working on a genetic program in which I am porting some of the heavy lifting into CUDA. (Previously just OpenMP).
It's not running very fast, and I'm getting an error related to the recursion:
Stack size for entry function '_Z9KScoreOnePdPiS_S_P9CPPGPNode' cannot be statically determined
I've added a lump of the logic which runs on CUDA. I believe its enough to show how its working. I'd be happy to hear about other optimizations I could add, but I would really like to take the recursion if it will speed things up.
Examples on how this could be achieved are very welcome.
__device__ double Fadd(double a, double b) {
return a + b;
};
__device__ double Fsubtract(double a, double b) {
return a - b;
};
__device__ double action (int fNo, double aa , double bb, double cc, double dd) {
switch (fNo) {
case 0 :
return Fadd(aa,bb);
case 1 :
return Fsubtract(aa,bb);
case 2 :
return Fmultiply(aa,bb);
case 3 :
return Fdivide(aa,bb);
default:
return 0.0;
}
}
__device__ double solve(int node,CPPGPNode * dev_m_Items,double * var_set) {
if (dev_m_Items[node].is_terminal) {
return var_set[dev_m_Items[node].tNo];
} else {
double values[4];
for (unsigned int x = 0; x < 4; x++ ) {
if (x < dev_m_Items[node].fInputs) {
values[x] = solve(dev_m_Items[node].children[x],dev_m_Items,var_set);
} else {
values[x] = 0.0;
}
}
return action(dev_m_Items[node].fNo,values[0],values[1],values[2],values[3]);
}
}
__global__ void KScoreOne(double *scores,int * root_nodes,double * targets,double * cases,CPPGPNode * dev_m_Items) {
int pid = blockIdx.x;
// We only work if this node needs to be calculated
if (root_nodes[pid] != -1) {
for (unsigned int case_no = 0; case_no < FITNESS_CASES; case_no ++) {
double result = solve(root_nodes[pid],dev_m_Items,&cases[case_no]);
double target = targets[case_no];
scores[pid] += abs(result - target);
}
}
}
I'm having trouble making any stack examples work for a large tree structure, which is what this solves.
I've solved this issue now. It was not quite a case of placing the recursive arguments into a stack but it was a very similar system.
As part of the creation of the node tree, I append each node each to into a vector. I now solve the problem in reverse using http://en.wikipedia.org/wiki/Reverse_Polish_notation, which fits very nicely as each node contains either a value or a function to perform.
It's also ~20% faster than the recursive version, so I'm pleased!

All possible combinations(with repetition) as values in array using recursion

I'm trying to solve a problem in which I need to insert math operations(+/- in this case) between digits or merge them to get a requested number.
For ex.: 123456789 => 123+4-5+6-7+8-9 = 120
My concept is basically generating different combinations of operation codes in array and calculating the expression until it equals some number.
The problem is I can't think of a way to generate every possible combination of math operations using recursion.
Here's the code:
#include <iostream>
#include <algorithm>
using namespace std;
enum {noop,opplus,opminus};//opcodes: 0,1,2
int applyOp(int opcode,int x, int y);
int calculate(int *digits,int *opcodes, int length);
void nextCombination();
int main()
{
int digits[9] = {1,2,3,4,5,6,7,8,9};
int wantedNumber = 100;
int length = sizeof(digits)/sizeof(digits[0]);
int opcodes[length-1];//math symbols
fill_n(opcodes,length-1,0);//init
while(calculate(digits,opcodes,length) != wantedNumber)
{
//recursive combination function here
}
return 0;
}
int applyOp(int opcode,int x, int y)
{
int result = x;
switch(opcode)
{
case noop://merge 2 digits together
result = x*10 + y;
break;
case opminus:
result -= y;
break;
case opplus:
default:
result += y;
break;
}
return result;
}
int calculate(int *digits,int *opcodes, int length)
{
int result = digits[0];
for(int i = 0;i < length-1; ++i)//elem count
{
result = applyOp(opcodes[i],result,digits[i+1]);//left to right, no priority
}
return result;
}
The key is backtracking. Each level of recursion handles
a single digit; in addition, you'll want to stop the recursion
one you've finished.
The simplest way to do this is to define a Solver class, which
keeps track of the global information, like the generated string
so far and the running total, and make the recursive function
a member. Basically something like:
class Solver
{
std::string const input;
int const target;
std::string solution;
int total;
bool isSolved;
void doSolve( std::string::const_iterator pos );
public:
Solver( std::string const& input, int target )
: input( input )
, target( target )
{
}
std::string solve()
{
total = 0;
isSolved = false;
doSolve( input.begin() );
return isSolved
? solution
: "no solution found";
}
};
In doSolve, you'll have to first check whether you've finished
(pos == input.end()): if so, set isSolved = total == target
and return immediately; otherwise, try the three possibilities,
(total = 10 * total + toDigit(*pos), total += toDigit(*pos),
and total -= toDigit(*pos)), each time saving the original
total and solution, adding the necessary text to
solution, and calling doSolve with the incremented pos.
On returning from the recursive call, if ! isSolved, restore
the previous values of total and solution, and try the next
possibility. Return as soon as you see isSolved, or when all
three possibilities have been solved.