Collect cout or output as vector in recursive function - c++

I have a recursive function that prints a some in nodes in a tree as integer ids. After exporting the function to R, I cannot use the cout output for anything (or so it seems). What would be ideal is if (1) I can return the output as a vector or (2) parse the cout inside R without losing too much speed.
I would insert some code here but my function is particularly generic. Essentially I'm trying to return, say, the Fibonacci sequence as a vector instead of a sum but through a recursive function without using global or static variables.
For example, fib(6) would return inside R as:
[1] 0 1 1 2 3 5
So one could,
y <- fib(6)
y[4] and y[4:5] would return respectively,
[1] 2
[1] 2 3
Thanks in advance for insights and ideas in problem solving. Using a static variable was as far as I got on my own.

I discuss this problem at length with different hashing and memoization implementation in both R and C++ in chapter one of the Rcpp book.

You should read this online book http://adv-r.had.co.nz/, and mostly the memoization part where your question is partly answered http://adv-r.had.co.nz/Function-operators.html:
Just add the function fib3 such as:
library(memoise)
fib2 <- memoise(function(n) {
if (n < 2) return(1)
fib2(n - 2) + fib2(n - 1)
})
fib3 <- memoise(function(n) sapply(1:n, fib2))
#> fib3(6)
#[1] 1 2 3 5 8 13

Just for fun, a slightly more involved approach that uses std::generate_n and a function object (fseq) in lieu of sapply:
#include <Rcpp.h>
struct fseq {
public:
fseq() {
current = 0;
}
int operator()() {
int val = fib(current);
current++;
return val;
}
int fib(int n) {
if (n==0) return 0;
if (n==1) return 1;
return fib(n-2) + fib(n-1);
}
private:
int current;
};
// [[Rcpp::export(".fib")]]
int fib(int n) {
if (n==0) return 0;
if (n==1) return 1;
return fib(n-2) + fib(n-1);
}
// [[Rcpp::export]]
std::vector<int> fib_seq(const int n) {
if (n < 1) throw std::invalid_argument("n must be >= 1");
std::vector<int> seq;
seq.reserve(n);
std::generate_n(std::back_inserter(seq), n, fseq());
return seq;
}
library(microbenchmark)
##
R> fib_seq(6)
[1] 0 1 1 2 3 5
R> all.equal(fib_seq(6),.fib_seq(6))
[1] TRUE
.fib_seq <- function(n) sapply(0:(n-1), .fib)
##
R> microbenchmark(
fib_seq(6),.fib_seq(6),
times=1000L,unit="us")
Unit: microseconds
expr min lq mean median uq max neval
fib_seq(6) 1.561 1.9015 3.287824 2.108 2.3430 1046.021 1000
.fib_seq(6) 27.239 29.0615 35.538355 30.290 32.8065 1108.266 1000
R> microbenchmark(
fib_seq(15),.fib_seq(15),
times=100L,unit="us")
Unit: microseconds
expr min lq mean median uq max neval
fib_seq(15) 6.108 6.5875 7.46431 7.0795 7.7590 20.391 100
.fib_seq(15) 57.243 60.7195 72.97281 63.8120 73.4045 231.707 100
R> microbenchmark(
fib_seq(28),.fib_seq(28),
times=100L,unit="us")
Unit: microseconds
expr min lq mean median uq max neval
fib_seq(28) 2134.861 2143.489 2222.018 2167.364 2219.400 2650.854 100
.fib_seq(28) 3705.492 3721.586 3871.314 3745.956 3852.516 5040.827 100
Note that these functions were parametrized to reflect your statement
For example, fib(6) would return inside R as:
[1] 0 1 1 2 3 5

Related

Rcpp function for subsetting strings

I was wondering if there was an Rcpp function which takes an Rcpp::String data type as input and returns a given character (by index) of the string. For example, extracting the character at index 0 of the string. This would be equivalent to the string::at method from the string header in c++. I have written this:
#include <vector>
#include <string>
#include <Rcpp.h>
using namespace Rcpp;
typedef std::vector<std::string> stringList;
int SplitGenotypesA(std::string s) {
char a = s.at(0);
int b = a - '0';
return b;
}
But would prefer not to have to convert between Rcpp::String and std::string types.
You can feed an R vector of strings directly to C++ using Rcpp::StringVector. This will obviously handle single elements too.
Getting the nth character of the ith element of your vector is as simple as vector[i][n].
So, without using std::string you can do this:
#include<Rcpp.h>
// [[Rcpp::export]]
Rcpp::NumericVector SplitGenotypesA(Rcpp::StringVector R_character_vector)
{
int number_of_strings = R_character_vector.size();
Rcpp::NumericVector result(number_of_strings);
for(int i = 0; i < number_of_strings; ++i)
{
char a = R_character_vector[i][0];
result[i] = a - '0';
}
return result;
}
Now in R you can do:
SplitGenotypesA("9C")
# [1] 9
or better yet,
SplitGenotypesA(c("1A", "2B", "9C"))
# [1] 1 2 9
Which is even a little faster than the native R method of doing the same thing:
microbenchmark::microbenchmark(
R_method = as.numeric(substr(c("1A", "2B", "9C"), 1, 1)),
Rcpp_method = SplitGenotypesA(c("1A", "2B", "9C")),
times = 1000)
# Unit: microseconds
# expr min lq mean median uq max neval
# R_method 3.422 3.765 4.076722 4.107 4.108 46.881 1000
# Rcpp_method 3.080 3.423 3.718779 3.765 3.765 32.509 1000

Educational - understanding variable performance of recursive functions with Rcpp

The problem is not of practical nature and I'm only looking for a sound explanation of the observed occurence. I'm reading Seamless R and C++ Integration with Rcpp (Use R!) by Dirk Eddelbuettel. Following the introduction, I'm looking at two simple "Fibonacci functions".
In RStudio I have a cpp file of the following structure
fib_fun.cpp
#include <Rcpp.h>
// [[Rcpp::export]]
int fibonacci(const int x) {
if (x < 2)
return x;
else
return (fibonacci(x -1)) + fibonacci(x-2);
}
/*** R
# Call the fib function defined in R
fibonacci(10)
*/
I also have an inline implementation of the same function:
inline_fib.R
# Inline fib implementation
incltxt <- "int fibonacci(const int x) {
if (x == 0) return(0);
if (x == 1) return(1);
return fibonacci(x - 1) + fibonacci(x - 2);
}"
# Inline call
require(inline)
fibRcpp <- cxxfunction(signature(xs = "int"), plugin = "Rcpp",
includes = incltxt,
body = "int x = Rcpp::as<int>(xs);
return Rcpp::wrap(fibonacci(x));")
When I benchmark the functions I get the following results:
> microbenchmark(fibonacci(10), fibRcpp(10), times = 10)
Unit: microseconds
expr min lq mean median uq max neval
fibonacci(10) 3.121 3.198 5.5192 3.447 3.886 23.491 10
fibRcpp(10) 1.176 1.398 3.9520 1.558 1.709 25.721 10
Questions
I would like to understand why there is a significant difference in performance between the two functions?
With respect to the practicalities surrounding the use of Rcpp, what generally considered to be a good practice? In my naivety, my first hunch would be to write a function and source it via sourceCpp but this solutions appears to be much slower.
Benchmarking code
require(microbenchmark); require(Rcpp); require(inline)
sourceCpp("fib_fun.cpp"); source("inline_fib.R")
microbenchmark(fibonacci(10), fibRcpp(10), times = 10)
Comment replies
I tried the functions with the unsigned int instead of the int, results:
Unit: microseconds
expr min lq mean median uq max neval
fibonacci(10) 2.908 2.992 5.0369 3.267 3.598 20.291 10
fibRcpp(10) 1.201 1.263 6.3523 1.424 1.639 50.536 10
All good comments above.
The function is way too lightweight at x=10 and you need to call way more often than times=10 to find anything meaningful. You are measuring noise.
As for style, most of us prefer fibonacci() via Rcpp Attributes...

max defined in #define not working properly

I wrote the program as follows :
#include<cstdio>
#define max(a,b) a>b?a:b
using namespace std;
int main()
{
int sum=0,i,k;
for(i=0;i<5;i++)
{
sum=sum+max(i,3);
}
printf("%d\n",sum);
return 0;
}
I got the output : 4
But when I stored max(i,3) in a variable k and then added to sum, I got the correct output:
#include<cstdio>
#define max(a,b) a>b?a:b
using namespace std;
int main()
{
int sum=0,i,k;
for(i=0;i<5;i++)
{
k=max(i,3);
sum=sum+k;
}
printf("%d\n",sum);
return 0;
}
Output : 16
Can somebody please explain why is it happening?
hash-define macros are a string expansion, not a "language" thing.
sum=sum+max(i,3);
expands to:
sum=sum+i>3?i:3;
And if you are writing that with no () round it you deserve to get the wrong answer. Try this:
#define max(a,b) (a>b?a:b)
but there are still many situations where it will fail. As others point out an even better macro is:
#define max(a,b) ((a)>(b)?(a):(b))
but it will still fail in too many situations, such as arguments with side effects getting evaluated twice. You are much much better off avoiding macros where possible and doing something like this:
template <typename T> T max(T a, T b) { return a>b?a:b; }
or, infact, using std::max and std::min which have already been written for you!
This line:
sum=sum+max(i,3);
expands to:
sum = sum + i > 3 ? i : 3;
Which, when set up with parens to make it clearer is:
sum = (sum + i) > 3 ? i : 3;
So on the 5-passes through the loop, the expressions are:
sum = (0 + 0) > 3 ? 0 : 3; // Result, sum = 3
sum = (3 + 1) > 3 ? 1 : 3; // Result: sum = 3
sum = (3 + 2) > 3 ? 2 : 3; // Result: sum = 3
sum = (3 + 3) > 3 ? 3 : 3; // Result: sum = 3
sum = (3 + 4) > 3 ? 4 : 3; // Result: sum = 4
And that's where your answer comes from.
The conventional way to solve this is to change the #define to:
#define max(a,b) (((a)>(b))?(a):(b))
But even this has some pitfalls.
I think you are having operator precedence issues, you have to remember that define will lead to a textual replacement in your source code. You should change your define to
#define max(a,b) ((a) > (b) ? (a) : (b))
The output of the prepocessor (view it with the -E flag) will be:
sum = sum+i>3?i:3;
which is the same as
sum = (sum+i)>3?i:3;
which is not what you meant because + has a higher precedence than >. You should use:
#define max(a,b) (a>b?a:b)
instead.
Replacing your macro in the line sum=sum+max(i,3); gives the following form :
sum=sum+i>3?i:3 ;
which is asking that if sum + i is greater than 3 than assign sum's value accordingly. Hence, you have 4 because each time a new assignment happens inside the loop. Use the template method suggested by Andrew.
(The loop evaluates the condition (sum + i) > 3 ? i : 3 every time. There is no cumulative addition here.)

Summarize with rcpp

Suppose, I've a data.frame as follows:
set.seed(45)
DF <- data.frame(x=1:10, strata2013=sample(letters[1:3], 10, TRUE))
x strata2013
1 1 b
2 2 a
3 3 a
4 4 b
5 5 b
6 6 a
7 7 a
8 8 b
9 9 a
10 10 a
And I'd like to get the counts for each unique value in the column strata2013, then, using data.table (for speed), one could do it in this manner:
DT <- as.data.table(DF)
DT[, .N, by=strata2013]
strata2013 N
1: b 4
2: a 6
Now, I'd like to try and accomplish this in Rcpp, as a learning exercise. I've written and tried out the code shown below which is supposed to provide the same output, but instead it gives me an error. Here's the code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector LengthStrata (CharacterVector uniqueStrata, DataFrame dataset ) {
int n = uniqueStrata.size();
NumericVector Nh(n);
Rcpp::CharacterVector strata=dataset["strate2013"];
for (int i = 0; i < n; ++i) {
Nh[i]=strata(uniqueStrata(i)).size();
}
return Nh;
}
Here is the error message:
conversion from 'Rcpp::Vector<16>::Proxy {aka Rcpp::internal::string_proxy<16>}'
to 'const size_t { aka const long long unsigned int}' is ambiguous
What am I doing wrong? Thank you very much for your help.
If I understand correctly, you're hoping that strata( uniqueStrata(i) ) will subset the vector, similar to how R's subsetting operates. This is unfortunately not the case; you would have to perform the subsetting 'by hand'. Rcpp doesn't have 'generic' subsetting operates available yet.
When it comes to using Rcpp, you really want to leverage the C++ standard library where possible. The de-facto C++ way of generating these counts would be to use a std::map (or std::unordered_map, if you can assume C++11), with something like the following. I include a benchmark for interest.
Note from Dirk: unordered_map is actually available from tr1 for pre-C++11, so one can include it using e.g. #include <tr1/unordered_map>
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector LengthStrata( DataFrame dataset ) {
Rcpp::CharacterVector strata = dataset["strata2013"];
int n = strata.size();
std::map<SEXP, int> counts;
for (int i = 0; i < n; ++i) {
++counts[ strata[i] ];
}
return wrap(counts);
}
/*** R
library(data.table)
library(microbenchmark)
set.seed(45)
DF <- data.frame(strata2013=sample(letters, 1E5, TRUE))
DT <- data.table(DF)
LengthStrata(DF)
DT[, .N, by=strata2013]
microbenchmark(
LengthStrata(DF),
DT[, .N, by=strata2013]
)
*/
gives me
Unit: milliseconds
expr min lq median uq max neval
LengthStrata(DF) 3.267131 3.831563 3.934992 4.101050 11.491939 100
DT[, .N, by = strata2013] 1.980896 2.360590 2.480884 2.687771 3.052583 100
The Rcpp solution is slower in this case likely due to the time it takes to move R objects to and from the C++ containers, but hopefully this is instructive.
Aside: This is, in fact, already included in Rcpp as the sugar table function, so if you want to skip the learning experience, you can use a pre-baked solution as
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector LengthStrata( DataFrame dataset ) {
Rcpp::CharacterVector strata = dataset["strata2013"];
return table(strata);
}
Sugar improves the speed of the Rcpp function:
Unit: milliseconds
expr min lq median uq max neval
LengthStrata(DF) 5.548094 5.870184 6.014002 6.448235 6.922062 100
DT[, .N, by = strate2013] 6.526993 7.136290 7.462661 7.949543 81.233216 100
I am not sure I understand what you are trying to do. And when strata is a vector
Rcpp::CharacterVector strata=df["strate2013"];
then I am not sure what
strata(uniqueStrata(i)).size()
is supposed to do. Maybe you could describe in words (or in R with some example code and data) what you are trying to do here.

Fibonacci Function Question

I was calculating the Fibonacci sequence, and stumbled across this code, which I saw a lot:
int Fibonacci (int x)
{
if (x<=1) {
return 1;
}
return Fibonacci (x-1)+Fibonacci (x-2);
}
What I don't understand is how it works, especially the return part at the end: Does it call the Fibonacci function again? Could someone step me through this function?
Yes, the function calls itself. For example,
Fibonacci(4)
= Fibonacci(3) + Fibonacci(2)
= (Fibonacci(2) + Fibonacci(1)) + (Fibonacci(1) + Fibonacci(0))
= ((Fibonacci(1) + Fibonacci(0)) + 1) + (1 + 1)
= ((1 + 1) + 1) + 2
= (2 + 1) + 2
= 3 + 2
= 5
Note that the Fibonacci function is called 9 times here. In general, the naïve recursive fibonacci function has exponential running time, which is usually a Bad Thing.
This is a classical example of a recursive function, a function that calls itself.
If you read it carefully, you'll see that it will call itself, or, recurse, over and over again, until it reaches the so called base case, when x <= 1 at which point it will start to "back track" and sum up the computed values.
The following code clearly prints out the trace of the algorithm:
public class Test {
static String indent = "";
public static int fibonacci(int x) {
indent += " ";
System.out.println(indent + "invoked with " + x);
if (x <= 1) {
System.out.println(indent + "x = " + x + ", base case reached.");
indent = indent.substring(4);
return 1;
}
System.out.println(indent + "Recursing on " + (x-1) + " and " + (x-2));
int retVal = fibonacci(x-1) + fibonacci(x-2);
System.out.println(indent + "returning " + retVal);
indent = indent.substring(4);
return retVal;
}
public static void main(String... args) {
System.out.println("Fibonacci of 3: " + fibonacci(3));
}
}
The output is the following:
invoked with 3
Recursing on 2 and 1
invoked with 2
Recursing on 1 and 0
invoked with 1
x = 1, base case reached.
invoked with 0
x = 0, base case reached.
returning 2
invoked with 1
x = 1, base case reached.
returning 3
Fibonacci of 3: 3
A tree depiction of the trace would look something like
fib 4
fib 3 + fib 2
fib 2 + fib 1 fib 1 + fib 0
fib 1 + fib 0 1 1 1
1 1
The important parts to think about when writing recursive functions are:
1. Take care of the base case
What would have happened if we had forgotten if (x<=1) return 1; in the example above?
2. Make sure the recursive calls somehow decrease towards the base case
What would have happened if we accidentally modified the algorithm to return fibonacci(x)+fibonacci(x-1);
return Fibonacci (x-1)+Fibonacci (x-2);
This is terribly inefficient. I suggest the following linear alternative:
unsigned fibonacci(unsigned n, unsigned a, unsigned b, unsigned c)
{
return (n == 2) ? c : fibonacci(n - 1, b, c, b + c);
}
unsigned fibonacci(unsigned n)
{
return (n < 2) ? n : fibonacci(n, 0, 1, 1);
}
The fibonacci sequence can be expressed more succinctly in functional languages.
fibonacci = 0 : 1 : zipWith (+) fibonacci (tail fibonacci)
> take 12 fibonacci
[0,1,1,2,3,5,8,13,21,34,55,89]
This is classic function recursion. http://en.wikipedia.org/wiki/Recursive_function should get you started. Essentially if x less than or equal to 1 it returns 1. Otherwise it it decreases x running Fibonacci at each step.
As your question is marked C++, I feel compelled to point out that this function can also be achieved at compile-time as a template, should you have a compile-time variable to use it with.
template<int N> struct Fibonacci {
const static int value = Fibonacci<N - 1>::value + Fibonacci<N - 2>::value;
};
template<> struct Fibonacci<1> {
const static int value = 1;
}
template<> struct Fibonacci<0> {
const static int value = 1;
}
Been a while since I wrote such, so it could be a little out, but that should be it.
Yes, the Fibonacci function is called again, this is called recursion.
Just like you can call another function, you can call the same function again. Since function context is stacked, you can call the same function without disturbing the currently executed function.
Note that recursion is hard since you might call the same function again infinitely and fill the call stack. This errors is called a "Stack Overflow" (here it is !)
In C and most other languages, a function is allowed to call itself just like any other function. This is called recursion.
If it looks strange because it's different from the loop that you would write, you're right. This is not a very good application of recursion, because finding the n th Fibonacci number requires twice the time as finding the n-1th, leading to running time exponential in n.
Iterating over the Fibonacci sequence, remembering the previous Fibonacci number before moving on to the next improves the runtime to linear in n, the way it should be.
Recursion itself isn't terrible. In fact, the loop I just described (and any loop) can be implemented as a recursive function:
int Fibonacci (int x, int a = 1, int p = 0) {
if ( x == 0 ) return a;
return Fibonacci( x-1, a+p, a );
} // recursive, but with ideal computational properties
Or if you want to be more quick but use more memory use this.
int *fib,n;
void fibonaci(int n) //find firs n number fibonaci
{
fib= new int[n+1];
fib[1] = fib[2] = 1;
for(int i = 3;i<=n-2;i++)
fib[i] = fib[i-1] + fib[i-2];
}
and for n = 10 for exemple you will have :
fib[1] fib[2] fib[3] fib[4] fib[5] fib[6] fib[7] fib[8] fib[9] fib[10]
1 1 2 3 5 8 13 21 34 55``