Educational - understanding variable performance of recursive functions with Rcpp

Educational - understanding variable performance of recursive functions with Rcpp - c++

The problem is not of practical nature and I'm only looking for a sound explanation of the observed occurence. I'm reading Seamless R and C++ Integration with Rcpp (Use R!) by Dirk Eddelbuettel. Following the introduction, I'm looking at two simple "Fibonacci functions".
In RStudio I have a cpp file of the following structure
fib_fun.cpp
#include <Rcpp.h>
// [[Rcpp::export]]
int fibonacci(const int x) {
if (x < 2)
return x;
else
return (fibonacci(x -1)) + fibonacci(x-2);
}
/*** R
# Call the fib function defined in R
fibonacci(10)
*/
I also have an inline implementation of the same function:
inline_fib.R
# Inline fib implementation
incltxt <- "int fibonacci(const int x) {
if (x == 0) return(0);
if (x == 1) return(1);
return fibonacci(x - 1) + fibonacci(x - 2);
}"
# Inline call
require(inline)
fibRcpp <- cxxfunction(signature(xs = "int"), plugin = "Rcpp",
includes = incltxt,
body = "int x = Rcpp::as<int>(xs);
return Rcpp::wrap(fibonacci(x));")
When I benchmark the functions I get the following results:
> microbenchmark(fibonacci(10), fibRcpp(10), times = 10)
Unit: microseconds
expr min lq mean median uq max neval
fibonacci(10) 3.121 3.198 5.5192 3.447 3.886 23.491 10
fibRcpp(10) 1.176 1.398 3.9520 1.558 1.709 25.721 10
Questions
I would like to understand why there is a significant difference in performance between the two functions?
With respect to the practicalities surrounding the use of Rcpp, what generally considered to be a good practice? In my naivety, my first hunch would be to write a function and source it via sourceCpp but this solutions appears to be much slower.
Benchmarking code
require(microbenchmark); require(Rcpp); require(inline)
sourceCpp("fib_fun.cpp"); source("inline_fib.R")
microbenchmark(fibonacci(10), fibRcpp(10), times = 10)
Comment replies
I tried the functions with the unsigned int instead of the int, results:
Unit: microseconds
expr min lq mean median uq max neval
fibonacci(10) 2.908 2.992 5.0369 3.267 3.598 20.291 10
fibRcpp(10) 1.201 1.263 6.3523 1.424 1.639 50.536 10

All good comments above.
The function is way too lightweight at x=10 and you need to call way more often than times=10 to find anything meaningful. You are measuring noise.
As for style, most of us prefer fibonacci() via Rcpp Attributes...

Related

Recursion for evaluating log(1+x) using taylor series of the same

#include<iostream>
using namespace std;
double log(double x,int n)
{
static double p = x ;
double s;
if(n==1)
return x;
else
{
s=log(x,n-1);
p*=x;
if(n%2==0)
return s - (p/n);
else
return s + (p/n);
}
}
int main()
{
double r = log(1,15);
cout << r;
return 0;
}
I tried writing the above function for evaluating the log(1+x) function using its taylor series with recursion. But it didn't gave the result as I expected.
Eg :
ln(2) = 0.693 whereas my code gave 0.725. In the above code, n represents the number of terms.
Also I am new to this platform, so can I say that the above question is complete or does it need some additional information for further explanation?

There is nothing wrong with that piece of code: this has obviously got to do with the rate of convergence of the Taylor series.
If you take n = 200 instead of n = 15 in your code, the approximation error will be low enough that the first two decimals of the exact solution ln(2) = 0.693147... will be the correct ones.
The more you increase the n parameter, the better approximation you will get of ln(2).

Your program does converge to the right number, just very slowly...
log(1,15) returns 0.725, as you noticed, log(1,50) is 0.683, log(1,100) is 0.688, and log(1,200) is 0.691. That's getting close to the number you expected, but still a long way to go...
So there is no C++ or recursion bug in your code - you just need to find a better Taylor series to calculate log(X). Don't look for a Taylor series for log(1+x) - these will typically assume x is small, and converge quickly for small x, not for x=1.

writing an object to disk in R through C++ vs. fst

I was inspired by the fst package to try to write a C++ function to quickly serialize some data structures I have in R to disk.
But I am having trouble achieving the same write speed even on very simple objects. The code below is a simple example of writing a large 1 GB vector to disk.
Using custom C++ code, I achieve a write speed of 135 MB/s, which is the limit of my disk according to CrystalBench.
On the same data, write_fst achieves a write speed of 223 MB/s, which seems impossible since my disk can't write that fast. (Note, I am using fst::threads_fst(1) and compress=0 settings, and the files have the same data size.)
What am I missing?
How can I get the C++ function to write to disk faster?
C++ Code:
#include <Rcpp.h>
#include <fstream>
#include <cstring>
#include <iostream>
// [[Rcpp::plugins(cpp11)]]
using namespace Rcpp;
// [[Rcpp::export]]
void test(SEXP x) {
char* d = reinterpret_cast<char*>(REAL(x));
long dl = Rf_xlength(x) * 8;
std::ofstream OutFile;
OutFile.open("/tmp/test.raw", std::ios::out | std::ios::binary);
OutFile.write(d, dl);
OutFile.close();
}
R Code:
library(microbenchmark)
library(Rcpp)
library(dplyr)
library(fst)
fst::threads_fst(1)
sourceCpp("test.cpp")
x <- runif(134217728) # 1 gigabyte
df <- data.frame(x)
microbenchmark(test(x), write_fst(df, "/tmp/test.fst", compress=0), times=3)
Unit: seconds
expr min lq mean median uq max neval
test(x) 6.549581 7.262408 7.559021 7.975235 8.063740 8.152246 3
write_fst(df, "/tmp/test.fst", compress = 0) 4.548579 4.570346 4.592398 4.592114 4.614307 4.636501 3
file.info("/tmp/test.fst")$size/1e6
# [1] 1073.742
file.info("/tmp/test.raw")$size/1e6
# [1] 1073.742

Benchmarking SSD write and read performance is a tricky business and hard to do right. There are many effects to take into account.
For example, many SSD's use techniques to accelerate data speeds (intelligently), such as DRAM caching. Those techniques can increase your write speed, especially in cases where an identical dataset is written to disk multiple times, as in your example. To avoid this effect, each iteration of the benchmark should write a unique dataset to disk.
The block sizes of write and read operations are also important: the default physical sector size of SSD's is 4KB. Writing smaller blocks hampers performance, but with fst I found that writing blocks of data larger than a few MB's also lowers performance, due to CPU cache effects. Because fst writes it's data to disk in relatively small chunks, it's usually faster than alternatives that write data in a single large block.
To facilitate this block-wise writing to SSD, you could modify your code:
Rcpp::cppFunction('
#include <fstream>
#include <cstring>
#include <iostream>
#define BLOCKSIZE 262144 // 2^18 bytes per block
long test_blocks(SEXP x, Rcpp::String path) {
char* d = reinterpret_cast<char*>(REAL(x));
std::ofstream outfile;
outfile.open(path.get_cstring(), std::ios::out | std::ios::binary);
long dl = Rf_xlength(x) * 8;
long nr_of_blocks = dl / BLOCKSIZE;
for (long block_nr = 0; block_nr < nr_of_blocks; block_nr++) {
outfile.write(&d[block_nr * BLOCKSIZE], BLOCKSIZE);
}
long remaining_bytes = dl % BLOCKSIZE;
outfile.write(&d[nr_of_blocks * BLOCKSIZE], remaining_bytes);
outfile.close();
return dl;
}
')
Now we can compare methods test, test_blocks and fst::write_fst in a single benchmark:
x <- runif(134217728) # 1 gigabyte
df <- data.frame(X = x)
fst::threads_fst(1) # use fst in single threaded mode
microbenchmark::microbenchmark(
test(x, "test.bin"),
test_blocks(x, "test.bin"),
fst::write_fst(df, "test.fst", compress = 0),
times = 10)
#> Unit: seconds
#> expr min lq mean
#> test(x, "test.bin") 1.473615 1.506019 1.590430
#> test_blocks(x, "test.bin") 1.018082 1.062673 1.134956
#> fst::write_fst(df, "test.fst", compress = 0) 1.127446 1.144039 1.249864
#> median uq max neval
#> 1.600055 1.635883 1.765512 10
#> 1.131631 1.204373 1.264220 10
#> 1.261269 1.327304 1.343248 10
As you can see, the modified method test_blocks is about 40 percent faster than the original method and even slightly faster than the fst package. This is expected, because fst has some overhead in storing column and table information, (possible) attributes, hashes and compression information.
Please note that the difference between fst and your initial test method is much less pronounced on my system, showing again the challenges in using benchmarks to optimize a system.

Difference between R's sum() and Armadillo's accu()

There are small differences in the results of R's sum() function and RcppArmadillo's accu() function when given the same input. For example, the following code:
R:
vec <- runif(100, 0, 0.00001)
accu(vec)
sum(vec)
C++:
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::export]]
double accu(arma::vec& obj)
{
return arma::accu(obj);
}
Gives the results:
0.00047941851844312633 (C++)
0.00047941851844312628 (R)
According to http://keisan.casio.com/calculator the true answer is:
4.79418518443126270948E-4
These small differences add up in my algorithm and significantly affect the way it executes. Is there a way to more accurately sum up vectors in C++? Or at least to get the same results that R does without having to call R code?

update: based on what others have found in the source, I was wrong about this - sum() does not sort. The patterns of consistency I found below stem from the fact that sorting (as done in some cases below) and using extended-precision intermediate values (as done in sum()) can have similar effects on precision ...
#user2357112 comments below:
src/main/summary.c ... doesn't do any sorting. (That'd be a lot of expense to add to a summation operation.) It's not even using pairwise or compensated summation; it just naively adds everything up left to right in an LDOUBLE (either long double or double, depending on HAVE_LONG_DOUBLE).
I have exhausted myself looking for this in the R source code (without success - sum is hard to search for), but I can show by experiment that when executing sum(), R sorts the input vector from smallest to largest in order to maximize accuracy; the difference between sum() and Reduce() results below is due to use of extended precision. I don't know what accu does ...
set.seed(101)
vec <- runif(100, 0, 0.00001)
options(digits=20)
(s1 <- sum(vec))
## [1] 0.00052502325481269514554
Using Reduce("+",...) just adds the elements in order.
(s2 <- Reduce("+",sort(vec)))
## [1] 0.00052502325481269514554
(s3 <- Reduce("+",vec))
## [1] 0.00052502325481269503712
identical(s1,s2) ## TRUE
?sum() also says
Where possible extended-precision accumulators are used, but this is platform-dependent.
Doing this in RcppArmadillo on the sorted vector gives the same answer as in R; doing it on the vector in the original order gives yet a different answer (I don't know why; my guess would be the aforementioned extended-precision accumulators, which would affect the numerical outcome more when the data are unsorted).
suppressMessages(require(inline))
code <- '
arma::vec ax = Rcpp::as<arma::vec>(x);
return Rcpp::wrap(arma::accu(ax));
'
## create the compiled function
armasum <- cxxfunction(signature(x="numeric"),
code,plugin="RcppArmadillo")
(s4 <- armasum(vec))
## [1] 0.00052502325481269525396
(s5 <- armasum(sort(vec)))
## [1] 0.00052502325481269514554
identical(s1,s5) ## TRUE
But as pointed out in comments this doesn't work for all seeds: in this case the Reduce() result is closer to the results of sum()
set.seed(123)
vec2 <- runif(50000,0,0.000001)
s4 <- sum(vec2); s5 <- Reduce("+",sort(vec2))
s6 <- Reduce("+",vec2); s7 <- armasum(sort(vec2))
rbind(s4,s5,s6,s7)
## [,1]
## s4 0.024869900535651481843
## s5 0.024869900535651658785
## s6 0.024869900535651523477
## s7 0.024869900535651343065
I'm stumped here. I would have expected at least s6 and s7 to be identical ...
I will point out that in general when your algorithm depends on these kinds of tiny numeric differences you're likely to be getting very frustrated, as the results are likely to differ on the basis of many small and possibly-out-of-your-control factors like particular operating system, compiler, etc. you work with.

What I have found:
I successfully managed to write a function which is able to mimic R's sum function. It appears R uses a higher precision variable to store the results of each addition operation.
What I wrote:
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::export]]
double accu2(arma::vec& obj)
{
long double result = 0;
for (auto iter = obj.begin(); iter != obj.end(); ++iter)
{
result += *iter;
}
return result;
}
How it compares in speed:
set.seed(123)
vec <- runif(50000, 0, 0.000001)
microbenchmark(
sum(vec),
accu(vec),
accu2(vec)
)
expr min lq mean median uq max neval
sum(vec) 72.155 72.351 72.61018 72.6755 72.7485 75.068 100
accu(vec) 48.275 48.545 48.84046 48.7675 48.9975 52.128 100
accu2(vec) 69.087 69.409 70.80095 69.6275 69.8275 182.955 100
So, my c++ solution is still faster than R's sum, however it is significantly slower than armadillo's accu()

you could use the mpfr package (Multiple Precision Floating-Point Reliable) and specify the decimal point
library("Rmpfr")
set.seed(1)
vec <- runif(100, 0, 0.00001)
# [1] 2.655087e-06 3.721239e-06 5.728534e-06 9.082078e-06 2.016819e-06 8.983897e-06 9.446753e-06 6.607978e-06 6.291140e-06 6.178627e-07 2.059746e-06
# [12] 1.765568e-06 6.870228e-06 3.841037e-06 7.698414e-06 4.976992e-06 7.176185e-06 9.919061e-06 3.800352e-06 7.774452e-06 9.347052e-06 2.121425e-06
# [23] 6.516738e-06 1.255551e-06 2.672207e-06 3.861141e-06 1.339033e-07 3.823880e-06 8.696908e-06 3.403490e-06 4.820801e-06 5.995658e-06 4.935413e-06
# [34] 1.862176e-06 8.273733e-06 6.684667e-06 7.942399e-06 1.079436e-06 7.237109e-06 4.112744e-06 8.209463e-06 6.470602e-06 7.829328e-06 5.530363e-06
# [45] 5.297196e-06 7.893562e-06 2.333120e-07 4.772301e-06 7.323137e-06 6.927316e-06 4.776196e-06 8.612095e-06 4.380971e-06 2.447973e-06 7.067905e-07
# [56] 9.946616e-07 3.162717e-06 5.186343e-06 6.620051e-06 4.068302e-06 9.128759e-06 2.936034e-06 4.590657e-06 3.323947e-06 6.508705e-06 2.580168e-06
# [67] 4.785452e-06 7.663107e-06 8.424691e-07 8.753213e-06 3.390729e-06 8.394404e-06 3.466835e-06 3.337749e-06 4.763512e-06 8.921983e-06 8.643395e-06
# [78] 3.899895e-06 7.773207e-06 9.606180e-06 4.346595e-06 7.125147e-06 3.999944e-06 3.253522e-06 7.570871e-06 2.026923e-06 7.111212e-06 1.216919e-06
# [89] 2.454885e-06 1.433044e-06 2.396294e-06 5.893438e-07 6.422883e-06 8.762692e-06 7.789147e-06 7.973088e-06 4.552745e-06 4.100841e-06 8.108702e-06
# [100] 6.049333e-06
sum(mpfr(vec,10))
# 1 'mpfr' number of precision 53 bits
# [1] 0.00051783234812319279

Collect cout or output as vector in recursive function

I have a recursive function that prints a some in nodes in a tree as integer ids. After exporting the function to R, I cannot use the cout output for anything (or so it seems). What would be ideal is if (1) I can return the output as a vector or (2) parse the cout inside R without losing too much speed.
I would insert some code here but my function is particularly generic. Essentially I'm trying to return, say, the Fibonacci sequence as a vector instead of a sum but through a recursive function without using global or static variables.
For example, fib(6) would return inside R as:
[1] 0 1 1 2 3 5
So one could,
y <- fib(6)
y[4] and y[4:5] would return respectively,
[1] 2
[1] 2 3
Thanks in advance for insights and ideas in problem solving. Using a static variable was as far as I got on my own.

I discuss this problem at length with different hashing and memoization implementation in both R and C++ in chapter one of the Rcpp book.

You should read this online book http://adv-r.had.co.nz/, and mostly the memoization part where your question is partly answered http://adv-r.had.co.nz/Function-operators.html:
Just add the function fib3 such as:
library(memoise)
fib2 <- memoise(function(n) {
if (n < 2) return(1)
fib2(n - 2) + fib2(n - 1)
})
fib3 <- memoise(function(n) sapply(1:n, fib2))
#> fib3(6)
#[1] 1 2 3 5 8 13

Just for fun, a slightly more involved approach that uses std::generate_n and a function object (fseq) in lieu of sapply:
#include <Rcpp.h>
struct fseq {
public:
fseq() {
current = 0;
}
int operator()() {
int val = fib(current);
current++;
return val;
}
int fib(int n) {
if (n==0) return 0;
if (n==1) return 1;
return fib(n-2) + fib(n-1);
}
private:
int current;
};
// [[Rcpp::export(".fib")]]
int fib(int n) {
if (n==0) return 0;
if (n==1) return 1;
return fib(n-2) + fib(n-1);
}
// [[Rcpp::export]]
std::vector<int> fib_seq(const int n) {
if (n < 1) throw std::invalid_argument("n must be >= 1");
std::vector<int> seq;
seq.reserve(n);
std::generate_n(std::back_inserter(seq), n, fseq());
return seq;
}
library(microbenchmark)
##
R> fib_seq(6)
[1] 0 1 1 2 3 5
R> all.equal(fib_seq(6),.fib_seq(6))
[1] TRUE
.fib_seq <- function(n) sapply(0:(n-1), .fib)
##
R> microbenchmark(
fib_seq(6),.fib_seq(6),
times=1000L,unit="us")
Unit: microseconds
expr min lq mean median uq max neval
fib_seq(6) 1.561 1.9015 3.287824 2.108 2.3430 1046.021 1000
.fib_seq(6) 27.239 29.0615 35.538355 30.290 32.8065 1108.266 1000
R> microbenchmark(
fib_seq(15),.fib_seq(15),
times=100L,unit="us")
Unit: microseconds
expr min lq mean median uq max neval
fib_seq(15) 6.108 6.5875 7.46431 7.0795 7.7590 20.391 100
.fib_seq(15) 57.243 60.7195 72.97281 63.8120 73.4045 231.707 100
R> microbenchmark(
fib_seq(28),.fib_seq(28),
times=100L,unit="us")
Unit: microseconds
expr min lq mean median uq max neval
fib_seq(28) 2134.861 2143.489 2222.018 2167.364 2219.400 2650.854 100
.fib_seq(28) 3705.492 3721.586 3871.314 3745.956 3852.516 5040.827 100
Note that these functions were parametrized to reflect your statement
For example, fib(6) would return inside R as:
[1] 0 1 1 2 3 5

Summarize with rcpp

Suppose, I've a data.frame as follows:
set.seed(45)
DF <- data.frame(x=1:10, strata2013=sample(letters[1:3], 10, TRUE))
x strata2013
1 1 b
2 2 a
3 3 a
4 4 b
5 5 b
6 6 a
7 7 a
8 8 b
9 9 a
10 10 a
And I'd like to get the counts for each unique value in the column strata2013, then, using data.table (for speed), one could do it in this manner:
DT <- as.data.table(DF)
DT[, .N, by=strata2013]
strata2013 N
1: b 4
2: a 6
Now, I'd like to try and accomplish this in Rcpp, as a learning exercise. I've written and tried out the code shown below which is supposed to provide the same output, but instead it gives me an error. Here's the code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector LengthStrata (CharacterVector uniqueStrata, DataFrame dataset ) {
int n = uniqueStrata.size();
NumericVector Nh(n);
Rcpp::CharacterVector strata=dataset["strate2013"];
for (int i = 0; i < n; ++i) {
Nh[i]=strata(uniqueStrata(i)).size();
}
return Nh;
}
Here is the error message:
conversion from 'Rcpp::Vector<16>::Proxy {aka Rcpp::internal::string_proxy<16>}'
to 'const size_t { aka const long long unsigned int}' is ambiguous
What am I doing wrong? Thank you very much for your help.

If I understand correctly, you're hoping that strata( uniqueStrata(i) ) will subset the vector, similar to how R's subsetting operates. This is unfortunately not the case; you would have to perform the subsetting 'by hand'. Rcpp doesn't have 'generic' subsetting operates available yet.
When it comes to using Rcpp, you really want to leverage the C++ standard library where possible. The de-facto C++ way of generating these counts would be to use a std::map (or std::unordered_map, if you can assume C++11), with something like the following. I include a benchmark for interest.
Note from Dirk: unordered_map is actually available from tr1 for pre-C++11, so one can include it using e.g. #include <tr1/unordered_map>
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector LengthStrata( DataFrame dataset ) {
Rcpp::CharacterVector strata = dataset["strata2013"];
int n = strata.size();
std::map<SEXP, int> counts;
for (int i = 0; i < n; ++i) {
++counts[ strata[i] ];
}
return wrap(counts);
}
/*** R
library(data.table)
library(microbenchmark)
set.seed(45)
DF <- data.frame(strata2013=sample(letters, 1E5, TRUE))
DT <- data.table(DF)
LengthStrata(DF)
DT[, .N, by=strata2013]
microbenchmark(
LengthStrata(DF),
DT[, .N, by=strata2013]
)
*/
gives me
Unit: milliseconds
expr min lq median uq max neval
LengthStrata(DF) 3.267131 3.831563 3.934992 4.101050 11.491939 100
DT[, .N, by = strata2013] 1.980896 2.360590 2.480884 2.687771 3.052583 100
The Rcpp solution is slower in this case likely due to the time it takes to move R objects to and from the C++ containers, but hopefully this is instructive.
Aside: This is, in fact, already included in Rcpp as the sugar table function, so if you want to skip the learning experience, you can use a pre-baked solution as
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector LengthStrata( DataFrame dataset ) {
Rcpp::CharacterVector strata = dataset["strata2013"];
return table(strata);
}
Sugar improves the speed of the Rcpp function:
Unit: milliseconds
expr min lq median uq max neval
LengthStrata(DF) 5.548094 5.870184 6.014002 6.448235 6.922062 100
DT[, .N, by = strate2013] 6.526993 7.136290 7.462661 7.949543 81.233216 100

I am not sure I understand what you are trying to do. And when strata is a vector
Rcpp::CharacterVector strata=df["strate2013"];
then I am not sure what
strata(uniqueStrata(i)).size()
is supposed to do. Maybe you could describe in words (or in R with some example code and data) what you are trying to do here.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Educational - understanding variable performance of recursive functions with Rcpp - c++

All good comments above. The function is way too lightweight at x=10 and you need to call way more often than times=10 to find anything meaningful. You are measuring noise. As for style, most of us prefer fibonacci() via Rcpp Attributes...

Related

Recursion for evaluating log(1+x) using taylor series of the same

writing an object to disk in R through C++ vs. fst

Difference between R's sum() and Armadillo's accu()

Collect cout or output as vector in recursive function

Summarize with rcpp

Categories

Resources