Difference between R's sum() and Armadillo's accu() - c++

There are small differences in the results of R's sum() function and RcppArmadillo's accu() function when given the same input. For example, the following code:
R:
vec <- runif(100, 0, 0.00001)
accu(vec)
sum(vec)
C++:
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::export]]
double accu(arma::vec& obj)
{
return arma::accu(obj);
}
Gives the results:
0.00047941851844312633 (C++)
0.00047941851844312628 (R)
According to http://keisan.casio.com/calculator the true answer is:
4.79418518443126270948E-4
These small differences add up in my algorithm and significantly affect the way it executes. Is there a way to more accurately sum up vectors in C++? Or at least to get the same results that R does without having to call R code?

update: based on what others have found in the source, I was wrong about this - sum() does not sort. The patterns of consistency I found below stem from the fact that sorting (as done in some cases below) and using extended-precision intermediate values (as done in sum()) can have similar effects on precision ...
#user2357112 comments below:
src/main/summary.c ... doesn't do any sorting. (That'd be a lot of expense to add to a summation operation.) It's not even using pairwise or compensated summation; it just naively adds everything up left to right in an LDOUBLE (either long double or double, depending on HAVE_LONG_DOUBLE).
I have exhausted myself looking for this in the R source code (without success - sum is hard to search for), but I can show by experiment that when executing sum(), R sorts the input vector from smallest to largest in order to maximize accuracy; the difference between sum() and Reduce() results below is due to use of extended precision. I don't know what accu does ...
set.seed(101)
vec <- runif(100, 0, 0.00001)
options(digits=20)
(s1 <- sum(vec))
## [1] 0.00052502325481269514554
Using Reduce("+",...) just adds the elements in order.
(s2 <- Reduce("+",sort(vec)))
## [1] 0.00052502325481269514554
(s3 <- Reduce("+",vec))
## [1] 0.00052502325481269503712
identical(s1,s2) ## TRUE
?sum() also says
Where possible extended-precision accumulators are used, but this is platform-dependent.
Doing this in RcppArmadillo on the sorted vector gives the same answer as in R; doing it on the vector in the original order gives yet a different answer (I don't know why; my guess would be the aforementioned extended-precision accumulators, which would affect the numerical outcome more when the data are unsorted).
suppressMessages(require(inline))
code <- '
arma::vec ax = Rcpp::as<arma::vec>(x);
return Rcpp::wrap(arma::accu(ax));
'
## create the compiled function
armasum <- cxxfunction(signature(x="numeric"),
code,plugin="RcppArmadillo")
(s4 <- armasum(vec))
## [1] 0.00052502325481269525396
(s5 <- armasum(sort(vec)))
## [1] 0.00052502325481269514554
identical(s1,s5) ## TRUE
But as pointed out in comments this doesn't work for all seeds: in this case the Reduce() result is closer to the results of sum()
set.seed(123)
vec2 <- runif(50000,0,0.000001)
s4 <- sum(vec2); s5 <- Reduce("+",sort(vec2))
s6 <- Reduce("+",vec2); s7 <- armasum(sort(vec2))
rbind(s4,s5,s6,s7)
## [,1]
## s4 0.024869900535651481843
## s5 0.024869900535651658785
## s6 0.024869900535651523477
## s7 0.024869900535651343065
I'm stumped here. I would have expected at least s6 and s7 to be identical ...
I will point out that in general when your algorithm depends on these kinds of tiny numeric differences you're likely to be getting very frustrated, as the results are likely to differ on the basis of many small and possibly-out-of-your-control factors like particular operating system, compiler, etc. you work with.

What I have found:
I successfully managed to write a function which is able to mimic R's sum function. It appears R uses a higher precision variable to store the results of each addition operation.
What I wrote:
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::export]]
double accu2(arma::vec& obj)
{
long double result = 0;
for (auto iter = obj.begin(); iter != obj.end(); ++iter)
{
result += *iter;
}
return result;
}
How it compares in speed:
set.seed(123)
vec <- runif(50000, 0, 0.000001)
microbenchmark(
sum(vec),
accu(vec),
accu2(vec)
)
expr min lq mean median uq max neval
sum(vec) 72.155 72.351 72.61018 72.6755 72.7485 75.068 100
accu(vec) 48.275 48.545 48.84046 48.7675 48.9975 52.128 100
accu2(vec) 69.087 69.409 70.80095 69.6275 69.8275 182.955 100
So, my c++ solution is still faster than R's sum, however it is significantly slower than armadillo's accu()

you could use the mpfr package (Multiple Precision Floating-Point Reliable) and specify the decimal point
library("Rmpfr")
set.seed(1)
vec <- runif(100, 0, 0.00001)
# [1] 2.655087e-06 3.721239e-06 5.728534e-06 9.082078e-06 2.016819e-06 8.983897e-06 9.446753e-06 6.607978e-06 6.291140e-06 6.178627e-07 2.059746e-06
# [12] 1.765568e-06 6.870228e-06 3.841037e-06 7.698414e-06 4.976992e-06 7.176185e-06 9.919061e-06 3.800352e-06 7.774452e-06 9.347052e-06 2.121425e-06
# [23] 6.516738e-06 1.255551e-06 2.672207e-06 3.861141e-06 1.339033e-07 3.823880e-06 8.696908e-06 3.403490e-06 4.820801e-06 5.995658e-06 4.935413e-06
# [34] 1.862176e-06 8.273733e-06 6.684667e-06 7.942399e-06 1.079436e-06 7.237109e-06 4.112744e-06 8.209463e-06 6.470602e-06 7.829328e-06 5.530363e-06
# [45] 5.297196e-06 7.893562e-06 2.333120e-07 4.772301e-06 7.323137e-06 6.927316e-06 4.776196e-06 8.612095e-06 4.380971e-06 2.447973e-06 7.067905e-07
# [56] 9.946616e-07 3.162717e-06 5.186343e-06 6.620051e-06 4.068302e-06 9.128759e-06 2.936034e-06 4.590657e-06 3.323947e-06 6.508705e-06 2.580168e-06
# [67] 4.785452e-06 7.663107e-06 8.424691e-07 8.753213e-06 3.390729e-06 8.394404e-06 3.466835e-06 3.337749e-06 4.763512e-06 8.921983e-06 8.643395e-06
# [78] 3.899895e-06 7.773207e-06 9.606180e-06 4.346595e-06 7.125147e-06 3.999944e-06 3.253522e-06 7.570871e-06 2.026923e-06 7.111212e-06 1.216919e-06
# [89] 2.454885e-06 1.433044e-06 2.396294e-06 5.893438e-07 6.422883e-06 8.762692e-06 7.789147e-06 7.973088e-06 4.552745e-06 4.100841e-06 8.108702e-06
# [100] 6.049333e-06
sum(mpfr(vec,10))
# 1 'mpfr' number of precision 53 bits
# [1] 0.00051783234812319279

Related

Rcpp Function filling matrix with different values

I'm building a process which will instantiate a NumericMatrix and fill it with Sorenson-Dice similarity coefficients, a similarity matrix. The matrix itself is of variable dimensions, and depends on the number of elements being processed. Generally there are more than 100 individual elements that are compared at any time (so the matrix dimensions will typically be 100+ by 100+). What I've built so far will create the matrix, and calculate the coefficient, then fill the matrix with those calculated values. However when I run the function repeatedly, I notice that values within the matrix change between each run, which is not expected behavior, since the data being compared is not changing or re-sorting between each run. I also get similarities greater than 1, which should definitely not be happening. I have four functions, one to find the numerator of the coefficient, one to find the denominator, one to use the numerator and denominator functions to calculate the coefficient, and the fourth to put the coefficients in the matrix.
Here's the c++ code:
// function to calculate the denominator of the dice coefficient
int diceDenomcpp(NumericVector val1, NumericVector val2){
int val1Len = na_omit(val1).size();
int val2Len = na_omit(val2).size();
int bands = 0;
bands = val1Len + val2Len;
// return the computed total data points within both arrays
return bands;
}
//######################################################################
//######################################################################
//######################################################################
// function to calculate the numerator for the dice coefficient
int diceNumcpp(NumericVector iso1, NumericVector iso2){
// declare and initialize vectors with the element band data
// remove any NA values within each vector
NumericVector is1 = na_omit(iso1);
NumericVector is2 = na_omit(iso2);
// declare and initialize some counter variables
int n = 0;
int m = 0;
int match = 0;
// loop through the first element's first datum and check for matching datum
// with the second element then continue to loop through each datum within each element
while (n<=is1.size()){
if (m>=is2.size()){
n++;
m=0;
}
// if a suitable match is found, increment the match variable
if((fabs(is1[n]-is2[m])/is1[n])<0.01 && (fabs(is1[n]-is2[m])/is2[m])<0.01){
match++;
}
m++;
}
return match;
}
//########################################################################
//########################################################################
//########################################################################
// function to put the coefficient together
double diceCoefcpp(NumericVector val1, NumericVector val2){
NumericVector is1 = clone(val1);
NumericVector is2 = clone(val2);
double dVal;
double num = 2*diceNumcpp(is1, is2);
double denom = diceDenomcpp(is1, is2);
dVal = num/denom;
return dVal;
}
//#######################################################################
//#######################################################################
//#######################################################################
// function to build the similarity matrix with the coefficients
NumericMatrix simMatGencpp(NumericMatrix df){
// clone the input data frame
NumericMatrix rapdDat = clone(df);
// create a data frame for the output
NumericMatrix simMat(rapdDat.nrow(),rapdDat.nrow());
std::fill(simMat.begin(), simMat.end(), NumericVector::get_na());
// declare and initialize the iterator
int i = 0;
// declare and initialize the column counter
int col = 0;
// declare an initialize the isolate counter
int iso = 0;
//simMat(_,0)=rapdDat(_,0);
while (iso < rapdDat.nrow()){
if (iso+i > rapdDat.nrow()){
col++;
i=0;
iso++;
}
if (iso+i < rapdDat.nrow()){
simMat(iso+i, col) = diceCoefcpp(rapdDat(iso,_), rapdDat(iso+i,_));
}
i++;
}
//Rcout << "SimMatrix:" << simMat << "\n";
return simMat;
}
Here's a sample of what the input data should look like . . .
sampleData
band1 band2 band3 band4 band5 band6
1 593.05 578.04 439.01 NA NA NA
2 589.07 567.03 NA NA NA NA
3 591.04 575.10 438.12 NA NA NA
4 591.04 NA NA NA NA NA
5 588.08 573.18 NA NA NA NA
6 591.04 576.09 552.10 NA NA NA
7 1805.00 949.00 639.19 589.07 576.09 440.06
8 952.00 588.08 574.14 550.04 NA NA
9 1718.00 576.09 425.01 NA NA NA
10 1708.00 577.05 425.01 NA NA NA
With a small enough data set, the output simMatGencpp() function will produce the same results each time, however when the data set gets larger that's when the values will start to change from run to run.
I've tried running the diceNumcpp(), diceDenomcpp(), and diceCoefcpp() functions independently on individual elements, and was getting the expected output consistently each time. Once I use simMatGencpp() however then the output gets screwy again. So I tried to loop each individual function like below.
Example:
for(i in 1:100){
print(diceNumcpp(sampleData[7,], sampleData[3,]))
}
The expected output from above should be 3, but sometimes it's 4. Each time I run this loop whichever time 4 is the output varies, sometimes the second iteration, sometimes the 14th, or none at all, or three times in a row.
My first thought is that maybe since garbage collection doesn't exactly occur in c++ that perhaps the previously run function call is leaving the old vector in memory since the name of the output object isn't changing from run to run. But then this post says that when the function exits any object created within the scope of the function call is destroyed as well.
When I code the same solution in R-code only, the runtime sucks, but it will consistently return a matrix or the example vector with the same values each time.
I'm at a loss. Any help or light anyone could shed on this subject would be greatly received!
Thanks for your help.
Update 2020-08-19
I'm hoping that this will help provide some insight for the more well-versed c++ people out there so that you may have some additional ideas about what may be happening. I have some sample data, similar to what is shown above, that is 187 rows long, meaning that a similarity matrix of these data would have 17578 elements. I've been running comparisons between the R version of this solution and the c++ version of this solution, using code like this, with the sample data:
# create the similarity matrix with the R-solution to compare iteratively
# with another R-solution similarity matrix
simMat1 <- simMatGen(isoMat)
resultsR <- c()
for(i in 1:100){
simMat2 <- simMatGen(isoMat)
# check for any mis-matched elements in each matrix
resultsR[[i]]<-length(which(simMat1 == simMat2)==TRUE)
#######################################################################
# everytime this runs I get the expected number of true values 17578
# and check this by subtracting the mean(resultsR) from the expected
# number of true values of 17578
}
mean(resultsR)
Now when i do this same process with the c++ version things change drastically and quickly. I tried this with both 64 and 32 bit R-3.6.0, just because.
simMat1 <- simMatGen(isoMat)
isoMat <- as.matrix(isoMat)
resultscpp <- c()
for(i in 1:10000){
simMat2 <- simMatGencpp(isoMat)
resultscpp[[i]]<-length(which(simMat1 == simMat2)==TRUE)
############ 64 bit R ##############
# first iteration length(which(simMat1 == simMat2)==TRUE)-17578 equals 2
# second iteration 740 elements differ: length(which(simMat1 == simMat2)==TRUE)-17578 equals 740
# third iteration 1142 elements differ
# after 100 iterations the average difference is 2487.7 elements
# after 10000 iterations the average difference is 2625.91 elements
############ 32 bit R ##############
# first iteration difference = 1
# second iteration difference = 694
# 100 iterations difference = 2520.94
# 10000 iterations difference = 2665.04
}
mean(resultscpp)
Here's sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 10 x64 (build 17763)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 rstudioapi_0.10 magrittr_1.5 usethis_1.5.0 devtools_2.1.0 pkgload_1.0.2 R6_2.4.0 rlang_0.4.4
[9] tools_3.6.0 pkgbuild_1.0.3 sessioninfo_1.1.1 cli_1.1.0 withr_2.1.2 remotes_2.1.0 assertthat_0.2.1 digest_0.6.20
[17] rprojroot_1.3-2 crayon_1.3.4 processx_3.3.1 callr_3.2.0 fs_1.3.1 ps_1.3.0 testthat_2.3.1 memoise_1.1.0
[25] glue_1.3.1 compiler_3.6.0 desc_1.2.0 backports_1.1.5 prettyunits_1.0.2
Made a rookie c++ mistake here.
In the diceNumcpp() I didn't put any checks in place so that I don't accidentally reference an out-of-bounds element in the array.
// if a suitable match is found, increment the match variable
if((fabs(is1[n]-is2[m])/is1[n])<0.01 && (fabs(is1[n]-is2[m])/is2[m])<0.01){
match++;
}
was changed to:
// if a suitable match is found, increment the match variable
if(n<=(is1.size()-1) && (m<=is2.size()-1)){ // <- here need to make sure it stays inbounds
if((fabs(is1[n]-is2[m])/is1[n])<0.01 && (fabs(is1[n]-is2[m])/is2[m])<0.01){
match++;
}
}
and after running it 1000 times was able to get correct results every time.
Learn something new everyday.
Cheers.

Educational - understanding variable performance of recursive functions with Rcpp

The problem is not of practical nature and I'm only looking for a sound explanation of the observed occurence. I'm reading Seamless R and C++ Integration with Rcpp (Use R!) by Dirk Eddelbuettel. Following the introduction, I'm looking at two simple "Fibonacci functions".
In RStudio I have a cpp file of the following structure
fib_fun.cpp
#include <Rcpp.h>
// [[Rcpp::export]]
int fibonacci(const int x) {
if (x < 2)
return x;
else
return (fibonacci(x -1)) + fibonacci(x-2);
}
/*** R
# Call the fib function defined in R
fibonacci(10)
*/
I also have an inline implementation of the same function:
inline_fib.R
# Inline fib implementation
incltxt <- "int fibonacci(const int x) {
if (x == 0) return(0);
if (x == 1) return(1);
return fibonacci(x - 1) + fibonacci(x - 2);
}"
# Inline call
require(inline)
fibRcpp <- cxxfunction(signature(xs = "int"), plugin = "Rcpp",
includes = incltxt,
body = "int x = Rcpp::as<int>(xs);
return Rcpp::wrap(fibonacci(x));")
When I benchmark the functions I get the following results:
> microbenchmark(fibonacci(10), fibRcpp(10), times = 10)
Unit: microseconds
expr min lq mean median uq max neval
fibonacci(10) 3.121 3.198 5.5192 3.447 3.886 23.491 10
fibRcpp(10) 1.176 1.398 3.9520 1.558 1.709 25.721 10
Questions
I would like to understand why there is a significant difference in performance between the two functions?
With respect to the practicalities surrounding the use of Rcpp, what generally considered to be a good practice? In my naivety, my first hunch would be to write a function and source it via sourceCpp but this solutions appears to be much slower.
Benchmarking code
require(microbenchmark); require(Rcpp); require(inline)
sourceCpp("fib_fun.cpp"); source("inline_fib.R")
microbenchmark(fibonacci(10), fibRcpp(10), times = 10)
Comment replies
I tried the functions with the unsigned int instead of the int, results:
Unit: microseconds
expr min lq mean median uq max neval
fibonacci(10) 2.908 2.992 5.0369 3.267 3.598 20.291 10
fibRcpp(10) 1.201 1.263 6.3523 1.424 1.639 50.536 10
All good comments above.
The function is way too lightweight at x=10 and you need to call way more often than times=10 to find anything meaningful. You are measuring noise.
As for style, most of us prefer fibonacci() via Rcpp Attributes...

Reconciling exponential function results in C++ (Rcpp) and R

I am working on speeding up software from my dissertation by utilizing Rcpp and RcppEigen. I have been very impressed with Rcpp and RcppEigen as the speed of my software has increased by upwards of 100 times. This is quite exciting to me because my R code had been parallelized using snow/doSNOW and the foreach package, so the actual speed gain is probably somewhere around 400x. However, the last time I attempeted to run my program in entirety to assess overall speed gains after translating some gradient/hessian calculations into Cpp, I see that the new Hessian matrix calculated using my C++ code differs from the old, much slower version which was calculated strictly in R. I had been very careful to check my results line by line, slowly increasing the complexity of my calculations while assuring the results were identical in R and C++. I realize now that I was only checking the first 11 or so digits.
The code for optimization has been very robust in R, but was dreadfully slow. All of the calculations in C++ have been checked and were virtually identical to previous versions in R (this was checked to 11 digits via specifying options(digits = 11) at the beginning of each session). However, deviations in long vectors or matrices representing particular quantities begin at 15 or so digits past the decimal point in some cells/elements. These differences become problematic when using matrix multiplication and summing over risk sets, as a small difference can lead to a large error (is it an error?) in the overall precision of the final estimate.
After looking back over my code and finding the first point of deviation in results between R and C++, I observed that this first occurs after taking the exponential of a matrix or vector in my Rcpp code. This led me to work out the examples below, which I hope illustrates the issue I am seeing. Has anyone observed this before, and is there a way to utilize the R exponential function within C++ or change the routine used within C++?
## A small example to illustrate issues with Rcppsugar exponentiate function
library(RcppEigen)
library(inline)
RcppsugarexpC <-
"
using Eigen::MatrixXd;
typedef Eigen::Map<Eigen::MatrixXd> MapMatd;
MapMatd A(as<MapMatd>(AA));
MatrixXd B = exp(A.array());
return wrap(B);
"
RcppexpC <-
"
using Eigen::MatrixXd;
using Eigen::VectorXd;
typedef Eigen::Map<Eigen::MatrixXd> MapMatd;
MapMatd A(as<MapMatd>(AA));
MatrixXd B = A.array().exp().matrix();
return wrap(B);
"
Rcppsugarexp <- cxxfunction(signature(AA = "NumericMatrix"), RcppsugarexpC, plugin = "RcppEigen")
Rcppexp <- cxxfunction(signature(AA = "NumericMatrix"), RcppexpC, plugin = "RcppEigen")
mat <- matrix(seq(-5.25, 10.25, by = 1), ncol = 4, nrow = 4)
RcppsugarC <- Rcppsugarexp(mat)
RcppexpC <- Rcppexp(mat)
exp <- exp(mat)
I then tested whether these exponentiated matrices were actually equal beyond the print standard (default is 7) that R uses via:
exp == RcppexpC ## inequalities in 3 cells
exp == RcppsugarC ## inequalities in 3 cells
RcppsugarC == RcppexpC ## these are equal!
sprintf("%.22f", exp)
Please forgive me if this is a dense question - my computer science skills are not as strong as they should be, but I am eager to learn how to do better. I appreciate any and all help or advice that can be given me. Special thanks to the creators of Rcpp, and all of the wonderful moderators/contributors at this site - your previous answers have saved me from posting questions on here well over a hundred times!
Edit:
It turns out that I didn't know what I was doing. I wanted to apply Rcppsugar to the MatrixXd or VectorXd, which I was attempting by using the .array() method, however calling exp(A.array()) or A.exp() computes what is referred to as the matrix exponential, rather than computing exp(A_ij) element by element. My friend pointed this out to me when he worked out a simple example using std::exp() on each element in a nested for loop and found that this result was identical to what was reported in R. I thus needed to use the .unaryExpr functionality of eigen, which meant changing the compiler settings to -std=c++0x. I was able to do this by specifying the following in R:
settings$env$PKG_CXXFLAGS='-std=c++0x'
I then made a file called Rcpptesting.cpp which is below:
#include <RcppEigen.h>
// [[Rcpp::depends(RcppEigen)]]
using Eigen::Map; // 'maps' rather than copies
using Eigen::MatrixXd; // variable size matrix, double precision
using Eigen::VectorXd; // variable size vector, double precision
// [[Rcpp::export]]
MatrixXd expCorrect(Map<MatrixXd> M) {
MatrixXd M2 = M.unaryExpr([](double e){return(std::exp(e));});
return M2;
}
After this, I was able to call this function in with sourceCpp() in R as follows: (note that I used the option verbose = TRUE and rebuild = TRUE because this seems to give me info regarding what the settings are - I was trying to make sure that -std=c++0x was actually being used)
sourceCpp("~/testingRcpp.cpp", verbose = TRUE, rebuild = TRUE)
Then the following R code worked like a charm:
mat <- matrix(seq(-5.25, 10.25, by = 1), ncol = 4, nrow = 4)
exp(mat) == expCorrect(mat)
Pretty cool!

Summarize with rcpp

Suppose, I've a data.frame as follows:
set.seed(45)
DF <- data.frame(x=1:10, strata2013=sample(letters[1:3], 10, TRUE))
x strata2013
1 1 b
2 2 a
3 3 a
4 4 b
5 5 b
6 6 a
7 7 a
8 8 b
9 9 a
10 10 a
And I'd like to get the counts for each unique value in the column strata2013, then, using data.table (for speed), one could do it in this manner:
DT <- as.data.table(DF)
DT[, .N, by=strata2013]
strata2013 N
1: b 4
2: a 6
Now, I'd like to try and accomplish this in Rcpp, as a learning exercise. I've written and tried out the code shown below which is supposed to provide the same output, but instead it gives me an error. Here's the code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector LengthStrata (CharacterVector uniqueStrata, DataFrame dataset ) {
int n = uniqueStrata.size();
NumericVector Nh(n);
Rcpp::CharacterVector strata=dataset["strate2013"];
for (int i = 0; i < n; ++i) {
Nh[i]=strata(uniqueStrata(i)).size();
}
return Nh;
}
Here is the error message:
conversion from 'Rcpp::Vector<16>::Proxy {aka Rcpp::internal::string_proxy<16>}'
to 'const size_t { aka const long long unsigned int}' is ambiguous
What am I doing wrong? Thank you very much for your help.
If I understand correctly, you're hoping that strata( uniqueStrata(i) ) will subset the vector, similar to how R's subsetting operates. This is unfortunately not the case; you would have to perform the subsetting 'by hand'. Rcpp doesn't have 'generic' subsetting operates available yet.
When it comes to using Rcpp, you really want to leverage the C++ standard library where possible. The de-facto C++ way of generating these counts would be to use a std::map (or std::unordered_map, if you can assume C++11), with something like the following. I include a benchmark for interest.
Note from Dirk: unordered_map is actually available from tr1 for pre-C++11, so one can include it using e.g. #include <tr1/unordered_map>
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector LengthStrata( DataFrame dataset ) {
Rcpp::CharacterVector strata = dataset["strata2013"];
int n = strata.size();
std::map<SEXP, int> counts;
for (int i = 0; i < n; ++i) {
++counts[ strata[i] ];
}
return wrap(counts);
}
/*** R
library(data.table)
library(microbenchmark)
set.seed(45)
DF <- data.frame(strata2013=sample(letters, 1E5, TRUE))
DT <- data.table(DF)
LengthStrata(DF)
DT[, .N, by=strata2013]
microbenchmark(
LengthStrata(DF),
DT[, .N, by=strata2013]
)
*/
gives me
Unit: milliseconds
expr min lq median uq max neval
LengthStrata(DF) 3.267131 3.831563 3.934992 4.101050 11.491939 100
DT[, .N, by = strata2013] 1.980896 2.360590 2.480884 2.687771 3.052583 100
The Rcpp solution is slower in this case likely due to the time it takes to move R objects to and from the C++ containers, but hopefully this is instructive.
Aside: This is, in fact, already included in Rcpp as the sugar table function, so if you want to skip the learning experience, you can use a pre-baked solution as
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector LengthStrata( DataFrame dataset ) {
Rcpp::CharacterVector strata = dataset["strata2013"];
return table(strata);
}
Sugar improves the speed of the Rcpp function:
Unit: milliseconds
expr min lq median uq max neval
LengthStrata(DF) 5.548094 5.870184 6.014002 6.448235 6.922062 100
DT[, .N, by = strate2013] 6.526993 7.136290 7.462661 7.949543 81.233216 100
I am not sure I understand what you are trying to do. And when strata is a vector
Rcpp::CharacterVector strata=df["strate2013"];
then I am not sure what
strata(uniqueStrata(i)).size()
is supposed to do. Maybe you could describe in words (or in R with some example code and data) what you are trying to do here.

Rewriting slow R function in C++ & Rcpp

I have this line of R code:
croppedDNA <- completeDNA[,apply(completeDNA,2,function(x) any(c(FALSE,x[-length(x)]!=x[-1])))]
What it does is identify the sites (cols) in a matrix of DNA sequences (1 row = one seq) that are not universal (informative) and subsets them from the matrix to make a new 'cropped matrix' i.e. get rid of all the columns in which values are the same. For a big dataset this takes about 6 seconds. I don't know if I can do it faster in C++ (still a beginner in C++) but it will be good for me to try. My idea is to use Rcpp, loop through the columns of the CharacterMatrix, pull out the column (the site) as a CharacterVector check if they are the same. If they are the same, record that column number/index, continue for all columns. Then at the end make a new CharacterMatrix that only includes those columns. It is important that I keep the rownames and column names as they are in th "R version" of the matrix i.e. if a column goes, so should the colname.
I've been writing for about two minutes, so far what I have is (not finished):
#include <Rcpp.h>
#include <vector>
using namespace Rcpp;
// [[Rcpp::export]]
CharacterMatrix reduce_sequences(CharacterMatrix completeDNA)
{
std::vector<bool> informativeSites;
for(int i = 0; i < completeDNA.ncol(); i++)
{
CharacterVector bpsite = completeDNA(,i);
if(all(bpsite == bpsite[1])
{
informativeSites.push_back(i);
}
}
CharacterMatrix cutDNA = completeDNA(,informativeSites);
return cutDNA;
}
Am I going the right way about this? Is there an easier way. My understanding is I need std::vector because it's easy to grow them (since I don't know in advance how many cols I am going to want to keep). With the indexing will I need to +1 to the informativeSites vector at the end (because R indexes from 1 and C++ from 0)?
Thanks,
Ben W.
Sample data:
set.seed(123)
z <- matrix(sample(c("a", "t", "c", "g", "N", "-"), 3*398508, TRUE), 3, 398508)
OP's solution:
system.time(y1 <- z[,apply(z,2,function(x) any(c(FALSE,x[-length(x)]!=x[-1])))])
# user system elapsed
# 4.929 0.043 4.976
A faster version using base R:
system.time(y2 <- (z[, colSums(z[-1,] != z[-nrow(z), ]) > 0]))
# user system elapsed
# 0.087 0.011 0.098
The results are identical:
identical(y1, y2)
# [1] TRUE
It's very possible c++ will beat it, but is it really necessary?