I have a std::vector; whose elements need to be summed up after checking if there is any Na values (and obviously removing the Na values if it has any) in it. I have to do it in Rcpp. Now, for a numeric vector in Rcpp (NumericVector); it is very easy as the code says below:
cppFunction("
double res ( NumericVector x){
NumericVector v = x[! is_na(x)];
return sum(v);
}
")
. So for a vector "x", it easily gives the sum as follows:
x<- c(NaN,1,2)
res(x)
[1] 3
Now for a std::vector x; how can I do the same?
You should be able to use RcppHoney (also on CRAN here) which brings the vectorised idioms of Rcpp Sugar (which has vectorised NA tests just like R has) to any iterable container -- hence also STL ones.
See eg the into vignette for this example of combining different vector types and classes into a single scalar exppression:
// [[Rcpp::export]]
Rcpp::NumericVector example_manually_hooked() {
// We manually hooked std::list in to RcppHoney so we'll create one
std::list< int > l;
l.push_back(1); l.push_back(2); l.push_back(3); l.push_back(4); l.push_back(5);
// std::vector is already hooked in to RcppHoney in default_hooks.hpp so
// we'll create one of those too
std::vector< int > v(l.begin(), l.end());
// And for good measure, let's create an Rcpp::NumericVector which is
// also hooked by default
Rcpp::NumericVector v2(v.begin(), v.end());
// Now do some weird operations incorporating std::vector, std::list,
// Rcpp::NumericVector and some RcppHoney functions and return it. The
// return value will be equal to the following R snippet:
// v <- 1:5
// result <- 42 + v + v + log(v) - v - v + sqrt(v) + -v + 42
// We can store our result in any of RcppHoney::LogicalVector,
// RcppHoney::IntegerVector, or RcppHoney::NumericVector and simply return
// it to R. These classes inherit from their Rcpp counterparts and add a
// new constructor. The only copy of the data, in this case, is when we
// assign our expression to retval. Since it is then a "native" R type,
// returning it is a shallow copy. Alternatively we could write this as:
// return Rcpp::wrap(1 + v + RcppHoney::log(v) - v - 1
// + RcppHoney::sqrt(v) + -v2);
RcppHoney::NumericVector retval
= 42 + l + v + RcppHoney::log(v) - v - l + RcppHoney::sqrt(v) + -v2
+ 42;
return retval;
}
Related
I'm trying to get vector recycling to work in Rcpp.
> recycle_and_add <- Rcpp::cppFunction("
+ NumericVector recycle_and_add(NumericVector x, NumericVector y) {
+ return x + y;
+ }")
> recycle_and_add(42, 1:5)
[1] 43
I'm expecting it to return something like
> 42 + 1:5
[1] 43 44 45 46 47
After some analysis, I found out that x.size() is 1 and y.size() is 5 within the Rcpp function, so clearly vector recycling doesn't work out-of-the-box.
While I can manually find the longest of x and y and recycle the shorter one, in the actual application there are 3 or 4 arguments requiring recycling, so I can imagine manual unrolling would result in a lot of variables pointing to different vectors and turn the code into a pile of spaghetti.
Does Rcpp have any built-in support for vector recycling, like, with some sugar?
Strategy-wise, it's almost always easier to recycle in R and then move into C++.
If it must be done in C++, then the following design pattern should work:
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::NumericVector recycle_vector(Rcpp::NumericVector x,
Rcpp::NumericVector y) {
// Obtain vector sizes
int n_x = x.size();
int n_y = y.size();
// Check both vectors have elements
if(n_x <= 0 || n_y <= 0) {
Rcpp::stop("Both `x` and `y` vectors must have at least 1 element.");
}
// Compare the three cases that lead to recycling...
if(n_x == n_y) {
return x + y;
} else if (n_x > n_y) {
return Rcpp::rep_len(y, n_x) + x;
}
return Rcpp::rep_len(x, n_y) + y;
}
Test Cases:
recycle_vector(1:3, 1:3)
# [1] 2 4 6
recycle_vector(4, 1:3)
# [1] 5 6 7
recycle_vector(10:12, -2:-1)
# [1] 8 10 10
I'm trying to build a super fast mode function for R to use for aggregating large categorical datasets. The function should take vector input of all supported R types and return the mode. I have read This post, This Help-page and others, but I was not able to make the function take in all R data types. My code now works for numeric vectors, I am relying on Rcpp sugar wrapper functions:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int Mode(NumericVector x, bool narm = false)
{
if (narm) x = x[!is_na(x)];
NumericVector ux = unique(x);
int y = ux[which_max(table(match(x, ux)))];
return y;
}
In addition I was wondering if the 'narm' argument can be renamed 'na.rm' without giving errors, and of course if there is a faster way to code a mode function in C++, I would be grateful to know about it.
In order to make the function work for any vector input, you could implement #JosephWood's algorithm for any data type you want to support and call it from a switch(TYPEOF(x)). But that would be lots of code duplication. Instead, it is better to make a generic function that can work on any Vector<RTYPE> argument. If we follow R's paradigm that everything is a vector and let the function also return a Vector<RTYPE>, then we can make use of RCPP_RETURN_VECTOR. Note that we need C++11 to be able to pass additional arguments to the function called by RCPP_RETURN_VECTOR. One tricky thing is that you need the storage type for Vector<RTYPE> in order to create a suitable std::unordered_map. Here Rcpp::traits::storage_type<RTYPE>::type comes to the rescue. However, std::unordered_map does not know how to deal with complex numbers from R. For simplicity, I am disabling this special case.
Putting it all together:
#include <Rcpp.h>
using namespace Rcpp ;
// [[Rcpp::plugins(cpp11)]]
#include <unordered_map>
template <int RTYPE>
Vector<RTYPE> fastModeImpl(Vector<RTYPE> x, bool narm){
if (narm) x = x[!is_na(x)];
int myMax = 1;
Vector<RTYPE> myMode(1);
// special case for factors == INTSXP with "class" and "levels" attribute
if (x.hasAttribute("levels")){
myMode.attr("class") = x.attr("class");
myMode.attr("levels") = x.attr("levels");
}
std::unordered_map<typename Rcpp::traits::storage_type<RTYPE>::type, int> modeMap;
modeMap.reserve(x.size());
for (std::size_t i = 0, len = x.size(); i < len; ++i) {
auto it = modeMap.find(x[i]);
if (it != modeMap.end()) {
++(it->second);
if (it->second > myMax) {
myMax = it->second;
myMode[0] = x[i];
}
} else {
modeMap.insert({x[i], 1});
}
}
return myMode;
}
template <>
Vector<CPLXSXP> fastModeImpl(Vector<CPLXSXP> x, bool narm) {
stop("Not supported SEXP type!");
}
// [[Rcpp::export]]
SEXP fastMode( SEXP x, bool narm = false ){
RCPP_RETURN_VECTOR(fastModeImpl, x, narm);
}
/*** R
set.seed(1234)
s <- sample(1e5, replace = TRUE)
fastMode(s)
fastMode(s + 0.1)
l <- sample(c(TRUE, FALSE), 11, replace = TRUE)
fastMode(l)
c <- sample(letters, 1e5, replace = TRUE)
fastMode(c)
f <- as.factor(c)
fastMode(f)
*/
Output:
> set.seed(1234)
> s <- sample(1e5, replace = TRUE)
> fastMode(s)
[1] 85433
> fastMode(s + 0.1)
[1] 85433.1
> l <- sample(c(TRUE, FALSE), 11, replace = TRUE)
> fastMode(l)
[1] TRUE
> c <- sample(letters, 1e5, replace = TRUE)
> fastMode(c)
[1] "z"
> f <- as.factor(c)
> fastMode(f)
[1] z
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
As noted above, the used algorithm comes from Joseph Wood's answer, which has been explicitly dual-licensed under CC-BY-SA and GPL >= 2. I am following Joseph and hereby license the code in this answer under the GPL (version 2 or later) in addition to the implicit CC-BY-SA license.
In your Mode function, since you are mostly calling sugar wrapper functions, you won't see that much improvement over base R. In fact, simply writing a faithful base R translation, we have:
baseMode <- function(x, narm = FALSE) {
if (narm) x <- x[!is.na(x)]
ux <- unique(x)
ux[which.max(table(match(x, ux)))]
}
And benchmarking, we have:
set.seed(1234)
s <- sample(1e5, replace = TRUE)
library(microbenchmark)
microbenchmark(Mode(s), baseMode(s), times = 10, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
Mode(s) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
baseMode(s) 1.490765 1.645367 1.571132 1.616061 1.637181 1.448306 10
Typically, when we undertake the effort of writing our own compiled code, we would expect bigger gains. Simply wrapping these already efficient compiled functions in Rcpp isn't going to magically get you the gains you expect. In fact, on larger examples the base solution is faster. Observe:
set.seed(1234)
sBig <- sample(1e6, replace = TRUE)
system.time(Mode(sBig))
user system elapsed
1.410 0.036 1.450
system.time(baseMode(sBig))
user system elapsed
0.915 0.025 0.943
To address your question of writing a faster mode function, we can make use of std::unordered_map, which is very similar to table underneath the hood (i.e. they are both hash tables at their heart). Additionally, since you are returning a single integer, we can safely assume that we can replace NumericVector with IntegerVector and also that you are not concerned with returning every value that occurs the most.
The algorithm below can be modified to return the true mode, but I will leave that as an exercise (hint: you will need std::vector along with taking some sort of action when it->second == myMax). N.B. you will also need to add // [[Rcpp::plugins(cpp11)]] at the top of your cpp file for std::unordered_map and auto.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
#include <unordered_map>
// [[Rcpp::export]]
int fastIntMode(IntegerVector x, bool narm = false) {
if (narm) x = x[!is_na(x)];
int myMax = 1;
int myMode = 0;
std::unordered_map<int, int> modeMap;
modeMap.reserve(x.size());
for (std::size_t i = 0, len = x.size(); i < len; ++i) {
auto it = modeMap.find(x[i]);
if (it != modeMap.end()) {
++(it->second);
if (it->second > myMax) {
myMax = it->second;
myMode = x[i];
}
} else {
modeMap.insert({x[i], 1});
}
}
return myMode;
}
And the benchmarks:
microbenchmark(Mode(s), baseMode(s), fastIntMode(s), times = 15, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
Mode(s) 6.428343 6.268131 6.622914 6.134388 6.881746 7.78522 15
baseMode(s) 9.757491 9.404101 9.454857 9.169315 9.018938 10.16640 15
fastIntMode(s) 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 15
Now we are talking... about 6x faster than the original and 9x faster than base. They all return the same value:
fastIntMode(s)
##[1] 85433
baseMode(s)
##[1] 85433
Mode(s)
##[1] 85433
And for our larger example:
## base R returned in 0.943s
system.time(fastIntMode(s))
user system elapsed
0.217 0.006 0.224
In addition to the implicit CC-BY-SA license I hereby license the code in this answer under the GPL >= 2.
To follow up with some shameless self-promotion, I have now published a package collapse on CRAN which includes a full set of Fast Statistical Functions, amonst them the generic function fmode. The implementation is based on index hashing and even faster than the solution above. fmode can be used to perform simple, grouped and/or weighted mode calculations on vectors, matrices, data.frames and dplyr grouped tibbles. Syntax:
fmode(x, g = NULL, w = NULL, ...)
where x is a vector, matrix, data.frame or grouped_df, g is a grouping vector or list of grouping vectors, and w is a vector of weights. A compact solution to categorical and mixed aggregation problems is further provided by the function collap. The code
collap(data, ~ id1 + id2, FUN = fmean, catFUN = fmode)
aggregates the mixed type data.frame data applying fmean to numeric and fmode to categorical columns. More customized calls are also possible. Together with the Fast Statistical Functions, collap is just as fast as data.table on large numeric data, and categorical and weighted aggregations are significantly faster than anything that can presently be done with data.table.
I am trying to write a Sequential Monte Carlo function in Rcpp, and I having the following problem:
I have created a vector the following way:
NumericVector R_t(Part*Ttau);
and I want to fill ONLY Part blocks of the vector. It should be like:
for (int i=0;i<Part;i++){
R_t[i]=runif(1,0,2);
}
and the second time I'd like to have
for (int i=Part+1;i<2*Part;i++){
R_t[i]=runif(1,0,2);
}
But it does not seem to work. I could replace the old values with the new ones in each iteration, but I need the old ones for each iteration. When I try to compile, I get the following error:
cannot convert 'Rcpp::NUmericVector {aka Rcpp::Vector<14, Rcpp::PrserveStorage>}' to 'Rcpp::traits::storage_type<14>:: type {aka double}' in assignment
Would it be easier to replace the vector with a 2-d matrix with dimensions Part and Ttau? I would like to avoid this last option.
Sorry if this has been answered, but I did not find anything close to this for rcpp
You are trying to assign a length-one vector to a location that expects a double, so use [0] to access the first element: runif(1,0,2)[0]. However, you can also just replace your loop with Rcpp sugar constructs to avoid repeatedly generating one random value at a time:
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::NumericVector fill_vector(R_xlen_t n, R_xlen_t m) {
Rcpp::NumericVector res(n);
for (R_xlen_t i = 0; i < m; i++) {
res[i] = Rcpp::runif(1, 0, 2)[0];
}
return res;
}
// [[Rcpp::export]]
Rcpp::NumericVector fill_vector2(R_xlen_t n, R_xlen_t m) {
Rcpp::NumericVector res(n);
res[Rcpp::seq(0, m - 1)] = Rcpp::runif(m, 0, 2);
return res;
}
/***R
set.seed(123)
fill_vector(7, 4)
#[1] 0.5751550 1.5766103 0.8179538 1.7660348 0.0000000 0.0000000 0.0000000
set.seed(123)
fill_vector2(7, 4)
#[1] 0.5751550 1.5766103 0.8179538 1.7660348 0.0000000 0.0000000 0.0000000
set.seed(123)
c(runif(4, 0, 2), rep(0, 3))
#[1] 0.5751550 1.5766103 0.8179538 1.7660348 0.0000000 0.0000000 0.0000000
*/
You have two options when it comes to RNGs:
Use Rcpp sugar to match runif(n,a,b) in R via Rcpp::runif(n,a,b) (returns NumericVector or
Create your own loop to mimic runif(n,a,b) by drawing each time from R::runif(a,b)
#nrussell demoed how to use 1 by subsetting the vector via Rcpp::runif(n,a,b)[0] but left out approach 2.
Below is how to go about approach 2:
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::NumericVector draw_vector(int n, int m) {
Rcpp::NumericVector res(n);
for (int i = 0; i < m; i++) {
res[i] = R::runif(0.0, 2.0); // Draw a single element that is a double
}
return res;
}
/***R
set.seed(123)
draw_vector(7, 4)
*/
This gives:
[1] 0.5751550 1.5766103 0.8179538 1.7660348 0.0000000 0.0000000 0.0000000
Say I have a vector<int> positions that represents positions that I would like to subset from, and two Rcpp::NumericVector vectors A and B that I want to subset (both can be treated also as vector<double>).
What would be the best way to calculate what in R I would write as
sum(A[positions]) (a double), or A[positions] / B[positions] (a vector[double])?
Basically, I would like to access the elements of the vectors at certain positions without making copies (or a for loop) if I do not have to.
Example in R:
positions = c(2,4,5) # just a vector with positions
A = rnorm(100) # a vector with 100 random numbers
B = rnorm(100)
mysum <- sum(A[positions])
mysmallvector <- A[positions] / B[positions] # or (A/B)[positions]
Right now I just loop through all the values of positions and and subset the vectors by position one by one, but I can't help thinking there is a more elegant solution.
So, replicating R's functionality in Rcpp is not necessarily ideal. For one, you should definitely check out the caveats to subsetting in Rcpp using Rcpp sugar expressions. Secondly, you are using a for loop even within R due to the vectorization structure R has.
You may wish to consider using RcppArmadillo instead of Rcpp data types. The downside to this is you will incur a copy hit when the data is ported into C++ and then back to R. With Rcpp data types, you will avoid that but you will have to define your own operations (see divide_subset() below).
With this being said, we can replicate the functionality requested via Rcpp:
#include <Rcpp.h>
using namespace Rcpp;
// Uses sugar index subsets
// [[Rcpp::export]]
NumericVector subset(NumericVector x, IntegerVector idx) {
return x[idx];
}
// Uses sugar summation function (e.g. a nice for loop)
// [[Rcpp::export]]
double sum_subset(NumericVector x, IntegerVector idx) {
return sum(subset(x,idx));
}
// No sugar for element-wise division
// [[Rcpp::export]]
NumericVector divide_subset(NumericVector x, NumericVector y, IntegerVector idx) {
unsigned int n = idx.size();
NumericVector a(n);
for(unsigned int i = 0; i < idx.size(); i++){
a[i] = x[idx[i]]/y[idx[i]];
}
return a;
}
/*** R
set.seed(1334)
positions = c(2,4,5)
# Subtract one from indexes for C++
pos_cpp = positions - 1
A = rnorm(100) # a vector with 100 random numbers
B = rnorm(100)
mysum = sum(A[positions])
cppsum = sum_subset(A, pos_cpp)
all.equal(cppsum, mysum)
mysmallvector = A[positions] / B[positions] # or (A/B)[positions]
cppdivide = divide_subset(A,B, pos_cpp)
all.equal(cppdivide, mysmallvector)
*/
I'm trying to write a sort of analogue of R's setdiff() function in C++ using RcppArmadillo. My rather crude approach:
// [[Rcpp::export]]
arma::uvec my_setdiff(arma::uvec x, arma::uvec y){
// Coefficientes of unsigned integer vector y form a subset of the coefficients of unsigned integer vector x.
// Returns set difference between the coefficients of x and those of y
int n2 = y.n_elem;
uword q1;
for (int j=0 ; j<n2 ; j++){
q1 = find(x==y[j]);
x.shed_row(q1);
}
return x;
}
fails at compilation time. The error reads:
fnsauxarma.cpp:622:29: error: no matching function for call to ‘arma::Col<double>::shed_row(const arma::mtOp<unsigned int, arma::mtOp<unsigned int, arma::Col<double>, arma::op_rel_eq>, arma::op_find>)’
I really have no idea what's going on, any help or comments would be greatly appreciated.
The problem is that arma::find returns a uvec, and doesn't know how to make the implicit conversion to arma::uword, as pointed out by #mtall. You can help the compiler out by using the templated arma::conv_to<T>::from() function. Also, I included another version of my_setdiff that returns an Rcpp::NumericVector because although the first version returns the correct values, it's technically a matrix (i.e. it has dimensions), and I assume you would want this to be as compatible with R's setdiff as possible. This is accomplished by setting the dim attribute of the return vector to NULL, using R_NilValue and the Rcpp::attr member function.
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::uvec my_setdiff(arma::uvec& x, const arma::uvec& y){
for (size_t j = 0; j < y.n_elem; j++) {
arma::uword q1 = arma::conv_to<arma::uword>::from(arma::find(x == y[j]));
x.shed_row(q1);
}
return x;
}
// [[Rcpp::export]]
Rcpp::NumericVector my_setdiff2(arma::uvec& x, const arma::uvec& y){
for (size_t j = 0; j < y.n_elem; j++) {
arma::uword q1 = arma::conv_to<arma::uword>::from(arma::find(x == y[j]));
x.shed_row(q1);
}
Rcpp::NumericVector x2 = Rcpp::wrap(x);
x2.attr("dim") = R_NilValue;
return x2;
}
/*** R
x <- 1:8
y <- 2:6
R> all.equal(setdiff(x,y), my_setdiff(x,y))
#[1] "Attributes: < target is NULL, current is list >" "target is numeric, current is matrix"
R> all.equal(setdiff(x,y), my_setdiff2(x,y))
#[1] TRUE
R> setdiff(x,y)
#[1] 1 7 8
R> my_setdiff(x,y)
# [,1]
# [1,] 1
# [2,] 7
# [3,] 8
R> my_setdiff2(x,y)
#[1] 1 7 8
*/
Edit:
For the sake of completeness, here is a more robust version of setdiff than the two implementations presented above:
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
// [[Rcpp::export]]
Rcpp::NumericVector arma_setdiff(arma::uvec& x, arma::uvec& y){
x = arma::unique(x);
y = arma::unique(y);
for (size_t j = 0; j < y.n_elem; j++) {
arma::uvec q1 = arma::find(x == y[j]);
if (!q1.empty()) {
x.shed_row(q1(0));
}
}
Rcpp::NumericVector x2 = Rcpp::wrap(x);
x2.attr("dim") = R_NilValue;
return x2;
}
/*** R
x <- 1:10
y <- 2:8
R> all.equal(setdiff(x,y), arma_setdiff(x,y))
#[1] TRUE
X <- 1:6
Y <- c(2,2,3)
R> all.equal(setdiff(X,Y), arma_setdiff(X,Y))
#[1] TRUE
*/
The previous versions would throw an error if you passed them vectors with non-unique elements, e.g.
R> my_setdiff2(X,Y)
error: conv_to(): given object doesn't have exactly one element
To solve the problem and more closely mirror R's setdiff, we just make x and y unique. Additionally, I switched out the arma::conv_to<>::from with q1(0) (where q1 is now a uvec instead of a uword), because uvec's are just a vector of uwords, and the explicit cast seemed a little inelegant.
I've used std::set_difference from the STL instead, converting back and forth from arma::uvec.
#include <RcppArmadillo.h>
#include <algorithm>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::uvec std_setdiff(arma::uvec& x, arma::uvec& y) {
std::vector<int> a = arma::conv_to< std::vector<int> >::from(arma::sort(x));
std::vector<int> b = arma::conv_to< std::vector<int> >::from(arma::sort(y));
std::vector<int> out;
std::set_difference(a.begin(), a.end(), b.begin(), b.end(),
std::inserter(out, out.end()));
return arma::conv_to<arma::uvec>::from(out);
}
Edit: I thought a performance comparison might be in order. The difference becomes smaller when the relative sizes of the sets are in the opposite order.
a <- sample.int(350)
b <- sample.int(150)
microbenchmark::microbenchmark(std_setdiff(a, b), arma_setdiff(a, b))
> Unit: microseconds
> expr min lq mean median uq max neval cld
> std_setdiff(a, b) 11.548 14.7545 17.29930 17.107 19.245 36.779 100 a
> arma_setdiff(a, b) 60.727 65.0040 71.77804 66.714 72.702 138.133 100 b
The Questioner might have already got the answer. However, the following template version may be more general. This is equivalent to setdiff function in Matlab
If P and Q are two sets, then their difference is given by P - Q or Q - P. If P = {1, 2, 3, 4} and Q = {4, 5, 6}, P - Q means elements of P which are not in Q. i.e., in the above example P - Q = {1, 2, 3}.
/* setdiff(t1, t2) is similar to setdiff() function in MATLAB. It removes the common elements and
gives the uncommon elements in the vectors t1 and t2. */
template <typename T>
T setdiff(T t1, T t2)
{
int size_of_t1 = size(t1);
int size_of_t2 = size(t2);
T Intersection_Elements;
uvec iA, iB;
intersect(Intersection_Elements, iA, iB, t1, t2);
for (int i = 0; i < size(iA); i++)
{
t1(iA(i)) = 0;
}
for (int i = 0; i < size(iB); i++)
{
t2(iB(i)) = 0;
}
T t1_t2_vec(size_of_t1 + size_of_t2);
t1_t2_vec = join_vert(t1, t2);
T DiffVec = nonzeros(t1_t2_vec);
return DiffVec;
}
Any suggestions for improving the performance of the algorithm are welcome.