Summarize with rcpp

Summarize with rcpp - c++

Suppose, I've a data.frame as follows:
set.seed(45)
DF <- data.frame(x=1:10, strata2013=sample(letters[1:3], 10, TRUE))
x strata2013
1 1 b
2 2 a
3 3 a
4 4 b
5 5 b
6 6 a
7 7 a
8 8 b
9 9 a
10 10 a
And I'd like to get the counts for each unique value in the column strata2013, then, using data.table (for speed), one could do it in this manner:
DT <- as.data.table(DF)
DT[, .N, by=strata2013]
strata2013 N
1: b 4
2: a 6
Now, I'd like to try and accomplish this in Rcpp, as a learning exercise. I've written and tried out the code shown below which is supposed to provide the same output, but instead it gives me an error. Here's the code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector LengthStrata (CharacterVector uniqueStrata, DataFrame dataset ) {
int n = uniqueStrata.size();
NumericVector Nh(n);
Rcpp::CharacterVector strata=dataset["strate2013"];
for (int i = 0; i < n; ++i) {
Nh[i]=strata(uniqueStrata(i)).size();
}
return Nh;
}
Here is the error message:
conversion from 'Rcpp::Vector<16>::Proxy {aka Rcpp::internal::string_proxy<16>}'
to 'const size_t { aka const long long unsigned int}' is ambiguous
What am I doing wrong? Thank you very much for your help.

If I understand correctly, you're hoping that strata( uniqueStrata(i) ) will subset the vector, similar to how R's subsetting operates. This is unfortunately not the case; you would have to perform the subsetting 'by hand'. Rcpp doesn't have 'generic' subsetting operates available yet.
When it comes to using Rcpp, you really want to leverage the C++ standard library where possible. The de-facto C++ way of generating these counts would be to use a std::map (or std::unordered_map, if you can assume C++11), with something like the following. I include a benchmark for interest.
Note from Dirk: unordered_map is actually available from tr1 for pre-C++11, so one can include it using e.g. #include <tr1/unordered_map>
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector LengthStrata( DataFrame dataset ) {
Rcpp::CharacterVector strata = dataset["strata2013"];
int n = strata.size();
std::map<SEXP, int> counts;
for (int i = 0; i < n; ++i) {
++counts[ strata[i] ];
}
return wrap(counts);
}
/*** R
library(data.table)
library(microbenchmark)
set.seed(45)
DF <- data.frame(strata2013=sample(letters, 1E5, TRUE))
DT <- data.table(DF)
LengthStrata(DF)
DT[, .N, by=strata2013]
microbenchmark(
LengthStrata(DF),
DT[, .N, by=strata2013]
)
*/
gives me
Unit: milliseconds
expr min lq median uq max neval
LengthStrata(DF) 3.267131 3.831563 3.934992 4.101050 11.491939 100
DT[, .N, by = strata2013] 1.980896 2.360590 2.480884 2.687771 3.052583 100
The Rcpp solution is slower in this case likely due to the time it takes to move R objects to and from the C++ containers, but hopefully this is instructive.
Aside: This is, in fact, already included in Rcpp as the sugar table function, so if you want to skip the learning experience, you can use a pre-baked solution as
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector LengthStrata( DataFrame dataset ) {
Rcpp::CharacterVector strata = dataset["strata2013"];
return table(strata);
}
Sugar improves the speed of the Rcpp function:
Unit: milliseconds
expr min lq median uq max neval
LengthStrata(DF) 5.548094 5.870184 6.014002 6.448235 6.922062 100
DT[, .N, by = strate2013] 6.526993 7.136290 7.462661 7.949543 81.233216 100

I am not sure I understand what you are trying to do. And when strata is a vector
Rcpp::CharacterVector strata=df["strate2013"];
then I am not sure what
strata(uniqueStrata(i)).size()
is supposed to do. Maybe you could describe in words (or in R with some example code and data) what you are trying to do here.

Related

Rcpp function for subsetting strings

I was wondering if there was an Rcpp function which takes an Rcpp::String data type as input and returns a given character (by index) of the string. For example, extracting the character at index 0 of the string. This would be equivalent to the string::at method from the string header in c++. I have written this:
#include <vector>
#include <string>
#include <Rcpp.h>
using namespace Rcpp;
typedef std::vector<std::string> stringList;
int SplitGenotypesA(std::string s) {
char a = s.at(0);
int b = a - '0';
return b;
}
But would prefer not to have to convert between Rcpp::String and std::string types.

You can feed an R vector of strings directly to C++ using Rcpp::StringVector. This will obviously handle single elements too.
Getting the nth character of the ith element of your vector is as simple as vector[i][n].
So, without using std::string you can do this:
#include<Rcpp.h>
// [[Rcpp::export]]
Rcpp::NumericVector SplitGenotypesA(Rcpp::StringVector R_character_vector)
{
int number_of_strings = R_character_vector.size();
Rcpp::NumericVector result(number_of_strings);
for(int i = 0; i < number_of_strings; ++i)
{
char a = R_character_vector[i][0];
result[i] = a - '0';
}
return result;
}
Now in R you can do:
SplitGenotypesA("9C")
# [1] 9
or better yet,
SplitGenotypesA(c("1A", "2B", "9C"))
# [1] 1 2 9
Which is even a little faster than the native R method of doing the same thing:
microbenchmark::microbenchmark(
R_method = as.numeric(substr(c("1A", "2B", "9C"), 1, 1)),
Rcpp_method = SplitGenotypesA(c("1A", "2B", "9C")),
times = 1000)
# Unit: microseconds
# expr min lq mean median uq max neval
# R_method 3.422 3.765 4.076722 4.107 4.108 46.881 1000
# Rcpp_method 3.080 3.423 3.718779 3.765 3.765 32.509 1000

writing an object to disk in R through C++ vs. fst

I was inspired by the fst package to try to write a C++ function to quickly serialize some data structures I have in R to disk.
But I am having trouble achieving the same write speed even on very simple objects. The code below is a simple example of writing a large 1 GB vector to disk.
Using custom C++ code, I achieve a write speed of 135 MB/s, which is the limit of my disk according to CrystalBench.
On the same data, write_fst achieves a write speed of 223 MB/s, which seems impossible since my disk can't write that fast. (Note, I am using fst::threads_fst(1) and compress=0 settings, and the files have the same data size.)
What am I missing?
How can I get the C++ function to write to disk faster?
C++ Code:
#include <Rcpp.h>
#include <fstream>
#include <cstring>
#include <iostream>
// [[Rcpp::plugins(cpp11)]]
using namespace Rcpp;
// [[Rcpp::export]]
void test(SEXP x) {
char* d = reinterpret_cast<char*>(REAL(x));
long dl = Rf_xlength(x) * 8;
std::ofstream OutFile;
OutFile.open("/tmp/test.raw", std::ios::out | std::ios::binary);
OutFile.write(d, dl);
OutFile.close();
}
R Code:
library(microbenchmark)
library(Rcpp)
library(dplyr)
library(fst)
fst::threads_fst(1)
sourceCpp("test.cpp")
x <- runif(134217728) # 1 gigabyte
df <- data.frame(x)
microbenchmark(test(x), write_fst(df, "/tmp/test.fst", compress=0), times=3)
Unit: seconds
expr min lq mean median uq max neval
test(x) 6.549581 7.262408 7.559021 7.975235 8.063740 8.152246 3
write_fst(df, "/tmp/test.fst", compress = 0) 4.548579 4.570346 4.592398 4.592114 4.614307 4.636501 3
file.info("/tmp/test.fst")$size/1e6
# [1] 1073.742
file.info("/tmp/test.raw")$size/1e6
# [1] 1073.742

Benchmarking SSD write and read performance is a tricky business and hard to do right. There are many effects to take into account.
For example, many SSD's use techniques to accelerate data speeds (intelligently), such as DRAM caching. Those techniques can increase your write speed, especially in cases where an identical dataset is written to disk multiple times, as in your example. To avoid this effect, each iteration of the benchmark should write a unique dataset to disk.
The block sizes of write and read operations are also important: the default physical sector size of SSD's is 4KB. Writing smaller blocks hampers performance, but with fst I found that writing blocks of data larger than a few MB's also lowers performance, due to CPU cache effects. Because fst writes it's data to disk in relatively small chunks, it's usually faster than alternatives that write data in a single large block.
To facilitate this block-wise writing to SSD, you could modify your code:
Rcpp::cppFunction('
#include <fstream>
#include <cstring>
#include <iostream>
#define BLOCKSIZE 262144 // 2^18 bytes per block
long test_blocks(SEXP x, Rcpp::String path) {
char* d = reinterpret_cast<char*>(REAL(x));
std::ofstream outfile;
outfile.open(path.get_cstring(), std::ios::out | std::ios::binary);
long dl = Rf_xlength(x) * 8;
long nr_of_blocks = dl / BLOCKSIZE;
for (long block_nr = 0; block_nr < nr_of_blocks; block_nr++) {
outfile.write(&d[block_nr * BLOCKSIZE], BLOCKSIZE);
}
long remaining_bytes = dl % BLOCKSIZE;
outfile.write(&d[nr_of_blocks * BLOCKSIZE], remaining_bytes);
outfile.close();
return dl;
}
')
Now we can compare methods test, test_blocks and fst::write_fst in a single benchmark:
x <- runif(134217728) # 1 gigabyte
df <- data.frame(X = x)
fst::threads_fst(1) # use fst in single threaded mode
microbenchmark::microbenchmark(
test(x, "test.bin"),
test_blocks(x, "test.bin"),
fst::write_fst(df, "test.fst", compress = 0),
times = 10)
#> Unit: seconds
#> expr min lq mean
#> test(x, "test.bin") 1.473615 1.506019 1.590430
#> test_blocks(x, "test.bin") 1.018082 1.062673 1.134956
#> fst::write_fst(df, "test.fst", compress = 0) 1.127446 1.144039 1.249864
#> median uq max neval
#> 1.600055 1.635883 1.765512 10
#> 1.131631 1.204373 1.264220 10
#> 1.261269 1.327304 1.343248 10
As you can see, the modified method test_blocks is about 40 percent faster than the original method and even slightly faster than the fst package. This is expected, because fst has some overhead in storing column and table information, (possible) attributes, hashes and compression information.
Please note that the difference between fst and your initial test method is much less pronounced on my system, showing again the challenges in using benchmarks to optimize a system.

Educational - understanding variable performance of recursive functions with Rcpp

The problem is not of practical nature and I'm only looking for a sound explanation of the observed occurence. I'm reading Seamless R and C++ Integration with Rcpp (Use R!) by Dirk Eddelbuettel. Following the introduction, I'm looking at two simple "Fibonacci functions".
In RStudio I have a cpp file of the following structure
fib_fun.cpp
#include <Rcpp.h>
// [[Rcpp::export]]
int fibonacci(const int x) {
if (x < 2)
return x;
else
return (fibonacci(x -1)) + fibonacci(x-2);
}
/*** R
# Call the fib function defined in R
fibonacci(10)
*/
I also have an inline implementation of the same function:
inline_fib.R
# Inline fib implementation
incltxt <- "int fibonacci(const int x) {
if (x == 0) return(0);
if (x == 1) return(1);
return fibonacci(x - 1) + fibonacci(x - 2);
}"
# Inline call
require(inline)
fibRcpp <- cxxfunction(signature(xs = "int"), plugin = "Rcpp",
includes = incltxt,
body = "int x = Rcpp::as<int>(xs);
return Rcpp::wrap(fibonacci(x));")
When I benchmark the functions I get the following results:
> microbenchmark(fibonacci(10), fibRcpp(10), times = 10)
Unit: microseconds
expr min lq mean median uq max neval
fibonacci(10) 3.121 3.198 5.5192 3.447 3.886 23.491 10
fibRcpp(10) 1.176 1.398 3.9520 1.558 1.709 25.721 10
Questions
I would like to understand why there is a significant difference in performance between the two functions?
With respect to the practicalities surrounding the use of Rcpp, what generally considered to be a good practice? In my naivety, my first hunch would be to write a function and source it via sourceCpp but this solutions appears to be much slower.
Benchmarking code
require(microbenchmark); require(Rcpp); require(inline)
sourceCpp("fib_fun.cpp"); source("inline_fib.R")
microbenchmark(fibonacci(10), fibRcpp(10), times = 10)
Comment replies
I tried the functions with the unsigned int instead of the int, results:
Unit: microseconds
expr min lq mean median uq max neval
fibonacci(10) 2.908 2.992 5.0369 3.267 3.598 20.291 10
fibRcpp(10) 1.201 1.263 6.3523 1.424 1.639 50.536 10

All good comments above.
The function is way too lightweight at x=10 and you need to call way more often than times=10 to find anything meaningful. You are measuring noise.
As for style, most of us prefer fibonacci() via Rcpp Attributes...

Collect cout or output as vector in recursive function

I have a recursive function that prints a some in nodes in a tree as integer ids. After exporting the function to R, I cannot use the cout output for anything (or so it seems). What would be ideal is if (1) I can return the output as a vector or (2) parse the cout inside R without losing too much speed.
I would insert some code here but my function is particularly generic. Essentially I'm trying to return, say, the Fibonacci sequence as a vector instead of a sum but through a recursive function without using global or static variables.
For example, fib(6) would return inside R as:
[1] 0 1 1 2 3 5
So one could,
y <- fib(6)
y[4] and y[4:5] would return respectively,
[1] 2
[1] 2 3
Thanks in advance for insights and ideas in problem solving. Using a static variable was as far as I got on my own.

I discuss this problem at length with different hashing and memoization implementation in both R and C++ in chapter one of the Rcpp book.

You should read this online book http://adv-r.had.co.nz/, and mostly the memoization part where your question is partly answered http://adv-r.had.co.nz/Function-operators.html:
Just add the function fib3 such as:
library(memoise)
fib2 <- memoise(function(n) {
if (n < 2) return(1)
fib2(n - 2) + fib2(n - 1)
})
fib3 <- memoise(function(n) sapply(1:n, fib2))
#> fib3(6)
#[1] 1 2 3 5 8 13

Just for fun, a slightly more involved approach that uses std::generate_n and a function object (fseq) in lieu of sapply:
#include <Rcpp.h>
struct fseq {
public:
fseq() {
current = 0;
}
int operator()() {
int val = fib(current);
current++;
return val;
}
int fib(int n) {
if (n==0) return 0;
if (n==1) return 1;
return fib(n-2) + fib(n-1);
}
private:
int current;
};
// [[Rcpp::export(".fib")]]
int fib(int n) {
if (n==0) return 0;
if (n==1) return 1;
return fib(n-2) + fib(n-1);
}
// [[Rcpp::export]]
std::vector<int> fib_seq(const int n) {
if (n < 1) throw std::invalid_argument("n must be >= 1");
std::vector<int> seq;
seq.reserve(n);
std::generate_n(std::back_inserter(seq), n, fseq());
return seq;
}
library(microbenchmark)
##
R> fib_seq(6)
[1] 0 1 1 2 3 5
R> all.equal(fib_seq(6),.fib_seq(6))
[1] TRUE
.fib_seq <- function(n) sapply(0:(n-1), .fib)
##
R> microbenchmark(
fib_seq(6),.fib_seq(6),
times=1000L,unit="us")
Unit: microseconds
expr min lq mean median uq max neval
fib_seq(6) 1.561 1.9015 3.287824 2.108 2.3430 1046.021 1000
.fib_seq(6) 27.239 29.0615 35.538355 30.290 32.8065 1108.266 1000
R> microbenchmark(
fib_seq(15),.fib_seq(15),
times=100L,unit="us")
Unit: microseconds
expr min lq mean median uq max neval
fib_seq(15) 6.108 6.5875 7.46431 7.0795 7.7590 20.391 100
.fib_seq(15) 57.243 60.7195 72.97281 63.8120 73.4045 231.707 100
R> microbenchmark(
fib_seq(28),.fib_seq(28),
times=100L,unit="us")
Unit: microseconds
expr min lq mean median uq max neval
fib_seq(28) 2134.861 2143.489 2222.018 2167.364 2219.400 2650.854 100
.fib_seq(28) 3705.492 3721.586 3871.314 3745.956 3852.516 5040.827 100
Note that these functions were parametrized to reflect your statement
For example, fib(6) would return inside R as:
[1] 0 1 1 2 3 5

Rewriting slow R function in C++ & Rcpp

I have this line of R code:
croppedDNA <- completeDNA[,apply(completeDNA,2,function(x) any(c(FALSE,x[-length(x)]!=x[-1])))]
What it does is identify the sites (cols) in a matrix of DNA sequences (1 row = one seq) that are not universal (informative) and subsets them from the matrix to make a new 'cropped matrix' i.e. get rid of all the columns in which values are the same. For a big dataset this takes about 6 seconds. I don't know if I can do it faster in C++ (still a beginner in C++) but it will be good for me to try. My idea is to use Rcpp, loop through the columns of the CharacterMatrix, pull out the column (the site) as a CharacterVector check if they are the same. If they are the same, record that column number/index, continue for all columns. Then at the end make a new CharacterMatrix that only includes those columns. It is important that I keep the rownames and column names as they are in th "R version" of the matrix i.e. if a column goes, so should the colname.
I've been writing for about two minutes, so far what I have is (not finished):
#include <Rcpp.h>
#include <vector>
using namespace Rcpp;
// [[Rcpp::export]]
CharacterMatrix reduce_sequences(CharacterMatrix completeDNA)
{
std::vector<bool> informativeSites;
for(int i = 0; i < completeDNA.ncol(); i++)
{
CharacterVector bpsite = completeDNA(,i);
if(all(bpsite == bpsite[1])
{
informativeSites.push_back(i);
}
}
CharacterMatrix cutDNA = completeDNA(,informativeSites);
return cutDNA;
}
Am I going the right way about this? Is there an easier way. My understanding is I need std::vector because it's easy to grow them (since I don't know in advance how many cols I am going to want to keep). With the indexing will I need to +1 to the informativeSites vector at the end (because R indexes from 1 and C++ from 0)?
Thanks,
Ben W.

Sample data:
set.seed(123)
z <- matrix(sample(c("a", "t", "c", "g", "N", "-"), 3*398508, TRUE), 3, 398508)
OP's solution:
system.time(y1 <- z[,apply(z,2,function(x) any(c(FALSE,x[-length(x)]!=x[-1])))])
# user system elapsed
# 4.929 0.043 4.976
A faster version using base R:
system.time(y2 <- (z[, colSums(z[-1,] != z[-nrow(z), ]) > 0]))
# user system elapsed
# 0.087 0.011 0.098
The results are identical:
identical(y1, y2)
# [1] TRUE
It's very possible c++ will beat it, but is it really necessary?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Summarize with rcpp - c++

Related

Rcpp function for subsetting strings

writing an object to disk in R through C++ vs. fst

Educational - understanding variable performance of recursive functions with Rcpp

Collect cout or output as vector in recursive function

Rewriting slow R function in C++ & Rcpp

Categories

Resources