I would like to collapse the rows of a transposed NumericMatrix using Rcpp. For instance:
library("data.table")
library("Rcpp")
dt1 <- data.table(V1=c(1, 0, 2),
V2=c(1, 1, 0),
V3=c(1, 0, 1),
V4=c(0, 1, 2),
V5=c(1, 1, 1))
cppFunction('NumericMatrix transpose(DataFrame data) {
NumericMatrix genotypes = internal::convert_using_rfunction(data, "as.matrix");
NumericMatrix tgeno(data.ncol(), data.nrow());
int number_samples = data.ncol();
int number_snps = data.nrow();
for (int i = 0; i < number_snps; i++) {
for (int j = 0; j < number_samples; j++) {
tgeno(j,i) = genotypes(i,j);
}
}
return tgeno;
}')
dt1
transpose(dt1)
Original Matrix
V1 V2 V3 V4 V5
1: 1 1 1 0 1
2: 0 1 0 1 1
3: 2 0 1 2 1
Transposed Matrix
[,1] [,2] [,3]
[1,] 1 0 2
[2,] 1 1 0
[3,] 1 0 1
[4,] 0 1 2
[5,] 1 1 1
I would like to have the following matrix:
[,1]
[1,] 102
[2,] 110
[3,] 101
[4,] 012
[5,] 111
Could anyone suggest a way to do this?
Maybe as a starting point, assuming that the numbers you concatenate consist only of a single digit:
//' #export
// [[Rcpp::export]]
std::vector<std::string> string_collapse(const Rcpp::DataFrame& data)
{
R_xlen_t nrow = data.nrow();
R_xlen_t ncol = data.ncol();
std::vector<std::string> ret(ncol);
for (R_xlen_t j = 0; j < ncol; ++j) {
const auto& col = Rcpp::as<Rcpp::NumericVector>(data[j]);
std::string ccstr;
ccstr.reserve(nrow);
for (const auto& chr: col) {
ccstr += std::to_string(chr)[0];
}
ret[j] = ccstr;
}
return ret;
}
It gives
dat <- data.frame(V1=c(1, 0, 2),
V2=c(1, 1, 0),
V3=c(1, 0, 1),
V4=c(0, 1, 2),
V5=c(1, 1, 1))
string_collapse(dat)
[1] "102" "110" "101" "012" "111"
But a quick benchmark comparing it to a pure R-solution suggests that you should not expect miracles. Probably there is still room for optimization.
Once you have transposed the matrix you can collapse the rows as follows:
matrix(apply(dt1, 1, paste0, collapse = ""), ncol = 1)
Related
I can select all the rows of a matrix and a range of columns of a matrix as follows:
library(Rcpp)
cppFunction('
NumericMatrix subset(NumericMatrix x){
return x(_, Range(0, 1));
}
')
However, I would like to select columns based on a NumericVector y which, for instance, could be something like c(0, 1, 0, 0, 1). I tried this:
library(Rcpp)
cppFunction('
NumericMatrix subset(NumericMatrix x, NumericVector y){
return x(_, y);
}
')
but it doesn't compile. How do I do it?
Alas, Rcpp doesn't have great support for non-contiguous views or selecting in a single statement only columns 1 and 4. As you saw, selecting contiguous views or selecting all columns is accessible with Rcpp::Range(). You'll likely want to upgrade to RcppArmadillo for better control over matrix subsets.
RcppArmadillo subset examples
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::mat matrix_subset_idx(const arma::mat& x,
const arma::uvec& y) {
// y must be an integer between 0 and columns - 1
// Allows for repeated draws from same columns.
return x.cols( y );
}
// [[Rcpp::export]]
arma::mat matrix_subset_logical(const arma::mat& x,
const arma::vec& y) {
// Assumes that y is 0/1 coded.
// find() retrieves the integer index when y is equivalent 1.
return x.cols( arma::find(y == 1) );
}
Test
# Sample data
x = matrix(1:15, ncol = 5)
x
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 4 7 10 13
# [2,] 2 5 8 11 14
# [3,] 3 6 9 12 15
# Subset only when 1 (TRUE) is found:
matrix_subset_logical(x, c(0, 1, 0, 0, 1))
# [,1] [,2]
# [1,] 4 13
# [2,] 5 14
# [3,] 6 15
# Subset with an index representing the location
# Note: C++ indices start at 0 not 1!
matrix_subset_idx(x, c(1, 3))
# [,1] [,2]
# [1,] 4 13
# [2,] 5 14
# [3,] 6 15
Pure Rcpp logic
If you do not want to take on the dependency of armadillo, then the equivalent for the matrix subset in Rcpp is:
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::NumericMatrix matrix_subset_idx_rcpp(
Rcpp::NumericMatrix x, Rcpp::IntegerVector y) {
// Determine the number of observations
int n_cols_out = y.size();
// Create an output matrix
Rcpp::NumericMatrix out = Rcpp::no_init(x.nrow(), n_cols_out);
// Loop through each column and copy the data.
for(unsigned int z = 0; z < n_cols_out; ++z) {
out(Rcpp::_, z) = x(Rcpp::_, y[z]);
}
return out;
}
I made a first stab at an Rcpp function via inline and it solved my speed problem (thanks Dirk!):
Replace negative values by zero
The initial version looked like this:
library(inline)
cpp_if_src <- '
Rcpp::NumericVector xa(a);
int n_xa = xa.size();
for(int i=0; i < n_xa; i++) {
if(xa[i]<0) xa[i] = 0;
}
return xa;
'
cpp_if <- cxxfunction(signature(a="numeric"), cpp_if_src, plugin="Rcpp")
But when called cpp_if(p), it overwrote p with the output, which was not as intended. So I assumed it was passing by reference.
So I fixed it with the following version:
library(inline)
cpp_if_src <- '
Rcpp::NumericVector xa(a);
int n_xa = xa.size();
Rcpp::NumericVector xr(a);
for(int i=0; i < n_xa; i++) {
if(xr[i]<0) xr[i] = 0;
}
return xr;
'
cpp_if <- cxxfunction(signature(a="numeric"), cpp_if_src, plugin="Rcpp")
Which seemed to work. But now the original version doesn't overwrite its input anymore when I re-load it into R (i.e. the same exact code now doesn't overwrite its input):
> cpp_if_src <- '
+ Rcpp::NumericVector xa(a);
+ int n_xa = xa.size();
+ for(int i=0; i < n_xa; i++) {
+ if(xa[i]<0) xa[i] = 0;
+ }
+ return xa;
+ '
> cpp_if <- cxxfunction(signature(a="numeric"), cpp_if_src, plugin="Rcpp")
>
> p
[1] -5 -4 -3 -2 -1 0 1 2 3 4 5
> cpp_if(p)
[1] 0 0 0 0 0 0 1 2 3 4 5
> p
[1] -5 -4 -3 -2 -1 0 1 2 3 4 5
I'm not the only one who has tried to replicate this behavior and found inconsistent results:
https://chat.stackoverflow.com/transcript/message/4357344#4357344
What's going on here?
They key is 'proxy model' -- your xa really is the same memory location as your original object so you end up changing your original.
If you don't want that, you should do one thing: (deep) copy using the clone() method, or maybe explicit creation of a new object into which the altered object gets written. Method two does not do that, you simply use two differently named variables which are both "pointers" (in the proxy model sense) to the original variable.
An additional complication, though, is in implicit cast and copy when you pass an int vector (from R) to a NumericVector type: that creates a copy, and then the original no longer gets altered.
Here is a more explicit example, similar to one I use in the tutorials or workshops:
library(inline)
f1 <- cxxfunction(signature(a="numeric"), plugin="Rcpp", body='
Rcpp::NumericVector xa(a);
int n = xa.size();
for(int i=0; i < n; i++) {
if(xa[i]<0) xa[i] = 0;
}
return xa;
')
f2 <- cxxfunction(signature(a="numeric"), plugin="Rcpp", body='
Rcpp::NumericVector xa(a);
int n = xa.size();
Rcpp::NumericVector xr(a); // still points to a
for(int i=0; i < n; i++) {
if(xr[i]<0) xr[i] = 0;
}
return xr;
')
p <- seq(-2,2)
print(class(p))
print(cbind(f1(p), p))
print(cbind(f2(p), p))
p <- as.numeric(seq(-2,2))
print(class(p))
print(cbind(f1(p), p))
print(cbind(f2(p), p))
and this is what I see:
edd#max:~/svn/rcpp/pkg$ r /tmp/ari.r
Loading required package: methods
[1] "integer"
p
[1,] 0 -2
[2,] 0 -1
[3,] 0 0
[4,] 1 1
[5,] 2 2
p
[1,] 0 -2
[2,] 0 -1
[3,] 0 0
[4,] 1 1
[5,] 2 2
[1] "numeric"
p
[1,] 0 0
[2,] 0 0
[3,] 0 0
[4,] 1 1
[5,] 2 2
p
[1,] 0 0
[2,] 0 0
[3,] 0 0
[4,] 1 1
[5,] 2 2
edd#max:~/svn/rcpp/pkg$
So it really matters whether you pass int-to-float or float-to-float.
I am a beginner in R studio, so hopefully someone can help me with this problem. The case: I want to make an if else loop. I made the following code for an l times m matrix:
for (i in 1:l){
for (j in 1:m){
if (is.na(quantilereturns[i,j]) < quantile(quantilereturns[,j], c(.1), na.rm=TRUE)) {
quantilereturns[i,j]
} else { (0) }
}
}
Summary: I want to make a matrix with values that are smaller than the quantile of a certain vector in the matrix quantilereturns. So when they are smaller than the 10% quantile they get their original value otherwise it will be a zero.
The code doesn't give any errors, but it doesn't change the values in the matrix either.
Can someone help me?
You need to assign the result to a cell of the matrix. I will take the matrix of a recent other thread as an example:
a <- c(4, -9, 2)
b <- c(-1, 3, -8)
c <- c(5, 2, 6)
d <- c(7, 9, -2)
matrix <- cbind(a,b,c,d)
d <- dim(matrix)
rows <- d[1]
columns <- d[2]
print("Before")
print(matrix)
for (i in 1:rows) {
for (j in 1:columns) {
if (is.na(matrix[i,j]) >= quantile(matrix[,j], c(.1), na.rm=TRUE)) {
matrix[i,j] <- 0
}
}
}
print("After")
print(matrix)
this gives
[1] "Before"
a b c d
[1,] 4 -1 5 7
[2,] -9 3 2 9
[3,] 2 -8 6 -2
[1] "After"
a b c d
[1,] 0 0 5 0
[2,] 0 0 2 0
[3,] 0 0 6 0
So the essential line you are looking for is matrix[i,j] <- 0
I made a first stab at an Rcpp function via inline and it solved my speed problem (thanks Dirk!):
Replace negative values by zero
The initial version looked like this:
library(inline)
cpp_if_src <- '
Rcpp::NumericVector xa(a);
int n_xa = xa.size();
for(int i=0; i < n_xa; i++) {
if(xa[i]<0) xa[i] = 0;
}
return xa;
'
cpp_if <- cxxfunction(signature(a="numeric"), cpp_if_src, plugin="Rcpp")
But when called cpp_if(p), it overwrote p with the output, which was not as intended. So I assumed it was passing by reference.
So I fixed it with the following version:
library(inline)
cpp_if_src <- '
Rcpp::NumericVector xa(a);
int n_xa = xa.size();
Rcpp::NumericVector xr(a);
for(int i=0; i < n_xa; i++) {
if(xr[i]<0) xr[i] = 0;
}
return xr;
'
cpp_if <- cxxfunction(signature(a="numeric"), cpp_if_src, plugin="Rcpp")
Which seemed to work. But now the original version doesn't overwrite its input anymore when I re-load it into R (i.e. the same exact code now doesn't overwrite its input):
> cpp_if_src <- '
+ Rcpp::NumericVector xa(a);
+ int n_xa = xa.size();
+ for(int i=0; i < n_xa; i++) {
+ if(xa[i]<0) xa[i] = 0;
+ }
+ return xa;
+ '
> cpp_if <- cxxfunction(signature(a="numeric"), cpp_if_src, plugin="Rcpp")
>
> p
[1] -5 -4 -3 -2 -1 0 1 2 3 4 5
> cpp_if(p)
[1] 0 0 0 0 0 0 1 2 3 4 5
> p
[1] -5 -4 -3 -2 -1 0 1 2 3 4 5
I'm not the only one who has tried to replicate this behavior and found inconsistent results:
https://chat.stackoverflow.com/transcript/message/4357344#4357344
What's going on here?
They key is 'proxy model' -- your xa really is the same memory location as your original object so you end up changing your original.
If you don't want that, you should do one thing: (deep) copy using the clone() method, or maybe explicit creation of a new object into which the altered object gets written. Method two does not do that, you simply use two differently named variables which are both "pointers" (in the proxy model sense) to the original variable.
An additional complication, though, is in implicit cast and copy when you pass an int vector (from R) to a NumericVector type: that creates a copy, and then the original no longer gets altered.
Here is a more explicit example, similar to one I use in the tutorials or workshops:
library(inline)
f1 <- cxxfunction(signature(a="numeric"), plugin="Rcpp", body='
Rcpp::NumericVector xa(a);
int n = xa.size();
for(int i=0; i < n; i++) {
if(xa[i]<0) xa[i] = 0;
}
return xa;
')
f2 <- cxxfunction(signature(a="numeric"), plugin="Rcpp", body='
Rcpp::NumericVector xa(a);
int n = xa.size();
Rcpp::NumericVector xr(a); // still points to a
for(int i=0; i < n; i++) {
if(xr[i]<0) xr[i] = 0;
}
return xr;
')
p <- seq(-2,2)
print(class(p))
print(cbind(f1(p), p))
print(cbind(f2(p), p))
p <- as.numeric(seq(-2,2))
print(class(p))
print(cbind(f1(p), p))
print(cbind(f2(p), p))
and this is what I see:
edd#max:~/svn/rcpp/pkg$ r /tmp/ari.r
Loading required package: methods
[1] "integer"
p
[1,] 0 -2
[2,] 0 -1
[3,] 0 0
[4,] 1 1
[5,] 2 2
p
[1,] 0 -2
[2,] 0 -1
[3,] 0 0
[4,] 1 1
[5,] 2 2
[1] "numeric"
p
[1,] 0 0
[2,] 0 0
[3,] 0 0
[4,] 1 1
[5,] 2 2
p
[1,] 0 0
[2,] 0 0
[3,] 0 0
[4,] 1 1
[5,] 2 2
edd#max:~/svn/rcpp/pkg$
So it really matters whether you pass int-to-float or float-to-float.
Suppose I have a matrix whose entries are only 0 and 1, e.g.
set.seed(123)
m <- matrix( sample(0:1, 10, TRUE), nrow=5 )
with sample output:
[,1] [,2]
[1,] 0 0
[2,] 1 1
[3,] 0 1
[4,] 1 1
[5,] 1 0
The matrix will have at most 20 columns, and will have many rows.
I want a function, let's call it rowCounts, that returns:
The number of times a particular row appears in the matrix, and
The index of the first occurrence of that row.
How might I solve this problem?
Building on Kevin's answer, here is a C++11 version using a slightly different approach:
List rowCounts_2(IntegerMatrix x) {
int n = x.nrow() ;
int nc = x.ncol() ;
std::vector<int> hashes(n) ;
for( int k=0, pow=1; k<nc; k++, pow*=2){
IntegerMatrix::Column column = x.column(k) ;
std::transform( column.begin(), column.end(), hashes.begin(), hashes.begin(), [=]( int v, int h ){
return h + pow*v ;
}) ;
}
using Pair = std::pair<int,int> ;
std::unordered_map<int, Pair> map_counts ;
for( int i=0; i<n; i++){
Pair& p = map_counts[ hashes[i] ] ;
if( p.first == 0){
p.first = i+1 ; // using directly 1-based index
}
p.second++ ;
}
int nres = map_counts.size() ;
IntegerVector idx(nres), counts(nres) ;
auto it=map_counts.begin() ;
for( int i=0; i<nres; i++, ++it){
idx[i] = it->second.first ;
counts[i] = it->second.second ;
}
return List::create( _["counts"] = counts, _["idx"] = idx );
}
The idea is to trade memory for speed. The first change is that I'm allocating and filling a std::vector<int> to host the hashes. Doing this allows me to traverse the input matrix column by column which is more efficient.
Once this is done, I'm training a hash map of pairs (index, counts) std::unordered_map<int, std::pair<int,int>>. The key of the map is the hash, the value is a pair (index, count).
Then I just have to traverse the hash map and collect the results. The results don't appear in ascending order of idx (it is easy to do it if we really want that).
I get these results for n=1e5 and n=1e7.
> m <- matrix(sample(0:1, 1e+05, TRUE), ncol = 10)
> microbenchmark(rowCounts(m), rowCountsR(m), rowCounts_2(m))
Unit: microseconds
expr min lq median uq max neval
rowCounts(m) 1194.536 1201.273 1213.1450 1231.7295 1286.458 100
rowCountsR(m) 575.004 933.637 962.8720 981.6015 23678.451 100
rowCounts_2(m) 421.744 429.118 442.5095 455.2510 530.261 100
> m <- matrix(sample(0:1, 1e+07, TRUE), ncol = 10)
> microbenchmark(rowCounts(m), rowCountsR(m), rowCounts_2(m))
Unit: milliseconds
expr min lq median uq max neval
rowCounts(m) 97.22727 98.02716 98.56641 100.42262 102.07661 100
rowCountsR(m) 57.44635 59.46188 69.34481 73.89541 100.43032 100
rowCounts_2(m) 22.95741 23.38186 23.78068 24.16814 27.44125 100
Taking advantage of threading helps further. Below is how the time is split between 4 threads on my machine. See the code in this gist.
Here are benchmarks with the last version too:
> microbenchmark(rowCountsR(m), rowCounts_1(m), rowCounts_2(m), rowCounts_3(m,4))
Unit: milliseconds
expr min lq median uq max neval
rowCountsR(m) 93.67895 127.58762 127.81847 128.03472 151.54455 100
rowCounts_1(m) 120.47675 120.89169 121.31227 122.86422 137.86543 100
rowCounts_2(m) 28.88102 29.68101 29.83790 29.97112 38.14453 100
rowCounts_3(m, 4) 12.50059 12.68981 12.87712 13.10425 17.21966 100
We can take advantage of the structure of our matrix to count the number of unique rows in a nice way. Because the values are all 0 and 1, we can define a 'hash' function that maps each row to a unique integer value, and then count those hashes.
The hash function we will implement is identical to the following R code:
hash <- function(x) sum(x * 2^(0:(length(x)-1)))
where x is an integer vector of 0s and 1s, representing a row of a matrix.
In my solution, because I'm using C++ and there is no associative container that maintains insertion order (in the standard library), I use both a std::map<int, int> to count hashes of each row, and a std::vector<int> to track the order in which hashes are inserted.
Because of the restriction of number of columns <= 20, we can compute the hashed values and store in an int, but to be safe for larger matrices one should store the hashes in a double (because overflow would occur with n > 31)
With that in mind, we can write a solution:
#include <Rcpp.h>
using namespace Rcpp;
inline int hash(IntegerMatrix::Row x) {
int n = x.size();
int hash = 0;
for (int j=0; j < n; ++j) {
hash += x[j] << j;
}
return hash;
}
// [[Rcpp::export]]
List rowCounts(IntegerMatrix x) {
int nrow = x.nrow();
typedef std::map<int, int> map_t;
map_t counts;
// keep track of insertion order with a separate vector
std::vector<int> ordered_hashes;
std::vector<int> insertion_order;
ordered_hashes.reserve(nrow);
insertion_order.reserve(nrow);
for (int i=0; i < nrow; ++i) {
IntegerMatrix::Row row = x(i, _);
int hashed_row = hash(row);
if (!counts[hashed_row]) {
ordered_hashes.push_back(hashed_row);
insertion_order.push_back(i);
}
++counts[hashed_row];
}
// fill the 'counts' portion of the output
int n = counts.size();
IntegerVector output = no_init(n);
for (int i=0; i < n; ++i) {
output[i] = counts[ ordered_hashes[i] ];
}
// fill the 'idx' portion of the output
IntegerVector idx = no_init(n);
for (int i=0; i < n; ++i) {
idx[i] = insertion_order[i] + 1; // 0 to 1-based indexing
}
return List::create(
_["counts"] = output,
_["idx"] = idx
);
}
/*** R
set.seed(123)
m <- matrix( sample(0:1, 10, TRUE), nrow=5 )
rowCounts(m)
m <- matrix( sample(0:1, 1E5, TRUE), ncol=5 )
str(rowCounts(m))
## Compare it to a close-ish R solution
microbenchmark( times=5,
rowCounts(m),
table(do.call(paste, as.data.frame(m)))
)
*/
Calling sourceCpp on this gives me:
> Rcpp::sourceCpp('rowCounts.cpp')
> set.seed(123)
> m <- matrix( sample(0:1, 10, TRUE), nrow=5 )
> m
[,1] [,2]
[1,] 0 0
[2,] 1 1
[3,] 0 1
[4,] 1 1
[5,] 1 0
> rowCounts(m)
$counts
[1] 1 2 1 1
$idx
[1] 1 2 3 5
> m <- matrix( sample(0:1, 1E5, TRUE), ncol=5 )
> str(rowCounts(m))
List of 2
$ counts: int [1:32] 602 640 635 624 638 621 622 615 633 592 ...
$ idx : int [1:32] 1 2 3 4 5 6 7 8 9 10 ...
> microbenchmark( times=5,
+ rowCounts(m),
+ table(do.call(paste, as.data.frame(m)))
+ )
Unit: milliseconds
expr min lq median uq max neval
rowCounts(m) 1.14732 1.150512 1.172886 1.183854 1.184235 5
table(do.call(paste, as.data.frame(m))) 22.95222 23.146423 23.607649 24.455728 24.953177 5
I was curious how a pure R solution would perform:
set.seed(123)
m <- matrix( sample(0:1, 1E5, TRUE), ncol=5 )
rowCountsR <- function(x) {
## calculate hash
h <- m %*% matrix(2^(0:(ncol(x)-1)), ncol=1)
i <- which(!duplicated(h))
counts <- tabulate(h+1)
counts[order(h[i])] <- counts
list(counts=counts, idx=i)
}
library("rbenchmark")
benchmark(rowCounts(m), rowCountsR(m))
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 rowCounts(m) 100 0.189 1.000 0.188 0 0 0
# 2 rowCountsR(m) 100 0.258 1.365 0.256 0 0 0
Edit: more columns, thanks #Arun for pointing this out.
set.seed(123)
m <- matrix( sample(0:1, 1e7, TRUE), ncol=10)
benchmark(rowCounts(m), rowCountsR(m), replications=100)
# test replications elapsed relative user.self sys.self user.child sys.child
#1 rowCounts(m) 100 20.659 1.077 20.533 0.024 0 0
#2 rowCountsR(m) 100 19.183 1.000 15.641 3.408 0 0