Having trouble in building Rpackage using R/C++ functions - c++

I have a C++ function that is called inside an R function using Rcpp packgae. The R function accepts an inputDataFrame and uses the C++ function (also accepts a DataFrame) to calculate drug amounts (A1) as a function with time. R then returns the inputDataFrame with added column for the calculated amounts A1.
I have trouble making an Rpackage for this function. I followed RStudio instruction but I ran into an error when building the package. The error is in the RcppExport.cpp file and states that 'OneCompIVbolusCpp' was not declared in this scope.
Here are the codes for the C++ and R functions. They work perfectly fine in R when I process an example dataframe.
Rfunction OneCompIVbolus_Rfunction.R:
library(Rcpp)
sourceCpp("OneCompIVbolusCppfunction.cpp")
OneCompIVbolusRCpp <- function(inputDataFrame){
inputDataFrame$A1[inputDataFrame$TIME==0] <- inputDataFrame$AMT[inputDataFrame$TIME==0]
OneCompIVbolusCpp( inputDataFrame )
inputDataFrame
}
C++ function OneCompIVbolusCppfunction.cpp:
#include <Rcpp.h>
#include <math.h>
#include <iostream>
using namespace Rcpp;
using namespace std;
// [[Rcpp::export]]
// input Dataframe from R
DataFrame OneCompIVbolusCpp(DataFrame inputFrame){
// Create vectors of each element used in function and for constructing output dataframe
Rcpp::DoubleVector TIME = inputFrame["TIME"];
Rcpp::DoubleVector AMT = inputFrame["AMT"];
Rcpp::DoubleVector k10 = inputFrame["k10"];
Rcpp::DoubleVector A1 = inputFrame["A1"];
double currentk10, currentTime, previousA1, currentA1;
// in C++ arrays start at index 0, so to start at 2nd row need to set counter to 1
// for counter from 1 to the number of rows in input data frame
for(int counter = 1; counter < inputFrame.nrows(); counter++){
// pull out all the variables that will be used for calculation
currentk10 = k10[ counter ];
currentTime = TIME[ counter ] - TIME[ counter - 1];
previousA1 = A1[ counter - 1 ];
// Calculate currentA1
currentA1 = previousA1*exp(-currentTime*currentk10);
// Fill in Amounts and check for other doses
A1[ counter ] = currentA1 + AMT[ counter ];
} // end for loop
return(0);
}
Any hints on what am I doing wrong here? How may I solve this issue?
Edit:
Here is an example of running the composite function OneCompIVbolusRCpp in R:
library(plyr)
library(Rcpp)
source("OneCompIVbolus_Rfunction.R")
#-------------
# Generate df
#-------------
#Set dose records:
dosetimes <- c(0,12)
#set number of subjects
ID <- 1:2
#Make dataframe
df <- expand.grid("ID"=ID,"TIME"=sort(unique(c(seq(0,24,1),dosetimes))),"AMT"=0,"MDV"=0,"CL"=2,"V"=10)
doserows <- subset(df, TIME%in%dosetimes)
#Dose = 100 mg, Dose 1 at time 0
doserows$AMT[doserows$TIME==dosetimes[1]] <- 100
#Dose 2 at 12
doserows$AMT[doserows$TIME==dosetimes[2]] <- 50
#Add back dose information
df <- rbind(df,doserows)
df <- df[order(df$ID,df$TIME,df$AMT),] # arrange df by TIME (ascending) and by AMT (descending)
df <- subset(df, (TIME==0 & AMT==0)==F) # remove the row that has a TIME=0 and AMT=0
df$k10 <- df$CL/df$V
#-------------
# Apply the function
#-------------
simdf <- ddply(df, .(ID), OneCompIVbolusRCpp)

You may simply have the wrong ordering. Instead of
// [[Rcpp::export]]
// input Dataframe from R
DataFrame OneCompIVbolusCpp(DataFrame inputFrame){
// ...
do
// input Dataframe from R
// [[Rcpp::export]]
DataFrame OneCompIVbolusCpp(DataFrame inputFrame){
// ...
as the [[Rcpp::export]] tag must come directly before the function it exports.

Related

Save results for each file of a list of files looping through a factor variable in R. Vector does not update

I am using a list of files, and I am trying to create a data frame that contains: for each sample, the percentage of two particular "GT" types by the levels of another factor variable called "chr" (with 1 to 24 levels).
It would have to look like this:
The problem I keep getting is that the vector never gets updated for the ith sample, it only keeps the first vector created. And then I am not sure how to save that updated vector on my data frame (df).
vector_chr <- vector();
for (i in seq_along(list_files)) {
GT <- list_files[[i]][,9]
chr <- list_files[[i]][,3]
GT$chr <- chr$chr # creating one df with both GT and chr
for (j in unique(GT$chr)){
dat_list = split(GT, GT$chr) # split data frames by chr (1 to 24)
table <- table(dat_list[[j]][,1]) # take GT and make a table
sum <- sum(table[3:4]) # sum GTs 3 and 4
perc <- sum/nrow(GT)
vector_chr <- c(vector_chr,perc) # assign the 24 percentages to a vector
}
df <- data.frame(matrix(ncol = 25, nrow = length(files)))
x <- c("Sample", "chr1", "chr2", "chr3",
"chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10",
"chr11", "chr12","chr13", "chr14", "chr15", "chr16",
"chr17", "chr18", "chr19", "chr20", "chr21", "chr22",
"chrX", "chrXY")
colnames(df) <- x
df$Sample <- names(list_files)
df[i,2:25] <- vector_chr # assign the 24 percentages for EACH sample
}

k-fold cross validation: how to filter data based on a randomly generated integer variable in Stata

The following seems obvious, yet it does not behave as I would expect. I want to do k-fold cross validation without using SCC packages, and thought I could just filter my data and run my own regressions on the subsets.
First I generate a variable with a random integer between 1 and 5 (5-fold cross validation), then I loop over each fold number. I want to filter the data by the fold number, but using a boolean filter fails to filter anything. Why?
Bonus: what would be the best way to capture all of the test MSEs and average them? In Python I would just make a list or a numpy array and take the average.
gen randint = floor((6-1)*runiform()+1)
recast int randint
forval b = 1(1)5 {
xtreg c.DepVar /// // training set
c.IndVar1 ///
c.IndVar2 ///
if randint !=`b' ///
, fe vce(cluster uuid)
xtreg c.DepVar /// // test set, needs to be performed with model above, not a
c.IndVar1 /// // new model...
c.IndVar2 ///
if randint ==`b' ///
, fe vce(cluster uuid)
}
EDIT: Test set needs to be performed with model fit to training set. I changed my comment in the code to reflect this.
Ultimately the solution to the filtering issue was I was using a scalar in quotes to define the bounds and I had:
replace randint = floor((`varscalar'-1)*runiform()+1)
instead of just
replace randint = floor((varscalar-1)*runiform()+1)
When and where to use the quotes in Stata is confusing to me. I cannot just use varscalar in a loop, I have to use `=varscalar', but I can for some reason use varscalar - 1 and get the expected result. Interestingly, I cannot use
replace randint = floor((`varscalar')*runiform()+1)
I suppose I should just use
replace randint = floor((`=varscalar')*runiform()+1)
So why is it ok to use the version with the minus one and without the equals sign??
The answer below is still extremely helpful and I learned much from it.
As a matter of fact, two different things are going on here that are not necessarily directly related. 1) How to filter data with a randomly generated integer value and 2) k-fold cross-validation procedure.
For the first one, I will leave an example below that could help you work things out using Stata with some tools that can be easily transferable to other problems (such as matrix generation and manipulation to store the metrics). However, I would call neither your sketch of code nor my example "k-fold cross-validation", mainly because they fit the model, both in the testing and in training data. Nonetheless, the case should be that strictly speaking, the model should be trained in the training data, and using those parameters, assess the performance of the model in testing data.
For further references on the procedure Scikit-learn has done brilliant work explaining it with several visualizations included.
That being said, here is something that could be helpful.
clear all
set seed 4
set obs 100
*Simulate model
gen x1 = rnormal()
gen x2 = rnormal()
gen y = 1 + 0.5 * x1 + 1.5 *x2 + rnormal()
gen byte randint = runiformint(1, 5)
tab randint
/*
randint | Freq. Percent Cum.
------------+-----------------------------------
1 | 17 17.00 17.00
2 | 18 18.00 35.00
3 | 21 21.00 56.00
4 | 19 19.00 75.00
5 | 25 25.00 100.00
------------+-----------------------------------
Total | 100 100.00
*/
// create a matrix to store results
matrix res = J(5,4,.)
matrix colnames res = "R2_fold" "MSE_fold" "R2_hold" "MSE_hold"
matrix rownames res ="1" "2" "3" "4" "5"
// show formated empty matrix
matrix li res
/*
res[5,4]
R2_fold MSE_fold R2_hold MSE_hold
1 . . . .
2 . . . .
3 . . . .
4 . . . .
5 . . . .
*/
// loop over different samples
forvalues b = 1/5 {
// run the model using fold == `b'
qui reg y x1 x2 if randint ==`b'
// save R squared training
matrix res[`b', 1] = e(r2)
// save rmse training
matrix res[`b', 2] = e(rmse)
// run the model using fold != `b'
qui reg y x1 x2 if randint !=`b'
// save R squared training (?)
matrix res[`b', 3] = e(r2)
// save rmse testing (?)
matrix res[`b', 4] = e(rmse)
}
// Show matrix with stored metrics
mat li res
/*
res[5,4]
R2_fold MSE_fold R2_hold MSE_hold
1 .50949187 1.2877728 .74155365 1.0070531
2 .89942838 .71776458 .66401888 1.089422
3 .75542004 1.0870525 .68884359 1.0517139
4 .68140328 1.1103964 .71990589 1.0329239
5 .68816084 1.0017175 .71229925 1.0596865
*/
// some matrix algebra workout to obtain the mean of the metrics
mat U = J(rowsof(res),1,1)
mat sum = U'*res
/* create vector of column (variable) means */
mat mean_res = sum/rowsof(res)
// show the average of the metrics acros the holds
mat li mean_res
/*
mean_res[1,4]
R2_fold MSE_fold R2_hold MSE_hold
c1 .70678088 1.0409408 .70532425 1.0481599
*/

Applying Rcpp on a dataframe

I'm new to C++ and exploring faster computation possibilities on R through the Rcpp package. The actual dataframe contains over ~2 million rows, and is quite slow.
Existing Dataframes
Main Dataframe
df<-data.frame(z = c("a","b","c"), a = c(303,403,503), b = c(203,103,803), c = c(903,803,703))
Cost Dataframe
cost <- data.frame("103" = 4, "203" = 5, "303" = 6, "403" = 7, "503" = 8, "603" = 9, "703" = 10, "803" = 11, "903" = 12)
colnames(cost) <- c("103", "203", "303", "403", "503", "603", "703", "803", "903")
Steps
df contains z which is a categorical variable with levels a, b and c. I had done a merge operation from another dataframe to bring in a,b,c into df with the specific nos.
First step would be to match each row in z with the column names (a,b or c) and create a new column called 'type' and copy the corresponding number.
So the first row would read,
df$z[1] = "a"
df$type[1]= 303
Now it must match df$type with column names in another dataframe called 'cost' and create df$cost. The cost dataframe contains column names as numbers e.g. "103", "203" etc.
For our example, df$cost[1] = 6. It matches df$type[1] = 303 with cost$303[1]=6
Final Dataframe should look like this - Created a sample output
df1 <- data.frame(z = c("a","b","c"), type = c("303", "103", "703"), cost = c(6,4,10))
A possible solution, not very elegant but does the job:
library(reshape2)
tmp <- cbind(cost,melt(df)) # create a unique data frame
row.idx <- which(tmp$z==tmp$variable) # row index of matching values
col.val <- match(as.character(tmp$value[row.idx]), names(tmp) ) # find corresponding values in the column names
# now put all together
df2 <- data.frame('z'=unique(df$z),
'type' = tmp$value[row.idx],
'cost' = as.numeric(tmp[1,col.val]) )
the output:
> df2
z type cost
1 a 303 6
2 b 103 4
3 c 703 10
see if it works

Why RcppArmadillo's fastLmPure produces NA's in output but fastLm doesn't?

I use rolling regression in R quite a lot and my initial setup is something like:
dolm <- function(x) coef(lm(x[,1] ~ x[,2] + 0, data = as.data.frame(x)))
rollingCoef = rollapply(someData, 100, dolm)
Above example works perfectly, except it is slow if you have a lot of iterations.
To speed it up I've decided to experiment with Rcpp package.
First I substituted lm with fastLm, result is a bit faster but still slow. So that pushed me to attempt to write the entire rolling regression's coefficients function in c++ as for loop and than integrate it in R with Rcpp help.
So I've changed original RcppArmadillo's function fastLm to this:
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
// [[Rcpp::export]]
List rollCoef(const arma::mat& X, const arma::colvec& y, double window ) {
double cppWindow = window - 1;
double matRows = X.n_rows;
double matCols = X.n_cols - 1;
arma::mat coef( matRows - cppWindow, X.n_cols); // matrix for estimated coefficients
//for loop for rolling regression.
for( double i = 0 ; i < matRows - cppWindow ; i++ )
{
coef.row(i) = arma::trans(arma::solve(X( arma::span(i,i + cppWindow), arma::span(0,matCols)) , y.rows(i,i + cppWindow)));
}
return List::create(_["coefficients"] = coef);
}
and than download it to R with sourceCpp(file=".../rollCoef.cpp")
So it's much faster than rollapply and it worked fine on small examples, but than I applied it to ~200000 observations of data it produced ~half of NA's in output, in the same time rollapply/fastLm combination didn't produce any.
So here I need some help. What is wrong with my function? Why are there NA's in my function output, and no NA's in rollapply/fastLm, however, if I understand right, them both based on arma::solve? Any help is highly appreciated.
UPDATE
Here is reproducible code:
require(Rcpp)
require(RcppArmadillo)
require(zoo)
require(repmis)
myData <- source_DropboxData(file = "example.csv",
key = "cbrmkkbssu5bn96", sep = ",", header = TRUE)
## in order to use my custom function "rollCoef" you should download it to R.
## The c++ code is presented above in the main question.
## Download it where you want as "rollCoef.cpp" and then download it to R with:
sourceCpp(file=".../rollCoeff.cpp"). # there should be your actual path.
myCoef = rollCoef(as.matrix(myData[,2]),myData[,1],260)
summary(unlist(myCoef)) # 80923 NA's
dolm = function(x) coef(fastLmPure(as.matrix(x[,2]), x[,1]))
myCoef2 = rollapply(myData, 260, dolm, by.column = FALSE)
summary(myCoef2) # 80923 NA's
dolm2 = function(x) coef(fastLm(x[,1] ~ x[,2] + 0, data = as.data.frame(x)))
myCoef3 = rollapply(myData, 260, dolm2, by.column = FALSE)
summary(myCoef3) # !!! No NA's !!!
head(unlist(myCoef)) ; head(unlist(myCoef2)) ; head(myCoef3)
So the output of my function is identical to output of RcppArmadillo's fastLmPure combined with rollapply and them both produce NA's, but rollapply with fastLm does not. As I understand, for example from HERE and HERE fastLm is basically calling to fastLmPure, but why is there no NA's in the third method than? Is there some additional capabilities in fastLm that prevent NA's that I didn't spotted?
There is an entire package RcppRoll to do just that custom rolling -- and you should be able to extend it and its rollit() function to do rolling lm() as well.

Rcpp Create DataFrame with Variable Number of Columns

I am interested in using Rcpp to create a data frame with a variable number of columns. By that, I mean that the number of columns will be known only at runtime. Some of the columns will be standard, but others will be repeated n times where n is the number of features I am considering in a particular run.
I am aware that I can create a data frame as follows:
IntegerVector i1(3); i1[0]=4;i1[1]=2134;i1[2]=3453;
IntegerVector i2(3); i2[0]=4123;i2[1]=343;i2[2]=99123;
DataFrame df = DataFrame::create(Named("V1")=i1,Named("V2")=i2);
but in this case it is assumed that the number of columns is 2.
To simplify the explanation of what I need, assume that I would like pass a SEXP variable specifying the number of columns to create in the variable part. Something like:
RcppExport SEXP myFunc(SEXP n, SEXP <other stuff>)
IntegerVector i1(3); <compute i1>
IntegerVector i2(3); <compute i2>
for(int i=0;i<n;i++){compute vi}
DataFrame df = DataFrame::create(Named("Num")=i1,Named("ID")=i2,...,other columns v1 to vn);
where n is passed as an argument. The final data frame in R would look like
Num ID V1 ... Vn
1 2 5 'aasda'
...
(In reality, the column names will not be of the form "Vx", but they will be known at runtime.) In other words, I cannot use a static list of
Named()=...
since the number will change.
I have tried skipping the "Named()" part of the constructor and then naming the columns at the end, but the results are junk.
Can this be done?
If I understand your question correctly, it seems like it would be easiest to take advantage of the DataFrame constructor that takes a List as an argument (since the size of a List can be specified directly), and set the names of your columns via .attr("names") and a CharacterVector:
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::DataFrame myFunc(int n, Rcpp::List lst,
Rcpp::CharacterVector Names = Rcpp::CharacterVector::create()) {
Rcpp::List tmp(n + 2);
tmp[0] = Rcpp::IntegerVector(3);
tmp[1] = Rcpp::IntegerVector(3);
Rcpp::CharacterVector lnames = Names.size() < lst.size() ?
lst.attr("names") : Names;
Rcpp::CharacterVector names(n + 2);
names[0] = "Num";
names[1] = "ID";
for (std::size_t i = 0; i < n; i++) {
// tmp[i + 2] = do_something(lst[i]);
tmp[i + 2] = lst[i];
if (std::string(lnames[i]).compare("") != 0) {
names[i + 2] = lnames[i];
} else {
names[i + 2] = "V" + std::to_string(i);
}
}
Rcpp::DataFrame result(tmp);
result.attr("names") = names;
return result;
}
There's a little extra going on there to allow the Names vector to be optional - e.g. if you just use a named list you can omit the third argument.
lst1 <- list(1L:3L, 1:3 + .25, letters[1:3])
##
> myFunc(length(lst1), lst1, c("V1", "V2", "V3"))
# Num ID V1 V2 V3
#1 0 0 1 1.25 a
#2 0 0 2 2.25 b
#3 0 0 3 3.25 c
lst2 <- list(
Column1 = 1L:3L,
Column2 = 1:3 + .25,
Column3 = letters[1:3],
Column4 = LETTERS[1:3])
##
> myFunc(length(lst2), lst2)
# Num ID Column1 Column2 Column3 Column4
#1 0 0 1 1.25 a A
#2 0 0 2 2.25 b B
#3 0 0 3 3.25 c C
Just be aware of the 20-length limit for this signature of the DataFrame constructor, as pointed out by #hrbrmstr.
It's an old question, but I think more people are struggling with this, like me. Starting from the other answers here, I arrived at a solution that isn't limited by the 20 column limit of the DataFrame constructor:
// [[Rcpp::plugins(cpp11)]]
#include <Rcpp.h>
#include <string>
#include <iostream>
using namespace Rcpp;
// [[Rcpp::export]]
List variableColumnList(int numColumns=30) {
List retval;
for (int i=0; i<numColumns; i++) {
std::ostringstream colName;
colName << "V" << i+1;
retval.push_back( IntegerVector::create(100*i, 100*i + 1),colName.str());
}
return retval;
}
// [[Rcpp::export]]
DataFrame variableColumnListAsDF(int numColumns=30) {
Function asDF("as.data.frame");
return asDF(variableColumnList(numColumns));
}
// [[Rcpp::export]]
DataFrame variableColumnListAsTibble(int numColumns=30) {
Function asTibble("tbl_df");
return asTibble(variableColumnList(numColumns));
}
So build a C++ List first by pushing columns onto an empty List. (I generate the values and the column names on the fly here.) Then, either return that as an R list, or use one of two helper functions to convert them into a data.frame or tbl_df. One could do the latter from R, but I find this cleaner.