Rcpp Create DataFrame with Variable Number of Columns

Rcpp Create DataFrame with Variable Number of Columns - c++

I am interested in using Rcpp to create a data frame with a variable number of columns. By that, I mean that the number of columns will be known only at runtime. Some of the columns will be standard, but others will be repeated n times where n is the number of features I am considering in a particular run.
I am aware that I can create a data frame as follows:
IntegerVector i1(3); i1[0]=4;i1[1]=2134;i1[2]=3453;
IntegerVector i2(3); i2[0]=4123;i2[1]=343;i2[2]=99123;
DataFrame df = DataFrame::create(Named("V1")=i1,Named("V2")=i2);
but in this case it is assumed that the number of columns is 2.
To simplify the explanation of what I need, assume that I would like pass a SEXP variable specifying the number of columns to create in the variable part. Something like:
RcppExport SEXP myFunc(SEXP n, SEXP <other stuff>)
IntegerVector i1(3); <compute i1>
IntegerVector i2(3); <compute i2>
for(int i=0;i<n;i++){compute vi}
DataFrame df = DataFrame::create(Named("Num")=i1,Named("ID")=i2,...,other columns v1 to vn);
where n is passed as an argument. The final data frame in R would look like
Num ID V1 ... Vn
1 2 5 'aasda'
...
(In reality, the column names will not be of the form "Vx", but they will be known at runtime.) In other words, I cannot use a static list of
Named()=...
since the number will change.
I have tried skipping the "Named()" part of the constructor and then naming the columns at the end, but the results are junk.
Can this be done?

If I understand your question correctly, it seems like it would be easiest to take advantage of the DataFrame constructor that takes a List as an argument (since the size of a List can be specified directly), and set the names of your columns via .attr("names") and a CharacterVector:
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::DataFrame myFunc(int n, Rcpp::List lst,
Rcpp::CharacterVector Names = Rcpp::CharacterVector::create()) {
Rcpp::List tmp(n + 2);
tmp[0] = Rcpp::IntegerVector(3);
tmp[1] = Rcpp::IntegerVector(3);
Rcpp::CharacterVector lnames = Names.size() < lst.size() ?
lst.attr("names") : Names;
Rcpp::CharacterVector names(n + 2);
names[0] = "Num";
names[1] = "ID";
for (std::size_t i = 0; i < n; i++) {
// tmp[i + 2] = do_something(lst[i]);
tmp[i + 2] = lst[i];
if (std::string(lnames[i]).compare("") != 0) {
names[i + 2] = lnames[i];
} else {
names[i + 2] = "V" + std::to_string(i);
}
}
Rcpp::DataFrame result(tmp);
result.attr("names") = names;
return result;
}
There's a little extra going on there to allow the Names vector to be optional - e.g. if you just use a named list you can omit the third argument.
lst1 <- list(1L:3L, 1:3 + .25, letters[1:3])
##
> myFunc(length(lst1), lst1, c("V1", "V2", "V3"))
# Num ID V1 V2 V3
#1 0 0 1 1.25 a
#2 0 0 2 2.25 b
#3 0 0 3 3.25 c
lst2 <- list(
Column1 = 1L:3L,
Column2 = 1:3 + .25,
Column3 = letters[1:3],
Column4 = LETTERS[1:3])
##
> myFunc(length(lst2), lst2)
# Num ID Column1 Column2 Column3 Column4
#1 0 0 1 1.25 a A
#2 0 0 2 2.25 b B
#3 0 0 3 3.25 c C
Just be aware of the 20-length limit for this signature of the DataFrame constructor, as pointed out by #hrbrmstr.

It's an old question, but I think more people are struggling with this, like me. Starting from the other answers here, I arrived at a solution that isn't limited by the 20 column limit of the DataFrame constructor:
// [[Rcpp::plugins(cpp11)]]
#include <Rcpp.h>
#include <string>
#include <iostream>
using namespace Rcpp;
// [[Rcpp::export]]
List variableColumnList(int numColumns=30) {
List retval;
for (int i=0; i<numColumns; i++) {
std::ostringstream colName;
colName << "V" << i+1;
retval.push_back( IntegerVector::create(100*i, 100*i + 1),colName.str());
}
return retval;
}
// [[Rcpp::export]]
DataFrame variableColumnListAsDF(int numColumns=30) {
Function asDF("as.data.frame");
return asDF(variableColumnList(numColumns));
}
// [[Rcpp::export]]
DataFrame variableColumnListAsTibble(int numColumns=30) {
Function asTibble("tbl_df");
return asTibble(variableColumnList(numColumns));
}
So build a C++ List first by pushing columns onto an empty List. (I generate the values and the column names on the fly here.) Then, either return that as an R list, or use one of two helper functions to convert them into a data.frame or tbl_df. One could do the latter from R, but I find this cleaner.

Related

Merging Tables in Apache Arrow

I have two arrow:Tables where table 1 is:
colA colB
1 2
3 4
and table 2 is,
colC colD
i j
k l
where both table 1 and 2 have the same number of rows. I would like to join them side-by-side as
colA colB colC coldD
1 2 i j
3 4 k l
I'm trying to use arrow::ConcatenateTables as follows, but I'm getting a bunch of nulls in my output (not shown)
t1 = ... \\ std::shared_ptr<arrow::Table>
t2 = ... \\ std::shared_ptr<arrow::Table>
arrow::ConcatenateTablesOptions options;
options.unify_schemas = true;
options.field_merge_options.promote_nullability = true;
auto merged = arrow::ConcatenateTables({t1, t2}, options);
How do I obtain the expected output?

arrow::ConcatenateTables only does row-wise concatenation. There is no builtin helper method for column-wise concatenation but it is easy enough to create one yourself (apologies if this is not quite right, I'm not in front of a compiler at the moment):
std::shared_ptr<arrow::Table> CombineTables(const Table& left, const Table& right) {
std::vector<std::shared_ptr<arrow::ChunkedArray>> columns = left.columns();
const std::vector<std::shared_ptr<arrow::ChunkedArray>>& right_columns = right.columns();
columns.insert(columns.end(), right_columns.begin(), right_columns.end());
std::vector<std::shared_ptr<arrow::Field>> fields = left.fields();
const std::vector<std::shared_ptr<arrow::Field>>& right_fields = right.fields();
fields.insert(fields.end(), right_fields.begin(), right_fields.end());
return arrow::Table::Make(arrow::schema(std::move(fields)), std::move(columns));
}

Applying Rcpp on a dataframe

I'm new to C++ and exploring faster computation possibilities on R through the Rcpp package. The actual dataframe contains over ~2 million rows, and is quite slow.
Existing Dataframes
Main Dataframe
df<-data.frame(z = c("a","b","c"), a = c(303,403,503), b = c(203,103,803), c = c(903,803,703))
Cost Dataframe
cost <- data.frame("103" = 4, "203" = 5, "303" = 6, "403" = 7, "503" = 8, "603" = 9, "703" = 10, "803" = 11, "903" = 12)
colnames(cost) <- c("103", "203", "303", "403", "503", "603", "703", "803", "903")
Steps
df contains z which is a categorical variable with levels a, b and c. I had done a merge operation from another dataframe to bring in a,b,c into df with the specific nos.
First step would be to match each row in z with the column names (a,b or c) and create a new column called 'type' and copy the corresponding number.
So the first row would read,
df$z[1] = "a"
df$type[1]= 303
Now it must match df$type with column names in another dataframe called 'cost' and create df$cost. The cost dataframe contains column names as numbers e.g. "103", "203" etc.
For our example, df$cost[1] = 6. It matches df$type[1] = 303 with cost$303[1]=6
Final Dataframe should look like this - Created a sample output
df1 <- data.frame(z = c("a","b","c"), type = c("303", "103", "703"), cost = c(6,4,10))

A possible solution, not very elegant but does the job:
library(reshape2)
tmp <- cbind(cost,melt(df)) # create a unique data frame
row.idx <- which(tmp$z==tmp$variable) # row index of matching values
col.val <- match(as.character(tmp$value[row.idx]), names(tmp) ) # find corresponding values in the column names
# now put all together
df2 <- data.frame('z'=unique(df$z),
'type' = tmp$value[row.idx],
'cost' = as.numeric(tmp[1,col.val]) )
the output:
> df2
z type cost
1 a 303 6
2 b 103 4
3 c 703 10
see if it works

Having trouble in building Rpackage using R/C++ functions

I have a C++ function that is called inside an R function using Rcpp packgae. The R function accepts an inputDataFrame and uses the C++ function (also accepts a DataFrame) to calculate drug amounts (A1) as a function with time. R then returns the inputDataFrame with added column for the calculated amounts A1.
I have trouble making an Rpackage for this function. I followed RStudio instruction but I ran into an error when building the package. The error is in the RcppExport.cpp file and states that 'OneCompIVbolusCpp' was not declared in this scope.
Here are the codes for the C++ and R functions. They work perfectly fine in R when I process an example dataframe.
Rfunction OneCompIVbolus_Rfunction.R:
library(Rcpp)
sourceCpp("OneCompIVbolusCppfunction.cpp")
OneCompIVbolusRCpp <- function(inputDataFrame){
inputDataFrame$A1[inputDataFrame$TIME==0] <- inputDataFrame$AMT[inputDataFrame$TIME==0]
OneCompIVbolusCpp( inputDataFrame )
inputDataFrame
}
C++ function OneCompIVbolusCppfunction.cpp:
#include <Rcpp.h>
#include <math.h>
#include <iostream>
using namespace Rcpp;
using namespace std;
// [[Rcpp::export]]
// input Dataframe from R
DataFrame OneCompIVbolusCpp(DataFrame inputFrame){
// Create vectors of each element used in function and for constructing output dataframe
Rcpp::DoubleVector TIME = inputFrame["TIME"];
Rcpp::DoubleVector AMT = inputFrame["AMT"];
Rcpp::DoubleVector k10 = inputFrame["k10"];
Rcpp::DoubleVector A1 = inputFrame["A1"];
double currentk10, currentTime, previousA1, currentA1;
// in C++ arrays start at index 0, so to start at 2nd row need to set counter to 1
// for counter from 1 to the number of rows in input data frame
for(int counter = 1; counter < inputFrame.nrows(); counter++){
// pull out all the variables that will be used for calculation
currentk10 = k10[ counter ];
currentTime = TIME[ counter ] - TIME[ counter - 1];
previousA1 = A1[ counter - 1 ];
// Calculate currentA1
currentA1 = previousA1*exp(-currentTime*currentk10);
// Fill in Amounts and check for other doses
A1[ counter ] = currentA1 + AMT[ counter ];
} // end for loop
return(0);
}
Any hints on what am I doing wrong here? How may I solve this issue?
Edit:
Here is an example of running the composite function OneCompIVbolusRCpp in R:
library(plyr)
library(Rcpp)
source("OneCompIVbolus_Rfunction.R")
#-------------
# Generate df
#-------------
#Set dose records:
dosetimes <- c(0,12)
#set number of subjects
ID <- 1:2
#Make dataframe
df <- expand.grid("ID"=ID,"TIME"=sort(unique(c(seq(0,24,1),dosetimes))),"AMT"=0,"MDV"=0,"CL"=2,"V"=10)
doserows <- subset(df, TIME%in%dosetimes)
#Dose = 100 mg, Dose 1 at time 0
doserows$AMT[doserows$TIME==dosetimes[1]] <- 100
#Dose 2 at 12
doserows$AMT[doserows$TIME==dosetimes[2]] <- 50
#Add back dose information
df <- rbind(df,doserows)
df <- df[order(df$ID,df$TIME,df$AMT),] # arrange df by TIME (ascending) and by AMT (descending)
df <- subset(df, (TIME==0 & AMT==0)==F) # remove the row that has a TIME=0 and AMT=0
df$k10 <- df$CL/df$V
#-------------
# Apply the function
#-------------
simdf <- ddply(df, .(ID), OneCompIVbolusRCpp)

You may simply have the wrong ordering. Instead of
// [[Rcpp::export]]
// input Dataframe from R
DataFrame OneCompIVbolusCpp(DataFrame inputFrame){
// ...
do
// input Dataframe from R
// [[Rcpp::export]]
DataFrame OneCompIVbolusCpp(DataFrame inputFrame){
// ...
as the [[Rcpp::export]] tag must come directly before the function it exports.

Computing all values or stopping and returning just the best value if found

I have a list of items and for each item I am computing a value. Computing this value is a bit computationally intensive so I want to minimise it as much as possible.
The algorithm I need to implement is this:
I have a value X
For each item
a. compute the value for it, if it is < 0 ignore it completely
b. if (value > 0) && (value < X)
return pair (item, value)
Return all (item, value) pairs in a List (that have the value > 0), ideally sorted by value
To make it a bit clearer, step 3 only happens if none of the items have a value less than X. In step 2, when we encounter the first item that is less than X we should not compute the rest and just return that item (we can obviously return it in a Set() by itself to match the return type).
The code I have at the moment is as follows:
val itemValMap = items.foldLeft(Map[Item, Int)]()) {
(map : Map[Item, Int], key : Item) =>
val value = computeValue(item)
if ( value >= 0 ) //we filter out negative ones
map + (key -> value)
else
map
}
val bestItem = itemValMap.minBy(_._2)
if (bestItem._2 < bestX)
{
List(bestItem)
}
else
{
itemValMap.toList.sortBy(_._2)
}
However, what this code is doing is computing all the values in the list and choosing the best one, rather than stopping as a 'better' one is found. I suspect I have to use Streams in some way to achieve this?

OK, I'm not sure how your whole setup looks like, but I tried to prepare a minimal example that would mirror your situation.
Here it is then:
object StreamTest {
case class Item(value : Int)
def createItems() = List(Item(0),Item(3),Item(30),Item(8),Item(8),Item(4),Item(54),Item(-1),Item(23),Item(131))
def computeValue(i : Item) = { Thread.sleep(3000); i.value * 2 - 2 }
def process(minValue : Int)(items : Seq[Item]) = {
val stream = Stream(items: _*).map(item => item -> computeValue(item)).filter(tuple => tuple._2 >= 0)
stream.find(tuple => tuple._2 < minValue).map(List(_)).getOrElse(stream.sortBy(_._2).toList)
}
}
Each calculation takes 3 seconds. Now let's see how it works:
val items = StreamTest.createItems()
val result = StreamTest.process(2)(items)
result.foreach(r => println("Original: " + r._1 + " , calculated: " + r._2))
Gives:
[info] Running Main
Original: Item(3) , calculated: 4
Original: Item(4) , calculated: 6
Original: Item(8) , calculated: 14
Original: Item(8) , calculated: 14
Original: Item(23) , calculated: 44
Original: Item(30) , calculated: 58
Original: Item(54) , calculated: 106
Original: Item(131) , calculated: 260
[success] Total time: 31 s, completed 2013-11-21 15:57:54
Since there's no value smaller than 2, we got a list ordered by the calculated value. Notice that two pairs are missing, because calculated values are smaller than 0 and got filtered out.
OK, now let's try with a different minimum cut-off point:
val result = StreamTest.process(5)(items)
Which gives:
[info] Running Main
Original: Item(3) , calculated: 4
[success] Total time: 7 s, completed 2013-11-21 15:55:20
Good, it returned a list with only one item, the first value (second item in the original list) that was smaller than 'minimal' value and was not smaller than 0.
I hope that the example above is easily adaptable to your needs...

A simple way to avoid the computation of unneeded values is to make your collection lazy by using the view method:
val weigthedItems = items.view.map{ i => i -> computeValue(i) }.filter(_._2 >= 0 )
weigthedItems.find(_._2 < X).map(List(_)).getOrElse(weigthedItems.sortBy(_._2))
By example here is a test in the REPL:
scala> :paste
// Entering paste mode (ctrl-D to finish)
type Item = String
def computeValue( item: Item ): Int = {
println("Computing " + item)
item.toInt
}
val items = List[Item]("13", "1", "5", "-7", "12", "3", "-1", "15")
val X = 10
val weigthedItems = items.view.map{ i => i -> computeValue(i) }.filter(_._2 >= 0 )
weigthedItems.find(_._2 < X).map(List(_)).getOrElse(weigthedItems.sortBy(_._2))
// Exiting paste mode, now interpreting.
Computing 13
Computing 1
defined type alias Item
computeValue: (item: Item)Int
items: List[String] = List(13, 1, 5, -7, 12, 3, -1, 15)
X: Int = 10
weigthedItems: scala.collection.SeqView[(String, Int),Seq[_]] = SeqViewM(...)
res27: Seq[(String, Int)] = List((1,1))
As you can see computeValue was only called up to the first value < X (that is, up to 1)

Stata- Is there a way to store data like Python's dictionary or a hash map?

Is there a way to store information in Stata similar to a dictionary in Python or a hash map in other languages?
I am iterating through variable lists that are appended with _1, _2, _3, _4, _5, _6, _7 ... _18 to delineate sections, and I want to sum the number of times the letters "DK" appear in each variable in each section. Right now I have 18 for loops, with each loop iterating through a different section, saving the 'sum' of the total number of DK's in a new variable called DK_1sum, DK_2sum, and then I later produce graphs of that data.
I'm wondering if there is a way to turn all this into a large For loop, and just append the data to a dictionary/array such that the data looks like:
{s1Sum, 25
s2Sum, 56 ...
s18Sum, 101}
Is this possible?

This could be stored in a Stata matrix, a Mata matrix or just ordinary Stata variables.
gen count = .
gen which = _n
qui forval j = 1/18 {
scalar found = 0
foreach v of var *_`j' {
count if strpos(`v', "DK")
scalar found = scalar(found) + r(N)
}
replace count = scalar(found) in `j'
}
list which count in 1/18
For variation, here is a Stata matrix approach.
matrix count = J(18,1,.)
qui forval j = 1/18 {
scalar found = 0
foreach v of var *_`j' {
count if strpos(`v', "DK")
scalar found = scalar(found) + r(N)
}
matrix count[`j', 1] = scalar(found)
}
matrix list count

If you are concerned about efficiency you could consider the associative array capabilities of Mata.
* associate Y with X
local yvalue "Y"
mata : H = asarray_create()
mata : asarray(H, "X", st_local("yvalue"))
* available in Mata
mata : asarray(H, "X")
* available in Stata
mata : st_local("xvalue", asarray(H, "X"))
di "`xvalue'"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Rcpp Create DataFrame with Variable Number of Columns - c++

Related

Merging Tables in Apache Arrow

Applying Rcpp on a dataframe

Having trouble in building Rpackage using R/C++ functions

Computing all values or stopping and returning just the best value if found

Stata- Is there a way to store data like Python's dictionary or a hash map?

Categories

Resources