Suppose I have a C++ program that has a vector of objects that I want to write out to an Rdata data.frame file, one observation per element of the vector. How can I do that? Here is an example. Suppose I have
vector<Student> myStudents;
And Student is a class which has two data members, name which is of type std::string and grade which is of type int.
Is my only option to write a csv file?
Note that Rdata is a binary format so I guess I would need to use a library.
A search for Rdata [r] [C++] came up empty.
I think nobody has bothered to extract a binary file writer from the R sources to be used independently from R.
Almost twenty years ago I did the same for Octave files as their format is simply: two integers for 'n' and 'k', followed by 'n * k' of data -- so you could read / write with two function calls each.
I fear that for R you would have to cover too many of R's headers -- so the easiest (?) route may be to give the data to R, maybe via Rserve ('loose' connection over tcp/ip) and RInside (tighter connection via embedding), and have R write it.
Edit: In the years since the original answer was written, one such library has been created: librdata.
Here is an example of a function that saves a list in a RData. This example is based on the previous answer :
void save_List_RData(const List &list_Data, const CharacterVector &file_Name)
{
Environment base("package:base");
Environment env = new_env();
env["list_Data"] = list_Data;
Function save = base["save"];
CharacterVector all(1);
all[0] = "list_Data";
save(Named("list", all), Named("envir", env), Named("file", file_Name));
Rcout << "File " << file_Name << " has been saved! \\n";
}
I don't know if this will fit everyone needs (of those who is googling this question), but this way you can save individual or multiple variables:
using namespace std;
using namespace Rcpp;
using Eigen::Map;
using Eigen::MatrixXi;
using Eigen::MatrixXd;
Environment base("package:base");
Function save = base["save"];
Function saveRDS = base["saveRDS"];
MatrixXd M = MatrixXd::Identity(3,3);
NumericMatrix xx(wrap(M));
NumericMatrix xx1(wrap(M));
NumericMatrix xx2(wrap(M));
base["xx"] = xx;
base["xx1"] = xx1;
base["xx2"] = xx2;
vector<string> lst;
lst.push_back("xx");
lst.push_back("xx1");
lst.push_back("xx2");
CharacterVector all = wrap(lst);
save(Named("list", all), Named("envir", base) , Named("file","Identities.RData"));
saveRDS(xx,Named("file","Identity.RDs"));
return wrap(M);
library(inline)
library(Rcpp)
library(RcppEigen)
src <- '
#put here cpp code shown above
'
saveworkspace <- cxxfunction(signature(), src, plugin = "RcppEigen")
saveworkspace()
list.files(pattern="*.RD*")
[1] "Identity.RDs"
[2] "Identities.RData"
I'm not 100% sure if this C++ code will work in standalone library/executable.
NB: Initially I missed the comment that the solution should be independent of R, but for those who is searching for exactly the same question, but they are ok with dependency on R, this could be helpful.
Related
Let suppose I have an S4 class A which contains a slot #S which is a data.frame. The data.frame has a column X. I want to process such object in C++ using Rcpp. Here a toy example of how I did that:
SEXP f(S4 A)
{
DataFrame S = A.slot("S");
NumericVector X = S["X"];
// do something with X
}
My questions are the following.
Does X is still a reference to the original R data or a deep copy? Considering how Rcpp works is should not be a copy. But how can I be sure?
This code compiles and works well but the IDE (Rstudio not the compiler) raise a warning: conversion from 'Rcpp::SlotProxyPolicy< Rcppp::S4_Impl<PreserveStorage >::SlotProxy' to 'DataFrame' (aka 'DataFrame_Impl< PreserveStorage >') is ambigous.What does it mean? Is is serious?
Thanks
I am new to interfaces with databases through c++ and was wondering what is the best approach to do the following:
I have an object with member variables that I define ahead of time, and member variables that I need to pull from a database given the known variables. For example:
class DataObject
{
public:
int input1;
string input2;
double output1;
DataObject(int Input1, string Input2) :
input1(Input1), input2(Input2)
{
output1 = Initializer(input1,input2);
}
private:
Initializer(int, string);
static RecordSet rs; //I am just guessing the object would be called RecordSet
}
Now, I can do something like:
std::vector<DataObject> v;
for (int n = 0; n <= 10; ++n)
for (char w = 'a'; w <= 'z'; ++w)
v.push_back(DataObject{n,z});
And get an initialized vector of DataObjects. Behind the scenes, Initializer will check check if rs already has data. If not, it will connect to the database and query something like: select input1, input2, output1 from ... where input1 between 1 and 10 and input 2 between 'a' and 'z', and then start initializing each DataObject with output1 given each pair of input1 and input2.
This would be utterly simple in C#, but from code samples I have found online it looks utterly ugly in C++. I am stuck on two things. As stated earlier, I am completely new to database interfaces in C++, and there are so many methods from which to choose, but I would like to hone in on a specific method that truly fits my purpose. Furthermore - and this is the purpose - I am trying to make use of a static data set to pull data in a single query, rather than run a new query for each input1/input2 combination; even better yet, is there a way to have database results written directly into the newly created DataObjects rather than making a pit stop in some temporary RecordSet object.
To summarize and clarify: I have database on a relational database, and I am trying to pull the data and store it into a collection of objects. How do I do this? Any tips/direction - I am much obliged.
EDIT 8/16/17: After some research and trials I have come up with the below
So I've had progress by using an ADORecordset with the put_CursorLocation set to adUseServer:
rs->put_CursorLocation(adUseServer)
My understanding is that by using this setting the query result is stored on the server, and the client side only gets the current row pointed to by rs.
So I get my data from the row and create the DataObject on the spot, emplace_back it into the vector, and finally call rs->MoveNext() to get the next row and repeat until I reach the end. Partial example as follows:
std::vector<DataObject> v;
DataObject::rs.Open(connString,Sql); // Connection for wrapper class
for (int n = 0; n <= 10; ++n)
for (char w = 'a'; w <= 'z'; ++w)
v.emplace_back(DataObject{n,z});
// Somewhere else...
void DataObject::Initializer(int a, string b) {
int ra; string rb; double rc;
// For simplicity's sake, let's assume the result set is ordered
// in the same way as the for-loop, and that no data is missing.
// So the below sanity-check would be unnecessary, but included.
while (!rs.IsEOF())
{
// Let's assume I defined these 'Get' functions
ra = rs.Get<int>("Input1");
rb = rs.Get<string>("Input2");
rc = rs.Get<double>("Output1");
rs.MoveNext();
if (ra == a && rb == b) break;
}
return rc;
}
// Constructor for RecordSet:
RecordSet::RecordSet()
{
HRESULT hr = rs_.CoCreateInstance(CLSID_CADORecordset);
ATLENSURE_SUCCEEDED(hr);
rs_->put_CursorLocation(adUseServer);
}
Now I'm hoping that I interpreted how this works correctly; otherwise, this would be a whole lot of fuss over nothing. I am not an ADO or .Net expert - clearly - but I'm hoping someone can chime in to confirm that this is indeed how this works, and perhaps shed some more light on the topic. On my end, I tested the memory usage using VS2015's diagnostic tool, and the heap seems to be significantly larger when using adUseClient. If my conjecture is correct, then why would anyone opt to use adUseClient, or any of the other choices, over adUseServer.
I think of two options: by member type and BLOB.
For classes, I recommend one row per class instance with one column per member. Search the supported data types by your database. There are some common types.
Another method is to use the BLOB (Binary Large OBject) data type. This is a "binary" data type used for storing data-as-is.
You can use the BLOB type for members that are of unsupported data types.
You can get more complicated by researching "Database Normalization" or "Database normal forms".
I have two multi-class data sets with 5 labels, one for training, and the other for cross validation. These data sets are stored as .csv files, so they act as a control in this experiment.
I have a C++ wrapper for libsvm, and the MATLAB functions for libsvm.
For both C++ and MATLAB:
Using a C-type SVM with an RBF kernel, I iterate over 2 lists of C and Gamma values. For each parameter combination, I train on the training data set and then predict the cross validation data set. I store the accuracy of the prediction in a 2D map which correlates to the C and Gamma value which yielded the accuracy.
I've recreated different training and cross validation data sets many, many times. Each time, the C++ and MATLAB accuracies are different; sometimes by a lot! Mostly MATLAB produces higher accuracies, but sometimes the C++ implementation is better.
What could be accounting for these differences? The C/Gamma values I'm trying are the same, as are the remaining SVM parameters (default).
There should be no significant differences as both C and Matlab codes use the same svm.c file. So what can be the reason?
implementation error in your code(s), this is unfortunately the most probable one
used wrapper has some bug and/or use other version of libsvm then your matlab code (libsvm is written in pure C and comes with python, Matlab and java wrappers, so your C++ wrapper is "not official") or your wrapper assumes some additional default values, which are not default in C/Matlab/Python/Java implementations
you perform cross validation in somewhat randomized form (shuffling the data and then folding, which is completely correct and reasonable, but will lead to different results in two different runs)
There is some rounding/conversion performed during loading data from .csv in one (or both) of your codes which leads to inconsistencies (really not likely to happen, yet still possible)
I trained an SVC using scikit-Learn (sklearn.svm.SVC) within a python Jupiter Notebook. I wanted to use the trained classifier in MATLAB v. 2022a and C++. I nedeed to verify that all three versions' predictions matched for each implementation of the kernel, decision, and prediction functions. I found some useful guidance from bcorso's implementation of the original libsvm C++ code.
Exporting structure that represents the structure's model is explained in bcorso's post ab required to call his prediction function implementation:
predict(params, sv, nv, a, b, cs, X)
for it to match sklearn's version for trained classifier instance, clf:
clf.predict(X)
Once I established this match, I created a MATLAB versions of bcorso's kernel,
function [k] = kernel_svm(params, sv, X)
k = zeros(1,length(sv));
if strcmp(params.kernel,'linear')
for i = 1:length(sv)
k(i) = dot(sv(i,:),X);
end
elseif strcmp(params.kernel,'rbf')
for i = 1:length(sv)
k(i) =exp(-params.gamma*dot(sv(i,:)-X,sv(i,:)-X));
end
else
uiwait(msgbox('kernel not defined','Error','modal'));
end
k = k';
end
decision,
function [d] = decision_svm(params, sv, nv, a, b, X)
%% calculate the kernels
kvalue = kernel_svm(params, sv, X);
%% define the start and end index for support vectors for each class
nr_class = length(nv);
start = zeros(1,nr_class);
start(1) = 1;
%% First Class Loop
for i = 1:(nr_class-1)
start(i+1) = start(i)+ nv(i)-1;
end
%% Other Classes Nested Loops
for i = 1:nr_class
for j = i+1:nr_class
sum = 0;
si = start(i); %first class start
sj = start(j); %first class end
ci = nv(i)+1; %next class start
cj = ci+ nv(j)-1; %next class end
for k = si:sj
sum =sum + a(k) * kvalue(k);
end
sum1=sum;
sum = 0;
for k = ci:cj
sum = sum + a(k) * kvalue(k);
end
sum2=sum;
end
end
%% Add class sums and the intercept
sumd = sum1 + sum2;
d = -(sumd +b);
end
and predict functions.
function [class, classIndex] = predict_svm(params, sv, nv, a, b, cs, X)
dec_value = decision_svm(params, sv, nv, a, b, X);
if dec_value <= 0
class = cs(1);
classIndex = 1;
else
class = cs(2);
classIndex = 0;
end
end
Translation of the python comprehension syntax to a MATLAB/C++ equivalent of the summations required nested for loops in the decision function.
It is also required to account for for MATLAB indexing (base 1) vs.Python/C++ indexing (base 0).
The trained classifer model is conveyed by params, sv, nv, a, b, cs, which can be gathered within a structure after hanving exported the sv and a matrices as .csv files from teh python notebook. I simply created a wrapper MATLAB function svcInfo that builds the structure:
svcStruct = svcInfo();
params = svcStruct.params;
sv= svcStruct.sv;
nv = svcStruct.nv;
a = svcStruct.a;
b = svcStruct.b;
cs = svcStruct.cs;
Or one can save the structure contents within as MATLAB workspace within a .mat file.
The new case for prediction is provided as a vector X,
%Classifier input feature vector
X=[x1 x2...xn];
A simplified C++ implementation that follows bcorso's python version is fairly similar to this MATLAB implementation in that it uses the nested "for" loop within the decision function but it uses zero based indexing.
Once tested, I may expand this post with the C++ version on the MATLAB code shared above.
I mainly use R, but eventually would like to use Rcpp to interface with some C++ functions that take in and return 2d numeric arrays. So to start out playing around with C++ and Rcpp, I thought I'd just make a little function that converts my R list of variable-length numeric vectors to the C++ equivalent and back again.
require(inline)
require(Rcpp)
test1 = cxxfunction(signature(x='List'), body =
'
using namespace std;
List xlist(x);
int xlen = xlist.size();
vector< vector<int> > xx;
for(int i=0; i<xlen; i++) {
vector<int> test = as<vector<int> > (xlist[i]);
xx.push_back(test);
}
return(wrap(xx));
'
, plugin='Rcpp')
This works like I expect:
> test1(list(1:2, 4:6))
[[1]]
[1] 1 2
[[2]]
[1] 4 5 6
Admittedly I am only part way through the very thorough documentation, but is there a nicer (i.e. more Rcpp-like) way to do the R -> C++ conversion than with the for loop? I am thinking possibly not, since the documentation mentions that (at least with the built-in methods) as "offers less flexibility and currently handles conversion of R objects into primitive types", but I wanted to check because I'm very much a novice in this area.
I will give you bonus points for a reproducible example, and of course for using Rcpp :) And then I will take those away for not asking on the rcpp-devel list...
As for converting STL types: you don't have to, but when you decide to do it, the as<>() idiom is correct. The only 'better way' I can think of is to do name lookup as you would in R itself:
require(inline)
require(Rcpp)
set.seed(42)
xl <- list(U=runif(4), N=rnorm(4), T2df=rt(4,2))
fun <- cxxfunction(signature(x="list"), plugin="Rcpp", body = '
Rcpp::List xl(x);
std::vector<double> u = Rcpp::as<std::vector<double> >(xl["U"]);
std::vector<double> n = Rcpp::as<std::vector<double> >(xl["N"]);
std::vector<double> t2 = Rcpp::as<std::vector<double> >(xl["T2df"]);
// do something clever here
return(R_NilValue);
')
Hope that helps. Otherwise, the list is always open...
PS As for the two-dim array, that is trickier as there is no native C++ two-dim array. If you actually want to do linear algebra, look at RcppArmadillo and RcppEigen.
Since last night I have been trying out Rcpp and inline, and so far I am really enjoying it. But I am kinda new to C in general and can only do basic stuff yet, and I am having a hard time finding help online on things like functions.
Something I was working on was a function that finds the minimum of a vector in the global environment. I came up with:
library("inline")
library("Rcpp")
foo <- rnorm(100)
bar <- cxxfunction( signature(),
'
Environment e = Environment::global_env();
NumericVector foo = e["foo"];
int min;
for (int i = 0; i < foo.size(); i++)
{
if ( foo[i] < foo[min] ) min = i;
}
return wrap(min+1);
', plugin = "Rcpp")
bar()
But it seems like there should be an easier way to do this, and it is quite slower than which.max()
system.time(replicate(100000,bar()))
user system elapsed
0.27 0.00 0.26
system.time(replicate(100000,which.min(foo)))
user system elapsed
0.2 0.0 0.2
Am I overlooking a basic c++ or Rcpp function that does this? And if so, where could I find a list of such functions?
I guess this question is related to:
Where can I learn how to write C code to speed up slow R functions?
but different in that I am not really interested in how to incorporate c++ in R, but more on how and where to learn basic c++ code that is usable in R.
Glad you are finding Rcpp useful.
The first comment by Billy is quite correct. There is overhead in the function lookup and there is overhead in the [] lookup for each element etc.
Also, a much more common approach is to take a vector you have in R, pass it to a compiled function you create via inline and Rcpp, and have it return the result. Try that. There are plenty of examples in the package and scattered over the rcpp-devel mailing list archives.
Edit: I could not resist trying to set up a very C++ / STL style answer.
R> src <- '
+ Rcpp::NumericVector x(xs);
+ Rcpp::NumericVector::iterator it = // iterator type
+ std::min_element(x.begin(), x.end()); // STL algo
+ return Rcpp::wrap(it - x.begin()); '
R> minfun <- cxxfunction(signature(xs="numeric"), body=src, plugin="Rcpp")
R> minfun(c(7:20, 3:5))
[1] 14
R>
That is not exactly the easiest answer but it shows how by using what C++ offers you can find a minimum element without an (explicit) loop even at the C++ level. But the builtin min() function is still faster.
*Edit 2: Corrected as per Romain's comment below.