I'm trying to get up to speed on using C++ to quickly build some sparse matrices for use in R. However, I cannot seem to use the insert method to change single elements of a sparse matrix in Eigen and get a correct R object of class dgCMatrix. A simple example is below.
The C++ code is:
#include <RcppEigen.h>
// [[Rcpp::depends(RcppEigen)]]
using Eigen::SparseMatrix; // sparse matrix
// [[Rcpp::export]]
SparseMatrix<double> SimpleSparseMatrix(int n) {
SparseMatrix<double> new_mat(n, n);
new_mat.insert(0, 0) = 2;
Rcpp::Rcout << new_mat << std::endl;
return new_mat;
}
And the resulting R is:
> SimpleSparseMatrix(2)
2 0
0 0
2 x 2 sparse Matrix of class "dgCMatrix"
Error in validObject(x) :
invalid class “dgCMatrix” object: last element of slot p must match length of slots i and x
As you can see from stdout, eigen is doing the right thing. However, the resulting sparse matrix object is malformed. Indeed, looking at its slots show invalid values for p:
> foo <- SimpleSparseMatrix(2)
2 0
0 0
> str(foo)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..# i : int 0
..# p : int [1:3] 0 2 4
..# Dim : int [1:2] 2 2
..# Dimnames:List of 2
.. ..$ : NULL
.. ..$ : NULL
..# x : num 2
..# factors : list()
Any ideas what might be going wrong?
After the insert statement add this statement:
new_mat.makeCompressed();
Related
I'm building a process which will instantiate a NumericMatrix and fill it with Sorenson-Dice similarity coefficients, a similarity matrix. The matrix itself is of variable dimensions, and depends on the number of elements being processed. Generally there are more than 100 individual elements that are compared at any time (so the matrix dimensions will typically be 100+ by 100+). What I've built so far will create the matrix, and calculate the coefficient, then fill the matrix with those calculated values. However when I run the function repeatedly, I notice that values within the matrix change between each run, which is not expected behavior, since the data being compared is not changing or re-sorting between each run. I also get similarities greater than 1, which should definitely not be happening. I have four functions, one to find the numerator of the coefficient, one to find the denominator, one to use the numerator and denominator functions to calculate the coefficient, and the fourth to put the coefficients in the matrix.
Here's the c++ code:
// function to calculate the denominator of the dice coefficient
int diceDenomcpp(NumericVector val1, NumericVector val2){
int val1Len = na_omit(val1).size();
int val2Len = na_omit(val2).size();
int bands = 0;
bands = val1Len + val2Len;
// return the computed total data points within both arrays
return bands;
}
//######################################################################
//######################################################################
//######################################################################
// function to calculate the numerator for the dice coefficient
int diceNumcpp(NumericVector iso1, NumericVector iso2){
// declare and initialize vectors with the element band data
// remove any NA values within each vector
NumericVector is1 = na_omit(iso1);
NumericVector is2 = na_omit(iso2);
// declare and initialize some counter variables
int n = 0;
int m = 0;
int match = 0;
// loop through the first element's first datum and check for matching datum
// with the second element then continue to loop through each datum within each element
while (n<=is1.size()){
if (m>=is2.size()){
n++;
m=0;
}
// if a suitable match is found, increment the match variable
if((fabs(is1[n]-is2[m])/is1[n])<0.01 && (fabs(is1[n]-is2[m])/is2[m])<0.01){
match++;
}
m++;
}
return match;
}
//########################################################################
//########################################################################
//########################################################################
// function to put the coefficient together
double diceCoefcpp(NumericVector val1, NumericVector val2){
NumericVector is1 = clone(val1);
NumericVector is2 = clone(val2);
double dVal;
double num = 2*diceNumcpp(is1, is2);
double denom = diceDenomcpp(is1, is2);
dVal = num/denom;
return dVal;
}
//#######################################################################
//#######################################################################
//#######################################################################
// function to build the similarity matrix with the coefficients
NumericMatrix simMatGencpp(NumericMatrix df){
// clone the input data frame
NumericMatrix rapdDat = clone(df);
// create a data frame for the output
NumericMatrix simMat(rapdDat.nrow(),rapdDat.nrow());
std::fill(simMat.begin(), simMat.end(), NumericVector::get_na());
// declare and initialize the iterator
int i = 0;
// declare and initialize the column counter
int col = 0;
// declare an initialize the isolate counter
int iso = 0;
//simMat(_,0)=rapdDat(_,0);
while (iso < rapdDat.nrow()){
if (iso+i > rapdDat.nrow()){
col++;
i=0;
iso++;
}
if (iso+i < rapdDat.nrow()){
simMat(iso+i, col) = diceCoefcpp(rapdDat(iso,_), rapdDat(iso+i,_));
}
i++;
}
//Rcout << "SimMatrix:" << simMat << "\n";
return simMat;
}
Here's a sample of what the input data should look like . . .
sampleData
band1 band2 band3 band4 band5 band6
1 593.05 578.04 439.01 NA NA NA
2 589.07 567.03 NA NA NA NA
3 591.04 575.10 438.12 NA NA NA
4 591.04 NA NA NA NA NA
5 588.08 573.18 NA NA NA NA
6 591.04 576.09 552.10 NA NA NA
7 1805.00 949.00 639.19 589.07 576.09 440.06
8 952.00 588.08 574.14 550.04 NA NA
9 1718.00 576.09 425.01 NA NA NA
10 1708.00 577.05 425.01 NA NA NA
With a small enough data set, the output simMatGencpp() function will produce the same results each time, however when the data set gets larger that's when the values will start to change from run to run.
I've tried running the diceNumcpp(), diceDenomcpp(), and diceCoefcpp() functions independently on individual elements, and was getting the expected output consistently each time. Once I use simMatGencpp() however then the output gets screwy again. So I tried to loop each individual function like below.
Example:
for(i in 1:100){
print(diceNumcpp(sampleData[7,], sampleData[3,]))
}
The expected output from above should be 3, but sometimes it's 4. Each time I run this loop whichever time 4 is the output varies, sometimes the second iteration, sometimes the 14th, or none at all, or three times in a row.
My first thought is that maybe since garbage collection doesn't exactly occur in c++ that perhaps the previously run function call is leaving the old vector in memory since the name of the output object isn't changing from run to run. But then this post says that when the function exits any object created within the scope of the function call is destroyed as well.
When I code the same solution in R-code only, the runtime sucks, but it will consistently return a matrix or the example vector with the same values each time.
I'm at a loss. Any help or light anyone could shed on this subject would be greatly received!
Thanks for your help.
Update 2020-08-19
I'm hoping that this will help provide some insight for the more well-versed c++ people out there so that you may have some additional ideas about what may be happening. I have some sample data, similar to what is shown above, that is 187 rows long, meaning that a similarity matrix of these data would have 17578 elements. I've been running comparisons between the R version of this solution and the c++ version of this solution, using code like this, with the sample data:
# create the similarity matrix with the R-solution to compare iteratively
# with another R-solution similarity matrix
simMat1 <- simMatGen(isoMat)
resultsR <- c()
for(i in 1:100){
simMat2 <- simMatGen(isoMat)
# check for any mis-matched elements in each matrix
resultsR[[i]]<-length(which(simMat1 == simMat2)==TRUE)
#######################################################################
# everytime this runs I get the expected number of true values 17578
# and check this by subtracting the mean(resultsR) from the expected
# number of true values of 17578
}
mean(resultsR)
Now when i do this same process with the c++ version things change drastically and quickly. I tried this with both 64 and 32 bit R-3.6.0, just because.
simMat1 <- simMatGen(isoMat)
isoMat <- as.matrix(isoMat)
resultscpp <- c()
for(i in 1:10000){
simMat2 <- simMatGencpp(isoMat)
resultscpp[[i]]<-length(which(simMat1 == simMat2)==TRUE)
############ 64 bit R ##############
# first iteration length(which(simMat1 == simMat2)==TRUE)-17578 equals 2
# second iteration 740 elements differ: length(which(simMat1 == simMat2)==TRUE)-17578 equals 740
# third iteration 1142 elements differ
# after 100 iterations the average difference is 2487.7 elements
# after 10000 iterations the average difference is 2625.91 elements
############ 32 bit R ##############
# first iteration difference = 1
# second iteration difference = 694
# 100 iterations difference = 2520.94
# 10000 iterations difference = 2665.04
}
mean(resultscpp)
Here's sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 10 x64 (build 17763)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 rstudioapi_0.10 magrittr_1.5 usethis_1.5.0 devtools_2.1.0 pkgload_1.0.2 R6_2.4.0 rlang_0.4.4
[9] tools_3.6.0 pkgbuild_1.0.3 sessioninfo_1.1.1 cli_1.1.0 withr_2.1.2 remotes_2.1.0 assertthat_0.2.1 digest_0.6.20
[17] rprojroot_1.3-2 crayon_1.3.4 processx_3.3.1 callr_3.2.0 fs_1.3.1 ps_1.3.0 testthat_2.3.1 memoise_1.1.0
[25] glue_1.3.1 compiler_3.6.0 desc_1.2.0 backports_1.1.5 prettyunits_1.0.2
Made a rookie c++ mistake here.
In the diceNumcpp() I didn't put any checks in place so that I don't accidentally reference an out-of-bounds element in the array.
// if a suitable match is found, increment the match variable
if((fabs(is1[n]-is2[m])/is1[n])<0.01 && (fabs(is1[n]-is2[m])/is2[m])<0.01){
match++;
}
was changed to:
// if a suitable match is found, increment the match variable
if(n<=(is1.size()-1) && (m<=is2.size()-1)){ // <- here need to make sure it stays inbounds
if((fabs(is1[n]-is2[m])/is1[n])<0.01 && (fabs(is1[n]-is2[m])/is2[m])<0.01){
match++;
}
}
and after running it 1000 times was able to get correct results every time.
Learn something new everyday.
Cheers.
I want to program a function in C++ which does the same thing as the function histcounts in Matlab, but I don't get the right edges.
[V, edges]=histcounts(Vector,10);
I found this post : implementing matlab hist() in c++ and I created a function which I thought should work. How do I get the same edges in my function as in histcounts?
My Code :
int n =10;
double histogrammdistance = (ceil(*max_element(std::begin(Vector), std::end(Vector)))-floor(*min_element(std::begin(Vector), std::end(Vector))))/n;
vector<double> edges;
double minY = *min_element(std::begin(Vector), std::end(Vector));
for (int i = 0 ;i <=n; i++)
{
edges.push_back(floor(minY) + histogrammdistance * i);
}
So My Problem right now is :
i have this Vector = [-37,0329218106996 -26,9722222222222 -34,0823045267490 -33,0987654320988 -39 -35,0658436213992 -30,8061224489796 -36,0493827160494 -38,0164609053498 -12]
Matlab creates these edges :
edges [-40 -37,2000000000000 -34,4000000000000 -31,6000000000000 -28,8000000000000 -26 -23,2000000000000 -20,4000000000000 -17,6000000000000 -14,8000000000000 -12,0000000000000]
But my Program doesn't use -40 to calculate the binwidth..because the minimum number is -39 and not -40. Also if i change -39 to -38.5... matlab is still picking -40
1, Question : Why is Matlab taking -40 ... and has sb an idea how this could be implemented ?
i created now a very simpel Vector with [1 2 3 4 5] if i take 15 as bin number i get this as solution [1 1.27 1.54 1.81 ...] bit if i take 13 as bin number it doesn*t start with the min number 1 :/ it starts with [0,90 1,2 1,54 1,8 2,18....]
2. Question : do sb know why it took 0.90 ??
How do we interpret the cost matrix in WEKA? If I have 2 classes to predict (class 0 and class 1) and want to penalize classfication of class 0 as class 1 more (say double the penalty), what exactly is the matrix format?
Is it :
0 10
20 0
or is it
0 20
10 0
The source of confusion are the following two references:
1) The JavaDoc for Weka CostMatrix says:
The element at position i,j in the matrix is the penalty for classifying an instance of class j as class i.
2) However, the answer in this post seems to indicate otherwise.
http://weka.8497.n7.nabble.com/cost-matrix-td5821.html
Given the first cost matrix, the post says "Misclassifying an instance of class 0 incurs a cost of 10. Misclassifying an instance of class 1 is twice as costly.
Thanks.
I know my answer is coming very late, but it might help somebody so here it is:
To boost the cost of classifying an item of class 0 as class 1, the correct format is the second one.
The evidence:
Cost Matrix I used:
0 1.0
1000.0 0
Confusion matrix (from cross-validation):
a b <-- classified as
565 20 | a = ignored
54 204 | b = not_ignored
Cross-validation output:
...
Total Cost 54020
...
That's a cost of 54 * 10000 + 20 * 1, which matches the confusion matrix above.
I have the following recursive function:
typedef unsigned long long ull;
ull calc(ull b, ull e)
{
if (!b) return e;
if (!e) return b;
return calc(b - 1, e - 1) + calc(b - 1, e) - calc(b, e - 1);
}
I want to implement it with dynamic programming (i.e. using storage). I have tried to use a map<pair<ull, ull>, ull> but it is too slow also. I couldn't implement it using arrays O(1) too.
I want to find a solution so that this function solves quickly for large b, es.
Make a table b/e and fill it cell by cell. This is DP with space and time complexity O(MaxB*MaxE).
Space complexity may be reduced with Ante's proposal in comment - store only two needed rows or columns.
0 1 2 3 4 5
1 0 3 . . .
2 . . . . .
3 . . . . .
4 . . . . .
If a bottom up representation is what you want then this would do fine.
Fill up the table as MBo has shown
This can be done as:
for e from 0 to n:
DP[0][e] = e
for b from 0 to n:
DP[b][0] = b
for i from 1 to n:
for j from 1 to n:
DP[i][j] = DP[i-1][j-1] + DP[i-1][j] - DP[i][j-1]
now your answer for any b,e is simply DP[b][e]
You might want to take a look at this recent blog posting on general purpose automatic memoization. The author discusses various data structures, such std::map, std::unordered_map etc. Warning: uses template-heavy code.
You can implement in O(n^2) (assuming n as max number of values for b and e ) by using a 2 dimensional array. Each current value for i,j would depend on the value at i-1,j and i-1,j-1 and i,j-1. Make sure you handle cases for i=0, j=0.
I got the code below from a C++ book, and I cannot figure out how the initialization works.
From what I can see, there is an outer for loop cycling trough the rows, and the inner loop
cycling trough the column. But its is the assignment of the values into the array that I do not understand.
#include <iostream>
using namespace std;
int main()
{
int t,i, nums[3][4];
for(t=0; t < 3; ++t) {
for(i=0; i < 4; ++i) {
nums[t][i] = (t*4)+i+1; //I don't understand this part/line
cout << nums[t][i] << ' ';
}
cout << '\n';
}
return 0;
}
so here are some questions.
I cannot understand the initialization of the 2D int array nums[3][4]. What separates the (t*4)+i+1, so that the compiler knows what to assign where?
How do I know what values will be stored in the rows and columns, based on what values have been assigned?
Why is there an asterisk?
What are the parentheses around t*4 for?
I understand that initialization two-dimensional arrays look like the following example.
#include <iostream>
using namespace std;
int main() {
char str[3][20] = {{"white" "rabbit"}, {"force"}, {"toad"}}; //initialize 2D character array
cout << str[0][0] << endl; //first letter of white
cout << str[0][5] << endl; //first letter of rabbit
cout << str[1][0] << endl; //first letter of force
cout << str[2][0] << endl; //first letter of toad
return 0;
}
And from what I know, like this in memory.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 w h i t e r a b b i t 0
1 f o r c e 0
2 t o a d 0
Thank you.
(t*4)+i+1
Is an arithmetic expression. t and i are ints, the * means multiply. So for row 1, column 2, t = 1, i = 2, and nums[1][2] = 1x4+2+1 = 7.
Oh, forgot a couple things. First, the () is to specify the order of operations. So the t*4 is done first. Note that in this case the () is unnecessary, since the multiply operator takes precedence over the plus operator anyway.
Also, I couldn't tell from your question if you knew this already or not, but the meaning of rows[t][i] is array notation for accessing rows at row t and column i.
For the first part, isn't it just assigning the a value equal to the row number * 4 plus the column number? I.E. the end result of the assignment should be:
1 2 3 4
5 6 7 8
9 10 11 12
So the expression (t*4)+i+1 means "4 multiplied by the row number plus the column number plus 1". Note that the row number and column numbers in this case start from 0.
nums[t][i] is the one spot in the array it is assigning the value of (t*4)+i+1.
So if t = 1 and i = 1 then the spot num[1][1] will equal (1*4)+1+1 which is 6.
See above.
Asterisk is for multiplying.
You do what's in the ( ) first just like in any mathematical equation.
Lets see, you have
int t,i, nums[3][4];
where we reserve space for the 2d array. The values inside the array will have random values since you only reserved space.
The line :
nums[t][i] = (t*4)+i+1; //I don't understand this part/line
Assigns the values to the array. You have t and i which are loop counters, and the line (t*4)+i+1 means, take value of t, multiply with 4, plus i and plus 1.
So for t=0, i =0, you get that nums[0][0] has value (0*4) + 0 + 1 which is 1.. etc for everything else.
And ofcourse the () are just basic math parentheses.