Rcpp Function filling matrix with different values - c++

I'm building a process which will instantiate a NumericMatrix and fill it with Sorenson-Dice similarity coefficients, a similarity matrix. The matrix itself is of variable dimensions, and depends on the number of elements being processed. Generally there are more than 100 individual elements that are compared at any time (so the matrix dimensions will typically be 100+ by 100+). What I've built so far will create the matrix, and calculate the coefficient, then fill the matrix with those calculated values. However when I run the function repeatedly, I notice that values within the matrix change between each run, which is not expected behavior, since the data being compared is not changing or re-sorting between each run. I also get similarities greater than 1, which should definitely not be happening. I have four functions, one to find the numerator of the coefficient, one to find the denominator, one to use the numerator and denominator functions to calculate the coefficient, and the fourth to put the coefficients in the matrix.
Here's the c++ code:
// function to calculate the denominator of the dice coefficient
int diceDenomcpp(NumericVector val1, NumericVector val2){
int val1Len = na_omit(val1).size();
int val2Len = na_omit(val2).size();
int bands = 0;
bands = val1Len + val2Len;
// return the computed total data points within both arrays
return bands;
}
//######################################################################
//######################################################################
//######################################################################
// function to calculate the numerator for the dice coefficient
int diceNumcpp(NumericVector iso1, NumericVector iso2){
// declare and initialize vectors with the element band data
// remove any NA values within each vector
NumericVector is1 = na_omit(iso1);
NumericVector is2 = na_omit(iso2);
// declare and initialize some counter variables
int n = 0;
int m = 0;
int match = 0;
// loop through the first element's first datum and check for matching datum
// with the second element then continue to loop through each datum within each element
while (n<=is1.size()){
if (m>=is2.size()){
n++;
m=0;
}
// if a suitable match is found, increment the match variable
if((fabs(is1[n]-is2[m])/is1[n])<0.01 && (fabs(is1[n]-is2[m])/is2[m])<0.01){
match++;
}
m++;
}
return match;
}
//########################################################################
//########################################################################
//########################################################################
// function to put the coefficient together
double diceCoefcpp(NumericVector val1, NumericVector val2){
NumericVector is1 = clone(val1);
NumericVector is2 = clone(val2);
double dVal;
double num = 2*diceNumcpp(is1, is2);
double denom = diceDenomcpp(is1, is2);
dVal = num/denom;
return dVal;
}
//#######################################################################
//#######################################################################
//#######################################################################
// function to build the similarity matrix with the coefficients
NumericMatrix simMatGencpp(NumericMatrix df){
// clone the input data frame
NumericMatrix rapdDat = clone(df);
// create a data frame for the output
NumericMatrix simMat(rapdDat.nrow(),rapdDat.nrow());
std::fill(simMat.begin(), simMat.end(), NumericVector::get_na());
// declare and initialize the iterator
int i = 0;
// declare and initialize the column counter
int col = 0;
// declare an initialize the isolate counter
int iso = 0;
//simMat(_,0)=rapdDat(_,0);
while (iso < rapdDat.nrow()){
if (iso+i > rapdDat.nrow()){
col++;
i=0;
iso++;
}
if (iso+i < rapdDat.nrow()){
simMat(iso+i, col) = diceCoefcpp(rapdDat(iso,_), rapdDat(iso+i,_));
}
i++;
}
//Rcout << "SimMatrix:" << simMat << "\n";
return simMat;
}
Here's a sample of what the input data should look like . . .
sampleData
band1 band2 band3 band4 band5 band6
1 593.05 578.04 439.01 NA NA NA
2 589.07 567.03 NA NA NA NA
3 591.04 575.10 438.12 NA NA NA
4 591.04 NA NA NA NA NA
5 588.08 573.18 NA NA NA NA
6 591.04 576.09 552.10 NA NA NA
7 1805.00 949.00 639.19 589.07 576.09 440.06
8 952.00 588.08 574.14 550.04 NA NA
9 1718.00 576.09 425.01 NA NA NA
10 1708.00 577.05 425.01 NA NA NA
With a small enough data set, the output simMatGencpp() function will produce the same results each time, however when the data set gets larger that's when the values will start to change from run to run.
I've tried running the diceNumcpp(), diceDenomcpp(), and diceCoefcpp() functions independently on individual elements, and was getting the expected output consistently each time. Once I use simMatGencpp() however then the output gets screwy again. So I tried to loop each individual function like below.
Example:
for(i in 1:100){
print(diceNumcpp(sampleData[7,], sampleData[3,]))
}
The expected output from above should be 3, but sometimes it's 4. Each time I run this loop whichever time 4 is the output varies, sometimes the second iteration, sometimes the 14th, or none at all, or three times in a row.
My first thought is that maybe since garbage collection doesn't exactly occur in c++ that perhaps the previously run function call is leaving the old vector in memory since the name of the output object isn't changing from run to run. But then this post says that when the function exits any object created within the scope of the function call is destroyed as well.
When I code the same solution in R-code only, the runtime sucks, but it will consistently return a matrix or the example vector with the same values each time.
I'm at a loss. Any help or light anyone could shed on this subject would be greatly received!
Thanks for your help.
Update 2020-08-19
I'm hoping that this will help provide some insight for the more well-versed c++ people out there so that you may have some additional ideas about what may be happening. I have some sample data, similar to what is shown above, that is 187 rows long, meaning that a similarity matrix of these data would have 17578 elements. I've been running comparisons between the R version of this solution and the c++ version of this solution, using code like this, with the sample data:
# create the similarity matrix with the R-solution to compare iteratively
# with another R-solution similarity matrix
simMat1 <- simMatGen(isoMat)
resultsR <- c()
for(i in 1:100){
simMat2 <- simMatGen(isoMat)
# check for any mis-matched elements in each matrix
resultsR[[i]]<-length(which(simMat1 == simMat2)==TRUE)
#######################################################################
# everytime this runs I get the expected number of true values 17578
# and check this by subtracting the mean(resultsR) from the expected
# number of true values of 17578
}
mean(resultsR)
Now when i do this same process with the c++ version things change drastically and quickly. I tried this with both 64 and 32 bit R-3.6.0, just because.
simMat1 <- simMatGen(isoMat)
isoMat <- as.matrix(isoMat)
resultscpp <- c()
for(i in 1:10000){
simMat2 <- simMatGencpp(isoMat)
resultscpp[[i]]<-length(which(simMat1 == simMat2)==TRUE)
############ 64 bit R ##############
# first iteration length(which(simMat1 == simMat2)==TRUE)-17578 equals 2
# second iteration 740 elements differ: length(which(simMat1 == simMat2)==TRUE)-17578 equals 740
# third iteration 1142 elements differ
# after 100 iterations the average difference is 2487.7 elements
# after 10000 iterations the average difference is 2625.91 elements
############ 32 bit R ##############
# first iteration difference = 1
# second iteration difference = 694
# 100 iterations difference = 2520.94
# 10000 iterations difference = 2665.04
}
mean(resultscpp)
Here's sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 10 x64 (build 17763)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 rstudioapi_0.10 magrittr_1.5 usethis_1.5.0 devtools_2.1.0 pkgload_1.0.2 R6_2.4.0 rlang_0.4.4
[9] tools_3.6.0 pkgbuild_1.0.3 sessioninfo_1.1.1 cli_1.1.0 withr_2.1.2 remotes_2.1.0 assertthat_0.2.1 digest_0.6.20
[17] rprojroot_1.3-2 crayon_1.3.4 processx_3.3.1 callr_3.2.0 fs_1.3.1 ps_1.3.0 testthat_2.3.1 memoise_1.1.0
[25] glue_1.3.1 compiler_3.6.0 desc_1.2.0 backports_1.1.5 prettyunits_1.0.2

Made a rookie c++ mistake here.
In the diceNumcpp() I didn't put any checks in place so that I don't accidentally reference an out-of-bounds element in the array.
// if a suitable match is found, increment the match variable
if((fabs(is1[n]-is2[m])/is1[n])<0.01 && (fabs(is1[n]-is2[m])/is2[m])<0.01){
match++;
}
was changed to:
// if a suitable match is found, increment the match variable
if(n<=(is1.size()-1) && (m<=is2.size()-1)){ // <- here need to make sure it stays inbounds
if((fabs(is1[n]-is2[m])/is1[n])<0.01 && (fabs(is1[n]-is2[m])/is2[m])<0.01){
match++;
}
}
and after running it 1000 times was able to get correct results every time.
Learn something new everyday.
Cheers.

Related

Difference between R's sum() and Armadillo's accu()

There are small differences in the results of R's sum() function and RcppArmadillo's accu() function when given the same input. For example, the following code:
R:
vec <- runif(100, 0, 0.00001)
accu(vec)
sum(vec)
C++:
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::export]]
double accu(arma::vec& obj)
{
return arma::accu(obj);
}
Gives the results:
0.00047941851844312633 (C++)
0.00047941851844312628 (R)
According to http://keisan.casio.com/calculator the true answer is:
4.79418518443126270948E-4
These small differences add up in my algorithm and significantly affect the way it executes. Is there a way to more accurately sum up vectors in C++? Or at least to get the same results that R does without having to call R code?
update: based on what others have found in the source, I was wrong about this - sum() does not sort. The patterns of consistency I found below stem from the fact that sorting (as done in some cases below) and using extended-precision intermediate values (as done in sum()) can have similar effects on precision ...
#user2357112 comments below:
src/main/summary.c ... doesn't do any sorting. (That'd be a lot of expense to add to a summation operation.) It's not even using pairwise or compensated summation; it just naively adds everything up left to right in an LDOUBLE (either long double or double, depending on HAVE_LONG_DOUBLE).
I have exhausted myself looking for this in the R source code (without success - sum is hard to search for), but I can show by experiment that when executing sum(), R sorts the input vector from smallest to largest in order to maximize accuracy; the difference between sum() and Reduce() results below is due to use of extended precision. I don't know what accu does ...
set.seed(101)
vec <- runif(100, 0, 0.00001)
options(digits=20)
(s1 <- sum(vec))
## [1] 0.00052502325481269514554
Using Reduce("+",...) just adds the elements in order.
(s2 <- Reduce("+",sort(vec)))
## [1] 0.00052502325481269514554
(s3 <- Reduce("+",vec))
## [1] 0.00052502325481269503712
identical(s1,s2) ## TRUE
?sum() also says
Where possible extended-precision accumulators are used, but this is platform-dependent.
Doing this in RcppArmadillo on the sorted vector gives the same answer as in R; doing it on the vector in the original order gives yet a different answer (I don't know why; my guess would be the aforementioned extended-precision accumulators, which would affect the numerical outcome more when the data are unsorted).
suppressMessages(require(inline))
code <- '
arma::vec ax = Rcpp::as<arma::vec>(x);
return Rcpp::wrap(arma::accu(ax));
'
## create the compiled function
armasum <- cxxfunction(signature(x="numeric"),
code,plugin="RcppArmadillo")
(s4 <- armasum(vec))
## [1] 0.00052502325481269525396
(s5 <- armasum(sort(vec)))
## [1] 0.00052502325481269514554
identical(s1,s5) ## TRUE
But as pointed out in comments this doesn't work for all seeds: in this case the Reduce() result is closer to the results of sum()
set.seed(123)
vec2 <- runif(50000,0,0.000001)
s4 <- sum(vec2); s5 <- Reduce("+",sort(vec2))
s6 <- Reduce("+",vec2); s7 <- armasum(sort(vec2))
rbind(s4,s5,s6,s7)
## [,1]
## s4 0.024869900535651481843
## s5 0.024869900535651658785
## s6 0.024869900535651523477
## s7 0.024869900535651343065
I'm stumped here. I would have expected at least s6 and s7 to be identical ...
I will point out that in general when your algorithm depends on these kinds of tiny numeric differences you're likely to be getting very frustrated, as the results are likely to differ on the basis of many small and possibly-out-of-your-control factors like particular operating system, compiler, etc. you work with.
What I have found:
I successfully managed to write a function which is able to mimic R's sum function. It appears R uses a higher precision variable to store the results of each addition operation.
What I wrote:
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::export]]
double accu2(arma::vec& obj)
{
long double result = 0;
for (auto iter = obj.begin(); iter != obj.end(); ++iter)
{
result += *iter;
}
return result;
}
How it compares in speed:
set.seed(123)
vec <- runif(50000, 0, 0.000001)
microbenchmark(
sum(vec),
accu(vec),
accu2(vec)
)
expr min lq mean median uq max neval
sum(vec) 72.155 72.351 72.61018 72.6755 72.7485 75.068 100
accu(vec) 48.275 48.545 48.84046 48.7675 48.9975 52.128 100
accu2(vec) 69.087 69.409 70.80095 69.6275 69.8275 182.955 100
So, my c++ solution is still faster than R's sum, however it is significantly slower than armadillo's accu()
you could use the mpfr package (Multiple Precision Floating-Point Reliable) and specify the decimal point
library("Rmpfr")
set.seed(1)
vec <- runif(100, 0, 0.00001)
# [1] 2.655087e-06 3.721239e-06 5.728534e-06 9.082078e-06 2.016819e-06 8.983897e-06 9.446753e-06 6.607978e-06 6.291140e-06 6.178627e-07 2.059746e-06
# [12] 1.765568e-06 6.870228e-06 3.841037e-06 7.698414e-06 4.976992e-06 7.176185e-06 9.919061e-06 3.800352e-06 7.774452e-06 9.347052e-06 2.121425e-06
# [23] 6.516738e-06 1.255551e-06 2.672207e-06 3.861141e-06 1.339033e-07 3.823880e-06 8.696908e-06 3.403490e-06 4.820801e-06 5.995658e-06 4.935413e-06
# [34] 1.862176e-06 8.273733e-06 6.684667e-06 7.942399e-06 1.079436e-06 7.237109e-06 4.112744e-06 8.209463e-06 6.470602e-06 7.829328e-06 5.530363e-06
# [45] 5.297196e-06 7.893562e-06 2.333120e-07 4.772301e-06 7.323137e-06 6.927316e-06 4.776196e-06 8.612095e-06 4.380971e-06 2.447973e-06 7.067905e-07
# [56] 9.946616e-07 3.162717e-06 5.186343e-06 6.620051e-06 4.068302e-06 9.128759e-06 2.936034e-06 4.590657e-06 3.323947e-06 6.508705e-06 2.580168e-06
# [67] 4.785452e-06 7.663107e-06 8.424691e-07 8.753213e-06 3.390729e-06 8.394404e-06 3.466835e-06 3.337749e-06 4.763512e-06 8.921983e-06 8.643395e-06
# [78] 3.899895e-06 7.773207e-06 9.606180e-06 4.346595e-06 7.125147e-06 3.999944e-06 3.253522e-06 7.570871e-06 2.026923e-06 7.111212e-06 1.216919e-06
# [89] 2.454885e-06 1.433044e-06 2.396294e-06 5.893438e-07 6.422883e-06 8.762692e-06 7.789147e-06 7.973088e-06 4.552745e-06 4.100841e-06 8.108702e-06
# [100] 6.049333e-06
sum(mpfr(vec,10))
# 1 'mpfr' number of precision 53 bits
# [1] 0.00051783234812319279

SAS/IML: how to use individual variance components in RANDNORMAL

This is a programming question, but I'll give you a little of the stats background first. This question refers to part of a data sim for a mixed-effects location scale model (i.e., heterogeneous variances). I'm trying to simulate two MVN variance components using the RANDNORMAL function in IML. Because both variance components are heterogeneous, the variances used by RANDNORMAL will differ across people. Thus, I need IML to select the specific row (e.g., row 1 = person 1) and use the RANDNORMAL function before moving onto the next row, and so on.
My example code below is for 2 people. I use DO to loop through each person's specific variance components (VC1 and VC2). I get the error: "Module RANDNORMAL called again before exit from prior call." I am assuming I need some kind of BREAK or EXIT function in the DO loop, but none I have tried work.
PROC IML;
ColNames = {"ID" "VC1" "VC2"};
A = {1 2 3,
2 8 9};
PRINT A[COLNAME=ColNames];
/*Set men of each variance component to 0*/
MeanVector = {0, 0};
/*Loop through each person's data using THEIR OWN variances*/
DO i = 1 TO 2;
VC1 = A[i,2];
VC2 = A[i,3];
CovMatrix = {VC1 0,
0 VC2};
CALL RANDSEED(1);
U = RANDNORMAL(2, MeanVector, CovMatrix);
END;
QUIT;
Any help is appreciated. Oh, and I'm using SAS 9.4.
You want to move some things around, but mostly you don't want to rewrite U twice: you need to write U's 1st row, then U's 2nd row, if I understand what you're trying to do. The below is a bit more efficient also, since I j() the U and _cv matrices rather than constructing then de novo every time through the loop (which is slow).
proc iml;
a = {1 2 3,2 8 9};
print(a);
_mv = {0,0};
U = J(2,2);
_cv = J(2,2,0);
CALL RANDSEED(1);
do i = 1 to 2;
_cv[1,1] = a[i,2];
_cv[2,2] = a[i,3];
U[i,] = randnormal(1,_mv, _cv);
end;
print(u);
quit;
Your mistake is the line
CovMatrix = {VC1 0, 0 VC2}; /* wrong */
which is not valid SAS/IML syntax. Instead, use #Joe's approach or use
CovMatrix = (VC1 || 0) // (0 || VC2);
For details, see the article "How to build matrices from expressions."
You might also be interested in this article that describes how to carry out this simulation with a block-diagonal matrix: "Constructing block matrices with applications to mixed models."

calculating w coefficients for iir filter

I am trying to implement an IIR filter I have designed in Matlab into a c++ program to filter out an unwanted signal from a wave file. The fdatool in Matlab generated this C header to use (it is a bandstop filter):
#include "tmwtypes.h"
/*
* Expected path to tmwtypes.h
* C:\Program Files (x86)\MATLAB\R2013a Student\extern\include\tmwtypes.h
*/
const int al = 7;
const real64_T a[7] = {
0.9915141178644, -5.910578456199, 14.71918523779, -19.60023964796,
14.71918523779, -5.910578456199, 0.9915141178644
};
const int bl = 7;
const real64_T b[7] = {
1, -5.944230431733, 14.76096188047, -19.60009655976,
14.67733658492, -5.877069568864, 0.9831002459245
};
After hours of exhausting research, I still can't figure out the proper way to use these values to determine the W values and then how to use those W values to properly calculate my Y outputs. If anyone has any insight into the ordering these values should be used to do all these conversions, it would be a major help.
All the methods I've developed and tried to this point do not generate a valid wave file, the header values all translate correctly, but everything beyond cannot be evaluated by a media player.
Thanks.
IIR filters work this way:
Assuming an array of samples A and and array of ceof named 'c' the result array B will be:
B[i] = (A[i] * c[0]) + (B[i-1] * c[1]) + ... + (B[n] * c[n])
Note that only the newest element is taken from A.
This is easier to do in-place, just update A as you move along.
These filter coefs are very violent, are you sure you got them right?
The first one is also symmetrical which probably indicates it's an FIR filter.
It appears to me that you have a 3 pole IIR filter with the coefficients given for an Nth order implementation (as opposed to a series of 2nd order sections). Since this is a band reject (or band pass) the polynomial order is twice the pole count.
I am not sure what you mean by W values, unless you are trying to evaluate the frequency response of this filter.
To calculate the Y values, as you put it, see this link for code on implementing IIR filters. See the Nth order implementation code in particular.
http://www.iowahills.com/A7ExampleCodePage.html
BTW: I assumed these were Nth order coefficients and simulated them. I got a 10 dB notch at 0.05 Pi. Sound about right?
where
B6 = 0.9915141178644
.
.
.
b0 = 0.9915141178644
a6 = 0.9831002459245
.
.
.
a0 = 1
Also, you may want to post a question like this on:
https://dsp.stackexchange.com/

Rewriting slow R function in C++ & Rcpp

I have this line of R code:
croppedDNA <- completeDNA[,apply(completeDNA,2,function(x) any(c(FALSE,x[-length(x)]!=x[-1])))]
What it does is identify the sites (cols) in a matrix of DNA sequences (1 row = one seq) that are not universal (informative) and subsets them from the matrix to make a new 'cropped matrix' i.e. get rid of all the columns in which values are the same. For a big dataset this takes about 6 seconds. I don't know if I can do it faster in C++ (still a beginner in C++) but it will be good for me to try. My idea is to use Rcpp, loop through the columns of the CharacterMatrix, pull out the column (the site) as a CharacterVector check if they are the same. If they are the same, record that column number/index, continue for all columns. Then at the end make a new CharacterMatrix that only includes those columns. It is important that I keep the rownames and column names as they are in th "R version" of the matrix i.e. if a column goes, so should the colname.
I've been writing for about two minutes, so far what I have is (not finished):
#include <Rcpp.h>
#include <vector>
using namespace Rcpp;
// [[Rcpp::export]]
CharacterMatrix reduce_sequences(CharacterMatrix completeDNA)
{
std::vector<bool> informativeSites;
for(int i = 0; i < completeDNA.ncol(); i++)
{
CharacterVector bpsite = completeDNA(,i);
if(all(bpsite == bpsite[1])
{
informativeSites.push_back(i);
}
}
CharacterMatrix cutDNA = completeDNA(,informativeSites);
return cutDNA;
}
Am I going the right way about this? Is there an easier way. My understanding is I need std::vector because it's easy to grow them (since I don't know in advance how many cols I am going to want to keep). With the indexing will I need to +1 to the informativeSites vector at the end (because R indexes from 1 and C++ from 0)?
Thanks,
Ben W.
Sample data:
set.seed(123)
z <- matrix(sample(c("a", "t", "c", "g", "N", "-"), 3*398508, TRUE), 3, 398508)
OP's solution:
system.time(y1 <- z[,apply(z,2,function(x) any(c(FALSE,x[-length(x)]!=x[-1])))])
# user system elapsed
# 4.929 0.043 4.976
A faster version using base R:
system.time(y2 <- (z[, colSums(z[-1,] != z[-nrow(z), ]) > 0]))
# user system elapsed
# 0.087 0.011 0.098
The results are identical:
identical(y1, y2)
# [1] TRUE
It's very possible c++ will beat it, but is it really necessary?

Dynamically Delete Elements WIthin an R loop

Ok guys, as requested, I will add more info so that you understand why a simple vector operation is not possible. It's not easy to explain in few words but let's see. I have a huge amount of points over a 2D space.
I divide my space in a grid with a given resolution,say, 100m. The main loop that I am not sure if it's mandatory or not (any alternative is welcomed) is to go through EACH cell/pixel that contains at least 2 points (right now I am using the method quadratcount within the package spatstat).
Inside this loop, thus for each one of this non empty cells, I have to find and keep only a maximum of 10 Male-Female pairs that are within 3 meters from each other. The 3-meter buffer can be done using the "disc" function within spatstat. To select points falling inside a buffer you can use the method pnt.in.poly within the SDMTools package. All that because pixels have a maximum capacity that cannot be exceeded. Since in each cell there can be hundreds or thousands of points I am trying to find a smart way to use another loop/similar method to:
1)go trough each point at a time 2)create a buffer a select points with different sex 3)Save the closest Male-Female (0-1) pair in another dataframe (called new_colonies) 4)Remove those points from the dataframe so that it shrinks and I don't have to consider them anymore 5) as soon as that new dataframe reaches 10 rows stop everything and go to the next cell (thus skipping all remaining points. Here is the code that I developed to be run within each cell (right now it takes too long):
head(df,20):
X Y Sex ID
2 583058.2 2882774 1 1
3 582915.6 2883378 0 2
4 582592.8 2883297 1 3
5 582793.0 2883410 1 4
6 582925.7 2883397 1 5
7 582934.2 2883277 0 6
8 582874.7 2883336 0 7
9 583135.9 2882773 1 8
10 582955.5 2883306 1 9
11 583090.2 2883331 0 10
12 582855.3 2883358 1 11
13 582908.9 2883035 1 12
14 582608.8 2883715 0 13
15 582946.7 2883488 1 14
16 582749.8 2883062 0 15
17 582906.4 2883317 0 16
18 582598.9 2883390 0 17
19 582890.2 2883413 0 18
20 582752.8 2883361 0 19
21 582953.1 2883230 1 20
Inside each cell I must run something according to what I explained above..
for(i in 1:dim(df)[1]){
new_colonies <- data.frame(ID1=0,ID2=0,X=0,Y=0)
discbuff <- disc(radius, centre=c(df$X[i], df$Y[i]))
#define the points and polygon
pnts = cbind(df$X[-i],df$Y[-i])
polypnts = cbind(x = discbuff$bdry[[1]]$x, y = discbuff$bdry[[1]]$y)
out = pnt.in.poly(pnts,polypnts)
out$ID <- df$ID[-i]
if (any(out$pip == 1)) {
pnt.inBuffID <- out$ID[which(out$pip == 1)]
cond <- df$Sex[i] != df$Sex[pnt.inBuffID]
if (any(cond)){
eucdist <- sqrt((df$X[i] - df$X[pnt.inBuffID][cond])^2 + (df$Y[i] - df$Y[pnt.inBuffID][cond])^2)
IDvect <- pnt.inBuffID[cond]
new_colonies_temp <- data.frame(ID1=df$ID[i], ID2=IDvect[which(eucdist==min(eucdist))],
X=(df$X[i] + df$X[pnt.inBuffID][cond][which(eucdist==min(eucdist))]) / 2,
Y=(df$Y[i] + df$Y[pnt.inBuffID][cond][which(eucdist==min(eucdist))]) / 2)
new_colonies <- rbind(new_colonies,new_colonies_temp)
if (dim(new_colonies)[1] == maxdensity) break
}
}
}
new_colonies <- new_colonies[-1,]
Any help appreciated!
Thanks
Francesco
In your case I wouldn't worry about deleting the points as you go, skipping is the critical thing. I also wouldn't make up a new data.frame piece by piece like you seem to be doing. Both of those things slow you down a lot. Having a selection vector is much more efficient (perhaps part of the data.frame, that you set to FALSE beforehand).
df$sel <- FALSE
Now, when you go through you set df$sel to TRUE for each item you want to keep. Just skip to the next cell when you find your 10. Deleting values as you go will be time consuming and memory intensive, as will slowly growing a new data.frame. When you're all done going through them then you can just select your data based on the selection column.
df <- df[ df$sel, ]
(or maybe make a copy of the data.frame at that point)
You also might want to use the dist function to calculate a matrix of distances.
from ?dist
"This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix."
I'm assuming you are doing something sufficiently complicated that the for-loop is actually required...
So here's one rather simple approach: first just gather the rows to delete (or keep), and then delete the rows afterwards. Typically this will be much faster too since you don't modify the data.frame on each loop iteration.
df <- generateTheDataFrame()
keepRows <- rep(TRUE, nrow(df))
for(i in seq_len(nrow(df))) {
rows <- findRowsToDelete(df, df[i,])
keepRows[rows] <- FALSE
}
# Delete afterwards
df <- df[keepRows, ]
...and if you really need to work on the shrunk data in each iteration, just change the for-loop part to:
for(i in seq_len(nrow(df))) {
if (keepRows[i]) {
rows <- findRowsToDelete(df[keepRows, ], df[i,])
keepRows[rows] <- FALSE
}
}
I'm not exactly clear on why you're looping. If you could describe what kind of conditions you're checking there might be a nice vectorized way of doing it.
However as a very simple fix have you considered looping through the dataframe backwards?