How do we interpret the cost matrix in WEKA? If I have 2 classes to predict (class 0 and class 1) and want to penalize classfication of class 0 as class 1 more (say double the penalty), what exactly is the matrix format?
Is it :
0 10
20 0
or is it
0 20
10 0
The source of confusion are the following two references:
1) The JavaDoc for Weka CostMatrix says:
The element at position i,j in the matrix is the penalty for classifying an instance of class j as class i.
2) However, the answer in this post seems to indicate otherwise.
http://weka.8497.n7.nabble.com/cost-matrix-td5821.html
Given the first cost matrix, the post says "Misclassifying an instance of class 0 incurs a cost of 10. Misclassifying an instance of class 1 is twice as costly.
Thanks.
I know my answer is coming very late, but it might help somebody so here it is:
To boost the cost of classifying an item of class 0 as class 1, the correct format is the second one.
The evidence:
Cost Matrix I used:
0 1.0
1000.0 0
Confusion matrix (from cross-validation):
a b <-- classified as
565 20 | a = ignored
54 204 | b = not_ignored
Cross-validation output:
...
Total Cost 54020
...
That's a cost of 54 * 10000 + 20 * 1, which matches the confusion matrix above.
Related
I'm trying to solve my linear program with Microsoft Solver Foundation, but it doesn't return a solution. It doesn't give a clear message to what is wrong, so I'm not sure what is going on. I have checked the constraints and I believe they are coded correctly, but maybe the LP model is wrong on itself? I would be glad if you can take a look at it and see what's wrong :)
===Solver Foundation Service Report===
Date: 15/10/2021 16:00:21
Version: Microsoft Solver Foundation 3.0.2.10889 Express Edition
Model Name: DefaultModel
Capabilities Applied: MILP
Solve Time (ms): 51
Total Time (ms): 103
Solve Completion Status: Unknown
Solver Selected: Microsoft.SolverFoundation.Solvers.SimplexSolver
Directives:
Simplex(TimeLimit = -1, MaximumGoalCount = -1, Arithmetic = Default, Pricing = Default, IterationLimit = -1, Algorithm = Default, Basis = Default, GetSensitivity = False)
Algorithm: Primal
Arithmetic: Exact
Variables: 133 -> 133 + 40
Rows: 40 -> 40
Nonzeros: 522
Eliminated Slack Variables: 0
Basis: Slack
Pivot Count: 0
Phase 1 Pivots: 0 + 0
Phase 2 Pivots: 0 + 0
Factorings: 0 + 0
Degenerate Pivots: 0 (0,00 %)
Branches: 0
I'm making this for a practical assignment, so I prefer not to share my code. For information about the assignment: it's a machine assignment problem, where you have to plan two appointments for all patients. There are global parameters:
p1: the duration of the first appointment
p2: the duration of the second appointment
g: the gap between the first and second appointment
Each patient needs two appointments t1 and t2 that need to be planned. Each patient also has personal parameters:
interval I1=[r1, d1], the time interval in which the first appointment can be planned
x: (personal) extra gap between the first and second appointment
length l, the length of the second time interval. I2=[t1 + p1 + g + x, t1 + p1 + g + x + l - 1]
I am trying to write a BA optimizer using Ceres and want to compute the covariances for my optimized results. But the program stuck at covariance.Compute(covariance_blocks, &problem)and it seems never stop computing and runs forever. I debugged deep inside the covariance.Compute() function and noticed that it stuck at the Eigen::SparseQR solver. The optimization steps work fine. Here is my full report from ceres.
Solver Summary (v 1.14.0-eigen-(3.3.5)-no_lapack-eigensparse-openmp-no_tbb)
Original Reduced
Parameter blocks 1335 1335
Parameters 20025 20025
Residual blocks 1780 1780
Residuals 22677 22677
Minimizer TRUST_REGION
Sparse linear algebra library EIGEN_SPARSE
Trust region strategy LEVENBERG_MARQUARDT
Given Used
Linear solver SPARSE_NORMAL_CHOLESKY SPARSE_NORMAL_CHOLESKY
Threads 1 1
Linear solver ordering AUTOMATIC 1335
Cost:
Initial 2.021075e+09
Final 5.734662e+03
Change 2.021070e+09
Minimizer iterations 25
Successful steps 25
Unsuccessful steps 0
Time (in seconds):
Preprocessor 0.000846
Residual only evaluation 0.021398 (25)
Jacobian & residual evaluation 0.699060 (25)
Linear solver 0.557998 (25)
Minimizer 1.327035
Postprocessor 0.000104
Total 1.327986
Termination: CONVERGENCE (Function tolerance reached. |cost_change|/cost: 2.763070e-10 <= 1.000000e-06)
iter cost cost_change |gradient| |step| tr_ratio tr_radius ls_iter iter_time total_time
0 3.179329e+07 0.00e+00 3.91e+10 0.00e+00 0.00e+00 1.00e+04 0 1.08e-01 1.20e-01
1 1.517378e+06 3.03e+07 7.50e+06 9.20e+01 1.00e+00 3.00e+04 1 8.24e-01 9.44e-01
2 1.578857e+05 1.36e+06 1.26e+06 3.50e+01 1.00e+00 9.00e+04 1 7.31e-01 1.68e+00
3 7.079412e+04 8.71e+04 2.09e+05 1.59e+01 1.00e+00 2.70e+05 1 7.24e-01 2.40e+00
4 6.395899e+04 6.84e+03 7.87e+04 9.33e+00 1.02e+00 8.10e+05 1 7.21e-01 3.12e+00
5 5.746863e+04 6.49e+03 7.13e+04 4.92e+00 1.02e+00 2.43e+06 1 7.25e-01 3.84e+00
6 4.865750e+04 8.81e+03 8.72e+04 3.41e+00 1.01e+00 7.29e+06 1 7.23e-01 4.57e+00
7 4.089894e+04 7.76e+03 9.71e+04 6.98e+00 1.02e+00 2.19e+07 1 7.22e-01 5.29e+00
8 3.531157e+04 5.59e+03 1.07e+05 1.63e+01 1.05e+00 6.56e+07 1 7.25e-01 6.02e+00
9 2.937695e+04 5.93e+03 1.88e+05 2.85e+01 1.04e+00 1.97e+08 1 7.28e-01 6.74e+00
10 2.435229e+04 5.02e+03 3.88e+05 3.19e+01 9.59e-01 5.90e+08 1 7.23e-01 7.47e+00
11 2.065070e+04 3.70e+03 2.95e+05 2.39e+01 1.04e+00 1.77e+09 1 7.22e-01 8.19e+00
12 1.886882e+04 1.78e+03 9.54e+04 1.43e+01 1.13e+00 5.31e+09 1 7.23e-01 8.91e+00
13 1.828538e+04 5.83e+02 1.20e+05 1.16e+01 1.08e+00 1.59e+10 1 7.24e-01 9.64e+00
14 1.790181e+04 3.84e+02 7.20e+04 1.79e+01 1.04e+00 4.78e+10 1 7.19e-01 1.04e+01
15 1.759101e+04 3.11e+02 1.18e+05 2.76e+01 1.03e+00 1.43e+11 1 7.20e-01 1.11e+01
16 1.739361e+04 1.97e+02 2.49e+05 3.36e+01 1.03e+00 4.30e+11 1 7.21e-01 1.18e+01
17 1.733176e+04 6.19e+01 9.70e+04 2.12e+01 1.10e+00 1.29e+12 1 7.22e-01 1.25e+01
18 1.732284e+04 8.92e+00 1.19e+04 6.60e+00 1.15e+00 3.87e+12 1 7.21e-01 1.32e+01
19 1.732184e+04 9.95e-01 3.75e+03 1.57e+00 1.26e+00 1.16e+13 1 7.29e-01 1.40e+01
20 1.732163e+04 2.10e-01 2.29e+03 9.22e-01 1.62e+00 3.49e+13 1 7.26e-01 1.47e+01
21 1.732153e+04 1.04e-01 1.51e+03 4.64e-01 1.75e+00 1.05e+14 1 7.23e-01 1.54e+01
22 1.732147e+04 6.23e-02 1.06e+03 9.74e-02 1.80e+00 3.14e+14 1 7.27e-01 1.61e+01
23 1.732143e+04 4.05e-02 7.78e+02 2.55e-02 1.82e+00 9.41e+14 1 7.24e-01 1.69e+01
24 1.732140e+04 2.78e-02 5.92e+02 1.87e-02 1.84e+00 2.82e+15 1 7.19e-01 1.76e+01
25 1.732138e+04 1.97e-02 4.56e+02 1.48e-02 1.85e+00 8.47e+15 1 7.17e-01 1.83e+01
Solver Summary (v 1.14.0-eigen-(3.3.5)-no_lapack-eigensparse-openmp-no_tbb)
Original Reduced
Parameter blocks 2552 2552
Parameters 23679 23679
Residual blocks 26098 26098
Residuals 71313 71313
Minimizer TRUST_REGION
Sparse linear algebra library EIGEN_SPARSE
Trust region strategy LEVENBERG_MARQUARDT
Given Used
Linear solver SPARSE_NORMAL_CHOLESKY SPARSE_NORMAL_CHOLESKY
Threads 1 1
Linear solver ordering AUTOMATIC 2552
Cost:
Initial 3.179329e+07
Final 1.732138e+04
Change 3.177597e+07
Minimizer iterations 26
Successful steps 26
Unsuccessful steps 0
Time (in seconds):
Preprocessor 0.012063
Residual only evaluation 0.135580 (26)
Jacobian & residual evaluation 3.066419 (26)
Linear solver 15.525091 (26)
Minimizer 18.891099
Postprocessor 0.001107
Total 18.904269
Termination: CONVERGENCE (Function tolerance reached. |cost_change|/cost: 7.930327e-07 <= 1.000000e-06)
The codes for setting up covariance are
ceres::Covariance::Options options_cov;
ceres::Covariance covariance(options_cov);
std::vector<std::pair<const double*, const double*> > covariance_blocks;
covariance_blocks.push_back(std::make_pair(cameraIntrinsic, cameraIntrinsic));
covariance_blocks.push_back(std::make_pair(delta_theta_ci.data(), delta_theta_ci.data()));
covariance_blocks.push_back(std::make_pair(cameraIntrinsic, delta_theta_ci.data()));
the cameraIntrinsic and delta_theta_ci are two arrays that I want to compute covariance.
Can anyone help me?
Alright, I found the problem myself. The problem is exactly on the Eigen_Sparse QR solver. There is nothing wrong with the optimization problem I build. If I switched to another linear algebra library like
options_cov.sparse_linear_algebra_library_type = ceres::SparseLinearAlgebraLibraryType::SUITE_SPARSE;
The covariance can be computed.
In spite of the fact that there are online plenty of algorithms and functions for generating unique combinations of any size from a list of unique items, there is none available in case of a list of non-unique items (i.e. list containing repetitions of same value.)
The question is how to generate ON-THE-FLY in a generator function all
the unique combinations from a non-unique list without the
computational expensive need of filtering out duplicates?
I consider combination comboA to be unique if there is no other combination comboB for which sorted lists for both combinations are the same. Let's give an example of code checking for such uniqueness:
comboA = [1,2,2]
comboB = [2,1,2]
print("B is a duplicate of A" if sorted(comboA)==sorted(comboB) else "A is unique compared to B")
In the above given example B is a duplicate of A and the print() prints B is a duplicate of A.
The problem of getting a generator function capable of providing unique combinations on-the-fly in case of a non-unique list is solved here: Getting unique combinations from a non-unique list of items, FASTER?, but the provided generator function needs lookups and requires memory what causes problems in case of a huge amount of combinations.
The in the current version of the answer provided function does the job without any lookups and appears to be the right answer here, BUT ...
The goal behind getting rid of lookups is to speed up the generation of unique combinations in case of a list with duplicates.
I have initially (writing the first version of this question) wrongly assumed that code which doesn't require creation of a set used for lookups needed to assure uniqueness is expected to give an advantage over code needing lookups. It is not the case. At least not always. The code in up to now provided answer does not using lookups, but is taking much more time to generate all the combinations in case of no redundant list or if only a few redundant items are in the list.
Here some timings to illustrate the current situation:
-----------------
k: 6 len(ls): 48
Combos Used Code Time
---------------------------------------------------------
12271512 len(list(combinations(ls,k))) : 2.036 seconds
12271512 len(list(subbags(ls,k))) : 50.540 seconds
12271512 len(list(uniqueCombinations(ls,k))) : 8.174 seconds
12271512 len(set(combinations(sorted(ls),k))): 7.233 seconds
---------------------------------------------------------
12271512 len(list(combinations(ls,k))) : 2.030 seconds
1 len(list(subbags(ls,k))) : 0.001 seconds
1 len(list(uniqueCombinations(ls,k))) : 3.619 seconds
1 len(set(combinations(sorted(ls),k))): 2.592 seconds
Above timings illustrate the two extremes: no duplicates and only duplicates. All other timings are between this two.
My interpretation of the results above is that a pure Python function (not using any C-compiled modules) can be extremely faster, but it can be also much slower depending on how many duplicates are in a list. So there is probably no way around writing C/C++ code for a Python .so extension module providing the required functionality.
Instead of post-processing/filtering your output, you can pre-process your input list. This way, you can avoid generating duplicates in the first place. Pre-processing involves either sorting (or using a collections.Counter on) the input. One possible recursive realization is:
def subbags(bag, k):
a = sorted(bag)
n = len(a)
sub = []
def index_of_next_unique_item(i):
j = i + 1
while j < n and a[j] == a[i]:
j += 1
return j
def combinate(i):
if len(sub) == k:
yield tuple(sub)
elif n - i >= k - len(sub):
sub.append(a[i])
yield from combinate(i + 1)
sub.pop()
yield from combinate(index_of_next_unique_item(i))
yield from combinate(0)
bag = [1, 2, 3, 1, 2, 1]
k = 3
i = -1
print(sorted(bag), k)
print('---')
for i, subbag in enumerate(subbags(bag, k)):
print(subbag)
print('---')
print(i + 1)
Output:
[1, 1, 1, 2, 2, 3] 3
---
(1, 1, 1)
(1, 1, 2)
(1, 1, 3)
(1, 2, 2)
(1, 2, 3)
(2, 2, 3)
---
6
Requires some stack space for the recursion, but this + sorting the input should use substantially less time + memory than generating and discarding repeats.
The current state-of-the-art inspired initially by a 50 than by a 100 reps bounties is at the moment (instead of a Python extension module written entirely in C):
An efficient algorithm and implementation that is better than the obvious (set + combinations) approach in the best (and average) case, and is competitive with it in the worst case.
It seems to be possible to fulfill this requirement using a kind of "fake it before you make it" approach. The current state-of-the-art is that there are two generator function algorithms available for solving the problem of getting unique combinations in case of a non-unique list. The below provided algorithm combines both of them what becomes possible because it seems to exist a threshold value for percentage of unique items in the list which can be used for appropriate switching between the two algorithms. The calculation of the percentage of uniqueness is done with so tiny amount of computation time that it even doesn't clearly show up in the final results due to common variation of the taken timing.
def iterFastUniqueCombos(lstList, comboSize, percUniqueThresh=60):
lstListSorted = sorted(lstList)
lenListSorted = len(lstListSorted)
percUnique = 100.0 - 100.0*(lenListSorted-len(set(lstListSorted)))/lenListSorted
lstComboCandidate = []
setUniqueCombos = set()
def idxNextUnique(idxItemOfList):
idxNextUniqueCandidate = idxItemOfList + 1
while (
idxNextUniqueCandidate < lenListSorted
and
lstListSorted[idxNextUniqueCandidate] == lstListSorted[idxItemOfList]
): # while
idxNextUniqueCandidate += 1
idxNextUnique = idxNextUniqueCandidate
return idxNextUnique
def combinate(idxItemOfList):
if len(lstComboCandidate) == sizeOfCombo:
yield tuple(lstComboCandidate)
elif lenListSorted - idxItemOfList >= sizeOfCombo - len(lstComboCandidate):
lstComboCandidate.append(lstListSorted[idxItemOfList])
yield from combinate(idxItemOfList + 1)
lstComboCandidate.pop()
yield from combinate(idxNextUnique(idxItemOfList))
if percUnique > percUniqueThresh:
from itertools import combinations
allCombos = combinations(lstListSorted, comboSize)
for comboCandidate in allCombos:
if comboCandidate in setUniqueCombos:
continue
yield comboCandidate
setUniqueCombos.add(comboCandidate)
else:
yield from combinate(0)
#:if/else
#:def iterFastUniqueCombos()
The below provided timings show that the above iterFastUniqueCombos() generator function provides a clear advantage
over uniqueCombinations() variant in case the list has less than 60 percent of unique elements and is not worse as
the on (set + combinations) based uniqueCombinations() generator function in the opposite case where it gets much faster than the iterUniqueCombos() one (due to switching between
the (set + combinations) and the (no lookups) variant at 60% threshold for amount of unique elements in the list):
=========== sizeOfCombo: 6 sizeOfList: 48 noOfUniqueInList 1 percUnique 2
Combos: 12271512 print(len(list(combinations(lst,k)))) : 2.04968 seconds.
Combos: 1 print(len(list( iterUniqueCombos(lst,k)))) : 0.00011 seconds.
Combos: 1 print(len(list( iterFastUniqueCombos(lst,k)))) : 0.00008 seconds.
Combos: 1 print(len(list( uniqueCombinations(lst,k)))) : 3.61812 seconds.
========== sizeOfCombo: 6 sizeOfList: 48 noOfUniqueInList 48 percUnique 100
Combos: 12271512 print(len(list(combinations(lst,k)))) : 1.99383 seconds.
Combos: 12271512 print(len(list( iterUniqueCombos(lst,k)))) : 49.72461 seconds.
Combos: 12271512 print(len(list( iterFastUniqueCombos(lst,k)))) : 8.07997 seconds.
Combos: 12271512 print(len(list( uniqueCombinations(lst,k)))) : 8.11974 seconds.
========== sizeOfCombo: 6 sizeOfList: 48 noOfUniqueInList 27 percUnique 56
Combos: 12271512 print(len(list(combinations(lst,k)))) : 2.02774 seconds.
Combos: 534704 print(len(list( iterUniqueCombos(lst,k)))) : 1.60052 seconds.
Combos: 534704 print(len(list( iterFastUniqueCombos(lst,k)))) : 1.62002 seconds.
Combos: 534704 print(len(list( uniqueCombinations(lst,k)))) : 3.41156 seconds.
========== sizeOfCombo: 6 sizeOfList: 48 noOfUniqueInList 31 percUnique 64
Combos: 12271512 print(len(list(combinations(lst,k)))) : 2.03539 seconds.
Combos: 1114062 print(len(list( iterUniqueCombos(lst,k)))) : 3.49330 seconds.
Combos: 1114062 print(len(list( iterFastUniqueCombos(lst,k)))) : 3.64474 seconds.
Combos: 1114062 print(len(list( uniqueCombinations(lst,k)))) : 3.61857 seconds.
i am working through an EDX course on computer programming. I have come to this problem and dont know how to work through it. im not looking for an answer but more a point in the right direction.
so the question gives you a 2D array. two columns and N amount of rows. the N is the number of students. each column is the grade of first test and then the second is the grade of the second test. I am asked to find the root mean square of two seperate kids and compare them and then return a number based off the comparison. The question gives you this formula
RMS = (0.5×(midsem_marks2 + endsem_marks2))0.5
I know how to get the appropriate marks using array[index 1(firsttest)] etc and then how to compare them. However, i am clueless on how to write that formula. any help would be great. Thanks in advance.
code I have
float RMSi1 = sqrt(.5*((marksarray[index1][0]*marksarray[index1][0])+(marksarray[index1][1])*(marksarray[index1][1])));
float RMSi2 = sqrt(.5*((marksarray[index2][0]*marksarray[index2][0])+(marksarray[index2][1])*(marksarray[index2][1])));
if(RSMi1>RSMi2){
return -1;
}
if(RSMi1<RSMi2){
return 1;
}
if(RSMi1==RSMi2){
return 0;
}
I'm getting an error that the RSMi1 and 2 are not declared in the if statements
Input marksarray:
1 2
1 60 20
2 60 20
3 30 40
4 10 90
5 90 30
6 0 100
7 60 20
Ok guys, as requested, I will add more info so that you understand why a simple vector operation is not possible. It's not easy to explain in few words but let's see. I have a huge amount of points over a 2D space.
I divide my space in a grid with a given resolution,say, 100m. The main loop that I am not sure if it's mandatory or not (any alternative is welcomed) is to go through EACH cell/pixel that contains at least 2 points (right now I am using the method quadratcount within the package spatstat).
Inside this loop, thus for each one of this non empty cells, I have to find and keep only a maximum of 10 Male-Female pairs that are within 3 meters from each other. The 3-meter buffer can be done using the "disc" function within spatstat. To select points falling inside a buffer you can use the method pnt.in.poly within the SDMTools package. All that because pixels have a maximum capacity that cannot be exceeded. Since in each cell there can be hundreds or thousands of points I am trying to find a smart way to use another loop/similar method to:
1)go trough each point at a time 2)create a buffer a select points with different sex 3)Save the closest Male-Female (0-1) pair in another dataframe (called new_colonies) 4)Remove those points from the dataframe so that it shrinks and I don't have to consider them anymore 5) as soon as that new dataframe reaches 10 rows stop everything and go to the next cell (thus skipping all remaining points. Here is the code that I developed to be run within each cell (right now it takes too long):
head(df,20):
X Y Sex ID
2 583058.2 2882774 1 1
3 582915.6 2883378 0 2
4 582592.8 2883297 1 3
5 582793.0 2883410 1 4
6 582925.7 2883397 1 5
7 582934.2 2883277 0 6
8 582874.7 2883336 0 7
9 583135.9 2882773 1 8
10 582955.5 2883306 1 9
11 583090.2 2883331 0 10
12 582855.3 2883358 1 11
13 582908.9 2883035 1 12
14 582608.8 2883715 0 13
15 582946.7 2883488 1 14
16 582749.8 2883062 0 15
17 582906.4 2883317 0 16
18 582598.9 2883390 0 17
19 582890.2 2883413 0 18
20 582752.8 2883361 0 19
21 582953.1 2883230 1 20
Inside each cell I must run something according to what I explained above..
for(i in 1:dim(df)[1]){
new_colonies <- data.frame(ID1=0,ID2=0,X=0,Y=0)
discbuff <- disc(radius, centre=c(df$X[i], df$Y[i]))
#define the points and polygon
pnts = cbind(df$X[-i],df$Y[-i])
polypnts = cbind(x = discbuff$bdry[[1]]$x, y = discbuff$bdry[[1]]$y)
out = pnt.in.poly(pnts,polypnts)
out$ID <- df$ID[-i]
if (any(out$pip == 1)) {
pnt.inBuffID <- out$ID[which(out$pip == 1)]
cond <- df$Sex[i] != df$Sex[pnt.inBuffID]
if (any(cond)){
eucdist <- sqrt((df$X[i] - df$X[pnt.inBuffID][cond])^2 + (df$Y[i] - df$Y[pnt.inBuffID][cond])^2)
IDvect <- pnt.inBuffID[cond]
new_colonies_temp <- data.frame(ID1=df$ID[i], ID2=IDvect[which(eucdist==min(eucdist))],
X=(df$X[i] + df$X[pnt.inBuffID][cond][which(eucdist==min(eucdist))]) / 2,
Y=(df$Y[i] + df$Y[pnt.inBuffID][cond][which(eucdist==min(eucdist))]) / 2)
new_colonies <- rbind(new_colonies,new_colonies_temp)
if (dim(new_colonies)[1] == maxdensity) break
}
}
}
new_colonies <- new_colonies[-1,]
Any help appreciated!
Thanks
Francesco
In your case I wouldn't worry about deleting the points as you go, skipping is the critical thing. I also wouldn't make up a new data.frame piece by piece like you seem to be doing. Both of those things slow you down a lot. Having a selection vector is much more efficient (perhaps part of the data.frame, that you set to FALSE beforehand).
df$sel <- FALSE
Now, when you go through you set df$sel to TRUE for each item you want to keep. Just skip to the next cell when you find your 10. Deleting values as you go will be time consuming and memory intensive, as will slowly growing a new data.frame. When you're all done going through them then you can just select your data based on the selection column.
df <- df[ df$sel, ]
(or maybe make a copy of the data.frame at that point)
You also might want to use the dist function to calculate a matrix of distances.
from ?dist
"This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix."
I'm assuming you are doing something sufficiently complicated that the for-loop is actually required...
So here's one rather simple approach: first just gather the rows to delete (or keep), and then delete the rows afterwards. Typically this will be much faster too since you don't modify the data.frame on each loop iteration.
df <- generateTheDataFrame()
keepRows <- rep(TRUE, nrow(df))
for(i in seq_len(nrow(df))) {
rows <- findRowsToDelete(df, df[i,])
keepRows[rows] <- FALSE
}
# Delete afterwards
df <- df[keepRows, ]
...and if you really need to work on the shrunk data in each iteration, just change the for-loop part to:
for(i in seq_len(nrow(df))) {
if (keepRows[i]) {
rows <- findRowsToDelete(df[keepRows, ], df[i,])
keepRows[rows] <- FALSE
}
}
I'm not exactly clear on why you're looping. If you could describe what kind of conditions you're checking there might be a nice vectorized way of doing it.
However as a very simple fix have you considered looping through the dataframe backwards?