Dynamically Delete Elements WIthin an R loop - list

Ok guys, as requested, I will add more info so that you understand why a simple vector operation is not possible. It's not easy to explain in few words but let's see. I have a huge amount of points over a 2D space.
I divide my space in a grid with a given resolution,say, 100m. The main loop that I am not sure if it's mandatory or not (any alternative is welcomed) is to go through EACH cell/pixel that contains at least 2 points (right now I am using the method quadratcount within the package spatstat).
Inside this loop, thus for each one of this non empty cells, I have to find and keep only a maximum of 10 Male-Female pairs that are within 3 meters from each other. The 3-meter buffer can be done using the "disc" function within spatstat. To select points falling inside a buffer you can use the method pnt.in.poly within the SDMTools package. All that because pixels have a maximum capacity that cannot be exceeded. Since in each cell there can be hundreds or thousands of points I am trying to find a smart way to use another loop/similar method to:
1)go trough each point at a time 2)create a buffer a select points with different sex 3)Save the closest Male-Female (0-1) pair in another dataframe (called new_colonies) 4)Remove those points from the dataframe so that it shrinks and I don't have to consider them anymore 5) as soon as that new dataframe reaches 10 rows stop everything and go to the next cell (thus skipping all remaining points. Here is the code that I developed to be run within each cell (right now it takes too long):
head(df,20):
X Y Sex ID
2 583058.2 2882774 1 1
3 582915.6 2883378 0 2
4 582592.8 2883297 1 3
5 582793.0 2883410 1 4
6 582925.7 2883397 1 5
7 582934.2 2883277 0 6
8 582874.7 2883336 0 7
9 583135.9 2882773 1 8
10 582955.5 2883306 1 9
11 583090.2 2883331 0 10
12 582855.3 2883358 1 11
13 582908.9 2883035 1 12
14 582608.8 2883715 0 13
15 582946.7 2883488 1 14
16 582749.8 2883062 0 15
17 582906.4 2883317 0 16
18 582598.9 2883390 0 17
19 582890.2 2883413 0 18
20 582752.8 2883361 0 19
21 582953.1 2883230 1 20
Inside each cell I must run something according to what I explained above..
for(i in 1:dim(df)[1]){
new_colonies <- data.frame(ID1=0,ID2=0,X=0,Y=0)
discbuff <- disc(radius, centre=c(df$X[i], df$Y[i]))
#define the points and polygon
pnts = cbind(df$X[-i],df$Y[-i])
polypnts = cbind(x = discbuff$bdry[[1]]$x, y = discbuff$bdry[[1]]$y)
out = pnt.in.poly(pnts,polypnts)
out$ID <- df$ID[-i]
if (any(out$pip == 1)) {
pnt.inBuffID <- out$ID[which(out$pip == 1)]
cond <- df$Sex[i] != df$Sex[pnt.inBuffID]
if (any(cond)){
eucdist <- sqrt((df$X[i] - df$X[pnt.inBuffID][cond])^2 + (df$Y[i] - df$Y[pnt.inBuffID][cond])^2)
IDvect <- pnt.inBuffID[cond]
new_colonies_temp <- data.frame(ID1=df$ID[i], ID2=IDvect[which(eucdist==min(eucdist))],
X=(df$X[i] + df$X[pnt.inBuffID][cond][which(eucdist==min(eucdist))]) / 2,
Y=(df$Y[i] + df$Y[pnt.inBuffID][cond][which(eucdist==min(eucdist))]) / 2)
new_colonies <- rbind(new_colonies,new_colonies_temp)
if (dim(new_colonies)[1] == maxdensity) break
}
}
}
new_colonies <- new_colonies[-1,]
Any help appreciated!
Thanks
Francesco

In your case I wouldn't worry about deleting the points as you go, skipping is the critical thing. I also wouldn't make up a new data.frame piece by piece like you seem to be doing. Both of those things slow you down a lot. Having a selection vector is much more efficient (perhaps part of the data.frame, that you set to FALSE beforehand).
df$sel <- FALSE
Now, when you go through you set df$sel to TRUE for each item you want to keep. Just skip to the next cell when you find your 10. Deleting values as you go will be time consuming and memory intensive, as will slowly growing a new data.frame. When you're all done going through them then you can just select your data based on the selection column.
df <- df[ df$sel, ]
(or maybe make a copy of the data.frame at that point)
You also might want to use the dist function to calculate a matrix of distances.
from ?dist
"This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix."

I'm assuming you are doing something sufficiently complicated that the for-loop is actually required...
So here's one rather simple approach: first just gather the rows to delete (or keep), and then delete the rows afterwards. Typically this will be much faster too since you don't modify the data.frame on each loop iteration.
df <- generateTheDataFrame()
keepRows <- rep(TRUE, nrow(df))
for(i in seq_len(nrow(df))) {
rows <- findRowsToDelete(df, df[i,])
keepRows[rows] <- FALSE
}
# Delete afterwards
df <- df[keepRows, ]
...and if you really need to work on the shrunk data in each iteration, just change the for-loop part to:
for(i in seq_len(nrow(df))) {
if (keepRows[i]) {
rows <- findRowsToDelete(df[keepRows, ], df[i,])
keepRows[rows] <- FALSE
}
}

I'm not exactly clear on why you're looping. If you could describe what kind of conditions you're checking there might be a nice vectorized way of doing it.
However as a very simple fix have you considered looping through the dataframe backwards?

Related

How to perform rolling window calculations without SSC packages

Goal: perform rolling window calculations on panel data in Stata with variables PanelVar, TimeVar, and Var1, where the window can change within a loop over different window sizes.
Problem: no access to SSC for the packages that would take care of this (like rangestat)
I know that
by PanelVar: gen Var1_1 = Var1[_n]
produces a copy of Var1 in Var1_1. So I thought it would make sense to try
by PanelVar: gen Var1SumLag = sum(Var1[(_n-3)/_n])
to produce a rolling window calculation for _n-3 to _n for the whole variable. But it fails to produce the results I want, it just produces zeros.
You could use sum(Var1) - sum(Var1[_n-3]), but I also want to be able to make the rolling window left justified (summing future observations) as well as right justified (summing past observations).
Essentially I would like to replicate Python's ".rolling().agg()" functionality.
In Stata _n is the index of the current observation. The expression (_n - 3) / _n yields -2 when _n is 1 and increases slowly with _n but is always less than 1. As a subscript applied to extract values from observations of a variable it always yields missing values given an extra rule that Stata rounds down expressions so supplied. Hence it reduces to -2, -1 or 0: in each case it yields missing values when given as a subscript. Experiment will show you that given any numeric variable say numvar references to numvar[-2] or numvar[-1] or numvar[0] all yield missing values. Otherwise put, you seem to be hoping that the / yields a set of subscripts that return a sequence you can sum over, but that is a long way from what Stata will do in that context: the / is just interpreted as division. (The running sum of missings is always returned as 0, which is an expression of missings being ignored in that calculation: just as 2 + 3 + . + 4 is returned as 9 so also . + . + . + . is returned as 0.)
A fairly general way to do what you want is to use time series operators, and this is strongly preferable to subscripts as (1) doing the right thing with gaps (2) automatically working for panels too. Thus after a tsset or xtset
L0.numvar + L1.numvar + L2.numvar + L3.numvar
yields the sum of the current value and the three previous and
L0.numvar + F1.numvar + F2.numvar + F3.numvar
yields the sum of the current value and the three next. If any of these terms is missing, the sum will be too; a work-around for that is to return say
cond(missing(L3.numvar), 0, L3.numvar)
More general code will require some kind of loop.
Given a desire to loop over lags (negative) and leads (positive) some code might look like this, given a range of subscripts as local macros i <= j
* example i and j
local i = -3
local j = 0
gen double wanted = 0
forval k = `i'/`j' {
if `k' < 0 {
local k1 = -(`k')
replace wanted = wanted + L`k1'.numvar
}
else replace wanted = wanted + F`k'.numvar
}
Alternatively, use Mata.
EDIT There's a simpler method, to use tssmooth ma to get moving averages and then multiply up by the number of terms.
tssmooth ma wanted1=numvar, w(3 1)
tssmooth ma wanted2=numvar, w(0 1 3)
replace wanted1 = 4 * wanted1
replace wanted2 = 4 * wanted2
Note that in contrast to the method above tssmooth ma uses whatever is available at the beginning and end of each panel. So, the first moving average, the average of the first value and the three previous, is returned as just the first value at the beginning of each panel (when the three previous values are unknown).

Subtracting value over multiple cells

I am trying to build a spreadsheet that keeps track of my inventory. I want to use the First In First Out approach and need the formula to solve the following problem. I want to subtract the value 16 from the list of stocks over multiple rows.
Value= 16
Column A --> Column B
10 0
5 0
2 1
3 3
12 12
delete everything in B column and use this ArrayFormula like:
=ARRAYFORMULA(
IF(IF(A4:A="", ,{B1; (SUMIF(ROW(A4:A), "<="&ROW(A4:A), A4:A)-B1)*-1})>A4:A, 0,
IF(IF(A4:A="", ,{B1; (SUMIF(ROW(A4:A), "<="&ROW(A4:A), A4:A)-B1)*-1})>0, A4:A-
IF(A4:A="", ,{B1; (SUMIF(ROW(A4:A), "<="&ROW(A4:A), A4:A)-B1)*-1}), A4:A)))
Sample below:
Subtract: B2 = number [16]
Subtract: B3 = formula =B2-A2. Copy down.
Out: C2 = formula =IF(B2>A2,0,IF(B2>0,A2-B2,A2)). Copy down.

Why I am getting Time Limit Exceeded? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I am getting Time Limit Exceeded on submitting this question
Question:
Let's consider a triangle of numbers in which a number appears in the first line, two numbers appear in the second line, three in the third line, etc. Develop a program which will compute the largest of the sums of numbers that appear on the paths starting from the top towards the base, so that:
on each path the next number is located on the row below, more precisely either
directly below or below and one place to the right;
the number of rows is strictly positive, but less than 100
all numbers are positive integers between O and 99.
My Code:
#include<stdio.h>
#include<iostream>
#include<algorithm>
using namespace std;
int trian(int i,int j);
long long int n,a[100][100];
int main()
{
long long int t,i,j,v,k;
scanf("%lld",&t);
for(i=0;i<t;i++)
{
scanf("%lld",&n);
for(j=0;j<n;j++)
{
for(k=0;k<j+1;k++)
{
scanf("%lld",&a[j][k]);
}
}
v=trian(0,0);
printf("%lld\n",v);
}
}
int trian(int i,int j)
{
if(i>=n)
return 0;
else
return (a[i][j]+(std::max(trian(i+1,j),trian(i+1,j+1))));
}
Why I am getting Time Limit Exceeded?
Consider this triangle (ignore the numbers):
1
2 3
There are 2 possible paths to take here. Let's add a row:
1
2 3
4 5 6
The 4 can only be reached via a path that ends directly above, the 5 has two paths which can reach it, the 6 can only be reached from the path previously ending left above of it. We now have 4 possible paths. Another row:
1
2 3
4 5 6
7 8 9 0
That's 8 possible paths. Do you see a pattern? Let's describe the path straight down to 7, starting from 1:
D = DOWN
R = DOWN AND RIGHT
DDD
The (single) path to 0:
RRR
Since in each step you go down one row, you can only chose between the two possibilities number of rows - 1 times, thus giving you:
2^(number of rows - 1) possible paths
With 100 rows, that's alot. Your code tries to compute each of these paths separately. Assuming computing 1 path takes 1 nanosecond (which would be blazing fast) computing them all would take more than 2 * 10^16 years. Well, ...
Time Limit Exceeded
So you now know that you cannot just compute every possible path and take the maximum. The main issue is the following:
1
2 3
4 5 6
One path to the 5 is 1 + 3 + 5, the path to the 6 is 1 + 3 + 6. Your code computes each path separately, thus 1 + 3 will be computed twice. If you save that result, you'll get rid of most of the unnecessary computations.
How could you store such results? Well, 1 + 3 is the computation of the path arriving at 3, so store it there. What if a number (say 5) can be reached by multiple paths?
5 could be reached via 1 + 2 + 5 or 1 + 3 + 5.
Any path going through the 5 will give a higher result if it wen't through the 3 first, so only remember this path (and ignore the path through the 2, it's useless now).
So, as an algorithm:
For each row, starting at row 1 (not the first, but the second): For each entry: Calculate the maximum of the entries left above (if available) and directly above (if available) and store the result + the entry's value as new value for the entry. Then, find the maximum of the last row.

Nearest Neighbor Matching in Stata

I need to program a nearest neighbor algorithm in stata from scratch because my dataset does not allow me to use any of the available solutions (as far as I am concerned).
To be pecise. I have a dataset that is of similar structure to that of the following (original has around 14k observations)
input id value treatment match
1 0.14 0 .
2 0.32 0 .
3 0.465 1 2
4 0.878 1 2
5 0.912 1 2
6 0.001 1 1
end
I want to generate a variable called match (already included in the example above). For each observation with treatment == 1 the variable match should store the id of another observation from within treatment == 0 whose value is closest to value of the considered observation (treatment == 1).
I am new to stata programming, so I am not yet familiar with the syntax. My first shot is the following however it does not produce any changes to the match variable. I am sure this is a novice question but I am hoping for some advice on how to make the code running.
EDIT: I have changed the code slightly and now it seems to work. Do you see any problems that may arise if I run it on a bigger dataset?
set more off
clear all
input id pscore treatment
1 0.14 0
2 0.32 0
3 0.465 1
4 0.878 1
5 0.912 1
6 0.001 1
end
gen match = .
forval i = 1/`= _N' {
if treatment[`i'] == 1 {
local dist 1
forvalues j = 1/`= _N' {
if (treatment[`j'] == 0) {
local current_dist (pscore[`i'] - pscore[`j'])^2
if `dist' > `current_dist' {
local dist `current_dist' // update smallest distance
replace match = id[`j'] in `i' // write match
}
}
}
}
}
Consider some simulated data: 1,000 observations, 200 of them untreated (treat == 0) and the rest treated (treat == 1). Then the code included below will be much more efficient than the originally posted. (Ties, like in your code, are not explicitly handled.)
clear
set more off
*----- example data -----
set obs 1000
set seed 32956
gen id = _n
gen pscore = runiform()
gen treat = cond(_n <= 200, 0, 1)
*----- new method -----
timer clear
timer on 1
// get id of last non-treated and first treated
// (data is sorted by treat and ids are consecutive)
bysort treat (id): gen firsttreat = id[1]
local firstt = first[_N]
local lastnt = `firstt' - 1
// start loop
gen match = .
gen dif = .
quietly forvalues i = `firstt'/`=_N' {
// compute distances
replace dif = (pscore[`i'] - pscore)^2
summarize dif in 1/`lastnt', meanonly
// identify id of minimum-distance observation
replace match = . in 1/`lastnt'
replace match = id in 1/`lastnt' if dif == r(min)
summarize match in 1/`lastnt', meanonly
// save the minimum-distance id
replace match = r(max) in `i'
}
// clean variable and drop
replace match = . in 1/`lastnt'
drop dif firsttreat
timer off 1
tempfile first
save `first'
*----- your method -----
drop match
timer on 2
gen match = .
quietly forval i = 1/`= _N' {
if treat[`i'] == 1 {
local dist 1
forvalues j = 1/`= _N' {
if (treat[`j'] == 0) {
local current_dist (pscore[`i'] - pscore[`j'])^2
if `dist' > `current_dist' {
local dist `current_dist' // update smallest distance
replace match = id[`j'] in `i' // write match
}
}
}
}
}
timer off 2
tempfile second
save `second'
// check for equality of results
cf _all using `first'
// check times
timer list
The results in seconds to finish execution:
. timer list
1: 0.19 / 1 = 0.1930
2: 10.79 / 1 = 10.7900
The difference is huge, specially considering this data set has only 1,000 observations.
An interesting thing to notice is that as the number of non-treated cases increases relative to the number of treated, then the original method improves, but never reaches the levels of efficiency of the new method. As an example, invert the number of cases, so there is now 800 untreated and 200 treated (change data setup to gen treat = cond(_n <= 800, 0, 1)). The result is
. timer list
1: 0.07 / 1 = 0.0720
2: 4.45 / 1 = 4.4470
You can see that the new method also improves and is still much faster. In fact, the relative difference is still the same.
Another way to do this is using joinby or cross. The problem is they temporarily expand (a lot) the size of your data base. In many cases, they are not feasible due to the hard limit Stata has on the number of possible observations (see help limits). You can find an example of joinby here: https://stackoverflow.com/a/19784222/2077064.
Edit
If there's a large number of treated relative to untreated, your code suffers
because you go through the whole first loop many more times (due to the first if).
Furthermore, going through
that whole loop once, implies going through another loop that
has itself two if conditions, _N more times.
The opposite case in which there are few treated observations means that you go through the whole
first loop only in a small number of occasions, speeding up your code substantially.
The reason my code can maintain its efficiency is due to the use of in. This always
offers speed gains over if. Stata will go directly to those observations with no
logical checking needed. Your problem provides an opportunity for that replacement
and it's wise to seize it.
If my code used if where in is in place, the results would be different.
Your code would be faster for the
case in which there's a large number of untreated relative to treated, and again, that
is because in your code there would not be the need to go through the complete loop,
requiring very little work;
the first loop is short-circuited with the first if. For the opposite case,
my code would still dominate.
The key is to "separate" treated from untreated and work on each group using in.

Not Equals Constraint in PROC OPTMODEL

I have an optimization problem that I need to solve. It's a binary linear programming problem, so all of the decision variables are equal to 0 or 1. I need certain combinations of these decision variables to add up to either 0 or 2+, they cannot sum to 1. I'm struggling with how to accomplish this in PROC OPTMODEL.
Something like this is what I need:
con sum_con: x+y+z~=1;
Unfortunately, this just throws a syntax error... Is there any way to accomplish this?
See below for a linear reformulation. However, you may not need it. In SAS 9.4m2 (SAS/OR 13.2), your expression works as written. You just need to invoke the (experimental) CLP solver:
proc optmodel;
/* In SAS/OR 13.2 you can use your code directly.
Just invoke the experimental CLP solver */
var x binary, y binary, z binary;
con sum_con: x+y+z~=1;
solve with clp / findall;
print {i in 1 .. _NSOL_} x.sol[i]
{i in 1 .. _NSOL_} y.sol[i]
{i in 1 .. _NSOL_} z.sol[i];
produces immediately:
[1] x.SOL y.SOL z.SOL
1 0 0 0
2 0 1 1
3 1 0 1
4 1 1 0
5 1 1 1
In older versions of SAS/OR, you can still call PROC CLP directly,
which is not experimental.
The syntax for your example will be very similar to PROC OPTMODEL's.
I am sure, however, that your model has other variables and constraints.
In that case, remember that no matter how you formulate this,
it is still a search space with a hole in the middle.
So it potentially can make the solver perform poorly.
How poorly is hard to predict. It depends on other features of your model.
If MILP is a better fit for the rest of your model,
you can reformulate your constraint as a valid MILP in two steps.
First, add a binary variable that is zero only when the expression is zero:
/* If solve with CLP is not available, you can linearize the disjunction: */
var IsGTZero binary; /* 1 if any variable in the expression is 1 */
con IsGTZeroBoundsExpression: 3 * IsGTZero >= x + y + z;
Then add another constraint that forces the expression to be
at least the constant you want (in this case 2) when it is nonzero.
num atLeast init 2;
con ZeroOrAtLeast: x + y + z >= atLeast * IsGTZero;
min f=0; /* Explicit objectives are unnecessary in 13.2 */
solve;
The following equation should work:
(x+y-z)*z + (y+z-x)*x + (x+z-y)*y > -1
It can be generalized to more than three variables and if you have some large number you should be able to use index expansions to make it easier.