Stata: Uniquely sorting points within groups - stata

I'm conducting a household survey with a random sample of 200 villages. Using QGIS, I picked a random point 5-10km from my original villages. I then obtained, from the national statistical office, the village codes for those 200 "neighbor" villages - as well as a buffer of 10 additional neighbor villages. So my total sample is:
200 original villages + 210 neighbor villages = 410 villages, total
We're going to begin fieldwork soon, and I want to give each survey team a map for 1 original village + the nearest neighbor village. Because I'm surveying in some dense urban areas as well, sometimes a neighbor village is actually quite close to more than one original village.
My problem is this: if I run Distance Matrix in QGIS, matching an old village to its nearest neighbor village, I get duplicates in the latter. To get around this, I've matched each old village to the nearest 5 neighbor villages. My main idea/goal is to pick the nearest neighbor that hasn't already been picked.
I end up with a .csv like so:
As you can see, picking the five nearest villages, I'm getting repeats - neighbor village 79 is showing up as nearby to original villages 1, 2, 3, and 4. This is fine, as long as I can assign neighbor village 79 to one (and only one) original village, and then have the rest uniquely match as well.
What I want to do, then, is to uniquely match each original village to one neighbor village. I've tried a bunch of stuff, none of which has worked: My sense is that I need to loop over original village groups, assign a variable (e.g. taken==1) to one of the neighbor villages, and then - somehow - have each instance of that taken==1 apply to all instances of, say, neighbor village 79.
Here's some sample code of what I was thinking. Note: this uniquely matches 163 of my neighbors.
gen taken = 0
so ea distance
by ea: replace taken=1 if _n==1
keep if taken==1
codebook FID ea
This also doesn't work; it just sets taken to 1 for all obs:
foreach i in 5 4 3 2 1 {
by ea: replace taken=1 if _n==`i' & taken==0
}
What I need to do, I think, is loop over both _N and _n, and maybe use an if/else. But I'm not sure how to put it all together.
(Tangentially, is there a better way to loop over decreasing values in Stata? Similar to i-- in other programming languages?)

This should work but the setup is a little different than what you say you need. By comparing with only five neighbors, you have an ill-posed problem. Imagine that geography is such that you end up with six (or more) original villages that have all the same list of five neighbors. What do you assign the sixth original village?
Given this, I compare the original village with all other villages, not only five. The strategy is then to assign original village 1 its closest neighbor; to original village 2 its closest neighbor after discarding the one previously assigned, and so on. This assumes equal number of original and neighbor villages but you have ten additional, so you need to give that a thought.
clear
set more off
*----- example data -----
local numvilla = 4 // change to test
local numobs = `numvilla'^2
set obs `numobs'
egen origv = seq(), from(1) to(`numvilla') block(`numvilla')
bysort origv: gen neigh = _n
set seed 1956
gen dist = runiform()*10
*----- what you want ? -----
sort origv dist
list, sepby(origv)
quietly forvalues villa = 1/`numvilla' {
drop if origv == `villa' & _n > `villa'
drop if neigh == neigh[`villa'] & _n > `villa'
}
list
The other issue is that results will depend on which original village is set to first, second, and so on; because order of assignments will change according to that. That is, the order in which available options are discarded changes with the order in which you set up the original villages. You may want to randomize the order of the original villages before you start the assignments.
You can increase efficiency substituting out & _n > `villa' for in `=`villa'+1'/L, but you won't notice much with your sample size.
I'm not qualified to say anything about your sample design, so take this answer to address only the programming issue you pose.
By the way, to loop over decreasing values:
forvalues obs = 5(-1)1 {
display "`obs'"
}
See help numlist.

Related

Reordering panels by another variable in twoway, by() graphs

Suppose I make the following chart showing the weight of 9 pigs over time:
webuse pig
tw line weight week if inrange(id,1,9), by(id) subtitle(, nospan)
Is it possible to reorder the panels by another variable while retaining the original label? I can imagine defining another variable that is sorted the right way and then labeling it with the right id, but curious if there is a less clunky way of achieving that.
I think you are right: you need a new ordering variable. Positively, you can order on any criterion of choice. Watch out for ties on the variable used to order, which can always broken by referring to the original identifier. Here we sort on final weights, by default smallest first. (For largest first, negate the weight variable.)
webuse pig, clear
keep if id <= 9
bysort id (week) : gen last = weight[_N]
egen newid = group(last id)
bysort newid : gen toshow = strofreal(id) + " (" + strofreal(last, "%2.1f") + ")"
* search labmask for download links
labmask newid , values(toshow)
set scheme s1color
line weight week, by(newid, note("")) sort xla(1/9)
Short papers discussing the principles here are already in train for publication in the Stata Journal in 2021.

Stata - Generating rolling average variable

I'd like to generate a rolling average variable from a basketball dataset. So if the first observation is 25 points on January 1, the generated variable will show 25. If the second observation is 30 points on January 2, the variable generated will show 27.5. If the third observation is 35 points, the variable generated will show 30, etc.
For variable y ordered by some time t at its simplest the average of values to date is
gen yave = sum(y) / _n
which is the cumulative sum divided by the number of observations. If there are occasional missing values, they are ignored by sum() but the denominator needs to be fixed, say
gen yave = sum(y) / sum(y < .)
This generalises easily to panel structure
bysort id (t) : gen yave = sum(y) / sum(y < .)
Here is the solution I came up with. I had to create three variables, a cumulative point total (numerator) and a running count (denominator), then divided the two variables to get player points per game:
gen player_pts = points if player[_n]!=player[_n-1]
replace player_pts=points+player_pts[_n-1] if player[_n]==player[_n-1]&[_n]!=1
by player: gen player_games= [_n]
gen ppg=player_pts/player_games

Creating indicator variable in Stata

In a panel data set I have 3 variables: name, week, and income.
I would like to make an indicator variable that indicates initial weeks where income is 0. So say a person X has 0 income in the first 13 weeks, the indicator takes the value 1 the first 13 weeks, and is otherwise 0. The same procedure for person Y and so on.
I have tried using by groups, but I can't get it to work.
Any suggestions?
One solution is
bysort name (week) : gen no_income = sum(income) == 0
The function sum() yields cumulative or running sum. So, as long as income is 0, its cumulative sum remains 0 too. As soon as a person earns something, the cumulative sum becomes positive. The code is based on the presumption that cumulative income can not cross zero again because in a given week, income is negative. To exclude that possibility use an appropriate extra condition, such as
bysort name (week) : gen no_income = sum(income) == 0 & income == 0
For a problem with very similar flavour, see this FAQ. A meta-lesson is to look at the StataCorp FAQs as one of several resources.

Stata Nearest neighbor of percentile

This has probably already been answered, but I must just be searching for the wrong terms.
Suppose I am using the built in Stata data set auto:
sysuse auto, clear
and say for example I am working with 1 independent and 1 dependent variable and I want to essentially compress down to the IQR elements, min, p(25), median, p(75), max...
so I use command,
keep weight mpg
sum weight, detail
return list
local min=r(min)
local lqr=r(p25)
local med = r(p50)
local uqr = r(p75)
local max = r(max)
keep if weight==`min' | weight==`max' | weight==`med' | weight==`lqr' | weight==`uqr'
Hence, I want to compress the data set down to only those 5 observations, and for example in this situation the median is not actually an element of the weight vector. there is an observation above and an observation below (due to the definition of median this is no surprise). is there a way that I can tell stata to look for the nearest neighbor above the percentile. ie. if r(p50) is not an element of weight then search above that value for the next observation?
The end result is I am trying to get the data down to 2 vectors, say weight and mpg such that for each of the 5 elements of weight in the IQR have their matching response in mpg.
Any thoughts?
I think you want something like:
clear all
set more off
sysuse auto
keep weight mpg
summarize weight, detail
local min = r(min)
local lqr = r(p25)
local med = r(p50)
local uqr = r(p75)
local max = r(max)
* differences between weights and its median
gen diff = abs(weight - `med')
* put the smallest difference in observation 1 (there can be several, watch out!)
isid diff weight mpg, sort
* replace the original median with the weight "closest" to the median
local med = weight[1]
keep if inlist(weight, `min', `lqr', `med', `uqr', `max')
drop diff
* pretty print
order weight mpg
sort weight mpg
list, sep(0)
Notice the median does not appear because we kept its "closest" neighbor instead (weight == 3,180).
Also, percentile 75 has two associated mpg values.
You could probably work something out with collapse and merge (and many more), but I'll leave it at this.
Use help <command> for whatever is not clear.
Thank you to all the suggestions, here is what I came up with. The idea is that I was pulling these 5 numbers so I could send them to mata for a cubic spline that I am attempting to write.
For whatever reason trying to generalize this was giving me a headache.
My final solution:
sysuse auto, clear
preserve
sort weight
count if weight<.
keep if _n==1 | _n==ceil(r(N)/4) | _n==ceil(r(N)/2) | _n==ceil(3*r(N)/4) | _n==_N
gen X = weight
gen Y = mpg
list X Y
/* at this point I will send X and Y to mata for the cubic spline
routine that I am in the process of writing. It was this little step that
was bugging me. */
restore

Count number of living firms efficiently

I have a list of companies with start and end dates for each. I want to count the number of companies alive over time. I have the following code but it runs slowly on my large dataset. Is there a more efficient way to do this in Stata?
forvalues y = 1982/2012 {
forvalues m = 1/12 {
*display date("01-`m'-`y'","DMY")
count if start_dt <= date("01-`m'-`y'","DMY") & date("01-`m'-`y'","DMY") <= end_dt
}
}
One way is to use the inrange function. In Stata, Date variables are just integers so you can easily operate on them.
forvalues y = 1982/2012 {
forvalues m = 1/12 {
local d = date("01-`m'-`y'","DMY")
count if inrange(`d', start_dt, end_dt)
}
}
This alone will save you a huge amount of time. For 50.000 observations (and made-up data):
. timer list 1
1: 3.40 / 1 = 3.3980
. timer list 2
2: 18.61 / 1 = 18.6130
timer 1 is with inrange, timer 2 is your original code. Results are in seconds. Run help inrange and help timer for details.
That said, maybe someone can suggest an overall better strategy.
Assuming a firm identifier firmid, this is another way to think about the problem, but with a different data structure. Make sure you have a saved copy of your dataset before you do this.
expand 2
bysort firmid : gen eitherdate = cond(_n == 1, start_dt, end_dt)
by firmid : gen score = cond(_n == 1, 1, -1)
sort eitherdate
gen living = sum(score)
by eitherdate : replace living = living[_N]
So,
We expand each observation to 2 and put both dates in a new variable, the start date in one observation and the end date in the other observation.
We assign a score that is 1 when a firm starts and -1 when it ends.
The number of firms is increased by 1 every time a firm starts and decreased by 1 every time one ends. We just need to sort by date and the number of firms is the cumulative sum of those scores. (EDIT: There is a fix for changes on the same date.)
This new data structure could be useful for other purposes.
There is a write-up at http://www.stata-journal.com/article.html?article=dm0068
EDIT:
Notes in response to #Roberto Ferrer (and anyone else who read this):
I fixed a bad bug, which made this too difficult to understand. Sorry about that.
The dates used here are just the dates at which firms start and end. There is no evident point in evaluating the number of firms at any other date as it would just be the same number as the previous date used. If you needed, however, to interpolate to a grid of dates, copying the previous count would be sufficient.
It is important not to confuse the Stata function sum() which returns the cumulative sum with any egen function. The impression that egen's total() is an alternative here was a side-effect of my bug.