Stata Nearest neighbor of percentile - stata

This has probably already been answered, but I must just be searching for the wrong terms.
Suppose I am using the built in Stata data set auto:
sysuse auto, clear
and say for example I am working with 1 independent and 1 dependent variable and I want to essentially compress down to the IQR elements, min, p(25), median, p(75), max...
so I use command,
keep weight mpg
sum weight, detail
return list
local min=r(min)
local lqr=r(p25)
local med = r(p50)
local uqr = r(p75)
local max = r(max)
keep if weight==`min' | weight==`max' | weight==`med' | weight==`lqr' | weight==`uqr'
Hence, I want to compress the data set down to only those 5 observations, and for example in this situation the median is not actually an element of the weight vector. there is an observation above and an observation below (due to the definition of median this is no surprise). is there a way that I can tell stata to look for the nearest neighbor above the percentile. ie. if r(p50) is not an element of weight then search above that value for the next observation?
The end result is I am trying to get the data down to 2 vectors, say weight and mpg such that for each of the 5 elements of weight in the IQR have their matching response in mpg.
Any thoughts?

I think you want something like:
clear all
set more off
sysuse auto
keep weight mpg
summarize weight, detail
local min = r(min)
local lqr = r(p25)
local med = r(p50)
local uqr = r(p75)
local max = r(max)
* differences between weights and its median
gen diff = abs(weight - `med')
* put the smallest difference in observation 1 (there can be several, watch out!)
isid diff weight mpg, sort
* replace the original median with the weight "closest" to the median
local med = weight[1]
keep if inlist(weight, `min', `lqr', `med', `uqr', `max')
drop diff
* pretty print
order weight mpg
sort weight mpg
list, sep(0)
Notice the median does not appear because we kept its "closest" neighbor instead (weight == 3,180).
Also, percentile 75 has two associated mpg values.
You could probably work something out with collapse and merge (and many more), but I'll leave it at this.
Use help <command> for whatever is not clear.

Thank you to all the suggestions, here is what I came up with. The idea is that I was pulling these 5 numbers so I could send them to mata for a cubic spline that I am attempting to write.
For whatever reason trying to generalize this was giving me a headache.
My final solution:
sysuse auto, clear
preserve
sort weight
count if weight<.
keep if _n==1 | _n==ceil(r(N)/4) | _n==ceil(r(N)/2) | _n==ceil(3*r(N)/4) | _n==_N
gen X = weight
gen Y = mpg
list X Y
/* at this point I will send X and Y to mata for the cubic spline
routine that I am in the process of writing. It was this little step that
was bugging me. */
restore

Related

Reordering panels by another variable in twoway, by() graphs

Suppose I make the following chart showing the weight of 9 pigs over time:
webuse pig
tw line weight week if inrange(id,1,9), by(id) subtitle(, nospan)
Is it possible to reorder the panels by another variable while retaining the original label? I can imagine defining another variable that is sorted the right way and then labeling it with the right id, but curious if there is a less clunky way of achieving that.
I think you are right: you need a new ordering variable. Positively, you can order on any criterion of choice. Watch out for ties on the variable used to order, which can always broken by referring to the original identifier. Here we sort on final weights, by default smallest first. (For largest first, negate the weight variable.)
webuse pig, clear
keep if id <= 9
bysort id (week) : gen last = weight[_N]
egen newid = group(last id)
bysort newid : gen toshow = strofreal(id) + " (" + strofreal(last, "%2.1f") + ")"
* search labmask for download links
labmask newid , values(toshow)
set scheme s1color
line weight week, by(newid, note("")) sort xla(1/9)
Short papers discussing the principles here are already in train for publication in the Stata Journal in 2021.

Unable to display statistics in the graph note() parameter

I want to display the total count of the data in the graph note().
I tried the following:
note(count)
However, this just displays the literal word "count".
I also tried to create a local variable but I am having difficulty just initializing it.
While I can do the following:
. local N = 100
. di `N'
100
I can't seem to do:
. local N = count
count not found
The total number of observations is stored in _N.
sysuse auto, clear
display _N
74
So the following works for me:
local N = _N
twoway scatter mpg price, note(Total no of observations: `N')
The total number of observations is kept in _N but it is not necessarily the number of observations used in a graph.
The command count displays a result and leaves a saved result, the number counted, in its wake as r(N). This is documented both in the help for count and in the manual entry.
Hence you can verify that this sequence leaves a note 74 observations in the resulting graph.
. sysuse auto, clear
(1978 Automobile Data)
. count if mpg < .
74
. histogram mpg, note(`r(N)' observations)
(bin=8, start=12, width=3.625)
Note that no r-class command should intervene here between count and your use of its result. r-class saved results, like any other saved results, are overwritten easily. In many circumstances you are well advised, as you did, to store the result in a local macro, say by
. local N = r(N)
immediately after the count command and then refer to that later in the note().
This is a more general method because count by itself returns the number of observations and so can be used when this is directly what you want.
Combining the other answers, I ultimately did:
count
local N = r(N)
count if male
local N_male = r(N)
count if !male
local N_female = r(N)
...
note("N = `N'" " `N_male' (Male)" " `N_female' (Female)")
But still can't get the commas to render at the thousands and millions place.

How do I generate predicted counts from a negative binomial regression with a logged independent variable in Stata?

I have a set of data with a dependent variable that is a count, and several independent variables. My primary independent variable is large dollar values. If I divide the dollar values by 10,000(to keep the coefficients manageable), the models(negative binomial and zero-inflated negative binomial) run in Stata and I can generate predicted counts with confidence intervals. However, theoretically it is more logical to take the natural log of this variable. When I do that, the models still run but now predicted counts on range between 0.22-0.77 or so. How do I fix this so the predicted counts generate correctly?
Your question does not show any code or data. It's nearly impossible to know what is going wrong without these two ingredients. Your questions reads as "I did some stuff to this other stuff with surprising results." In order to ask a good question, you should replicate your coding approach with a dataset that everyone would have access to, like rod93.
Here's my attempt at that, which shows reasonably similar predictions with nbreg from both models:
webuse rod93, clear
replace exposure = exposure/10000
nbreg deaths exposure age_mos, nolog
margins
predictnl d1 =predict(n), ci(lb1 ub1)
/* Compare the prediction for the first obs by hand */
di exp(_b[_cons]+_b[age_mos]*age_mos[1]+_b[exposure]*exposure[1])
di d1[1]
gen ln_exp = ln(exposure)
nbreg deaths ln_e age_mos, nolog
margins
predictnl d2 =predict(n), ci(lb2 ub2)
/* Compare the prediction for the first obs by hand */
di exp(_b[_cons]+_b[age_mos]*age_mos[1]+_b[ln_e]*ln(exposure[1]))
di d2[1]
sum d? lb* ub*, sep(2)
This produces very similar predictions and confidence intervals:
. sum d? lb* ub*, sep(2)
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
d1 | 21 84.82903 25.44322 12.95853 104.1868
d2 | 21 85.0432 25.24095 32.87827 105.1733
-------------+---------------------------------------------------------
lb1 | 21 64.17752 23.19418 1.895858 80.72885
lb2 | 21 59.80346 22.01917 10.9009 79.71531
-------------+---------------------------------------------------------
ub1 | 21 105.4805 29.39726 24.02121 152.7676
ub2 | 21 110.2829 29.16468 51.76427 143.856

Computing and plotting difference in group means

In what follows I plot the mean of an outcome of interest (price) by a grouping variable (foreign) for each possible value taken by the fake variable time:
sysuse auto, clear
gen time = rep78 - 3
bysort foreign time: egen avg_p = mean(price)
scatter avg_p time if (foreign==0 & time>=0) || ///
scatter avg_p time if (foreign==1 & time>=0), ///
legend(order(1 "Domestic" 2 "Foreign")) ///
ytitle("Average price") xlab(#3)
What I would like to do is to plot the difference in the two group means over time, not the two separate means.
I am surely missing something, but to me it looks complicated because the information about the averages is stored "vertically" (in avg_p).
The easiest way to do this is to arguably use linear regression to estimate the differences:
/* Regression Way */
drop if time < 0 | missing(time)
reg price i.foreign##i.time
margins, dydx(foreign) at(time =(0(1)2))
marginsplot, noci title("Foreign vs Domestic Difference in Price")
If regression is hard to wrap your mind around, the other is involves mangling the data with a reshape:
/* Transform the Data */
keep price time foreign
collapse (mean) price, by(time foreign)
reshape wide price, i(time) j(foreign)
gen diff = price1-price0
tw connected diff time
Here is another approach. graph dot will happily plot means.
sysuse auto, clear
set scheme s1color
collapse price if inrange(rep78, 3, 5), by(foreign rep78)
reshape wide price, i(rep78) j(foreign)
rename price0 Domestic
label var Domestic
rename price1 Foreign
label var Foreign
graph dot (asis) Domestic Foreign, over(rep78) vertical ///
marker(1, ms(Oh)) marker(2, ms(+))

Stata: How to top-code a variable

Using the auto.dta data, I want to top-code the PRICE variable at the median of top X%. For example, X% could be 3%, 4%, etc. How can I do this in Stata?
In answering your question, I am assuming that you want to replace all the values above, say top 10%, with value say X(top 90% in the following code).
Here is the sample code:
program topcode
sysuse auto, clear
pctile pct = price, nq(10)
dis r(r9)
gen newprice=price
replace newprice=r(r9) if newprice>r(r9)
end