Stata: How to top-code a variable - stata

Using the auto.dta data, I want to top-code the PRICE variable at the median of top X%. For example, X% could be 3%, 4%, etc. How can I do this in Stata?

In answering your question, I am assuming that you want to replace all the values above, say top 10%, with value say X(top 90% in the following code).
Here is the sample code:
program topcode
sysuse auto, clear
pctile pct = price, nq(10)
dis r(r9)
gen newprice=price
replace newprice=r(r9) if newprice>r(r9)
end

Related

Reordering panels by another variable in twoway, by() graphs

Suppose I make the following chart showing the weight of 9 pigs over time:
webuse pig
tw line weight week if inrange(id,1,9), by(id) subtitle(, nospan)
Is it possible to reorder the panels by another variable while retaining the original label? I can imagine defining another variable that is sorted the right way and then labeling it with the right id, but curious if there is a less clunky way of achieving that.
I think you are right: you need a new ordering variable. Positively, you can order on any criterion of choice. Watch out for ties on the variable used to order, which can always broken by referring to the original identifier. Here we sort on final weights, by default smallest first. (For largest first, negate the weight variable.)
webuse pig, clear
keep if id <= 9
bysort id (week) : gen last = weight[_N]
egen newid = group(last id)
bysort newid : gen toshow = strofreal(id) + " (" + strofreal(last, "%2.1f") + ")"
* search labmask for download links
labmask newid , values(toshow)
set scheme s1color
line weight week, by(newid, note("")) sort xla(1/9)
Short papers discussing the principles here are already in train for publication in the Stata Journal in 2021.

Combine two plots in one graph using ciplot

I would like to plot the means and confidence intervals of two variables into one graph. I used ciplot to do this for only one variable, but for two this code is not working.
On the internet I found that you could combine the plots as follows:
ciplot relative_ambition12 relative_ambition22, by(quota)
However, if I run this I get the error:
no observations found
At the same time both of the following do produce graphs:
ciplot relative_ambition12, by(quota)
ciplot relative_ambition22, by(quota)
Does anyone know how I can combine these two graphs into one?
The community-contributed command ciplot expects to work on the same set of observations for all variables specified in varlist.
For example, the following works:
. sysuse auto, clear
. generate price2 = price + 500
. ciplot price price2, by(foreign)
However, the following does not:
. replace price2 = . if foreign == 1
. ciplot price price2, by(foreign)
no observations
r(2000);
Both plots can be graphed separately (i.e. if one variable at a time is specified).
When you have different sets of observations, you can use the inclusive option to produce the desired output to the extent possible:
. ciplot price price2, by(foreign) inclusive

Computing and plotting difference in group means

In what follows I plot the mean of an outcome of interest (price) by a grouping variable (foreign) for each possible value taken by the fake variable time:
sysuse auto, clear
gen time = rep78 - 3
bysort foreign time: egen avg_p = mean(price)
scatter avg_p time if (foreign==0 & time>=0) || ///
scatter avg_p time if (foreign==1 & time>=0), ///
legend(order(1 "Domestic" 2 "Foreign")) ///
ytitle("Average price") xlab(#3)
What I would like to do is to plot the difference in the two group means over time, not the two separate means.
I am surely missing something, but to me it looks complicated because the information about the averages is stored "vertically" (in avg_p).
The easiest way to do this is to arguably use linear regression to estimate the differences:
/* Regression Way */
drop if time < 0 | missing(time)
reg price i.foreign##i.time
margins, dydx(foreign) at(time =(0(1)2))
marginsplot, noci title("Foreign vs Domestic Difference in Price")
If regression is hard to wrap your mind around, the other is involves mangling the data with a reshape:
/* Transform the Data */
keep price time foreign
collapse (mean) price, by(time foreign)
reshape wide price, i(time) j(foreign)
gen diff = price1-price0
tw connected diff time
Here is another approach. graph dot will happily plot means.
sysuse auto, clear
set scheme s1color
collapse price if inrange(rep78, 3, 5), by(foreign rep78)
reshape wide price, i(rep78) j(foreign)
rename price0 Domestic
label var Domestic
rename price1 Foreign
label var Foreign
graph dot (asis) Domestic Foreign, over(rep78) vertical ///
marker(1, ms(Oh)) marker(2, ms(+))

Stata Nearest neighbor of percentile

This has probably already been answered, but I must just be searching for the wrong terms.
Suppose I am using the built in Stata data set auto:
sysuse auto, clear
and say for example I am working with 1 independent and 1 dependent variable and I want to essentially compress down to the IQR elements, min, p(25), median, p(75), max...
so I use command,
keep weight mpg
sum weight, detail
return list
local min=r(min)
local lqr=r(p25)
local med = r(p50)
local uqr = r(p75)
local max = r(max)
keep if weight==`min' | weight==`max' | weight==`med' | weight==`lqr' | weight==`uqr'
Hence, I want to compress the data set down to only those 5 observations, and for example in this situation the median is not actually an element of the weight vector. there is an observation above and an observation below (due to the definition of median this is no surprise). is there a way that I can tell stata to look for the nearest neighbor above the percentile. ie. if r(p50) is not an element of weight then search above that value for the next observation?
The end result is I am trying to get the data down to 2 vectors, say weight and mpg such that for each of the 5 elements of weight in the IQR have their matching response in mpg.
Any thoughts?
I think you want something like:
clear all
set more off
sysuse auto
keep weight mpg
summarize weight, detail
local min = r(min)
local lqr = r(p25)
local med = r(p50)
local uqr = r(p75)
local max = r(max)
* differences between weights and its median
gen diff = abs(weight - `med')
* put the smallest difference in observation 1 (there can be several, watch out!)
isid diff weight mpg, sort
* replace the original median with the weight "closest" to the median
local med = weight[1]
keep if inlist(weight, `min', `lqr', `med', `uqr', `max')
drop diff
* pretty print
order weight mpg
sort weight mpg
list, sep(0)
Notice the median does not appear because we kept its "closest" neighbor instead (weight == 3,180).
Also, percentile 75 has two associated mpg values.
You could probably work something out with collapse and merge (and many more), but I'll leave it at this.
Use help <command> for whatever is not clear.
Thank you to all the suggestions, here is what I came up with. The idea is that I was pulling these 5 numbers so I could send them to mata for a cubic spline that I am attempting to write.
For whatever reason trying to generalize this was giving me a headache.
My final solution:
sysuse auto, clear
preserve
sort weight
count if weight<.
keep if _n==1 | _n==ceil(r(N)/4) | _n==ceil(r(N)/2) | _n==ceil(3*r(N)/4) | _n==_N
gen X = weight
gen Y = mpg
list X Y
/* at this point I will send X and Y to mata for the cubic spline
routine that I am in the process of writing. It was this little step that
was bugging me. */
restore

Stata: Combine table command with ttest and output latex

For regression output, I usually use a combination of eststo to store estimations, estadd to add the R2 and additional tests, then estab to output the lot.
I need to do the same with the table command. I need the mean, median and N for a variable across three by variables and would like to add stars for the result of a ttest==1 on the mean and signtest==1 on the median. I have three by variables, so I've been using table to collate the mean, median and N, which I'm calling like the following pseudo-code:
sysuse auto,clear
table foreign rep78 , ///
contents(mean price median price n price) format(%9.2f)
ttest price==1, by(foreign rep78)
signtest price=1, by(foreign rep78)
I've tried esttab and estpost to no avail. I've also looked at tabstat, tablemat and summarize as alternatives to table, but they don't allow three by variables.
How can I create this table, add the stars for the ttest and signtest p-values and output the full table?
The main point in your question seems to be producing a LaTeX table. However, you show "pseudo-code", that looks pretty much like Stata code, with the caveat that it is illegal.
In particular, for the ttest you can only have one variable in the by() option. But notice that ttest allows also the by: prefix (you can use both, in fact). Their reasons-to-be are different. On the other hand, signtest does not allow a by() option but it does allow the by: prefix. So you should probably clarify what you want to do before creating the table.
If you are trying to use the by: prefix in both cases and afterwards produce a table, you can create a grouping variable, and put the commands in a loop. In this way, you can try tabulating the saved results for each group using the ESTOUT module (by Ben Jann in SSC). Something like:
*clear all
set more off
sysuse auto
keep price foreign rep78
* create group variable
egen grou = group(foreign rep78)
* tests by group
forvalues i = 1/8 {
ttest price == 1 if grou == `i'
signtest price = 1 if grou == `i'
*<complete with estout syntax>
}
See help by, help egen (the group function), help estout and help saved results.