Trimming data in Stata - stata

I have a data set and want to drop 1% of data at one end. For example, I have 3000 observations and I want to drop the 30 highest ones. Is there a command for this kind of trimming? Btw, I am new to Stata.

You can use _pctile in Stata for that.
sysuse auto, clear
_pctile weight, nq(100)
return list #this is optional
drop if weight>r(r99) #top 1 percent

If you know what the cutoff is for your drop you can use:
drop if var1>300
which drops all rows with var1 over 300.
You can use summarize var1, detail to get the key percentiles: it will give you 1% and 99% percentiles along with other standard percentiles.

To select 30 top observations in stata, use the following command:
keep if (_n<=30 )
To drop top 30 observations in stata, use the following command
keep if (_n>30)

Related

How does one find the percentiles for a variable by a descriptor variable in Stata?

I'm trying to find the 10th and 90th percentiles for income by state in my dataset. I know the basic code to find the percentiles for the entire dataset would be as follows:
centile (medhhinc_2019), centile (10 90)
How can I get these by each state instead of the entire dataset?
Actually, this ended up being quite simple.
tabstat medhhinc_2019, statistics(p10 p90) by(state)
maybe you can use astile ffd=medhhinc_2019,nq(10),which can divide medhhinc into ten groups according to precentiles

Stata estpost esttab: Generate table with mean of variable split by year and group

I want to create a table in Stata with the estout package to show the mean of a variable split by 2 groups (year and binary indicator) in an efficient way.
I found a solution, which is to split the main variable cash_at into 2 groups by hand through the generation of new variables, e.g. cash_at1 and cash_at2. Then, I can generate summary statistics with tabstat and get output with esttab.
estpost tabstat cash_at1 cash_at2, stat(mean) by(year)
esttab, cells("cash_at1 cash_at2")
Link to current result: http://imgur.com/2QytUz0
However, I'd prefer a horizontal table (e.g. year on the x axis) and a way to do it without splitting the groups by hand - is there a way to do so?
My preference in these cases is for year to be in rows and the statistic (e.g. mean) in the columns, but if you want to do it the other way around, there should be no problem.
For a table like the one you want it suffices to have the binary variable you already mention (which I name flag) and appropriate labeling. You can use the built-in table command:
clear all
set more off
* Create example data
set seed 8642
set obs 40
egen year = seq(), from(1985) to (2005) block(4)
gen cash = floor(runiform()*500)
gen flag = round(runiform())
list, sepby(year)
* Define labels
label define lflag 0 "cash0" 1 "cash1"
label values flag lflag
* Table
table flag year, contents(mean cash)
In general, for tables, apart from the estout module you may want to consider also the user-written command tabout. Run ssc describe tabout for more information.
On the other hand, it's not clear what you mean by "splitting groups by hand". You show no code for this operation, but as long as it's general enough for your purposes (and practical) I think you should allow for it. The code might not be as elegant as you wish but if it's doing what it's supposed to, I think it's alright. For example:
clear all
set more off
set seed 8642
set obs 40
* Create example data
egen year = seq(), from(1985) to (2005) block(4)
gen cash = floor(runiform()*500)
gen flag = round(runiform())
* Data management
gen cash0 = cash if flag == 0
gen cash1 = cash if flag == 1
* Table
estpost tabstat cash*, stat(mean) by(year)
esttab, cells("cash0 cash1")
can be used for a table like the one you give in your original post. It's true you have two extra lines and variables, but they may be harmless. I agree with the idea that in general, efficiency is something you worry about once your program is behaving appropriately; unless of course, the lack of it prevents you from reaching that state.

New SAS variable conditional on observations

(first time posting)
I have a data set where I need to create a new variable (in SAS), based on meeting a condition related to another variable. So, the data contains three variables from a survey: Site, IDnumb (person), and Date. There can be multiple responses from different people but at the same site (see person 1 and 3 from site A).
Site IDnumb Date
a 1 6/12
b 2 3/4
c 4 5/1
a 3 .
d 5 .
I want to create a new variable called Complete, but it can't contain duplicates. So, when I go to proc freq, I want site A to be counted once, using the 6/12 Date of the Completed Survey. So basically, if a site is represented twice and contains a Date in one, I want to only count that one and ignore the duplicate site without a date.
N %
Complete 3 75%
Last Month 1 25%
My question may be around the NODUP and NODUPKEY possibilities. If I do a Proc Sort (nodupkey) by Site and Date, would that eliminate obs "a 3 ."?
Any help would be greatly appreciated. Sorry for the jumbled "table", as this is my first post (hints on making that better are also welcomed).
You can do this a number of ways.
First off, you need a complete/not complete binary variable. If you're in the datastep anyway, might as well just do it all there.
proc sort data=yourdata;
by site date descending;
run;
data yourdata_want;
set yourdata;
by site date descending;
if first.site then do;
comp = ifn(date>0,1,0);
output;
end;
run;
proc freq data=yourdata_want;
tables comp;
run;
If you used NODUPKEY, you'd first sort it by SITE DATE DESCENDING, then by SITE with NODUPKEY. That way the latest date is up top. You also could format COMP to have the text labels you list rather than just 1/0.
You can also do it with a format on DATE, so you can skip the data step (still need the sort/sort nodupkey). Format all nonmissing values of DATE to "Complete" and missing value of date to "Last Month", then include the missing option in your proc freq.
Finally, you could do the table in SQL (though getting two rows like that is a bit harder, you have to UNION two queries together).

SAS: Keep ONLY the highest-valued groups, for Box Plots

Presently I have so many "groups" when doing Box Plots that the result is SEVEN panels of box plots.
I'd like to have ONE panel, with about 20 box plots (or "groups").
So, that would require cutting out a bunch of groups.
Is there a way to automatically do this?
What I have in mind is: In a data step, only keep the TOP 20 groups, using Q3 value for each group as the criterion for keeping or removing.
Any coding assistance greatly appreciated.
Nicholas Kormanik
You are not providing much information but you could sort the data by that value and then save the top 20 like this:
proc sort data=myGroups;
by q3;
run;
data myGroupsTop.
set myGroups(obs=20);
run;

How to export tabulations

I have a small project where I need to tabulate a dataset with frequencies in various ways and export those tables in a large Excel sheet. Unfortunately, copy and paste truncates text-labels and causes lots of other issues for us.
Is there a way to save/export the result into a CSV or Excel format?
That is, something similar to the write.table command in R, which I can't install at work.
Update 1:
The Stata FAQ provided three solutions which would work for us: http://www.stata.com/support/faqs/data-management/copying-tables/, but Stata support did a followup mail a shortly after pointing to the FAQ with a link to tabout and the tutorial displayed some truly beautiful tabulations.
We've had some progress with the tabout, but we are not really sure if it would do everything we need, but so far creating tabulations with tabout D7 test.xls works nicely although without any proper aligment of labels and such as you would get generating LaTeX.
Update 2:
OK, so lots of tables weren't as straightforward as with tabulate and the by command in combination - some programming was required (not done at current Stata skill-level). The lack of native support for just exporting any result out is a real pain!
outreg is not going to work, as it only works with estimation (regression-like) results. xml_tab can probably produce anything you like (findit xml_tab to install). Obviously, you can export excel your data, although if you need frequency tables, you probably would want to collapse (count) ..., by(varlist) your data first. (I hate collapse though, as I think it is a poor idea that you need to destroy and reload your data; this is one example where R's concept of objects comes handier than Stata's idea of having only one data set in memory at a time.)
When wanting the tabulated output to anything, whether tabulate or regress or clogit, I always close the current log file and begin a new one, not in the .smcl format but with a .log suffix, handy because usually I want to keep a lot of the values from clogit returns
something along the lines of...
*close logs even if there isn't any
capture log close
log using NAMEOFOUTPUT.log
do something like tab or reg or clogit
log close
Your tabulated results from whichever command will then be in that .log file.
Could outreg be a solution?
http://www.kellogg.northwestern.edu/rc/stata-outreg.htm
Since the above will only do regression tables, estout is a good alternative. And the command estpost, I believe creates tables for tabulations:
http://repec.org/bocode/e/estout/estpost.html
For one way frequency tables fre module can be quite handy too. Output can be written to tab-delimited table and LaTeX.
sysuse auto, clear
fre rep78
rep78 -- Repair Record 1978
-----------------------------------------------------------
| Freq. Percent Valid Cum.
--------------+--------------------------------------------
Valid 1 | 2 2.70 2.90 2.90
2 | 8 10.81 11.59 14.49
3 | 30 40.54 43.48 57.97
4 | 18 24.32 26.09 84.06
5 | 11 14.86 15.94 100.00
Total | 69 93.24 100.00
Missing . | 5 6.76
Total | 74 100.00
-----------------------------------------------------------
Download and more info on SSC:
http://ideas.repec.org/c/boc/bocode/s456835.html