Collapse only a subset of the dataset with "if" - stata

I'm trying to collapse only a subset of my data using if, but it seems to be dropping / collapsing much more than I expect.
With every other command with which I have used an if qualifier, the command applies only to the subset of the data that meets the if criteria and leaves the rest of the data alone.
For example, replace does not alter the data for which foreign != 1:
. sysuse auto, clear
(1978 Automobile Data)
. replace mpg = 16 if foreign == 1
(22 real changes made)
However, it appears that collapse applies to the data that meets the if criteria and drops the rest:
. count if mpg > -1
74
. * all the data has mpg > -1
. count if foreign == 1
22
. collapse (mean) mpg if foreign == 1
. count if mpg > -1
1
There is no reason why collapse could not in theory work the same way as replace. It could leave all the foreign != 1 intact, while collapsing all foreign == 1 data to one observation.
That is in fact what I want to do with my data, so what should I do differently?
#NickCox helpfully suggested something like this:
. save "temp/whatever"
file temp/whatever.dta saved
. sysuse auto, clear
(1978 Automobile Data)
. drop if foreign == 1
(22 observations deleted)
. append using "temp/whatever"
(note: variable mpg was int, now float to accommodate using data's values)
That works in this sandbox, but my dataset has 10 million observations. If I can avoid having to re-load it, I can save myself a half hour. More if I have to do this for multiple cases.
Any other suggestions would be appreciated.

collapse with if works this way:
Those observations selected by the if condition are collapsed, typically (but not necessarily) into a new dataset with fewer observations.
Those observations not selected disappear.
It's incorrect to say that this command is unusual, let alone unique, in that respect. contract and keep also work in this way: whatever is not selected disappears.
(The community has often asked for save with if: savesome from SSC is one work-around.)
If you want to collapse some of the observations but leave the others unchanged, then you can try
A. this strategy
A1. use your dataset
A2. keep if what you want unchanged and save those observations
A3. use your dataset again
A4. collapse to taste
A5. append the dataset from A2
sysuse auto, clear
keep if !foreign
save domestic
sysuse auto, clear
collapse mpg if foreign
gen make = "All foreign"
append using domestic
or B. this one:
B1. work with (if needed create) an identifier that is unique (distinct) for the observations you want unchanged but takes on a single value for the observations you want collapsed
B2. collapse feeding that identifier to by().
sysuse auto, clear
replace make = "All foreign" if foreign
collapse mpg, by(make)
Although B looks trivial for this example, it is far from obvious to me that it is always superior for large datasets and if you want to continue to work with many variables. I have not experimented with timing or memory comparisons for large datasets or even any datasets, as I haven't encountered this wish before.

Related

How can I do =INDEX(_, MATCH(_, _,0)) of Excel In Stata?

I would like to use the same concept as =INDEX(_,MATCH(_,_,0)) of Excel in Stata 12, exclusively using Stata programming.
Is there a way to match one value with a column (say variable A), and then give another column (say variable B) as the output?
It is not a good idea to rely on Stata users knowing what MS Excel functions do: many knowledgeable Stata users don't use MS Excel. Conversely, it's a good idea to put forward your failed attempts. See https://stackoverflow.com/help/asking on asking good questions.
Can the following be what you want?
clear
set more off
*----- example data -----
sysuse auto
keep make foreign
bysort foreign (make) : keep if _n == 1
list, nolabel
*----- what you want ? -----
// two cases
list make if foreign == 1
list make if foreign == 0
Run findit vlookup, for a user-written command that does just that, but in Stata.
I believe you are looking for the merge command. Type help merge for an explanation.

Stopping at the variable before a specified variable in a varlist

I'm stuck on a tricky data management question, which I need to do in Stata. I'm using version 13.1.
I have more than 40 datasets I need to work on using a subset of variables that is different in each dataset. I can't include the data or specific analysis I'm doing for proprietary reasons but will try to include examples and code.
I have a set of datasets, A-Z. Each has a set of questions, Q1 through Q200. I need to do an analysis that includes a varlist entry on each dataset that excludes the last few questions (which deal with background info). I know this background info starts with a certain question (e.g. "MALE / FEMALE") although the actual number for that question varies by dataset.
Here's what I have done so far:
foreach X in A B C D E F {
use `X'_YEAR.dta, clear
lookfor "MALE/FEMALE"
local torename = r(varlist)
rename `torename' MF
ANALYSIS Q1 - MF
}
That works but the problem is I'm including the variable that's actually the beginning of where I should start excluding. I know that I can save the varlist as a macro and then use the placement in the macro to exclude, for example, the seventh variable.
However, I'm stuck on taking that a step further - using this as an entry in the varlist to stop at the variable MF. Something like ANALYSIS Q1 - (MF - 1).
Does anyone know if something like that is possible?
I've searched for this issue on this site and Google and haven't found a good solution.
Apologies if this is a simple issue I've missed.
Here's one approach building on your code.
. sysuse auto.dta, clear
(1978 Automobile Data)
. quiet describe, varlist
. local vars `r(varlist)'
. display "vars - `vars'"
vars - make price mpg rep78 headroom trunk weight length turn displacement gear_ratio foreign
. lookfor "Circle"
storage display value
variable name type format label variable label
------------------------------------------------------------------------------------------------
turn int %8.0g Turn Circle (ft.)
. local stopvar `r(varlist)'
. display "stopvar - `stopvar'"
stopvar - turn
. local myvars
. foreach var in `vars' {
2. if "`var'" == "`stopvar'" continue, break
3. local myvars `myvars' `var'
4. }
. display "myvars - `myvars'"
myvars - make price mpg rep78 headroom trunk weight length
And then just use `myvars' wherever you need the list of analysis variables. Alternatively, if your variable list always starts with Q1, you can change the local within the loop to
local lastvar `var'
and use
Q1-`lastvar'
for the list of analysis variables.

Stata estpost esttab: Generate table with mean of variable split by year and group

I want to create a table in Stata with the estout package to show the mean of a variable split by 2 groups (year and binary indicator) in an efficient way.
I found a solution, which is to split the main variable cash_at into 2 groups by hand through the generation of new variables, e.g. cash_at1 and cash_at2. Then, I can generate summary statistics with tabstat and get output with esttab.
estpost tabstat cash_at1 cash_at2, stat(mean) by(year)
esttab, cells("cash_at1 cash_at2")
Link to current result: http://imgur.com/2QytUz0
However, I'd prefer a horizontal table (e.g. year on the x axis) and a way to do it without splitting the groups by hand - is there a way to do so?
My preference in these cases is for year to be in rows and the statistic (e.g. mean) in the columns, but if you want to do it the other way around, there should be no problem.
For a table like the one you want it suffices to have the binary variable you already mention (which I name flag) and appropriate labeling. You can use the built-in table command:
clear all
set more off
* Create example data
set seed 8642
set obs 40
egen year = seq(), from(1985) to (2005) block(4)
gen cash = floor(runiform()*500)
gen flag = round(runiform())
list, sepby(year)
* Define labels
label define lflag 0 "cash0" 1 "cash1"
label values flag lflag
* Table
table flag year, contents(mean cash)
In general, for tables, apart from the estout module you may want to consider also the user-written command tabout. Run ssc describe tabout for more information.
On the other hand, it's not clear what you mean by "splitting groups by hand". You show no code for this operation, but as long as it's general enough for your purposes (and practical) I think you should allow for it. The code might not be as elegant as you wish but if it's doing what it's supposed to, I think it's alright. For example:
clear all
set more off
set seed 8642
set obs 40
* Create example data
egen year = seq(), from(1985) to (2005) block(4)
gen cash = floor(runiform()*500)
gen flag = round(runiform())
* Data management
gen cash0 = cash if flag == 0
gen cash1 = cash if flag == 1
* Table
estpost tabstat cash*, stat(mean) by(year)
esttab, cells("cash0 cash1")
can be used for a table like the one you give in your original post. It's true you have two extra lines and variables, but they may be harmless. I agree with the idea that in general, efficiency is something you worry about once your program is behaving appropriately; unless of course, the lack of it prevents you from reaching that state.

How to import dates from Excel into Stata

I'm using Stata 12.0.
I have a CSV file of exposures for days of the year e.g. 01/11/2002 (DMY).
I want these imported into Stata and it to recognise that it is a date variable. I've been using:
insheet using "FILENAME", comma
But by doing this I am only getting the dates as labels rather than names of the variables. I guess this is because Stata doesn't allow variable names to start with numbers. I have tried to reformat the cells as Dates in Excel and import but then Stata thinks the whole column is a Date and changes the exposure data into dates.
Any advice on the best course of action is appreciated...
As commented elsewhere, I too think you probably have a dataset that is best formatted as panel data. However, I address first the specific problem I think you have according to your question. Then I show some code in case you are interested in switching to a panel structure.
Here is an example CSV file open as a spreadsheet:
And here the same file, open in a text editor. Imagine the ; are ,. This is related to my system's language settings.
Running this (substitute delimiter(";") for comma, in your case):
clear all
set more off
insheet using "D:\xlsdates.csv", delimiter(";")
results in
which I think is the problem you describe: dates as variable labels. You would like to have the dates as variable names. One solution is to use a loop and strtoname() to rename the variables based on the variable labels. The following goes after importing with insheet:
foreach var of varlist * {
local j = "`: variable l `var''"
local newname = strtoname("`j'", 1)
rename `var' `newname'
}
The result is
The function strtoname() will substitute out the ilegal characters for _'s. See help strtoname.
Now, if you want to work with a panel structure, one way would be:
clear all
set more off
insheet using "D:\xlsdates.csv", delimiter(";")
* Rename variables
foreach var of varlist * {
local j = "`: variable l `var''"
local newname = strtoname("`j'", 1)
rename `var' `newname'
}
* Generate ID
generate id = _n
* Change to long format
reshape long _, i(id) j(dat) string
* Sensible name
rename _ metric
* Generate new date variable
gen dat2 = date(dat,"DMY", 2050)
format dat2 %d
list, sepby(id)
As you can see, there's no need to do anything beforehand in Excel or in an editor. Stata seems to be enough in this case.
Note: I've reused code from http://www.stata.com/statalist/archive/2008-09/msg01316.html.
A further note on performance: A CSV file with 122 variables or days (columns) and 10,000 observations or subjects (rows) + 1 header row, will produce 1,220,000 observations after the reshape. I have tested this on some old machine with a 1.79 GHz AMD processor and 640 MB RAM and the reshape takes approximately 8 minutes. Stata 12 has a hard-limit of 2,147,483,647 observations (although available RAM determines if you can actually achieve it) and Stata SE of 32,767 variables.
There seems to be some confusion here between the names that variables may have, the values that variables may have and the types that they may have.
Thus, the statement "Stata doesn't allow variables to start with numbers" appears to be a reference to Stata's rules for variable names; if it were true, numeric variables would be impossible.
Stata has no variable (i.e. storage) type that is a date. Strictly, it has no concept of a date variable, but dates may be held as strings or numbers. Dates may be held as strings insofar as any text indicating a date is likely to be a string that Stata can hold. This is flexible, but not especially useful. For almost all useful work, dates need to be converted to integers and then assigned a display format that matches their content to be readable by people. Stata has various conventions here, e.g. that daily dates are held as integers with 0 meaning 1 January 1960.
It seems likely in your case that daily dates are being imported as strings: if so, the function date() (also known as daily()) may be used to convert to an integer date. The example here just uses the minimal default display format for daily dates: friendlier formats exist.
. set obs 1
obs was 0, now 1
. gen sdate = "12/03/12"
. gen ndate = daily(sdate, "DMY", 2050)
. format ndate %td
. l
+----------------------+
| sdate ndate |
|----------------------|
1. | 12/03/12 12mar2012 |
+----------------------+
If your variable names are being misread, as guessed by #ChrisP, you may need to tell us more. A short and concrete example is worth more than a longer verbal description.

Trimming data in Stata

I have a data set and want to drop 1% of data at one end. For example, I have 3000 observations and I want to drop the 30 highest ones. Is there a command for this kind of trimming? Btw, I am new to Stata.
You can use _pctile in Stata for that.
sysuse auto, clear
_pctile weight, nq(100)
return list #this is optional
drop if weight>r(r99) #top 1 percent
If you know what the cutoff is for your drop you can use:
drop if var1>300
which drops all rows with var1 over 300.
You can use summarize var1, detail to get the key percentiles: it will give you 1% and 99% percentiles along with other standard percentiles.
To select 30 top observations in stata, use the following command:
keep if (_n<=30 )
To drop top 30 observations in stata, use the following command
keep if (_n>30)