Stopping at the variable before a specified variable in a varlist - stata

I'm stuck on a tricky data management question, which I need to do in Stata. I'm using version 13.1.
I have more than 40 datasets I need to work on using a subset of variables that is different in each dataset. I can't include the data or specific analysis I'm doing for proprietary reasons but will try to include examples and code.
I have a set of datasets, A-Z. Each has a set of questions, Q1 through Q200. I need to do an analysis that includes a varlist entry on each dataset that excludes the last few questions (which deal with background info). I know this background info starts with a certain question (e.g. "MALE / FEMALE") although the actual number for that question varies by dataset.
Here's what I have done so far:
foreach X in A B C D E F {
use `X'_YEAR.dta, clear
lookfor "MALE/FEMALE"
local torename = r(varlist)
rename `torename' MF
ANALYSIS Q1 - MF
}
That works but the problem is I'm including the variable that's actually the beginning of where I should start excluding. I know that I can save the varlist as a macro and then use the placement in the macro to exclude, for example, the seventh variable.
However, I'm stuck on taking that a step further - using this as an entry in the varlist to stop at the variable MF. Something like ANALYSIS Q1 - (MF - 1).
Does anyone know if something like that is possible?
I've searched for this issue on this site and Google and haven't found a good solution.
Apologies if this is a simple issue I've missed.

Here's one approach building on your code.
. sysuse auto.dta, clear
(1978 Automobile Data)
. quiet describe, varlist
. local vars `r(varlist)'
. display "vars - `vars'"
vars - make price mpg rep78 headroom trunk weight length turn displacement gear_ratio foreign
. lookfor "Circle"
storage display value
variable name type format label variable label
------------------------------------------------------------------------------------------------
turn int %8.0g Turn Circle (ft.)
. local stopvar `r(varlist)'
. display "stopvar - `stopvar'"
stopvar - turn
. local myvars
. foreach var in `vars' {
2. if "`var'" == "`stopvar'" continue, break
3. local myvars `myvars' `var'
4. }
. display "myvars - `myvars'"
myvars - make price mpg rep78 headroom trunk weight length
And then just use `myvars' wherever you need the list of analysis variables. Alternatively, if your variable list always starts with Q1, you can change the local within the loop to
local lastvar `var'
and use
Q1-`lastvar'
for the list of analysis variables.

Related

Is there a code for including a p-value to test for normality of variables?

I'd like to include a statistic in my summary statistics table within Stata, with the summarize command. Is there any possibility or other convenient way to include p-values (of the normality test) of the variables included? Would be very helpful. I'm using Stata 17.
There are many tests for normality, and several are included in Stata. The Shapiro-Wilk test is a modern classic, and quite popular. The more recent Doornik-Hansen test has been calibrated over a wide range of situations. Here's a token example:
. sysuse auto, clear
(1978 automobile data)
. swilk mpg
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
mpg | 74 0.94821 3.335 2.627 0.00430
. mvtest normality mpg
Test for multivariate normality
Doornik-Hansen chi2(2) = 12.366 Prob>chi2 = 0.0021
The P-value is typically returned as an r-class value, so read the documentation for each command, and try return list after running each.

Collapse only a subset of the dataset with "if"

I'm trying to collapse only a subset of my data using if, but it seems to be dropping / collapsing much more than I expect.
With every other command with which I have used an if qualifier, the command applies only to the subset of the data that meets the if criteria and leaves the rest of the data alone.
For example, replace does not alter the data for which foreign != 1:
. sysuse auto, clear
(1978 Automobile Data)
. replace mpg = 16 if foreign == 1
(22 real changes made)
However, it appears that collapse applies to the data that meets the if criteria and drops the rest:
. count if mpg > -1
74
. * all the data has mpg > -1
. count if foreign == 1
22
. collapse (mean) mpg if foreign == 1
. count if mpg > -1
1
There is no reason why collapse could not in theory work the same way as replace. It could leave all the foreign != 1 intact, while collapsing all foreign == 1 data to one observation.
That is in fact what I want to do with my data, so what should I do differently?
#NickCox helpfully suggested something like this:
. save "temp/whatever"
file temp/whatever.dta saved
. sysuse auto, clear
(1978 Automobile Data)
. drop if foreign == 1
(22 observations deleted)
. append using "temp/whatever"
(note: variable mpg was int, now float to accommodate using data's values)
That works in this sandbox, but my dataset has 10 million observations. If I can avoid having to re-load it, I can save myself a half hour. More if I have to do this for multiple cases.
Any other suggestions would be appreciated.
collapse with if works this way:
Those observations selected by the if condition are collapsed, typically (but not necessarily) into a new dataset with fewer observations.
Those observations not selected disappear.
It's incorrect to say that this command is unusual, let alone unique, in that respect. contract and keep also work in this way: whatever is not selected disappears.
(The community has often asked for save with if: savesome from SSC is one work-around.)
If you want to collapse some of the observations but leave the others unchanged, then you can try
A. this strategy
A1. use your dataset
A2. keep if what you want unchanged and save those observations
A3. use your dataset again
A4. collapse to taste
A5. append the dataset from A2
sysuse auto, clear
keep if !foreign
save domestic
sysuse auto, clear
collapse mpg if foreign
gen make = "All foreign"
append using domestic
or B. this one:
B1. work with (if needed create) an identifier that is unique (distinct) for the observations you want unchanged but takes on a single value for the observations you want collapsed
B2. collapse feeding that identifier to by().
sysuse auto, clear
replace make = "All foreign" if foreign
collapse mpg, by(make)
Although B looks trivial for this example, it is far from obvious to me that it is always superior for large datasets and if you want to continue to work with many variables. I have not experimented with timing or memory comparisons for large datasets or even any datasets, as I haven't encountered this wish before.

Quantile Regression with Quantiles based on independent variable

I am attempting to run a quantile regression on monthly observations (of mutual fund characteristics). What I would like to do is distribute my observations in quintiles for each month (my dataset comprises 99 months). I want to base the quintiles on a variable (lagged fund size i.e. Total Net Assets) that will be later employed as an independent variable to explain fund performance.
What I already tried to do is use the qreg command, but that uses quantiles based on the dependent variable not the independent variable that is needed.
Moreover I tried to use the xtile command to create the quintiles; however, the by: command is not supported.
. by Date: xtile QLagTNA= LagTNA, nq(5)
xtile may not be combined with by
r(190);
Is there a (combination of) command(s) which saves me from creating quintiles manually on a month-by-month basis?
Statistical comments first before getting to your question, which has two Stata answers at least.
Quantile regression is defined by prediction of quantiles of the response (what you call the dependent variable). You may or may not want to do that, but using quantile-based groups for predictors does not itself make a regression a quantile regression.
Quantiles (here quintiles) are values that divide a variable into bands of defined frequency. Here you want the 0, 20, 40, 60, 80, 100% points. The bands, intervals or groups themselves are not best called quantiles, although many statistically-minded people would know what you mean.
What you propose seems common in economics and business, but it is still degrading the information in the data.
All that said, you could always write a loop using forval, something like this
egen group = group(Date)
su group, meanonly
gen QLagTNA = .
quietly forval d = 1/`r(max)' {
xtile work = LagTNA if group == `d', nq(5)
replace QLagTNA = work if group == `d'
drop work
}
For more, see this link
But you will probably prefer to download a user-written egen function [correct term here] to do this
ssc inst egenmore
h egenmore
The function you want is xtile().

Stata estpost esttab: Generate table with mean of variable split by year and group

I want to create a table in Stata with the estout package to show the mean of a variable split by 2 groups (year and binary indicator) in an efficient way.
I found a solution, which is to split the main variable cash_at into 2 groups by hand through the generation of new variables, e.g. cash_at1 and cash_at2. Then, I can generate summary statistics with tabstat and get output with esttab.
estpost tabstat cash_at1 cash_at2, stat(mean) by(year)
esttab, cells("cash_at1 cash_at2")
Link to current result: http://imgur.com/2QytUz0
However, I'd prefer a horizontal table (e.g. year on the x axis) and a way to do it without splitting the groups by hand - is there a way to do so?
My preference in these cases is for year to be in rows and the statistic (e.g. mean) in the columns, but if you want to do it the other way around, there should be no problem.
For a table like the one you want it suffices to have the binary variable you already mention (which I name flag) and appropriate labeling. You can use the built-in table command:
clear all
set more off
* Create example data
set seed 8642
set obs 40
egen year = seq(), from(1985) to (2005) block(4)
gen cash = floor(runiform()*500)
gen flag = round(runiform())
list, sepby(year)
* Define labels
label define lflag 0 "cash0" 1 "cash1"
label values flag lflag
* Table
table flag year, contents(mean cash)
In general, for tables, apart from the estout module you may want to consider also the user-written command tabout. Run ssc describe tabout for more information.
On the other hand, it's not clear what you mean by "splitting groups by hand". You show no code for this operation, but as long as it's general enough for your purposes (and practical) I think you should allow for it. The code might not be as elegant as you wish but if it's doing what it's supposed to, I think it's alright. For example:
clear all
set more off
set seed 8642
set obs 40
* Create example data
egen year = seq(), from(1985) to (2005) block(4)
gen cash = floor(runiform()*500)
gen flag = round(runiform())
* Data management
gen cash0 = cash if flag == 0
gen cash1 = cash if flag == 1
* Table
estpost tabstat cash*, stat(mean) by(year)
esttab, cells("cash0 cash1")
can be used for a table like the one you give in your original post. It's true you have two extra lines and variables, but they may be harmless. I agree with the idea that in general, efficiency is something you worry about once your program is behaving appropriately; unless of course, the lack of it prevents you from reaching that state.

How to import dates from Excel into Stata

I'm using Stata 12.0.
I have a CSV file of exposures for days of the year e.g. 01/11/2002 (DMY).
I want these imported into Stata and it to recognise that it is a date variable. I've been using:
insheet using "FILENAME", comma
But by doing this I am only getting the dates as labels rather than names of the variables. I guess this is because Stata doesn't allow variable names to start with numbers. I have tried to reformat the cells as Dates in Excel and import but then Stata thinks the whole column is a Date and changes the exposure data into dates.
Any advice on the best course of action is appreciated...
As commented elsewhere, I too think you probably have a dataset that is best formatted as panel data. However, I address first the specific problem I think you have according to your question. Then I show some code in case you are interested in switching to a panel structure.
Here is an example CSV file open as a spreadsheet:
And here the same file, open in a text editor. Imagine the ; are ,. This is related to my system's language settings.
Running this (substitute delimiter(";") for comma, in your case):
clear all
set more off
insheet using "D:\xlsdates.csv", delimiter(";")
results in
which I think is the problem you describe: dates as variable labels. You would like to have the dates as variable names. One solution is to use a loop and strtoname() to rename the variables based on the variable labels. The following goes after importing with insheet:
foreach var of varlist * {
local j = "`: variable l `var''"
local newname = strtoname("`j'", 1)
rename `var' `newname'
}
The result is
The function strtoname() will substitute out the ilegal characters for _'s. See help strtoname.
Now, if you want to work with a panel structure, one way would be:
clear all
set more off
insheet using "D:\xlsdates.csv", delimiter(";")
* Rename variables
foreach var of varlist * {
local j = "`: variable l `var''"
local newname = strtoname("`j'", 1)
rename `var' `newname'
}
* Generate ID
generate id = _n
* Change to long format
reshape long _, i(id) j(dat) string
* Sensible name
rename _ metric
* Generate new date variable
gen dat2 = date(dat,"DMY", 2050)
format dat2 %d
list, sepby(id)
As you can see, there's no need to do anything beforehand in Excel or in an editor. Stata seems to be enough in this case.
Note: I've reused code from http://www.stata.com/statalist/archive/2008-09/msg01316.html.
A further note on performance: A CSV file with 122 variables or days (columns) and 10,000 observations or subjects (rows) + 1 header row, will produce 1,220,000 observations after the reshape. I have tested this on some old machine with a 1.79 GHz AMD processor and 640 MB RAM and the reshape takes approximately 8 minutes. Stata 12 has a hard-limit of 2,147,483,647 observations (although available RAM determines if you can actually achieve it) and Stata SE of 32,767 variables.
There seems to be some confusion here between the names that variables may have, the values that variables may have and the types that they may have.
Thus, the statement "Stata doesn't allow variables to start with numbers" appears to be a reference to Stata's rules for variable names; if it were true, numeric variables would be impossible.
Stata has no variable (i.e. storage) type that is a date. Strictly, it has no concept of a date variable, but dates may be held as strings or numbers. Dates may be held as strings insofar as any text indicating a date is likely to be a string that Stata can hold. This is flexible, but not especially useful. For almost all useful work, dates need to be converted to integers and then assigned a display format that matches their content to be readable by people. Stata has various conventions here, e.g. that daily dates are held as integers with 0 meaning 1 January 1960.
It seems likely in your case that daily dates are being imported as strings: if so, the function date() (also known as daily()) may be used to convert to an integer date. The example here just uses the minimal default display format for daily dates: friendlier formats exist.
. set obs 1
obs was 0, now 1
. gen sdate = "12/03/12"
. gen ndate = daily(sdate, "DMY", 2050)
. format ndate %td
. l
+----------------------+
| sdate ndate |
|----------------------|
1. | 12/03/12 12mar2012 |
+----------------------+
If your variable names are being misread, as guessed by #ChrisP, you may need to tell us more. A short and concrete example is worth more than a longer verbal description.