Why does Stata complain with a cryptic error when I use string variables in the table command?
Consider the following toy example:
sysuse auto, clear
decode foreign, g(foreign_str)
table foreign, contents(n foreign_str mean mpg)
This raises an r(111) variable __000002 not found error in Stata 13.1.
Tracing the error tells me that it is trying to run format __000002 %9.0gc and crashing when it does not find the variable. If I switch the order of the variables in the clist, that is i run table foreign, contents(n mpg_rank mean mpg), I get the same error but with __000003 instead of __000002.
So it appears that Stata crashes when it finds the string variable. If I replace the string variable with a numeric variable, the error doesn't occur.
I know it is not meaningful to compute summary statistics on string variables, but counting the number of observations of a string variable (in each group specified by the rowvar) makes perfect sense.
Stata complains because variable __000002 (or __000003 if you change the order) is not created by the collapse command (which is used internally by table) due to the following error:
collapse (count) foreign_str
type mismatch
r(109);
What really happens is not visible to the user because capture is used in combination with collapse and the output from trace confirms that:
- capture collapse `clist' `wgt', by(`varlist' `by') fast `cw'
= capture collapse (count) __000002=foreign_str (mean) __000003=mpg , by(foreign ) fast
There are only provisions for error codes 111 and 135, so the table command continues to run until it hits a wall when it cannot find the aforementioned variables.
Stata 14 and later versions check the variable(s) provided by the user in the contents() option and only accept numeric types, issuing a more informative error message if this is not the case.
It is also worth pointing out that collapse treats strings differently in more recent Stata versions.
Related
Using this excellent advice from Statalist, I am running a loop to read in a 60GB Stata dataset and save it in chunks (after some data preprocessing).
Unfortunately, I do not know the total number of observations and so the use command does not execute when asking to read in more data than is available:
use `usevars' in 210000001/220000000 using "a_large_dta_file.dta", clear
The dataset appears to contain less than 220000000 observations, but I do not know how many. I am looking for an endoffile operator or something in that spirit to circumvent this problem. Manually I verified that at least 210001001 exist, but this will not help much.
Consider the following reproducible example using Stata's auto toy dataset:
sysuse auto, clear
display _N
74
Using the describe command will get you what you want:
findfile auto.dta
describe using "`r(fn)'" // or ask for only one variable e.g. describe rep78
display r(N)
74
Stata datasets are always rectangular so you can also manually load a single variable and count:
use rep78 using "`r(fn)'", clear // load a variable which also contains missing data
display _N
74
Alternatively, use a loop to load smaller chunks and the capture command to see where it fails.
Is there a way to check whether three variables (month, day, year) can actually build a valid SAS date format before handing those variables over to MDY() (maybe except checking all possible cases)?
Right now I am dealing with a couple of thousand input variables and let SAS put them together - there are a lot of date variables which cannot work like month=0, day=33, year=10 etc. and I'd like to catch them. Otherwise I will get way too many Notes like
NOTE: Invalid argument to function MDY(13,12,2014)
which then eventually culminate in Warnings like
WARNING: Limit set by ERRORS= option reached. Further errors of this type will not be printed.
I really would like too prevent getting those Warnings and I thought the best way would be to actually check the validity of the date - any recommendations?
Use an INFORMAT instead, then you can use the ?? modifier to suppress errors.
month=0;
day=33;
year=10;
date = input(cats(put(year,z4.),put(month,z2.),put(day,z2.)),??yymmdd8.);
SAS documentation: ? or ?? (Format Modifiers for Error Reporting)
I use Stata since several years now, along with other languages like R.
Stata is great, but there is one thing that annoys me : the generate/replace behaviour, and especially the "... already defined" error.
It means that if we want to run a piece of code twice, if this piece of code contains the definition of a variable, this definition needs 2 lines :
capture drop foo
generate foo = ...
While it takes just one line in other languages such as R.
So is there another way to define variables that combines "generate" and "replace" in one command ?
I am unaware of any way to do this directly. Further, as #Roberto's comment implies, there are reasons simply issuing a generate command will not overwrite (see: replace) the contents of a variable.
To be able to do this while maintaining data integrity, you would need to issue two separate commands as your question points out (explicitly dropping the existing variable before generating the new one) - I see this as method in which Stata forces the user to be clear about his/her intentions.
It might be noted that Stata is not alone in this regard. SQL Server, for example, requires the user drop an existing table before creating a table with the same name (in the same database), does not allow multiple columns with the same name in a table, etc. and all for good reason.
However, if you are really set on being able to issue a one-liner in Stata to do what you desire, you could write a very simple program. The following should get you started:
program mkvar
version 13
syntax anything=exp [if] [in]
capture confirm variable `anything'
if !_rc {
drop `anything'
}
generate `anything' `exp' `if' `in'
end
You then would naturally save the program to mkvar.ado in a directory that Stata would find (i.e., C:\ado\personal\ on Windows. If you are unsure, type sysdir), and call it using:
mkvar newvar=expression [if] [in]
Now, I haven't tested the above code much so you may have to do a bit of de-bugging, but it has worked fine in the examples I've tried.
On a closing note, I'd advise you to exercise caution when doing this - certainly you will want to be vigilant with regard to altering your data, retain a copy of your raw data while a do file manipulates the data in memory, etc.
I am trying to follow with a simple linear regression example provided by Stata: https://www.youtube.com/watch?v=HafqFSB9x70
It is done using Stata/SE 12 and works perfectly.
I am using Stata/MP 13.
And I am getting the following error:
. predict Predicted Wage, xb
too many variables specified
r(103);
I tried to look this up, couldn't figure it out.
How can I fix this, does this relate to the version?
predict takes one new variable name, and you gave it two: Predicted and Wage. Try deleting the space between them, making PredictedWage one word.
I am using Stata 12 and I have to run a Ordered Probit (oprobit) with a panel dataset. I know that "oprobit" command is compatible with cross-section analysis. In the new version of Stata (Stata 13) they have "xtoprobit" command to account for Random Effects Ordered Probit. I need the similar command for Stata 12. I have checked "reoprob" command but when I use it with my panel dataset I have the following error :
"factor variables and time-series operators not allowed"
That means you need to create your own dummy variables instead using the factor variable notation i.dummyvar. Try this:
tab dummyvar, gen(D)
reg y D*
This will creates a set of dummy variables (D1, D2,...) reflecting the observed values of the tabulated variable.
Some of the older user-written commands do not know what to do with the factor variable notation, which is convenient, but fairly new.
You can also explore xi for more complicated tasks.