Using this excellent advice from Statalist, I am running a loop to read in a 60GB Stata dataset and save it in chunks (after some data preprocessing).
Unfortunately, I do not know the total number of observations and so the use command does not execute when asking to read in more data than is available:
use `usevars' in 210000001/220000000 using "a_large_dta_file.dta", clear
The dataset appears to contain less than 220000000 observations, but I do not know how many. I am looking for an endoffile operator or something in that spirit to circumvent this problem. Manually I verified that at least 210001001 exist, but this will not help much.
Consider the following reproducible example using Stata's auto toy dataset:
sysuse auto, clear
display _N
74
Using the describe command will get you what you want:
findfile auto.dta
describe using "`r(fn)'" // or ask for only one variable e.g. describe rep78
display r(N)
74
Stata datasets are always rectangular so you can also manually load a single variable and count:
use rep78 using "`r(fn)'", clear // load a variable which also contains missing data
display _N
74
Alternatively, use a loop to load smaller chunks and the capture command to see where it fails.
Related
I have a list of circumstances and effects:
I want to generate a matrix with betas containing the values of betas. I am going to run the loop 10 times, because i am in fact going to bootstrap my observations.
So far I have tried:
local circumstances height weight
local effort training diet
foreach i in 1 10 {
reg outcome `circumstances' `effects'
* store in column i the values of betas of circumstances
* store in column i the values of betas of effort
}
Does anyone know what should the code look like in order to store those values?
Thank you
The pseudocode would first store in "column 1" the first lot of betas and then overwrite them (column 1) with the second lot of betas. Then it would do the same again for column 10 with the first lot of betas and the second lot of betas. That is a long way from anything that makes sense. Nothing in your pseudocode takes bootstrap samples from the dataset, although perhaps you are intending to add code for that later.
Stata doesn't really work with any idea of column numbers, although the idea makes sense to Mata.
Unless there are very specific reasons -- which you would need to spell out -- there is no need to write your own code ab initio for bootstrapping, as the whole point of bootstrap is to do that for you.
Here is complete code for a reproducible example of bootstrapping a silly regression:
sysuse auto, clear
bootstrap b_weight=_b[weight] b_price=_b[price] , reps(1000) seed(2803) : regress mpg weight price
See also the help for bootstrap to learn about its other options, including saving().
10 repetitions would be regarded as absurdly small for the number of bootstrap samples.
I'd like to save in a macro the storage type of variables in a .dta dataset (without opening it).
As an example I'll first create a dataset temp.dta
drop _all
set obs 100
gen a = runiform()
save temp, replace
In an interactive session, I can display the storage type of all variable using the command describe using
However, the command only saves the dimension of the dataset, without any information related to storage types.
Is there a way to do it?
You can start with this example:
clear
set more off
sysuse auto
foreach v of varlist _all {
local allt `allt' `v' `: type `v''
}
display "`allt'"
I set the information such that each variable name is followed by the type, but you can modify that to suit your needs; maybe two locals, one with variable names, the other with corresponding types is best for you.
The key is the extended macro function type varname. See help extended_fcn, for details.
For this to work, the dataset needs to be opened at some point. I don't know a way of doing this without the latter requirement.
Edit
#SteveSamuels proposes use <somedata> in 1, and I present the benchmarking:
clear
*----- example data -----
sysuse auto
expand 50000
tempfile myauto
save "`myauto'"
*----- tests -----
clear
timer on 1
describe using "`myauto'"
timer off 1
clear
timer on 2
use "`myauto'" in 1
describe
timer off 2
clear
timer on 3
use "`myauto'"
describe
timer off 3
count
timer list
timer clear
clear
Resulting in
. timer list
1: 0.00 / 1 = 0.0000
2: 0.22 / 1 = 0.2190
3: 0.33 / 1 = 0.3260
So, it is faster then a simple use, as expected, but describe using ... still wins the race. The latter must use optimized code and additionally, there must be some reason for use <somedata> in 1 to be unexpectedly slow, despite loading only one observation.
This doesn't include, of course, looping through variables and using extended macro functions, nor parsing a log file; but I don't think results would be modified by much.
I have data with IDs which may or may not have all values present. I want to delete ONLY the observations with no data in them; if there are observations with even one value, I want to retain them. Eg, if my data set is:
ID val1 val2 val3 val4
1 23 . 24 75
2 . . . .
3 45 45 70 9
I want to drop only ID 2 as it is the only one with no data -- just an ID.
I have tried Statalist and Google but couldn't find anything relevant.
This will also work with strings as long as they are empty:
ds id*, not
egen num_nonmiss = rownonmiss(`r(varlist)'), strok
drop if num_nonmiss == 0
This gets a list of variables that are not the id and drops any observations that only have the id.
Brian Albert Monroe is quite correct that anyone using dropmiss (SJ) needs to install it first. As there is interest in varying ways of solving this problem, I will add another.
foreach v of var val* {
qui count if missing(`v')
if r(N) == _N local todrop `todrop' `v'
}
if "`todrop'" != "" drop `todrop'
Although it should be a comment under Brian's answer, I will add here a comment here as (a) this format is more suited for showing code (b) the comment follows from my code above. I agree that unab is a useful command and have often commended it in public. Here, however, it is unnecessary as Brian's loops could easily start something like
foreach v of var * {
UPDATE September 2015: See http://www.statalist.org/forums/forum/general-stata-discussion/general/1308777-missings-now-available-from-ssc-new-program-for-managing-missings for information on missings, considered by the author of both to be an improvement on dropmiss. The syntax to drop observations if and only if all values are missing is missings dropobs.
Just another way to do it which helps you discover how flexible local macros are without installing anything extra to Stata. I rarely see code using locals storing commands or logical conditions, though it is often very useful.
// Loop through all variables to build a useful local
foreach vname of varlist _all {
// We don't want to include ID in our drop condition, so don't execute the remaining code if our loop is currently on ID
if "`vname'" == "ID" continue
// This local stores all the variable names except 'ID' and a logical condition that checks if it is missing
local dropper "`dropper' `vname' ==. &"
}
// Let's see all the observations which have missing data for all variables except for ID
// The '1==1' bit is a condition to deal with the last '&' in the `dropper' local, it is of course true.
list if `dropper' 1==1
// Now let's drop those variables
drop if `dropper' 1==1
// Now check they're all gone
list if `dropper' 1==1
// They are.
Now dropmiss may be convenient once you've downloaded and installed it, but if you are writing a do file to be used by someone else, unless they also have dropmiss installed, your code won't work on their machine.
With this approach, if you remove the lines of comments and the two unnecessary list commands, this is a fairly sparse 5 lines of code which will run with Stata out of the box.
This question already has an answer here:
How to export Spearman correlations
(1 answer)
Closed 3 years ago.
I need to get a Spearman and Pearson correlation table using Stata. Here is what I did to get the results in a table format.
estpost correlate sp_rating srating mrating split split_neg split_ord split_neg_ord tier1_risk tier1_leverage st1 sl mt1 ml adt1 adl dt1 dl offering_amt maturity2 security
enhance timeliness validity disc loan_at cash_dep trading_at real_est intangible other_at sec_sum assets_sold all_residual secinc_ta, matrix quietly
esttab . using "root4.rtf", replace notype unstack compress noobs nogaps nostar
Then, I get this error message:
varlist not allowed
When I used just a few variables, I didn't get the error, but when I put many variables. I don't know how to fix this. Please help me.
I was able to reproduce your error and ran a trace on it. I believe this is a bug at line 946 of estout.ado, perhaps caused by the fact that a very long variable list with RTF tags exceeds the size of the local macro created at that line.
You should send a bug report to Ben Jann (email at the end of help estout). In the meantime, you can try saving to DOC and TXT, both of them might work (you have over 30 variables, I tested both .txt and .doc successfully with something like 20 variables).
Alternately, try the mkcorr command (ssc install mkcorr) to see if it works with your data.
I just had the same problem after I tried a lot of different esttab outputs and had stored a lot in estimates.
So, maybe estimates clear helps if you type it before running your command. At least for me it worked.
I need to store the value for the cluster robust standard error in order to use it to create a new variable.
I am able to get the cluster robust standard error with the mean command, but stata does not store this value.
Do you have any suggestions about how to calculate the cluster robust standard error for an estimate and then store this value in order to use it to create a new variable?
I think this might almost do the trick. There might be a more elegant way to do this. Toy data, nonsensical example:
/* Get some data */
webuse nhanes2f, clear
svyset psuid [pweight=finalwgt], strata(stratid)
/* get the standard error of the constant, which is the mean */
svy: reg zinc
display _se[_cons]
generate se = _se[_cons]
/* Verify that this is correct */
svy: mean zinc
However, you also want to cluster, which complicates things. I think if you only have survey weights (aka first stage clusters), you can do:
reg zinc [pweight=finalwgt], cluster(region)
There might a way to do what you want with -glamm-, which is user-written command. You should ask this question on Statalist if you don't get much of response here.