Stata: save name of variable with max value as a string - stata

I have several variables in the same row: x1 x2 x3 x4
With egen and the rowmax function, I create a new variable containing the value the x* with the highest value:
egen max_x = rowmax(x1 x2 x3 x4)
However, instead of saving the maximum value, I would like to save the name of the variable which contains the maximum value as a string. How can I do that?

There might be a single command for this, but here is one approach...
// generate some test data
set obs 10
forvalues i=1/4 {
gen float x`i' = runiform()
}
tempvar valmax argmax
gen `valmax' = x1
gen `argmax' = "x1"
foreach v of varlist x2-x4 {
// does value beat the current highest value?
replace `argmax' = "`v'" if `v' > `valmax' & !mi(`v')
replace `valmax' = max(`valmax', `v')
}
list
You should also consider how ties and missing values are handled.

Related

Using local in a forvalues loop reports a syntax error

I am using two-level loops to create a set of variables. But Stata reports a syntax error.
forvalues i = 1/5 {
local to `i'+1
dis `to'
forvalues j = `to'/6{
dis `j'
gen e_`i'_`j' = .
}
}
I could not figure out where I made the syntax error.
And a follow-up question. I would like to change how the number of loops are coded in the example above. Right now, it's hard-coded as 5 and 6. But I want to make it based on the data. For instance,I am coding as below:
sum x
scalar x_max_1 = `r(max)'-1
scalar x_max_2 = `r(max)'
forvalues i = 1/x_max_1 {
local to = `i'+1
dis `to'
forvalues j = `to'/x_max_2{
dis `j'
gen e_`i'_`j' = .
}
}
However, Stata reports a syntax error in this case. I am not sure why. The scalar is a numeric variable. Why would the code above not work?
Your code would be better as
forvalues i = 1/5 {
local to = `i' + 1
forvalues j = `to'/6 {
gen e_`i'_`j' = .
}
}
With your code you went
local to `i' + 1
so first time around the loop to becomes the string or text 1 + 1 which is then illegal as an argument to forvalues. That is, a local definition without an = sign will result in copying of text, not evaluation of the expression.
The way you used display could not show you this error because display used that way will evaluate expressions to the extent possible. If you had insisted that the macro was a string with
di "`to'"
then you would have seen its contents.
Another way to do it is
forvalues i = 1/5 {
forvalues j = `= `i' + 1'/6 {
gen e_`i'_`j' = .
}
}
EDIT
You asked further about
sum x
scalar x_max_1 = `r(max)'-1
scalar x_max_2 = `r(max)'
forvalues i = 1/x_max_1 {
and quite a lot can be said about that. Let's work backwards from one of various better solutions:
sum x, meanonly
forvalues i = 1/`= r(max) - 1' {
or another, perhaps a little more transparent:
sum x, meanonly
local max = r(max) - 1
forvalues i = 1/`max' {
What are the messages here:
If you only want the maximum, specify meanonly. Agreed: the option name alone does not imply this. See https://www.stata-journal.com/sjpdf.html?articlenum=st0135 for more.
What is the point of pushing the r-class result r(max) into a scalar? You already have what you need in r(max). Educate yourself out of this with the following analogy.
I have what I want. Now I put it into a box. Now I take it out of the box. Now I have what I want again. Come to think of it, the box business can be cut.
The box is the scalar, two scalars in this case.
forvalues won't evaluate scalars to give you the number you want. That will happen in many languages, but not here.
More subtly, forvalues doesn't even evaluate local references or similar constructs. What happens is that Stata's generic syntax parser does that for you before what you typed is passed to forvalues.

Multiple local in foreach command macro

I have a dataset with multiple subgroups (variable economist) and dates (variable temps99).
I want to run a tabsplit command that does not accept bysort or by prefixes. So I created a macro to apply my tabsplit command to each of my subgroups within my data.
For example:
levelsof economist, local(liste)
foreach gars of local liste {
display "`gars'"
tabsplit SubjectCategory if economist=="`gars'", p(;) sort
return list
replace nbcateco = r(r) if economist == "`gars'"
}
For each subgroup, Stata runs the tabsplit command and I use the variable nbcateco to store count results.
I did the same for the date so I can have the evolution of r(r) over time:
levelsof temps99, local(liste23)
foreach time of local liste23 {
display "`time'"
tabsplit SubjectCategory if temps99 == "`time'", p(;) sort
return list
replace nbcattime = r(r) if temps99 == "`time'"
}
Now I want to do it on each subgroups economist by date temps99. I tried multiple combination but I am not very good with macros (yet?).
What I want is to be able to have my r(r) for each of my subgroups over time.
Here's a solution that shows how to calculate the number of distinct publication categories within each by-group. This uses runby (from SSC). runby loops over each by-group, each time replacing the data in memory with the data from the current by-group. For each by-group, the commands contained in the user's program are executed. Whatever is left in memory when the user's program terminates is considered results and accumulates. Once all the groups have been processed, these results replace the data in memory.
I used the verbose option because I wanted to present the results for each by-group using nice formatting. The derivation of the list of distinct categories is done by splitting each list, converting to a long layout, and reducing to one observation per distinct value. The distinct_categories program generates one variable that contains the final count of distinct categories for the by-group.
* create a demontration dataset
* ------------------------------------------------------------------------------
clear all
set seed 12345
* Example generated by -dataex-. To install: ssc install dataex
clear
input str19 economist
"Carmen M. Reinhart"
"Janet Currie"
"Asli Demirguc-Kunt"
"Esther Duflo"
"Marianne Bertrand"
"Claudia Goldin"
"Bronwyn Hughes Hall"
"Serena Ng"
"Anne Case"
"Valerie Ann Ramey"
end
expand 20
bysort economist: gen temps99 = 1998 + _n
gen pubs = runiformint(1,10)
expand pubs
sort economist temps99
gen pubid = _n
local nep NEP-AGR NEP-CBA NEP-COM NEP-DEV NEP-DGE NEP-ECM NEP-EEC NEP-ENE ///
NEP-ENV NEP-HIS NEP-INO NEP-INT NEP-LAB NEP-MAC NEP-MIC NEP-MON ///
NEP-PBE NEP-TRA NEP-URE
gen SubjectCategory = ""
forvalues i=1/19 {
replace SubjectCategory = SubjectCategory + " " + word("`nep'",`i') ///
if runiform() < .1
}
replace SubjectCategory = subinstr(trim(SubjectCategory)," ",";",.)
leftalign // from SSC
* ------------------------------------------------------------------------------
program distinct_categories
dis _n _n _dup(80) "-"
dis as txt "fille = " as res economist[1] as txt _col(68) " temps = " as res temps99[1]
// if there are no subjects for the group, exit now to avoid a no obs error
qui count if !mi(trim(SubjectCategory))
if r(N) == 0 exit
// split categories, reshape to a long layout, and reduce to unique values
preserve
keep pubid SubjectCategory
quietly {
split SubjectCategory, parse(;) gen(cat)
reshape long cat, i(pubid)
bysort cat: keep if _n == 1
drop if mi(cat)
}
// show results and generate the wanted variable
list cat
local distinct = _N
dis _n as txt "distinct = " as res `distinct'
restore
gen wanted = `distinct'
end
runby distinct_categories, by(economist temps99) verbose
This is an example of the XY problem, I think. See http://xyproblem.info/
tabsplit is a command in the package tab_chi from SSC. I have no negative feelings about it, as I wrote it, but it seems quite unnecessary here.
You want to count categories in a string variable: semi-colons are your separators. So count semi-colons and add 1.
local SC SubjectCategory
gen NCategory = 1 + length(`SC') - length(subinstr(`SC', ";", "", .))
Then (e.g.) table or tabstat will let you explore further by groups of interest.
To see the counting idea, consider 3 categories with 2 semi-colons.
. display length("frog;toad;newt")
14
. display length(subinstr("frog;toad;newt", ";", "", .))
12
If we replace each semi-colon with an empty string, the change in length is the number of semi-colons deleted. Note that we don't have to change the variable to do this. Then add 1. See also this paper.
That said, a way to extend your approach might be
egen class = group(economist temps99), label
su class, meanonly
local nclass = r(N)
gen result = .
forval i = 1/`nclass' {
di "`: label (class) `i''"
tabsplit SubjectCategory if class == `i', p(;) sort
return list
replace result = r(r) if class == `i'
}
Using statsby would be even better. See also this FAQ.

Error:'no variables defined' in stata when using monte carlo simulation

I have written the program below and keep getting the error message that my variables are not defined.
Can somebody plese see where the error is and how I should adapt the code? Really nothing seems to work.
program define myreg, rclass
drop all
set obs 200
gen x= 2*uniform()
gen z = rnormal(0,1)
gen e = (invnorm(uniform()))^2
e=e-r(mean)
replace e=e-r(mean)
more
gen y = 1 + 1*x +1*z + 1*e
reg y x z
e=e-r(mean)
replace e=e-r(mean)
more
gen y = 1 + 1*x +1*z + 1*e
reg y x z
more
return scalar b0 =_[_cons]
return scalar b1=_[x]
return scalar b2 =_[z]
more
end
simulate b_0 = r(b0) b_1 = r(b1) b_2 = r(b2), rep(1000): myreg
*A possible solution with eclass
capture program drop myreg
program define myreg, eclass
* create an empty data by dropping all variables
drop _all
set obs 200
gen x= 2*uniform()
gen z = rnormal(0,1)
gen e = (invnorm(uniform()))^2
qui sum e /*to get r(mean) you need to run sum first*/
replace e=e-r(mean)
gen y = 1 + 1*x +1*z + 1*e
reg y x z
end
*gather the coefficients (_b) and standard errors (_se) from the *regression each time
simulate _b _se, reps(1000) seed (123): myreg
* show the final result
mat list r(table)
* A possible solution with rclass
* To understand the difference between rclass and eclass, see the Stata manual(http://www.stata.com/manuals13/rstoredresults.pdf)
capture program drop myreg
program define myreg, rclass
drop _all
set obs 200
gen x= 2*uniform()
gen z = rnormal(0,1)
gen e = (invnorm(uniform()))^2
qui sum e
replace e=e-r(mean)
gen y = 1 + 1*x +1*z + 1*e
reg y x z
mat output=e(b)
return scalar b0=output[1,3]
return scalar b1=output[1,1]
return scalar b2=output[1,2]
end
simulate b_0=r(b0) b_1=r(b1) b_2=r(b2), rep(1000) seed (123): myreg
return list
*P.S. You should read all the comments as suggested by #Nick to fully understand what I did here. .

Stata: Subsetting data using criteria stored in other data set

I have a large data set. I have to subset the data set (Big_data) by using values stored in other dta file (Criteria_data). I will show you the problem first:
**Big_data** **Criteria_data**
==================== ================================================
lon lat 4_digit_id minlon maxlon minlat maxlat
-76.22 44.27 0765 -78.44 -77.22 34.324 35.011
-67.55 33.19 6161 -66.11 -65.93 40.32 41.88
....... ........
(over 1 million obs) (271 observations)
==================== ================================================
I have to subset the bid data as follows:
use Big_data
preserve
keep if (-78.44<lon<-77.22) & (34.324<lat<35.011)
save data_0765, replace
restore
preserve
keep if (-66.11<lon<-65.93) & (40.32<lat<41.88)
save data_6161, replace
restore
....
(1) What should be the efficient programming for the subsetting in Stata? (2) Are the inequality expressions correctly written?
1) Subsetting data
With 400,000 observations in the main file and 300 in the reference file, it takes about 1.5 minutes. I can't test this with double the observations in the main file because the lack of RAM takes my computer to a crawl.
The strategy involves creating as many variables as needed to hold the reference latitudes and longitudes (271*4 = 1084 in the OP's case; Stata IC and up can handle this. See help limits). This requires some reshaping and appending. Then we check for those observations of the big data file that meet the conditions.
clear all
set more off
*----- create example databases -----
tempfile bigdata reference
input ///
lon lat
-76.22 44.27
-66.0 40.85 // meets conditions
-77.10 34.8 // meets conditions
-66.00 42.0
end
expand 100000
save "`bigdata'"
*list
clear all
input ///
str4 id minlon maxlon minlat maxlat
"0765" -78.44 -75.22 34.324 35.011
"6161" -66.11 -65.93 40.32 41.88
end
drop id
expand 150
gen id = _n
save "`reference'"
*list
*----- reshape original reference file -----
use "`reference'", clear
tempfile reference2
destring id, replace
levelsof id, local(lev)
gen i = 1
reshape wide minlon maxlon minlat maxlat, i(i) j(id)
gen lat = .
gen lon = .
save "`reference2'"
*----- create working database -----
use "`bigdata'"
timer on 1
quietly {
forvalues num = 1/300 {
gen minlon`num' = .
gen maxlon`num' = .
gen minlat`num' = .
gen maxlat`num' = .
}
}
timer off 1
timer on 2
append using "`reference2'"
drop i
timer off 2
*----- flag observations for which conditions are met -----
timer on 3
gen byte flag = 0
foreach le of local lev {
quietly replace flag = 1 if inrange(lon, minlon`le'[_N], maxlon`le'[_N]) & inrange(lat, minlat`le'[_N], maxlat`le'[_N])
}
timer off 3
*keep if flag
*keep lon lat
*list
timer list
The inrange() function implies that the minimums and maximums must be adjusted beforehand to satisfy the OP's strict inequalities (the function tests <=, >=).
Probably some expansion using expand, use of correlatives and by (so data is in long form) could speed things up. It's not totally clear for me right now. I'm sure there are better ways in plain Stata mode. Mata may be even better.
(joinby was also tested but again RAM was a problem.)
Edit
Doing computations in chunks rather than for the complete database, significantly improves the RAM issue. Using a main file with 1.2 million observations and a reference file with 300 observations, the following code does all the work in about 1.5 minutes:
set more off
*----- create example big data -----
clear all
set obs 1200000
set seed 13056
gen lat = runiform()*100
gen lon = runiform()*100
local sizebd `=_N' // to be used in computations
tempfile bigdata
save "`bigdata'"
*----- create example reference data -----
clear all
set obs 300
set seed 97532
gen minlat = runiform()*100
gen maxlat = minlat + runiform()*5
gen minlon = runiform()*100
gen maxlon = minlon + runiform()*5
gen id = _n
tempfile reference
save "`reference'"
*----- reshape original reference file -----
use "`reference'", clear
destring id, replace
levelsof id, local(lev)
gen i = 1
reshape wide minlon maxlon minlat maxlat, i(i) j(id)
drop i
tempfile reference2
save "`reference2'"
*----- create file to save results -----
tempfile results
clear all
set obs 0
gen lon = .
gen lat = .
save "`results'"
*----- start computations -----
clear all
* local that controls # of observations in intermediate files
local step = 5000 // can't be larger than sizedb
timer clear
timer on 99
forvalues en = `step'(`step')`sizebd' {
* load observations and join with references
timer on 1
local start = `en' - (`step' - 1)
use in `start'/`en' using "`bigdata'", clear
timer off 1
timer on 2
append using "`reference2'"
timer off 2
* flag observations that meet conditions
timer on 3
gen byte flag = 0
foreach le of local lev {
quietly replace flag = 1 if inrange(lon, minlon`le'[_N], maxlon`le'[_N]) & inrange(lat, minlat`le'[_N], maxlat`le'[_N])
}
timer off 3
* append to result database
timer on 4
quietly {
keep if flag
keep lon lat
append using "`results'"
save "`results'", replace
}
timer off 4
}
timer off 99
timer list
display "total time is " `r(t99)'/60 " minutes"
use "`results'"
browse
2) Inequalities
You ask if your inequalities are correct. They are in fact legal, meaning that Stata will not complain, but the result is probably unexpected.
The following result may seem surprising:
. display (66.11 < 100 < 67.93)
1
How is it the case that the expression evaluates to true (i.e. 1) ? Stata first evaluates 66.11 < 100 which is true, and then sees 1 < 67.93 which is also true, of course.
The intended expression was (and Stata will now do what you want):
. display (66.11 < 100) & (100 < 67.93)
0
You can also rely on the function inrange().
The following example is consistent with the previous explanation:
. display (66.11 < 100 < 0)
0
Stata sees 66.11 < 100 which is true (i.e. 1) and follows up with 1 < 0, which is false (i.e. 0).
This uses Roberto's data setup:
clear all
set obs 1200000
set seed 13056
gen lat = runiform()*100
gen lon = runiform()*100
local sizebd `=_N' // to be used in computations
tempfile bigdata
save "`bigdata'"
*----- create example reference data -----
clear all
set obs 300
set seed 97532
gen minlat = runiform()*100
gen maxlat = minlat + runiform()*5
gen minlon = runiform()*100
gen maxlon = minlon + runiform()*5
gen id = _n
tempfile reference
save "`reference'"
timer on 1
levelsof id, local(id_list)
foreach id of local id_list {
sum minlat if id==`id', meanonly
local minlat = r(min)
sum maxlat if id==`id', meanonly
local maxlat = r(max)
sum minlon if id==`id', meanonly
local minlon = r(min)
sum maxlon if id==`id', meanonly
local maxlon = r(max)
preserve
use if (inrange(lon,`minlon',`maxlon') & inrange(lat,`minlat',`maxlat')) using "`bigdata'", clear
qui save data_`id', replace
restore
}
timer off 1
I would try to avoid preserveing and restoreing the "big" file, and doing so is possible, but at the expense of losing Stata format.
Using the same set up as Roberto and Dimitriy did,
set more off
use `bigdata', clear
merge 1:1 _n using `reference'
* check for data consistency:
* minlat, maxlat, minlon, maxlon are either all defined or all missing
assert inlist( mi(minlat) + mi(maxlat) + mi(minlon) + mi(maxlon), 0, 4)
* this will come handy later
gen byte touse = 0
* set up and cycle over the reference data
count if !missing(minlat)
forvalues n=1/`=r(N)' {
replace touse = inrange(lat,minlat[`n'],maxlat[`n']) & inrange(lon,minlon[`n'],maxlon[`n'])
local thisid = id[`n']
outfile lat lon if touse using data_`thisid'.csv, replace comma
}
Time it on your machine. You could avoid touse and thisid and only have the single outfile within the cycle, but it would be less readable.
You can then infile lat lon using data_###.csv, clear later. If you really need the Stata files proper, you can convert that swarm of CSV files with
clear
local allcsv : dir . files "*.csv"
foreach f of local allcsv {
* change the filename
local dtaname = subinstr(`"`f'"',".csv",".dta",.)
infile lat lon using `"`f'"', clear
if _N>0 save `"`dtaname'"', replace
}
Time it, too. I protected the save as some of the simulated data sets were empty. I think this was faster than 1.5 min on my machine, including the conversion.

Stata- Is there a way to store data like Python's dictionary or a hash map?

Is there a way to store information in Stata similar to a dictionary in Python or a hash map in other languages?
I am iterating through variable lists that are appended with _1, _2, _3, _4, _5, _6, _7 ... _18 to delineate sections, and I want to sum the number of times the letters "DK" appear in each variable in each section. Right now I have 18 for loops, with each loop iterating through a different section, saving the 'sum' of the total number of DK's in a new variable called DK_1sum, DK_2sum, and then I later produce graphs of that data.
I'm wondering if there is a way to turn all this into a large For loop, and just append the data to a dictionary/array such that the data looks like:
{s1Sum, 25
s2Sum, 56 ...
s18Sum, 101}
Is this possible?
This could be stored in a Stata matrix, a Mata matrix or just ordinary Stata variables.
gen count = .
gen which = _n
qui forval j = 1/18 {
scalar found = 0
foreach v of var *_`j' {
count if strpos(`v', "DK")
scalar found = scalar(found) + r(N)
}
replace count = scalar(found) in `j'
}
list which count in 1/18
For variation, here is a Stata matrix approach.
matrix count = J(18,1,.)
qui forval j = 1/18 {
scalar found = 0
foreach v of var *_`j' {
count if strpos(`v', "DK")
scalar found = scalar(found) + r(N)
}
matrix count[`j', 1] = scalar(found)
}
matrix list count
If you are concerned about efficiency you could consider the associative array capabilities of Mata.
* associate Y with X
local yvalue "Y"
mata : H = asarray_create()
mata : asarray(H, "X", st_local("yvalue"))
* available in Mata
mata : asarray(H, "X")
* available in Stata
mata : st_local("xvalue", asarray(H, "X"))
di "`xvalue'"