Stata factor value from label - stata

I would like to look up a value/code associated with a label, and store that value in a scalar or local macro. While the information I want is stored in the definition of the label vector, apparently I need to go through some contortions to get it.
Extending Roberto Ferrer's answer to my last question, I came up with this approach:
// sample data
clear
input str5 mystr int mynum
a 5
b 5
b 6
c 4
end
encode mystr, gen(myfactor)
// get code for "b"
gen tmp = 0
replace tmp = myfactor if myfactor == "b":myfactor
sort tmp
scalar bcode = tmp[_N]
This seems woefully inefficient in terms of data manipulation and code maintenance, especially considering how the information I want is already saved (and viewable with label list).

This uses labellist, from SSC. Download using ssc install labellist.
clear
set more off
*----- example data -----
input str5 mystr
"good"
"bad"
"bad"
"regular"
end
encode mystr, gen(myfactor)
*----- what you want -----
labellist
local faclab = r(myfactor_labels)
local facval = r(myfactor_values)
// get # for "good"
local i : list posof "good" in faclab
local j : word `i' of `facval'
display "`j'"

Related

Repeating code in an if qualifier in Stata

In Stata I am trying to repeat code inside an if qualifier using perhaps a forvalues loop. My code looks something like this:
gen y=0
replace y=1 if x_1==1 & x_2==1 & x_3==1 & x_4==1
Instead of writing the & x_i==1 statement every time for each variable, I want to do it using a loop, something like this:
gen y=0
replace y=1 if forvalues i=1/4{x_`i'==1 &}
LATER EDIT:
Would it be possible to create a local in the line of this with the elements added together:
forvalues i=1/4{
local text_`i' "x_`i'==1 &"
display "`text_`i''"
}
And then call it at the if qualifier ?
Although you use the term "if statement" all your code is phrased in terms of if qualifiers, which aren't commands or statements. (Your use of the term "statement" is looser than customary, but that doesn't affect an answer directly.)
You can't insert loops in if qualifiers.
See for the differences
help if
help ifcmd
The entire example
gen y = 0
replace y = 1 if x==1 | x==2 | x==3 | x==4
would be better as
gen y = inlist(x, 1, 2, 3, 4)
or (dependent possibly on whatever values are allowed)
gen y = inrange(x, 1, 4)
A loop solution could be
gen y = 0
quietly forval i = 1/4 {
replace y = 1 if x == `i'
}
We can't discuss whether inlist() or inrange() would or would not be a solution for your real problem if you don't show to us.
I usually don't like - in Nick's terms - to write code to write code. I see an immediate, though not elegant nor 'heterodox', solution to your issue. The whole thing amounts to generate an indicator function for all your indicators, and use it with your if qualifier.
Implicit assumptions, which make this a bad, non-generalizable solution, are: 1) all variables are dummies, and you need them to be == 1, and 2) variable names are conveniently ordered 1 to N (although, if that is not the case, you can easily change the forv into a 'foreach var of varlist etc.')
g touse = 1
forv i =1/30{
replace touse = touse * x_'i'
}
<your action> if touse == 1

Multiple local in foreach command macro

I have a dataset with multiple subgroups (variable economist) and dates (variable temps99).
I want to run a tabsplit command that does not accept bysort or by prefixes. So I created a macro to apply my tabsplit command to each of my subgroups within my data.
For example:
levelsof economist, local(liste)
foreach gars of local liste {
display "`gars'"
tabsplit SubjectCategory if economist=="`gars'", p(;) sort
return list
replace nbcateco = r(r) if economist == "`gars'"
}
For each subgroup, Stata runs the tabsplit command and I use the variable nbcateco to store count results.
I did the same for the date so I can have the evolution of r(r) over time:
levelsof temps99, local(liste23)
foreach time of local liste23 {
display "`time'"
tabsplit SubjectCategory if temps99 == "`time'", p(;) sort
return list
replace nbcattime = r(r) if temps99 == "`time'"
}
Now I want to do it on each subgroups economist by date temps99. I tried multiple combination but I am not very good with macros (yet?).
What I want is to be able to have my r(r) for each of my subgroups over time.
Here's a solution that shows how to calculate the number of distinct publication categories within each by-group. This uses runby (from SSC). runby loops over each by-group, each time replacing the data in memory with the data from the current by-group. For each by-group, the commands contained in the user's program are executed. Whatever is left in memory when the user's program terminates is considered results and accumulates. Once all the groups have been processed, these results replace the data in memory.
I used the verbose option because I wanted to present the results for each by-group using nice formatting. The derivation of the list of distinct categories is done by splitting each list, converting to a long layout, and reducing to one observation per distinct value. The distinct_categories program generates one variable that contains the final count of distinct categories for the by-group.
* create a demontration dataset
* ------------------------------------------------------------------------------
clear all
set seed 12345
* Example generated by -dataex-. To install: ssc install dataex
clear
input str19 economist
"Carmen M. Reinhart"
"Janet Currie"
"Asli Demirguc-Kunt"
"Esther Duflo"
"Marianne Bertrand"
"Claudia Goldin"
"Bronwyn Hughes Hall"
"Serena Ng"
"Anne Case"
"Valerie Ann Ramey"
end
expand 20
bysort economist: gen temps99 = 1998 + _n
gen pubs = runiformint(1,10)
expand pubs
sort economist temps99
gen pubid = _n
local nep NEP-AGR NEP-CBA NEP-COM NEP-DEV NEP-DGE NEP-ECM NEP-EEC NEP-ENE ///
NEP-ENV NEP-HIS NEP-INO NEP-INT NEP-LAB NEP-MAC NEP-MIC NEP-MON ///
NEP-PBE NEP-TRA NEP-URE
gen SubjectCategory = ""
forvalues i=1/19 {
replace SubjectCategory = SubjectCategory + " " + word("`nep'",`i') ///
if runiform() < .1
}
replace SubjectCategory = subinstr(trim(SubjectCategory)," ",";",.)
leftalign // from SSC
* ------------------------------------------------------------------------------
program distinct_categories
dis _n _n _dup(80) "-"
dis as txt "fille = " as res economist[1] as txt _col(68) " temps = " as res temps99[1]
// if there are no subjects for the group, exit now to avoid a no obs error
qui count if !mi(trim(SubjectCategory))
if r(N) == 0 exit
// split categories, reshape to a long layout, and reduce to unique values
preserve
keep pubid SubjectCategory
quietly {
split SubjectCategory, parse(;) gen(cat)
reshape long cat, i(pubid)
bysort cat: keep if _n == 1
drop if mi(cat)
}
// show results and generate the wanted variable
list cat
local distinct = _N
dis _n as txt "distinct = " as res `distinct'
restore
gen wanted = `distinct'
end
runby distinct_categories, by(economist temps99) verbose
This is an example of the XY problem, I think. See http://xyproblem.info/
tabsplit is a command in the package tab_chi from SSC. I have no negative feelings about it, as I wrote it, but it seems quite unnecessary here.
You want to count categories in a string variable: semi-colons are your separators. So count semi-colons and add 1.
local SC SubjectCategory
gen NCategory = 1 + length(`SC') - length(subinstr(`SC', ";", "", .))
Then (e.g.) table or tabstat will let you explore further by groups of interest.
To see the counting idea, consider 3 categories with 2 semi-colons.
. display length("frog;toad;newt")
14
. display length(subinstr("frog;toad;newt", ";", "", .))
12
If we replace each semi-colon with an empty string, the change in length is the number of semi-colons deleted. Note that we don't have to change the variable to do this. Then add 1. See also this paper.
That said, a way to extend your approach might be
egen class = group(economist temps99), label
su class, meanonly
local nclass = r(N)
gen result = .
forval i = 1/`nclass' {
di "`: label (class) `i''"
tabsplit SubjectCategory if class == `i', p(;) sort
return list
replace result = r(r) if class == `i'
}
Using statsby would be even better. See also this FAQ.

How can I sort variables based on part of a string variable?

I have a dataset with string variables and I am trying to generate a new binary variable based on the first two characters. All strings are 5 characters long, but I'm only concerned with the first two in order to sort.
For example, I could have 22001 and 22005. Since both are of the form 22XXX, I want to assign value 1 for both in the variable type_A. And if I have 25001 and 25005, since both are not of the form 22XXX, I want to assign value 0 for both in the variable type_A.
This should do the job:
clear
set obs 4
generate str5 var1 = "22001" in 1
replace var1 = "22005" in 2
replace var1 = "25001" in 3
replace var1 = "25005" in 4
gen type_A = substr(var1, 1, 2) == "22"
Please note that as you explain your problem it looks like you you are storing 22005 as text - which may not necessarily be the best idea..

Use of local macro

I want to write six temp data files from my original data keeping the following variables:
temp1: v1-v18
temp2: v1-v5 v19-v31
temp3: v1-v5 v32-v44
temp4: v1-v5 v45-v57
temp5: v1-v5 v58-v70
temp6: v1-v5 v71-v84
I have tried the following:
forvalues i =1(1)6 {
preserve
local j = 6 + (`i'-1)*13
local k = `j'+12
keep v1-v18 if `j'==6
keep v1-v5 v`i'-v`k' if `i'>6 & `j'<71
keep v1-v5 v71-v84 if `j'==71
export delimited using temp`i'.csv, delimiter(";") novarnames replace
restore
}
I get an invalid syntax error. The problem lies with the keep statements. Specifically the if condition with a local macro seems to be against syntax rules.
I think part of your confusion is due to misunderstanding the if qualifier vs the if command.
The if command evaluates an expression: if that expression is true, it executes what follows. The if command should be used to evaluate a single expression, in this case, the value of a macro.
You might use an if qualifier, for example, when you want to regress y x if x > 2 or replace x = . if x <= 2 etc. See here for a short description.
Your syntax has other issues too. You cannot have code following on the same line as the open brace in your forvalues loop, or again on the same line as your closing brace. You also use the local i to condition your keep. I think you mean to use j here, as i simply serves to iterate the loop, not identify a variable suffix.
Further, the logic here seems to work, but doesn't seem very general or efficient. I imagine there is a better way to do this but I don't have time to play around with it at the moment - perhaps an update later.
In any case, I think the correct syntax most analogous to what you have tried is something like the following.
clear *
set more off
set obs 5
forvalues i = 1/84 {
gen v`i' = runiform()
}
forvalues i =1/6 {
preserve
local j = 6 + (`i'-1)*13
local k = `j'+12
if `j' == 6 {
keep v1-v18
}
else if `j' > 6 & `j' < 71 {
keep v1-v5 v`j'-v`k'
}
else keep v1-v5 v71-v84
ds
di
restore
}
I use ds here to simply list the variables in the data followed by di do display a blank line as a separator, but you could simply plug back in your export and it should work just fine.
Another thing to consider if you truly want temp data files is to consider using tempfile so that you aren't writing anything to disk. You might use
forvalues i = 1/6 {
tempfile temp`i'
// other commands
save `temp`i''
}
This will create six Stata data files temp1 - temp6 that are held in memory until the program terminates.

How to return a value label by indexing label position

Suppose I have a variable named MyVar with value labels defined like this:
0 Something
1 Something else
2 Yet another thing
How do I obtain the second value label (i.e. "Something else")? Edit: Assume that I do not know a priori what the factor values are (i.e. I do not know the minimum value label, and the factor values may increment by numbers other than 1, and may increment unevenly).
I know I can obtain the label corresponding to the value of 2:
. local LABEL: label (MyVar) 2, strict
. di "`LABEL'"
Yet another thing
But I want to obtain the label corresponding to the position of 2 in the value label list:
. <Some amazing Stata-fu using (labeled) variable MyVar and the position 2>
. di "`LABEL'"
Something else
You want to nest a couple of extended macro functions like matryoshkas:
clear
set obs 3
gen x=_n-1
label define xlab 0 "Something" 1 "Something else" 2 "Yet another thing"
lab val x xlab
levelsof x, local(xnumbers)
di "`:label xlab `:word 2 of `xnumbers'''"
Working from the end of the last line to the front. The local xnumbers produced by levelsof contains the distinct levels of x from smallest to largest: 0 1 2. Then you figure out what the second word of that is local is, which is 1. Finally, you get the label corresponding to that numeric value, which is "Something else".
You can get the labels into a vector in Mata.
. sysuse auto, clear
(1978 Automobile Data)
. mata
------------------------------------------------- mata (type end to exit) --
: st_vlload("origin", values = ., text = "")
: values
1
+-----+
1 | 0 |
2 | 1 |
+-----+
: text
1
+------------+
1 | Domestic |
2 | Foreign |
+------------+
: text[2,1]
Foreign
: end
That could be the hard core of a program to do something with them. Dependent on what you want to do, the answer could be expanded. It's also up for grabs whether you start with a variable name or a value label name.
EDIT: Here is a quick hack at a program to return the j th value label. You present a name which by default is taken to be a variable name; with the labelname option it is taken to be a value label name. Not much tested.
*! 1.0.0 NJC 7 Oct 2014
program jthvaluelabel, rclass
version 9
syntax name , j(numlist int >0 min=1 max=1) [labelname]
if "`labelname'" == "" {
confirm var `namelist'
local labelname : value label `namelist'
if "`labelname'" == "" {
di as err "no value label attached to `namelist'"
exit 111
}
}
else {
local labelname `namelist'
capture label list `labelname'
if _rc {
di as err "no such value label defined"
exit 111
}
}
mata: lookitup("`labelname'", `j')
di as text `"`valuelabel'"'
return local valuelabel `"`valuelabel'"'
end
mata:
void lookitup (string scalar lblname, real scalar j) {
real colvector values
string colvector text
real scalar nlbl
string scalar labels
st_vlload(lblname, values = ., text = "")
nlbl = length(text)
if (nlbl == 1) labels = "label"
else if (nlbl > 1) labels = "labels"
if (nlbl < j) {
errprintf("no such label; %1.0f %s, but #%1.0f requested\n",
nlbl, labels, j)
exit(498)
}
else {
st_local("valuelabel", text[j])
}
}
end
Some examples:
. sysuse auto, clear
(1978 Automobile Data)
. jthvaluelabel foreign, j(1)
Domestic
. jthvaluelabel foreign, j(2)
Foreign
. jthvaluelabel foreign, j(3)
no such label; 2 labels, but #3 requested
r(498);
. jthvaluelabel make, j(1)
no value label attached to make
r(111);
. jthvaluelabel origin, j(1) labelname
Domestic
Posting code here is occasionally a little difficult. The code delimiters aren't always respected. The real program on my machine is indented more systematically than is evident from the version above.
I cobbled together a nice solution from Nick's and Dimitriy's answers and comments (the application is for a function outputting a line of a table, in a section and the user has specified that they want labels for groupvar for the position index):
local labelname : value label `groupvar'
mata: st_vlload("`labelname'", values = ., text = "")
mata: st_local("vallab", text[`index'])
local vallab = substr("`vallab'",1,8)
Then the program carries on using the local vallab.