How to return a value label by indexing label position - stata

Suppose I have a variable named MyVar with value labels defined like this:
0 Something
1 Something else
2 Yet another thing
How do I obtain the second value label (i.e. "Something else")? Edit: Assume that I do not know a priori what the factor values are (i.e. I do not know the minimum value label, and the factor values may increment by numbers other than 1, and may increment unevenly).
I know I can obtain the label corresponding to the value of 2:
. local LABEL: label (MyVar) 2, strict
. di "`LABEL'"
Yet another thing
But I want to obtain the label corresponding to the position of 2 in the value label list:
. <Some amazing Stata-fu using (labeled) variable MyVar and the position 2>
. di "`LABEL'"
Something else

You want to nest a couple of extended macro functions like matryoshkas:
clear
set obs 3
gen x=_n-1
label define xlab 0 "Something" 1 "Something else" 2 "Yet another thing"
lab val x xlab
levelsof x, local(xnumbers)
di "`:label xlab `:word 2 of `xnumbers'''"
Working from the end of the last line to the front. The local xnumbers produced by levelsof contains the distinct levels of x from smallest to largest: 0 1 2. Then you figure out what the second word of that is local is, which is 1. Finally, you get the label corresponding to that numeric value, which is "Something else".

You can get the labels into a vector in Mata.
. sysuse auto, clear
(1978 Automobile Data)
. mata
------------------------------------------------- mata (type end to exit) --
: st_vlload("origin", values = ., text = "")
: values
1
+-----+
1 | 0 |
2 | 1 |
+-----+
: text
1
+------------+
1 | Domestic |
2 | Foreign |
+------------+
: text[2,1]
Foreign
: end
That could be the hard core of a program to do something with them. Dependent on what you want to do, the answer could be expanded. It's also up for grabs whether you start with a variable name or a value label name.
EDIT: Here is a quick hack at a program to return the j th value label. You present a name which by default is taken to be a variable name; with the labelname option it is taken to be a value label name. Not much tested.
*! 1.0.0 NJC 7 Oct 2014
program jthvaluelabel, rclass
version 9
syntax name , j(numlist int >0 min=1 max=1) [labelname]
if "`labelname'" == "" {
confirm var `namelist'
local labelname : value label `namelist'
if "`labelname'" == "" {
di as err "no value label attached to `namelist'"
exit 111
}
}
else {
local labelname `namelist'
capture label list `labelname'
if _rc {
di as err "no such value label defined"
exit 111
}
}
mata: lookitup("`labelname'", `j')
di as text `"`valuelabel'"'
return local valuelabel `"`valuelabel'"'
end
mata:
void lookitup (string scalar lblname, real scalar j) {
real colvector values
string colvector text
real scalar nlbl
string scalar labels
st_vlload(lblname, values = ., text = "")
nlbl = length(text)
if (nlbl == 1) labels = "label"
else if (nlbl > 1) labels = "labels"
if (nlbl < j) {
errprintf("no such label; %1.0f %s, but #%1.0f requested\n",
nlbl, labels, j)
exit(498)
}
else {
st_local("valuelabel", text[j])
}
}
end
Some examples:
. sysuse auto, clear
(1978 Automobile Data)
. jthvaluelabel foreign, j(1)
Domestic
. jthvaluelabel foreign, j(2)
Foreign
. jthvaluelabel foreign, j(3)
no such label; 2 labels, but #3 requested
r(498);
. jthvaluelabel make, j(1)
no value label attached to make
r(111);
. jthvaluelabel origin, j(1) labelname
Domestic
Posting code here is occasionally a little difficult. The code delimiters aren't always respected. The real program on my machine is indented more systematically than is evident from the version above.

I cobbled together a nice solution from Nick's and Dimitriy's answers and comments (the application is for a function outputting a line of a table, in a section and the user has specified that they want labels for groupvar for the position index):
local labelname : value label `groupvar'
mata: st_vlload("`labelname'", values = ., text = "")
mata: st_local("vallab", text[`index'])
local vallab = substr("`vallab'",1,8)
Then the program carries on using the local vallab.

Related

Multiple local in foreach command macro

I have a dataset with multiple subgroups (variable economist) and dates (variable temps99).
I want to run a tabsplit command that does not accept bysort or by prefixes. So I created a macro to apply my tabsplit command to each of my subgroups within my data.
For example:
levelsof economist, local(liste)
foreach gars of local liste {
display "`gars'"
tabsplit SubjectCategory if economist=="`gars'", p(;) sort
return list
replace nbcateco = r(r) if economist == "`gars'"
}
For each subgroup, Stata runs the tabsplit command and I use the variable nbcateco to store count results.
I did the same for the date so I can have the evolution of r(r) over time:
levelsof temps99, local(liste23)
foreach time of local liste23 {
display "`time'"
tabsplit SubjectCategory if temps99 == "`time'", p(;) sort
return list
replace nbcattime = r(r) if temps99 == "`time'"
}
Now I want to do it on each subgroups economist by date temps99. I tried multiple combination but I am not very good with macros (yet?).
What I want is to be able to have my r(r) for each of my subgroups over time.
Here's a solution that shows how to calculate the number of distinct publication categories within each by-group. This uses runby (from SSC). runby loops over each by-group, each time replacing the data in memory with the data from the current by-group. For each by-group, the commands contained in the user's program are executed. Whatever is left in memory when the user's program terminates is considered results and accumulates. Once all the groups have been processed, these results replace the data in memory.
I used the verbose option because I wanted to present the results for each by-group using nice formatting. The derivation of the list of distinct categories is done by splitting each list, converting to a long layout, and reducing to one observation per distinct value. The distinct_categories program generates one variable that contains the final count of distinct categories for the by-group.
* create a demontration dataset
* ------------------------------------------------------------------------------
clear all
set seed 12345
* Example generated by -dataex-. To install: ssc install dataex
clear
input str19 economist
"Carmen M. Reinhart"
"Janet Currie"
"Asli Demirguc-Kunt"
"Esther Duflo"
"Marianne Bertrand"
"Claudia Goldin"
"Bronwyn Hughes Hall"
"Serena Ng"
"Anne Case"
"Valerie Ann Ramey"
end
expand 20
bysort economist: gen temps99 = 1998 + _n
gen pubs = runiformint(1,10)
expand pubs
sort economist temps99
gen pubid = _n
local nep NEP-AGR NEP-CBA NEP-COM NEP-DEV NEP-DGE NEP-ECM NEP-EEC NEP-ENE ///
NEP-ENV NEP-HIS NEP-INO NEP-INT NEP-LAB NEP-MAC NEP-MIC NEP-MON ///
NEP-PBE NEP-TRA NEP-URE
gen SubjectCategory = ""
forvalues i=1/19 {
replace SubjectCategory = SubjectCategory + " " + word("`nep'",`i') ///
if runiform() < .1
}
replace SubjectCategory = subinstr(trim(SubjectCategory)," ",";",.)
leftalign // from SSC
* ------------------------------------------------------------------------------
program distinct_categories
dis _n _n _dup(80) "-"
dis as txt "fille = " as res economist[1] as txt _col(68) " temps = " as res temps99[1]
// if there are no subjects for the group, exit now to avoid a no obs error
qui count if !mi(trim(SubjectCategory))
if r(N) == 0 exit
// split categories, reshape to a long layout, and reduce to unique values
preserve
keep pubid SubjectCategory
quietly {
split SubjectCategory, parse(;) gen(cat)
reshape long cat, i(pubid)
bysort cat: keep if _n == 1
drop if mi(cat)
}
// show results and generate the wanted variable
list cat
local distinct = _N
dis _n as txt "distinct = " as res `distinct'
restore
gen wanted = `distinct'
end
runby distinct_categories, by(economist temps99) verbose
This is an example of the XY problem, I think. See http://xyproblem.info/
tabsplit is a command in the package tab_chi from SSC. I have no negative feelings about it, as I wrote it, but it seems quite unnecessary here.
You want to count categories in a string variable: semi-colons are your separators. So count semi-colons and add 1.
local SC SubjectCategory
gen NCategory = 1 + length(`SC') - length(subinstr(`SC', ";", "", .))
Then (e.g.) table or tabstat will let you explore further by groups of interest.
To see the counting idea, consider 3 categories with 2 semi-colons.
. display length("frog;toad;newt")
14
. display length(subinstr("frog;toad;newt", ";", "", .))
12
If we replace each semi-colon with an empty string, the change in length is the number of semi-colons deleted. Note that we don't have to change the variable to do this. Then add 1. See also this paper.
That said, a way to extend your approach might be
egen class = group(economist temps99), label
su class, meanonly
local nclass = r(N)
gen result = .
forval i = 1/`nclass' {
di "`: label (class) `i''"
tabsplit SubjectCategory if class == `i', p(;) sort
return list
replace result = r(r) if class == `i'
}
Using statsby would be even better. See also this FAQ.

How to populate missing values for string variable in a column based on fixed criteria

To populate missing data with a fixed range of values
I would like to check how to populate column aktype with a range of values (the range of values for the same pidlink are always fixed at 11 types of values listed below) for those cells with missing values. I have about 17,000+ observations that are missing.
The range of values are as follows:
A
B
C
D
E
G
H
I
J
K
L
I have tried the following command but it does not work:-
foreach x of varlist aktype=1/11 {
replace aktype = "A" in 1 if aktype==""
replace aktype = "B" in 2 if aktype==""
replace aktype = "C" in 3 if aktype==""
replace aktype = "D" in 4 if aktype==""
replace aktype = "E" in 5 if aktype==""
replace aktype = "G" in 6 if aktype==""
replace aktype = "H" in 7 if aktype==""
replace aktype = "I" in 8 if aktype==""
replace aktype = "J" in 9 if aktype==""
replace aktype = "K" in 10 if aktype==""
replace aktype = "L" in 11 if aktype==""
}
Would appreciate it if you could advise on the right command to use. Many thanks!
I would generate a variable AK that has letters A-K in positions 1-11 (and 12-22, and 23-33, and so on). The replace missing values with the value of this variable AK.
* generate data
clear
set obs 20
generate aktype = ""
replace aktype = "foo" in 1/1
replace aktype = "bar" in 10/12
* generate variable with letters A-K
generate AK = char(65 + mod(_n - 1, 11))
* fill missing values
replace aktype = AK if missing(aktype)
list
This yields the following.
. list
+-------------+
| aktype AK |
|-------------|
1. | foo A |
2. | B B |
3. | C C |
4. | D D |
5. | E E |
|-------------|
This first addresses the comment "it does not work".
Generally, in this kind of forum you should always be specific and say exactly what happens, namely where the code breaks down and what the result is (e.g. what error message you get). If necessary, add why that is not what is wanted.
Specifically, in this case Stata would get no further than
foreach x of varlist aktype=1/11
which is illegal (as well as unclear to Stata programmers).
You can loop over a varlist. In this case looping over a single variable aktype is legal. (It is usually pointless, but that's style, not syntax.) So this is legal:
foreach x of varlist aktype
By the way, you define x as the loop argument, but never refer to it inside the loop. That isn't illegal, but it is unusual.
You can also loop over a numlist, e.g.
foreach x of numlist 1/11
although
forval x = 1/11
is a more direct way of doing that. All this follows from the syntax diagrams for the commands concerned, where whatever is not explicitly allowed is forbidden.
On occasions when you need to loop over a varlist and a numlist you will need to use different syntax, but what is best depends on the precise problem.
Now second to the question: I can't see any kind of rule in the question for which values get assigned A through L, so can't advise positively.

Stata factor value from label

I would like to look up a value/code associated with a label, and store that value in a scalar or local macro. While the information I want is stored in the definition of the label vector, apparently I need to go through some contortions to get it.
Extending Roberto Ferrer's answer to my last question, I came up with this approach:
// sample data
clear
input str5 mystr int mynum
a 5
b 5
b 6
c 4
end
encode mystr, gen(myfactor)
// get code for "b"
gen tmp = 0
replace tmp = myfactor if myfactor == "b":myfactor
sort tmp
scalar bcode = tmp[_N]
This seems woefully inefficient in terms of data manipulation and code maintenance, especially considering how the information I want is already saved (and viewable with label list).
This uses labellist, from SSC. Download using ssc install labellist.
clear
set more off
*----- example data -----
input str5 mystr
"good"
"bad"
"bad"
"regular"
end
encode mystr, gen(myfactor)
*----- what you want -----
labellist
local faclab = r(myfactor_labels)
local facval = r(myfactor_values)
// get # for "good"
local i : list posof "good" in faclab
local j : word `i' of `facval'
display "`j'"

Stata - assign different variables depending on the value within a variable

Sorry that title is confusing. Hopefully it's clear below.
I'm using Stata and I'd like to assign the value 1 to a variable that depends on the value within a different variable. I have 20 order variables and also 20 corresponding variables. For example if order1 = 3, I'd like to assign variable3 = 1. Below is a snippet of what the final dataset would look like if I had only 3 of each variable.
Right now I'm doing this with two loops but I have to another loop around this that goes through this 9 more times plus I'd doing this for a couple hundred data files. I'd like to make it more efficient.
forvalues i = 1/20 {
forvalues j = 1/20 {
replace variable`j' = 1 if order`i'==`j'
}
}
Is it possible to use the value of order'i' to assign the variable[order`i'VALUE] directly? Then I can get rid of the j loop above. Something like this.
forvalues i = 1/20 {
replace variable[`order`i'value] = 1
}
Thanks for your help!
***** CLARIFICATION ADDED Feb 2nd.**
I simplified my problem and the dataset too much bc the solutions suggested work for what I presented but, are not getting at what I'm really attempting to do. Thank you three for your solutions though. I was not clear enough in my post.
To clarify, my data doesn't have a one to one correspondence of each order# assigning variable# a 1 if it's not missing. For example, the first observation for order1=3, variable1 isn't supposed to get a 1, variable3 should get a 1. What I didn't include in my original post is that I'm actually checking for other conditions to set it equal to 1.
For more background, I'm counting up births of women by birth order(1st child, 2nd child, etc) that occurred at different ages of mothers. So in the data, each row is a woman, each order# is the number birth (order1=3, it's her third child). The corresponding variable#s are the counts (variable# means the woman has a child of birth order #). I mentioned in the post, that I do this 9 times bc I'm doing it for 5 year age groups (15-19; 20-24; etc). So the first set of variable# would be counts of birth by order when women were ages 15-19; the second set of variable# would be counts of births by order when women were 20-24. etc etc. After this, I sum up the counts in different ways (by woman's education, geography, etc).
So with the additional loop what I do is something more like this
forvalues k = 1/9{
forvalues i = 1/20 {
forvalues j = 1/20 {
replace variable`k'_`j' = 1 if order`i'==`j' & age`i'==`k' & birth_age`i'<36
}
}
}
Not sure if it's possible, but I wanted to simplify so I only need to cycle through each child once, without cycling through the birth orders and directly use the value in order# to assign a 1 to the correct variable. So if order1=3 and the woman had the child at the specific age group, assign variable[agegroup][3]=1; if order1=2, then variable[agegroup][2] should get a 1.
forvalues k=1/9{
forvalues i = 1/20 {
replace variable`k'_[`order`i'value] = 1 if age`i'==`k' & birth_age`i'<36
}
}
I would reshape twice. First reshape to long, then condition variable on !missing(order), then reshape back to wide.
* generate your data
clear
set obs 3
forvalues i = 1/3 {
generate order`i' = .
local k = (3 - `i' + 1)
forvalues j = 1/`k' {
replace order`i' = (`k' - `j' + 1) if (_n == `j')
}
}
list
*. list
*
* +--------------------------+
* | order1 order2 order3 |
* |--------------------------|
* 1. | 3 2 1 |
* 2. | 2 1 . |
* 3. | 1 . . |
* +--------------------------+
* I would rehsape to long, then back to wide
generate id = _n
reshape long order, i(id)
generate variable = !missing(order)
reshape wide order variable, i(id) j(_j)
order order* variable*
drop id
list
*. list
*
* +-----------------------------------------------------------+
* | order1 order2 order3 variab~1 variab~2 variab~3 |
* |-----------------------------------------------------------|
* 1. | 3 2 1 1 1 1 |
* 2. | 2 1 . 1 1 0 |
* 3. | 1 . . 1 0 0 |
* +-----------------------------------------------------------+
Using a simple forvalues loop with generate and missing() is orders of magnitude faster than other proposed solutions (until now). For this problem you need only one loop to traverse the complete list of variables, not two, as in the original post. Below some code that shows both points.
*----------------- generate some data ----------------------
clear all
set more off
local numobs 60
set obs `numobs'
quietly {
forvalues i = 1/`numobs' {
generate order`i' = .
local k = (`numobs' - `i' + 1)
forvalues j = 1/`k' {
replace order`i' = (`k' - `j' + 1) if (_n == `j')
}
}
}
timer clear
*------------- method 1 (gen + missing()) ------------------
timer on 1
quietly {
forvalues i = 1/`numobs' {
generate variable`i' = !missing(order`i')
}
}
timer off 1
* ----------- method 2 (reshape + missing()) ---------------
drop variable*
timer on 2
quietly {
generate id = _n
reshape long order, i(id)
generate variable = !missing(order)
reshape wide order variable, i(id) j(_j)
}
timer off 2
*--------------- method 3 (egen, rowmax()) -----------------
drop variable*
timer on 3
quietly {
// loop over the order variables creating dummies
forvalues v=1/`numobs' {
tab order`v', gen(var`v'_)
}
// loop over the domain of the order variables
// (may need to change)
forvalues l=1/`numobs' {
egen variable`l' = rmax(var*_`l')
drop var*_`l'
}
}
timer off 3
*----------------- method 4 (original post) ----------------
drop variable*
timer on 4
quietly {
forvalues i = 1/`numobs' {
gen variable`i' = 0
forvalues j = 1/`numobs' {
replace variable`i' = 1 if order`i'==`j'
}
}
}
timer off 4
*-----------------------------------------------------------
timer list
The timed procedures give
. timer list
1: 0.00 / 1 = 0.0010
2: 0.30 / 1 = 0.3000
3: 0.34 / 1 = 0.3390
4: 0.07 / 1 = 0.0700
where timer 1 is the simple gen, timer 2 the reshape, timer 3 the egen, rowmax(), and timer 4 the original post.
The reason you need only one loop is that Stata's approach is to execute the command for all observations in the database, from top (first observation) to bottom (last observation). For example, variable1 is generated but according to whether order1 is missing or not; this is done for all observations of both variables, without an explicit loop.
I wonder if you actually need to do this. For future questions, if you have a further goal in mind, I think a good strategy is to mention it in your post.
Note: I've reused code from other posters' answers.
Here's a simpler way to do it (that still requires 2 loops):
// loop over the order variables creating dummies
forvalues v=1/20 {
tab order`v', gen(var`v'_)
}
// loop over the domain of the order variables (may need to change)
forvalues l=1/3 {
egen variable`l' = rmax(var*_`l')
drop var*_`l'
}

Capitalizing value labels in Stata

Some datasets come with full-lowercase value labels, and I end up with graphs and tables showing results for "egypt", "jordan" and "saudi arabia" instead of the capitalized country names.
I guess that the proper() string function can do something for me, but I am not finding the right way to write the code for Stata 11 that will capitalize all value labels for a given variable.
I basically need to run the proper() function on all value labels on the variable, and then assign them to the variable. Is that possible using a foreach loop and macros in Stata?
Yes. First let's create some sample data with labels for testing:
clear
drawnorm x, n(10)
gen byte v = int(4+x)
drop x
label define types 0 "zero" 1 "one" 2 "two" 3 "three" 4 "four" 5 "five" 6 "six"
label list types
label values v types
Here's a macro to capitalize the values associated with the variable "v":
local varname v
local sLabelName: value label `varname'
di "`sLabelName'"
levelsof `varname', local(xValues)
foreach x of local xValues {
local sLabel: label (`varname') `x', strict
local sLabelNew =proper("`sLabel'")
noi di "`x': `sLabel' ==> `sLabelNew'"
label define `sLabelName' `x' "`sLabelNew'", modify
}
After running it, check the results:
label list types