Syntax issues for making a new variable from integer data

Syntax issues for making a new variable from integer data - stata

I am currently trying to create my do-file for Australian data. The data input asked for a free-cell textbox participant postcode and I would like to create a new variable that assigns them to states. Stata has recognised the free-cell text as an "int" type, but when I try and make a new variable I get a syntax error. I have included the variations on the value range I have tried.
*Make postcode to states
generate famsur_state = "ACT/NSW" if famsur_postcode=="2000/2999"
replace famsur_state = "SA" if famsur_postcode==(5000/5999)
replace famsur_state = "QLD" if famsur_postcode==4000/4999
replace famsur_state = "NT" if famsur_postcode==">=0000 & <=0999"
replace famsur_state = "WA" if famsur_postcode==>=6000 & <=6999
replace famsur_state = "TAS" if famsur_postcode==>=7000 & <=7999
replace famsur_state = "VIC" if famsur_postcode==>=3000 & <=3999
label var famsur_state "Which state is the participant from?"
label define state 1 "ACT/NSW" ///
2 "SA" ///
3 "QLD" ///
4 "NT" ///
5 "WA" ///
6 "TAS" ///
7 "VIC"
label values famsur_state state

I'm not really sure I understand what you want to do here. If I understand you correctly, you're dealing with panel data of families within states, perhaps at one point in time, perhaps at many points in time. So we have a reference point, maybe tell me a bit more about the context the data are situated in.
Anyways, if this is true, and all you want to do is make a string variable based on the values of your numeric variables... Your issue here is that you've misspecified the variable type of the post code variable. More precisely, you've misspecified the way to delineate the range of codes you're interested in. At present, the way the code above is written, you want to generate the string ACT/NSW if the code is "2000/2999". Stata here is interpreting this as a string. This is to say that Stata would replace the variable with "ACT/NSW" if the contents of the cell were "2000/2999".
One way of tackling this problem is the inrange function. Obviously, it acts on variables only if the values of the variables fall within a given range. Here's my code so you can follow along. To see how it works, take away the comment of "set tr on" and you'll see it at work.
clear
cls
set obs 4
g famsur_postcode = .
replace famsur_postcode = 2000 in 1
replace famsur_postcode = 2999 in 2
replace famsur_postcode = 3000 in 3
replace famsur_postcode = 3999 in 4
loc states ACT/NSW VIC
loc ranges ""2000,2999" "3000,3999""
loc n: word count `states'
as `n' ==2
g famsur_state = ""
*set tr on
forv i = 1/`n' {
loc a: word `i' of `states' // Where 1 = "ACT/NSW" and 2 = "VIC"
loc b: word `i' of `ranges' // Where 1 = "2000,2999" and 2 = "3000,3999"
replace famsur_state = "`a'" if inrange(famsur_postcode,`b')
}
br
What I do above, is I make a macro for the state abbreviations and code ranges. I count how many words there are in each macro (in this case 2 but it can be any amount you like). I then loop over these as parallel lists (see the good documentation on this on the Stata website). This, as you can see, seems to handle the first part correctly.
You also seem to want to assign value labels to these abbreviations. What I would do, is use the sencode command to accomplish this (ssc inst sencode, replace), and then recode the remaining variables by hand.
sencode famsur_state, replace
recode famsur_state (2=7)

You have here three different guesses at the syntax, all wrong. P1 to P3 are identifiable problems with your syntax.
generate famsur_state = "ACT/NSW" if famsur_postcode=="2000/2999"
P1. This syntax is illegal and so wrong if the postcode variable is integer, as the double quotes imply that it is string,
replace famsur_state = "SA" if famsur_postcode==(5000/5999)
replace famsur_state = "QLD" if famsur_postcode==4000/4999
P2. This syntax is legal if the postcode is integer, but wrong because the slash indicates division, and the result of 5000/5999 and 4000/4999 is each case not an integer.
replace famsur_state = "NT" if famsur_postcode==">=0000 & <=0999"
P1. Again, the implication of string is illegal.
replace famsur_state = "WA" if famsur_postcode==>=6000 & <=6999
P3. You can say in Stata >= 6000 to mean "greater than equal to 6000", but & does not work like that. ==>= will not do what you want.
replace famsur_state = "TAS" if famsur_postcode==>=7000 & <=7999
P3. Similar comment.
label var famsur_state "Which state is the participant from?"
label define state 1 "ACT/NSW" ///
2 "SA" ///
3 "QLD" ///
4 "NT" ///
5 "WA" ///
6 "TAS" ///
7 "VIC"
This is legal and possibly even correct, meaning what you want:
generate famsur_state = "ACT/NSW" if inrange(famsur_postcode, 2000, 2999)
This would be legal too but Stata will ignore the leading zeros:
replace famsur_state = "NT" if inrange(famsur_postcode, 0000, 0999)
If what you are seeing is the result of value labels, this could be quite wrong. It is possible that you have integers presenting as e.g. 0000 if the display format is %04.0f, but there is not enough hard information in the question to rule out other possibilities.
The functions inrange() will help you but otherwise Stata's syntax here is what is documented at help operators.

Related

giving a string variable values conditional on another variable

I am using Stata 14. I have US states and corresponding regions as integer.
I want create a string variable that represents the region for each observation.
Currently my code is
gen div_name = "A"
replace div_name = "New England" if div_no == 1
replace div_name = "Middle Atlantic" if div_no == 2
.
.
replace div_name = "Pacific" if div_no == 9
..so it is a really long code.
I was wondering if there is a shorter way to do this where I can automate assigning values rather than manually hard coding them.

You can define value labels in one line with label define and then use decode to create the string variable. See the help for those commands.
If the correspondence was defined in a separate dataset you could use merge. See e.g. this FAQ
There can't be a short-cut here other than typing all the names at some point or exploiting the fact that someone else typed them earlier into a file.
With nine or so labels, typing them yourself is quickest.
Note that you type one statement more than you need, even doing it the long way, as you could start
gen div_name = "New England" if div_no == 1

Is it possible to invoke a global macro inside a function in Stata?

I have a set of variables the list of which I have saved in a global macro so that I can use them in a function
global inlist_cond "amz2002ras_clss, amz2003ras_clss, amz2004ras_clss, amz2005ras_clss, amz2006ras_clss, amz2007ras_clss, amz2008ras_clss, amz2009ras_clss, amz2010ras_clss, amz2011ras_clss"
The reason why they are saved in a macro is because the list will be in a loop and its content will change depending on the year.
What I need to do is to generate a dummy variable so that water_dummy == 1 if any of the variables in the macro list has the WATER classification. In Stata, I need to write
gen water_dummy = inlist("WATER", "$inlist_cond")
, which--ideally--should translate to
gen water_dummy = inlist("WATER", amz2002ras_clss, amz2003ras_clss, amz2004ras_clss, amz2005ras_clss, amz2006ras_clss, amz2007ras_clss, amz2008ras_clss, amz2009ras_clss, amz2010ras_clss, amz2011ras_clss)
But this did not work---the code executed without any errors but the dummy variable only contained 0s. I know that it is possible to invoke macros inside functions in Stata, but I have never tried it when the macro contains a whole list of conditions. Any thoughts?

With a literal string specified, which the double quotes in the generate statement insist on, then you are comparing text with text and the comparison is not with the data at all.
. clear
. set obs 1
number of observations (_N) was 0, now 1
. gen a = "water"
. gen b = "wine"
. gen c = "beer"
. global myvars "a,b,c"
. gen found1 = inlist("water", "$myvars")
. gen found2 = inlist("water", $myvars)
. list
+---------------------------------------+
| a b c found1 found2 |
|---------------------------------------|
1. | water wine beer 0 1 |
+---------------------------------------+
The first comparison is equivalent to
. di inlist("water", "a,b,c")
0
which finds no match, as "water" is not matched by the (single!) other argument.
Macro references are certainly allowed within function or command calls: as each macro name is replaced by its contents before the syntax is checked, the function or command never even knows that a macro reference was ever used.
As #Aspen Chen concisely points out, omitting the double quotes gives what you want so long as the inlist() syntax remains legal.

If your data structure is something like in the following example, you can try the egen function incss, from egenmore (ssc install egenmore):
clear
set more off
input ///
str15(amz2009 amz2010)
"water" "juice"
"milk" "water"
"lemonade" "wine"
"water & beer" "tea"
end
list
egen watindic = incss(amz*), sub(water)
list
Be aware it searches for substrings (see the result for the last example observation).
A solution with a loop achieving different results is:
gen watindic2 = 0
forvalues i = 2009/2010 {
replace watindic2 = 1 if amz`i' == "water"
}
list
Another solution involves reshape, but I'll leave it at that.

Extract the mean from svy mean result in Stata

I am able to extract the mean into a matrix as follows:
svy: mean age, over(villageid)
matrix villagemean = e(b)'
clear
svmat village
However, I also want to merge this mean back to the villageid. My current thinking is to extract the rownames of the matrix villagemean like so:
local names : rownames villagemean
Then try to turn this macro names into variable
foreach v in names {
gen `v' = "``v''"
}
However, the variable names is empty. What did I do wrong? Since a lot of this is copied from Stata mailing list, I particularly don't understand the meaning of local names : rownames villagemean.

It's not completely clear to me what you want, but I think this might be it:
clear
set more off
*----- example data -----
webuse nhanes2f
svyset [pweight=finalwgt]
svy: mean zinc, over(sex)
matrix eb = e(b)
*----- what you want -----
levelsof sex, local(levsex)
local wc: word count `levsex'
gen avgsex = .
forvalues i = 1/`wc' {
replace avgsex = eb[1,`i'] if sex == `:word `i' of `levsex''
}
list sex zinc avgsex in 1/10
I make use of two extended macro functions:
local wc: word count `levsex'
and
`:word `i' of `levsex''
The first one returns the number of words in a string; the second returns the nth token of a string. The help entry for extended macro functions is help extended_fcn. Better yet, read the manuals, starting with: [U] 18.3 Macros. You will see there (18.3.8) that I use an abbreviated form.
Some notes on your original post
Your loop doesn't do what you intend (although again, not crystal clear to me) because you are supplying a list (with one element: the text name). You can see it running and comparing:
local names 1 2 3
foreach v in names {
display "`v'"
}
foreach v in `names' {
display "`v'"
}
foreach v of local names {
display "`v'"
}
You need to read the corresponding help files to set that right.
As for the question in your original post, : rownames is another extended macro function but for matrices. See help matrix, #11.
My impression is that for the kind of things you are trying to achieve, you need to dig deeper into the manuals. Furthermore, If you have not read the initial chapters of the Stata User's Guide, then you must do so.

Use value label in if command

I am working with a set of dta files representing surveys from different years.
Conveniently, each year uses different values for the country variable, so I am trying to set the country value labels for each year to match. I am having trouble comparing value labels though.
So far, I have come up with the following code:
replace country=1 if countryO=="Japan"
replace country=2 if countryO=="South Korea" | countryO=="Korea"
replace country=3 if countryO=="China"
replace country=4 if countryO=="Malaysia"
However, this doesn't work because "Japan" is the value label, not the actual value.
How do I tell Stata that I am comparing the value label?

Try
replace country=1 if countryO=="Japan":country0valuelabel
replace country=2 if inlist(countryO,"South Korea":country0valuelabel,"Korea":country0valuelabel)
You will have to replace country0valuelabel with the corresponding value label name in your data. You can find out its name by looking at the penultimate column in the output of describe country0.

To complement #Dimitriy's answer:
clear all
set more off
sysuse auto
keep foreign weight
describe foreign
label list origin
replace weight = . if foreign == 0
list in 1/15
list in 1/15, nolabel
describe displays the value label associated with a variable. label list can show the content of a particular value label.

I know I'm responding to this post years later, but I wanted to provide a solution that will work for multiple variables in case anybody comes across this.
My task was similar, except that I had to recode every variable that had a "Refused" response as a numerical value (8, 9, 99, etc) to the missing value type (., .r, .b, etc). All the variables had "Refused" coded a different value based on the value label, e.g. some variables had "Refused" coded as 9, while others had it as 99, or 8.
Version Information
Stata 15.1
Code
foreach v of varlist * {
if `"`: val label `v''"' == "yndkr" {
recode `v' (9 = .r)
}
else if `"`: val label `v''"' == "bw3" {
recode `v' (9 = .r)
}
else if `"`: val label `v''"' == "def_some" {
recode `v' (9 = .r)
}
else if `"`: val label `v''"' == "difficulty5" {
recode `v' (9 = .r)
}
}
You can keep adding as many else if commands as needed. I only showed a chunk of my entire loop, but I hope this demonstrates what needs to be done. If you need to find the name of your value labels, use the command labelbook and it will print them all for you.

Handling string variables inside collapse command

Edit: I should have generated better data. It isn't necessarily the case that the string variable is destringable. I'm just being lazy here (I don't know how to generate random letters).
I have a data set with a lot of strings that I want to collapse, but it seems that in general collapse doesn't place nicely with strings, particularly (firstnm) and (count). Here are some similar data.
clear
set obs 9
generate mark = .
replace mark = 1 in 1
replace mark = 2 in 6
generate name = ""
generate random = ""
local i = 0
foreach first in Tom Dick Harry {
foreach last in Smith Jones Jackson {
local ++i
replace name = "`first' `last'" in `i'
replace random = string(runiform())
}
}
I want to collapse on "mark", which is simple enough with replace and subscripts.
replace mark = mark[_n - 1] if missing(mark)
But my collapses fail with type mismatch errors.
collapse (firstnm) name (count) random, by(mark)
If I use (first), then the first error clears, but (count) still fails. Is there a solution that avoids an additional by operation?
It seems that the following works, but would also be a lot more time-consuming for my data.
generate nonmissing_random = !missing(random)
egen nonmissing_random_count = count(nonmissing_random), by(mark)
collapse (first) name nonmissing_random_count, by(mark)
Or is any solution that facilitates using collapse the same?

You can use destring random,replace and then the following works:
collapse (first) name (count) random, by(mark)
mark name random
1 Tom Smith 5
2 Dick Jackson 4
But collapse (firstnm) name (count) random, by(mark) still generates mismatch error.

Thinking on this some more, my egen count with by operation isn't necessary. I can generate a 1/0 variable for nonmissing/missing string variables then use (sum) in collapse.
generate nonmissing_random = !missing(random)
collapse (first) name (sum) nonmissing_random, by(mark)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js