Handling string variables inside collapse command - stata

Edit: I should have generated better data. It isn't necessarily the case that the string variable is destringable. I'm just being lazy here (I don't know how to generate random letters).
I have a data set with a lot of strings that I want to collapse, but it seems that in general collapse doesn't place nicely with strings, particularly (firstnm) and (count). Here are some similar data.
clear
set obs 9
generate mark = .
replace mark = 1 in 1
replace mark = 2 in 6
generate name = ""
generate random = ""
local i = 0
foreach first in Tom Dick Harry {
foreach last in Smith Jones Jackson {
local ++i
replace name = "`first' `last'" in `i'
replace random = string(runiform())
}
}
I want to collapse on "mark", which is simple enough with replace and subscripts.
replace mark = mark[_n - 1] if missing(mark)
But my collapses fail with type mismatch errors.
collapse (firstnm) name (count) random, by(mark)
If I use (first), then the first error clears, but (count) still fails. Is there a solution that avoids an additional by operation?
It seems that the following works, but would also be a lot more time-consuming for my data.
generate nonmissing_random = !missing(random)
egen nonmissing_random_count = count(nonmissing_random), by(mark)
collapse (first) name nonmissing_random_count, by(mark)
Or is any solution that facilitates using collapse the same?

You can use destring random,replace and then the following works:
collapse (first) name (count) random, by(mark)
mark name random
1 Tom Smith 5
2 Dick Jackson 4
But collapse (firstnm) name (count) random, by(mark) still generates mismatch error.

Thinking on this some more, my egen count with by operation isn't necessary. I can generate a 1/0 variable for nonmissing/missing string variables then use (sum) in collapse.
generate nonmissing_random = !missing(random)
collapse (first) name (sum) nonmissing_random, by(mark)

Related

Syntax issues for making a new variable from integer data

I am currently trying to create my do-file for Australian data. The data input asked for a free-cell textbox participant postcode and I would like to create a new variable that assigns them to states. Stata has recognised the free-cell text as an "int" type, but when I try and make a new variable I get a syntax error. I have included the variations on the value range I have tried.
*Make postcode to states
generate famsur_state = "ACT/NSW" if famsur_postcode=="2000/2999"
replace famsur_state = "SA" if famsur_postcode==(5000/5999)
replace famsur_state = "QLD" if famsur_postcode==4000/4999
replace famsur_state = "NT" if famsur_postcode==">=0000 & <=0999"
replace famsur_state = "WA" if famsur_postcode==>=6000 & <=6999
replace famsur_state = "TAS" if famsur_postcode==>=7000 & <=7999
replace famsur_state = "VIC" if famsur_postcode==>=3000 & <=3999
label var famsur_state "Which state is the participant from?"
label define state 1 "ACT/NSW" ///
2 "SA" ///
3 "QLD" ///
4 "NT" ///
5 "WA" ///
6 "TAS" ///
7 "VIC"
label values famsur_state state
I'm not really sure I understand what you want to do here. If I understand you correctly, you're dealing with panel data of families within states, perhaps at one point in time, perhaps at many points in time. So we have a reference point, maybe tell me a bit more about the context the data are situated in.
Anyways, if this is true, and all you want to do is make a string variable based on the values of your numeric variables... Your issue here is that you've misspecified the variable type of the post code variable. More precisely, you've misspecified the way to delineate the range of codes you're interested in. At present, the way the code above is written, you want to generate the string ACT/NSW if the code is "2000/2999". Stata here is interpreting this as a string. This is to say that Stata would replace the variable with "ACT/NSW" if the contents of the cell were "2000/2999".
One way of tackling this problem is the inrange function. Obviously, it acts on variables only if the values of the variables fall within a given range. Here's my code so you can follow along. To see how it works, take away the comment of "set tr on" and you'll see it at work.
clear
cls
set obs 4
g famsur_postcode = .
replace famsur_postcode = 2000 in 1
replace famsur_postcode = 2999 in 2
replace famsur_postcode = 3000 in 3
replace famsur_postcode = 3999 in 4
loc states ACT/NSW VIC
loc ranges ""2000,2999" "3000,3999""
loc n: word count `states'
as `n' ==2
g famsur_state = ""
*set tr on
forv i = 1/`n' {
loc a: word `i' of `states' // Where 1 = "ACT/NSW" and 2 = "VIC"
loc b: word `i' of `ranges' // Where 1 = "2000,2999" and 2 = "3000,3999"
replace famsur_state = "`a'" if inrange(famsur_postcode,`b')
}
br
What I do above, is I make a macro for the state abbreviations and code ranges. I count how many words there are in each macro (in this case 2 but it can be any amount you like). I then loop over these as parallel lists (see the good documentation on this on the Stata website). This, as you can see, seems to handle the first part correctly.
You also seem to want to assign value labels to these abbreviations. What I would do, is use the sencode command to accomplish this (ssc inst sencode, replace), and then recode the remaining variables by hand.
sencode famsur_state, replace
recode famsur_state (2=7)
You have here three different guesses at the syntax, all wrong. P1 to P3 are identifiable problems with your syntax.
generate famsur_state = "ACT/NSW" if famsur_postcode=="2000/2999"
P1. This syntax is illegal and so wrong if the postcode variable is integer, as the double quotes imply that it is string,
replace famsur_state = "SA" if famsur_postcode==(5000/5999)
replace famsur_state = "QLD" if famsur_postcode==4000/4999
P2. This syntax is legal if the postcode is integer, but wrong because the slash indicates division, and the result of 5000/5999 and 4000/4999 is each case not an integer.
replace famsur_state = "NT" if famsur_postcode==">=0000 & <=0999"
P1. Again, the implication of string is illegal.
replace famsur_state = "WA" if famsur_postcode==>=6000 & <=6999
P3. You can say in Stata >= 6000 to mean "greater than equal to 6000", but & does not work like that. ==>= will not do what you want.
replace famsur_state = "TAS" if famsur_postcode==>=7000 & <=7999
P3. Similar comment.
label var famsur_state "Which state is the participant from?"
label define state 1 "ACT/NSW" ///
2 "SA" ///
3 "QLD" ///
4 "NT" ///
5 "WA" ///
6 "TAS" ///
7 "VIC"
This is legal and possibly even correct, meaning what you want:
generate famsur_state = "ACT/NSW" if inrange(famsur_postcode, 2000, 2999)
This would be legal too but Stata will ignore the leading zeros:
replace famsur_state = "NT" if inrange(famsur_postcode, 0000, 0999)
If what you are seeing is the result of value labels, this could be quite wrong. It is possible that you have integers presenting as e.g. 0000 if the display format is %04.0f, but there is not enough hard information in the question to rule out other possibilities.
The functions inrange() will help you but otherwise Stata's syntax here is what is documented at help operators.

Looping through every value

I was trying to run a loop through a variable and was unsure how to code up my thoughts. So, I have variable called newid that goes as
newid
1
1
2
2
3
3
and so on.
foreach x in newid2 {
replace switchers = 1 if doc[_n] != doc[_n+1]
}
I want to modify this code so that this code will run for each two values (in this case run for 1 and 1, 2 and 2). What would be the best way to modify this? Please help me
Something like this can be done with levelsof:
clear
input id str1 doc
1 "A"
1 "B"
2 "A"
3 "C"
3 "A"
end
gen switcher1 = 0
levelsof id
foreach i in `r(levels)' {
quietly tab doc if id==`i'
replace switcher1 = 1 if r(r)>1 & id==`i'
}
However, you there are certainly more efficient ways to accomplish your goal. Here's one example that tags ids that switch doctors:
ssc install egenmore
bysort id: egen num_docs = nvals(doc)
generate switcher2 = cond(num_docs>1,1,0)
The underlying idea is the same. You count the number of distinct values of doc for each id. If that number exceeds one, the id is tagged as a switcher. The second version is arguably more efficient since it does not involve looping over each value of id.

Is it possible to invoke a global macro inside a function in Stata?

I have a set of variables the list of which I have saved in a global macro so that I can use them in a function
global inlist_cond "amz2002ras_clss, amz2003ras_clss, amz2004ras_clss, amz2005ras_clss, amz2006ras_clss, amz2007ras_clss, amz2008ras_clss, amz2009ras_clss, amz2010ras_clss, amz2011ras_clss"
The reason why they are saved in a macro is because the list will be in a loop and its content will change depending on the year.
What I need to do is to generate a dummy variable so that water_dummy == 1 if any of the variables in the macro list has the WATER classification. In Stata, I need to write
gen water_dummy = inlist("WATER", "$inlist_cond")
, which--ideally--should translate to
gen water_dummy = inlist("WATER", amz2002ras_clss, amz2003ras_clss, amz2004ras_clss, amz2005ras_clss, amz2006ras_clss, amz2007ras_clss, amz2008ras_clss, amz2009ras_clss, amz2010ras_clss, amz2011ras_clss)
But this did not work---the code executed without any errors but the dummy variable only contained 0s. I know that it is possible to invoke macros inside functions in Stata, but I have never tried it when the macro contains a whole list of conditions. Any thoughts?
With a literal string specified, which the double quotes in the generate statement insist on, then you are comparing text with text and the comparison is not with the data at all.
. clear
. set obs 1
number of observations (_N) was 0, now 1
. gen a = "water"
. gen b = "wine"
. gen c = "beer"
. global myvars "a,b,c"
. gen found1 = inlist("water", "$myvars")
. gen found2 = inlist("water", $myvars)
. list
+---------------------------------------+
| a b c found1 found2 |
|---------------------------------------|
1. | water wine beer 0 1 |
+---------------------------------------+
The first comparison is equivalent to
. di inlist("water", "a,b,c")
0
which finds no match, as "water" is not matched by the (single!) other argument.
Macro references are certainly allowed within function or command calls: as each macro name is replaced by its contents before the syntax is checked, the function or command never even knows that a macro reference was ever used.
As #Aspen Chen concisely points out, omitting the double quotes gives what you want so long as the inlist() syntax remains legal.
If your data structure is something like in the following example, you can try the egen function incss, from egenmore (ssc install egenmore):
clear
set more off
input ///
str15(amz2009 amz2010)
"water" "juice"
"milk" "water"
"lemonade" "wine"
"water & beer" "tea"
end
list
egen watindic = incss(amz*), sub(water)
list
Be aware it searches for substrings (see the result for the last example observation).
A solution with a loop achieving different results is:
gen watindic2 = 0
forvalues i = 2009/2010 {
replace watindic2 = 1 if amz`i' == "water"
}
list
Another solution involves reshape, but I'll leave it at that.

Extract the mean from svy mean result in Stata

I am able to extract the mean into a matrix as follows:
svy: mean age, over(villageid)
matrix villagemean = e(b)'
clear
svmat village
However, I also want to merge this mean back to the villageid. My current thinking is to extract the rownames of the matrix villagemean like so:
local names : rownames villagemean
Then try to turn this macro names into variable
foreach v in names {
gen `v' = "``v''"
}
However, the variable names is empty. What did I do wrong? Since a lot of this is copied from Stata mailing list, I particularly don't understand the meaning of local names : rownames villagemean.
It's not completely clear to me what you want, but I think this might be it:
clear
set more off
*----- example data -----
webuse nhanes2f
svyset [pweight=finalwgt]
svy: mean zinc, over(sex)
matrix eb = e(b)
*----- what you want -----
levelsof sex, local(levsex)
local wc: word count `levsex'
gen avgsex = .
forvalues i = 1/`wc' {
replace avgsex = eb[1,`i'] if sex == `:word `i' of `levsex''
}
list sex zinc avgsex in 1/10
I make use of two extended macro functions:
local wc: word count `levsex'
and
`:word `i' of `levsex''
The first one returns the number of words in a string; the second returns the nth token of a string. The help entry for extended macro functions is help extended_fcn. Better yet, read the manuals, starting with: [U] 18.3 Macros. You will see there (18.3.8) that I use an abbreviated form.
Some notes on your original post
Your loop doesn't do what you intend (although again, not crystal clear to me) because you are supplying a list (with one element: the text name). You can see it running and comparing:
local names 1 2 3
foreach v in names {
display "`v'"
}
foreach v in `names' {
display "`v'"
}
foreach v of local names {
display "`v'"
}
You need to read the corresponding help files to set that right.
As for the question in your original post, : rownames is another extended macro function but for matrices. See help matrix, #11.
My impression is that for the kind of things you are trying to achieve, you need to dig deeper into the manuals. Furthermore, If you have not read the initial chapters of the Stata User's Guide, then you must do so.

How do I take data from one observation and apply it to one other observation within a group?

An unmarried couple is living together in a house with other people. To isolate how much that couple makes I need to add the two incomes together. I am using variables that act as pointers that give the partners_id. Using the partners_id, id , and individual_income how do I apply partner's income to his/her partner?
This was my attempt below:
summarize id, meanonly
capture gen partners_income = 0
forvalue ln = 1/`r(max)' {
bys household (id): ///
egen link_`ln' = total(individual_income) if partners_location==`ln')
replace partners_income = link_`ln' if link_`ln' > 0 & id == `ln'
drop link_*
}
There is general advice in this FAQ.
It can take longer to write a smart way to do this than to use a quick-and-dirty approach.
However, there is a smarter way.
Brute solution
Quick here means relatively quick to code; this isn't guaranteed quick for a very large dataset.
gen partners_income = .
gen problem = 0
The proper initialisation of the partner's income variable is to missing, not zero. Not knowing an income and the income being zero are different conditions. For example, if someone doesn't have a partner, the income will certainly be missing. (If at a later stage, you want to treat missings as zeros, that's up to you, but you should keep them distinct at this stage.)
The reason for the problem variable will become apparent.
I can't see a reason for your capture.
Now we can loop:
quietly forval i = 1/`=_N' {
su individual_income if id == partners_id[`i'], meanonly
replace partners_income = r(max) in `i'
if r(N) > 1 replace problem = r(N) in `i'
}
So, the logic is
foreach observation
find the partner's identifier
find that income: summarize, meanonly is fast
that should be one value, so it should be immaterial whether we pick it up from the results of summarize as the maximum, minimum, or mean
but if summarize finds more than one value, something is not as assumed (mistakes over identifiers, or multiple partners); later we edit if problem and look at those observations.
Notes:
We can make comparison safer by restricting computations to the same household by modifying
if id == partners_id[`i']
to
if id == partners_id[`i'] & household == household[`i']
In one place you have the variable partners_location which looks like a typo for partners_id.
Cute solution
Assuming that partners name each other as partner (and this is not the forum to explore exceptions), then couples have a joint identity which we obtain by sorting "John Joanna" and "Joanna John" to "Joanna John" or the equivalent with numeric identifiers:
gen first = cond(id < partner_id, id, partner_id)
gen second = cond(id < partner_id, partner_id, id)
egen joint = concat(first second), p(" ")
first and second just mean in numeric or alphanumeric order; this works for numeric and string identifiers. You may need to slap on an exclusion clause such as
if !missing(partner_id)
Now
bysort household joint : gen partners_income = income[3 - _n] if _N == 2
Get it? Each distinct combination of household and joint should be precisely 2 observations for us to be interested (hence the qualifier if _N == 2). If that's true then 3 - _n gives us the subscript of the other partner as if _n is 1 then 3 - _n is 2 and vice versa. Under by: subscripts are always applied within groups, so that _n runs 1, 2, and so forth in each distinct group.
If this seems cryptic, it is all spelled out in Cox, N.J. 2008. The problem of split identity, or how to group dyads. Stata Journal 8(4): 588-591 which is accessible as a .pdf.