Stata xtline overlayed plot for multiple groups - stata

I am attempting to produce an overlayed -xtline- plot that distinguishes between males and females (or any number of multiple groups) by displaying different plot styles for each group. I chose to recast the xtline plot as "connected" and show males using circle markers and females as triangle markers. Taking cues from this question on Statalist, I produced code similar to what is below. When I try this solution Stata produces the "too many options" error, which is perhaps predictable given the large number of unique persons. I am aware of this solution which employs combined graphs but that is also not practical given the large number of unique individuals in my data.
Does a more simple solution to this problem exist? Does Stata have the capacity to overlay multiple -xtline- plots like it can -twoway- plots?
The code below, using publicly available data from UCLA's excellent Stata guide shows my basic code and reproduces the error:
use http://www.ats.ucla.edu/stat/stata/examples/alda/data/alcohol1_pp, clear
xtset id age
gsort -male id
qui levelsof id if !male, loc(fidlevs)
qui levelsof id if male, loc(midlevs)
qui levelsof id, loc(alllevs)
tokenize `alllevs'
loc len_f : word count `fidlevs'
loc len_m : word count `midlevs'
loc len_all : word count `alllevs'
loc start_f = `len_all' - `len_f'
forval i = 1/`len_all' {
if `i' < `start_f' {
loc m_plot_opt "`m_plot_opt' plot`i'opts(recast(connected) mcolor(black) msize(medsmall) msymbol(circle) lcolor(black) lwidth(medthin) lpattern(solid))"
}
else if `i' >= `start_f' {
loc f_plot_opt "`f_plot_opt' plot`i'opts(recast(connected) mcolor(black) msize(medsmall) msymbol(triangle) lcolor(black) lwidth(medthin) lpattern(solid))"
}
}
di "xtline alcuse, legend(off) scheme(s1mono) overlay `m_plot_opt' `f_plot_opt'"
xtline alcuse, legend(off) scheme(s1mono) overlay `m_plot_opt' `f_plot_opt'

It is difficult (for me) to separate the programming issue here from statistical or graphical views on what kind of graph works well, or at all. Even with this modest dataset there are 82 distinct identifiers, so any attempt to show them distinctly fails to be useful, if only because the resulting legend takes up most of the real estate.
There is considerable ingenuity in the question code in working through all the identifiers, but a broad-brush approach seems to work as well. Try this:
use http://www.ats.ucla.edu/stat/stata/examples/alda/data/alcohol1_pp, clear
xtset id age
separate alcuse, by(male) veryshortlabel
label var alcuse1 "male"
label var alcuse0 "female"
line alcuse? age, legend(off) sort connect(L)
Key points:
There is nothing very special about xtline. It's just a convenience wrapper. When frustrated by its wired-in choices, people often just reach for line.
To get distinct colours, distinct variables suffice, which is where separate has a role. See also this Tip.
Although the example dataset is well behaved, extra options sort connect(L) will help in some case to remove spurious connections between individuals or panels. (In extreme cases, reach for linkplot (SSC).)
This could be fine too:
line alcuse age if male || line alcuse age if !male, legend(order(1 "male" 2 "female")) sort connect(L)

Related

Add custom column based on string in another column

Source data:
Market Platform Web sales $ Mobile sales $ Insured
FR iPhone 1323 8709 Y
IT iPad 12434 7657 N
FR android 234 2352355 N
IT android 12323 23434 Y
Is there a way to evaluate the sales of devices that are insured?
if List.Contains({"iPhone","iPad","iPod"},[Platform]) and ([Insured]="Y") then [Mobile sales] else "error"
Something to that extent, just not sure how to approach it
A direct answer to your question is
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
SumUpSales = Table.AddColumn(Source, "Sales of insured devices", each if List.Contains({"iPhone","iPad","iPod"}, _[Platform]) and Text.Upper(_[Insured]) = "Y" then _[#"Mobile sales $"] else null, type number)
in
SumUpSales
However, I would like to stress you few things.
First, it's better to convert values in [Insured] column to boolean first. That way you can catch errors before they corrupt your data without you noticing. My example doesn't do that, all it does is negating letter case in [Insured], since PowerM is case-sensitive language.
Second, you'd better use null rather than text value error. Then, you can set column type, and do some math with its values, such as summing them up. In case of mixed text and number values you will get an error in this and many other cases.
And last.
It is probably better way to use a pivot table for visualizing data like this. You just need to add a column which groups all Apple (and/or other) devices together based on the same logic, but excluding [Insured]. Pivot tables are more flexible, and I personally like them very much.

Counting first and last names appearing more than once

I have the following names:
clear
input str25 names
"Trenton Mercer"
"Carissa Moyer"
"Timothy Delgado"
"Kaylynn Payne"
"Harry Patton"
"Charlie Dudley"
"Harry Schmitt"
"Wyatt Hammond"
"Kasen Delgado"
"Katherine Noble"
"Julius Jarvis"
"Harry Carney"
"Wyatt Holden"
"Megan Wilson"
"Priscilla Shaffer"
"Savanah Marshall"
"Harry Delgado"
"Harper Ballard"
"Harry Mcmahon"
"Alejandro Jarvis"
end
How can I identify which first and last names (separately) come up more than once?
I would also like to count how many times these appear.
Pearly's solution (with split as the definitely best choice for the issue) appears reasonable. But there are still some unnecessary contours. For example, generating tag, b1, b2 variables seems not really needed.
And more important, the final output is not thoroughly consistent, with the counting info just in line with seemingly-random order, which is also different from the original one without clear explanation.
Thus, I try to contribute a solution (which must also have defects), just as a way to avoid those issues while still providing the output that you are seeking for.
split names
foreach v in `r(varlist)' {
egen TotalAppear_`v' = total(`v' != ""), by(`v')
egen LastAppear_`v' = max(_n), by(`v')
replace LastAppear_`v' = LastAppear_`v'==_n
list `v' TotalAppear_`v' if LastAppear_`v' == 1 & TotalAppear_`v' >1
}
It should be noted your description leads to assumptions made in my code as well as in Pearly's solution:
Every name has only 2 parts, i.e. first name and last name, so not including any middle name(s).
You just want to compare within each group (each first name among first names, last name among last names), not comparing any one with those from the other group.

How to transfer covariance-like table into one-to-one pairs cleverly in stata?

I encounter a practical issue on data management using Stata. What I'm planning to do is creating variable of spherical distances between 30 province capitals (so there are roughly 870 identical values) of China. There have been some user-written commands to handle this issue(through google map) but My problem is, for some confidential reason, the data is stored in a isolated computer that disconnected to internet,so I have to defined all of the one-to-one distance value in do-file and then merge them into the data. Given the nontrival workload (though not really infeasible), I wonder if there is some clever way to do the job. I have an excel worksheet in which the distance is like a covariance martrix with province capitals' names appearing in both first row and column,it's like a lower-triangle matrix, . stands for values
A B C D E AD
capital_1 capital_2 capital_3 capital_4 ······ capital_30
1 capital_1
2 capital_2 '
3 capital_3 ' '
4 capital_4 ' ' '
········
capital_30 ' ' ' '
I know how to import such martrix, but can I generate the desired one-to-one pairs? Thank you.
Thanks for #Roberto 's helpful suggestion, I have partly solved the problem, and I pose my solution to benefit any newcomer who may meet similar problem, and want to benefit from the real experts to improve my code
//import data from excel file
import excel using distance.xls
*********************************
* rename the variable name
* (to make it more well orgnized to facilitie use of --reshape-- command
*********************************
* rename the first column
ren A id
* rename the following variables
local i=0
foreach var of varlist B-AF {
local j=`i'+1
local i=`i'+1
ren `var' distance`j'
}
*********************************
* reshape the data
*********************************
reshape long distance,i(id) j(city)
tostring city,replace
***the following part is really urgly, because I don't know how
*to gen a one-to-one mapping between orginal province name and
*the new generated variable name like distance`j',I have to recover
*them by hand, I hope someone can help me to improve this part**
replace city="北京" if city=="1"
replace city="天津" if city=="2"
replace city="河北" if city=="3"
replace city="山西" if city=="4"
replace city="内蒙古" if city=="5"
replace city="辽宁" if city=="6"
replace city="吉林" if city=="7"
·····

Extract the mean from svy mean result in Stata

I am able to extract the mean into a matrix as follows:
svy: mean age, over(villageid)
matrix villagemean = e(b)'
clear
svmat village
However, I also want to merge this mean back to the villageid. My current thinking is to extract the rownames of the matrix villagemean like so:
local names : rownames villagemean
Then try to turn this macro names into variable
foreach v in names {
gen `v' = "``v''"
}
However, the variable names is empty. What did I do wrong? Since a lot of this is copied from Stata mailing list, I particularly don't understand the meaning of local names : rownames villagemean.
It's not completely clear to me what you want, but I think this might be it:
clear
set more off
*----- example data -----
webuse nhanes2f
svyset [pweight=finalwgt]
svy: mean zinc, over(sex)
matrix eb = e(b)
*----- what you want -----
levelsof sex, local(levsex)
local wc: word count `levsex'
gen avgsex = .
forvalues i = 1/`wc' {
replace avgsex = eb[1,`i'] if sex == `:word `i' of `levsex''
}
list sex zinc avgsex in 1/10
I make use of two extended macro functions:
local wc: word count `levsex'
and
`:word `i' of `levsex''
The first one returns the number of words in a string; the second returns the nth token of a string. The help entry for extended macro functions is help extended_fcn. Better yet, read the manuals, starting with: [U] 18.3 Macros. You will see there (18.3.8) that I use an abbreviated form.
Some notes on your original post
Your loop doesn't do what you intend (although again, not crystal clear to me) because you are supplying a list (with one element: the text name). You can see it running and comparing:
local names 1 2 3
foreach v in names {
display "`v'"
}
foreach v in `names' {
display "`v'"
}
foreach v of local names {
display "`v'"
}
You need to read the corresponding help files to set that right.
As for the question in your original post, : rownames is another extended macro function but for matrices. See help matrix, #11.
My impression is that for the kind of things you are trying to achieve, you need to dig deeper into the manuals. Furthermore, If you have not read the initial chapters of the Stata User's Guide, then you must do so.

How do I take data from one observation and apply it to one other observation within a group?

An unmarried couple is living together in a house with other people. To isolate how much that couple makes I need to add the two incomes together. I am using variables that act as pointers that give the partners_id. Using the partners_id, id , and individual_income how do I apply partner's income to his/her partner?
This was my attempt below:
summarize id, meanonly
capture gen partners_income = 0
forvalue ln = 1/`r(max)' {
bys household (id): ///
egen link_`ln' = total(individual_income) if partners_location==`ln')
replace partners_income = link_`ln' if link_`ln' > 0 & id == `ln'
drop link_*
}
There is general advice in this FAQ.
It can take longer to write a smart way to do this than to use a quick-and-dirty approach.
However, there is a smarter way.
Brute solution
Quick here means relatively quick to code; this isn't guaranteed quick for a very large dataset.
gen partners_income = .
gen problem = 0
The proper initialisation of the partner's income variable is to missing, not zero. Not knowing an income and the income being zero are different conditions. For example, if someone doesn't have a partner, the income will certainly be missing. (If at a later stage, you want to treat missings as zeros, that's up to you, but you should keep them distinct at this stage.)
The reason for the problem variable will become apparent.
I can't see a reason for your capture.
Now we can loop:
quietly forval i = 1/`=_N' {
su individual_income if id == partners_id[`i'], meanonly
replace partners_income = r(max) in `i'
if r(N) > 1 replace problem = r(N) in `i'
}
So, the logic is
foreach observation
find the partner's identifier
find that income: summarize, meanonly is fast
that should be one value, so it should be immaterial whether we pick it up from the results of summarize as the maximum, minimum, or mean
but if summarize finds more than one value, something is not as assumed (mistakes over identifiers, or multiple partners); later we edit if problem and look at those observations.
Notes:
We can make comparison safer by restricting computations to the same household by modifying
if id == partners_id[`i']
to
if id == partners_id[`i'] & household == household[`i']
In one place you have the variable partners_location which looks like a typo for partners_id.
Cute solution
Assuming that partners name each other as partner (and this is not the forum to explore exceptions), then couples have a joint identity which we obtain by sorting "John Joanna" and "Joanna John" to "Joanna John" or the equivalent with numeric identifiers:
gen first = cond(id < partner_id, id, partner_id)
gen second = cond(id < partner_id, partner_id, id)
egen joint = concat(first second), p(" ")
first and second just mean in numeric or alphanumeric order; this works for numeric and string identifiers. You may need to slap on an exclusion clause such as
if !missing(partner_id)
Now
bysort household joint : gen partners_income = income[3 - _n] if _N == 2
Get it? Each distinct combination of household and joint should be precisely 2 observations for us to be interested (hence the qualifier if _N == 2). If that's true then 3 - _n gives us the subscript of the other partner as if _n is 1 then 3 - _n is 2 and vice versa. Under by: subscripts are always applied within groups, so that _n runs 1, 2, and so forth in each distinct group.
If this seems cryptic, it is all spelled out in Cox, N.J. 2008. The problem of split identity, or how to group dyads. Stata Journal 8(4): 588-591 which is accessible as a .pdf.