Scatter plot color by variable - stata

I want to make an scatter plot in Stata with points colored according to a categorical variable.
The only way I've found to do this, is to code colors in layers of a twoway plot.
However, this seems a rather convoluted solution for such a simple operation:
twoway (scatter latitud longitud if nougrups4 ==1, mcolor(black)) ///
(scatter latitud longitud if nougrups4 ==2, mcolor(blue)) ///
(scatter latitud longitud if nougrups4 ==3, mcolor(red)) ///
(scatter latitud longitud if nougrups4 ==4, mcolor(green))
Is there a simpler and automatic way to do this?
In this case, the categorical variable nougrups4 came from a cluster analysis. A general solution would be fine, but also a specific solution to draw clusters.

This is how I would do this by hand:
sysuse auto, clear
separate price, by(rep78)
tw scatter price? mpg
drop price?
Or in one line using Nick Cox's sepscatter command from SSC:
sepscatter price mpg, separate(rep78)
The latter command can also output other type of plots with the recast() option.

There isn't a 'simpler' built-in solution for what you want to do.
However, here's a simple wrapper command, which you can extend to meet your needs:
capture program drop foo
program define foo
syntax varlist(min=1 max=3)
quietly {
tokenize `varlist'
levelsof `3', local(foolevels)
local i = 0
local foocolors red green blue
foreach x of local foolevels {
local ++i
local extra `extra' || scatter `1' `2' if `3' == `x', mcolor("`: word `i' of `foocolors''")
}
twoway `extra'
}
end
And a toy example:
clear
set obs 10
generate A = runiform()
generate B = runiform()
generate C = .
replace C = 1 in 1/3
replace C = 2 in 4/7
replace C = 3 in 8/10
foo A B C

Related

Stata coefplot: plot coefficients and corresponding confidence intervals on 2nd axis

When trying to depict two coefficients from one regression on separate axes with Ben Jann's superb coefplot (ssc install coefplot) command, the coefficient to be shown on the 2nd axis is correctly displayed, but its confidence interval is depicted on the 1st scale.
Can anyone explain how I get the CI displayed on the same (2nd) axis as the coefficient it belongs to? I couldn't find any option to change this - and imagine it should be the default, if not the only, option to plot the CI around the point estimate it belongs to.
I use the latest coefplot version with Stata 16.
Here is a minimum example to illustrate the problem:
results plot
webuse union, clear
eststo results: reg idcode i.union grade
coefplot (results, keep(1.union)) (results, keep(grade) xaxis(2))
In the line
coefplot (results, keep(1.union)) (results, keep(grade) xaxis(2))
you specify the option xaxis(2), but this is not a documented option of coefplot, although it is a valid option of twoway rspike which is called by coefplot. Apparently, if you use xaxis(2) something goes wrong with the communication between coefplot and rspike.
This works for me:
coefplot (results, keep(1.union)) (results, keep(grade) axis(2))
I'm trying to create something similar. Since this option is not built-in we need to write a program to tweak how coefplot works. I'm sharing the code from the user manual here: http://repec.sowi.unibe.ch/stata/coefplot/markers.html
capt program drop coefplot_mlbl
*! version 1.0.0 10jun2021 Ben Jann
program coefplot_mlbl, sclass
_parse comma plots 0 : 0
syntax [, MLabel(passthru) * ]
if `"`mlabel'"'=="" local mlabel mlabel(string(#b, "%5.2f") + " (" + string(#ll, "%5.2f") + "; " + string(#ul, "%5.2f") + ")")
preserve
qui coefplot `plots', `options' `mlabel' generate replace nodraw
sreturn clear
tempvar touse
qui gen byte `touse' = __at<.
mata: st_global("s(mlbl)", ///
invtokens((strofreal(st_data(.,"__at","`touse'")) :+ " " :+ ///
"`" :+ `"""' :+ st_sdata(.,"__mlbl","`touse'") :+ `"""' :+ "'")'))
sreturn local plots `"`plots'"'
sreturn local options `"`options'"'
end
capt program drop coefplot_ymlbl
*! version 1.0.0 10jun2021 Ben Jann
program coefplot_ymlbl
_parse comma plots 0 : 0
syntax [, MLabel(str asis) * ]
_parse comma mlspec mlopts : mlabel
local mlopts = substr(`"`mlopts'"', 2, .) // remove leading comma
if `"`mlspec'"'!="" local mlabel mlabel(`mlspec')
else local mlabel
coefplot_mlbl `plots', `options' `mlabel'
coefplot `plots', ///
yaxis(1 2) yscale(alt) yscale(axis(2) alt noline) ///
ylabel(none, axis(2)) yti("", axis(2)) ///
ymlabel(`s(mlbl)', axis(2) notick angle(0) `mlopts') `options'
end
coefplot_ymlbl D F, drop(_cons) xline(0)
However, the above program does not allow for the option 'bylabel'. I get a stata error saying "bylabel not allowed". I wanted to ask if there is a way to edit this code and include the bylabel option which is used to label subplots?

insert variable value label for graph title

I can't quite grasp how to insert a variable's value labels for titles to a graph.
For example, in sysuse auto, the variable foreign takes the value of 0 or 1 where 0 is labeled "Domestic" and 1 is labeled "Foreign".
In the following snippet, I want to plot the average price for each category of the variable foreign using a loop:
sysuse auto, clear
forvalues i=0/1{
local t = foreign[`i']
graph bar (mean) price if foreign == `i', ///
over(rep78, sort(price) descending) asyvars ///
title("`t'") name(p_`i', replace) nodraw
local graphs `graphs' p_`i'
}
gr combine `graphs'
but it does not even display the category value correctly in the title.
What am I doing wrong?
Your code
local t = foreign[`i']
sets the local macro t to the value of the variable foreign first in observation 0 and then in obseration 1: these will be missing and 0, respectively.
What you want is the value label corresponding to the values 0 and 1, which you can obtain with
local t : label (foreign) `i'
Swap this into your code and your graphs will be labelled Domestic and Foreign, respectively.
The syntax of the replacement command may be unfamiliar; macro "extended functions" are described in help extended_fcn.
Note that either of these graph commands
sysuse auto, clear
graph bar (mean) price , ///
over(rep78, sort(price) descending) asyvars over(foreign)
graph bar (mean) price , ///
over(rep78, sort(price) descending) asyvars by(foreign)
uses value labels automatically and produces a combined graph directly. This may not be the main question, but the original code is not a good solution on those grounds.

Graph evolution of quantile non-linear coefficient: can it be done with grqreg? Other options?

I have the following model:
Y_{it} = alpha_i + B1*weight_{it} + B2*Dummy_Foreign_{i} + B3*(weight*Dummy_Foreign)_ {it} + e_{it}
and I am interested on the effect on Y of weight for foreign cars and to graph the evolution of the relevant coefficient across quantiles, with the respective standard errors. That is, I need to see the evolution of the coefficients (B1+ B3). I know this is a non-linear effect, and would require some sort of delta method to obtain the variance-covariance matrix to obtain the standard error of (B1+B3).
Before I delve into writing a program that attempts to do this, I thought I would try and ask if there is a way of doing it with grqreg. If this is not possible with grqreg, would someone please guide me into how they would start writing a code that computes the proper standard errors, and graphs the quantile coefficient.
For a cross section example of what I am trying to do, please see code below.
I use grqred to generate the evolution of the separate coefficients (but I need the joint one)-- One graph for the evolution of (B1+B3) with it's respective standard errors.
Thanks.
(I am using Stata 14.1 on Windows 10):
clear
sysuse auto
set scheme s1color
gen gptm = 1000/mpg
label var gptm "gallons / 1000 miles"
gen weight_foreign= weight*foreign
label var weight_foreign "Interaction weight and foreign car"
qreg gptm weight foreign weight_foreign , q(.5)
grqreg weight weight_foreign , ci ols olsci reps(40)
*** Question 1: How to constuct the plot of the coefficient of interest?
Your second question is off-topic here since it is statistical. Try the CV SE site or Statalist.
Here's how you might do (1) in a cross section, using margins and marginsplot:
clear
set more off
sysuse auto
set scheme s1color
gen gptm = 1000/mpg
label var gptm "gallons / 1000 miles"
sqreg gptm c.weight##i.foreign, q(10 25 50 75 95) reps(500) coefl
margins, dydx(weight) predict(outcome(q10)) predict(outcome(q25)) predict(outcome(q50)) predict(outcome(q75)) predict(outcome(q95)) at(foreign=(0 1))
marginsplot, xdimension(_predict) xtitle("Quantile") ///
legend(label(1 "Domestic") label(2 "Foreign")) ///
xlabel(none) xlabel(1 "Q10" 2 "Q25" 3 "Q50" 4 "Q75" 5 "Q95", add) ///
title("Marginal Effect of Weight By Origin") ///
ytitle("GPTM")
This produces a graph like this:
I didn't recast the CI here since it would look cluttered, but that would make it look more like your graph. Just add recastci(rarea) to the options.
Unfortunately, none of the panel quantile regression commands play nice with factor variables and margins. But we can hack something together. First, you can calculate the sums of coefficients with nlcom (instead of more natural lincom, which the lacks the post option), store them, and use Ben Jann's coefplot to graph them. Here's a toy example to give you the main idea where we will look at the effect of tenure for union members:
set more off
estimates clear
webuse nlswork, clear
gen tXu = tenure*union
local quantiles 1 5 10 25 50 75 90 95 99 // K quantiles that you care about
local models "" // names of K quantile models for coefplot to graph
local xlabel "" // for x-axis labels
local j=1 // counter for quantiles
foreach q of numlist `quantiles' {
qregpd ln_wage tenure union tXu, id(idcode) fix(year) quantile(`q')
nlcom (me_tu:_b[tenure]+_b[tXu]), post
estimates store me_tu`q'
local models `"`models' me_tu`q' || "'
local xlabel `"`xlabel' `j++' "Q{sub:`q'}""'
}
di "`models'
di `"`xlabel'"'
coefplot `models' ///
, vertical bycoefs rescale(100) ///
xlab(none) xlabel(`xlabel', add) ///
title("Marginal Effect of Tenure for Union Members On Each Conditional Quantile Q{sub:{&tau}}", size(medsmall)) ///
ytitle("Wage Change in Percent" "") yline(0) ciopts(recast(rcap))
This makes a dromedary curve, which suggests that the effect of tenure is larger in the middle of the wage distribution than at the tails:

How to create bar charts with multiple bar labels in Stata

I'm trying to create a bar chart in which the frequency is outside the bar and the percentage inside, is it possible? Would post a picture but the system doesn't allow for it yet.
As others pointed out, this is a poor question without code.
It is possible to guess that you are using graph bar. That makes you choose at most one kind and position of bar labels. Much more is possible with twoway bar so long as you do a little work.
sysuse auto, clear
contract rep78 if rep78 < .
su _freq
gen _pc = 100 * _freq / r(sum)
gen s_pc = string(_pc, "%2.1f") + "%"
gen one = 1
twoway bar _freq rep78, barw(0.9) xla(1/5, notick) bfcolor(none) ///
|| scatter one _freq rep78, ms(none ..) mla(s_pc _freq) mlabcolor(black ..) ///
mlabpos(0 12) scheme(s1color) ysc(r(0 32)) yla(, ang(h)) legend(off)
In short:
contract collapses to a dataset of frequencies.
Calculation of percents is trivial, but you need a formatted version in a string variable if the labels are not to look silly. Precise format is at choice.
The frequency scale on the axis is arguably redundant given the bar labels, and could be omitted.
The example puts labels within the bar just above its base at the level of frequency equal to 1. That's a choice for this example and would be too close to the axis if the typical frequencies were much higher.

Stata output files in surveys

I have some survey data which I'm using Stata to analyze. I want to compute means of one variable by group and save those means to a Stata file. My code looks like this:
svyset [iw=wtsupp], sdrweight(repwtp1-repwtp160) vce(sdr)
svy: mean x
I tried
svy: by grp: mean x
but that did not work. I could save each mean to a separate file by simply saying
svy: mean x if grp==1
but that's inefficient. Is there a better way?
Saving results to a file like one can use SAS ODS to capture results is also a need. I am not talking about the log here. I need the means and the associated group. I'm thinking
estimates save [path],replace
but I'm not sure if that will give me a Stata file or the group if I can figure out how to use by processing.
Here's a simpler approach that creates a data set of the displayed estimation results: estimated means, standard errors, confidence limits, z statistics, and p-values. svy: mean is called with the over() option, which does away with the need for the foreach loop and computes standard errors appropriate for subpopulation analysis. The estimation results are contained in the returned matrix r(table), which is converted by the svmat command to a Stata data set. While svmat maintains column names, it does not preserve row (group) names, so it is necessary to merge these in to the created data set.
set more off
use http://www.stata-press.com/data/r13/ss07ptx, clear
svyset _n [pw= pwgtp], sdrweight(pwgtp*) vce(sdr)
************************************************ *
* Set name of grouping variable in double quotes *
* in the next line. *
* ************************************************
local gpname "sex"
tempvar gp
egen `gp' = group(`gpname')
preserve
tempfile t1
bys `gp': keep if _n==1
keep `gp' `gpname'
save `t1'
restore
svy: mean agep , over(`gp')
matrix a = r(table)'
clear
qui svmat double a, names(col)
gen `gp'=_n
merge 1:1 `gp' using `t1'
keep `gpname' b se z pvalue ll ul
order `gpname'
save results, replace
list
Edited 10/28
This version contains legibility improvements and the outcome variable and saved datasets are specified in a local macro. Therefore the analyst need not touch the foreach block. Easier to write and read matrix subscript expressions are used instead of the el matrix function: thus m[1,1] instead of el("m",1,1).
sysuse auto, clear
svyset _n
************************************************ *
* Set names of grouping variable and results data *
* set in double quotes in the next line. *
* ************************************************
local yvar mpg // variable for mean
local gpname "foreign"
local d_results "results"
tempvar gp
gen `gp' = `gpname'
tempname memhold
postfile `memhold' ///
`gpname' n mean se sd using `d_results', replace
levelsof `gp', local(lg)
foreach x of local lg{
svy, subpop(if `gp'==`x'): mean `yvar'
matrix m = e(b)
matrix v = e(V)
matrix a = e(V_srssub)
matrix b = e(_N_subp)
matrix c = e(_N)
scalar gx = `x'
scalar mean = m[1,1]
scalar sem = sqrt(v[1,1])
scalar sd = sqrt(b[1,1]*a[1,1])
scalar n = c[1,1]
post `memhold' (gx) (n) (mean) (sem) (sd)
}
postclose `memhold'
use results, clear
list