Is there a way to create a graph bar with confidence interval in Stata? - stata

I am trying to produce a nice graph bar but with confidence interval. It looks pretty tedious but I found this method here. I have 1 y-variable (info_avoiding_ba) as well as 1 dummy (T_Meat). When I type the following code everything works up to the last line where I get the message "too few variables specified" from Stata. Can anyone help? Cheers!
collapse (mean) meaninfo_avoiding_ba= info_avoiding_ba (sd) sdinfo_avoiding_ba=info_avoiding_ba (count) n=info_avoiding_ba, by(T_Meat)
generate hiwrite = meaninfo_avoiding_ba + invttail(n-1,0.025)*(sdinfo_avoiding_ba / sqrt(n))
generate lowrite = meaninfo_avoiding_ba - invttail(n-1,0.025)*(sdinfo_avoiding_ba / sqrt(n))
graph twoway (bar meaninfo_avoiding_ba) (rcap hiwrite lowrite), by(T_Meat)
Also, if anyone knows how to add the significance value on the figure (above the bars for example)?

Related

How to get y axis range in Stata

Suppose I am using some twoway graph command in Stata. Without any action on my part Stata will choose some reasonable values for the ranges of both y and x axes, based both upon the minimum and maximum y and x values in my data, but also upon some algorithm that decides when it would be prettier for the range to extend instead to a number like '0' instead of '0.0139'. Wonderful! Great.
Now suppose that after (or while) I draw my graph, I want to slap some very important text onto it, and I want to be choosy about precisely where the text appears. Having the minimum and maximum values of the displayed axes would be useful: how can I get these min and max numbers? (Either before or while calling the graph command.)
NB: I am not asking how to set the y or x axis ranges.
Since this issue has been a bit of a headache for me for quite some time and I believe there is no good solution out there yet I wanted to write up two ways in which I was able to solve a similar problem to the one described in the post. Specifically, I was able to solve the issue of gray shading for part of the graph using these.
Define a global macro in the code generating the axis labels This is the less elegant way to do it but it works well. Locate the tickset_g.class file in your ado path. The graph twoway command uses this to draw the axes of any graph. There, I defined a global macro in the draw program that takes the value of the omin and omax locals after they have been set to the minimum between the axis range and data range (the command that does this is local omin = min(.scale.min,omin) and analogously for the max), since the latter sometimes exceeds the former. You could also define the global further up in that code block to only get the axis extent. You can then access the axis range using the globals after the graph command (and use something like addplot to add to the previously drawn graph). Two caveats for this approach: using global macros is, as far as I understand, bad practice and can be dangerous. I used names I was sure wouldn't be included in any program with the prefix userwritten. Also, you may not have administrator privileges that allow you to alter this file based on your organization's decisions. However, it is the simpler way. If you prefer a more elegant approach along the lines of what Nick Cox suggested, then you can:
Use the undocumented gdi natscale command to define your own axis labels The gdi commands are the internal commands that are used to generate what you see as graph output (cf. https://www.stata.com/meeting/dcconf09/dc09_radyakin.pdf). The tickset_g.class uses the gdi natscale command to generate the nice numbers of the axes. Basic documentation is available with help _natscale, basically you enter the minimum and maximum, e.g. from a summarize return, and a suggested number of steps and the command returns a min, max, and delta to be used in the x|ylabel option (several possible ways, all rather straightforward once you have those numbers so I won't spell them out for brevity). You'd have to adjust this approach in case you use some scale transformation.
Hope this helps!
I like Nick's suggestion, but if you're really determined, it seems that you can find these values by inspecting the output after you set trace on. Here's some inefficient code that seems to do exactly what you want. Three notes:
when I import the log file I get this message:
Note: Unmatched quote while processing row XXXX; this can be due to a formatting problem in the file or because a quoted data element spans multiple lines. You should carefully inspect your data after importing. Consider using option bindquote(strict) if quoted data spans multiple lines or option bindquote(nobind) if quotes are not used for binding data.
Sometimes the data fall outside of the min and max range values that are chosen for the graph's axis labels (but you can easily test for this).
The log linesize is actually important to my code below because the key values must fall on the same line as the strings that I use to identify the helpful rows.
* start a log (critical step for my solution)
cap log close _all
set linesize 255
log using "log", replace text
* make up some data:
clear
set obs 3
gen xvar = rnormal(0,10)
gen yvar = rnormal(0,.01)
* turn trace on, run the -twoway- call, and then turn trace off
set trace on
twoway scatter yvar xvar
set trace off
cap log close _all
* now read the log file in and find the desired info
import delimited "log.log", clear
egen my_string = concat(v*)
keep if regexm(my_string,"forvalues yf") | regexm(my_string,"forvalues xf")
drop if regexm(my_string,"delta")
split my_string, parse("=") gen(new)
gen axis = "vertical" if regexm(my_string,"yf")
replace axis = "horizontal" if regexm(my_string,"xf")
keep axis new*
duplicates drop
loc my_regex = "(.*[0-9]+)\((.*[0-9]+)\)(.*[0-9]+)"
gen min = regexs(1) if regexm(new3,"`my_regex'")
gen delta = regexs(2) if regexm(new3,"`my_regex'")
gen max_temp= regexs(3) if regexm(new3,"`my_regex'")
destring min max delta , replace
gen max = min + delta* int((max_temp-min)/delta)
*here is the info you want:
list axis min delta max

Recoding 0-3 values

I have speech data set so here is how it is coded now:
Hypernasality (0-3)
Speech understandibility (0-3)
Speech Acceptability (0-3)
Where 0 is good 3 is severe deviation from normal speech.
Hypnasality (0 and 1)
Audible Air Emission (0 and 1)
Where 0 is none and 1 is yes
I recoded my data this way:
foreach j in speechunderstandibility speechacceptability hypernasality {
recode `j' (0 = 3) (3 = 0) (1 = 2) (2 = 1), gen (`j'_1)
}
foreach j in hyponasality audibleemission {
recode `j' (0 = 1) (1 = 0), gen (`j'_1)
}
However, when I run my regression it gives me counter-intuitive results.
My dependent variable is speech outcome and beta of interest is cleft severity.
Results after recoding would say" Cleft severity improves speech but cleft surgery decreases it"
If I leave it the way it is coded then all 5 outcomes mentioned above have different outcomes.
I need them to go in one direction so I can build a summary index.
It may be a raw data issue. I would make sure that all the data points were entered correctly, because at some stage of data entry, the entry of the 0-3 may have gotten mixed up. So it may be a mix up during data-entry.
Secondly, if you're really sure of the data entry (this sounds like a data entry issue to me, or like Nick Cox says, a data interpretation issue), then perhaps try using "gen" and "replace" commands to recode your variables inside or outside of a loop.
When I have trouble with a looped command, I dissect each part and raw code it.

SPSS- how to make the histogram template refer to the y axis as percentage

I have an odd issue regarding the SPSS (version 20) use of Chart Template, and any help will be appriciated.
I used the GUI to manualy define a chart template for Histograms. Those are simple definitions:
1) set the x axis between 0 to 100.
2) set the y axis as percent and not as actual number of examples within each bin.
3) set the bin sizes to 5.
4) set the maximal value of the y axis to 20.
I saved the template using the File->Save ChartTemplate option after changing the definitions of one histogram.
Oddly, when I implement the template on a new histogram, only definitions 1,3,4 are generated while 2 is omitted. I searched for a solution and did not find any. This is extremly frustrating since I need to waste time and effort to manualy reset the axis to the right definition over any new histogram I make (which is a lot :/ ).
There might be a way to hack the template code using notepad but I did not see any mention of the Y axis there.
Any help and comment would be much appriciated.
I can't say offhand how to set up a template to do any of those aspects, but here is an example using syntax to specify those four options.
SET SEED 10.
INPUT PROGRAM.
LOOP #i = 1 TO 500.
COMPUTE Var = RV.UNIFORM(0,90).
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Sim.
FORMATS Var (F3.0).
EXECUTE.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Var MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Var=col(source(s), name("Var"))
GUIDE: axis(dim(1), label("Var"), delta(5))
GUIDE: axis(dim(2), label("Percent in Bin"))
SCALE: linear(dim(1), min(0), max(100))
SCALE: linear(dim(2), max(20))
ELEMENT: interval(position(summary.percent.count(bin.rect(Var, binWidth(5)), base.all(acrossPanels()))))
END GPL.
And this is what the graph looks like for me (with my default chart template) in V25.

Calculating p-value by hand from Stata table

I want to ask a question on how to compute the p-value without a t-stat table, just by looking at the table, like on the first page of the pdf in the following link http://faculty.arts.ubc.ca/dwhistler/326UBC/stataHILL.pdf . Like if I don't know the value 0.062, how can I know it is 0.062 by looking at other information from the table?
You need to use the ttail() function, which returns the reverse cumulative Student's t distribution, aka the probability T > t:
display ttail(38,abs(_b[_cons]/_se[_cons]))*2
The first argument, 38, is the degrees of freedom (sample size less number of parameters), while the second, 1.92, is the absolute value of the coefficient of interest divided by its standard error, or the t-stat. The factor of two comes from the fact that Stata is doing a two-tailed test. You can also use the stored DoF with
display ttail(e(df_r),abs(_b[_cons]/_se[_cons]))*2
You can also do the integration of the t density by "hand" using Adrian Mander's integrate:
ssc install integrate
integrate, f(tden(38,x)) l(-1.92) u(1.92)
This gives you 0.93761229, but you want Pr(T>|t|), which is 1-0.93761229=0.06238771.
If you look at many statistics textbooks, you will find a table called the Z-table which will give you the probability that Z is beyond your test statistic. The table is actually a cumulative distribution function of the normal curve.
When people went to school with 4-function calculators, one or more of the questions on the statistics test would include a copy of this Z-table, and the dear students would have to interpolate columns of numbers to find the p-value. In your example, you would see the test statistic between .06 and .07 and those fingers would tap out that it was closer to .06 and do a linear interpolation to come up with .062.
Today, the p-value is something that Stata or SAS will calculate for you.
Here is another SO question that may be of interest: How do I calculate a p-value if I have the t-statistic and d.f. (in Perl)?
Here is a basic page on how to determine p-value "by hand": http://www.dummies.com/how-to/content/how-to-determine-a-pvalue-when-testing-a-null-hypo.html
Here is how you can determine p-value using Excel: http://ms-office.wonderhowto.com/how-to/find-p-value-with-excel-346366/
===EDIT===
My Stata text ("Microeconometrics using Stata", Revised Ed, Cameron & Trivedi) says the following on p. 402.
* p-values for t(30), F(1,30), Z, and chi(1) at y=2
. scalar y=2
. scalar p_t30 = 2 * ttail(30,y)
. scalar p_f1and30 = Ftail(1,30,y^2)
. scalar p_z = 2 * (1 - normal(y))
. scalar p_chi1 = chi2tail(1,y^2)
. display "p-values" " t(30)=" %7.4f p_t30
p-values t(30) = 0.0546

How to plot the different graphs by stcurve in one chart in Stata?

I am using stcurve in Stata to plot survival probability. I need to plot the graph for all data and then for specific variables. I can generate the graphs in two different charts, but I need to have all three lines together in one chart.
I have tried the addplot() option but I get the error that stcurve is not a twoway graph. Do you have any idea how to do this?
This is the code that I have used which generates the graphs in two different charts separately:
stcurve, survival graphregion(lcolor(white) ilcolor(white) ifcolor(white) ) plotregion( lcolor(black)) title("Survival Function", size(vlarge)) ytitle("Survival probabilities", size(large)) xtitle("Time", size(large)) xlabel(,labsize(medium)) ylabel(,labsize(medium))
stcurve, survival at1( def=0) at2( def=1) graphregion(lcolor(white) ilcolor(white) ifcolor(white) ) plotregion( lcolor(black)) legend(label(1 "X Firms") label(2 "Y Firms")) legend(size(large)) lwidth(thin thick) title("Survival Function", size(vlarge)) ytitle("Survival probabilities", size(large)) xtitle("Time", size(large)) xlabel(,labsize(medium)) ylabel(,labsize(medium))
I am not sure if I understood correctly what you want. It would have been useful if you had added the stset and stcox code necessary before running stcurve.
If the Kaplan-Meier hazard graph is identical to your first stcurve, survival you can try a dirty fix by generating a variable e.g.
sts gen s2=s after running stset
then plotting it as a line against your time variable. i.e. adding this to the end of the second graph:
addplot(line s2 your_timevar, sort c(J) title("Survival probabilities"))
The equality of KM hazard and Cox hazard only holds if the first graph does not have any more predictors than failvar in the stset. So if you ran stcox, estimate after stset timevar, failure(failvar) id(idvar) it works, but if you have more variables in the stcox call this will not give you the correct plot.
edit:
As the above quick solution does not work, there is another dirty workaround: save the results from stcurve in a file (option outfile), then plot the "new" data as twoway graphs. Something like this:
stcurve, survival name("surv1") outfile(stcurve1.dta, replace)
stcurve, survival name("surv2") at1( def=0) at2( def=1) outfile(stcure2.dta, replace)
use stcurve1.dta, clear
rename surv1 surv1_A
rename _t _tA
append using stcurve2.dta
twoway line surv1 _t, sort || line surv1_A _tA, sort
I do not know if this will work with your data: it may be that you need to manipulate the new variables in the outfiles in some way to get the desired results, and you need to add the options you want to the twoway graphs. There surely are many better and easier ways of plotting this when you have the data for the graphs in separate datafiles, but this is the first solution that sprang to mind.