Using PROC TTEST, analyze the stat1.german data set - sas

Elli Sagerman, a Masters of Education candidate in German Education at the University of North Carolina at Chapel Hill in 2000, collected data for a study. She looked at the effectiveness of a new type of foreign language teaching technique on grammar skills. She selected 30 students to receive tutoring. Fifteen received the new type of training during the tutorials and 15 received standard tutoring. Two students moved from the district before completing the study. Scores on a standardized German grammar test were recorded immediately before the 12-week tutorials and again 12 weeks later at the end of the trial. Sagerman wanted to see the effect of the new technique on grammar skills.
Using PROC TTEST, analyze the s data set. Assess whether the treatment group improved more than the control group.

ods graphics;
proc ttest data=data plots(shownull)=interval;
class Group;
var Change;
title "German Grammar Training, Comparing Treatment to Control";
run;

Related

Clarification on tabstat use after bysort in Stata

I have a rather simple question regarding the output of tabstat command in Stata.
To be more specific, I have a large panel dataset containing several hundred thousands of observations, over a 9 year period.
The context:
bysort year industry: egen total_expenses=total(expenses)
This line should create total expenses by year and industry (or sum of all expenses by all id's in one particular year for one particular industry).
Then I'm using:
tabstat total_expenses, by(country)
As far as I understand, tabstat should show in a table format the means of expenses. Please do note that ids are different from countries.
In this case tabstat calculates the means for all 9 years for all industries for a particular country, or it just the mean of one year and one industry by each country from my panel data?
What would happen if this command is used in the following context:
bysort year industry: egen mean_expenses=mean(expenses)
tabstat mean_expenses, by(country)
Does tabstat creates means of means? This is a little bit confusing.
I don't know what is confusing you about what tabstat does, but you need to be clear about what calculating means implies. Your dataset is far too big to post here, but for your sake as well as ours creating a tiny sandbox dataset would help you see what is going on. You should experiment with examples where the correct answer (what you want) is obvious or at least easy to calculate.
As a detail, your explanation that ids are different from countries is itself confusing. My guess is that your data are on firms and the identifier concerned identifies the firm. Then you have aggregations by industry and by country and separately by year.
bysort year industry: egen total_expenses = total(expenses)
This does calculate totals and assigns them to every observation. Thus if there are 123 observations for industry A and 2013, there will be 123 identical values of the total in the new variable.
tabstat total_expenses, by(country)
The important detail is that tabstat by default calculates and shows a mean. It just works on all the observations available, unless you specify otherwise. Stata has no memory or understanding of how total_expenses was just calculated. The mean will take no account of different numbers in each (industry, year) combination. There is no selection of individual values for (industry, year) combinations.
Your final question really has the same flavour. What your command asks for is a brute force calculation using all available data. In effect your calculations are weighted by the numbers of observations in whatever combinations of industry, country and year are being aggregated.
I suspect that you need to learn about two commands (1) collapse and (2) egen, specifically its tag() function. If you are using Stata 16, frames may be useful to you. That should apply to any future reader of this using a later version.

Linear Regression: Finding Significant Class Variables Using SAS

I'm attempting to use SAS to do a pretty basic regression problem but I'm having trouble getting the full set of results.
I'm using a data set that includes professors' overall quality (the dependent variable) and has the following independent variables: gender, numYears, pepper, discipline, easiness, and rateInterest.
I'm using the code below to generate the analysis of the data set:
proc glm data=WORK.IMPORT;
class gender pepper discipline;
model quality = gender numYears pepper discipline easiness raterInterest;
run;
I get the following results, which is mostly what I need, EXCEPT that I would like to see exactly which responses from the class variables (gender, pepper, discipline) are significant.
From these results, I can see that easiness, rateInterest, pepper, and discipline are significant; however, I'd like to see which specific values of pepper and discipline are significant. For example, pepper was answered as a 'yes' or 'no' by the student. I'd like to see if quality correlates specifically to pepperyes or pepperno. Can anyone give me some advice about how to alter my code to return a breakdown of the class variables?
Here is also a link to the dataset, in case it's needed for reference:
https://drive.google.com/file/d/1Kc9cb_n-l7qwWRNfzXtZi5OsiY-gsYZC/view?usp=sharingRateprof
I really, truly appreciate any assistance!
Add the solution option to your model statement to break out statistics of each class variable; however, reference parameterization is not available in proc glm, and will cause biased estimates. There are ways around this to continue using proc glm, but the simplest solution is to use proc glmselect instead. proc glmselect allows you to specify reference parameterization. Use the selection=none option to disable variable selection.
proc glmselect data=WORK.IMPORT;
class gender(ref='female') pepper discipline / param=reference;
model quality = gender numYears pepper discipline easiness raterInterest / selection=none;
run;
The interpretation of this would be:
All other variables held constant, females affect the quality rating by
-0.046782 units compared to males. This variable is not statistically significant.
The breakdown of each class level is a comparison to a reference value. By default, the reference value selected is the last level after all class values are internally sorted. You can specify a reference using the ref= option after each class variable. For example, if you wanted to use females as a reference value instead of males:
proc glmselect data=WORK.IMPORT;
class gender(ref='female') pepper discipline;
model quality = gender numYears pepper discipline easiness raterInterest / selection=none;
run;
Note that you can also do this with prox mixed. For this specific purpose, the preference is up to you based on the output style that you like. proc mixed is a more flexible way to run regressions, but would be a bit overkill here.
proc mixed data=import;
class gender pepper discipline;
model quality = gender numYears pepper discipline easiness raterInterest / solution;
run;

Season Identification in SAS

I have three years worth of monthly data showing concentrations of X chemical in a sample. The data shows seasonality as predicted. However, the seasons are not your regular summer/winter/etc. I am trying to find out how I can delineate the seasons in SAS. I am trying to break the year down to 2 main seasons (high vs low concentrations). So I need SAS to be able to identify where that break is between the two seasons (i.e., which months are in the high concentration season and which months are in the low concentration season). Any way to do that?
Yes. You will want to use proc timeseries to decompose the series and estimate the season. You can alternatively use proc spectra, but proc timeseries is much more comprehensive.
ods graphics on;
proc timeseries data=sashelp.air plots=(decomp sc sa cycles sic periodogram);
id date interval=month;
var air;
run;
The results clearly indicate a season of 12 in our example.

How do I perform spatial logistic regression in SAS?

I am trying to develop a spatiotemporal logistic regression model to predict the presence/absence of a disease in U.S. counties (contiguous U.S.) based on climatologic variables, with data points for each year between 2007 and 2014; ideally, I would like a model with functionality to score additional datasets, e.g., use the model developed for 2006-2014 to predict disease probability in future climate scenarios. The model needs to account for spatial autocorrelation, and (again, ideally) repeated measures (each county has one data point per year). Unfortunately, my SAS abilities are not up to the task. Would anyone have suggestions for developing the model? The data, in csv format, take the form of:
countyFIPS year outcome predictor1 predictor2 predictor3 latitude longitude
where
countyFIPS = unique 5-digit identifier for U.S. counties
outcome = at least one case in the county for the given year, coded 0/1
latitude and longitude denote the centroid of the county
I'm really bad at this, so please be gentle and use small words...

What can TPL Tables do that Proc Tabulate cannot?

In particular, what, if any, are the substantial changes or extensions in the programming language that gives it functionality beyond PROC TABULATE?
Or is it the case that the programming languages in Proc Tabulate and TPL Tables ( from QQQ Software ) are pretty close to the same?
I was really surprised to hear about TPL Tables, and it's predecessor, the Table Producing Language from the US Department of Labor in the 1970s. After all these years, I had never heard of it. Turns out, two commercial descendants of the Table Producing Language are the SAS PROC and TPL Tables.
Has anyone worked with both? Why are TPL Tables so unknown?
Robert
You are correct, both TABULATE and QQQ TPL Tables are descendants of the US Bureau of Labor Statistics TPL. According to this thread, the developers of TPL/PCL at the Bureau of Labor Statistics eventually left BLS and started QQQ.
This SAS article is a good read regarding TABULATE. According to the article, TABULATE, which was introduced in the 80s, originally borrowed much of its syntax and features from BLS TPL while addressing some of its shortcomings, though the specific shortcomings addressed are not mentioned.
What, if any, are the substantial changes or extensions in the programming language that give it functionality beyond PROC TABULATE?
The features of QQQ TPL Tables have evolved over time, as have the features of TABULATE. I've found no information to suggest that ongoing TABULATE development kept abreast of QQQ TPL features, so the two systems are now likely too different to compare effectively. As a SAS product, TABULATE is intended to integrate with other SAS technologies, such as ODS. TPL probably integrates with other QQQ technologies.
Although, just based on documentation, something that TPL (v7+) can do that TABULATE (as of v9.4) cannot is perform statistical hypothesis tests, e.g. t-tests, chi-squared tests, and ANOVA. But in SAS you have other, likely more flexible, options to get these.
If you're looking to integrate one or the other into your development cycle, I recommend choosing the one that best fits your current system. If you're already using SAS, stick with TABULATE.
Why is TPL Tables so unknown?
Who knows. It's still in use by the BLS and a few others, apparently. But SAS is such a giant in the field that it tends to overshadow its competition.