SAS Newbie - Proc Power Anova with multiple standard deviations

SAS Newbie - Proc Power Anova with multiple standard deviations - sas

Looking at a continuous variable under four different categorical coded groups.
Attempting to run proc power with a onewayanova test but I can't seem to make it account for multiple standard deviations.
Looking to try and see if this is possible.
Title "Find Power for ANOVA"
proc power;
onewayanova test = overall
groupmeans = 1814120 | 1344300 | 953580 | 1352900
stddev = 1879922.09 | 969317.15 | 441433.68 | 970670.65
npergroup = 3 | 4 |5 | 4
power = .;
run;
This gives me:
180
ERROR 180-322: Statement is not valid or it is used out of proper order.

stddev and npergroup use number lists, whereas groupmeans use grouped-number-lists. The syntax between the two are different.
proc power;
onewayanova
test = overall
groupmeans = 1814120 | 1344300 | 953580 | 1352900
stddev = 1879922.09 969317.15 441433.68 970670.65
npergroup = 3 4 5 4
power = .
;
run;

Related

Optimal lag selection in Granger Causality tests

I use [TS] varsoc to obtain the optimum lag length for the Granger causality test in Stata. This command reports the optimal number of lags based on different criteria such as Akaike's information criterion (AIC).
Is there any way to store the optimal lag number (obtained based on AIC) in a variable and use it in the next command to estimate causality? Something like this:
Lag= varsoc X Y
tvgc X Y, p(Lag) d(Lag) trend window(30) prefix(_) graph

Here I adapt the first example in the help for varsoc. You can sort the matrix of statistics so that minimum AIC is in the first row, and read off the lag concerned.
. webuse lutkepohl2, clear
(Quarterly SA West German macro data, Bil DM, from Lutkepohl 1993 Table E.1)
. varsoc dln_inv dln_inc dln_consump
Lag-order selection criteria
Sample: 1961q2 thru 1982q4 Number of obs = 87
+---------------------------------------------------------------------------+
| Lag | LL LR df p FPE AIC HQIC SBIC |
|-----+---------------------------------------------------------------------|
| 0 | 696.398 2.4e-11 -15.9402 -15.9059 -15.8552* |
| 1 | 711.682 30.568 9 0.000 2.1e-11 -16.0846 -15.9477* -15.7445 |
| 2 | 724.696 26.028 9 0.002 1.9e-11* -16.1769* -15.9372 -15.5817 |
| 3 | 729.124 8.8557 9 0.451 2.1e-11 -16.0718 -15.7294 -15.2215 |
| 4 | 738.353 18.458* 9 0.030 2.1e-11 -16.0771 -15.632 -14.9717 |
+---------------------------------------------------------------------------+
* optimal lag
Endogenous: dln_inv dln_inc dln_consump
Exogenous: _cons
.
. mata
------------------------------------------------- mata (type end to exit) ---------------
: stats = st_matrix("r(stats)")
: _sort(stats, 7)
: st_numscalar("opt_lag_AIC", stats[1,1])
: end
-----------------------------------------------------------------------------------------
.
. di opt_lag_AIC
2
To plug into a later command automatically, use expressions like
`=opt_lag_AIC'
as arguments to options.

How to divide all the observations based on a sum of a column

I'm trying to do simple calculations but I'm new and SAS is not intuitive to me.
Suppose I have this table.
data money;
infile datalines delimiter=",";
input name $ return $ invested;
datalines;
Joe,10,100
Bob,7,50
Mary,80,1000
;
Which creates this
/* name | return | invested */
/* _________________________ */
/* Joe | 10 | 100 */
/* Bob | 7 | 50 */
/* Mary | 80 | 50 */
I have three things I would like to do for my job that just switched over to SAS.
I need to make sure columns return and invested are numeric. When I run the code above, return column ends up being a CHAR column and I don't know why.
Now I want to create a new column and calculate the share of the total return they each got. In this case, the sum of return=97. This is the result I want.
/* name | return | invested | share_of_return */
/* ____________________________________________ */
/* Joe | 10 | 100 | 10.30% */
/* Bob | 7 | 50 | 7.22% */
/* Mary | 80 | 50 | 82.47% */
Next I want to find their ROI. Which is (return-investment) / investment * 100. This is the result I am looking for
/* Find ROI */
/* name | return | invested | share_of_return | ROI */
/* ___________________________________________________ */
/* Joe | 10 | 100 | 10.30% | -90% */
/* Bob | 7 | 50 | 7.22% | -86% */
/* Mary | 80 | 50 | 82.47% | 60% */
I appreciate your explanations and guidance in advanced. This is for a work project and we just switched over to SAS

1 & 3 are easy, 2 is slightly more difficult.
Remove $ in INPUT statement. $ indicates character. In your data you may need to convert it using the input function instead though.
Fix for example:
input name $ return invested;
Fix for actual data using input function. Note that you cannot convert types in a data step to the same name so I rename it while reading it in using the rename data set option.
data money2;
set money (rename = return = return_char);
return = input(return_char, best.);
drop return_char;
run;
Add total value to data step, SQL is fastest here:
proc sql;
create table money3 as
select *, sum(return) as return_total, return/calculated return_total as return_percentage f=percent12.1
from money2;
quit;
I outline two different methods of doing this here
Within a data step, add your calculation. It's probably most efficient if it can be done in first step.
Since a data step loops automatically you write the formula pretty much as shown. In this case I've also applied a format so it shows as a percentage but that requires you to not multiply it by 100. Depending on what you're doing next it may be best to leave it as numeric.
data money2;
set money (rename = return = return_char);
return = input(return_char, best.);
ROI = (return - investment)/investment;
format ROI percent12.1;
run;
drop return_char;
run;

Group rows in PROC TABULATE

I have the following (fake) crime data of offenders:
/* Some fake-data */
DATA offenders;
INPUT id :$12. crime :4. offenderSex :$1. count :3.;
INFORMAT id $12.;
INFILE DATALINES DSD;
DATALINES;
1,110,f,3
2,32,f,1
3,31,m,1
4,113,m,1
5,110,m,1
6,31,m,1
7,31,m,1
8,110,f,2
9,113,m,1
10,31,m,1
11,113,m,1
12,110,f,1
13,32,m,1
14,31,m,1
15,31,m,1
16,31,m,1
17,110,f,2
18,113,m,2
19,31,m,1
20,31,m,1
21,110,m,4
22,32,f,1
23,31,m,1
24,31,m,1
25,110,f,4
26,110,m,1
27,110,m,1
28,110,m,2
29,32,m,1
30,113,f,1
31,32,m,1
32,31,f,1
33,110,m,1
34,32,f,1
35,113,m,2
36,31,m,1
37,113,m,1
38,110,f,1
39,113,u,2
;
RUN;
proc format;
value crimes 110 = 'Theft'
113 = 'Robbery'
32 = 'Assault'
31 = 'Minor assault';
run;
I want to create a cross table using PROC TABULATE:
proc tabulate;
format crime crimes.;
freq count;
class crime offenderSex;
table crime="Type of crime", offenderSex="Sex of the offender" /misstext="0";
run;
This gives me a table like this:
m f
------------------------------------
Minor assault |
Assault |
Theft |
Robbery |
Now, I'd like to group the different types of crimes:
'Assault' and 'minor assault' should be in a category "Violent crimes" and 'theft' and 'robbery' should be in a category "Crimes against property":
m f
------------------------------------
Minor assault |
Assault |
*Total violent crimes* |
Theft |
Robbery |
*Total property crimes* |
Can anyone explain me how to do this? I tried to use another format for the 'crime'-variable and use "category * crime" within PROC TABULATE, but then it turned out like this, which is not exactly what I want:
m f
-------------------------------------------------------
Violent crimes Minor assault |
Assault |
Property crimes Theft |
Robbery |

Use the all= option within a table dimension :
table group='Category' * (crime="Type of crime" All='Total'), offenderSex="Sex of the offender" /misstext="0";

SAS: Rename variables in merge according to original dataset

I have two datasets, one for male and one for female, which contain identical variables. I need to find the percent difference between the sexes on each variable by group.
The datasets look something like this, but with more variables and groups,
| Group | Sex | VarA | VarB |
|-------+-----+------+------|
| 1 | F | 8 | 5 |
| 2 | F | 6 | 3 |
| 3 | F | 7 | 0 |
|-------+-----+------+------|
| Group | Sex | VarA | VarB |
|-------+-----+------+------|
| 1 | M | 9 | 7 |
| 2 | M | 8 | 5 |
| 3 | M | 6 | 3 |
|-------+-----+------+------|
The result I need is this:
| Group | percent_diffA | percent_diffB |
|-------+---------------+---------------|
| 1 | -0.117647059 | -0.333333333 |
| 2 | -0.285714286 | -0.5 |
| 3 | 0.153846154 | -2 |
|-------+---------------+---------------|
I could solve this via a merge by renaming each variable.
data difference;
merge
females (rename = (VarA = VarA_F VarB = VarB_F)
males (rename = (VarA = VarA_M VarB = VarB_M)
;
by group;
percent_diffA = (VarA_F - VarA_M) / ( (VarA_F + VarA_M) / 2 );
percent_diffB = (VarB_F - VarB_M) / ( (VarB_F + VarB_M) / 2 );
drop sex;
run;
However, this approach requires me to rename everything manually. With several variables, the rename statement becomes cumbersome. Unfortunately, this calculation is being interjected into some old code, so renaming the original datasets is not practical.
I'm wondering if there is another way to solve this problem which is less cumbersome.
EDIT: I have updated the variable names because that appears to have caused people confusion. They were originally called Var1 and Var2. They are now VarA and VarB. The real variable names are descriptive, for instance body_weight_g or gonadal_somatic_index. The variables are not simply listed with sequential numbers.

For a data set that contains variables that are sequentially numbered there is variable list syntax for renaming the whole range of variables:
This example creates sample that has 100 variables.
data have1 have2;
do group = 1 to 100;
sex = 'M';
array var(100);
do _n_ = 1 to dim(var);
var(_n_) = ceil (25 * ranuni(123));
end;
if group ne 42 then output have1;
sex = 'F';
do _n_ = 1 to dim(var);
var(_n_) = ceil (25 * ranuni(123));
end;
if group ne 100-42 then output have2;
end;
run;
The rename option works on all 100 variables.
data want;
merge
have1(rename=var1-var100=mvar1-mvar100 in=_M)
have2(rename=var1-var100=fvar1-fvar100 in=_F)
;
by group;
if _M & _F & first.group & last.group then do;
array one mvar1-mvar100;
array two fvar1-fvar100;
array results result1-result100;
do i = 1 to dim(results);
diff = one(i) - two(i);
mean = mean (one(i), two(i));
results(i) = diff / mean * 100;
end;
end;
keep group result:;
run;

Shenglin's answer is a nice and concise use of SQL.
An alternative method is constructing a macro variable specifying the renames to be used in the rename DSO (data set option). This can be done with an SQL query to the dictionary table containing the column names.
* This macro creates the macro variable rename_suffix, to be used in a rename statement or data set option ;
* It will be of form: var1 = var1_suffix var2 = var2_suffix ... ;
* &inset is the input set. &suffix is the suffix to added to all variables except for the variables specified in &keys. ;
* &keys variables should be given each in quotation marks, and separated by spaces. ;
%macro rename_list(inset, suffix, keys) ;
%global rename_&inset ; * So that this macro variable is accessable outside the macro ;
proc sql ;
select strip(name) || ' = ' || strip(name) || "_&suffix"
into :rename_&inset separated by ' '
from sashelp.vcolumn /* dictionary.columns can be used in place of sashelp.vcolumn */
where libname = 'WORK' & memname = "%sysfunc(upcase(&inset))"
& upcase(strip(name)) not in (' ' %sysfunc(upcase(&keys))); * The ' ' is included, so there is no error if no keys are given ;
quit ;
%mend rename_list ;
%rename_list(females, F, 'GROUP' 'SEX')
%rename_list(males , M, 'GROUP' 'SEX')
%put &rename_females ; * Check that the macro variables are correct ;
%put &rename_males ;
%macro pct_diff(num) ;
percent_diff&num = (Var&num._F - Var&num._M) / ( (Var&num._F + Var&num._M) / 2 ) ;
%mend pct_diff ;
data difference ;
merge females(rename = (&rename_females), drop = sex)
males (rename = (&rename_males ), drop = sex) ;
by group ;
pct_diff(1) ;
pct_diff(2) ;
run ;
dm 'vt difference';
The percent_diff variable creation can also be shortened with a macro (as shown). If you had a large and/or variable number of variables to compare, then you could further shorten it by automatically detecting the number of comparisons, by running the same SQL query with the select into part modified to be
select count(name) into :varct trimmed
to count the number of variables, and then use a do loop in the data step:
do i = 1 to &varct ;
%pct_diff(i) ;
end ;

Use table alias in proc sql to avoid name change:
proc sql;
select a.group,(a.var1-b.var1)/((a.var1+b.var1)/2) as percent_diff1,
(a.var2-b.var2)/((a.var2+b.var2)/2) as percent_diff2
from female as a,male as b
where a.group=b.group;
quit;

Compare Value of Current Observation with First Observation

I have a set of multiple choice responses from a survey with 45 questions, and I've placed the correct responses as my first observation in the dataset.
In my DATA step I would like to set values to 0 or 1depending on whether the variable in each observation matches the same variable in the first observation, I want to replace the response letter (A-D) with the 0 or 1 in the dataset, how do I go about doing that comparison?
I'm not doing any grouping, so I believe I can access the first row using First.x, but I'm not sure how to compare that across each variable(answer1-answer45).
| Id | answer1 | answer2 | ...through answer 45
|:-------------|---------:|
| KEY | A | B |
| 2 | A | C |
| 3 | C | D |
| 4 | A | B |
| 5 | D | C |
| 6 | B | B |
Should become:
| Id | answer1 | answer2 | ...through answer 45
|:-------------|---------:|
| KEY | A | B |
| 2 | 1 | 0 |
| 3 | 0 | 0 |
| 4 | 1 | 1 |
| 5 | 0 | 0 |
| 6 | 0 | 1 |
Current code for reading in the data:
DATA TEST(drop=name fill answer0);
INFILE SCORES DSD firstobs=2;
length id $4;
length answer1-answer150 $1;
INPUT name $ fill id $ (answer0-answer150) ($);
RUN;
Thanks in advance!

Here's how I might do it. Create a data set to PROC COMPARE the KEY to the observed. Then you have X for not matching key and missing for matched. You can then use PROC TRANSREG to score the 'X.' to 01. PROC TRANSREG also creates macro variables which contain the names of the new variables and the number.
From log NOTE: _TRGINDN=2 _TRGIND=answer1D answer2D
data questions;
input id:$3. (answer1-answer2)(:$1.);
cards;
KEY A B
2 A C
3 C D
4 A B
5 D C
6 B B
;;;;
run;
data key;
if _n_ eq 1 then set questions(obs=1);
set questions(keep=id firstobs=2);
run;
proc compare base=key compare=questions(firstobs=2) out=comp outdiff noprint;
id id;
run;
options validvarname=v7;
proc transreg design data=comp(drop=_type_ type=data);
id id;
model class(answer:) / noint;
output out=scored(drop=intercept _:);
run;
%put NOTE: &=_TRGINDN &=_TRGIND;

I don't have my SAS license here at home, so I can't actually test this code. I'll give it me best shot, though ...
First, I'd keep my correct answers in a separate table, and then merge it with the answers from the respondents. That also makes the solution scalable, should you have more multiple choice solutions and answers in the same table, since you'd be joining on the assignment ID as well.
Now, import all your correct answers to a table answers_correct with column names answer_correct1-answer_correct45.
Then, merge the two tables and determine the outcome for each question.
DATA outcome;
MERGE answers answers_correct;
* We will not be using any BY.;
* If you later add more questionnaires, merge BY the questionnaire ID;
ARRAY answer(*) answer1-answer45;
ARRAY answer_correct(*) answer_correct1-answer_correct45;
LENGTH result1-result45 $1;
ARRAY result(*) result1-result45;
DROP i;
FOR i = 1 TO DIM(answer);
IF answer(i) = answer_correct(i) THEN result(i) = '1';
ELSE result(i) = '0';
END;
RUN;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

SAS Newbie - Proc Power Anova with multiple standard deviations - sas

Related

Optimal lag selection in Granger Causality tests

How to divide all the observations based on a sum of a column

Group rows in PROC TABULATE

SAS: Rename variables in merge according to original dataset

Compare Value of Current Observation with First Observation

Categories

Resources