SAS Proc Ttest finding differences and where statement - sas

DATA OZONE;
INPUT MONTH $ STMF YKRS ##;
CARDS;
A 80 66 A 68 82 A 24 47 A 24 28 A 82 44 A 100 55
A 55 34 A 91 60 A 87 70 A 64 41 A . 67 A . 127 A 170 96 A . 56
JN 215 93 JN 230 106 JN . 49 JN 69 64 JN 98 83 JN 125 97
JN 72 51 JN 125 75 JN 143 104 JN 192 107 JN . 56 JN 122 68
JN 32 20 JN 23 35 JN 71 30 JN 38 31 JN 136 81 JN 169 119
JL 152 76 JL 201 108 JL 134 85 JL 206 96 JL 92 48 JL 101 60
JL 133 . JL 83 50 JL . 27 JL 60 37 JL 124 47 JL 142 71
JL 75 49 JL 103 59 JL . 53 JL 46 25 JL 68 45 JL . 78
S 38 23 S 80 50 S 80 34 S 99 58 S 71 35 S 42 24 S 52 27 S 33 17
;
run;
Proc Ttest data=Ozone PLOT=NONE ALPHA=0.01;
Where MONTH='JN';
Paired STMF*YKRS;
Run;
Question 2
Data Baseball;
Input ba league;
Datalines;
276 National League
288 National League
281 National League
290 National League
303 National League
257 American League
254 American League
263 American League
261 American League
Run;
Proc Ttest data=Baseball ALPHA=0.02 ;
Question 3
Proc Ttest data=ozone ALPHA=0.01 Plot=NONE;
Where Month='A'-'S';
Paired STMF*YKRS;
Run;
Question 2 Test to see if both leagues have different batting averages. Use a alpha = 0.02 in your conclusions and compute a 98% confidence interval for the means.
Question 3 From the first question test to see the differences the A and S average ozone values. Use alpha = 0.01 in your conclusions. Include a 99% confidence interval for the difference.
So my questions are for Question 2 kind of a stupid question but for whatever reason I am confused as to what you are suppose to do.
For question 3 (my main question) how do I use one proc ttest to check and see the differences between months A and S? I tried using a Where statement as you can see above, but of course that does not work I’m abit stomped on where to go from here. Also I omited a good bit of the month data in the ozone portion as I couldn't properly format all of the data without it looking extremely confusing.
Thanks for your help in advance!

Related

Is there a way to make a dummy variable in SAS for a Country in my SAS Data Set?

I am looking to create a dummy variable for Latin American Countries in my data set which I need to make for a log-log model. I know how to log all of them for my later regression. Any suggestion or help on how to make a dummy variable for the Latin American countries with my data would be appreciated.
data HW6;
input country : $25. midyear sancts lprots lfrac ineql pop;
cards;
CHILE 1955 58 44 65 57 6.743
CHILE 1960 19 34 65 57 7.785
CHILE 1965 27 24 65 57 8.510
CHILE 1970 36 29 65 57 9.369
CHILE 1975 38 58 65 57 10.214
COSTA_RICA 1955 16 7 54 60 1.024
COSTA_RICA 1960 6 1 54 60 1.236
COSTA_RICA 1965 2 1 54 60 1.482
COSTA_RICA 1970 3 1 54 60 1.732
COSTA_RICA 1975 2 3 54 60 1.965
INDIA 1955 81 134 47 52 404.478
INDIA 1960 101 190 47 52 445.857
INDIA 1965 189 845 47 52 494.882
INDIA 1970 133 915 47 52 553.619
INDIA 1975 132 127 47 52 616.551
JAMICA 1955 11 12 47 62 1.542
JAMICA 1960 9 2 47 62 1.629
JAMICA 1965 8 6 47 62 1.749
JAMICA 1970 1 1 47 62 1.877
JAMICA 1975 7 1 47 62 2.043
PHILIPPINES 1955 26 123 48 56 24.0
PHILIPPINES 1960 20 38 48 56 27.898
PHILIPPINES 1965 9 5 48 56 32.415
PHILIPPINES 1970 79 25 48 56 37.540
SRI_LANKA 1955 29 2 73 52 8.679
SRI_LANKA 1960 75 35 73 52 9.879
SRI_LANKA 1965 25 63 73 52 11.202
SRI_LANKA 1970 34 14 73 52 12.532
TURKEY 1955 79 1 67 61 24.145
TURKEY 1960 138 19 67 61 28.217
TURKEY 1965 36 51 67 61 31.951
TURKEY 1970 51 22 67 61 35.743
URUGUAY 1955 8 4 57 48 2.372
URUGUAY 1960 12 1 57 48 2.538
URUGUAY 1965 16 14 57 48 2.693
URUGUAY 1970 21 19 57 48 2.808
URUGUAY 1975 24 45 57 48 2.829
VENEZUELA 1955 38 14 76 65 6.110
VENEZUELA 1960 209 23 76 65 7.632
VENEZUELA 1965 100 162 76 65 9.119
VENEZUELA 1970 9 27 76 65 10.709
VENEZUELA 1975 4 12 76 65 12.722
;
data newData;
set HW6;
sancts = log (sancts);
lprots = log (lprots);
lfrac = log (lfrac);
ineql = log (ineql);
pop = log (pop);
run;
The GLMSELECT procedure is one simple way of creating dummy variables.
There is a nice article about how to use it to generate dummy variables
data newData;
set HW6;
sancts = log (sancts);
lprots = log (lprots);
lfrac = log (lfrac);
ineql = log (ineql);
pop = log (pop);
Y = 0; *-- Create a fake response variable --*
run;
proc glmselect data=newData noprint outdesign(addinputvars)=want(drop=Y);
class country;
model Y = country / noint selection=none;
run;
If needed in further step, use the macro-variable &_GLSMOD created by the procedure that contains the names of the dummy variables.
The real question here is not related to SAS, it is related on how to get the region of a country by its name.
I would give a try to the ISO 3166 which lists all countries and their geographical location.
Getting that list is straight forward, then import that list in SAS, use a merge by country and finally identify the countries in Latin America

SAS warning: complete separation of data points. The maximum likelihood estimate does not exist

I did loggistic regression in SAS using the database shown below but I got several warnings. I tried to identify the outliers and exclude them then test for multicolinearity but still I am getting warnings.
Any advice will be greatly appreciated.
**********************************;
************** database **********;
***********************************;
data D_BP;
input BP Age Weight BSA Dur Pulse Stress;
datalines;
0 47 85.4 1.75 5.1 63 33
0 51 89.4 1.89 7 72 95
0 47 90.9 1.9 6.2 66 8
0 49 89.2 1.83 7.1 69 62
0 48 92.7 2.07 5.6 64 35
0 47 94.4 2.07 5.3 74 90
0 50 95 2.05 10.2 68 47
0 45 87.1 1.92 5.6 67 80
0 46 94.5 1.98 7.4 69 95
0 46 87 1.87 3.6 62 18
0 46 94.5 1.9 4.3 70 12
0 48 90.5 1.88 9 71 99
1 49 94.2 2.1 3.8 70 14
1 49 95.3 1.98 8.2 72 10
1 50 94.7 2.01 5.8 73 99
1 48 99.5 2.25 9.3 71 10
1 49 99.8 2.25 2.5 69 42
1 49 94.1 1.98 5.6 71 21
1 52 101.3 2.19 10 76 98
1 56 95.7 2.09 7 75 99
;
run;
****** do logistic regression **********;
Proc logistic data=work.D_bp;
Model BP=Age Weight BSA Dur Pulse Stress;
Run;
**** identify outlier *********;
proc reg data=work.D_bp plots(only
label)=(RStudentByLeverage CooksD);
model BP=Age Weight BSA Dur Pulse Stress ;
run;
**** After removing outliers ==> assess multicollinearity*********;
**** assessing multicollinearity by 2 ways *********;
proc corr data=work.D_bp ;
Var Age Weight BSA Dur Pulse Stress;
run;
proc reg data=work.D_bp plots;
Model BP=Age Weight BSA Dur Pulse Stress/Collin vif tol;
run;
****** repeat logistic regression after excluding weight **********;
Proc logistic data=work.D_bp;
Model BP=Age BSA Dur Pulse Stress;
Run;
WARNING: There is a complete separation of data points. The maximum likelihood estimate does
not exist.
WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are
based on the last maximum likelihood iteration. Validity of the model fit is
questionable.

SAS Restructure Data

I need help restructuring the data. My Table looks like this
NameHead Department Per_test Per_Delta Per_DB Per_Vul
Nancy Health 55 33.2 33 63
Jim Air 25 22.8 23 11
Shu Water 26 88.3 44 12
Dick Electricity 77 55.9 66 10
Elena General 88 22 67 9
Nancy Internet 66 12 44 79
And I want my table to look like this
NameHead Nancy Jim Shu Dick Elena Nancy
Department Health Air Water Electricity General Internet
Per_test 55 25 26 77 88 66
Per_Delta 33.2 22.8 88.3 55.9 22 12
PerDB 33 23 44 66 67 44
Per_Vul 63 11 12 10 9 79
I tried proc transpose but couldnt get the desired result. Please help!
Thanks!
PROC TRANSPOSE does exactly what you want. You must include a VAR statement if you want to include the character variables.
proc transpose data=have out=want;
var _all_;
run;
Note that you cannot have variables that do not have names. Here is what the dataset looks like.
Obs _NAME_ COL1 COL2 COL3 COL4 COL5 COL6
1 NameHead Nancy Jim Shu Dick Elena Nancy
2 Department Health Air Water Electricity General Internet
3 Percent_test 55 25 26 77 88 66
4 Percent_Delta 33.2 22.8 88.3 55.9 22 12
5 Percent_DB 33 23 44 66 67 44
6 Percent_Vul 63 11 12 10 9 79

How do you Resample daily with a conditional statement in pandas

I have a pandas data frame below: (it does have other columns but these are the important ones) Date column is the Index
Number_QA_VeryGood Number_Valid_Cells Time
Date
2015-01-01 91 92 18:55
2015-01-02 6 6 18:00
2015-01-02 13 13 19:40
2015-01-03 106 106 18:45
2015-01-05 68 68 18:30
2015-01-06 111 117 19:15
2015-01-07 89 97 18:20
2015-01-08 86 96 19:00
2015-01-10 9 16 18:50
I need to resample daily the first two columns will be resampled with sum.
The last column needs to look at the highest daily value for the Number_Valid_Cells column and use that time for the value.
example output should be: (1/2/02 is line which changed)
Number_QA_VeryGood Number_Valid_Cells Time
Date
2015-01-01 91 92 18:55
2015-01-02 19 19 19:40
2015-01-03 106 106 18:45
2015-01-05 68 68 18:30
2015-01-06 111 117 19:15
2015-01-07 89 97 18:20
2015-01-08 86 96 19:00
2015-01-10 9 16 18:50
What is the best way to get this to work.
Or you can try
df.groupby(df.index).agg({'Number_QA_VeryGood':'sum','Number_Valid_Cells':'sum','Time':'last'})
Out[276]:
Time Number_QA_VeryGood Number_Valid_Cells
Date
2015-01-01 18:55 91 92
2015-01-02 19:40 19 19
2015-01-03 18:45 106 106
2015-01-05 18:30 68 68
2015-01-06 19:15 111 117
2015-01-07 18:20 89 97
2015-01-08 19:00 86 96
2015-01-10 18:50 9 16
Update: sort_values first
df.sort_values('Number_Valid_Cells').groupby(df.sort_values('Number_Valid_Cells').index)\
.agg({'Number_QA_VeryGood':'sum','Number_Valid_Cells':'sum','Time':'last'})
Out[314]:
Time Number_QA_VeryGood Number_Valid_Cells
Date
1/1/2015 18:55 91 92
1/10/2015 18:50 9 16
1/2/2015 16:40#here.changed 19 19
1/3/2015 18:45 106 106
1/5/2015 18:30 68 68
1/6/2015 19:15 111 117
1/7/2015 18:20 89 97
1/8/2015 19:00 86 96
Data input :
Number_QA_VeryGood Number_Valid_Cells Time
Date
1/1/2015 91 92 18:55
1/2/2015 6 6 18:00
1/2/2015 13 13 16:40#I change here
1/3/2015 106 106 18:45
1/5/2015 68 68 18:30
1/6/2015 111 117 19:15
1/7/2015 89 97 18:20
1/8/2015 86 96 19:00
1/10/2015 9 16 18:50
You can use groupby sum for first two columns, if you have the values of Number_Valid_Cells sorted then
ndf = df.reset_index().groupby('Date').sum()
ndf['Time'] = df.reset_index().drop_duplicates(subset='Date',keep='last').set_index('Date')['Time']
Number_QA_VeryGood Number_Valid_Cells Time
Date
2015-01-01 91 92 18:55
2015-01-02 19 19 19:40
2015-01-03 106 106 18:45
2015-01-05 68 68 18:30
2015-01-06 111 117 19:15
2015-01-07 89 97 18:20
2015-01-08 86 96 19:00
2015-01-10 9 16 18:50

Looping in SAS to bring the latest value

I am trying to find days matching to a reference number of days given or else to find the number of days close to the reference days.
I coded till here, however not sure how to go forward.
ID Date ref_days lags total_days
1 2017-02-02 224 . 0
1 2017-02-02 224 84 84
1 2017-02-02 224 84 168
2 2015-01-21 213 300 388
3 2016-02-12 560 95 .
3 2016-02-12 560 86 181
3 2016-02-12 560 82 263
3 2016-02-12 560 69 332
3 2016-02-12 560 77 409
So now I want to bring out the last value close to the reference days.
and the next total_days should start from ZERO again to find the next window. How can I do this?
Here is a code that I wrote
data want;
do until (totaldays <= ref_days);
set have;
by ID ref_days notsorted;
if first.id then totaldays=0;
else totaldays+lags;
end;
run;
Required Output:
ID Date ref_days lags total_days
1 2017-02-02 224 . 0
1 2017-02-02 224 84 84
1 2017-02-02 224 84 168
2 2015-01-21 213 300 388
3 2016-02-12 560 95 .
3 2016-02-12 300 86 181
3 2016-02-12 300 82 263
3 2016-02-12 300 69 .
3 2016-02-12 300 77 146
A while ago I did similar to this via Proc sql. It calculates all the distances and takes the closest one. It works with moderate size dataset. Hopefully it is of some use.
proc sql;
select * from
(
select *,
abs(t1.link-t2.link) as dist /*In your case these would be dateVars*/
from test1 t1
left join test2 t2
on 1=1) group by system1 having dist=min(dist);
;
quit;
There was some talk that the left join on 1=1 is a bit silly (as full outter join would suffice, or something.) However this worked for the problem in question.