Currently I have two datasets with similar variable lists. Each dataset has a procedure variable. I want to compare the frequency of the procedure variable between datasets. I created a flag in both datasets to id the source dataset, and was going to merge but don't have a common identifier. How do I merge a dataset without deleting any observations? This isn't just a simple Merge without a By function, right?
Currently have:
Data.a Data.b
pproc proc1_numb
70 9
71 15
77 24
80 80
81 42
83 71
86 66
87 125
121 159
125 242
Want Output:
pproc freq
9 1
15 1
24 1
42 1
66 1
70 1
71 2
77 1
80 2
81 1
83 1
86 1
87 1
121 1
125 2
159 1
242 1
If I understand your question properly, you should just concatenate the two datasets into one and rename the variable. Then you can use PROC MEANS to get the frequencies. Something like this:
data all;
set a
b(rename=(proc1_numb=pproc));
run;
proc means nway data=all noprint;
class pproc;
output out=want(drop=_type_ rename=(_freq_=freq));
run;
Related
I did loggistic regression in SAS using the database shown below but I got several warnings. I tried to identify the outliers and exclude them then test for multicolinearity but still I am getting warnings.
Any advice will be greatly appreciated.
**********************************;
************** database **********;
***********************************;
data D_BP;
input BP Age Weight BSA Dur Pulse Stress;
datalines;
0 47 85.4 1.75 5.1 63 33
0 51 89.4 1.89 7 72 95
0 47 90.9 1.9 6.2 66 8
0 49 89.2 1.83 7.1 69 62
0 48 92.7 2.07 5.6 64 35
0 47 94.4 2.07 5.3 74 90
0 50 95 2.05 10.2 68 47
0 45 87.1 1.92 5.6 67 80
0 46 94.5 1.98 7.4 69 95
0 46 87 1.87 3.6 62 18
0 46 94.5 1.9 4.3 70 12
0 48 90.5 1.88 9 71 99
1 49 94.2 2.1 3.8 70 14
1 49 95.3 1.98 8.2 72 10
1 50 94.7 2.01 5.8 73 99
1 48 99.5 2.25 9.3 71 10
1 49 99.8 2.25 2.5 69 42
1 49 94.1 1.98 5.6 71 21
1 52 101.3 2.19 10 76 98
1 56 95.7 2.09 7 75 99
;
run;
****** do logistic regression **********;
Proc logistic data=work.D_bp;
Model BP=Age Weight BSA Dur Pulse Stress;
Run;
**** identify outlier *********;
proc reg data=work.D_bp plots(only
label)=(RStudentByLeverage CooksD);
model BP=Age Weight BSA Dur Pulse Stress ;
run;
**** After removing outliers ==> assess multicollinearity*********;
**** assessing multicollinearity by 2 ways *********;
proc corr data=work.D_bp ;
Var Age Weight BSA Dur Pulse Stress;
run;
proc reg data=work.D_bp plots;
Model BP=Age Weight BSA Dur Pulse Stress/Collin vif tol;
run;
****** repeat logistic regression after excluding weight **********;
Proc logistic data=work.D_bp;
Model BP=Age BSA Dur Pulse Stress;
Run;
WARNING: There is a complete separation of data points. The maximum likelihood estimate does
not exist.
WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are
based on the last maximum likelihood iteration. Validity of the model fit is
questionable.
I need help restructuring the data. My Table looks like this
NameHead Department Per_test Per_Delta Per_DB Per_Vul
Nancy Health 55 33.2 33 63
Jim Air 25 22.8 23 11
Shu Water 26 88.3 44 12
Dick Electricity 77 55.9 66 10
Elena General 88 22 67 9
Nancy Internet 66 12 44 79
And I want my table to look like this
NameHead Nancy Jim Shu Dick Elena Nancy
Department Health Air Water Electricity General Internet
Per_test 55 25 26 77 88 66
Per_Delta 33.2 22.8 88.3 55.9 22 12
PerDB 33 23 44 66 67 44
Per_Vul 63 11 12 10 9 79
I tried proc transpose but couldnt get the desired result. Please help!
Thanks!
PROC TRANSPOSE does exactly what you want. You must include a VAR statement if you want to include the character variables.
proc transpose data=have out=want;
var _all_;
run;
Note that you cannot have variables that do not have names. Here is what the dataset looks like.
Obs _NAME_ COL1 COL2 COL3 COL4 COL5 COL6
1 NameHead Nancy Jim Shu Dick Elena Nancy
2 Department Health Air Water Electricity General Internet
3 Percent_test 55 25 26 77 88 66
4 Percent_Delta 33.2 22.8 88.3 55.9 22 12
5 Percent_DB 33 23 44 66 67 44
6 Percent_Vul 63 11 12 10 9 79
I am trying to find days matching to a reference number of days given or else to find the number of days close to the reference days.
I coded till here, however not sure how to go forward.
ID Date ref_days lags total_days
1 2017-02-02 224 . 0
1 2017-02-02 224 84 84
1 2017-02-02 224 84 168
2 2015-01-21 213 300 388
3 2016-02-12 560 95 .
3 2016-02-12 560 86 181
3 2016-02-12 560 82 263
3 2016-02-12 560 69 332
3 2016-02-12 560 77 409
So now I want to bring out the last value close to the reference days.
and the next total_days should start from ZERO again to find the next window. How can I do this?
Here is a code that I wrote
data want;
do until (totaldays <= ref_days);
set have;
by ID ref_days notsorted;
if first.id then totaldays=0;
else totaldays+lags;
end;
run;
Required Output:
ID Date ref_days lags total_days
1 2017-02-02 224 . 0
1 2017-02-02 224 84 84
1 2017-02-02 224 84 168
2 2015-01-21 213 300 388
3 2016-02-12 560 95 .
3 2016-02-12 300 86 181
3 2016-02-12 300 82 263
3 2016-02-12 300 69 .
3 2016-02-12 300 77 146
A while ago I did similar to this via Proc sql. It calculates all the distances and takes the closest one. It works with moderate size dataset. Hopefully it is of some use.
proc sql;
select * from
(
select *,
abs(t1.link-t2.link) as dist /*In your case these would be dateVars*/
from test1 t1
left join test2 t2
on 1=1) group by system1 having dist=min(dist);
;
quit;
There was some talk that the left join on 1=1 is a bit silly (as full outter join would suffice, or something.) However this worked for the problem in question.
DATA OZONE;
INPUT MONTH $ STMF YKRS ##;
CARDS;
A 80 66 A 68 82 A 24 47 A 24 28 A 82 44 A 100 55
A 55 34 A 91 60 A 87 70 A 64 41 A . 67 A . 127 A 170 96 A . 56
JN 215 93 JN 230 106 JN . 49 JN 69 64 JN 98 83 JN 125 97
JN 72 51 JN 125 75 JN 143 104 JN 192 107 JN . 56 JN 122 68
JN 32 20 JN 23 35 JN 71 30 JN 38 31 JN 136 81 JN 169 119
JL 152 76 JL 201 108 JL 134 85 JL 206 96 JL 92 48 JL 101 60
JL 133 . JL 83 50 JL . 27 JL 60 37 JL 124 47 JL 142 71
JL 75 49 JL 103 59 JL . 53 JL 46 25 JL 68 45 JL . 78
S 38 23 S 80 50 S 80 34 S 99 58 S 71 35 S 42 24 S 52 27 S 33 17
;
run;
Proc Ttest data=Ozone PLOT=NONE ALPHA=0.01;
Where MONTH='JN';
Paired STMF*YKRS;
Run;
Question 2
Data Baseball;
Input ba league;
Datalines;
276 National League
288 National League
281 National League
290 National League
303 National League
257 American League
254 American League
263 American League
261 American League
Run;
Proc Ttest data=Baseball ALPHA=0.02 ;
Question 3
Proc Ttest data=ozone ALPHA=0.01 Plot=NONE;
Where Month='A'-'S';
Paired STMF*YKRS;
Run;
Question 2 Test to see if both leagues have different batting averages. Use a alpha = 0.02 in your conclusions and compute a 98% confidence interval for the means.
Question 3 From the first question test to see the differences the A and S average ozone values. Use alpha = 0.01 in your conclusions. Include a 99% confidence interval for the difference.
So my questions are for Question 2 kind of a stupid question but for whatever reason I am confused as to what you are suppose to do.
For question 3 (my main question) how do I use one proc ttest to check and see the differences between months A and S? I tried using a Where statement as you can see above, but of course that does not work I’m abit stomped on where to go from here. Also I omited a good bit of the month data in the ozone portion as I couldn't properly format all of the data without it looking extremely confusing.
Thanks for your help in advance!
I am learning little sas book. Below is a code from book. and raw data. The issue is when I run it, the final data set keeps missing the record at end of line, i.e., it keeps missing 75 and 56, and label them as missing ("."). Could anyone point out where could possible be the problem? When I add spaces after 75 and 56 at line ends, the problem is gone.
DATA class;
INFILE 'c:\MyRawData\Scores.dat';
INPUT Score ##;
RUN;
PROC UNIVARIATE DATA = class;
VAR Score;
TITLE;
RUN;
Data in that file:
56 78 84 73 90 44 76 87 92 75
85 67 90 84 74 64 73 78 69 56
87 73 100 54 81 78 69 64 73 65
after run it shows more like
56 78 84 73 90 44 76 87 92 .
85 67 90 84 74 64 73 78 69 .
87 73 100 54 81 78 69 64 73 65
My suspicion is that you have something wrong with your end of lines; either you have a spurious character, or your end of line isn't correct in some fashion. Most likely you are using a windows file and you are running in Unix, so you have
75CRLF85
and since Unix uses only LF for line terminator, it sees "75CR" endofline "85", not "75" endofline "85" like it should.
In that case you can either do what you did - add a space, though that likely will still leave some 'blank' records in there - or use TERMSTR in your infile statement to tell SAS how to properly read the file in.
Otherwise, you may have some spurious end characters - for example, if you pasted this from the web, it's possible you have a non-breaking space that is not converted to a regular space.
You can find out by doing this:
data _null_;
infile 'c:\rawdata\myfile.dat';
input #;
put _infile_ $HEX60.;
run;
The 60 is 2x the length of the line. That tells you what SAS is seeing. What you should see:
3536203738203834203733203930203434203736203837203932203735
3835203637203930203834203734203634203733203738203639203536
383720373320313030203534203831203738203639203634203733203635
Digits in ASCII are 30+digit, so 35 is a 5, 36 is a 6, etc. Space is 20. The first line:
35|36|20|37|38|20|38|34|20|37|33|20| ...
so 5 6 space 7 8 space 3 8 space 7 3 space. If you see something else after the 37 35, then you know there is a problem. You might see any of the following:
0A = Line feed.
0D = Carriage return.
A0 = Nonbreaking (web) space.
There are lots of other things you could see, but those are the most likely to trip you up. Pasting from the web is often a problem.