Reading Messy SAS Data - sas

How can I read this -
C 303 102 140 B 293 C 399 B 450 233 456
450 A 289 282 555
like this -
Group Score
C 303
C 102
C 140
B 293
C 399
B 450
B 233
B 456
B 450
A 289
A 282
A 555
In SAS? I have tried the #'character' column pointer, which I cant seem to get right. This is the code so far :( -
data OUTCOMES;
infile 'testscores.txt';
input #'C' SCORES; Run;

Coo:
The double ampersand (##) operator for held input looks good [pun intended] for scanning inputs across line boundaries.
Construct an example external data file:
filename haveFile temp;
data _null_;
file haveFile;
put " C 303 102 140 B 293 C 399 B 450 233 456";
put "450 A 289 282 555";
run;
Read from the file, one token at a time.
data have ;
attrib
token group length=$10
score length=8
;
retain group;
infile haveFile ;
input token ##;
score = input (token, ?? 12.); * check if token can be interpreted as a number, the ?? modifier prevents errors and notes in the log;
if missing (score) and token ne '.' then
group = token;
else
output;
run;

Related

SAS: Unable to add variable to data set

I have a data set and am trying to add four new variables using the existing ones. I keep getting an error that says the code is incomplete. I'm having trouble seeing where it is incomplete. How do I fix this?
data dataset;
input ID $
Height
Weight
SBP
DBP
WtKg = Weight/2.2;
HtCm = Height/2.4;
AveBP = DBP + (SBP - DBP)/3;
HtPolynomial = (2*Height)**2 + (1.5*Height)**3;
datalines;
001 68 150 110 70
002 73 240 150 90
003 62 101 120 80
run;
You did not end your input statement with a semicolon. input reads variables from external data (in this case, in-line data with the datalines statement). New variables are not created within input in the way you've specified.
Use input to read in the five variables of your data. After that, create new variables based on those five read-in variables:
data dataset;
input ID $
Height
Weight
SBP
DBP
;
WtKg = Weight/2.2;
HtCm = Height/2.4;
AveBP = DBP + (SBP - DBP)/3;
HtPolynomial = (2*Height)**2 + (1.5*Height)**3;
datalines;
001 68 150 110 70
002 73 240 150 90
003 62 101 120 80
;
run;
Correcting 2 errors should fix this:
Add a semicolon after the last field being read in from the datalines, which is DBP.
(A previous version of this question used the ^ symbol for exponents.) Instead of ^ to raise to the power of something, use **
For reference, SAS arithmetic operators are described here.
After making the 2 corrections above I ran the revised code below without any errors.
data dataset;
input ID $
Height
Weight
SBP
DBP;
WtKg = Weight/2.2;
HtCm = Height/2.4;
AveBP = DBP + (SBP - DBP)/3;
HtPolynomial = (2*Height)**2 + (1.5*Height)**3;
datalines;
001 68 150 110 70
002 73 240 150 90
003 62 101 120 80
run;

How to create a running 3 observation average in SAS?

I have a dataset with some volumes in a column and I want to create a second column that contains the average of the previous three observations. Is this possible?
e.g.
data have;
input Vol Avg_pre_4;
datalines;
228 .
141 .
125 .
101 164.66
116 122.33
107 114
74 108
118 99
127 99.67
123 106.33
;
run;
The LAG function is an automatic built-in queue.
VOL_AVG_OF_PRIOR3 = MEAN ( lag(Vol), lag2(Vol), lag3(Vol) )
if _n_ < 4 then VOL_AVG_OF_PRIOR3 = .;

Problems aggregating data by variable in SAS

I have data that looks like this:
ID FileSource Age MamUlt ProcDate Name
223 Facility 35 M 19591 SWEDISH
223 Facility 35 M 19592 SWEDISH
223 Facility 35 U 19592 SWEDISH
223 Facility 35 U 19593 SWEDISH
223 Non-Facility 35 M 19594 RADIA
223 Non-Facility 35 U 19594 RADIA
What I am trying to do is to combine that data (for each ID in the data set) to look like this:
ID Age MAMs ULTs SameDate
223 35 3 3 2
So, for each ID, I need the total times "M" and "U" show up and how many times they show up on the same date; twice in this sample.
Here is what I have so far:
data ImageTotals;
set ImageClaims;
by ID;
retain ID MAMs ULTs SameDate;
if first.ID then do;
MAMs = 0;
ULTs = 0;
MamDate = .;
UltDate = .;
SameDate = 0;
end;
if MamUlt = "M" then do; MAMs = MAMs + 1; MamDate = ProcDate; end;
if MamUlt = "U" then do; ULTs = ULTs + 1; UltDate = ProcDate; end;
if MamDate = UltDate and MamDate ^= . then do; SameDate = SameDate+1; end;
if last.ID;
keep ID MAMs ULTs SameDate;
run;
Any advice? This solves the count problems but not the SameDate problem (still coming up as zero for this instance).
You can use DOW loop to do the aggregation in a data step. Data must be sorted by ID and PROCDATE. Within the same date count how many times M or U appear. Then you can use those day counts to aggregate at the ID level and also test if both appeared on the same date. The AGE variable is simply kept so it will have the value from the last record for that ID.
data counts ;
do until (last.id);
m=0;
u=0;
do until (last.procdate);
set imageclaims;
by id procdate;
m= sum(m,proc='M');
u= sum(u,proc='U');
end;
MAMs=sum(mams,m);
ULTs=sum(ults,u);
SameDate=sum(samedate,m and u);
end;
keep id age mams ults samedate ;
run;
I think this is probably a SQL problem (not my specialty), but since you started on a DATA step solution I took a stab at both. I also added more test data.
data ImageClaims;
input id age Proc $1. ProcDate;
cards;
223 35 M 19591
223 35 M 19592
223 35 U 19592
223 35 U 19593
223 35 M 19594
223 35 U 19594
224 35 M 19591
224 35 M 19592
224 35 M 19593
224 35 M 19593
224 35 M 19594
224 35 U 19595
225 35 M 19592
225 35 U 19592
225 35 U 19593
225 35 M 19593
225 35 M 19594
225 35 U 19594
;
run;
For DATA step approach, create counters for MAMs, ULTs, and MAMULTs (Mam and Ult on same day). Note because I use sum statement for these counters (MAMs++1) they are implicitly retained.
data ImageTotals (keep=id Age MAMs ULTs MAMULTs);
set ImageClaims;
by ID ProcDate;
retain HaveMam HaveUlt; *Count vars are implicitly retained by sum statement;
if first.ID then do;
MAMs=0; *count of mammograms;
ULTs=0; *count of ultrasounds;
MAMULTs=0; *count of mammograms and ultrasounds on same date;
end;
if first.ProcDate then do;
HaveMam=0; *indicator for have a mammogram or not on that date;
HaveUlt=0; *indicator for have an ultrasound or not on that date;
end;
if Proc='M' then do;
HaveMam=1; *set mammogram indicator (for that date);
MAMs++1; *increment counter;
end;
else if Proc='U' then do;
HaveUlt=1; *set ultrasound indicator (for that date);
ULTs++1; *increment counter;
end;
if last.ProcDate then do;
MAMULTs++(HaveMam=1 and HaveUlt=1); *increment MamUlts counter if had both on same date;
end;
if last.id;
run;
For SQL solution I use a subquery that counts MAMs, ULTs, and MAMULTs by ID and ProcDate, and an outer query then sums these by ID. Probably there's a better SQL solution, but I think this works.
proc sql;
create table ImageTotals as
select id
,max(age) as age /*arbitrary use of max age is constant within id*/
,sum(MAMs) as MAMs
,sum(ULTs) as ULTs
,sum(MAMULTs) as MAMULTs
from (
select id
,procdate
,max(age) as age
,sum(Proc='M') as MAMs
,sum(Proc='U') as ULTs
,count(distinct(Proc))=2 as MAMULTs
from ImageClaims
group by id,ProcDate
)
group by id
;
quit;
proc print;
run;
Work.ImageTotals I get from both steps is:
Obs id age MAMs ULTs MAMULTs
1 223 35 3 3 2
2 224 35 5 1 0
3 225 35 3 3 3
Thinking this could be solved with proc sql (count/group by) once you take Q's suggestion, unless I am misinterpreting the complexity here...was going to post some code, but will let you take a crack at it first...

SAS Proc IML rolling window (loop)

I wanted to modify this working module given below into this upper one with purpose that instead of using whole sample of p from 1 to m, the module would use only previous 18 and next 18 values around the time-point x. So p(x-18...x+18). But I end up with error and can't really understand where's the problem. Error message with whole command line at the end of post.
start mhatx2(m,p,h,pi,e);
t5=j(m,1); /*mhatx omit x=t*/
upb=m-18;
do x=19 to upb;
lo=x-18;
up=x+18;
i=T(lo:up);
temp1=x-i;
ue=Kmod(temp1,h,pi,e)#p[i];
le=Kmod(temp1,h,pi,e);
t5[x]=(sum(ue)-ue[x])/(sum(le)-le[x]);
end;
return (t5);
finish;
start mhatx2(m,p,h,pi,e);
t5=j(m,1); /*mhatx omit x=t*/
do x=1 to nrow(p);
i=T(1:m);
temp1=x-i;
ue=Kmod(temp1,h,pi,e)#p[i];
le=Kmod(temp1,h,pi,e);
t5[x]=(sum(ue)-ue[x])/(sum(le)-le[x]);
end;
return (t5);
finish;
Error message:
430 proc iml;
NOTE: IML Ready
431
432
433 EDIT kirjasto.basfraaka var "open";
434
435 read all var "open" into p;
436
437
438 m=nrow(p);
439 x=T(1:m);
440 pi=constant("pi");
441 e=constant("e");
442
443 h=0.75;
444
445 start Kmod(x,h,pi,e);
446 k=1/(h#(2#pi)##(1/2))#e##(-x##2/(2#h##2));
447 return (k);
448 finish;
NOTE: Module KMOD defined.
449 start mhatx2(m,p,h,pi,e);
450 t5=j(m,1);
450! /*mhatx omit x=t*/
451 upb=m-18;
452 do x=19 to upb;
453 lo=x-18;
454 up=x+18;
455 i=T(lo:up);
456 temp1=x-i;
457 ue=Kmod(temp1,h,pi,e)#p[i];
458 le=Kmod(temp1,h,pi,e);
459 t5[x]=(sum(ue)-ue[x])/(sum(le)-le[x]);
460 end;
461 return (t5);
462 finish;
NOTE: Module MHATX2 defined.
463
464 ptz=j(m,1);
465 ptz=mhatx2(m,p,h,pi,e);
ERROR: (execution) Invalid subscript or subscript out of range.
operation : [ at line 459 column 18
operands : ue, x
ue 37 rows 1 col (numeric)
x 1 row 1 col (numeric)
38
statement : ASSIGN at line 459 column 1
traceback : module MHATX2 at line 459 column 1
NOTE: Paused in module MHATX2.
466 print ptz;
ERROR: Matrix ptz has not been set to a value.
statement : PRINT at line 466 column 1
It looks like this line:
t5[x]=(sum(ue)-ue[x])/(sum(le)-le[x]);
is incorrectly referencing ue and le members. If you're trying to subtract out the 'current iteration' piece, then you want
t5[x]=(sum(ue)-ue[19])/(sum(le)-le[19]);
since that is the 'middle' of the range (which corresponds to the current x value).

In the following SAS statement, what do the parameters "noobs" and "label" stand for?

In the following SAS statement, what do the parameters "noobs" and "label" stand for?
proc print data-sasuser.schedule noobs label;
per SAS 9.2 documentation on PROC PRINT:
"NOOBS - Suppress the column in the output that identifies each observation by number"
"LABEL - Use variables' labels as column headings"
noobs don't show you the column of observations number
(1,2,3,4,5,....)
my first title
results without noobs
Obs name sex group height weight
1 mike m a 21 150
2 henry m b 30 140
3 norian f b 18 130
4 nadine f b 32 135
5 dianne f a 23 135
results with noobs
my first title
name sex group height weight
mike m a 21 150
henry m b 30 140
norian f b 18 130
nadine f b 32 135
dianne f a 23 135