Split string in SAS by keeping the 0 lead values - sas

My dataset looks like this
And I want it to look like this:
Subject Code site subj
0156 00062 156 62
0156 00062 156 62
0047 00032 47 32
0034 00066 34 66
0032 00029 32 29
.
.
My Code:
if "Subject Code"n ^="" then site=input(scan("Subject Code"n,1,' '),z9.);
put site=;
if "Subject Code"n ^="" thensubj=input(strip(substr((scan("Subject Code"n,-1)),1,4)),$4.);
put subj=;
The output I get:
site=15600062
subj=1560
As you can see SAS takes out the leading 0 values and the space " ", because of which it's difficult to split.

You might be over complicating. Try:
length site subj 8; * declare the variables as numeric;
site = input (scan ('Subject Code'n,1), 8.);
subj = input (scan ('Subject Code'n,2), 8.);
The variables will need a z format if you want the values to be displayed with leading zeros when rendered in viewers or proc output.
format site z4.;
format subj z5.;

data have;
input subjectcode $&10.;
datalines;
0156 00062
0156 00062
0047 00032
0034 00066
0032 00029
;
data want;
set have;
site=prxchange('s/0*([1-9]+) 0*([1-9]+)/$1/', -1, subjectcode);
subj=prxchange('s/0*([1-9]+) 0*([1-9]+)/$2/', -1, subjectcode);
run;

Related

SAS: Unable to add variable to data set

I have a data set and am trying to add four new variables using the existing ones. I keep getting an error that says the code is incomplete. I'm having trouble seeing where it is incomplete. How do I fix this?
data dataset;
input ID $
Height
Weight
SBP
DBP
WtKg = Weight/2.2;
HtCm = Height/2.4;
AveBP = DBP + (SBP - DBP)/3;
HtPolynomial = (2*Height)**2 + (1.5*Height)**3;
datalines;
001 68 150 110 70
002 73 240 150 90
003 62 101 120 80
run;
You did not end your input statement with a semicolon. input reads variables from external data (in this case, in-line data with the datalines statement). New variables are not created within input in the way you've specified.
Use input to read in the five variables of your data. After that, create new variables based on those five read-in variables:
data dataset;
input ID $
Height
Weight
SBP
DBP
;
WtKg = Weight/2.2;
HtCm = Height/2.4;
AveBP = DBP + (SBP - DBP)/3;
HtPolynomial = (2*Height)**2 + (1.5*Height)**3;
datalines;
001 68 150 110 70
002 73 240 150 90
003 62 101 120 80
;
run;
Correcting 2 errors should fix this:
Add a semicolon after the last field being read in from the datalines, which is DBP.
(A previous version of this question used the ^ symbol for exponents.) Instead of ^ to raise to the power of something, use **
For reference, SAS arithmetic operators are described here.
After making the 2 corrections above I ran the revised code below without any errors.
data dataset;
input ID $
Height
Weight
SBP
DBP;
WtKg = Weight/2.2;
HtCm = Height/2.4;
AveBP = DBP + (SBP - DBP)/3;
HtPolynomial = (2*Height)**2 + (1.5*Height)**3;
datalines;
001 68 150 110 70
002 73 240 150 90
003 62 101 120 80
run;

SAS: Avoid end-of-line problem and LOST CARD

I'm working through a SAS exercise, which has data in the following format:
3496 Jerry Nelson 13960 Wilson Dr. San Diego CA 92191 40 4
3498 Scott Mason 9226 College Dr. Oak View CA 93022 95 2
3498 CA 35 3
3498 CA 35 11
3500 Michele Stone 8393 West Ct. Emeryville CA 94608 55 5
3500 CA 70 5
For each person, the data continues until the next person's name. The following code is very close to what I need, I think:
libname Ch4data '\\Client\C$\Users\m210028\Google Drive\Adrian\Self-Study\SAS\Chapter4_data';
Data Ch4data.my_donations;
Infile '\\Client\C$\Users\m210028\Google Drive\Adrian\Self-Study\SAS\Chapter4_data\Donations.dat' MISSOVER;
Array amounts(10);
Array months(10);
Input first_name $ 6 - 19
last_name $ 20 - 33
street_address $ 34 - 58
city $ 59 - 88
state_code $ 89 - 93
zip_code $ 94 - 100
amounts(1) 101 - 105 # 106
months(1);
end = end1;
If ~(end1) Then
Do;
Input test_char $ 6-6 #;
i = 2;
Do While (0 = ANYALPHA(test_char));
Input amounts(i) 101 - 105 # 106
months(i);
end = end1;
If ~(end1) Then Input test_char $ 6-6 #;
Else test_char = '';
i = i+1;
End;
End;
Run;
Proc Print Data = Ch4data.my_donations;
Title 'Donations to Coastal Humane Society';
Run;
The problem is that I'm getting a LOST CARD note in the log, and the last name in the file, Michele Stone, doesn't make it into the data set. I suspect my code for detecting the end-of-file is incorrect. Could someone please show me how to detect the end-of-file? The SAS documentation is not helpful.
Many thanks for your time!
[UPDATE]: Thanks to Tom's comment, I can now get the last line with the following code:
libname Ch4data '\\Client\C$\Users\m210028\Google Drive\Adrian\Self-Study\SAS\Chapter4_data';
Data Ch4data.my_donations;
Infile '\\Client\C$\Users\m210028\Google Drive\Adrian\Self-Study\SAS\Chapter4_data\Donations.dat' MISSOVER END=end1;
Array amounts(10);
Array months(10);
Input first_name $ 6 - 19
last_name $ 20 - 33
street_address $ 34 - 58
city $ 59 - 88
state_code $ 89 - 93
zip_code $ 94 - 100
amounts(1) 101 - 105 # 106
months(1);
If ~(end1) Then
Do;
Input test_char $ 6-6 #;
i = 2;
Do While (0 = ANYALPHA(test_char));
Input amounts(i) 101 - 105 # 106
months(i);
If ~(end1) Then Input test_char $ 6-6 #;
Else test_char = '';
i = i+1;
End;
End;
Run;
Proc Print Data = Ch4data.my_donations;
Title 'Donations to Coastal Humane Society';
Run;
Unfortunately, it's not getting the second-to-last line. For that matter, it's skipping a lot of first lines of records. Thoughts?
You are trying to combine reading and transposing. It is probably easier to read first and then transpose. In fact you can just read
data step1;
Infile example truncover ;
Input first_name $ 6 - 19
last_name $ 20 - 33
street_address $ 34 - 58
city $ 59 - 88
state_code $ 89 - 93
zip_code $ 94 - 100
amount 101 - 105
month 105 - 110
;
if not missing(first_name) then case+1;
run;
and then apply the carry-forward of the names etc.
data step2;
update step1(obs=0) step1;
by case;
output;
run;
and then transpose.
data want;
do row=1 by 1 until(last.case);
set step2;
by case;
array months [10];
array amounts [10];
months[row]=month;
amounts[row]=amount;
end;
drop row amount month;
run;
You will need to use the line holding specifier ## to hold the line when your name check detects the first line of the next group.
filename exercise 'c:\temp\exercise.txt';
* create file to read in;
data _null_;
file exercise;
input;
put _infile_;
datalines;
3496 Jerry Nelson 13960 Wilson Dr. San Diego CA 92191 40 4
3498 Scott Mason 9226 College Dr. Oak View CA 93022 95 2
3498 CA 35 3
3498 CA 35 11
3500 Michele Stone 8393 West Ct. Emeryville CA 94608 55 5
3500 CA 70 5
run;
* read-in the data;
* error will occur if data file has a group with more than 10 months of data;
data want;
infile exercise end=end_of_data ;
array amounts(10);
array months(10);
input first_name $ 6 - 19
last_name $ 20 - 33
street_address $ 34 - 58
city $ 59 - 88
state_code $ 89 - 93
zip_code $ 94 - 100
amounts(1) 101 - 105
# 106 months(1);
do i = 2 by 1 while (not end_of_data);
input name_check $ 6-6 ##;
if name_check = ' ' then
input amounts(i) 101-105 #106 months(i);
else
leave; /* jump out of loop
* when control returns to top the input will be of the held line
*/
end;
run;

Doing Principal Components in SAS Using a Holdout and to Score New Data

I am performing Principal Components Analysis in SAS Enterprise Guide and wish to compute factor/component scores on some holdout.
KeepCombinedLR is my primary source of truth. I have another dataset, with the exact same variables, that I would like to be scored without including it in the actual factor analyses.
proc factor data = KeepCombinedLR
simple
method = prin
priors = one
rotate = varimax reorder
mineigen = 1
nfactors = 25
out = FactorScores;
var var1--var40;
run;
data Fitness;
input Age Weight Oxygen RunTime RestPulse RunPulse ##;
datalines;
44 89.47 44.609 11.37 62 178 40 75.07 45.313 10.07 62 185
44 85.84 54.297 8.65 45 156 42 68.15 59.571 8.17 40 166
38 89.02 49.874 9.22 55 178 47 77.45 44.811 11.63 58 176
40 75.98 45.681 11.95 70 176 43 81.19 49.091 10.85 64 162
44 81.42 39.442 13.08 63 174 38 81.87 60.055 8.63 48 170
44 73.03 50.541 10.13 45 168 45 87.66 37.388 14.03 56 186
;
proc factor data=Fitness outstat=FactOut
method=prin rotate=varimax score;
var Age Weight RunTime RunPulse RestPulse;
title 'Factor Scoring Example';
run;
proc print data=FactOut;
title2 'Data Set from PROC FACTOR';
run;
proc score data=Fitness score=FactOut out=FScore;
var Age Weight RunTime RunPulse RestPulse;
run;
proc print data=FScore;
title2 'Data Set from PROC SCORE';
run;
PROC SCORE will score your data for you, using your 'holdout' data set.
https://documentation.sas.com/?docsetId=statug&docsetTarget=statug_score_examples01.htm&docsetVersion=14.3&locale=en

Create table with frequency buckets in Base SAS

Below is a sample of my dataset:
City Days
Atlanta 10
Tampa 95
Atlanta 100
Charlotte 20
Charlotte 31
Tampa 185
I would like to break down "Days" into buckets of 0-30, 30-90, 90-180, 180+, such that the "buckets" are along the x-axis of the table, and the cities are along the y-axis.
I tried using PROC FREQ, but I don't have SAS/STAT. Is there any way to do this in base SAS?
I believe this is what you want. This is most certainly a "brute force" approach, but I think its outlines the concept correctly.
data have;
length city $9;
input city dayscount;
cards;
Atlanta 10
Tampa 95
Atlanta 100
Charlotte 20
Charlotte 31
Tampa 185
;
run;
data want;
set have;
if dayscount >= 0 and dayscount <=30 then '0-30'n = dayscount;
if dayscount >= 30 and dayscount <=90 then '30-90'n = dayscount;
if dayscount >= 90 and dayscount <=180 then '90-180'n = dayscount;
if dayscount > 180 then '180+'n = dayscount;
drop dayscount;
run;
One of the ways for solving this problem is by using Proc Format for assigning the value bucket and then using Proc Transpose for the desired result:
data city_day_split;
length city $12.;
input city dayscount;
cards;
atlanta 10
tampa 95
atlanta 100
charlotte 20
charlotte 31
tampa 185
;
run;
/****Assigning the buckets****/
proc format;
value buckets
0 - <30 = '0-30'
30 - <90 = '30-90'
90 - <180 = '90-180'
180 - high = 'gte180'
;
run;
data city_day_split;
set city_day_split;
day_bucket = put(dayscount,buckets.);
run;
proc sort data=city_day_split out=city_day_split;
by city;
run;
/****Making the Buckets as columns, City as rows and daycount as Value****/
proc transpose data=city_day_split out=city_day_split_1(drop=_name_);
by city;
id day_bucket;
var dayscount;
run;
My Output:
> **city |0-30 |90-180 |30-90 |GTE180**
> Atlanta |10 |100 |. |.
> Charlotte |20 |. |31 |.
> Tampa |. |95 |. |185

Problems aggregating data by variable in SAS

I have data that looks like this:
ID FileSource Age MamUlt ProcDate Name
223 Facility 35 M 19591 SWEDISH
223 Facility 35 M 19592 SWEDISH
223 Facility 35 U 19592 SWEDISH
223 Facility 35 U 19593 SWEDISH
223 Non-Facility 35 M 19594 RADIA
223 Non-Facility 35 U 19594 RADIA
What I am trying to do is to combine that data (for each ID in the data set) to look like this:
ID Age MAMs ULTs SameDate
223 35 3 3 2
So, for each ID, I need the total times "M" and "U" show up and how many times they show up on the same date; twice in this sample.
Here is what I have so far:
data ImageTotals;
set ImageClaims;
by ID;
retain ID MAMs ULTs SameDate;
if first.ID then do;
MAMs = 0;
ULTs = 0;
MamDate = .;
UltDate = .;
SameDate = 0;
end;
if MamUlt = "M" then do; MAMs = MAMs + 1; MamDate = ProcDate; end;
if MamUlt = "U" then do; ULTs = ULTs + 1; UltDate = ProcDate; end;
if MamDate = UltDate and MamDate ^= . then do; SameDate = SameDate+1; end;
if last.ID;
keep ID MAMs ULTs SameDate;
run;
Any advice? This solves the count problems but not the SameDate problem (still coming up as zero for this instance).
You can use DOW loop to do the aggregation in a data step. Data must be sorted by ID and PROCDATE. Within the same date count how many times M or U appear. Then you can use those day counts to aggregate at the ID level and also test if both appeared on the same date. The AGE variable is simply kept so it will have the value from the last record for that ID.
data counts ;
do until (last.id);
m=0;
u=0;
do until (last.procdate);
set imageclaims;
by id procdate;
m= sum(m,proc='M');
u= sum(u,proc='U');
end;
MAMs=sum(mams,m);
ULTs=sum(ults,u);
SameDate=sum(samedate,m and u);
end;
keep id age mams ults samedate ;
run;
I think this is probably a SQL problem (not my specialty), but since you started on a DATA step solution I took a stab at both. I also added more test data.
data ImageClaims;
input id age Proc $1. ProcDate;
cards;
223 35 M 19591
223 35 M 19592
223 35 U 19592
223 35 U 19593
223 35 M 19594
223 35 U 19594
224 35 M 19591
224 35 M 19592
224 35 M 19593
224 35 M 19593
224 35 M 19594
224 35 U 19595
225 35 M 19592
225 35 U 19592
225 35 U 19593
225 35 M 19593
225 35 M 19594
225 35 U 19594
;
run;
For DATA step approach, create counters for MAMs, ULTs, and MAMULTs (Mam and Ult on same day). Note because I use sum statement for these counters (MAMs++1) they are implicitly retained.
data ImageTotals (keep=id Age MAMs ULTs MAMULTs);
set ImageClaims;
by ID ProcDate;
retain HaveMam HaveUlt; *Count vars are implicitly retained by sum statement;
if first.ID then do;
MAMs=0; *count of mammograms;
ULTs=0; *count of ultrasounds;
MAMULTs=0; *count of mammograms and ultrasounds on same date;
end;
if first.ProcDate then do;
HaveMam=0; *indicator for have a mammogram or not on that date;
HaveUlt=0; *indicator for have an ultrasound or not on that date;
end;
if Proc='M' then do;
HaveMam=1; *set mammogram indicator (for that date);
MAMs++1; *increment counter;
end;
else if Proc='U' then do;
HaveUlt=1; *set ultrasound indicator (for that date);
ULTs++1; *increment counter;
end;
if last.ProcDate then do;
MAMULTs++(HaveMam=1 and HaveUlt=1); *increment MamUlts counter if had both on same date;
end;
if last.id;
run;
For SQL solution I use a subquery that counts MAMs, ULTs, and MAMULTs by ID and ProcDate, and an outer query then sums these by ID. Probably there's a better SQL solution, but I think this works.
proc sql;
create table ImageTotals as
select id
,max(age) as age /*arbitrary use of max age is constant within id*/
,sum(MAMs) as MAMs
,sum(ULTs) as ULTs
,sum(MAMULTs) as MAMULTs
from (
select id
,procdate
,max(age) as age
,sum(Proc='M') as MAMs
,sum(Proc='U') as ULTs
,count(distinct(Proc))=2 as MAMULTs
from ImageClaims
group by id,ProcDate
)
group by id
;
quit;
proc print;
run;
Work.ImageTotals I get from both steps is:
Obs id age MAMs ULTs MAMULTs
1 223 35 3 3 2
2 224 35 5 1 0
3 225 35 3 3 3
Thinking this could be solved with proc sql (count/group by) once you take Q's suggestion, unless I am misinterpreting the complexity here...was going to post some code, but will let you take a crack at it first...