Transposing dataset by creating variables from substr with inconsistent pattern - sas

I have a dataset that I need to transpose by a specific design that involves substring with inconsistent column pattern.
The original dataset (note: this is just a mock up dataset, in reality, there are 75 nt variables):
data have;
input ID $ 1 NT2 NT3 NT4 NT5 NT6 ;
cards;
1NOTES 12:13:44 03-16-2018 CODE: ABC AML NOTES 09:13:11 03-12-2018 CODE: OPI TEST
2NOTES 04:25:09 01-04-2018 CODE: FDS IMD NOTES 03:25:10 01-09-2018 CODE: FGH TEST
3NOTES 12:22:49 11-12-2018 CODE: DGH TESTNOTES 08:02:49 11-11-2018 CODE: LKO AML
4NOTES 22:02:21 01-14-2018 CODE: MKL TESTNOTES 07:02:21 01-10-2018 CODE: LOP IMD
5NOTES 09:01:36 01-23-2018 CODE: HJK TESTNOTES 09:01:56 01-23-2018 CODE: UIY TEST
6NOTES 11:01:06 01-20-2018 CODE: LPO IMD TEST NOTES 10:01:30 01-24-2018 CODE: KLO AML
;
run;
Desired output, transposed by ID to split out time , date, code and notes:
ID time date code notes
1 12:13:44 03-16-2018 ABC AML
1 09:13:11 03-12-2018 OPI TEST
2 04:25:09 01-04-2018 FDS IMD
2 03:25:10 01-09-2018 FGH TEST
3 12:22:49 11-12-2018 DGH TEST
3 08:02:49 11-11-2018 LKO AML
4 22:02:21 01-14-2018 MKL TEST
4 07:02:21 01-10-2018 LOP IMD
5 09:01:36 01-23-2018 HJK TEST
5 09:01:56 01-23-2018 UIY TEST
6 11:01:06 01-20-2018 LPO IMD/TEST
6 10:01:30 01-24-2018 KLO AML
The following is my code. It is running indefinitely at the while do loop part
data want;
set have;
attrib notes length=$50;
array _nt{*} nt:;
do i = 1 to dim(_nt) ;
if not missing(_nt(i)) and index(left(_nt[i]), ' NOTES') then do;
timestamp=input(scan(_nt(i), 3, " "), time8.0);
date=input(scan(_nt(i), 4, " "), mmddyy10.);
code = substr(_nt[i], index(left(_nt[i]), 'CODE:')+9);
end;
/*the while loop is used to concatenate notes that immediately follow the other, but it is running indefinitely*/
do while(index(left(_nt[i+1]), ' NOTES')=0);
notes = catx('/',notes, _nt(i+1));
end;
output;
end;
drop i nt:;
format timestamp time8. date mmddyy10.;
run;
Note: This is an add-on question to my previous post:
Transposing dataset by creating variables from substr

First let's make a data step that actually works to create some sample data.
While we are at it let's modify the examples to have more complex examples. So ID=2 now only has one observation. And the first observation for ID=5 does not have any strings before the second "NOTES ..." string.
data have;
infile cards dsd truncover;
input ID $ (NT2-NT6) (:$100.);
cards;
1,NOTES 12:13:44 03-16-2018 CODE: ABC,AML,NOTES 09:13:11 03-12-2018 CODE: OPI,TEST
2,NOTES 04:25:09 01-04-2018 CODE: FDS,IMD
3,NOTES 12:22:49 11-12-2018 CODE: DGH,TEST,NOTES 08:02:49 11-11-2018 CODE: LKO,AML
4,NOTES 22:02:21 01-14-2018 CODE: MKL,TEST,NOTES 07:02:21 01-10-2018 CODE: LOP,IMD
5,NOTES 09:01:36 01-23-2018 CODE: HJK,NOTES 09:01:56 01-23-2018 CODE: UIY,TEST
6,NOTES 11:01:06 01-20-2018 CODE: LPO,IMD,TEST,NOTES 10:01:30 01-24-2018 CODE: KLO,AML
;
Now let's just introduce a second DO loop that increments the same counter to look for the following strings to be concatenated. We will want to remove the extra increment to the index variable that will cause.
data want ;
set have;
array nt nt: ;
length time date 8 code $10 notes $100 ;
format time tod8. date yymmdd10. ;
do index=1 to dim(nt);
time=input(scan(nt[index],2,' '),time8.);
date=input(scan(nt[index],3,' '),mmddyy10.);
code=scan(nt[index],-1,' ');
do index=index+1 to dim(nt) while((nt[index] ^=:'NOTES'));
notes=catx('/',notes,nt[index]);
end;
output;
index=index-1;
notes=' ';
end;
drop index nt: ;
run;
Results:
Obs ID time date code notes
1 1 12:13:44 2018-03-16 ABC AML
2 1 09:13:11 2018-03-12 OPI TEST
3 2 04:25:09 2018-01-04 FDS IMD
4 3 12:22:49 2018-11-12 DGH TEST
5 3 08:02:49 2018-11-11 LKO AML
6 4 22:02:21 2018-01-14 MKL TEST
7 4 07:02:21 2018-01-10 LOP IMD
8 5 09:01:36 2018-01-23 HJK
9 5 09:01:56 2018-01-23 UIY TEST
10 6 11:01:06 2018-01-20 LPO IMD/TEST
11 6 10:01:30 2018-01-24 KLO AML

Related

Needing to retain Lab category tests based on individual positive test result

Hello so this is a sample of my data (There is an additional column of LBCAT =URINALYSIS for those panel of tests)
I've been asked to only include the panel of tests where LBNRIND is populated for any of those tests and the rest to be removed. Some subjects have multiple test results at different visit timepoints and others only have 1.I can't utilise a simple where LBNRIND ne '' in the data step because I need the entire panel of Urinalysis tests and not just that particular test result. What would be the best approach here? I think transposing the data would be too messy but maybe putting the variables in an array/macro and utilising a do loop for those panel of tests?.
Update:I've tried this code but it doesn't keep the corresponding tests for where lb_nrind >0. If I apply the sum(lb_nrind > '' ) the same when applying lb_nrind > '' to the having clause
*proc sql;
*create table want as
select * from labUA
group by ptno and day and lb_cat
having sum(lb_nrind > '') > 0 ;
data want2;
do _n_ = 1 by 1 until (last.ptno);
set labUA;
by ptno period day hour ;
if not flag_group then flag_group = (lb_nrind > '');
end;
do _n_ = 1 to _n_;
set want;
if flag_group then output;
end;
drop flag_group; run;*
You can use a SQL HAVING clause to retain rows of a group meeting some aggregate condition. In your case that group might be a patientid, panelid and condition at least one LBNRIND not NULL
Example:
Consider this example where a group of rows is to be kept only if at least one of the rows in the group meets the criteria result7=77
Both code blocks use the SAS feature that a logical evaluation is 1 for true and 0 for false.
SQL
data have;
infile datalines missover;
input id test $ parm $ result1-result10;
datalines;
1 A P 1 2 . 9 8 7 . . . .
1 B Q 1 2 3
1 C R 4 5 6
1 D S 8 9 . . . 6 77
1 E T 1 1 1
1 F U 1 1 1
1 G V 2
2 A Z 3
2 B K 1 2 3 4 5 6 78
2 C L 4
2 D M 9
3 G N 8
4 B Q 7
4 D S 6
4 C 1 1 1 . . 5 0 77
;
proc sql;
create table want as
select * from have
group by id
having sum(result7=77) > 0
;
DOW Loop
data want;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if not flag_group then flag_group = (result7=77);
end;
do _n_ = 1 to _n_;
set have;
if flag_group then output;
end;
drop flag_group;
run;

List frequency of presence of each variable using loop in SAS

I tried some solutions already here and I am still unable to get a desired output.
The data I have is given below (ID is unique):
data have;
input id code_1 code_2 code_3 code_4 randa randb randc$;
datalines;
19736 1 0 1 0 5.5 10 11
19737 0 0 0 1 2 4.8 19
19738 1 0 1 1 6 9 2.6
19739 1 1 0 1 1.6 7 8.5
;;;;;
run
I need to get the frequency of only the presence of various codes. (code1, code2 etc..)
The desired output:
Variable Frequency
code_1 3
code_2 1
code_3 2
code_4 3
I tried the solution in this and the code is given below:
ods output onewayfreqs=preds;
proc freq data=have;
tables _all_;
run;
ods output close;
proc tabulate data=preds;
class table frequency;
tables table,frequency;
run;
Output:
Frequenza
1 2 3
N N N
Table 1 . 1
Tabella code_1
Tabella code_2 1 . 1
Tabella code_3 . 2 .
Tabella code_4 1 . 1
Tabella id 4 . .
Tabella randa 4 . .
Tabella randb 4 . .
Tabella randc 4 . .
Also I tried as the code below:
proc freq data=have order=freq;
array codes code_:;
do _n_ = 1 to dim(codes);
table codes(_n_)/list missing out=var1_freq;
end;
run;
But I donot know how to write the code properly.
I am getting output for the code below (only for one code at a time):
proc freq data=have order=freq ;
tables code_1/list missing out=var1_freq;
run;
But how to get for multiple codes? Many thanks for your help..!
The out= option for the tables statement will only produce output for the last variable listed, so you won't get all 4 codes.
You can count the 1 valued code_* variables after transposition.
data have;
input id code_1 code_2 code_3 code_4 randa randb randc $ ;
datalines;
19736 1 0 1 0 5.5 10 11
19737 0 0 0 1 2 4.8 19
19738 1 0 1 1 6 9 2.6
19739 1 1 0 1 1.6 7 8.5
;
data idcodes / view=idcodes;
set have;
array codes code_1-code_4;
do _n_ = 1 to dim (codes);
variable = vname(codes(_n_));
flag = codes(_n_);
output;
end;
keep id variable flag;
run;
proc freq data=idcodes;
where flag;
table variable / out=freqs(keep=variable count);
run;
Presuming codes are only 0/1, you could also sum the codes and transpose the result.
proc means noprint data=have;
var code_:;
output out=flagsum sum=;
run;
proc transpose data=flagsum out=want(rename=(_name_=variable col1=frequency));
var code_:;
run;

How to sum value from next row by group using SAS?

I want to create a column in my dataset that calculates the sum of the current row and next row for another field. There are several groups within the data, and I only want to take the sum of the next row if the next row is part of the current group. If a row is the last record for that group I want to fill with a null value.
I'm referencing reading next observation's value in current observation, but still can't figure out how to obtain the solution I need.
For example:
data have;
input Group ID Salary;
cards;
10 1 1
10 2 2
10 3 2
10 4 1
11 1 2
11 2 2
11 3 1
11 4 1
;
run;
The result I want to obtain here is this:
data want;
input Group ID Salary Sum;
cards;
10 1 1 3
10 2 2 4
10 3 2 3
10 4 1 .
11 1 2 4
11 2 2 3
11 3 1 2
11 4 1 .
;
run;
Similar to Tom's answer, but using a 'look-ahead' merge (without a by statement, and firstobs=2) :
data want ;
merge have
have (firstobs=2
keep=Group Salary
rename=(Group=NextGroup Salary=NextSalary)) ;
if Group = NextGroup then sum = sum(Salary,NextSalary) ;
drop Next: ;
run ;
Use BY group processing and a second SET statement that skips the first observation.
data want ;
set have end=eof;
by group ;
if not eof then set have (keep=Salary rename=(Salary=Sum) firstobs=2);
if last.group then Sum=.;
else sum=sum(sum,salary);
run;
I found a solution using proc expand that produced what I needed:
proc sort data = have;
by Group ID;
run;
proc expand data=have out=want method=none;
by Group;
convert Salary = Next_Sal / transformout=(lead 1);
run;
data want(keep=Group ID Salary Sum);
set want;
Sum = Salary + Next_Sal;
run;

Transposing data in SAS

I have a dataset of laboratory results. Each row corresponds to a time point of a subject (for example: row 1 is subject #1 at his first visit, row 2 is subject #1 at his second visit,...). In each row, I have values of 5 tests (test1, test2, ....) and for each test, I have in addition to the result, two columns of reference values of the test (normal low and high levels). I wish to transpose the data, in a way that each row will be identical for subject+visit+test, with two columns, the numerical result and the status (normal or not). I failed transposing the data. I managed to get all tests in a long format, but I couldn't save the reference values. How should I do it ? My alternative is a set of if statements, it's going to be very long !
This question was also posted on communities.sas.com.
The two step process extracts data about PARAMCD (lab test code) and variable type (value and normal range limits) from the names. PARAMCD becomes a new row id variable and V L and H are used to create new variable names when the data are transposed again to the more or less (CDISC SDTM) format.
data A;
input ID Visit Group Test1 Test2 Test3 Test1_L Test1_H Test2_L Test2_H Test3_L Test3_H;
datalines;
1 1 0 5 3 6.7 1 10 2 7 3 9
1 2 0 5.5 3.8 8.7 1 10 2 7 3 6
1 3 0 4.5 2.8 5.7 1 10 3 7 3 6
2 1 1 5 3 6.7 1 10 2 7 3 9
2 2 1 5.5 3.8 8.7 1 10 2 7 3 9
2 3 1 4.5 2.8 5.7 1 10 2 7 3 9
;;;;
run;
proc print;
run;
proc transpose data=a out=b;
by id visit group;
run;
data b;
set b;
length paramcd $8 namecd $1;
call scan(_name_,1,p,l,'_');
paramcd = substrn(_name_,p,l);
namecd = coalesceC(substrn(_name_,p+l+1),'V');
drop p l _name_;
run;
proc sort data=b;
by id visit group paramcd;
run;
proc format;
value $namecd 'V'='Value' 'H'='High' 'L'='Low';
run;
proc transpose data=b out=c(drop=_name_);
by id visit group paramcd;
id namecd;
format namecd $namecd.;
var col1;
run;
data c;
set c;
length RangeFL $1;
if n(low,value) eq 2 and value lt low then RangeFL='L';
else if n(high,value) eq 2 and value gt high then RangeFL='H';
else RangeFL='N';
run;
proc print;
run;

Averaging Panel Data in SAS

I have panel data set that looks like this
ID Usage month
1234 2 -2
1234 4 -1
1234 3 1
1234 2 2
2345 5 -2
2345 6 -1
2345 3 1
2345 6 2
Obviously there are more ID variables and usage data, but this is the general form. I want to average the usage data when the month column is negative, and when it is positive for each ID. In other words for each unique ID, average the usage for negative months and for positive months. My goal is to get something like this.
ID avg_usage_neg avg_usage_pos
1234 3 2.5
2345 5.5 4.5
Here's a few options for you.
First create the test data:
data sample;
input ID
Usage
month;
datalines;
1234 2 -2
1234 4 -1
1234 3 1
1234 2 2
2345 5 -2
2345 6 -1
2345 3 1
2345 6 2
;
run;
Here's an SQL solution:
proc sql noprint;
create table result as
select id,
avg(ifn(month < 0, usage, .)) as avg_usage_neg,
avg(ifn(month > 0, usage, .)) as avg_usage_pos
from sample
group by 1
;
quit;
Here's a datastep / proc means solution:
data sample2;
set sample;
usage_neg = ifn(month < 0, usage, .);
usage_pos = ifn(month > 0, usage, .);
run;
proc means data=sample2 noprint missing nway;
class id;
var usage_neg usage_pos;
output out=result2 mean=;
run;