pivot long by group in SAS - sas

I have a table in the following structure:
month
filter1
filter2
account
payment
users
Jan
1
0
4535
564
5000
Jan
1
1
7387
787
8000
Feb
1
0
567
5
600
In SAS I am looking to convert it into the following format:
month
filter1
filter2
value
amount
Jan
1
0
account
4535
Jan
1
0
payment
564
Jan
1
0
users
5000
Jan
1
1
account
7387
I am looking at proc transpose but I do not have same prefixes in the values, and I will need to confirm amounts in another column. Would anyone have any suggestions? Thank you very much!

PROC TRANSPOSE will handle your example:
proc transpose data=have out=want(rename=(col1=amount)) name=value ;
by month filter1 filter2 NOTSORTED ;
var account payment users ;
run;
In this example since you want to transpose all of the numeric variables you can just skip the VAR statement.

Another option is a data step transpose, but this requires all your variables to be the same type or stored as a character field so I can see this being problematic as well. You may want to store the type to make your life easier on later.
data want;
set have;
array _vars(*) account payment users;
do i=1 to dim(_vars);
value = vname (_vars(i));
amount = _vars(i);
output;
end;
*drop account payment users;
run;
UCLA has good tutorials on the proc transpose option
and the
data step option:

Related

PROC Transpose by YearMonth and ID for duplicate values

I'm trying to transpose this table:
YearMonth ID Purchase Purchase Value
201912 1 Laptop 1000
202012 1 Computer 2000
202112 1 Phone 1000
201912 2 Stereo 500
202012 2 Headset 200
To look like this using PROC Transpose:
ID Purchase_201912 Purchase_202012 Purchase_202112 PV_201912 PV_202012 PV_202112
1 Laptop Computer Phone 1000 2000 1000
2 Stereo Headset - 500 200 -
I think I'll have to transpose multiple times to achieve this. The first transpose I've tried doing is this:
proc transpose data=query_final out=transpose_1 let;
by yearmonth agent_number;
run;
but I keep getting the error
ERROR: Data set WORK.QUERY_FINAL is not sorted in ascending sequence. The current BY group has YearMonth = 202112
and the next BY group has YearMonth = 201912.
I've checked the the data from the table I'm pulling from is indeed sorted in ascending order by YearMonth then grouped by agent number, so I'm not sure what this error is referring to. Could it be that not all IDs have the same YearMonths associated with them (i.e. in example above, ID 2 did not purchase anything in 2021).
Assuming agent_number is equivalent to id in your example, I reproduced the data:
data have;
infile datalines4 delimiter="|";
input yearmonth id purchase :$8. PV;
datalines4;
201912|1|Laptop|1000
202012|1|Computer|2000
202112|1|Phone|1000
201912|2|Stereo|500
202012|2|Headset|200
;;;;
quit;
You can use two proc transpose and then merge the latter
proc transpose data=have out=stage1(drop=_name_) prefix=Purchase_;
by id;
var purchase;
id yearmonth;
run;
proc transpose data=have out=stage2(drop=_name_) prefix=PV_;
by id;
var PV;
id yearmonth;
run;
data want;
merge stage1 stage2;
run;
Resulting in the desired output
id purchase_201912 purchase_202012 purchase_202112 PV_201912 PV_202012 PV_202112
1 Laptop Computer Phone 1000 2000 1000
2 Stereo Headset 500 200 .
PS: In order to avoid getting the error you report, sort the data first, in the same manner as in the by statement in the proc transpose. However, it is not needed here as it is already sorted by id.

How to use MS SQL window function in SAS proc SQL

Hi I am trying to calculate how much the customer paid on the month by subtracting their balance from the next month.
Data looks like this: I want to calculate PaidAmount for A111 in Jun-20 by Balance in Jul-20 - Balance in June-20. Can anyone help, please? Thank you
For this situation there is no need to look ahead as you can create the output you want just by looking back.
data have;
input id date balance ;
informat date yymmdd10.;
format date yymmdd10.;
cards;
1 2020-06-01 10000
1 2020-07-01 8000
1 2020-08-01 5000
2 2020-06-01 10000
2 2020-07-01 8000
3 2020-08-01 5000
;
data want;
set have ;
by id date;
lag_date=lag(date);
format lag_date yymmdd10.;
lag_balance=lag(balance);
payment = lag_balance - balance ;
if not first.id then output;
if last.id then do;
payment=.;
lag_balance=balance;
lag_date=date;
output;
end;
drop date balance;
rename lag_date = date lag_balance=balance;
run;
proc print;
run;
Result:
Obs id date balance payment
1 1 2020-06-01 10000 2000
2 1 2020-07-01 8000 3000
3 1 2020-08-01 5000 .
4 2 2020-06-01 10000 2000
5 2 2020-07-01 8000 .
6 3 2020-08-01 5000 .
This is looking for a LEAD calculation which is typically done via PROC EXPAND but that's under the SAS/ETS license which not many users have. Another option is to merge the data with itself, offsetting the records by one so that the next months record is on the same line.
data want;
merge have have(firstobs=2 rename=balance = next_balance);
by clientID;
PaidAmount = Balance - next_balance;
run;
If you can be missing months in your series this is not a good approach. If that is possible you want to do an explicit merge using SQL instead. This assumes you have month as a SAS date as well.
proc sql;
create table want as
select t1.*, t1.balance - t2.balance as paidAmount
from have as t1
left join have as t2
on t1.clientID = t2.ClientID
/*joins current month with next month*/
and intnx('month', t1.month, 0, 'b') = intnx('month', t2.month, 1, 'b');
quit;
Code is untested as no test data was provided (I won't type out your data to test code).

How can I compare many datasets update several columns based on the max value of a single column in SAS?

I have test scores from many students in 8 different years. I want to retain only the max total score of each student, but then also retain all the student-year related information to that test score (that is, all the columns from the same year in which the student got the highest total score).
An example of the datasets I have:
%macro score;
%do year = 2010 %to 2018;
data student_&year.;
do id=1 to 10;
english=25*rand('uniform');
math=25*rand('uniform');
sciences=25*rand('uniform');
history=25*rand('uniform');
total_score=sum(english, math, sciences, history);
output;
end;
%end;
run;
%mend;
%score;
In my expected output, I would like to retain the max of total_score for each student, and also have the other columns related to that total score. If possible, I would also like to have the information about the year in which the student got the max of total_score. An example of the expected output would be:
DATA want;
INPUT id total_score english math sciences history year;
CARDS;
1 75.4 15.4 20 20 20 2017
2 63.8 20 13.8 10 20 2016
3 48 10 10 18 10 2018
4 52 12 10 10 20 2016
5 69.5 20 19.5 20 10 2013
6 85 20.5 20.5 21 23 2011
7 41 5 12 14 10 2010
8 55.3 15 20.3 10 10 2012
9 51.5 10 20 10 11.5 2013
10 48.9 12.9 16 10 10 2015
;
RUN;
I have been trying to work with the SAS UPDATE procedure. But it just get the most recent value for each student. I want the max total score. Also, within the update framework, I need to update two tables at a time. I would like to compare all tables at the same time. So this strategy I am trying does not work:
data want;
update score_2010 score_2011;
by id;
Thanks to anyone who can provide insights.
It is easier to obtain what you want if you have only one longitudinal dataset with all the original information of your students. It also makes more sense, since you are comparing students across different years.
To build a longitudinal dataset, you will first need to insert a variable informing the year of each of your original datasets. For example with:
%macro score;
%do year = 2010 %to 2018;
data student_&year.;
do id=1 to 10;
english=25*rand('uniform');
math=25*rand('uniform');
sciences=25*rand('uniform');
history=25*rand('uniform');
total_score=sum(english, math, sciences, history);
year=&year.;
output;
end;
%end;
run;
%mend;
%score;
After including the year, you can get a longitudinal dataset with:
data student_allyears;
set student_201:;
run;
Finally, you can get what you want with a proc sql, in which you select the max of "total_score" grouped by "id":
proc sql;
create table want as
select distinct *
from student_allyears
group by id
having total_score=max(total_score);
Create a view that stacks the individual data sets and perform your processing on that.
Example (SQL select, group by, and having)
data scores / view=scores;
length year $4;
set work.student_2010-work.student_2018 indsname=dsname;
year = scan(dsname,-1,'_');
run;
proc sql;
create table want as
select * from scores
group by id
having total_score=max(total_score)
;
Example DOW loop processing
Stack data so the view is processible BY ID. The first DOW loops computes which record has the max total score over the group and the second selects the record in the group for OUTPUT
data scores_by_id / view=scores_by_id;
set work.student_2010-work.student_2018 indsname=dsname;
by id;
year = scan(dsname,-1,'_');
run;
data want;
* compute which record in group has max measure;
do _n_ = 1 by 1 until (last.id);
set scores_by_id;
by id;
if total_score > _max then do;
_max = total_score;
_max_at_n = _n_;
end;
end;
* output entire record having the max measure;
do _n_ = 1 to _n_;
set scores_by_id;
if _n_ = _max_at_n then OUTPUT;
end;
drop _max:;
run;

Aggregating using proc means SAS

For a project, I have a large dataset of 1.5m entries, I am looking to aggregate some car loan data by some constraint variables such as:
Country, Currency, ID, Fixed or floating , performing , Initial Loan Value , Car Type , Car Make
I am wondering if it is possible to aggregate data by summing the initial loan value for the numeric and then condensing the similar variables into one row with the same observation such that I turn the first dataset into the second
Country Currency ID Fixed_or_Floating Performing Initial_Value Current_Value
data have;
set have;
input country $ currency $ ID Fixed $ performing $ initial current;
datalines;
UK GBP 1 Fixed Performing 100 50
UK GBP 1 Fixed Performing 150 30
UK GBP 1 Fixed Performing 160 70
UK GBP 1 Floating Performing 150 30
UK GBP 1 Floating Performing 115 80
UK GBP 1 Floating Performing 110 60
UK GBP 1 Fixed Non-Performing 100 50
UK GBP 1 Fixed Non-Performing 120 30
;
run;
data want;
set have;
input country $ currency $ ID Fixed $ performing $ initial current;
datalines;
UK GBP 1 Fixed Performing 410 150
UK GBP 1 Floating Performing 275 170
UK GBP 1 Fixed Non-performing 220 80
;
run;
Essentially looking for a way to sum the numeric values while concatenating the character variables.
I've tried this code
proc means data=have sum;
var initial current;
by country currency id fixed performing;
run;
Unsure If i'll have to use a proc sql (would be too slow for such a large dataset) or possibly a data step.
any help in concatenating would be appreciated.
Create an output data set from Proc MEANS and concatenate the variables in the result. MEANS with a BY statement requires sorted data. Your have does not.
Concatenation of the aggregations key (those lovely categorical variables) into a single space separated key (not sure why you need to do that) can be done with CATX function.
data have_unsorted;
length country $2 currency $3 id 8 type $8 evaluation $20 initial current 8;
input country currency ID type evaluation initial current;
datalines;
UK GBP 1 Fixed Performing 100 50
UK GBP 1 Fixed Performing 150 30
UK GBP 1 Fixed Performing 160 70
UK GBP 1 Floating Performing 150 30
UK GBP 1 Floating Performing 115 80
UK GBP 1 Floating Performing 110 60
UK GBP 1 Fixed Non-Performing 100 50
UK GBP 1 Fixed Non-Performing 120 30
;
run;
Way 1 - MEANS with CLASS/WAYS/OUTPUT, post process with data step
The cardinality of the class variables may cause problems.
proc means data=have_unsorted noprint;
class country currency ID type evaluation ;
ways 5;
output out=sums sum(initial current)= / autoname;
run;
data want;
set sums;
key = catx(' ',country,currency,ID,type,evaluation);
keep key initial_sum current_sum;
run;
Way 2 - SORT followed by MEANS with BY/OUTPUT, post process with data step
BY statement requires sorted data.
proc sort data=have_unsorted out=have;
by country currency ID type evaluation ;
proc means data=have noprint;
by country currency ID type evaluation ;
output out=sums sum(initial current)= / autoname;
run;
data want;
set sums;
key = catx(' ',country,currency,ID,type,evaluation);
keep key initial_sum current_sum;
run;
Way 3 - MEANS, given data that is grouped but unsorted, with BY NOTSORTED/OUTPUT, post process with data step
The have rows will be processed in clumps of the BY variables. A clump is a sequence of contiguous rows that have the same by group.
proc means data=have_unsorted noprint;
by country currency ID type evaluation NOTSORTED;
output out=sums sum(initial current)= / autoname;
run;
data want;
set sums;
key = catx(' ',country,currency,ID,type,evaluation);
keep key initial_sum current_sum;
run;
Way 4 - DATA Step, DOW loop, BY NOTSORTED and key construction
The have rows will be processed in clumps of the BY variables. A clump is a sequence of contiguous rows that have the same by group.
data want_way4;
do until (last.evaluation);
set have;
by country currency ID type evaluation NOTSORTED;
initial_sum = SUM(initial_sum, initial);
current_sum = SUM(current_sum, current);
end;
key = catx(' ',country,currency,ID,type,evaluation);
keep key initial_sum current_sum;
run;
Way 5 - Data Step hash
data can be processed with out a presort or clumping. In other words, data can be totally disordered.
data _null_;
length key $50 initial_sum current_sum 8;
if _n_ = 1 then do;
call missing (key, initial_sum, current_sum);
declare hash sums();
sums.defineKey('key');
sums.defineData('key','initial_sum','current_sum');
sums.defineDone();
end;
set have_unsorted end=end;
key = catx(' ',country,currency,ID,type,evaluation);
rc = sums.find();
initial_sum = SUM(initial_sum, initial);
current_sum = SUM(current_sum, current);
sums.replace();
if end then
sums.output(dataset:'have_way5');
run;
1.5m entries is not very big dataset. The dataset is sorted first.
proc sort data=have;
by country currency id fixed performing;
run;
proc means data=have sum;
var initial current;
by country currency id fixed performing;
output out=sum(drop=_:) sum(initial)=Initial sum(current)=Current;
run;
Props to paige miller
proc summary data=testa nway;
var net_balance;
class ID fixed_or_floating performing_status initial country currency ;
output out=sumtest sum=sum_initial;
run;

SAS: Exclude patients based on diagnoses on multiple lines and calculate incidence rates

I have large dataset of a few million patient encounters that include a diagnosis, timestamp, patientID, and demographic information.
For each patient, their diagnoses are listed on multiple lines. I need to exclude patients who have a certain diagnosis (282.1) and calculate incidence rates of other diseases in the year 2014.
IF diagnosis NE 282.1;
This in the data step does not work, because it does not take into account the other diagnoses on the other lines.
If possible, I would also like to calculate the incidence rates by disease.
This is an example of what the data looks like. There are multiple lines with multiple diagnoses.
PatientID Diagnosis Date Gender Age
1 282.1 1/2/10 F 25
1 232.1 1/2/10 F 87
1 250.02 1/2/10 F 41
1 125.1 1/2/10 F 46
1 90.1 1/2/10 F 58
2 140 12/15/13 M 57
2 132.3 12/15/13 M 41
2 149.1 12/15/13 M 66
3 601.1 11/19/13 F 58
3 231.1 11/19/13 F 76
3 123.1 11/19/13 F 29
4 282.1 12/30/14 F 81
4 130.1 12/30/14 F 86
5 230.1 1/22/14 M 60
5 282.1 1/22/14 M 46
5 250.02 1/22/14 M 53
Dual reading sollution
Straight forward version
You said you sorted the data first, probably like this
proc sort data=MYLIB.DIAGNOSES;
by PatientID;
run;
Assuming your data is ordered by patientID, you can process each with the diagnose to exclude first.
data WORK.NOT_HAVING_282_1;
set MYLIB.DIAGNOSES (where=(diagnosis EQ 282.1))
MYLIB.DIAGNOSES (where=(diagnosis NE 282.1));
by PatientID;
As we need to report by year, not by date:
year = year(Date);
Next step is to exclude those you don't need, so you need to remember if the unwanted diagnose occured:
retain has_282_1;
if first.PatientID then has_282_1 = 0;
if diagnosis EQ 282.1 then has_282_1 = 1;
and then keep the other diagnoses in 2014 for patients that do not have 282.1
else if not has_282_1 then output;
run;
Next you could SQL to count what you need
proc sql:
create table MYLIB.STATISTICS as
select year, Diagonsis, count(distinct PatientID) as incidence
from WORK.NOT_HAVING_282_1
group by year, Diagonsis;
quit;
Improvements
The above solution would take more processing power then needed:
you read DIAGNOSES from diks, then write FIRST_282_1 to disk, just to read it back in again
you can keep multiple observations of the same diagnose at diffrerent dates in the same year for the same patient, so you need count(distinct PatientID), which is a costly operation.
About diagnose 282.1, we only need to know who was ever diagnosed:
proc sort noduplicates
data=MYLIB.DIAGNOSES (where=(diagnosis EQ 282.1))
out=WORK.HAVING_282_1 (keep=PatientID);
by PatientID;
run;
About other diagnoses, we also need the year, which here:
data WORK.VIEW_OTHER / view=WORK.VIEW_OTHER;
set MYLIB.DIAGNOSES (where=(diagnosis NE 282.1));
year = year(Date);
keep PatientID year Diagnose;
run;
but as we use a view, we do not realy read and calculate anything before the view is used in this sort:
proc sort noduplicates
data=WORK.VIEW_OTHER (where=(diagnosis EQ 282.1))
out=WORK.OTHER_DIAGNOSES;
by PatientID year Diagnose;
run;
Now things become simpler. We use temproary variables exclude and other to indicate where data came from
data WORK.NOT_HAVING_282_1;
set WORK.HAVING_282_1 (in=exclude)
WORK.OTHER_DIAGNOSES (in=other);
by PatientID;
retain has_282_1;
if first.PatientID then has_282_1 = exclude;
if other and not has_282_1 then output;
run;
proc sql:
create table MYLIB.STATISTICS as
select year, Diagonsis, count(*) as incidence
from WORK.NOT_HAVING_282_1
group by year, Diagonsis;
quit;
Remark: this code is not tested