Proc SQL Version=9.4. No windows functions to use.
There are client id, time period(month), amount and corresponding class.
client_id data_period amount class
1 200801 30000 2
2 200801 17000 1
3 200801 9000 1
1 200802 30000 2
2 200802 55555 2
3 200802 11000 2
Threshold amount = 20 000.
amount > 20k gives class = 2, amount <= 20k makes class = 1
client_id = 1, amount and class are the same for 200801 and 200802.
client_id = 2, amount gets higher from 17k to 55.5k, class change is correct, from 1 to 2.
client_id =3, amount changed within the same class 1 (<20K), but class changed incorrectly.
Desired result is
client_id oldDate newDate AmtOld AmtNew ClassOld ClassNew Good Bad
2 200801 200802 17000 55555 1 2 1 0
3 200801 200802 9000 11000 1 1 0 1
I tried to applied self join to get all the differences btw data periods, but there are too many rows in output. Data below is not from example above, real numbers.
client_id oldDate newDate AmtOld AmtNew ClassOld ClassNew
A001687463 200808 200802 -5613 1690386 I03 I04
A001687463 200807 200802 -5613 1690386 I03 I04
A001687463 200806 200802 -5613 1690386 I03 I04
A001687463 200805 200802 -5613 1690386 I03 I04
PROC SQL;
CREATE TABLE WORK.'Q'n AS
SELECT distinct
t1.client_id, t1.data_period as oldDate, t2.data_period as newDate, t1.amount as expAmtOld, t2.amount as expAmtNew, t1.class as classOld, t2.class as classNew
FROM WORK.'E'n t1, WORK.'E'n t2
where
t1.client_id = t2.client_id and
t1.amount <> t2.amount
order by t1.client_id;
Do not attempt to do sequential processing using SQL. It is not built for that.
It should be easy to do in a data step. For example let's convert your printout into an actual SAS dataset so we have something to code with.
data have ;
input client_id data_period amount class ;
cards;
1 200801 30000 2
2 200801 17000 1
3 200801 9000 1
1 200802 30000 2
2 200802 55555 2
3 200802 11000 2
;
And let's sort it by client and period.
proc sort data=have ;
by client_id data_period ;
run;
Now just set the data and use the LAG() function to get the previous values.
Not sure what you definition of GOOD and BAD were so I just created new class variables based on your rule of 20K.
data want ;
set have ;
by client_id;
old_period = lag(data_period);
old_class = lag(class);
newclass = 1 + (amount > 20000) ;
old_newclass = lag(newclass);
if first.client_id then call missing(of old_:);
bad = (class ne newclass) or (old_newclass ne old_class) ;
run;
So here are the results.
client_ data_ old_ old_ old_
id period amount class period class newclass newclass bad
1 200801 30000 2 . . 2 . 0
1 200802 30000 2 200801 2 2 2 0
2 200801 17000 1 . . 1 . 0
2 200802 55555 2 200801 1 2 1 0
3 200801 9000 1 . . 1 . 0
3 200802 11000 2 200801 1 1 1 1
Related
I am dealing with a repeated measures dataset in a wide format. Each observation represents one measurement for one subject and each subject is measures six times. The data contains mainly dummy variables.
I am looking to do a count of unique dummy variable values across all six observations for each subject.
Have:
MeasurementNum SubjectID Dummy0 Dummy1 Dummy2 Dummy3 Dummy4
-----------------------------------------------------------------------------
1 1 1 1 0 0 0
2 1 0 1 0 1 0
3 1 - - - - -
4 1 0 0 1 1 0
5 1 - - - - -
6 1 0 0 0 1 0
1 2 1 0 0 1 0
2 2 0 0 0 0 0
3 2 0 1 0 0 0
4 2 1 1 0 1 0
5 2 - - - - -
6 2 1 1 1 0 0
Want:
Total for Overall
MeasurementNum SubjectID ... MeasurementNUM Total
--------------------------------...-----------------------------
1 1 ... 2 4
2 1 ... 2 4
3 1 ... - 4
4 1 ... 2 4
5 1 ... - 4
6 1 ... 1 4
1 2 ... 2 4
2 2 ... 0 4
3 2 ... 1 4
4 2 ... 3 4
5 2 ... - 4
6 2 ... 3 4
My current approach is to consolidate all six rows within each subject to one rows retaining value 1 using Proc MEANS with BY and OUTPUT statements, as described in this related question. I then use Proc SUMMARY to get the values listed under variable 'Total` in the have statement.
proc summary
data=have;
By SubjectID
class Dummy1-4;
output out=want sum=sum;
Is there a way to get the distinct/unique counts across observations without consolidating rows first?
I prefer PROC SQL as it will also allow me to do conditional counts according to subject covariates present in my working dataset. I.e. producing the want descriptives on condition of a covariate specific to the subject.
I suspect that using PROC SUMMARY (aka PROC MEANS) will be the easiest way. Sounds like you want to find the MAX for each SUBJECT and then SUM those to get the subject totals.
proc summary data=have nway ;
class SubjectID ;
var Dummy0-Dummy999;
output out=any(drop=_type_ _freq_) n=n_reps max= ;
run;
data want ;
set any ;
total = sum(of Dummy0-Dummy999) ;
run;
Not sure how SQL helps any with conditional counts. But you could generate the counts and total in one step with PROC SQL, but it would require wallpaper code like this:
proc sql ;
create table want as
select SubjectID
, count(*) as n_reps
, max(dummy0) as dummy0
, max(dummy1) as dummy1
...
, max(dummy999) as dumyy999
, sum
( max(dummy0)
, max(dummy1)
...
, max(dummy999)
) as Total
from have
group by 1
;
quit;
You could probably define a macro (or some other tool) to generate that wallpaper code for you from a list of variable names.
I have a data structure that looks like this:
DATA have ;
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
RUN ;
So, we have family id, individual id, implicate number and imputed income for each implicate.
What i need is to replicate the results of the first individual in each family (all of the five implicates) for the remaining individuals within each family, replacing whatever values we previously had on those cells, like this:
DATA want ;
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 40000
1 2 2 25000
1 2 3 34000
1 2 4 23555
1 2 5 49850
1 3 1 40000
1 3 2 25000
1 3 3 34000
1 3 4 23555
1 3 5 49850
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 40000
2 2 2 45000
2 2 3 50000
2 2 4 34000
2 2 5 23500
2 3 1 40000
2 3 2 45000
2 3 3 50000
2 3 4 34000
2 3 5 23500
RUN ;
In this example I'm trying to replicate only one variable but in my project I will have to do this for dozens of variables.
So far, I came up with this solution:
%let implist_1=imp_inc;
%macro copyv1(list);
%let nwords=%sysfunc(countw(&list));
%do i=1 %to &nwords;
%let varl=%scan(&list, &i);
proc means data=have max noprint;
var &varl;
by famid implicate;
where indid=1;
OUTPUT OUT=copy max=max_&varl;
run;
data want;
set have;
drop &varl;
run;
data want (drop=_TYPE_ _FREQ_);
merge want copy;
by famid implicate;
rename max_&varl=&varl;
run;
%end;
%mend;
%copyv1(&imp_list1);
This works well for one or two variables. However it is tremendously slow once you do it for 400 variables in a data-set with the size of 1.5 GB.
I'm pretty sure there is a faster way to do this with some form of proc sql or first.var etc., but i'm relatively new to SAS and so far I couldn't come up with a better solution.
Thank you very much for your support.
Best regards
Yes, this can be done in DATA step using a first. reference made available via the by statement.
data want;
set have (keep=famid indid implicate imp_inc /* other vars */);
by famid indid implicate; /* by implicate is so step logs an error (at run-time) if data not sorted */
if first.famid then if indid ne 1 then abort;
array across imp_inc /* other vars */;
array hold [1,5] _temporary_; /* or [<n>,5] where <n> means the number of variables in the across array */
if indid = 1 then do; /* hold data for 1st individuals implicate across data */
do _n_ = 1 to dim(across);
hold[_n_,implicate] = across[_n_]; /* store info of each implicate of first individual */
end;
end;
else do;
do _n_ = 1 to dim(across);
across[_n_] = hold[_n_,implicate]; /* apply 1st persons info to subsequent persons */
end;
end;
run;
The DATA step could be significantly faster due to single pass through data, however there is an internal processing cost associated with calculating all those pesky [] array addresses at run; time, and that cost could become impactful at some <n>
SQL is simpler syntax, clearer understanding and works if have data set is unsorted or has some peculiar sequencing in the by group.
This is fairly straightforward with a bit of SQL:
proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have a
left join (
select * from have
group by famid
having indid = min(indid)
) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;
The idea is to join the table to a subset of itself containing only the rows corresponding to the first individual within each family.
It is set up to pick the lowest numbered individual within each family, so it will work even if there is no row with indid = 1. If you are sure that there will always be such a row, you can use a slightly simpler query:
proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have(sortedby = famid) a
left join have(where = (indid = 1)) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;
Specifying sortedby = famid provides a hint to the query optimiser that it can skip one of the initial sorts required for the join, which may improve performance a bit.
My data look like this:
rep model x Reject
1 1 1.36 1
1 2 -0.76 0
1 3 3.74 1
1 4 -0.42 0
2 1 -0.56 0
2 2 -5.78 0
2 3 -2.00 0
2 4 -3.67 0
and i want output look like this:
rep model x Reject
1 1 1.36 1
2 1 -0.56 0
I want just 1 from 4 model where Reject=1 but if it can't find,every Obs could be.
Thanks!
Sort your data by REP and REJECT and take first record per REP.
Proc sort data=have;
By rep descending reject model;
Run;
Data select;
Set have;
By rep descending reject model;
If first.rep;
Run;
I would like to create a table that has three variables where var2 is a percentage of var1 and var3 is a percentage of var 2, broken down by class variables that have missing values.
To explain, imagine I have data showing who applied, was interviewed, and was hired for a job, e.g.
data job;
input applied interviewed hired;
datalines;
1 1 1
1 1 1
1 1 1
1 1 0
1 1 0
1 1 0
1 0 .
1 0 .
1 0 .
1 0 .
;
run;
it's very easy to create a table that shows the count of who applied, and then the percentage of those who were interviewed and then of those people, the percentage who was hired.
proc tabulate data = job;
var applied interviewed hired;
tables applied * n (interviewed hired) * mean * f=percent6.;
run;
which gives:
applied interviewed hired
10 60% 50%
Now I would like to break that down by several class variables with missing values.
data have;
input sex degree exp applied interviewed hired;
datalines;
0 1 1 1 1 1
1 . 0 1 1 1
. 0 1 1 1 1
0 1 0 1 1 0
1 0 1 1 1 0
0 1 0 1 1 0
1 . 1 1 0 .
0 1 . 1 0 .
. 0 0 1 0 .
1 0 0 1 0 .
;
run;
If I do one class variable at a time it will give me the correct percentages:
proc tabulate data = have format = 6.;
class sex;
var applied interviewed hired;
tables sex, applied * sum (interviewed hired) * mean * f=percent6.;
run;
Is there a way to do all three class variables in the table at once and get the right percentage for each category. so the table looks like:
applied interviewed hired
sex
0 4 75% 33%
1 4 50% 50%
degree
0 4 50% 50%
1 4 75% 33%
exp
0 5 60% 33%
1 4 75% 67%
This is something I must do many, many times and I need to populate tables in a report with the numbers, so I'm looking for a solution where the table can be printed all in one step.
How would you solve this problem?
The problem you're running into is that of missing data. When a case is missing for any class variable, it is eliminated from the entire table, unless you specify MISSING in the proc call. So, for example, your 4th sex=0 who did not interview was missing EXP; so they didn't show up at all in the table, though you would want them showing up in SEX.
You can get the correct numbers, mostly:
proc tabulate data = have format = 6. missing;
class sex degree exp;
var applied interviewed hired;
tables (sex degree exp), applied * sum (interviewed hired) * mean * f=percent6.;
run;
However, you have an extra row that includes those with missing data. You cannot eliminate those rows from the printed output while also including them in the other class calculations; this is just one of those limitations of SAS tabulation. Other PROCs have a similar problem; PROC FREQ is the only one that doesn't do this if you have multiple tables generated, but even then within one table (combined with asterisks) you will have the same issue.
The only way I've found around this is to output the table to a dataset and then filter out those rows, and PROC REPORT or PRINT or TABULATE the data back out.
I think this is close to what you want. You will have to fix the row labels, but it is one PROC TABULATE step.
title;
data have;
input sex degree exp applied interviewed hired;
datalines;
0 1 1 1 1 1
1 . 0 1 1 1
. 0 1 1 1 1
0 1 0 1 1 0
1 0 1 1 1 0
0 1 0 1 1 0
1 . 1 1 0 .
0 1 . 1 0 .
. 0 0 1 0 .
1 0 0 1 0 .
;
run;
proc print;
run;
proc summary data=have missing ;
class sex degree exp;
ways 1;
output out=stats sum(applied)= mean(interviewed hired)= / levels;
run;
data stats2;
set stats;
if n(of sex degree exp) eq 0 then delete;
run;
proc print;
run;
proc tabulate data=stats2;
class _type_ / descend;
class _level_;
var applied interviewed hired;
tables (_type_*_level_),applied*sum='N'*f=8. (interviewed hired)*sum='Percent'*f=percent6.;
run;
/**/
/* applied interviewed hired*/
/*sex */
/* 0 4 75% 33%*/
/* 1 4 50% 50%*/
/*degree */
/* 0 4 50% 50%*/
/* 1 4 75% 33%*/
/*exp */
/* 0 5 60% 33%*/
/* 1 4 75% 67%*/
Hi I am trying to subset a dataset which has following
ID sal count
1 10 1
1 10 2
1 10 3
1 10 4
2 20 1
2 20 2
2 20 3
3 30 1
3 30 2
3 30 3
3 30 4
I want to take out only those IDs who are recorded 4 times.
I wrote like
data AN; set BU
if last.count gt 4 and last.count lt 4 then delete;
run;
But there is something wrong.
EDIT - Thanks for clarifying. Based on your needs, PROC SQL will be more direct:
proc sql;
CREATE TABLE AN as
SELECT * FROM BU
GROUP BY ID
HAVING MAX(COUNT) = 4
;quit;
For posterity, here is how you could do it with only a data step:
In order to use first. and last., you need to use a by clause, which requires sorting:
proc sort data=BU;
by ID DESCENDING count;
run;
When using a SET statement BY ID, first.ID will be equal to 1 (TRUE) on the first instance of a given ID, 0 (FALSE) for all other records.
data AN;
set BU;
by ID;
retain keepMe;
If first.ID THEN DO;
IF count = 4 THEN keepMe=1;
ELSE keepMe=0;
END;
if keepMe=0 THEN DELETE;
run;
During the datastep BY ID, your data will look like:
ID sal count keepMe first.ID
1 10 4 1 1
1 10 3 1 0
1 10 2 1 0
1 10 1 1 0
2 20 3 0 1
2 20 2 0 0
2 20 1 0 0
3 30 4 1 1
3 30 3 1 0
3 30 2 1 0
3 30 1 1 0
If I understand correct, you are trying to extract all observations are are repeated 4 time or more. if so, your use of last.count and first.count is wrong. last.var is a boolean and it will indicate which observation is last in the group. Have a look at Tim's suggestion.
In order to extract all observations that are repeated four times or more, I would suggest to use the following PROC SQL:
PROC SQL;
CREATE TABLE WORK.WANT AS
SELECT /* COUNT_of_ID */
(COUNT(t1.ID)) AS COUNT_of_ID,
t1.ID,
t1.SAL,
t1.count
FROM WORK.HAVE t1
GROUP BY t1.ID
HAVING (CALCULATED COUNT_of_ID) ge 4
ORDER BY t1.ID,
t1.SAL,
t1.count;
QUIT;
Result:
1 10 1
1 10 2
1 10 3
1 10 4
3 30 1
3 30 2
3 30 3
3 30 4
Slight variation on Tims - assuming you don't necessarily have the count variable.
proc sql;
CREATE TABLE AN as
SELECT * FROM BU
GROUP BY ID
HAVING Count(ID) >= 4;
quit;