I'm trying to get the count of changes in default variable based on ID and Date. I'm a week old into using SAS so please forgive me if I need detailed explanations.
I currently have
data test111;
input Date $ Acc $ Default $;
datalines;
jan-10 A N
feb-10 A D
mar-10 A D
apr-10 A D
may-10 A D
jan-10 B N
feb-10 B N
mar-10 B D
apr-10 B D
may-10 B D
jan-10 C N
feb-10 C N
mar-10 C N
apr-10 C D
may-10 C D
jan-10 D N
feb-10 D D
mar-10 D N
apr-10 D D
may-10 D D
jan-10 E D
feb-10 E D
mar-10 E D
apr-10 E N
may-10 E D
I want an output (Table 1 below) that counts when default changes from N to D for each unique accounts. But it is dependent on Date. I only know how to use Excel to show the output desired (manually counting). Table 2 is how the accounts are counted if I'm not clear.
Table 1
month+1 month+2 month+3 month+4
Jan-10 2 1 1 0
Feb-10 1 1 0
Mar-10 2 0
Apr-10 1
Table 2 (FYR)
month+1 month+2 month+3 month+4
Jan-10 A,D B C -
Feb-10 B C -
Mar-10 C,D -
Apr-10 E
I've tried something like creating a new column that tags when N changes to D so I can sum when tag =1
by first.Acc
if first.Acc then tag = 0;
if default = 'D' then do;
tag = 1;
tag+1;
Not sure if this will get me the correct result for my first line.
But if this works it can only give me the first row of my desired output. I have over 100 months and is it possible to loop or array it?
In case my SAS data input skills fail I've included the Excel screenshot.
Excel screenshot
The most general solution will involve a full outer join within each group.
SQL can perform the join and compute the months apart for the case of a month with N default to the next earliest month with D default. Tabulate can present a grid of the counts.
Data
The month strings are converted to SAS 1-st of month date values that are formatted yymon for display purposes.
data have;
input Date $ Acc $ Default $;
month = input ('01-'||date, date9.);
format month yymon.;
datalines;
jan-10 A N
feb-10 A D
mar-10 A D
apr-10 A D
may-10 A D
jan-10 B N
feb-10 B N
mar-10 B D
apr-10 B D
may-10 B D
jan-10 C N
feb-10 C N
mar-10 C N
apr-10 C D
may-10 C D
jan-10 D N
feb-10 D D
mar-10 D N
apr-10 D D
may-10 D D
jan-10 E D
feb-10 E D
mar-10 E D
apr-10 E N
may-10 E D
run;
Sample SQL
proc sql;
create view have_v as
select
left.acc
, left.month as from_month
, right.month as to_month
, intck ('MONTH', left.month, right.month) as months_apart
from
have as left
join have as right on left.acc = right.acc
where
left.month < right.month
& left.default = 'N'
& right.default = 'D'
group by left.acc, left.month
having right.month = min(right.month)
order by
left.month, right.month
;
create table grid_bounds as
select
min(from_month) as min_from
, max(from_month) as max_from
, 1 as min_apart
, max(months_apart) as max_apart
from have_v
;
SQL Explained
A self join is a join of a table against itself.
have as left join have as right
The join constraint restricts the from/to month combinations to be only those in the same account
on left.acc = right.acc
The where constraint further restricts the from/to combinations to be only those with to in the future and having the desired default transition
where
left.month < right.month
& left.default = 'N'
& right.default = 'D'
The group by operator and SAS Proc SQL having clause automatic remerge feature allows a simple statement for selecting the earliest to for a given from
group by left.acc, left.month
having right.month = min(right.month)
For the selected rows of the join the months apart can be computed using SAS data function INTCK (I think of the function name as an acronym for the phrase "INTerval Count of a period Kind")
, intck ('MONTH', left.month, right.month) as months_apart
A specific solution could use array if the largest group size is known apriori.
Tabulate for output
Some months may not be present due to no transitions. Likewise, some gaps may also not be present. These cases will not have any rows in have_v. In order to get a full coverage for the report, all possible crossings (or combinations) are generated for use in Proc TABULATE
proc sql;
create table grid_bounds as
select
min(from_month) as min_from
, max(from_month) as max_from
, 1 as min_apart
, max(months_apart) as max_apart
from have_v
;
quit;
data grid (label="All crossings to be shown in the output");
set grid_bounds;
do from_month = min_from to max_from;
do months_apart = 1 to max(12,max_apart);
OUTPUT;
end;
end;
keep from_month months_apart;
format from_month yymon.;
run;
Outputs a grid report with gap frequencies for each from_month
options missing = '0' nocenter;
title "Account frequency";
title2 "Gap of month default changing from N to D";
proc tabulate data=have_v classdata=grid;
class from_month months_apart;
table
from_month=''
, months_apart * N=''
;
run;
options missing = '.';
Related
Please help friends
data have;
input v_202002 $1. v_202003 $1. v_202001 $1.;
datalines;
a . b
. . b
a b b
. b a
b b a
;
What I am looking for - First time the value became 'b'
want dataset:
v_202002 v_202003 v_202001 output
a . b 202001
. . b 202001
a b b 202001
. b a 202003
b b a 202002
You can use the WHICHC() function to find the index into an array where the value appears. Then use the VNAME() function to get the name.
data want;
set have;
array vlist v: ;
index=whichc('b',of vlist[*]);
if index then output = substr(vname(vlist[index]),3);
run;
Results
Obs v_202002 v_202003 v_202001 index output
1 a b 3 202001
2 b 3 202001
3 a b b 2 202003
4 b a 2 202003
5 b b a 1 202002
I would like to add a new column to a dataset but I am not sure how to do so. My dataset has a variable called KEYVAR (character variable) with three different values. A participant can appear multiple times in my dataset, with each row containing a similar or different value for KEYVAR. What I want to do is create a new variable call NEWVAR that counts how many times a participant has a specific value for KEYVAR; when a participant does not have an observation for that specific value, I want NEWVAR to have a result of zero.
Here's an example of the dataset I would like (in this example, I want to count every instance of "Y" per participants as newvar):
have
PARTICIPANT KEYVAR
A Y
A N
B Y
B Y
B Y
C W
C N
C W
D Y
D N
D N
D Y
D W
want
PARTICIPANT KEYVAR NEWVAR
A Y 1
A N 1
B Y 3
B Y 3
B Y 3
C W 0
C N 0
C W 0
D Y 2
D N 2
D N 2
D Y 2
D W 2
You can use Proc SQL to compute an aggregate result over a group meeting a criteria, and have that aggregate value automatically merged into the result set.
-OR-
Use a MEANS, TRANSPOSE, MERGE approach
Sample Code (SQL)
data have;
input ID $ value $; datalines;
A Y
A N
B Y
B Y
B Y
C W
C N
C W
D Y
D N
D N
D Y
D W
E X
;
proc sql;
create table want as
select ID, value
, sum(value='Y') as Y_COUNT /* relies on logic eval 'math' 0 false, 1 true */
, sum(value='N') as N_COUNT
, sum(value='W') as W_COUNT
from have
group by ID
;
Sample Code (PROC and MERGE)
* format for PRELOADFMT and COMPLETETYPES;
proc format;
value $eachvalue
'Y' = 'Y'
'N' = 'N'
'W' = 'W'
other = '-';
;
run;
* Count how many per combination ID/VALUE;
proc means noprint data=have nway completetypes;
class ID ;
class value / preloadfmt;
format value $eachvalue.;
output out=freqs(keep=id value _freq_);
run;
* TRANSPOSE reshapes to wide (across) data layout, one row per ID;
proc transpose data=freqs suffix=_count out=counts_across(drop=_name_);
by id;
id value;
var _freq_;
where put(value,$eachvalue.) ne '-';
run;
* MERGE;
data want_way_2;
merge have counts_across;
by id;
run;
For the following data I am trying to filter rows, of each group ID, based on these conditions:
After every row with type='B' and value='Y' do the following
Remove the rows until the next row having type='F' and value='Y'.
If there is no B='Y then keep all of them (e.g. id=002)
Can we create the flag variable as shown in my want dataset? so that I can filter on Flag='Y'?
Have
ID Type Date Value
001 F 1/2/2018 Y
001 B 1/3/2018
001 B 1/4/2018 Y
001 B 1/5/2018
001 B 1/6/2018
001 F 1/6/2018 Y
001 B 1/6/2018
001 B 1/7/2018
001 B 1/8/2018 Y
001 B 1/8/2018
001 B 1/9/2018
002 F 1/2/2018 Y
002 B 1/3/2018
002 B 1/4/2018
Want
ID Type Date Value Flag
001 F 1/2/2018 Y Y
001 B 1/3/2018 Y
001 B 1/4/2018 Y Y
001 B 1/5/2018
001 B 1/6/2018
001 F 1/6/2018 Y Y
001 B 1/6/2018 Y
001 B 1/7/2018 Y
001 B 1/8/2018 Y Y
001 B 1/8/2018
001 B 1/9/2018
002 F 1/2/2018 Y Y
002 B 1/3/2018 Y
002 B 1/4/2018 Y
I tried to do the following
data F;
set have;
where Type='F';run;
data B;
set have;
where Type='B';run;
proc sql;
create table all as select
a.* from B as b
inner join F as f
on a.id=b.id
and b.date >= a.date;
quit;
This includes all the rows from my have dataset. Any help is much appreciated.
The criteria for computing the state of a row as part of a contiguous sub-group (call it a 'run' of rows) within group ID are relatively simple, but a compromised state might occur or be indicated if some funny cases of data occur:
two or more B Y before a F Y (extra 'run ending')
two or more F Y before a B Y ('run starting' within a run)
first row in group not F Y ('run starting' not first in group)
data want(drop=run_:);
SET have;
BY id;
run_first = (type='F' and value='Y');
run_final = (type='B' and value='Y');
* set flag state at criteria for start of contiguous sub-group criteria;
run_flag + run_first;
if first.id and NOT run_flag then
put 'WARNING: first row in group ' id= ' is not F Y, this may be incorrect';
if run_flag > 1 and run_first then
put 'WARNING: an additional F Y before a B Y at row ' _n_;
if run_flag then
OUTPUT;
if run_flag = 0 and run_final then
put 'WARNING: an additional B Y before a F Y at row ' _n_;
* reset flag at criteria for contiguous sub-group;
if last.id or run_final then
run_flag = 0;
run;
Same as Richard, I don't quite understand what the filtering criteria are.
I could see one problem with your join. you used a.* in your select statement, but "b" and "f" as your dataset aliases. this would not work as no dataset have been assigned to alias "a".
Proper way would be as follow:
proc sql;
create table all as
select b.* from B as b
inner join F as f
on b.id=f.id
and b.date >= f.date;
quit;
However, even then, I don't believe inner join is the proper way to solve your problem. Do let us your filtering condition please?
I have a solution but it is not the most elegant (and might not cover corner cases.) If anyone else has a better solution please share.
First, to create the dataset in-case anyone else want to try it out:
Data work.have;
input #01 ID 3.
#05 Type $1.
#07 Date date7.
#18 Value $1.;
format ID 3.
Type $1.
Date date11.
Value $1.;
datalines;
001 F '02Jan18'n Y
001 B '03Jan18'n
001 B '04Jan18'n Y
001 B '05Jan18'n
001 B '06Jan18'n
001 F '06Jan18'n Y
001 B '06Jan18'n
001 B '07Jan18'n
001 B '08Jan18'n Y
001 B '08Jan18'n
001 B '09Jan18'n
002 F '02Jan18'n Y
002 B '03Jan18'n
002 B '04Jan18'n
;
run;
Solution:
I based on your edited suggestion of creating a flag variable.
Data Flag;
set work.have;
if Type = 'B' and Value = 'Y' then
flag + 1;
if Type = 'F' then
flag = 0;
if Value ne 'Y' and flag = 1 then delete;
run;
The flag variable is 0 by default.
The first IF-Then condition identifies the Type B ='Y' rows and flag them as 1, as well as retaining this flag for the subsequent rows.
The second IF-Then condition identifies the type='F' row and resets the Flag to 0
The Last If-Then condition drops all rows with Flag=1 except the first occurrence which are the Type B ='Y' rows.
I hope this applies to your problem.
I am attempting to group by a variable that is not unique with a discrete variable to get the unique combinations per non-unique variable. For example:
A B
1 a
1 b
2 a
2 a
3 a
4 b
4 d
5 c
5 e
I want:
A Unique_combos
1 a, b
2 a
3 a
4 b, d
5 e
My current attempt is something along the lines of:
proc sql outobs=50;
title 'Unique Combinations of b per a';
select a, b
from mylib.mydata
group by distinct a;
run;
If you are happy to use a data step instead of proc sql you can use the retain keyword combined with first/last processing:
Example data:
data have;
attrib b length=$1 format=$1. informat=$1.;
input a
b $
;
datalines;
1 a
1 b
2 a
2 a
3 a
4 b
4 d
5 c
5 e
;
run;
Eliminate duplicates and make sure the data is sorted for first/last processing:
proc sql noprint;
create table tmp as select distinct a,b from have order by a,b;
quit;
Iterate over the distinct list and concatenate the values of b together:
data want;
length combinations $200; * ADJUST TO BE BIG ENOUGH TO STORE ALL THE COMBINATIONS;
set tmp;
by a;
retain combinations '';
if first.a then do;
combinations = '';
end;
combinations = catx(', ',combinations, b);
if last.a then do;
output;
end;
drop b;
run;
Result:
combinations a
a, b 1
a 2
a 3
b, d 4
c, e 5
You just need to put a distinct keyword in the select clause, eg:
title 'Unique Combinations of b per a';
proc sql outobs=50;
select distinct a, b
from mylib.mydata;
The run statement is unnecessary, the sql procedure is normally ended with a quit - although I personally never use it, as the statement will execute upon hitting the semicolon and the procedure quits anyway upon hitting the next step boundary.
I have 2 tables as followings:
Table 1
data table1;
input id $ value;
datalines;
A 1
A 2
B 1
B 2
C 1
D 1
;
Table 2
data table2;
input id $ value;
datalines;
A 1
B 2
C 1
D 1
E 1
;
As you may observed that the unique id for table 1 is A, B, C, D.
I would like to delete observations those id in table2 do not appear in table1.
Therefore last observation of table2 should be deleted as E not in {A, B, C, D}
Desired output:
A 1
B 2
C 1
D 1
You can do this with proc sql:
proc sql;
delete from table2
where not exists (select 1 from table1 where table1.id = table2.id);