For the following data I am trying to filter rows, of each group ID, based on these conditions:
After every row with type='B' and value='Y' do the following
Remove the rows until the next row having type='F' and value='Y'.
If there is no B='Y then keep all of them (e.g. id=002)
Can we create the flag variable as shown in my want dataset? so that I can filter on Flag='Y'?
Have
ID Type Date Value
001 F 1/2/2018 Y
001 B 1/3/2018
001 B 1/4/2018 Y
001 B 1/5/2018
001 B 1/6/2018
001 F 1/6/2018 Y
001 B 1/6/2018
001 B 1/7/2018
001 B 1/8/2018 Y
001 B 1/8/2018
001 B 1/9/2018
002 F 1/2/2018 Y
002 B 1/3/2018
002 B 1/4/2018
Want
ID Type Date Value Flag
001 F 1/2/2018 Y Y
001 B 1/3/2018 Y
001 B 1/4/2018 Y Y
001 B 1/5/2018
001 B 1/6/2018
001 F 1/6/2018 Y Y
001 B 1/6/2018 Y
001 B 1/7/2018 Y
001 B 1/8/2018 Y Y
001 B 1/8/2018
001 B 1/9/2018
002 F 1/2/2018 Y Y
002 B 1/3/2018 Y
002 B 1/4/2018 Y
I tried to do the following
data F;
set have;
where Type='F';run;
data B;
set have;
where Type='B';run;
proc sql;
create table all as select
a.* from B as b
inner join F as f
on a.id=b.id
and b.date >= a.date;
quit;
This includes all the rows from my have dataset. Any help is much appreciated.
The criteria for computing the state of a row as part of a contiguous sub-group (call it a 'run' of rows) within group ID are relatively simple, but a compromised state might occur or be indicated if some funny cases of data occur:
two or more B Y before a F Y (extra 'run ending')
two or more F Y before a B Y ('run starting' within a run)
first row in group not F Y ('run starting' not first in group)
data want(drop=run_:);
SET have;
BY id;
run_first = (type='F' and value='Y');
run_final = (type='B' and value='Y');
* set flag state at criteria for start of contiguous sub-group criteria;
run_flag + run_first;
if first.id and NOT run_flag then
put 'WARNING: first row in group ' id= ' is not F Y, this may be incorrect';
if run_flag > 1 and run_first then
put 'WARNING: an additional F Y before a B Y at row ' _n_;
if run_flag then
OUTPUT;
if run_flag = 0 and run_final then
put 'WARNING: an additional B Y before a F Y at row ' _n_;
* reset flag at criteria for contiguous sub-group;
if last.id or run_final then
run_flag = 0;
run;
Same as Richard, I don't quite understand what the filtering criteria are.
I could see one problem with your join. you used a.* in your select statement, but "b" and "f" as your dataset aliases. this would not work as no dataset have been assigned to alias "a".
Proper way would be as follow:
proc sql;
create table all as
select b.* from B as b
inner join F as f
on b.id=f.id
and b.date >= f.date;
quit;
However, even then, I don't believe inner join is the proper way to solve your problem. Do let us your filtering condition please?
I have a solution but it is not the most elegant (and might not cover corner cases.) If anyone else has a better solution please share.
First, to create the dataset in-case anyone else want to try it out:
Data work.have;
input #01 ID 3.
#05 Type $1.
#07 Date date7.
#18 Value $1.;
format ID 3.
Type $1.
Date date11.
Value $1.;
datalines;
001 F '02Jan18'n Y
001 B '03Jan18'n
001 B '04Jan18'n Y
001 B '05Jan18'n
001 B '06Jan18'n
001 F '06Jan18'n Y
001 B '06Jan18'n
001 B '07Jan18'n
001 B '08Jan18'n Y
001 B '08Jan18'n
001 B '09Jan18'n
002 F '02Jan18'n Y
002 B '03Jan18'n
002 B '04Jan18'n
;
run;
Solution:
I based on your edited suggestion of creating a flag variable.
Data Flag;
set work.have;
if Type = 'B' and Value = 'Y' then
flag + 1;
if Type = 'F' then
flag = 0;
if Value ne 'Y' and flag = 1 then delete;
run;
The flag variable is 0 by default.
The first IF-Then condition identifies the Type B ='Y' rows and flag them as 1, as well as retaining this flag for the subsequent rows.
The second IF-Then condition identifies the type='F' row and resets the Flag to 0
The Last If-Then condition drops all rows with Flag=1 except the first occurrence which are the Type B ='Y' rows.
I hope this applies to your problem.
Related
I have a question about the conditional group dataset:
Here is my dataset:
data temp;
input id x1 x2 $;
cards;
1 25 A
1 35 C
1 20 B
3 33 D
;
run;
I want to group them by id and create a new variable Y. the output should like:
id x1 x2 y
1 25 A C
1 35 C C
1 20 B C
3 33 D D
For example, if id is the same, I need to compare x2 C trumps B, and B trumps A, etc.
how should I code? Thank you very much!
Not sure if this will generalize but a SQL solution, where the max of X2 becomes your Y value for each ID.
proc sql;
create table want as
select *, max(x2) as y
from have
group by ID;
quit;
Hi I am having trouble with this question related to a school project. Converting values to a character missing value.
Run the following program to create the Dataset NOTAPPLY.
DATA NOTAPPLY;
LENGTH A B C D E $ 2;
INPUT ID A $ B $ C $ D $ E $ X Y Z;
DATALINES;
001 Y N N Y Y 1 2 3
002 na NA Y Y Y 3 4 5
003 NA NA NA na na 8 9 10
;
In the SAS data set NOTAPPLY, a value of either NA or na was used in place of a
missing value for all character variables. Create a new SAS data set NEW where
these values are converted to a character missing value.
There are many ways of converting values in SAS. One of them is using an INFORMAT by importing data. So you code might look like:
proc format;
invalue $MYMISS
'NA'=' '
'na'=' '
;
run;
DATA NEW;
LENGTH A B C D E $ 2;
INPUT ID A:$MYMISS2. B:$MYMISS2. C:$MYMISS2. D:$MYMISS2. E:$MYMISS2. X Y Z;
DATALINES;
001 Y N N Y Y 1 2 3
002 na NA Y Y Y 3 4 5
003 NA NA NA na na 8 9 10
;
run;
I have the following SAS dataset.
correlation
policynum
risknum
A
X
Y
A
X
Y
A
X
Y
B
X
Y
B
X
Y
B
X
Y
B
X
L
B
X
L
B
X
L
C
Z
M
C
Z
M
C
Z
M
D
Z
M
D
Z
M
D
Z
M
In SAS, I want to filter the above dataset so I get my final output as:
correlation
policynum
risknum
B
X
Y
B
X
Y
B
X
Y
B
X
L
B
X
L
B
X
L
D
Z
M
D
Z
M
D
Z
M
i.e. for each group of policynum and risknum, if multiple values exist for correlation, I want to keep the second value and get rid of the first value.
If only a single value of correlation exists for a group of policynum and risknum, I want to retain that group in my final output too.
What would be the best way to do this? It might be something simple as I am relatively new to SAS.
Thanks in advance!
If the order of the correlation values, in sort order, is the same ordering as they appear row-wise in the data set you can use SQL. Otherwise, SQL, being based on set theory, which does not have implicit row numbers, can not be used. A DATA step with DOW loop can be used.
Example:
FYI, one common situation in which SAS coders use the phrase 'DOW loop' is when SET & BY statements occur inside a DO loop.
data have;
input correlation $ policynum $ risknum $;
datalines;
A X Y
A X Y
A X Y
B X Y
B X Y
B X Y
B X L
B X L
B X L
C Z M
C Z M
C Z M
D Z M
D Z M
D Z M
;
/* keep last group of a nested group */
* SQL can be used only if correlation wanted is ALWAYS highest valued correlation;
proc sql;
create table want as
select * from have
group by policynum, risknum
having correlation = max(correlation)
;
* DATA Step DOW loops can be used when correlation wanted is last occurring correlation within by group;
data want;
do _n_ = 1 by 1 until (last.policynum);
set have;
by policynum risknum notsorted; /* presume at least contiguous */
end;
_want_correlation = correlation;
do _n_ = 1 to _n_;
set have;
if _want_correlation = correlation then OUTPUT;
end;
run;
I would like to add a new column to a dataset but I am not sure how to do so. My dataset has a variable called KEYVAR (character variable) with three different values. A participant can appear multiple times in my dataset, with each row containing a similar or different value for KEYVAR. What I want to do is create a new variable call NEWVAR that counts how many times a participant has a specific value for KEYVAR; when a participant does not have an observation for that specific value, I want NEWVAR to have a result of zero.
Here's an example of the dataset I would like (in this example, I want to count every instance of "Y" per participants as newvar):
have
PARTICIPANT KEYVAR
A Y
A N
B Y
B Y
B Y
C W
C N
C W
D Y
D N
D N
D Y
D W
want
PARTICIPANT KEYVAR NEWVAR
A Y 1
A N 1
B Y 3
B Y 3
B Y 3
C W 0
C N 0
C W 0
D Y 2
D N 2
D N 2
D Y 2
D W 2
You can use Proc SQL to compute an aggregate result over a group meeting a criteria, and have that aggregate value automatically merged into the result set.
-OR-
Use a MEANS, TRANSPOSE, MERGE approach
Sample Code (SQL)
data have;
input ID $ value $; datalines;
A Y
A N
B Y
B Y
B Y
C W
C N
C W
D Y
D N
D N
D Y
D W
E X
;
proc sql;
create table want as
select ID, value
, sum(value='Y') as Y_COUNT /* relies on logic eval 'math' 0 false, 1 true */
, sum(value='N') as N_COUNT
, sum(value='W') as W_COUNT
from have
group by ID
;
Sample Code (PROC and MERGE)
* format for PRELOADFMT and COMPLETETYPES;
proc format;
value $eachvalue
'Y' = 'Y'
'N' = 'N'
'W' = 'W'
other = '-';
;
run;
* Count how many per combination ID/VALUE;
proc means noprint data=have nway completetypes;
class ID ;
class value / preloadfmt;
format value $eachvalue.;
output out=freqs(keep=id value _freq_);
run;
* TRANSPOSE reshapes to wide (across) data layout, one row per ID;
proc transpose data=freqs suffix=_count out=counts_across(drop=_name_);
by id;
id value;
var _freq_;
where put(value,$eachvalue.) ne '-';
run;
* MERGE;
data want_way_2;
merge have counts_across;
by id;
run;
I have one table displaying 3 obs and 4 fields (ID lastname, firstname and telephone number) for each of ID and I prefer to transpose lastname to 3 fields and for firstname field, I want to only keep the one associated with 1st last name and for telephone number, I want to only keep the one associated with 3rd (last) last name.
Table:
ID Lastname FirstName TelephoneNumber
001 Y A 123
001 W B 345
001 Z C 567
002 M D 789
002 N E 912
002 L F 934
Table want:
ID LastName_1 LastName_2 LastName_3 FirstName TelephoneNumber
001 Y W Z A 567
002 M N L D 934
Can anyone help out?
You can do this with PROC SUMMARY IDGROUP. I will leave it to you to research the syntax.
data id;
input (ID Lastname FirstName)(:$3. 2*:$1.) TelephoneNumber;
cards;
001 Y A 123
001 W B 345
001 Z C 567
002 M D 789
002 N E 912
002 L F 934
;;;;
run;
proc print;
run;
proc summary nway;
class id;
output
out=id2
idgroup(out(firstname)=)
idgroup(last out(telephonenumber)=)
idgroup(out[3](lastname)=)
;
run;
proc print;
run;