Hi I am having trouble with this question related to a school project. Converting values to a character missing value.
Run the following program to create the Dataset NOTAPPLY.
DATA NOTAPPLY;
LENGTH A B C D E $ 2;
INPUT ID A $ B $ C $ D $ E $ X Y Z;
DATALINES;
001 Y N N Y Y 1 2 3
002 na NA Y Y Y 3 4 5
003 NA NA NA na na 8 9 10
;
In the SAS data set NOTAPPLY, a value of either NA or na was used in place of a
missing value for all character variables. Create a new SAS data set NEW where
these values are converted to a character missing value.
There are many ways of converting values in SAS. One of them is using an INFORMAT by importing data. So you code might look like:
proc format;
invalue $MYMISS
'NA'=' '
'na'=' '
;
run;
DATA NEW;
LENGTH A B C D E $ 2;
INPUT ID A:$MYMISS2. B:$MYMISS2. C:$MYMISS2. D:$MYMISS2. E:$MYMISS2. X Y Z;
DATALINES;
001 Y N N Y Y 1 2 3
002 na NA Y Y Y 3 4 5
003 NA NA NA na na 8 9 10
;
run;
Related
I have a question about the conditional group dataset:
Here is my dataset:
data temp;
input id x1 x2 $;
cards;
1 25 A
1 35 C
1 20 B
3 33 D
;
run;
I want to group them by id and create a new variable Y. the output should like:
id x1 x2 y
1 25 A C
1 35 C C
1 20 B C
3 33 D D
For example, if id is the same, I need to compare x2 C trumps B, and B trumps A, etc.
how should I code? Thank you very much!
Not sure if this will generalize but a SQL solution, where the max of X2 becomes your Y value for each ID.
proc sql;
create table want as
select *, max(x2) as y
from have
group by ID;
quit;
For the following data I am trying to filter rows, of each group ID, based on these conditions:
After every row with type='B' and value='Y' do the following
Remove the rows until the next row having type='F' and value='Y'.
If there is no B='Y then keep all of them (e.g. id=002)
Can we create the flag variable as shown in my want dataset? so that I can filter on Flag='Y'?
Have
ID Type Date Value
001 F 1/2/2018 Y
001 B 1/3/2018
001 B 1/4/2018 Y
001 B 1/5/2018
001 B 1/6/2018
001 F 1/6/2018 Y
001 B 1/6/2018
001 B 1/7/2018
001 B 1/8/2018 Y
001 B 1/8/2018
001 B 1/9/2018
002 F 1/2/2018 Y
002 B 1/3/2018
002 B 1/4/2018
Want
ID Type Date Value Flag
001 F 1/2/2018 Y Y
001 B 1/3/2018 Y
001 B 1/4/2018 Y Y
001 B 1/5/2018
001 B 1/6/2018
001 F 1/6/2018 Y Y
001 B 1/6/2018 Y
001 B 1/7/2018 Y
001 B 1/8/2018 Y Y
001 B 1/8/2018
001 B 1/9/2018
002 F 1/2/2018 Y Y
002 B 1/3/2018 Y
002 B 1/4/2018 Y
I tried to do the following
data F;
set have;
where Type='F';run;
data B;
set have;
where Type='B';run;
proc sql;
create table all as select
a.* from B as b
inner join F as f
on a.id=b.id
and b.date >= a.date;
quit;
This includes all the rows from my have dataset. Any help is much appreciated.
The criteria for computing the state of a row as part of a contiguous sub-group (call it a 'run' of rows) within group ID are relatively simple, but a compromised state might occur or be indicated if some funny cases of data occur:
two or more B Y before a F Y (extra 'run ending')
two or more F Y before a B Y ('run starting' within a run)
first row in group not F Y ('run starting' not first in group)
data want(drop=run_:);
SET have;
BY id;
run_first = (type='F' and value='Y');
run_final = (type='B' and value='Y');
* set flag state at criteria for start of contiguous sub-group criteria;
run_flag + run_first;
if first.id and NOT run_flag then
put 'WARNING: first row in group ' id= ' is not F Y, this may be incorrect';
if run_flag > 1 and run_first then
put 'WARNING: an additional F Y before a B Y at row ' _n_;
if run_flag then
OUTPUT;
if run_flag = 0 and run_final then
put 'WARNING: an additional B Y before a F Y at row ' _n_;
* reset flag at criteria for contiguous sub-group;
if last.id or run_final then
run_flag = 0;
run;
Same as Richard, I don't quite understand what the filtering criteria are.
I could see one problem with your join. you used a.* in your select statement, but "b" and "f" as your dataset aliases. this would not work as no dataset have been assigned to alias "a".
Proper way would be as follow:
proc sql;
create table all as
select b.* from B as b
inner join F as f
on b.id=f.id
and b.date >= f.date;
quit;
However, even then, I don't believe inner join is the proper way to solve your problem. Do let us your filtering condition please?
I have a solution but it is not the most elegant (and might not cover corner cases.) If anyone else has a better solution please share.
First, to create the dataset in-case anyone else want to try it out:
Data work.have;
input #01 ID 3.
#05 Type $1.
#07 Date date7.
#18 Value $1.;
format ID 3.
Type $1.
Date date11.
Value $1.;
datalines;
001 F '02Jan18'n Y
001 B '03Jan18'n
001 B '04Jan18'n Y
001 B '05Jan18'n
001 B '06Jan18'n
001 F '06Jan18'n Y
001 B '06Jan18'n
001 B '07Jan18'n
001 B '08Jan18'n Y
001 B '08Jan18'n
001 B '09Jan18'n
002 F '02Jan18'n Y
002 B '03Jan18'n
002 B '04Jan18'n
;
run;
Solution:
I based on your edited suggestion of creating a flag variable.
Data Flag;
set work.have;
if Type = 'B' and Value = 'Y' then
flag + 1;
if Type = 'F' then
flag = 0;
if Value ne 'Y' and flag = 1 then delete;
run;
The flag variable is 0 by default.
The first IF-Then condition identifies the Type B ='Y' rows and flag them as 1, as well as retaining this flag for the subsequent rows.
The second IF-Then condition identifies the type='F' row and resets the Flag to 0
The Last If-Then condition drops all rows with Flag=1 except the first occurrence which are the Type B ='Y' rows.
I hope this applies to your problem.
I have a series of string values with missing observations. I would like to use flat substitution. For instance variable x has 3 available values. There should be a 33.333% chance that a missing value will be assigned to the available values for x under this substitution method. How would I do this?
DATA have;
INPUT id a $ b $ c $ x;
CARDS;
1 Y Male . 5
2 Y Female . 4
3 . Female Tall 4
4 Y . Short 2
5 N Male Tall 1
;
Run;
You could use temporary arrays to store the possible values. Then generate a random index into the array.
DATA have;
INPUT id a $ b $ c $ x;
CARDS;
1 Y Male . 5
2 Y Female . 4
3 . Female Tall 4
4 Y . Short 2
5 N Male Tall 1
;
data want ;
set have ;
array possible_b (2) $8 ('Male','Female') ;
if missing(b) then b=possible_b(1+int(rand('uniform')*dim(possible_b)));
run;
I did this with generating random numbers and hard coding the limits. There should be an easier way to do this, but for the purposes of the question this should work.
option missing='';
data begin;
input a $;
cards;
a
.
b
c
.
e
.
f
g
h
.
.
j
.
;
run;
data intermediate;
set begin;
if a EQ '' then help= rand("uniform");
else help=.;
run;
data wanted;
set intermediate;
format help populated.;
if a EQ '' then do;
if 0<=help<0.33 then a='V1';
else if 0.33<=help<0.66 then a='V2';
else if 0.66<=help then a='V3';
end;
drop help;
run;
I'm trying to pull only last 4 working days data in SAS...I tried following code but I'm not getting what I'm intended to...
data input;
Input id $ id1 $ id2 $ num date date9.;
Format Date Date9.;
datalines;
x y z 3 19JUL2015
x y z 2 18JUL2015
x y z 3 17JUL2015
x y z 2 16JUL2015
x y z 3 15JUL2015
x y z 2 14JUL2015
x y z 3 13JUL2015
a b c 1 12JUL2015
a b c 1 11JUL2015
a b c 1 10JUL2015
a b c 1 09JUL2015
a b c 1 08JUL2015
a b c 2 07JUL2015
x y z 1 06JUL2015
;
Run;
Data test;
Set input;
Weekday=Weekday(Date);
intck=intck('weekday',Date,today());
*if intck('weekday',Date,today()) >4;
if 1<Weekday(Date)<7 and Date>=today()-4;
Run;
I think you need to reverse the > in your code, and add a qualification that you only want weekdays:
Data test;
Set input;
Weekday=Weekday(Date);
intck=intck('weekday',Date,today());
if intck('weekday',Date,'20JUL2015'd) le 4 and 1<weekday(Date)<7;
*if 1<Weekday(Date)<7 and Date>='20JUL2015'd-5;
Run;
I have the following SAS data set:
data mydata;
LENGTH id 4.0 num 4.0 A $ 4. B $ 8. C $ 20.;
input id num A $ B $ C $;
datalines;
1 1 x yy zzzzz
2 1 x yy zzzzz
3 2 xq yyqq zzzzzqqqqq
4 1 x yy zzzzz
5 3 xqw yyqqww zzzzzqqqqqwwwww
6 1 x yy zzzzz
7 4 xqwe yyqqwwee zzzzzqqqqqwwwwweeeee
;
which looks like
mydata
-------------------
id num A B C
1 1 x yy zzzzz
2 1 x yy zzzzz
3 2 xq yyqq zzzzzqqqqq
4 1 x yy zzzzz
5 3 xqw yyqqww zzzzzqqqqqwwwww
6 1 x yy zzzzz
7 4 xqwe yyqqwwee zzzzzqqqqqwwwwweeeee
The problem is that each of the observations where num > 1 actually contains data for multiple "observations" and I would like to split it up using some logic in SAS. Here's an example for what I want to get:
mydatawanted
-------------------
id num A B C
1 1 x yy zzzzz
2 1 x yy zzzzz
3 1 x yy zzzzz
3 1 q qq qqqqq
4 1 x yy zzzzz
5 1 x yy zzzzz
5 1 q qq qqqqq
5 1 w ww wwwww
6 1 x yy zzzzz
7 1 x yy zzzzz
7 1 q qq qqqqq
7 1 w ww wwwww
7 1 e ee eeeee
Basically, if num > 1 I want to take the substring of each variable depending on its length, for each item, and then output those as new observations with num = 1. Here is what I have tried to code so far:
data mydata2(drop=i _:);
set mydata; /*use the data from the original data set */
_temp_id = id; /*create temp variables from the currently read observation */
_temp_num = num;
_temp_A = A;
_temp_B = B;
_temp_C = C;
if (_temp_num > 1) THEN /* if num in current record > 1 then split them up */
do i = 1 to _temp_num;
id = _temp_id; /* keep id the same */
num = 1; /* set num to 1 for each new observation */
A = substr(_temp_A,i,i); /*split the string by 1s */
B = substr(_temp_B,1 + 2 * (i - 1),i * 2); /*split the string by 2s */
C = substr(_temp_C,1 + 5 * (i - 1),i * 5); /*split the string by 5s */
OUTPUT; /* output this new observation with the changes */
end;
else OUTPUT; /* if num == 1 then output without any changes */
run;
However it doesn't work as I wanted it to (I put in some comments to show what I thought was happening at each step). It actually produces the following result:
mydata2
-------------------
id num A B C
1 1 x yy zzzzz
2 1 x yy zzzzz
3 1 x yy zzzzz
3 1 q qq qqqqq
4 1 x yy zzzzz
5 1 x yy zzzzz
5 1 qw qqww qqqqqwwwww
5 1 w ww wwwww
6 1 x yy zzzzz
7 1 x yy zzzzz
7 1 qw qqww qqqqqwwwww
7 1 we wwee wwwwweeeee
7 1 e ee eeeee
This mydata2 result isn't the same as mydatawanted. The lines where num = 1 are fine but when num > 1 the output records are much different from what I want. The total number of records are correct though. I'm not really sure what is happening, since this is the first time I tried any complicated SAS logic like this, but I would appreciate any help in either fixing my code or accomplishing what I want to do using any alternate methods. Thank you!
edit: I fixed a problem with my original input mydata data statement and updated the question.
Your substrings are incorrect. Substr takes the arguments (original string, start, length), not (original string, start, ending position). So length should be 1,2,5 not i,i*2,i*5.
A = substr(_temp_A,i,1); /*split the string by 1s */
B = substr(_temp_B,1 + 2 * (i - 1),2); /*split the string by 2s */
C = substr(_temp_C,1 + 5 * (i - 1),5); /*split the string by 5s */