I have the following SAS data set:
data mydata;
LENGTH id 4.0 num 4.0 A $ 4. B $ 8. C $ 20.;
input id num A $ B $ C $;
datalines;
1 1 x yy zzzzz
2 1 x yy zzzzz
3 2 xq yyqq zzzzzqqqqq
4 1 x yy zzzzz
5 3 xqw yyqqww zzzzzqqqqqwwwww
6 1 x yy zzzzz
7 4 xqwe yyqqwwee zzzzzqqqqqwwwwweeeee
;
which looks like
mydata
-------------------
id num A B C
1 1 x yy zzzzz
2 1 x yy zzzzz
3 2 xq yyqq zzzzzqqqqq
4 1 x yy zzzzz
5 3 xqw yyqqww zzzzzqqqqqwwwww
6 1 x yy zzzzz
7 4 xqwe yyqqwwee zzzzzqqqqqwwwwweeeee
The problem is that each of the observations where num > 1 actually contains data for multiple "observations" and I would like to split it up using some logic in SAS. Here's an example for what I want to get:
mydatawanted
-------------------
id num A B C
1 1 x yy zzzzz
2 1 x yy zzzzz
3 1 x yy zzzzz
3 1 q qq qqqqq
4 1 x yy zzzzz
5 1 x yy zzzzz
5 1 q qq qqqqq
5 1 w ww wwwww
6 1 x yy zzzzz
7 1 x yy zzzzz
7 1 q qq qqqqq
7 1 w ww wwwww
7 1 e ee eeeee
Basically, if num > 1 I want to take the substring of each variable depending on its length, for each item, and then output those as new observations with num = 1. Here is what I have tried to code so far:
data mydata2(drop=i _:);
set mydata; /*use the data from the original data set */
_temp_id = id; /*create temp variables from the currently read observation */
_temp_num = num;
_temp_A = A;
_temp_B = B;
_temp_C = C;
if (_temp_num > 1) THEN /* if num in current record > 1 then split them up */
do i = 1 to _temp_num;
id = _temp_id; /* keep id the same */
num = 1; /* set num to 1 for each new observation */
A = substr(_temp_A,i,i); /*split the string by 1s */
B = substr(_temp_B,1 + 2 * (i - 1),i * 2); /*split the string by 2s */
C = substr(_temp_C,1 + 5 * (i - 1),i * 5); /*split the string by 5s */
OUTPUT; /* output this new observation with the changes */
end;
else OUTPUT; /* if num == 1 then output without any changes */
run;
However it doesn't work as I wanted it to (I put in some comments to show what I thought was happening at each step). It actually produces the following result:
mydata2
-------------------
id num A B C
1 1 x yy zzzzz
2 1 x yy zzzzz
3 1 x yy zzzzz
3 1 q qq qqqqq
4 1 x yy zzzzz
5 1 x yy zzzzz
5 1 qw qqww qqqqqwwwww
5 1 w ww wwwww
6 1 x yy zzzzz
7 1 x yy zzzzz
7 1 qw qqww qqqqqwwwww
7 1 we wwee wwwwweeeee
7 1 e ee eeeee
This mydata2 result isn't the same as mydatawanted. The lines where num = 1 are fine but when num > 1 the output records are much different from what I want. The total number of records are correct though. I'm not really sure what is happening, since this is the first time I tried any complicated SAS logic like this, but I would appreciate any help in either fixing my code or accomplishing what I want to do using any alternate methods. Thank you!
edit: I fixed a problem with my original input mydata data statement and updated the question.
Your substrings are incorrect. Substr takes the arguments (original string, start, length), not (original string, start, ending position). So length should be 1,2,5 not i,i*2,i*5.
A = substr(_temp_A,i,1); /*split the string by 1s */
B = substr(_temp_B,1 + 2 * (i - 1),2); /*split the string by 2s */
C = substr(_temp_C,1 + 5 * (i - 1),5); /*split the string by 5s */
Related
Im trying to set any5 = 'Yes' if there is a number 5 in any of the columns Q1 to Q5. However my code below only shows for the last column.
data survey;
infile datalines firstobs=2;
input ID 3. Q1-Q5;
array score{5} _temporary_ (5,5,5,5,5);
array Ques{5} Q1-Q5;
do i =1 to 5;
if Ques{i} = score{i} then any5='Yes';
else any5='No';
end;
drop i;
datalines;
ID Q1 Q2 Q3 Q4 Q5
535 1 3 5 4 2
12 5 5 4 4 3
723 2 1 2 1 1
7 3 5 1 4 2
;
run;
Keep it simple :-)
data survey;
infile datalines;
input ID 3. Q1-Q5;
array Ques{*} Q1 - Q5;
any5 = ifc(5 in Ques, 'Yes', 'No');
datalines;
535 1 3 5 4 2
12 5 5 4 4 3
723 2 1 2 1 1
7 3 5 1 4 2
;
Use the COUNTC function to compute the number of times 5 is repeated in your Q 1-Q5 columns then use the IFC function to return a character value based on whether the expression is true, false, or missing.
data survey;
infile datalines firstobs=2;
input ID 3. Q1-Q5;
any5 = ifc(countc(cats(of Q:),'5')>0,'Yes','No');
datalines;
ID Q1 Q2 Q3 Q4 Q5
535 1 3 5 4 2
12 5 5 4 4 3
723 2 1 2 1 1
7 3 5 1 4 2
;
run;
Result:
535 1 3 5 4 2 Yes
12 5 5 4 4 3 Yes
723 2 1 2 1 1 No
7 3 5 1 4 2 Yes
Use the WHICHN function to determine the index of the target value in a list of values.
In your case assign the test for any index matching
any5 = whichn (5, of ques(*)) > 0;
From the documentation:
WHICHN Function
Searches for a numeric value that is equal to the first argument, and
returns the index of the first matching value.
Syntax
WHICHN(argument, value-1 <, value-2, ...>)
It is a simple mistake in your logic. You are setting ANY5 to YES or NO on every time through the loop. Since you continue going through the loop even after the match is found you overwrite the results from the previous times through the loop, so only the results of the last test survive.
Here is one way. Set the answer to NO before the loop and remove the ELSE clause.
any5='No ';
do i =1 to 5;
if Ques{i} = 5 then any5='Yes';
end;
Or stop when you have your answer.
do i =1 to 5 until(any5='Yes');
if Ques{i} = score{i} then any5='Yes';
else any5='No';
end;
Or skip the looping altogether.
if whichn(5, of Q1-Q5) then any5='Yes';
else any5='No';
Or even easier create any5 as numeric instead of character. SAS will return 1 for TRUE and 0 for FALSE as the result of a boolean expression.
any5 = ( 0 < whichn(5, of Q1-Q5) );
I have this data
data have;
input cust_id pmt months;
datalines;
AA 100 0
AA 50 1
AA 200 2
AA 350 3
AA 150 4
AA 700 5
BB 500 0
BB 300 1
BB 1000 2
BB 800 3
run;
and I'd like to generate an output that looks like this
data want;
input cust_id pmt months i;
datalines;
AA 100 0 0
AA 50 0 1
AA 200 0 2
AA 350 0 3
AA 150 0 4
AA 700 0 5
AA 50 1 0
AA 200 1 1
AA 350 1 2
AA 150 1 3
AA 700 1 4
AA 200 2 0
AA 350 2 1
AA 150 2 2
AA 700 2 3
AA 350 3 0
AA 150 3 1
AA 700 3 2
AA 150 4 0
AA 700 4 1
AA 700 5 0
BB 500 0 0
BB 300 0 1
BB 1000 0 2
BB 800 0 3
BB 300 1 0
BB 1000 1 1
BB 800 1 2
BB 1000 2 0
BB 800 2 1
BB 800 3 0
run;
There are few thousand rows with different cust_ID and different months length. I tried joining tables but it couldn't get me the sequence of 100 50 200 350 150 700 (for cust_ID AA). I could only replicated 100 if my months are 0, 50 if months are 1 & so on. I created a maxval which is the maximum month value. My code is something like this
data temp1;
set have;
do i = 0 to maxval;
if (months <=maxval) then output;
end;
i thought of creating a uniquekey to join my have data and temp1 data but it could only give me
AA 100 0 0
AA 50 0 1
AA 200 0 2
AA 350 0 3
AA 150 0 4
AA 700 0 5
AA 100 1 0
AA 50 1 1
AA 200 1 2
AA 350 1 3
AA 150 1 4
AA 100 2 0
AA 50 2 1
AA 200 2 2
AA 350 2 3
AA 100 3 0
AA 50 3 1
AA 200 3 2
AA 100 4 0
AA 50 4 1
AA 100 5 0
Any thoughts or different approach on how to generate my want table? Thank you!
This problem is a little tricky because you have things going in three directions
The number of group repetitions descends from group count. Within each repetition:
The payments item start index ascends and terminates at group count
The months (as I) item start index is 1 and termination descends from group count
SQL
One SQL approach is a three-way reflexive join with-in group. The months values act as a within group index and must be monotonic by 1 from 0 for this to work.
proc sql;
create table want as
select X.cust_id, Z.pmt, X.months, Y.months as i
from have as X
join have as Y on X.cust_id = Y.cust_id
join have as Z on Y.cust_id = Z.cust_id
where
X.months + Y.months = Z.months
order by
X.cust_id, X.months, Z.months
;
quit;
DATA Step
A DOW loop is used to count the group size. 2-deep looping crosses the combinations and three point= values are computed (finagled) to retrieve the relevant values.
data want2;
if 0 then set have; * prep pdv to match have;
retain point_end ;
point_start = sum(point_end,0);
do group_count = 1 by 1 until (last.cust_id);
set have(keep=cust_id);
by cust_id;
end;
do index1 = 1 to group_count;
point1 = point_start + index1;
set have (keep=months) point = point1;
do index2 = 0 to group_count - index1 ;
point2 = point_start + index1 + index2;
set have (keep=pmt) point=point2;
point3 = point_start + index2 + 1;
set have (keep=months rename=months=i) point=point3;
output;
end;
end;
point_end = point1;
keep cust_id pmt months i;
run;
Try the following:
data want(drop = start_obs limit j);
retain start_obs 1;
/* read by cust_id group */
do until(last.cust_id);
set have end = last_obs;
by cust_id;
end;
limit = months;
do j = 0 to limit;
i = 0;
do obs_num = start_obs + j to start_obs + limit;
/* read specific observations using direct access */
set have point = obs_num;
months = j;
output;
i = i + 1;
end;
end;
/* prepare for next direct access read */
start_obs = limit + 2;
if last_obs then
stop;
run;
I have data set,
CustID Rating
1 A
1 A
1 B
2 A
2 B
2 C
2 D
3 X
3 X
3 Z
4 Y
4 Y
5 M
6 N
7 O
8 U
8 T
8 U
And expecting Output
CustID Rating ID
1 A 1
1 A 1
1 B 1
2 A 1
2 B 2
2 C 3
2 D 4
3 X 1
3 X 1
3 Z 2
4 Y 1
4 Y 1
5 M 1
6 N 1
7 O 1
8 U 1
8 T 2
8 U 1
In the solution below, I selected the distinct possible ratings into a macro variable to be used in an array statement. These distinct values are then searched in the ratings tolumn to return the number assigned at each successful find.
You can avoid the macro statement in this case by replacing the %sysfunc by 3 (the number of distinct ratings, if you know it before hand). But the %sysfunc statement helps resolve this in case you don't know.
data have;
input CustomerID Rating $;
cards;
1 A
1 A
1 B
2 A
2 A
3 A
3 A
3 B
3 C
;
run;
proc sql noprint;
select distinct quote(strip(rating)) into :list separated by ' '
from have
order by 1;
%put &list.;
quit;
If you know the number before hand:
data want;
set have;
array num(3) $ _temporary_ (&list.);
do i = 1 to dim(num);
if findw(rating,num(i),'tips')>0 then id = i;
end;
drop i;
run;
Otherwise:
%macro Y;
data want;
set have;
array num(%sysfunc(countw(&list., %str( )))) $ _temporary_ (&list.);
do i = 1 to dim(num);
if findw(rating,num(i),'tips')>0 then id = i;
end;
drop i;
run;
%mend;
%Y;
The output:
Obs CustomerID Rating id
1 1 A 1
2 1 A 1
3 1 B 2
4 2 A 1
5 2 A 1
6 3 A 1
7 3 A 1
8 3 B 2
9 3 C 3
Assuming data is sorted by customerid and rating (as in the original unedited question). Is the following what you want:
data want;
set have;
by customerid rating;
if first.customerid then
id = 0;
if first.rating then
id + 1;
run;
I have the following two sas datasets:
data have ;
input a b;
cards;
1 15
2 10
3 40
4 200
1 25
2 15
3 10
4 75
1 1
2 99
3 30
4 100
;
data ref ;
input x y;
cards;
1 10
2 20
3 30
4 100
;
I would like to have the following dataset:
data want ;
input a b outcome ;
cards;
1 15 0
2 10 1
3 40 0
4 200 0
1 25 0
2 15 1
3 10 1
4 75 1
1 1 1
2 99 0
3 30 1
4 100 1
;
I would like to create a variable 'outcome' which is produced by an if statement upon conditions of variables a, b, x and y. As in reality the 'have' dataset is extremely large I would like to avoid a sort and merging the two datasets together (where a = x).
I am trying to use macro variables with the following code:
data _null_ ;
set ref ;
call symput('listx', x) ;
call symput('listy', y) ;
run ;
data want ;
set have ;
if a=&listx and b le &listy then outcome = 1 ; else outcome = 0 ;
run ;
which does not however produce the desired result:
data want ;
input a b outcome ;
cards;
1 15 0
2 10 1
3 40 0
4 200 0
1 25 0
2 15 1
3 10 1
4 75 1
1 1 1
2 99 0
3 30 1
4 100 1
;
redone my solution using hash tables. Below my approach
data ref2(rename=(x=a));
set ref ;
run;
data want;
declare Hash Plan ();
rc = plan.DefineKey ('a'); /*x originally*/
rc = plan.DefineData('a', 'y');
rc = plan.DefineDone();
do until (eof1);
set ref2 end=eof1;
rc = plan.add(); /*add each record from ref2 to plan (hash table)*/
end;
do until (eof2);
set have end=eof2;
call missing(y);
rc = plan.find();
outcome = (rc =0 and b<y);
output;
end;
stop;
run;
hope it helps
I have a string 28,16OB4N7L8O4L using two arrays I had split into separate variables.
hrs1 hrs2 hrs3 hrs4 hrs5 hrs6 hrs7
28 16 1 4 7 8 4
cd1 cd2 cd3 cd4 cd5 cd6 cd7
, O B N L O L
Now I want to summarize across variables, if same value repeats in character variable in the above example 'O' and'L' are repeated, in that case I want to merge as one and add the respective hrs.
Output should be:
, O B N L -COLUMN
28 24 1 4 11 -VALUES
Here's an example of transposing to normalized (long skinny format). I added a second sample record.
data have;
input id hrs1-hrs7 (cd1-cd7) ($1.);
cards;
1 28 16 1 4 7 8 4 ,OBNLOL
2 1 2 3 4 5 6 7 AAAABBB
;
run;
data tran (keep=id hr cd) / view=tran ;
set have ;
array hrs{*} hrs1-hrs7 ;
array cds{*} cd1-cd7 ;
do i=1 to dim(hrs) ;
hr=hrs{i} ;
cd=cds{i} ;
output ;
end ;
run ;
proc sql ;
select id, cd, sum(hr)
from tran
group by id, cd
;
quit ;
Returns:
id cd
________________
1 , 28
1 B 1
1 L 11
1 N 4
1 O 24
2 A 10
2 B 18