How to write a foreach loop statement in SAS? - sas

I'm working in SAS as a novice. I have two datasets:
Dataset1
Unique ID
ColumnA
1
15
1
39
2
20
3
10
Dataset2
Unique ID
ColumnB
1
40
2
55
2
10
For each UniqueID, I want to subtract all values of ColumnB by each value of ColumnA. And I would like to create a NewColumn that is 1 anytime 1>ColumnB-Column >30. For the first row of Dataset 1, where UniqueID= 1, I would want SAS to go through all the rows in Dataset 2 that also have a UniqueID = 1 and determine if there is any rows in Dataset 2 where the difference between ColumnB and ColumnA is greater than 1 or less than 30. For the first row of Dataset 1 the NewColumn should be assigned a value of 1 because 40 - 15 = 25. For the second row of Dataset 1 the NewColumn should be assigned a value of 0 because 40 - 39 = 1 (which is not greater than 1). For the third row of Dataset 1, I again want SAS to go through every row of ColumnB in Dataset 2 that has the same UniqueID as in Dataset1, so 55 - 20 = 35 (which is greater than 30) but NewColumn would still be assigned a value of 1 because (moving to row 3 of Datatset 2 which has UniqueID =2) 20 - 10 = 10 which satisfies the if statement.
So I want my output to be:
Unique ID
ColumnA
NewColumn
1
15
1
1
30
0
2
20
1
I have tried concatenating Dataset1 and Dataset2 into a FullDataset. Then I tried using a do loop statement but I can't figure out how to do the loop for each value of UniqueID. I tried using BY but that of course produces an error because that is only used for increments.
DATA FullDataset;
set Dataset1 Dataset2; /*Concatenate datasets*/
do i=ColumnB-ColumnA by UniqueID;
if 1<ColumnB-ColumnA<30 then NewColumn=1;
output;
end;
RUN;
I know I'm probably way off but any help would be appreciated. Thank you!

So, the way that answers your question most directly is the keyed set. This isn't necessarily how I'd do this, but it is fairly simple to understand (as opposed to a hash table, which is what I'd use, or a SQL join, probably what most people would use). This does exactly what you say: grabs a row of A, says for each matching row of B check a condition. It requires having an index on the datasets (well, at least on the B dataset).
data colA(index=(id));
input ID ColumnA;
datalines;
1 15
1 39
2 20
3 10
;;;;
data colB(index=(id));
input ID ColumnB;
datalines;
1 40
2 55
2 30
;;;;
run;
data want;
*base: the colA dataset - you want to iterate through that once per row;
set colA;
*now, loop while the check variable shows 0 (match found);
do while (_iorc_ = 0);
*bring in other dataset using ID as key;
set colB key=ID ;
* check to see if it matches your requirement, and also only check when _IORC_ is 0;
if _IORC_ eq 0 and 1 lt ColumnB-ColumnA lt 30 then result=1;
* This is just to show you what is going on, can remove;
put _all_;
end;
*reset things for next pass;
_ERROR_=0;
_IORC_=0;
run;

Related

Using a counter to find multiple occurences on same day from ID & date

I am trying find when a person has multiple occurences on the same day & when they do not.
My data looks something like this
data have;
input id date ;
datalines ;
1 nov10
1 nov15
2 nov11
2 nov11
2 nov14
3 nov12
4 nov17
4 nov19
4 nov19
etc...;
I want to create a new variable to show when an occurence happens on the same day or not. I want my end rseult to look like
data want;
input id date occ;
1 nov10 1
1 nov15 1
2 nov11 2
2 nov11 2
2 nov14 1
3 nov12 1
4 nov17 1
4 nov19 2
4 nov19 2
etc...;
THis is what I tried but it is not working for each date instead only doing it if the date repeats on the first. Here is my code
data want ;
set have ;
by id date;
if first.date then occ = 1;
else occ = 2;
run;
Your IF/THEN logic is just a complicated way to do
occ = 1 + not first.date;
Which is just a test of whether or not it is the first observation for this date.
Looks like you want to instead test whether or not there are multiple observations per date.
occ = 1 + not (first.date and last.date) ;

SAS-How to count the number of observation over the 10 years prior to certain month

I have a sample that include two variables: ID and ym. ID id refer to the specific ID for each trader and ym refer to the year-month variable. And I want to create a variable that show the number of years over the 10 years period prior month t as shown in the following figure.
ID ym Want
1 200101 0
1 200301 1
1 200401 2
1 200501 3
1 200601 4
1 200801 5
1 201201 5
1 201501 4
2 200001 0
2 200203 1
2 200401 2
2 200506 3
I attempt to use by function and fisrt.id to count the number.
data want;
set have;
want+1;
by id;
if first.id then want=1;
run;
However, the year in ym is not continuous. When the time gap is higher than 10 years, this method is not working. Although I assume I need to count the number of year in a rolling window (10 years), I am not sure how to achieve it. Please give me some suggestions. Thanks.
Just do a self join in SQL. With your coding of YM it is easy to do interval that is a multiple of a year, but harder to do other intervals.
proc sql;
create table want as
select a.id,a.ym,count(b.ym) as want
from have a
left join have b
on a.id = b.id
and (a.ym - 1000) <= b.ym < a.ym
group by a.id,a.ym
order by a.id,a.ym
;
quit;
This method retains the previous values for each ID and directly checks to see how many are within 120 months of the current value. It is not optimized but it works. You can set the array m() to the maximum number of values you have per ID if you care about efficiency.
The variable d is a quick shorthand I often use which converts years/months into an integer value - so
200012 -> (2000*12) + 12 = 24012
200101 -> (2001*12) + 1 = 24013
time from 200012 to 200101 = 24013 - 24012 = 1 month
data have;
input id ym;
datalines;
1 200101
1 200301
1 200401
1 200501
1 200601
1 200801
1 201201
1 201501
2 200001
2 200203
2 200401
2 200506
;
proc sort data=have;
by id ym;
data want (keep=id ym want);
set have;
by id;
retain seq m1-m100;
array m(100) m1-m100;
** Convert date to comparable value **;
d = 12 * floor(ym/100) + mod(ym,10);
** Initialize number of previous records **;
want = 0;
** If first record, set retained values to missing and leave want=0 **;
if first.id then call missing(seq,of m1-m100);
** Otherwise loop through previous months and count how many were within 120 months **;
else do;
do i = 1 to seq;
if d <= (m(i) + 120) then want = want + 1;
end;
end;
** Increment variables for next iteration **;
seq + 1;
m(seq) = d;
run;
proc print data=want noobs;

SAS and do loop

I'm writing a program in SAS.
Here's the dataset I have:
id huuse days
1 0 4
1 0 3
1 1 12
1 1 1
1 2 15
2 1 13
2 0 16
2 1 18
2 0 44
For each ID, I want to delete the record if variable huuse ne 1, until I get to the first huuse=1. Then I want to keep that record and all subsequent records for that id, no matter what value huuse is. So for id=1, I want to delete the first two records than keep all records for id=1 starting with the 3rd record. For id=2, the first record has huuse=1, so I want to keep all records for id=2.
The data set I want should look like this:
id huuse days
1 0 4
1 0 3
1 1 12
1 1 1
1 2 15
2 1 13
2 0 16
2 1 18
2 0 44
I tried this code, but it removes all records that have huuse ne 1.
data want;
set have;
by id;
do until (huuse=1);
if huuse = 1 then LEAVE;
if huuse ne 1 then DELETE;
END;
run;
I've tried several variations of do loops, but they all do the same thing.
The DATA step is a program with an implicit loop that reads every record of the data set specified in the SET statement. Any program data vector (pdv) variables not coming from the data set are, by default, reset to missing at the top of the implicit loop. You change that behavior using a RETAIN statement to name variables that should not get reset.
So, in your problem you have two situations when a tracking variable is needed. The variable will track the state of the condition Have I seen huuse=1 yet in this group ?. Call this variable one_flag
RETAIN one_flag; so you control when it's value changes
At the start of a BY group one_flag needs to be reset to false (0)
When huuse is first seen as 1 set the flag to true (1)
Example:
data want(drop=one_flag);
set have;
by id;
retain one_flag 0;
if first.id then one_flag = 0;
if not one_flag and huuse = 1 then one_flag = 1;
if one_flag then OUTPUT; * want all rows in group starting at first huuse=1;
run;
You can place the SET and BY statement inside an explicit DO and that changes the operating behavior of the program, especially if the explicit loop is terminated according to a LAST.<var> automatic variable. Such a loop is commonly called a DOW loop by SAS programmers. There is no phrase DOW loop in the SAS documentation.
Example:
data want;
do until (last.id);
set have;
by id;
if not one_flag and huuse=1 then one_flag = 1;
if one_flag then OUTPUT; * want all rows in group starting at first huuse=1;
end;
run;
Because the looping is explicit and never reaches the TOP of the program with in the loop, there is no need to RETAIN the flag variable, nor reset it. Program variables that are not retained are reset automatically at the top of the program, and the top of the program is only reached at the start of the BY group. Learn more about this programming construct in the SGF 2013 paper "The Magnificent DO", Paul M. Dorfman
Your source and result are same :-)
But if I understood your question correctly the solution is quite simple with a retain solution. I add 2 lines to the example to make it clear that I understood correctly.
The code with example table:
data test;
id=1;huuse=0;days=4;output;
id=1;huuse=0;days=3;output;
id=1;huuse=1;days=12;output;
id=1;huuse=1;days=1;output;
id=1;huuse=2;days=15;output;
id=2;huuse=1;days=13;output;
id=2;huuse=0;days=16;output;
id=2;huuse=1;days=18;output;
id=2;huuse=0;days=44;output;
id=3;huuse=0;days=1;output;
id=3;huuse=1;days=2;output;
run;
data test_output;
set test;
retain keep_id -1;
if (keep_id ne id and huuse ne 0) then keep_id=id;
if keep_id = id then output;
run;
/* the results:
id huuse days
1 1 12 1
1 1 1 1
1 2 15 1
2 1 13 2
2 0 16 2
2 1 18 2
2 0 44 2
3 1 2 3
*/

SAS Comparing values across multiple columns of the same observation?

I have observations with column ID, a, b, c, and d. I want to count the number of unique values in columns a, b, c, and d. So:
I want:
I can't figure out how to count distinct within each row, I can do it among multiple rows but within the row by the columns, I don't know.
Any help would be appreciated. Thank you
********************************************UPDATE*******************************************************
Thank you to everyone that has replied!!
I used a different method (that is less efficient) that I felt I understood more. I am still going to look into the ways listed below however to learn the correct method. Here is what I did in case anyone was wondering:
I created four tables where in each table I created a variable named for example ‘abcd’ and placed a variable under that name.
So it was something like this:
PROC SQL;
CREATE TABLE table1_a AS
SELECT
*
a as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table2_b AS
SELECT
*
b as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table3_c AS
SELECT
*
c as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table4_d AS
SELECT
*
d as abcd
FROM table_I_have_with_all_columns
;
QUIT;
Then I stacked them (this means I have duplicate rows but that ok because I just want all of the variables in 1 column and I can do distinct count.
data ALL_STACK;
set
table1_a
table1_b
table1_c
table1_d
;
run;
Then I counted all unique values in ‘abcd’ grouped by ID
PROC SQL ;
CREATE TABLE count_unique AS
SELECT
My_id,
COUNT(DISTINCT abcd) as Count_customers
FROM ALL_STACK
GROUP BY my_id
;
RUN;
Obviously, it’s not efficient to replicate a table 4 times just to put a variables under the same name and then stack them. But my tables were somewhat small enough that I could do it and then immediately delete them after the stack. If you have a very large dataset this method would most certainly be troublesome. I used this method over the others because I was trying to use Procs more than loops, etc.
A linear search for duplicates in an array is O(n2) and perfectly fine for small n. The n for a b c d is four.
The search evaluates every pair in the array and has a flow very similar to a bubble sort.
data have;
input id a b c d; datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
55 . . . .
66 1 2 3 4
run;
The linear search for duplicates will occur on every row, and the count_distinct will be initialized automatically in each row to a missing (.) value. The sum function is used to increment the count when a non-missing value is not found in any prior array indices.
* linear search O(N**2);
data want;
set have;
array x a b c d;
do i = 1 to dim(x) while (missing(x(i)));
end;
if i <= dim(x) then count_distinct = 1;
do j = i+1 to dim(x);
if missing(x(j)) then continue;
do k = i to j-1 ;
if x(k) = x(j) then leave;
end;
if k = j then count_distinct = sum(count_distinct,1);
end;
drop i j k;
run;
Try to transpose dataset, each ID becomes one column, frequency each ID column by option nlevels, which count frequency of value, then merge back with original dataset.
Proc transpose data=have prefix=ID out=temp;
id ID;
run;
Proc freq data=temp nlevels;
table ID:;
ods output nlevels=count(keep=TableVar NNonMisslevels);
run;
data count;
set count;
ID=compress(TableVar,,'kd');
drop TableVar;
run;
data want;
merge have count;
by id;
run;
one more way using sortn and using conditions.
data have;
input id a b c d; datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
55 . . . .
66 1 2 3 4
77 . 3 . 4
88 . 9 5 .
99 . . 2 2
76 . . . 2
58 1 1 . .
50 2 . 2 .
66 2 . 7 .
89 1 1 1 .
75 1 2 3 .
76 . 5 6 7
88 . 1 1 1
43 1 . . 1
31 1 . . 2
;
data want;
set have;
_a=a; _b=b; _c=c; _d=d;
array hello(*) _a _b _c _d;
call sortn(of hello(*));
if a=. and b = . and c= . and d =. then count=0;
else count=1;
do i = 1 to dim(hello)-1;
if hello(i) = . then count+ 0;
else if hello(i)-hello(i+1) = . then count+0;
else if hello(i)-hello(i+1) = 0 then count+ 0;
else if hello(i)-hello(i+1) ne 0 then count+ 1;
end;
drop i _:;
run;
You could just put the unique values into a temporary array. Let's convert your photograph into data.
data have;
input id a b c d;
datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
;
So make an array of the input variables and another temporary array to hold the unique values. Then loop over the input variables and save the unique values. Finally count how many unique values there are.
data want ;
set have ;
array unique (4) _temporary_;
array values a b c d ;
call missing(of unique(*));
do _n_=1 to dim(values);
if not missing(values(_n_)) then
if not whichn(values(_n_),of unique(*)) then
unique(_n_)=values(_n_)
;
end;
count=n(of unique(*));
run;
Output:
Obs id a b c d count
1 11 2 3 4 4 3
2 22 1 8 1 1 2
3 33 6 . 1 2 3
4 44 . 1 1 . 1

How to sum value from next row by group using SAS?

I want to create a column in my dataset that calculates the sum of the current row and next row for another field. There are several groups within the data, and I only want to take the sum of the next row if the next row is part of the current group. If a row is the last record for that group I want to fill with a null value.
I'm referencing reading next observation's value in current observation, but still can't figure out how to obtain the solution I need.
For example:
data have;
input Group ID Salary;
cards;
10 1 1
10 2 2
10 3 2
10 4 1
11 1 2
11 2 2
11 3 1
11 4 1
;
run;
The result I want to obtain here is this:
data want;
input Group ID Salary Sum;
cards;
10 1 1 3
10 2 2 4
10 3 2 3
10 4 1 .
11 1 2 4
11 2 2 3
11 3 1 2
11 4 1 .
;
run;
Similar to Tom's answer, but using a 'look-ahead' merge (without a by statement, and firstobs=2) :
data want ;
merge have
have (firstobs=2
keep=Group Salary
rename=(Group=NextGroup Salary=NextSalary)) ;
if Group = NextGroup then sum = sum(Salary,NextSalary) ;
drop Next: ;
run ;
Use BY group processing and a second SET statement that skips the first observation.
data want ;
set have end=eof;
by group ;
if not eof then set have (keep=Salary rename=(Salary=Sum) firstobs=2);
if last.group then Sum=.;
else sum=sum(sum,salary);
run;
I found a solution using proc expand that produced what I needed:
proc sort data = have;
by Group ID;
run;
proc expand data=have out=want method=none;
by Group;
convert Salary = Next_Sal / transformout=(lead 1);
run;
data want(keep=Group ID Salary Sum);
set want;
Sum = Salary + Next_Sal;
run;