SAS for subsetting data - sas

Consider following exemplary SAS dataset with following layout.
Price Num_items
100 10
120 15
130 20
140 25
150 30
I want to group them into 4 categories by defining a new variable called cat such that the new dataset looks as follows:
Price Num_items Cat
100 10 1
120 15 1
130 20 2
140 25 3
150 30 4
Also I want to group them so that they have about equal number of items (For example in above grouping Group 1 has 25, Group 2 has 20 ,Group 3 has 25 and Group 4 has 30 observations). Note that the price column is sorted in ascending order (that is required).
I am struggling to start with SAS for above. So any help would be appreciated. I am not looking for a complete solution but pointers towards preparing a solution would help.

Cool problem, subtly complex. I agree with #J_Lard that a data step with some retainment would likely be the quickest way to accomplish this. If I understand your problem correctly, I think the code below would give you some ideas as to how you want to solve it. Note that depending on the num_items, and group_target, your mileage will vary.
Generate similar, but larger data set.
data have;
do price=50 to 250 by 10;
/*Seed is `_N_` so we'll see the same random item count.*/
num_items = ceil(ranuni(_N_)*10)*5;
output;
end;
run;
Categorize.
/*Desired group size specification.*/
%let group_target = 50;
data want;
set have;
/*The first record, initialize `cat` and `cat_num_items` to 1 with implicit retainment*/
if _N_=1 then do;
cat + 1;
cat_num_items + num_items;
end;
else do;
/*If the item count for a new price puts the category count above the target, apply logic.*/
if cat_num_items + num_items > &group_target. then do;
/*If placing the item into a new category puts the current cat count closer to the `group_target` than would keeping it, then put into new category.*/
if abs(&group_target. - cat_num_items) < abs(&group_target. - (cat_num_items+num_items)) then do;
cat+1;
cat_num_items = num_items;
end;
/*Otherwise keep it in the currnet category and increment category count.*/
else cat_num_items + num_items;
end;
/*Otherwise keep the item count in the current category and increment category count.*/
else cat_num_items + num_items;
end;
drop cat_num_items;
run;
Check.
proc sql;
create table check_want as
select cat,
sum(num_items) as cat_count
from want
group by cat;
quit;

Related

Multiple do loops in SAS

I have a dataset of money earned as a % every week in 2017 to 2018. Some don't have data at the start of 2017 as they didn't start earning until later on. The weeks are numbered as 201701, 201702 - 201752 and 201801 - 201852.
What I'd like to do is have 104 new variables called WEEK0 - WEEK103, where WEEK0 will have the first non empty column value of the money earned columns. Here is an example of the data:
MON_EARN_201701 MON_EARN_201702 MON_EARN_201703 MON_EARN_201704
30 21 50 65
. . 30 100
. 102 95 85
Then I want my data to have the following columns (example)
WEEK0 WEEK1 WEEK2 WEEK3
30 21 50 65
30 100 . .
102 95 85 .
These are just small examples of a very large dataset.
I was thinking I'd need to try and do some sort of do loops so what I've tried so far is:
DATA want;
SET have;
ARRAY mon_earn{104} mon_earn_201701 - mon_earn_201752 mon_earn_201801 -mon_earn_201852;
ARRAY WEEK {104} WEEK0 - WEEK103;
DO i = 1 to 104;
IF mon_earn{i} NE . THEN;
WEEK{i} = mon_earn{i};
END;
END;
RUN;
This doesn't work as it doesn't fill the WEEK0 when the first value is empty.
If anymore information is needed please comment and I will add it in.
Sounds like you just need to find the starting point for copying.
First look thru the list of earnings by calendar month until you find the first non missing value. Then copy the values starting from there into you new array of earnings by relative month.
data want;
set have;
array mon_earn mon_earn_201701 -- mon_earn_201852;
array week (104);
do i = 1 to dim(mon_earn) until(found);
if mon_earn{i} ne . then found=1;
end;
do j=1 to dim(week) while (i+j<dim(mon_earn));
week(j) = mon_earn(i+j-1);
end;
run;
NOTE: I simplified the ARRAY definitions. For the input array I assumed that the variables are defined in order so that you could use positional array list. For the WEEK array SAS and I both like to start counting from one, not zero.
You could do this if it was a long format. There's a chance you don't need it while in a long format.
proc sort data=have;
by ID week;
run;
data want;
set have;
by id; *for each group/id counter;
retain counter;
if first.id then counter=0;
if counter=0 and not missing(value) then do;
counter=1; new_week=0; end;
if counter = 1 then new_week+1;
run;
If you really need it wide:
Find first not missing value and store index in i
Loop from i to end of week dimension
Assign week to mon_earned from i to end of week.
data want;
set have;
array mon_earned(*) .... ;
array week(*) ... ;
found=0; i=0;
do while(found=0);
if not missing(mon_earned(i)) then found=1;
i+1;
end;
z=0;
do j=i to dim(week);
week(z) = mon_earned(j);
z+1;
end;
run;
You need a second index variable, call it j, to target the proper week assignment. j is only incremented when a months earning is not missing.
This example code will 'squeeze out` all missing earnings; even those missing earnings that occurring after some earning has occurred. For example
earning: . . . 10 . 120 . 25 … will squeeze to
week: 10 120 25 …
data have;
array earn earn_201701-earn_201752 earn_201801-earn_201852;
do _n_ = 1 to 1000;
call missing (of earn(*));
do _i_ = 1 + 25 * ranuni(123) to dim(earn);
if ranuni(123) < 0.95 then
earn(_i_) = round(10 + 125 * ranuni(123));
end;
output;
end;
run;
data want;
set have;
array earn earn_201701-earn_201752 earn_201801-earn_201852;
array week(0:103);
j = -1;
do i = 1 to dim(earn);
if not missing(earn(i)) then do;
j+1;
week(j) = earn(i);
end;
end;
drop i j;
run;
If you want to maintain interior missing earnings the logic would be
if not missing(earn(i)) or j >=0 then do;
j+1;
week(j) = earn(i);
end;

SAS duplicate maximum values across rows

I have a data set that looks like this:
company Assets Liabilities strategy1 strategy2 strategy3.....strategy22
1 500 500 0 50 50 50
2 200 300 33 30 33 0
My goal is to find the maximum value across the row for all strategies (strategy1 - strategy22), and basically bucket the company by the strategy they use. The problem comes when some companies have the same maximum value under multiple strategies. In this case I would want to place the firm into multiple buckets. The final dataset would be something like this:
company Assets Liab. strategy1 strategy2 strategy3.....strategy22 Strategy
1 500 500 0 50 50 50 Strategy2
1 500 500 0 50 50 50 Strategy3
1 500 500 0 50 50 50 Strategy22
Etc.
The end goal is to be able to run a proc means on the company's assets, liabilities, etc. by strategy. So far I have been able to achieve a dataset close to what I would like, but in the "Strategy" column I can't get it so SAS doesn't always output the first strategy with the maximum value.
Data want;
set have;
MAX = max(of strategy1-strategy22);
array nums {22} strategy1-strategy22;
do _n_=1 to 21;
count=1;
do _i_ = _n_+1 to 22;
if nums{_n_} = nums{_i_} AND nums{_i_} ne 0 then count + 1;
end;
if count > maxcount then do;
mode = nums{_n_};
maxcount = count;
end;
end;
Run;
Data want2;
set want (where=( maxcount > 1 AND Mode = Max));
by company;
strat=1;
do until (strat gt maxcount);
output;
strat = strat +1;
end;
Run;
Basically, I computed the mode and the count of identical maximum values and if maxcount > 1 and mode = max then I output identical observations. However, I am stuck regarding how to get SAS to output different strategies if there are multiple maximum values that are the same.
That seems more complicated than you need.
data want;
set have;
array strategies[22] strategy1-strategy22;
do strategy = 1 to dim(strategies);
if strategies[strategy] = max(of strategies[*]) then output;
end;
run;
Why not just output the the row if the strategy column matches the MAX?
My array language is off, but here is some pseudo code to do what I'm thinking...
If the column you're on has the value EQ MAX, then output that row with the strategy column set to the strategy that you're looking at:
Data want;
set have;
MAX = max(of strategy1-strategy22);
array nums {22} strategy1-strategy22;
do i = _n_+1 to 22;
if nums{i} eq MAX then do;
strategy = "strategy" + i
output;
end;
Run;

Delete the group that none of its observation contain the certain value in SAS

I want to delete the whole group that none of its observation has NUM=14
So something likes this:
Original DATA
ID NUM
1 14
1 12
1 10
2 13
2 11
2 10
3 14
3 10
Since none of the ID=2 contain NUM=14, I delete group 2.
And it should looks like this:
ID NUM
1 14
1 12
1 10
3 14
3 10
This is what I have so far, but it doesn't seem to work.
data originaldat;
set newdat;
by ID;
If first.ID then do;
IF NUM EQ 14 then Score = 100;
Else Score = 10;
end;
else SCORE+1;
run;
data newdat;
set newdat;
If score LT 50 then delete;
run;
An approach using proc sql would be:
proc sql;
create table newdat as
select *
from originaldat
where ID in (
select ID
from originaldat
where NUM = 14
);
quit;
The sub query selects the IDs for groups that contain an observation where NUM = 14. The where clause then limits the selected data to only these groups.
The equivalent data step approach would be:
/* Get all the groups that contain an observation where N = 14 */
data keepGroups;
set originaldat;
if NUM = 14;
keep ID;
run;
/* Sort both data sets to ensure the data step merge works as expected */
proc sort data = originaldat;
by ID;
run;
/* Make sure there are no duplicates values in the groups to be kept */
proc sort data = keepGroups nodupkey;
by ID;
run;
/*
Merge the original data with the groups to keep and only keep records
where an observation exists in the groups to keep dataset
*/
data newdat;
merge
originaldat
keepGroups (in = k);
by ID;
if k;
run;
In both datasets the subsetting if statement is used to only output observations when the condition is met. In the second case k is a temporary variable with value 1(true) when a value is read from keepGroups an 0(false) otherwise.
You're sort of getting at a DoW loop here, but not quite doing it right. The problem (Assuming the DATA/SET names are mistyped and not actually wrong in your program) is the first data step doesn't append that 100 to every row - only to the 14 row. What you need is one 'line' per ID value with a keep/no keep decision.
You can either do this by doing your first data step, but RETAIN score, and only output one row per ID. Your code would actually work, based on 14 being the first row, if you just fixed your data/set typo; but it only works when 14 is the first row.
data originaldat;
input ID NUM ;
datalines;
1 14
1 12
1 10
2 13
2 11
2 10
3 14
3 10
;;;;
run;
data has_fourteen;
set originaldat;
by ID;
retain keep;
If first.ID then keep=0;
if num=14 then keep=1;
if last.id then output;
run;
data newdata;
merge originaldat has_fourteen;
by id;
if keep=1;
run;
That works by merging the value from a 1-per-ID to the whole dataset.
A double DoW also works.
data newdata;
keep=0;
do _n_=1 by 1 until (last.id);
set originaldat;
by id;
if num=14 then keep=1;
end;
do _n_=1 by 1 until (last.id);
set originaldat;
by id;
if keep=1 then output;
end;
run;
This works because it iterates over the dataset twice; for each ID, it iterates once through all records, looking for a 14, if it finds one then setting keep to 1. Then it reads all records again for that ID, and keeps if keep=1. Then it goes on to the next set of records by ID.
data in;
input id num;
cards;
1 14
1 12
1 10
2 16
2 13
3 14
3 67
;
/* To find out the list of groups which contains num=14, use below SQL */
proc sql;
select distinct id into :lst separated by ','
from in
where num = 14;
quit;
/* If you want to create a new data set with only groups containing num=14 then use following data step */
data out;
set in;
where id in (&lst.);
run;

Calculate maximum difference between grouped rows

I have the following data where people in households are sorted by age (oldest to youngest):
data houses;
input HouseID PersonID Age;
datalines;
1 1 25
1 2 20
2 1 32
2 2 16
2 3 14
2 4 12
3 1 44
3 2 42
3 3 10
3 4 5
;
run;
I would like to calculate for each household the maximum age difference between consecutively aged people. So this example would give values of 5 (=25-20), 16 (=32-16) and 32 (=42-10) for households 1, 2 and 3 consecutively.
I could do this using lots of merges (i.e. extract person 1, merge with extract of person 2, and so on), but as there can be upto 20+ people in a household I'm looking for a much more direct method.
Here's a two pass solution. Same first step as the two solutions above, sort by age. In the second step keep track of max_diff per row, at the last record of HouseID output the results. This results in only two passes through the data.
proc sort data=houses; by houseid age;run;
data want;
set houses;
by houseID;
retain max_diff 0;
diff = dif1(age)*-1;
if first.HouseID then do;
diff = .; max_diff=.;
end;
if diff>max_diff then max_diff=diff;
if last.houseID then output;
keep houseID max_diff;
run;
proc sort data=houses; by houseid personid age;run;
data _t1;
set houses;
diff = dif1(age) * (-1);
if personid = 1 then diff = .;
run;
proc sql;
create table want as
select houseid, max(diff) as Max_Diff
from _t1
group by houseid;
proc sort data = house;
by houseid descending age;
run;
data house;
set house;
by houseid;
lag_age = lag1(age);
if first.houseid then age_diff = 0;
age_diff = lag_age - age;
run;
proc sql;
select houseid,max(age_diff) as max_age_diff
from house
group by houseid;
quit;
Working:
First sort the data set using houseid and descending Age.
Second data step will calculate difference between current age value (in PDV) and previous age value in PDV. Then, using sql procedure, we can get the max age difference for each houseid.
Just throwing one more into the mix. This one is a condensed version of Reeza's response.
/* No need to sort by PersonID as age is the only concern */
proc sort data = houses;
by HouseID Age;
run;
data want;
set houses;
by HouseID;
/* Keep the diff when a new row is loaded */
retain diff;
/* Only replace the diff if it is larger than previous */
diff = max(diff, abs(dif(Age)));
/* Reset diff for each new house */
if first.HouseID then diff = 0;
/* Only output the final diff for each house */
if last.HouseID;
keep HouseID diff;
run;
Here is an example using FIRST. and LAST. with one pass (after sort) through the data.
data houses;
input HouseID PersonID Age;
datalines;
1 1 25
1 2 20
2 1 32
2 2 16
2 3 14
2 4 12
3 1 44
3 2 42
3 3 10
3 4 5
;
run;
Proc sort data=HOUSES;
by houseid descending age ;
run;
Data WANT(keep=houseid max_diff);
format houseid max_diff;
retain max_diff age1 age2;
Set HOUSES;
by houseid descending age ;
if first.houseid and last.houseid then do;
max_diff=0;
output;
end;
else if first.houseid then do;
call missing(max_diff,age1,age2);
age1=age;
end;
else if not(first.houseid or last.houseid) then do;
age2=age;
temp=age1-age2;
if temp>max_diff then max_diff=temp;
age1=age;
end;
else if last.houseid then do;
age2=age;
temp=age1-age2;
if temp>max_diff then max_diff=temp;
output;
end;
Run;

How do i perform calculation about the last n observations

how can i perform calculation for the last n observation in a data set
For example if I have 10 observations I would like to create a variable that would sum the last 5 values of another variable. Please do not suggest that I lag 5 times or use module ( N ). I need a bit more elegant solution than that.
with the code below alpha is the data set that i have and bravo is the one i need.
data alpha;
input lima ## ;
cards ;
3 1 4 21 3 3 2 4 2 5
;
run ;
data bravo;
input lima juliet;
cards;
3 .
1 .
4 .
21 .
3 32
3 32
2 33
4 33
2 14
5 16
;
run;
thank you in advance!
You can do this in the data step or using PROC EXPAND from SAS/ETS if available.
For the data step the idea is that you start with a cumulative sum (summ), but keep track of the number of values that were added so far (ninsum). Once that reaches 5, you start outputting the cumulative sum to the target variable (juliet), and from the next step you start subtracting the lagged-5 value to only store the sum of the last five values.
data beta;
set alpha;
retain summ ninsum 0;
summ + lima;
ninsum + 1;
l5 = lag5(lima);
if ninsum = 6 then do;
summ = summ - l5;
ninsum = ninsum - 1;
end;
if ninsum = 5 then do;
juliet = summ;
end;
run;
proc print data=beta;
run;
However there is a procedure that can do all kind of cumulative, moving window, etc calculations: PROC EXPAND, in which this is really just one line. We just tell it to calculate the backward moving sum in a window of width 5 and set the first 4 observations to missing (by default it will expand your series by 0's on the left).
proc expand data=alpha out=gamma;
convert lima = juliet / transformout=(movsum 5 trimleft 4);
run;
proc print data=gamma;
run;
Edit
If you want to do more complicated calculations, you need to carry the previous values in retained variables. I thought you wanted to avoid that, but here it is:
data epsilon;
set alpha;
array lags {5};
retain lags1 - lags5;
/* do whatever calculation is needed */
juliet = 0;
do i=1 to 5;
juliet = juliet + lags{i};
end;
output;
/* shift over lagged values, and add self at the beginning */
do i=5 to 2 by -1;
lags{i} = lags{i-1};
end;
lags{1} = lima;
drop i;
run;
proc print data=epsilon;
run;
I can offer rather ugly solution:
run data step and add increasing number to each group.
run sql step and add column of max(group).
run another data step and check if value from (2)-(1) is less than 5. If so, assign to _num_to_sum_ variable (for example) the value that you want to sum, otherwise leave it blank or assign 0.
and last do a sql step with sum(_num_to_sum_) and group results by grouping variable from (1).
EDIT: I have added a live example of the concept in a bit more compacted way.
input var1 $ var2;
cards;
aaa 3
aaa 5
aaa 7
aaa 1
aaa 11
aaa 8
aaa 6
bbb 3
bbb 2
bbb 4
bbb 6
;
run;
data step1;
set sourcetable;
by var1;
retain obs 0;
if first.var1 then obs = 0;
else obs = obs+1;
if obs >=5 then to_sum = var2;
run;
proc sql;
create table rezults as
select distinct var1, sum(to_sum) as needed_summs
from step1
group by var1;
quit;
In case anyone reads this :)
I solved it the way I needed it to be solved. Although now I am more curious which of the two(the retain and my solution) is more optimal in terms of computing/processing time.
Here is my solution:
data bravo(keep = var1 summ);
set alpha;
do i=_n_ to _n_-4 by -1;
set alpha(rename=var1=var2) point=i;
summ=sum(summ,var2);
end;
run;