Building a table with sub-summary-rows in SAS - sas

Assume I have two clusters of clients and for each client I have a value, how much worth the client is for me. I want to now create a summary table with rows sub-summing the all clients' values for that cluster.
Example input table clients:
client cluster value
1 1 3
2 2 1
3 1 4
4 1 1
5 2 5
Expected output wanted:
cluster client value comment
1 1 3 kEuro
1 3 4 kEuro
1 4 1 kEuro
1 . 8 sum for cluster 1
2 2 1 kEuro
2 5 5 kEuro
2 . 6 sum for cluster 2
I am now looking for the most efficient way of achieving this. What I came up with so far is the code below, which does not fully solve it: Instead of sum for cluster x I only see sum f.
proc sql;
create table part_1 as
select cluster, client, value
, "kEuro" as comment
from clients
where cluster = 1;
quit;
proc sql;
create table part_2 as
select cluster, client, value
, "kEuro" as comment
from clients
where cluster = 2;
quit;
/* The summation over the clusters */
proc means noprint data=part1;
output out=psum_1 (drop=_type_ _freq_) sum=;
run;
proc means noprint data=part2;
output out=psum_2 (drop=_type_ _freq_) sum=;
run;
/* The concatenation stage 1 */
data puzzle_1;
set part_1 psum_1 (in=in2);
if in2 then do;
client = .;
comment = 'sum for cluster 1';
end;
run;
data puzzle_2;
set part_2 psum_2 (in=in2);
if in2 then do;
client = .;
comment = 'sum for cluster 2';
end;
run;
/* The concatenation stage 2 */
data wanted;
set puzzle_1 puzzle_2;
run;
Is there a better way to approach this problem, ideally something which I could loop over, if I have more than two clusters?

You can use PROC MEANS/SUMMARY to generate the summary rows. Then use a SET with a BY to add the summary rows to the detail rows.
data have;
input client cluster value ;
cards;
1 1 3
2 2 1
3 1 4
4 1 1
5 2 5
;
proc sort data=have; by cluster client; run;
proc means data=have noprint nway ;
by cluster;
var value;
output out=sum sum= ;
run;
data want;
set have sum ;
by cluster;
length comment $20 ;
if last.cluster then comment=catx(' ','sum for cluster',cluster);
else comment='kEuro';
run;
proc print ;
var cluster client value comment;
run;

Typically you would not add summary rows to the data set containing the details being aggregated. Instead, consider using a reporting tool such as Proc REPORT, Proc TABULATE or Proc SUMMARY
Example:
Show details and summary row using Proc REPORT
data have; input
client cluster value; datalines;
1 1 3
2 2 1
3 1 4
4 1 1
5 2 5
;
ods html file='report.html';
proc report data=have;
columns cluster client value;
define cluster / order;
define client / display;
define value / sum;
break after cluster / summarize style=[fontstyle=italic background=lightgray];
run;
ods html close;
If you absolutely feel the need to make your original data more difficult to use further downstream:
add a CLASS cluster statement to your Proc MEANS
SORT original data by cluster
stack original data with MEANS output using SET original means_output; by cluster;

Related

How to create 2x2 table in sas for fisher exact test

I just performed the fisher test in R and in Excel on a 2x2 table with the contents 1 6 and 7 2. I can't manage to do this in sas.
data my_table;
input var1 var2 ##;
datalines;
1 6 7 2
;
proc freq data=my_table;
tables var1*var2 / fisher;
run;
The test somehow ignores that the table consists of the 4 variables but when I print the table it looks fine. In the test the contents of the table are 0, 1, 1, 0. I guess I need to change something when creating the data but what?
You do NOT have two variables that each have two categories.
Read the data in this way instead.
data have ;
do var1=1,2 ; do var2=1,2;
input count ##;
output;
end; end;
datalines;
1 6 7 2
;
Now VAR1 and VAR2 both have two possible values and COUNT has the number of cases for the particular combination. Use the WEIGHT statement to tell PROC FREQ to use COUNT as the number of cases.
proc freq data=have ;
weight count ;
tables var1*var2 / fisher ;
run;

How Can I show all orders of mode by using proc univariate

I try to show all orders of Mode.
For example, I import excel like:
A
1
1
2
3
3
3
and code is :
ods select Modes;
proc univariate data=Want modes;
var A;
run;
this Result shows like:
Mode Count
3 3
I want to show like
Mode Count
3 3
1 2
2 1
how can I do that???
Your desired output is actually not modes. Modes returns most frequent value or values (if there is more than one with the same frequency) with the corresponding count. In your example, there is only one mode (3), as it is the value with the highest frequency. And that's what the result shows.
You may be interested in showing frequencies of every value present in variable A. In that case, you want to use this code:
ods select Frequencies;
proc univariate data=Want freq;
var A;
run;
That is a frequency table.
data have ;
input A ##;
cards;
1 1 2 3 3 3
;
proc freq data=have order=freq ;
tables a / out=counts;
run;
proc print data=counts;
run;
Result:
Obs A COUNT PERCENT
1 3 3 50.0000
2 1 2 33.3333
3 2 1 16.6667

Multiple transactions lines to base table SAS

I am new to sas and are trying to handle some customer data, and I'm not really sure how to do this.
What I have:
data transactions;
input ID $ Week Segment $ Average Freq;
datalines;
1 1 Sports 500 2
1 1 PC 400 3
1 2 Sports 350 3
1 2 PC 550 3
2 1 Sports 650 2
2 1 PC 700 3
2 2 Sports 720 3
2 2 PC 250 3
;
run;
What I want:
data transactions2;
input ID Week1_Sports_Average Week1_PC_Average Week1_Sports_Freq
Week1_PC_Freq
Week2_Sports_Average Week2_PC_Average Week2_Sports_Freq Week2_PC_Freq;
datalines;
1 500 400 2 3 350 550 3 3
2 650 700 2 3 720 250 3 3
;
run;
The only thing I got so far is this:
Data transactions3;
SET transactions;
if week=1 and Segment="Sports" then DO;
Week1_Sports_Freq=Freq;
Week1_Sports_Average=Average;
END;
else DO;
Week1_Sports_Freq=0;
Week1_Sports_Average=0;
END;
run;
This will be way too much work as I have a lot of weeks and more variables than just freq/avg.
Really hoping for some tips are, as I'm stucked.
You can use PROC TRANSPOSE to create that structure. But you need to use it twice since your original dataset is not fully normalized.
The first PROC TRANSPOSE will get the AVERAGE and FREQ readings onto separate rows.
proc transpose data=transactions out=tall ;
by id week segment notsorted;
var average freq ;
run;
If you don't mind having the variables named slightly differently than in your proposed solution you can just use another proc transpose to create one observation per ID.
proc transpose data=tall out=want delim=_;
by id;
id segment _name_ week ;
var col1 ;
run;
If you want the exact names you had before you could add data step to first create a variable you could use in the ID statement of the PROC transpose.
data tall ;
set tall ;
length new_name $32 ;
new_name = catx('_',cats('WEEK',week),segment,_name_);
run;
proc transpose data=tall out=want ;
by id;
id new_name;
var col1 ;
run;
Note that it is easier in SAS when you have a numbered series of variable if the number appears at the end of the name. Then you can use a variable list. So instead of WEEK1_AVERAGE, WEEK2_AVERAGE, ... you would use WEEK_AVERAGE_1, WEEK_AVERAGE_2, ... So that you could use a variable list like WEEK_AVERAGE_1 - WEEK_AVERAGE_5 in your SAS code.

SAS-use of lead function

Suppose the dataset has three columns
Date Region Price
01-03 A 1
01-03 A 2
01-03 B 3
01-03 B 4
01-03 A 5
01-04 B 4
01-04 B 6
01-04 B 7
I try to get the lead price by date and region through following code.
data want;
set have;
by _ric date_l_;
do until (eof);
set have(firstobs=2 keep=price rename=(price=lagprice)) end=eof;
end;
if last.date_l_ then call missing(lagprice);
run;
However, the WANT only have one observations. Then I create new_date=date and try another code:
data want;
set have nobs=nobs;
do _i = _n_ to nobs until (new_date ne Date);
if eof1=0 then
set have (firstobs=2 keep=price rename=(price=leadprice)) end=eof1;
else leadprice=.;
end;
run;
With this code, SAS is working slowly. So I think this code is also not appropriate. Could anyone give some suggestions? Thanks
Try sorting by the variables you want lead price for then set together twice:
data test;
length Date Region $12 Price 8 ;
input Date $ Region $ Price ;
datalines;
01-03 A 1
01-03 A 2
01-03 B 3
01-03 B 4
01-03 A 5
01-04 B 4
01-04 B 6
01-04 B 7
;
run;
** sort by vars you want lead price for **;
proc sort data = test;
by DATE REGION;
run;
** set together twice -- once for lead price and once for all variables **;
data lead_price;
set test;
by DATE REGION;
set test (firstobs = 2 keep = PRICE rename = (PRICE = LEAD_PRICE))
test (obs = 1 drop = _ALL_);
if last.DATE or last.REGION then do;
LEAD_PRICE = .;
end;
run;
You can use proc expand to generate leads on numeric variables by group. Try the following method instead:
Step 1: Sort by Region, Date
proc sort data=have;
by Region Date;
run;
Step 2: Create a new ID variable to denote observation numbers
Because you have multiple values per date per region, we need to generate a new ID variable so that proc expand uses lead by observation number rather than by date.
data have2;
set have;
_ID_ = _N_;
run;
Step 3: Run proc expand by region with the lead transformation
lead will do exactly as it sounds. You can lead by as many values as you'd like, as long as the data supports it. In this case, we are leading by one observation.
proc expand data=have2
out=want;
by Region;
id _ID_;
convert Price = Lead_Price / transform=(lead 1) ;
run;

Filling in gaps in data with a merge in SAS

I have data that looks like this:
id t x
1 1 3.7
1 3 1.2
1 4 2.4
2 2 6.0
2 4 6.1
2 5 6.2
For each id I want to add observations as necessary so there are values for all 1<=t<=5.
So my desired result is:
id t x
1 1 3.7
1 2 .
1 3 1.2
1 4 2.4
1 5 .
2 1 .
2 2 6.0
2 3 .
2 4 6.1
2 5 6.2
My real setting involves massive amounts of data, so I'm looking for the most efficient way to do this.
Here's probably the simplest way, using the COMPLETETYPES option in PROC SUMMARY. I'm making the assumption that the combinations of id and t are unique in the data.
The only thing I'm not sure of is whether you'll run into memory issues when running against a very large dataset, I have had problems with PROC SUMMARY in this respect in the past.
data have;
input id t x;
cards;
1 1 3.7
1 3 1.2
1 4 2.4
2 2 6.0
2 4 6.1
2 5 6.2
;
run;
proc summary data=have nway completetypes;
class id t;
var x;
output out=want (drop=_:) max=;
run;
One option is to use PROC EXPAND, if you have ETS. I'm not sure if it'll do 100% of what you want, but it might be a good start. It seems like so far the main problem is it won't do records at the start or the end, but I think that's surmountable; just not sure how.
proc expand data=have out=want from=daily method=none extrapolate;
by id;
id t;
run;
That fills in 2 for id 1 and 3 for id 2, but does not fill in 5 for id 1 or 1 for id 2.
To do it in base SAS, you have a few options. PROC FREQ with the SPARSE option might be a good option.
proc freq data=have noprint;
tables id*t/sparse out=want2(keep=id t);
run;
data want_fin;
merge have want2;
by id t;
run;
You could also do this via PROC SQL, with a join to a table with the possible t values, but that seems slower to me (even though the FREQ method requires two passes, FREQ will be pretty fast and the merge is using already sorted data so that's also not too slow).
Here's another approach, provided that you already know the minimum/maximum values for T. It creates a template that contains all values of ID and T, then merges with the original data set so that you keep the values of X.
proc sort data=original_dataset out=template(keep=id) nodupkey;
by id;
run;
data template;
set template;
do t = 1 to 5; /* you could make these macro variables */
output;
end;
run;
proc sort data=original_dataset;
by id t;
run;
data complete_dataset;
merge template(in=in_template) original_dataset(in=in_original);
by id t;
if in_template then output;
run;