Filling in gaps in data with a merge in SAS - sas

I have data that looks like this:
id t x
1 1 3.7
1 3 1.2
1 4 2.4
2 2 6.0
2 4 6.1
2 5 6.2
For each id I want to add observations as necessary so there are values for all 1<=t<=5.
So my desired result is:
id t x
1 1 3.7
1 2 .
1 3 1.2
1 4 2.4
1 5 .
2 1 .
2 2 6.0
2 3 .
2 4 6.1
2 5 6.2
My real setting involves massive amounts of data, so I'm looking for the most efficient way to do this.

Here's probably the simplest way, using the COMPLETETYPES option in PROC SUMMARY. I'm making the assumption that the combinations of id and t are unique in the data.
The only thing I'm not sure of is whether you'll run into memory issues when running against a very large dataset, I have had problems with PROC SUMMARY in this respect in the past.
data have;
input id t x;
cards;
1 1 3.7
1 3 1.2
1 4 2.4
2 2 6.0
2 4 6.1
2 5 6.2
;
run;
proc summary data=have nway completetypes;
class id t;
var x;
output out=want (drop=_:) max=;
run;

One option is to use PROC EXPAND, if you have ETS. I'm not sure if it'll do 100% of what you want, but it might be a good start. It seems like so far the main problem is it won't do records at the start or the end, but I think that's surmountable; just not sure how.
proc expand data=have out=want from=daily method=none extrapolate;
by id;
id t;
run;
That fills in 2 for id 1 and 3 for id 2, but does not fill in 5 for id 1 or 1 for id 2.
To do it in base SAS, you have a few options. PROC FREQ with the SPARSE option might be a good option.
proc freq data=have noprint;
tables id*t/sparse out=want2(keep=id t);
run;
data want_fin;
merge have want2;
by id t;
run;
You could also do this via PROC SQL, with a join to a table with the possible t values, but that seems slower to me (even though the FREQ method requires two passes, FREQ will be pretty fast and the merge is using already sorted data so that's also not too slow).

Here's another approach, provided that you already know the minimum/maximum values for T. It creates a template that contains all values of ID and T, then merges with the original data set so that you keep the values of X.
proc sort data=original_dataset out=template(keep=id) nodupkey;
by id;
run;
data template;
set template;
do t = 1 to 5; /* you could make these macro variables */
output;
end;
run;
proc sort data=original_dataset;
by id t;
run;
data complete_dataset;
merge template(in=in_template) original_dataset(in=in_original);
by id t;
if in_template then output;
run;

Related

How to create 2x2 table in sas for fisher exact test

I just performed the fisher test in R and in Excel on a 2x2 table with the contents 1 6 and 7 2. I can't manage to do this in sas.
data my_table;
input var1 var2 ##;
datalines;
1 6 7 2
;
proc freq data=my_table;
tables var1*var2 / fisher;
run;
The test somehow ignores that the table consists of the 4 variables but when I print the table it looks fine. In the test the contents of the table are 0, 1, 1, 0. I guess I need to change something when creating the data but what?
You do NOT have two variables that each have two categories.
Read the data in this way instead.
data have ;
do var1=1,2 ; do var2=1,2;
input count ##;
output;
end; end;
datalines;
1 6 7 2
;
Now VAR1 and VAR2 both have two possible values and COUNT has the number of cases for the particular combination. Use the WEIGHT statement to tell PROC FREQ to use COUNT as the number of cases.
proc freq data=have ;
weight count ;
tables var1*var2 / fisher ;
run;

Building a table with sub-summary-rows in SAS

Assume I have two clusters of clients and for each client I have a value, how much worth the client is for me. I want to now create a summary table with rows sub-summing the all clients' values for that cluster.
Example input table clients:
client cluster value
1 1 3
2 2 1
3 1 4
4 1 1
5 2 5
Expected output wanted:
cluster client value comment
1 1 3 kEuro
1 3 4 kEuro
1 4 1 kEuro
1 . 8 sum for cluster 1
2 2 1 kEuro
2 5 5 kEuro
2 . 6 sum for cluster 2
I am now looking for the most efficient way of achieving this. What I came up with so far is the code below, which does not fully solve it: Instead of sum for cluster x I only see sum f.
proc sql;
create table part_1 as
select cluster, client, value
, "kEuro" as comment
from clients
where cluster = 1;
quit;
proc sql;
create table part_2 as
select cluster, client, value
, "kEuro" as comment
from clients
where cluster = 2;
quit;
/* The summation over the clusters */
proc means noprint data=part1;
output out=psum_1 (drop=_type_ _freq_) sum=;
run;
proc means noprint data=part2;
output out=psum_2 (drop=_type_ _freq_) sum=;
run;
/* The concatenation stage 1 */
data puzzle_1;
set part_1 psum_1 (in=in2);
if in2 then do;
client = .;
comment = 'sum for cluster 1';
end;
run;
data puzzle_2;
set part_2 psum_2 (in=in2);
if in2 then do;
client = .;
comment = 'sum for cluster 2';
end;
run;
/* The concatenation stage 2 */
data wanted;
set puzzle_1 puzzle_2;
run;
Is there a better way to approach this problem, ideally something which I could loop over, if I have more than two clusters?
You can use PROC MEANS/SUMMARY to generate the summary rows. Then use a SET with a BY to add the summary rows to the detail rows.
data have;
input client cluster value ;
cards;
1 1 3
2 2 1
3 1 4
4 1 1
5 2 5
;
proc sort data=have; by cluster client; run;
proc means data=have noprint nway ;
by cluster;
var value;
output out=sum sum= ;
run;
data want;
set have sum ;
by cluster;
length comment $20 ;
if last.cluster then comment=catx(' ','sum for cluster',cluster);
else comment='kEuro';
run;
proc print ;
var cluster client value comment;
run;
Typically you would not add summary rows to the data set containing the details being aggregated. Instead, consider using a reporting tool such as Proc REPORT, Proc TABULATE or Proc SUMMARY
Example:
Show details and summary row using Proc REPORT
data have; input
client cluster value; datalines;
1 1 3
2 2 1
3 1 4
4 1 1
5 2 5
;
ods html file='report.html';
proc report data=have;
columns cluster client value;
define cluster / order;
define client / display;
define value / sum;
break after cluster / summarize style=[fontstyle=italic background=lightgray];
run;
ods html close;
If you absolutely feel the need to make your original data more difficult to use further downstream:
add a CLASS cluster statement to your Proc MEANS
SORT original data by cluster
stack original data with MEANS output using SET original means_output; by cluster;

How Can I show all orders of mode by using proc univariate

I try to show all orders of Mode.
For example, I import excel like:
A
1
1
2
3
3
3
and code is :
ods select Modes;
proc univariate data=Want modes;
var A;
run;
this Result shows like:
Mode Count
3 3
I want to show like
Mode Count
3 3
1 2
2 1
how can I do that???
Your desired output is actually not modes. Modes returns most frequent value or values (if there is more than one with the same frequency) with the corresponding count. In your example, there is only one mode (3), as it is the value with the highest frequency. And that's what the result shows.
You may be interested in showing frequencies of every value present in variable A. In that case, you want to use this code:
ods select Frequencies;
proc univariate data=Want freq;
var A;
run;
That is a frequency table.
data have ;
input A ##;
cards;
1 1 2 3 3 3
;
proc freq data=have order=freq ;
tables a / out=counts;
run;
proc print data=counts;
run;
Result:
Obs A COUNT PERCENT
1 3 3 50.0000
2 1 2 33.3333
3 2 1 16.6667

Multiple transactions lines to base table SAS

I am new to sas and are trying to handle some customer data, and I'm not really sure how to do this.
What I have:
data transactions;
input ID $ Week Segment $ Average Freq;
datalines;
1 1 Sports 500 2
1 1 PC 400 3
1 2 Sports 350 3
1 2 PC 550 3
2 1 Sports 650 2
2 1 PC 700 3
2 2 Sports 720 3
2 2 PC 250 3
;
run;
What I want:
data transactions2;
input ID Week1_Sports_Average Week1_PC_Average Week1_Sports_Freq
Week1_PC_Freq
Week2_Sports_Average Week2_PC_Average Week2_Sports_Freq Week2_PC_Freq;
datalines;
1 500 400 2 3 350 550 3 3
2 650 700 2 3 720 250 3 3
;
run;
The only thing I got so far is this:
Data transactions3;
SET transactions;
if week=1 and Segment="Sports" then DO;
Week1_Sports_Freq=Freq;
Week1_Sports_Average=Average;
END;
else DO;
Week1_Sports_Freq=0;
Week1_Sports_Average=0;
END;
run;
This will be way too much work as I have a lot of weeks and more variables than just freq/avg.
Really hoping for some tips are, as I'm stucked.
You can use PROC TRANSPOSE to create that structure. But you need to use it twice since your original dataset is not fully normalized.
The first PROC TRANSPOSE will get the AVERAGE and FREQ readings onto separate rows.
proc transpose data=transactions out=tall ;
by id week segment notsorted;
var average freq ;
run;
If you don't mind having the variables named slightly differently than in your proposed solution you can just use another proc transpose to create one observation per ID.
proc transpose data=tall out=want delim=_;
by id;
id segment _name_ week ;
var col1 ;
run;
If you want the exact names you had before you could add data step to first create a variable you could use in the ID statement of the PROC transpose.
data tall ;
set tall ;
length new_name $32 ;
new_name = catx('_',cats('WEEK',week),segment,_name_);
run;
proc transpose data=tall out=want ;
by id;
id new_name;
var col1 ;
run;
Note that it is easier in SAS when you have a numbered series of variable if the number appears at the end of the name. Then you can use a variable list. So instead of WEEK1_AVERAGE, WEEK2_AVERAGE, ... you would use WEEK_AVERAGE_1, WEEK_AVERAGE_2, ... So that you could use a variable list like WEEK_AVERAGE_1 - WEEK_AVERAGE_5 in your SAS code.

replicating a sql function in sas datastep

Hi another quick question
in proc sql we have on which is used for conditional join is there something similar for sas data step
for example
proc sql;
....
data1 left join data2
on first<value<last
quit;
can we replicate this in sas datastep
like
data work.combined
set data1(in=a) data2(in=b)
if a then output;
run;
You can also can reproduce sql join in one DATA-step using hash objects. It can be really fast but depends on the size of RAM of your machine since this method loads one table into memory. So the more RAM - the larger dataset you can wrap into hash. This method is particularly effective for look-ups in relatively small reference table.
data have1;
input first last;
datalines;
1 3
4 7
6 9
;
run;
data have2;
input value;
datalines;
2
5
6
7
;
run;
data want;
if _N_=1 then do;
if 0 then set have2;
declare hash h(dataset:'have2');
h.defineKey('value');
h.defineData('value');
h.defineDone();
declare hiter hi('h');
end;
set have1;
rc=hi.first();
do while(rc=0);
if first<value<last then output;
rc=hi.next();
end;
drop rc;
run;
The result:
value first last
2 1 3
5 4 7
6 4 7
7 6 9
Yes there is a simple (but subtle) way in just 7 lines of code.
What you intend to achieve is intrinsically a conditional Cartesian join which can be done by a do-looped set statement. The following code use the test dataset from Dmitry and a modified version of the code in the appendix of SUGI Paper 249-30
data data1;
input first last;
datalines;
1 3
4 7
6 9
;
run;
data data2;
input value;
datalines;
2
5
6
7
;
run;
/***** by data step looped SET *****/
DATA CART_data;
SET data1;
DO i=1 TO NN; /*NN can be referenced before set*/
SET data2 point=i nobs=NN; /*point=i - random access*/
if first<value<last then OUTPUT; /*conditional output*/
END;
RUN;
/***** by SQL *****/
proc sql;
create table cart_SQL as
select * from data1
left join data2
on first<value<last;
quit;
One can easily see that the results coincide.
Also note that from SAS 9.2 documentation: "At compilation time, SAS reads the descriptor portion of each data set and assigns the value of the NOBS= variable automatically. Thus, you CAN refer to the NOBS= variable BEFORE the SET statement. The variable is available in the DATA step but is not added to any output data set."
There isn't a direct way to do this with a MERGE. This is one example where the SQL method is clearly superior to any SAS data step methods, as anything you do will take much more code and possibly more time.
However, depending on the data, it's possible a few approaches may make sense. In particular, the format merge.
If data1 is fairly small (even, say, millions of records), you can make a format out of it. Like so:
data fmt_set;
set data1;
format label $8.;
start=first; *set up the names correctly;
end=last;
label='MATCH';
fmtname='DATA1F';
output;
if _n_=1 then do; *put out a hlo='o' line which is for unmatched lines;
start=.; *both unnecessary but nice for clarity;
end=.;
label='NOMATCH';
hlo='o';
output;
end;
run;
proc format cntlin=fmt_set; *import the dataset;
quit;
data want;
set data2;
if put(value,DATA1F.)="MATCH";
run;
This is very fast to run, unless data1 is extremely large (hundreds of millions of rows, on my system) - faster than a data step merge, if you include sort time, since this doesn't require a sort. One major limitation is that this will only give you one row per data2 row; if that is what is desired, then this will work. If you want repeats of data2 then you can't do it this way.
If data1 may have overlapping rows (ie, two rows where start/end overlap each other), you also will need to address this, since start/end aren't allowed to overlap normally. You can set hlo="m" for every row, and "om" for the non-match row, or you can resolve the overlaps.
I'd still do the sql join, however, since it's much shorter to code and much easier to read, unless you have performance issues, or it doesn't work the way you want it to.
Here's another solution, using a temporary array to hold the lookup dataset. Performance is probably similar to Dmitry's hash-based solution, but this should also work for people still using versions of SAS prior to 9.1 (i.e. when hash objects were first introduced).
I've reused Dmitry's sample datasets:
data have1;
input first last;
datalines;
1 3
4 7
6 9
;
run;
data have2;
input value;
datalines;
2
5
6
7
;
run;
/*We need a macro var with the number of obs in the lookup dataset*/
/*This is so we can specify the dimension for the array to hold it*/
data _null_;
if 0 then set have2 nobs = nobs;
call symput('have2_nobs',put(nobs,8.));
stop;
run;
data want_temparray;
array v{&have2_nobs} _temporary_;
do _n_ = 1 to &have2_nobs;
set have2 (rename=(value=value_array));
v{_n_}=value_array;
end;
do _n_ = 1 by 1 until (eof_have1);
set have1 end = eof_have1;
value=.;
do i=1 to &have2_nobs;
if first < v{i} < last then do;
value=v{i};
output;
end;
end;
if missing(value) then output;
end;
drop i value_array;
run;
Output:
value first last
2 1 3
5 4 7
6 4 7
7 6 9
This matches the output from the equivalent SQL:
proc sql;
create table want_sql as
select * from
have1 left join have2
on first<value<last
;
quit;
run;