Why many to many merge doesn't do cartesian product

Why many to many merge doesn't do cartesian product - sas

data jul11.merge11;
input month sales ;
datalines ;
1 3123
1 1234
2 7482
2 8912
3 1284
;
run;
data jul11.merge22;
input month goal ;
datalines;
1 4444
1 5555
1 8989
2 9099
2 8888
3 8989
;
run;
data jul11.merge1;
merge jul11.merge11 jul11.merge22 ;
by month;
difference =goal - sales ;
run;
proc print data=jul11.merge1 noobs;
run;
output:
month sales goal difference
1 3123 4444 1321
1 1234 5555 4321
1 1234 8989 7755
2 7482 9099 1617
2 8912 8888 -24
3 1284 8989 7705
Why it didn't match all observation in table 1 with in table 2 for common months ?
pdv retains data of observation to seek if any more observation are left for that particular by group before it reinitialises it , in that case it should have done cartesian product .
Gives perfect cartesian product for one to many merging but not for many to many .

This is because of how SAS processes the data step. A merge is never a true cartesian product (ie, all records are searched and matched up against all other records, like a SQL comma join might ); what SAS does (in the case of two datasets) is it follows down one dataset (the one on the left) and advances to the next particular by-group value; then it looks over on the right dataset, and advances until it gets to that by group value. If there are other records in between, it processes those singly. If there are not, but there is a match, then it matches up those records.
Then it looks on the left to see if there are any more in that by group, and if so, advances to the next. It does the same on the right. If only one of these has a match then it will only bring in those values; hence if it has 1 element on the left and 5 on the right, it will do 1x5 or 5 rows. However, if there are 2 on the left and 3 on the right, it won't do 2x3=6; it does 1:1, 2:2, and 2:3, because it's advancing record pointers sequentially.
The following example is a good way to see how this works. If you really want to see it in action, throw in the data step debugger and play around with it interactively.
data test1;
input x row1;
datalines;
1 1
1 2
1 3
1 4
2 1
2 2
2 3
3 1
;;;;
run;
data test2;
input x row2;
datalines;
1 1
1 2
1 3
2 1
3 1
3 2
3 3
;;;;
run;
data test_merge;
merge test1 test2;
by x;
put x= row1= row2=;
run;
If you do want to do a cartesian join in SAS datastep, you have to do nested SET statements.
data want;
set test1;
do _n_ = 1 to nobs_2;
set test2 point=_n_ nobs=nobs_2;
output;
end;
run;
That's the true cartesian, you can then test for by group equality; but that's messy, really. You could also use a hash table lookup, which works better with BY groups. There are a few different options discussed here.

SAS doesn't handle many-to-many merges very well within the datastep. You need to use a PROC SQL if you want to do a many-to-many merge.

Related

apply proc means to quickly calculate multiple observations?

I have a dataset with varying observations per ID, and these participants are also in different treatment status (Group). I wonder if I can use proc means to quickly calculate the number of participants and visits to clinic per group status by using proc means? Ideally, I can use proc means sum function quickly capture those with 0 and 1 based on group status and gain the total number? However, I got stuck in how to proceed.
ID Visit Group
1 1 0
1 2 0
2 1 1
2 2 1
2 3 1
3 1 0
4 1 1
4 2 1
5 1 0
5 2 0
6 1 1
6 2 1
6 3 1
6 4 1
Specifically, I am interested in 1) the total number of participants in each group status. In this case we can 3 participants (ID:1,3,and 5)in the control group (0) and another 3 participants (ID:2,4,and 6) in the treatment group (1).
2) the total number of visits per group status. In this case, the total visits in the control group (0) will be 5 (2+1+2=5) and the total visits in the treatment group (1) will be 9 (3+2+4=9).
I wonder if proc means procedure can help quickly calculate such values? Thanks.

Yes, you can use proc means to get counts.
data have;
input ID$ Visit Group;
cards;
1 1 0
1 2 0
2 1 1
2 2 1
2 3 1
3 1 0
4 1 1
4 2 1
5 1 0
5 2 0
;
run;
proc means data=have n;
class group id;
var visit;
types group id group*id;
run;
If you want the sum of visit, add "sum" behind proc means data=have n and ;.

It looks like GROUP is assigned at the ID level and not the ID/VISIT level. In that case if you want to count the number of ID's in each group you need to first get down to one observation per ID.
proc sort data=have nodupkey out=unique_ids ;
by id;
run;
Now you can count how many ID's are in each group. The normal way is to use PROC FREQ.
proc freq data=unique_ids;
tables group;
run;
But you can count with PROC MEANS/SUMMARY also.
proc summary data=unique_ids nway;
class group;
output out=counts N=N_ids ;
run;
proc print data=counts;
var group n_ids;
run;

MEANS doesn't do a distinct count easily so SQL may be a simpler to understand option here.
proc sql;
create table want as
select group, count(*) as num_visits, count(distinct ID) as num_participants
from have
group by group
order by 1;
quit;

Interpolate values in unbalanced panel data using SAS

Say we are confined to using SAS and have a panel/longitudinal dataset. We have indicators for cohort and time, as well as some measured variable y.
data in;
input cohort time y;
datalines;
1 1 100
1 2 101
1 3 102
1 4 103
1 5 104
1 6 105
2 2 .
2 3 .
2 4 .
2 5 .
2 6 .
3 3 .
3 4 .
3 5 .
3 6 .
4 4 108
4 5 110
4 6 112
run;
Note that units of cohort and time are the same so that if the dataset goes out to time unit 6, each successive panel unit will be one period shorter than the one before it in time.
We have a gap of two panel units between actual data. The goal is to linearly interpolate the two missing panel units (values for cohort 2 and 3) from the two that "sandwich" them. For cohort 2 at time 5 the interpolated value should be 0.67*104 + 0.33*110, while for cohort 3 at time 5 it would be 0.33*104 + 0.67*110. Basically you just weight 2/3 for the closer panel unit with actuals, and 1/3 for the further panel unit. You'll of course have missing values, but for this toy example that's not a problem.
I'm imagining the solution involves lagging and using the first. operator and loops but my SAS is so poor I hesitate to provide even my broken code example.

I've got a solution, it is however tortured. There must be a better way to do it, this takes one line in Stata.
First we use proc SQL to make a table of the two populated panel units, the "bread of the sandwich"
proc sql;
create table haveY as
select time, cohort, y
from startingData
where y is not missing
order by time, cohort;
quit;
Next we loop over the rows of this reduced dataset to produce interpolated values, I don't completely follow the operations here, I modified a related example I found.
data wantY;
set haveY(rename=(y=thisY cohort=thisCohort));
by time;
retain lastCohort lastY;
lastcohort = lag(thisCohort);
lastY = lag(thisY);
if not first.time then do;
do cohort = lastCohort +1 to thisCohort-1;
y = ((thisCohort-cohort)*lastY + (cohort-lastCohort)*thisY)/(thisCohort-lastCohort);
output;
end;
end;
cohort=thisCohort;
y=thisY;
drop this: last:;
run;
proc sort data=work.wantY;
by cohort time;
run;
This does produce what is needed, it can be joined using proc sql into the starting table: startingData. Not a completely satisfying solution due to the verbosity but it does work.

SAS: PROC FREQ with multiple ID variables

I have data that's tracking a certain eye phenomena. Some patients have it in both eyes, and some patients have it in a single eye. This is what some of the data looks like:
EyeID PatientID STATUS Gender
1 1 1 M
2 1 0 M
3 2 1 M
4 3 0 M
5 3 1 M
6 4 1 M
7 4 0 M
8 5 1 F
9 6 1 F
10 6 0 F
11 7 1 F
12 8 1 F
13 8 0 F
14 9 1 F
As you can see from the data above, there are 9 patients total and all of them have the particular phenomena in one eye.
I need the count the number of patients with this eye phenomena.
To get the number of total patients in the dataset, I used:
PROC FREQ data=new nlevels;
tables PatientID;
run;
To count the number of patients with this eye phenomena, I used:
PROC SORT data=new out=new1 nodupkey;
by Patientid Status;
run;
proc freq data=new1 nlevels;
tables Status;
run;
However, it gave the correct number of patients with the phenomena (9), but not the correct number without (0).
I now need to calculate the gender distribution of this phenomena. I used:
proc freq data=new1;
tables gender*Status/chisq;
run;
However, in the cross table, it has the correct number of patients who have the phenomena (9), but not the correct number without (0). Does anyone have any thoughts on how to do this chi-square, where if the has this phenomena in at least 1 eye, then they are positive for this phenomena?
Thanks!

PROC FREQ is doing what you told it to: counting the status=0 cases.
In general here you are using sort of blunt tools to accomplish what you're trying to accomplish, when you probably should use a more precise tool. PROC SORT NODUPKEY is sort of overkill for example, and it doesn't really do what you want anyway.
To set up a dataset of has/doesn't have, for example, let's do a few things. First I add one more row - someone who actually doesn't have - so we see that working.
data have;
input eyeID patientID status gender $;
datalines;
1 1 1 M
2 1 0 M
3 2 1 M
4 3 0 M
5 3 1 M
6 4 1 M
7 4 0 M
8 5 1 F
9 6 1 F
10 6 0 F
11 7 1 F
12 8 1 F
13 8 0 F
14 9 1 F
15 10 0 M
;;;;
run;
Now we use the data step. We want a patient-level dataset at the end, where we have eye-level now. So we create a new patient-level status.
data patient_level;
set have;
by patientID;
retain patient_status;
if first.patientID then patient_status =0;
patient_status = (patient_Status or status);
if last.patientID then output;
keep patientID patient_Status gender;
run;
Now, we can run your second proc freq. Also note you have a nice dataset of patients.
title "Patients with/without condition in any eye";
proc freq data=patient_level;
tables patient_status;
run;
title;
You also may be able to do your chi-square analysis, though I'm not a statistician and won't dip my toe into whether this is an appropriate analysis. It's likely better than your first, anyway - as it correctly identifies has/doesn't have status in at least one eye. You may need a different indicator, if you need to know number of eyes.
title "Crosstab of gender by patient having/not having condition";
proc freq data=patient_level;
tables gender*patient_Status/chisq;
run;
title;
If your actual data has every single patient having the condition, of course, it's unlikely a chi-square analysis is appropriate.

Setting most variables to missing, while preserving the contents of a select few

I have a dataset like this (but with several hundred vars):
id q1 g7 q3 b2 zz gl az tre
1 1 2 1 1 1 2 1 1
2 2 3 3 2 2 2 1 1
3 1 2 3 3 2 1 3 3
4 3 1 2 2 3 2 1 1
5 2 1 2 2 1 2 3 3
6 3 1 1 2 2 1 3 3
I'd like to keep id, b2, and tre, but set everything else to missing. In a dataset this small, I can easily use call missing (q1, g7, q3, zz, gl, az) - but in a set with many more variables, I would effectively like to say call missing (of _ALL_ *except ID, b2, tre*).
Obviously, SAS can't read my mind. I've considered workarounds that involve another data step or proc sql where I copy the original variables to a new ds and merge them back on post, but I'm trying to find a more elegant solution.

This technique uses an un-executed set statement (compile time function only) to define all variables in the original data set. Keeps the order and all variable attributes type, labels, format etc. Basically setting all the variables to missing. The next SET statement which will execute brings in only the variables the are NOT to be set to missing. It doesn't explicitly set variables to missing but achieves the same result.
data nomiss;
input id q1 g7 q3 b2 zz gl az tre;
cards;
1 1 2 1 1 1 2 1 1
2 2 3 3 2 2 2 1 1
3 1 2 3 3 2 1 3 3
4 3 1 2 2 3 2 1 1
5 2 1 2 2 1 2 3 3
6 3 1 1 2 2 1 3 3
;;;;
run;
proc print;
run;
data manymiss;
if 0 then set nomiss;
set nomiss(keep=id b2 tre:);
run;
proc print;
run;

Another fairly simple option is to set them missing using a macro, and basic code writing techniques.
For example, let's say we have a macro:
%call_missing(var=);
call missing(&var.);
%mend call_missing;
Now we can write a query that uses dictionary.columns to identify the variables we want set to missing:
proc sql;
select name
from dictionary.columns
where libname='WORK' and memname='HAVE'
and not (name in ('ID','B2','TRE')); *note UPCASE for all these;
quit;
Now, we can combine these two things to get a macro variable containing code we want, and use that:
proc sql;
select cats('%call_missing(var=',name ,')')
into :misslist separated by ' '
from dictionary.columns
where libname='WORK' and memname='HAVE'
and not (name in ('ID','B2','TRE')); *note UPCASE for all these;
quit;
data want;
set have;
&misslist.;
run;
This has the advantage that it doesn't care about the variable types, nor the order. It has the disadvantage that it's somewhat more code, but it shouldn't be particularly long.

If the variables are all of the same type (numeric or character) then you could use an array.
data want ;
set have;
array _all_ _numeric_ ;
do over _all_;
if upcase(vname(_all_)) not in ('ID','B2') then _all_=.;
end;
run;
If you don't care about the order then just drop the variables and add them back on with 0 observations.
data want;
set have (keep=ID B2 TRE:) have (obs=0 drop=ID B2 TRE:);
run;

how to solve the problem of selecting multiple rows

I have the data in this format- it is just an
example: n=2
X Y info
2 1 good
2 4 bad
3 2 good
4 1 bad
4 4 good
6 2 good
6 3 good
Now, the above data is in sorted manner (total 7 rows). I need to make a group of 2 , 3 or 4 rows separately and generate a graph. In the above data, I made a group of 2 rows. The third row is left alone as there is no other column in 3rd row to form a group. A group can be formed only within the same row. NOT with other rows.
Now, I will check if both the rows have “good” in the info column or not. If both rows have “good” – the group formed is also good , otherwise bad. In the above example, 3rd /last group is “good” group. Rest are all bad group. Once I’m done with all the rows, I will calculate the total no. of Good groups formed/Total no. of groups.
In the above example, the output will be: Total no. of good groups/Total no. of groups => 1/3.
This is the case of n=2(size of group)
Now, for n=3, we make group of 3 rows and for n=4, we make a group of 4 rows and find the good /bad groups in a similar way. If all the rows in a group has “good” block—the result is good block, otherwise bad.
Example: n= 3
2 1 good
2 4 bad
2 6 good
3 2 good
4 1 good
4 4 good
4 6 good
6 2 good
6 3 good
In the above case, I left the 4th row and last 2 rows as I can’t make group of 3 rows with them. The first group result is “bad” and last group result is “good”.
Output: 1/ 2
For n= 4:
2 1 good
2 4 good
2 6 good
2 7 good
3 2 good
4 1 good
4 4 good
4 6 good
6 2 good
6 3 good
6 4 good
6 5 good
In this case, I make a group of 4 and finds the result. The 5th,6th,7th,8th row are left behind or ignored. I made 2 groups of 4 rows and both are “good” blocks.
Output: 2/2
So, After getting 3 output values for n=2 , n-3, and n=4 I will plot a graph of these values.

Below is code that I think is getting what you are looking for. It assumes that the data that you described is stored separately in the three datasets named data_2, data_3, and data_4. Each of these datasets is processed by the %FIND_GOOD_GROUPS macro that determines which groups of X have all "GOOD" values in INFO, then this summary information is appended as a new row to the BASE dataset. I didn't add the code, but you could calculate the ratio of GOOD_COUNT to FREQ in a separate data step, then use a procedure to plot the N value and the ratio. Hope this gets close to what you're trying to accomplish.
%******************************************************************************;
%macro main;
%find_good_groups(dsn=data_2, n=2);
%find_good_groups(dsn=data_3, n=3);
%find_good_groups(dsn=data_4, n=4);
proc print data=base uniform noobs;
%mend main;
%******************************************************************************;
%******************************************************************************;
%macro find_good_groups(dsn=,n=);
%***************************************************************************;
%* Sort data by X and Y so that you can use FIRST.X variable in Data step. *;
%***************************************************************************;
proc sort data=&dsn;
by x y;
run;
%***************************************************************************;
%* TEMP dataset uses the FIRST.X variable to reset COUNT and GOOD_COUNT to *;
%* initial values for each row where X changes. Each row in the X groups *;
%* adds 1 to COUNT and sets GOOD_COUNT to 0 (zero) if INFO is ever "BAD". *;
%* A record is output if COUNT is equal to the macro parameter &N. *;
%***************************************************************************;
data temp;
keep good_count n;
retain count 0 good_count 1 n &n;
set &dsn;
by x y;
if first.x then do;
count = 0;
good_count = 1;
end;
count = count + 1;
if good_count eq 1 then do;
if trim(left(upcase(info))) eq "BAD" then do;
good_count = 0;
end;
end;
if count eq &n then output;
run;
%***************************************************************************;
%* Summarize the TEMP data to find the number of times that all of the *;
%* rows had "GOOD" in the INFO column for each value of X. *;
%***************************************************************************;
proc summary data=temp;
id n;
var good_count;
output out=n_&n (drop=_type_) sum=;
run;
%***************************************************************************;
%* Append to BASE dataset to retain the sums and frequencies from all of *;
%* the datasets. BASE can be used to plot the N / number of Good records. *;
%***************************************************************************;
proc append data=n_&n base=base force; run;
%mend find_good_groups;
%******************************************************************************;
%main

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why many to many merge doesn't do cartesian product - sas

SAS doesn't handle many-to-many merges very well within the datastep. You need to use a PROC SQL if you want to do a many-to-many merge.

Related

apply proc means to quickly calculate multiple observations?

Interpolate values in unbalanced panel data using SAS

SAS: PROC FREQ with multiple ID variables

Setting most variables to missing, while preserving the contents of a select few

how to solve the problem of selecting multiple rows

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why many to many merge doesn't do cartesian product - sas

SAS doesn't handle many-to-many merges very well within the datastep. You need to use a PROC SQL if you want to do a many-to-many merge.

Related

apply proc means to quickly calculate multiple observations?

Interpolate values in unbalanced panel data using SAS

SAS: PROC FREQ with multiple ID variables

Setting *most* variables to missing, while preserving the contents of a select few

how to solve the problem of selecting multiple rows

Categories

Resources

Setting most variables to missing, while preserving the contents of a select few