how to compare 2 columns in 2 tables using SAS - sas

there are 2 similar columns (x1 vs x2) in 2 tables a1, a2. How can I match and lineup them?
I tried compare function, but failed.
data a1;
input x1 $15.;
cards;
abcd
go shopping
DUT univarsity
he is driving
;
run;
data a2;
input x2 $15.;
cards;
golf shopping
she is driving
abcdf
DUT univars
;
run;
and I want the final table with matched values:
abcd abcdf
go shopping golf shopping
DUT univarsity DUT univars
he is driving she is driving

This solution will not scale well, but you get the idea. Use COMPGED to find the nearest match and take the minimum. This doesn't deal with cases where you don't have a good match.
You're essentially doing fuzzy matching which is a computationally intensive process.
proc sql;
create table want as
select t1.x1, t2.x2
from a1 as t1, a2 as t2
group by t1.x1
having compged(x1, x2) = min(compged(x1, x2));
quit;

Related

How to make a table (with proc report or data step) of a grouped variable where in different columns are counts of different variables?

Could you give some advise please how to calculate different counts to different columns when we group a certain variable with proc report (if it is possible with it)?
I copy here an example and the solution to better understand what i want to achieve. I can compile this table in sql in a way that i group them individually (with where statements, for example where Building_code = 'A') and then i join them to one table, but it is a little bit long, especially when I want to add more columns. Is there a way to define it in proc report or some shorter data step query, if yes can you give a short example please?
Example:
Solution:
Thank you for your time.
This should work. There is absolutely no need to do this by joining multiple tables.
data have;
input Person_id Country :$15. Building_code $ Flat_code $ age_category $;
datalines;
1000 England A G 0-14
1001 England A G 15-64
1002 England A H 15-64
1003 England B H 15-64
1004 England B J 15-64
1005 Norway A G 15-64
1006 Norway A H 65+
1007 Slovakia A G 65+
1008 Slovakia B H 65+
;
run;
This is a solution in proc sql. It's not really long or complicated. I don't think you could do it any shorter using data step.
proc sql;
create table want as
select distinct country, sum(Building_code = 'A') as A_buildings, sum(Flat_code= 'G') as G_flats, sum(age_category='15-64') as adults
from have
group by country
;
quit;

SAS vlookup - view all data not only joined

I would like to see all the data from "one" dataset. If join between tables not exist overwrite the value 0. The current code gives me values only where there is a connection. This table I need:
data one;
input lastname: $15. typeofcar: $15. mileage;
datalines;
Jones Toyota 3000
Smith Toyota 13001
Jones2 Ford 3433
Smith2 Toyota 15032
Shepherd Nissan 4300
Shepherd2 Honda 5582
Williams Ford 10532
;
data two;
input startrange endrange typeofservice & $35.;
datalines;
3000 5000 oil change
5001 6000 overdue oil change
6001 8000 oil change and tire rotation
8001 9000 overdue oil change
9001 11000 oil change
11001 12000 overdue oil change
12001 14000 oil change and tire rotation
15032 14999 overdue oil change
13001 15999 15000 mile check
;
data combine;
do until (mileage<15000);
set one;
do i=1 to nobs;
set two point=i nobs=nobs;
if startrange = mileage then
output;
end;
end;
run;
proc print;
run;
Description of the code from the SAS support site:
Read the first observation from the SAS data set outside the DO loop. Assign the FOUND variable to 0. Start the DO loop reading observations from the SAS data set inside the DO loop. Process the IF condition; if the IF condition is true, OUTPUT the observation and set the FOUND variable to 1. Assigning the FOUND variable to 1 will cause the DO loop to stop processing because of the UNTIL (FOUND) that is coded on the DO loop. Go back to the top of the DATA step and read the next observation from the data set outside the DO loop and process through the DATA step again until all observations from the data set outside the DO loop have been read.
You could do that with a LEFT JOIN in a proc sql
taking all variables from one
then making 2 conditions to fill startrange and endrange with 0 when missing.
proc sql noprint;
create table want as
select t1.*
, case when t2.startrange=. then 0 else t2.startrange end as startrange
, case when t2.endrange=. then 0 else t2.endrange end as endrange
, t2.typeofservice
from one t1 left join two t2
on (t1.mileage = t2.startrange)
;run;quit;
Or do it in 2 steps (I personally find the if of the data step cleaner than the case when of the proc sql.)
proc sql noprint;
create table want as select *
from one t1 left join two t2 on (t1.mileage = t2.startrange)
;run;quit;
data want; set want;
if startrange=. then do; startrange=0; endrange=0; end;
run;
I can't use proc sql because I need Vlookup inside loop UNTIL. I need another solution.
Data step is not the best way to code this. It is much easier to code fuzzy matches using SQL code.
Not sure why you need to have zeros instead of missing values, but coalesce() should make it easy to provide them.
proc sql ;
create table combine as
select a.*
, coalesce(b.startrange,0) as startrange
, coalesce(b.endrange,0) as endrange
, b.typeofservice
from one a left join two b
on a.mileage between b.startrange and b.endrange
;
quit;

How to work across two datasets in SAS

I have two datasets described below
data1:
$restaurant $reviewers
A Tom
B Jack.Mary.Joan
C Tom.Joan
D Rose
data2 (sorted by the friends numbers):
$user $friends
Tom Joan.Mary.Jack
Jack Tom.Rose
Mary Tom
Joan Tom
The question is to calculate the overlap in the reviews of these users with the reviews of their friends.
Take an example of Tom, the restaurants Toms friends reviewed are B and C, from which C was also reviewed by Tom. So here the percentage is C/B+C = 1/2, so the overlap is 50%.
I think I need a loop to work across two datasets, but with very basic knowledge of SAS, I don't know how. Has anybody an idea?
Thank you very much.
You should try something like this.
data reviews;
infile datalines dsd dlm=",";
input restaurant $ reviewer $;
datalines;
A,Tom
B,Jack
B,Mary
B,Joan
C,Tom
C,Joan
D,Rose
;
run;
data users;
infile datalines dsd dlm=",";
input user $ friend $;
datalines;
Tom,Joan
Tom,Mary
Tom,Jack
Jack,Tom
Jack,Rose
Mary,Tom
Joan,Tom
;
run;
proc sql;
create table want as
select t1.user
,sum(case when t3.restaurant=t2.restaurant then 1 else 0 end)/count(*) as percentage
from users t1
inner join reviews t2
on t1.user=t2.reviewer
inner join reviews t3
on t1.friend=t3.reviewer
group by t1.user
;
quit;
I did'nt get your 0,5 value for Tom, but maybe you have a mistake.
So you can adapt the code as needed.
I followed the logic from here :
How to check percentage overlap in SAS

SAS- Calculate Top Percent of Population

I am trying to seek some validation, this may be trivial for most but I am by no means an expert at statistics. I am trying to select patients in the top 1% based on a score within each drug and location. The data would look something like this (on a much larger scale):
Patient drug place score
John a TX 12
Steven a TX 10
Jim B TX 9
Sara B TX 4
Tony B TX 2
Megan a OK 20
Tom a OK 10
Phil B OK 9
Karen B OK 2
The code snipit I have written to calculate those top 1% patients is as follows:
proc sql;
create table example as
select *,
score/avg(score) as test_measure
from prior_table
group by drug, place
having test_measure>.99;
quit;
Does this achieve what I am trying to do, or am going about it all wrong? Sorry if this is really trivial to most.
Thanks
There are multiple ways to calculate and estimate a percentile. A simple way is to use PROC SUMMARY
proc summary data=have;
var score;
output out=pct p99=p99;
run;
This will create a data set named pct with a variable p99 containing the 99th percentile.
Then filter your table for values >=p99
proc sql noprint;
create table want as
select a.*
from have as a
where a.score >= (select p99 from pct);
quit;

sas how to do frequencies for only certain values

I have some survey data with possible responses, an example would be:
Q1
Person1 Yes
Person2 No
Person3 Missing
Person4 Multiple Marks
Person5 Yes
I need to calculate the frequencies by question, so that only the Yes/No (other questions have varied responses such as frequently, very frequently, etc) are counted in the totals - not the ones with Multiple Marks. Is there a way to exclude these using proc freq or another method?
Outcome:
Yes: 2
No: 1
Total: 3
Using proc freq, I'd do something like this:
proc freq data=have (where=(q1 in ("Yes", "No")));
tables q1 / out=want;
run;
Output:
Q1 Count Percent
No 1 33.333333333
Yes 2 66.666666667
Proc sql:
proc sql;
select
sum(case when q1 eq "Yes" then 1 else 0 end) as Yes
,sum(case when q1 eq "No" then 1 else 0 end) as No
,count(q1) as Total
from have
where q1 in ("Yes", "No");
quit;
Output:
Yes No Total
2 1 3
The best way to do this is using formats.
Rather than storing your data as character strings, you should be storing it as numeric variables. This allows you to use numeric missing values to code those values you don't consider proper responses; using formats allows you to have your cake and eat it to (i.e., allows you to still have those nice pretty response labels).
Here's an example. To understand this, you need to understand SAS special missings. Note the missing statement tells SAS to consider a single "M" in the input as .M (and similar for D and R). I then show two PROC FREQ results, one with the missings excluded, one with them included, to show the difference.
proc format;
value YNQF
1 = 'Yes'
2 = 'No'
. = 'Missing'
.M= 'Multiple Marks'
.D= "Don't Know"
.R= "Refused"
;
quit;
missing M R D;
data have;
input Q1 Q2 Q3;
format q1 q2 q3 YNQF.;
datalines;
1 1 2
2 1 R
. . 1
M 1 1
1 . D
;;;;
run;
proc freq data=have;
tables (q1 q2 q3);
tables (q1 q2 q3)/missing;
run;