Dataset HAVE includes two variables with misspelled names in them: names and friends.
Name Age Friend
Jon 11 Ann
Jon 11 Tom
Jimb 12 Egg
Joe 11 Egg
Joe 11 Anne
Joe 11 Tom
Jed 10 Ann
I have a small dataset CORRECTIONS that includes wrong_names and resolved_names.
current_names resolved_names
Jon John
Ann Anne
Jimb Jim
I need any name in names or friends in HAVE that matches a name in the wrong_names column of CORRECTIONS to get recoded to the corresponding string in resolved_name. The resulting dataset WANT should look like this:
Name Age Friend
John 11 Anne
John 11 Tom
Jim 12 Egg
Joe 11 Egg
Joe 11 Anne
Joe 11 Tom
Jed 10 Anne
In R, I could simply invoke each dataframe and vector using if_else(), but the DATA step in SAS doesn't play nicely with multiple datasets. How can I make these replacements using CORRECTIONS as a look-up table?
There are many ways to do a lookup in SAS.
First of all, however, I would suggest to de-duplicate your look-up table (for example, using PROC SORT and Data Step/Set/By) - deciding which duplicate to keep (if any exist).
As for the lookup task itself, for simplicity and learning I would suggest the following:
The "OLD SCHOOL" way - good for auditing inputs and outputs (it is easier to validate the results of a join when input tables are in the required order):
*** data to validate;
data have;
length name $10. age 4. friend $10.;
input name age friend;
datalines;
Jon 11 Ann
Jon 11 Tom
Jimb 12 Egg
Joe 11 Egg
Joe 11 Anne
Joe 11 Tom
Jed 10 Ann
run;
*** lookup table;
data corrections;
length current_names $10. resolved_names $10.;
input current_names resolved_names;
datalines;
Jon John
Ann Anne
Jimb Jim
run;
*** de-duplicate lookup table;
proc sort data=corrections nodupkey; by current_names; run;
proc sort data=have; by name; run;
data have_corrected;
merge have(in=a)
corrections(in=b rename=(current_names=name))
;
by name;
if a;
if b then do;
name=resolved_names;
end;
run;
The SQL way - which avoids sorting the have table:
proc sql;
create table have_corrected_sql as
select
coalesce(b.resolved_names, a.name) as name,
a.age,
a.friend
from work.have as a left join work.corrections as b
on a.name eq b.current_names
order by name;
quit;
NB the Coalesce() is used to replace missing resolved_names values (ie when there is no correction) with names from the have table
EDIT: To reflect Quentin's (CORRECT) comment that I'd missed the update to both name and friend fields.
Based on correcting the 2 fields, again many approaches but the essence is one of updating a value only IF it exists in the lookup (corrections) table. The hash object is pretty good at this, once you've understood it's declaration.
NB: any key fields in the Hash object need to be specified on a Length statement BEFOREHAND.
EDIT: as per ChrisJ's alternative to the Length statement declaration, and my reply (see below) - it would be better to state that key variables need to be defined BEFORE you declare the hash table.
data have_corrected;
keep name age friend;
length current_names $10.;
*** load valid names into hash lookup table;
if _n_=1 then do;
declare hash h(dataset: 'work.corrections');
rc = h.defineKey('current_names');
rc = h.defineData('resolved_names');
rc = h.defineDone();
end;
do until(eof);
set have(in=a) end=eof;
*** validate both name fields;
if h.find(key:name) eq 0 then
name = resolved_names;
if h.find(key:friend) eq 0 then
friend = resolved_names;
output;
end;
run;
EDIT: to answer the comments re ChrisJ's SQL/Update alternative
Basically, you need to restrict each UPDATE statement to ONLY those rows that have name values or friend values in the corrections table - this is done by adding another where clause AFTER you've specified the set var = (clause). See below.
NB. AFAIK, an SQL solution to your requirement will require MORE than 1 pass of both the base table & the lookup table.
The lookup/hash table, however, requires a single pass of the base table, a load of the lookup table and then the lookup actions themselves. You can see the performance difference in the log...
proc sql;
*** create copy of have table;
create table work.have_sql as select * from work.have;
*** correct name field;
update work.have_sql as u
set name = (select resolved_names
from work.corrections as n
where u.name=n.current_names)
where u.name in (select current_names from work.corrections)
;
*** correct friend field;
update work.have_sql as u
set friend = (select resolved_names
from work.corrections as n
where u.friend=n.current_names)
where u.friend in (select current_names from work.corrections)
;
quit;
Given data
*** data to validate;
data have;
length name $10. age 4. friend $10.;
input name age friend;
datalines;
Jon 11 Ann
Jon 11 Tom
Jimb 12 Egg
Joe 11 Egg
Joe 11 Anne
Joe 11 Tom
Jed 10 Ann
run;
*** lookup table;
data corrections;
length from_name $10. to_name $10.;
input from_name to_name;
datalines;
Jon John
Ann Anne
Jimb Jim
run;
One SQL alternative is to perform a existent mapping select look up on each field to be mapped. This would be counter to joining the corrections table one time for each field to be mapped.
proc sql;
create table want1 as
select
case when exists (select * from corrections where from_name=name)
then (select to_name from corrections where from_name=name)
else name
end as name
, age
, case when exists (select * from corrections where from_name=friend)
then (select to_name from corrections where from_name=friend)
else friend
end as friend
from
have
;
Another, SAS only way, to perform inline left joins is to use a custom format.
data cntlin;
set corrections;
retain fmtname '$cohen'; /* the fixer */
rename from_name=start to_name=label;
run;
proc format cntlin=cntlin;
run;
data want2;
set have;
name = put(name,$cohen.);
friend = put(friend,$cohen.);
run;
You can use an UPDATE in proc sql :
proc sql ;
update have a
set name = (select resolved_names b from corrections where a.name = b.current_names)
where name in(select current_names from corrections)
;
update have a
set friend = (select resolved_names b from corrections where a.friend = b.current_names)
where friend in(select current_names from corrections)
;
quit ;
Or, you could use a format :
/* Create format */
data current_fmt ;
retain fmtname 'NAMEFIX' type 'C' ;
set resolved_names ;
start = current_names ;
label = resolved_names ;
run ;
proc format cntlin=current_fmt ; run ;
/* Apply format */
data want ;
set have ;
name = put(name ,$NAMEFIX.) ;
friend = put(friend,$NAMEFIX.) ;
run ;
Try this:
proc sql;
create table want as
select p.name,p.age,
case
when q.current_names is null then p.friend
else q.resolved_names
end
as friend1
from
(
select
case
when b.current_names is null then a.name
else b.resolved_names
end
as name,
a.age,a.friend
from
have a
left join
corrections b
on upcase(a.name) = upcase(b.current_names)
) p
left join
corrections q
on upcase(p.friend) = upcase(q.current_names);
quit;
Output:
name age friend
John 11 Anne
Jed 10 Anne
Joe 11 Anne
Jim 12 Egg
Joe 11 Egg
Joe 11 Tom
John 11 Tom
Let me know in case of any clarifications.
Related
I have a table which has dates in it and the member changes over time. I want to know when the member started and ended. If the member starts and ends and then restarts that needs to be a different indicator.
Sample of what I have (sorry I don't know how to make a table here):
member yyyymm
Jim 201603
Jim 201606
Jim 201609
Bob 201709
Bob 201712
Jim 201806
Jef 201806
Jef 201809
I tried a proc sql statement which finds min and max date but then the max date is wrong if the member restarts (code A below). I also tried a data step and that said it wasn't properly sorted (code B below)
code A
proc sql;
create table tst as
select
member,
max(yyyymm) as effective_until,
min(yyyymm) as effective_from
from tbl
group by 1,2;
quit;
code B
data tst;
count + 1;
by member;
if first.member then count = 1;
run;
What I'm hoping for:
member yyyymm id
Jim 201603 1
Jim 201606 1
Jim 201609 1
Bob 201709 2
Bob 201712 2
Jim 201803 3
Jef 201806 4
Jef 201809 4
proc sort data=have;
by yyyymm member;
data want;
set have;
by yyyymm member;
if first.member then id+1;
run;
So try the lag function that return parameter from previous call. So here it return the value from last observation (but handle with care). When the member is different from last observation simply change you id. For example by adding 1.
data have;
length member $3 yyyymm $6;
input member yyyymm;
cards;
Jim 201603
Jim 201606
Jim 201609
Bob 201709
Bob 201712
Jim 201806
Jef 201806
Jef 201809
run;
data want;
set have;
if lag(member)^=member then id+1;
run;
I want to sum over a specific variable in my dataset, without loosing all the other columns. I have tried the following code:
proc summary data=work.test nway missing;
class var_1 var_2 ; *groups;
var salary;
id _character_ _numeric_; * keeps all variables;
output out=test2(drop=_:) sum= ;
run;
But it does not seem to sum properly, and for the "salary" column I'm just left with the value of the last value in each group (var_1 and var_2). If I remove
id _character_ _numeric_;
it works fine, but I loose all other columns.
Example:
data:
data salary;
input name $ dept $ Salary Sex $;
datalines;
John Sales 23 M
John Sales 43 M
Mary Acctng 21 F
;
desired output:
John Sales 66 M
Mary Acctng 21 F
I think this does what you want. You still get warnings about name conflicts and variables being dropped but at least the ones you want are kept. The ID statement is depreciated in favor in the new and better IDGROUP output statement option.
You could add the AUTONAME option to the output statement if you wanted PROC SUMMARY to automatically rename the conflicting variables.
data salary;
input name $ dept $ Salary Sex $;
datalines;
John Sales 23 M
John Sales 43 M
Mary Acctng 21 F
;;;;
run;
proc print;
run;
proc summary nway missing;
class name dept;
var salary;
output out=test2(drop=_:) sum= idgroup(out(_all_)=);
run;
proc print;
run;
Try this:
data salary;
input name $ dept $ Salary Sex $;
datalines;
John Sales 23 M
John Sales 43 M
Mary Acctng 21 F
;
proc sql;
create table salary2 as
select *,
monotonic() as n,
sum(salary) as sum_salary
from salary
group by name
having max(n)=n;
quit;
I wasn't aware that SAS did this, but the problem appears to lie in the fact that the id statement takes preference over the var statement. By including all variables in the id statement, all the output is showing is the maximum value for each variable, including Salary.
One option is to pull a list of the variables not included in the class or var statements from dictionary.columns, then use that list in the id statement. Just be aware that proc summary runs in memory and I have come across out of memory problems in the past when many variables have been included in the id statement
data salary;
input name $ dept $ Salary Sex $;
datalines;
John Sales 23 M
John Sales 43 M
Mary Acctng 21 F
;
proc sql noprint;
select name into :cols separated by ' '
from dictionary.columns
where libname='WORK'
and
memname='SALARY'
and
name not in ('name','Salary');
quit;
%put &cols.;
proc summary data=salary nway missing;
class name;
var salary;
id &cols.;
output out=want (drop=_:) sum=;
run;
I am using the below code but in the final output I am not able to get the name in the first entry where income is 234234. How do I get name entry here.
data names;
input name $ age;
datalines;
John 10
Mary 12
Sally 12
Fred 1
Paul 2
;
run;
data check;
input name $ income;
datalines;
Mary 121212
Fred 334343
Ben 234234
;
Proc sql;
title 'Inner Join';
create table common_names as
select * from names as n right join check as c on
n.name = c.name;
run;
Proc print data = common_names;
run;
Output
Inner Join
Obs name age income
1 . 234234
2 Fred 1 334343
3 Mary 12 121212
You cannot create two variables with the same name, in this case the variable NAME. So either create two variables
select n.name as name1, c.name as name2, ....
or use the COALESCE() function to create a single variable.
select coalesce(n.name,c.name) as name, ....
You might also what to look at SAS's NATURAL join. That will link tables on variables with the same name and automatically coalesce the key variable values.
create table common_names as
select *
from names as n
natural right join check as c
;
NAME DATE
---- ----------
BOB 24/05/2013
BOB 12/06/2012
BOB 19/10/2011
BOB 05/02/2010
BOB 05/01/2009
CARL 15/05/2011
LOUI 15/01/2014
LOUI 15/05/2013
LOUI 15/05/2012
DATA newdata;
SET mydata;
count + 1;
IF FIRST.name THEN count=1;
BY name DESCENDING date;
run;
here i got count group wise 1,2,3 so on..I want the output of name(all obs of bob) if count> 3. please help me..
The simplest way to do that is to output the last row for each ID if it is > 3, then merge that dataset back to your master dataset, keeping only matches. You could also use PROC FREQ to generate the dataset of counts and merge to that.
You can do it in a single datastep using a DoW loop, but that's more complicated, so I wouldn't recommend a new user do that.
I think this shows the power of SQL - though some would say since this generates a NOTE in the log it isn't good practice. Use the GROUP & HAVING clause in SQL to create a count of the names that you then limit to 3.
proc sql;
create table want as
select *
from have
group by name
having count(name)>3;
quit;
Here are a couple different ways to do this using SUBQUERIES in PROC SQL
Data HAVE;
Length NAME $50;
Input Name $ Date: ddmmyy10.;
Format date ddmmyy10.;
datalines;
BOB 24/05/2013
BOB 12/06/2012
BOB 19/10/2011
BOB 05/02/2010
BOB 05/01/2009
CARL 15/05/2011
LOUI 15/01/2014
LOUI 15/05/2013
LOUI 15/05/2012
;
Run;
Using a multiple-value subquery in the Where statement
Proc sql;
Create table WANT1 as
Select *
From Have
Where Name in (Select name from have b group by b.name having count(b.name)>3);
Quit;
Using a subquery in the From clause
Proc sql;
Create table WANT2 as
Select a.name, a.date
From Have a Inner Join (select name, count(name) as Count from have b group by b.name having Count>3)
On a.name=b.name
;
Quit;
proc sort data=sas.mincome;
by F3 F4;
run;
Proc sort doesn't sort the dataset by formatted values, only internal values. I need to sort by two variables prior to a merge. Is there anyway to do this with proc sort?
I don't think you can sort by formatted values in proc sort, but you can definitely use a simple proc SQL procedure to sort a dataset by formatted values. proc SQL is similar to the data step and proc sort, but is more powerful.
The general syntax of proc sql for sorting by formatted values will be:
proc sql;
create table NewDataSet as
select variable(s)
from OriginalDataSet
order by put(variable1, format1.), put(variable2, format2.);
quit;
For example, we have a sample data set containing the names, sex and ages of some people and we want to sort them:
proc format;
value gender 1='Male'
2='Female';
value age 10-15='Young'
16-24='Old';
run;
data work.original;
input name $ sex age;
datalines;
John 1 12
Zack 1 15
Mary 2 18
Peter 1 11
Angela 2 24
Jack 1 16
Lucy 2 17
Sharon 2 12
Isaac 1 22
;
run;
proc sql;
create table work.new as
select name, sex format=gender., age format=age.
from work.original
order by put(sex, gender.), put(age, age.);
quit;
Output of work.new will be:
Obs name sex age
1 Mary Female Old
2 Angela Female Old
3 Lucy Female Old
4 Sharon Female Young
5 Jack Male Old
6 Isaac Male Old
7 John Male Young
8 Zack Male Young
9 Peter Male Young
If we had used proc sort by sex, then Males would have been ranked first because we had used 1 to represent Males and 2 to represent Females which is not what we want. So, we can clearly see that proc sql did in fact sort them according to the formatted values (Females first, Males second).
Hope this helps.
Because of the nature of formats, SAS only uses the underlying values for the sort. To my knowledge, you cannot change that (unless you want to build your own translation table via PROC TRANTAB).
What you can do is create a new column that contains the formatted value. Then you can sort on that column.
proc format library=work;
value $test 'z' = 'a'
'y' = 'b'
'x' = 'c';
run;
data test;
format val $test.;
informat val $1.;
input val $;
val_fmt = put(val,$test.);
datalines;
x
y
z
;
run;
proc print data=test(drop=val_fmt);
run;
proc sort data=test;
by val_fmt;
run;
proc print data=test(drop=val_fmt);
run;
Produces
Obs val
1 c
2 b
3 a
Obs val
1 a
2 b
3 c