Finding a matching date and title among many variables in SAS - sas

I have a data set that is a list of employees with their job titles and month and year that job title started for their entire career here.
It looks something like this: employeeID JobTitle1 MonthYearofTitle1 Department1 Jobtitle2 MonthYearofTitle2 Department2 etc.
I have another list of employees that are not in the first data set and only have one job title and date of title. My goal is to match employees in the 2nd data set with employees in the 1st based on their job title and month/year, but I am completely unsure of how to do this match because it involves information being present among multiple variables.
Put another way, if I have MarySue who has became an admin in Jan2017, I want to match her with JohnDoe who also became an admin in Jan2017 and flag them as a match for further analysis.
Unfortunately, I am not sure where to even begin with my code so I don't have things I've tried. The data would look like this
Data set 1
employeeID JobTitle1 MonthYearofTitle1 Jobtitle2 MonthYearofTitle2
JohnDoe Intern Jan2016 Admin Jan2017
JakeSo VP Jul2017
JulieDo Manager April2017
Data set 2
employeeID JobTitle1 MonthYearofTitle1
MarySue Admin Jan2017
JaneDoe Admin Jan2017
Greg VP Jul2017
Desired Outcome / Data set:
Employee1 Employee2 Title Date Flag
JohnDoe MarySue Admin Jan2017 Match
JakeSo Greg VP Jul2017 Match
JulieDo Admin Jan2017 No Match
Can anyone help?

You can do a FULL JOIN or Left JOINand use Case statement to create a calculated field to indicate the matching records.
The code below will do a Full Join and create a flag field:
Creating Table1 & Table2: Only 1 record will match
data table1;
input employeeID $ JobTitle1 $ MonthYearofTitle1 Jobtitle2 $ MonthYearofTitle2 ;
informat MonthYearofTitle1 monyy7. MonthYearofTitle2 monyy7.;
format MonthYearofTitle1 monyy7. MonthYearofTitle2 monyy7.;
datalines;
JohnDoe Intern Jan2016 Admin Jan2017
TomJones Junior Jul2016 Admin Jul2017
;
run;
data table2;
input employeeID $ JobTitle1 $ MonthYearofTitle1 ;
informat MonthYearofTitle1 monyy7.;
format MonthYearofTitle1 monyy7.;
datalines;
MarySue Admin Jan2017
JackieC Admin Jul2013
;
run;
Full Join: To get all data
proc sql;
create table want as
select
t1.employeeID as t1_employeeID , t2.employeeID as t2_employeeID,
t2.JobTitle1 as t2_JobTitle,
t2.MonthYearofTitle1 as t2_MonthYearofTitle1,
case when
((t1.JobTitle1=t2.JobTitle1 and t1.MonthYearofTitle1=t2.MonthYearofTitle1) or (t1.JobTitle2=t2.JobTitle1 and t1.MonthYearofTitle2=t2.MonthYearofTitle1)) then "Match"
else "No-Match" end as flag
from table1 as t1 full join table2 as t2
on (t1.JobTitle1=t2.JobTitle1 and t1.MonthYearofTitle1=t2.MonthYearofTitle1) or (t1.JobTitle2=t2.JobTitle1 and t1.MonthYearofTitle2=t2.MonthYearofTitle1)
;
quit;
Results:
t1_employeeID=JohnDoe t2_employeeID=MarySue t2_JobTitle=Admin t2_MonthYearofTitle1=JAN2017 flag=Match
t1_employeeID= t2_employeeID=JackieC t2_JobTitle=Admin t2_MonthYearofTitle1=JUL2013 flag=No-Match
t1_employeeID=TomJones t2_employeeID= t2_JobTitle= t2_MonthYearofTitle1=. flag=No-Match
Update:
Left Join: To get only records from table 1
proc sql;
create table want as
select
t1.employeeID as Employee1 , t2.employeeID as Employee2,
coalescec(t2.JobTitle1,t1.JobTitle2,t1.JobTitle1) as Title,
coalesce(t2.MonthYearofTitle1,t1.MonthYearofTitle2,t1.MonthYearofTitle1) as Date format monyy7.,
case when
((t1.JobTitle1=t2.JobTitle1 and t1.MonthYearofTitle1=t2.MonthYearofTitle1) or (t1.JobTitle2=t2.JobTitle1 and t1.MonthYearofTitle2=t2.MonthYearofTitle1)) then "Match"
else "No-Match" end as Flag
from table1 as t1 left join table2 as t2
on (t1.JobTitle1=t2.JobTitle1 and t1.MonthYearofTitle1=t2.MonthYearofTitle1) or (t1.JobTitle2=t2.JobTitle1 and t1.MonthYearofTitle2=t2.MonthYearofTitle1)
;
quit;

Here's what I would do. First change both datasets so the have just the following columns:
employeeID, JobTitle, MonthYear
then do a proc sql:
proc sql noprint:
select a.employeeID,b.employeeId,a.jobTitle,a.MonthYear
from firstdataset as a
inner join seconddataset as b
on a.employeeId = b.employeeId
and a.jobTitle = b.jobTitle
and a.MonthYear = b.MonthYear;
quit;
Give that a go and let me know what you get
also, depending on your data you can create the initial tables with:
data b;
keep employeeId title monthyear;
set a;
array x [*] _CHARACTER_;
y= dim(x);
do i = 2 to y;
if (mod(i,2) = 0)then do;
Title = x[i];
monthyear = x[i+1];
output;
end;
end ;
run;

Related

Finding new versus repeated users in sas

Below given dataset I am trying to find New Users Vs Repeated Users.
DATE ID Unique_Event
20200901 a12345 1
20200902 a12345 1
20200903 b12345 1
20200903 a12345 1
20200904 c12345 1
In the above dataset, since a12345 appeared on multiple dates, should be counted as a "repeated" user whereas b12345 only appeared once, so he is a "new" user. Please note, this is only sample data as the actual data is quite large. I tried the below code, but I am not getting the correct count. Ideally, tot_num_users-num_new_users should be repeated users, but I am getting incorrect counts. Am I missing something?
Expected Output:
Month new_users repeated_users
9 2 1
Code:
data user_events;
set user_events;
new_date=input(date,yymmdd10.);
run;
proc sql;select month(new_date) as mm,
count(distinct vv.id) as total_num_users,
count(distinct case when v.new_date = vv.minva then v.id end) as num_new_users,
(count(distinct vv.id) - count(distinct case when v.new_date = vv.minva then id end)
) as num_repeated_users
from user_events v inner join
(select t.id, min(new_date) as minva
from user_events t
group by t.id
) vv
on v.id = vv.id
group by 1
order by 1;quit;
In a sub-select, for each ID you can count the number of distinct DATE to determine the new / repeated status. The all ids aggregate computations are made from the sub-select.
proc sql;
create table freq as
select
count(*) as id_count
, sum (status='repeated') as id_repeated_count /* sum counts a logic eval state */
, sum (status='new') as id_new_count
from
( select
id
, case
when count(distinct date) > 1 then 'repeated'
else 'new'
end as status
from
user_events
group by
id
) as statuses
;
An alternative solution not using proc sql (though I'm aware you tagged this with "proc sql").
data final;
set user_events;
Month=month(new_date);
run;
proc sort data=final; by Month ID;
data final;
set final;
by Month ID;
if first.Month then do;
new_users=0;
repeated_users=0;
end;
if last.ID then do;
if first.ID then
new_users+1;
else
repeated_users+1;
end;
if last.Month then
output;
keep Month new_users repeated_users;
run;
Since you are using proc sql, this is a sql question, not a SAS question.
Try something like:
proc sql;
select ID,count(Unique_Event)
from <that table>
group by ID
order by ID
run;

SAS aggregate rows computation

I'm a beginner user of SAS especially when it comes to aggregate rows computation.
Here is a question which I believe some of you may have encountered before.
The data I have is related to insurance policies, here is an example dataset: columns from left to right are customer number, policy number, policy status, policy start date and policy cancel date (if the policy is not active, otherwise is a missing value).
data have;
informat cust_id 8. pol_num $10. status $10. start_date can_date DDMMYY10.;
input cust_id pol_num status start_date can_date;
format start_date can_date date9.;
datalines;
110 P110001 Cancelled 04/12/2004 10/10/2013
110 P110002 Active 01/03/2005 .
123 P123001 Cancelled 21/07/1998 23/04/2013
123 P123003 Cancelled 22/10/1987 01/11/2011
133 P133001 Active 19/02/2001 .
133 P133001 Active 20/02/2002 .
;
run;
Basically I want to roll these policy level information to customer level, if a customer holds at least one active policy, then his status would be 'Active', otherwise if all his policies are Cancelled, then his status becomes 'Inactive'. I also need a customer "start date" which picks up the earliest policy start date under that customer. If the customer is 'Inactive', then I need the customer's latest policy cancel date as the customer's exit date.
Below is the what I needed:
data want;
informat cust_id 8. status $10. start_date exit_date DDMMYY10.;
input cust_id status start_date exit_date;
format start_date exit_date date9.;
datalines;
110 Active 01/03/2005 .
123 Inactive 22/10/1987 23/04/2013
133 Active 19/02/2001 .
;
run;
Solution in any form would be much appreciated! Either DATA step or PROC SQL is fine.
Thank you so much.
You can do something like that:
proc sql;
create table want as
select cust_id,
case when count(case when status='Active' then 1 end) > 0
then 'Active'
else 'Inactive'
end as status,
min(start_date) as start_date,
case when count(case when status='Active' then 1 end) = 0
then max(can_date)
end as exit_date
from have
group by cust_id;
quit;
You could attack the question in a DATA step. Here's one simple way, assuming your data are sorted by cust_id and start_date...
data want;
set have (keep=cust_id status start_date exit_date);
where upcase(status) contains 'ACTIVE';
by cust_id start_date;
if first.start_date then output;
else delete;
run;
/*BEGINNER NOTES*/
*1] WHERE tells SAS to compile only records that fit a certain
condition - the DS 'want' will never have any observations with
'CANCELLED' in the status variable;
*2] I use UPCASE() to standardize the contents of status, as CONTAINS
is a case-sensitive operator;
*3] FIRST.variable = 1 if the value is the first encountered in
the compile phase;

Table is not splitting in the right way

I have the following dataset and code:
options nocenter;
DATA survey;
INPUT product_id department;
DATALINES;
1212 Sales
1213 Sales
1214 Marketing
;
PROC PRINT; RUN;
data sales marketing;
set survey;
if department = 'Sales' then output sales;
else if department = 'Marketing' then output marketing;
run;
title 'Sales employees';
proc print data= sales;
run;
title;
title 'Marketing employees';
proc print data= marketing;
run;
title;
This however gives me two tables with all the values while I only a table with the marketing- and sales values. Also the title appears above the second table but not above the first. Any thoughts what goes wrong?
Your missing a '$' sign after your variable 'department', so you get the '.' for missing (numeric) values. In addition to that the variable is truncating my value of Marketing to Marketin, so the data set Marketing never finds a string that equals 'Marketing', so your input should be INPUT product_id department $10.; . The title statements work of for me.

create new data set by grouping SAS

NAME DATE
---- ----------
BOB 24/05/2013
BOB 12/06/2012
BOB 19/10/2011
BOB 05/02/2010
BOB 05/01/2009
CARL 15/05/2011
LOUI 15/01/2014
LOUI 15/05/2013
LOUI 15/05/2012
DATA newdata;
SET mydata;
count + 1;
IF FIRST.name THEN count=1;
BY name DESCENDING date;
run;
here i got count group wise 1,2,3 so on..I want the output of name(all obs of bob) if count> 3. please help me..
The simplest way to do that is to output the last row for each ID if it is > 3, then merge that dataset back to your master dataset, keeping only matches. You could also use PROC FREQ to generate the dataset of counts and merge to that.
You can do it in a single datastep using a DoW loop, but that's more complicated, so I wouldn't recommend a new user do that.
I think this shows the power of SQL - though some would say since this generates a NOTE in the log it isn't good practice. Use the GROUP & HAVING clause in SQL to create a count of the names that you then limit to 3.
proc sql;
create table want as
select *
from have
group by name
having count(name)>3;
quit;
Here are a couple different ways to do this using SUBQUERIES in PROC SQL
Data HAVE;
Length NAME $50;
Input Name $ Date: ddmmyy10.;
Format date ddmmyy10.;
datalines;
BOB 24/05/2013
BOB 12/06/2012
BOB 19/10/2011
BOB 05/02/2010
BOB 05/01/2009
CARL 15/05/2011
LOUI 15/01/2014
LOUI 15/05/2013
LOUI 15/05/2012
;
Run;
Using a multiple-value subquery in the Where statement
Proc sql;
Create table WANT1 as
Select *
From Have
Where Name in (Select name from have b group by b.name having count(b.name)>3);
Quit;
Using a subquery in the From clause
Proc sql;
Create table WANT2 as
Select a.name, a.date
From Have a Inner Join (select name, count(name) as Count from have b group by b.name having Count>3)
On a.name=b.name
;
Quit;

create unique id variable based on existing id variable

Trying to make a more simple unique identifier from already existing identifier. Starting with just and ID column I want to make a new, more simple, id column so the final data looks like what follows. There are 1million + id's, so it isnt an option to do if thens, maybe a do statement?
ID NEWid
1234 1
3456 2
1234 1
6789 3
1234 1
A trivial data step solution not using monotonic().
proc sort data=have;
by id;
run;
data want;
set have;
by id;
if first.id then newid+1;
run;
using proc sql..
(you can probably do this without the intermediate datasets using subqueries, but sometimes monotonic doesn't act the way you'd think in a subquery)
proc sql noprint;
create table uniq_id as
select distinct id
from original
order by id
;
create table uniq_id2 as
select id, monotonic() as newid
from uniq_id
;
create table final as
select a.id, b.newid
from original_set a, uniq_id2 b
where a.id = b.id
;
quit;