How to merge 2 datasets with different lengths? - sas

I would like to merge 2 datasets with 2 different dimensions.
TABLE1: people
gender name
M raa
F chico
M july
F sergio
TABLE2: serial_numbers
gender serial
M 4
F 5
I want the result to be
result
gender name serial
M raa 4
F chico 5
M july 4
F sergio 5

I'm creating here the datasets to illustrate how to merge both datasets:
data people;
infile cards;
length gender $1
name $10;
input gender name;
cards;
M raa
F chico
M july
F sergio
;
run;
data serial_numbers;
length gender $1
serial 8;
infile cards;
input gender serial;
cards;
M 4
F 5
;
run;
Solution 1: use a proc sql to perform the join.
proc sql;
create table result as
select a.gender, a.name, b.serial
from people a LEFT JOIN serial_numbers b
on a.gender=b.gender;
quit;
proc print data=result;
run;
Solution 2: use a data step to merge both datasets. This requires the datasets to be sorted:
proc sort data=people;
by gender;
run;
proc sort data=serial_numbers;
by gender;
run;
data result;
merge people serial_numbers;
by gender;
run;
proc print data=result;
run;

Related

SAS transpose columns to row and values to columns

I have a summary table which I want to transpose, but I can't get my head around. The columns should be the rows, and the columns are the values.
Some explanation about the table. Each column represents a year. People can be in 3 groups: A, B or C. In 2016, everyone (100) is in group A. In 2017, 35 are in group A (5 + 20 + 10), 15 in B and 50 in C.
DATA have;
INPUT year2016 $ year2017 $ year2018 $ count;
DATALINES;
A A A 5
A A B 20
A A C 10
A B C 15
A C A 50
;
RUN;
I want to be able to make a nice graph of the evolution of the groups through the different periods. So I want to end up with a table where the columns are the rows (=period) and the columns are the values (= the 3 different groups). Please find an example of the table I want:
Image of table want
I have tried different approaches, but I can't get what I want.
Maybe more direct way but this is probably how I would do it.
DATA have;
INPUT year2016 $ year2017 $ year2018 $ count;
id + 1;
DATALINES;
A A A 5
A A B 20
A A C 10
A B C 15
A C A 50
;
RUN;
proc print;
proc transpose data=have out=want1 name=period;
by id count notsorted;
var year:;
run;
proc print;
run;
proc summary data=want1 nway completetypes;
class period col1;
freq count;
output out=want2(drop=_type_);
run;
proc print;
run;
proc transpose data=want2 out=want(drop=_name_) prefix=Group_;
by period;
var _freq_;
id col1;
run;
proc print;
run;

Create new row to data set based existing ones SAS

I have a dataset looking something like this:
var1 var2 count
cat1 no 1
cat1 yes 4
cat1 unkown 3
cat2 no 7
cat2 yes 3
cat2 unkown 5
cat3 no 2
cat3 yes 9
cat3 unkown 0
What I want to do is combine var1 & var2 into new variable where first row is from var1 and the others from var2. So it supposed to look like:
comb count
cat1
no 1
yes 4
unkown 3
cat2
no 7
yes 3
unkown 5
cat3
no 2
yes 9
unkown 0
Any help would be highly appreciated!
It's quite simple.
Here the solution :
1) create the dataset source:
data testa;
infile datalines dsd dlm=',';
input var1 : $200. var2 : $200. count : 8. ;
datalines;
cat1,no,1,
cat1,yes,4,
cat1,unkown,3,
cat2,no,7,
cat2,yes,3,
cat2,unkown,5,
cat3,no,2,
cat3,yes,9,
cat3,unkown,0,
;
run;
2) Selection of var list : cat1|cat2|cat3
proc sql;
select distinct(var1) into: list_var separated by '|' from testa;
run;
3) Process the var list one by one
%macro processListVar(list_var);
data want;
run;
%let k=1;
%do %while (%qscan(&list_var, &k,|) ne );
%let var = %scan(&list_var, &k,|);
data testb(drop=var1 rename=(var2=comb));
set testa;
N=_N_+1+&k;
where var1="&var";
run;
data testc;
N=1+&k;
comb="&var";
count=.;
run;
data tmp;
set testb testc;
run;
proc sort data=tmp out=teste;
by N;
run;
data want;
set want teste;
run;
%put var=&var;
%let k = %eval(&k + 1);
%end;
%mend processListVar;
%processListVar(&list_var);
4) At the end you get the result in dataset want.
You have to exclude finaly the N column like that :
data want_cleaned (drop=N);
set want;
run;
5) More explanation on the code.
a. The key problem was to keep the order between cat1,cat2,cat3.
b. So I divided the problem by each dataset cat1, cat2, .. and created a %do %while to loop through categories.
c. We use the column N, to count the number of line (like an index), and then we can sort on this column, to keep the order.
d. For example : the first var cat1 : We select the column var2, we rename it like the comb column. We drop the var1 column. It create the testb dataset.
The testb dataset is used to create an index (column N) and we create the first line of our subdataset (N=1+&k) in testc. &k is used through all subdatasets. Like that the index is continuing between subdatasets. (without interfering each others). We make a merge between testb and testc. The dataset tmp contains all info needed for cat1. Then we merge all subdatasets in dataset want.
So to summary, we create a loop, and we merge the datasets together at the end. We make a sort on the column N, to display lines in the order you wanted.
Regards,

Setting names to idgroup

Follow up to
SAS - transpose multiple variables in rows to columns
I have the following code:
data have;
input CX_ID 1. TYPE $1. COUNT_RATE 1. SUM_RATE 2.;
datalines;
1A110
1B220
2A120
;
run;
proc summary data = have nway;
class cx_id;
output out=want (drop = _:)
idgroup(out[2] (count_rate sum_rate)= count sum);
run;
So this table:
CX_ID TYPE COUNT_RATE SUM_RATE
1 A 1 10
1 B 2 20
2 A 1 20
becomes
CX_ID COUNT_1 COUNT_2 SUM_1 SUM_2
1 1 2 10 20
2 1 . 20 .
Which is perfect, but how do I set the names to be
Count_A Count_B Sum_A Sum_B
Or in general whatever the value in the type field of the have table ?
Thank you
A double PROC TRANSPOSE is dynamic and you can add a data step to customize the names easily.
*sample data;
data have;
input CX_ID 1. TYPE $1. COUNT 1. SUM 2.;
datalines;
1A110
1B220
2A120
;
run;
*transpose to long;
proc transpose data=have out=long;
by cx_id type;
run;
*transpose to wide;
proc transpose data=long out=wide;
by cx_id;
var col1;
id _name_ type;
run;

Extracting info by matching two datasets in SAS

I have two datasets. Both have a common column- ID. I would like to check if ID from df1 lies in df2 and extract all such rows from df1. I'm doing this in SAS.
It is easily done in one sql query.
proc sql;
create table extract_from_df1 as
select
*
from
df1
where
id in (select id from df2)
;
quit;
There are lots of ways to do this. For example:
proc sql;
create table compare as select distinct
a.id as id1, b.id as id2
from table1 as a
left join table2 as b
on a.id = b.id;
quit;
and then keep matches. Or you can try:
proc sql;
delete from table2 where id2 in select distinct id1 from table1;
quit;
data df1;
input id name $;
cards;
1 abc
2 cde
3 fgh
4 ijk
;
run;
data df2;
input id address $;
cards;
1 abc
2 cde
5 ggh
6 ihh
7 jjj
;
run;
data c;
merge df1(in=x) df2(in=y);
if x and y;
keep id name;
run;
proc print data=c;
run;

Report using data _Null_

I'm looking for report using SAS data step :
I have a data set:
Name Company Date
X A 199802
X A 199705
X D 199901
y B 200405
y F 200309
Z C 200503
Z C 200408
Z C 200404
Z C 200309
Z C 200210
Z M 200109
W G 200010
Report I'm looking for:
Name Company From To
X A 1997/05 1998/02
D 1998/02 1999/01
Y B 2003/09 2004/05
F 2003/09 2003/09
Z C 2002/10 2005/03
M 2001/09 2001/09
W G 2000/10 2000/10
THANK you,
Tried using proc print but it is not accurate. So looking for a data null solution.
data _null_;
set salesdata;
by name company date;
array x(*) from;
From=lag(date);
if first.name then count=1;
do i=count to dim(x);
x(i)=.;
end;
count+1;
If first.company then do;
from_date1=date;
end;
if last.company then To_date=date;
if from_date1 ="" and to_date="" then delete;
run;
data _null_;
set yourEvents;
by Name Company notsorted;
file print;
If _N_ EQ 1 then put
#01 'Name'
#06 'Company'
#14 'From'
#22 'To'
;
if first.Name then put
#01 Name
#; ** This instructs sas to not start a new line for the next put instruction **;
retain From To;
if first.company then do;
From = 1E9;
To = 0;
end;
if Date LT From then From = Date;
if Date GT To then To = Date;
if last.Company then put
#06 Company
#14 From yymm7.
#22 To yymm7.
;
run;
I have done data step to calculate From_date and To_date
and then proc report to print the report by group.
proc sort data=have ;
by Name Company Date;
run;
data want(drop=prev_date date);
set have;
by Name Company date;
attrib From_Date To_date format=yymms10.;
retain prev_date;
if first.Company then prev_date=date;
if last.Company then do;
To_date=Date;
From_Date=prev_date;
end;
if not(last.company) then delete;
run;
proc sort data=want;
by descending name ;
run;
proc report data=want;
define Name/order order=data;
run;
IMHO, the simplest way is exploiting proc report and its analysis column type as the code below. Note that name and company columns are automatically sorted in alphabetical order (as most of the summary functions or procedures do).
/* your data */
data have;
infile datalines;
input Name $ Company $ Date $;
cards;
X A 199802
X A 199705
X D 199901
y B 200405
y F 200309
Z C 200503
Z C 200408
Z C 200404
Z C 200309
Z C 200210
Z M 200109
W G 200010
;
run;
/* convert YYYYMM to date */
data have2(keep=name company date);
set have(rename=(date=date_txt));
name = upcase(name);
y = input(substr(date_txt, 1, 4), 4.);
m = input(substr(date_txt, 5, 2), 2.);
date = mdy(m,1,y);
format date yymms7.;
run;
/****** 1. proc report ******/
proc report data=have2;
columns name company date=date_from date=date_to;
define name / 'Name' group;
define company / 'Company' group;
define date_from / 'From' analysis min;
define date_to / 'To' analysis max;
run;
The html output:
(tested on SAS 9.4 win7 x64)
============================ OFFTOPIC ==============================
One may also consider using proc means or proc tabulate. The basic code forms are shown below. However, you can also see that further adjustments in output formats are required.
/***** 2. proc tabulate *****/
proc tabulate data=have2;
class name company;
var date;
table name*company, date=' '*(min='From' max='To')*format=yymms7.;
run;
proc tabulate output:
/***** 3. proc means (not quite there) *****/
* proc means + ODS -> cannot recognize date formats;
proc means data=have2 nonobs min max;
class name company;
format date yymms7.; * in vain;
var date;
run;
proc means output (cannot output date format, dunno why):
You may leave comments on improving these alternative ways.