How to remove duplicates based on multiple columns in SAS? - sas

I have table in SAS like below:
date types:
COL1 - datetime
COL2 - numeric
date type of rest columns does not matter
COL1
COL2
...
COLn
22AUG2022:15:46:38.000000
111
...
...
22AUG2022:15:46:38.000000
111
...
...
22AUG2022:15:46:38.000000
111
...
...
22AUG2022:15:46:38.000000
222
...
...
...
...
...
...
I have many columns in my table and I need to delete duplicates in COL2 based on values in COL1, so as a result I need something like below:
COL1
COL2
...
COLn
22AUG2022:15:46:38.000000
111
...
...
22AUG2022:15:46:38.000000
222
...
...
...
...
...
...
How can I do that using SAS Procedures or PROC SQL in SAS Enterprise Guide?

The question is which observation would you like to retrieve for a given by group? Perhaps the rest of the columns does not matter because they have the same values across rows?
If yes, use the noduprecs option in proc sort. It will delete duplicated observations while nodupkey will delete those observations that have duplicate BY value.
Taking the below dataset as an example
col1 col2 col3
22AUG22:15:46:38 111 ABC
22AUG22:15:46:38 111 DEF
22AUG22:15:46:38 111 GHI
22AUG22:15:46:38 222 JKL
Using proc sort
* delete observations that have duplicated BY values;
proc sort data=have out=want nodupkey equals;
by col1 col2;
run;
22AUG22:15:46:38 111 ABC --> Keep the first record within the by group
22AUG22:15:46:38 222 JKL
Note that by using the equals option, observations with identical BY variable values are to retain the same relative positions in
the output data set as in the input data set. See more on
Does NODUPKEY Select the First Record in a By Group?
* delete observations that have duplicated values;
proc sort data=have out=want noduprecs equals;
by col1 col2;
run;
22AUG22:15:46:38 111 ABC
22AUG22:15:46:38 111 DEF
22AUG22:15:46:38 111 GHI
22AUG22:15:46:38 222 JKL
Using the noduprecs option will not remove any observation with our example dataset as col3 has different values. You are not showing the rest of the columns in your example but shall they share the same values across rows, you might want to use it.
You could alternatively do it using proc summary
proc summary data=have nway;
class col1 col2;
id col3;
output out=want(drop=_type_ _freq_);
run;
22AUG22:15:46:38 111 GHI --> Keep the last value within the by group
22AUG22:15:46:38 222 JKL
Note that proc summary with a class statement provides a more efficient alternative than proc sort by avoiding the need for sorting in advance.
Or even using proc sql
proc sql;
create table want as
select col1, col2, col3
from have
group by col1, col2
having col3 = max(col3);
quit;
22AUG22:15:46:38 111 GHI --> max value within the group by columns
22AUG22:15:46:38 222 JKL
proc sql;
create table want as
select col1, col2, col3
from have
group by col1, col2
having col3 = min(col3);
quit;
22AUG22:15:46:38 111 ABC --> min value within the group by columns
22AUG22:15:46:38 222 JKL

Related

How to create binary column base on values in other column and drop duplicates in SAS Enterprise Guide 4GL / PROC SQL?

I have table in SAS Enterprise Guide like below:
ID - numeric
COL1 - numeric
ID
COL1
123
1
123
0
123
0
444
1
444
0
778
0
And I need to aggregate above table and create column 'TARGET':
TARGET = 1 for these IDs which at least once have '1' in the column 'COL1'
TARGET = 0 for these IDs which have never had a '1' in the column 'COL1'
Moreover I need to delete duplicates in ID, so as a result i need something like below :
ID
TARGET
123
1 -> had '1' at least once in COL1
444
1 -> had '1' at least once in COL1
778
0 -> never had '1' in COL1
How can I do that in SAS Enterprise Guide in normal SAS code or in PROC SQL ?
Your logic is equivalent to taking the maximum value of the column for each ID.
proc sql;
create table want as
select ID, max(col1) as Target
from have
group by ID;
quit;
You can use a DOW loop over the group, computing a repeated OR result.
Example:
data want;
do until (last.id);
set have;
by id;
if not target then
target = target or col1;
end;
run;

Need same output in sas

So I have the following data:
Data Cricket;
input match $;
cards;
IndVsPak
NezVsAus
PakVsInd
WesVsPak
WesVsAus
IndVsPak
AusVsNez
; run;
Need Output:
Match Count
IndVsPak 3
NezVsAus 2
WesVsPak 1
WesVsAus 1
Please help with code how many ways we get the above output?
Try this:
Data Cricket;
input match $;
cards;
IndVsPak
NezVsAus
PakVsInd
WesVsPak
WesVsAus
IndVsPak
AusVsNez
;
run;
/*standardise team order within each match - easier to do in data step*/
data temp /view = temp;
set cricket;
team1 = substr(match,1,3);
team2 = substr(match,6,3);
call sortc(of team:);
match_sorted = cats(team1,'Vs',team2);
run;
proc sql noprint;
create table want as
select match_sorted, count(match_sorted) as freq
from temp
group by match_sorted
order by freq descending
;
quit;
Output:
match_
sorted freq
IndVsPak 3
AusVsNez 2
AusVsWes 1
PakVsWes 1
Here's my attempt at doing this entirely in proc sql:
proc sql noprint;
create table want as
select
ifc(
team1 < team2,
cats(team1, 'Vs', team2),
cats(team2, 'Vs', team1)
) as match_sorted length=8,
count(calculated match_sorted) as freq
from (
select
substr(match,1,3) as team1,
substr(match,6,3) as team2
from cricket
)
group by match_sorted
order by freq descending
;
quit;
N.B. this uses a calculated field - a bit of SAS-specific sql functionality. You could eliminate this by setting the whole thing up as a sub-query that produces match_sorted, or you could flatten the query and use calculated fields for everything.
Good day, In SAS (almost) everything is done via PROCS. Kind of macros performing actions.
In this case I suggest using Proc freq
Data Cricket;
input match $10.;
cards;
IndVsPak
NezVsAus
PakVsInd
WesVsPak
WesVsAus
IndVsPak
AusVsNez
; run;
proc freq data=Cricket noprint;
table match / out= freqs ;
run;
You can see the output by removing the noprint-option.
This will also work if you are more comfortable using SQL:
PROC SQL;
SELECT match, count(*) AS cnt FROM cricket GROUP BY match;
QUIT;

sas relative frequencies by group

I have a categorical variable, say SALARY_GROUP, and a group variable, say COUNTRY. I would like to get the relative frequency of SALARY_GROUP within COUNTRY in SAS. Is it possible to get it by proc SUMMARY or proc means?
Perhaps explore proc tabulate and a counter variable?
Yes, You can calculate the relative frequency of a categorical variable using both Proc Means and Proc Summary. For both procs you have to:
-Specify NWAY in the proc statement,
-Specify in the Class statement your categorical fields,
-Specify in the Var statement your response or numeric field.
Example below is for proc means:
Dummy Data:
/*Dummy Data*/
data work.have;
input Country $ Salary_Group $ Value;
datalines;
USA Group1 100
USA Group1 100
GBR Group1 100
GBR Group1 100
USA Group2 20
USA Group2 20
GBR Group2 20
GBR Group1 100
;
run;
Code:
*Calculating Frequncy and saving output to table sg_means*/
proc means data=have n nway ;
class Country Salary_Group;
var Value;
output out=sg_means n=frequency;
run;
Output Table:
Country=GBR Salary_Group=Group1 _TYPE_=3 _FREQ_=3 frequency=3
Country=GBR Salary_Group=Group2 _TYPE_=3 _FREQ_=1 frequency=1
Country=USA Salary_Group=Group1 _TYPE_=3 _FREQ_=2 frequency=2
Country=USA Salary_Group=Group2 _TYPE_=3 _FREQ_=2 frequency=2

how to split four columns into two tables in sas report

I have one table having 4 columns and i want to separate them into 2 table 2 columns in one table and 2 columns in another table.but both table should be below to each other.I want this in proc report format.code should be in report.
id name age gender
1 abc 21 m
2 pqr 23 f
3 qwe 25 f
4 ert 54 m
i want id and name in one table and age and gender in other table.but one below the other in ods excel.
I've split the main table into two tables using a data setp then appended them to each other, I added an extra columns called "source" in order to be differniate between the tables. if you use a Proc report you can group by "source"
Code:
*Create input data*/
data have;
input id name $ age gender $ ;
datalines;
1 abc 21 m
2 pqr 23 f
3 qwe 25 f
4 ert 54 m
;;;;
run;
/*Split / create first table*/
data table1;
set have;
source="table1: id & name";
keep source id name ;
run;
/*Split / create second table*/
data table2;
set have;
source="table2: age & gender";
keep source age gender;
run;
/*create Empty table*/
data want;
length Source $30. column1 8. column2 $10.;
run;
proc sql; delete * from want; quit;
/* Append both tables to each other*/
proc append base= want data=table1(rename=(id=column1 name=column2)) force ; run;
proc append base= want data=table2(rename=(age=column1 gender=column2)) force ; run;
/*Create Report*/
proc report data= want;
col source column1 column2 ;
define source / group;
run;
Output Table:
Report:
For data
data have;input
id name $ age gender $; datalines;
1 abc 21 m
2 pqr 23 f
3 qwe 25 f
4 ert 54 m
run;
Being output as Excel, the splitting into two parts can be done via two Proc REPORT steps; each step responsible for a single set of columns. Options are used in the ODS EXCEL to control how sheet processing is handled.
The first step manages the common header through DEFINE, the subsequent steps are NOHEADER and don't need DEFINE statements. Each step must define and compute the value of the new source column. There will be a one Excel row gap between each table.
ods _all_ close;
ods excel file='want.xlsx' options(sheet_interval='NONE');
proc report data=have;
column source id name;
define id / 'Column 1';
define name / 'Column 2';
define source / format=$20.;
compute source / character length=20; source='ID and NAME'; endcomp;
run;
proc report data=have noheader;
column source age gender;
define source / format=$20.;
compute source / character length=20; source='AGE and GENDER'; endcomp;
run;
ods excel close;
There is no reasonable single Proc REPORT step that would produce similar output from dataset have.

Combining SAS Data Sets with different no. of columns

I am having problem in combining two tables with different no. of columns.
Say my first table is table1:
table1
t1_col_1 t1_col_2 t1_col_3 ... t1_col_13
and my second table is table2:
table2
t2_col_1 t2_col2 t2_col3 t2_col4
Now if I type command:
data table3;
set tabel1 table2;
run;
What will be the out put of table3 ?
The SAS link says this command do a concatanation:
http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a001107839.htm
Since the columns no. are different, concatenation will cause problem.
So how does this command exactly works ? And what will be its output in this case ?
Appending (concatenating) two or more data sets is basically just stacking the data sets together with values in variables of the same name being stacked together. Unique variables in each data set will form their own variables in the new combined data set. Right now we have different number of variables. This article explains how concatenation works between data sets with different variables: http://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/viewer.htm#a001312944.htm
For example, suppose we have:
data work.table1;
input col1 $ col2 col3 col4;
datalines;
George 10 10 10
Lucy 10 10 10
;
run;
data work.table2;
input col1 $ col2;
datalines;
Shane 3
Peter 3
;
run;
data work.table3;
set table1 table2;
run;
OUTPUT:
col1 col2 col3 col4
George 10 10 10
Lucy 10 10 10
Peter 3 . . <== These entries are
Shane 3 . . empty.
col1 and col2 are present in both sets, so the values inside them will be stacked. col3 and col4 are only present in table1, so some of the values under them in the new combined set will be empty.