Trying to make a more simple unique identifier from already existing identifier. Starting with just and ID column I want to make a new, more simple, id column so the final data looks like what follows. There are 1million + id's, so it isnt an option to do if thens, maybe a do statement?
ID NEWid
1234 1
3456 2
1234 1
6789 3
1234 1
A trivial data step solution not using monotonic().
proc sort data=have;
by id;
run;
data want;
set have;
by id;
if first.id then newid+1;
run;
using proc sql..
(you can probably do this without the intermediate datasets using subqueries, but sometimes monotonic doesn't act the way you'd think in a subquery)
proc sql noprint;
create table uniq_id as
select distinct id
from original
order by id
;
create table uniq_id2 as
select id, monotonic() as newid
from uniq_id
;
create table final as
select a.id, b.newid
from original_set a, uniq_id2 b
where a.id = b.id
;
quit;
Related
I have two codes one proc sql and another proc and datastep. Both are interlinked datasets.
Below is the proc sql lines.
create table new as select a.id,a.alid,b.pdate from tb a inner join
tb1 act on a.aid =act.aid left join tb2 as b on (r.alid=a.alid) where
a.did in (15,45); quit;
Below is the proc and datasteps created from above datatset new.
proc sort data = new uodupkey;
by alid;
data new1;
set new;
format ddate date9.
dat1=datepart(today);
datno=input(number,20.);
key=_n_;
rename alid blid;
run;
proc sort data=new1 nodupkey;
by datno dat1;
run;
I need to put everything into single proc sql step.
You mention two data steps but I only see one.
Anyway, your data step and proc sort can indeed be written in one sql query (which you can then insert in your proc sql):
proc sql;
create table new1 as
select id
,alid as blid
,pdate
,datepart(today) as dat1
,input(number,20.) as datno
,monotonic() as key
from new1
group by datno, dat1
having key=min(key)
;
quit;
One remark though. Your data step expects variables called ddate,today and number in your input dataset new. If that dataset is supposed to be the result of your first sql query, then those variables don't exist and their values along with those of dat1 and datno in new1 will always be missing.
Also I assume you misspelled nodupkey on your proc sort.
EDIT: or, to have it all in the same query (if that's what you meant with "the same proc sql"):
proc sql;
create table new1 as
select id
,alid as blid
,pdate
,datepart(today) as dat1
,input(number,20.) as datno
,monotonic() as key
from (
select a.id,a.alid,b.pdate
from tb a
inner join tb1 act
on a.aid =act.aid
left join tb2 as b
on (r.alid=a.alid)
where a.did in (15,45)
)
group by datno, dat1
having key=min(key)
;
quit;
I have a dataset with many columns like this:
ID Indicator Name C1 C2 C3....C90
A 0001 Black 0 1 1.....0
B 0001 Blue 1 0 0.....1
B 0002 Blue 1 0 0.....1
Some of the IDs are duplicates because the indicator is different, but they're essentially the same record. To find duplicates, I want to select distinct ID, Name and then C1 through C90 to check because some claims who have the same Id and indicator have different C1...C90 values.
Is there a way to select c1...c90 either through proc sql or a sas data step? It seems the only way I can think of is to set the dataset and then drop the non essential columns, but in the actual dataset, it's not only Indicator but at least 15 other columns.
It would be nice if PROC SQL used the : variable name wildcard like other Procs do. When no other alternative is reasonable, I usually use a macro to select bulk columns. This might work for you:
%macro sel_C(n);
%do i=1 %to %eval(&n.-1);
C&i.,
%end;
C&n.
%mend sel_C;
proc sql;
select ID,
Indicator,
Name,
%sel_C(90)
from have_data;
quit;
If I understand the question properly, the easiest way would be to concatenate the columns to one. RETAIN that value from row to row, and you can compare it across rows to see if it's the same or not.
data want;
set have;
by id indicator;
retain last_cols;
length last_cols $500;
cols = catx('|',of c1-c90);
if first.id then call missing(last_cols);
else do;
identical = (cols = last_cols); *or whatever check you need to perform;
end;
output;
last_cols = cols;
run;
There are a few different ways you can do this and it will be much easier if the actual column names are C1 - C90. If you're just looking to remove anything that you know is a duplicate you can use proc sort.
proc sort data=dups out=nodups nodupkey;
by ID Name C1-C90;
run;
The nodupkey option will automatically remove any duplicates in the by statement.
Alternatively, if you want to know which records contain duplicates, you could use proc summary.
proc summary data=dups nway missing;
class ID Name C1-C90;
output out=onlydups(where=(_freq_ > 1));
run;
proc summary creates two new variables, _type_ and _freq_. If you specify _freq_ > 1 you will only output the duplicate records. Also, note that this will remove the Indicator variable.
Here is the the structure of my data below:
I only need ID 2 since all patients are alive, therefore trying to delete ID 1:
ID sex status
1 2 A
1 2 A
1 2 A
1 2 D
2 1 A
2 1 A
2 1 A
If you truly want to delete the records from your source dataset, you can do so with:
PROC SQL;
DELETE FROM MyData WHERE ID = 1;
QUIT;
However, if you want to retain the source dataset as-is; maybe you will use it again, it would be best to create a new dataset from it, like so:
PROC SQL;
CREATE TABLE MyFilteredData AS
SELECT ID, sex, status
FROM MyData
WHERE ID = 2;
QUIT;
or
DATA MyFilteredData;
SET MyData;
IF ID = 2;
RUN;
proc sql;
delete from your_data where id ~= 2;
quit;
This PROC SQL will create a new dataset Want from the original dataset Have including only IDs that have no status="D":
proc sql;
create table Want as
select *
from Have
where ID not in
(select distinct ID
from Have
where status="D")
;
quit;
NAME DATE
---- ----------
BOB 24/05/2013
BOB 12/06/2012
BOB 19/10/2011
BOB 05/02/2010
BOB 05/01/2009
CARL 15/05/2011
LOUI 15/01/2014
LOUI 15/05/2013
LOUI 15/05/2012
DATA newdata;
SET mydata;
count + 1;
IF FIRST.name THEN count=1;
BY name DESCENDING date;
run;
here i got count group wise 1,2,3 so on..I want the output of name(all obs of bob) if count> 3. please help me..
The simplest way to do that is to output the last row for each ID if it is > 3, then merge that dataset back to your master dataset, keeping only matches. You could also use PROC FREQ to generate the dataset of counts and merge to that.
You can do it in a single datastep using a DoW loop, but that's more complicated, so I wouldn't recommend a new user do that.
I think this shows the power of SQL - though some would say since this generates a NOTE in the log it isn't good practice. Use the GROUP & HAVING clause in SQL to create a count of the names that you then limit to 3.
proc sql;
create table want as
select *
from have
group by name
having count(name)>3;
quit;
Here are a couple different ways to do this using SUBQUERIES in PROC SQL
Data HAVE;
Length NAME $50;
Input Name $ Date: ddmmyy10.;
Format date ddmmyy10.;
datalines;
BOB 24/05/2013
BOB 12/06/2012
BOB 19/10/2011
BOB 05/02/2010
BOB 05/01/2009
CARL 15/05/2011
LOUI 15/01/2014
LOUI 15/05/2013
LOUI 15/05/2012
;
Run;
Using a multiple-value subquery in the Where statement
Proc sql;
Create table WANT1 as
Select *
From Have
Where Name in (Select name from have b group by b.name having count(b.name)>3);
Quit;
Using a subquery in the From clause
Proc sql;
Create table WANT2 as
Select a.name, a.date
From Have a Inner Join (select name, count(name) as Count from have b group by b.name having Count>3)
On a.name=b.name
;
Quit;
I've the below dataset as input
ID
--
1
2
2
3
4
4
4
5
And need a new dataset as below
ID count of ID
-- -----------
1 1
2 2
3 1
4 3
5 1
Could you please tell how to do this in SAS wihtout using PROC SQL?
or how about Proc Freq or Proc Summary? These avoid having to presort the data.
proc freq data=have noprint;
table id / out=want1 (drop=percent);
run;
proc summary data=have nway;
class id;
output out=want2 (drop=_type_);
run;
proc sql noprint;
create table test as select distinct id, count(id)
from your_table
group by ID
order by ID
;
quit;
Try this:
DATA Have;
input id ;
datalines;
1
2
2
3
4
4
4
5
;
Proc Sort data=Have;
by ID;
run;
Data Want;
Set Have;
By ID;
If first.ID then Count=0;
Count+1;
If Last.ID then Output;
Run;
PROC SORT DATA=YOURS NOPRINT;
BY ID; RUN;
PROC MEANS DATA=YOURS;
VAR ID;
BY ID;
OUTPUT OUT=NEWDATASET N=; RUN;
You can also choose to keep only the Id and N variables in your newdataset.
We can use simple PROC SQL count to do this:
proc sql;
create table want as
select id, count(id) as count_of_id
from have
group by id;
quit;
Here is yet another possibility, often known as a DoW construction:
Data want;
do count=1 by 1 until(last.ID);
set have;
by id;
end;
run;
If the aggregation you want to do is complex then go with PROC SQL only as we are more familiar with Group by in SQL
proc sql ;
create table solution_1 as select distinct ID, count(ID)
from table_1
group by ID
order by ID
;
quit;
OR
If you are using SAS- EG Query builders are very useful in small
analyses .
It's just drag & drop the columns u want to aggregate and in summary option Select whatever operation you want to perform like Avg,Count,miss,NMiss etc .