SAS Updating records sequentially - sas

I have hundreds of thousands of IDs in a large dataset.
Some records have the same ID but different data points. Some of these IDs need to be merged into a single ID. People registered for a system more than once should be just one person in the database.
I also have a separate file that tells me which IDs need to be merged, but it's not always a one-to-one relationship. For example, in many cases I have x->y and then y->z because they registered three times. I had a macro that essentially was the following set of if-then statements:
if ID='1111111' then do; ID='2222222'; end;
if ID='2222222' then do; ID='3333333'; end;
I believe SAS runs this one record at a time. My list of merged IDs is almost 15k long, so it takes forever to run and the list just gets longer. Is there a faster method of updating these IDs?
Thanks
EDIT: Here is an example of the situation, except the macro is over 15k lines long due to all the merges.
data one;
input ID $5. v1 $ v2 $;
cards;
11111 a b
11111 c d
22222 e f
33333 g h
44444 i j
55555 k l
66666 m n
66666 o p
;
run;
%macro ID_Change;
if ID='11111' then do; ID='77777'; end; *77777 is a brand new ID;
if ID='22222' then do; ID='88888'; end; *88888 is a new ID but is merged below;
if ID='88888' then do; ID='99999'; end; *99999 becomes the newer ID;
%mend;
data two; set one; %ID_Change; run;

A hash table will greatly speed up the process. Hash tables are one of the little-used, but highly effective, tools in SAS. They're a bit bizarre since the syntax is very different from standard SAS programming. For now, think of it as a way to merge data together in-memory (a big reason as to why it's so fast).
First, create a dataset that has the conversions that you need. We want to match up by ID, then convert it to New_ID. Consider ID as your key column, and New_ID as your data column.
dataset: translate
ID New_ID
111111 222222
222222 333333
In a hash table, you need to consider two things:
The Key column(s)
The Data column(s)
The Data column is what will be replacing observations matched by the Key column. In other words, New_ID will be populated every time there's a match for ID.
Next, you'll want to do your hash merge. This is performed in the data step.
data want;
set have;
/* Only declare the hash object on the first iteration.
Otherwise it will do this every record. */
if(_N_ = 1) then do;
declare hash id_h(dataset: 'translate'); *Declare a hash object called 'id_h';
id_h.defineKey('ID'); *Define key for matching;
id_h.defineData('New_ID'); *The new ID after matching;
id_h.defineDone(); *Done declaring this hash object;
call missing(New_ID); *Prevents a warning in the log;
end;
/* If a customer has changed multiple times, keep iterating until
there is no longer a match between tables */
do while(id_h.Find() = 0);
_loop_count+1; *Tells us how long we've been in the loop;
/* Just in case the while loop gets to 500 iterations, then
there's likely a problem and you don't want the data step to get stuck */
if(_loop_count > 500) then do;
put 'WARNING: ' ID ' iterated 500 times. The loop will stop. Check observation ' _N_;
leave;
end;
/* If the ID of the hash table matches the ID of the dataset, then
we'll set ID to be New_ID from the hash object;
ID = New_ID;
end;
_loop_count = 0;
drop _loop_count;
run;
This should run very quickly and provide the desired output, assuming that your lookup table is coded in the way that you need it to be.

Use PROC SQL or a MERGE step against your separate file (after you have created a separate dataset from it, using infile or proc import) to append this unique id to all records. If your separate file contains only the duplicates, you will need to create a dummy unique id for the non-duplicates.
Do PROC SORT with BY unique id and timestamp of signup.
Use a DATA step with the same BY variables. Depending on whether you want to keep the first or last signup, do if first.timestamp then output; (or last, etc.)
Or you could do it all in one PROC SQL using a left join to the separate file, a coalesce step to return a dummy unique id if it is not contained in the separate file, a group by unique id, and a having max(timestamp) (or min). You can also coalesce any other variables you might want to try to preserve across signups -- for example, if the first signup contained a phone number and successive signups were missing that data point.
Without a reproducible example it's hard to be more specific.

Related

SAS Create missing numeric ids into individual observations.

I need to outline a series of ID numbers that are currently available based on a data set in which ID's are already assigned (if the ID is on the file then its in use...if its not on file, then its available for use).
The issue is I don't know how to create a data set that displays ID numbers which are between two ID #'s that are currently on file - Lets say I have the data set below -
data have;
input id;
datalines;
1
5
6
10
;
run;
What I need is for the new data set to be in the following structure of this data set -
data need;
input id;
datalines;
2
3
4
7
8
9
;
run;
I am not sure how I would produce the observations of ID #'s 2, 3 and 4 as these would be scenarios of "available ID's"...
My initial attempt was going to be subtracting the ID values from one observation to the next in order to find the difference, but I am stuck from there on how to use that value and add 1 to the observation before it...and it all became quite messy from there.
Any assistance would be appreciated.
As long as your set of possible IDs is know, this can be done by putting them all in a file and excluding the used ones.
e.g.
data id_set;
do id = 1 to 10;
output;
end;
run;
proc sql;
create table need as
select id
from id_set
where id not in (select id from have)
;
quit;
Create a temporary variable that stores the previous id, then just loop between that and the current id, outputting each iteration.
data have;
input id;
datalines;
1
5
6
10
;
run;
data need (rename=(newid=id));
set have;
retain _lastid; /* keep previous id value */
if _n_>1 then do newid=_lastid+1 to id-1; /* fill in numbers between previous and current ids */
output;
end;
_lastid=id;
keep newid;
run;
Building on Jetzler's answer: Another option is to use the MERGE statement. In this case:
note: before merge, sort both datasets by id (if not already sorted);
data want;
merge id_set (in=a)
have (in=b); /*specify datasets and vars to allow the conditional below*/
by id; /*merge key variable*/
if a and not b; /*on output keep only records in ID_SET that are not in HAVE*/
run;

SAS Data Step | Between 2 Dates

Probably a simple question. I have a simple dataset with scheduled payment dates in it.
DATA INFORM2;
INFORMAT previous_pmt_date scheduled_pmt_date MMDDYY10.;
INPUT previous_pmt_date scheduled_pmt_date;
FORMAT previous_pmt_date scheduled_pmt_date MMDDYYS10.;
DATALINES;
11/16/2015 12/16/2015
12/17/2015 01/16/2016
01/17/2016 02/16/2016
;
What I'm trying to do is to create a binary latest row indicator. For example, If I wanted to know the latest row as of 1/31/2016 I'd want row 2 to be flagged as the latest row. What I had been doing before is finding out where 1/31/2016 is between the previous_pmt_date and the scheduled_pmt_date, but that isn't correct for my purposes. I'd like to do this in an data step as opposed to SQL subqueries. Any ideas?
Want:
previous_pmt_date scheduled_pmt_date latest_row_ind
11/16/2015 12/16/2015 0
12/17/2015 01/16/2016 1
01/17/2016 02/16/2016 0
Here's a solution that does it all in the single existing datastep without any additional sorting. First I'm going to modify your data slightly to include account as the solution really should take that into account as well:
DATA INFORM2;
INFORMAT previous_pmt_date scheduled_pmt_date MMDDYY10.;
INPUT account previous_pmt_date scheduled_pmt_date;
FORMAT previous_pmt_date scheduled_pmt_date MMDDYYS10.;
DATALINES;
1 11/16/2015 12/16/2015
1 12/17/2015 01/16/2016
1 01/17/2016 02/16/2016
2 11/16/2015 12/16/2015
2 12/17/2015 01/16/2016
2 01/17/2016 02/16/2016
;
run;
Specify a cutoff date:
%let cutoff_date = %sysfunc(mdy(1,31,2016));
This solution uses the approach from this question to save the variables in the next row of data, into the current row. You can drop the vars at the end if desired (I've commented out for the purposes of testing).
data want;
set inform2 end=eof;
by account scheduled_pmt_date;
recno = _n_ + 1;
if not eof then do;
set inform2 (keep=account previous_pmt_date scheduled_pmt_date
rename=(account = next_account
previous_pmt_date = next_previous_pmt_date
scheduled_pmt_date = next_scheduled_pmt_date)
) point=recno;
end;
else do;
call missing(next_account, next_previous_pmt_date, next_scheduled_pmt_date);
end;
select;
when ( next_account eq account and next_scheduled_pmt_date gt &cutoff_date ) flag='a';
when ( next_account ne account ) flag='b';
otherwise flag = 'z';
end;
*drop next:;
run;
This approach works by using the current observation in the dataset (obtained via _n_) and adding 1 to it to get the next observation. We then use a second set statement with the point= option to load in that next observation and rename the variables at the same time so that they don't overwrite the current variables.
We then use some logic to flag the necessary records. I'm not 100% of the logic you require for your purposes, so I've provided some sample logic and used different flags to show which logic is being triggered.
Some notes...
The by statement isn't strictly necessary but I'm including it to (a) ensure that the data is sorted correctly, and (b) help future readers understand the intent of the datastep as some of the logic requires this sort order.
The call missing statement is simply there to clean up the log. SAS doesn't like it when you have variables that don't get assigned values, and this will happen on the very last observation so this is why we include this. Comment it out to see what happens.
The end=eof syntax basically creates a temporary variable called eof that has a value of 1 when we get to the last observation on that set statement. We simply use this to determine if we're at the last row or not.
Finally but very importantly, be sure to make sure you are keeping only the variables required when you load in the second dataset otherwise you will overwrite existing vars in the original data.

I want to add auto_increment column in a table in SAS

I want to add a auto_Increment column in a table in SAS.Following code add's a column but not increment the value.
Thanks In Advance.
proc sql;
alter table pmt.W_cur_qtr_recoveries
add ID integer;
quit;
Wow, going to try for my second "SAS doesn't do that" answer this morning. Risky stuff.
A SAS dataset cannot define an auto-increment column. Whether you are creating a new dataset or inserting records into an existing dataset, you are responsible for creating any increment counters (ie they are just normal numeric vars where you have set the values to what you want).
That said, there are DATA step statements such as the sum statement (e.g. MyCounter+1) that make it easier to implement counters. If you describe more details of your problem, people could provide some alternatives.
The correct answer at this time is to create the ID yourself, BUT the discussion wouldn't be complete without mentioning that there is an unsupported SQL function Monotonic that can do what you want. It's not reliable, yet it persists.
The code pattern for its usage is
select monotonic() as ID, ....
Use the _N_ automatic variable in a data step like:
DATA TEMPLIB.my_dataset (label="my dataset with auto increment variables");
SET TEMPREP.my_dataset;
sas_incr_num = _N_; * add an auto increment 'sas_incr_num' variable;
sas_incr_cat = cat("AB.",cats(repeat("0",5-ceil(log10(sas_incr_num+1))),sas_incr_num),".YZ"); * auto increment the sas_incr_num variable and add 5 leading zeros and concatenate strings on either end;
LABEL
sas_incr_num="auto number each row"
sas_incr_cat="auto number each row, leading zeros, and add strings along for fun"
...
There is no such thing as an auto increment column in a SAS dataset. You can use a data step to create a new dataset that has the new variable. You can use the same name to have it replace the old one when done.
data pmt.W_cur_qtr_recoveries;
set pmt.W_cur_qtr_recoveries;
ID+1;
run;
It really depends on what your intended outcome is. But I have thrown together an example of how you may want to tackle this. it is a little rough, but gives you something to work from.
/*JUST SETTING UP THE DAY ONE DATA WITH AN ID ATTACHED
YOU WOULD MAKE THE FIRST RUN EXECUTE DIFFERENTLY TO SUBSEQUENT RUNS BY USING THE EXISTS FUNCTION AND MACRO LANGUAGE,
BUT I WILL LET YOU INVESTIGATE THIS FURTHER AS IT MAY BE IRRELEVANT.*/
DATA DAY1;
SET SASHELP.CLASS;
ID+1;
RUN;
/*ON DAY 2 WE ARE APPENDING ADDITIONAL RECORDS TO THE EXISTING DATASET*/
DATA DAY2;
/*APPEND DATASETS*/
SET DAY1 SASHELP.CLASS;
/*HOLD VALUE IN PROGRAM DATA VECTOR (PDV) UNTIL EXPLICITLY CHANGED*/
RETAIN _ID;
/*ADD VARIABLE _ID AND POPULATE WITH ID. IN DOING THIS THE LAST INSTANCE OF THE ID WILL BE HELD IN THE PDV FOR THE
FIRST OF THE NEW RECORDS*/
IF ID ~= . THEN _ID = ID;
/*INCREMENT THE VALUE IN _ID BY 1 AND DO SO FOR EACH RECORD ADDED*/
ELSE DO;
_ID+1;
END;
/*DROP THE ORIGINAL ID;*/
DROP ID;
/*RENAME _ID TO ID*/
RENAME _ID = ID;
RUN;
where "W_prv_qtr_recoveries" is a table Name and "pmt" is a library name.
Thanks to user2337871.
DATA pmt.W_prv_qtr_recoveries;
SET pmt.W_prv_qtr_recoveries;
RETAIN _ID;
IF ID ~= . THEN _ID = ID;
ELSE DO;
_ID+1;
END;
DROP ID;
RENAME _ID = ID;
RUN;
Assuming that this autoincrement column will be used for every record that is inserted.
We can accomplish the same as follows:-
We will first check the latest key in the dataset
PROC SQL;
SELECT MAX(KEY) INTO :MK FROM MYDATA;
QUIT;
%put KeyOld=&MK;
Then we increment this key
Data _NULL_;
call symput('KeyNew',&MK+1);
run;
%put KeyNew=&KeyNew;
Here we hold the New record that we want to insert, and add the correspoding key
Data TEMP1;
set TEMP;
Key=&KeyNew;
run;
Finally we load the new record in our dataset
PROC APPEND BASE=MYDATA DATA=TEMP1 FORCE;
RUN;

select only a few columns from a large table in SAS

I have to join 2 tables on a key (say XYZ). I have to update one single column in table A using a coalesce function. Coalesce(a.status_cd, b.status_cd).
TABLE A:
contains some 100 columns. KEY Columns ABC.
TABLE B:
Contains just 2 columns. KEY Column ABC and status_cd
TABLE A, which I use in this left join query is having more than 100 columns. Is there a way to use a.* followed by this coalesce function in my PROC SQL without creating a new column from the PROC SQL; CREATE TABLE AS ... step?
Thanks in advance.
You can take advantage of dataset options to make it so you can use wildcards in the select statement. Note that the order of the columns could change doing this.
proc sql ;
create table want as
select a.*
, coalesce(a.old_status,b.status_cd) as status_cd
from tableA(rename=(status_cd=old_status)) a
left join tableB b
on a.abc = b.abc
;
quit;
I eventually found a fairly simple way of doing this in proc sql after working through several more complex approaches:
proc sql noprint;
update master a
set status_cd= coalesce(status_cd,
(select status_cd
from transaction b
where a.key= b.key))
where exists (select 1
from transaction b
where a.ABC = b.ABC);
quit;
This will update just the one column you're interested in and will only update it for rows with key values that match in the transaction dataset.
Earlier attempts:
The most obvious bit of more general SQL syntax would seem to be the update...set...from...where pattern as used in the top few answers to this question. However, this syntax is not currently supported - the documentation for the SQL update statement only allows for a where clause, not a from clause.
If you are running a pass-through query to another database that does support this syntax, it might still be a viable option.
Alternatively, there is a way to do this within SAS via a data step, provided that the master dataset is indexed on your key variable:
/*Create indexed master dataset with some missing values*/
data master(index = (name));
set sashelp.class;
if _n_ <= 5 then call missing(weight);
run;
/*Create transaction dataset with some missing values*/
data transaction;
set sashelp.class(obs = 10 keep = name weight);
if _n_ > 5 then call missing(weight);
run;
data master;
set transaction;
t_weight = weight;
modify master key = name;
if _IORC_ = 0 then do;
weight = coalesce(weight, t_weight);
replace;
end;
/*Suppress log messages if there are key values in transaction but not master*/
else _ERROR_ = 0;
run;
A standard warning relating to the the modify statement: if this data step is interrupted then the master dataset may be irreparably damaged, so make sure you have a backup first.
In this case I've assumed that the key variable is unique - a slightly more complex data step is needed if it isn't.
Another way to work around the lack of a from clause in the proc sql update statement would be to set up a format merge, e.g.
data v_format_def /view = v_format_def;
set transaction(rename = (name = start weight = label));
retain fmtname 'key' type 'i';
end = start;
run;
proc format cntlin = v_format_def; run;
proc sql noprint;
update master
set weight = coalesce(weight,input(name,key.))
where master.name in (select name from transaction);
run;
In this scenario I've used type = 'i' in the format definition to create a numeric informat, which proc sql uses convert the character variable name to the numeric variable weight. Depending on whether your key and status_cd columns are character or numeric you may need to do this slightly differently.
This approach effectively loads the entire transaction dataset into memory when using the format, which might be a problem if you have a very large transaction dataset. The data step approach should hardly use any memory as it only has to load 1 row at a time.

SAS: Compare values between two numbers to make separate buckets

I'm trying to compare multiple sets of data by putting them in separate groups between two numbers. Originally I had statements like,
if COLUMN1 gt 0 and COLUMN1 LE 1000 then PRICE_GROUP = 1000;
I had this going up by 1000 to 100,000. The only problem is that once I counted how many were in each price_group, some price_groups were missing (57,000 had no values so when I would count(Price_group) it would not appear for some groups). The solution I think is to make a table with the bounds for each, and then compare the actual value vs the upper and lower bound.
proc iml;
mat = j(100,2,0);
total = 100000;
mat[1,1] = 0;
mat[1,2] = mat[1,1] + (total/100);
do i = 2 to nrow(mat);
mat[i,1] = mat[i-1,1] + (total/100);
mat[i,2] = mat[i,1] + (total/100);
end;
create dataset from mat;
append from mat;
quit;
This creates the table which I can compare the values, but is there an easier way besides proc iml? I was next going to do a loop to compare each value with the two columns and create a new column on the table to have the count in each bucket. This still seems like an intensive process that is inefficient.
IML isn't a terrible solution, but there are a few others depending on what exactly you're doing.
The most common is proc format. Create a format that manages each bucket, like so:
proc format;
value buckets
0-1000 = 1000
1000<-2000 = 2000
...
other="NA";
quit;
Then you can either use the format (or informat) to create a new variable with the bucketed value, or even better, use the format on the fly (ie, in proc means or whatnot) which not only means you don't have to rewrite the dataset, but you can swap formats on and off depending on how many buckets you want (say, buckets100 format or buckets20 and whatnot).
Second, your specific question looks like it's solveable just using math:
data want;
set have;
bucket = &total/100*(1+floor(column1/(&total/100)));
run;
although obviously that doesn't work for every example.
Third, you could use a hash lookup table, if you are unable to use formats (such as there are two or more elements that determine the bucket). If that's useful I can expand on that, or just google about as those are very commonly used for lookups in SAS. That's the closest solution to the IML solution inside a regular datastep.
Create another table with groups:
data group_table;
do price_group=1000 to 100000 by 1000;
output;
end;
run;
Then left join the grouping/comparison table with this new table using price_group as key:
proc sql;
create table price_group_compare as
select L.price_group,R.group1_count,R.group2_count
from group_table as L, group_counts as R
on L.price_group = R.price_group;
quit;