Probably a simple question. I have a simple dataset with scheduled payment dates in it.
DATA INFORM2;
INFORMAT previous_pmt_date scheduled_pmt_date MMDDYY10.;
INPUT previous_pmt_date scheduled_pmt_date;
FORMAT previous_pmt_date scheduled_pmt_date MMDDYYS10.;
DATALINES;
11/16/2015 12/16/2015
12/17/2015 01/16/2016
01/17/2016 02/16/2016
;
What I'm trying to do is to create a binary latest row indicator. For example, If I wanted to know the latest row as of 1/31/2016 I'd want row 2 to be flagged as the latest row. What I had been doing before is finding out where 1/31/2016 is between the previous_pmt_date and the scheduled_pmt_date, but that isn't correct for my purposes. I'd like to do this in an data step as opposed to SQL subqueries. Any ideas?
Want:
previous_pmt_date scheduled_pmt_date latest_row_ind
11/16/2015 12/16/2015 0
12/17/2015 01/16/2016 1
01/17/2016 02/16/2016 0
Here's a solution that does it all in the single existing datastep without any additional sorting. First I'm going to modify your data slightly to include account as the solution really should take that into account as well:
DATA INFORM2;
INFORMAT previous_pmt_date scheduled_pmt_date MMDDYY10.;
INPUT account previous_pmt_date scheduled_pmt_date;
FORMAT previous_pmt_date scheduled_pmt_date MMDDYYS10.;
DATALINES;
1 11/16/2015 12/16/2015
1 12/17/2015 01/16/2016
1 01/17/2016 02/16/2016
2 11/16/2015 12/16/2015
2 12/17/2015 01/16/2016
2 01/17/2016 02/16/2016
;
run;
Specify a cutoff date:
%let cutoff_date = %sysfunc(mdy(1,31,2016));
This solution uses the approach from this question to save the variables in the next row of data, into the current row. You can drop the vars at the end if desired (I've commented out for the purposes of testing).
data want;
set inform2 end=eof;
by account scheduled_pmt_date;
recno = _n_ + 1;
if not eof then do;
set inform2 (keep=account previous_pmt_date scheduled_pmt_date
rename=(account = next_account
previous_pmt_date = next_previous_pmt_date
scheduled_pmt_date = next_scheduled_pmt_date)
) point=recno;
end;
else do;
call missing(next_account, next_previous_pmt_date, next_scheduled_pmt_date);
end;
select;
when ( next_account eq account and next_scheduled_pmt_date gt &cutoff_date ) flag='a';
when ( next_account ne account ) flag='b';
otherwise flag = 'z';
end;
*drop next:;
run;
This approach works by using the current observation in the dataset (obtained via _n_) and adding 1 to it to get the next observation. We then use a second set statement with the point= option to load in that next observation and rename the variables at the same time so that they don't overwrite the current variables.
We then use some logic to flag the necessary records. I'm not 100% of the logic you require for your purposes, so I've provided some sample logic and used different flags to show which logic is being triggered.
Some notes...
The by statement isn't strictly necessary but I'm including it to (a) ensure that the data is sorted correctly, and (b) help future readers understand the intent of the datastep as some of the logic requires this sort order.
The call missing statement is simply there to clean up the log. SAS doesn't like it when you have variables that don't get assigned values, and this will happen on the very last observation so this is why we include this. Comment it out to see what happens.
The end=eof syntax basically creates a temporary variable called eof that has a value of 1 when we get to the last observation on that set statement. We simply use this to determine if we're at the last row or not.
Finally but very importantly, be sure to make sure you are keeping only the variables required when you load in the second dataset otherwise you will overwrite existing vars in the original data.
Related
This might sound awkward but I do have a requirement to be able to concatenate all the values of a char column from a dataset, into one single string. For example:
data person;
input attribute_name $ dept $;
datalines;
John Sales
Mary Acctng
skrill Bish
;
run;
Result : test_conct = "JohnMarySkrill"
The column could vary in number of rows in the input dataset.
So, I tried the code below but it errors out when the length of the combined string (samplkey) exceeds 32K in length.
DATA RECKEYS(KEEP=test_conct);
length samplkey $32767;
do until(eod);
SET person END=EOD;
if lengthn(attribute_name) > 0 then do;
test_conct = catt(test_conct, strip(attribute_name));
end;
end;
output; stop;
run;
Can anyone suggest a better way to do this, may be break down a column into chunks of 32k length macro vars?
Regards
It would very much help if you indicated what you're trying to do but a quick method is to use SQL
proc sql NOPRINT;
select name into :name_list separated by ""
from sashelp.class;
quit;
%put &name_list.;
As you've indicated macro variables do have a size limit (64k characters) in most installations now. Depending on what you're doing, a better method may be to build a macro that puts the entire list as needed into where ever it needs to go dynamically but you would need to explain the usage for anyone to suggest that option. This answers your question as posted.
Try this, using the VARCHAR() option. If you're on an older version of SAS this may not work.
data _null_;
set sashelp.class(keep = name) end=eof;
length long_var varchar(1000000);
length want $256.;
retain long_var;
long_var = catt(long_var, name);
if eof then do;
want = md5(long_var);
put want;
end;
run;
I need help reading the code below. I am not sure what specific parts in this code are doing. For example, what does ( firstobs = 2 keep = column3 rename = (column3 = column4) ) do?
Also, what does ( obs = 1 drop = _all_ ); do?
I have also not used column5 = ifn( first.column1, (.), lag(column3) ); before. What does this do?
I am reading someone else's code. I wish I could provide more detail. If I find a solution, I will post it. Thanks for your help.
data out.dataset1;
set out.dataset2;
by column1;
WHERE column2 = 'N';
set out.dataset1 ( firstobs = 2 keep = column3 rename = (column3 = column4) )
out.dataset1 ( obs = 1 drop = _all_ );
FORMAT column5 DATETIME20.;
FORMAT column4 DATETIME20.;
column5 = ifn( first.column1, (.), lag(column3) );
column4 = ifn( last.column1, (.), column4 );
IF first.column1 then DIF=intck('dtday',column4,column3);
ELSE DIF= intck('dtday',column5,column3);
format column6 $6.;
IF first.column1
OR intck('dtday',column5,column3) GT 20 THEN column6= 'HARM';
ELSE column6= 'REPEAT';
run;
Seems like you need to learn about SAS datastep languange!
This series of things happening in the parenthesis are datastep options
You can use those options when ever you are referencing a table, even in a proc sql
The options you have:
firstobs : This starts the datafeed on the record enter in your case 2 it means SAS will start on the table on the 2nd record.
keep : This will only use the fields in list rather than using all the field in the table
rename = rename will rename field, so it works like an alias in SQL
OBS = will limit the amount of record you pull out of a table like top or limit in SQL
DROP = would remove the fields selected from the table in your case all is used wich means he's dropping all the fields.
as for the functions:
LAG is keeping the value from the previous record for the field you put in parenthesis so DPD_CLOSE_OF_BUSINESS_DT
INF = Works like a case or if. Basically you create a condition in the 1st argument and then the 2nd argument is applied when your condition in the 1st argument is true, the 3rd argument get done in the event that your condition on the 1st argument is false.
So to answer that question if it's the first record for the variable SOR_LEASE_NBR then the field Prev_COB_DT will be . otherwise it will be a previous value of DPD_CLOSE_OF_BUSINESS_DT.
The best advise I can give you is to start googling SAS and the function name you are wondering what it does, then it's a matter of encapsulation!
Hope this helps!
Basically your data step is using the LAG() function to look back one observation and the extra SET statement to look ahead one observation.
The IFN() function calls are then being used to make sure that missing values are assigned when at the boundary of a group.
You then use these calculated PREV and NEXT dates to calculate the DIF variable.
Note for this to work you need to be referencing the same input dataset in the two different SET statements (the dataset used in the last with the obs=1 and drop=_all_ dataset options
doesn't really need to be the same since it is not reading any of the actual data, it just has to have at least one observation).
( firstobs = 2 keep = DPD_CLOSE_OF_BUSINESS_DT rename =
(DPD_CLOSE_OF_BUSINESS_DT = Next_COB_DT) ) do?
Here the code firstobs=2 says SAS to read the data from 2nd observation in the dataset.
and also by using rename option trying to change the name of the variable.
(obs = 1 drop = _all_);
obs=1 is reading only the 1st obs in the dataset. If you specify obs=2 then up to 2nd obs will be read.
drop = _all_ , is dropping all of your variables.
Firstobs:
Can read part of the data. If you specify Firstobs= 10, it starts reading the data from 10th observation.
Obs :
If specify obs=15, up to 15th obs the data will be readed.
If you run the below table, it gives you 3 observations ( from 2nd to 4th ) in the output result.
Example;
DATA INSURANCE;
INFILE CARDS FIRSTOBS=2 OBS=4;
INPUT NAME$ GENDER$ AGE INSURANCE $;
CARDS;
SOWMYA FEMALE 20 MEDICAL
SUNDAR MALE 25 MEDICAL
DIANA FEMALE 67 MEDICARE
NINA FEMALE 56 MEDICAL
RUN;
I am simply looking to create a new column retaining the value of a specific column within the previous record, so that I can compare the existing column against the new column. Further down the line I want to be able to output records whereby the values in both columns are different and lose the records where the values are the same
Essentially I want my dataset to look like this, where the first idr has the retained date set to null:
Idr Date1 Date2
1 20/01/2016 .
1 20/01/2016 20/01/2016
1 18/10/2016 20/01/2016
2 07/03/2016 .
2 18/05/2016 07/03/2016
2 21/10/2016 18/05/2016
3 29/01/2016 .
3 04/02/2016 29/01/2016
3 04/02/2016 04/02/2016
I have used code along the following lines in the past whereby I have created a temporary variable referencing the data I want to retain:
date_temp=date1;
data example2;
set example1;
by idr date1;
date_temp=date1;
retain date_temp ;
if first.idr then do;
date_temp=date1;
end;else do;
date2=date_temp;
end;
run;
I have searched the highs and lows of google - Any help would be greatly appreciated
The trick here is to set the value of the retained variable ready for the next row after you've already output the current row, rather than relying on the default implicit output at the end of the data step:
data example2;
set example1;
by idr;
retain date2;
if first.idr then call missing(date2);
output;
date2 = date1;
format date2 ddmmyy10.;
run;
Logic that executes after the output statement doesn't make any difference to the row that's just been output, but until the data step proceeds to the next iteration, everything is still in the PDV, including variables that are being retained across to the next row, so this is an opportunity to update them. More details on the PDV.
Another way of doing this, using the lag function:
data example3;
set example1;
date2 = lag(date1);
if idr ne lag(idr) then call missing(date2);
run;
Be careful when using lag - it returns the value from the last time that instance of the lag function was executed during your data step, not necessarily the value of that variable from the previous row, so weird things tend to happen if you do something like if condition then mylaggedvar=lag(var);
To achieve your final outcome (remove records where the idr and date are the same as the previous row), you can easily achieve this without creating the extra column. Providing the data is sorted by idr and date1, then just use first.date1 to keep the required records.
data have;
input Idr Date1 :ddmmyy10.;
format date1 ddmmyy10.;
datalines;
1 20/01/2016
1 20/01/2016
1 18/10/2016
2 07/03/2016
2 18/05/2016
2 21/10/2016
3 29/01/2016
3 04/02/2016
3 04/02/2016
;
run;
data want;
set have;
by idr date1;
if first.date1 then output;
run;
I have something similar to the code below, I want to create every 2 character combination within my strings and then count the occurrence of each and store in a table. I will be changing the substr statement to a do loop to iterate through the whole string. But for now I just want to get the first character pair to work;
data temp;
input cat $50.;
call symput ('regex', substr(cat,1,2));
®ex = count(cat,substr(cat,1,2));
datalines;
bvbvbsbvbvbvbvblb
dvdvdvlxvdvdgd
cdcdcdcdvdcdcdvcdcded
udvdvdvdevdvdvdvdvdvdvevdedvdv
dvdkdkdvdkdkdkudvkdkd
kdkvdkdkvdkdkvudkdkdukdvdkdkdkdv
dvkvwduvwdedkd
;
run;
Expected results;
cat bv dv cd ud kd
#### 6
#### 4
#### 8
#### 1
#### 3
#### 9
#### 1
I'd prefer not to use a proc transpose as I can't loop through the string to create all the character pairs. I'll have to manually create them and I have upto 500 characters per string, plus I would like to search for 3 and 4 string patterns.
You can't do what you're asking to directly. You will either have to use the macro language, or use PROC TRANSPOSE. SAS doesn't let you reference data in the way you're trying to, because it has to have already constructed the variable names and such before it reads anything in.
I'll post a different solution that uses the macro language, but I suspect TRANSPOSE is the ultimate solution here; there's no practical reason this shouldn't work with your actual problem, and if you're having trouble with that it should be possible to help - post the do loop and what you're wanting, and we can of course help. Likely you just need to put the OUTPUT in the do loop.
data temp;
input cat $50.;
cat_val = substr(cat,1,2);
_var_ = count(cat,substr(cat,1,2));
output;
datalines;
bvbvbsbvbvbvbvblb
dvdvdvlxvdvdgd
cdcdcdcdvdcdcdvcdcded
udvdvdvdevdvdvdvdvdvdvevdedvdv
dvdkdkdvdkdkdkudvkdkd
kdkvdkdkvdkdkvudkdkdukdvdkdkdkdv
dvkvwduvwdedkd
;
run;
proc transpose data=temp out=temp_T(drop=_name_);
by cat notsorted; *or by some ID variable more likely;
id cat_val;
var _var_;
run;
Here's a solution that uses CALL EXECUTE rather than the macro language, as I decided that was actually a better solution. I wouldn't use this in production, but it hopefully shows the concept (in particular, I would not run a PROC DATASETS for each variable separately - I would concat all the renames into one string then run that at the end. I thought this better for showing how the process might work.)
This takes advantage of timing - namely, CALL EXECUTE happens after the data step terminates, so by that point you do know what variable maps to what data point. It does have to pass the data twice in order to drop the spurious variables, though if you either know the actual number of variables you want to have, or if you're okay with the excess variables hanging around, it would be okay to skip that, and PROC DATASETS doesn't actually open the whole dataset, so it would be quite fast (even the above with five calls is quite fast).
data temp;
input cat $50.;
array _catvars[50]; *arbitrary 50 chosen here - pick one big enough for your data;
array _catvarnames[50] $ _temporary_;
cat_val = substr(cat,1,2);
_iternum = whichc(cat_val, of _catvarnames[*]);
if _iternum=0 then do;
_iternum = whichc(' ',of _catvarnames[*]);
_catvarnames[_iternum]=cat_val;
call execute('proc datasets lib=work; modify temp; rename '||vname(_catvars[_iternum])||' = '||cat_val||'; quit;');
end;
_catvars[_iternum]= count(cat,substr(cat,1,2));
if _n_=7 then do; *this needs to actually be a test for end-of-file (so add `end=eof` to the set statement or infile), but you cannot do that in DATALINES so I hardcode the example.;
call execute('data temp; set temp; drop _catvars'||put(whichc(' ',of _catvarnames[*]),2. -l)||'-_catvars50;run;');
end;
datalines;
bvbvbsbvbvbvbvblb
dvdvdvlxvdvdgd
cdcdcdcdvdcdcdvcdcded
udvdvdvdevdvdvdvdvdvdvevdedvdv
dvdkdkdvdkdkdkudvkdkd
kdkvdkdkvdkdkvudkdkdukdvdkdkdkdv
dvkvwduvwdedkd
;
run;
I have hundreds of thousands of IDs in a large dataset.
Some records have the same ID but different data points. Some of these IDs need to be merged into a single ID. People registered for a system more than once should be just one person in the database.
I also have a separate file that tells me which IDs need to be merged, but it's not always a one-to-one relationship. For example, in many cases I have x->y and then y->z because they registered three times. I had a macro that essentially was the following set of if-then statements:
if ID='1111111' then do; ID='2222222'; end;
if ID='2222222' then do; ID='3333333'; end;
I believe SAS runs this one record at a time. My list of merged IDs is almost 15k long, so it takes forever to run and the list just gets longer. Is there a faster method of updating these IDs?
Thanks
EDIT: Here is an example of the situation, except the macro is over 15k lines long due to all the merges.
data one;
input ID $5. v1 $ v2 $;
cards;
11111 a b
11111 c d
22222 e f
33333 g h
44444 i j
55555 k l
66666 m n
66666 o p
;
run;
%macro ID_Change;
if ID='11111' then do; ID='77777'; end; *77777 is a brand new ID;
if ID='22222' then do; ID='88888'; end; *88888 is a new ID but is merged below;
if ID='88888' then do; ID='99999'; end; *99999 becomes the newer ID;
%mend;
data two; set one; %ID_Change; run;
A hash table will greatly speed up the process. Hash tables are one of the little-used, but highly effective, tools in SAS. They're a bit bizarre since the syntax is very different from standard SAS programming. For now, think of it as a way to merge data together in-memory (a big reason as to why it's so fast).
First, create a dataset that has the conversions that you need. We want to match up by ID, then convert it to New_ID. Consider ID as your key column, and New_ID as your data column.
dataset: translate
ID New_ID
111111 222222
222222 333333
In a hash table, you need to consider two things:
The Key column(s)
The Data column(s)
The Data column is what will be replacing observations matched by the Key column. In other words, New_ID will be populated every time there's a match for ID.
Next, you'll want to do your hash merge. This is performed in the data step.
data want;
set have;
/* Only declare the hash object on the first iteration.
Otherwise it will do this every record. */
if(_N_ = 1) then do;
declare hash id_h(dataset: 'translate'); *Declare a hash object called 'id_h';
id_h.defineKey('ID'); *Define key for matching;
id_h.defineData('New_ID'); *The new ID after matching;
id_h.defineDone(); *Done declaring this hash object;
call missing(New_ID); *Prevents a warning in the log;
end;
/* If a customer has changed multiple times, keep iterating until
there is no longer a match between tables */
do while(id_h.Find() = 0);
_loop_count+1; *Tells us how long we've been in the loop;
/* Just in case the while loop gets to 500 iterations, then
there's likely a problem and you don't want the data step to get stuck */
if(_loop_count > 500) then do;
put 'WARNING: ' ID ' iterated 500 times. The loop will stop. Check observation ' _N_;
leave;
end;
/* If the ID of the hash table matches the ID of the dataset, then
we'll set ID to be New_ID from the hash object;
ID = New_ID;
end;
_loop_count = 0;
drop _loop_count;
run;
This should run very quickly and provide the desired output, assuming that your lookup table is coded in the way that you need it to be.
Use PROC SQL or a MERGE step against your separate file (after you have created a separate dataset from it, using infile or proc import) to append this unique id to all records. If your separate file contains only the duplicates, you will need to create a dummy unique id for the non-duplicates.
Do PROC SORT with BY unique id and timestamp of signup.
Use a DATA step with the same BY variables. Depending on whether you want to keep the first or last signup, do if first.timestamp then output; (or last, etc.)
Or you could do it all in one PROC SQL using a left join to the separate file, a coalesce step to return a dummy unique id if it is not contained in the separate file, a group by unique id, and a having max(timestamp) (or min). You can also coalesce any other variables you might want to try to preserve across signups -- for example, if the first signup contained a phone number and successive signups were missing that data point.
Without a reproducible example it's hard to be more specific.