Script the next row data for current - sas

What can I do if I want to copy the data from the next row.
For example customer A started his current trip on 01JAN2015 and next trip on 15JAN2015. Therefore, his end trip date for his current trip will be on 14JAN2015, which is a day before his next trip starts. What can I script for the end trip date?

As there is no lead() function in SAS, you can either sort your data into descending date order and use lag() then re-sort it back again, as per Vasilij's answer, or you can do a 'look-ahead merge'.
Example:
proc sort data=have ;
by customer date_start ;
run ;
data want ;
merge have
have (firstobs=2 rename=(date_start=next_date customer=next_customer)) ;
if customer = next_customer then do ;
date_end = next_date ;
end ;
format date_end date7. ;
drop next_: ;
run ;

Here is the code that would do what you are asking.
It sorts the data in descending order in order to use LAG() function. That way, any previous record is actually your future record and you can use it to work out the data points you need. Last PROC SORT sorts the data in the original order.
NOTE: I didn't take into account different customers. You might want to introduce some BY GROUP processing to make sure you don't take the next trip date for another customer.
data have;
input customer $ date_start date7.;
format date_start date7.;
datalines;
A 01JAN15
A 15JAN15
;
PROC SORT data=have;
by customer Descending date_start ;
RUN;
data want;
set have;
by customer Descending date_start ;
format date_end date7.;
date_end = lag(date_start)-1;
RUN;
PROC SORT data=want;
by customer date_start ;
RUN;

lag() is a horrible misnomer-ed function that has nothing to do with 'previous row' and should be almost always be avoided. It often creates buggy, very hard to spot mistakes. There are some rare cases where it makes sense to use it. This is not one of them. I really wish people would stop recommending its use. [/end rant].
Instead, consider using one of the below methods.
1) The point= method (not sure if there's a name for this). Some notes, be sure to keep just those variables you need on the second set statement and no more. Rename them so they don't overwrite the existing variable values.
data want;
set sashelp.class end=last;
* GET THE NAME FROM THE NEXT ROW OF DATA;
if not last then do;
recno=_n_+1;
set sashelp.class(keep=name rename=(name=next_name)) point=recno;
end;
else do;
call missing(next_name);
end;
run;
2) The retain method:
* REVERSE THE ORDER OF THE DATA;
proc sort data=sashelp.class out=have;
by descending name;
run;
* KEEP TRACK OF THE PRIOR RECORDS NAME AS WE ITERATE ACROSS OBSERVATIONS;
data have2;
set have;
length next_name $8;
retain next_name '';
output;
next_name = name;
run;
* SORT THE DATA BACK TO ITS ORIGINAL ORDER;
proc sort data=have2 out=want;
by name;
run;
3) The look-ahead-merge method as suggested in Chris J's answer.

Related

Flagging values based on subsequent occurences using first. retain etc

Thank you who will be able to help me. I've got a dataset as below:
data smp;
infile datalines dlm=',';
informat identifier $7. trx_date $9. transaction_id $13. product_description $50. ;
input identifier $ trx_date transaction_id $ product_description $ ;
datalines;
Cust1,11Aug2016,20-0030417313,ONKEN BIOPOT F/FREE STRAWBERRY
Cust1,11Aug2016,20-0030417313,ONKEN BIOPOT F/FREE STRAWBERRY
Cust1,11Aug2016,20-0030417313,ONKEN BIOPOT FULL STRAWB/GRAIN
Cust1,11Aug2016,20-0030417313,RACHELS YOG GREEK NAT F/F/ORG
Cust1,03Nov2016,23-0040737060,RACHELS YOG GREEK NAT F/F/ORG
Cust3,13Feb2016,39-0070595440,COLLECT YOG LEMON
Cust3,21Jun2016,34-0050769524,AF YOG FARMHOUSE STRAWB/REDCUR
Cust3,21Jun2016,34-0050769524,Y/VALLEY GREEK HONEY ORGANIC
Cust3,21Jun2016,34-0050769524,Y/VALLEY THICK LEMON CURD ORG
Cust3,21Jun2016,34-0050769524,Y/VALLEY THICK YOG FRUITY FAVS
Cust3,21Jun2016,34-0050769524,Y/VALLEY THICK YOG STRAWB ORG
Cust3,26Jun2016,39-0430106897,TOTAL GREEK YOGURT 0%
Cust3,14Aug2016,54-0040266755,M/BUNCH SQUASHUMS STRAW/RASP
Cust3,14Aug2016,54-0040266755,MULLER CORNER STRAWBERRY
Cust3,14Aug2016,54-0040266755,TOTAL GREEK YOGURT 0%
Cust3,22Aug2016,54-0050447336,M/BUNCH SQUASHUMS STRAW/RASP
;
For each customers (and each of their purchase based on transaction_id), i'm wanting to flag each product that will be repurchased during their next visit (only their next visit) on a rolling basis. So in the above dataset, correct flags would be on rows 4,12 and 13 because these products are bought on the next customer visit (we only look at the next visit).
I'm trying to do it with the following program:
proc sort data = smp out = td;
by descending identifier transaction_id product_description;
run;
DATA TD2(DROP=tmp_product);
SET td;
BY identifier transaction_id product_description;
RETAIN tmp_product;
IF FIRST.product_description and first.transaction_id THEN DO;
tmp_product = product_description;
END;
ATTRIB repeat_flag FORMAT=$1.;
IF NOT FIRST.product_description THEN DO;
IF tmp_product EQ product_description THEN repeat_flag ='Y';
ELSE repeat_flag = 'N';
END;
RUN;
proc sort data = td2;
by descending identifier transaction_id product_description;
run;
But it's not working? if someone could pse help it would be fab.
Best Wishes
Other method is to produce a dummy group in original dataset and temporary dataset. In original dataset, group is sequenced by visit time per customer, in temporary dataset, group is sequenced from beginning of SECOND visit time per customer, group number in temporary dataset is the same as group number of original dataset, but its visit time is next visit of original dataset. With the dummy group, it is easy to find the same product that was repurchased during their next visit by hash table.
proc sort data=smp;
by identifier trx_date;
run;
data have(drop=_group) temp(drop=group rename=(_group=group));
set smp;
by identifier trx_date;
if first.identifier then do;
group=1; _group=0;
end;
if dif(trx_date)>0 then do;
group+1; _group+1;
end;
if _group^=0 then output temp;
output have;
run;
data want;
if 0 then set temp;
if _n_=1 then do;
declare hash h(dataset:'temp');
h.definekey('identifier','group','product_description');
h.definedata('product_description');
h.definedone();
end;
set have;
flag=(h.find()=0);
drop group;
run;
The method below will "look ahead" to the next row (opposite to LAG) after sorting so you can bring comparisons onto the same row for simple logic:
** convert character date to numeric **;
data smp1; set smp;
TRX_DATE_NUM = input(TRX_DATE,ANYDTDTE10.);
format TRX_DATE_NUM mmddyy10.;
run;
** sort **;
proc sort data = smp1;
by IDENTIFIER PRODUCT_DESCRIPTION TRX_DATE_NUM;
run;
** look ahead at the next observations and use logic to identify flags **;
data look_ahead;
set smp1;
by IDENTIFIER;
set smp1 (firstobs = 2
keep = IDENTIFIER PRODUCT_DESCRIPTION TRX_DATE_NUM
rename = (IDENTIFIER = NEXT_ID PRODUCT_DESCRIPTION = NEXT_PROD TRX_DATE_NUM = NEXT_DT))
smp1 (obs = 1 drop = _ALL_);
if last.IDENTIFIER then do;
NEXT_ID = "";
NEXT_PROD = "";
NEXT_DT = .;
end;
run;
** logic says if the next row is the same customer who bought the same product on a different date then flag **;
data look_ahead_final; set look_ahead;
if IDENTIFIER = NEXT_ID and NEXT_PROD = PRODUCT_DESCRIPTION and TRX_DATE_NUM ne NEXT_DT then FLAG = 1;
else FLAG = 0;
run;
There are a few ways to do this; I think the simplest to understand, while still having a reasonable level of performance, is to sort the data in descending date order and then use an array to store the product_descriptions of the last trx_date.
Here I use a 2 dimensional array where the first dimension is just a 1/2 value; each trx_date simultaneously loads one row of the array and checks against the other row of the array (using _array_switch to determine which is being loaded/checked).
You could do the same thing with a hash table, and it would be appreciably faster along with perhaps a bit less complicated in some ways; if you are familiar with hash tables and want to see that solution comment and I or someone else can provide it.
You also could use SQL to do this, and I suspect that is the most common solution overall, but I couldn't quite get it to work, as it has some complexity with subqueries within subqueries the way I was approaching it, and I'm apparently not good enough with those.
Here's the array solution. Set the second dimension of prods to a reasonable maximum for your data - it could even be thousands, this is a temporary array and does not use much memory so set to 32000 or whatever would not be a big deal.
proc sort data=smp;
by identifier descending trx_date ;
run;
data want;
array prods[2,20] $255. _temporary_;
retain _array_switch 2;
do _n_ = 1 by 1 until (last.trx_date);
set smp;
by identifier descending trx_date;
/* for first row for an identifier, clear out the whole thing */
if first.identifier then do;
call missing(of prods[*]);
end;
/* for first row of a trx_date, clear out the array-row we were looking at last time, and switch _array_switch to the other value */
if first.trx_date then do;
do _i = 1 to dim(prods,2);
if missing(prods[_array_switch,_i]) then leave;
call missing(prods[_array_switch,_i]);
end;
_array_switch = 3-_array_switch;
end;
*now check the array to see if we should set next_trans_flag;
next_trans_flag='N';
do _i = 1 to dim(prods,2);
if missing(prods[_array_switch,_i]) then leave; *for speed;
if prods[_array_switch,_i] = product_description then next_trans_flag='Y';
end;
prods[3-_array_switch,_n_] = product_description; *set for next trx_date;
output;
end;
drop _:;
run;
I think to really answer this you need to generate a list of distinct visit*product combinations. And also a list of the distinct products bought on particular visits.
proc sql noprint ;
create table bought as
select distinct identifier, product_description, trx_date, transaction_id
from smp
order by 1,2,3,4
;
create table all_visits as
select a.identifier, product_description, trx_date, transaction_id
from (select distinct identifier,product_description from bought) a
natural join (select distinct identifier,transaction_id,trx_date from bought) b
order by 1,2,3,4
;
quit;
You can then combine them and make a flag for whether the product was bought on that visit.
data check ;
merge all_visits bought(in=in1) ;
by identifier product_description trx_date transaction_id ;
bought=in1;
run;
You can now use a lead technique to figure out if the they also bought the product on the next visit.
data flag ;
set check ;
by identifier product_description trx_date transaction_id ;
set check(firstobs=2 keep=bought rename=(bought=bought_next)) check(drop=_all_ obs=1);
if last.product_description then bought_next=0;
run;
You can then combine back with the actual purchases and eliminate the extra dummy records.
proc sort data=smp;
by identifier product_description trx_date transaction_id ;
run;
data want ;
merge flag smp (in=in1);
by identifier product_description trx_date transaction_id ;
if in1 ;
run;
Let's put the records back into the original order so we can check the results.
proc sort; by row; run;
proc print; run;

Create new variables from format values

What i want to do: I need to create a new variables for each value labels of a variable and do some recoding. I have all the value labels output from a SPSS file (see sample).
Sample:
proc format; library = library ;
value SEXF
1 = 'Homme'
2 = 'Femme' ;
value FUMERT1F
0 = 'Non'
1 = 'Oui , occasionnellement'
2 = 'Oui , régulièrement'
3 = 'Non mais j''ai déjà fumé' ;
value ... (many more with different amount of levels)
The new variable name would be the actual one without F and with underscore+level (example: FUMERT1F level 0 would become FUMERT1_0).
After that i need to recode the variables on this pattern:
data ds; set ds;
FUMERT1_0=0;
if FUMERT1=0 then FUMERT1_0=1;
FUMERT1_1=0;
if FUMERT1=1 then FUMERT1_1=1;
FUMERT1_2=0;
if FUMERT1=2 then FUMERT1_2=1;
FUMERT1_3=0;
if FUMERT1=3 then FUMERT1_3=1;
run;
Any help will be appreciated :)
EDIT: Both answers from Joe and the one of data_null_ are working but stackoverflow won't let me pin more than one right answer.
Update to add an _ underscore to the end of each name. It looks like there is not option for PROC TRANSREG to put an underscore between the variable name and the value of the class variable so we can just do a temporary rename. Create rename name=newname pairs to rename class variable to end in underscore and to rename them back. CAT functions and SQL into macro variables.
data have;
call streaminit(1234);
do caseID = 1 to 1e4;
fumert1 = rand('table',.2,.2,.2) - 1;
sex = first(substrn('MF',rand('table',.5),1));
output;
end;
stop;
run;
%let class=sex fumert1;
proc transpose data=have(obs=0) out=vnames;
var &class;
run;
proc print;
run;
proc sql noprint;
select catx('=',_name_,cats(_name_,'_')), catx('=',cats(_name_,'_'),_name_), cats(_name_,'_')
into :rename1 separated by ' ', :rename2 separated by ' ', :class2 separated by ' '
from vnames;
quit;
%put NOTE: &=rename1;
%put NOTE: &=rename2;
%put NOTE: &=class2;
proc transreg data=have(rename=(&rename1));
model class(&class2 / zero=none);
id caseid;
output out=design(drop=_: inter: rename=(&rename2)) design;
run;
%put NOTE: _TRGIND(&_trgindn)=&_trgind;
First try:
Looking at the code you supplied and the output from Joe's I don't really understand the need for the formats. It looks to me like you just want to create dummies for a list of class variables. That can be done with TRANSREG.
data have;
call streaminit(1234);
do caseID = 1 to 1e4;
fumert1 = rand('table',.2,.2,.2) - 1;
sex = first(substrn('MF',rand('table',.5),1));
output;
end;
stop;
run;
proc transreg data=have;
model class(sex fumert1 / zero=none);
id caseid;
output out=design(drop=_: inter:) design;
run;
proc contents;
run;
proc print data=design(obs=40);
run;
One good alternative to your code is to use proc transpose. It won't get you 0's in the non-1 cells, but those are easy enough to get. It does have the disadvantage that it makes it harder to get your variables in a particular order.
Basically, transpose once to vertical, then transpose back using the old variable name concatenated to the variable value as the new variable name. Hat tip to Data null for showing this feature in a recent SAS-L post. If your version of SAS doesn't support concatenation in PROC TRANSPOSE, do it in the data step beforehand.
I show using PROC EXPAND to then set the missings to 0, but you can do this in a data step as well if you don't have ETS or if PROC EXPAND is too slow. There are other ways to do this - including setting up the dataset with 0s pre-proc-transpose - and if you have a complicated scenario where that would be needed, this might make a good separate question.
data have;
do caseID = 1 to 1e4;
fumert1 = rand('Binomial',.3,3);
sex = rand('Binomial',.5,1)+1;
output;
end;
run;
proc transpose data=have out=want_pre;
by caseID;
var fumert1 sex;
copy fumert1 sex;
run;
data want_pre_t;
set want_pre;
x=1; *dummy variable;
run;
proc transpose data=want_pre_t out=want delim=_;
by caseID;
var x;
id _name_ col1;
copy fumert1 sex;
run;
proc expand data=want out=want_e method=none;
convert _numeric_ /transformin=(setmiss 0);
run;
For this method, you need to use two concepts: the cntlout dataset from proc format, and code generation. This method will likely be faster than the other option I presented (as it passes through the data only once), but it does rely on the variable name <-> format relationship being straightforward. If it's not, a slightly more complex variation will be required; you should post to that effect, and this can be modified.
First, the cntlout option in proc format makes a dataset of the contents of the format catalog. This is not the only way to do this, but it's a very easy one. Specify the appropriate libname as you would when you create a format, but instead of making one, it will dump the dataset out, and you can use it for other purposes.
Second, we create a macro that performs your action one time (creating a variable with the name_value name and then assigning it to the appropriate value) and then use proc sql to make a bunch of calls to that macro, once for each row in your cntlout dataset. Note - you may need a where clause here, or some other modifications, if your format library includes formats for variables that aren't in your dataset - or if it doesn't have the nice neat relationship your example does. Then we just make those calls in a data step.
*Set up formats and dataset;
proc format;
value SEXF
1 = 'Homme'
2 = 'Femme' ;
value FUMERT1F
0 = 'Non'
1 = 'Oui , occasionnellement'
2 = 'Oui , régulièrement'
3 = 'Non mais j''ai déjà fumé' ;
quit;
data have;
do caseID = 1 to 1e4;
fumert1 = rand('Binomial',.3,3);
sex = rand('Binomial',.5,1)+1;
output;
end;
run;
*Dump formats into table;
proc format cntlout=formats;
quit;
*Macro that does the above assignment once;
%macro spread_var(var=, val=);
&var._&val.= (&var.=&val.); *result of boolean expression is 1 or 0 (T=1 F=0);
%mend spread_var;
*make the list. May want NOPRINT option here as it will make a lot of calls in your output window otherwise, but I like to see them as output.;
proc sql;
select cats('%spread_var(var=',substr(fmtname,1,length(Fmtname)-1),',val=',start,')')
into :spreadlist separated by ' '
from formats;
quit;
*Actually use the macro call list generated above;
data want;
set have;
&spreadlist.;
run;

Frequency of a value across multiple variables?

I have a data set of patient information where I want to count how many patients (observations) have a given diagnostic code. I have 9 possible variables where it can be, in diag1, diag2... diag9. The code is V271. I cannot figure out how to do this with the "WHERE" clause or proc freq.
Any help would be appreciated!!
Your basic strategy to this is to create a dataset that is not patient level, but one observation is one patient-diagnostic code (so up to 9 observations per patient). Something like this:
data want;
set have;
array diag[9];
do _i = 1 to dim(diag);
if not missing(diag[_i]) then do;
diagnosis_Code = diag[_i];
output;
end;
end;
keep diagnosis_code patient_id [other variables you might want];
run;
You could then run a proc freq on the resulting dataset. You could also change the criteria from not missing to if diag[_i] = 'V271' then do; to get only V271s in the data.
An alternate way to reshape the data that can match Joe's method is to use proc transpose like so.
proc transpose data=have out=want(keep=patient_id col1
rename=(col1=diag)
where=(diag is not missing));
by patient_id;
var diag1-diag9;
run;

SAS: Drop column in a if statement

I have a dataset called have with one entry with multiple variables that look like this:
message reference time qty price
x 101 35000 100 .
the above dataset changes every time in a loop where message can be ="A". If the message="X" then this means to remove 100 qty from the MASTER set where the reference number equals the reference number in the MASTER database. The price=. is because it is already in the MASTER database under reference=101. The MASTER database aggregates all the available orders at some price with quantity available. If in the next loop message="A" then the have dataset would look like this:
message reference time qty price
A 102 35010 150 500
then this mean to add a new reference number to the MASTER database. In other words, to append the line to the MASTER.
I have the following code in my loop to update the quantity in my MASTER database when there is a message X:
data b.master;
modify b.master have(where=(message="X")) updatemode=nomissingcheck;
by order_reference_number;
if _iorc_ = %sysrc(_SOK) then do;
replace;
end;
else if _iorc_ = %sysrc(_DSENMR) then do;
output;
_error_ = 0;
end;
else if _iorc_ = %sysrc(_DSEMTR) then do;
_error_ = 0;
end;
else if _iorc_ = %sysrc(_DSENOM) then do;
_error_ = 0;
end;
run;
I use the replace to update the quantity. But since my entry for price=. when message is X, the above code sets the price='.' where reference=101 in the MASTER via the replace statement...which I don't want. Hence, I prefer to delete the price column is message=X in the have dataset. But I don't want to delete column price when message=A since I use this code
proc append base=MASTER data=have(where=(msg_type="A")) force;
run;
Hence, I have this code price to my Modify statement:
data have(drop=price_alt);
set have; if message="X" then do;
output;end;
else do; /*I WANT TO MAKE NO CHANGE*/
end;run;
but it doesn't do what I want. If the message is not equal X then I don't want to drop the column. If it is equal X, I want to drop the column. How can I adapt the code above to make it work?
Its a bit of a strange request to be honest, such that it raises questions about whether what you're doing is the best way of doing it. However, in the spirit of answering the question...
The answer by DomPazz gives the option of splitting the data into two possible sets, but if you want code down the line to always refer to a specific data set, this creates its own complications.
You also can't, in the one data step, tell SAS to output to the "same" data set where one instance has a column and one instance doesn't. So what you'd like, therefor, is for the code itself to be dynamic, so that the data step that exists is either one that does drop the column, or one that does not drop the column, depending on whether message=x. The answer to this, dynamic code, like many things in SAS, resolves to the creative use of macros. And it looks something like this:
/* Just making your input data set */
data have;
message='x';
time=35000;
qty=1000;
price=10.05;
price_alt=10.6;
run;
/* Writing the macro */
%macro solution;
%local id rc1 rc2;
%let id=%sysfunc(open(work.have));
%syscall set(id);
%let rc1=%sysfunc(fetchobs(&id, 1));
%let rc2=%sysfunc(close(&id));
%IF &message=x %THEN %DO;
data have(drop=price_alt);
set have;
run;
%END;
%ELSE %DO;
data have;
set have;
run;
%END;
%mend solution;
/* Running the macro */
%solution;
Try this:
data outX(drop=price_alt) outNoX;
set have;
if message = "X" then
output outX;
else
output outNoX;
run;
As #sasfrog says in the comments, a table either has a column or it does not. If you want to subset things where MESSAGE="X" then you can use something like this to create 2 data sets.

Drop a range of variables in SAS

I currently have a dataset with 200 variables. From those variables, I created 100 new variables. Now I would like to drop the original 200 variables. How can I do that?
Slightly better would be, how I can drop variables 3-200 in the new dataset.
sorry if I was vague in my question but basically I figured out I need to use --.
If my first variable is called first and my last variable is called last, I can drop all the variables inbetween with (drop= first--last);
Thanks for all the responses.
As with most SAS tasks, there are several alternatives. The easiest and safest way to drop variables from a SAS data set is with PROC SQL. Just list the variables by name, separated by a comma:
proc sql;
alter table MYSASDATA
drop name, age, address;
quit;
Altering the table with PROC SQL removes the variables from the data set in place.
Another technique is to recreate the data set using a DROP option:
data have;
set have(drop=name age address);
run;
And yet another way is using a DROP statement:
data have;
set have;
drop name age address;
run;
Lots of options - some 'safer', some less safe but easier to code. Let's imagine you have a dataset with variables ID, PLNT, and x1-x200 to start with.
data have;
id=0;
plnt=0;
array x[200];
do _t = 1 to dim(x);
x[_t]=0;
end;
run;
data want;
set have;
*... create new 100 variables ... ;
*option 1:
drop x1-x200; *this works when x1-x200 are numerically consecutive;
*option 2:
drop x1--x200; *this works when they are physically in order on the dataset -
only the first and last matter;
run;
*Or, do it this way. This would also work with SQL ALTER TABLE. This is
the safest way to do it.;
proc sql;
select name into :droplist separated by ' ' from dictionary.columns
where libname='WORK' and memname='HAVE' and name not in ('ID','PRNT');
quit;
proc datasets lib=work;
modify want;
drop &droplist.;
quit;
If all of the variables you want to drop are named so they all start the same (like old_var_1, old_var_2, ..., old_var_n), you could do this (note the colon in drop option):
data have;
set have(drop= old_var:);
run;
data want;
set have;
drop VAR1--VARx;
run;
Would love to know if you can do this by position.
Definitely works with variable names separated by double dash (--).
I have some macros that would allow this here
You could run that whole set of macros, or just run list_vars(), is_blank(), num_words, find_word, remove_word, remove_words , nth_word().
Using these it would be:
%let keep_vars = keep_this and_this also_this;
%let drop_vars = %list_vars(old_dataset);
%let drop_vars = %remove_words(&drop_vars , &keep_vars);
data new_dataset (drop = &drop_vars );
set old_dataset;
/*stuff happens*/
run;
This will keep the three variables keep_this and_this also_this but drop everything else in the old dataset.