Repeat the value for duplicate records in SAS - sas

I have a dataset with duplicate Id’s, but the status message is repeated only for a couple of id’s. Example below
Id Status
101 Single entry and multiple items
101 Multiple items
101 Single items
101
101
501
502 Multiple
502
I would like to have the ‘Status’ messages repeated for the other two Id’s also (it can be either of the messages)
I thought of creating two datasets with and without ‘Status’ and then merge them on ‘Id’
Is there any simple solution to this?

I cannot understand where this would be a reasonable thing to do, but it is pretty easy. You could make a new variable and use RETAIN.
data want ;
set have ;
by id ;
if first.id then new=status ;
if not missing(status) then new=status ;
retain new ;
rename new=status status=old_status;
run;
Or you could do it using merge.
data want ;
merge have(drop=status) have(keep=id status where=(not missing(status)));
by id;
run;

Related

enter column in a dataset to an array

I have 33 different datasets with one column and all share the same column name/variable name;
net_worth
I want to load the values into arrays and use them in a datastep. But the array that I use should depend on the the by groups in the datastep (country by city). There are total of 33 datasets and 33 groups (country by city). each dataset correspond to exactly one by group.
here is an example what the by groups look like in the dataset: customers
UK 105 (other fields)
UK 102 (other fields)
US 291 (other fields)
US 292 (other fields)
Could I get some advice on how to go about and enter the columns in arrays and then use them in a datastep. or do you suggest to do it in another way?
%let var1 = uk105
%let var2 = uk102
.....
&let var33 = jk12
data want;
set customers;
by country city;
if _n_ = 1 then do;
*set datasets and create and populate arrays*;
* use array values in calculations with fields from dataset customers, depending on which by group. if the by group is uk and city is 105 then i need to use the created array corresponding to that by group;
It is a little hard to understand what you want.
It sounds like you have one dataset name CUSTOMERS that has all of the main variables and a bunch of single variable datasets that the values of NET_WORTH for a lot of different things (Countries?).
Assuming that the observations in all of the datasets are in the same order then I think you are asking for how to generate a data step like this:
data want;
set customers;
set uk105 (rename=(net_worth=uk105));
set uk103 (rename=(net_worth=uk103));
....
run;
Which might just be easiest to do using a data step.
filename code temp;
data _null_;
input name $32. ;
file code ;
put ' set ' name '(rename=(net_worth=' name '));' ;
cards;
uk105
uk102
;;;;
data want;
set customers;
%include code / source2;
run;

updating last not blank record SAS

I have the following dataset.
ID var1 var2 var3
1 100 200
1 150 300
2 120
2 100 150 200
3 200 150
3 250 300
I would like to have a new dataset with only the last not blank record for each group of variables.
id var1 var2 var3
1 150 200 300
2 100 150 200
3 250 300 150
last. select the last reord, but i need to selet the last not null record
Looks like you want the last non missing value for each non-key variable. So you can let the UPDATE statement do the work for you. Normally for update operation you apply transactions to a master dataset. But for your application you can use OBS=0 dataset option to make your current dataset work as both the master and the transactions.
data want ;
update have(obs=0) have ;
by id;
run;
Riccardo:
There are many ways to select, within each group, the last non-missing value of each column. Here are three ways. I would not say one is the best way, each approach has it's merits depending on the specific data set, coder comfort and long term maintainability.
Way 1 - UPDATE statement
Perhaps the simplest of the coding approaches goes like this:
make a data set that has one row per id and the same columns as the original data.
use the UPDATE statement to replace each like named variable with a non-missing value.
Example:
data want_base_table(label="One row per id, original columns");
set have;
by id;
if first.id;
run;
* use have as a transaction data set in the update statement;
data want_by_update;
update want_base_table have;
by id;
run;
Way 2 - DOW loop
Others will involve arrays and first. and last. flag variables of the BY group. This example shows a DOW loop that tracks the non-missing value and then uses them for the output of each ID:
data want_dow;
do until (last.id);
set have;
by id;
array myvar var1-var3 ;
array myhas has1-has3 ;
do _i = 1 to dim(myvar);
if not missing (myvar(_i)) then
myhas(_i) = myvar(_i);
end;
end;
do _i = 1 to dim(myhas);
myvar(_i) = myhas(_i);
end;
output;
drop _i has1-has3;
run;
A loop is most often called a DOW loop when there is a SET statement inside the DO; END; block and the loop termination is triggered by the last. flag variable. A similar non DOW approach (not shown) would use the implicit loop and first. to initialize the tracking array and last. for copying the values (tracked within group) into the columns for output.
Way 3 - Merging column slices
data want_by_column_slices;
merge
have (keep=id var1 where=(var1 ne .))
have (keep=id var2 where=(var2 ne .))
have (keep=id var3 where=(var3 ne .))
;
by id;
if last.id;
run;

SAS Updating records sequentially

I have hundreds of thousands of IDs in a large dataset.
Some records have the same ID but different data points. Some of these IDs need to be merged into a single ID. People registered for a system more than once should be just one person in the database.
I also have a separate file that tells me which IDs need to be merged, but it's not always a one-to-one relationship. For example, in many cases I have x->y and then y->z because they registered three times. I had a macro that essentially was the following set of if-then statements:
if ID='1111111' then do; ID='2222222'; end;
if ID='2222222' then do; ID='3333333'; end;
I believe SAS runs this one record at a time. My list of merged IDs is almost 15k long, so it takes forever to run and the list just gets longer. Is there a faster method of updating these IDs?
Thanks
EDIT: Here is an example of the situation, except the macro is over 15k lines long due to all the merges.
data one;
input ID $5. v1 $ v2 $;
cards;
11111 a b
11111 c d
22222 e f
33333 g h
44444 i j
55555 k l
66666 m n
66666 o p
;
run;
%macro ID_Change;
if ID='11111' then do; ID='77777'; end; *77777 is a brand new ID;
if ID='22222' then do; ID='88888'; end; *88888 is a new ID but is merged below;
if ID='88888' then do; ID='99999'; end; *99999 becomes the newer ID;
%mend;
data two; set one; %ID_Change; run;
A hash table will greatly speed up the process. Hash tables are one of the little-used, but highly effective, tools in SAS. They're a bit bizarre since the syntax is very different from standard SAS programming. For now, think of it as a way to merge data together in-memory (a big reason as to why it's so fast).
First, create a dataset that has the conversions that you need. We want to match up by ID, then convert it to New_ID. Consider ID as your key column, and New_ID as your data column.
dataset: translate
ID New_ID
111111 222222
222222 333333
In a hash table, you need to consider two things:
The Key column(s)
The Data column(s)
The Data column is what will be replacing observations matched by the Key column. In other words, New_ID will be populated every time there's a match for ID.
Next, you'll want to do your hash merge. This is performed in the data step.
data want;
set have;
/* Only declare the hash object on the first iteration.
Otherwise it will do this every record. */
if(_N_ = 1) then do;
declare hash id_h(dataset: 'translate'); *Declare a hash object called 'id_h';
id_h.defineKey('ID'); *Define key for matching;
id_h.defineData('New_ID'); *The new ID after matching;
id_h.defineDone(); *Done declaring this hash object;
call missing(New_ID); *Prevents a warning in the log;
end;
/* If a customer has changed multiple times, keep iterating until
there is no longer a match between tables */
do while(id_h.Find() = 0);
_loop_count+1; *Tells us how long we've been in the loop;
/* Just in case the while loop gets to 500 iterations, then
there's likely a problem and you don't want the data step to get stuck */
if(_loop_count > 500) then do;
put 'WARNING: ' ID ' iterated 500 times. The loop will stop. Check observation ' _N_;
leave;
end;
/* If the ID of the hash table matches the ID of the dataset, then
we'll set ID to be New_ID from the hash object;
ID = New_ID;
end;
_loop_count = 0;
drop _loop_count;
run;
This should run very quickly and provide the desired output, assuming that your lookup table is coded in the way that you need it to be.
Use PROC SQL or a MERGE step against your separate file (after you have created a separate dataset from it, using infile or proc import) to append this unique id to all records. If your separate file contains only the duplicates, you will need to create a dummy unique id for the non-duplicates.
Do PROC SORT with BY unique id and timestamp of signup.
Use a DATA step with the same BY variables. Depending on whether you want to keep the first or last signup, do if first.timestamp then output; (or last, etc.)
Or you could do it all in one PROC SQL using a left join to the separate file, a coalesce step to return a dummy unique id if it is not contained in the separate file, a group by unique id, and a having max(timestamp) (or min). You can also coalesce any other variables you might want to try to preserve across signups -- for example, if the first signup contained a phone number and successive signups were missing that data point.
Without a reproducible example it's hard to be more specific.

Rename nonsequential variable names to sequential names in sas

I am working with survey data where the variable names in our database are descriptive, and not sequentially numbered. They are sequential in the database (moving from left to right). I would like to work in my programs with numbered variables, and I have been unsuccessful in trying to rename them programmatically without having to write out every change by hand (there are 87 total variables).
I have tried to use array, but that has not worked since they are not named sequentially nor do they have a common structure (no common prefix or suffix).
Example data is below:
data svy;
input id relationship outburst checkwork goodideas ;
cards;
101 3 4 5 6
102 4 5 6 6
103 1 1 8 1
104 2 3 2 4
;
run;
***** does not work ;
data svy_1; set svy;
rename relationship--goodideas = var01--var04;
run;
quit;
The above code returns the following error in the log:
ERROR: Missing numeric suffix on a numbered variable list (relationship-goodideas).
I would like to rename the variables to something like: var01, var02, etc...
Any help is greatly appreciated.
A few things:
Your data step #2 isn't right - it doesn't have a set statement. Also, it doesn't require 'quit' - quit is only for certain PROCs that generally are 'programming environments', such as PROC SQL, PROC FORMAT, PROC DATASETS. It doesn't do any harm but it looks odd :)
Sequential-in-the-dataset variable lists are double dash. So, you could trivially create an array with these:
array myvars relationship--goodideas;
So if that's good enough for you (no rename), then, go for it. If you really want to rename them (a bit of a bad idea IMO since it takes away some meaning of the variable name, making code harder to read, though I understand the reasoning why you'd want to), you can't use this unfortunately - while it's correct, the RENAME statement does not support it.
82 ***** does not work ;
83 data svy_1;
84 rename relationship--goodideas = var01-var04;
------------
47
ERROR 47-185: Given form of variable list is not supported by RENAME. Statement is ignored.
85 run;
You cannot use an array to perform rename statements, unfortunately; so you'll have to do something else. Here's one answer.
proc contents data=svy out=svy_vars(keep=name varnum) noprint;
run;
proc sort data=svy_vars;
by varnum;
run;
data for_rename;
set svy_vars;
if name in ('relationship' 'outburst' 'checkwork' 'goodideas') then do;
namectr+1;
new_name=cats(name,'=','var',put(namectr,z2.));
output;
end;
run;
proc sql;
select new_name into :renlist separated by ' ' from for_rename;
quit;
proc datasets nolist;
modify svy;
rename &renlist;
quit;
You can do something similar in a shorter fashion using PROC SQL and the DICTIONARY.COLUMNS table, or a data step and SASHELP.VCOLUMN, but the proc contents method is somewhat more transparent as to what's happening. If you have more than four variables, you may want to change that IN statement into a negative statement (if name not in (list of things to not change)) if that's easier, or even use the VARNUM variable itself to determine which variables you want to change (if varnum in (2:5) would work there).
A colleague came up with the best approach:
***** does work ;
data svy_1;
set svy;
array old { 4 } relationship--goodideas;
array var { 4 } ;
do i = 1 to 4;
var[i] = old[i];
end;
drop i;
run;

Dynamic Where in SAS - Conditional Operation based on another set

To my disappointment, the following code, which sums up 'value' by week from 'master' for weeks which appear in 'transaction' does not work -
data master;
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input change_week ;
datalines;
1
3
;
run;
data _null_;
set transaction;
do until(done);
set master end=done;
where week=change_week;
sum = sum(value, sum);
end;
file print;
put week= sum=;
run;
SAS complains, rightly, because it doesn't see 'change_week' in master and does not know how to operate on it.
Surely there must be a way of doing some operation on a subset of a master set (of course, suitably indexed), given a transaction dataset... Does any one know?
I believe this is the closest answer to what the asker has requested.
This method uses an index on week on the large dataset, allowing for the possibility of invalid week values in the transaction dataset, and without requiring either dataset to be sorted in any particular order. Performance will probably be better if the master dataset is in week order.
For small transaction datasets, this should perform quite a lot better than the other solutions as it only retrieves the required observations from the master dataset. If you're dealing with > ~30% of the records in the master dataset in a single transaction dataset, Quentin's method may sometimes perform better due to the overhead of using the index.
data master(index = (week));
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input week ;
datalines;
1
3
4
;
run;
data _null_;
set transaction;
file print;
do until(done);
set master key = week end=done;
/*Prevent implicit retain from previous row if the key isn't found,
or we've read past the last record for the current key*/
if _IORC_ ne 0 then do;
_ERROR_ = 0;
call missing(value);
end;
else sum = sum(value, sum);
end;
put week= sum=;
run;
N.B. for this to work, the indexed variable in the master dataset must have exactly the same name and type as the variable in the transaction dataset. Also, the index must be of the non-unique variety in order to accommodate multiple rows with the same key value.
Also, it is possible to replace the set master... statement with an equivalent modify master... statement if you want to apply transactional changes directly, i.e. without SAS making a massive temp file and replacing the original.
You are correct, there are many ways to do this in SAS. Your example is inefficient because (once we got it working) it would still require a full read of "master" for ever line of "transaction".
(The reason you got the error was because you used where instead of if. In SAS, the sub-setting where in a data step is only aware of columns already existing within the data set it's sub-setting. They keep two options because there where is faster when it's usable.)
An alternative solution would be use proc sql. Hopefully this example is self-explanatory:
proc sql;
select
a.change_week,
sum(b.value) as value
from
transaction as a,
master as b
where a.change_week = b.week
group by change_week;
quit;
I don't suggest below solution (would like #Jeff's SQL solution or even a hash better). But just for playing with data step logic, I think below approach would work, if you trust that every key in transaction will exist in master. It relies on the fact that both datasets are sorted, so only makes one pass of each dataset.
On first iteration of the DATA step, it reads the first record from the transaction dataset, then keeps reading through the master dataset until it finds all the matching records for that key, then the DATA step loop iterates and it does it again for the next transaction record.
1003 data _null_;
1004 set transaction;
1005 by change_week;
1006
1007 do until(last.week and _found);
1008 set master;
1009 by week;
1010
1011 if week=change_week then do;
1012 sum = sum(value, sum);
1013 _found=1;
1014 end;
1015 end;
1016
1017 *file print;
1018 put week= sum= ;
1019 run;
week=1 sum=60
week=3 sum=75