Merging but keeping all observations? - sas

I have three data sets of inpatient, outpatient, and professional claims. I want to find the number of unique people who have a claim related to tobacco use (1=yes tobacco, 0=tobacco) in ANY of these three data sets.
Therefore, the data sets pretty much are all:
data inpatient;
input Patient_ID Tobacco;
datalines;
1 0
2 1
3 1
4 1
5 0
;
run;
I am trying to merge the inpatient, outpatient, and professional so that I am left with those patient ids that have a tobacco claim in any of the three data sets using:
data tobaccoall;
merge inpatient outpatient professional;
by rid;
run;
However, it is overwriting some of the 1's with 0's in the new data set. How do I better merge the data sets to find if the patient has a claim in ANY of the datasets?

When you merge data sets in SAS that share variable names, the values from the data set listed on the right in the merge statement overwrite the values from data set to its left. In order to keep each value, you'd want to rename the variables before merging. You can do this in the merge statement by adding a rename= option after each data set.
If you want a single variable that represents whether a tobacco claim exists in any of the three variables, you could create a new variable using the max function to combine the three different values.
data tobaccoall;
merge inpatient (rename=(tobacco=tobacco_in))
outpatient (rename=(tobacco=tobacco_out))
professional (rename=(tobacco=tobacco_pro));
by rid;
tobacco_any = max(tobacco_in,tobacco_out,tobacco_pro,0);
run;

If your data were 1=has .=doesn't have (missing), then you could use the UPDATE statement, which mostly works like Merge except it wouldn't overwrite nonmissing data with missing.
For example:
data inpatient;
input Patient_ID Tobacco;
datalines;
1 .
2 1
3 1
4 1
5 .
;
run;
data outpatient;
input Patient_ID Tobacco;
datalines;
1 1
2 1
3 .
4 .
5 .
;
run;
data want;
update inpatient outpatient;
by patient_id;
run;

Related

Collapsing a large dataset while conditionally preserving some missing values

Dataset HAVE includes id values and a character variable of names. Values in names are usually missing. If names is missing for all values of an id EXCEPT one, the obs for IDs with missing values in names can be deleted. If names is completely missing for all id of a certain value (like id = 2 or 5 below), one record for this id value must be preserved.
In other words, I need to turn HAVE:
id names
1
1
1 Matt, Lisa, Dan
1
2
2
2
3
3
3 Emily, Nate
3
4
4
4 Bob
5
into WANT:
id names
1 Matt, Lisa, Dan
2
3 Emily, Nate
4 Bob
5
I currently do this by deleting all records where names is missing, then merging the results onto a new dataset KEY with one variable id that contains all original values (1, 2, 3, 4, and 5):
data WANT_pre;
set HAVE;
if names = " " then delete;
run;
data WANT;
merge KEY
WANT_pre;
by id;
run;
This is ideal for HAVE because I know that id is a set of numeric values ranging from 1 to 5. But I am less sure how I could do this efficiently (A) on a much larger file, and (B) if if I couldn't simply create an id KEY dataset by counting from 1 to n. If your HAVE had a few million observations and your id values were more complex (e.g., hexadecimal values like XR4GN), how would you produce WANT?
You can use SQL here easily, MAX() applies to character variables within SQL.
proc sql;
create table want as
select id, max(names) as names
from have
group by ID;
quit;
Another option is to use an UPDATE statement instead.
data want;
update have (obs=0) have;
by ID;
run;
This seems like a good candidate for a DOW-loop, assuming that your dataset is sorted by id:
data want;
do until(last.id);
set have;
by id;
length t_names $50; /*Set this to at least the same length as names unless you want the default length of 200 from coalescec*/
t_names = coalescec(t_names,names);
end;
names = t_names;
drop t_names;
run;
proc summary data=have nway missing;
class id;
output out=want(drop=_:) idgroup(max(names) out(names)=);
run;
Use the UPDATE statement. That will ignore the missing values and keep the last non-missing value. It normally requires a master and transaction dataset, but you can use your single dataset for both.
data want;
update have(obs=0) have ;
by id;
run;

How to convert a SAS data set to a data step

How can I convert my SAS data set, into a data set that I can easily paste into the forum or hand over to someone to replicate my data. Ideally, I'd also like to be able to control the amount of records that are included.
Ie I have sashelp.class in the SASHELP library, but I want to provide it here so others can use it as the starting point for my question.
To do this, you can use a macro written by Mark Jordan at SAS, the code is stored in GitHub as well.
You need to provide the data set name, including library and the number of observations you want to output. It takes them in order. The code will then appear in your SAS log.
*data set you want to create demo data for;
%let dataSetName = sashelp.Class;
*number of observations you want to keep;
%let obsKeep = 5;
******************************************************
DO NOT CHANGE ANYTHING BELOW THIS LINE
******************************************************;
%let source_path = https://gist.githubusercontent.com/statgeek/bcc55940dd825a13b9c8ca40a904cba9/raw/865d2cf18f5150b8e887218dde0fc3951d0ff15b/data2datastep.sas;
filename reprex url "&source_path";
%include reprex;
filename reprex;
option linesize=max;
%data2datastep(dsn=&dataSetName, obs=&obsKeep);
This may not work if you do not have access to the github page, in that case, you can manually navigate to the page (same link) and copy/paste it into SAS. Then run the program and run only the last step, the %data2datastep(dsn=, obs=);
This topic came up recently on SAS Communities and I created a little more robust macro than the one Reeza linked. You can see it in Github: ds2post.sas
* Pull macro definition from GITHUB ;
filename ds2post url
'https://raw.githubusercontent.com/sasutils/macros/master/ds2post.sas'
;
%include ds2post ;
For example if you wanted to share the first 5 observations of SASHELP.CARS you would run this macro call:
%ds2post(sashelp.cars,obs=5)
Which would generate this code to the SAS log:
data work.cars (label='2004 Car Data');
infile datalines dsd dlm='|' truncover;
input Make :$13. Model :$40. Type :$8. Origin :$6. DriveTrain :$5.
MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway
Weight Wheelbase Length
;
format MSRP dollar8. Invoice dollar8. ;
label EngineSize='Engine Size (L)' MPG_City='MPG (City)'
MPG_Highway='MPG (Highway)' Weight='Weight (LBS)'
Wheelbase='Wheelbase (IN)' Length='Length (IN)'
;
datalines4;
Acura|MDX|SUV|Asia|All|36945|33337|3.5|6|265|17|23|4451|106|189
Acura|RSX Type S 2dr|Sedan|Asia|Front|23820|21761|2|4|200|24|31|2778|101|172
Acura|TSX 4dr|Sedan|Asia|Front|26990|24647|2.4|4|200|22|29|3230|105|183
Acura|TL 4dr|Sedan|Asia|Front|33195|30299|3.2|6|270|20|28|3575|108|186
Acura|3.5 RL 4dr|Sedan|Asia|Front|43755|39014|3.5|6|225|18|24|3880|115|197
;;;;
Try this little test to compare the two macros.
First make a sample dataset with a couple of issues.
data testit;
set sashelp.class (obs=5);
if _n_=1 then name='Le Bron';
if _n_=2 then age=.;
if _n_=3 then wt=.;
if _n_=4 then name='12;34';
run;
Then run both macros to dump code to the SAS log.
%ds2post(testit);
%data2datastep(dsn=testit,obs=20);
Copy the code from the log. Changing the name in the DATA statements to not overwrite the original dataset or each other. Run them and compare the result to the original.
proc compare data=testit compare=testit1; run;
proc compare data=testit compare=testit2; run;
Result using %DS2POST:
The COMPARE Procedure
Comparison of WORK.TESTIT with WORK.TESTIT1
(Method=EXACT)
Data Set Summary
Dataset Created Modified NVar NObs
WORK.TESTIT 02NOV18:17:09:40 02NOV18:17:09:40 6 5
WORK.TESTIT1 02NOV18:17:10:29 02NOV18:17:10:29 6 5
Variables Summary
Number of Variables in Common: 6.
Observation Summary
Observation Base Compare
First Obs 1 1
Last Obs 5 5
Number of Observations in Common: 5.
Total Number of Observations Read from WORK.TESTIT: 5.
Total Number of Observations Read from WORK.TESTIT1: 5.
Number of Observations with Some Compared Variables Unequal: 0.
Number of Observations with All Compared Variables Equal: 5.
Summary of results using %Data2DataStep:
Comparison of WORK.TESTIT with WORK.TESTIT2
(Method=EXACT)
Data Set Summary
Dataset Created Modified NVar NObs
WORK.TESTIT 02NOV18:17:09:40 02NOV18:17:09:40 6 5
WORK.TESTIT2 02NOV18:17:10:29 02NOV18:17:10:29 6 3
Variables Summary
Number of Variables in Common: 6.
Observation Summary
Observation Base Compare
First Obs 1 1
First Unequal 1 1
Last Unequal 3 3
Last Match 3 3
Last Obs 5 .
Number of Observations in Common: 3.
Number of Observations in WORK.TESTIT but not in WORK.TESTIT2: 2.
Total Number of Observations Read from WORK.TESTIT: 5.
Total Number of Observations Read from WORK.TESTIT2: 3.
Number of Observations with Some Compared Variables Unequal: 3.
Number of Observations with All Compared Variables Equal: 0.
Variable Values Summary
Values Comparison Summary
Number of Variables Compared with All Observations Equal: 1.
Number of Variables Compared with Some Observations Unequal: 5.
Number of Variables with Missing Value Differences: 4.
Total Number of Values which Compare Unequal: 12.
Maximum Difference: 0.
Variables with Unequal Values
Variable Type Len Ndif MaxDif MissDif
Name CHAR 8 1 0
Sex CHAR 1 3 3
Age NUM 8 2 0 2
Height NUM 8 3 0 3
Weight NUM 8 3 0 3
Note that I am sure there are values that will cause trouble for my macro also. But hopefully they are caused by data that is less likely to occur than spaces or semi-colons.

SAS: Replacing missing value with average of nearest neighbors

I am trying to find a quick way to replace missing values with the average of the two nearest non-missing values. Example:
Id Amount
1 10
2 .
3 20
4 30
5 .
6 .
7 40
Desired output
Id Amount
1 10
2 **15**
3 20
4 30
5 **35**
6 **35**
7 40
Any suggestions? I tried using the retain function, but I can only figure out how to retain last non-missing value.
I thinks what you are looking for might be more like interpolation. While this is not mean of two closest values, it might be useful.
There is a nifty little tool for interpolating in datasets called proc expand. (It should do extrapolation as well, but I haven't tried that yet.) It's very handy when making series of of dates and cumulative calculations.
data have;
input Id Amount;
datalines;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;
run;
proc expand data=have out=Expanded;
convert amount=amount_expanded / method=join;
id id; /*second is column name */
run;
For more on the proc expand see documentation: https://support.sas.com/documentation/onlinedoc/ets/132/expand.pdf
This works:
data have;
input id amount;
cards;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;
run;
proc sort data=have out=reversed;
by descending id;
run;
data retain_non_missing;
set reversed;
retain next_non_missing;
if amount ne . then next_non_missing = amount;
run;
proc sort data=retain_non_missing out=ordered;
by id;
run;
data final;
set ordered;
retain last_non_missing;
if amount ne . then last_non_missing = amount;
if amount = . then amount = (last_non_missing + next_non_missing) / 2;
run;
but as ever, will need extra error checking etc for production use.
The key idea is to sort the data into reverse order, allowing it to use RETAIN to carry the next_non_missing value back up the data set. When sorted back into the correct order, you then have enough information to interpolate the missing values.
There may well be a PROC to do this in a more controlled way (I don't know anything about PROC STANDARDIZE, mentioned in Reeza's comment) but this works as a data step solution.
Here's an alternative requiring no sorting. It does require IDs to be sequential, though that can be worked around if they're not.
What it does is uses two set statements, one that gets the main (and previous) amounts, and one that sets until the next amount is found. Here I use the sequence of id variables to guarantee it will be the right record, but you could write this differently if needed (keeping track of what loop you're on) if the id variables aren't sequential or in an order of any sort.
I use the first.amount check to make sure we don't try to execute the second set statement more than we should (which would terminate early).
You need to do two things differently if you want first/last rows treated differently. Here I assume prev_amount is 0 if it's the first row, and I assume last_amount is missing, meaning the last row just gets the last prev_amount repeated, while the first row is averaged between 0 and the next_amount. You can treat either one differently if you choose, I don't know your data.
data have;
input Id Amount;
datalines;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;;;;
run;
data want;
set have;
by amount notsorted; *so we can tell if we have consecutive missings;
retain prev_amount; *next_amount is auto-retained;
if not missing(amount ) then prev_amount=amount;
else if _n_=1 then prev_amount=0; *or whatever you want to treat the first row as;
else if first.amount then do;
do until ((next_id > id and not missing(next_amount)) or (eof));
set have(rename=(id=next_id amount=next_amount)) end=eof;
end;
amount = mean(prev_amount,next_amount);
end;
else amount = mean(prev_amount,next_amount);
run;

SAS Create missing numeric ids into individual observations.

I need to outline a series of ID numbers that are currently available based on a data set in which ID's are already assigned (if the ID is on the file then its in use...if its not on file, then its available for use).
The issue is I don't know how to create a data set that displays ID numbers which are between two ID #'s that are currently on file - Lets say I have the data set below -
data have;
input id;
datalines;
1
5
6
10
;
run;
What I need is for the new data set to be in the following structure of this data set -
data need;
input id;
datalines;
2
3
4
7
8
9
;
run;
I am not sure how I would produce the observations of ID #'s 2, 3 and 4 as these would be scenarios of "available ID's"...
My initial attempt was going to be subtracting the ID values from one observation to the next in order to find the difference, but I am stuck from there on how to use that value and add 1 to the observation before it...and it all became quite messy from there.
Any assistance would be appreciated.
As long as your set of possible IDs is know, this can be done by putting them all in a file and excluding the used ones.
e.g.
data id_set;
do id = 1 to 10;
output;
end;
run;
proc sql;
create table need as
select id
from id_set
where id not in (select id from have)
;
quit;
Create a temporary variable that stores the previous id, then just loop between that and the current id, outputting each iteration.
data have;
input id;
datalines;
1
5
6
10
;
run;
data need (rename=(newid=id));
set have;
retain _lastid; /* keep previous id value */
if _n_>1 then do newid=_lastid+1 to id-1; /* fill in numbers between previous and current ids */
output;
end;
_lastid=id;
keep newid;
run;
Building on Jetzler's answer: Another option is to use the MERGE statement. In this case:
note: before merge, sort both datasets by id (if not already sorted);
data want;
merge id_set (in=a)
have (in=b); /*specify datasets and vars to allow the conditional below*/
by id; /*merge key variable*/
if a and not b; /*on output keep only records in ID_SET that are not in HAVE*/
run;

Modifying data in SAS: copying part of the value of a cell, adding missing data and labeling it

I have three different questions about modifying a dataset in SAS. My data contains: the day and the specific number belonging to the tag which was registred by an antenna on a specific day.
I have three separate questions:
1) The tag numbers are continuous and range from 1 to 560. Can I easily add numbers within this range which have not been registred on a specific day. So, if 160-280 is not registered for 23-May and 40-190 for 24-May to add these non-registered numbers only for that specific day? (The non registered numbers are much more scattered and for a dataset encompassing a few weeks to much to do by hand).
2) Furthermore, I want to make a new variable saying a tag has been registered (1) or not (0). Would it work to make this variable and set it to 1, then add the missing variables and (assuming the new variable is not set for the new number) set the missing values to 0.
3) the last question would be in regard to the format of the registered numbers which is along the line of 528 000000000400 and 000 000000000054. I am only interested in the last three digits of the number and want to remove the others. If I could add the missing numbers I could make a new variable after the data has been sorted by date and the original transponder code but otherwise what would you suggest?
I would love some suggestions and thank you in advance.
I am inventing some data here, I hope I got your questions right.
data chickens;
do tag=1 to 560;
output;
end;
run;
data registered;
input date mmddyy8. antenna tag;
format date date7.;
datalines;
01012014 1 1
01012014 1 2
01012014 1 6
01012014 1 8
01022014 1 1
01022014 1 2
01022014 1 7
01022014 1 9
01012014 2 2
01012014 2 3
01012014 2 4
01012014 2 7
01022014 2 4
01022014 2 5
01022014 2 8
01022014 2 9
;
run;
proc sql;
create table dates as
select distinct date, antenna
from registered;
create table DatesChickens as
select date, antenna, tag
from dates, chickens
order by date, antenna, tag;
quit;
proc sort data=registered;
by date antenna tag;
run;
data registered;
merge registered(in=INR) DatesChickens;
by date antenna tag;
Registered=INR;
run;
data registeredNumbers;
input Numbers $16.;
datalines;
528 000000000400
000 000000000054
;
run;
data registeredNumbers;
set registeredNumbers;
NewNumbers=substr(Numbers,14);
run;
I do not know SAS, but here is how I would do it in SQL - may give you an idea of how to start.
1 - Birds that have not registered through pophole that day
SELECT b.BirdId
FROM Birds b
WHERE NOT EXISTS
(SELECT 1 FROM Pophole_Visits p WHERE b.BirdId = p.BirdId AND p.date = ????)
2 - Birds registered through pophole
If you have a dataset with pophole data you can query that to find if a bird has been through. What would you flag be doing - finding a bird that has never been through any popholes? Looking for dodgy sensor tags or dead birds?
3 - Data code
You might have more joy with the SUBSTRING function
Good luck