Collapsing a large dataset while conditionally preserving some missing values

Collapsing a large dataset while conditionally preserving some missing values - sas

Dataset HAVE includes id values and a character variable of names. Values in names are usually missing. If names is missing for all values of an id EXCEPT one, the obs for IDs with missing values in names can be deleted. If names is completely missing for all id of a certain value (like id = 2 or 5 below), one record for this id value must be preserved.
In other words, I need to turn HAVE:
id names
1
1
1 Matt, Lisa, Dan
1
2
2
2
3
3
3 Emily, Nate
3
4
4
4 Bob
5
into WANT:
id names
1 Matt, Lisa, Dan
2
3 Emily, Nate
4 Bob
5
I currently do this by deleting all records where names is missing, then merging the results onto a new dataset KEY with one variable id that contains all original values (1, 2, 3, 4, and 5):
data WANT_pre;
set HAVE;
if names = " " then delete;
run;
data WANT;
merge KEY
WANT_pre;
by id;
run;
This is ideal for HAVE because I know that id is a set of numeric values ranging from 1 to 5. But I am less sure how I could do this efficiently (A) on a much larger file, and (B) if if I couldn't simply create an id KEY dataset by counting from 1 to n. If your HAVE had a few million observations and your id values were more complex (e.g., hexadecimal values like XR4GN), how would you produce WANT?

You can use SQL here easily, MAX() applies to character variables within SQL.
proc sql;
create table want as
select id, max(names) as names
from have
group by ID;
quit;
Another option is to use an UPDATE statement instead.
data want;
update have (obs=0) have;
by ID;
run;

This seems like a good candidate for a DOW-loop, assuming that your dataset is sorted by id:
data want;
do until(last.id);
set have;
by id;
length t_names $50; /*Set this to at least the same length as names unless you want the default length of 200 from coalescec*/
t_names = coalescec(t_names,names);
end;
names = t_names;
drop t_names;
run;

proc summary data=have nway missing;
class id;
output out=want(drop=_:) idgroup(max(names) out(names)=);
run;

Use the UPDATE statement. That will ignore the missing values and keep the last non-missing value. It normally requires a master and transaction dataset, but you can use your single dataset for both.
data want;
update have(obs=0) have ;
by id;
run;

Related

Delete all observations, which are doubled on some variable

Suppose i have a table:
Name Age
Bob 4
Pop 5
Yoy 6
Bob 5
I want to delete all names, which are not unique in the table:
Name Age
Pop 5
Yoy 6
ATM, my solution is to make a new table with counts of unique names:
Name Count
Bob 2
Pop 1
Yoy 1
And then, leave all, which's Count > 1
I believe there are much more beautiful solutions.

If I understand you correctly there are two ways to do it:
The SQL Procedure
In SAS you may not need to use a summarisation function such as MIN() as I have here, but when there is only one of name then min(age) = age anyway, and when migrating this to another RDBMS (e.g. Oracle, SQL Server) it may be required:
proc sql;
create table want as
select name, min(age) as age
from have
group by name
having count(*) = 1;
quit;
Data Step
Requires the data to be pre-sorted:
proc sort data=have out=have_stg;
by name;
run;
When doing SAS data-step by group processing, the first. (first-dot) and last. (last-dot) variables are generated which denote whether the current observation is the first and/or last in the by-group. Using SAS conditional logic one can simply test if first.name = 1 and last.name = 1. Reducing this using logical shorthand becomes:
data want;
set have_stg;
by name;
if first.name and last.name;
/* Equivalent to:*/
*if first.name = 1 and last.name = 1;
run;
I left both versions in the code above, use whichever version you find more readable.

You can use proc sort with the nouniquekey option. Then use uniqueout= to output the unique values and out= to output the duplicates (the out= statement is necessary if you don't wan't to overwrite your original dataset).
proc sort data = have nouniquekey uniqueout = unique out = dups;
by name;
run;

SAS Create missing numeric ids into individual observations.

I need to outline a series of ID numbers that are currently available based on a data set in which ID's are already assigned (if the ID is on the file then its in use...if its not on file, then its available for use).
The issue is I don't know how to create a data set that displays ID numbers which are between two ID #'s that are currently on file - Lets say I have the data set below -
data have;
input id;
datalines;
1
5
6
10
;
run;
What I need is for the new data set to be in the following structure of this data set -
data need;
input id;
datalines;
2
3
4
7
8
9
;
run;
I am not sure how I would produce the observations of ID #'s 2, 3 and 4 as these would be scenarios of "available ID's"...
My initial attempt was going to be subtracting the ID values from one observation to the next in order to find the difference, but I am stuck from there on how to use that value and add 1 to the observation before it...and it all became quite messy from there.
Any assistance would be appreciated.

As long as your set of possible IDs is know, this can be done by putting them all in a file and excluding the used ones.
e.g.
data id_set;
do id = 1 to 10;
output;
end;
run;
proc sql;
create table need as
select id
from id_set
where id not in (select id from have)
;
quit;

Create a temporary variable that stores the previous id, then just loop between that and the current id, outputting each iteration.
data have;
input id;
datalines;
1
5
6
10
;
run;
data need (rename=(newid=id));
set have;
retain _lastid; /* keep previous id value */
if _n_>1 then do newid=_lastid+1 to id-1; /* fill in numbers between previous and current ids */
output;
end;
_lastid=id;
keep newid;
run;

Building on Jetzler's answer: Another option is to use the MERGE statement. In this case:
note: before merge, sort both datasets by id (if not already sorted);
data want;
merge id_set (in=a)
have (in=b); /*specify datasets and vars to allow the conditional below*/
by id; /*merge key variable*/
if a and not b; /*on output keep only records in ID_SET that are not in HAVE*/
run;

Merging but keeping all observations?

I have three data sets of inpatient, outpatient, and professional claims. I want to find the number of unique people who have a claim related to tobacco use (1=yes tobacco, 0=tobacco) in ANY of these three data sets.
Therefore, the data sets pretty much are all:
data inpatient;
input Patient_ID Tobacco;
datalines;
1 0
2 1
3 1
4 1
5 0
;
run;
I am trying to merge the inpatient, outpatient, and professional so that I am left with those patient ids that have a tobacco claim in any of the three data sets using:
data tobaccoall;
merge inpatient outpatient professional;
by rid;
run;
However, it is overwriting some of the 1's with 0's in the new data set. How do I better merge the data sets to find if the patient has a claim in ANY of the datasets?

When you merge data sets in SAS that share variable names, the values from the data set listed on the right in the merge statement overwrite the values from data set to its left. In order to keep each value, you'd want to rename the variables before merging. You can do this in the merge statement by adding a rename= option after each data set.
If you want a single variable that represents whether a tobacco claim exists in any of the three variables, you could create a new variable using the max function to combine the three different values.
data tobaccoall;
merge inpatient (rename=(tobacco=tobacco_in))
outpatient (rename=(tobacco=tobacco_out))
professional (rename=(tobacco=tobacco_pro));
by rid;
tobacco_any = max(tobacco_in,tobacco_out,tobacco_pro,0);
run;

If your data were 1=has .=doesn't have (missing), then you could use the UPDATE statement, which mostly works like Merge except it wouldn't overwrite nonmissing data with missing.
For example:
data inpatient;
input Patient_ID Tobacco;
datalines;
1 .
2 1
3 1
4 1
5 .
;
run;
data outpatient;
input Patient_ID Tobacco;
datalines;
1 1
2 1
3 .
4 .
5 .
;
run;
data want;
update inpatient outpatient;
by patient_id;
run;

How to create a new variable in SAS by extracting part of the value of an existing numeric variable?

I have two datasets in SAS that I would like to merge, but they have no common variables. One dataset has a "subject_id" variable, while the other has a "mom_subject_id" variable. Both of these variables are 9-digit codes that have just 3 digits in the middle of the code with common meaning, and that's what I need to match the two datasets on when I merge them.
What I'd like to do is create a new common variable in each dataset that is just the 3 digits from within the subject ID. Those 3 digits will always be in the same location within the 9-digit subject ID, so I'm wondering if there's a way to extract those 3 digits from the variable to make a new variable.
Thanks!

SQL(using sample data from Data Step code):
proc sql;
create table want2 as
select a.subject_id, a.other, b.mom_subject_id, b.misc
from have1 a JOIN have2 b
on(substr(a.subject_id,4,3)=substr(b.mom_subject_id,4,3));
quit;
Data Step:
data have1;
length subject_id $9;
input subject_id $ other $;
datalines;
abc001def other1
abc002def other2
abc003def other3
abc004def other4
abc005def other5
;
data have2;
length mom_subject_id $9;
input mom_subject_id $ misc $;
datalines;
ghi001jkl misc1
ghi003jkl misc3
ghi005jkl misc5
;
data have1;
length id $3;
set have1;
id=substr(subject_id,4,3);
run;
data have2;
length id $3;
set have2;
id=substr(mom_subject_id,4,3);
run;
Proc sort data=have1;
by id;
run;
Proc sort data=have2;
by id;
run;
data work.want;
merge have1(in=a) have2(in=b);
by id;
run;

an alternative would be to use
proc sql
and then use a join and the substr() just as explained above, if you are comfortable with sql

Assuming that your "subject_id" variable is a number then the substr function wont work as sas will try convert the number to a string. But by default it pads some paces on the left of the number.
You can use the modulus function mod(input, base) which returns the remainder when input is divided by base.
/*First get rid of the last 3 digits*/
temp_var = floor( subject_id / 1000);
/* then get the next three digits that we want*/
id = mod(temp_var ,1000);
Or in one line:
id = mod(floor(subject_id / 1000), 1000);
Then you can continue with sorting the new data sets by id and then merging.

SAS and Date operations

I've tried googling and I haven't turned up any luck to my current problem. Perhaps someone can help?
I have a dataset with the following variables:
ID, AccidentDate
It's in long format, and each participant can have more than 1 accident, with participants having not necessarily an equal number of accidents. Here is a sample:
Code:
ID AccidentDate
1 1JAN2001
2 4MAY2001
2 16MAY2001
3 15JUN2002
3 19JUN2002
3 05DEC2002
4 04JAN2003
What I need to do is count the number of days between each individuals First and Last recorded accident date. I've been playing around with first.byvariable and last.byvariable commands, but I'm just not making any progress. Any tips? or Any links to a source?
Thank you,
Also. I posted this originally over at Talkstats.com (cross-posting etiquette)

Not sure what you mean by in long format
long format should be like this
id accident date
1 1 1JAN2001
1 2 1JAN2002
2 1 1JAN2001
2 2 1JAN2003
Then you can try proc sql like this
Proc Sql;
select id, max(date)-min(date) from table;
group by id;
run;

By long format I think you mean it is a "stacked" dataset with each person having multiple observations (instead of one row per person with multiple columns). In your situation, it is probably the correct way to have the data stored.
To do it with data steps, I think you are on the right track with first. and last.
I would do it like this:
proc sort data=accidents;
by id date;
run;
data accidents; set accidents;
by id accident; *this is important-it makes first. and last. available for use;
retain first last;
if first.date then first=date;
if last.date then last=date;
run;
Now you have a dataset with ID, Date, Date of First Accident, Date of Last Accident
You could calculate the time between with
data accidents; set accidents;
timebetween = last-first;
run;
You can't do this directly in the same data step since the "last" variable won't be accurate until it has parsed the last line and as such the data will be wrong for anything but the last accident observation.

Assuming the data looks like:
ID AccidentDate
1 1JAN2001
2 4MAY2001
2 16MAY2001
3 15JUN2002
3 19JUN2002
3 05DEC2002
4 04JAN2003
You have the right idea. Retain the first accident date in order to have access to both the first and last dates. Then calculate the difference.
proc sort data=accidents;
by id accidentdate
run;
data accidents;
set accidents;
by id;
retain first_accidentdate;
if first.id then first_accidentdate = accidentdate;
if last.id then do;
daysbetween = date - first_accidentdate
output;
end;
run;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Collapsing a large dataset while conditionally preserving some missing values - sas

You can use SQL here easily, MAX() applies to character variables within SQL. proc sql; create table want as select id, max(names) as names from have group by ID; quit; Another option is to use an UPDATE statement instead. data want; update have (obs=0) have; by ID; run;

proc summary data=have nway missing; class id; output out=want(drop=_:) idgroup(max(names) out(names)=); run;

Use the UPDATE statement. That will ignore the missing values and keep the last non-missing value. It normally requires a master and transaction dataset, but you can use your single dataset for both. data want; update have(obs=0) have ; by id; run;

Related

Delete all observations, which are doubled on some variable

SAS Create missing numeric ids into individual observations.

Merging but keeping all observations?

How to create a new variable in SAS by extracting part of the value of an existing numeric variable?

SAS and Date operations

Categories

Resources