I have a large SAS dataset that includes patient ID and race. This is a longitudinal dataset where each observation represents a visit to the hospital. There are many observations that are missing race information, but other visits for that same patient ID have race indicated. I used the code below to resolve any observations for a given patient ID that was missing race, as long as another visit had that information:
data need;
do until (last.id);
set have;
by id;
if not missing(race) then newrace=race;
if missing(race) then race=newrace;
output;
end;
run;
My question is - how do I record when a patient has multiple race's indicated across options? How do I determine one to be more dominant/overriding than the others (i.e. for Patient 342, there are 3 obs with race=2 and 2 obs with race=4; we want any indication of race=4 to determine that newrace=4 for all obs with Patient 342).
Thanks!
The way I would do it is to create a format for the patient IDs. This not only solves your immediate problem, but might be useful in other steps as it can be used in procs.
data for_fmt;
set have;
by id;
retain label;
retain fmtname 'IDRACEF';
start=id;
if race=4 then label=4; *or you could have label='Hispanic', also - could use this to convert to character strings;
else label=coalesce(label,race); *otherwise only change race if label is missing;
if last.id then output;
keep start label fmtname hlo;
if _n_=1 then do;
start=.;
label=.; *or 'MISSING' or something else indicating a nonmatch;
hlo='o';
output;
end;
run;
proc format cntlin=for_fmt;
quit;
Then you can use IDRACEF. as a format, either with format on the column (such as in a proc means), or with a put statement.
This answer assumes that 4 is the overriding race, and that, if an id has more than one race but one of them is 4, all race values are replaced with 4. Also if a given id has more than one race none of which is 4, this code essentially picks which one will replace missing values at random.
data races (drop=race);
do until (last.id);
set have;
by id;
if newrace ne 4 then newrace = race;
end;
output;
run;
data need (drop=newrace);
merge have races;
by id;
if missing(race) then race=newrace;
if newrace = 4 then race = 4;
run;
The first part creates a data set "races" for which race should replace missings for each id. The second merges that into the original set and replaces missings with the race from "races".
Related
This might sound awkward but I do have a requirement to be able to concatenate all the values of a char column from a dataset, into one single string. For example:
data person;
input attribute_name $ dept $;
datalines;
John Sales
Mary Acctng
skrill Bish
;
run;
Result : test_conct = "JohnMarySkrill"
The column could vary in number of rows in the input dataset.
So, I tried the code below but it errors out when the length of the combined string (samplkey) exceeds 32K in length.
DATA RECKEYS(KEEP=test_conct);
length samplkey $32767;
do until(eod);
SET person END=EOD;
if lengthn(attribute_name) > 0 then do;
test_conct = catt(test_conct, strip(attribute_name));
end;
end;
output; stop;
run;
Can anyone suggest a better way to do this, may be break down a column into chunks of 32k length macro vars?
Regards
It would very much help if you indicated what you're trying to do but a quick method is to use SQL
proc sql NOPRINT;
select name into :name_list separated by ""
from sashelp.class;
quit;
%put &name_list.;
As you've indicated macro variables do have a size limit (64k characters) in most installations now. Depending on what you're doing, a better method may be to build a macro that puts the entire list as needed into where ever it needs to go dynamically but you would need to explain the usage for anyone to suggest that option. This answers your question as posted.
Try this, using the VARCHAR() option. If you're on an older version of SAS this may not work.
data _null_;
set sashelp.class(keep = name) end=eof;
length long_var varchar(1000000);
length want $256.;
retain long_var;
long_var = catt(long_var, name);
if eof then do;
want = md5(long_var);
put want;
end;
run;
I need to outline a series of ID numbers that are currently available based on a data set in which ID's are already assigned (if the ID is on the file then its in use...if its not on file, then its available for use).
The issue is I don't know how to create a data set that displays ID numbers which are between two ID #'s that are currently on file - Lets say I have the data set below -
data have;
input id;
datalines;
1
5
6
10
;
run;
What I need is for the new data set to be in the following structure of this data set -
data need;
input id;
datalines;
2
3
4
7
8
9
;
run;
I am not sure how I would produce the observations of ID #'s 2, 3 and 4 as these would be scenarios of "available ID's"...
My initial attempt was going to be subtracting the ID values from one observation to the next in order to find the difference, but I am stuck from there on how to use that value and add 1 to the observation before it...and it all became quite messy from there.
Any assistance would be appreciated.
As long as your set of possible IDs is know, this can be done by putting them all in a file and excluding the used ones.
e.g.
data id_set;
do id = 1 to 10;
output;
end;
run;
proc sql;
create table need as
select id
from id_set
where id not in (select id from have)
;
quit;
Create a temporary variable that stores the previous id, then just loop between that and the current id, outputting each iteration.
data have;
input id;
datalines;
1
5
6
10
;
run;
data need (rename=(newid=id));
set have;
retain _lastid; /* keep previous id value */
if _n_>1 then do newid=_lastid+1 to id-1; /* fill in numbers between previous and current ids */
output;
end;
_lastid=id;
keep newid;
run;
Building on Jetzler's answer: Another option is to use the MERGE statement. In this case:
note: before merge, sort both datasets by id (if not already sorted);
data want;
merge id_set (in=a)
have (in=b); /*specify datasets and vars to allow the conditional below*/
by id; /*merge key variable*/
if a and not b; /*on output keep only records in ID_SET that are not in HAVE*/
run;
I have 2 variables and 3 records in a sas data set, and based on the date field in that data set, I need to read different monthly data sets.
For example,
I have
item no. Date
1 30Jun2015
2 31Jul2015
3 31Aug2015
When I read the first record, then based on the date field (30jun2015) here, it should merge another dataset suffixed with 30jun2015 with this current dataset.
How can I achieve that?
So as I'll hazard a guess what you're looking for I've left a bit of a gap where you'll have to specifiy the criteria for your own merge.
1) Read in base data
data MAIN_DATA;
infile cards;
input ITEM_NO DATE:date9.;
format DATE date9.;
cards;
1 30JUN2015
2 31JUL2015
3 31AUG2015
;
run;
2) Store all dates: into macro variables date1 to daten. Assuming ddmmyy6. is a good format for your table names
Data _null_;
Set Main_data;
Call symputx('date'||strip(_n_),put(DATE,ddmmyy6.));
Call symputx('daten', _n_);
Run;
3) Read in the variables and read the associated table - you haven't specified how to do the merge so I'll leave that up to you
%macro readin;
%do i = 1 %to &daten;
data NEW_TABLE_&&date&i..;
set TEST_&&date&i..; /*in this step you can merge on the original table however you intend to*/
run;
%end;
%mend readin;
%readin;
What can I do if I want to copy the data from the next row.
For example customer A started his current trip on 01JAN2015 and next trip on 15JAN2015. Therefore, his end trip date for his current trip will be on 14JAN2015, which is a day before his next trip starts. What can I script for the end trip date?
As there is no lead() function in SAS, you can either sort your data into descending date order and use lag() then re-sort it back again, as per Vasilij's answer, or you can do a 'look-ahead merge'.
Example:
proc sort data=have ;
by customer date_start ;
run ;
data want ;
merge have
have (firstobs=2 rename=(date_start=next_date customer=next_customer)) ;
if customer = next_customer then do ;
date_end = next_date ;
end ;
format date_end date7. ;
drop next_: ;
run ;
Here is the code that would do what you are asking.
It sorts the data in descending order in order to use LAG() function. That way, any previous record is actually your future record and you can use it to work out the data points you need. Last PROC SORT sorts the data in the original order.
NOTE: I didn't take into account different customers. You might want to introduce some BY GROUP processing to make sure you don't take the next trip date for another customer.
data have;
input customer $ date_start date7.;
format date_start date7.;
datalines;
A 01JAN15
A 15JAN15
;
PROC SORT data=have;
by customer Descending date_start ;
RUN;
data want;
set have;
by customer Descending date_start ;
format date_end date7.;
date_end = lag(date_start)-1;
RUN;
PROC SORT data=want;
by customer date_start ;
RUN;
lag() is a horrible misnomer-ed function that has nothing to do with 'previous row' and should be almost always be avoided. It often creates buggy, very hard to spot mistakes. There are some rare cases where it makes sense to use it. This is not one of them. I really wish people would stop recommending its use. [/end rant].
Instead, consider using one of the below methods.
1) The point= method (not sure if there's a name for this). Some notes, be sure to keep just those variables you need on the second set statement and no more. Rename them so they don't overwrite the existing variable values.
data want;
set sashelp.class end=last;
* GET THE NAME FROM THE NEXT ROW OF DATA;
if not last then do;
recno=_n_+1;
set sashelp.class(keep=name rename=(name=next_name)) point=recno;
end;
else do;
call missing(next_name);
end;
run;
2) The retain method:
* REVERSE THE ORDER OF THE DATA;
proc sort data=sashelp.class out=have;
by descending name;
run;
* KEEP TRACK OF THE PRIOR RECORDS NAME AS WE ITERATE ACROSS OBSERVATIONS;
data have2;
set have;
length next_name $8;
retain next_name '';
output;
next_name = name;
run;
* SORT THE DATA BACK TO ITS ORIGINAL ORDER;
proc sort data=have2 out=want;
by name;
run;
3) The look-ahead-merge method as suggested in Chris J's answer.
I have a data set of patient information where I want to count how many patients (observations) have a given diagnostic code. I have 9 possible variables where it can be, in diag1, diag2... diag9. The code is V271. I cannot figure out how to do this with the "WHERE" clause or proc freq.
Any help would be appreciated!!
Your basic strategy to this is to create a dataset that is not patient level, but one observation is one patient-diagnostic code (so up to 9 observations per patient). Something like this:
data want;
set have;
array diag[9];
do _i = 1 to dim(diag);
if not missing(diag[_i]) then do;
diagnosis_Code = diag[_i];
output;
end;
end;
keep diagnosis_code patient_id [other variables you might want];
run;
You could then run a proc freq on the resulting dataset. You could also change the criteria from not missing to if diag[_i] = 'V271' then do; to get only V271s in the data.
An alternate way to reshape the data that can match Joe's method is to use proc transpose like so.
proc transpose data=have out=want(keep=patient_id col1
rename=(col1=diag)
where=(diag is not missing));
by patient_id;
var diag1-diag9;
run;