SAS - How to keep the earliest date considering a missing - sas

A need to create a new variable to repeat the earliest date for a ID visit and if it missing it should type missing, after a missing it should keep the earliest date since it was missing(like in the example). I've tried the LAG function and it didn't work; I also try the keep function but just repeat the 25NOV2015 for all records. The final result/"what I need" is in the last column.
Thanks
Example

You need to use retain statement. Retain means your value in each observation won't be reinitialized to a missing. So in the next iteration of data step your variable remembers its value.
Sample data
data a;
input date;
format date ddmmyy10.;
datalines;
.
5
6
7
.
1
2
.
9
;
run;
Solution
data b;
set a;
retain new_date;
format new_date ddmmyy10.;
if date = . then
new_date = .;
if new_date = . then
new_date = date;
run;

Since you didn't post any data I will make up some. Also since the fact that your variable is a date doesn't really impact the answer I will just use some integers as they are easier to type.
data have ;
input id value ## ;
cards;
1 . 1 2 1 3 1 . 1 5 1 6 1 . 1 8
2 1 2 2 2 3 2 . 2 5 2 6
;;;;
Basically your algorithm says that you want to store the value when either the current value is missing or stored value is missing. With multiple BY groups you would also want to set it when you start a new group.
data want ;
set have ;
by id ;
retain new_value ;
if first.id or missing(new_value) or missing(value)
then new_value=value;
run;
Results:
new_
Obs id value value
1 1 . .
2 1 2 2
3 1 3 2
4 1 . .
5 1 5 5
6 1 6 5
7 1 . .
8 1 8 8
9 2 1 1
10 2 2 1
11 2 3 1
12 2 . .
13 2 5 5
14 2 6 5

Related

Impute missing categories

I have randomly missing categories in a Stata dataset that look like the following
omb_control_number agency hours
1 HHS-ACF
1 10
2
2
2 HHS-CDC 2
3
3 HHS-ACF 3
3
4 HHS-ACF 10
4
4
4
The omb_control_number variable is constant throughout the data is not missing. I am trying to impute the categories such that all unique omb_control_number have the same agency and hours. I tried using the following:
by omb_control_number, sort : replace agency[_n-1] if missing(agency)
But it filled in only previous values. Is there a way to do this where it won't just fill in previous values? For reference, the final dataset should look like the following:
omb_control_number agency hours
1 HHS-ACF 10
1 HHS-ACF 10
2 HHS-CDC 2
2 HHS-CDC 2
2 HHS-CDC 2
3 HHS-ACF 3
3 HHS-ACF 3
3 HHS-ACF 3
4 HHS-ACF 10
4 HHS-ACF 10
4 HHS-ACF 10
4 HHS-ACF 10
If you do not care about maintaining original sort order, then you can do this:
* Example generated by -dataex-. For more info, type help dataex
clear
input byte omb_control_number str7 agency byte hours
1 "HHS-ACF" .
1 "" 10
2 "" .
2 "" .
2 "HHS-CDC" 2
3 "" .
3 "HHS-ACF" 3
3 "" .
4 "HHS-ACF" 10
4 "" .
4 "" .
4 "" .
end
gsort omb_control_number -agency
bys omb_control_number : replace agency = agency[_n-1] if missing(agency)
sort omb_control_number hours
bys omb_control_number : replace hours = hours[_n-1] if missing(hours)
If agency is a string variable, then
bysort omb (agency) : replace agency = agency[_N]
will copy the last value after sorting to all observations for the same group.
If agency is a numeric variable with value labels, keep reading.
As hours is presumably a numeric variable, it is the same idea with a twist:
bysort omb (hours) : replace hours = hours[1]
In neither case is there any check for two or more non-missing values for the same identifier.
For a numeric variable, whether with or without value labels, a check would be
bysort omb (hours) : gen byte OK = (hours == hours[1]) | missing(hours)
You should then want to look if any observations are 0 on OK. 1 means "OK".
And from the above string variables can be checked too, with a need to look in the last observation -- indexed by _N-- rather than the first -- indexed by 1.
This will get you the desired results:
bysort omb_control_number: gen nonmissing = sum(!missing(agency)) if !missing(agency)
bysort omb_control_number: gen nonmissing2 = sum(!missing(hours)) if !missing(hours)
bysort omb_control_number (nonmissing) : replace agency = agency[1]
bysort omb_control_number (nonmissing2) : replace hours = hours[1]
drop nonmissing*

sas change last value by group to first value

I want to change data of the form
id value
1 1
1 1
1 2
2 7
2 7
2 7
2 5
. .
. .
. .
to
id value
1 1
1 1
1 1
2 7
2 7
2 7
2 7
. .
. .
. .
That is, the last value by group should be the first value by group. I have tried the following code
data want;
set have;
by id;
last.value=first.value;
run;
But that didn't work. Could someone help me out?
You should save first.id value in variable and retain it.
data want(drop=tValue);
set have;
by id;
retain tValue;
if first.id then tValue=value;
if last.id then value=tValue;
run;
The problem here is that first.value and last.value:
Do not hold the actual value, they just tell you if an observation is the first or last in a BY-group
Cannot be assigned - last.value = is not valid syntax
Secondly, first.value and last.value only get set if the value variable is stated in the by statement. You should use first.id and last.id instead.
What we need to do here is:
Check if we are looking at an observation that is the first in the BY-group based on id
Keep the value of the value variable until the last id value is reached
When we are looking at the last id value then set the value from step 1.
Alexey's answer covers the actual syntax required to do this. Here's what the first.id/last.id values look like. (You can always view them by adding put _all_; into your datastep):
id value first.id last.id tValue
1 1 1 0 1
1 1 0 0 1
1 2 0 1 1
2 7 1 0 7
2 7 0 0 7
2 7 0 0 7
2 5 0 1 7
. .
. .
. .

Rollup function in SAS

I would like to add summary record after each group of records connected with specific shop. So, I have this:
Shop_id Trans_id Count
1 1 10
1 2 23
1 3 12
2 1 8
2 2 15
And want to have this:
Shop_id Trans_id Count
1 1 10
1 2 23
1 3 12
. . 45
2 1 8
2 2 15
. . 23
I have done this using PROC SQL but I would like to do this using PROC REPORT as I have read that PROC REPORT should handle such cases.
Try this:
data have;
input shop_id Trans_id Count;
cards;
1 1 10
1 2 23
1 3 12
2 1 8
2 2 15
;
proc report data=have out=want(drop=_:);
define shop_id/group;
define trans_id/order;
define count/sum;
break after shop_id/summarize;
compute after shop_id;
if _break_='shop_id' then shop_id='';
endcomp;
run;

Reshaping dataset wide

I have a dataset from a small clinic which looks something like this:
What I am trying to do is make the top long form of the dataset look like the bottom wide form.
My code is the following:
reform date injury_code_1 .... , i(ID) j(VisitNum)
The error code I get is this:
There are variables other than a, b, ID, VisitNum in your data. They must be constant within ID because that is the only way they can fit into wide data without loss of information.
The variable or variables listed above are not constant within ID. Perhaps the values are in error. Type reshape error for a list of the problem observations.
Either that, or the values vary because they should vary, in which case you must either add the variables to the list of xij variables to be reshaped, or drop them.
Why is my code wrong?
Using the data as illustrated in the screenshot, the following works for me:
clear
input ID VisitNum str6 date Injury_1 Injury_2 Injury_3 gender
1 1 "12-Mar" 1 2 3 0
2 1 "2-Apr" 4 . . 1
1 2 "23-Jun" 1 2 . 0
3 1 "1-Feb" 5 6 . 1
1 3 "30-Aug" 8 9 10 0
end
reshape wide date Injury_1 Injury_2 Injury_3, i(ID) j(VisitNum)
order ID gender
list, abbreviate(15)
+----------------------------------------------------------------------------------------------------------------------------------------------------+
| ID gender date1 Injury_11 Injury_21 Injury_31 date2 Injury_12 Injury_22 Injury_32 date3 Injury_13 Injury_23 Injury_33 |
|----------------------------------------------------------------------------------------------------------------------------------------------------|
1. | 1 0 12-Mar 1 2 3 23-Jun 1 2 . 30-Aug 8 9 10 |
2. | 2 1 2-Apr 4 . . . . . . . . |
3. | 3 1 1-Feb 5 6 . . . . . . . |
+----------------------------------------------------------------------------------------------------------------------------------------------------+
The command provided is not valid Stata syntax.

SAS Merge Issue

So, I'm familiar with merges in SAS, and haven't had issues before, but I noticed an issue today that has never been an issue before.
For some reason the actual merging of observations is working properly in more complex data sets, however it only lists the variable values from one of the data sets (e.g. it doesn't overwrite missing values).
For instance, I wrote up this simple program:
data dataset1;
input id var1 var2 var3 var4;
cards;
1 . . 2 2
2 . . 2 2
3 . . 2 2
4 . . 2 2
5 . . 2 2
6 . . 2 2
7 . . 2 2
8 . . 2 2
9 . 2 . 2
10 1 . . .
;
data dataset2;
input id var1 var2 var3 var4;
cards;
1 2 2 . .
2 2 2 . .
3 2 2 . .
4 2 2 . .
5 2 2 . .
6 2 2 . .
7 2 2 . .
8 2 2 . .
10 . 1 . .
;
data dataset3;
merge dataset1 dataset2;
by id;
run;
This should yield the following:
id var1 var2 var3 var4
1 2 2 2 2
2 2 2 2 2
3 2 2 2 2
4 2 2 2 2
5 2 2 2 2
6 2 2 2 2
7 2 2 2 2
8 2 2 2 2
9 . 2 . 2
10 1 1 . .
but instead, I get this:
id var1 var2 var3 var4
1 2 2 . .
2 2 2 . .
3 2 2 . .
4 2 2 . .
5 2 2 . .
6 2 2 . .
7 2 2 . .
8 2 2 . .
9 . 2 . 2
10 . 1 . .
So, it's as if the merge is merging the observations and then just displaying the second data set's values.
I've tried to figure out the issue (I have a feeling it's something very basic I've just looked over), but I've no idea what's happening, since I've never come across the issue before.
Anyone know what's going wrong?
Thanks for any help.
Your problem is that you are merging the datasets by ID but both datasets have the variables VAR1-VAR4. So when both datasets contribute to an observations the one that is listed last in the MERGE statement will "win".
The reason you probably never saw this before is that normally when you are merging two datasets the only variables they have in common are the key variables. So the fact that the values read from the first datasets are overwritten by the values read from the second dataset didn't matter.
To get what you want you can use the UPDATE statement instead. Update will not replace a value with a missing value. Basically it is designed to apply transactions to a master dataset.
Since it looks like each ID only has one observation in DATASET1 you could just use DATASEt1 as your master dataset.
data want ;
update dataset1 dataset2 ;
by id ;
run;