Linear Interpolation on missing values at the end of the period - sas

Here is a dataset example :
data data;
input group $ date value;
datalines;
A 2001 1.5
A 2002 2.6
A 2003 2.8
A 2004 2.9
A 2005 .
B 2001 0.1
B 2002 0.6
B 2003 0.7
B 2004 1.4
B 2005 .
C 2001 4.7
C 2002 4.6
C 2003 4.8
C 2004 5.0
C 2005 .
;
run;
I want to replace the missing values of the variable "value" for each group using linear interpolation.
I tried using proc expand :
proc expand data=data method = join out=want;
by group;
id date;
convert value;
run;
But it's not replacing any value in the output database.
Any idea what I'm doing wrong please?

Here are three ways to do it. Your missing data is at the end of the series. You are effectively doing a forecast with a few points. proc expand isn't good for that, but for the purposes of filling in missing values, these are some of the options available.
1. PROC EXPAND
You were close! Your missing data is at the end of the series, which means it has no values to join between. You need to use the extrapolate option in this case. If you have missing values between two data points then you do not need to use extrapolate.
proc expand data=data method = join
out=want
extrapolate;
by group;
id date;
convert value;
run;
2. PROC ESM
You can do interpolation with exponential smoothing models. I like this method since it can account for things like seasonality, trend, etc.
/* Convert Date to SAS date */
data to_sas_date;
set data;
year = mdy(1,1,date);
format year year4.;
run;
proc esm data=to_sas_date
out=want
lead=0;
by group;
id year interval=year;
forecast value / replacemissing;
run;
3. PROC TIMESERIES
This will fill in values using mean/median/first/last/etc. for a timeframe. First convert the year to a SAS date as shown above.
proc timeseries data=to_sas_date
out=want;
by group;
id year interval=year;
var value / setmissing=average;
run;

I don't know much about the expand procedure, but you can add extrapolate to the proc expand statement.
proc expand data=data method = join out=want extrapolate;
by group;
id date;
convert value;
run;
Results in:
Obs group date value
1 A 2001 1.5
2 A 2002 2.6
3 A 2003 2.8
4 A 2004 2.9
5 A 2005 3.0
6 B 2001 0.1
7 B 2002 0.6
8 B 2003 0.7
9 B 2004 1.4
10 B 2005 2.1
11 C 2001 4.7
12 C 2002 4.6
13 C 2003 4.8
14 C 2004 5.0
15 C 2005 5.2
Please take note of the statement here
By default, PROC EXPAND avoids extrapolating values beyond the first or last input value for a series and only interpolates values within the range of the nonmissing input values. Note that the extrapolated values are often not very accurate and for the SPLINE method the EXTRAPOLATE option results may be very unreasonable. The EXTRAPOLATE option is rarely used."

Related

ERROR: The ID value "xxxxxxxxxxxx" occurs twice in the same BY group. when transposing a complex dataset

I have a strange data set and I am hoping you all can help me. I have a data set of the levels of certain environmental contaminants which are measured multiple ways along with the limit of detection are present in a group of research participants. I need these in a wide format, but unfortunately they are currently long and the naming conventions don’t easily translate.
This is what it looks like now:
ID Class Name Weight Amount_lipids Amount_plasma LOD
1 AAA Lead 1.55 44.0 10.0 5.00
1 AAB Mercury 1.55 222.0 100.0 75.00
2 AAA Lead 1.25 25.5 12.0 5.00
I have tried various forms of Proc Transpose with no luck and this seems to be more complex than what specifying a prefix can handle.
I want it to look like this:
ID Weight Lead_lip Lead_plas Lead_LOD Mercury_lip Mercury_plas Mercury_LOD
1 1.55 44.0 10.0 5.0 222.0 100.0 75.0
2 1.25 25.5 12.0 5.0 . . .
I tried a two step transpose process but received the following error ERROR: The ID value "xxxxxxxxxxxx" occurs twice in the same BY group
by id weight name;
run;
proc transpose data=want_intermediate out=want;
by id weight;
id name _name_;
run;
You likely have a record with the same ID and weight so it's duplicated.
You can add a counter for each ID record and use that. This is a double wide transpose, and it looks like your code was cut off. So to add an enumerator for each ID:
data temp;
set have;
by id;
if first.id then count=1;
else count+1;
run;
Then modify your PROC TRANSPOSE to use ID and count in the BY statement.

Tracking ID in SAS

I have a SAS question. I have a dataset containing ID and year. I want to create the dummyvariables "2011" and "2012" that should take on the value 1 if the ID has an observation in the given year and 0 otherwise. Eg. ID 2 should have 2011=1 and 2012=0, since the ID only has an observation for 2011.
ID Year 2011 2012
1 2011 1 1
1 2012 1 1
2 2011 1 0
3 2012 0 1
Can anyone help? Thanks!
For one thing, 2011 or 2012 are not valid names for SAS variables. SAS variables must start with a letter or an underscore (e.g., _2011).
If you really need to, you can get around that limitation by setting the system option validvarname=any and surrounding your 'invalid' variable names with single quotes and appending an n.
This would do what you want:
data have;
infile datalines;
input ID year;
datalines;
1 2011
1 2012
2 2011
3 2012
;
run;
options validvarname=ANY;
proc sql;
create table want as
select ID
,year
,exists(select * from have b where year=2011 and a.id=b.id) as '2011'n
,exists(select * from have b where year=2012 and a.id=b.id) as '2012'n
from have a
;
quit;

Defining a new field conditionally using put function with user-defined formats

I am trying to define a new value for an observation with a user defined format. However, my if/then/else statement seems to only work for observations with a year value of "2014". The put statements are not working for other values. In SAS, the put statement is blue in the first statement, and black in the other two. Here is a picture of what I mean:
Does anyone know what I am missing here? Here is my complete code:
data claims_t03_group;
set output.claims_t02_group;
if year = "2014" then test = put(compress(lookup,"_"),$G_14_PROD35.);
else if year = "2015" then test = put(compress(lookup,"_"),$G_15_PROD35.);
else test = put(compress(lookup,"_"),$G_16_PROD35.);
run;
Here is an example of what I mean when I say that the process seems to "work" for 2014:
As you can see, when the Year value is 2014, the format lookup works correctly, and the test field returns the value I am expecting. However, for years 2015 and 2016, the test field returns the lookup value without any formatting.
Your code utilises user-defined formats, $G_14_PROD.-$G_16_PROD.. My guess would be that there is a problem with one or more of these, but unless you can provide the format definitions it will be difficult to assist you further.
Try running the following and sharing the resulting output dataset work.prdfmts:
proc sql noprint;
select cats(libname,'.',memname) into :myfmtlib
from sashelp.vcatalg
where objname = 'G_14_PROD';
quit;
proc format cntlout = prdfmts library=&myfmtlib;
select G_14_PROD G_15_PROD G_16_PROD;
run;
N.B. this assumes that you only have one catalogue containing a format with that name, and that the format definitions for all 3 formats are contained in the same catalogue. If not, you will need to adapt this a bit and run it once for each format to find and export the definition.
Not that it solves your actual problem, but you could eliminate the IF/THEN by using the PUTC() function instead.
data have ;
do year=2014,2015,2016;
do lookup='00_01','00_02' ;
output;
end;
end;
run;
proc format ;
value $G_14_PROD '0001'='2014 - 1' '0002'='2014 - 2' ;
value $G_15_PROD '0001'='2015 - 1' '0002'='2015 - 2' ;
value $G_16_PROD '0001'='2016 - 1' '0002'='2016 - 2' ;
run;
data want ;
set have ;
length test $35 ;
if 2014 <= year <= 2016 then
test = putc(compress(lookup,'_'),cats('$G_',year-2000,'_PROD.'))
;
run;
Result
Obs year lookup test
1 2014 00_01 2014 - 1
2 2014 00_02 2014 - 2
3 2015 00_01 2015 - 1
4 2015 00_02 2015 - 2
5 2016 00_01 2016 - 1
6 2016 00_02 2016 - 2

SAS: Calculate standard deviation based on rolling window

I need to obtain the standard deviation for period surrounding over (-2,+2) days, instead of (-2,+2) observations. my table is in below
date value standard_deviation
01/01/2015 18 ...
01/01/2015 15 ...
01/01/2015 5 ...
02/01/2015 66 ...
02/01/2015 7 ...
03/01/2015 7 ...
04/01/2015 19 ...
04/01/2015 7 ...
04/01/2015 11 ...
04/01/2015 17 ...
05/01/2015 3 ...
06/01/2015 7 ...
... ... ...
The tricky part is that there are different number of observations in each days. Therefore, I CANNOT just using the following code
PROC EXPAND DATA=TESTTEST OUT=MOVINGAVERAGE;
CONVERT VAL=AVG / TRANSFORMOUT=(MOVSTD 5);
RUN;
Anyone can tell me how to include only (-2, +2) days of centered moving Standard Deviation in this case ?
Thanks in advance !
Best
Sorry for the confusion. Here's how I'd do it using proc sql. Simply use a subquery with the std aggregate function and a where clause selecting the observations you want:
proc sql;
select
h.date,
h.value,
( select std(s.value)
from have s
where h.date between s.date-2 and s.date+2) as stddev
from
have h
;
quit;
If the period should refer to only the days existing in the table (as opposed to all calendar days), then create a variable counting the dates and use that instead of the date variable to select your observations.
data have2;
set have;
by date;
if first.date then sort+1;
run;
proc sql;
select
h.date,
h.value,
( select std(s.value)
from have2 s
where h.sort between s.sort-2 and s.sort+2) as stddev
from
have2 h
;
quit;

know the start date and the latest date on a table on SAS

I'm very new therefore very fresh in the world of SAS , although I've used SAS last year during my studies but theoretical knowledge is not the same as hands-on knowledge.
Here is my problem.
I have tables on SAS, that look like the example below:
table1
date var_1 var_2 var_3 var_4 var_5
1957M1 . . . . .
1957M2 . . . . 23.5
1957M3 . 1.2 . . 23.6
1957M4 . 1.3 . . 23.7
1957M5 . 1.4 . 0.123 23.8
1957M6 . 1.5 . 0.124 23.9
1957M7 . 1.6 3.0 0.125 23.10
1957M8 . 1.7 3.1 0.126 23.11
1957M9 . 1.8 3.2 0.127 23.12
1957M10 2.1 1.9 3.3 0.128 23.13
1957M11 2.2 1.10 3.4 0.129 23.14
1957M12 2.3 1.11 3.5 0.130 23.15
As you have guessed, each variable is a time serie on its own right and date is a a time series as well. The columns are numeric ones except the date column, which is a character one.
My aim is to know for each of these variables , their respective start date and their latest date.
var_1 would start 1957 in October (or M10) and the latest date would be in l957 in December (or M12).
var_4 would start 1957 in October (or M10) and the latest date would be in December (or M12).
I've tried the following through SAS for one column for one table as a test but it is taking a hell of a long time , without results.
PROC SQL NOPRINT;
SELECT
MIN(input(substr(date,1,4),date4.)),
MAX(input(substr(date,1,4),date4.))
FROM
table1
WHERE
var_2 <> "."
quit;
For my query, the date column is in text. I'm trying through my query to transform it into a date format with the year only although as I will only be with the year and having the months would be great.
My boss told me about PROC FREQ to achieve the results I want but I have no idea how.
If you have any clues, I'm taking it.
Cheers.
Your problem is that your data structure isn't really right for your problem.
The right data structure is a more vertical structure, with DATE, VAR, VALUE. Then PROC MEANS is perfect for your needs.
data have;
input date $ var_1 var_2 var_3 var_4 var_5;
datalines;
1957M1 . . . . .
1957M2 . . . . 23.5
1957M3 . 1.2 . . 23.6
1957M4 . 1.3 . . 23.7
1957M5 . 1.4 . 0.123 23.8
1957M6 . 1.5 . 0.124 23.9
1957M7 . 1.6 3.0 0.125 23.10
1957M8 . 1.7 3.1 0.126 23.11
1957M9 . 1.8 3.2 0.127 23.12
1957M10 2.1 1.9 3.3 0.128 23.13
1957M11 2.2 1.10 3.4 0.129 23.14
1957M12 2.3 1.11 3.5 0.130 23.15
;;;;
run;
data want;
set have;
array var_[5];
date_num = mdy(substr(date,6),1,substr(date,1,4));
do _iter= 1 to dim(var_);
if not missing(var_[_iter]) then do;
var = vname(var_[_iter]);
value = var_[_iter];
output;
end;
end;
format date_num MONYY.;
run;
proc means data=want;
class var;
var date_num;
output out=edge_dates min= max= /autoname;
run;
If performance is an issue, this is the fastest way I know as it only needs to read the data once. Use Joe's code to create the 'have' dataset:
data want;
format date_num start1-start5 end1-end5 monyy.;
set have end=eof;
retain start1-start5 end1-end5 .; * RETAIN THE VALUES WE WILL BE CALCULATING AS WE ITERATE ACROSS ROWS IN THE DATASET;
array arr_var [*] var_1-var_5 ; * ARRAY FOR EXISTING VARIABLES;
array arr_start[*] start1-start5; * ARRAY FOR NEW VARIABLES THAT WILL CONTAIN START DATE;
array arr_end [*] end1-end5 ; * ARRAY FOR NEW VARIABLES THAT WILL CONTAIN END DATE;
date_num = mdy(input(substr(date,6),best.),1,input(substr(date,1,4),best.));
do iter=1 to dim(arr_var); * LOOPING FOR THE NUMBER OF VARIABLES IN ARR_VAR;
if arr_var[iter] ne . then do; * ONLY GOING TO PERFORM CALCS WHEN THE VARIABLE IS NOT MISSING;
if arr_start[iter] eq . then do;
arr_start[iter] = date_num; * ONLY UPDATE THE START DATE IF IT HASNT ALREADY BEEN SET;
end;
arr_end[iter] = date_num; * IF ITS NOT MISSING, ALWAYS UPDATE THE END DATE;
end;
end;
if eof then do;
output; * ONLY OUTPUT THE CALCULATED VALUES ONCE WE HIT THE END OF THE DATASET;
end;
keep start: end:; * KEEP ONLY VARS STARTING WITH START OR END;
run;
Normally I wouldn't recommend calculating start and end dates this way unless performance is a consideration, or if this code ends up being simpler than the alternative.
Most of the time you're better off prepping the data structure differently - although there's certainly circumstances where having the data in the above format is advantageous as well.