I am trying to linearly interpolate values within a panel set data. So I am find the next non zero value within a variable if the current value of the variable is "."
For example if X = { 1, 2, . , . , . ,7), I want to store 7 as a variable "Y" and subject the lag value of X from it as the numerator of the slope. Can anyone help with this step?
If you cannot transpose your data, here is a way that will work for your given example:
data test;
input id $3. x best12.;
datalines;
AAA 1
BBB 2
CCC .
DDD .
EEE .
FFF 7
;
run;
data test2;
set test;
n = _n_;
if x ne .;
run;
data test3;
set test2;
lagx = lag(x);
lagn = lag(n);
if _n_ > 1 and n ne lagn + 1 then do;
postiondiff = n - lagn;
valuediff = x - lagx;
do i = (lagx + ((x-lagx)/(n-lagn))) to x by ((x-lagx)/(n-lagn));
x = i;
output;
end;
end;
else output;
keep x;
run;
data test4;
merge test test3 (rename = (x=newx));
run;
So we are basically rebuilding the variable with the interpolated values, then remerging it into the original dataset without a by variable which will line up all the new interpolated data with the missing points.
Is there a way you could transpose all your data? Interpolating like that is much easier when all the data you need is in a single observation. Like this:
data test;
input x best12.;
datalines;
1
2
.
.
.
7
;
run;
proc transpose data = test
out = test2;
run;
data test3;
set test2;
array xvalues {*} COL1-COL6;
array interpol {4,10} begin1-begin10 end1-end10 begposition1-begposition10 endposition1-endposition10;
rangenum = 1;
* Find the endpoints of the missing ranges;
do i = 1 to dim(xvalues);
if xvalues{i} ne . then lastknownx = xvalues{i};
else do;
interpol{1,rangenum} = lastknownx;
if interpol{3,rangenum} = . then interpol{3,rangenum} = i - 1;
end;
if i > 1 and xvalues{i} ne . then do;
if xvalues{i-1} = . then do;
interpol{2,rangenum} = xvalues{i};
interpol{4,rangenum} = i;
rangenum = rangenum + 1;
end;
end;
end;
* Interpolate;
rangenum = 1;
do j = 1 to dim(xvalues);
if xvalues{j} = . then do;
xvalues{j} = interpol{1,rangenum} + (j-interpol{3,rangenum})*((interpol{2,rangenum}-interpol{1,rangenum})/(interpol{4,rangenum}-interpol{3,rangenum}));
end;
else if j > 1 and xvalues{j} ne . then do;
if xvalues{j-1} = . then rangenum = rangenum + 1;
end;
end;
keep col1-col6;
run;
That can handle up to 10 different missing ranges per observation, though you could tweak the code to handle much more than that by creating bigger arrays.
The SAS data step reads a dataset one record at a time from top to bottom. So at record i, it can't access i+1 because it hasn't read it yet; it can only access i-1. Assume you have a dataset with a variable x.
data intrpl;
retain _x;
set yourdata;
by x notsorted;
if not missing(x) then do;
_x = x;
if last.x then do;
slope = _x - lag(_x);
output;
end;
end;
run;
Transposing can get kind of messy if x takes on a lot of values, so I recommend this method. I hope it helps!
Related
I'm trying to generate 20 lags for a variable.
To generate the first lag, I use the following statement:
data temp.data2;
set temp.data1;
by gvkey fyear;
lag1 = ifn(gvkey=lag(gvkey) and fyear=lag(fyear)+1,lag(mv),.);
lag2 = ifn(gvkey=lag(gvkey) and fyear=lag(fyear)+1,lag(lag1),.);
etc.
run;
Don't want to repeat 20 times. Is there a way to do this through a loop?
Thanks a lot!
You would have to maintain your own array of mv values and assign the lag values from that. The array would be bubbled for each row processed and reset at the start of an fyear group.
Example:
data have;
do gvkey = 1 to 5;
do fyear = 1 to 5;
do day = 1 to ifn(fyear=3, 10, 30);
mv = 366-day;
output;
end;
end;
end;
run;
data want;
set have;
by gvkey fyear;
array mvs(20) _temporary_;
array lags(20) lag1-lag20;
if first.fyear then call missing(of mvs(*));
* assign lags;
do _n_ = 1 to dim(lags);
lags(_n_) = mvs(_n_);
end;
* bubble mvs;
do _n_ = dim(lags) to 2 by -1;
mvs(_n_) = mvs(_n_-1);
end;
mvs(1) = mv;
run;
Looking to automate some checks and print some warnings to a log file. I think I've gotten the general idea but I'm having problems generalising the checks.
For example, I have two datasets my_data1 and my_data2. I wish to print a warning if nobs_my_data2 < nobs_my_data1. Additionally, I wish to print a warning if the number of distinct values of the variable n in my_data2 is less than 11.
Some dummy data and an attempt of the first check:
%LET N = 1000;
DATA my_data1(keep = i u x n);
a = -1;
b = 1;
max = 10;
do i = 1 to &N - 100;
u = rand("Uniform"); /* decimal values in (0,1) */
x = a + (b-a) * u; /* decimal values in (a,b) */
n = floor((1 + max) * u); /* integer values in 0..max */
OUTPUT;
END;
RUN;
DATA my_data2(keep = i u x n);
a = -1;
b = 1;
max = 10;
do i = 1 to &N;
u = rand("Uniform"); /* decimal values in (0,1) */
x = a + (b-a) * u; /* decimal values in (a,b) */
n = floor((1 + max) * u); /* integer values in 0..max */
OUTPUT;
END;
RUN;
DATA _NULL_;
FILE "\\filepath\log.txt" MOD;
SET my_data1 NOBS = NOBS1 my_data2 NOBS = NOBS2 END = END;
IF END = 1 THEN DO;
PUT "HERE'S A HEADER LINE";
END;
IF NOBS1 > NOBS2 AND END = 1 THEN DO;
PUT "WARNING!";
END;
IF END = 1 THEN DO;
PUT "HERE'S A FOOTER LINE";
END;
RUN;
How can I set up the check for the number of distinct values of n in my_data2?
A proc sql way to do it -
%macro nobsprint(tab1,tab2);
options nonotes; *suppresses all notes;
proc sql;
select count(*) into:nobs&tab1. from &tab1.;
select count(*) into:nobs&tab2. from &tab2.;
select count(distinct n) into:distn&tab2. from &tab2.;
quit;
%if &&nobs&tab2. < &&nobs&tab1. %then %put |WARNING! &tab2. has less recs than &tab1.|;
%if &&distn&tab2. < 11 %then %put |WARNING! distinct VAR n count in &tab2. less than 11|;
options notes; *overrides the previous option;
%mend nobsprint;
%nobsprint(my_data1,my_data2);
This would break if you have to specify libnames with the datasets due to the .. And, you can use proc printto log to print it to a file.
For your other part as to just print the %put use the above as a call -
filename mylog temp;
proc printto log=mylog; run;
options nomprint nomlogic;
%nobsprint(my_data1,my_data2);
proc printto; run;
This won't print any erroneous text to SAS log other than your custom warnings.
#samkart provided perhaps the most direct, easily understood way to compare the obs counts. Another consideration is performance. You can get them without reading the entire data set if your data set has millions of obs.
One method is to use nobs= option in the set statement like you did in your code, but you unnecessarily read the data sets. The following will get the counts and compare them without reading all of the observations.
62 data _null_;
63 if nobs1 ne nobs2 then putlog 'WARNING: Obs counts do not match.';
64 stop;
65 set sashelp.cars nobs=nobs1;
66 set sashelp.class nobs=nobs2;
67 run;
WARNING: Obs counts do not match.
Another option is to get the counts from sashelp.vtable or dictionary.tables. Note that you can only query dictionary.tables with proc sql.
Beginning with a table,
A B C D E
1 . 1 . 1
. . 1 . .
. 1 . 1 .
I am trying to get an output like this:
A B C D E X Y Z
1 . 1 . 1 1 1 1
. . 1 . . 1
. 1 . 1 . 1 1
Here is my code:
data want;
set have;
array GG(5) A-E;
array BB(3) X Y Z;
do i=1 to 5;
do j=1 to 3;
if gg(i)=1 then BB(j)=1;
end;
end;
run;
I understand that the result that I get is wrong, as the dimensions of both the arrays are not co-operating. Is there another way to do this?
data want;
set have;
array v1 a--e;
array v2 x y z;
i=1;
do over v1;
if not missing(of v1) then do;
v2(i)=v1;
i+1;
end;
end;
drop i;
run;
why not something like this using a counter to identify which position is the actual non missing value?
data try;
infile datalines delimiter=',';
input var1 var2 var3 var4 var5;
datalines;
1,.,.,1,1,
.,1,.,.,.,
.,.,.,.,.,
.,1,.,1,.,
;
data want;
set try;
array vars[5] var1 var2 var3 var4 var5;
array newvars[5] nvar1 nvar2 nvar3 nvar4 nvar5;
do i=1 to 5;
if i=1 then count=0;
if vars[i] ne . then do;
count=count+1;
newvars[count]=vars[i];
end;
end;
drop i count var:;
run;
My method is to copy the existing values to new variables X Y X temp1 temp2, then sort the values using call sortn, which will put the 1's and missing values together. Because call sortn only sorts in ascending order, with missing values coming first, I've reversed the variables in the array statement (creating them first in the correct order with a retain statement.
The unwanted variables temp1 and temp2 can then be dropped.
data have;
input A B C D E;
datalines;
1 . 1 . 1
. . 1 . .
. 1 . 1 .
;
run;
data want;
set have;
retain X Y Z .; /* create new variables in required order */
array GG{5} A--E;
array BB{5} temp1 temp2 Z Y X; /* array in reverse order due to ascending sort later on */
do i = 1 to dim(GG);
BB{i} = GG{i};
end;
call sortn(of BB{*}); /* sort array (missing values come first, hence the reverse array order) */
drop temp: i; /* drop unwanted variables */
run;
Alternatively, here's a simpler solution as your criteria is pretty basic. As you're just dealing with 1's and missings, you can loop through the number of non-missing values in A-E and assign 1 to the new array.
data want;
set have;
array GG{5} A--E;
array BB{3} X Y Z;
do i = 1 to n(of GG{*});
BB{i}=1;
end;
drop i; /* drop unwanted variable*/
run;
Is there any more elegant way than that presented below for the following task:
to create Indicator Variables (below "MAX_X1" and "MAX_X2") whithin each group (below "key1") of multiple observation (below "key2") with value 1 if this observation corresponds to the maximum value of the variable in eache group and 0 otherwise
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
proc means data=have noprint;
by key1;
var x1 x2;
output out=max
max= / autoname;
run;
data want;
merge have max;
by key1;
drop _:;
run;
proc sql;
title "MAX";
select name into :MAXvars separated by ' '
from dictionary.columns
WHERE LIBNAME="WORK" AND MEMNAME="WANT" AND NAME like "%_Max"
order by name;
quit;
title;
data want; set want;
array MAX (*) &MAXvars;
array XVars (*) x1 x2;
array Indicators (*) MAX_X1 MAX_X2;
do i=1 to dim(MAX);
if XVars[i]=MAX[i] then Indicators[i]=1; else Indicators[i]=0;
end;
drop i;
run;
Thanks for any suggestion of optimization
Proc sql can be used with a group by statement to allow summary functions across values of a variable.
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
proc sql;
create table want
as select
key1,
key2,
x1,
x2,
case
when x1 = max(x1) then 1
else 0 end as max_x1,
case
when x2 = max(x2) then 1
else 0 end as max_x2
from have
group by key1
order by key1, key2;
quit;
It is also possible to do this in a single data step, provided that you read the input dataset twice - this is an example of a double DOW-loop.
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
/*Sort by key1 (or generate index) if not already sorted*/
proc sort data = have;
by key1;
run;
data want;
if 0 then set have;
array xvars[3,2] x1 x2 x1_max_flag x2_max_flag t_x1_max t_x2_max;
/*1st DOW-loop*/
do _n_ = 1 by 1 until(last.key1);
set have;
by key1;
do i = 1 to 2;
xvars[3,i] = max(xvars[1,i],xvars[3,i]);
end;
end;
/*2nd DOW-loop*/
do _n_ = 1 to _n_;
set have;
do i = 1 to 2;
xvars[2,i] = (xvars[1,i] = xvars[3,i]);
end;
output;
end;
drop i t_:;
run;
This may be a bit complicated to understand, so here's a rough explanation of how it flows:
Read one by group with the first DOW-loop, updating rolling max variables as each row is read in. Don't output anything yet.
Now read the same by-group again using the second DOW-loop, checking to see whether each row is equal to the rolling max and outputting each row.
Go back to first DOW-loop, read the next by-group and repeat.
I have a dataset like this(sp is an indicator):
datetime sp
ddmmyy:10:30:00 N
ddmmyy:10:31:00 N
ddmmyy:10:32:00 Y
ddmmyy:10:33:00 N
ddmmyy:10:34:00 N
And I would like to extract observations with "Y" and also the previous and next one:
ID sp
ddmmyy:10:31:00 N
ddmmyy:10:32:00 Y
ddmmyy:10:33:00 N
I tired to use "lag" and successfully extract the observations with "Y" and the next one, but still have no idea about how to extract the previous one.
Here is my try:
data surprise_6_step3; set surprise_6_step2;
length lag_sp $1;
lag_sp=lag(sp);
if sp='N' and lag(sp)='N' then delete;
run;
and the result is:
ID sp
ddmmyy:10:32:00 Y
ddmmyy:10:33:00 N
Any methods to extract the previous observation also?
Thx for any help.
Try using the point option in set statement in data step.
Like this:
data extract;
set surprise_6_step2 nobs=nobs;
if sp = 'Y' then do;
current = _N_;
prev = current - 1;
next = current + 1;
if prev > 0 then do;
set x point = prev;
output;
end;
set x point = current;
output;
if next <= nobs then do;
set x point = next;
output;
end;
end;
run;
There is an implicite loop through dataset when you use it in set statement.
_N_ is an automatic variable that contains information about what observation is implicite loop on (starts from 1). When you find your value, you store the value of _N_ into variable current so you know on which row you have found it. nobs is total number of observations in a dataset.
Checking if prev is greater then 0 and if next is less then nobs avoids an error if your row is first in a dataset (then there is no previous row) and if your row is last in a dataset (then there is no next row).
/* generate test data */
data test;
do dt = 1 to 100;
sp = ifc( rand("uniform") > 0.75, "Y", "N" );
output;
end;
run;
proc sql;
create table test2 as
select *,
monotonic() as _n
from test
;
create table test3 ( drop= _n ) as
select a.*
from test2 as a
full join test2 as b
on a._n = b._n + 1
full join test2 as c
on a._n = c._n - 1
where a.sp = "Y"
or b.sp = "Y"
or c.sp = "Y"
;
quit;