Checking for proper ordering of numeric, time, etc - sas

My data looks something like this:
data tmp ;
input id var1 - var5 ;
datalines ;
1 1 2 3 4 5
2 1 2 . . .
3 1 . . . 4
4 . 3 . . .
5 . . . . 5
6 1 3 2 2 3
7 5 3 7 8 9
8 1 . . . 2
9 1 . 2 3 4
;
run ;
I'm trying to determine if n variables are properly 'ordered.' By ordered, I mean numerically or sequential in time (or even alphabetic). So in this example, my desired output would be:
dummy = 1 1 1 1 1 0 0 1 1 since the ones where dummy = 1 are in correct order.
It would be trivial if I had complete data:
if var1 <= var2 <= ... <= varn then dummy = 1; else dummy = 0;
I do not have complete data unfortunately. So the problem may be that sas treats . as a very small number(?) and also that I cannot perform operations on . since this also failed:
if 0 * (var1 = .) + var1 <=
var1 * (var2 = .) + var2 <=
var2 * (var3 = .) + var3 <= ... <=
var_n-1 * (varn = .) + varn
then dummy = 1;
else dummy = 0;
Basically this would check to see if a variable is . and if it is, then use the previous value in the inequality, but if it is not missing, proceed as normal. This works sometimes, but still requires most of the info to be nonmissing.
I have also tried something like:
if var2 = max(var1, var2) & var1 <= var2 &
var3 = max(var1 -- var3) & var2 <= var3 & ...
but this approach also needs complete data. And I have tried transposing the data into a long format so that I can just delete the missing columns (and only keep variables I am interested in knowing the order of) but a transposed data set of thousands of variables isn't useful to me (if you would convert back to wide, there would still be missing columns).
Clearly, I am not the best SASer, but I would ideally like to write a macro or something since this issue comes up for me a lot (basically just a data check to see if dates are in order and occur when they should be regarding their relative timeline).
Here is all the code:
data tmp ;
input id var1 - var5 ;
datalines ;
1 1 2 3 4 5
2 1 2 . . .
3 1 . . . 4
4 . 3 . . .
5 . . . . 5
6 1 3 2 2 3
7 5 3 7 8 9
8 1 . . . 2
9 1 . 2 3 4
;
run ;
data tmp1 ;
set tmp ;
if var1 <= var2 <= var3 <= var4 <= var5 then dummy1 = 1 ; else dummy1 = 0 ;
if 0 * (var1 = .) + var1 <=
var1 * (var2 = .) + var2 <=
var2 * (var3 = .) + var3 <=
var3 * (var4 = .) + var4 <=
var4 * (var5 = .) + var5
then dummy2 = 1 ;
else dummy2 = 0 ;
if var2 = max(var1,var2) & var1 ~= var2 &
var3 = max(var1, var2, var3) & var2 ~= var3 &
var4 = max(var1, var2, var3, var4) & var3 ~= var4 &
var5 = max(var1, var2, var3, var4, var5) & var4 ~= var5
then dummy3 = 1 ;
else dummy3 = 0 ;
* none of dummy1 - 3 pick up the observations that are in proper order ;
run ;
data tmp1_varsIwant ;
set tmp1 ;
keep id var1 -- var5 ;
run ;
proc transpose data = tmp1_varsIwant out = tmp1_long ;
by id ;
run ;
data tmp1_long ;
set tmp1_long ;
if col1 = . then delete ;
if _name_ in('var6', 'var999') then delete ;
run ;
proc sort data = tmp1_long ;
by id col1 ;
run ;

Maybe you could force all the logic into one conditional, but it's probably simpler to use a loop like this:
data tmp1 ;
set tmp ;
array vars (*) var1-var5;
last_highest = .;
dummy = 1;
do i = 1 to 5;
if vars(i) > . and vars(i) < last_highest then do;
dummy = 0;
leave;
end;
last_highest = coalesce(vars(i),last_highest);
end;
run ;

Related

SAS: apply statement over multiple columns

I have a dataset with one id column and three variables:
data have;
input id var1 var2 var3;
datalines;
1 0 1 0
2 1 1 0
3 0 0 2
4 0 4 1
;
run;
I want to use some osrt or data or proc sql step over var1 to var3 to keep as 0 if it is 0, and 1 if it is greater than 0. It should ideally use an array var1 -- var3 as the actual dataset has many more variables.
Try this
data have;
input id var1 var2 var3;
datalines;
1 0 1 0
2 1 1 0
3 0 0 2
4 0 4 1
;
run;
data want;
set have;
array v var1 -- var3;
do over v;
v = v > 0;
end;
run;

Average/Count of a varying number of observations until a condition is met

I have got a data set with the following structure:
ID Date&Time var1 var2
1 1/11 1 yes
1 3/11 3 no
1 3/11 2 no
1 5/11 5 yes
1 10/11 2 no
2 3/11 0 yes
2 12/11 1 no
2 23/11 2 yes
2 24/11 0 yes
3 5/11 1 yes
3 6/11 2 no
3 8/11 5 yes
3 9/11 4 no
It is a log-file with observations on which my analysis is based on. Now I would like to get a moving average considering, e.g. all observaitons of the last week (and month, year, etc.), i.e. I want a structure like the following:
ID Date&Time var1 var2 week_avg week_count
1 1/11 1 yes . .
1 3/11 3 no 1 1
1 3/11 2 no 2 1
1 5/11 5 yes 2 1
1 10/11 2 no 3.33 1
2 3/11 0 yes . .
2 12/11 1 no . .
2 23/11 2 yes . .
2 24/11 0 yes 2 1
3 5/11 1 yes . .
3 6/11 2 no 1 1
3 8/11 5 yes 1.5 1
3 9/11 4 no 2.66 2
Is there a way to use the lag-function in something like a do-until loop?
Or is PROC EXPAND capable of performing a moving average by specifying a time window instead of a number of observations?
You can do it via by-processing, first creating the corresponding period values :
proc sort data=have ; by id date ; run ;
data periods ;
set have ;
year = put(date,year4.) ;
month = put(date,yymmn6.) ;
week = put(date,weeku5.) ;
run ;
data groups ;
set periods ;
retain week_tot week_cnt month_tot month_cnt year_tot year_cnt 0 ;
/* For the first value in each period, set count & total values to . */
if first.year then call missing(of year_:) ;
if first.month then call missing(of month_:) ;
if first.week then call missing(of week_:) ;
/* Increment count by 1, total by var1, calculate average */
/* Add any conditional logic on which to increment the running values */
week_cnt + 1 ; week_tot + var1 ; week_avg = week_tot / week_cnt ;
month_cnt + 1 ; month_tot + var1 ; month_avg = month_tot / month_cnt ;
year_cnt + 1 ; year_tot + var1 ; year_avg = year_tot / year_cnt ;
run ;
You can then abstract the above into a macro if you so wish
%MACRO PERIOD_CALC(PD) ;
retain &PD._cnt &PD._tot ;
if first.&PD then call missing(of &PD._:) ;
&PD._cnt + 1 ;
&PD._tot + var1 ;
&PD._avg = &PD._tot / &PD._cnt ;
%MEND ;
data groups ;
set periods ;
%PERIOD_CALC(week) ;
%PERIOD_CALC(month) ;
%PERIOD_CALC(year) ;
run ;

Compare value to max of variable

I have a dataset with a lot of lines and I'm studying a group of variables.
For each line and each variable, I want to know if the value is equal to the max for this variable or more than or equal to 10.
Expected output (with input as all variables without _B) :
(you can replace T/F by TRUE/FALSE or 1/0 as you wish)
+----+------+--------+------+--------+------+--------+
| ID | Var1 | Var1_B | Var2 | Var2_B | Var3 | Var3_B |
+----+------+--------+------+--------+------+--------+
| A | 1 | F | 5 | F | 15 | T |
| B | 1 | F | 5 | F | 7 | F |
| C | 2 | T | 5 | F | 15 | T |
| D | 2 | T | 6 | T | 10 | T |
+----+------+--------+------+--------+------+--------+
Note that for Var3, the max is 15 but since 15>=10, any value >=10 will be counted as TRUE.
Here is what I've maid up so far (doubt it will be any help but still) :
%macro pleaseWorkLittleMacro(table, var, suffix);
proc means NOPRINT data=&table;
var &var;
output out=Varmax(drop=_TYPE_ _FREQ_) max=;
run;
proc transpose data=Varmax out=Varmax(rename=(COL1=varmax));
run;
data Varmax;
set Varmax;
varmax = ifn(varmax<10, varmax, 10);
run; /* this outputs the max for every column, but how to use it afterward ? */
%mend;
%pleaseWorkLittleMacro(MY_TABLE, VAR1 VAR2 VAR3 VAR4, _B);
I have the code in R, works like a charm but I really have to translate it to SAS :
#in a for loop over variable names, db is my data.frame, x is the
#current variable name and x2 is the new variable name
x.max = max(db[[x]], na.rm=T)
x.max = ifelse(x.max<10, x.max, 10)
db[[x2]] = (db[[x]] >= x.max) %>% mean(na.rm=T) %>% percent(2)
An old school sollution would be to read the data twice in one data step;
data expect ;
input ID $ Var1 Var1_B $ Var2 Var2_B $ Var3 Var3_B $ ;
cards;
A 1 F 5 F 15 T
B 1 F 5 F 7 F
C 2 T 5 F 15 T
D 2 T 6 T 10 T
;
run;
data my_input;
set expect;
keep ID Var1 Var2 Var3 ;
proc print;
run;
It is a good habit to declare the most volatile things in your code as macro variables.;
%let varList = Var1 Var2 Var3;
%let markList = Var1_B Var2_B Var3_B;
%let varCount = 3;
Read the data twice;
data my_result;
set my_input (in=maximizing)
my_input (in=marking);
Decklare and Initialize arrays;
format &markList $1.;
array _vars [&&varCount] &varList;
array _maxs [&&varCount] _temporary_;
array _B [&&varCount] &markList;
if _N_ eq 1 then do _varNr = 1 to &varCount;
_maxs(_varNr) = -1E15;
end;
While reading the first time, Calculate the maxima;
if maximizing then do _varNr = 1 to &varCount;
if _vars(_varNr) gt _maxs(_varNr) then _maxs(_varNr) = _vars(_varNr);
end;
While reading the second time, mark upt to &maxMarks maxima;
if marking then do _varNr = 1 to &varCount;
if _vars(_varNr) eq _maxs(_varNr) or _vars(_varNr) ge 10
then _B(_varNr) = 'T';
else _B(_varNr) = 'F';
end;
Drop all variables starting with an underscore, i.e. all my working variables;
drop _:;
Only keep results when reading for the second time;
if marking;
run;
Check results;
proc print;
var ID Var1 Var1_B Var2 Var2_B Var3 Var3_B;
proc compare base=expect compare=my_result;
run;
This is quite simple to solve in sql
proc sql;
create table my_result as
select *
, Var1_B = (Var1 eq max_Var1)
, Var1_B = (Var2 eq max_Var2)
, Var1_B = (Var3 eq max_Var3)
from my_input
, (select max(Var1) as max_Var1
, max(Var2) as max_Var2
, max(Var3) as max_Var3)
;
quit;
(Not tested, as our SAS server is currently down, which is the reason I pass my time on Stack Overflow)
If you need that for a lot of variables, consult the system view VCOLUMN of SAS:
proc sql;
select ''|| name ||'_B = ('|| name ||' eq max_'|| name ||')'
, 'max('|| name ||') as max_'|| name
from sasHelp.vcolumn
where libName eq 'WORK'
and memName eq 'MY_RESULT'
and type eq 'num'
and upcase(name) like 'VAR%'
;
into : create_B separated by ', '
, : select_max separated by ', '
create table my_result as
select *, &create_B
, Var1_B = (Var1 eq max_Var1)
, Var1_B = (Var2 eq max_Var2)
, Var1_B = (Var3 eq max_Var3)
from my_input
, (select max(Var1) as max_Var1
, max(Var2) as max_Var2
, max(Var3) as max_Var3)
;
quit;
(Again not tested)
After Proc MEANS computes the maximum value for each column you can run a data step that combines the original data with the maximums.
data want;
length
ID $1 Var1 8 Var1_B $1. Var2 8 Var2_B $1. Var3 8 Var3_B $1. var4 8 var4_B $1;
input
ID Var1 Var1_B Var2 Var2_B Var3 Var3_B ; datalines;
A 1 F 5 F 15 T
B 1 F 5 F 7 F
C 2 T 5 F 15 T
D 2 T 6 T 10 T
run;
data have;
set want;
drop var1_b var2_b var3_b var4_b;
run;
proc means NOPRINT data=have;
var var1-var4;
output out=Varmax(drop=_TYPE_ _FREQ_) max= / autoname;
run;
The neat thing the VAR statement is that you can easily list numerically suffixed variable names. The autoname option automatically appends _ to the names of the variables in the output.
Now combine the maxes with the original (have). The set varmax automatically retains the *_max variables, and they will not get overwritten by values from the original data because the varmax variable names are different.
Arrays are used to iterate over the values and apply the business logic of flagging a row as at max or above 10.
data want;
if _n_ = 1 then set varmax; * read maxes once from MEANS output;
set have;
array values var1-var4;
array maxes var1_max var2_max var3_max var4_max;
array flags $1 var1_b var2_b var3_b var4_b;
do i = 1 to dim(values); drop i;
flags(i) = ifc(min(10,maxes(i)) <= values(i),'T','F');
end;
run;
The difficult part above is that the MEANS output creates variables that can not be listed using the var1 - varN syntax.
When you adjust the naming convention to have all your conceptually grouped variable names end in numeric suffixes the code is simpler.
* number suffixed variable names;
* no autoname, group rename on output;
proc means NOPRINT data=have;
var var1-var4;
output out=Varmax(drop=_TYPE_ _FREQ_ rename=var1-var4=max_var1-max_var4) max= ;
run;
* all arrays simpler and use var1-varN;
data want;
if _n_ = 1 then set varmax;
set have;
array values var1-var4;
array maxes max_var1-max_var4;
array flags $1 flag_var1-flag_var4;
do i = 1 to dim(values); drop i;
flags(i) = ifc(min(10,maxes(i)) <= values(i),'T','F');
end;
run;
You can use macro code or arrays, but it might just be easier to transform your data into a tall variable/value structure.
So let's input your test data as an actual SAS dataset.
data expect ;
input ID $ Var1 Var1_B $ Var2 Var2_B $ Var3 Var3_B $ ;
cards;
A 1 F 5 F 15 T
B 1 F 5 F 7 F
C 2 T 5 F 15 T
D 2 T 6 T 10 T
;
First you can use PROC TRANSPOSE to make the tall structure.
proc transpose data=expect out=tall ;
by id ;
var var1-var3 ;
run;
Now your rules are easy to apply in PROC SQL step. You can derive a new name for the flag variable by appending a suffix to the original variable's name.
proc sql ;
create table want_tall as
select id
, cats(_name_,'_Flag') as new_name
, case when col1 >= min(max(col1),10) then 'T' else 'F' end as value
from tall
group by 2
order by 1,2
;
quit;
Then just flip it back to horizontal and merge with the original data.
proc transpose data=want_tall out=flags (drop=_name_);
by id;
id new_name ;
var value ;
run;
data want ;
merge expect flags;
by id;
run;

add value of another variable to variable in loop

I'd like to do something like that
gen var1 = 0
gen var2 = 0
forval x = 1/5 {
replace var1 = `x'
replace var2 = var2 + var1
}
Namely I want to replace var2 by its old value plus var1. In a programming language like Python this works but in Stata it doesn't.
My goal is not to create a lot of variables! That's why I want to update the variable var2 in every cycle of the loop. I my loop would run from 1 to 100, I don't want to create 100 variables in order to get a nice solution.
In my example, in the first cycle of the loop, var1 becomes 1 and var2 also becomes 1. In the second cycle var1 should be 2 and var2 should become 3 since it adds the old value of var2 (which is 1) to the new value of var1 which is 2. In the third cylce var1 should become 3 and var2 should become 3 + 3 which is the old value of var2 plus the value of var1 in this cyle. So on and so forth. That's what I want to have!
Could someone please help me?
no need for a loop:
clear all
set obs 100
gen id = _n
tsset id
gen var1 = _n - 1
gen var2 = 0
replace var2 = l.var2 + l.var1 if _n > 1
If you just want to know the "end-result", i.e. the values for var1 and var2 at the end of the loop, then you can use Mata:
mata
a = 0
b = 0
for (i = 1 ; i <= 100; i++) {
a = i
b = b + a
}
a
b
end

Read wide file with repeated variables in SAS

I have input data shaped like this:
var1 var2 var3 var2 var3 ...
where each row has one value of var1 followed by a varying number of var2-var3 pairs. After reading this input, I want the data set to have multiple records for each var1: one record for each pair of var2/var3.
So if the first two lines of the input file are
A 1 2 7 3 4 5
B 2 3
this would generate 4 records:
A 1 2
A 7 3
A 4 5
B 2 3
Is there an simple/elegant way to do this? I've tried reading each row as one long variable and splitting with scan but it's getting messy and I'm betting there's a really easy way to do this.
I'm sure there are many ways to do this, but here is the first that comes to my mind:
data want(keep=var1 var2 var3);
infile 'path-to-your-file';
input;
var1 = input(scan(_infile_,1),$8.);
i = 1;
do while(i ne 0);
i + 1;
var2 = input(scan(_infile_,i),8.);
i + 1;
var3 = input(scan(_infile_,i),8.);
if var3 = . then i = 0;
else output;
end;
run;
_infile_ is an automatic SAS variable that contains the currently read record. Use an appropriate informat for each variable you read.
Like this (conditional input with jumping back):
data test;
infile datalines missover;
input var1 $ var2 $ var3 $ temp $ #;
output;
do while(not missing(temp));
input +(-2) var2 $ var3 $ temp $ #;
output;
end;
drop temp;
datalines;
A 1 2 7 3 4 5
B 2 3
;
run;