I'm new to proc optmodel and would appreciate any help to solve the problem at hand.
Here's my problem:
My dataset is like below:
data my data;
input A B C;
cards;
0 240 3
3.4234 253 2
0 258 7
0 272 4
0 318 7
0 248 8
0 260 2
0.2555 305 5
0 314 5
1.7515 235 7
32 234 4
0 301 3
0 293 5
0 302 12
0 234 2
0 258 4
0 289 2
0 287 10
0 313 3
0.7725 240 7
0 268 3
1.4411 286 9
0 234 13
0.0474 318 2
0 315 4
0 292 5
0.4932 272 3
0 288 4
0 268 4
0 284 6
0 270 4
50.9188 293 3
0 272 3
0 284 2
0 307 3
;
run;
There are 3 variables(A,B,C) and I want to classify observations into three classes (H,M,L) based on these 3 variables.
For class H, I want to maximize A, minimize B and C;
For class M, I want to median A,B and C;
For class L, I want to minimize A, maximize B and C.
Also, the constrain is that I want to limit the total observations classified into H less than 5%, and total observations classified into M less than 7%.
The final target is finding the cut-off of A,B,C for classifying obs into three different classes.
Since the three classes are equally weighted,so I scaled the vars first and create a risk var where risk = A+(1-B)+(1-C);
Thanks in advance for any help.
my sas code:
proc stdize data=my_data out=my_data1 method=RANGE;
var A B C;
run;
data new;
set my_data1;
risk = A+(1-B)+(1-C);
run;
proc sort data=new out=range;
by risk;
run;
proc optmodel;
/* read data */
set CUTOFF;
/* str risk_level {CUTOFF}; */
num a {CUTOFF};
num b {CUTOFF};
num c {CUTOFF};
read data my_data1 into CUTOFF=[_n_] a=A b=B c=C;
impvar risk{p in CUTOFF} = a[p]+(1-b[p])+(1-c[p]);
var indh {CUTOFF} binary;
var indmh {CUTOFF} binary;
var indo {CUTOFF} binary;
con sum{p in CUTOFF} indh[p] le 10;
con sum{p in CUTOFF} indmh[p] le 6;
con sum{p in CUTOFF} indo[p] le 19;
con class{p in CUTOFF}:indh[p]+indmh[p]+indo[p] le 1;
max new = sum{p in CUTOFF}(10*indh[p]+4*indmh[p]+indo[p])*risk[p];
solve;
print a b c risk indh indmh indo new;
quit;
So now my problem is how to find the min risk value in each class,Thanks!
Related
I would like to create a new column whose values equal the average of values in other columns. But the number of columns I am taking the average of is dictated by a variable. My data look like this, with 'length' dictating the number of columns x1-x5 that I want to average:
data have;
input ID $ length x1 x2 x3 x4 x5;
datalines;
A 5 8 234 79 36 78
B 4 8 26 589 3 54
C 3 19 892 764 89 43
D 5 72 48 65 4 9
;
run;
I would like to end up with the below where 'avg' is the average of the specified columns.
data want;
input ID $ length avg
datalines;
A 5 87
B 4 156.5
C 3 558.3
D 5 39.6
;
run;
Any suggestions? Thanks! Sorry about the awful title, I did my best.
You have to do a little more work since mean(of x[1]-x[length]) is not valid syntax. Instead, save the values to a temporary array and take the mean of it, then reset it at each row. For example:
tmp1 tmp2 tmp3 tmp4 tmp5
8 234 79 36 78
8 26 589 3 .
19 892 764 . .
72 48 65 4 9
data want;
set have;
array x[*] x:;
array tmp[5] _temporary_;
/* Reset the temp array */
call missing(of tmp[*]);
/* Save each value of x to the temp array */
do i = 1 to length;
tmp[i] = x[i];
end;
/* Get the average of the non-missing values in the temp array */
avg = mean(of tmp[*]);
drop i;
run;
Use an array to average it by summing up the array for the length and then dividing by the length.
data have;
input ID $ length x1 x2 x3 x4 x5;
datalines;
A 5 8 234 79 36 78
B 4 8 26 589 3 54
C 3 19 892 764 89 43
D 5 72 48 65 4 9
;
data want;
set have;
array x(5) x1-x5;
sum=0;
do i=1 to length;
sum + x(i);
end;
avg = sum/length;
keep id length avg;
format avg 8.2;
run;
#Reeza's solution is good, but in case of missing values in x it will produce not always desirable result. It's better to use a function SUM. Also the code is little simplified:
data want (drop=i s);
set have;
array a{*} x:;
s=0; nm=0;
do i=1 to length;
if missing(a{i}) then nm+1;
s=sum(s,a{i});
end;
avg=s/(length-nm);
run;
Rather than writing your own code to calculate means you could just calculate all of the possible means and then just use an index into an array to select the one you need.
data have;
input ID $ length x1 x2 x3 x4 x5;
datalines;
A 5 8 234 79 36 78
B 4 8 26 589 3 54
C 3 19 892 764 89 43
D 5 72 48 65 4 9
;
data want;
set have;
array means[5] ;
means[1]=x1;
means[2]=mean(x1,x2);
means[3]=mean(of x1-x3);
means[4]=mean(of x1-x4);
means[5]=mean(of x1-x5);
want = means[length];
run;
Results:
From a cumulative episode count (if time intervals are less than 10 days, it is considered one episode), I want to calculate a “wide” and “long” version of running episode count based on class by ID.
This is what my data looks like right now.
id Class Date Obsvn Episode_Sum
9 Wide 3/10/2012 1 1
9 Wide 3/12/2012 2 1
9 Wide 7/1/2012 111 2
9 Wide 7/3/2012 2 2
108 Wide 3/31/2011 1 1
108 Long 3/31/2011 1 1
108 Wide 4/17/2011 17 2
108 Wide 6/24/2011 68 3
108 Wide 6/16/2012 358 4
108 Wide 7/20/2012 34 5
108 Wide 7/27/2012 7 5
I achieved the running count by this code:
data want (drop=lag); set have;
by id date;
format lag mmddyy10.;
lag=lag(date);
if first.id then obsvn=1;
else obsvn=max(intck("Day", Lag, date),1);
if first.id then episode_sum=1;
else if obsvn>10 then episode_sum+1;
run;
I want my data to look like this:
id Class Date Obsvn Sum Wide Long
9 Wide 3/10/2012 1 1 1 0
9 Wide 3/12/2012 2 1 1 0
9 Wide 7/1/2012 111 2 2 0
9 Wide 7/3/2012 2 2 2 0
108 Wide 3/31/2011 1 1 1 0
108 Long 3/31/2011 1 1 1 1
108 Wide 4/17/2011 17 2 2 1
108 Wide 6/24/2011 68 3 3 1
108 Wide 6/16/2012 358 4 4 1
108 Wide 7/20/2012 34 5 5 1
108 Wide 7/27/2012 7 5 5 1
But I am getting this:
id Class Date Obsvn Sum Wide Long
9 Wide 3/10/2012 1 1 1 0
9 Wide 3/12/2012 2 1 1 0
9 Wide 7/1/2012 111 2 2 0
9 Wide 7/3/2012 2 2 **1** 0
108 Wide 3/31/2011 1 1 1 **1**
108 Long 3/31/2011 1 1 1 1
108 Wide 4/17/2011 17 2 2 1
108 Wide 6/24/2011 68 3 3 1
108 Wide 6/16/2012 358 4 4 1
108 Wide 7/20/2012 34 5 5 1
108 Wide 7/27/2012 7 5 **1** 1
This is my code to create the episodes by wide and long. I am trying to account for when each ID switches class. How do I achieve this?
/*Calculating Long*/
if (first.id and class in ("Long")) then Episode_Long=1;
else if obsvn>10 and class in ("Long") then Episode_Long+1;
retain Episode_Long;
if (obsvn<10 and class in ("Long")) then Episode_Long=1;
if class not in ("Long") then do;
if first.id and class not in ("Long") then Episode_Long=0;
retain Episode_Long;
end;
/*Calculating Wide */
if (obsvn<10 and class in ("Wide")) then Episode_Wide=1 ;
if (first.id and class in ("Wide")) then Episode_Wide=1;
else if obsvn>10 and class in ("Wide") then Episode_Wide+1;
retain Episode_Wide;
The tricky part is that you have two records for the same DATE in the second ID group. So you want to keep track of that when calculating the change in days.
Here is one way. First let's enter your source data (and desired results).
data have ;
input id Class $ Date :mmddyy. EObsvn ESum EWide ELong ;
format date yymmdd10.;
cards;
9 Wide 3/10/2012 1 1 1 0
9 Wide 3/12/2012 2 1 1 0
9 Wide 7/1/2012 111 2 2 0
9 Wide 7/3/2012 2 2 2 0
108 Wide 3/31/2011 1 1 1 0
108 Long 3/31/2011 1 1 1 1
108 Wide 4/17/2011 17 2 2 1
108 Wide 6/24/2011 68 3 3 1
108 Wide 6/16/2012 358 4 4 1
108 Wide 7/20/2012 34 5 5 1
108 Wide 7/27/2012 7 5 5 1
;
You might want to find the dates where WIDE or LONG gaps exist first.
data long ;
set have ;
by id date;
where class='Long';
if first.date;
lag=lag(date);
if first.id then call missing(lag,obsvn);
else obsvn=max(intck("Day", Lag, date),1);
lflag = missing(lag) or obsvn > 10 ;
keep id date lflag ;
run;
data wide ;
set have ;
by id date;
where class='Wide';
if first.date;
lag=lag(date);
if first.id then call missing(lag,obsvn);
else obsvn=max(intck("Day", Lag, date),1);
wflag = missing(lag) or obsvn > 10 ;
keep id date wflag ;
run;
Then merge it back onto the source by date and calculate your counters.
data want ;
merge have wide long ;
by id date;
if first.date then do ;
lag=lag(date);
format lag yymmdd10.;
if first.id then call missing(lag,obsvn);
else obsvn=max(intck("Day", Lag, date),1);
retain lag obsvn;
end;
if first.id then call missing(sum,wide,long);
if missing(lag) or obsvn > 10 then sum+first.date ;
wide + (wflag and first.date);
long + (lflag and first.date);
run;
My dataset is like this
bucket D_201009 D_201010 D_201011 D_201012 D_201101 D_201102 D_201103
0 0 0 0 0 0 0 0
1 1 0 0 0 1 0 0
2 3 0 3 0 1 6 3
3 0 0 0 0 0 0 0
4 0 4 0 0 0 0 0
5 4 0 4 0 4 8 1
6 8 0 8 0 8 10 8
7 0 0 0 0 0 0 0
8 7 0 7 0 0 7 3
what I want is this
bucket D_201009 D_201010 D_201011 D_201012 D_201101 D_201102 D_201103
0 23 4 22 0 14 31 15
1 23 4 22 0 14 31 15
2 22 4 22 0 13 31 15
3 19 4 19 0 12 25 12
4 19 4 19 0 12 25 12
5 19 0 19 0 12 25 12
6 15 0 15 0 8 17 11
7 7 0 7 0 0 7 3
8 7 0 7 0 0 7 3
where the sum is the value for bucket 0 and 1 row the corresponding bucket 2 for column D_201009 =sum-original value(1) and later for bucket 3 for column D_201009 previous value(lag value) -3(value original) and label this column as original column name. I wrote the code to perform one column.
data test;
input bucket D_201009 D_201010 D_201011 D_201012 D_201101 D_201102 D_201103;
datalines;
0 0 0 0 0 0 0 0
1 1 0 0 0 1 0 0
2 3 0 3 0 1 6 3
3 0 0 0 0 0 0 0
4 0 4 0 0 0 0 0
5 4 0 4 0 4 8 1
6 8 0 8 0 8 10 8
7 0 0 0 0 0 0 0
8 7 0 7 0 0 7 3
;
run;
Saving these column names in a macro
proc contents data = test
out = vars(keep = varnum name)
noprint;
run;
proc sql noprint;
select distinct name
into :orderedvars2 separated by ' '
from vars
where varnum >=2
order by varnum;
quit;
Finding sum of one column only
proc sql;
select sum(D_201009) into :total from test;
quit;
Using lag to perform
data result(drop= D_201009 lag_D_201009 rename=(sum=D_201009));
set test;
retain sum;
if bucket < 2 then sum = &total;
sum = sum(sum, -lag(D_201009));
run;
how do I change the code to work for all columns where the column names are stored as macro &orderedvars2. ?
The way I'd approach it would be to transpose the data structure to a more useful data structure; then you don't have to use macro variables. You can use BY processing instead, and no lags.
The way I create the final output is to transpose the initial dataset so you have one row per bucket/D_var, then sort by the D_vars (_NAME_ holds that). Then use a Double DoW loop in order to first calculate the sum, and then to subtract the value. Note I don't have to use Retain or Lag here, I can just directly operate on the value since I'm in a DoW loop. I output before subtracting since that's what you seem to want. Then I retranspose back.
This might not be the fastest option if you have very large data, since it goes through several steps; if you do, you should be using a more efficient algorithm anyway. But it's likely the least fiddly if you don't always have the same columns.
proc transpose data=test out=test_t;
by bucket;
run;
proc sort data=test_t;
by _name_ bucket;
run;
data want_t;
do _n_ = 1 by 1 until (last._name_);
set test_t;
by _name_ bucket;
sum_var = sum(sum_var,col1);
end;
do _n_ = 1 by 1 until (last._name_);
set test_t;
by _name_ bucket;
output;
sum_var = sum_var - col1;
end;
run;
proc sort data=want_t;
by bucket _name_;
run;
proc transpose data=want_t out=want;
by bucket;
id _name_;
var sum_var;
run;
Use proc summary to get sum of each variable, then define multiple arrays.
proc summary data=test;
var D:;
output out=sum(drop=_:) sum=/autoname;
run;
data want;
set test;
if _n_=1 then set sum;
array var1 D_201009--D_201103;
array var2 D_201009_sum--D_201103_sum;
array var3 _D_201009 _D_201010 _D_201011 _D_201012 _D_201101 _D_201102 _D_201103;
array temp (7) _temporary_;
retain temp;
do i=1 to dim(var1);
lag=lag(var1(i));
if bucket<2 then var3(i)=var2(i);
else var3(i)=sum(temp(i),-lag);
temp(i)=var3(i);
end;
drop D: lag i;
run;
If I understand this right you want to sum the column and then subtract the value of each observation from the total?
Getting totals is easy, just use proc summary.
Then combine it with the original data. Here is a way that will work without having to worry about the actual variable names. In this program it will sum all variables that start with d_ but you could use any variable list you want. If you have more than 100 variables then change the dimension of the temporary array.
%let varlist=d_:;
* Get sums into variables with same names ;
proc summary data=have ;
var &varlist ;
output out=total sum= ;
run;
data want ;
set have(obs=0) /* Set variable order */
total(keep=&varlist) /* Get totals */
have(keep=&varlist) /* Get lagged variables */
;
array vars &varlist ;
array total (100) _temporary_;
set have (drop=&varlist); /* Get non-lagged variables */
do i=1 to dim(vars);
if _n_>1 then vars(i)=total(i)-vars(i);
total(i)=vars(i);
end;
drop i;
run;
If you have missing values you might want to add this line of code at beginning of the DO loop:
vars(i)=coalesce(vars(i),0);
I have a data set which essence is the following
data have;
input Name $ ab gh vz iz jh pq ch km eo lk;
datalines;
adam 7 8 7 0 0 0 0 0 0 0
bob 0 1 0 3 4 6 0 1 6 0
clint 0 0 0 5 4 3 1 0 0 2
;
run;
Now I would like to count how many times I have a number greater than zero in the variables iz, jh, chand km. The result should look like this
/* want
Name ab gh vz iz jh pq ch km eo lk count_of_iz_jh_ch_km
adam 7 8 7 0 2 3 0 0 0 0 1
bob 0 1 0 3 0 6 0 1 6 0 2
clint 5 0 0 5 4 3 1 2 0 2 4
*/
I would greatly appreciate any help since I wasn't successful searching the internet for a solution.
Gerit
The below code will initialize the required variables from have into an array called vars, then for each row, count every time one of these variables is > 0.
data want;
set have;
array vars[*] iz jh ch km;
count_of_iz_ch_km = 0;
do i = 1 to dim(vars);
if(vars[i] > 0) then count_of_iz_ch_km+1;
end;
drop i;
run;
Our university is forcing us to perform the old school chi square test using PROC FREQ (I am aware of the options with proc univariate).
I have generated one theoretical exponential distribution with Beta=15 (and written down the values laboriously), and I've generated 10000 random variables which have an exponential distribution, with beta=15.
I try to first enter the frequencies of my random variables (in each interval) via the datalines command:
data expofaktiska;
input number count;
datalines;
1 2910
2 2040
3 1400
4 1020
5 732
6 531
7 377
8 305
9 210
10 144
11 106
12 66
13 40
14 45
15 29
16 16
17 12
18 8
19 8
20 3
21 2
22 0
23 1
24 2
25 0
26 2
;
run;
This seems to work.
I then try to compare these values to the theoretical values, using the chi square test in proc freq (the one we are supposed to use)
As follows:
proc freq data=expofaktiska;
weight count;
tables number / testp=(0.28347 0.20311 0.14554 0.10428 0.07472 0.05354 0.03837 0.02749 0.01969 0.01412 0.01011 0.00724 0.0052 0.00372 0.00266 0.00191 0.00137 0.00098 0.00070 0.00051 0.00036 0.00026 0.00018 0.00013 0.00010 0.00007) chisq;
run;
I get the following error:
ERROR: The number of TESTP values does not equal the number of levels. For the table of number,
there are 24 levels and 26 TESTP values.
This may be because two intervals contain 0 obervations. I don't really see a way around this.
Also, I don't get the chi square test in the results viewer, nor the "tes probability", I only the frequency/cumulative frequency of the random variables.
What am I doing wrong? Do both theoretical/actual distributions need to have the same form (probability/frequencies?)
We are using SAS 9.4
Thanks in advance!
/Magnus
You need ZEROS options on the WEIGHT statement.
data expofaktiska;
input number count;
datalines;
1 2910
2 2040
3 1400
4 1020
5 732
6 531
7 377
8 305
9 210
10 144
11 106
12 66
13 40
14 45
15 29
16 16
17 12
18 8
19 8
20 3
21 2
22 0
23 1
24 2
25 0
26 2
;
run;
proc freq data=expofaktiska;
weight count / zeros;
tables number / testp=(0.28347 0.20311 0.14554 0.10428 0.07472 0.05354 0.03837 0.02749 0.01969 0.01412 0.01011 0.00724 0.0052 0.00372 0.00266 0.00191 0.00137 0.00098 0.00070 0.00051 0.00036 0.00026 0.00018 0.00013 0.00010 0.00007) chisq;
run;