SAS: column position rearrangement - sas

I would like to rearange variable column poistion depending on ft value: for example. if ft =1, then put o2 and o5 #33 and 34. if ft=2, then put o2 and o5 #35 and 36 and so on... But I think I got the loop and array incorrect below. Can someone point out what I did wrong?
data fttry1;
input ft m1 o2 m3 m4 o5;
datalines;
1 2 3 4 5 6
2 7 8 9 10 11
3 12 13 14 15 20
4 16 17 18 19 21
;
run;
data fttry2;
set fttry1;
file print notitles;
put #10 ft
#30 M1
#31 M3-M4;
do ft =1 to 4;
array ftposition[2] o2 o5;
do i=1 to 2;
do l=33 to 34 by 2;
put #l ftposition[i];
end;
end;
end;
run;

Does this work?
data fttry1;
input ft m1 o2 m3 m4 o5;
datalines;
1 2 3 4 5 6
2 7 8 9 10 11
3 12 13 14 15 20
4 16 17 18 19 21
;
run;
data fttry2;
set fttry1;
file print notitles;
cnt + 1;
put #10 ft
#30 M1
#31 M3-M4 #;
o2_loc=(ft+cnt) + 32;
o5_loc=(ft+cnt) + 33;
put #o2_loc o2 #;
put #o5_loc o5 ;
run;
EDIT
This link indicates that a trailing # sign will prevent a newline after a PUT statement.

As I have mentioned in the comment, 2-digit number is problematic. So I don't include m3, m4 for output. Use "ft" as a pointer since it is like a serial number (is it?). Get length of o2 to ensure there is no overlapping between o2 and o5.
data fttry1;
input ft m1 o2 m3 m4 o5;
datalines;
1 2 3 4 5 6
2 7 8 9 10 11
3 12 13 14 15 20
4 16 17 18 19 21
;
run;
data fttry2;
set fttry1;
file print notitles;
by ft;
put #10 ft
#30 M1
#(31+2*ft) o2
#(31+2*ft+length(cats(o2))) o5;
run;

Related

Average over number of variables where number of variables is dictated by separate column

I would like to create a new column whose values equal the average of values in other columns. But the number of columns I am taking the average of is dictated by a variable. My data look like this, with 'length' dictating the number of columns x1-x5 that I want to average:
data have;
input ID $ length x1 x2 x3 x4 x5;
datalines;
A 5 8 234 79 36 78
B 4 8 26 589 3 54
C 3 19 892 764 89 43
D 5 72 48 65 4 9
;
run;
I would like to end up with the below where 'avg' is the average of the specified columns.
data want;
input ID $ length avg
datalines;
A 5 87
B 4 156.5
C 3 558.3
D 5 39.6
;
run;
Any suggestions? Thanks! Sorry about the awful title, I did my best.
You have to do a little more work since mean(of x[1]-x[length]) is not valid syntax. Instead, save the values to a temporary array and take the mean of it, then reset it at each row. For example:
tmp1 tmp2 tmp3 tmp4 tmp5
8 234 79 36 78
8 26 589 3 .
19 892 764 . .
72 48 65 4 9
data want;
set have;
array x[*] x:;
array tmp[5] _temporary_;
/* Reset the temp array */
call missing(of tmp[*]);
/* Save each value of x to the temp array */
do i = 1 to length;
tmp[i] = x[i];
end;
/* Get the average of the non-missing values in the temp array */
avg = mean(of tmp[*]);
drop i;
run;
Use an array to average it by summing up the array for the length and then dividing by the length.
data have;
input ID $ length x1 x2 x3 x4 x5;
datalines;
A 5 8 234 79 36 78
B 4 8 26 589 3 54
C 3 19 892 764 89 43
D 5 72 48 65 4 9
;
data want;
set have;
array x(5) x1-x5;
sum=0;
do i=1 to length;
sum + x(i);
end;
avg = sum/length;
keep id length avg;
format avg 8.2;
run;
#Reeza's solution is good, but in case of missing values in x it will produce not always desirable result. It's better to use a function SUM. Also the code is little simplified:
data want (drop=i s);
set have;
array a{*} x:;
s=0; nm=0;
do i=1 to length;
if missing(a{i}) then nm+1;
s=sum(s,a{i});
end;
avg=s/(length-nm);
run;
Rather than writing your own code to calculate means you could just calculate all of the possible means and then just use an index into an array to select the one you need.
data have;
input ID $ length x1 x2 x3 x4 x5;
datalines;
A 5 8 234 79 36 78
B 4 8 26 589 3 54
C 3 19 892 764 89 43
D 5 72 48 65 4 9
;
data want;
set have;
array means[5] ;
means[1]=x1;
means[2]=mean(x1,x2);
means[3]=mean(of x1-x3);
means[4]=mean(of x1-x4);
means[5]=mean(of x1-x5);
want = means[length];
run;
Results:

How to select a percentage of values from a column in SAS?

I have 70 databases of different sizes (same number of columns, different numbers of lines).
I need to get the 25% higher values and the 25% lower values considering a given column VAR1.
I have:
id VAR1
1 10
2 -5
3 -12
4 7
5 12
6 7
7 -9
8 -24
9 0
10 6
11 -18
12 22
Sorting by VAR1, I need to select the rows (all columns) containing the 3 smallest and the 3 largest (25% from each extreme), i.e.,
id VAR1
8 -24
11 -18
3 -12
7 -9
2 -5
9 0
10 6
4 7
6 7
1 10
5 12
12 22
I need to keep in the database the rows (all columns) that contain the VAR1 equal to -24, -18, -12, 10, 12 and 22.
id VAR1
8 -24
11 -18
3 -12
1 10
5 12
12 22
What I’ve been thinking:
Order column VAR1 in ascending order;
Create a numbered column from 1 to N (n=_N_) - in this case, N=12;
I do a=N*0.25 (to have the value that represents 25%);
I do b=N-a (to have the value that represents the "last" 25%).
So, I can use keep:
if N<a.... I will have the first 25% (the smallest).
if N>b.... I will have the last 25% (the largest).
I can calculate a and b.
But I’m not getting the maximum value of N in this case 12.
I will repeat this for the 70 database, I would not like to have to enter this maximum value every time (it varies from one database to another).
I need help to "fix" the maximum value (N) without having to type it (even if it is repeated in all the lines of another "auxiliary column").
Or if there’s some better way to get those 25% from each end.
My code:
proc sort data=have; by VAR1; run;
data want; set have;
seq=_N_;
N=max(seq); *N=max. value of lines. (I stopped here and don’t know if below is right);
a=N*0.25;
b=N-b;
if N<a;
if N>b;
run;
Thank you very much!
Proc RANK computes percentiles that you can use to select the desired rows.
Example:
data have1 have2 have3 have4 have5;
do id = 1 to 100;
X = ceil(rand('normal', 0, 10));
if id < 60 then output have1;
if id < 70 then output have2;
if id < 80 then output have3;
if id < 90 then output have4;
if id < 100 then output have5;
end;
run;
proc rank data=have1 percent out=want1(where=(pct not between 25 and 75)) ;
var x;
ranks pct;
run;
proc rank data=have2 percent out=want2(where=(pct not between 25 and 75)) ;
var x;
ranks pct;
run;
proc rank data=have3 percent out=want3(where=(pct not between 25 and 75)) ;
var x;
ranks pct;
run;

How to retain calculated values between rows when calculating running totals?

I have a tricky question about conditional sum in SAS. Actually, it is very complicated for me and therefore, I cannot explain it by words. Therefore I want to show an example:
A B
5 3
7 2
8 6
6 4
9 5
8 2
3 1
4 3
As you can see, I have a datasheet that has two columns. First of all, I calculated the conditional cumulative sum of column A ( I can do it by myself-So no need help for that step):
A B CA
5 3 5
7 2 12
8 6 18
6 4 8 ((12+8)-18)+6
9 5 17
8 2 18
3 1 10 (((17+8)-18)+3
4 3 14
So my condition value is 18. If the cumulative more than 18, then it equal 18 and next value if sum of the first value after 18 and exceeds amount over 18. ( As I said I can do it by myself )
So the tricky part is I have to calculate the cumulative sum of column B according to column A:
A B CA CB
5 3 5 3
7 2 12 5
8 6 18 9.5 (5+(6*((18-12)/8)))
6 4 8 5.5 ((5+6)-9.5)+4
9 5 17 10.5 (5.5+5)
8 2 18 10.75 (10.5+(2*((18-7)/8)))
3 1 10 2.75 ((10.5+2)-10.75)+1
4 3 14 5.75 (2.75+3)
As you can see from example the cumulative sum of column B is very specific. When column CA is equal to our condition value (18), then we calculate the proportion of the last value for getting our condition value (18) and then use this proportion for computing cumulative sum of column B.
Looks like when the sum of A reaches 18 or more you want to split the values of A and B between the current and the next record. One way is to remember the left over values for A and B and carry them forward in your new cumulative variables. Just make sure to output the observation before resetting those variables.
data want ;
set have ;
ca+a;
cb+b;
if ca >= 18 then do;
extra_a=ca - 18;
extra_b=b - b*((a - extra_a)/a) ;
ca=18;
cb=cb-extra_b ;
end;
output;
if ca=18 then do;
ca=extra_a;
cb=extra_b;
end;
drop extra_a extra_b ;
run;

SAS Function to calculate percentage for row for two stratifications

I have a dataset that looks like this
data test;
input id1$ id2$ score1 score2 score3 total;
datalines;
A D 9 36 6 51
A D 9 8 6 23
A E 5 3 2 10
B D 5 3 3 11
B E 7 4 7 18
B E 5 3 3 11
C D 8 7 9 24
C E 8 52 6 66
C D 4 5 3 12
;
run;
I want to add a column that calculates what percentage of the corresponding total is of the summation within id1 and id2.
What I mean is this; id1 has a value of A. Within the value of A, there are twoid2 values ; D and E. There are two values of D, and one of E. The two total values of D are 51 and 23, and they sum to 74. The one total value of E is 10, and it sums to 10. The column I'd like to create would hold the values of .68 (51/74), .31 (23/74), and 1 (10/10) in row 1 ,row 2, and row 3 respectively.
I need to perform this calculations for the rest of the id1 and their corresponding id2. So when complete, I want a table that would like like this:
id1 id2 score1 score2 score3 total percent_of_total
A D 9 36 6 51 0.689189189
A D 9 8 6 23 0.310810811
A E 5 3 2 10 1
B D 5 3 3 11 1
B E 7 4 7 18 0.620689655
B E 5 3 3 11 0.379310345
C D 8 7 9 24 0.666666667
C E 8 52 6 66 1
C D 4 5 3 12 0.333333333
I realize a loop might be able to solve the problem I've given, but I'm dealing with EIGHT levels of stratification, with as many as 98 sublevels within those levels. A loop is not practical. I'm thinking something along the lines of PROC SUMMARY but I'm not too familiar with the function.
Thank you.
It is easy to do with a data step. Make sure the records are sorted.
You can find the grand total for the ID1*ID2 combination and then use it to calculate the percentage.
proc sort data=test;
by id1 id2;
run;
data want ;
do until (last.id2);
set test ;
by id1 id2 ;
grand = sum(grand,total);
end;
do until (last.id2);
set test ;
by id1 id2 ;
precent_of_total = total/grand ;
output;
end;
run;

performing chi squared test in SAS using PROC FREQ

Our university is forcing us to perform the old school chi square test using PROC FREQ (I am aware of the options with proc univariate).
I have generated one theoretical exponential distribution with Beta=15 (and written down the values laboriously), and I've generated 10000 random variables which have an exponential distribution, with beta=15.
I try to first enter the frequencies of my random variables (in each interval) via the datalines command:
data expofaktiska;
input number count;
datalines;
1 2910
2 2040
3 1400
4 1020
5 732
6 531
7 377
8 305
9 210
10 144
11 106
12 66
13 40
14 45
15 29
16 16
17 12
18 8
19 8
20 3
21 2
22 0
23 1
24 2
25 0
26 2
;
run;
This seems to work.
I then try to compare these values to the theoretical values, using the chi square test in proc freq (the one we are supposed to use)
As follows:
proc freq data=expofaktiska;
weight count;
tables number / testp=(0.28347 0.20311 0.14554 0.10428 0.07472 0.05354 0.03837 0.02749 0.01969 0.01412 0.01011 0.00724 0.0052 0.00372 0.00266 0.00191 0.00137 0.00098 0.00070 0.00051 0.00036 0.00026 0.00018 0.00013 0.00010 0.00007) chisq;
run;
I get the following error:
ERROR: The number of TESTP values does not equal the number of levels. For the table of number,
there are 24 levels and 26 TESTP values.
This may be because two intervals contain 0 obervations. I don't really see a way around this.
Also, I don't get the chi square test in the results viewer, nor the "tes probability", I only the frequency/cumulative frequency of the random variables.
What am I doing wrong? Do both theoretical/actual distributions need to have the same form (probability/frequencies?)
We are using SAS 9.4
Thanks in advance!
/Magnus
You need ZEROS options on the WEIGHT statement.
data expofaktiska;
input number count;
datalines;
1 2910
2 2040
3 1400
4 1020
5 732
6 531
7 377
8 305
9 210
10 144
11 106
12 66
13 40
14 45
15 29
16 16
17 12
18 8
19 8
20 3
21 2
22 0
23 1
24 2
25 0
26 2
;
run;
proc freq data=expofaktiska;
weight count / zeros;
tables number / testp=(0.28347 0.20311 0.14554 0.10428 0.07472 0.05354 0.03837 0.02749 0.01969 0.01412 0.01011 0.00724 0.0052 0.00372 0.00266 0.00191 0.00137 0.00098 0.00070 0.00051 0.00036 0.00026 0.00018 0.00013 0.00010 0.00007) chisq;
run;