Transform Ordered Values to Paired - sas

I'm looking to transform a set of ordered values into a new dataset containing all ordered combinations.
For example, if I have a dataset that looks like this:
Code Rank Value Pctile
1250 1 25 0
1250 2 32 0.25
1250 3 37 0.5
1250 4 51 0.75
1250 5 59 1
I'd like to transform it to something like this, with values for rank 1 and 2 in a single row, values for 2 and 3 in the next, and so forth:
Code Min_value Min_pctile Max_value Max_pctile
1250 25 0 32 0.25
1250 32 0.25 37 0.5
1250 37 0.5 51 0.75
1250 51 0.75 59 1
It's simple enough to do with a handful of values, but when the number of "Code" families is large (as is mine), I'm looking for a more efficient approach. I imagine there's a straightforward way to do this with a data step, but it escapes me.

Looks like you just want to use the lag() function.
data want ;
set have ;
by code rank ;
min_value = lag(value) ;
min_pctile = lag(pctile) ;
rename value=max_value pctile=max_pctile ;
if not first.code ;
run;
Results
max_ max_ min_ min_
Obs Code Rank value pctile value pctile
1 1250 2 32 0.25 25 0.00
2 1250 3 37 0.50 32 0.25
3 1250 4 51 0.75 37 0.50
4 1250 5 59 1.00 51 0.75

Related

Showing data in power Bi matrix as percentage

In the matrix, I have this representation -
X Y Z TOTAL
A 3 4 6 13
B 6 44 55 105
C 0 4 8 12
TOTAL 9 52 69 130
I want to show this as the following -
X Y Z
A 23% 31% 46%
B 6% 42% 52%
C 0% 33% 67%
example, for row A - (X/Total)*100 , (Y/Total)*100 ,(Z/Total)*100.
How do i do it?
Thanks in advance for your hep !
Select values field and show value as pecentage of row total

How to compare means (µ1 + µ2 + µ3)/3 = (µ3 + µ4)/2 in SAS: Use 'ESTIMATE' or 'CONTRAST'?

I am working with a dataset of 5 techniques of assembling parts, with random samples of workers taken from each technique for evaluation on how long it takes workers to complete the task. I wish to compare means with a t-test, but am struggling to get the correct code, as I am very new to SAS.
The dataset can be found with the following code:
data Ex1;
input technique time ##;
lines;
1 45.6
1 41
1 46.4
1 50.7
1 47.9
1 44.6
2 41
2 49.1
2 49.2
2 54.8
2 45
3 51.7
3 60.1
3 52.6
3 58.6
3 59.8
3 52.6
3 53.8
4 67.5
4 57.7
4 58.2
4 60.6
4 57.3
4 58.3
4 54.8
5 57.1
5 69.6
5 62.7
;
run;
I wish to use PROC GLM to test the null hypothesis of (µ1 + µ2 + µ3)/3 = (µ3 + µ4)/2, compared to the alternative that these means are not equal. I have the following code for this operation, but I am getting errors when I run it:
proc glm data=Ex1;
class technique;
model time=technique/NOINT SOLUTION E;
CONTRAST 'M1+M2+M3=M3+M4' technique 1 1 1 0 0/DIVISOR=3, technique 0 0 1 1 0/DIVISOR=3;
run;
The following output error is produced:
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
72
73 proc glm data=Ex1;
74 class technique;
75 model time=technique/NOINT SOLUTION E;
76
77 CONTRAST 'M1+M2+M3=M3+M4' technique 1 1 1 0 0/DIVISOR=3, technique 0 0 1 1 0/DIVISOR=3;
_______
22
76
NOTE: The previous statement has been deleted.
ERROR 22-322: Syntax error, expecting one of the following: ;, E, EST, ETYPE, SINGULAR.
ERROR 76-322: Syntax error, statement will be ignored.
78 run;
NOTE: Due to the presence of CLASS variables, an intercept is implicitly fitted. R-Square has been corrected for the mean.
79
80 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
92
Is this an ESTIMATE or a CONTRAST problem?
There is no option DIVISOR for CONTRAST in GLM procedure.
Let's simplify your initial equation.
Full PROC GLM call:
proc glm data=Ex1;
class technique;
model time=technique/NOINT SOLUTION E;
ESTIMATE '(M1+M2+M3)/3-(M3+M4)/2' technique 2 2 -1 -3 0 / divisor=6;
CONTRAST '(M1+M2+M3)/3=(M3+M4)/2' technique 0.33333 0.333333 -0.166667 -0.5 0 / e ;
run;
and output:

How to create 1000 files all the same but change some specific parameter?

I need to make around 1000 inputs files each is almost similar but change some parameters, how to create 1000 files all the same but change some specific parameter?
Is there a way to copy the file data and make it as output after I change a variable value?
========================================================================
=arp
ce16x16
5
3
0.2222222
0.2222222
0.2222222
60
60
60
1
1
1
0.71
ft33f001
end
#origens
0$$ a4 33 all 71 e t
ce16x16
3$$ 33 a3 1 27 a16 2 a33 18 e t
35$$ 0 t
56$$ 10 10 a6 3 a10 0 a13 4 a15 3 a18 1 e
95$$ 0 t
cycle 1 -fo3
1 MTU
58** 60 60 60 60 60 60 60 60 60 60
60 ** 0.02222222 0.04444444 0.06666667 0.08888889 0.1111111 0.1333333
0.1555556 0.1777778 0.2 0.2222222
66$$ a1 2 a5 2 a9 2 e
73$$ 922340 922350 922360 922380
74** 445 50000 230 949325
75$$ 2 2 2 2
t
====================================================================
This a part of the file, I would like to make 1000 files similar to this, but only change the values of 60 each time.
The value of 60 is equal some value entered by the user divider by (0.2222222).

Density of fractions between 2 given numbers

I'm trying to do some analysis over a simple Fraction class and I want some data to compare that type with doubles.
The problem
Right know I'm looking for some good way to get the density of Fractions between 2 numbers. Fractions is basically 2 integers (e.g. pair< long, long>), and the density between s and t is the amount of representable numbers in that range. And it needs to be an exact, or very good approximation done in O(1) or very fast.
To make it a bit simpler, let's say I want all the numbers (not fractions) a/b between s and t, where 0 <= s <= a/b < t <= M, and 0 <= a,b <= M (b > 0, a and b are integers)
Example
If my fractions were of a data type which only count to 6 (M = 6), and I want the density between 0 and 1, the answer would be 12. Those numbers are:
0, 1/6, 1/5, 1/4, 1/3, 2/5, 1/2, 3/5, 2/3, 3/4, 4/5, 5/6.
What I thought already
A very naive approach would be to cycle trough all the possible fractions, and count those which can't be simplified. Something like:
long fractionsIn(double s, double t){
long density = 0;
long M = LONG_MAX;
for(int d = 1; d < floor(M/t); d++){
for(int n = ceil(d*s); n < M; n++){
if( gcd(n,d) == 1 )
density++;
}
}
return density;
}
But gcd() is very slow so it doesn't works. I also try doing some math but i couldn't get to anything good.
Solution
Thanks to #m69 answer, I made this code for Fraction = pair<Long,Long>:
//this should give the density of fractions between first and last, or less.
double fractionsIn(unsigned long long first, unsigned long long last){
double pi = 3.141592653589793238462643383279502884;
double max = LONG_MAX; //i can't use LONG_MAX directly
double zeroToOne = max/pi * max/pi * 3; // = approx. amount of numbers in Farey's secuence of order LONG_MAX.
double res = 0;
if(first == 0){
res = zeroToOne;
first++;
}
for(double i = first; i < last; i++){
res += zeroToOne/(i * i+1);
if(i == i+1)
i = nextafter(i+1, last); //if this happens, i might not count some fractions, but i have no other choice
}
return floor(res);
}
The main change is nextafter, which is important with big numbers (1e17)
The result
As I explain at the begining, I was trying to compare Fractions with double. Here is the result for Fraction = pair<Long,Long> (and here how I got the density of doubles):
Density between 0,1: | 1,2 | 1e6,1e6+1 | 1e14,1e14+1 | 1e15-1,1e15 | 1e17-10,1e17 | 1e19-10000,1e19 | 1e19-1000,1e19
Doubles: 4607182418800017408 | 4503599627370496 | 8589934592 | 64 | 8 | 1 | 5 | 0
Fraction: 2.58584e+37 | 1.29292e+37 | 2.58584e+25 | 2.58584e+09 | 2.58584e+07 | 2585 | 1 | 0
Density between 0 and 1
If the integers with which you express the fractions are in the range 0~M, then the density of fractions between the values 0 (inclusive) and 1 (exclusive) is:
M: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0~(1): 1 2 4 6 10 12 18 22 28 32 42 46 58 64 72 80 96 102 120 128 140 150 172 180 200 212 230 242 270 278 308 ...
This is sequence A002088 on OEIS. If you scroll down to the formula section, you'll find information about how to approximate it, e.g.:
Φ(n) = (3 ÷ π2) × n2 + O[n × (ln n)2/3 × (ln ln n)4/3]
(Unfortunately, no more detail is given about the constants involved in the O[x] part. See discussion about the quality of the approximation below.)
Distribution across range
The interval from 0 to 1 contains half of the total number of unique fractions that can be expressed with numbers up to M; e.g. this is the distribution when M = 15 (i.e. 4-bit integers):
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
72 36 12 6 4 2 2 2 1 1 1 1 1 1 1 1
for a total of 144 unique fractions. If you look at the sequence for different values of M, you'll see that the steps in this sequence converge:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1: 1 1
2: 2 1 1
3: 4 2 1 1
4: 6 3 1 1 1
5: 10 5 2 1 1 1
6: 12 6 2 1 1 1 1
7: 18 9 3 2 1 1 1 1
8: 22 11 4 2 1 1 1 1 1
9: 28 14 5 2 2 1 1 1 1 1
10: 32 16 5 3 2 1 1 1 1 1 1
11: 42 21 7 4 2 2 1 1 1 1 1 1
12: 46 23 8 4 2 2 1 1 1 1 1 1 1
13: 58 29 10 5 3 2 2 1 1 1 1 1 1 1
14: 64 32 11 5 4 2 2 1 1 1 1 1 1 1 1
15: 72 36 12 6 4 2 2 2 1 1 1 1 1 1 1 1
Not only is the density between 0 and 1 half of the total number of fractions, but the density between 1 and 2 is a quarter, and the density between 2 and 3 is close to a twelfth, and so on.
As the value of M increases, the distribution of fractions across the ranges 0-1, 1-2, 2-3 ... converges to:
1/2, 1/4, 1/12, 1/24, 1/40, 1/60, 1/84, 1/112, 1/144, 1/180, 1/220, 1/264 ...
This sequence can be calculated by starting with 1/2 and then:
0-1: 1/2 x 1/1 = 1/2
1-2: 1/2 x 1/2 = 1/4
2-3: 1/4 x 1/3 = 1/12
3-4: 1/12 x 2/4 = 1/24
4-5: 1/24 x 3/5 = 1/40
5-6: 1/40 x 4/6 = 1/60
6-7: 1/60 x 5/7 = 1/84
7-8: 1/84 x 6/8 = 1/112
8-9: 1/112 x 7/9 = 1/144 ...
You can of course calculate any of these values directly, without needing the steps inbetween:
0-1: 1/2
6-7: 1/2 x 1/6 x 1/7 = 1/84
(Also note that the second half of the distribution sequence consists of 1's; these are all the integers divided by 1.)
Approximating the density in given interval
Using the formulas provided on the OEIS page, you can calculate or approximate the density in the interval 0-1, and multiplied by 2 this is the total number of unique values that can be expressed as fractions.
Given two values s and t, you can then calculate and sum the densities in the intervals s ~ s+1, s+1 ~ s+2, ... t-1 ~ t, or use an interpolation to get a faster but less precise approximate value.
Example
Let's assume that we're using 10-bit integers, capable of expressing values from 0 to 1023. Using this table linked from the OEIS page, we find that the density between 0~1 is 318452, and the total number of fractions is 636904.
If we wanted to find the density in the interval s~t = 100~105:
100~101: 1/2 x 1/100 x 1/101 = 1/20200 ; 636904/20200 = 31.53
101~102: 1/2 x 1/101 x 1/102 = 1/20604 ; 636904/20604 = 30.91
102~103: 1/2 x 1/102 x 1/103 = 1/21012 ; 636904/21012 = 30.31
103~104: 1/2 x 1/103 x 1/104 = 1/21424 ; 636904/21424 = 29.73
104~105: 1/2 x 1/104 x 1/105 = 1/21840 ; 636904/21840 = 29.16
Rounding these values gives the sum:
32 + 31 + 30 + 30 + 29 = 152
A brute force algorithm gives this result:
32 + 32 + 30 + 28 + 28 = 150
So we're off by 1.33% for this low value of M and small interval with just 5 values. If we had used linear interpolation between the first and last value:
100~101: 31.53
104~105: 29.16
average: 30.345
total: 151.725 -> 152
we'd have arrived at the same value. For larger intervals, the sum of all the densities will probably be closer to the real value, because rounding errors will cancel each other out, but the results of linear interpolation will probably become less accurate. For ever larger values of M, the calculated densities should converge with the actual values.
Quality of approximation of Φ(n)
Using this simplified formula:
Φ(n) = (3 ÷ π2) × n2
the results are almost always smaller than the actual values, but they are within 1% for n ≥ 182, within 0.1% for n ≥ 1880 and within 0.01% for n ≥ 19494. I would suggest hard-coding the lower range (the first 50,000 values can be found here), and then using the simplified formula from the point where the approximation is good enough.
Here's a simple code example with the first 182 values of Φ(n) hard-coded. The approximation of the distribution sequence seems to add an error of a similar magnitude as the approximation of Φ(n), so it should be possible to get a decent approximation. The code simply iterates over every integer in the interval s~t and sums the fractions. To speed up the code and still get a good result, you should probably calculate the fractions at several points in the interval, and then use some sort of non-linear interpolation.
function fractions01(M) {
var phi = [0,1,2,4,6,10,12,18,22,28,32,42,46,58,64,72,80,96,102,120,128,140,150,172,180,200,212,230,242,270,278,308,
324,344,360,384,396,432,450,474,490,530,542,584,604,628,650,696,712,754,774,806,830,882,900,940,964,1000,
1028,1086,1102,1162,1192,1228,1260,1308,1328,1394,1426,1470,1494,1564,1588,1660,1696,1736,1772,1832,1856,
1934,1966,2020,2060,2142,2166,2230,2272,2328,2368,2456,2480,2552,2596,2656,2702,2774,2806,2902,2944,3004,
3044,3144,3176,3278,3326,3374,3426,3532,3568,3676,3716,3788,3836,3948,3984,4072,4128,4200,4258,4354,4386,
4496,4556,4636,4696,4796,4832,4958,5022,5106,5154,5284,5324,5432,5498,5570,5634,5770,5814,5952,6000,6092,
6162,6282,6330,6442,6514,6598,6670,6818,6858,7008,7080,7176,7236,7356,7404,7560,7638,7742,7806,7938,7992,
8154,8234,8314,8396,8562,8610,8766,8830,8938,9022,9194,9250,9370,9450,9566,9654,9832,9880,10060];
if (M < 182) return phi[M];
return Math.round(M * M * 0.30396355092701331433 + M / 4); // experimental; see below
}
function fractions(M, s, t) {
var half = fractions01(M);
var frac = (s == 0) ? half : 0;
for (var i = (s == 0) ? 1 : s; i < t && i <= M; i++) {
if (2 * i < M) {
var f = Math.round(half / (i * (i + 1)));
frac += (f < 2) ? 2 : f;
}
else ++frac;
}
return frac;
}
var M = 1023, s = 100, t = 105;
document.write(fractions(M, s, t));
Comparing the approximation of Φ(n) with the list of the 50,000 first values suggests that adding M÷4 is a workable substitute for the second part of the formula; I have not tested this for larger values of n, so use with caution.
Blue: simplified formula. Red: improved simplified formula.
Quality of approximation of distribution
Comparing the results for M=1023 with those of a brute-force algorithm, the errors are small in real terms, never more than -7 or +6, and above the interval 205~206 they are limited to -1 ~ +1. However, a large part of the range (57~1024) has fewer than 100 fractions per integer, and in the interval 171~1024 there are only 10 fractions or fewer per integer. This means that small errors and rounding errors of -1 or +1 can have a large impact on the result, e.g.:
interval: 241 ~ 250
fractions/integer: 6
approximation: 5
total: 50 (instead of 60)
To improve the results for intervals with few fractions per integer, I would suggest combining the method described above with a seperate approach for the last part of the range:
Alternative method for last part of range
As already mentioned, and implemented in the code example, the second half of the range, M÷2 ~ M, has 1 fraction per integer. Also, the interval M÷3 ~ M÷2 has 2; the interval M÷4 ~ M÷3 has 4. This is of course the Φ(n) sequence again:
M/2 ~ M : 1
M/3 ~ M/2: 2
M/4 ~ M/3: 4
M/5 ~ M/4: 6
M/6 ~ M/5: 10
M/7 ~ M/6: 12
M/8 ~ M/7: 18
M/9 ~ M/8: 22
M/10 ~ M/9: 28
M/11 ~ M/10: 32
M/12 ~ M/11: 42
M/13 ~ M/12: 46
M/14 ~ M/13: 58
M/15 ~ M/14: 64
M/16 ~ M/15: 72
M/17 ~ M/16: 80
M/18 ~ M/17: 96
M/19 ~ M/18: 102 ...
Between these intervals, one integer can have a different number of fractions, depending on the exact value of M, e.g.:
interval fractions
202 ~ 203 10
203 ~ 204 10
204 ~ 205 9
205 ~ 206 6
206 ~ 207 6
The interval 204 ~ 205 lies on the edge between intervals, because M ÷ 5 = 204.6; it has 6 + 3 = 9 fractions because M modulo 5 is 3. If M had been 1022 or 1024 instead of 1023, it would have 8 or 10 fractions. (This example is straightforward because 5 is a prime; see below.)
Again, I would suggest using the hard-coded values for Φ(n) to calculate the number of fractions for the last part of the range. If you use the first 17 values as listed above, this covers the part of the range with fewer than 100 fractions per integer, so that would reduce the impact of rounding errors below 1%. The first 56 values would give you 0.1%, the first 182 values 0.01%.
Together with the values of Φ(n), you could hard-code the number of fractions of the edge intervals for each modulo value, e.g.:
modulo: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
M/ 2 1 2
M/ 3 2 3 4
M/ 4 4 5 5 6
M/ 5 6 7 8 9 10
M/ 6 10 11 11 11 11 12
M/ 7 12 13 14 15 16 17 18
M/ 8 18 19 19 20 20 21 21 22
M/ 9 22 23 24 24 25 26 26 27 28
M/10 28 29 29 30 30 30 30 31 31 32
M/11 32 33 34 35 36 37 38 39 40 41 42
M/12 42 43 43 43 43 44 44 45 45 45 45 46
M/13 46 47 48 49 50 51 52 53 54 55 56 57 58
M/14 58 59 59 60 60 61 61 61 61 62 62 63 63 64
M/15 64 65 66 66 67 67 67 68 69 69 69 70 70 71 72
M/16 72 73 73 74 74 75 75 76 76 77 77 78 78 79 79 80
M/17 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
M/18 96 97 97 97 97 98 98 99 99 99 99 100 100 101 101 101 101 102
This is exactly the same as: (Sum of phi(k)) where m <= k <= M where phi(k) is the Euler Totient Function and with phi(0) = 1 (as defined by the problem). There is no known closed form for this sum. However there are many optimizations known as mentioned in the wiki link. This is known as the Totient Summatory Function in Wolfram. The same website also links to the series: A002088 and provides a few asymptotic approximations.
The reasoning is this: consider the number of values of the form {1/M, 2/M, ...., (M-1)/M, M/M}. All those fractions that will be reducible to a smaller value will not be counted in phi(M) because they are not relatively prime. They will appear in the summation of another totient.
For example, phi(6) = 12 and you have 1 + phi(6), since you also count the 0.

performing chi squared test in SAS using PROC FREQ

Our university is forcing us to perform the old school chi square test using PROC FREQ (I am aware of the options with proc univariate).
I have generated one theoretical exponential distribution with Beta=15 (and written down the values laboriously), and I've generated 10000 random variables which have an exponential distribution, with beta=15.
I try to first enter the frequencies of my random variables (in each interval) via the datalines command:
data expofaktiska;
input number count;
datalines;
1 2910
2 2040
3 1400
4 1020
5 732
6 531
7 377
8 305
9 210
10 144
11 106
12 66
13 40
14 45
15 29
16 16
17 12
18 8
19 8
20 3
21 2
22 0
23 1
24 2
25 0
26 2
;
run;
This seems to work.
I then try to compare these values to the theoretical values, using the chi square test in proc freq (the one we are supposed to use)
As follows:
proc freq data=expofaktiska;
weight count;
tables number / testp=(0.28347 0.20311 0.14554 0.10428 0.07472 0.05354 0.03837 0.02749 0.01969 0.01412 0.01011 0.00724 0.0052 0.00372 0.00266 0.00191 0.00137 0.00098 0.00070 0.00051 0.00036 0.00026 0.00018 0.00013 0.00010 0.00007) chisq;
run;
I get the following error:
ERROR: The number of TESTP values does not equal the number of levels. For the table of number,
there are 24 levels and 26 TESTP values.
This may be because two intervals contain 0 obervations. I don't really see a way around this.
Also, I don't get the chi square test in the results viewer, nor the "tes probability", I only the frequency/cumulative frequency of the random variables.
What am I doing wrong? Do both theoretical/actual distributions need to have the same form (probability/frequencies?)
We are using SAS 9.4
Thanks in advance!
/Magnus
You need ZEROS options on the WEIGHT statement.
data expofaktiska;
input number count;
datalines;
1 2910
2 2040
3 1400
4 1020
5 732
6 531
7 377
8 305
9 210
10 144
11 106
12 66
13 40
14 45
15 29
16 16
17 12
18 8
19 8
20 3
21 2
22 0
23 1
24 2
25 0
26 2
;
run;
proc freq data=expofaktiska;
weight count / zeros;
tables number / testp=(0.28347 0.20311 0.14554 0.10428 0.07472 0.05354 0.03837 0.02749 0.01969 0.01412 0.01011 0.00724 0.0052 0.00372 0.00266 0.00191 0.00137 0.00098 0.00070 0.00051 0.00036 0.00026 0.00018 0.00013 0.00010 0.00007) chisq;
run;