Categorise the number of values in a list - list

I have the list [5,15,25,27,30,39,45,50,55]
How do I code the categorisation of the values in a list, where the categories are groups of roughly 10 ? such that I get the following result. I have absolutely no clue where to start, and I am only just learning to code.
0 - 9 =1
10 - 19 =1
20 - 29 =2
30 - 39 =2
40 - 49 =1
50 - 59 =2
thankyou
(I did think of something like
if list[1] > 0 and < 10 make Group[1] == 1
list[2] > 0 and < 10 make Group[1] == 2
but this was going to generate LOADS of bulky code )

In your example the thing to the left of = is a number divided by 10. Accumulate groups in dictionary then:
from collections import defaultdict
a=[5,15,25,27,30,39,45,50,55]
b=defaultdict(int)
for x in a:
b[x//10]+=1
for k,v in b.items():
print("%d - %d =%d"%(k*10, k*10+9, v))
result:
0 - 9 =1
10 - 19 =1
20 - 29 =2
30 - 39 =2
40 - 49 =1
50 - 59 =2

Related

Sorting a 2-d vector but getting an unexpected output

I am trying to sort a 2 D vector and I am getting the desired output for input 'n' less than 15 but above that it is not arranged in the order that I want. If all the first column values are 0 then the second column must have increasing ordered values.
#include <bits/stdc++.h>
using namespace std;
bool sortcol(const vector<long long int>& v1, const vector<long long int>& v2)
{
return v1[0] < v2[0];
}
int main()
{
int n;
cin >> n;
vector<vector<long long int>> arr(n,vector<long long int> (2));
for (int i = 0; i < n; i++)
{
arr[i][0] = 0;
arr[i][1] = i;
}
sort(arr.begin(), arr.end(),sortcol);
for(int i = 0;i<n;i++){
cout << i << " - " << arr[i][0] << " , " << arr[i][1] << endl;
}
}
Output I want to be like :-
15 0
0 - 0 , 0
1 - 0 , 1
2 - 0 , 2
3 - 0 , 3
4 - 0 , 4
5 - 0 , 5
6 - 0 , 6
7 - 0 , 7
8 - 0 , 8
9 - 0 , 9
10 - 0 , 10
11 - 0 , 11
12 - 0 , 12
13 - 0 , 13
14 - 0 , 14
But what I getting is :-
50 0
0 - 0 , 38
1 - 0 , 26
2 - 0 , 27
3 - 0 , 28
4 - 0 , 29
5 - 0 , 30
6 - 0 , 31
7 - 0 , 32
8 - 0 , 33
9 - 0 , 34
10 - 0 , 35
11 - 0 , 36
12 - 0 , 37
13 - 0 , 25
14 - 0 , 39
15 - 0 , 40
16 - 0 , 41
17 - 0 , 42
18 - 0 , 43
19 - 0 , 44
20 - 0 , 45
21 - 0 , 46
22 - 0 , 47
23 - 0 , 48
24 - 0 , 49
25 - 0 , 13
26 - 0 , 1
27 - 0 , 2
28 - 0 , 3
29 - 0 , 4
30 - 0 , 5
31 - 0 , 6
32 - 0 , 7
33 - 0 , 8
34 - 0 , 9
35 - 0 , 10
36 - 0 , 11
37 - 0 , 12
38 - 0 , 0
39 - 0 , 14
40 - 0 , 15
41 - 0 , 16
42 - 0 , 17
43 - 0 , 18
44 - 0 , 19
45 - 0 , 20
46 - 0 , 21
47 - 0 , 22
48 - 0 , 23
49 - 0 , 24
I am running this code on VS code
As others have also noted in the comments, your sortcol() always returns false because v1[0] and v2[0] are always 0. Since the predicate sortcol() tells the sorting algorithm which elements are considered to be "smaller"/"less" than other elements, no element is considered smaller than another one. This implies that all elements are considered to be equal: If a<b is false and b<a is false, this implies a==b is true. In other words, the STL sorting algorithms assume a strict weak ordering, compare e.g. this post and this one.
So all your elements are considered to be equal by the sorting algorithm. The order of elements considered to be equal is implementation defined for std::sort(). Quote for std::sort:
The order of equal elements is not guaranteed to be preserved.
Hence, in your case the implementation is free to change the order of all elements as it sees fit, since all elements are considered to be equal. In practice a different algorithm is used once the input reaches a certain size, in which case a different algorithm is selected. For libstdc++ (the STL of gcc), this happens for n>16 (see the constant _S_threshold in the source code). That is the reason why you see a jump in behavior for n>16 with std::sort().
Other implementations of the STL might use other thresholds (e.g., the Microsoft STL seems to use a value of 32).
On the other hand, std::stable_sort() guarantees that the order of equal elements remains the same. Quote for std::stable_sort:
The order of equivalent elements is guaranteed to be preserved.
Of course, preserving the order is not free, and hence std::stable_sort can be slower.
So, if your sortcol() is really the predicate you want (although, in the example, it does not really make much sense), using std::stable_sort() is the solution you are look for.

Density of fractions between 2 given numbers

I'm trying to do some analysis over a simple Fraction class and I want some data to compare that type with doubles.
The problem
Right know I'm looking for some good way to get the density of Fractions between 2 numbers. Fractions is basically 2 integers (e.g. pair< long, long>), and the density between s and t is the amount of representable numbers in that range. And it needs to be an exact, or very good approximation done in O(1) or very fast.
To make it a bit simpler, let's say I want all the numbers (not fractions) a/b between s and t, where 0 <= s <= a/b < t <= M, and 0 <= a,b <= M (b > 0, a and b are integers)
Example
If my fractions were of a data type which only count to 6 (M = 6), and I want the density between 0 and 1, the answer would be 12. Those numbers are:
0, 1/6, 1/5, 1/4, 1/3, 2/5, 1/2, 3/5, 2/3, 3/4, 4/5, 5/6.
What I thought already
A very naive approach would be to cycle trough all the possible fractions, and count those which can't be simplified. Something like:
long fractionsIn(double s, double t){
long density = 0;
long M = LONG_MAX;
for(int d = 1; d < floor(M/t); d++){
for(int n = ceil(d*s); n < M; n++){
if( gcd(n,d) == 1 )
density++;
}
}
return density;
}
But gcd() is very slow so it doesn't works. I also try doing some math but i couldn't get to anything good.
Solution
Thanks to #m69 answer, I made this code for Fraction = pair<Long,Long>:
//this should give the density of fractions between first and last, or less.
double fractionsIn(unsigned long long first, unsigned long long last){
double pi = 3.141592653589793238462643383279502884;
double max = LONG_MAX; //i can't use LONG_MAX directly
double zeroToOne = max/pi * max/pi * 3; // = approx. amount of numbers in Farey's secuence of order LONG_MAX.
double res = 0;
if(first == 0){
res = zeroToOne;
first++;
}
for(double i = first; i < last; i++){
res += zeroToOne/(i * i+1);
if(i == i+1)
i = nextafter(i+1, last); //if this happens, i might not count some fractions, but i have no other choice
}
return floor(res);
}
The main change is nextafter, which is important with big numbers (1e17)
The result
As I explain at the begining, I was trying to compare Fractions with double. Here is the result for Fraction = pair<Long,Long> (and here how I got the density of doubles):
Density between 0,1: | 1,2 | 1e6,1e6+1 | 1e14,1e14+1 | 1e15-1,1e15 | 1e17-10,1e17 | 1e19-10000,1e19 | 1e19-1000,1e19
Doubles: 4607182418800017408 | 4503599627370496 | 8589934592 | 64 | 8 | 1 | 5 | 0
Fraction: 2.58584e+37 | 1.29292e+37 | 2.58584e+25 | 2.58584e+09 | 2.58584e+07 | 2585 | 1 | 0
Density between 0 and 1
If the integers with which you express the fractions are in the range 0~M, then the density of fractions between the values 0 (inclusive) and 1 (exclusive) is:
M: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0~(1): 1 2 4 6 10 12 18 22 28 32 42 46 58 64 72 80 96 102 120 128 140 150 172 180 200 212 230 242 270 278 308 ...
This is sequence A002088 on OEIS. If you scroll down to the formula section, you'll find information about how to approximate it, e.g.:
Φ(n) = (3 ÷ π2) × n2 + O[n × (ln n)2/3 × (ln ln n)4/3]
(Unfortunately, no more detail is given about the constants involved in the O[x] part. See discussion about the quality of the approximation below.)
Distribution across range
The interval from 0 to 1 contains half of the total number of unique fractions that can be expressed with numbers up to M; e.g. this is the distribution when M = 15 (i.e. 4-bit integers):
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
72 36 12 6 4 2 2 2 1 1 1 1 1 1 1 1
for a total of 144 unique fractions. If you look at the sequence for different values of M, you'll see that the steps in this sequence converge:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1: 1 1
2: 2 1 1
3: 4 2 1 1
4: 6 3 1 1 1
5: 10 5 2 1 1 1
6: 12 6 2 1 1 1 1
7: 18 9 3 2 1 1 1 1
8: 22 11 4 2 1 1 1 1 1
9: 28 14 5 2 2 1 1 1 1 1
10: 32 16 5 3 2 1 1 1 1 1 1
11: 42 21 7 4 2 2 1 1 1 1 1 1
12: 46 23 8 4 2 2 1 1 1 1 1 1 1
13: 58 29 10 5 3 2 2 1 1 1 1 1 1 1
14: 64 32 11 5 4 2 2 1 1 1 1 1 1 1 1
15: 72 36 12 6 4 2 2 2 1 1 1 1 1 1 1 1
Not only is the density between 0 and 1 half of the total number of fractions, but the density between 1 and 2 is a quarter, and the density between 2 and 3 is close to a twelfth, and so on.
As the value of M increases, the distribution of fractions across the ranges 0-1, 1-2, 2-3 ... converges to:
1/2, 1/4, 1/12, 1/24, 1/40, 1/60, 1/84, 1/112, 1/144, 1/180, 1/220, 1/264 ...
This sequence can be calculated by starting with 1/2 and then:
0-1: 1/2 x 1/1 = 1/2
1-2: 1/2 x 1/2 = 1/4
2-3: 1/4 x 1/3 = 1/12
3-4: 1/12 x 2/4 = 1/24
4-5: 1/24 x 3/5 = 1/40
5-6: 1/40 x 4/6 = 1/60
6-7: 1/60 x 5/7 = 1/84
7-8: 1/84 x 6/8 = 1/112
8-9: 1/112 x 7/9 = 1/144 ...
You can of course calculate any of these values directly, without needing the steps inbetween:
0-1: 1/2
6-7: 1/2 x 1/6 x 1/7 = 1/84
(Also note that the second half of the distribution sequence consists of 1's; these are all the integers divided by 1.)
Approximating the density in given interval
Using the formulas provided on the OEIS page, you can calculate or approximate the density in the interval 0-1, and multiplied by 2 this is the total number of unique values that can be expressed as fractions.
Given two values s and t, you can then calculate and sum the densities in the intervals s ~ s+1, s+1 ~ s+2, ... t-1 ~ t, or use an interpolation to get a faster but less precise approximate value.
Example
Let's assume that we're using 10-bit integers, capable of expressing values from 0 to 1023. Using this table linked from the OEIS page, we find that the density between 0~1 is 318452, and the total number of fractions is 636904.
If we wanted to find the density in the interval s~t = 100~105:
100~101: 1/2 x 1/100 x 1/101 = 1/20200 ; 636904/20200 = 31.53
101~102: 1/2 x 1/101 x 1/102 = 1/20604 ; 636904/20604 = 30.91
102~103: 1/2 x 1/102 x 1/103 = 1/21012 ; 636904/21012 = 30.31
103~104: 1/2 x 1/103 x 1/104 = 1/21424 ; 636904/21424 = 29.73
104~105: 1/2 x 1/104 x 1/105 = 1/21840 ; 636904/21840 = 29.16
Rounding these values gives the sum:
32 + 31 + 30 + 30 + 29 = 152
A brute force algorithm gives this result:
32 + 32 + 30 + 28 + 28 = 150
So we're off by 1.33% for this low value of M and small interval with just 5 values. If we had used linear interpolation between the first and last value:
100~101: 31.53
104~105: 29.16
average: 30.345
total: 151.725 -> 152
we'd have arrived at the same value. For larger intervals, the sum of all the densities will probably be closer to the real value, because rounding errors will cancel each other out, but the results of linear interpolation will probably become less accurate. For ever larger values of M, the calculated densities should converge with the actual values.
Quality of approximation of Φ(n)
Using this simplified formula:
Φ(n) = (3 ÷ π2) × n2
the results are almost always smaller than the actual values, but they are within 1% for n ≥ 182, within 0.1% for n ≥ 1880 and within 0.01% for n ≥ 19494. I would suggest hard-coding the lower range (the first 50,000 values can be found here), and then using the simplified formula from the point where the approximation is good enough.
Here's a simple code example with the first 182 values of Φ(n) hard-coded. The approximation of the distribution sequence seems to add an error of a similar magnitude as the approximation of Φ(n), so it should be possible to get a decent approximation. The code simply iterates over every integer in the interval s~t and sums the fractions. To speed up the code and still get a good result, you should probably calculate the fractions at several points in the interval, and then use some sort of non-linear interpolation.
function fractions01(M) {
var phi = [0,1,2,4,6,10,12,18,22,28,32,42,46,58,64,72,80,96,102,120,128,140,150,172,180,200,212,230,242,270,278,308,
324,344,360,384,396,432,450,474,490,530,542,584,604,628,650,696,712,754,774,806,830,882,900,940,964,1000,
1028,1086,1102,1162,1192,1228,1260,1308,1328,1394,1426,1470,1494,1564,1588,1660,1696,1736,1772,1832,1856,
1934,1966,2020,2060,2142,2166,2230,2272,2328,2368,2456,2480,2552,2596,2656,2702,2774,2806,2902,2944,3004,
3044,3144,3176,3278,3326,3374,3426,3532,3568,3676,3716,3788,3836,3948,3984,4072,4128,4200,4258,4354,4386,
4496,4556,4636,4696,4796,4832,4958,5022,5106,5154,5284,5324,5432,5498,5570,5634,5770,5814,5952,6000,6092,
6162,6282,6330,6442,6514,6598,6670,6818,6858,7008,7080,7176,7236,7356,7404,7560,7638,7742,7806,7938,7992,
8154,8234,8314,8396,8562,8610,8766,8830,8938,9022,9194,9250,9370,9450,9566,9654,9832,9880,10060];
if (M < 182) return phi[M];
return Math.round(M * M * 0.30396355092701331433 + M / 4); // experimental; see below
}
function fractions(M, s, t) {
var half = fractions01(M);
var frac = (s == 0) ? half : 0;
for (var i = (s == 0) ? 1 : s; i < t && i <= M; i++) {
if (2 * i < M) {
var f = Math.round(half / (i * (i + 1)));
frac += (f < 2) ? 2 : f;
}
else ++frac;
}
return frac;
}
var M = 1023, s = 100, t = 105;
document.write(fractions(M, s, t));
Comparing the approximation of Φ(n) with the list of the 50,000 first values suggests that adding M÷4 is a workable substitute for the second part of the formula; I have not tested this for larger values of n, so use with caution.
Blue: simplified formula. Red: improved simplified formula.
Quality of approximation of distribution
Comparing the results for M=1023 with those of a brute-force algorithm, the errors are small in real terms, never more than -7 or +6, and above the interval 205~206 they are limited to -1 ~ +1. However, a large part of the range (57~1024) has fewer than 100 fractions per integer, and in the interval 171~1024 there are only 10 fractions or fewer per integer. This means that small errors and rounding errors of -1 or +1 can have a large impact on the result, e.g.:
interval: 241 ~ 250
fractions/integer: 6
approximation: 5
total: 50 (instead of 60)
To improve the results for intervals with few fractions per integer, I would suggest combining the method described above with a seperate approach for the last part of the range:
Alternative method for last part of range
As already mentioned, and implemented in the code example, the second half of the range, M÷2 ~ M, has 1 fraction per integer. Also, the interval M÷3 ~ M÷2 has 2; the interval M÷4 ~ M÷3 has 4. This is of course the Φ(n) sequence again:
M/2 ~ M : 1
M/3 ~ M/2: 2
M/4 ~ M/3: 4
M/5 ~ M/4: 6
M/6 ~ M/5: 10
M/7 ~ M/6: 12
M/8 ~ M/7: 18
M/9 ~ M/8: 22
M/10 ~ M/9: 28
M/11 ~ M/10: 32
M/12 ~ M/11: 42
M/13 ~ M/12: 46
M/14 ~ M/13: 58
M/15 ~ M/14: 64
M/16 ~ M/15: 72
M/17 ~ M/16: 80
M/18 ~ M/17: 96
M/19 ~ M/18: 102 ...
Between these intervals, one integer can have a different number of fractions, depending on the exact value of M, e.g.:
interval fractions
202 ~ 203 10
203 ~ 204 10
204 ~ 205 9
205 ~ 206 6
206 ~ 207 6
The interval 204 ~ 205 lies on the edge between intervals, because M ÷ 5 = 204.6; it has 6 + 3 = 9 fractions because M modulo 5 is 3. If M had been 1022 or 1024 instead of 1023, it would have 8 or 10 fractions. (This example is straightforward because 5 is a prime; see below.)
Again, I would suggest using the hard-coded values for Φ(n) to calculate the number of fractions for the last part of the range. If you use the first 17 values as listed above, this covers the part of the range with fewer than 100 fractions per integer, so that would reduce the impact of rounding errors below 1%. The first 56 values would give you 0.1%, the first 182 values 0.01%.
Together with the values of Φ(n), you could hard-code the number of fractions of the edge intervals for each modulo value, e.g.:
modulo: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
M/ 2 1 2
M/ 3 2 3 4
M/ 4 4 5 5 6
M/ 5 6 7 8 9 10
M/ 6 10 11 11 11 11 12
M/ 7 12 13 14 15 16 17 18
M/ 8 18 19 19 20 20 21 21 22
M/ 9 22 23 24 24 25 26 26 27 28
M/10 28 29 29 30 30 30 30 31 31 32
M/11 32 33 34 35 36 37 38 39 40 41 42
M/12 42 43 43 43 43 44 44 45 45 45 45 46
M/13 46 47 48 49 50 51 52 53 54 55 56 57 58
M/14 58 59 59 60 60 61 61 61 61 62 62 63 63 64
M/15 64 65 66 66 67 67 67 68 69 69 69 70 70 71 72
M/16 72 73 73 74 74 75 75 76 76 77 77 78 78 79 79 80
M/17 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
M/18 96 97 97 97 97 98 98 99 99 99 99 100 100 101 101 101 101 102
This is exactly the same as: (Sum of phi(k)) where m <= k <= M where phi(k) is the Euler Totient Function and with phi(0) = 1 (as defined by the problem). There is no known closed form for this sum. However there are many optimizations known as mentioned in the wiki link. This is known as the Totient Summatory Function in Wolfram. The same website also links to the series: A002088 and provides a few asymptotic approximations.
The reasoning is this: consider the number of values of the form {1/M, 2/M, ...., (M-1)/M, M/M}. All those fractions that will be reducible to a smaller value will not be counted in phi(M) because they are not relatively prime. They will appear in the summation of another totient.
For example, phi(6) = 12 and you have 1 + phi(6), since you also count the 0.

SAS_Add value for specific rows

I want to give the value for some specific rows. I think showing it by example would be better. I have following datasheet;
Date Value
01/01/2001 10
02/01/2001 20
03/01/2001 35
04/01/2001 15
05/01/2001 25
06/01/2001 35
07/01/2001 20
08/01/2001 45
09/01/2001 35
My result should be:
Date Value Spec.Value
01/01/2001 10 1
02/01/2001 20 1
03/01/2001 35 1
04/01/2001 15 2
05/01/2001 25 2
06/01/2001 35 2
07/01/2001 20 3
08/01/2001 45 3
09/01/2001 35 3
As you can see, my condition value is 35. I have three 35. I need to group my date by using this condition value.
data want;
set have;
retain specvalue 1;
if lag(value) = 35 then do;
specvalue +1;
end;
run;

Cumulative sum in multiple columns in SAS

I have been searching the solution a while, but I couldn't find any similar question in SAS in communities. So here is my question: I have a big SAS table: let's say with 2 classes and 26 variables:
A B Var1 Var2 ... Var25 Var26
-----------------------------
1 1 10 20 ... 35 30
1 2 12 24 ... 32 45
1 3 20 23 ... 24 68
2 1 13 29 ... 22 57
2 2 32 43 ... 33 65
2 3 11 76 ... 32 45
...................
...................
I need to calculate the cumulative sum of the all 26 variables through the Class=B, which means that for A=1, it will accumulate through B=1,2,3; and for A=2 it will accumulate through B=1,2,3. The resulting table will be like:
A B Cum1 Cum2 ... Cum25 Cum26
-----------------------------
1 1 10 20 ... 35 30
1 2 22 44 ... 67 75
1 3 40 67 ... 91 143
2 1 13 29 ... 22 57
2 2 45 72 ... 55 121
2 3 56 148 .. 87 166
...................
...................
I can choose the hard way, like describing each of 26 variables in a loop, and then I can find the cumulative sums through B. But I want to find a more practical solution for this without describing all the variables.
On one of the websites was suggested a solution like this:
proc sort data= (drop=percent cum_pct rename=(count=demand cum_freq=cal));
weight var1;
run;
I am not sure if there is any option like "Weight" in Proc Sort, but if it works then I thought that maybe I can modify it by putting numeric instead of Var1, then the Proc Sort process can do the process for all the numerical values :
proc sort data= (drop=percent cum_pct rename=(count=demand cum_freq=cal));
weight _numerical_;
run;
Any ideas?
One way to accomplish this is to use 2 'parallel' arrays, one for your input values and another for the cumulative values.
%LET N = 26 ;
data cum ;
set have ;
by A B ;
array v{*} var1-var&N ;
array c{*] cum1-cum&N ;
retain c . ;
if first.A then call missing(of c{*}) ; /* reset on new values of A */
do i = 1 to &N ;
c{i} + v{i} ;
end ;
drop i ;
run ;

Indicator for Top3 and ranking across ros

What I am trying to do are following: I want to find out if a observation (A) is top 3 across others.
For example,
A B C D E F G H TOP3-A
1 20 30 40 50 60 70 80 90 N
2 80 90 70 80 0 0 0 0 Y
3 70 0 0 80 90 0 0 0 Y
4 60 70 80 90 0 0 0 0 N
I am thinking transpose + rank + transpose + if <4 then Y else N, however it seems too cumbersome and to be honest as a newbie I do not how to code all these steps correctly...
Your method would work, but there's a much simpler way of doing it.
You could use an array, which reads across rows, however I'm using an even easier way of reading across rows.
The OF statement can be used in conjunction with a summary function to calculate values across rather than down. The LARGEST function returns the largest nth value from a range, so you can compare field A to the 3rd largest value in the row.
I've given you the answer to produce Y, N plus an alternative that produces 1, 0 which is even simpler.
data have;
input A B C D E F G H;
datalines;
20 30 40 50 60 70 80 90
80 90 70 80 0 0 0 0
70 0 0 80 90 0 0 0
60 70 80 90 0 0 0 0
;
run;
data want;
set have;
if A >= largest(3, of A--H) then top3_A = 'Y'; /* A--H references all columns between A and H */
else top3_A = 'N';
/* or */
top3_A2 = (A >= largest(3, of A--H)); /* returns 1 for true, 0 for flase */
run;