Equal - depth binning- whether it is just grouping data into k groups - data-mining

A small confusion on equal - depth or equal frequency binning
Equal depth binning says that - It divides the range into N intervals, each containing approximately same number of samples
Lets take a small portion of iris data
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
If I need to bin my 1st column, what will be the results?
Whether it is just grouping the data or it includes some calculation like equal width binning.
What happens if number of elements to be binned is an odd number. How will I bin equally?

like #Anony-Mousse mentions, it is not always possible to exactly get the same number of samples in a bin, approximately is what is desired.
I will walk you through the case when unique(N)/bins > 0, where N represents the values in an array to be binned. Assume
N = [1, 1, 1, 1, 1, 1,
2, 3, 4, 5,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
bins = 4
here, length(N) = 20 and length(unique(N)) = 6, making unique(N)/bins = 1.5 > 0. Which means every bin will have approximately 1.5 samples. So you will put 1 in bin1, carrying over the 0.5 residue to the next bin, making the number of elements in that bin to 1.5 + 0.5 = 2, so 2 and 3 will be in bin2. Extrapolating this logic the final bins will have the following split. [1], [2,3], [4], [5,6] of course 1 repeats 6 times and 6 repeats 10 times.
I would not like the ties to sit in separate bins, that is usually the point in having bins (grouping values close to one another).
For cases with unique(N)/bins < 0, the same logic can be applied. Hope this answers your question.

Sometimes you cannot make bins of exactly the same size.
For example, if your data is
1,1,1,2,99
and you want 4 bins, then the most intuitive result should be
[1,1,1], [2], [], [99]
Most tools will produce one of these answers:
[1,1,1], [], [2], [99]
[1,1], [1], [2], [99]
[1], [1], [1], [2,99]
None of them have exactly 1.25 elements in every bin. The two last solutions are closest, but also the least intuitive. That is why one only demands "approximately the same number". Sometimes, there is no good solution that exactly has this frequency.

Related

Minimum number of iterations

We are given an array with numbers from ranging from 1 to n (no duplicates) where n = size of the array.
We are allowed to do the following operation :
arr[i] = arr[arr[i]-1] , 0 <= i < n
Now, one iteration is considered when we perform above operation on the entire array.
Our task is to find the number of iterations after we encounter a previously encountered sequence.
Constraints :
a) Array has no duplicates
b) 1 <= arr[i] <= n , 0 <= i < n
c) 1 <= n <= 10^6
Ex 1:
n = 5
arr[] = {5, 4, 2, 1, 3}
After 1st iteration array becomes : {3, 1, 4, 5, 2}
After 2nd iteration array becomes : {4, 3, 5, 2, 1}
After 3rd iteration array becomes : {2, 5, 1, 3, 4}
After 4th iteration array becomes : {5, 4, 2, 1, 3}
In the 4th iteration, the sequence obtained is already seen before
So the expected output is 4.
This question was asked in one of job hiring tests, so I dont have any link to the question.
There were 2 sample test cases given out of which I remember one which is given above. I would really appreciate any help on this question
P.S.
I was able to code the brute force solution, where in I stored all the results in a Set and then kept advancing to the next permutation. But it gave TLE
First, note that an array of length n containing 1, 2, ..., n with no duplicates is a permutation.
Next, observe that arr[i] := arr[arr[i] - 1] is squaring the permutation.
That is, consider permutations as elements of the symmetric group S_n, where multiplication is composition of permutations.
Then the above operation is arr := arr * arr.
So, in terms of permutations and their composition, the question is as follows:
You are given a permutation p (= arr).
Consider permutations p, p^2, p^4, p^8, p^16, ...
What is the number of distinct elements among them?
Now, to solve it, consider the cycle notation of the permutation.
Every permutation is a product of disjoint cycles.
For example, 6 1 4 3 5 2 is the product of the following cycles: (1 6 2) (3 4) (5).
In other words, every application of this permutation:
moves elements at positions 1, 6, 2 along the cycle;
moves elements at positions 4, 3 along the cycle;
leaves element at position 5 in place.
So, when we consider p^k (take an identity permutation and apply the permutation p to it k times), we actually process three independent actions:
move elements at positions 1, 6, 2 along the cycle, k times;
move elements at positions 4, 3 along the cycle, k times;
leave element at position 5 in place, k times.
Now, take into account that, after d applications of a cycle of length d, it just returns all the respective elements to their initial places.
So, we can actually formulate p^k as:
move elements at positions 1, 6, 2 along the cycle, (k mod 3) times;
move elements at positions 4, 3 along the cycle, (k mod 2) times;
leave element at position 5 in place.
We can now prove (using Chinese Remainder Theorem, or just using general knowledge of group theory) that the permutations p, p^2, p^3, p^4, p^5, ... are all distinct up to p^m, where m is the least common multiple of all cycle lengths.
In our example with p = 6 1 4 3 5 2, we have p, p^2, p^3, p^4, p^5, and p^6 all distinct.
But p^6 is the identity permutation: moving six times along a cycle of length 2 or 3 results in the items at their initial places.
So p^7 is the same as p^1, p^8 is the same as p^2, and so on.
Our question however is harder: we want to know the number of distinct permutations not among p, p^2, p^3, p^4, p^5, ..., but among p, p^2, p^4, p^8, p^16, ...: p to the power of a power of two.
To do that, consider all cycle lengths c_1, c_2, ..., c_r in our permutation.
For each c_i, find the pre-period and period of 2^k mod c_i:
For example, c_1 = 3, and 2^k mod 3 look as 1, 2, 1, 2, 1, 2, ..., which is (1, 2) with pre-period 0 and period 2.
As another example, c_2 = 2, and 2^k mod 2 look as 1, 0, 0, 0, ..., which is 1, (0) with pre-period 1 and period 1.
In this problem, this part can be done naively, by just marking visited numbers mod c_i in some array.
By Chinese Remainder Theorem again, after all pre-periods are considered, the period of the whole system of cycles will be the least common multiple of all individual periods.
What remains is to consider pre-periods.
These can be processed with your naive solution anyway, as the lengths of pre-periods here is at most log_2 n.
The answer is the least common multiple of all individual periods, calculated as above, plus the length of the longest pre-period.

How to find smallest connected label in equivalency list

I have a list of numbers stored in a standard vector. Some of the numbers are children of other numbers. Here is an example
3, 4
3, 5
5, 6
7, 3
8, 9
8, 1
8, 2
9, 8
Or as a graph:
1 2 3-4 5-6 7 8-9
|-------------|
|-----------|
|---|
|-------|
That is there are two clusters 3,4,5,6,7 and 1,2,8,9. The root number is the smallest number of a cluster. Here 3 and 1. I would like to know which algorithms I can use to extract a list like this:
3, 4
3, 5
3, 6
3, 7
1, 2
1, 8
1, 9
An algorithm similar disjoint set union algorithm can help you:
Initialize N disjoint subset, each subset has exactly one number, and root of number i(r(i)) is i.
For each edge (u, v), you can assign:
t = min(r(u), r(v))
r(u) = t
r(v) = t
For each i with i != r(i), you can write out [r(i) - i].

how to handle reappearing values with std::next_permutation

I have a few vectors.
I want to find all permutations of each vector.
It works reasonably well, when the values are unique but if there are reappearing values it messes up.
I have the following vectors
vector<string> present = {"Schaukelpferd","Schaukelpferd","Puppe","Puppe"};
vector<string> children = {"Jan","Tim","Alex","Daniel"};
vector<int> houses = {4,5,5,5};
I am sorting the before using next_permutation()
sort(present.begin(),present.end());
sort(children.begin(),children.end());
sort(houses.begin(),houses.end());
do {
present_perm.push_back(present);
} while (next_permutation(present.begin(), present.end()));
do {
children_perm.push_back(children);
} while (next_permutation(children.begin(), children.end()));
do {
houses_perm.push_back(houses);
} while (next_permutation(houses.begin(), houses.end()));
children works good, but present as well as houses doesn't work as expected
children returns 24 permutation, as expected, present returns only 6 and houses returns only 4. I would expect all to return 24 because all vectors have 4 elements (4! = 24).
Consider the four integer values 4, 5, 5, 5. The four possible permutations are 4, 5, 5, 5 and 5, 4, 5, 5 and 5, 5, 4, 5 and 5, 5, 5, 4. That's it. The three 5s have the same value, so they cannot be distinguished from each other. The algorithm doesn't keep track of which of those values originally came before the other; they're the same. The same thing applies to present: there are three distinct values, not four.

How can I find median with an even amount of numbers in a list?

This is what I have right now. It just finds the median with an odd amount of numbers.
def median(height):
height.sort()
x = len(height)
x -= 1
posn = x // 2
return height[posn]
"The median is the numeric value separating the higher half of a sample data set from the lower half. The median of a data set can be found by arranging all the values from lowest to highest value and picking the one in the middle. If there is an odd number of data values then the median will be the value in the middle. If there is an even number of data values the median is the mean of the two data values in the middle." - Source
For the data set 1, 1, 2, 5, 6, 6, 9 the median is 5.
For the data set 1, 1, 2, 6, 6, 9 the median is 4. It is the mean of 2 and 6 or, (2+6)/2 = 4.

Ascending subsequences in permutation

With given permutation 1...n for example 5 3 4 1 2
how to find all ascending subsequences of length 3 in linear time ?
Is it possible to find other ascending subsequences of length X ? X
I don't have idea how to solve it in linear time.
Do you need the actual ascending sequences? Or just the number of ascending subsequences?
It isn't possible to generate them all in less than the time it takes to list them. Which, as has been pointed out, is O(NX / (X-1)!). (There is a possibly unexpected factor of X because it takes time O(X) to list a data structure of size X.) The obvious recursive search for them scales not far from that.
However counting them can be done in time O(X * N2) if you use dynamic programming. Here is Python for that.
counts = []
answer = 0
for i in range(len(perm)):
inner_counts = [0 for k in range(X)]
inner_counts[0] = 1
for j in range(i):
if perm[j] < perm[i]:
for k in range(1, X):
inner_counts[k] += counts[j][k-1]
counts.add(inner_counts)
answer += inner_counts[-1]
For your example 3 5 1 2 4 6 and X = 3 you will wind up with:
counts = [
[1, 0, 0],
[1, 1, 0],
[1, 0, 0],
[1, 1, 0],
[1, 3, 1],
[1, 5, 5]
]
answer = 6
(You only found 5 above, the missing one is 2 4 6.)
It isn't hard to extend this answer to create a data structure that makes it easy to list them directly, to find a random one, etc.
You can't find all ascending subsequences on linear time because there may be much more subsequences than that.
For instance in a sorted original sequence all subsets are increasing subsequences, so a sorted sequence of of length N (1,2,...,N) has N choose k = n!/(n-k)!k! increasing subsequences of length k.