data mining: Frequent Item sets

data mining: Frequent Item sets - data-mining

When asked in an exam question to find all frequent Item sets is it just the set that was last worked out that is the answer you give or is it necessary to give all before that too.
e.g. say if last results i get for a set is (A,B,D) then is that my frequent item set or do I need to also include all found before that that also satisfy minSup e.g. (A,B) (A,D) (B,D) etc

I understand that you're asking if a frequent item set satisfies minimum support threshold, is the subsets of this item set also supports minimum support? The answer is Yes the subsets also satisfies the minimum support threshold.
The logic in here is the bottom-up approach. The size 2 candidate itemsets are generated from size 1 frequent itemsets. The size 3 candidate itemsets are generated from size 2 frequent itemsets.
For an example data set:
Row 1 : A B C D E
Row 2 : A C D
Row 3 : B C
Row 4 : A C D E
Row 5 : A D E
Row 6 : A B C D
Row 7 : A B C
Row 8 : A C
Row 9 : B C D
Row 10: B D E
First the size 1 candidate itemsets were generated. These are: A, B, C, D, E. Now support of each candidate were calculated. A=7, B=6, C=8, D=7, E=4. Now if the minSup value is 5 E is pruned. İf minSup value is 3 then all size 1 candidates were evaluated as frequent.
Second size 2 candidate itemsets were generated. The approach is to cross-product size 1 frequent itemsets. So A B, A C, A D, A E, B C, B D, B E, C D, C E, D E size 2 candidate itemsets were generated. After this the support values of each candidate were calculated. Support for A B is 3 since Row 1, Row 6 and Row 7 includes this pattern. But B E candidate is present for only Row 1 and Row 10 and if the minSup value is 3, this candidate is pruned.
Because this logic is used, if the upper itemset is frequent, the subsets should also be frequent. If not the upper itemset couldn't be generated.
I hope I could explain myself.

Related

Looking for hints to solve this dynamic programming problem

I am trying to improve my problem solving skills for programming interviews and am trying to solve this problem. I have a feeling it can be solved using dynamic programming but the recursive relationship is not obvious to me.
To select the first three choir singers I simply use brute force. Since there are only 20 Choose 3 = 1140 ways to pick them. At first I thought dp[a][b][c] could represent the shortest song with three choir singers with remaining breath a, b, c. If I could calculate this using dp[a][b][c] = 1 + dp[a - 1][b - 1][c - 1], but what should be done when any of the indices equal 0, which choir singer should be substituted in. Additionally, we cannot reuse the dp array, because say in one instance we start with choir singers with breath a, b, c and in the second instance d, e, f. Once the first instance has been calculated and the dp array filled; the second instance may need to use dp[i][j][k] computed by the first instance. Since this value depends on the available choir singers in the first instance, and the available singers in both instances are not the same, dp[i][j][k] may not be possible in the second instance. This is because the shortest song length dp[i][j][k] may use choir singers which in the second instance are already being used.
I am out of ideas to tackle this problem and there is no solution anywhere. Could someone give me some hints to solve it?
Problem statement
We have N singers, who each have a certain time they can sing for and need 1 second to recover once out of breath. What is the minimum song they can sing, where three singers are singing at all times and where they all three finish singing simultaneously?
Input:
Input
3 < N <= 20
N integers Fi (1 <= Fi <= 10, for all 1 <= i <= N)

Here is the idea.
At each point in the singing, the current state can be represented by who the singers are, how long they have been singing, and which ones are currently out of breath. And from each state we need to transition to a new state, which is every singer out of breath is ready to sing again, every singer singing is good for one less turn, and new singers might be chosen.
Done naively, there are up to 20 choose 3 singers, each of which can be in 10 current states, plus up to 2 more who are out of breath. This is 175560000 combined states you can be in. That's too many, we need to be more clever to make this work.
Being more clever, we do not have 20 differentiable singers. We have 10 buckets of singers based on how long they can sing for. If a singer can sing for 7 turns, they can't be in 10 states if currently singing, but only 7. We do not care whether the two can sing for 7 turns are at 4 and 3 turns left or 3 and 4, they are the same. This introduces a lot of symmetries. Once we take care of all of the symmetries, that reduces the number of possible states that we might be in from hundreds of millions to (usually) tens of thousands.
And now we have a state transition for our DP which is dp[state1] to dp[state2]. The challenge being to produce a state representation that takes advantage of these symmetries that you can use as keys to your data structure.
UPDATE:
The main loop of the code would look like this Python:
while not finished:
song_length += 1
next_states = set()
for state in current_states:
for next_state in transitions(state):
if is_finished(next_state):
finished = True # Could break out of loops here
else:
next_states.add(next_state)
current_states = next_states
Most of the challenge is a good representation of a state, and your transitions function.

The state in terms of memoisation seems unrelated to the time elapsed since the start. Take any starting position,
a, b, c
where a, b, c are chosen magnitudes (how long each singer can hold their breath), and a is the smallest magnitude. We have
a, b, c
t = 0
and it's the same as:
0, b - a, c - a
t = a
So let's define the initial state with smallest magnitude a as:
b, c, ba, ca
where ba = b - a
ca = c - a
t = a
From here, every transition of the state is similar:
new_a <- x
where x is a magnitude in
the list that can be available
together with b and c. (We only
need to try each such unique
magnitude once during this
iteration. We must also prevent
a singer from repeating.)
let m = min(new_a, ba, ca)
then the new state is:
u, v, um, vm
t = t + m
where u and v are from the
elements of [new_a, b, c] that
aren't associated with m, and um
and vm are their pairs from
[new_a, ba, ca] that aren't m,
subtracted by m.
The state for memoisation of visited combinations can be only:
[(b, ba), (c, ca)] sorted by
the tuples' first element
with which we can prune a branch in the search if the associated t that is reached is equal or higher to the minimal one seen for that state.
Example:
2 4 7 6 5
Solution (read top-down):
4 5 6
7 4 5
2
States:
u v um vm
5 6 1 2
t = 4
new_a = 7
m = min(7, 1, 2) = 1 (associated with 5)
7 6 6 1
t = 5
new_a = 4
m = min(4, 6, 1) = 1 (associated with 6)
4 7 3 5
t = 6
new_a = 5
m = min(5, 3, 5) = 3 (associated with 4)
5 7 2 2
t = 9
new_a = 2
m = min(2, 2, 2) = 2 (associated with 2)
5 7 0 0
t = 11
Python code:
import heapq
from itertools import combinations
def f(A):
mag_counts = {}
for x in A:
if x in mag_counts:
mag_counts[x] = mag_counts[x] + 1
else:
mag_counts[x] = 1
q = []
seen = set()
# Initialise the queue with unique starting combinations
for comb in combinations(A, 3):
sorted_comb = tuple(sorted(comb))
if not sorted_comb in seen:
(a, b, c) = sorted_comb
heapq.heappush(q, (a, (b-a, b), (c-a, c), a))
seen.add(sorted_comb)
while q:
(t, (ba, b), (ca, c), prev) = heapq.heappop(q)
if ba == 0 and ca == 0:
return t
for mag in mag_counts.keys():
# Check that the magnitude is available
# and the same singer is not repeating.
[three, two] = [3, 2] if mag != prev else [4, 3]
if mag == b == c and mag_counts[mag] < three:
continue
elif mag == b and mag_counts[mag] < two:
continue
elif mag == c and mag_counts[mag] < two:
continue
elif mag == prev and mag_counts[mag] < 2:
continue
m = min(mag, ba, ca)
if m == mag:
heapq.heappush(q, (t + m, (ba-m, b), (ca-m, c), m))
elif m == ba:
heapq.heappush(q, (t + m, (mag-m, mag), (ca-m, c), b))
else:
heapq.heappush(q, (t + m, (mag-m, mag), (ba-m, b), c))
return float('inf')
As = [
[3, 2, 3, 3], # 3
[1, 2, 3, 2, 4], # 3
[2, 4, 7, 6, 5] # 11
]
for A in As:
print A, f(A)

searching for a dedicated binary linear programming algorithm

There are various linear programming solvers out there, among which lpSolve or GLPK. These are general LP solvers that are suitable for many purposes, but is there a dedicated LP solver for the binary case?
To exemplify, my problem involves finding the minimum number of lines that cover all columns of a matrix, something like:
0 1 0 1
1 0 0 1
1 0 1 0
It can be seen the first and third rows cover all four columns, the second row being irrelevant. Denoting the three lines as A, B and C, the problem boils down to solving a system of inequations:
B + C >= 1
A >= 1
C >= 1
A + B >= 1
This gives the solution:
A = 1; B = 0; C = 1
(which means the minimum number of lines is 2).
Thanks in advance,
Adrian

Finding numbers with given GCD values

I was looking over the Eucledian algorithm for finding GCD of numbers. Now, it can be used to find GCD of two given numbers. However, if I am given GCD of a single number with multiple other numbers, for example, GCD of the first number with 3 other numbers (including itself), that is,
Given is: GCD of a with a, GCD of a with b, GCD of a with c, GCD of a with d.
and same goes for the other numbers, i.e. GCD of b with a, b with b, .....
Then, how can I find the individual numbers? I know that GCD(a,a) = a itself but the problem here is, that the individual GCD given are in a random order, and therefore, I don't know which input number is the GCD of which two numbers. In that case, how do I find the individual numbers?
Here is my GCD code:
int gcd(int a,int b)
{
if(b==0)
{
return a;
}
return gcd(b,a%b);
}
Example: Let's say the input given is,
3 1 3 1 4 2 2 3 6
3 //(total numbers we have to find in original array)
Then output should be, 3 4 6. Because if you calculate GCD pairwise (total 9 pairs and hence 9 numbers as input) of each of these numbers, then we get the output as above.
Explanation: 3 -> GCD of (3,3)
1 -> GCD of (3,4)
3 -> GCD of (3,6)
1 -> GCD of (4,3)
4 -> GCD of (4,4)
2 -> GCD of (4,6)
6 -> GCD of (6,6)
3 -> GCD of (6,3)
2 -> GCD of (6,4)
Therefore, I have to find the numbers whose GCD is given as input. Hence, (3,4,6) are those numbers.

I think you can do this by the following process:
Find and remove largest number, this is one of the original numbers
Compute the gcd of the number just found in step 1, with all numbers previously found in step 1.
Remove each of these computed gcds from the input array of gcds (Strictly speaking remove 2 copies of each gcd)
Repeat until all numbers are found
The point is that this only goes wrong if the largest number x found in step 1 is not one of the original numbers. However, this can only happen if x is a gcd of two other numbers. These other numbers must be at least as large as x, but all such gcds have been removed in step 3. Therefore x is always one of the original numbers.

If the second line of the input is 1, then the first line of the input will have only one number, and due to your observation that gcd(a, a) = a, the value of a will be whatever value is on the first line of input.
If the value on the second line of input is greater than 1, then the problem can be reduced using the following observation. Given positive integers a and b, we know that gcd(a, b) <= a = gcd(a, a) and gcd(a, b) <= b = gcd(b, b) will always hold. Therefore, we can conclude that the largest two numbers on the first line of input must both be part of the basic set of numbers. The largest two numbers may be equal, but in your example, they are 4 and 6, and they are not equal.
If there are more than two numbers to find, let's call the largest two a and b. Since we now know the value of a and b, we can compute gcd(a, b) and remove two occurrences of that value from consideration as one of the input numbers. We remove two occurrences because gcd(a, b) = gcd(b, a) are both in the list of input numbers. Then using similar logic, we conclude that the largest number remaining after a, b, gcd(a, b), and gcd(b, a) are removed must be one of the input numbers.
Wash, rinse, repeat.

This is actually pretty easy:
count how many times each distinct number appears in the array
if that count is odd, then that is one of the numbers in your set
if that count is even, that is not one of the numbers in your set.
done.
This works because when x != y, gcd(x,y) = gcd(y,x) and that number will be in the array twice. Only values that come from gcd(x,x) will be in the array once, leading to an odd number of that specific value.

As we see here, that for each number with other number a GCD is calculated and added to the existing list, like this in the end we would be having n*n total numberer for original n numbers. So the same approach can be reverse used to find original numbers
Sort the list in descending order.
Collect top n^(0.5) numbers, that would be our answer.
For your example
3 1 3 1 4 2 2 3 6
Sorted = 6 4 3 3 3 2 2 1 1
value of n = 9
value of n^(0.5) = 3
choose top 3 numbers 6,4 and 3. Thats our answer

Similar to subset sum [duplicate]

This problem was asked to me in Amazon interview -
Given a array of positive integers, you have to find the smallest positive integer that can not be formed from the sum of numbers from array.
Example:
Array:[4 13 2 3 1]
result= 11 { Since 11 was smallest positive number which can not be formed from the given array elements }
What i did was :
sorted the array
calculated the prefix sum
Treverse the sum array and check if next element is less than 1
greater than sum i.e. A[j]<=(sum+1). If not so then answer would
be sum+1
But this was nlog(n) solution.
Interviewer was not satisfied with this and asked a solution in less than O(n log n) time.

There's a beautiful algorithm for solving this problem in time O(n + Sort), where Sort is the amount of time required to sort the input array.
The idea behind the algorithm is to sort the array and then ask the following question: what is the smallest positive integer you cannot make using the first k elements of the array? You then scan forward through the array from left to right, updating your answer to this question, until you find the smallest number you can't make.
Here's how it works. Initially, the smallest number you can't make is 1. Then, going from left to right, do the following:
If the current number is bigger than the smallest number you can't make so far, then you know the smallest number you can't make - it's the one you've got recorded, and you're done.
Otherwise, the current number is less than or equal to the smallest number you can't make. The claim is that you can indeed make this number. Right now, you know the smallest number you can't make with the first k elements of the array (call it candidate) and are now looking at value A[k]. The number candidate - A[k] therefore must be some number that you can indeed make with the first k elements of the array, since otherwise candidate - A[k] would be a smaller number than the smallest number you allegedly can't make with the first k numbers in the array. Moreover, you can make any number in the range candidate to candidate + A[k], inclusive, because you can start with any number in the range from 1 to A[k], inclusive, and then add candidate - 1 to it. Therefore, set candidate to candidate + A[k] and increment k.
In pseudocode:
Sort(A)
candidate = 1
for i from 1 to length(A):
if A[i] > candidate: return candidate
else: candidate = candidate + A[i]
return candidate
Here's a test run on [4, 13, 2, 1, 3]. Sort the array to get [1, 2, 3, 4, 13]. Then, set candidate to 1. We then do the following:
A[1] = 1, candidate = 1:
A[1] ≤ candidate, so set candidate = candidate + A[1] = 2
A[2] = 2, candidate = 2:
A[2] ≤ candidate, so set candidate = candidate + A[2] = 4
A[3] = 3, candidate = 4:
A[3] ≤ candidate, so set candidate = candidate + A[3] = 7
A[4] = 4, candidate = 7:
A[4] ≤ candidate, so set candidate = candidate + A[4] = 11
A[5] = 13, candidate = 11:
A[5] > candidate, so return candidate (11).
So the answer is 11.
The runtime here is O(n + Sort) because outside of sorting, the runtime is O(n). You can clearly sort in O(n log n) time using heapsort, and if you know some upper bound on the numbers you can sort in time O(n log U) (where U is the maximum possible number) by using radix sort. If U is a fixed constant, (say, 109), then radix sort runs in time O(n) and this entire algorithm then runs in time O(n) as well.
Hope this helps!

Use bitvectors to accomplish this in linear time.
Start with an empty bitvector b. Then for each element k in your array, do this:
b = b | b << k | 2^(k-1)
To be clear, the i'th element is set to 1 to represent the number i, and | k is setting the k-th element to 1.
After you finish processing the array, the index of the first zero in b is your answer (counting from the right, starting at 1).
b=0
process 4: b = b | b<<4 | 1000 = 1000
process 13: b = b | b<<13 | 1000000000000 = 10001000000001000
process 2: b = b | b<<2 | 10 = 1010101000000101010
process 3: b = b | b<<3 | 100 = 1011111101000101111110
process 1: b = b | b<<1 | 1 = 11111111111001111111111
First zero: position 11.

Consider all integers in interval [2i .. 2i+1 - 1]. And suppose all integers below 2i can be formed from sum of numbers from given array. Also suppose that we already know C, which is sum of all numbers below 2i. If C >= 2i+1 - 1, every number in this interval may be represented as sum of given numbers. Otherwise we could check if interval [2i .. C + 1] contains any number from given array. And if there is no such number, C + 1 is what we searched for.
Here is a sketch of an algorithm:
For each input number, determine to which interval it belongs, and update corresponding sum: S[int_log(x)] += x.
Compute prefix sum for array S: foreach i: C[i] = C[i-1] + S[i].
Filter array C to keep only entries with values lower than next power of 2.
Scan input array once more and notice which of the intervals [2i .. C + 1] contain at least one input number: i = int_log(x) - 1; B[i] |= (x <= C[i] + 1).
Find first interval that is not filtered out on step #3 and corresponding element of B[] not set on step #4.
If it is not obvious why we can apply step 3, here is the proof. Choose any number between 2i and C, then sequentially subtract from it all the numbers below 2i in decreasing order. Eventually we get either some number less than the last subtracted number or zero. If the result is zero, just add together all the subtracted numbers and we have the representation of chosen number. If the result is non-zero and less than the last subtracted number, this result is also less than 2i, so it is "representable" and none of the subtracted numbers are used for its representation. When we add these subtracted numbers back, we have the representation of chosen number. This also suggests that instead of filtering intervals one by one we could skip several intervals at once by jumping directly to int_log of C.
Time complexity is determined by function int_log(), which is integer logarithm or index of the highest set bit in the number. If our instruction set contains integer logarithm or any its equivalent (count leading zeros, or tricks with floating point numbers), then complexity is O(n). Otherwise we could use some bit hacking to implement int_log() in O(log log U) and obtain O(n * log log U) time complexity. (Here U is largest number in the array).
If step 1 (in addition to updating the sum) will also update minimum value in given range, step 4 is not needed anymore. We could just compare C[i] to Min[i+1]. This means we need only single pass over input array. Or we could apply this algorithm not to array but to a stream of numbers.
Several examples:
Input: [ 4 13 2 3 1] [ 1 2 3 9] [ 1 1 2 9]
int_log: 2 3 1 1 0 0 1 1 3 0 0 1 3
int_log: 0 1 2 3 0 1 2 3 0 1 2 3
S: 1 5 4 13 1 5 0 9 2 2 0 9
C: 1 6 10 23 1 6 6 15 2 4 4 13
filtered(C): n n n n n n n n n n n n
number in
[2^i..C+1]: 2 4 - 2 - - 2 - -
C+1: 11 7 5
For multi-precision input numbers this approach needs O(n * log M) time and O(log M) space. Where M is largest number in the array. The same time is needed just to read all the numbers (and in the worst case we need every bit of them).
Still this result may be improved to O(n * log R) where R is the value found by this algorithm (actually, the output-sensitive variant of it). The only modification needed for this optimization is instead of processing whole numbers at once, process them digit-by-digit: first pass processes the low order bits of each number (like bits 0..63), second pass - next bits (like 64..127), etc. We could ignore all higher-order bits after result is found. Also this decreases space requirements to O(K) numbers, where K is number of bits in machine word.

If you sort the array, it will work for you. Counting sort could've done it in O(n), but if you think in a practically large scenario, range can be pretty high.
Quicksort O(n*logn) will do the work for you:
def smallestPositiveInteger(self, array):
candidate = 1
n = len(array)
array = sorted(array)
for i in range(0, n):
if array[i] <= candidate:
candidate += array[i]
else:
break
return candidate

How to list all the possible sums of combinations in an array for C++?

I have my homework assignment and I have no idea how to start with the code for this kind of problem.
Let say I have an integer array with consist of n elements,
[A][B][C][D][E] (We have 5 elements for example)
I want to list out all the sum of possibility such that I want to print out the
sum of all combination (ABCDE, ABCD, ABCE, ACDE, BCDE, ABC, ABD, ABE, ACE, ADE, BDE, CDE, AB, AC, AD, AE, BC, BD, BE, CD, CE, DE, A, B, C, D and E)
Another example would be 4 elements in an array ([A][B][C][D])
I want to list all sum of combination of (ABCD, ABC, ABD, ACD, BCD, AB, AC, AD, BC, BD, CD, A, B, C and D).

Well, here's a simple rule to follow:
The set of all combinations of "ABCDE" is composed of those combinations which contain (and thus start with) "A" and those which don't contain "A". In both cases, all combinations of "BCDE" can occur. Of course, combinations of "BCDE" can be treated in the same way.

When you say "list out all the sum of possibility" do you mean you want to know how many combinations are actually possible?
If so, then search on combinations of N items taken K at a time; there are pages on this very site addressing this problem. You would then simply add the number of combinations of (by 5) + (by 4) + (by 3) + (by 2) + (by 1) to get your total "sum of possibilities".
Or do you mean you have an array of values and you literally want to print out the different sums represented by different combinations of the elements? In that case you need to actually enumerate all the combinations and evaluate the sums.
So given an array of { 1, 2, 3, 4, 5 } you can encode this as "A", "B", "C", "D", "E". Examples of tuples would be:
ABCDE = 1+2+3+4+5
ABE = 1+2+5
BCE = 2+3+5
etc, where you use the encoded enumeration to select the addends for your sum. Note that deciding whether to allow or disallow duplicates (i.e., is "DE" different from "ED") will have a very large effect on your results; in most cases this will probably not be what you want.

If you have 3 elements, you may imagine each element placed at a certain position from 1 to 3 (or 0 to 2) and a boolean array representing whether the element is contained in a certain set.
ABC remark
--- ---------------------
000 no element in the set
001 one element, C
010 one element, b
100 ...
011 two elements, b and c
...
111 all elements contained
Now if you calculate the number of solutions, which is 2³ in this case, and generate a function, which does a mapping from a binary representation to a set, from 011 to (b, c) for example, then you can easily program a loop, which iterates from 0 to max-1 and returns all sets, produced by your mapping function.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js