The mean of the mean of the Xn combinations by n. is the mean of the Xn

The mean of the mean of the Xn combinations by n. is the mean of the Xn - combinations

I have X1...X6. I have taken the combinations by two. For each of those sub-samples I have taken the mean, and then the mean of all of those means:
[(X1+X2)/2 + ... +(X5+X6)/2]/15, where 15 is the total number of combinations.
Now the mean of all of those sub-samples is equal to the mean of :
(X1+X2+X3+X4+X5+X6)/6 .
I am asking for some help in order to either PROVE it (as a generalazation), or why this happens? Because even if I increase the combinations for example the combinations of 6 by 3 or 4 etc the results are the same.
Thank you

OK, here's a quick page of scribbles that shows that no matter how many items you have if you take the mean of all combinations of 2 pairs and then take the mean of those means then you will always get the mean of the original sum.
Explanation...
I work out what the number of combinations is first. For later use.
Then it's just a matter of simplifying the calculation.
Each number is used n-1 times. X1 is obvious. X2 is used n-2 times but also used once in the sum with X1. (This bit is a bit harder with r > 2)
At the end I substitute in the actual values for the number of combinations.
This then cancels out to give the sum of all the numbers over n. Which is the mean.
The next step is to show this for all values r but that shouldn't be too hard.
Substituting r instead of 2. I found that each number is used (n-1) choose (r-1) times.
But then I'm getting the wrong cancellation out of it.
I know where I went wrong... I miscalculated the calculation for (n-1)choose(r-1)
With the correct formula the answer falls out to S/n.

Related

Bit Manipulation: Harder Flipping Coins

Recently, I saw this problem from CodeChef titled 'Flipping Coins' (Link: FLIPCOINS).
Summarily, there are N coins and we must write a program that supports two operations.
To flip coin in range [A,B]
To find the number of heads in range [A,B] respectively.
Of course, we can quickly use a segment tree (range query, range updates using lazy propagation) to solve this.
However, I faced another similar problem where after a series of flips (operation 1), we are required to output the resulting permutation of coins after the flips (e.g 100101, where 0 represents head while 1 represents tail).
More specifically, operation 2 changes from counting number of heads to producing the resulting permutation of all N coins. Also, the new operation 2 is only called after all the flips have been done (i.e operation 2 is the last to be called and is only called one time).
May I know how does one solve this? It requires some form of bit manipulation, according to the problem tags.
Edit
I attempted brute-forcing through all queries, and alas, it yield Time Limit Exceeded.

Printing out the state of the coins can be done using a Binary-indexed tree:
Initially all values are 0.
When we need to flip coins [A, B], we increment A by 1 and
decrement B + 1 by 1.
The state of coin i is then the prefix sum at i modulo 2.
This works because the prefix sum at i is always the number of flip operations done at i.

Good way to detect identical expressions in C++

I am writing a program that solves this puzzle game: some numbers and a goal number is given, and you make the goal number using the n numbers and operators +, -, *, / and (). For example, given 2,3,5,7 and the goal number 10, the solutions are (2+3)*(7-5)=10, 3*5-(7-2)=10, and so on.
The catch is, if I implement it naively, I will get a bunch of identical solutions, like (2+3)*(7-5)=10 and (3+2)*(7-5)=10, and 3*5-(7-2)=10 and 5*3-(7-2)=10 and 3*5-7+2=10 and 3*5+2-7=10 and so on. So I'd like to detect those identical solutions and prune them.
I'm currently using randomly generated double numbers to detect identical solutions. What I'm doing is basically substituting those random numbers to the solution and check if there are any pairs of them that calculate to the same number. I have to perform the detection at every node of my search, so it has to be fast, and I use hashset for it now.
Now the problem is the error that comes with the calculation. Because even identical solutions do not calculate to the exactly same value, I currently round the calculated value to a precision when storing in the hashset. However this does not seem to work well enough, and gives different number of solutions every time to the same problem. Sometimes the random numbers are bad and prune some completely different solutions. Sometimes the calculated value lies on the edge of rounding function and it outputs two(or more) identical solutions. Is there a better way to do this?
EDIT:
By "identical" I mean two or more solutions(f(w,x,y,z,...) and g(w,x,y,z,...)) that calculate to the same number whatever the original number(w,x,y,z...) is. For more examples, 4/3*1/2 and 1*4/3/2 and (1/2)/(3/4) are identical, but 4/3/1/2 and 4/(3*1)/2 are not because if you change 1 to some other number they will not produce the same result.

It will be easier if you "canonicalize" the expressions before comparing them. One way would be to sort when an operation is commutative, so 3+2 becomes 2+3 whereas 2+3 remains as it was. Of course you will need to establish an ordering for parenthesized groups as well, like 3+(2*1)...does that become (1*2)+3 or 3+(1*2)? What the ordering is doesn't necessarily matter, so long as it is a total ordering.

Generate all possibilities of your expressions. Then..
When you create expressions, put them in a collection of parsed trees (this would also eliminate your parenthesis). Then "push down" any division and subtraction into the leaf nodes so that all the non-leaf nodes have * and +. Apply a sorting of the branches (e.g. regular string sort) and then compare the trees to see if they are identical.

I like the idea of using doubles. The problem is in the rounding. Why not use a container SORTED by the value obtained with one random set of double inputs. When you find the place you would insert in that container, you can look at the immediately preceding and following items. Use a different set of random doubles to recompute each for the more robust comparison. Then you can have a reasonable cutoff for "close enough to be equal" without arbitrary rounding.
If a pair of expressions are close enough for equal in both the main set of random numbers and the second set, the expressions are safely "same" and the newer one discarded. If close enough for equal in the main set but not the new set, you have a rare problem, that probably requires rekeying the entire container with a different random number set. If not close enough in either, then they are different.

For the larger n suggested by one of your recent comments, I think you would need the better performance that should be possible from a canonical by construction method (or maybe "almost" canonical by construction) rather than a primarily comparison based approach.
You don't want to construct an incredibly large number of expressions, then canonicalize and compare.
Define a doubly recursive function can(...) that takes as input:
A reference to a canonical expression tree.
A reference to one subexpression of that tree.
A count N of inputs to be injected.
A set of flags for prohibiting some injections.
A leaf function to call.
If N is zero, can just calls the leaf function. If N is nonzero, can patches the subtree in every possible way that produces a canonical tree with N injected variables, and calls the leaf function for each and restores the tree, undoing each part of the patch as it is done with it, so we never need massive copying.
X is the subtree and K is a leaf representing variable N-1. First can would replace the subtree temporarily one at a time with subtrees representing some of (X)+K, (X)-K, (X)*K, (X)/K and K/(X) but both flags and some other rules would cause some of those to be skipped. For each not skipped, recursively call itself with the whole tree as both top and sub, with N-1, and with 0 flags.
Next drill into the two children of X and call recursively itself with that as the subtree, with N, and with appropriate flags.
The outer just calls can with a single node tree representing variable N-1 of the original N, and passing N-1.
In discussion, it is easier to name the inputs forward, so A is input N-1 and B is input N-2 etc.
When we drill into X and see it is Y+Z or Y-Z we don't want to add or subtract K from Y or Z because those are redundant with X+K or X-K. So we pass a flag that suppresses direct add or subtract.
Similarly, when we drill into X and see it is Y*Z or Y/Z we don't want to multiply or divide either Y or Z by K because that is redundant with multiplying or dividing X by K.
Some cases for further clarification:
(A/C)/B and A/(B*C) are easily non canonical because we prefer (A/B)/C and so when distributing C into (A/B) we forbid direct multiplying or dividing.
I think it takes just a bit more effort to allow C/(A*B) while rejecting C/(A/B) which was covered by (B/A)*C.
It is easier if negation is inherently non canonical, so level 1 is just A and does not include -A then if the whole expression yields negative the target value, we negate the whole expression. Otherwise we never visit the negative of a canonical expression:
Given X, we might visit (X)+K, (X)-K, (X)*K, (X)/K and K/(X) and we might drill down into the parts of X passing flags which suppress some of the above cases for the parts:
If X is a + or - suppress '+' or '-' in its direct parts. If X is a * or / suppress * or divide in its direct parts.
But if X is a / we also suppress K/(X) before drilling into X.

Since you are dealing with integers, I'd focus on getting an exact result.
Claim: Suppose there is some f(a_1, ..., a_n) = x where a_i and x are your integer input numbers and f(a_1, ..., a_n) represents any functions of your desired form. Then clearly f(a_i) - x = 0. I claim, we can construct a different function g with g(x, a_1, ..., a_n) = 0 for the exact same x and g only uses ()s, +, - and * (no division).
I'll prove that below. Consequently you could construct g evaluate g(x, a_1, ..., a_n) = 0 on integers only.
Example:
Suppose we have a_i = i for i = 1, ..., 4 and f(a_i) = a_4 / (a_2 - (a_3 / 1)) (which contains divisions so far). This is how I would like to simplify:
0 = a_4 / (a_2 - (a_3 / a_1) ) - x | * (a_2 - (a_3 / a_1) )
0 = a_4 - x * (a_2 - (a_3 / a_1) ) | * a_1
0 = a_4 * a_1 - x * (a_2 * a_1 - (a_3) )
In this form, you can verify your equality for some given integer x using integer operations only.
Proof:
There is some g(x, a_i) := f(a_i) - x which is equivalent to f. Consider any equivalent g with as few as possible division. Assume there is at least one (otherwise we are done). Assume within g we divide by h(x, a_i) (any of your functions, may contain divisions itself). Then (g*h)(x, a_i) := g(x, a_i) * h(x, a_i) has the same roots, as g has (multiplying by a root, ie. (x, a_i) where g(a_i) - x = 0, preserves all roots). But on the other hand, g*h is composed of one division fewer. A contradiction (g with minimum number of divisions), which is why g doesn't contain any division.
I've updated the example to visualize the strategy.
Update: This works well on rational input numbers (those represent a single division p/q). This should help you. Other input can't be provided by humans.
What are you doing to find / test f's? I'd guess some form of dynamic programming will be fast in practice.

smith waterman algorithm choose more than one alignment

I want to align a small sequence S1 to another larger nucleotide sequence S2 for example:
S1: acgtgt
S2: ttcgtgacagt...
In this example s1 hit in 2 places in s2 : cgtg and acgt with gap in s2 the 2. I want to use smith waterman algorithm but my question is : in case the 2 alignments have 2 diffrent score i.e one 4 and another 3 how to get the2 alignments from the dynamic programimg matrix? Is there any tool or library that do this already? I tried paorwise2 from biopython and it only gives the alignments with high score in tje matrix

Pairwise alignment algorithms such as Smith-Waterman will only provide the one best alignment. A worse alignment will have a different traceback walk that will not be followed by the Dynamic Programming algorithm Smith-Waterman uses.
If there are multiple alignments with the same best score, S-W will choose only one of those alignments (which one is implementation specific since it doesn't really matter since they have the same score).
If you really really wanted to have multiple alignments returned AND use something like Smith-Waterman, you will have to re-align the sequences multiple times each time configuring the gap penalties differently. I do not recommend this since it will be very expensive.
Instead of using Smith-Waterman, you may want to try something like BLAST which will give you multiple hits

see section Repeated matches in the Durbin - Biological Sequence Analysis
Let us assume that we are only interested in matches scoring higher
than some threshold T . This will be true in general, because there
are always short local alignments with small positive scores even
between entirely unrelated sequences. Let y be the sequence containing
the domain or motif, and x be the sequence in which we are looking for
multiple matches.
An example of the repeat algorithm is given in Figure 2.7. We again
use the matrix F, but the recurrence is now different, as is the
meaning of F(i, j). In the final alignment, x will be partitioned into
regions that match parts of y in gapped alignments, and regions that
are unmatched. We will talk about the score
of a completed match region as being its standard gapped alignment
score minus the threshold T . All these match scores will be positive.
F(i, j) for j ≥ 1 is now the best sum of match scores to x1...i,
assuming that xi is in a matched region, and the corresponding match
ends in xi and yj (they may not actually be aligned, if this is a
gapped section of the match). F(i,0) is the best sum of completed
match scores to the subsequence x1...i, i.e. assuming that xi is in an
unmatched region.
To achieve the desired goal, we start by
initialising F(0,0) = 0 as usual, and then fill the matrix using the
following recurrence relations:
Equation (2.11) handles unmatched regions and ends of matches, only
allowing matches to end when they have score at least T . Equation
(2.12) handles starts of matches and extensions. The total score of
all the matches is obtained by adding an extra cell to the matrix, F(n
+ 1,0), using (2.11). This score will have T subtracted for each match; if there were no matches of score greater than T it will be 0,
obtained by repeated application of the first option in (2.11).
The
individual match alignments can be obtained by tracing back from cell
(n + 1,0) to (0,0), at each point going back to the cell that was the
source of the score in the current cell in the max() operation. This
traceback procedure is a global procedure, showing what each residue
in x will be aligned to. The resulting global alignment will contain
sections of more conventional gapped local alignments of subsequences
of x to subsequences of y.
Note that the algorithm obtains all the
local matches in one pass. It finds the maximal scoring set of
matches, in the sense of maximising the combined total of the excess
of each match score above the threshold T . Changing the value of T
changes what the algorithm finds. Increasing T may exclude matches.
Decreasing it may split them, as well as finding new weaker ones. A
locally optimal match in the sense of the preceding section will be
split into pieces if it contains internal subalignments scoring less
than −T . However, this may be what is wanted: given two similar high
scoring sections significant in their own right, separated by a
non-matching section with a strongly negative score, it is not clear
whether it is preferable to report one match or two.

All possible alignments that conform to the scoring in the substitution matrix are rerpresented in the trace back matrix T - its just that some implementations might not give you access to T.
To extract multiple alignments, you'll need first to look at the scoring matrix H and choose which scores you want to trace back - for example, you might look at the highest 10 scores. The values in the matrix T will tell you the route to trace back. Keep going until the corresponding score in H is zero.
Be careful though - the 10 highest scores might all be part of the same alignment, in which case you'd just get a result that are subsequences of another result. To avoid this, it's probably best to trace back the highest scoring alignment first, and then look for high values in cells that are not passed through by the first alignment.

Number contained in an odd number of sets

I have a homework problem which i can solve only in O(max(F)*N) ( N is about 10^5 and F is 10^9) complexity, and i hope you could help me. I am given N sets of 4 integer numbers (named S, F, a and b); Each set of 4 numbers describe a set of numbers in this way: The first a successive numbers, starting from S included are in the set. The next b successive numbers are not, and then the next a numbers are, repeating this until you reach the superior limit, F. For example for S=5;F=50;a=1;b=19 the set contains (5,25,45); S=1;F=10;a=2;b=1 the set contains (1,2,4,5,7,8,10);
I need to find the integer which is contained in an odd number of sets. It is guaranteed that for the given test there is ONLY 1 number which respects this condition.
I tried to go trough every number between min(S) and max(F) and check in how many number of sets this number is included, and if it is included in an odd number of sets, then this is the answer. As i said, in this way I get an O (F*N) which is too much, and I have no other idea how could I see if a number is in a odd number of sets.
If you could help me I would be really grateful. Thank you in advance and sorry for my bad English and explanation!

Hint
I would be tempted to use bisection.
Choose a value x, then count how many numbers<=x are present in all the sets.
If this is odd then the answer is <=x, otherwise >x.
This should take time O(Nlog(F))
Alternative explanation
Suppose we have sets
[S=1,F=8,a=2,b=1]->(1,2,4,5,7,8)
[S=1,F=7,a=1,b=0]->(1,2,3,4,5,6,7)
[S=6,F=8,a=1,b=1]->(6,8)
Then we can table:
N(y) = number of times y is included in a set,
C(z) = sum(N(y) for y in range(1,z)) % 2
y N(y) C(z)
1 2 0
2 2 0
3 1 1
4 2 1
5 2 1
6 2 1
7 2 1
8 2 1
And then we use bisection to find the first place where C(z) becomes 1.

Seems like it'd be useful to find a way to perform set operations, particularly intersection, on these sets without having to generate the actual sets. If you could do that, the intersection of all these sets in the test should leave you with just one number. Leaving the a and b part aside, it's easy to see how you'd take the intersection of two sets that include all integers between S and F: the intersection is just the set with S=max(S1, S2) and F=min(F1, F2).
That gives you a starting point; now you have to figure out how to create the intersection of two sets consider a and b.

XOR to the rescue.
Take the numbers from each successive set and XOR them with the contents of the result set. I.e., if the number is currently marked as "present", change that to "not present", and vice versa.
At the end, you'll have one number marked as present in the result set, which will be the one that occurred an odd number of times. All of the others will have been XORed an even number of times, so they'll be back to the original state.
As for complexity, you're dealing with each input item exactly once, so it's basically linear on the total number of input items -- at least assuming your operations on the result set are constant complexity. At least if I understand how they're phrasing things, that seems to meet the requirement.

It sounds like S is assumed to be non-negative. Given your desire for an O(max(F)*N) time boundary you can use a sieving-like approach.
Have an array of integers with an entry for each candidate number (that is, every number between min(S) and max(F)). Go through all the quadruples and add 1 to all array locations associated with included numbers represented by each quadruple. At the end, look through the array to see which count is odd. The number it represents is the number that satisfies your conditions.
This works because you're going under N quadruples, and each one takes O(max(F)) or less time (assuming S is always non-negative) to count the included numbers. That gives you O(max(F)*N).

Fastest way to find sum of digits on big numbers

I have some big numbers (again) and i need to find if the sum of the digits is an even number.
I tried this: finding the sum of the digits with a while loop and then checking if that sum % 2 equals 0 and it's working but it's too slow for big numbers, because i am given intervals of numbers and if the input is 1999999 19999999999 then my program fails, i cannot complete within the time limit which is 0,1 sec.
What to do ? Is there any other faster way to do this ?
EDIT: The input 1999999 19999999999 means it will start with 1999999 and check all the numbers like i wrote above until 19999999999, and because we are talking about big numbers (< 2^30) my program is not worthy.

You don't need to sum the digits. Think about it. The sum starts with zero, which is generally regarded as even (although you can special case this if you want).
Each even digit changes nothing. If the sum was odd, it stays odd, if it was even it stays even.
Each odd digit changes the sum from even to odd, or odd to even.
So, just count the number of odd digits. If the number is even, then the sum of all the digits is even. If the number is odd, then the sum of all the digits is odd.
Now, you only need to do this for the FIRST number in your range. What you need to do next is figure out how the evenness or oddness of the numbers change as you keep adding one.
I leave this as an exercise for the reader. Homework has to involve some work!

Hint: if you find that the sum of the digits of a given number n is odd, will the sum of the digits of the number n + 1 be odd or even?
Update: as #Mark pointed out, it is not so simple... but the anomalies appear only when n + 1 is a multiple of 10, i.e. (n + 1) % 10 == 0. Then the oddity does not change. However, out of these cases, every 10th is an exception when the oddity does change still (e.g. 199 -> 200). And so on... basically, depending on where the highest value 9 of n is, one can decide whether or not the oddity changes between n and n + 1. I admit it is a bit tedious to calculate, but still I am sure it is faster than just adding up all these digits...

Here is a hint, it may work -- you don't need to sum the digits you just need to know if the result will be odd or even -- if you start with the assumption your total is even, even numbers have no effect, odd number toggle (ie an odd number of odd digits make it odd).
Depending on the language there may be a faster way to perform the calculation without adding.
Also remember -- a number is odd or even based on its last binary digit.
Example:
In ASM you could XOR the low order bit to get the correct result
In FORTH this would not work so well...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js