smith waterman algorithm choose more than one alignment - python-2.7

I want to align a small sequence S1 to another larger nucleotide sequence S2 for example:
S1: acgtgt
S2: ttcgtgacagt...
In this example s1 hit in 2 places in s2 : cgtg and acgt with gap in s2 the 2. I want to use smith waterman algorithm but my question is : in case the 2 alignments have 2 diffrent score i.e one 4 and another 3 how to get the2 alignments from the dynamic programimg matrix? Is there any tool or library that do this already? I tried paorwise2 from biopython and it only gives the alignments with high score in tje matrix

Pairwise alignment algorithms such as Smith-Waterman will only provide the one best alignment. A worse alignment will have a different traceback walk that will not be followed by the Dynamic Programming algorithm Smith-Waterman uses.
If there are multiple alignments with the same best score, S-W will choose only one of those alignments (which one is implementation specific since it doesn't really matter since they have the same score).
If you really really wanted to have multiple alignments returned AND use something like Smith-Waterman, you will have to re-align the sequences multiple times each time configuring the gap penalties differently. I do not recommend this since it will be very expensive.
Instead of using Smith-Waterman, you may want to try something like BLAST which will give you multiple hits

see section Repeated matches in the Durbin - Biological Sequence Analysis
Let us assume that we are only interested in matches scoring higher
than some threshold T . This will be true in general, because there
are always short local alignments with small positive scores even
between entirely unrelated sequences. Let y be the sequence containing
the domain or motif, and x be the sequence in which we are looking for
multiple matches.
An example of the repeat algorithm is given in Figure 2.7. We again
use the matrix F, but the recurrence is now different, as is the
meaning of F(i, j). In the final alignment, x will be partitioned into
regions that match parts of y in gapped alignments, and regions that
are unmatched. We will talk about the score
of a completed match region as being its standard gapped alignment
score minus the threshold T . All these match scores will be positive.
F(i, j) for j ≥ 1 is now the best sum of match scores to x1...i,
assuming that xi is in a matched region, and the corresponding match
ends in xi and yj (they may not actually be aligned, if this is a
gapped section of the match). F(i,0) is the best sum of completed
match scores to the subsequence x1...i, i.e. assuming that xi is in an
unmatched region.
To achieve the desired goal, we start by
initialising F(0,0) = 0 as usual, and then fill the matrix using the
following recurrence relations:
Equation (2.11) handles unmatched regions and ends of matches, only
allowing matches to end when they have score at least T . Equation
(2.12) handles starts of matches and extensions. The total score of
all the matches is obtained by adding an extra cell to the matrix, F(n
+ 1,0), using (2.11). This score will have T subtracted for each match; if there were no matches of score greater than T it will be 0,
obtained by repeated application of the first option in (2.11).
The
individual match alignments can be obtained by tracing back from cell
(n + 1,0) to (0,0), at each point going back to the cell that was the
source of the score in the current cell in the max() operation. This
traceback procedure is a global procedure, showing what each residue
in x will be aligned to. The resulting global alignment will contain
sections of more conventional gapped local alignments of subsequences
of x to subsequences of y.
Note that the algorithm obtains all the
local matches in one pass. It finds the maximal scoring set of
matches, in the sense of maximising the combined total of the excess
of each match score above the threshold T . Changing the value of T
changes what the algorithm finds. Increasing T may exclude matches.
Decreasing it may split them, as well as finding new weaker ones. A
locally optimal match in the sense of the preceding section will be
split into pieces if it contains internal subalignments scoring less
than −T . However, this may be what is wanted: given two similar high
scoring sections significant in their own right, separated by a
non-matching section with a strongly negative score, it is not clear
whether it is preferable to report one match or two.

All possible alignments that conform to the scoring in the substitution matrix are rerpresented in the trace back matrix T - its just that some implementations might not give you access to T.
To extract multiple alignments, you'll need first to look at the scoring matrix H and choose which scores you want to trace back - for example, you might look at the highest 10 scores. The values in the matrix T will tell you the route to trace back. Keep going until the corresponding score in H is zero.
Be careful though - the 10 highest scores might all be part of the same alignment, in which case you'd just get a result that are subsequences of another result. To avoid this, it's probably best to trace back the highest scoring alignment first, and then look for high values in cells that are not passed through by the first alignment.

Related

An FA that accepts all binary strings with an even number of 0's and the number of 1's is a multiple of 3

This question is directly from Chapter 1 exercises of Introducing the Theory of Computation by Wayne Goddard (Question 1.17).
Initially I thought of creating two separate DFAs one for ensuring the number of 0's in the input is even and another for ensuring the number of 1's in the input is divisible by 3. However combining these two separate DFA's into one language proved to be a more difficult task than what I thought. I would highly appreciate if anyone could point me in the right direction.
I'm having a difficult time constructing logical steps to ensure we retain previously acquired data regarding 0's and 1's after observing a new symbol.
Your idea of building two different DFAs and combining them together is a good one. You're starting with two DFAs and essentially want to build a single DFA that runs both of them at the same time. This is a great time to use the product construction, which does just that. The idea is to build a new DFA whose states correspond to pairs of states, one from the first DFA and one from the second. If you're familiar with the powerset construction, you'll find this construction pretty easy to pick up; the intuition is pretty similar.
The language accepted by finite automata for even numbers of 0's and numbers of 1's divisible by 3 is
L={€,00,111,0000,00111, 111111,
..... }
The language accepted for even numbers of 0's is
L1={€, 00,0000,.....}
FA1
The language accepted for numbers of 1's divisible by 3 is
L2={€,111, 111111,...}
FA2
The finite automata that accepts the string having even numbers of 0's and numbers of 1's divisible by 3 is shown below:
Transition diagram
DFAs only accepts even numbers of 0's and numbers of 1's that are multiples of 3.
Alphabet Σ={0,1}.
Language L={00011,00010001,0101010...}
Here q0 is a start state and final state also.
This DFA take 6 six states.
The given input satisfies multiples of 3.
Suppose that we take an input of 0001. Then:
if q0 accepts 0 then it will goes to q1
if q1 accepts 0 then it will goes to q0
Again q0 accepts 0 then it will goes to q1
And next q1 accepts 1 then it will goes to q2
Then the given input string is satisfies.
Here is a helpful transition diagram to better explain this process:
Transition diagram
In this case the considered language is the union of language accepted by the machine M1 and that
accepted by the machine M2. The machines M1 and M2 are given in the solution to Question 1. Using
the -transition, we propose the non-deterministic finite automaton M3 given in the diagram as the one that accepts L(M1) ∪ L(M2)
image 1

The mean of the mean of the Xn combinations by n. is the mean of the Xn

I have X1...X6. I have taken the combinations by two. For each of those sub-samples I have taken the mean, and then the mean of all of those means:
[(X1+X2)/2 + ... +(X5+X6)/2]/15, where 15 is the total number of combinations.
Now the mean of all of those sub-samples is equal to the mean of :
(X1+X2+X3+X4+X5+X6)/6 .
I am asking for some help in order to either PROVE it (as a generalazation), or why this happens? Because even if I increase the combinations for example the combinations of 6 by 3 or 4 etc the results are the same.
Thank you
OK, here's a quick page of scribbles that shows that no matter how many items you have if you take the mean of all combinations of 2 pairs and then take the mean of those means then you will always get the mean of the original sum.
Explanation...
I work out what the number of combinations is first. For later use.
Then it's just a matter of simplifying the calculation.
Each number is used n-1 times. X1 is obvious. X2 is used n-2 times but also used once in the sum with X1. (This bit is a bit harder with r > 2)
At the end I substitute in the actual values for the number of combinations.
This then cancels out to give the sum of all the numbers over n. Which is the mean.
The next step is to show this for all values r but that shouldn't be too hard.
Substituting r instead of 2. I found that each number is used (n-1) choose (r-1) times.
But then I'm getting the wrong cancellation out of it.
I know where I went wrong... I miscalculated the calculation for (n-1)choose(r-1)
With the correct formula the answer falls out to S/n.

Good way to detect identical expressions in C++

I am writing a program that solves this puzzle game: some numbers and a goal number is given, and you make the goal number using the n numbers and operators +, -, *, / and (). For example, given 2,3,5,7 and the goal number 10, the solutions are (2+3)*(7-5)=10, 3*5-(7-2)=10, and so on.
The catch is, if I implement it naively, I will get a bunch of identical solutions, like (2+3)*(7-5)=10 and (3+2)*(7-5)=10, and 3*5-(7-2)=10 and 5*3-(7-2)=10 and 3*5-7+2=10 and 3*5+2-7=10 and so on. So I'd like to detect those identical solutions and prune them.
I'm currently using randomly generated double numbers to detect identical solutions. What I'm doing is basically substituting those random numbers to the solution and check if there are any pairs of them that calculate to the same number. I have to perform the detection at every node of my search, so it has to be fast, and I use hashset for it now.
Now the problem is the error that comes with the calculation. Because even identical solutions do not calculate to the exactly same value, I currently round the calculated value to a precision when storing in the hashset. However this does not seem to work well enough, and gives different number of solutions every time to the same problem. Sometimes the random numbers are bad and prune some completely different solutions. Sometimes the calculated value lies on the edge of rounding function and it outputs two(or more) identical solutions. Is there a better way to do this?
EDIT:
By "identical" I mean two or more solutions(f(w,x,y,z,...) and g(w,x,y,z,...)) that calculate to the same number whatever the original number(w,x,y,z...) is. For more examples, 4/3*1/2 and 1*4/3/2 and (1/2)/(3/4) are identical, but 4/3/1/2 and 4/(3*1)/2 are not because if you change 1 to some other number they will not produce the same result.
It will be easier if you "canonicalize" the expressions before comparing them. One way would be to sort when an operation is commutative, so 3+2 becomes 2+3 whereas 2+3 remains as it was. Of course you will need to establish an ordering for parenthesized groups as well, like 3+(2*1)...does that become (1*2)+3 or 3+(1*2)? What the ordering is doesn't necessarily matter, so long as it is a total ordering.
Generate all possibilities of your expressions. Then..
When you create expressions, put them in a collection of parsed trees (this would also eliminate your parenthesis). Then "push down" any division and subtraction into the leaf nodes so that all the non-leaf nodes have * and +. Apply a sorting of the branches (e.g. regular string sort) and then compare the trees to see if they are identical.
I like the idea of using doubles. The problem is in the rounding. Why not use a container SORTED by the value obtained with one random set of double inputs. When you find the place you would insert in that container, you can look at the immediately preceding and following items. Use a different set of random doubles to recompute each for the more robust comparison. Then you can have a reasonable cutoff for "close enough to be equal" without arbitrary rounding.
If a pair of expressions are close enough for equal in both the main set of random numbers and the second set, the expressions are safely "same" and the newer one discarded. If close enough for equal in the main set but not the new set, you have a rare problem, that probably requires rekeying the entire container with a different random number set. If not close enough in either, then they are different.
For the larger n suggested by one of your recent comments, I think you would need the better performance that should be possible from a canonical by construction method (or maybe "almost" canonical by construction) rather than a primarily comparison based approach.
You don't want to construct an incredibly large number of expressions, then canonicalize and compare.
Define a doubly recursive function can(...) that takes as input:
A reference to a canonical expression tree.
A reference to one subexpression of that tree.
A count N of inputs to be injected.
A set of flags for prohibiting some injections.
A leaf function to call.
If N is zero, can just calls the leaf function. If N is nonzero, can patches the subtree in every possible way that produces a canonical tree with N injected variables, and calls the leaf function for each and restores the tree, undoing each part of the patch as it is done with it, so we never need massive copying.
X is the subtree and K is a leaf representing variable N-1. First can would replace the subtree temporarily one at a time with subtrees representing some of (X)+K, (X)-K, (X)*K, (X)/K and K/(X) but both flags and some other rules would cause some of those to be skipped. For each not skipped, recursively call itself with the whole tree as both top and sub, with N-1, and with 0 flags.
Next drill into the two children of X and call recursively itself with that as the subtree, with N, and with appropriate flags.
The outer just calls can with a single node tree representing variable N-1 of the original N, and passing N-1.
In discussion, it is easier to name the inputs forward, so A is input N-1 and B is input N-2 etc.
When we drill into X and see it is Y+Z or Y-Z we don't want to add or subtract K from Y or Z because those are redundant with X+K or X-K. So we pass a flag that suppresses direct add or subtract.
Similarly, when we drill into X and see it is Y*Z or Y/Z we don't want to multiply or divide either Y or Z by K because that is redundant with multiplying or dividing X by K.
Some cases for further clarification:
(A/C)/B and A/(B*C) are easily non canonical because we prefer (A/B)/C and so when distributing C into (A/B) we forbid direct multiplying or dividing.
I think it takes just a bit more effort to allow C/(A*B) while rejecting C/(A/B) which was covered by (B/A)*C.
It is easier if negation is inherently non canonical, so level 1 is just A and does not include -A then if the whole expression yields negative the target value, we negate the whole expression. Otherwise we never visit the negative of a canonical expression:
Given X, we might visit (X)+K, (X)-K, (X)*K, (X)/K and K/(X) and we might drill down into the parts of X passing flags which suppress some of the above cases for the parts:
If X is a + or - suppress '+' or '-' in its direct parts. If X is a * or / suppress * or divide in its direct parts.
But if X is a / we also suppress K/(X) before drilling into X.
Since you are dealing with integers, I'd focus on getting an exact result.
Claim: Suppose there is some f(a_1, ..., a_n) = x where a_i and x are your integer input numbers and f(a_1, ..., a_n) represents any functions of your desired form. Then clearly f(a_i) - x = 0. I claim, we can construct a different function g with g(x, a_1, ..., a_n) = 0 for the exact same x and g only uses ()s, +, - and * (no division).
I'll prove that below. Consequently you could construct g evaluate g(x, a_1, ..., a_n) = 0 on integers only.
Example:
Suppose we have a_i = i for i = 1, ..., 4 and f(a_i) = a_4 / (a_2 - (a_3 / 1)) (which contains divisions so far). This is how I would like to simplify:
0 = a_4 / (a_2 - (a_3 / a_1) ) - x | * (a_2 - (a_3 / a_1) )
0 = a_4 - x * (a_2 - (a_3 / a_1) ) | * a_1
0 = a_4 * a_1 - x * (a_2 * a_1 - (a_3) )
In this form, you can verify your equality for some given integer x using integer operations only.
Proof:
There is some g(x, a_i) := f(a_i) - x which is equivalent to f. Consider any equivalent g with as few as possible division. Assume there is at least one (otherwise we are done). Assume within g we divide by h(x, a_i) (any of your functions, may contain divisions itself). Then (g*h)(x, a_i) := g(x, a_i) * h(x, a_i) has the same roots, as g has (multiplying by a root, ie. (x, a_i) where g(a_i) - x = 0, preserves all roots). But on the other hand, g*h is composed of one division fewer. A contradiction (g with minimum number of divisions), which is why g doesn't contain any division.
I've updated the example to visualize the strategy.
Update: This works well on rational input numbers (those represent a single division p/q). This should help you. Other input can't be provided by humans.
What are you doing to find / test f's? I'd guess some form of dynamic programming will be fast in practice.

Number contained in an odd number of sets

I have a homework problem which i can solve only in O(max(F)*N) ( N is about 10^5 and F is 10^9) complexity, and i hope you could help me. I am given N sets of 4 integer numbers (named S, F, a and b); Each set of 4 numbers describe a set of numbers in this way: The first a successive numbers, starting from S included are in the set. The next b successive numbers are not, and then the next a numbers are, repeating this until you reach the superior limit, F. For example for S=5;F=50;a=1;b=19 the set contains (5,25,45); S=1;F=10;a=2;b=1 the set contains (1,2,4,5,7,8,10);
I need to find the integer which is contained in an odd number of sets. It is guaranteed that for the given test there is ONLY 1 number which respects this condition.
I tried to go trough every number between min(S) and max(F) and check in how many number of sets this number is included, and if it is included in an odd number of sets, then this is the answer. As i said, in this way I get an O (F*N) which is too much, and I have no other idea how could I see if a number is in a odd number of sets.
If you could help me I would be really grateful. Thank you in advance and sorry for my bad English and explanation!
Hint
I would be tempted to use bisection.
Choose a value x, then count how many numbers<=x are present in all the sets.
If this is odd then the answer is <=x, otherwise >x.
This should take time O(Nlog(F))
Alternative explanation
Suppose we have sets
[S=1,F=8,a=2,b=1]->(1,2,4,5,7,8)
[S=1,F=7,a=1,b=0]->(1,2,3,4,5,6,7)
[S=6,F=8,a=1,b=1]->(6,8)
Then we can table:
N(y) = number of times y is included in a set,
C(z) = sum(N(y) for y in range(1,z)) % 2
y N(y) C(z)
1 2 0
2 2 0
3 1 1
4 2 1
5 2 1
6 2 1
7 2 1
8 2 1
And then we use bisection to find the first place where C(z) becomes 1.
Seems like it'd be useful to find a way to perform set operations, particularly intersection, on these sets without having to generate the actual sets. If you could do that, the intersection of all these sets in the test should leave you with just one number. Leaving the a and b part aside, it's easy to see how you'd take the intersection of two sets that include all integers between S and F: the intersection is just the set with S=max(S1, S2) and F=min(F1, F2).
That gives you a starting point; now you have to figure out how to create the intersection of two sets consider a and b.
XOR to the rescue.
Take the numbers from each successive set and XOR them with the contents of the result set. I.e., if the number is currently marked as "present", change that to "not present", and vice versa.
At the end, you'll have one number marked as present in the result set, which will be the one that occurred an odd number of times. All of the others will have been XORed an even number of times, so they'll be back to the original state.
As for complexity, you're dealing with each input item exactly once, so it's basically linear on the total number of input items -- at least assuming your operations on the result set are constant complexity. At least if I understand how they're phrasing things, that seems to meet the requirement.
It sounds like S is assumed to be non-negative. Given your desire for an O(max(F)*N) time boundary you can use a sieving-like approach.
Have an array of integers with an entry for each candidate number (that is, every number between min(S) and max(F)). Go through all the quadruples and add 1 to all array locations associated with included numbers represented by each quadruple. At the end, look through the array to see which count is odd. The number it represents is the number that satisfies your conditions.
This works because you're going under N quadruples, and each one takes O(max(F)) or less time (assuming S is always non-negative) to count the included numbers. That gives you O(max(F)*N).

Better compression algorithm for vector data?

I need to compress some spatially correlated data records. Currently I am getting 1.2x-1.5x compression with zlib, but I figure it should be possible to get more like 2x. The data records have various fields, but for example, zlib seems to have trouble compressing lists of points.
The points represent a road network. They are pairs of fixed-point 4-byte integers of the form XXXXYYYY. Typically, if a single data block has 100 points, there will be only be a few combinations of the top two bytes of X and Y (spatial correlation). But the bottom bytes are always changing and must look like random data to zlib.
Similarly, the records have 4-byte IDs which tend to have constant high bytes and variable low bytes.
Is there another algorithm that would be able to compress this kind of data better? I'm using C++.
Edit: Please no more suggestions to change the data itself. My question is about automatic compression algorithms. If somebody has a link to an overview of all popular compression algorithms I'll just accept that as answer.
You'll likely get much better results if you try to compress the data yourself based on your knowledge of its structure.
General-purpose compression algorithms just treat your data as a bitstream. They look for commonly-used sequences of bits, and replace them with a shorter dictionary indices.
But the duplicate data doesn't go away. The duplicated sequence gets shorter, but it's still duplicated just as often as it was before.
As I understand it, you have a large number of data points of the form
XXxxYYyy, where the upper-case letters are very uniform. So factor them out.
Rewrite the list as something similar to this:
XXYY // a header describing the common first and third byte for all the subsequent entries
xxyy // the remaining bytes, which vary
xxyy
xxyy
xxyy
...
XXYY // next unique combination of 1st and 3rd byte)
xxyy
xxyy
...
Now, each combination of the rarely varying bytes is listed only once, rather than duplicated for every entry they occur in. That adds up to a significant space saving.
Basically, try to remove duplicate data yourself, before running it through zlib. You can do a better job of it because you have additional knowledge about the data.
Another approach might be, instead of storing these coordinates as absolute numbers, write them as deltas, relative deviations from some location chosen to be as close as possible to all the entries. Your deltas will be smaller numbers, which can be stored using fewer bits.
Not specific to your data, but I would recommend checking out 7zip instead of zlib if you can. I've seen ridiculously good compression ratios using this.
http://www.7-zip.org/
Without seeing the data and its exact distribution, I can't say for certain what the best method is, but I would suggest that you start each group of 1-4 records with a byte whose 8 bits indicate the following:
0-1 Number of bytes of ID that should be borrowed from previous record
2-4 Format of position record
6-7 Number of succeeding records that use the same 'mode' byte
Each position record may be stored one of eight ways; all types other than 000 use signed displacements. The number after the bit code is the size of the position record.
000 - 8 - Two full four-byte positions
001 - 3 - Twelve bits for X and Y
010 - 2 - Ten-bit X and six-bit Y
011 - 2 - Six-bit X and ten-bit Y
100 - 4 - Two sixteen-bit signed displacements
101 - 3 - Sixteen-bit X and 8-bit Y signed displacement
110 - 3 - Eight-bit signed displacement for X; 16-bit for Y
111 - 2 - Two eight-bit signed displacements
A mode byte of zero will store all the information applicable to a point without reference to any previous point, using a total of 13 bytes to store 12 bytes of useful information. Other mode bytes will allow records to be compacted based upon similarity to previous records. If four consecutive records differ only in the last bit of the ID, and either have both X and Y within +/- 127 of the previous record, or have X within +/- 31 and Y within +/- 511, or X within +/- 511 and Y within +/- 31, then all four records may be stored in 13 bytes (an average of 3.25 bytes each (a 73% reduction in space).
A "greedy" algorithm may be used for compression: examine a record to see what size ID and XY it will have to use in the output, and then grab up to three more records until one is found that either can't "fit" with the previous records using the chosen sizes, or could be written smaller (note that if e.g. the first record has X and Y displacements both equal to 12, the XY would be written with two bytes, but until one reads following records one wouldn't know which of the three two-byte formats to use).
Before setting your format in stone, I'd suggest running your data through it. It may be that a small adjustment (e.g. using 7+9 or 5+11 bit formats instead of 6+10) would allow many data to pack better. The only real way to know, though, is to see what happens with your real data.
It looks like the Burrows–Wheeler transform might be useful for this problem. It has a peculiar tendency to put runs of repeating bytes together, which might make zlib compress better. This article suggests I should combine other algorithms than zlib with BWT, though.
Intuitively it sounds expensive, but a look at some source code shows that reverse BWT is O(N) with 3 passes over the data and a moderate space overhead, likely making it fast enough on my target platform (WinCE). The forward transform is roughly O(N log N) or slightly over, assuming an ordinary sort algorithm.
Sort the points by some kind of proximity measure such that the average distance between adjacent points is small. Then store the difference between adjacent points.
You might do even better if you manage to sort the points so that most differences are positive in both the x and y axes, but I can't say for sure.
As an alternative to zlib, a family of compression techniques that works well when the probability distribution is skewed towards small numbers is universal codes. They would have to be tweaked for signed numbers (encode abs(x)<<1 + (x < 0 ? 1 : 0)).
You might want to write two lists to the compressed file: a NodeList and a LinkList. Each node would have an ID, x, y. Each link would have a FromNode and a ToNode, along with a list of intermediate xy values. You might be able to have a header record with a false origin and have node xy values relative to that.
This would provide the most benefit if your streets follow an urban grid network, by eliminating duplicate coordinates at intersections.
If the compression is not required to be lossless, you could use truncated deltas for intermediate coordinates. While someone above mentioned deltas, keep in mind that a loss in connectivity would likely cause more problems than a loss in shape, which is what would happen if you use truncated deltas to represent the last coordinate of a road (which is often an intersection).
Again, if your roads aren't on an urban grid, this probably wouldn't buy you much.