PythonQuestion on Longest Common Substring(LCS) algorithm - python-2.7

I'm pretty new to Python, it's my first programming language, and I've wanted to work on some manual data structure manipulation and playing around.
I've recently been learning the basic algorithm for solving the LCS problem, and I understand how it works besides one line of code that I for some weird reason can't seem to convince myself I am grasping entirely.
this is the code I've been using to learn from after I couldn't get it down myself quite right.
EDIT 2: Anyway to make this work with an input of two lists of integers?**I figured out that I was understanding my original question correctly, but would anyone know how I could make this work with a **list of integers? I tried converting S and T to a string of comma separated values, which worked in matching some of the characters, but even then it rarely worked in most test-cases. I'm not sure why it wouldn't, as it is still just two strings being compared, but with commas.
def lcs(S,T):
m = len(S)
n = len(T)
counter = [[0]*(n+1) for x in range(m+1)]
longest = 0
lcs_set = set()
for i in range(m):
for j in range(n):
if S[i] == T[j]:
c = counter[i][j] + 1
counter[i+1][j+1] = c
if c > longest:
lcs_set = set()
longest = c
lcs_set.add(S[i-c+1:i+1])
elif c == longest:
lcs_set.add(S[i-c+1:i+1])
return lcs_set
Now my issue is understanding is the line : lcs_set.add(S[i-c+1:i-1])
I understand that the counter is incremented when a match is found, to give longest the length of the substring. So, to make it easy, if S = Crow and T = Crown, when you reach w, the last match, the counter is incremented to 4, and i is at index 3 of S.
Does this mean I am to read this as: i (index3 on S, the W) - c (4), so 3-4 = -1, so 3-4+1 = 0 (at C) and for the right side of the slice: i(3) + 1 = 4(N, but will not be included, obviously), meaning we end with S[0:4], Crow, to LCS_Set?
If that is the case, I guess I am confused as to why we are adding the whole substring to the set, and not just the newest matched character?
If I understand right, it is updating LCS_set with the entire slice of the current matched substring, so if it were on the second match, R, the counter would be at 2, i would be at 1, and it would be saying S[1-2+1:i(1)+1], so 1-2 = -1, -1 + 1 = 0(C) up to i(1)+1 = 2 (leaving us with S[0:2], or CR), so each time around, the set is updated with the entire substring, and not just the current index.
It's not really a problem, I just want to make sure I'm understanding this correctly.
I would really appreciate any input, or any tips anyone might see with my current logic!!
EDIT:
I just realized I was totally forgetting that the position at C is the current counter number, therefore it obviously wouldn't be updating the LCS_set with the current max match number, and it can't update it with just the current matched letter, so it has to take the slice of the substring in order to update the LCS_Set.
Thanks in advance!

Related

Kotlin Regex find undesired behavior with startindex

I was trying to find words using regex in Kotlin. Here is a snippet of sample code
val possibleString = "#This is a comment"
val regex = "(?<=[ \t\n$PUNC])(\\w+)".toRegex() //PUNC is another char sequence of punctuation
val matcher1 = regex.find(possibleString)
val matcher2 = regex.find(possibleString,1)
println(matcher1?.value) // this
println(matcher2?.value) //this
The value of matcher 1 makes sense to me, which yields this.
However, why matcher2 also return this? if the start index is 1, don't we start from 'T', and output "is" instead?
I'm wondering why is the case. Do the matcher still scans for the string before index?
If this is the case, I know I could passing substring staring from index 1 to get the desired output. However, consider the possibilities of large chunks of text, generate multiple substring seems waste of memory.
So, is there any efficient workaround?
Thanks!
If you begin your search at "T", you will find This, because the startIndex is inclusive. A match will be found unless it starts before the startIndex. If it starts on the startIndex, it will still be found.
I suspect that your misunderstanding might be thinking that find would ignore the the first #, because it is before the startIndex. This is not true. startIndex only says where to start - lookbehinds don't suddenly break because you started at a later index.
Your desired behaviour isn't how lookbehinds work, so the workaround would be to use a group instead.
val regex = "[ \t\n$PUNC](\\w+)".toRegex()
val matcher1 = regex.find(possibleString)
val matcher2 = regex.find(possibleString,1)
println(matcher1?.groups?.get(1)?.value) // This
println(matcher2?.groups?.get(1)?.value) // is
You should think of the startIndex as the minimum index of the beginning of a match. If you want the match is for example, you can start searching from anywhere between index 2 (inclusive) and 6 (inclusive), since is starts on index 6:
println(regex.find(possibleString, 2).value) // is

I need help understanding a piece code

I would really appreciate If someone could help me understand it thanks p.s. Im new to code
sentence = "I like my dog I buy my dog toys"
s = sentence.split()
positions = [s.index(x)+1 for x in s]
print(sentence)
print(positions)
I would really appreciate If someone could help me understand it thanks p.s. Im new to code
Jean is correct. Have you done any online python tutorials?
Here goes.
The first line assigns the string "I like my dog I buy my dog toys" to a variable named sentence.
the next line
s = sentence.split()
breaks up the string into an array of substrings and assigns that array to variable s
>>> print(s)
['I', 'like', 'my', 'dog', 'I', 'buy', 'my', 'dog', 'toys']
the next line
positions = [s.index(x)+1 for x in s]
looks for the occurrence of each of each array value and logs its position to the array position
>>> print(positions)
[1, 2, 3, 4, 1, 6, 3, 4, 9]
EDIT
Allow me to elaborate on some key points. First, the split function. Many languages have a split function. They all take a delimiter, the character upon which the string will be split. In Python, the split() function can be called with no delimiter. In this case the function will use a single space character (" "). Thus when we call sentence.split(), it takes the value of the sentence variable and breaks it apart using the single space and returns an array of the various substrings, or pieces. In this case the individual words.
Next, let's look at the line
positions = [s.index(x)+1 for x in s]
Let's consider the following for a moment
for x in s
i = s.index(x)
this is a basic loop that takes each item in array s and places it in variable x. The first pass through this loop takes "I" and assigns it to x. Then we look for the position of "I" in the array of s. Since s contains the words od the sentence in order, the first position, array item 0 contains the value "I". So, the value of variable i becomes 0. The loop continues matching each item in array s and finds the value's corresponding position within the array.
Taking this one step further, we instantiate another array, in this case position. As the loop iterates over the array s finding the corresponding indices of each value, those positions are then placed in the new array position.
Now most people do not necessarily think in terms of zero based lists. Therefore, we take an extra step and add 1 to each position as it is found. So position 0 becomes position 1, and so on.
So what is different about the for loop I used to demonstrate above and the single line of code used in the example of this question? Nothing really. this line
positions = [s.index(x)+1 for x in s]
is simply a condensed form of the for loop. In Python, this is known as List Comprehension.
At this point, this answer is becoming more of a small instructional on Python. I really need to suggest that you seek out and find some tutorials on Python, starting with the one on Pythons documentation site. Another one may be here on TutorialPoint, or Learn Python. There are also great resources on Pluralsite and Cousera as well.
Good luck

SAS: Removing first and last characters and numbers of a string

I am searching for hours but nothing seems to work so far. I tried reverse, substr and scan but it's all not doing what I need. I am so thankful for any answer.
I have a string in a following way (lenghts vary in the dataset):
1CDF534R6
Now, I need 2 new variables:
a) 534, i.e. the middle numbers
Something like: Give me all numbers and then cut the first and last (that would work in my case).
b) 1CDF536
Just removing the last two characters
Especially the first one is important and would be great if it works somehow.
Best
In the first case use a compress function. Keep only digits.
data result;
source = "1CDF534R6";
a = compress(source, , 'kd');
a = substr(a, 2, lengthn(a) - 2);
b = substr(source, 1, lengthn(source) - 2);
run;

Python key sorting

Im taking an online beginner course through google on python 2, and I cannot figure out the answer to one of the questions. Here it is and thanks in advance for your help!
# A. match_ends
# Given a list of strings, return the count of the number of
# strings where the string length is 2 or more and the first
# and last chars of the string are the same.
# Note: python does not have a ++ operator, but += works.
def match_ends(words):
a = []
for b in words:
retun
I tried a few different things. This is just where i left off on my last attempt, and decided to ask for help. I have spent more time thinking about this than i care to mention
You should go through the course materials carefully. This can be solved easily if you have a beginner level understanding of Python. See the following code snippet:
def match_ends(words):
count = 0
for word in words:
if len(word) >= 2 and word[0] == word[-1]:
count += 1
return count

Which string-finding algorithm is appropriate for this?

I have a big string say "aaaaaaaaaaabbbbbbbbbcccccccccccddddddddddd" (but maybe longer) and I have a collection of lots of little strings. I want to count (overlap is OK) how many times the little strings are found in the big string. I care only about speed. KMP seemed good but it looked like Rabin-Karp dealt with multiple but was slow.
The problem with most string searching algorithms is that they will take at least time O(k) to return k matches, so if you have a string with say 1 million "a"s, and 1 million queries of the little string "a", then it will take around 1 million, million iterations to count all the matches!
An alternative linear time approach would be to:
Construct a suffix tree of the big string: O(n) where n is len(big string)
Precompute the number of suffixes below each node in the suffix tree: O(n)
For each small string, find the node in the suffix tree that has the small string as a suffix: O(m) where m is len(small string)
Add to the total count the number of suffixes below that node. (Each suffix corresponds to a different match of the small string in the big string)
This will take time O(n+p) where n is the length of the big string, and p is the total length of all the small strings.
Example Code
As requested, here is some small(ish) example code in Python that uses this approach:
from collections import defaultdict
class SuffixTree:
def __init__(self):
"""Returns an empty suffix tree"""
self.T=''
self.E={}
self.nodes=[-1] # 0th node is empty string
def add(self,s):
"""Adds the input string to the suffix tree.
This inserts all substrings into the tree.
End the string with a unique character if you want a leaf-node for every suffix.
Produces an edge graph keyed by (node,character) that gives (first,last,end)
This means that the edge has characters from T[first:last+1] and goes to node end."""
origin,first,last = 0,len(self.T),len(self.T)-1
self.T+=s
nc = len(self.nodes)
self.nodes += [-1]*(2*len(s))
T=self.T
E=self.E
nodes=self.nodes
Lm1=len(T)-1
for last_char_index in xrange(first,len(T)):
c=T[last_char_index]
last_parent_node = -1
while 1:
parent_node = origin
if first>last:
if (origin,c) in E:
break
else:
key = origin,T[first]
edge_first, edge_last, edge_end = E[key]
span = last - first
A = edge_first+span
m = T[A+1]
if m==c:
break
E[key] = (edge_first, A, nc)
nodes[nc] = origin
E[nc,m] = (A+1,edge_last,edge_end)
parent_node = nc
nc+=1
E[parent_node,c] = (last_char_index, Lm1, nc)
nc+=1
if last_parent_node>0:
nodes[last_parent_node] = parent_node
last_parent_node = parent_node
if origin==0:
first+=1
else:
origin = nodes[origin]
if first <= last:
edge_first,edge_last,edge_end=E[origin,T[first]]
span = edge_last-edge_first
while span <= last - first:
first+=span+1
origin = edge_end
if first <= last:
edge_first,edge_last,edge_end = E[origin,T[first]]
span = edge_last - edge_first
if last_parent_node>0:
nodes[last_parent_node] = parent_node
last+=1
if first <= last:
edge_first,edge_last,edge_end=E[origin,T[first]]
span = edge_last-edge_first
while span <= last - first:
first+=span+1
origin = edge_end
if first <= last:
edge_first,edge_last,edge_end = E[origin,T[first]]
span = edge_last - edge_first
return self
def make_choices(self):
"""Construct a sorted list for each node of the possible continuing characters"""
choices = [list() for n in xrange(len(self.nodes))] # Contains set of choices for each node
for (origin,c),edge in self.E.items():
choices[origin].append(c)
choices=[sorted(s) for s in choices] # should not have any repeats by construction
self.choices=choices
return choices
def count_suffixes(self,term):
"""Recurses through the tree finding how many suffixes are based at each node.
Strings assumed to use term as the terminating character"""
C = self.suffix_counts = [0]*len(self.nodes)
choices = self.make_choices()
def f(node=0):
t=0
X=choices[node]
if len(X)==0:
t+=1 # this node is a leaf node
else:
for c in X:
if c==term:
t+=1
continue
first,last,end = self.E[node,c]
t+=f(end)
C[node]=t
return t
return f()
def count_matches(self,needle):
"""Return the count of matches for this needle in the suffix tree"""
i=0
node=0
E=self.E
T=self.T
while i<len(needle):
c=needle[i]
key=node,c
if key not in E:
return 0
first,last,node = E[key]
while i<len(needle) and first<=last:
if needle[i]!=T[first]:
return 0
i+=1
first+=1
return self.suffix_counts[node]
big="aaaaaaaaaaabbbbbbbbbcccccccccccddddddddddd"
small_strings=["a","ab","abc"]
s=SuffixTree()
term=chr(0)
s.add(big+term)
s.count_suffixes(term)
for needle in small_strings:
x=s.count_matches(needle)
print needle,'has',x,'matches'
It prints:
a has 11 matches
ab has 1 matches
abc has 0 matches
However, in practice I would recommend you simply use a pre-existing Aho-Corasick implementation as I would expect this to be much faster in your particular case.
Matching against a large collection of strings sounds like http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm for me. It does find matches one at a time, so Peter de Rivaz's idea might be better if there are a huge number of matches. On the other hand, Aho-Corasick doesn't need to keep the big string in memory - you can just stream it through - and is very practical to implement and tune - the Wikipedia link notes that the orginal fgrep used it.
Thinking about it, you can work round the mega-match problem. Aho-Corasick can be viewed as creating a deterministic finite state machine just capable of recognizing each of the strings it is searching for. The state of the machine corresponds to the last N characters seen. If you wish to match two strings and one is a suffix of the other you need to be careful that when you are in the state that says you have just matched the longer string that you also recognize that this means that you have just matched the shorter string. If you deliberately choose not to do this, then the counts you accumulate for the shorter string will be incorrect - but you know that they are too low by the number of times the longer string was seen. So if you modify Aho-Corasick so that only the longest match at each point is recognized and counted, then the cost of matching remains linear in the number of characters in the string you are searching, and you can fix up the counts at the end by going through the long strings and then incrementing counts for the shorter strings which are suffixes of the long strings. This will take time at most linear in the total size of the strings being searched for.
Building on another answer (and hopefully convince you this is the best type of answer), you can look up http://en.wikipedia.org/wiki/Suffix_tree and also go through the references listed there to learn about suffix trees if you really want the fastest solution for your problem, that also can make it possible to get the number of matches without iterating over all the matches, and the running times and memory requirements you get are the absolute best possible for any substring matching or match counting algorithm. Once you understand how the suffix tree works and how to build/use it, then the only additional tweak you need is store the number of distinct string starting positions that are being represented at each internal node of the tree, a minor modification that you can easily do efficiently (linear time, as already claimed) by recursively getting the count from children nodes and adding them up to get the count at a current node. Then these counts allow you to count substring matches without iterating over all of them.
If I understood correctly, your input string consists of many one-character blocks.
In this case, you can compress your text using the Run-length encoding.
For example:
s = aaabbbbcc
is encoded as:
encoded_s = (a3)(b4)(c2)
Now you may search for patterns in encoded text.
If you want a concrete algorithm, just search the web for patterns matching in Run-length encoded strings. You can achieve time complexity O(N + M), where N and M are the lengths of compressed text and compressed pattern. Both M and N in general are much smaller than original lengths, so it beats any standard pattern matching algorithm e.g. KMP.
1) I'd go with finite automata. Can't think of a specialized library right now, but the general-purpose PCRE can be used to construct an automata efficiently searching for the given substring. For substring "foo" and "bar" one can construct a pattern /(foo)|(bar)/, scan a string and get the "id" number of the matched substring by iterating the ovector checking which group matched.
RE2::FindAndConsume is good if you only need the total count, not grouping by substring.
P.S. Example using Boost.Xpressive and loading the strings from a map: http://ericniebler.com/2010/09/27/boost-xpressive-ftw/
P.P.S. Recently I've had a good time creating a Ragel machine for a similar task. For a small set of searched strings a "normal" DFA would work, buf if you have a larger ruleset then using Ragel scanners shows good results (here is a related answer).
P.P.P.S. PCRE has the MARK keyword which is super useful for that kind of subpattern classification (cf).
2) Quite some time ago I wrote a Trie-based thingie in Scala for that kind of load: https://gist.github.com/ArtemGr/6150594; Trie.search goes over the string trying to match the current position to a number encoded in the Trie. The trie is encoded in a single cache-friendly array, I expect it to be as good as non-JIT DFAs.
3) I've been using boost::spirit for substring matching, but never got to measuring how it fares against other approaches. Spirit uses some efficient structure for the symbols matching, perhaps the structure can be used on its own without the overhead of Spirit.
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
using qi::lit; using qi::alnum; using qi::digit; using qi::_val; using qi::_1; using boost::phoenix::ref;
static struct exact_t: qi::symbols<char, int> {
exact_t() {add
("foo", 1)
("bar", 2);}
} exact;
int32_t match = -1;
qi::rule<const char*, int()> rule =
+alnum >> exact [ref (match) = _1];
const char* it = haystack; // Mutable iterator for Spirit.
qi::parse (it, haystack + haystackLen, rule);