Suppose I have string str = "aabaa"
Its non repetitive substrings are
a
b
aa
ab
ba
aab
aba
baa
aaba
abaa
aabaa
Compute the suffix array and the longest common prefix array thereof.
a
1
aa
2
aabaa
1
abaa
0
baa
Return (n+1)n/2, the number of substring bounds, minus the sum of the longest common prefix array.
(5+1)5/2 - (1+2+1+0) = 15 - 4 = 11.
Related
I am trying to remove a specific pattern of numbers from a string using the regexr function in Stata. I want to remove any pattern of numbers that are not bounded by a character (other than whitespace), or a letter. For example, if the string contained t370 or 6-test I would want those to remain. It's only when I have numbers next to each other.
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
I would like to end up with:
ID string
1 7-test
2 67-tty
3 j37b2 3hty
I've tried different regex statements to find when numbers are wrapped in a word boundary: regexr(string, "\b[0-9]+\b", ""); in addition to manually adding the white space " [0-9]+" which will only replace if the pattern occurs in the middle, not at the start of a string. If it's easier to do this without regex expressions that's fine, I was just trying to become more familiar.
Following up on the loop suggesting from the comments, you could do something like the following:
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
gen N_words = wordcount(string) // # words in each string
qui sum N_words
global max_words = r(max) // max # words in all strings
split string, gen(part) parse(" ") // split string at space (p.s. space is the default)
gen string2 = ""
forval i = 1/$max_words {
* add in parts that contain at least one letter
replace string2 = string2 + " " + part`i' if regexm(part`i', "[a-zA-Z]") & !missing(string2)
replace string2 = part`i' if regexm(part`i', "[a-zA-Z]") & missing(string2)
}
drop part* N_words
where the result would be
. list
+----------------------------------------+
| id string string2 |
|----------------------------------------|
1. | 1 9884 7-test 58 - 489 7-test |
2. | 2 67-tty 783 444 67-tty |
3. | 3 j3782 3hty j3782 3hty |
+----------------------------------------+
Note that I have assumed that you want all words that contain at least one letter. You may need to adjust the regexm here for your specific use case.
So I have looked around this site and others for information on how to iterate through a string on Python, find a specific substring, reverse it and check if the two equaled in order to get a Palindrome. This is the problem though since some of the test cases are challenging to get and have confused me on how to find them through indexing.
This is my code that works for all, but two test cases:
def countPalindromes(s):
count = 0
firstindex = 0
lastindex = len(str)-1
while firstindex != lastindex and firstindex <= lastindex:
ch1 = s[firstindex:lastindex]
ch2 = s[lastindex:firstindex:-1]
if ch1 == ch2:
count +=1
firstindex +=1
lastindex -=1
return count
This code works for the following Palindromes: "racecar", " ", and "abqc".
It does not work for these Palindromes "aaaa" and "abacccaba".
For "aaaa" there are 6 palindromes and for "abacccaba" there are 8 palindromes. This is where my problem occurs, and I simply can't figure it out. For the 6 palindromes for "aaaa" I get aaaa, aaa, aa, twice for each. For "abacccaba" the 8 palindromes I have no idea as I get abacccaba, bacccab, accca, ccc, aba, aba.
I understand this is a confusing question, but I am lost how to approach the problem since I only get 2 for the "aaaa" and 4 for "abacccaba". Any ideas how I would cut out the substrings and get these values?
Thanks in advance!
while firstindex != lastindex and firstindex <= lastindex: misses the case of a single character palindrome.
You're also missing the case where aa contains three palindromes, 0:1, 0:2 and 1:2.
I think you're missing some palindromes for aaaa; there are 10:
aaaa
a
a
a
a
aa
aa
aa
aaa
aaa
If single-character palindromes do not count, then we have 6.
Either way, you need to consider all substrings as possible palindromes; not only the ones in the middle. Comparing a string against its reversed self is very easy to do in Python: s == s[::-1].
Getting all the substrings is easy too:
def get_all_substrings(input_string):
length = len(input_string)
return [input_string[i:j+1] for i in range(length) for j in range(i,length)]
and filtering out strings of length < 2 is also easy:
substrings = [a for a in get_all_substrings(string) if len(a) > 1]
Combining these should be fairly straight forward:
len([a for a in get_all_substrings(string) if len(a) > 1 and a == a[::-1]])
I think you should write a function(f) individually to check if a string is a palindrome.
Then make a function(g) that selects sub-strings of letters.
Eg: in string abcd, g will select a, b, c, d, ab, bc, cd, abc, bcd, abcd. Then apply f on each of these strings individually to get the number of palindromes.
I'm trying to solve a regex where the given alphabet is Σ={a,b}
The first expression is:
L1 = {a^2n b^(3m+1) | n >= 1, m >= 0}
which means the corresponding regex is: aa(a)*b(bbb)*
What would be a regex for L2, complement of L1?
Is it right to assume L2 = "Any string except for aa(a)b(bbb)"?
First, in my opinion, the regex for L1 = {a^2n b^3m+1 | n>=1, m>=0}
is NOT what you gave but is: aa(aa)*b(bbb)*. The reason is that a^2n, n > 1 means that there are at least 2 a and a pair number of a.
Now, the regular expression for "Any string except for aa(aa)*b(bbb)*" is:
^(?!^aa(aa)*b(bbb)*$).*$
more details here: Regex101
Explanations
aa(a)*b(bbb)* the regex you DON'T want to match
^ represents begining of line
(?!) negative lookahead: should NOT match what's in this group
$ represents end of line
EDIT
Yes, a complement for aa(aa)*b(bbb)* is "Any string but the ones that match aa(aa)*b(bbb)*".
Now you need to find a regex that represents that with the syntax that you can use. I gave you a regex in this answer that is correct and matches "Any string but the ones that match aa(aa)*b(bbb)*", but if you want a mathematical representation following the pattern you gave for L1, you'll need to find something simpler.
Without any negative lookahead, that would be:
L2 = ^((b+.*)|((a(aa)*)?b*)|a*((bbb)*|bb(bbb)*)|(.*a+))$
Test it here at Regex101
Good luck with the mathematical representation translation...
The first expression is:
L1 = {a^2n b^(3m+1) | n >= 1, m >= 0}
Regex for L1 is:
^aa(?:aa)*b(?:bbb)*$
Regex demo
Input
a
b
ab
aab
abb
aaab
aabb
abbb
aaaab
aaabb
aabbb
abbbb
aaaaab
aaaabb
aaabbb
aabbbb
abbbbb
aaaaaab
aaaaabb
aaaabbb
aaabbbb
aabbbbb
abbbbbb
aaaabbbb
Matches
MATCH 1
1. [7-10] `aab`
MATCH 2
1. [30-35] `aaaab`
MATCH 3
1. [75-81] `aabbbb`
MATCH 4
1. [89-96] `aaaaaab`
MATCH 5
1. [137-145] `aaaabbbb`
Regex for L2, complement of L1
^aa(?:aa)*b(?:bbb)*$(*SKIP)(*FAIL)|^.*$
Explanation:
^aa(?:aa)*b(?:bbb)*$ matches L1
^aa(?:aa)*b(?:bbb)*$(*SKIP)(*FAIL) anything matches L1 will skip & fail
|^.*$ matches others that not matches L1
Regex demo
Matches
MATCH 1
1. [0-1] `a`
MATCH 2
1. [2-3] `b`
MATCH 3
1. [4-6] `ab`
MATCH 4
1. [11-14] `abb`
MATCH 5
1. [15-19] `aaab`
MATCH 6
1. [20-24] `aabb`
MATCH 7
1. [25-29] `abbb`
MATCH 8
1. [36-41] `aaabb`
MATCH 9
1. [42-47] `aabbb`
MATCH 10
1. [48-53] `abbbb`
MATCH 11
1. [54-60] `aaaaab`
MATCH 12
1. [61-67] `aaaabb`
MATCH 13
1. [68-74] `aaabbb`
MATCH 14
1. [82-88] `abbbbb`
MATCH 15
1. [97-104] `aaaaabb`
MATCH 16
1. [105-112] `aaaabbb`
MATCH 17
1. [113-120] `aaabbbb`
MATCH 18
1. [121-128] `aabbbbb`
MATCH 19
1. [129-136] `abbbbbb`
I'd like to know if its possible to use a value inside the expression as a variable for a second part of the expression
The goal is to extract some specific strings from a memory dump. One part of the string is based on a (more or less) fixed structure that can be described well using regular expressions. The Problem is the second part of the string that has a variable length and no "footer" or anything that can be "matched" as an "END".
Instead there is a length indicator on position 2 of the first part.
Here is a simplified example string that id like to find (an all others) inside a large file
00 24 AA BB AA DD EE FF GG HH II JJ ########### ( # beeing unwanted data)
Lets assume that the main structure would allways be 00 XX AA BB AA - but the last part (starting from DD) will be variable in length for each string based on the value of XX
I know that this can be done in code outside regex but iam curious if its possible :)
Short answer: NO
Long answer:
You can acheive what you want in two steps:
Extract the value inside string
Build dynamically a regexp for matching
PSEUDO CODE
s:='00 24 AA BB AA DD EE FF GG HH II JJ ###########'
re:=/00 (\d{2}) AA BB AA/
if
s::matches(re)
then
match := re::match(s)
len := matches(1)
dynamicRE := new Regexp(re::toString() + ' (?:[A-Z]{2} ){' + len + '}')
// dynamicRE == /00 (\d{2}) AA BB AA (?:[A-Z]{2} ){24,24}/
if s::matches(dynamicRE) then
// MATCH !!
else
// NO MATCH !!
end if
end if
I have something like this
AD ABCDEFG HIJKLMN
AB HIJKLMN
AC DJKEJKW SJKLAJL JSHELSJ
Rule: Always 2 Chars Code (AB|AC|AD) at line beginning then any number of 7 Chars codes following.
With this regex:
^(AB|AC|AD)|((\S{7})?
in this groovy code sample:
def m= Pattern.compile(/^(AB|AC|AD)|((\S{7})?)/).matcher("AC DJKEJKW SJKLAJL JSHELSJ")
println m.getCount()
I always get 8 as count, means it counts the spaces.
How do I get 4 groups (as expected) without spaces ?
Thanks from a not-yet-regex-expert
Sven
Using this code:
def input = [ 'AD ABCDEFG HIJKLMN', 'AB HIJKLMN', 'AC DJKEJKW SJKLAJL JSHELSJ' ]
def regexp = /^(AB|AC|AD)|((\S{7})+)/
def result = input.collect {
matcher = ( it =~ regexp )
println "got $matcher.count for $it"
matcher.collect { it[0] }
}
println result
I get the output
got 3 for AD ABCDEFG HIJKLMN
got 2 for AB HIJKLMN
got 4 for AC DJKEJKW SJKLAJL JSHELSJ
[[AD, ABCDEFG, HIJKLMN], [AB, HIJKLMN], [AC, DJKEJKW, SJKLAJL, JSHELSJ]]
Is this more what you wanted?
This pattern will match your requirements
^A[BCD](?:\s\S{7})+
See it here online on Regexr
Meaning start with A then either a B or a C or a D. This is followed by at least one group consisting of a whitespace followed by 7 non whitespaces.