Find overlapping matches and submatches using regular expressions in Python - regex

I have a string of characters (a DNA sequence) with a regular expression I designed to filter out possible matches, (?:ATA|ATT)[ATCGN]{144,16563}(?:AGA|AGG|TAA|TAG). Later I apply two filter conditions:
The sequence must be divisible by 3, len(match) % 3 == 0, and
There must be no stop codon (['AGA', 'AGG', 'TAA', 'TAG']) before the end of the string, not any(substring in list(sliced(match[:-3], 3)) for substring in [x.replace("U", "T") for x in stop_codons]).
However, when I apply these filters, I get no matches at all (before the filters I get around ~200 matches. The way I'm searching for substrings in the full sequence is by running re.findall(regex, genome_fasta, overlapped=True), because matches could be submatches of other matches.
Is there something about regular expressions that I'm misunderstanding? To my knowledge, after the filters I should still have matches.
If there's anything else I need to add please let me know! (I'm using the regex package for Python 3.4, not the standard re package, because it has no overlap support).
EDIT 1:
Per comment: I'm looking for ORFs in the mitochondrial genome, but only considering those at least 150 nucleotides (characters) in length. Considering overlap is important because a match could include the first start codon in the string and the last stop codon in the string, but there could be another start codon in the middle. For example:
Given "ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG", matches should be "ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG" but also "ATAAAGCCATTTACCGTACATAGCACATTATAA", since both "TAG" and "TAA" are stop codons.
EDIT 2:
Totally, misunderstood comment, full code for method is:
typical_regex = r"%s[ATCGN]{%s,%s}%s" % (proc_start_codons, str(minimum_orf_length - 6), str(maximum_orf_length - 6), proc_stop_codons)
typical_fwd_matches = []
if re.search(typical_regex, genome_fasta, overlapped=True):
for match in re.findall(typical_regex, genome_fasta, overlapped=True):
if len(match) % 3 == 0:
if not any(substring in list(sliced(match[:-3], 3)) for substring in [x.replace("U", "T") for x in stop_codons]):
typical_fwd_matches.append(match)
print(typical_fwd_matches)
The typical_fwd_matches array is empty and the regex is rendered as (?:ATA|ATT)[ATCGN]{144,16563}(?:AGA|AGG|TAA|TAG) when printed to console/file.

I think you can do it this way.
The subsets will consist of ever decreasing size of the previous matches.
That's about all you have to do.
So, it's fairly straight forward to design the regex.
The regex will only match multiples of 3 chars.
The beginning and middle are captured in group 1.
This is used for the new text value which is just the last match
minus the last 3 chars.
Regex explained:
( # (1 start), Whole match minus last 3 chars
(?: ATA | ATT ) # Starts with one of these 3 char sequence
(?: # Cluster group
[ATCGN]{3} # Any 3 char sequence consisting of these chars
)+ # End cluster, do 1 to many times
) # (1 end)
(?: AGA | AGG | TAA | TAG ) # Last 3 char sequence, one of these
Python code sample:
Demo
import re
r = re.compile(r"((?:ATA|ATT)(?:[ATCGN]{3})+)(?:AGA|AGG|TAA|TAG)")
text = "ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG"
m = r.search(text)
while m:
print("Found: " + m.group(0))
text = m.group(1)
m = r.search(text)
Output:
Found: ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG
Found: ATAAAGCCATTTACCGTACATAGCACATTATAA
Found: ATTTACCGTACATAG
Using this method, the subsets being tested are these:
ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG
ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAAC
ATAAAGCCATTTACCGTACATAGCACATTA
ATTTACCGTACA
We can benchmark the time the regex takes to match these.
Regex1: ((?:ATA|ATT)(?:[ATCGN]{3})+)(?:AGA|AGG|TAA|TAG)
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 3
Elapsed Time: 1.63 s, 1627.59 ms, 1627594 µs
Matches per sec: 92,160

Related

Regex quantifier more than one group

I need a regex to get a sequence of number 1 followed by number 0 and the total numbers should be equal to a max length. Is there a way to do something like (([1]+)([0]+)){maxLength} ?
Ex.:
maxLength = 7
10 -> should not pass (total length < maxLength)
1111100 -> should match
1000000 -> should match
11110000000 -> should match 1111000.
111111111111 -> should match 1111111.
Plus: The sequence could be 0 followed by 1, and the greater the amount of 1 the better (I don't know if it's possible in only one regex).
000000001111 -> should get 0001111.
I'm focusing on 1 followed by 0.
I started with [1]+[0]+,
after I quantified the 0s ([1]+)([0]{1,7}),
but it still giving more 0s than I want.
Then I was thinking in ([1]{7,}|[1]{6}[0]{1}|[1]{5}[0]{2}|[1]{4}[0]{3}|[1]{3}[0]{4}|[1]{2}[0]{5}|[1]{1}[0]{6}),
and ok, it works. BUT if maxLength = 100 the above solution is not viable.
Is there some way to count the length of the first matched group and then the second group to be the difference from the first one?
Or something like (([1]+)([0]+)){7} ?
My attempt using branch reset group:
0*(?|(1[10]{6})|([10]{6}1))
See an online demo. You can use the result from 1st capture group.
0* - 0+ literal zeros (greedy) upto;
(?| - Open branch reset group:
(1[10]{6}) - 1st Capture group holding a literal 1 and 6 ones or zeros.
| - Or:
([10]{6}1) - 1st Capture group holding 6 ones or zeros upto and a literal one.
) - Close branch reset group.
It seems you just want:
^(?:(?=1+0*$)|(?=0+1*$))[01]{7}
Here the {7} can be replaced with whatever the max length is minus one.
I think the regex can be as simple as:
/0*([01]{7})/
example:
const result = `
10
1111100
1000000
11110000000
111111111111
000000001111
`.split("\n").reduce((acc, str) => {
const m = str.match(/0*([01]{7})/);
m && acc.push(m[1]);
return acc
}, []);
console.log(result)

Regex - Match n occurences of substring within any m-lettered window

I am facing some issues forming a regex that matches at least n times a given pattern within m characters of the input string.
For example imagine that my input string is:
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
I want to detect all cases where an 1 appears at least 7 times (not necessarily consecutively) in the input string, but within a window of up to 20 characters.
So far I have built this expression:
(1[^1]*?){7,}
which detects all cases where an 1 appears at least 7 times in the input string, but this now matches both the:
11000000011101111
and the
1100000001110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011
parts whereas I want only the first one to be kept, as it is within a substring composed of less than 20 characters.
It tried to combine the aforementioned regex with:
(?=(^[01]{0,20}))
to also match only parts of the string containing either an '1' or a '0' of length up to 20 characters but when I do that it stops working.
Does anyone have an idea gow to accomplish this?
I have put this example in regex101 as a quick reference.
Thank you very much!
This is not something that can be done with regex without listing out every possible string. You would need to iterate over the string instead.
You could also iterate over the matches. Example in Python:
import re
matches = re.finditer(r'(?=((1[^1]*?){7}))', string)
matches = [match.group(1) for match in matches if len(match.group(1)) <= 20]
The next Python snippet is an attempt to get the desired sequences using only the regular expression.
import re
r = r'''
(?mx)
( # the 1st capturing group will contain the desired sequence
1 # this sequence should begin with 1
(?=(?:[01]{6,19}) # let's see that there are enough 0s and 1s in a line
(.*$)) # the 2nd capturing group will contain all characters to the end of a line
(?:0*1){6}) # there must be six more 1s in the sequence
(?=.{0,13} # complement the 1st capturing group to 20 characters
\2) # the rest of a line should be 2nd capturing group
'''
s = '''
0000000
101010101010111111100000000000001
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
1111111
111111
'''
print([m.group(1) for m in re.finditer(r, s)])
Output:
['1010101010101', '11111100000000000001', '110000000111011', '1111111']
You can find an exhaustive explanation of this regular expression on RegEx101.

Why is this regex performing partial matches?

I have the following raw data:
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 ...
I'm using this regex to remove duplicates:
([^.]+)(.[ ]*\1)+
which results in the following:
1.2.4.5.9.115.16.19 ...
The problem is how the regex handles 1.1 in the substring .11.15. What should be 9.11.15.16 becomes 9.115.16. How do I fix this?
The raw values are sorted in numeric order to accommodate the regex used for processing the duplicate values.
The regex is being used within Oracle's REGEXP_REPLACE
The decimal is a delimiter. I've tried commas and pipes but that doesn't fix the problem.
Oracle's REGEX does not work the way you intended. You could split the string and find distinct rows using the general method Splitting string into multiple rows in Oracle. Another option is to use XMLTABLE , which works for numbers and also strings with proper quoting.
SELECT LISTAGG(n, '.') WITHIN
GROUP (
ORDER BY n
) AS n
FROM (
SELECT DISTINCT TO_NUMBER(column_value) AS n
FROM XMLTABLE(replace('1.1.2.2.4.4.4.5.5.9.11.15.16.16.19', '.', ','))
);
Demo
Unfortunately Oracle doesn't provide a token to match a word boundary position. Neither familiar \b token nor ancient [[:<:]] or [[:>:]].
But on this specific set you can use:
(\d+\.)(\1)+
Note: You forgot to escape dot.
Your regex caught:
a 1 - the second digit in 11,
then a dot,
and finally 1 - the first digit in 15.
So your regex failed to catch the whole sequence of digits.
The most natural way to write a regex catching the whole sequence
of digits would be to use:
a loobehind for either the start of the string or a dot,
then catch a sequence of digits,
and finally a lookahead for a dot.
But as I am not sure whether Oracle supports lookarounds, I wrote
the regex another way:
(^|\.)(\d+)(\.(\2))+
Details:
(^|\.) - Either start of the string or a dot (group 1), instead of
the loobehind.
(\d+) - A sequence of digits (group 2).
( - Start of group 3, containing:
\.(\2) - A dot and the same sequence of digits which caught group 2.
)+ - End of group 3, it may occur multiple times.
Group the repeating pattern and remove it
As revo has indicated, a big source of your difficulties came with not escaping the period. In addition, the resulting string having a 115 included can be explained as follows (Valdi_Bo made a similar observation earlier):
([^.]+)(.[ ]*\1)+ will match 11.15 as follow:
SCOTT#DB>SELECT
2 '11.15' val,
3 regexp_replace('11.15','([^.]+)(\.[ ]*\1)+','\1') deduplicated
4 FROM
5 dual;
VAL DEDUPLICATED
11.15 115
Here is a similar approach to address those problems:
matching pattern composition
-Look for a non-period matching list of length 0 to N (subexpression is referenced by \1).
'19' which matches ([^.]*)
-Look for the repeats which form our second matching list associated with subexression 2, referenced by \2.
'19.19.19' which matches ([^.]*)([.]\1)+
-Look for either a period or end of string. This is matching list referenced by \3. This fixes the match of '11.15' by '115'.
([.]|$)
replacement string
I replace the match pattern with a replacement string composed of the first instance of the non-period matching list.
\1\3
Solution
regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3')
Here is an example using some permutations of your examples:
SCOTT#db>WITH tst AS (
2 SELECT
3 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19' val
4 FROM
5 dual
6 UNION ALL
7 SELECT
8 '1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19' val
9 FROM
10 dual
11 UNION ALL
12 SELECT
13 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19' val
14 FROM
15 dual
16 ) SELECT
17 val,
18 regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3') deduplicate
19 FROM
20 tst;
VAL DEDUPLICATE
------------------------------------------------------------------------
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19 1.2.4.5.9.11.15.16.19
My approach does not address possible spaces in the string. One could just remove them separately (e.g. through a separate replace statement).

Convert a regex expression to erlang's re syntax?

I am having hard time trying to convert the following regular expression into an erlang syntax.
What I have is a test string like this:
1,2 ==> 3 #SUP: 1 #CONF: 1.0
And the regex that I created with regex101 is this (see below):
([\d,]+).*==>\s*(\d+)\s*#SUP:\s*(\d)\s*#CONF:\s*(\d+.\d+)
:
But I am getting weird match results if I convert it to erlang - here is my attempt:
{ok, M} = re:compile("([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)").
re:run("1,2 ==> 3 #SUP: 1 #CONF: 1.0", M).
Also, I get more than four matches. What am I doing wrong?
Here is the regex101 version:
https://regex101.com/r/xJ9fP2/1
I don't know much about erlang, but I will try to explain. With your regex
>{ok, M} = re:compile("([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)").
>re:run("1,2 ==> 3 #SUP: 1 #CONF: 1.0", M).
{match,[{0, 28},{0,3},{8,1},{16,1},{25,3}]}
^^ ^^
|| ||
|| Total number of matched characters from starting index
Starting index of match
Reason for more than four groups
First match always indicates the entire string that is matched by the complete regex and rest here are the four captured groups you want. So there are total 5 groups.
([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)
<-------> <----> <---> <--------->
First group Second group Third group Fourth group
<----------------------------------------------------------------->
This regex matches entire string and is first match you are getting
(Zero'th group)
How to find desired answer
Here we want anything except the first group (which is entire match by regex). So we can use all_but_first to avoid the first group
> re:run("1,2 ==> 3 #SUP: 1 #CONF: 1.0", M, [{capture, all_but_first, list}]).
{match,["1,2","3","1","1.0"]}
More info can be found here
If you are in doubt what is content of the string, you can print it and check out:
1> RE = "([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)".
"([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)"
2> io:format("RE: /~s/~n", [RE]).
RE: /([\d,]+).*==>\s*(\d+)\s*#SUP:\s*(\d)\s*#CONF:\s*(\d+.\d+)/
For the rest of issue, there is great answer by rock321987.

Regular expression for bit strings with even number of 1s

Let L= { w in (0+1)* | w has even number of 1s}, i.e. L is the set of all bit strings with even number of 1s. Which one of the regular expressions below represents L?
A) (0*10*1)*
B) 0*(10*10*)*
C) 0*(10*1)* 0*
D) 0*1(10*1)* 10*
According to me option D is never correct because it does not represent the bit string with zero 1s. But what about the other options? We are concerned about the number of 1s(even or not) not the number of zeros doesn't matter.
Then which is the correct option and why?
A if false. It doesn't get matched by 0110 (or any zeros-only non-empty string)
B represents OK. I won't bother proving it here since the page margins are too small.
C doesn't get matched by 010101010 (zero in the middle is not matched)
D as you said doesn't get matched by 00 or any other # with no ones.
So only B
To solve such a problem you should
Supply counterexample patterns to all "incorrect" regexps. This will be either a string in L that is not matched, or a matched string out of L.
To prove the remaining "correct" pattern, you should answer two questions:
Does every string that matches the pattern belong to L? This can be done by devising properties each of matched strings should satisfy--for example, number of occurrences of some character...
Is every string in L matched by the regexp? This is done by dividing L into easily analyzable subclasses, and showing that each of them matches pattern in its own way.
(No concrete answers due to [homework]).
Examining the pattern B:
^0*(10*10*)*$
^ # match beginning of string
0* # match zero or more '0'
( # start group 1
10* # match '1' followed by zero or more '0'
10* # match '1' followed by zero or more '0'
)* # end group 1 - match zero or more times
$ # end of string
Its pretty obvious that this pattern will only match strings who have 0,2,4,... 1's.
Look for examples that should match but don't. 0, 11011, and 1100 should all match, but each one fails for one of those four
C is incorrect because it does not allow any 0s between the second 1 of one group and the first 1 of the next group.
This answer would be best for this language
(0*10*10*)
a quick python script actually eliminated all the possibilities:
import re
a = re.compile("(0*10*1)*")
b = re.compile("0*(10*10*)*")
c = re.compile("0*(10*1)* 0*")
d = re.compile("0*1(10*1)* 10*")
candidates = [('a',a),('b',b),('c',c),('d',d)]
tests = ['0110', '1100', '0011', '11011']
for test in tests:
for candidate in candidates:
if not candidate[1].match(test):
candidates.remove(candidate)
print "removed %s because it failed on %s" % (candidate[0], test)
ntests = ['1', '10', '01', '010', '10101']
for test in ntests:
for candidate in candidates:
if candidate[1].match(test):
candidates.remove(candidate)
print "removed %s because it matched on %s" % (candidate[0], test)
the output:
removed c because it failed on 0110
removed d because it failed on 0110
removed a because it matched on 1
removed b because it matched on 10