R regular expression: find the last but one match - regex

I need a regular expression script (in R language) which finds the last but one match.
Here is an example:
input = c("(test1(test2(test3","(((((othertest1(othertest2(othertest3")
regexpr('the_right_regular_expression_here_which_can_finds_the_last_but_one_'(' ', input)
The result has to be: 7 and 16, because in the first case the last but one '(' is in a 7th position (from left), and in the second case the last but one '(' in in the 16th position (from left).
I've found a regular expression which can find the last match, but I could not transform it in the right way:
\\([^\\(]*$
Thanks for any help!

To match a chunk of text beginning with the last but one (, you may use
"(\\([^(]*){2}$"
Details:
(\\([^(]*){2} - 2 sequences of:
\( - a literal (
[^(]* - zero or more chars other than (
$ - end of string.
R test:
> input = c("(test1(test2(test3","(((((othertest1(othertest2(othertest3")
> regexpr("(\\([^(]*){2}$", input)
[1] 7 16
attr(,"match.length")
[1] 12 22
attr(,"useBytes")
[1] TRUE

Related

Regex - Match n occurences of substring within any m-lettered window

I am facing some issues forming a regex that matches at least n times a given pattern within m characters of the input string.
For example imagine that my input string is:
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
I want to detect all cases where an 1 appears at least 7 times (not necessarily consecutively) in the input string, but within a window of up to 20 characters.
So far I have built this expression:
(1[^1]*?){7,}
which detects all cases where an 1 appears at least 7 times in the input string, but this now matches both the:
11000000011101111
and the
1100000001110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011
parts whereas I want only the first one to be kept, as it is within a substring composed of less than 20 characters.
It tried to combine the aforementioned regex with:
(?=(^[01]{0,20}))
to also match only parts of the string containing either an '1' or a '0' of length up to 20 characters but when I do that it stops working.
Does anyone have an idea gow to accomplish this?
I have put this example in regex101 as a quick reference.
Thank you very much!
This is not something that can be done with regex without listing out every possible string. You would need to iterate over the string instead.
You could also iterate over the matches. Example in Python:
import re
matches = re.finditer(r'(?=((1[^1]*?){7}))', string)
matches = [match.group(1) for match in matches if len(match.group(1)) <= 20]
The next Python snippet is an attempt to get the desired sequences using only the regular expression.
import re
r = r'''
(?mx)
( # the 1st capturing group will contain the desired sequence
1 # this sequence should begin with 1
(?=(?:[01]{6,19}) # let's see that there are enough 0s and 1s in a line
(.*$)) # the 2nd capturing group will contain all characters to the end of a line
(?:0*1){6}) # there must be six more 1s in the sequence
(?=.{0,13} # complement the 1st capturing group to 20 characters
\2) # the rest of a line should be 2nd capturing group
'''
s = '''
0000000
101010101010111111100000000000001
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
1111111
111111
'''
print([m.group(1) for m in re.finditer(r, s)])
Output:
['1010101010101', '11111100000000000001', '110000000111011', '1111111']
You can find an exhaustive explanation of this regular expression on RegEx101.

Why is this regex performing partial matches?

I have the following raw data:
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 ...
I'm using this regex to remove duplicates:
([^.]+)(.[ ]*\1)+
which results in the following:
1.2.4.5.9.115.16.19 ...
The problem is how the regex handles 1.1 in the substring .11.15. What should be 9.11.15.16 becomes 9.115.16. How do I fix this?
The raw values are sorted in numeric order to accommodate the regex used for processing the duplicate values.
The regex is being used within Oracle's REGEXP_REPLACE
The decimal is a delimiter. I've tried commas and pipes but that doesn't fix the problem.
Oracle's REGEX does not work the way you intended. You could split the string and find distinct rows using the general method Splitting string into multiple rows in Oracle. Another option is to use XMLTABLE , which works for numbers and also strings with proper quoting.
SELECT LISTAGG(n, '.') WITHIN
GROUP (
ORDER BY n
) AS n
FROM (
SELECT DISTINCT TO_NUMBER(column_value) AS n
FROM XMLTABLE(replace('1.1.2.2.4.4.4.5.5.9.11.15.16.16.19', '.', ','))
);
Demo
Unfortunately Oracle doesn't provide a token to match a word boundary position. Neither familiar \b token nor ancient [[:<:]] or [[:>:]].
But on this specific set you can use:
(\d+\.)(\1)+
Note: You forgot to escape dot.
Your regex caught:
a 1 - the second digit in 11,
then a dot,
and finally 1 - the first digit in 15.
So your regex failed to catch the whole sequence of digits.
The most natural way to write a regex catching the whole sequence
of digits would be to use:
a loobehind for either the start of the string or a dot,
then catch a sequence of digits,
and finally a lookahead for a dot.
But as I am not sure whether Oracle supports lookarounds, I wrote
the regex another way:
(^|\.)(\d+)(\.(\2))+
Details:
(^|\.) - Either start of the string or a dot (group 1), instead of
the loobehind.
(\d+) - A sequence of digits (group 2).
( - Start of group 3, containing:
\.(\2) - A dot and the same sequence of digits which caught group 2.
)+ - End of group 3, it may occur multiple times.
Group the repeating pattern and remove it
As revo has indicated, a big source of your difficulties came with not escaping the period. In addition, the resulting string having a 115 included can be explained as follows (Valdi_Bo made a similar observation earlier):
([^.]+)(.[ ]*\1)+ will match 11.15 as follow:
SCOTT#DB>SELECT
2 '11.15' val,
3 regexp_replace('11.15','([^.]+)(\.[ ]*\1)+','\1') deduplicated
4 FROM
5 dual;
VAL DEDUPLICATED
11.15 115
Here is a similar approach to address those problems:
matching pattern composition
-Look for a non-period matching list of length 0 to N (subexpression is referenced by \1).
'19' which matches ([^.]*)
-Look for the repeats which form our second matching list associated with subexression 2, referenced by \2.
'19.19.19' which matches ([^.]*)([.]\1)+
-Look for either a period or end of string. This is matching list referenced by \3. This fixes the match of '11.15' by '115'.
([.]|$)
replacement string
I replace the match pattern with a replacement string composed of the first instance of the non-period matching list.
\1\3
Solution
regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3')
Here is an example using some permutations of your examples:
SCOTT#db>WITH tst AS (
2 SELECT
3 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19' val
4 FROM
5 dual
6 UNION ALL
7 SELECT
8 '1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19' val
9 FROM
10 dual
11 UNION ALL
12 SELECT
13 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19' val
14 FROM
15 dual
16 ) SELECT
17 val,
18 regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3') deduplicate
19 FROM
20 tst;
VAL DEDUPLICATE
------------------------------------------------------------------------
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19 1.2.4.5.9.11.15.16.19
My approach does not address possible spaces in the string. One could just remove them separately (e.g. through a separate replace statement).

Find overlapping matches and submatches using regular expressions in Python

I have a string of characters (a DNA sequence) with a regular expression I designed to filter out possible matches, (?:ATA|ATT)[ATCGN]{144,16563}(?:AGA|AGG|TAA|TAG). Later I apply two filter conditions:
The sequence must be divisible by 3, len(match) % 3 == 0, and
There must be no stop codon (['AGA', 'AGG', 'TAA', 'TAG']) before the end of the string, not any(substring in list(sliced(match[:-3], 3)) for substring in [x.replace("U", "T") for x in stop_codons]).
However, when I apply these filters, I get no matches at all (before the filters I get around ~200 matches. The way I'm searching for substrings in the full sequence is by running re.findall(regex, genome_fasta, overlapped=True), because matches could be submatches of other matches.
Is there something about regular expressions that I'm misunderstanding? To my knowledge, after the filters I should still have matches.
If there's anything else I need to add please let me know! (I'm using the regex package for Python 3.4, not the standard re package, because it has no overlap support).
EDIT 1:
Per comment: I'm looking for ORFs in the mitochondrial genome, but only considering those at least 150 nucleotides (characters) in length. Considering overlap is important because a match could include the first start codon in the string and the last stop codon in the string, but there could be another start codon in the middle. For example:
Given "ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG", matches should be "ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG" but also "ATAAAGCCATTTACCGTACATAGCACATTATAA", since both "TAG" and "TAA" are stop codons.
EDIT 2:
Totally, misunderstood comment, full code for method is:
typical_regex = r"%s[ATCGN]{%s,%s}%s" % (proc_start_codons, str(minimum_orf_length - 6), str(maximum_orf_length - 6), proc_stop_codons)
typical_fwd_matches = []
if re.search(typical_regex, genome_fasta, overlapped=True):
for match in re.findall(typical_regex, genome_fasta, overlapped=True):
if len(match) % 3 == 0:
if not any(substring in list(sliced(match[:-3], 3)) for substring in [x.replace("U", "T") for x in stop_codons]):
typical_fwd_matches.append(match)
print(typical_fwd_matches)
The typical_fwd_matches array is empty and the regex is rendered as (?:ATA|ATT)[ATCGN]{144,16563}(?:AGA|AGG|TAA|TAG) when printed to console/file.
I think you can do it this way.
The subsets will consist of ever decreasing size of the previous matches.
That's about all you have to do.
So, it's fairly straight forward to design the regex.
The regex will only match multiples of 3 chars.
The beginning and middle are captured in group 1.
This is used for the new text value which is just the last match
minus the last 3 chars.
Regex explained:
( # (1 start), Whole match minus last 3 chars
(?: ATA | ATT ) # Starts with one of these 3 char sequence
(?: # Cluster group
[ATCGN]{3} # Any 3 char sequence consisting of these chars
)+ # End cluster, do 1 to many times
) # (1 end)
(?: AGA | AGG | TAA | TAG ) # Last 3 char sequence, one of these
Python code sample:
Demo
import re
r = re.compile(r"((?:ATA|ATT)(?:[ATCGN]{3})+)(?:AGA|AGG|TAA|TAG)")
text = "ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG"
m = r.search(text)
while m:
print("Found: " + m.group(0))
text = m.group(1)
m = r.search(text)
Output:
Found: ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG
Found: ATAAAGCCATTTACCGTACATAGCACATTATAA
Found: ATTTACCGTACATAG
Using this method, the subsets being tested are these:
ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG
ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAAC
ATAAAGCCATTTACCGTACATAGCACATTA
ATTTACCGTACA
We can benchmark the time the regex takes to match these.
Regex1: ((?:ATA|ATT)(?:[ATCGN]{3})+)(?:AGA|AGG|TAA|TAG)
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 3
Elapsed Time: 1.63 s, 1627.59 ms, 1627594 µs
Matches per sec: 92,160

Convert a regex expression to erlang's re syntax?

I am having hard time trying to convert the following regular expression into an erlang syntax.
What I have is a test string like this:
1,2 ==> 3 #SUP: 1 #CONF: 1.0
And the regex that I created with regex101 is this (see below):
([\d,]+).*==>\s*(\d+)\s*#SUP:\s*(\d)\s*#CONF:\s*(\d+.\d+)
:
But I am getting weird match results if I convert it to erlang - here is my attempt:
{ok, M} = re:compile("([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)").
re:run("1,2 ==> 3 #SUP: 1 #CONF: 1.0", M).
Also, I get more than four matches. What am I doing wrong?
Here is the regex101 version:
https://regex101.com/r/xJ9fP2/1
I don't know much about erlang, but I will try to explain. With your regex
>{ok, M} = re:compile("([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)").
>re:run("1,2 ==> 3 #SUP: 1 #CONF: 1.0", M).
{match,[{0, 28},{0,3},{8,1},{16,1},{25,3}]}
^^ ^^
|| ||
|| Total number of matched characters from starting index
Starting index of match
Reason for more than four groups
First match always indicates the entire string that is matched by the complete regex and rest here are the four captured groups you want. So there are total 5 groups.
([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)
<-------> <----> <---> <--------->
First group Second group Third group Fourth group
<----------------------------------------------------------------->
This regex matches entire string and is first match you are getting
(Zero'th group)
How to find desired answer
Here we want anything except the first group (which is entire match by regex). So we can use all_but_first to avoid the first group
> re:run("1,2 ==> 3 #SUP: 1 #CONF: 1.0", M, [{capture, all_but_first, list}]).
{match,["1,2","3","1","1.0"]}
More info can be found here
If you are in doubt what is content of the string, you can print it and check out:
1> RE = "([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)".
"([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)"
2> io:format("RE: /~s/~n", [RE]).
RE: /([\d,]+).*==>\s*(\d+)\s*#SUP:\s*(\d)\s*#CONF:\s*(\d+.\d+)/
For the rest of issue, there is great answer by rock321987.

Reg-ex, Find x then N characters if N+1 == x

OK here is what I have:
(24(?:(?!24).)*)
its works in the fact it finds from 24 till the next 24 but not the 2nd 24... (wow some logic).
like this:
23252882240013152986400000006090000000787865670000004524232528822400513152986240013152986543530000452400
it finds from the 1st 24 till the next 24 but does not include it, so the strings it finds are:
23252882 - 2400131529864000000060900000007878656700000045 - 2423252882 - 2400513152986 - 24001315298654353000045 - 2400
that is half of what I want it to do, what I need it to find is this:
23252882 - 2400131529864000000060900000007878656700000045 - 2423252882240051315298624001315298654353000045 - 2400
lets say:
x = 24
n = 46
I need to:
find x then n characters if the n+1 character == x
so find the start take then next 46, and the 45th must be the start of the next string, including all 24's in that string.
hope this is clear.
Thanks in advance.
EDIT
answer = 24.{44}(?=24)
You're almost there.
First, find x (24):
24
Then, find n=46 characters, where the 46 includes the original 24 (hence 44 left):
.{44}
The following character must be x (24):
(?=24)
All together:
24.{44}(?=24)
You can play around with it here.
In terms of constructing such a regex from a given x, n, your regex consists of
x.{n-number_of_characters(x)}(?=x)
where you substitute in x as-is and calculate n-number_of_characters(x).
Try this:
(?(?=24)(.{46})|(.{25})(.{24}))
Explanation:
<!--
(?(?=24)(.{46})|(.{25})(.{24}))
Options: case insensitive; ^ and $ match at line breaks
Do a test and then proceed with one of two options depending on the result of the text «(?(?=24)(.{46})|(.{25})(.{24}))»
Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=24)»
Match the characters “24” literally «24»
If the test succeeded, match the regular expression below «(.{46})»
Match the regular expression below and capture its match into backreference number 1 «(.{46})»
Match any single character that is not a line break character «.{46}»
Exactly 46 times «{46}»
If the test failed, match the regular expression below if the test succeeded «(.{25})(.{24})»
Match the regular expression below and capture its match into backreference number 2 «(.{25})»
Match any single character that is not a line break character «.{25}»
Exactly 25 times «{25}»
Match the regular expression below and capture its match into backreference number 3 «(.{24})»
Match any single character that is not a line break character «.{24}»
Exactly 24 times «{24}»
-->