Find value of a string given a superstring regex - regex

How can I match for a string that is a substring of a given input string, preferable with regex?
Given a value: A789Lfu891MatchMe2ENOTSTH, construct a regex that would match a string where the string is a substring of the given value.
Expected matches:
MatchMe
ENOTST
891
Expected Non Match
foo
A789L<fu891MatchMe2ENOTSTH_extra
extra_A789L<fu891MatchMe2ENOTSTH
extra_A789L<fu891MatchMe2ENOTSTH_extra
It seems easier for me to do the reverse: /\w*MatchMe\w*/, but I can't wrap my head around this problem.
Something like how SQL would do it:
SELECT * FROM my_table mt WHERE 'A789Lfu891MatchMe2ENOTSTH' LIKE '%' || mt.foo || '%';

Prefix suffixes
You could alternate prefix suffixes, like turn the superstring abcd into a pattern like ^(a|(a)?b|((a)?b)?c|(((a)?b)?c)?d)$. For your example, the pattern has 1253 characters (exactly 2000 fewer than #tobias_k's).
Python code to produce the regex, can then be tested with tobias_k's code (try it online):
from itertools import accumulate
t = "A789Lfu891MatchMe2ENOTSTH"
p = '^(' + '|'.join(accumulate(t, '({})?{}'.format)) + ')$'
Suffix prefixes
Suffix prefixes look nicer and match faster: ^(a(b(c(d)?)?)?|b(c(d)?)?|c(d)?|d)$. Sadly the generating code is less elegant.
Divide and conquer
For a shorter regex, we can use divide and conquer. For example for the superstring abcdefg, every substring falls into one of three cases:
Contains the middle character (the d). Pattern for that: ((a?b)?c)?d(e(fg?)?)?
Is left of that middle character. So recursively build a regex for the superstring abc: a|a?bc?|c.
Is right of that middle character. So recursively build a regex for the superstring efg: e|e?fg?|g.
And then make an alternation of those three cases:
a|a?bc?|c|((a?b)?c)?d(e(fg?)?)?|e|e?fg?|g
Regex length will be Θ(n log n) instead of our previous Θ(n2).
The result for your superstring example of 25 characters is this regex with 301 characters:
^(A|A?78?|8|((A?7)?8)?9(Lf?)?|Lf?|f|(((((A?7)?8)?9)?L)?f)?u(8(9(1(Ma?)?)?)?)?|89?|9|(8?9)?1(Ma?)?|Ma?|a|(((((((((((A?7)?8)?9)?L)?f)?u)?8)?9)?1)?M)?a)?t(c(h(M(e(2(E(N(O(T(S(TH?)?)?)?)?)?)?)?)?)?)?)?|c|c?hM?|M|((c?h)?M)?e(2E?)?|2E?|E|(((((c?h)?M)?e)?2)?E)?N(O(T(S(TH?)?)?)?)?|OT?|T|(O?T)?S(TH?)?|TH?|H)$
Benchmark
Speed benchmarks don't really make that much sense, as in reality we'd just do a regular substring check (in Python s in t), but let's do one anyway.
Results for matching all substrings of your superstring, using Python 3.9.6 on my PC:
1.09 ms just_all_substrings
25.18 ms prefix_suffixes
3.47 ms suffix_prefixes
3.46 ms divide_and_conquer
And on TIO and their "Python 3.8 (pre-release)" with quite different results:
0.30 ms just_all_substrings
46.90 ms prefix_suffixes
2.24 ms suffix_prefixes
2.95 ms divide_and_conquer
Regex lengths (also printed by the below benchmark code):
3253 characters - just_all_substrings
1253 characters - prefix_suffixes
1253 characters - suffix_prefixes
301 characters - divide_and_conquer
Benchmark code (Try it online!):
from timeit import repeat
import re
from itertools import accumulate
def just_all_substrings(t):
return "^(" + '|'.join(t[i:k] for i in range(0, len(t))
for k in range(i+1, len(t)+1)) + ")$"
def prefix_suffixes(t):
return '^(' + '|'.join(accumulate(t, '({})?{}'.format)) + ')$'
def suffix_prefixes(t):
return '^(' + '|'.join(list(accumulate(t[::-1], '{1}({0})?'.format))[::-1]) + ')$'
def divide_and_conquer(t):
def suffixes(t):
# Example: abc => ((a?b)?c)?
regex = f'{t[0]}?'
for c in t[1:]:
regex = f'({regex}{c})?'
return regex
def prefixes(t):
# Example: efg => (e(fg?)?)?
regex = f'{t[-1]}?'
for c in t[-2::-1]:
regex = f'({c}{regex})?'
return regex
def superegex(t):
n = len(t)
if n == 1:
return t
if n == 2:
return f'{t}?|{t[1]}'
mid = n // 2
contain = suffixes(t[:mid]) + t[mid] + prefixes(t[mid+1:])
left = superegex(t[:mid])
right = superegex(t[mid+1:])
return '|'.join([left, contain, right])
return '^(' + superegex(t) + ')$'
creators = just_all_substrings, prefix_suffixes, suffix_prefixes, divide_and_conquer,
t = "A789Lfu891MatchMe2ENOTSTH"
substrings = [t[start:stop]
for start in range(len(t))
for stop in range(start+1, len(t)+1)]
def test(p):
match = re.compile(p).match
return all(map(re.compile(p).match, substrings))
for creator in creators:
print(test(creator(t)), creator.__name__)
print()
print('Regex lengths:')
for creator in creators:
print('%5d characters -' % len(creator(t)), creator.__name__)
print()
for _ in range(3):
for creator in creators:
p = creator(t)
number = 10
time = min(repeat(lambda: test(p), number=number)) / number
print('%5.2f ms ' % (time * 1e3), creator.__name__)
print()

One way to "construct" such a regex would be to build a disjunction of all possible substrings of the original value. Example in Python:
import re
t = "A789Lfu891MatchMe2ENOTSTH"
p = "^(" + '|'.join(t[i:k] for i in range(0, len(t))
for k in range(i+1, len(t)+1)) + ")$"
good = ["MatchMe", "ENOTST", "891"]
bad = ["foo", "A789L<fu891MatchMe2ENOTSTH_extra",
"extra_A789L<fu891MatchMe2ENOTSTH",
"extra_A789L<fu891MatchMe2ENOTSTH_extra"]
assert all(re.match(p, s) is not None for s in good)
assert all(re.match(p, s) is None for s in bad)
For the value "abcd", this would e.g. be "^(a|ab|abc|abcd|b|bc|bcd|c|cd|d)$"; for the given example it's a bit longer, with 3253 characters...

Related

matching two or more characters that are not the same

Is it possible to write a regex pattern to match abc where each letter is not literal but means that text like xyz (but not xxy) would be matched? I am able to get as far as (.)(?!\1) to match a in ab but then I am stumped.
After getting the answer below, I was able to write a routine to generate this pattern. Using raw re patterns is much faster than converting both the pattern and a text to canonical form and then comaring them.
def pat2re(p, know=None, wild=None):
"""return a compiled re pattern that will find pattern `p`
in which each different character should find a different
character in a string. Characters to be taken literally
or that can represent any character should be given as
`know` and `wild`, respectively.
EXAMPLES
========
Characters in the pattern denote different characters to
be matched; characters that are the same in the pattern
must be the same in the text:
>>> pat = pat2re('abba')
>>> assert pat.search('maccaw')
>>> assert not pat.search('busses')
The underlying pattern of the re object can be seen
with the pattern property:
>>> pat.pattern
'(.)(?!\\1)(.)\\2\\1'
If some characters are to be taken literally, list them
as known; do the same if some characters can stand for
any character (i.e. are wildcards):
>>> a_ = pat2re('ab', know='a')
>>> assert a_.search('ad') and not a_.search('bc')
>>> ab_ = pat2re('ab*', know='ab', wild='*')
>>> assert ab_.search('abc') and ab_.search('abd')
>>> assert not ab_.search('bad')
"""
import re
# make a canonical "hash" of the pattern
# with ints representing pattern elements that
# must be unique and strings for wild or known
# values
m = {}
j = 1
know = know or ''
wild = wild or ''
for c in p:
if c in know:
m[c] = '\.' if c == '.' else c
elif c in wild:
m[c] = '.'
elif c not in m:
m[c] = j
j += 1
assert j < 100
h = tuple(m[i] for i in p)
# build pattern
out = []
last = 0
for i in h:
if type(i) is int:
if i <= last:
out.append(r'\%s' % i)
else:
if last:
ors = '|'.join(r'\%s' % i for i in range(1, last + 1))
out.append('(?!%s)(.)' % ors)
else:
out.append('(.)')
last = i
else:
out.append(i)
return re.compile(''.join(out))
You may try:
^(.)(?!\1)(.)(?!\1|\2).$
Demo
Here is an explanation of the regex pattern:
^ from the start of the string
(.) match and capture any first character (no restrictions so far)
(?!\1) then assert that the second character is different from the first
(.) match and capture any (legitimate) second character
(?!\1|\2) then assert that the third character does not match first or second
. match any valid third character
$ end of string

Recursive regex to check if string length is a power of 3?

I'm attempting to create a regex pattern that will only match strings containing only periods (.) with a length that is a power of 3. Obviously I could manually repeat length checks against powers of 3 up until the length is no longer feasible, but I would prefer a short pattern.
I wrote this python method to help explain what I want to do:
#n = length
def check(n):
if n == 1:
return True
elif n / 3 != n / 3.0:
return False
else:
return check(n / 3)
To clarify, the regex should only match ., ..., ........., ..........................., (length 1, 3, 9, 27) etc.
I've read up on regex recursion, making use of (?R) but I haven't been able to put anything together that works correctly.
Is this possible?
It can be done using regex (handles any character, not just period):
def length_power_three(string):
regex = r"."
while True:
match = re.match(regex, string)
if not match:
return False
if match.group(0) == string:
return True
regex = r"(?:" + regex + r"){3}"
But I understand you really want to know if it can be done with a single regex.
UPDATE
I forgot that you asked for a recursive solution:
def length_power_three(string, regex = r"."):
match = re.match(regex, string)
if not match:
return False
if match.group(0) == string:
return True
return length_power_three(string, r"(?:" + regex + r"){3}")

REGEX - Matching any character which repeats n times

How to match any character which repeats n times?
Example:
for input: abcdbcdcdd
for n=1: ..........
for n=2: .........
for n=3: .. .....
for n=4: . . ..
for n=5: no matches
After several hours my best is this expression
(\w)(?=(?:.*\1){n-1,}) //where n is variable
which uses lookahead. However the problem with this expression is this:
for input: abcdbcdcdd
for n=1 ..........
for n=2 ... .. .
for n=3 .. .
for n=4 .
for n=5 no matches
As you can see, when lookahead matches for a character, let's look for n=4 line, d's lookahead assertion satisfied and first d matched by regex. But remaining d's are not matched because they don't have 3 more d's ahead of them.
I hope I stated the problem clearly. Hoping for your solutions, thanks in advance.
let's look for n=4 line, d's lookahead assertion satisfied
and first d matched by regex.
But remaining d's are not matched because they don't have 3 more d's
ahead of them.
And obviously, without regex, this is a very simple string manipulation
problem. I'm trying to do this with and only with regex.
As with any regex implementation, the answer depends on the regex flavour. You could create a solution with .net regex engine, because it allows variable width lookbehinds.
Also, I'll provide a more generalized solution below for perl-compatible/like regex flavours.
.net Solution
As #PetSerAl pointed out in his answer, with variable width lookbehinds, we can assert back to the beggining of the string, and check there are n occurrences.
ideone demo
regex module in Python
You can implement this solution in python, using the regex module by Matthew Barnett, which also allows variable-width lookbehinds.
>>> import regex
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){2})\A.*)', 'abcdbcdcdd')
['b', 'c', 'd', 'b', 'c', 'd', 'c', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){3})\A.*)', 'abcdbcdcdd')
['c', 'd', 'c', 'd', 'c', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){4})\A.*)', 'abcdbcdcdd')
['d', 'd', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){5})\A.*)', 'abcdbcdcdd')
[]
Generalized Solution
In pcre or any of the "perl-like" flavours, there is no solution that would actually return a match for every repeated character, but we could create one, and only one, capture for each character.
Strategy
For any given n, the logic involves:
Early matches: Match and capture every character followed by at least n more occurences.
Final captures:
Match and capture a character followed by exactly n-1 occurences, and
also capture every one of the following occurrences.
Example
for n = 3
input = abcdbcdcdd
The character c is Matched only once (as final), and the following 2 occurrences are also Captured in the same match:
abcdbcdcdd
M C C
and the character d is (early) Matched once:
abcdbcdcdd
M
and (finally) Matched one more time, Capturing the rest:
abcdbcdcdd
M CC
Regex
/(\w) # match 1 character
(?:
(?=(?:.*?\1){≪N≫}) # [1] followed by other ≪N≫ occurrences
| # OR
(?= # [2] followed by:
(?:(?!\1).)*(\1) # 2nd occurence <captured>
(?:(?!\1).)*(\1) # 3rd occurence <captured>
≪repeat previous≫ # repeat subpattern (n-1) times
# *exactly (n-1) times*
(?!.*?\1) # not followed by another occurence
)
)/xg
For n =
/(\w)(?:(?=(?:.*?\1){2})|(?=(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
/(\w)(?:(?=(?:.*?\1){3})|(?=(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
/(\w)(?:(?=(?:.*?\1){4})|(?=(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
... etc.
Pseudocode to generate the pattern
// Variables: N (int)
character = "(\w)"
early_match = "(?=(?:.*?\1){" + N + "})"
final_match = "(?="
for i = 1; i < N; i++
final_match += "(?:(?!\1).)*(\1)"
final_match += "(?!.*?\1))"
pattern = character + "(?:" + early_match + "|" + final_match + ")"
JavaScript Code
I'll show an implementation using javascript because we can check the result here (and if it works in javascript, it works in any perl-compatible regex flavour, including .net, java, python, ruby, perl, and all languages that implemented pcre, among others).
var str = 'abcdbcdcdd';
var pattern, re, match, N, i;
var output = "";
// We'll show the results for N = 2, 3 and 4
for (N = 2; N <= 4; N++) {
// Generate pattern
pattern = "(\\w)(?:(?=(?:.*?\\1){" + N + "})|(?=";
for (i = 1; i < N; i++) {
pattern += "(?:(?!\\1).)*(\\1)";
}
pattern += "(?!.*?\\1)))";
re = new RegExp(pattern, "g");
output += "<h3>N = " + N + "</h3><pre>Pattern: " + pattern + "\nText: " + str;
// Loop all matches
while ((match = re.exec(str)) !== null) {
output += "\nPos: " + match.index + "\tMatch:";
// Loop all captures
x = 1;
while (match[x] != null) {
output += " " + match[x];
x++;
}
}
output += "</pre>";
}
document.write(output);
Python3 code
As requested by the OP, I'm linking to a Python3 implementation in ideone.com
Regular expressions (and finite automata) are not able to count to arbitrary integers. They can only count to a predefined integer and fortunately this is your case.
Solving this problem is much easier if we first construct a nondeterministic finite automata (NFA) ad then convert it to regular expression.
So the following automata for n=2 and input alphabet = {a,b,c,d}
will match any string that has exactly 2 repetitions of any char. If no character has 2 repetitions (all chars appear less or more that two times) the string will not match.
Converting it to regex should look like
"^([^a]*a[^a]*a[^a]*)|([^b]*b[^b]*b[^b]*)|([^b]*c[^c]*c[^C]*)|([^d]*d[^d]*d[^d]*)$"
This can get problematic if the input alphabet is big, so that regex should be shortened somehow, but I can't think of it right now.
With .NET regular expressions you can do following:
(\w)(?<=(?=(?:.*\1){n})^.*) where n is variable
Where:
(\w) — any character, captured in first group.
(?<=^.*) — lookbehind assertion, which return us to the start of the string.
(?=(?:.*\1){n}) — lookahead assertion, to see if string have n instances of that character.
Demo
I would not use regular expressions for this. I would use a scripting language such as python. Try out this python function:
alpha = 'abcdefghijklmnopqrstuvwxyz'
def get_matched_chars(n, s):
s = s.lower()
return [char for char in alpha if s.count(char) == n]
The function will return a list of characters, all of which appear in the string s exactly n times. Keep in mind that I only included letters in my alphabet. You can change alpha to represent anything that you want to get matched.

Avoiding "double search" when using regexp matching groups

Is there a more efficient way of doing this:
if re.search("(?P<value>[0-9]*[.][0-9]*) (?P<units>KB|MB|GB|TB|PB)", line):
m = re.search("(?P<value>[0-9]*[.][0-9]*) (?P<units>KB|MB|GB|TB|PB)", line)
self.capacity = convert_to_bytes(m.group("units"), m.group("value"))
Like in C and other languages you can do this and avoid executing the search twice:
if m = re.search("(?P<value>[0-9]*[.][0-9]*) (?P<units>KB|MB|GB|TB|PB)", line):
self.capacity = convert_to_bytes(m.group("units"), m.group("value"))
Is this a better way of doing this?
m = re.search("(?P<value>[0-9]*[.][0-9]*) (?P<units>KB|MB|GB|TB|PB)", line)
if m:
self.capacity = convert_to_bytes(m.group("units"), m.group("value"))
As #Christian Aichinger says, this is correct way, but I'd trim the regex a bit:
m = re.search("(?i)(?P<value>[0-9]+(?:\\.[0-9]+)?) (?P<units>KB|MB|GB|TB|PB)", line)
if m:
self.capacity = convert_to_bytes(m.group("units"), m.group("value"))
Now, [0-9]+(?:\\.[0-9]+)? will match some digit(s) and optionally decimal fractions after it. Mind that in case you have other decimal separator, or if you want to include a thousand digit grouping symbol(s), you'd rather use a character class like [0-9]+(?:[., ][0-9]+)? (in Russian or Polish, a space is a valid thousand digit grouping symbol).
Also, making the regex pattern case-insensitive is also a good idea, so that it also matched 1,000 kb.
Sample code:
import re
line = "1.56 kb"
m = re.search("(?i)(?P<value>[0-9]+(?:\\.[0-9]+)?) (?P<units>KB|MB|GB|TB|PB)", line)
if m:
print m.group("units") + " - " + m.group("value")
Output:
kb - 1.56

Error with regex, match numbers

I have a string 00000001001300000708303939313833313932E2
so, I want to match everything between 708 & E2..
So I wrote:
(?<=708)(.*\n?)(?=E2) - tested in RegExr (it's working)
Now, from that result 303939313833313932 match to get result
(every second number):
099183192
How ?
To match everything between 708 and E2, use:
708(\d+)
if you are sure that there will be only digits. Otherwise try with:
708(.*?)E2
To match every second digit from 303939313833313932, use:
(?:\d(\d))+
use a global replace:
find: \d(\d)
replace: $1
Are you expecting a regular expression answer to this?
You are perhaps better off doing this using string operations in whatever programming language you're using. If you have text = "abcdefghi..." then do output = text[0] + text[2] + text[4]... in a loop, until you run out of characters.
You haven't specified a programming language, but in Python I would do something like:
>>> text = "abcdefghjiklmnop"
>>> for n, char in enumerate(text):
... if n % 2 == 0: #every second char
... print char
...
a
c
e
g
j
k
m
o