matching two or more characters that are not the same - regex

Is it possible to write a regex pattern to match abc where each letter is not literal but means that text like xyz (but not xxy) would be matched? I am able to get as far as (.)(?!\1) to match a in ab but then I am stumped.
After getting the answer below, I was able to write a routine to generate this pattern. Using raw re patterns is much faster than converting both the pattern and a text to canonical form and then comaring them.
def pat2re(p, know=None, wild=None):
"""return a compiled re pattern that will find pattern `p`
in which each different character should find a different
character in a string. Characters to be taken literally
or that can represent any character should be given as
`know` and `wild`, respectively.
EXAMPLES
========
Characters in the pattern denote different characters to
be matched; characters that are the same in the pattern
must be the same in the text:
>>> pat = pat2re('abba')
>>> assert pat.search('maccaw')
>>> assert not pat.search('busses')
The underlying pattern of the re object can be seen
with the pattern property:
>>> pat.pattern
'(.)(?!\\1)(.)\\2\\1'
If some characters are to be taken literally, list them
as known; do the same if some characters can stand for
any character (i.e. are wildcards):
>>> a_ = pat2re('ab', know='a')
>>> assert a_.search('ad') and not a_.search('bc')
>>> ab_ = pat2re('ab*', know='ab', wild='*')
>>> assert ab_.search('abc') and ab_.search('abd')
>>> assert not ab_.search('bad')
"""
import re
# make a canonical "hash" of the pattern
# with ints representing pattern elements that
# must be unique and strings for wild or known
# values
m = {}
j = 1
know = know or ''
wild = wild or ''
for c in p:
if c in know:
m[c] = '\.' if c == '.' else c
elif c in wild:
m[c] = '.'
elif c not in m:
m[c] = j
j += 1
assert j < 100
h = tuple(m[i] for i in p)
# build pattern
out = []
last = 0
for i in h:
if type(i) is int:
if i <= last:
out.append(r'\%s' % i)
else:
if last:
ors = '|'.join(r'\%s' % i for i in range(1, last + 1))
out.append('(?!%s)(.)' % ors)
else:
out.append('(.)')
last = i
else:
out.append(i)
return re.compile(''.join(out))

You may try:
^(.)(?!\1)(.)(?!\1|\2).$
Demo
Here is an explanation of the regex pattern:
^ from the start of the string
(.) match and capture any first character (no restrictions so far)
(?!\1) then assert that the second character is different from the first
(.) match and capture any (legitimate) second character
(?!\1|\2) then assert that the third character does not match first or second
. match any valid third character
$ end of string

Related

Find value of a string given a superstring regex

How can I match for a string that is a substring of a given input string, preferable with regex?
Given a value: A789Lfu891MatchMe2ENOTSTH, construct a regex that would match a string where the string is a substring of the given value.
Expected matches:
MatchMe
ENOTST
891
Expected Non Match
foo
A789L<fu891MatchMe2ENOTSTH_extra
extra_A789L<fu891MatchMe2ENOTSTH
extra_A789L<fu891MatchMe2ENOTSTH_extra
It seems easier for me to do the reverse: /\w*MatchMe\w*/, but I can't wrap my head around this problem.
Something like how SQL would do it:
SELECT * FROM my_table mt WHERE 'A789Lfu891MatchMe2ENOTSTH' LIKE '%' || mt.foo || '%';
Prefix suffixes
You could alternate prefix suffixes, like turn the superstring abcd into a pattern like ^(a|(a)?b|((a)?b)?c|(((a)?b)?c)?d)$. For your example, the pattern has 1253 characters (exactly 2000 fewer than #tobias_k's).
Python code to produce the regex, can then be tested with tobias_k's code (try it online):
from itertools import accumulate
t = "A789Lfu891MatchMe2ENOTSTH"
p = '^(' + '|'.join(accumulate(t, '({})?{}'.format)) + ')$'
Suffix prefixes
Suffix prefixes look nicer and match faster: ^(a(b(c(d)?)?)?|b(c(d)?)?|c(d)?|d)$. Sadly the generating code is less elegant.
Divide and conquer
For a shorter regex, we can use divide and conquer. For example for the superstring abcdefg, every substring falls into one of three cases:
Contains the middle character (the d). Pattern for that: ((a?b)?c)?d(e(fg?)?)?
Is left of that middle character. So recursively build a regex for the superstring abc: a|a?bc?|c.
Is right of that middle character. So recursively build a regex for the superstring efg: e|e?fg?|g.
And then make an alternation of those three cases:
a|a?bc?|c|((a?b)?c)?d(e(fg?)?)?|e|e?fg?|g
Regex length will be Θ(n log n) instead of our previous Θ(n2).
The result for your superstring example of 25 characters is this regex with 301 characters:
^(A|A?78?|8|((A?7)?8)?9(Lf?)?|Lf?|f|(((((A?7)?8)?9)?L)?f)?u(8(9(1(Ma?)?)?)?)?|89?|9|(8?9)?1(Ma?)?|Ma?|a|(((((((((((A?7)?8)?9)?L)?f)?u)?8)?9)?1)?M)?a)?t(c(h(M(e(2(E(N(O(T(S(TH?)?)?)?)?)?)?)?)?)?)?)?|c|c?hM?|M|((c?h)?M)?e(2E?)?|2E?|E|(((((c?h)?M)?e)?2)?E)?N(O(T(S(TH?)?)?)?)?|OT?|T|(O?T)?S(TH?)?|TH?|H)$
Benchmark
Speed benchmarks don't really make that much sense, as in reality we'd just do a regular substring check (in Python s in t), but let's do one anyway.
Results for matching all substrings of your superstring, using Python 3.9.6 on my PC:
1.09 ms just_all_substrings
25.18 ms prefix_suffixes
3.47 ms suffix_prefixes
3.46 ms divide_and_conquer
And on TIO and their "Python 3.8 (pre-release)" with quite different results:
0.30 ms just_all_substrings
46.90 ms prefix_suffixes
2.24 ms suffix_prefixes
2.95 ms divide_and_conquer
Regex lengths (also printed by the below benchmark code):
3253 characters - just_all_substrings
1253 characters - prefix_suffixes
1253 characters - suffix_prefixes
301 characters - divide_and_conquer
Benchmark code (Try it online!):
from timeit import repeat
import re
from itertools import accumulate
def just_all_substrings(t):
return "^(" + '|'.join(t[i:k] for i in range(0, len(t))
for k in range(i+1, len(t)+1)) + ")$"
def prefix_suffixes(t):
return '^(' + '|'.join(accumulate(t, '({})?{}'.format)) + ')$'
def suffix_prefixes(t):
return '^(' + '|'.join(list(accumulate(t[::-1], '{1}({0})?'.format))[::-1]) + ')$'
def divide_and_conquer(t):
def suffixes(t):
# Example: abc => ((a?b)?c)?
regex = f'{t[0]}?'
for c in t[1:]:
regex = f'({regex}{c})?'
return regex
def prefixes(t):
# Example: efg => (e(fg?)?)?
regex = f'{t[-1]}?'
for c in t[-2::-1]:
regex = f'({c}{regex})?'
return regex
def superegex(t):
n = len(t)
if n == 1:
return t
if n == 2:
return f'{t}?|{t[1]}'
mid = n // 2
contain = suffixes(t[:mid]) + t[mid] + prefixes(t[mid+1:])
left = superegex(t[:mid])
right = superegex(t[mid+1:])
return '|'.join([left, contain, right])
return '^(' + superegex(t) + ')$'
creators = just_all_substrings, prefix_suffixes, suffix_prefixes, divide_and_conquer,
t = "A789Lfu891MatchMe2ENOTSTH"
substrings = [t[start:stop]
for start in range(len(t))
for stop in range(start+1, len(t)+1)]
def test(p):
match = re.compile(p).match
return all(map(re.compile(p).match, substrings))
for creator in creators:
print(test(creator(t)), creator.__name__)
print()
print('Regex lengths:')
for creator in creators:
print('%5d characters -' % len(creator(t)), creator.__name__)
print()
for _ in range(3):
for creator in creators:
p = creator(t)
number = 10
time = min(repeat(lambda: test(p), number=number)) / number
print('%5.2f ms ' % (time * 1e3), creator.__name__)
print()
One way to "construct" such a regex would be to build a disjunction of all possible substrings of the original value. Example in Python:
import re
t = "A789Lfu891MatchMe2ENOTSTH"
p = "^(" + '|'.join(t[i:k] for i in range(0, len(t))
for k in range(i+1, len(t)+1)) + ")$"
good = ["MatchMe", "ENOTST", "891"]
bad = ["foo", "A789L<fu891MatchMe2ENOTSTH_extra",
"extra_A789L<fu891MatchMe2ENOTSTH",
"extra_A789L<fu891MatchMe2ENOTSTH_extra"]
assert all(re.match(p, s) is not None for s in good)
assert all(re.match(p, s) is None for s in bad)
For the value "abcd", this would e.g. be "^(a|ab|abc|abcd|b|bc|bcd|c|cd|d)$"; for the given example it's a bit longer, with 3253 characters...

Regex - match words that follow the rule "i before e except after c"

Hi I'm wondering how to match words that follow the spelling rule “i before e
except after c” (such as brief, receipt, receive, pier). But shouldn't match words that don't follow that rule such as science.
What I have here is incorrect (as science shouldn't match) but it's what I got so far:
enter link description here
I don't really know how to do this without using look behind (which I know isn't very well supported).
This seems simple enough:
[A-Za-z]*(cei|[^c]ie)[A-Za-z]*
You can try this:
\b\w*(cei|\bie|(?!c)\w(?=ie))\w*\b
Explanation
I suggest using a regex with a negative lookahead:
/\b(?![a-z]*cie)[a-z]*(?:cei|ie)[a-z]*/i
See the regex demo
Details:
\b - a leading word boundary
(?![a-z]*cie) - a negative lookahead that fails the match if the word has cie after 0+ letters
[a-z]* - 0+ letters
(?:cei|ie) - cei or ie
[a-z]* - 0+ letters.
I was analysing the text of a novel, looking for the most common words that obeyed and broke the i before e rule. It would appear you're more likely to use a word that breaks rather than one which obeys the rule.
import operator, re
ie = {}
cei = {}
ei = {}
file = open('1342.txt', 'r')
for line in file:
line = line.lower()
# ditch anything that's not a word or a white-space character
line = re.sub(r'[^\_\w\s]','', line)
# split line into words
line = line.split()
for word in range(len(line)):
# add the word to the appropriate dictionary or increment
# frequency of occurrence if already in that dictionary
if re.search(r'ie', line[word], flags=re.IGNORECASE):
if line[word] in ie:
ie[line[word]] = ie[line[word]] + 1
else:
ie[line[word]] = 1
if re.search(r'cei', line[word], flags=re.IGNORECASE):
if line[word] in cei:
cei[line[word]] = cei[line[word]] + 1
else:
cei[line[word]] = 1
if re.search(r'[a-b,d-z]+ei', line[word], flags=re.IGNORECASE):
if line[word] in ei:
ei[line[word]] = ei[line[word]] + 1
else:
ei[line[word]] = 1
# sort each dictionary and display the 10 most common words from each
x = sorted(ie.items(), key=operator.itemgetter(1))
for word in range(len(x) - 1, len(x) - 10, -1):
print(x[word])
print()
x = sorted(cei.items(), key=operator.itemgetter(1))
for word in range(len(x) - 1, len(x) - 10, -1):
print(x[word])
print()
x = sorted(ei.items(), key=operator.itemgetter(1))
for word in range(len(x) - 1, len(x) - 10, -1):
print(x[word])

REGEX - Matching any character which repeats n times

How to match any character which repeats n times?
Example:
for input: abcdbcdcdd
for n=1: ..........
for n=2: .........
for n=3: .. .....
for n=4: . . ..
for n=5: no matches
After several hours my best is this expression
(\w)(?=(?:.*\1){n-1,}) //where n is variable
which uses lookahead. However the problem with this expression is this:
for input: abcdbcdcdd
for n=1 ..........
for n=2 ... .. .
for n=3 .. .
for n=4 .
for n=5 no matches
As you can see, when lookahead matches for a character, let's look for n=4 line, d's lookahead assertion satisfied and first d matched by regex. But remaining d's are not matched because they don't have 3 more d's ahead of them.
I hope I stated the problem clearly. Hoping for your solutions, thanks in advance.
let's look for n=4 line, d's lookahead assertion satisfied
and first d matched by regex.
But remaining d's are not matched because they don't have 3 more d's
ahead of them.
And obviously, without regex, this is a very simple string manipulation
problem. I'm trying to do this with and only with regex.
As with any regex implementation, the answer depends on the regex flavour. You could create a solution with .net regex engine, because it allows variable width lookbehinds.
Also, I'll provide a more generalized solution below for perl-compatible/like regex flavours.
.net Solution
As #PetSerAl pointed out in his answer, with variable width lookbehinds, we can assert back to the beggining of the string, and check there are n occurrences.
ideone demo
regex module in Python
You can implement this solution in python, using the regex module by Matthew Barnett, which also allows variable-width lookbehinds.
>>> import regex
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){2})\A.*)', 'abcdbcdcdd')
['b', 'c', 'd', 'b', 'c', 'd', 'c', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){3})\A.*)', 'abcdbcdcdd')
['c', 'd', 'c', 'd', 'c', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){4})\A.*)', 'abcdbcdcdd')
['d', 'd', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){5})\A.*)', 'abcdbcdcdd')
[]
Generalized Solution
In pcre or any of the "perl-like" flavours, there is no solution that would actually return a match for every repeated character, but we could create one, and only one, capture for each character.
Strategy
For any given n, the logic involves:
Early matches: Match and capture every character followed by at least n more occurences.
Final captures:
Match and capture a character followed by exactly n-1 occurences, and
also capture every one of the following occurrences.
Example
for n = 3
input = abcdbcdcdd
The character c is Matched only once (as final), and the following 2 occurrences are also Captured in the same match:
abcdbcdcdd
M C C
and the character d is (early) Matched once:
abcdbcdcdd
M
and (finally) Matched one more time, Capturing the rest:
abcdbcdcdd
M CC
Regex
/(\w) # match 1 character
(?:
(?=(?:.*?\1){≪N≫}) # [1] followed by other ≪N≫ occurrences
| # OR
(?= # [2] followed by:
(?:(?!\1).)*(\1) # 2nd occurence <captured>
(?:(?!\1).)*(\1) # 3rd occurence <captured>
≪repeat previous≫ # repeat subpattern (n-1) times
# *exactly (n-1) times*
(?!.*?\1) # not followed by another occurence
)
)/xg
For n =
/(\w)(?:(?=(?:.*?\1){2})|(?=(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
/(\w)(?:(?=(?:.*?\1){3})|(?=(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
/(\w)(?:(?=(?:.*?\1){4})|(?=(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
... etc.
Pseudocode to generate the pattern
// Variables: N (int)
character = "(\w)"
early_match = "(?=(?:.*?\1){" + N + "})"
final_match = "(?="
for i = 1; i < N; i++
final_match += "(?:(?!\1).)*(\1)"
final_match += "(?!.*?\1))"
pattern = character + "(?:" + early_match + "|" + final_match + ")"
JavaScript Code
I'll show an implementation using javascript because we can check the result here (and if it works in javascript, it works in any perl-compatible regex flavour, including .net, java, python, ruby, perl, and all languages that implemented pcre, among others).
var str = 'abcdbcdcdd';
var pattern, re, match, N, i;
var output = "";
// We'll show the results for N = 2, 3 and 4
for (N = 2; N <= 4; N++) {
// Generate pattern
pattern = "(\\w)(?:(?=(?:.*?\\1){" + N + "})|(?=";
for (i = 1; i < N; i++) {
pattern += "(?:(?!\\1).)*(\\1)";
}
pattern += "(?!.*?\\1)))";
re = new RegExp(pattern, "g");
output += "<h3>N = " + N + "</h3><pre>Pattern: " + pattern + "\nText: " + str;
// Loop all matches
while ((match = re.exec(str)) !== null) {
output += "\nPos: " + match.index + "\tMatch:";
// Loop all captures
x = 1;
while (match[x] != null) {
output += " " + match[x];
x++;
}
}
output += "</pre>";
}
document.write(output);
Python3 code
As requested by the OP, I'm linking to a Python3 implementation in ideone.com
Regular expressions (and finite automata) are not able to count to arbitrary integers. They can only count to a predefined integer and fortunately this is your case.
Solving this problem is much easier if we first construct a nondeterministic finite automata (NFA) ad then convert it to regular expression.
So the following automata for n=2 and input alphabet = {a,b,c,d}
will match any string that has exactly 2 repetitions of any char. If no character has 2 repetitions (all chars appear less or more that two times) the string will not match.
Converting it to regex should look like
"^([^a]*a[^a]*a[^a]*)|([^b]*b[^b]*b[^b]*)|([^b]*c[^c]*c[^C]*)|([^d]*d[^d]*d[^d]*)$"
This can get problematic if the input alphabet is big, so that regex should be shortened somehow, but I can't think of it right now.
With .NET regular expressions you can do following:
(\w)(?<=(?=(?:.*\1){n})^.*) where n is variable
Where:
(\w) — any character, captured in first group.
(?<=^.*) — lookbehind assertion, which return us to the start of the string.
(?=(?:.*\1){n}) — lookahead assertion, to see if string have n instances of that character.
Demo
I would not use regular expressions for this. I would use a scripting language such as python. Try out this python function:
alpha = 'abcdefghijklmnopqrstuvwxyz'
def get_matched_chars(n, s):
s = s.lower()
return [char for char in alpha if s.count(char) == n]
The function will return a list of characters, all of which appear in the string s exactly n times. Keep in mind that I only included letters in my alphabet. You can change alpha to represent anything that you want to get matched.

Python Replacement of Shortcodes using Regular Expressions

I have a string that looks like this:
my_str = "This sentence has a [b|bolded] word, and [b|another] one too!"
And I need it to be converted into this:
new_str = "This sentence has a <b>bolded</b> word, and <b>another</b> one too!"
Is it possible to use Python's string.replace or re.sub method to do this intelligently?
Just capture all the characters before | inside [] into a group . And the part after | into another group. Just call the captured groups through back-referencing in the replacement part to get the desired output.
Regex:
\[([^\[\]|]*)\|([^\[\]]*)\]
Replacemnet string:
<\1>\2</\1>
DEMO
>>> import re
>>> s = "This sentence has a [b|bolded] word, and [b|another] one too!"
>>> m = re.sub(r'\[([^\[\]|]*)\|([^\[\]]*)\]', r'<\1>\2</\1>', s)
>>> m
'This sentence has a <b>bolded</b> word, and <b>another</b> one too!'
Explanation...
Try this expression: [[]b[|](\w+)[]] shorter version can also be \[b\|(\w+)\]
Where the expression is searching for anything that starts with [b| captures what is between it and the closing ] using \w+ which means [a-zA-Z0-9_] to include a wider range of characters you can also use .*? instead of \w+ which will turn out in \[b\|(.*?)\]
Online Demo
Sample Demo:
import re
p = re.compile(ur'[[]b[|](\w+)[]]')
test_str = u"This sentence has a [b|bolded] word, and [b|another] one too!"
subst = u"<bold>$1</bold>"
result = re.sub(p, subst, test_str)
Output:
This sentence has a <bold>bolded</bold> word, and <bold>another</bold> one too!
Just for reference, in case you don't want two problems:
Quick answer to your particular problem:
my_str = "This sentence has a [b|bolded] word, and [b|another] one too!"
print my_str.replace("[b|", "<b>").replace("]", "</b>")
# output:
# This sentence has a <b>bolded</b> word, and <b>another</b> one too!
This has the flaw that it will replace all ] to </b> regardless whether it is appropriate or not. So you might want to consider the following:
Generalize and wrap it in a function
def replace_stuff(s, char):
begin = s.find("[{}|".format(char))
while begin != -1:
end = s.find("]", begin)
s = s[:begin] + s[begin:end+1].replace("[{}|".format(char),
"<{}>".format(char)).replace("]", "</{}>".format(char)) + s[end+1:]
begin = s.find("[{}|".format(char))
return s
For example
s = "Don't forget to [b|initialize] [code|void toUpper(char const *s)]."
print replace_stuff(s, "code")
# output:
# "Don't forget to [b|initialize] <code>void toUpper(char const *s)</code>."

Using a Regex Back-reference In a Repetition Construct ({N})

I need to match a string that is prefixed with an acceptable length for that string.
For example, {3}abc would match, because the abc part is 3 characters long. {3}abcd would fail because abcd is not 3 characters long.
I would use ^\{(\d+)\}.{\1}$ (capture a number N inside curly braces, then any character N times) but it appears that the value in the repetition construct has to be a number (or at least, it won’t accept a backreference).
For example, in JavaScript this returns true:
/^\{(\d+)\}.{3}$/.test("{3}abc")
While this returns false:
/^\{(\d+)\}.{\1}$/.test("{3}abc")
Is this possible to do in a single regex, or would I need to resort to splitting it into two stages like:
/^\{(\d+)\}/.test("{3}abc") && RegExp("^\\{" + RegExp.$1 + "\\}.{" + RegExp.$1 + "}$").test("{3}abc")
Regular expressions can't calculate, so you can't do this with a regex only.
You could match the string to /^\{(\d+)\}(.*)$/, then check whether len($2)==int($1).
In Python, for example:
>>> import re
>>> t1 = "{3}abc"
>>> t2 = "{3}abcd"
>>> r = re.compile(r"^\{(\d+)\}(.*)$")
>>> m1 = r.match(t1)
>>> m2 = r.match(t2)
>>> len(m1.group(2)) == int(m1.group(1))
True
>>> len(m2.group(2)) == int(m2.group(1))
False