I have a problem which needs to remove consecutive pattern in a string.
For example,
input: abcbcbcbcd
output: abcd
input: abcbcebcbcd
output: abcebcd
We may only consider the repeating pattern contains only 2 characters.
What is the best way to solve it?
Thanks
You may try this regex:
(..)\1+
Substitution
$1
syntax
note
(..)
any two characters, capture them in group 1
\1+
repeat group 1 at least 1 time
Check the test cases
Here is a simple way to do it in Python.
s = "abcbcebcbcd"
ans = ""
i = 0
while i < len(s):
if len(ans) >= 2 and i + 1 < len(s) and ans[-2:] == s[i:i + 2]:
i += 2
else:
ans += s[i]
i += 1
print(ans)
Related
I am trying to find a regex query, such that, for instance, the following strings match the same expression
"1116.67711..44."
"2224.43322..88."
"9993.35599..22."
"7779.91177..55."
I.e. formally "x1x1x1x2.x2x3x3x1x1..x4x4." where xi ≠ xj if i ≠ j, and where xi is some number from 1 to 9 inclusive.
Or (another example), the following strings match the same expression, but not the same expression as before:
"94..44.773399.4"
"25..55.886622.5"
"73..33.992277.3"
I.e. formally "x1x2..x2x2.x3x3x4x4x1x1.x2" where xi ≠ xj if i ≠ j, and where xi is some number from 1 to 9 inclusive.
That is two strings should be equal if they have the same form, but with the numbers internally permuted so that they are pairwise distinct.
The dots should mean a space in the sequence, this could be any value that is not a single digit number, and two "equal" strings, should have spaces the same places. If it helps, the strings all have the same length of 81 (above they all have a length of 15, as to not write too long strings).
That is, if I have some string as above, e.g. "3566.235.225..45" i want to have some reqular expression that i can apply to some database to find out if such a string already exists
Is it possible to do this?
The answer is fairly straightforward:
import re
pattern = re.compile(r'^(\d)\1{3}$')
print(pattern.match('1234'))
print(pattern.match('333'))
print(pattern.match('3333'))
print(pattern.match('33333'))
You capture what you need once, then tell the regex engine how often you need to repeat it. You can refer back to it as often as you like, for example for a pattern that would match 11.222.1 you'd use ^(\d)\1{1}\.(\d)\2{2}\.(\1){1}$.
Note that the {1} in there is superfluous, but it shows that the pattern can be very regular. So much so, that it's actually easy to write a function that solves the problem for you:
def make_pattern(grouping, separators='.'):
regex_chars = '.\\*+[](){}^$?!:'
groups = {}
i = 0
j = 0
last_group = 0
result = '^'
while i < len(grouping):
if grouping[i] in separators:
if grouping[i] in regex_chars:
result += '\\'
result += grouping[i]
i += 1
else:
while i < len(grouping) and grouping[i] == grouping[j]:
i += 1
if grouping[j] in groups:
group = groups[grouping[j]]
else:
last_group += 1
groups[grouping[j]] = last_group
group = last_group
result += '(.)'
j += 1
result += f'\\{group}{{{i-j}}}'
j = i
return re.compile(result+'$')
print(make_pattern('111.222.11').match('aaa.bbb.aa'))
So, you can give make_pattern a good example of the pattern and it will return the compiled regex for you. If you'd like other separators than '.', you can just pass those in as well:
my_pattern = make_pattern('11,222,11', separators=',')
print(my_pattern.match('aa,bbb,aa'))
I tried the following code. However, the result is not what I want.
$strLine = "100.11 Q9"
$sortString = StringRegExp ($strLine,'([0-9\.]{1,7})', $STR_REGEXPARRAYMATCH)
MsgBox(0, "", $sortString[0],2)
The output shows 100.11, but I want 100.11 9. How could I display it this way using a regular expression?
$sPattern = "([0-9\.]+)\sQ(\d+)"
$strLine = "100.11 Q9"
$sortString = StringRegExpReplace($strLine, $sPattern, '\1 \2')
MsgBox(0, "$sortString", $sortString, 2)
$strLine = "100.11 Q9"
$sortString = StringRegExp($strLine, $sPattern, 3); array of global matches.
For $i1 = 0 To UBound($sortString) -1
MsgBox(0, "$sortString[" & $i1 & "]", $sortString[$i1], 2)
Next
The pattern is to get the 2 groups being 100.11 and 9.
The pattern will 1st match the group with any digit and dot until it reach
/s which will match the space. It will then match the Q. The 2nd group
matches any remaining digits.
StringRegExpReplace replaces the whole string with 1st and 2nd groups
separated with a space.
StringRegExp get the 2 groups as 2 array elements.
Choose 1 from the 2 types regexp above of which you prefer.
I am trying to write a regex that is very specific. I want to find 3 digits in a list. The issue comes because I do not care about repeating digits (5, 555, and 55555555555555 are seen as 5). Also, within the 3 digits, they need to be 3 different digits (123 = good, 311 = bad).
Here is what I have so far to find 3 digits, ignoring repeats but it does not specify 3 unique digits.
^(?:([0]{1,}|[1]{1,}|[2]{1,}|[3]{1,}|[4]{1,}|[5]{1,}|[6]{1,}|[7]{1,}|[8]{1,}|[9]{1,}|[0]{1,})(?!.*\\1)){3}$<p>
Here is an example of the types of data I see.
Matching:
458
3333335555111
2222555111
222255558888
111147
9533333333
And not matching:
999999999
222252
888887
Right now my regex will find all of these. How can I ignore any that do not have 3 unique digits?
If your regex-tool of choice supports look-behinds, back-references and possesive matching you could use
^(\d)\1*+(?!.*\1)(\d)\2*+(\d)\3*+$
^ and $ are anchors to ensure, that we check the whole string
(\d) matches a digit into a first capturing group, with \1*+ we possesively match any following occurences of this digit and use the lookbehind (?!.*\1) to ensure, that it doesn't end with that number.
(\d)\2*+ then matches the next different digit, again matching any following occurences possesively (check 122 without the possesive matching to see, why I use it here)
(\d)\3*+ matches the last digit with any following occurences.
Without possesive matching you could make more use of look-behinds, like ^(\d)\1*(?!.*\1)(\d)\2*(?!.*\2)(\d)\3*+$
See https://regex101.com/r/pV2tB2/2 for a demo.
Site Note: Regex might not be the best for this, but as you specifically asked for it - here you are.
This can be done with regex, but it's not the best tool for your work.
Instead of a regex-only approach, you can easily achieve this using Python.
Example:
strings = ['458', '3333335555111', '2222555111', '222255558888', '111147', '9533333333', '955555555', '12222211']
for s in strings:
if len(set(list(s))) == 3:
print "Ok :", s
else:
print "Error :", s
Output:
>> Ok : 458
>> Ok : 3333335555111
>> Ok : 2222555111
>> Ok : 222255558888
>> Ok : 111147
>> Ok : 9533333333
>> Error : 955555555
>> Error : 12222211
I've used the following commands while iterating over the strings inside that list:
list()
set()
len()
Using negative lookahead, this should match any string of digits that contains at least 3 unique digits /^(\d)\1*(?!\1)(\d)(?:\2|\1)*(?!\2|\1)(\d)+$/
(\d) - Match a digit
\1* - Allow that digit to repeat
(?!\1) - Make sure that's followed by a digit that does not match the first match
(\d) - Match the new digit
(?:\2|\1)* - Allow repeats of either the first or second digit
(?!\2|\1) - Make sure that's followed by a digit that does not match the first or second match
(\d)+ - Capture the third unique digit, then allow any number of digits of any kind to follow
I'm not sure if an awk script will do it for you, but here it goes:
awk '
function match_func(num) {
if (match_array[num] == 0)
match_array[num] = 1;
}
{
for (i = 0; i < length($1); i++)
match_func(substr($1, i, 1));
for (i = 0; i < 10; i++)
if (match_array[i] == 1) match_sum++;
if (match_sum == 3)
print $1;
}'
How to match any character which repeats n times?
Example:
for input: abcdbcdcdd
for n=1: ..........
for n=2: .........
for n=3: .. .....
for n=4: . . ..
for n=5: no matches
After several hours my best is this expression
(\w)(?=(?:.*\1){n-1,}) //where n is variable
which uses lookahead. However the problem with this expression is this:
for input: abcdbcdcdd
for n=1 ..........
for n=2 ... .. .
for n=3 .. .
for n=4 .
for n=5 no matches
As you can see, when lookahead matches for a character, let's look for n=4 line, d's lookahead assertion satisfied and first d matched by regex. But remaining d's are not matched because they don't have 3 more d's ahead of them.
I hope I stated the problem clearly. Hoping for your solutions, thanks in advance.
let's look for n=4 line, d's lookahead assertion satisfied
and first d matched by regex.
But remaining d's are not matched because they don't have 3 more d's
ahead of them.
And obviously, without regex, this is a very simple string manipulation
problem. I'm trying to do this with and only with regex.
As with any regex implementation, the answer depends on the regex flavour. You could create a solution with .net regex engine, because it allows variable width lookbehinds.
Also, I'll provide a more generalized solution below for perl-compatible/like regex flavours.
.net Solution
As #PetSerAl pointed out in his answer, with variable width lookbehinds, we can assert back to the beggining of the string, and check there are n occurrences.
ideone demo
regex module in Python
You can implement this solution in python, using the regex module by Matthew Barnett, which also allows variable-width lookbehinds.
>>> import regex
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){2})\A.*)', 'abcdbcdcdd')
['b', 'c', 'd', 'b', 'c', 'd', 'c', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){3})\A.*)', 'abcdbcdcdd')
['c', 'd', 'c', 'd', 'c', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){4})\A.*)', 'abcdbcdcdd')
['d', 'd', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){5})\A.*)', 'abcdbcdcdd')
[]
Generalized Solution
In pcre or any of the "perl-like" flavours, there is no solution that would actually return a match for every repeated character, but we could create one, and only one, capture for each character.
Strategy
For any given n, the logic involves:
Early matches: Match and capture every character followed by at least n more occurences.
Final captures:
Match and capture a character followed by exactly n-1 occurences, and
also capture every one of the following occurrences.
Example
for n = 3
input = abcdbcdcdd
The character c is Matched only once (as final), and the following 2 occurrences are also Captured in the same match:
abcdbcdcdd
M C C
and the character d is (early) Matched once:
abcdbcdcdd
M
and (finally) Matched one more time, Capturing the rest:
abcdbcdcdd
M CC
Regex
/(\w) # match 1 character
(?:
(?=(?:.*?\1){≪N≫}) # [1] followed by other ≪N≫ occurrences
| # OR
(?= # [2] followed by:
(?:(?!\1).)*(\1) # 2nd occurence <captured>
(?:(?!\1).)*(\1) # 3rd occurence <captured>
≪repeat previous≫ # repeat subpattern (n-1) times
# *exactly (n-1) times*
(?!.*?\1) # not followed by another occurence
)
)/xg
For n =
/(\w)(?:(?=(?:.*?\1){2})|(?=(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
/(\w)(?:(?=(?:.*?\1){3})|(?=(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
/(\w)(?:(?=(?:.*?\1){4})|(?=(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
... etc.
Pseudocode to generate the pattern
// Variables: N (int)
character = "(\w)"
early_match = "(?=(?:.*?\1){" + N + "})"
final_match = "(?="
for i = 1; i < N; i++
final_match += "(?:(?!\1).)*(\1)"
final_match += "(?!.*?\1))"
pattern = character + "(?:" + early_match + "|" + final_match + ")"
JavaScript Code
I'll show an implementation using javascript because we can check the result here (and if it works in javascript, it works in any perl-compatible regex flavour, including .net, java, python, ruby, perl, and all languages that implemented pcre, among others).
var str = 'abcdbcdcdd';
var pattern, re, match, N, i;
var output = "";
// We'll show the results for N = 2, 3 and 4
for (N = 2; N <= 4; N++) {
// Generate pattern
pattern = "(\\w)(?:(?=(?:.*?\\1){" + N + "})|(?=";
for (i = 1; i < N; i++) {
pattern += "(?:(?!\\1).)*(\\1)";
}
pattern += "(?!.*?\\1)))";
re = new RegExp(pattern, "g");
output += "<h3>N = " + N + "</h3><pre>Pattern: " + pattern + "\nText: " + str;
// Loop all matches
while ((match = re.exec(str)) !== null) {
output += "\nPos: " + match.index + "\tMatch:";
// Loop all captures
x = 1;
while (match[x] != null) {
output += " " + match[x];
x++;
}
}
output += "</pre>";
}
document.write(output);
Python3 code
As requested by the OP, I'm linking to a Python3 implementation in ideone.com
Regular expressions (and finite automata) are not able to count to arbitrary integers. They can only count to a predefined integer and fortunately this is your case.
Solving this problem is much easier if we first construct a nondeterministic finite automata (NFA) ad then convert it to regular expression.
So the following automata for n=2 and input alphabet = {a,b,c,d}
will match any string that has exactly 2 repetitions of any char. If no character has 2 repetitions (all chars appear less or more that two times) the string will not match.
Converting it to regex should look like
"^([^a]*a[^a]*a[^a]*)|([^b]*b[^b]*b[^b]*)|([^b]*c[^c]*c[^C]*)|([^d]*d[^d]*d[^d]*)$"
This can get problematic if the input alphabet is big, so that regex should be shortened somehow, but I can't think of it right now.
With .NET regular expressions you can do following:
(\w)(?<=(?=(?:.*\1){n})^.*) where n is variable
Where:
(\w) — any character, captured in first group.
(?<=^.*) — lookbehind assertion, which return us to the start of the string.
(?=(?:.*\1){n}) — lookahead assertion, to see if string have n instances of that character.
Demo
I would not use regular expressions for this. I would use a scripting language such as python. Try out this python function:
alpha = 'abcdefghijklmnopqrstuvwxyz'
def get_matched_chars(n, s):
s = s.lower()
return [char for char in alpha if s.count(char) == n]
The function will return a list of characters, all of which appear in the string s exactly n times. Keep in mind that I only included letters in my alphabet. You can change alpha to represent anything that you want to get matched.
I am trying to code a program that will insert specific numbers before parts of an input, for example given the input "171819-202122-232425" I would like it to split up the number into pieces and use the dash as a delimiter. I have split up the number using list(str(input)) but have no idea how to insert the appropriate numbers. It has to work for any number Thanks for the help.
Output =
(number)17
(number)18
(number)19
(number+1)20
(number+1)21
(number+1)22
(number+2)23
(number+2)24
(number+2)25
You could use split and regexps to dig out lists of your numbers:
Code
import re
mynum = "171819-202122-232425"
start_number = 5
groups = mynum.split('-') # list of numbers separated by "-"
number_of_groups = xrange(start_number , start_number + len(groups))
for (i, number_group) in zip(number_of_groups, groups):
numbers = re.findall("\d{2}", number_group) # return list of two-digit numbers
for x in numbers:
print "(%s)%s" % (i, x)
Result
(5)17
(5)18
(5)19
(6)20
(6)21
(6)22
(7)23
(7)24
(7)25
Try this:
Code:
mInput = "171819-202122-232425"
number = 9 # Just an example
result = ""
i = 0
for n in mInput:
if n == '-': # To handle dash case
number += 1
continue
i += 1
if i % 2 == 1: # Each two digits
result += "\n(" + str(number) + ")"
result += n # Add current digit
print result
Output:
(9)17
(9)18
(9)19
(10)20
(10)21
(10)22
(11)23
(11)24
(11)25