I am trying to find a regex query, such that, for instance, the following strings match the same expression
"1116.67711..44."
"2224.43322..88."
"9993.35599..22."
"7779.91177..55."
I.e. formally "x1x1x1x2.x2x3x3x1x1..x4x4." where xi ≠ xj if i ≠ j, and where xi is some number from 1 to 9 inclusive.
Or (another example), the following strings match the same expression, but not the same expression as before:
"94..44.773399.4"
"25..55.886622.5"
"73..33.992277.3"
I.e. formally "x1x2..x2x2.x3x3x4x4x1x1.x2" where xi ≠ xj if i ≠ j, and where xi is some number from 1 to 9 inclusive.
That is two strings should be equal if they have the same form, but with the numbers internally permuted so that they are pairwise distinct.
The dots should mean a space in the sequence, this could be any value that is not a single digit number, and two "equal" strings, should have spaces the same places. If it helps, the strings all have the same length of 81 (above they all have a length of 15, as to not write too long strings).
That is, if I have some string as above, e.g. "3566.235.225..45" i want to have some reqular expression that i can apply to some database to find out if such a string already exists
Is it possible to do this?
The answer is fairly straightforward:
import re
pattern = re.compile(r'^(\d)\1{3}$')
print(pattern.match('1234'))
print(pattern.match('333'))
print(pattern.match('3333'))
print(pattern.match('33333'))
You capture what you need once, then tell the regex engine how often you need to repeat it. You can refer back to it as often as you like, for example for a pattern that would match 11.222.1 you'd use ^(\d)\1{1}\.(\d)\2{2}\.(\1){1}$.
Note that the {1} in there is superfluous, but it shows that the pattern can be very regular. So much so, that it's actually easy to write a function that solves the problem for you:
def make_pattern(grouping, separators='.'):
regex_chars = '.\\*+[](){}^$?!:'
groups = {}
i = 0
j = 0
last_group = 0
result = '^'
while i < len(grouping):
if grouping[i] in separators:
if grouping[i] in regex_chars:
result += '\\'
result += grouping[i]
i += 1
else:
while i < len(grouping) and grouping[i] == grouping[j]:
i += 1
if grouping[j] in groups:
group = groups[grouping[j]]
else:
last_group += 1
groups[grouping[j]] = last_group
group = last_group
result += '(.)'
j += 1
result += f'\\{group}{{{i-j}}}'
j = i
return re.compile(result+'$')
print(make_pattern('111.222.11').match('aaa.bbb.aa'))
So, you can give make_pattern a good example of the pattern and it will return the compiled regex for you. If you'd like other separators than '.', you can just pass those in as well:
my_pattern = make_pattern('11,222,11', separators=',')
print(my_pattern.match('aa,bbb,aa'))
Related
I have a problem which needs to remove consecutive pattern in a string.
For example,
input: abcbcbcbcd
output: abcd
input: abcbcebcbcd
output: abcebcd
We may only consider the repeating pattern contains only 2 characters.
What is the best way to solve it?
Thanks
You may try this regex:
(..)\1+
Substitution
$1
syntax
note
(..)
any two characters, capture them in group 1
\1+
repeat group 1 at least 1 time
Check the test cases
Here is a simple way to do it in Python.
s = "abcbcebcbcd"
ans = ""
i = 0
while i < len(s):
if len(ans) >= 2 and i + 1 < len(s) and ans[-2:] == s[i:i + 2]:
i += 2
else:
ans += s[i]
i += 1
print(ans)
for n=1:37
for m=2:71
rep1 = regexp(Cell1{n,m}, 'f[0-9]*', 'match')
rep2 = regexp(rep1, '[0-9]*', 'match')
rep2 = [rep2{:}]
cln = str2double(rep2)
Cell2{n,cln} = Cell1{n,m}
end
end
Cell 1 is a 37x71 Cell, Cell 2 is a 37x71 empty cell.
Ex
Cell1{1,2} = -(f32.*x1.*x6)./v1
If I run each part of the loop above individually, the function works as intended. However, it returns cln as a NaN when the whole loop is executed.
You are getting a NaN because your regex doesn't match one of the values of Cell1 and returns an empty string (which str2double converts to a NaN).
But let's take a step back for a second here. You can use regexp on cell arrays so there is no need to loop through all of your elements. Also, you can use a look behind assertion to look for that "f" that precedes your number therefore preventing the use of regexp twice.
stringNumber = regexp(Cell1, '(?<=f)[0-9]*', 'match', 'once');
numbers = str2double(stringNumber);
You can then check for NaNs (isnan(numbers)) and look closer at the elements of Cell1 to see why your regex isn't finding a number in a particular string.
Once you get that sorted out, you can assign to Cell2 like you are doing
Cell2 = cell(37, 71);
for k = 1:numel(numbers)
row = mod(k - 1, size(Cell1, 2)) + 1;
Cell2(row, numbers(k)) = Cell1(k);
end
How to match any character which repeats n times?
Example:
for input: abcdbcdcdd
for n=1: ..........
for n=2: .........
for n=3: .. .....
for n=4: . . ..
for n=5: no matches
After several hours my best is this expression
(\w)(?=(?:.*\1){n-1,}) //where n is variable
which uses lookahead. However the problem with this expression is this:
for input: abcdbcdcdd
for n=1 ..........
for n=2 ... .. .
for n=3 .. .
for n=4 .
for n=5 no matches
As you can see, when lookahead matches for a character, let's look for n=4 line, d's lookahead assertion satisfied and first d matched by regex. But remaining d's are not matched because they don't have 3 more d's ahead of them.
I hope I stated the problem clearly. Hoping for your solutions, thanks in advance.
let's look for n=4 line, d's lookahead assertion satisfied
and first d matched by regex.
But remaining d's are not matched because they don't have 3 more d's
ahead of them.
And obviously, without regex, this is a very simple string manipulation
problem. I'm trying to do this with and only with regex.
As with any regex implementation, the answer depends on the regex flavour. You could create a solution with .net regex engine, because it allows variable width lookbehinds.
Also, I'll provide a more generalized solution below for perl-compatible/like regex flavours.
.net Solution
As #PetSerAl pointed out in his answer, with variable width lookbehinds, we can assert back to the beggining of the string, and check there are n occurrences.
ideone demo
regex module in Python
You can implement this solution in python, using the regex module by Matthew Barnett, which also allows variable-width lookbehinds.
>>> import regex
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){2})\A.*)', 'abcdbcdcdd')
['b', 'c', 'd', 'b', 'c', 'd', 'c', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){3})\A.*)', 'abcdbcdcdd')
['c', 'd', 'c', 'd', 'c', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){4})\A.*)', 'abcdbcdcdd')
['d', 'd', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){5})\A.*)', 'abcdbcdcdd')
[]
Generalized Solution
In pcre or any of the "perl-like" flavours, there is no solution that would actually return a match for every repeated character, but we could create one, and only one, capture for each character.
Strategy
For any given n, the logic involves:
Early matches: Match and capture every character followed by at least n more occurences.
Final captures:
Match and capture a character followed by exactly n-1 occurences, and
also capture every one of the following occurrences.
Example
for n = 3
input = abcdbcdcdd
The character c is Matched only once (as final), and the following 2 occurrences are also Captured in the same match:
abcdbcdcdd
M C C
and the character d is (early) Matched once:
abcdbcdcdd
M
and (finally) Matched one more time, Capturing the rest:
abcdbcdcdd
M CC
Regex
/(\w) # match 1 character
(?:
(?=(?:.*?\1){≪N≫}) # [1] followed by other ≪N≫ occurrences
| # OR
(?= # [2] followed by:
(?:(?!\1).)*(\1) # 2nd occurence <captured>
(?:(?!\1).)*(\1) # 3rd occurence <captured>
≪repeat previous≫ # repeat subpattern (n-1) times
# *exactly (n-1) times*
(?!.*?\1) # not followed by another occurence
)
)/xg
For n =
/(\w)(?:(?=(?:.*?\1){2})|(?=(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
/(\w)(?:(?=(?:.*?\1){3})|(?=(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
/(\w)(?:(?=(?:.*?\1){4})|(?=(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
... etc.
Pseudocode to generate the pattern
// Variables: N (int)
character = "(\w)"
early_match = "(?=(?:.*?\1){" + N + "})"
final_match = "(?="
for i = 1; i < N; i++
final_match += "(?:(?!\1).)*(\1)"
final_match += "(?!.*?\1))"
pattern = character + "(?:" + early_match + "|" + final_match + ")"
JavaScript Code
I'll show an implementation using javascript because we can check the result here (and if it works in javascript, it works in any perl-compatible regex flavour, including .net, java, python, ruby, perl, and all languages that implemented pcre, among others).
var str = 'abcdbcdcdd';
var pattern, re, match, N, i;
var output = "";
// We'll show the results for N = 2, 3 and 4
for (N = 2; N <= 4; N++) {
// Generate pattern
pattern = "(\\w)(?:(?=(?:.*?\\1){" + N + "})|(?=";
for (i = 1; i < N; i++) {
pattern += "(?:(?!\\1).)*(\\1)";
}
pattern += "(?!.*?\\1)))";
re = new RegExp(pattern, "g");
output += "<h3>N = " + N + "</h3><pre>Pattern: " + pattern + "\nText: " + str;
// Loop all matches
while ((match = re.exec(str)) !== null) {
output += "\nPos: " + match.index + "\tMatch:";
// Loop all captures
x = 1;
while (match[x] != null) {
output += " " + match[x];
x++;
}
}
output += "</pre>";
}
document.write(output);
Python3 code
As requested by the OP, I'm linking to a Python3 implementation in ideone.com
Regular expressions (and finite automata) are not able to count to arbitrary integers. They can only count to a predefined integer and fortunately this is your case.
Solving this problem is much easier if we first construct a nondeterministic finite automata (NFA) ad then convert it to regular expression.
So the following automata for n=2 and input alphabet = {a,b,c,d}
will match any string that has exactly 2 repetitions of any char. If no character has 2 repetitions (all chars appear less or more that two times) the string will not match.
Converting it to regex should look like
"^([^a]*a[^a]*a[^a]*)|([^b]*b[^b]*b[^b]*)|([^b]*c[^c]*c[^C]*)|([^d]*d[^d]*d[^d]*)$"
This can get problematic if the input alphabet is big, so that regex should be shortened somehow, but I can't think of it right now.
With .NET regular expressions you can do following:
(\w)(?<=(?=(?:.*\1){n})^.*) where n is variable
Where:
(\w) — any character, captured in first group.
(?<=^.*) — lookbehind assertion, which return us to the start of the string.
(?=(?:.*\1){n}) — lookahead assertion, to see if string have n instances of that character.
Demo
I would not use regular expressions for this. I would use a scripting language such as python. Try out this python function:
alpha = 'abcdefghijklmnopqrstuvwxyz'
def get_matched_chars(n, s):
s = s.lower()
return [char for char in alpha if s.count(char) == n]
The function will return a list of characters, all of which appear in the string s exactly n times. Keep in mind that I only included letters in my alphabet. You can change alpha to represent anything that you want to get matched.
I have to process a comma separated string which contains triplets of values and translate them to runtime types,the input looks like:
"1x2y3z,80r160g255b,48h30m50s,1x3z,255b,1h,..."
So each substring should be transformed this way:
"1x2y3z" should become Vector3 with x = 1, y = 2, z = 3
"80r160g255b" should become Color with r = 80, g = 160, b = 255
"48h30m50s" should become Time with h = 48, m = 30, s = 50
The problem I'm facing is that all the components are optional (but they preserve order) so the following strings are also valid Vector3, Color and Time values:
"1x3z" Vector3 x = 1, y = 0, z = 3
"255b" Color r = 0, g = 0, b = 255
"1h" Time h = 1, m = 0, s = 0
What I have tried so far?
All components optional
((?:\d+A)?(?:\d+B)?(?:\d+C)?)
The A, B and C are replaced with the correct letter for each case, the expression works almost well but it gives twice the expected results (one match for the string and another match for an empty string just after the first match), for example:
"1h1m1s" two matches [1]: "1h1m1s" [2]: ""
"11x50z" two matches [1]: "11x50z" [2]: ""
"11111h" two matches [1]: "11111h" [2]: ""
This isn't unexpected... after all an empty string matches the expression when ALL of the components are empty; so in order to fix this issue I've tried the following:
1 to 3 quantifier
((?:\d+[ABC]){1,3})
But now, the expression matches strings with wrong ordering or even repeated components!:
"1s1m1h" one match, should not match at all! (wrong order)
"11z50z" one match, should not match at all! (repeated components)
"1r1r1b" one match, should not match at all! (repeated components)
As for my last attempt, I've tried this variant of my first expression:
Match from begin ^ to the end $
^((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
And it works better than the first version but it still matches the empty string plus I should first tokenize the input and then pass each token to the expression in order to assure that the test string could match the begin (^) and end ($) operators.
EDIT: Lookahead attempt (thanks to Casimir et Hippolyte)
After reading and (try to) understanding the regex lookahead concept and with the help of Casimir et Hippolyte answer I've tried the suggested expression:
\b(?=[^,])(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Against the following test string:
"48h30m50s,1h,1h1m1s,11111h,1s1m1h,1h1h1h,1s,1m,1443s,adfank,12322134445688,48h"
And the results were amazing! it is able to detect complete valid matches flawlessly (other expressions gave me 3 matches on "1s1m1h" or "1h1h1h" which weren't intended to be matched at all). Unfortunately it captures emtpy matches everytime a unvalid match is found so a "" is detected just before "1s1m1h", "1h1h1h", "adfank" and "12322134445688", so I modified the Lookahead condition to get the expression below:
\b(?=(?:\d+[ABC]){1,3})(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
It gets rid of the empty matches in any string which doesn't match (?:\d+[ABC]){1,3}) so the empty matches just before "adfank" and "12322134445688" are gone but the ones just before "1s1m1h", "1h1h1h" are stil detected.
So the question is: Is there any regular expression which matches three triplet values in a given order where all component is optional but should be composed of at least one component and doesn't match empty strings?
The regex tool I'm using is the C++11 one.
Yes, you can add a lookahead at the begining to ensure there is at least one character:
^(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
If you need to find this kind of substring in a larger string (so without to tokenize before), you can remove the anchors and use a more explicit subpattern in a lookahead:
(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)
In this case, to avoid false positive (since you are looking for very small strings that can be a part of something else), you can add word-boundaries to the pattern:
\b(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Note: in a comma delimited string: (?=\d+[ABC]) can be replaced by (?=[^,])
I think this might do the trick.
I am keying on either the beginning of the string to match ^ or the comma separator , for fix the start of each match: (?:^|,).
Example:
#include <regex>
#include <iostream>
const std::regex r(R"~((?:^|,)((?:\d+[xrh])?(?:\d+[ygm])?(?:\d+[zbs])?))~");
int main()
{
std::string test = "1x2y3z,80r160g255b,48h30m50s,1x3z,255b";
std::sregex_iterator iter(test.begin(), test.end(), r);
std::sregex_iterator end_iter;
for(; iter != end_iter; ++iter)
std::cout << iter->str(1) << '\n';
}
Output:
1x2y3z
80r160g255b
48h30m50s
1x3z
255b
Is that what you are after?
EDIT:
If you really want to go to town and make empty expressions unmatched then as far as I can tell you have to put in every permutation like this:
const std::string A = "(?:\\d+[xrh])";
const std::string B = "(?:\\d+[ygm])";
const std::string C = "(?:\\d+[zbs])";
const std::regex r("(?:^|,)(" + A + B + C + "|" + A + B + "|" + A + C + "|" + B + C + "|" + A + "|" + B + "|" + C + ")");
I need to validate a string against a character vector pattern. My current code is:
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
# valid pattern is lowercase alphabet, '.', '!', and '?' AND
# the string length should be >= than 2
my.pattern = c(letters, '!', '.', '?')
check.pattern = function(word, min.size = 2)
{
word = trim(word)
chars = strsplit(word, NULL)[[1]]
all(chars %in% my.pattern) && (length(chars) >= min.size)
}
Example:
w.valid = 'special!'
w.invalid = 'test-me'
check.pattern(w.valid) #TRUE
check.pattern(w.invalid) #FALSE
This is VERY SLOW i guess...is there a faster way to do this? Regex maybe?
Thanks!
PS: Thanks everyone for the great answers. My objective was to build a 29 x 29 matrix,
where the row names and column names are the allowed characters. Then i iterate over each word of a huge text file and build a 'letter precedence' matrix. For example, consider the word 'special', starting from the first char:
row s, col p -> increment 1
row p, col e -> increment 1
row e, col c -> increment 1
... and so on.
The bottleneck of my code was the vector allocation, i was 'appending' instead of pre-allocate the final vector, so the code was taking 30 minutes to execute, instead of 20 seconds!
There are some built-in functions that can clean up your code. And I think you're not leveraging the full power of regular expressions.
The blaring issue here is strsplit. Comparing the equality of things character-by-character is inefficient when you have regular expressions. The pattern here uses the square bracket notation to filter for the characters you want. * is for any number of repeats (including zero), while the ^ and $ symbols represent the beginning and end of the line so that there is nothing else there. nchar(word) is the same as length(chars). Changing && to & makes the function vectorized so you can input a vector of strings and get a logical vector as output.
check.pattern.2 = function(word, min.size = 2)
{
word = trim(word)
grepl(paste0("^[a-z!.?]*$"),word) & nchar(word) >= min.size
}
check.pattern.2(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
Next, using curly braces for number of repetitions and some paste0, the pattern can use your min.size:
check.pattern.3 = function(word, min.size = 2)
{
word = trim(word)
grepl(paste0("^[a-z!.?]{",min.size,",}$"),word)
}
check.pattern.3(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
Finally, you can internalize the regex from trim:
check.pattern.4 = function(word, min.size = 2)
{
grepl(paste0("^\\s*[a-z!.?]{",min.size,",}\\s*$"),word)
}
check.pattern.4(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
If I understand the pattern you are desiring correctly, you would want a regex of a similar format to:
^\\s*[a-z!\\.\\?]{MIN,MAX}\\s*$
Where MIN is replaced with the minimum length of the string, and MAX is replaced with the maximum length of the string. If there is no maximum length, then MAX and the comma can be omitted. Likewise, if there is neither maximum nor minimum everything within the {} including the braces themselves can be replaced with a * which signifies the preceding item will be matched zero or more times; this is equivalent to {0}.
This ensures that the regex only matches strings where every character after any leading and trailing whitespace is from the set of
* a lower case letter
* a bang (exclamation point)
* a question mark
Note that this has been written in Perl style regex as it is what I am more familiar with; most of my research was at this wiki for R text processing.
The reason for the slowness of your function is the extra overhead of splitting the string into a number of smaller strings. This is a lot of overhead in comparison to a regex (or even a manual iteration over the string, comparing each character until the end is reached or an invalid character is found). Also remember that this algorithm ENSURES a O(n) performance rate, as the split causes n strings to be generated. This means that even FAILING strings must do at least n actions to reject the string.
Hopefully this clarifies why you were having performance issues.