Regular expression circular matching - regex

Using regular expressions (in any language) is there a way to match a pattern that wraps around from the end to the beginning of a string? For example, if i want the match the pattern:
"street"
against the string:
m = "et stre"
it would match m[3:] + m[:2]

You can't do that directly in the regexp. What you can do is some arithmetic. Append the string to itself:
m = "et stre"
n = m + m //n = "et street stre"
If there is an odd number of matches in n (in this case, 1), the match was 'circular'. If not, there were no circular matches, and the number of matches in n is the double of the number of matches in m.

Related

Recursive regex complexity

For a regex that support only +,?,*,.,|,[..],[^..],^,$,(..), a matcher that lets you match recursion: {<some-regex-name>} with length m, and a string of length n.
(the regex isn't supporting positive/negative look-ahead/behind)
What is the best complexity of the matcher?
Examples:
Bracket Matching:
brackets = (\({brackets}\))*
// brackets can match:
// ((()())())()
Some random "double" recursion:
a = \(({a}|{b})?\)
b = \[{a}(,{b})*\]
// a can match:
// (([(),[(),[()]][(())]]))
Json without whitespaces:
string = "([^\\"]|\\.)"
object = \{ ( {string}:{value}(,{string}:{value})* )? \}
number = [0-9]+(\.[0-9]*)?
list = \[({value}(,{value})*)?]
value = object|string|number|list|true|false
// value can match:
// {"key":false,"ab\"c":[false,12.3,[{}]]}
i have an idea how to implement this with complexity O(m^2*n^3) where n is the size of the string, and m is the size of the regex.
i haven't implement this yet, so maybe i have a mistake

Regular expression to match n times in which n is not fixed

The pattern I want to match is a sequence of length n where n is right before the sequence.
For example, when the input is "1aaaaa", I want to match the single character "a", as the first number specifies only 1 character is matched.
Similar, when the input is "2aaaaa", I want to match the first two characters "aa", but not the rest, as the number 2 specifies two characters will be matched.
I understand a{1} and a{2} will match "a" one or two times. But how to match a{n} in which n is not fixed?
Is it possible to do this type of match using regular expressions?
This will work for repeating numbers.
import re
a="1aaa2bbbbb1cccccccc4dddddddddddd"
for b in re.findall(r'\d[a-z]+', a):
print b[int(b[0])+1:int(b[0])+1+int(b[0])]
Output:
a
bb
c
dddd
Though I have done in Java, it will help you get going in your program.
Here you can select the first letter as sub-string from the given input string and use it in your regex to match the string accordingly.
public class DynamicRegex {
public static void main(String args[]){
Scanner scan = new Scanner(System.in);
System.out.println("Enter a string: ");
String str = scan.nextLine();
String testStr = str.substring(0, 1); //Get the first character from the string using sub-string.
String pattern = "a{"+ testStr +"}"; //Use the sub-string in your regex as length of the string to match.
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(str);
if(m.find()){
System.out.println(m.group());
}
}
}

How to match regular expression exactly in R and pull out pattern

I want to get pattern from my vector of strings
string <- c(
"P10000101 - Przychody netto ze sprzedazy produktów" ,
"P10000102_PL - Przychody nettozy uslug",
"P1000010201_PL - Handlowych, marketingowych, szkoleniowych",
"P100001020101 - - Handlowych,, szkoleniowych - refaktury",
"- Handlowych, marketingowych,P100001020102, - pozostale"
)
As result I want to get exact match of regular expression
result <- c(
"P10000101",
"P10000102_PL",
"P1000010201_PL",
"P100001020101",
"P100001020102"
)
I tried with this pattern = "([PLA]\\d+)" and different combinations of value = T, fixed = T, perl = T.
grep(x = string, pattern = "([PLA]\\d+(_PL)?)", fixed = T)
We can try with str_extract
library(stringr)
str_extract(string, "P\\d+(_[A-Z]+)*")
#[1] "P10000101" "P10000102_PL" "P1000010201_PL" "P100001020101" "P100001020102"
grep is for finding whether the match pattern is present in a particular string or not. For extraction, either use sub or gregexpr/regmatches or str_extract
Using the base R (regexpr/regmatches)
regmatches(string, regexpr("P\\d+(_[A-Z]+)*", string))
#[1] "P10000101" "P10000102_PL" "P1000010201_PL" "P100001020101" "P100001020102"
Basically, the pattern to match is P followed by one more numbers (\\d+) followed by greedy (*) match of _ and one or more upper case letters.

regex with all components optionals, how to avoid empty matches

I have to process a comma separated string which contains triplets of values and translate them to runtime types,the input looks like:
"1x2y3z,80r160g255b,48h30m50s,1x3z,255b,1h,..."
So each substring should be transformed this way:
"1x2y3z" should become Vector3 with x = 1, y = 2, z = 3
"80r160g255b" should become Color with r = 80, g = 160, b = 255
"48h30m50s" should become Time with h = 48, m = 30, s = 50
The problem I'm facing is that all the components are optional (but they preserve order) so the following strings are also valid Vector3, Color and Time values:
"1x3z" Vector3 x = 1, y = 0, z = 3
"255b" Color r = 0, g = 0, b = 255
"1h" Time h = 1, m = 0, s = 0
What I have tried so far?
All components optional
((?:\d+A)?(?:\d+B)?(?:\d+C)?)
The A, B and C are replaced with the correct letter for each case, the expression works almost well but it gives twice the expected results (one match for the string and another match for an empty string just after the first match), for example:
"1h1m1s" two matches [1]: "1h1m1s" [2]: ""
"11x50z" two matches [1]: "11x50z" [2]: ""
"11111h" two matches [1]: "11111h" [2]: ""
This isn't unexpected... after all an empty string matches the expression when ALL of the components are empty; so in order to fix this issue I've tried the following:
1 to 3 quantifier
((?:\d+[ABC]){1,3})
But now, the expression matches strings with wrong ordering or even repeated components!:
"1s1m1h" one match, should not match at all! (wrong order)
"11z50z" one match, should not match at all! (repeated components)
"1r1r1b" one match, should not match at all! (repeated components)
As for my last attempt, I've tried this variant of my first expression:
Match from begin ^ to the end $
^((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
And it works better than the first version but it still matches the empty string plus I should first tokenize the input and then pass each token to the expression in order to assure that the test string could match the begin (^) and end ($) operators.
EDIT: Lookahead attempt (thanks to Casimir et Hippolyte)
After reading and (try to) understanding the regex lookahead concept and with the help of Casimir et Hippolyte answer I've tried the suggested expression:
\b(?=[^,])(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Against the following test string:
"48h30m50s,1h,1h1m1s,11111h,1s1m1h,1h1h1h,1s,1m,1443s,adfank,12322134445688,48h"
And the results were amazing! it is able to detect complete valid matches flawlessly (other expressions gave me 3 matches on "1s1m1h" or "1h1h1h" which weren't intended to be matched at all). Unfortunately it captures emtpy matches everytime a unvalid match is found so a "" is detected just before "1s1m1h", "1h1h1h", "adfank" and "12322134445688", so I modified the Lookahead condition to get the expression below:
\b(?=(?:\d+[ABC]){1,3})(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
It gets rid of the empty matches in any string which doesn't match (?:\d+[ABC]){1,3}) so the empty matches just before "adfank" and "12322134445688" are gone but the ones just before "1s1m1h", "1h1h1h" are stil detected.
So the question is: Is there any regular expression which matches three triplet values in a given order where all component is optional but should be composed of at least one component and doesn't match empty strings?
The regex tool I'm using is the C++11 one.
Yes, you can add a lookahead at the begining to ensure there is at least one character:
^(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
If you need to find this kind of substring in a larger string (so without to tokenize before), you can remove the anchors and use a more explicit subpattern in a lookahead:
(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)
In this case, to avoid false positive (since you are looking for very small strings that can be a part of something else), you can add word-boundaries to the pattern:
\b(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Note: in a comma delimited string: (?=\d+[ABC]) can be replaced by (?=[^,])
I think this might do the trick.
I am keying on either the beginning of the string to match ^ or the comma separator , for fix the start of each match: (?:^|,).
Example:
#include <regex>
#include <iostream>
const std::regex r(R"~((?:^|,)((?:\d+[xrh])?(?:\d+[ygm])?(?:\d+[zbs])?))~");
int main()
{
std::string test = "1x2y3z,80r160g255b,48h30m50s,1x3z,255b";
std::sregex_iterator iter(test.begin(), test.end(), r);
std::sregex_iterator end_iter;
for(; iter != end_iter; ++iter)
std::cout << iter->str(1) << '\n';
}
Output:
1x2y3z
80r160g255b
48h30m50s
1x3z
255b
Is that what you are after?
EDIT:
If you really want to go to town and make empty expressions unmatched then as far as I can tell you have to put in every permutation like this:
const std::string A = "(?:\\d+[xrh])";
const std::string B = "(?:\\d+[ygm])";
const std::string C = "(?:\\d+[zbs])";
const std::regex r("(?:^|,)(" + A + B + C + "|" + A + B + "|" + A + C + "|" + B + C + "|" + A + "|" + B + "|" + C + ")");

Regular Expression - All words that begin and end in different letters

I'm having trouble with this regular expression:
Construct a regular expression defining the following language over alphabet
Σ = { a,b }
L6 = {All words that begin and end in different letters}
Here are some examples of regular expressions I was able to solve:
1. L1 = {all words of even length ending in ab}
(aa + ab + ba + bb)*(ab)
2. L2 = {all words that DO NOT have the substring ab}
b*a*
Would this work:
(a.*b)|(b.*a)
Or said in Kleene way:
a(a+b)*b+b(a+b)*a
This should do it:
"^((a.*b)|(b.*a))$"
1- Write a Regular expression for each of the following languages: (a)language of all those strings which end with substrings 'ab' and have odd length. (b)language of all those strings which do not contain the substring 'abb'.
2- Construct a deterministic FSA for each of the following languages: (a)languages of all those strings in which second last symbol is 'b'. (b)language of all those strings whose length is odd,but contain even number if b's.
(aa+ab+ba+bb)∗(a+b)ab
It can choose any number of even length and have any character from a and b, and then end at string ab.