capture in a string everything that is not a token - regex

Context: I am dealing with a mix of boolean and arithmetic expressions that may look like in the following example:
b_1 /\ (0 <= x_1) /\ (x_2 <= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))))
I want to match and extract any constraint of the shape A <= B contained in the formula which must be always true. In the above example, only 0 <= x_1 would satisfy such criterion.
Current Goal:
My idea is to build a simple parse tree of the input formula focusing only on the following tokens: and (/\), or (\/), left bracket (() and right bracket ()). Given the above formula, I would like to generate the following AST:
/\
|_ "b_1"
|_ /\
|_ "0 <= x_1"
|_ \/
|_ "x_2 <= 2"
|_ /\
|_ "b_3"
|_ "(/ 1 3) <= x_4"
Then, I can simply walk through the AST and discard any sub-tree rooted at \/.
My Attempt:
Looking at this documentation, I am defining the grammar for the lexer as follows:
import ply.lex as lex
tokens = (
"LPAREN",
"RPAREN",
"AND",
"OR",
"STRING",
)
t_AND = r'\/\\'
t_OR = r'\\\/'
t_LPAREN = r'\('
t_RPAREN = r'\)'
t_ignore = ' \t\n'
def t_error(t):
print(t)
print("Illegal character '{}'".format(t.value[0]))
t.lexer.skip(1)
def t_STRING(t):
r'^(?!\)|\(| |\t|\n|\\\/|\/\\)'
t.value = t
return t
data = "b_1 /\ (x_2 <= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))"
lexer = lex.lex()
lexer.input(data)
while True:
tok = lexer.token()
if not tok:
break
print(tok.type, tok.value, tok.lineno, tok.lexpos)
However, I get the following output:
~$ python3 lex.py
LexToken(error,'b_1 /\\ (x_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,0)
Illegal character 'b'
LexToken(error,'_1 /\\ (x_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,1)
Illegal character '_'
LexToken(error,'1 /\\ (x_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,2)
Illegal character '1'
AND /\ 1 4
LPAREN ( 1 7
LexToken(error,'x_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,8)
Illegal character 'x'
LexToken(error,'_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,9)
Illegal character '_'
LexToken(error,'2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,10)
Illegal character '2'
LexToken(error,'<= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,12)
Illegal character '<'
LexToken(error,'= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,13)
Illegal character '='
LexToken(error,'2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,15)
Illegal character '2'
OR \/ 1 17
LPAREN ( 1 20
LexToken(error,'b_3 /\\ ((/ 1 3) <= x_4))',1,21)
Illegal character 'b'
LexToken(error,'_3 /\\ ((/ 1 3) <= x_4))',1,22)
Illegal character '_'
LexToken(error,'3 /\\ ((/ 1 3) <= x_4))',1,23)
Illegal character '3'
AND /\ 1 25
LPAREN ( 1 28
LPAREN ( 1 29
LexToken(error,'/ 1 3) <= x_4))',1,30)
Illegal character '/'
LexToken(error,'1 3) <= x_4))',1,32)
Illegal character '1'
LexToken(error,'3) <= x_4))',1,34)
Illegal character '3'
RPAREN ) 1 35
LexToken(error,'<= x_4))',1,37)
Illegal character '<'
LexToken(error,'= x_4))',1,38)
Illegal character '='
LexToken(error,'x_4))',1,40)
Illegal character 'x'
LexToken(error,'_4))',1,41)
Illegal character '_'
LexToken(error,'4))',1,42)
Illegal character '4'
RPAREN ) 1 43
RPAREN ) 1 44
The t_STRING token is not correctly recognized as it should.
Question: how to set the catch all regular expression for t_STRING so as to get a working tokenizer?

Your regular expression for T_STRING most certainly doesn't do what you want. What it does do is a little more difficult to answer.
In principle, it consists only of two zero-length assertions: ^, which is only true at the beginning of the string (unless you provide the re.MULTILINE flag, which you don't), and a long negative lookahead assertion.
A pattern which consists only of zero-length assertions can only match the empty string, if it matches anything at all. But lexer patterns cannot be allowed to match the empty string. Lexers divide the input into a series of tokens, so that every character in the input belongs to some token. Each match -- and they are all matches, not searches -- starts precisely at the end of the previous match. So if a pattern could match the empty string, the lexer would try the next match at the same place, with the same result, which would be an endless loop.
Some lexer generators solve this problem by forcing a minimum one-character match using a built-in catch-all error pattern, but Ply simply refuses to generate a lexer if a pattern matches the empty string. Yet Ply does not complain about this lexer specification. The only possible explanation is that the pattern cannot match anything.
The key is that Ply compiles all patterns using the re.VERBOSE flag, which allows you to separate items in regular expressions with whitespace, making the regexes slightly less unreadable. As the Python documentation indicates:
Whitespace within the pattern is ignored, except when in a character class, or when preceded by an unescaped backslash, or within tokens like *?, (?: or (?P<...>.
Whitespace includes newlines and even comments (starting with a # character), so you can split patterns over several lines and insert comments about each piece.
We could do that, in fact, with your pattern:
def t_STRING(t):
r'''^ # Anchor this match at the beginning of the input
(?! # Don't match if the next characters match:
\) | # Close parenthesis
\( | # Open parenthesis
\ | # !!! HERE IS THE PROBLEM
\t | # Tab character
\n | # Newline character
\\\/ | # \/ token
\/\\ # /\ token
)
'''
t.value = t
return t
So as I added whitespace and comments to your pattern, I had to notice that the original pattern attempted to match a space character as an alternative with | |. But since the pattern is compiled as re.VERBOSE, that space character is ignored, leaving an empty alternative, which matches the empty string. That alternative is part of a negative lookahead assertion, which means that the assertion will fail if the string to match at that point starts with the empty string. Of course, every string starts with the empty string, so the negative lookahead assertion always fails, explaining why Ply didn't complain (and why the pattern never matches anything).
Regardless of that particular glitch, the pattern cannot be useful because, as mentioned already, a lexer pattern must match some characters, and so a pattern which only matches the empty string cannot be useful. What we want to do is match any character, providing that the negative lookahead (corrected, as below) allows it. So that means that the negative lookahead assertion show be followed with ., which will match the next character.
But you almost certainly don't want to match just one character. Presumably you wanted to match a string of characters which don't match any other token. So that means putting the negative lookahead assertion and the following . into a repetition. And remember that it needs to be a non-empty repetition (+, not *), because patterns must not have empty matches.
Finally, there is absolutely no point using an anchor assertion, because that would limit the pattern to matching only at the beginning of the input, and that is certainly not what you want. It's not at all clear what it is doing there. (I've seen recommendations which suggest using an anchor with a negative lookahead search, which I think are generally misguided, but that discussion is out of scope for this question.)
And before we write the pattern, let's make one more adjustment: in a Python regular expression, if you can replace a set of alternatives with a character class, you should do so because it is a lot more efficient. That's true even if only some of the alternatives can be replaced.
So that produces the following:
def t_STRING(t):
r'''(
(?! # Don't match if the next characters match:
[() \t\n] | # Parentheses or whitespace
\\\/ | # \/ token
\/\\ # /\ token
) . # If none of the above match, accept a character
)+ # and repeat as many times as possible (at least once)
'''
return t
I removed t.value = t. t is a token object, not a string, and the value should be the string it matched. If you overwrite the value with a circular reference, you won't be able to figure out which string was matched.
This works, but not quite in the way you intended. Since whitespace characters are excluded from T_STRING, you don't get a single token representing (/ 1 3) <= x_4. Instead, you get a series of tokens:
STRING b_1 1 0
AND /\ 1 4
LPAREN ( 1 7
STRING x_2 1 8
STRING <= 1 12
STRING 2 1 15
OR \/ 1 17
LPAREN ( 1 20
STRING b_3 1 21
AND /\ 1 25
LPAREN ( 1 28
LPAREN ( 1 29
STRING / 1 30
STRING 1 1 32
STRING 3 1 34
RPAREN ) 1 35
STRING <= 1 37
STRING x_4 1 40
RPAREN ) 1 43
RPAREN ) 1 44
But I think that's reasonable. How could the lexer be able to tell that the parentheses in (x_2 <= 2 and (b_3 are parenthesis tokens, while the parentheses in (/ 1 3) <= x_4 are part of T_STRING? That determination will need to be made in your parser.
In fact, my inclination would be to fully tokenise the input, even if you don't (yet) require a complete tokenisation. As this entire question and answer shows, attempting to recognised "everything but..." can actually be a lot more complicated than just recognising all tokens. Trying to get the tokeniser to figure out which tokens are useful and which ones aren't is often more difficult than tokenising everything and passing it through a parser.

Based on the excellent answer from #rici, pointing the problem with t_STRING, this is my final version of the example that introduces smaller changes to the one proposed by #rici.
Code
##############
# TOKENIZING #
##############
tokens = (
"LPAREN",
"RPAREN",
"AND",
"OR",
"STRING",
)
def t_AND(t):
r'[ ]*\/\\[ ]*'
t.value = "/\\"
return t
def t_OR(t):
r'[ ]*\\\/[ ]*'
t.value = "\\/"
return t
def t_LPAREN(t):
r'[ ]*\([ ]*'
t.value = "("
return t
def t_RPAREN(t):
r'[ ]*\)[ ]*'
t.value = ")"
return t
def t_STRING(t):
r'''(
(?! # Don't match if the next characters match:
[()\t\n] | # Parentheses or whitespace
\\\/ | # \/ token
\/\\ # /\ token
) . # If none of the above match, accept a character
)+ # and repeat as many times as possible (at least once)
'''
return t
def t_error(t):
print("error: " + str(t.value[0]))
t.lexer.skip(1)
import ply.lex as lex
lexer = lex.lex()
data = "b_b /\\ (ccc <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))"
lexer.input(data)
while True:
tok = lexer.token()
if not tok:
break
print("{0}: `{1}`".format(tok.type, tok.value))
Output
STRING: `b_b `
AND: `/\`
LPAREN: `(`
STRING: `ccc <= 2 `
OR: `\/`
LPAREN: `(`
STRING: `b_3 `
AND: `/\`
LPAREN: `(`
LPAREN: `(`
STRING: `/ 1 3`
RPAREN: `)`
STRING: `<= x_4`
RPAREN: `)`
RPAREN: `)`

Related

Regex optional group selection doesn't work

I want to extract the numbers from the following text:
Something_Time 10 min (Time in Class T>60�C Something Something )
Something_Time 899 min (Time in Class 35�C<T<=40�C Something Something )
Something_Time 0 min (Time in Class T<=-25�C Something Something )
So what I need is:
|---------------|---------------|---------------|
| Group 1 | Group 2 | Group 3 |
|---------------|---------------|---------------|
| 10 | 60 | |
|---------------|---------------|---------------|
| 899 | 35 | 40 |
|---------------|---------------|---------------|
| 0 | | -25 |
|---------------|---------------|---------------|
Group 2 as lower bound and group 3 as upper bound.
I tried the following regex expression:
^.* (\d{1,6}) min .*(?:[ \>](\-?\d{1,2}))?.*(?:[\=](\-?\d{1,2}))?.*$
This unfortunately does not match groups 2 and 3. It works for the second line as soon as the ? is removed from the end of both groups. Do you have any suggestions?
Try:
^Something_Time (\d{1,6}) min(?:.*?[ >](-?\d{1,2}))?(?:.*?[ =](-?\d{1,2}))?.*$
See Regex Demo
^ Matches start of string.
Something_Time Matches 'Something_Time '
(\d{1,6}) Group 1: 1 - 6 digits
min Matches ' min'
(?:.*?[ >](-?\d{1,2}))? Optional group that matches 0 or more non-newline characters followed by either a space or '>' followed by a number (optional '-' followed by up to 2 digits). The number is placed in Group 2.
(?:.*?[ =](-?\d{1,2}))? Optional group that matches 0 or more non-newline characters followed by either a space or '=' followed by a number (optional '-' followed by up to 2 digits). The number is placed in Group 3.
.* Matches 0 or more non-newline characters.
$ Matches the end of the string or a newline that precedes the end of the string.
In Python:
import re
tests = [
'Something_Time 10 min (Time in Class T>60�C Something Something )',
'Something_Time 899 min (Time in Class 35�C<T<=40�C Something Something )',
'Something_Time 0 min (Time in Class T<=-25�C Something Something )'
]
for test in tests:
m = re.match(r'^Something_Time (\d{1,6}) min(?:.*?[ >](-?\d{1,2}))?(?:.*?[ =](-?\d{1,2}))?.*$', test)
if m:
print(m.groups())
Prints:
('10', '60', None)
('899', '35', '40')
('0', None, '-25')

Reluctant matching in ANTLR 4.4

Just as the reluctant quantifiers work in Regular expressions I'm trying to parse two different tokens from my input i.e, for operand1 and operator. And my operator token should be reluctantly matched instead of greedily matching input tokens for operand1.
Example,
Input:
Active Indicator in ("A", "D", "S")
(To simplify I have removed the code relevant for operand2)
Expected operand1:
Active Indicator
Expected operator:
in
Actual output for operand1:
Active indicator in
and none for the operator rule.
Below is my grammar code:
grammar Test;
condition: leftOperand WHITESPACE* operator;
leftOperand: ALPHA_NUMERIC_WS ;
operator: EQUALS | NOT_EQUALS | IN | NOT_IN;
EQUALS : '=';
NOT_EQUALS : '!=';
IN : 'in';
NOT_IN : 'not' WHITESPACE 'in';
WORD: (LOWERCASE | UPPERCASE )+ ;
ALPHA_NUMERIC_WS: WORD ( WORD| DIGIT | WHITESPACE )* ( WORD | DIGIT)+ ;
WHITESPACE : (' ' | '\t')+;
fragment DIGIT: '0'..'9' ;
LOWERCASE : [a-z] ;
UPPERCASE : [A-Z] ;
One solution to this would be to not produce one token for several words but one token per word instead.
Your grammar would then look like this:
grammar Test;
condition: leftOperand operator;
leftOperand: ALPHA_NUMERIC+ ;
operator: EQUALS | NOT_EQUALS | IN | NOT_IN;
EQUALS : '=';
NOT_EQUALS : '!=';
IN : 'in';
NOT_IN : 'not' WHITESPACE 'in';
WORD: (LOWERCASE | UPPERCASE )+ ;
ALPHA_NUMERIC: WORD ( WORD| DIGIT)* ;
WHITESPACE : (' ' | '\t')+ -> skip; // ignoring WS completely
fragment DIGIT: '0'..'9' ;
LOWERCASE : [a-z] ;
UPPERCASE : [A-Z] ;
Like this the lexer will not match the whole input as ALPHA_NUMERIC_WS once the corresponding lexer rule has been entered because any occuring WS forces the lexer to leave the ALPHA_NUMERIC rule. Therefore any following input will be given a chance to be matched by other lexer-rules (in the order they are defined in the grammar).

limit string in preg_match on php

Thanks for visit or just see my question.
I try to catch some pattern in string on php.
pattern :
min char = 100
max char = 100
contains [a-z] and [A-Z] and [0-9] and "#" and "-" and "=" and "+"
It's return TRUE if all on pattern match and founds in string.
else it will return FALSE.
$string = "jhJH#KJNkj-HJV=+0ksdbscasdJNKajcajnacakjBKBKjbidcubiISABUIhsdbchdsiweucIBHbhbHBUJHBjhJHBIYBHBCwJHBdcd";
if (preg_match("/^.*(?=.{100,100})(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[#])(?=.*[-])(?=.*[=])(?=.*[+]).*$/", $string)){
$result1 = true;
} else {
$result1 = false;
}
This works if the 100 char, But this is return TRUE too if the string length more than 100 digits.
I want return TRUE if string length not below 100 digits and not more than 100 digits and it should be contains a pattern.
Maybe any suggestions, in the pattern?
Thanks in advances
You failed to actually set the min and max length as you allow 0+ any chars other than a newline before actually requiring 100 chars in the string. You need to use
'/^(?=\D*\d)(?=[^a-z]*[a-z])(?=[^A-Z]*[A-Z])(?=[^#]*#)(?=[^-]*-)(?=[^=]*=)(?=[^+]*[+]).{0,100}$/s'
^^^^^^^
Here, the pattern will match 0 to 100 any chars incl. a newline (as /s allows . to match a newline) and requiring at least 1 digit, 1 lowercase, 1 uppercase, 1 #, 1-, 1 = and one +.
Actually, there are 7 conditions with specific requirements, so in fact, the minimum will be 7 chars. You need to adjust this pattern to what you really need.

PCRE regex for multiple decimal coordinates using [lon,lat] format

I am trying to create a regex for [lon,lat] coordinates.
The code first checks if the input starts with '['.
If it does we check the validity of the coordinates via a regex
/([\[][-+]?(180(\.0{1,15})?|((1[0-7]\d)|([1-9]?\d))(\.\d{1,15})?),[-+]?([1-8]?\d(\.\d{1,15})?|90(\.0{1,15})?)[\]][\;]?)+/gm
The regex tests for [lon,lat] with 15 decimals [+- 180degrees, +-90degrees]
it should match :
single coordinates :
[120,80];
[120,80]
multiple coordinates
[180,90];[180,67];
[180,90];[180,67]
with newlines
[123,34];[-32,21];
[12,-67]
it should not match:
semicolon separator missing - single
[25,67][76,23];
semicolon separator missing - multiple
[25,67]
[76,23][12,90];
I currently have problems with the ; between coordinates (see 4 & 5)
jsfiddle equivalent here : http://regex101.com/r/vQ4fE0/4
You can try with this (human readable) pattern:
$pattern = <<<'EOD'
~
(?(DEFINE)
(?<lon> [+-]?
(?:
180 (?:\.0{1,15})?
|
(?: 1(?:[0-7][0-9]?)? | [2-9][0-9]? | 0 )
(?:\.[0-9]{1,15})?
)
)
(?<lat> [+-]?
(?:
90 (?:\.0{1,15})?
|
(?: [1-8][0-9]? | 9)
(?:\.[0-9]{1,15})?
)
)
)
\A
\[ \g<lon> , \g<lat> ] (?: ; \n? \[ \g<lon> , \g<lat> ] )* ;?
\z
~x
EOD;
explanations:
When you have to deal with a long pattern inside which you have to repeat several time the same subpatterns, you can use several features to make it more readable.
The most well know is to use the free-spacing mode (the x modifier) that allows to indent has you want the pattern (all spaces are ignored) and eventually to add comments.
The second consists to define subpatterns in a definition section (?(DEFINE)...) in which you can define named subpatterns to be used later in the main pattern.
Since I don't want to repeat the large subpatterns that describes the longitude number and the latitude number, I have created in the definition section two named pattern "lon" and "lat". To use them in the main pattern, I only need to write \g<lon> and \g<lat>.
javascript version:
var lon_sp = '(?:[+-]?(?:180(?:\\.0{1,15})?|(?:1(?:[0-7][0-9]?)?|[2-9][0-9]?|0)(?:\\.[0-9]{1,15})?))';
var lat_sp = '(?:[+-]?(?:90(?:\\.0{1,15})?|(?:[1-8][0-9]?|9)(?:\\.[0-9]{1,15})?))';
var coo_sp = '\\[' + lon_sp + ',' + lat_sp + '\\]';
var regex = new RegExp('^' + coo_sp + '(?:;\\n?' + coo_sp + ')*;?$');
var coordinates = new Array('[120,80];',
'[120,80]',
'[180,90];[180,67];',
'[123,34];[-32,21];\n[12,-67]',
'[25,67][76,23];',
'[25,67]\n[76,23]');
for (var i = 0; i<coordinates.length; i++) {
console.log("\ntest "+(i+1)+": " + regex.test(coordinates[i]));
}
fiddle
Try this out:
^(\[([+-]?(?!(180\.|18[1-9]|19\d{1}))\d{1,3}(\.\d{1,15})?,[+-]?(?!(90\.|9[1-9]))\d{1,2}(\.\d{1,15})?(\];$|\]$|\];\[)){1,})
Demo: http://regex101.com/r/vQ4fE0/7
Explanation
^(\[
Must start with a bracket
[+-]?
May or may not contain +- in front of the number
(?!(180\.|18[1-9]|19\d{1}))
Should not contain 180., 181-189 nor 19x
\d{1,3}(\.\d{1,15})?
Otherwise, any number containing 1 or 3 digits, with or without decimals (up to 15) are allowed
(?!(90\.|9[1-9]))
The 90 check is similar put here we are not allowing 90. nor 91-99
\d{1,2}(\.\d{1,15})?
Otherwise, any number containing 1 or 2 digits, with or without decimals (up to 15) are allowed
(\];$|\]$|\];\[)
The ending of a bracket body must have a ; separating two bracket bodies, otherwise it must be the end of the line.
{1,}
The brackets can exist 1 or multiple times
Hope this was helpful.
This might work. Note that you have a lot of capture groups, none of which
will give you good information because of recursive quantifiers.
# /^(\[[-+]?(180(\.0{1,15})?|((1[0-7]\d)|([1-9]?\d))(\.\d{1,15})?),[-+]?([1-8]?\d(\.\d{1,15})?|90(\.0{1,15})?)\](?:;\n?|$))+$/
^
( # (1 start)
\[
[-+]?
( # (2 start)
180
( \. 0{1,15} )? # (3)
|
( # (4 start)
( 1 [0-7] \d ) # (5)
|
( [1-9]? \d ) # (6)
) # (4 end)
( \. \d{1,15} )? # (7)
) # (2 end)
,
[-+]?
( # (8 start)
[1-8]? \d
( \. \d{1,15} )? # (9)
|
90
( \. 0{1,15} )? # (10)
) # (8 end)
\]
(?: ; \n? | $ )
)+ # (1 end)
$
Try a function approach, where the function can do some of the splitting for you, as well as delegating the number comparisons away from the regex. I tested it here: http://repl.it/YyG/3
//represents regex necessary to capture one coordinate, which
// looks like 123 or 123.13532
// the decimal part is a non-capture group ?:
var oneCoord = '(-?\\d+(?:\\.\\d+)?)';
//console.log("oneCoord is: "+oneCoord+"\n");
//one coordinate pair is represented by [x,x]
// check start/end with ^, $
var coordPair = '^\\['+oneCoord+','+oneCoord+'\\]$';
//console.log("coordPair is: "+coordPair+"\n");
//the full regex string consists of one or more coordinate pairs,
// but we'll do the splitting in the function
var myRegex = new RegExp(coordPair);
//console.log("my regex is: "+myRegex+"\n");
function isPlusMinus180(x)
{
return -180.0<=x && x<=180.0;
}
function isPlusMinus90(y)
{
return -90.0<=y && y<=90.0;
}
function isValid(s)
{
//if there's a trailing semicolon, remove it
if(s.slice(-1)==';')
{
s = s.slice(0,-1);
}
//remove all newlines and split by semicolon
var all = s.replace(/\n/g,'').split(';');
//console.log(all);
for(var k=0; k<all.length; ++k)
{
var match = myRegex.exec(all[k]);
if(match===null)
return false;
console.log(" match[1]: "+match[1]);
console.log(" match[2]: "+match[2]);
//break out if one pair is bad
if(! (isPlusMinus180(match[1]) && isPlusMinus90(match[2])) )
{
console.log(" one of matches out of bounds");
return false;
}
}
return true;
}
var coords = new Array('[120,80];',
'[120.33,80]',
'[180,90];[180,67];',
'[123,34];[-32,21];\n[12,-67]',
'[25,67][76,23];',
'[25,67]\n[76,23]',
'[190,33.33]',
'[180.33,33]',
'[179.87,90]',
'[179.87,91]');
var s;
for (var i = 0; i<coords.length; i++) {
s = coords[i];
console.log((i+1)+". ==== testing "+s+" ====");
console.log(" isValid? => " + isValid(s));
}

Non-greedy regular expression match for multicharacter delimiters in awk

Consider the string "AB 1 BA 2 AB 3 BA". How can I match the content between "AB" and "BA" in a non-greedy fashion (in awk)?
I have tried the following:
awk '
BEGIN {
str="AB 1 BA 2 AB 3 BA"
regex="AB([^B][^A]|B[^A]|[^B]A)*BA"
if (match(str,regex))
print substr(str,RSTART,RLENGTH)
}'
with no output. I believe the reason for no match is that there is an odd number of characters between "AB" and "BA". If I replace str with "AB 11 BA 22 AB 33 BA" the regex seems to work..
Merge your two negated character classes and remove the [^A] from the second alternation:
regex = "AB([^AB]|B|[^B]A)*BA"
This regex fails on the string ABABA, though - not sure if that is a problem.
Explanation:
AB # Match AB
( # Group 1 (could also be non-capturing)
[^AB] # Match any character except A or B
| # or
B # Match B
| # or
[^B]A # Match any character except B, then A
)* # Repeat as needed
BA # Match BA
Since the only way to match an A in the alternation is by matching a character except B before it, we can safely use the simple B as one of the alternatives.
The other answer didn't really answer: how to match non-greedily?
Looks like it can't be done in (G)AWK. The manual says this:
awk (and POSIX) regular expressions always match the leftmost, longest
sequence of input characters that can match.
https://www.gnu.org/software/gawk/manual/gawk.html#Leftmost-Longest
And the whole manual doesn't contain the words "greedy" nor "lazy". It mentions Extended Regular Expressions, but for greedy matching you'd need Perl-Compatible Regular Expressions. So… no, can't be done.
For general expressions, I'm using this as a non-greedy match:
function smatch(s, r) {
if (match(s, r)) {
m = RSTART
do {
n = RLENGTH
} while (match(substr(s, m, n - 1), r))
RSTART = m
RLENGTH = n
return RSTART
} else return 0
}
smatch behaves like match, returning:
the position in s where the regular expression r occurs, or 0 if it does not. The variables RSTART and RLENGTH are set to the position and length of the matched string.