Rails 6- RFC5322 compliant Email validation - regex

This is a PCRE regex. https://regex101.com/r/gJ7pU0/1 that can validate email addresses.
Is there a RFC5322 compliant regex for ruby? Ruby has URI::MailTo::EMAIL_REGEXP, but I do not think it's RFC5322 compliant.
Another post mentioned this 'mail' gem, but I do not see a way to validate email addresses with it.
https://github.com/mikel/mail/tree/6b0ebb142c476bf7c00524effe513a4f151f59ab
PERC RFC5322 Compliant
(?(DEFINE)
(?<addr_spec> (?&local_part) # (?&domain) )
(?<local_part> (?&dot_atom) | (?&quoted_string) | (?&obs_local_part) )
(?<domain> (?&dot_atom) | (?&domain_literal) | (?&obs_domain) )
(?<domain_literal> (?&CFWS)? \[ (?: (?&FWS)? (?&dtext) )* (?&FWS)? \] (?&CFWS)? )
(?<dtext> [\x21-\x5a] | [\x5e-\x7e] | (?&obs_dtext) )
(?<quoted_pair> \\ (?: (?&VCHAR) | (?&WSP) ) | (?&obs_qp) )
(?<dot_atom> (?&CFWS)? (?&dot_atom_text) (?&CFWS)? )
(?<dot_atom_text> (?&atext) (?: \. (?&atext) )* )
(?<atext> [a-zA-Z0-9!#$%&'*+\/=?^_`{|}~-]+ )
(?<atom> (?&CFWS)? (?&atext) (?&CFWS)? )
(?<word> (?&atom) | (?&quoted_string) )
(?<quoted_string> (?&CFWS)? " (?: (?&FWS)? (?&qcontent) )* (?&FWS)? " (?&CFWS)? )
(?<qcontent> (?&qtext) | (?&quoted_pair) )
(?<qtext> \x21 | [\x23-\x5b] | [\x5d-\x7e] | (?&obs_qtext) )
# comments and whitespace
(?<FWS> (?: (?&WSP)* \r\n )? (?&WSP)+ | (?&obs_FWS) )
(?<CFWS> (?: (?&FWS)? (?&comment) )+ (?&FWS)? | (?&FWS) )
(?<comment> \( (?: (?&FWS)? (?&ccontent) )* (?&FWS)? \) )
(?<ccontent> (?&ctext) | (?&quoted_pair) | (?&comment) )
(?<ctext> [\x21-\x27] | [\x2a-\x5b] | [\x5d-\x7e] | (?&obs_ctext) )
# obsolete tokens
(?<obs_domain> (?&atom) (?: \. (?&atom) )* )
(?<obs_local_part> (?&word) (?: \. (?&word) )* )
(?<obs_dtext> (?&obs_NO_WS_CTL) | (?&quoted_pair) )
(?<obs_qp> \\ (?: \x00 | (?&obs_NO_WS_CTL) | \n | \r ) )
(?<obs_FWS> (?&WSP)+ (?: \r\n (?&WSP)+ )* )
(?<obs_ctext> (?&obs_NO_WS_CTL) )
(?<obs_qtext> (?&obs_NO_WS_CTL) )
(?<obs_NO_WS_CTL> [\x01-\x08] | \x0b | \x0c | [\x0e-\x1f] | \x7f )
# character class definitions
(?<VCHAR> [\x21-\x7E] )
(?<WSP> [ \t] )
)
^(?&addr_spec)$

PCRE to Onigmo recursion/subroutine regex conversion is straight-forward:
Remove the unsupported (?(DEFINE)...) construct
Put all the named groups used to define the consuming pattern at the start of the regex and apply a {0} quantifier to all of them so that they just matched nothing
Replace (?&...) to \g<...> syntax (I have just done it in Notepad++ with \(\?&(\w+)\) to replace with \\g<$1>).
The final expression that will work in Ruby looks like
re =/(?<addr_spec> \g<local_part> # \g<domain> ){0}
(?<local_part> \g<dot_atom> | \g<quoted_string> | \g<obs_local_part> ){0}
(?<domain> \g<dot_atom> | \g<domain_literal> | \g<obs_domain> ){0}
(?<domain_literal> \g<CFWS>? \[ (?: \g<FWS>? \g<dtext> )* \g<FWS>? \] \g<CFWS>? ){0}
(?<dtext> [\x21-\x5a] | [\x5e-\x7e] | \g<obs_dtext> ){0}
(?<quoted_pair> \\ (?: \g<VCHAR> | \g<WSP> ) | \g<obs_qp> ){0}
(?<dot_atom> \g<CFWS>? \g<dot_atom_text> \g<CFWS>? ){0}
(?<dot_atom_text> \g<atext> (?: \. \g<atext> )* ){0}
(?<atext> [a-zA-Z0-9!#$%&'*+\/=?^_`{|}~-]+ ){0}
(?<atom> \g<CFWS>? \g<atext> \g<CFWS>? ){0}
(?<word> \g<atom> | \g<quoted_string> ){0}
(?<quoted_string> \g<CFWS>? " (?: \g<FWS>? \g<qcontent> )* \g<FWS>? " \g<CFWS>? ){0}
(?<qcontent> \g<qtext> | \g<quoted_pair> ){0}
(?<qtext> \x21 | [\x23-\x5b] | [\x5d-\x7e] | \g<obs_qtext> ){0}
# comments and whitespace
(?<FWS> (?: \g<WSP>* \r\n )? \g<WSP>+ | \g<obs_FWS> ){0}
(?<CFWS> (?: \g<FWS>? \g<comment> )+ \g<FWS>? | \g<FWS> ){0}
(?<comment> \( (?: \g<FWS>? \g<ccontent> )* \g<FWS>? \) ){0}
(?<ccontent> \g<ctext> | \g<quoted_pair> | \g<comment> ){0}
(?<ctext> [\x21-\x27] | [\x2a-\x5b] | [\x5d-\x7e] | \g<obs_ctext> ){0}
# obsolete tokens
(?<obs_domain> \g<atom> (?: \. \g<atom> )* ){0}
(?<obs_local_part> \g<word> (?: \. \g<word> )* ){0}
(?<obs_dtext> \g<obs_NO_WS_CTL> | \g<quoted_pair> ){0}
(?<obs_qp> \\ (?: \x00 | \g<obs_NO_WS_CTL> | \n | \r ) ){0}
(?<obs_FWS> \g<WSP>+ (?: \r\n \g<WSP>+ )* ){0}
(?<obs_ctext> \g<obs_NO_WS_CTL> ){0}
(?<obs_qtext> \g<obs_NO_WS_CTL> ){0}
(?<obs_NO_WS_CTL> [\x01-\x08] | \x0b | \x0c | [\x0e-\x1f] | \x7f ){0}
# character class definitions
(?<VCHAR> [\x21-\x7E] ){0}
(?<WSP> [ \t] ){0}
^\g<addr_spec>$/x
See a Ruby test:
p re.match?('+1~1+#iana.org') # => true
p re.match?('test#[123.123.123.123') # => false

Related

need return value for captured group from last captured string in perl

I have XML files from which i want to capture init value( tag) for each parameter.I am copying some part of xml for reference.
I have port name and parameter name( tag(MNO) available with me.
eg . port name is XYZ & parameter name is MNO
port name is PQR & parameter name is ABC and GHI
There can be multiple tag under one container.
<R-PORT-PROTOTYPE UUID="Oac11eff016c6bb667f357a89xOac11f0ad174240e817fa858f00">
<SHORT-NAME>XYZ</SHORT-NAME>
<REQUIRED-COM-SPECS>
<PARAMETER-REQUIRE-COM-SPEC>
<INIT-VALUE>
<APPLICATION-VALUE-SPECIFICATION>
<SHORT-LABEL>Init_Val</SHORT-LABEL>
<CATEGORY>VALUE</CATEGORY>
<SW-VALUE-CONT>
<SW-VALUES-PHYS>
<V>0.071</V>
</SW-VALUES-PHYS>
</SW-VALUE-CONT>
</APPLICATION-VALUE-SPECIFICATION>
</INIT-VALUE>
<PARAMETER-REF DEST="PARAMETER-DATA-PROTOTYPE">/SoftwareTypes/Interfaces/MNO</PARAMETER-REF>
</PARAMETER-REQUIRE-COM-SPEC>
</REQUIRED-COM-SPECS>
</R-PORT-PROTOTYPE>
<R-PORT-PROTOTYPE UUID="Oac11eff016c6bb667f357a89xOac11f0ad174240e817f8f55900">
<SHORT-NAME>PQR</SHORT-NAME>
<REQUIRED-COM-SPECS>
<PARAMETER-REQUIRE-COM-SPEC>
<INIT-VALUE>
<APPLICATION-VALUE-SPECIFICATION>
<SHORT-LABEL>Init_0</SHORT-LABEL>
<CATEGORY>VALUE</CATEGORY>
<SW-VALUE-CONT>
<SW-VALUES-PHYS>
<V>80</V>
</SW-VALUES-PHYS>
</SW-VALUE-CONT>
</APPLICATION-VALUE-SPECIFICATION>
</INIT-VALUE>
<PARAMETER-REF DEST="PARAMETER-DATA-PROTOTYPE">/SoftwareTypes/Interfaces/ABC</PARAMETER-REF>
</PARAMETER-REQUIRE-COM-SPEC>
<PARAMETER-REQUIRE-COM-SPEC>
<INIT-VALUE>
<APPLICATION-VALUE-SPECIFICATION>
<SHORT-LABEL>Int_ghi</SHORT-LABEL>
<CATEGORY>VALUE</CATEGORY>
<SW-VALUE-CONT>
<SW-VALUES-PHYS>
<V>-80</V>
</SW-VALUES-PHYS>
</SW-VALUE-CONT>
</APPLICATION-VALUE-SPECIFICATION>
</INIT-VALUE>
<PARAMETER-REF DEST="PARAMETER-DATA-PROTOTYPE">/SoftwareTypes/Interfaces/GHI</PARAMETER-REF>
</PARAMETER-REQUIRE-COM-SPEC>
</REQUIRED-COM-SPECS>
</R-PORT-PROTOTYPE>
regex :
if($test_string=~ /<R-PORT-PROTOTYPE.*?<short-name>Port_name<\/short-name>.*?<V>(.*?)<\/.*?<PARAMETER-REF DEST="PARAMETER-DATA-PROTOTYPE">.*?Parameter_name<\/PARAMETER-REF>/gis) {
print $2;
}
I need output 80 if parameter is ABC and -80 if parameter is GHI
I suggest using XML::LibXML.
Here I've combined two Xpath queries to find V nodes:
SHORT-NAME is XYZ and PARAMETER-REF (with DEST == PARAMETER-DATA-PROTOTYPE) contains MNO.
SHORT-NAME is PQR and PARAMETER-REF (with DEST == PARAMETER-DATA-PROTOTYPE) contains ABC or GHI.
Example:
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
my $dom = XML::LibXML->load_xml(location => 'doc.xml');
my $query = q{
//R-PORT-PROTOTYPE/SHORT-NAME[text()="XYZ"]/..
//PARAMETER-REF[#DEST="PARAMETER-DATA-PROTOTYPE"][
contains(text(),'MNO')
]/..//V
|
//R-PORT-PROTOTYPE/SHORT-NAME[text()="PQR"]/..
//PARAMETER-REF[#DEST="PARAMETER-DATA-PROTOTYPE"][
contains(text(),'ABC') or contains(text(),'GHI')
]/..//V
};
foreach my $vnode ($dom->findnodes($query)) {
print $vnode->to_literal() . "\n";
}
Output:
0.071
80
-80
The two ways to get either or both is
1 - Linear https://regex101.com/r/NYbvI8/1
# https://regex101.com/r/NYbvI8/1
# if($test_string=~ /<R-PORT-PROTOTYPE.*?<short-name>PQR<\/short-name>(?:.*?<V>(.*?)<\/V>.*?<PARAMETER-REF[ ]DEST="PARAMETER-DATA-PROTOTYPE">.*?MNO1<\/PARAMETER-REF>)?(?:.*?<V>(.*?)<\/V>.*?<PARAMETER-REF[ ]DEST="PARAMETER-DATA-PROTOTYPE">.*?MNO2<\/PARAMETER-REF>)?(?(1)|(?(2)|(?!)))/gis)
<R-PORT-PROTOTYPE .*? <short-name>PQR</short-name>
(?:
.*?
<V>
( .*? ) # (1)
</V>
.*?
<PARAMETER-REF [ ] DEST="PARAMETER-DATA-PROTOTYPE"> .*? MNO1</PARAMETER-REF>
)?
(?:
.*?
<V>
( .*? ) # (2)
</V>
.*?
<PARAMETER-REF [ ] DEST="PARAMETER-DATA-PROTOTYPE"> .*? MNO2</PARAMETER-REF>
)?
(?(1)
| (?(2)
| (?!)
)
)
2 - Out of order https://regex101.com/r/gQJ3cO/1
# https://regex101.com/r/t4M9UB/1
# if($test_string=~ /<R-PORT-PROTOTYPE.*?<short-name>PQR<\/short-name>(?:(?:(?(1)(?!)).*?<V>(.*?)<\/V>.*?<PARAMETER-REF[ ]DEST="PARAMETER-DATA-PROTOTYPE">.*?MNO1<\/PARAMETER-REF>)|(?:(?(2)(?!)).*?<V>(.*?)<\/V>.*?<PARAMETER-REF[ ]DEST="PARAMETER-DATA-PROTOTYPE">.*?MNO2<\/PARAMETER-REF>)){1,2}/gis)
<R-PORT-PROTOTYPE .*? <short-name>PQR</short-name>
(?:
(?:
(?(1) (?!) )
.*?
<V>
( .*? ) # (1)
</V>
.*?
<PARAMETER-REF [ ] DEST="PARAMETER-DATA-PROTOTYPE"> .*? MNO1</PARAMETER-REF>
)
|
(?:
(?(2) (?!) )
.*?
<V>
( .*? ) # (2)
</V>
.*?
<PARAMETER-REF [ ] DEST="PARAMETER-DATA-PROTOTYPE"> .*? MNO2</PARAMETER-REF>
)
){1,2}

How to get specific error message to Data Validation using Regex with python 3

How do i get a specific output like the below examples:
example 1 - If the user inputs Alberta, UN. I want to be able to see the print result as I'm sorry, UN is an invalid province abbreviation.
I would love if the program can display an exact error in relation to user's input. Instead of an error saying I'm sorry this is an error, without any specific message to let the user know where his/her fault is.
I would really appreciate it if i could get some results, because i have been brainstorming on how to make it work
# Import
import re
# Program interface
print("====== POSTAL CODE CHECKER PROGRAM ====== ")
print("""
Select from below city/Province and enter it
--------------------------------------------
Alberta, AB,
British Columbia, BC
Manitoba, MB
New Brunswick, NB
Newfoundland, NL
Northwest Territories, NT
Nova Scotia, NS
Nunavut, NU
Ontario, ON
Prince Edward Island, PE
Quebec, QC
Saskatchewan, SK
Yukon, YT
""")
# User input 1
province_input = input("Please enter the city/province: ")
pattern = re.compile(r'[ABMNOPQSYabcdefhiklmnorstuvw]| [CBTSE]| [Idasln], [ABMNOPQSY]+[BCLTSUNEK]')
if pattern.match(province_input):
print("You have successfully inputted {} the right province.".format(province_input))
elif not pattern.match(province_input):
print("I'm sorry, {} is an invalid province abbreviation".format(province_input))
else:
print("I'm sorry, your city or province is incorrectly formatted.")
I tried to generalize your question, so it will check if the first part of the input is a valid city and the second is a valid state abbreviation, when "valid" means each of them appears in its relevant valid inputs list.
The core of the code is the regex ([A-Z][A-Za-z ]+), ([A-Z]{2}), which matches two groups: the first group contains the city - and after a comma and a space - the second group contains the state abbreviation (which must consist two capital letters).
Please notice there are 5 possible outputs, according to the validity of each part.
import re
cities = ["Alberta", "British Columbia", "Manitoba"]
states = ["AB", "BC", "MB"]
province_input = input("Please enter the city/province: ")
regexp = r"([A-Z][A-Za-z ]+), ([A-Z]{2})"
if re.compile(regexp).match(province_input):
m = re.search(regexp, province_input)
city_input = m.group(1)
state_input = m.group(2)
if city_input not in cities and state_input not in states:
print("Both '%s' and '%s' are valid" % (city_input, state_input))
elif city_input in cities:
if state_input in states:
print("Your input '%s, %s' was valid" % (city_input, state_input))
else:
print("'%s' is an invalid province abbreviation" % state_input)
else:
print("The city '%s' is invalid" % city_input)
else:
print("Wrong input format")
I tried to make the code as clear as possible, but please do let me know if anything is unclear.
Not a Python programmer but I think if you import regex
the new regex lib replacement, it will give you access to the
Branch Reset construct.
Using that, it's trivial to divide City and Province into 3 groups.
So, just match with this regex
[ \t]*(?|(Alberta)(?:[ \t]*,[ \t]*(?:(AB)|(\w+)))?|(British[ \t]+Columbia)(?:[ \t]*,[ \t]*(?:(BC)|(\w+)))?|(Manitoba)(?:[ \t]*,[ \t]*(?:(MB)|(\w+)))?|(New[ \t]+Brunswick)(?:[ \t]*,[ \t]*(?:(NB)|(\w+)))?|(Newfoundland)(?:[ \t]*,[ \t]*(?:(NL)|(\w+)))?|(Northwest[ \t]+Territories)(?:[ \t]*,[ \t]*(?:(NT)|(\w+)))?|(Nova[ \t]+Scotia)(?:[ \t]*,[ \t]*(?:(NS)|(\w+)))?|(Nunavut)(?:[ \t]*,[ \t]*(?:(NU)|(\w+)))?|(Ontario)(?:[ \t]*,[ \t]*(?:(ON)|(\w+)))?|(Prince[ \t]+Edward[ \t]+Island)(?:[ \t]*,[ \t]*(?:(PE)|(\w+)))?|(Quebec)(?:[ \t]*,[ \t]*(?:(QC)|(\w+)))?|(Saskatchewan)(?:[ \t]*,[ \t]*(?:(SK)|(\w+)))?|(Yukon)(?:[ \t]*,[ \t]*(?:(YT)|(\w+)))?|()(\w+)())
Check the groups in this order :
if NO match : Please enter 'City, Province' from the list
else if length $1 equals 0 : '$2' is not a valid City
else if length $3 > 0 : '$3' is not a valid Province
else if length $2 equals 0 : Please enter a Province
else : Thank you, your entry is valid '$1, $2'
Demo: https://regex101.com/r/MrlqEN/1
Expanded
[ \t]*
(?|
( Alberta ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( AB ) # (2)
| ( \w+ ) # (3)
)
)?
| ( British [ \t]+ Columbia ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( BC ) # (2)
| ( \w+ ) # (3)
)
)?
| ( Manitoba ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( MB ) # (2)
| ( \w+ ) # (3)
)
)?
| ( New [ \t]+ Brunswick ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( NB ) # (2)
| ( \w+ ) # (3)
)
)?
| ( Newfoundland ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( NL ) # (2)
| ( \w+ ) # (3)
)
)?
| ( Northwest [ \t]+ Territories ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( NT ) # (2)
| ( \w+ ) # (3)
)
)?
| ( Nova [ \t]+ Scotia ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( NS ) # (2)
| ( \w+ ) # (3)
)
)?
| ( Nunavut ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( NU ) # (2)
| ( \w+ ) # (3)
)
)?
| ( Ontario ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( ON ) # (2)
| ( \w+ ) # (3)
)
)?
| ( Prince [ \t]+ Edward [ \t]+ Island ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( PE ) # (2)
| ( \w+ ) # (3)
)
)?
| ( Quebec ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( QC ) # (2)
| ( \w+ ) # (3)
)
)?
| ( Saskatchewan ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( SK ) # (2)
| ( \w+ ) # (3)
)
)?
| ( Yukon ) # (1)
(?:
[ \t]* , [ \t]*
(?:
( YT ) # (2)
| ( \w+ ) # (3)
)
)?
| ( ) # (1)
( \w+ ) # (2)
( ) # (3)
)

Using named captures with the branch reset pattern

$recepies =~ s/
(?|
([^\.\/]) ($dimentions)
|
(\(?) ($wholeNumberDecimal) #ex: 1.5
|
(\(?) ($wholeNumber) #ex: 1
|
(\(?) ($wholeNumberFraction)
)
(\s) ($unit)
/transformer($1,$2,$3,$4) /eixg; #the replacement
I was wondering how would it be possible to name my captures in this context.
For example I would like to call the first one "prefix" the second "number" the third "space" and the fourth "unit"
Probably can be done like this
$recepies =~ s/
(?|
(?<prefix> [^\.\/] ) # (1)
(?<number> $dimentions ) # (2)
|
(?<prefix> \(? ) # (1), ex: 1.5
(?<number> $wholeNumberDecimal ) # (2)
|
(?<prefix> \(? ) # (1), ex: 1
(?<number> $wholeNumber ) # (2)
|
(?<prefix> \(? ) # (1)
(?<number> $wholeNumberFraction ) # (2)
)
(?<space> \s ) # (3)
(?<unit> $unit ) # (4)
/transformer($+{prefix},$+{number},$+{space},$+{unit}) /eixg; #the replacement
Or like this
$recepies =~ s/
(?|
(?<prefix> [^\.\/] ) # (1)
(?<number> $dimentions ) # (2)
|
(?<prefix> \(? ) # (1), ex: 1.5
(?<number> $wholeNumberDecimal ) # (2)
|
(?<prefix> \(? ) # (1), ex: 1
(?<number> $wholeNumber ) # (2)
|
(?<prefix> \(? ) # (1)
(?<number> $wholeNumberFraction ) # (2)
)
\s
$unit
/transformer($+{prefix},$+{number},$unit) /eixg; #the replacement

How to find the next unbalanced brace?

The regex below captures everything up to the last balanced }.
Now, what regex would be able to capture everything up to the next unbalanced }? In other words, how can I can get ... {three {four}} five} from $str instead of just ... {three {four}}?
my $str = "one two {three {four}} five} six";
if ( $str =~ /
(
.*?
{
(?> [^{}] | (?-1) )+
}
)
/sx
)
{
print "$1\n";
}
So you want to match
[noncurlies [block noncurlies [...]]] "}"
where a block is
"{" [noncurlies [block noncurlies [...]]] "}"
As a grammar:
start : text "}"
text : noncurly* ( block noncurly* )*
block : "{" text "}"
noncurly : /[^{}]/
As a regex (5.10+):
/
^
(
(
[^{}]*
(?:
\{ (?-1) \}
[^{}]*
)*
)
\}
)
/x
As a regex (5.10+):
/
^ ( (?&TEXT) \} )
(?(DEFINE)
(?<TEXT> [^{}]* (?: (?&BLOCK) [^{}]* )* )
(?<BLOCK> \{ (?&TEXT) \} )
)
/x

perl stream file for regex token including scanned tokens

I am trying to stream a file in perl and tokenize the lines and include the tokens.
I have:
while( $line =~ /([\/][\d]*[%].*?[%][\d]*[\/]|[^\s]+|[\s]+)/g ) {
my $word = $1;
#...
}
But it doesn't work when there's no spaces in the token.
For example, if my line is:
$line = '/15%one (1)(2)%15/ is a /%good (1)%/ +/%number(2)%/.'
I would like to split that line into:
$output =
[
'/15%one (1)(2)%15/',
' ',
'is',
' ',
'a',
'/%good (1)%/',
' ',
'+',
'/%number(2)%/',
'.'
]
What is the best way to do this?
(?:(?!STRING).)* is to STRING as [^CHAR]* is to CHAR, so
my #tokens;
push #tokens, $1
while $line =~ m{
\G
( \s+
| ([\/])([0-9]*)%
(?: (?! %\3\2 ). )*
%\3\2
| (?: (?! [\/][0-9]*% )\S )+
)
}sxg;
but that doesn't validate. If you want to validate, you could use
my #tokens;
push #tokens, $1
while $line =~ m{
\G
( \s+
| ([\/])([0-9]*)%
(?: (?! %\3\2 ). )*
%\3\2
| (?: (?! [\/][0-9]*% )\S )+
| \z (*COMMIT) (*FAIL)
| (?{ die "Syntax error" })
)
}sxg;
The following also validates, but it's a bit more readable and makes it easy to differentiate the token types.:
my #tokens;
for ($line) {
m{\G ( \s+ ) }sxgc
&& do { push #tokens, $1; redo };
m{\G ( ([\/])([0-9]*)% (?: (?! %\3\2 ). )* %\3\2 ) }sxgc
&& do { push #tokens, $1; redo };
m{\G ( (?: (?! [\/][0-9]*% )\S )+ ) }sxgc
&& do { push #tokens, $1; redo };
m{\G \z }sxgc
&& last;
die "Syntax error";
}
pos will get you information about where the error occurred.