How to avoid warnings in Perl regex substitution with alternatives? - regex

I have this regex.
$string =~ s/(?<!["\w])(\w+)(?=:)|(?<=:)([\w\d\\.+=\/]+)/"$1$2"/g;
The regex itself works fine.
But since I am am substituting alternatives (and globally), I always get warning that $1 or $2 is uninitialized. These warnings clutter my logfile.
What can I do better to avoid such warning?
Or is my best option to just turn the warning off? I doubt this.
Side question: Is there possibly some better way of doing this, e.g. not using regex at all?
What I am doing is fixing JSON where some key:value pairs do not have double quotes and JSON module does not like it when trying to decode.

There are a couple of approaches to get around this.
If you intend to use capture groups:
When capturing the entirety of each clause of the alternation.
Combine the capture groups into 1 and move the group out.
( # (1 start)
(?<! ["\w] )
\w+
(?= : )
|
(?<= : )
[\w\d\\.+=/]+
) # (1 end)
s/((?<!["\w])\w+(?=:)|(?<=:)[\w\d\\.+=\/]+)/"$1"/g
Use a Branch Reset construct (?| aaa ).
This will cause capture groups in each alternation to start numbering it's groups
from the same point.
(?|
(?<! ["\w] )
( \w+ ) # (1)
(?= : )
|
(?<= : )
( [\w\d\\.+=/]+ ) # (1)
)
s/(?|(?<!["\w])(\w+)(?=:)|(?<=:)([\w\d\\.+=\/]+))/"$1"/g
Use Named capture groups that are re-useable (Similar to a branch reset).
In each alternation, reuse the same names. Make the group that isn't relevant, the empty group.
This works by using the name in the substitution instead of the number.
(?<! ["\w] )
(?<V1> \w+ ) # (1)
(?<V2> ) # (2)
(?= : )
|
(?<= : )
(?<V1> ) # (3)
(?<V2> [\w\d\\.+=/]+ ) # (4)
s/(?<!["\w])(?<V1>\w+)(?<V2>)(?=:)|(?<=:)(?<V1>)(?<V2>[\w\d\\.+=\/]+)/"$+{V1}$+{V2}"/g
The two concepts of the named substitution and a branch reset can be combined
if an alternation contains more than 1 capture group.
The example below uses the capture group numbers.
The theory is that you put dummy capture groups in each alternation to
"pad" the branch to equal the largest number of groups in a single alternation.
Indeed, this must be done to avoid the bug in Perl regex that could cause a crash.
(?| # Branch Reset
# ------ Br 1 --------
( ) # (1)
( \d{4} ) # (2)
ABC294
( [a-f]+ ) # (3)
|
# ------ Br 2 --------
( :: ) # (1)
( \d+ ) # (2)
ABC555
( ) # (3)
|
# ------ Br 3 --------
( == ) # (1)
( ) # (2)
ABC18888
( ) # (3)
)
s/(?|()(\d{4})ABC294([a-f]+)|(::)(\d+)ABC555()|(==)()ABC18888())/"$1$2$3"/g

You can try using Cpanel::JSON::XS's relaxed mode, or JSONY, to parse the almost-JSON and then write out regular JSON using Cpanel::JSON::XS. Depending what exactly is wrong with your input data one or the other might understand it better.
use strict;
use warnings;
use Cpanel::JSON::XS 'encode_json';
# JSON is normally UTF-8 encoded; if you're reading it from a file, you will likely need to decode it from UTF-8
my $string = q<{foo: 1,bar:'baz',}>;
my $data = Cpanel::JSON::XS->new->relaxed->decode($string);
my $json = encode_json $data;
print "$json\n";
use JSONY;
my $data = JSONY->new->load($string);
my $json = encode_json $data;
print "$json\n";

Related

Perl / Regex String Manipulation for multiple matches

I have the following string:
<Multicast ID="0/m1" Feed="EUREX-EMDI" IPPort="224.0.50.128:59098" State="CHECK" IsTainted="0" UncrossAfterGap="0" ManualUncrosses="0" AutoUncrosses="0" ExpectedSeqNo="-" />
I need to strip everything in this string apart from:
Feed="EUREX-EMDI"
State="CLOSED"
IsTainted="0"
I have managed to get "Feed="EUREX-EMDI"" with the following code:
s/^[^Feed]*(?=Feed)//;
So it now looks like:
Feed="EUREX-EMDI" IPPort="224.0.50.0:59098" State="CLOSED" IsTainted="0" UncrossAfterGap="0" ManualUncrosses="0" AutoUncrosses="0" ExpectedSeqNo="2191840" />
However I now don't know how to look for the next part "State="CLOSED"" in the string whilst ignoring my already found "Feed="EUREX-EMDI"" match
The perl idiom for this type of thing is a multiple assignment from regex capture groups. Assuming you can always count on the items of interest being in the same order and format (quoting):
($feed, $state, $istainted) = /.*(Feed="[^"]*").*(State="[^"]*").*(IsTainted="[^"]*")/;
Or if you only want to capture the (unquoted) values themselves, change the parentheses (capture groups):
($feed, $state, $istainted) = /.*Feed="([^"]*)".*State="([^"]*)".*(IsTainted="([^"]*)"/;
Please, don't try and parse XML with a regex. It's brittle. XML is contextual, and regular expression aren't. So at best, it's a dirty hack, and one that may one day break without warning for the most inane reasons.
See: RegEx match open tags except XHTML self-contained tags for more.
However, XML is structured, and it's actually quite easy to work with - provided you use something well suited to the job: A parser.
I like XML::Twig. XML::LibXML is also excellent, but has a bit of a steeper learning curve. (You also get XPath which is like regular expressions, but much more well suited for XML)
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
#create a list of what we want to keep. This map just turns it
#into a hash.
my %keep = map { $_ => 1 } qw ( IsTainted State Feed );
#parse the XML. If it's a file, you may want "parsefile" instead.
my $twig = XML::Twig->parse( \*DATA );
#iterate the attributes.
foreach my $att ( keys %{ $twig->root->atts } ) {
#delete the attribute unless it's in our 'keep' list.
$twig->root->del_att($att) unless $keep{$att};
}
#print it. You may find set_pretty_print useful for formatting XML.
$twig->print;
__DATA__
<Multicast ID="0/m1" Feed="EUREX-EMDI" IPPort="224.0.50.128:59098" State="CHECK" IsTainted="0" UncrossAfterGap="0" ManualUncrosses="0" AutoUncrosses="0" ExpectedSeqNo="-" />
Outputs:
<Multicast Feed="EUREX-EMDI" IsTainted="0" State="CHECK"/>
That preserves the attributes, and gives you valid XML. But if you just want the values:
foreach my $att ( qw ( Feed State IsTainted ) ) {
print $att, "=", $twig->root->att($att),"\n";
}
This will strip all but those strings.
$str =~ s/(?s)(?:(?!(?:Feed|State|IsTainted)\s*=\s*".*?").)*(?:((?:Feed|State|IsTainted)\s*=\s*".*?")|$)/$1/g;
If you want to include a space separator, make the replacement ' $1'.
Explained
(?s) # Dot - all
(?: # To be removed
(?!
(?: Feed | State | IsTainted )
\s* = \s* " .*? "
)
.
)*
(?: # To be saved
( # (1 start)
(?: Feed | State | IsTainted )
\s* = \s* " .*? "
) # (1 end)
| $
)

Why does this regular expression match in pcregrep but not within my c++ code?

I have a regex that works perfectly with pcregrep:
pcregrep -M '([a-zA-Z0-9_&*]+)(\(+)([a-zA-Z0-9_ &\*]+)(\)+)(\n)(\{)'
Now I tried to include this regex in my C++ code but it does not match (escapes included):
char const *regex = "([a-zA-Z0-9_&*]+)\\(+([a-zA-Z0-9_ &\\*]+)\\)+(?>\n+)\\{+";
re = pcre_compile(regex, PCRE_MULTILINE, &error, &erroffset, 0);
I'm trying to find function bodies like this (the paragraph is 0a in hex):
my_function(char *str)
{
Why does it work with pcregrep and not within the C++ code?
Your first regex:
( [a-zA-Z0-9_&*]+ ) # (1)
( \(+ ) # (2)
( [a-zA-Z0-9_ &\*]+ ) # (3)
( \)+ ) # (4)
( \n ) # (5)
( \{ ) # (6)
Your second regex:
( [a-zA-Z0-9_&*]+ ) # (1)
\(+
( [a-zA-Z0-9_ &\*]+ ) # (2)
\)+
(?> \n+ )
\{+
Other than different capture groups and an unnecessary atomic group (?>)
there is one thing that is obviously different:
The last newline and curly brace in the second regex have + quantifiers.
But that's 1 or more, so I think the first regex would be a subset of the second.
The un-obvious difference is that it is unknown if the files were opened in translated mode or not.
You can usually cover all cases with \r?\n in place of \n.
(or even (?:\r?\n|\r) ).
So, if you want to quantify the linebreak, it would be (?:\r?\n)+ or (?:\r?\n|\r)+.
The other option might be to try the linebreak construct (I think its \R)
instead (available on the newest versions of pcre).
If that doesn't work, it's something else.

Prefixing multiple lines with a previously captured token

I'm looking for a search/replace regular expression which will capture tokens and apply them as a prefix to every subsequent line within a document.
So this..
Tokens always start with ##..
Nothing is prefixed until a token is encountered..
##CAT
furball
scratch
##DOG
egg
##MOUSE
wheel
on the stair
Becomes..
Tokens always start with ##..
Nothing is prefixed until a token is captured!
##CAT
CAT furball
CAT scratch
##DOG
DOG egg
#MOUSE
MOUSE wheel
MOUSE on the stair
You can use this pattern:
search: ((?:\A|\n)##([^\r\n]+)(?>\r?\n\2[^\r\n]+)*+\r?\n(?!##))
replace: $1$2 <= with a space at the end
But you must apply the search replace several times until there no more matches.
As far as I know, this is impossible. The closest I can get is replacing
^##(.*)\r?\n(.*)
with
##\1\n\1 \2
Output:
Tokens always start with ##..
Nothing is prefixed until a token is encountered..
##CAT
CAT furball
scratch
##DOG
DOG egg
##MOUSE
MOUSE wheel
on the stair
You have the pcre tag and the Notepad++ tag.
I don't think you can actually do this without a callback mechanism.
That being said, you can do it without a callback, but you need to divide
up the functionality.
This is a php example that might give you some ideas.
Note - not sure of php string concatenation syntax (used a '.' but it could be a '+').
The usage is multi-line mode //m modifier.
^ # Begin of line
(?-s) # Modifier, No Dot-All
(?:
( # (1 start)
\#\# # Token indicator
( \w+ ) # (2), Token
.* # The rest of line
) # (1 end)
| # or,
( .* ) # (3), Just a non-token line
)
$ # End of line
# $token = "";
# $str = preg_replace_callback('/^(?-s)(?:(\#\#(\w+).*)|(.*))$/m',
# function( $matches ){
# if ( $matches[1] != "" ) {
# $token = $matches[2];
# return $matches[1];
# }
# else
# if ( $token != "" ) {
# return $token . " " . $matches[3];
# }
# return $matches[3];
# },
# $text);
#

regex validating a time value or a list of time values

I need a regex, which matches a single time value as well as lists of time values in the format hhmm[, hhmm] like for example:
"1245" or "0056, 1034,2355"
I am not so good with regex.. I thought this would do it:
(([0-1][0-9])|(2[0-3]))[0-5][0-9](,[ \t]*(([0-1][0-9])|(2[0-3]))[0-5][0-9])*
single time values are validated correctly, but if I try lists of times, every number behind the comma is accepted. It matches also "1235, 4711".
Can someone give me a hint what i am doing wrong?
Thanks in advance!
$pat = qr/(?:2[0-3]|[01][0-9])[0-5][0-9]/;
while (<DATA>) {
if (/^$pat(,\s*$pat)*$/) {
print;
}
}
__DATA__
1245
0056, 1034,2355
1034,2455
You should add a ^ to instruct the regular expression to match from the beginning of the line.
The following regex should work.
^([01][0-9]|2[0-3])[0-5][0-9](,\s*([01][0-9]|2[0-3])[0-5][0-9])*$
Try it yourself
In my opinion this is more readable regexp and it should work.
while( <DATA> ) {
if( /
^(
(
((0|1)\d)|(2[0-3]) #regex for hour (the first number may be 0, 1, or 2
#if 0 or 1, the second number can be from 0 to 9
#if 2, the second number can be from 0 to 3
)
[0-5]\d #regex for minutes (the first number
#can be from 0 to 5, second from 0 to 9)
)
(
,\s* #comma required
#the separator may be, or may not be
(
((0|1)\d)|(2[0-3])
)
[0-5]\d
)*$
/x ) {
print;
}
}
Your regular expression is basically fine except that it looks for the pattern anywhere inside the target string. That means any string that contains a single valid time will match. You must add beginning and end of string anchors ^ and $ to force the entire string to match the pattern.
You will find it clearer and easier to code regular expressions if you first write a common sub-expression and then use it like a subroutine. It also helps to use the /x modifier so that you can use whitespace to lay out the expresion more clearly.
For instance, this matches a single time string
/ ( [0-1][0-9] | 2[0-3] ) [0-5][0-9] /x
and you can go on to substitute that twice in the main expression.
It is also better to use non-capturing parentheses like (?: ... ) unless you really want to capture the substring into $1, $2 etc.
Take a look at this program and see what you think
use strict;
use warnings;
my $time = qr/(?: (?: [0-1][0-9] | 2[0-3] ) [0-5][0-9] ) /x;
while (<DATA>) {
print if /^ $time (?: ,[ \t]* $time )* $/x;
}
__DATA__
1245
0056, 1034,2355
1235, 4711
0000,1111
output
1245
0056, 1034,2355
0000,1111
This regexp must work:
/^(\d+)(, ?\d+)*$/

Is it possible to check if two groups are equal?

If I have some HTML like this:
<b>1<i>2</i>3</b>
And the following regex:
\<[^\>\/]+\>(.*?)\<\/[^\>]+\>
Then it will match:
<b>1<i>2</i>
I want it to only match HTML where the start and end tags are the same. Is there a way to do this?
Thanks,
Joe
Is there a way to do this?
Yes, certainly. Ignore those flippant non-answers that tell you it can’t be done. It most certainly can. You just may not wish to do so, as I explain below.
Numbered Captures
Pretending for the nonce that HTML <i> and <b> tags are always denude of attributes, and moreover, neither overlap nor nest, we have this simple solution:
#!/usr/bin/env perl
#
# solution A: numbered captures
#
use v5.10;
while (<>) {
say "$1: $2" while m{
< ( [ib] ) >
(
(?:
(?! < /? \1 > ) .
) *
)
</ \1 >
}gsix;
}
Which when run, produces this:
$ echo 'i got <i>foo</i> and <b>bar</b> bits go here' | perl solution-A
i: foo
b: bar
Named Captures
It would be better to use named captures, which leads to this equivalent solution:
#!/usr/bin/env perl
#
# Solution B: named captures
#
use v5.10;
while (<>) {
say "$+{name}: $+{contents}" while m{
< (?<name> [ib] ) >
(?<contents>
(?:
(?! < /? \k<name> > ) .
) *
)
</ \k<name> >
}gsix;
}
Recursive Captures
Of course, it is not reasonable to assume that such tags neither overlap nor nest. Since this is recursive data, it therefore requires a recursive pattern to solve. Remembering that the trival pattern to parse nested parens recursively is simply:
( \( (?: [^()]++ | (?-1) )*+ \) )
I’ll build that sort of recursive matching into the previous solution, and I’ll further toss in a bit interative processing to unwrap the inner bits, too.
#!/usr/bin/perl
use v5.10;
# Solution C: recursive captures, plus bonus iteration
while (my $line = <>) {
my #input = ( $line );
while (#input) {
my $cur = shift #input;
while ($cur =~ m{
< (?<name> [ib] ) >
(?<contents>
(?:
[^<]++
| (?0)
| (?! </ \k<name> > )
.
) *+
)
</ \k<name> >
}gsix)
{
say "$+{name}: $+{contents}";
push #input, $+{contents};
}
}
}
Which when demo’d produces this:
$ echo 'i got <i>foo <i>nested</i> and <b>bar</b> bits</i> go here' | perl Solution-C
i: foo <i>nested</i> and <b>bar</b> bits
i: nested
b: bar
That’s still fairly simple, so if it works on your data, go for it.
Grammatical Patterns
However, it doesn’t actually know about proper HTML syntax, which admits tag attributes to things like <i> and <b>.
As explained in this answer, one can certainly use regexes to parse markup languages, provided one is careful about it.
For example, this knows the attributes germane to the <i> (or <b>) tag. Here we defined regex subroutines used to build up a grammatical regex. These are definitions only, just like defining regular subs but now for regexes:
(?(DEFINE) # begin regex subroutine defs for grammatical regex
(?<i_tag_end> < / i > )
(?<i_tag_start> < i (?&attributes) > )
(?<attributes> (?: \s* (?&one_attribute) ) *)
(?<one_attribute>
\b
(?&legal_attribute)
\s* = \s*
(?:
(?&quoted_value)
| (?&unquoted_value)
)
)
(?<legal_attribute>
(?&standard_attribute)
| (?&event_attribute)
)
(?<standard_attribute>
class
| dir
| ltr
| id
| lang
| style
| title
| xml:lang
)
# NB: The white space in string literals
# below DOES NOT COUNT! It's just
# there for legibility.
(?<event_attribute>
on click
| on dbl click
| on mouse down
| on mouse move
| on mouse out
| on mouse over
| on mouse up
| on key down
| on key press
| on key up
)
(?<nv_pair> (?&name) (?&equals) (?&value) )
(?<name> \b (?= \pL ) [\w\-] + (?<= \pL ) \b )
(?<equals> (?&might_white) = (?&might_white) )
(?<value> (?&quoted_value) | (?&unquoted_value) )
(?<unwhite_chunk> (?: (?! > ) \S ) + )
(?<unquoted_value> [\w\-] * )
(?<might_white> \s * )
(?<quoted_value>
(?<quote> ["'] )
(?: (?! \k<quote> ) . ) *
\k<quote>
)
(?<start_tag> < (?&might_white) )
(?<end_tag>
(?&might_white)
(?: (?&html_end_tag)
| (?&xhtml_end_tag)
)
)
(?<html_end_tag> > )
(?<xhtml_end_tag> / > )
)
Once you have the pieces of your grammar assembled, you could incorporate those definitions into the recursive solution already given to do a much better job.
However, there are still things that haven’t been considered, and which in the more general case must be. Those are demonstrated in the longer solution already provided.
SUMMARY
I can think of only three possible reasons why you might not care to use regexes for parsing general HTML:
You are using an impoverished regex language, not a modern one, and so you have to recourse to essential modern conveniences like recursive matching or grammatical patterns.
You might such concepts as recursive and grammatical patterns too complicated for you to easily understand.
You prefer for someone else to do all the heavy lifting for you, including the heavy testing, and so you would rather use a separate HTML parsing module instead of rolling your own.
Any one or more of those might well apply. In which case, don’t do it this way.
For simple canned examples, this route is easy. The more robust you want this to work on things you’ve never seen before, the harder this route becomes.
Certainly you can’t do any of it if you are using the inferior, impoverished pattern matching bolted onto the side of languages like Python or even worse, Javascript. Those are barely any better than the Unix grep program, and in some ways, are even worse. No, you need a modern pattern matching engine such as found in Perl or PHP to even start down this road.
But honestly, it’s probably easier just to get somebody else to do it for you, by which I mean that you should probably use an already-written parsing module.
Still, understanding why not to bother with these regex-based approaches (at least, not more than once) requires that you first correctly implement proper HTML parsing using regexes. You need to understand what it is all about. Therefore, little exercises like this are useful for improving your overall understanding of the problem-space, and of modern pattern matching in general.
This forum isn’t really in the right format for explaining all these things about modern pattern-matching. There are books, though, that do so equitably well.
You probably don't want to use regular expressions with HTML.
But if you still want to do this you need to take a look at backreferences.
Basically it's a way to capture a group (such as "b" or "i") to use it later in the same regular expression.
Related issues:
RegEx match open tags except XHTML self-contained tags