Best way to deal with "Unescaped braces in regex" inside Perl regex - regex

I recently started learning Perl to automate some mindless data tasks. I work on windows machines, but prefer to use Cygwin. Wrote a Perl script that did everything I wanted fine in Cygwin, but when I tried to run it with Strawberry Perl on Windows via CMD I got the "Unescaped left brace in regex is illegal here in regex," error.
After some reading, I am guessing my Cygwin has an earlier version of Perl and modern versions of Perl which Strawberry is using don't allow for this. I am familiar with escaping characters in regex, but I am getting this error when using a capture group from a previous regex match to do a substitution.
open(my $fh, '<:encoding(UTF-8)', $file)
or die "Could not open file '$file' $!";
my $fileContents = do { local $/; <$fh> };
my $i = 0;
while ($fileContents =~ /(.*Part[^\}]*\})/) {
$defParts[$i] = $1;
$i = $i + 1;
$fileContents =~ s/$1//;
}
Basically I am searching through a file for matches that look like:
Part
{
Somedata
}
Then storing those matches in an array. Then purging the match from the $fileContents so I avoid repeats.
I am certain there are better and more efficient ways of doing any number of these things, but I am surprised that when using a capture group it's complaining about unescaped characters.
I can imagine storing the capture group, manually escaping the braces, then using that for the substitution, but is there a quicker or more efficient way to avoid this error without rewriting the whole block? (I'd like to avoid special packages if possible so that this script is easily portable.)
All of the answers I found related to this error were with specific cases where it was more straightforward or practical to edit the source with the curly braces.
Thank you!

I would just bypass the whole problem and at the same time simplify the code:
my $i = 0;
while ($fileContents =~ s/(.*Part[^\}]*\})//) {
$defParts[$i] = $1;
$i = $i + 1;
}
Here we simply do the substitution first. If it succeeds, it will still set $1 and return true (just like plain /.../), so there's no need to mess around with s/$1// later.
Using $1 (or any variable) as the pattern would mean you have to escape all regex metacharacters (e.g. *, +, {, (, |, etc.) if you want it to match literally. You can do that pretty easily with quotemeta or inline (s/\Q$1//), but it's still an extra step and thus error prone.
Alternatively, you could keep your original code and not use s///. I mean, you already found the match. Why use s/// to search for it again?
while ($fileContents =~ /(.*Part[^\}]*\})/) {
...
substr($fileContents, $-[0], $+[0] - $-[0], "");
}
We already know where the match is in the string. $-[0] is the position of the start and $+[0] the position of the end of the last regex match (thus $+[0] - $-[0] is the length of the matched string). We can then use substr to replace that chunk by "".
But let's keep going with s///:
my $i = 0;
while ($fileContents =~ s/(.*Part[^\}]*\})//) {
$defParts[$i] = $1;
$i++;
}
$i = $i + 1; can be reduced to $i++; ("increment $i").
my #defParts;
while ($fileContents =~ s/(.*Part[^\}]*\})//) {
push #defParts, $1;
}
The only reason we need $i is to add elements to the #defParts array. We can do that by using push, so there's no need for maintaining an extra variable. This saves us another line.
Now we probably don't need to destroy $fileContents. If the substitution exists only for the benefit of this loop (so I doesn't re-match already extracted content), we can do better:
my #defParts;
while ($fileContents =~ /(.*Part[^\}]*\})/g) {
push #defParts, $1;
}
Using /g in scalar context attaches a "current position" to $fileContents, so the next match attempt starts where the previous match left off. This is probably more efficient because it doesn't have to keep rewriting $fileContents.
my #defParts = $fileContents =~ /(.*Part[^\}]*\})/g;
... Or we could just use //g in list context, where it returns a list of all captured groups of all matches, and assign that to #defParts.
my #defParts = $fileContents =~ /.*Part[^\}]*\}/g;
If there are no capture groups in the regex, //g in list context returns the list of all matched strings (as if there had been ( ) around the whole regex).
Feel free to choose any of these. :-)

As for the question of escaping, that's what quotemeta is for,
my $needs_escaping = q(some { data } here);
say quotemeta $needs_escaping;
what prints (on v5.16)
some\ \{\ data\ \}\ here
and works on $1 as well. See linked docs for details. Also see \Q in perlre (search for \Q), which is how this is used inside a regex, say s/\Q$1//;. The \E stops escaping (what you don't need).
Some comments.
Relying on deletion so that the regex keeps finding further such patterns may be a risky design. If it isn't and you do use it there is no need for indices, since we have push
my #defParts;
while ($fileContents =~ /($pattern)/) {
push #defParts, $1;
$fileContents =~ s/\Q$1//;
}
where \Q is added in the regex. Better yet, as explained in melpomene's answer the substitution can be done in the while condition itself
push #defParts, $1 while $fileContents =~ s/($pattern)//;
where I used the statement modifier form (postfix syntax) for conciseness.
With the /g modifier in scalar context, as in while (/($pattern)/g) { .. }, the search continues from the position of the previous match in each iteration, and this is a usual way to iterate over all instances of a pattern in a string. Please read up on use of /g in scalar context as there are details in its behavior that one should be aware of.
However, this is tricky here (even as it works) as the string changes underneath the regex. If efficiency is not a concern, you can capture all matches with /g in list context and then remove them
my #all_matches = $fileContents =~ /$patt/g;
$fileContents =~ s/$patt//g;
While inefficient, as it makes two passes, this is much simpler and clearer.
I expect that Somedata cannot possibly, ever, contain }, for instance as nested { ... }, correct? If it does you have a problem of balanced delimiters, which is far more rounded. One approach is to use the core Text::Balanced module. Search for SO posts with examples.

Related

Remove certain characters from a regex group

I have a string that looks like this (key":["value","value","value"])
"emailDomains":["google.co.uk","google.com","google.com","google.com","google.co.uk"]
and I use the following regex to select from the string. (the regex is setup in a way where it wont select a string that looks like this "key":[{"key":"value","key":"value"}] )
(?<=:\[").*?(?="])
Resulting Selection:
google.co.uk","google.com","google.com","google.com","google.co.uk
I want to remove the " in that select string, and i was wondering if there was an easy way to do this using the replace command. Desired result...
"emailDomains":["google.co.uk, google.com, google.com, google.com, google.co.uk"]
How do I solve this problem?
If your string indeed has the form "key":["v1", "v2", ... "vN"], you can split off the part that needs to be changed, replace "," by a space in it, and re-assemble:
my #parts = split / (\["\s* | \s*\"]) /x, $string; #"
$parts[2] =~ s/",\s*"/ /g;
my $processed = join '', #parts;
The regex pattern for the separator in split is captured since in that case the separators are also in the returned list, what is helpful here for putting the string back together. Then, we need to change the third element of the array.
In this approach, we have to change a specific element in the array so if your format varies, even a little, this may not (or still may) be suitable.
This should of course be processed as JSON, using a module. If the format isn't sure, as indicated in a comment, it would be best to try to ensure that you have JSON. Picking bits and pieces like above (or below) is a road to madness once requirements slowly start evolving.
The same approach can be used in a regex, and this may in fact have an advantage to be able to scoop up and ignore everything preceding the : (with split that part may end up with multiple elements if the format isn't exactly as shown, what then affects everything)
$string =~ s{ :\["\s*\K (.*?) ( "\] ) }{
my $e = $2;
my $n = $1 =~ s/",\s*"/ /gr;
$n.$e
}ex;
Here /e modifier makes it so that the replacement side is evaluated as code, where we do the same as with the split above. Notes on regex
Have to save away $2 first, since it gets reset in the next regex
The /r modifier†, which doesn't change its target but rather returns the changed string, is what allows us to use substitution operator on the read-only $1
If nothing gets captured for $2, and perhaps for $1, that means that there was no match and the outcome is simply that $string doesn't change, quietly. So if this substitution should always work then you may want to add handling of such unexpected data
Don't need a $n above, but can return ($1 =~ s/",\s*"/ /gr) . $e
Or, using lookarounds as attempted
$string =~ s{ (?<=:\[") (.+?) (?="\]) }{ $1 =~ s/",\s*"/ /gr }egx;
what does reduce the amount of code, but may be trickier to work with later.
While this is a direct answer to the question I think it's least maintainable.
†  This useful modifier, for "non-destructive substitution," appeared in v5.14. In earlier Perl versions we would copy the string and run regex on that, with an idiom
(my $n = $1) =~ s/",\s*"/ /g;
In the lookarounds-example we then need a little more
$string =~ s{...}{ (my $n = $1) =~ s/",\s*"/ /g; $n }gr
since s/ operator returns the number of substitutions made while we need $n to be returned from that whole piece of code in {} (the replacement side), to be used as the replacement.
You can use this \G based regex to start the match with :[" and further captures the values appropriately and replaces matched text so that only comma is retained and doublequotes are removed.
(:\[")|(?!^)\G([^"]+)"(,)"
Regex Demo
Your text is almost proper JSON, so it's really easy to go the final inch and make it so, and then process that:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say postderef/;
no warnings qw/experimental::postderef/;
use JSON::XS; # Install through your OS package manager or a CPAN client
my $str = q/"emailDomains":["google.co.uk","google.com","google.com","google.com","google.co.uk"]/;
my $json = JSON::XS->new();
my $obj = $json->decode("{$str}");
my $fixed = $json->ascii->encode({emailDomains =>
join(', ', $obj->{'emailDomains'}->#*)});
$fixed =~ s/^\{|\}$//g;
say $fixed;
Try Regex: " *, *"
Replace with: ,
Demo

Perl grep a multi line output for a pattern

I have the below code where I am trying to grep for a pattern in a variable. The variable has a multiline text in it.
Multiline text in $output looks like this
_skv_version=1
COMPONENTSEQUENCE=C1-
BEGIN_C1
COMPONENT=SecurityJNI
TOOLSEQUENCE=T1-
END_C1
CMD_ID=null
CMD_USES_ASSET_ENV=null_jdk1.7.0_80
CMD_USES_ASSET_ENV=null_ivy,null_jdk1.7.3_80
BEGIN_C1_T1
CMD_ID=msdotnet_VS2013_x64
CMD_ID=ant_1.7.1
CMD_FILE=path/to/abcI.vc12.sln
BEGIN_CMD_OPTIONS_RELEASE
-useideenv
The code I am using to grep for the pattern
use strict;
use warnings;
my $cmd_pattern = "CMD_ID=|CMD_USES_ASSET_ENV=";
my #matching_lines;
my $output = `cmd to get output` ;
print "output is : $output\n";
if ($output =~ /^$cmd_pattern(?:null_)?(\w+([\.]?\w+)*)/s ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
I am getting the multiline output as expected from $output but the regex pattern match which I am using on $output is not giving me any results.
Desired output
jdk1.7.0_80
ivy
jdk1.7.3_80
msdotnet_VS2013_x64
ant_1.7.1
Regarding your regular expression:
You need a while, not an if (otherwise you'll only be matching once); when you make this change you'll also need the /gc modifiers
You don't really need the /s modifier, as that one makes . match \n, which you're not making use of (see note at the end)
You want to use the /m modifier so that ^ matches the beginning of every new line, and not just the beginning of the string
You want to add \s* to your regular expression right after ^, because in at least one of your lines you have a leading space
You need parenthesis around $cmd_pattern; otherwise, you're getting two options, the first one being ^CMD_ID= and the second one being CMD_USES_ASSET_ENV= followed by the rest of your expression
You can also simplify the (\w+([\.]?\w+)*) bit down to (.+).
The result would be:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
That being said, your regular expression still won't split ivy and jdk1.7.3_80 on its own; I would suggest adding a split and removing _null with something like:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
my $text = $1;
my #text;
if ($text =~ /,/) {
#text = split /,(?:null_)?/, $text;
}
else {
#text = $text;
}
for (#text) {
print "1 is : $_\n";
push (#matching_lines, $_);
}
}
The only problem you're left with is the lone line CMD_ID=null. I'm gonna leave that to you :-)
(I recently wrote a blog post on best practices for regular expressions - http://blog.codacy.com/2016/03/30/best-practices-for-regular-expressions/ - you'll find there a note to always require the /s in Perl; the reason I mention here that you don't need it is that you're not using the ones you actually need, and that might mean you weren't certain of the meaning of /s)

Perl - how to get values of tokens

I am searching how to get tokens values in properties file with Perl.
Given the source property:
my $source="application.1.hostname={{DNS_APP}}:{{PORT_APP}}/WHATEVER";
And given the target property:
my $target="application.1.hostname=test.test.com:8080/WHATEVER";
I would like to get the following result:
{{DNS_APP}}=test.test.com
{{PORT_APP}}=8080
I have no trouble to get the tokens with :
my #matches= ( $source =~ /({{.*?}})/g );
But then, how to match with their values ?
Is there an easy way, with perl regexps to get these substitutions ?
Another difficulty (but they are execption, so it is not a big deal if this problem is not addressed) is that, sometimes, $target can be
my $target="application.1.hostname=test.test.com/WHATEVER";
Or
my $target="application.1.hostname=test.test.com:8080/SOMETHINGELSE";
Or even
my $target="application.1.hostname=test.test.com/SOMETHINGELSE";
How to deal with that ?
I thank you in advance for you answers.
Regards.
OK, at a basic level, you can turn your thing into a named capture for a regex. There's a caveat though - you might need to restrict character sets.
But something like this might work:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $source = "application.1.hostname={{DNS_APP}}:{{PORT_APP}}/WHATEVER";
my $target = "application.1.hostname=test.test.com:8080/WHATEVER";
$source =~ s|\Q{{\E(\w+)\Q}}\E|(?<$1>.*)|g;
$source = qr/$source/;
print "Using Regex:", $source,"\n";
$target =~ m/$source/;
#%+ is the special named-capture hash. You can access $+{DNS_APP} for example
print Dumper \%+;
Note though - that .* is a greedy match, and that will mean without delimitors/anchors between patterns, this will break. You could perhaps define a more narrow character class - I would think \w normally, but you also have . so perhaps [\w.]+ - or maybe even .*? for non greedy matching instead. This depends rather on what would 'fit' with the types of patterns you're trying to match. If you need to do so with arbitrary patterns, I think you're going to need to need ... something like regex to define the match criteria in the first place.
If your 'targets' are purely that pattern - e.g. trailing static words - you can trim you initial pattern with s/\w+$// which will reduce it to:
application.1.hostname={{DNS_APP}}:{{PORT_APP}}/
Which you then regex transform to:
(?^:application.1.hostname=(?<DNS_APP>.*):(?<PORT_APP>.*)/)
And then get %+ of:
$VAR1 = {
'DNS_APP' => 'test.test.com',
'PORT_APP' => '8080'
};
As you're on 5.8.8 - my first advice is upgrade it, because it's 7 year old software, and is long since end of life.
This variable was added in Perl v5.10.0.
However you should be able to work around by:
my #match_names = $source =~ m|\Q{{\E(\w+)\Q}}\E|g; #capture 'names' of matches
$source =~ s|\Q{{\E(\w+)\Q}}\E|(.*)|g;
$source = qr/$source/;
print "Using Regex:", $source, "\n";
my %results;
my #matches = $target =~ m/$source/;
#results{#match_names} = #matches;
print Dumper \%results;
I'm pretty sure there's a way of capturing what matched from the s pattern replacement. If I figure out what it was, I'll update.
(As it stands:
my ( #match_names ) = $source =~ s|\Q{{\E(\w+)\Q}}\E|\(.*\)|g;
doesn't seem to work as I want - #match_names contains the number of replacements. )

Perl, match one pattern multiple times in the same line delimited by unknown characters

I've been able to find similar, but not identical questions to this one. How do I match one regex pattern multiple times in the same line delimited by unknown characters?
For example, say I want to match the pattern HEY. I'd want to recognize all of the following:
HEY
HEY HEY
HEYxjfkdsjfkajHEY
So I'd count 5 HEYs there. So here's my program, which works for everything but the last one:
open ( FH, $ARGV[0]);
while(<FH>)
{
foreach $w ( split )
{
if ($w =~ m/HEY/g)
{
$count++;
}
}
}
So my question is how do I replace that foreach loop so that I can recognize patterns delimited by weird characters in unknown configurations (like shown in the example above)?
EDIT:
Thanks for the great responses thus far. I just realized I need one other thing though, which I put in a comment below.
One question though: is there any way to save the matched term as well? So like in my case, is there any way to reference $w (say if the regex was more complicated, and I wanted to store it in a hash with the number of occurrences)
So if I was matching a real regex (say a sequence of alphanumeric characters) and wanted to save that in a hash.
One way is to capture all matches of the string and see how many you got. Like so:
open (FH, $ARGV[0]);
while(my $w = <FH>) {
my #matches = $w =~ m/(HEY)/g;
my $count = scalar(#matches);
print "$count\t$w\n";
}
EDIT:
Yes, there is! Just loop over all the matches, and use the capture variables to increment the count in a hash:
my %hash;
open (FH, $ARGV[0]);
while (my $w = <FH>) {
foreach ($w =~ /(HEY)/g) {
$hash{$1}++;
}
}
The problem is you really don't want to call split(). It splits things into words, and you'll note that your last line only has a single "word" (though you won't find it in the dictionary). A word is bounded by white-space and thus is just "everything but whitespace".
What you really want is to continue to do look through each line counting every HEY, starting where you left off each time. Which requires the /g at the end but to keep looking:
while(<>)
{
while (/HEY/g)
{
$count++;
}
}
print "$count\n";
There is, of course, more than one way to do it but this sticks close to your example. Other people will post other wonderful examples too. Learn from them all!
None of the above answers worked for my similar problem. $1 does not seem to change (perl 5.16.3) so $hash{$1}++ will just count the first match n times.
To get each match, the foreach needs a local variable assigned, which will then contain the match variable. Here's a little script that will match and print each occurrence of (number).
#!/usr/bin/perl -w
use strict;
use warnings FATAL=>'all';
my (%procs);
while (<>) {
foreach my $proc ($_ =~ m/\((\d+)\)/g) {
$procs{$proc}++;
}
}
print join("\n",keys %procs) . "\n";
I'm using it like this:
pstree -p | perl extract_numbers.pl | xargs -n 1 echo
(except with some relevant filters in that pipeline). Any pattern capture ought to work as well.

How do I make an arbitrary Perl regex wholly non-capturing?

How can I remove capturing from arbitrarily nested sub-groups in a a Perl regex string? I'd like to nest any regex into an enveloping expression that captures the sub-regex as a whole entity as well as statically known subsequent groups. Do I need to transform the regex string manually into using all non-capturing (?:) groups (and hope I don't mess up), or is there a Perl regex or library mechanism that provides this?
# How do I 'flatten' $regex to protect $2 and $3?
# Searching 'ABCfooDE' for 'foo' OK, but '((B|(C))fo(o)?(?:D|d)?)', etc., breaks.
# I.E., how would I turn it effectively into '(?:(?:B|(?:C))fo(?:o)?(?:D|d)?)'?
sub check {
my($line, $regex) = #_;
if ($line =~ /(^.*)($regex)(.*$)/) {
print "<", $1, "><", $2, "><", $3, ">\n";
}
}
Addendum: I am vaguely aware of $&, $`, and $' and have been advised to avoid them if possible, and I don't have access to ${^PREMATCH}, ${^MATCH} and ${^POSTMATCH} in my Perl 5.8 environment. The example above can be partitioned into 2/3 chunks using methods like these, and more complex real cases could manually iterate this, but I think I'd like a general solution if possible.
Accepted Answer: What I wish existed and surprisingly (to me at least) does not, is an encapsulating group that makes its contents opaque, such that subsequent positional backreferences see the contents as a single entity and names references are de-scoped. gbacon has a potentially useful workaround for Perl 5.10+, and FM shows a manual iterative mechanism for any version that can accomplish the same effect in specific cases, but j_random_hacker calls it that there is no real language mechanism to encapsulate subexpressions.
In general, you can't.
Even if you could transform all (...)s into (?:...)s, this would not work in the general case because the pattern might require backreferences: e.g. /(.)X\1/, which matches any character, followed by an X, followed by the originally matched character.
So, absent a Perl mechanism for discarding captured results "after the fact", there is no way to solve your problem for all regexes. The best you can do (or could do if you had Perl 5.10) is to use gbacon's suggestion and hope to generate a unique name for the capture buffer.
One way to protect the subpatterns you care about is to use named capture buffers:
Additionally, as of Perl 5.10.0 you may use named capture buffers and named backreferences. The notation is (?<name>...) to declare and \k<name> to reference. You may also use apostrophes instead of angle brackets to delimit the name; and you may use the bracketed \g{name} backreference syntax. It's possible to refer to a named capture buffer by absolute and relative number as well. Outside the pattern, a named capture buffer is available via the %+ hash. When different buffers within the same pattern have the same name, $+{name} and \k<name> refer to the leftmost defined group.
In the context of your question, check becomes
sub check {
use 5.10.0;
my($line, $regex) = #_;
if ($line =~ /(^.*)($regex)(.*$)/) {
print "<", $+{one}, "><", $+{two}, "><", $+{three}, ">\n";
}
}
Then calling it with
my $pat = qr/(?<one>(?<two>B|(?<three>C))fo(o)?(?:D|d)?)/;
check "ABCfooDE", $pat;
outputs
<CfooD><C><C>
This does not address the general case, but your specific example can be handled with the /g option in scalar context, which would allow you to divide the problem into two matches, the second picking up where the first left off:
sub check {
my($line, $regex) = #_;
my ($left_side, $regex_match) = ($1, $2) if $line =~ /(^.*)($regex)/g;
my $right_side = $1 if $line =~ /(.*$)/g;
print "<$left_side> <$regex_match> <$right_side>\n"; # <AB> <CfooD> <E123>
}
check( 'ABCfooDE123', qr/((B|(C))fo(o)?(?:D|d)?)/ );
If all you need is the portion of the string before and after the match, you can use the #- and #+ arrays to get the offsets into the matched string:
sub check {
my ($line, $regex) = #_;
if ($line =~ /$regex/) {
my $pre = substr $line, 0, $-[0];
my $match = substr $line, $-[0], $+[0] - $-[0];
my $post = substr $line, $+[0];
print "<$pre><$match><$post>\n";
}
}
Perl v5.22 and later has a /n modifier which turn all capturing off.
This doesn't disable capturing, but might accomplish what you want:
$ perl -wle 'my $_ = "123abc"; /(\d+)/ && print "num: $1"; { /([a-z]+)/ && print "letter: $1"; } print "num: $1";'
num: 123
letter: abc
num: 123
You create a new scope and the $1 outside it will not be affected.