Opposite of (foo|bar|baz) - regex

I'd like a regex to match everything but a few specific options within a broader expression.
The following example will match test_foo.pl or test_bar.pl or test_baz.pl:
/test_(foo|bar|baz)\.pl/
But I'd like just the opposite:
match test_.*\.pl except for where .* = (foo|bar|baz)
I'm kind of limited in my options for this because this is not directly into a perl program, but an argument to cloc, a program that counts lines of code (that happens to be written in perl). So I'm looking for an answer that can be done in one regex, not multiple chained together.

You should be able to accomplish this by using a negative lookahead:
/test_(?!foo|bar|baz).*\.pl/
This will fail if foo, bar, or baz immediately follows test_.
Note that this could still match something like test_notfoo.pl, and would fail on test_fool.pl, if you do not want this behavior please clarify by adding some examples of what exactly should and should not match.
If you want to accept something like test_fool.pl or test_bart.pl, then you could change it to the following:
/test_(?!(foo|bar|baz)\.pl).*\.pl/

#!/usr/bin/env perl
use strict; use warnings;
my $pat = qr/\Atest_.+(?<!foo|bar|baz)[.]pl\z/;
while (my $line = <DATA>) {
chomp $line;
printf "%s %s\n", $line, $line =~ $pat ? 'matches' : "doesn't match";
}
__DATA__
test_bar.pl
test_foo.pl
test_baz.pl
test baz.pl
0test_bar.pl
test_me.pl
test_me_too.txt
Output:
test_bar.pl doesn't match
test_foo.pl doesn't match
test_baz.pl doesn't match
test baz.pl doesn't match
0test_bar.pl doesn't match
test_me.pl matches
test_me_too.txt doesn't match

(?:(?!STR).)*
is to
STR
as
[^CHAR]
is to
CHAR
So you want
if (/^test_(?:(?!foo|bar|baz).)*\.pl\z/s)
More readable:
my %bad = map { $_ => 1 } qw( foo bar baz );
if (/^test_(.*)\.pl\z/s && !$bad{$1})

Hmm, I might have misunderstood your question. Anyway, maybe this is helpful ...
You would negate the match operator. For example:
perl -lwe "print for grep ! m/(lwp|archive).*\.pl/, glob q(*.pl)"
# Note you'd use single-quotes on Linux but double-quotes on Windows.
# Nothing to do with Perl, just different shells (bash vs cmd.exe).
The ! negates the match. The above is shorthand for:
perl -lwe "print for grep ! ($_ =~ m/(lwp|archive).*\.pl/), glob q(*.pl)"
Which can also be written using the negated match operator !~, as follows:
perl -lwe "print for grep $_ !~ m/(lwp|archive).*\.pl/, glob q(*.pl)"
In case you're wondering, the glob is simply used to get an input list of filenames as per your example. I just substituted another match pattern suitable for the files I had handy in a directory.

Related

Perl Replace Capture Group With Length of Capture Group

Given :
my $str = "foo95285734776bar";
$str =~ s/([0-9]{2,4})/_????_/g;
What single regex where '????' is the length of $1 can produce output "foo_4__4__3_bar" ?
That is, where "9528" is replaced with "_4_", "5734" with "_4_", and the remaining "776" with "_3_".
You can use the /e modifier to add Perl code into the substitution part that is then evaled.
my $str = "foo95285734776bar";
$str =~ s/([0-9]{2,4})/'_' . length($1) . '_'/ge;
print $str;
Will output
foo_4__4__3_bar
Note that you now need a full Perl expression there. That's why you have to actually quote and concatenate the underscores.
From perlop:
A /e will cause the replacement portion to be treated as a full-fledged Perl expression and evaluated right then and there. It is, however, syntax checked at compile-time. A second e modifier will cause the replacement portion to be evaled before being run as a Perl expression.

Perl grep a multi line output for a pattern

I have the below code where I am trying to grep for a pattern in a variable. The variable has a multiline text in it.
Multiline text in $output looks like this
_skv_version=1
COMPONENTSEQUENCE=C1-
BEGIN_C1
COMPONENT=SecurityJNI
TOOLSEQUENCE=T1-
END_C1
CMD_ID=null
CMD_USES_ASSET_ENV=null_jdk1.7.0_80
CMD_USES_ASSET_ENV=null_ivy,null_jdk1.7.3_80
BEGIN_C1_T1
CMD_ID=msdotnet_VS2013_x64
CMD_ID=ant_1.7.1
CMD_FILE=path/to/abcI.vc12.sln
BEGIN_CMD_OPTIONS_RELEASE
-useideenv
The code I am using to grep for the pattern
use strict;
use warnings;
my $cmd_pattern = "CMD_ID=|CMD_USES_ASSET_ENV=";
my #matching_lines;
my $output = `cmd to get output` ;
print "output is : $output\n";
if ($output =~ /^$cmd_pattern(?:null_)?(\w+([\.]?\w+)*)/s ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
I am getting the multiline output as expected from $output but the regex pattern match which I am using on $output is not giving me any results.
Desired output
jdk1.7.0_80
ivy
jdk1.7.3_80
msdotnet_VS2013_x64
ant_1.7.1
Regarding your regular expression:
You need a while, not an if (otherwise you'll only be matching once); when you make this change you'll also need the /gc modifiers
You don't really need the /s modifier, as that one makes . match \n, which you're not making use of (see note at the end)
You want to use the /m modifier so that ^ matches the beginning of every new line, and not just the beginning of the string
You want to add \s* to your regular expression right after ^, because in at least one of your lines you have a leading space
You need parenthesis around $cmd_pattern; otherwise, you're getting two options, the first one being ^CMD_ID= and the second one being CMD_USES_ASSET_ENV= followed by the rest of your expression
You can also simplify the (\w+([\.]?\w+)*) bit down to (.+).
The result would be:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
That being said, your regular expression still won't split ivy and jdk1.7.3_80 on its own; I would suggest adding a split and removing _null with something like:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
my $text = $1;
my #text;
if ($text =~ /,/) {
#text = split /,(?:null_)?/, $text;
}
else {
#text = $text;
}
for (#text) {
print "1 is : $_\n";
push (#matching_lines, $_);
}
}
The only problem you're left with is the lone line CMD_ID=null. I'm gonna leave that to you :-)
(I recently wrote a blog post on best practices for regular expressions - http://blog.codacy.com/2016/03/30/best-practices-for-regular-expressions/ - you'll find there a note to always require the /s in Perl; the reason I mention here that you don't need it is that you're not using the ones you actually need, and that might mean you weren't certain of the meaning of /s)

Why can't I store a regexp in a variable?

Given the following code,
my $string = "foo";
my $regex = s/foo/bar/;
$string =~ $regex;
print $string, "\n";
I would have expected the output to be bar, however it is foo. Why is that the case, and how can I solve that problem?
Note that in my actual case, the regex is more complicated, and I actually want to store several of them in a hash (so I can write something like $string =~ $rules{$key}).
You're looking for substitution, not only the regex part so I guess compiled regex (qr//) is not what you're looking for,
use strict;
use warnings;
my $string = "foo";
my $regex = sub { $_[0] =~ s/foo/bar/ };
$regex->($string);
print $string, "\n";
Your statement
my $regex = s/foo/bar/
is equivalent to
my $regex = $_ =~ s/foo/bar/
s/// returns the number of substitutions made, or it returns false (specifically, the empty string). So $regex is now '' or 1 (it could be more if the /g modifier was in effect) and
$string =~ $regex
is doing 'foo' =~ // or 'foo' =~ /1/ depending on what $_ contained originally.
You can store a regex pattern in a variable but, in your example, the regex is just foo, and there is a lot more going on than just that pattern
The statement s/foo/bar/ is more complex than it seems -- it is a fully-fledged statement that applies a regex pattern to a target string and substitutes a replacement string if the pattern is found. In this case the target string is the default variable $_ and the replacement string is foo. You could think of it as a call to a subroutine
substitute($_, 'foo', 'bar')
and the regex pattern is only the second parameter
What you can do is store a regex pattern. The regex part of that substitution is foo, and you can say
my $pattern = qr/foo/;
s/$pattern/bar/;
But you really should explain the problem that you're trying to solve so that we can help you better
In the assignment, you need to tell Perl not to evaluate the regular expression but just to keep it. This is what qr is for.
But you can't do this with whole substitutions, which is why Сухой27 suggests using a subroutine.

In Perl, how can I correctly extract URLs that are enclosed in parentheses?

I've got two question about Regexp::Common qw/URI/ and Regex in Perl.
I use Regexp::Common qw/URI/ to parse URI in the strings and delete them. But I've got an error when a URI is between parentheses.
For example: (http://www.example.com)
The error is caused by ')', and when it try to parse the URI, the app crash. So I've thought two fixes:
Do a simple (or I thought so) that writes a whitespace between parentheses and ) characters
The Regexp::Common qw/URI/ has a function that implement a fix.
In my code I've tried to implement the Regex but the app freezes. The code that I've tried is this:
use strict;
use Regexp::Common qw/URI/;
my $str = "Hello!!, I love (http://www.example.com)";
while ($str =~ m/\)/){
$str =~ s/\)/ \)/;
}
my ($uri) = $str =~ /$RE{URI}{-keep}/;
print "$uri\n";
print $str;
The output that I want is: (http://www.example.com )
I'm not sure, but I think that the problem is in $str =~ s/\)/ \)/;
BTW, I've got a question about Regexp::Common qw/URI/. I've got two string type:
ablalbalblalblalbal http://www.example.com
asfasdfasdf http://www.example.com aasdfasdfasdf
I want to remove the URI if it is the last component (and save it). And, if not, save it without removing it from the text.
You don't have to first test for a match to be able to use the s/// operator correctly: If the string does not match the search pattern, it will not do anything.
#!/usr/bin/perl
use strict; use warnings;
my $str = "Hello!!, I love (GOOGLE)";
$str =~ s/\)/ )/g;
print "$str\n";
The general problem of detecting URLs correctly in text is error-prone. See for example Jeff's thoughts on this.
my $str = "Hello!!, I love (GOOGLE)";
while ($str =~ m/)/){
$str =~ s/)/ )/;
}
Your program goes into an infinite loop at this point. To see why, try printing the value of $str each time round the loop.
my $str = "Hello!!, I love (GOOGLE)";
while ($str =~ m/)/){
$str =~ s/)/ )/;
print $str, "\n";
}
The first time it prints "Hello!!, I love (GOOGLE )". The while loop condition is then evaluated again. Your string still matches your regular expression (it still contains a closing parenthesis) so the replacement is run again and this time it prints out "Hello!!, I love (GOOGLE )" with two spaces.
And so it goes on. Each time round the loop another space is added, but each time you still have a closing parenthesis, so another substitution is run.
The simplest solution I can see is to only match the closing parenthesis if it is preceded by a non-whitespace character (using \S).
my $str = "Hello!!, I love (GOOGLE)";
while ($str =~ m/\S)/){
$str =~ s/)/ )/;
print $str, "\n";
}
In this case the loop is only executed once.
Why not just include the parentheses in the search? If the URLs will always be bracketed, then something like this:
#!/usr/bin/perl
use warnings;
use strict;
use Regexp::Common qw/URI/;
my $str = "Hello!!, I love (http://www.google.com)";
my ($uri) = $str =~ / \( ( $RE{URI} ) \) /x;
print "$uri\n";
The regex from Regex::Common can be used as part of a longer regex, it doesn't have to be used on its own. Also I've used the 'x' modifier on the regex to allow whitespace so you can see more clearly what is going on - the brackets with the backslashes are treated as characters to match, those without define what is to matched (presumably like the {-keep} - I've not used that before).
You could also make the brackets optional, with something like:
/ (?: \( ( $RE{URI} ) \) | ( $RE{URI} ) ) /
although that would result in two match variables, one undefined - so something like following would be needed:
my $uri = $1 || $2 || die "Didn't match a URL!";
There's probably a better way to do this, and also if you're not bothered about matching parentheses then you could simply make the brackets optional (via a '?') in the first regex...
To answer your second question about only matching URLs at the end of the line - have a look at Regex 'anchors' which can force a match against the beginning or end of a line: ^ and $ (or \A and \Z if you prefer). e.g. matching a URL at the end of a line only:
/$RE{URI}\Z/

What is the difference between using $1 vs \1 in Perl regex substitutions?

I'm debugging some code and wondered if there is any practical difference between $1 and \1 in Perl regex substitutions
For example:
my $package_name = "Some::Package::ButNotThis";
$package_name =~ s{^(\w+::\w+)}{$1};
print $package_name; # Some::Package
This following line seems functionally equivalent:
$package_name =~ s{^(\w+::w+)}{\1};
Are there subtle differences between these two statements? Do they behave differently in different versions of Perl?
First, you should always use warnings when developing:
#!/usr/bin/perl
use strict; use warnings;
my $package_name = "Some::Package::ButNotThis";
$package_name =~ s{^(\w+::\w+)}{\1};
print $package_name, "\n";
Output:
\1 better written as $1 at C:\Temp\x.pl line 7.
When you get a warning you do not understand, add diagnostics:
C:\Temp> perl -Mdiagnostics x.pl
\1 better written as $1 at x.pl line 7 (#1)
(W syntax) Outside of patterns, backreferences live on as variables.
The use of backslashes is grandfathered on the right-hand side of a
substitution, but stylistically it's better to use the variable form
because other Perl programmers will expect it, and it works better if
there are more than 9 backreferences.
Why does it work better when there are more than 9 backreferences? Here is an example:
#!/usr/bin/perl
use strict; use warnings;
my $t = (my $s = '0123456789');
my $r = join '', map { "($_)" } split //, $s;
$s =~ s/^$r\z/\10/;
$t =~ s/^$r\z/$10/;
print "[$s]\n";
print "[$t]\n";
Output:
C:\Temp> x
]
[9]
If that does not clarify it, take a look at:
C:\Temp> x | xxd
0000000: 5b08 5d0d 0a5b 395d 0d0a [.]..[9]..
See also perlop:
The following escape sequences are available in constructs that interpolate and in transliterations …
\10 octal is 8 decimal. So, the replacement part contained the character code for BACKSPACE.
NB
Incidentally, your code does not do what you want: That is, it will not print Some::Package some package contrary to what your comment says because all you are doing is replacing Some::Package with Some::Package without touching ::ButNotThis.
You can either do:
($package_name) = $package_name =~ m{^(\w+::\w+)};
or
$package_name =~ s{^(\w+::\w+)(?:::\w+)*\z}{$1};
From perldoc perlre:
The bracketing construct "( ... )" creates capture buffers. To refer to
the current contents of a buffer later on, within the same pattern, use
\1 for the first, \2 for the second, and so on. Outside the match use
"$" instead of "\".
The \<digit> notation works in certain circumstances outside the match. But it can potentially clash with octal escapes. This happens when the backslash is followed by more than 1 digits.