Perl Regex to exclude certain strings within the middle of a string - regex

I have this string:
The file FILENAME has not been received
I'm trying to get the regex to match this, but not if the string is this (for example):
The file FAILNAME has not been received
I've got this regex so far:
/^(?=.*?\bThe\sfile\b)((?!FAILNAME).)*$/
But I'm unsure how to continue the expected text after the exclusion.
I hope I've explained that correctly :)
Thanks in advance!

I'd do it in two steps:
if ($message =~ /\AThe file (\S+) has not been received\z/ && $1 ne 'FAILNAME') {
I.e. use a regex to validate the general format and extract the filename, then check the extracted name separately.
Why cram everything into a single regex?
Speaking of which, you can actually cram arbitrary conditions into a regex. I wouldn't recommend it in this case, but:
/\AThe file (\S+) has not been received\z(?(?{ $1 eq 'FAILNAME' })(*FAIL))/
This extended pattern essentially says "if $1 equals FAILNAME, fail the match".

You could move the negative lookahead to after file followed by a whitespace character to assert what is directly on the right is not FAILNAME:
^The\sfile\s(?!\bFAILNAME\b).*$
Or of it can not occur in the string after The file use a quantifier:
^The\sfile\s(?!.*\bFAILNAME\b).*$
If there can not be anything before and after FAILNAME you could lookarounds:
^The\sfile\s(?!.*(?<!\S)FAILNAME(?!\S)).*$
Regex demo

My guess is that we wish to fail those that have FAILNAME, which your original expression seems to be working fine, and we'd then slightly modify that, which this might work:
^(?=The\sfile\s)(?!.*\s\bFAILNAME\b\s.*).*$
Demo 1
Here we are adding two spaces as an extra left and right boundaries, which if we do not wish to have those, we'd be simply excluding.
Example
use strict;
my $str = 'The file FILENAME has not been received
The file FILENAME has been received
The file FILENAME
The file AFAILNAME has not been received
FILENAME
The file FAILNAME has not been received
FAILNAME has not been received
The file FAILNAME
';
my $regex = qr/^(?=The\sfile\s)(?!.*\s\bFAILNAME\b\s.*).*$/mp;
if ( $str =~ /$regex/g ) {
print "Whole match is ${^MATCH} and its start/end positions can be obtained via \$-[0] and \$+[0]\n";
# print "Capture Group 1 is $1 and its start/end positions can be obtained via \$-[1] and \$+[1]\n";
# print "Capture Group 2 is $2 ... and so on\n";
}
# ${^POSTMATCH} and ${^PREMATCH} are also available with the use of '/p'
# Named capture groups can be called via $+{name}
RegEx Circuit
jex.im visualizes regular expressions:

Related

Perl: extract substring that match a pattern between to XML tag

I need to parse an XML file without using module.
In that XML file I need to extract all content between 2 tags (<mi>...</mi>) that match a pattern.
I have this:
$xmlstring = my xml string
$pattern = "G2_CPU";
my $regex = "<mi>(.*?" . $pattern . ".*?)<\\/mi>";
my ($data) = $xmlstring =~ /$regex/i;
But when I execute it, in $data I got everything between the very first <mi> tag and the very last </mi> tag.
I also try with the regex without variable: /(<mi>.*?G2_CPU.*?<\/mi>)/ and I got the same result.
How can I do it?
Assuming this is still valid XML, i.e. < cannot appear between tag open and tag close, and there is no CDATA within those tags, you can just use:
my $re = qr{<mi>([^<]*? \Q$pattern\E [^<]*?)</mi>}ix;
That is, instead of allowing any character up to the substring of interest, allow just non-tag opening characters.
Also, my first instinct, if I ever thought I would try to go down the rabbit hole of parsing XML without a decent XML parser, would have been to first extract the text between <mi>...</mi> and then check if it contains what I am looking for.
You just need to add a greedy match in the beginning of the pattern, so that it catches the most of it:
my $regex = "(?:.*)<mi>(.*?" . $compteur . ".*?)<\/mi>";
^^^^^^
From Shortest match issues:
The problem is that even with non-greedy matching, Perl is still
trying to find the match that starts at the leftmost possible point in
the string.
Test
File p.pl:
$xmlstring = "hello <mi>first mi</mi> and this is another <mi>second mi</mi> end." ;
$compteur="second";
my $regex = "(?:.*)<mi>(.*?" . $compteur . ".*?)<\/mi>";
my ($data) = $xmlstring =~ /$regex/i;
print "$data\n";
Execution:
$ perl p.pl
second mi

What regex should be used to avoid symbols before text in perl?

i have following variable. i only want to print yes if the variable has "imoport/canada/campingplaces/tobermory" not # or anything. What should insert in a regex for this kind of things.
my $textfile = "# imoport/canada/campingplaces/tobermory
imoport/canada/campingplaces/tobermory
#imoport/canada/campingplaces/tobermory";
my $textNeeded= "imoport/canada/campingplaces/tobermory"
THIS IS WHAT i am using
if ($textfile =~ m/$textNeeded/i) {
print "yes working"
}
note:- i am getting data from differnt text files so some text files might just have "#imoport/canada/campingplaces/tobermory". I want to avoid those
Despite the quite vague problem description, I think I have puzzled out what you mean. You mean you may have lines where the text is commented out with #, and you want to avoid matching those.
print "yes" if $textfile =~ /^\s*$textNeeded/im;
This will match any string inside $textfile which has a newline followed by optional whitespace followed by your string. The /m option makes the regex multiline, meaning that ^ and $ match line endings represented by newlines inside a larger string.
You may wish to be wary of regex meta characters in your search string. If for example your search string is foo[bar].txt, those brackets will be interpreted as a character class instead. In which case you would use
/^\s*\Q$textNeeded\E/im
instead. The \Q ... \E will make the text inside match only literal characters.
I think you need to create an Anchor to say you want a match if your target string appears at the BEGINNING of the line. This uses the up-carat symbol:
if ($textfile =~ m/^$textNeeded/i) {
print "yes working"
}
This wont report a match if you have spaces or tabs before your textNeeded string.
To simply return the rows having no leading hash, something like this:
my $textfile = "# imoport/canada/campingplaces/tobermory
imoport/canada/campingplaces/tobermory
#imoport/canada/campingplaces/tobermory";
for (split /^/, $textfile) {
print $_ if(m/^\s*[a-zA-Z].*/);
}
Returns:
imoport/canada/campingplaces/tobermory

Remove characters and numbers from a string in perl

I'm trying to rename a bunch of files in my directory and I'm stuck at the regex part of it.
I want to remove certain characters from a filename which appear at the beginning.
Example1: _00-author--book_revision_
Expected: Author - Book (Revision)
So far, I am able to use regex to remove underscores & captialize the first letter
$newfile =~ s/_/ /g;
$newfile =~ s/^[0-9]//g;
$newfile =~ s/^[0-9]//g;
$newfile =~ s/^-//g;
$newfile = ucfirst($newfile);
This is not a good method. I need help in removing all characters until you hit the first letter, and when you hit the first '-' I want to add a space before and after '-'.
Also when I hit the second '-' I want to replace it with '('.
Any guidance, tips or even suggestions on taking the right approach is much appreciated.
So do you want to capitalize all the components of the new filename, or just the first one? Your question is inconsistent on that point.
Note that if you are on Linux, you probably have the rename command, which will take a perl expression and use it to rename files for you, something like this:
rename 'my ($a,$b,$r);$_ = "$a - $b ($r)"
if ($a, $b, $r) = map { ucfirst $_ } /^_\d+-(.*?)--(.*?)_(.*?)_$/' _*
Your instructions and your example don't match.
According to your instructions,
s/^[^\pL]+//; # Remove everything until first letter.
s/-/ - /; # Replace first "-" with " - "
s/-[^-]*\K-/(/; # Replace second "-" with "("
According to your example,
s/^[^\pL]+//;
s/--/ - /;
s/_/ (/;
s/_/)/;
s/(?<!\pL)(\pL)/\U$1/g;
$filename =~ s,^_\d+-(.*?)--(.*?)_(.*?)_$,\u\1 - \u\2 (\u\3),;
My Perl interpreter (using strict and warnings) says that this is better written as:
$filename =~ s,^_\d+-(.*?)--(.*?)_(.*?)_$,\u$1 - \u$2 (\u$3),;
The first one probably is more sedish for its taste! (Of course both version works just the same.)
Explanation (as requested by stema):
$filename =~ s/
^ # matches the start of the line
_\d+- # matches an underscore, one or more digits and a hypen minus
(.*?)-- # matches (non-greedyly) anything before two consecutive hypen-minus
# and captures the entire match (as the first capture group)
(.*?)_ # matches (non-greedyly) anything before a single underscore and
# captures the entire match (as the second capture group)
(.*?)_ # does the same as the one before (but captures the match as the
# third capture group obviously)
$ # matches the end of the line
/\u$1 - \u$2 (\u$3)/x;
The \u${1..3} in replacement specification simply tells Perl to insert the capture groups from 1 to 3 with their first character made upper-case. If you'd wanted to make the entire match (in a captured group) upper-case you'd had to use \U instead.
The x flags turns on verbose mode, which tells the Perl interpreter that we want to use # comments, so it will ignore these (and any white space in the regular expression - so if you want to match a space you have to use either \s or \). Unfortunately I couldn't figure out how to tell Perl to ignore white space in the * replacement* specification - this is why I've written that on a single line.
(Also note that I've changed my s terminator from , to / - Perl barked at me if I used the , with verbose mode turned on ... not exactly sure why.)
If they all follow that format then try:
my ($author, $book, $revision) = $newfiles =~ /-(.*?)--(.*?)_(.*?)_/;
print ucfirst($author ) . " - $book ($revision)\n";

Extract text from a multiline string using Perl

I have a string that covers several lines. I need to extract the text between two strings. For example:
Start Here Some example
text covering a few
lines. End Here
I need to extract the string, Start Here Some example text covering a few lines.
How do I go about this?
Use the /s regex modifier to treat the string as a single line:
/s
Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.
$string =~ /(Start Here.*)End Here/s;
print $1;
This will capture up to the last End Here, in case it appears more than once in your text.
If this is not what you want, then you can use:
$string =~ /(Start Here.*?)End Here/s;
print $1;
This will stop matching at the very first occurrence of End Here.
print $1 if /(Start Here.*?)End Here/s;
Wouldn't the correct modifier to treat the string as a single line be (?s) rather than (/s) ? I've been wrestling with a similar problem for quite a while now and the RegExp Tester embedded in JMeter's View Results Tree listener shows my regular expression extractor with the regex
(?s)<FMSFlightPlan>(.*?)</FMSFlightPlan>
matches
<FMSFlightPlan>
C87D
AN NTEST/GL
- FPN/FN/RP:DA:GCRR:AA:EIKN:F:SAMAR,N30540W014249.UN873.
BAROK,N35580W010014..PESUL,N40529W008069..RELVA,N41512W008359..
SIVIR,N46000W008450..EMPER,N49000W009000..CON,N53545W008492
</FMSFlightPlan>
while the regex
(?s)<FMSFlightPlan>(.*?)</FMSFlightPlan>
does not match. Other regex testers show the same result. However when I try to execute a the script I get the Beanshell Assertion error:
Assertion failure message: org.apache.jorphan.util.JMeterException: Error invoking bsh method: eval Sourced file: inline evaluation of: ``import java.io.*; //write out the data results to a file outfile = "/Users/Dani . . . '' Token Parsing Error: Lexical error at line 12, column 380. Encountered: "\n" (10),
So something else is definitely wrong with mine. Anyway, just a suggestion

What is the regular Expression to uncomment a block of Perl code in Eclipse?

I need a regular expression to uncomment a block of Perl code, commented with # in each line.
As of now, my find expression in the Eclipse IDE is (^#(.*$\R)+) which matches the commented block, but if I give $2 as the replace expression, it only prints the last matched line. How do I remove the # while replacing?
For example, I need to convert:
# print "yes";
# print "no";
# print "blah";
to
print "yes";
print "no";
print "blah";
In most flavors, when a capturing group is repeated, only the last capture is kept. Your original pattern uses + repetition to match multiple lines of comments, but group 2 can only keep what was captured in the last match from the last line. This is the source of your problem.
To fix this, you can remove the outer repetition, so you match and replace one line at a time. Perhaps the simplest pattern to do this is to match:
^#\s*
And replace with the empty string.
Since this performs match and replacement one line at a time, you must repeat it as many times as necessary (in some flavors, you can use the g global flag, in e.g. Java there are replaceFirst/All pair of methods instead).
References
regular-expressions.info/Repeating a Captured Group vs Capturing a Repeated Group
Related questions
Is there a regex flavor that allows me to count the number of repetitions matched by * and +?
.NET regex keeps all repeated matches
Special note on Eclipse keyboard shortcuts
It Java mode, Eclipse already has keyboard shortcuts to add/remove/toggle block comments. By default, Ctrl+/ binds to the "Toggle comment" action. You can highlight multiple lines, and hit Ctrl+/ to toggle block comments (i.e. //) on and off.
You can hit Ctrl+Shift+L to see a list of all keyboard shortcuts. There may be one in Perl mode to toggle Perl block comments #.
Related questions
What is your favorite hot-key in Eclipse?
Hidden features of Eclipse
Search with ^#(.*$) and replace with $1
You can try this one: -
use strict;
use warning;
my $data = "#Hello#stack\n#overflow\n";
$data =~ s/^?#//g ;
OUTPUT:-
Hello
stack
overflow
Or
open(IN, '<', "test.pl") or die $!;
read(IN, my $data, -s "test.pl"); #reading a file
$data =~ s/^?#//g ;
open(OUT, '>', "test1.pl") or die $!;
print OUT $data; #Writing a file
close OUT;
close IN;
Note: Take care of #!/usr/bin/perl in the Perl script, it will uncomment it also.
You need the GLOBAL g switch.
s/^#(.+)/$1/g
In order to determine whether a perl '#' is a comment or something else, you have to compile the perl and build a parse tree, because of Schwartz's Snippet
whatever / 25 ; # / ; die "this dies!";
Whether that '#' is a comment or part of a regex depends on whether whatever() is nullary, which depends on the parse tree.
For the simple cases, however, yours is failing because (^#(.*$\R)+) repeats a capturing group, which is not what you wanted.
But anyway, if you want to handle simple cases, I don't even like the regex that everyone else is using, because it fails if there is whitespace before the # on the line. What about
^\s*#(.*)$
? This will match any line that begins with a comment (optionally with whitespace, e.g., for indented blocks).
Try this regex:
(^[\t ]+)(\#)(.*)
With this replacement:
$1$3
Group 1 is (^[\t ]+) and matches all leading whitespace (spaces and tabs).
Group 2 is (#) and matches one # character.
Group 3 is (.*) and matches the rest of the line.