Perl regex subsitute last occurrence - regex

I have this input:
AB2.HYNN.KABCDSEG.L000.G0001V00
AB2.HYNN.GABCDSEG.L000.G0005V00
I would like to remove all which finish by GXXXXVXX in the string.
When i use this code:
$result =~ s/\.G.*V.*$//g;
print "$result \n";
The result is :
AB2.HYNN.KABCDSEG.L000
AB2.HYNN
It seems each time the regex find ".G" it removes with blank .
I don't understand.
I would like to have this:
AB2.HYNN.KABCDSEG.L000
AB2.HYNN.GABCDSEG.L000
How i can do this in regex ?

Update:
After talking in the comments, the final solution was:
s/\.G\w+V\w+$//;
In your regex:
s/\.G.*V.*$//g;
those .* are greedy and will match as much as possible. The only requirement you have is that there must be a V after the .G somewhere, so it will truncate the string from the first .G it finds, as long as it is followed by a V. There is no need for the /g modifier here, because any match that occurs will delete the rest of the string. Unless you have newlines, because . does not match newlines without the /s modifier.

$result =~ s/\.G\d+V\d+//g;
Works on given input.

Related

Regular Expression required for matching a string that should not be followed by another specific string

I am using the below code for matching a string(EX: <jdgdt\s+mdy=.*?>\s*) which should not be followed by another specific string (<jdg>). But i am unable to get the desired output as per the below code. Can anyone help me regarding this ?
Input file :
<dckt>Docket No. 7677-12.</dckt>
<jdgdt mdy='02/25/2014'>
<jdg>Opinion by Marvel, <e>J.</e></jdg>
<taxyr></taxyr>
<disp></disp>
</tcpar>
<dckt>Docket No. 7237-13.</dckt>
<jdgdt mdy='02/24/2014'>
</tcpar>
Desired Output:
<dckt>Docket No. 7677-12.</dckt>
<jdgdt mdy='02/25/2014'>
<jdg>Opinion by Marvel, <e>J.</e></jdg>
<taxyr></taxyr>
<disp></disp>
</tcpar>
<dckt>Docket No. 7237-13.</dckt>
<jdgdt mdy='02/24/2014'>
<jdg>Opinion by Marvel, <e>J.</e></jdg>
<taxyr></taxyr>
<disp></disp>
</tcpar>
Code:
#/usr/bin/perl
my $filename = $ARGV[0];
my $ext = $ARGV[1];
my $InputFile = "$filename" . "\." . "$ext";
my $document = do {
local $/ = undef;
open my $fh, "<", $InputFile or die "Error: Could Not Open File $InputFile: $!";
<$fh>;
};
$document =~ s/(<jdgdt\s+mdy=.*?>\s*)(?!<jdg>)/$1<jdg>Opinion by Marvel,<e>J.<\/e><\/jdg>\n<taxyr><\/taxyr>\n<disp><\/disp>/isg;
print $document;
I had to make two minor adjustments to your regex to get the desired output:
$document =~ s{(<jdgdt\s+mdy\=[^>]*>\s*)(?!\s*<jdg>)}{$1<jdg>Opinion by Marvel,<e>J.</e></jdg>\n<taxyr></taxyr>\n<disp></disp>}isg;
Also, to clean up the code, I switched from using / to using {} to delimit the regex; that way, you don't need to backslash all the slashes that you actually want there in your replacement.
Explanation of what I changed:
First off, negative lookahead is tricky. What you have to remember is that perl will try to match your expression the maximum amount of times possible. Because you had this initially:
/(<jdgdt\s+mdy\=.*?>\s*)(?!<jdg>)/
What would happen is that in that first clause you'd get this match:
<jdgdt mdy='02/25/2014'>\n<jdg>Opinion by Marvel, <e>J.</e></jdg>
^^^^^^^^^^^^^^^^^^^^^^^^
(this part matched by paren. Note the \n is not matched!)
Perl would consider this a match because after the first parenthesized expression, you have "\n<jdg>". Well, that doesn't match the expression "<jdg>" (because of the initial newline), so yay! found a match.
In other words, initially, perl would have the \s* that you end your parenthesized expression with match the empty string, and therefore it would find a match and you'd end up stuffing things into the first clause that you didn't want. Another way to put it is that because of the freedom to choose what went into \s*, perl would choose the amount that allowed the expression as a whole to match. (and would fill \s* with the empty string for the first docket record, and newline for the second docket record)
To get perl to never find a match on the first docket record, I repeated the \s* in the negative lookahead as well. That way, no choice of what to put in \s* could make the expression as a whole match on the initial docket record, and perl had to give up and move to the second docket record.
But then there was a second problem! Remember how I said perl was really aggressive about finding matches anywhere it could? Well, next perl would expand your mdy\=.*?> bit to still find a result in the first docket record. After I added \s* to the negative lookahead, the first docket was still matching (but in a different spot) with:
<jdgdt mdy='02/25/2014'>\n<jdg>Opinion by Marvel, <e>J.</e></jdg>
^^^^^^^^^^^???????????????????^
(Underlined part matched by paren. ? denotes the bit matched by .*?)
See how perl expanded your .*? way beyond what you had intended? You'd intended that bit to match only stuff up to the first > character, but perl will stretch your non-greedy matches as far as necessary so that the whole pattern matches. This time, it stretched your .*? to cover the > that closed the <jdg> tag so that it could find a spot where the negative lookahead didn't block the match.
To keep perl from stretching your .*? pattern that far, I replaced .*? with [^>]*, which is really what you meant.
After these two changes, we then only found a match in the second docket record, as initially desired.
Use positive lookahead. (?!<jdg>) or something similar, look it up.

Lookbehind does not work as expected

I am trying to understand the lookbehind.
This example I am trying doesn't work as I expected. I wanted to try to form a regex that would match John but not John.
The following:
$ perl -e '
my $var = "John.";
if( $var =~ m/J*/) {
print "Matches!\n";
}
'
Matches!
matches up to and including . of course. The problem is the following:
$ perl -e '
my $var = "John.";
if( $var =~ m/J*(?<![.])/) {
print "Matches!\n";
}
'
Matches!
For the latter I expected that the regex would match John. consuming >.< (the period)
Then at the next position it would look behind and realize that it consumed a period (.) and would reject the match.
Is my understanding wrong? What am I messing up here?
Update:
Same result also for my $var = "John. ";
Update 2:
My question is not about how to match only John and not John.
But to understand how lookbehind works and if it is not supposed to work in this case why.
The * is a quantification operator, not a placeholder. So A* means zero or more A characters. Without any further context, this always matches, e.g. "foo" =~ /J*/ is true.
What you intended to write was /J.*/ which does what you've actually described.
Now let's look what happens when we do "John." =~ /(J.*(?<![.]))/:
The regex engine sees J, which matches.
The next pattern is .*, which matches ohn..
Next the assertion (?<![.]) is tested, which fails.
The regex engine therefore backtracks.
We try .* again, but this time only match ohn.
Next the assertion (?<![.]) is tested, which suceeds.
In the above regex, I enclosed the pattern in a capture group, which we can now read out:
$ perl -E'"John." =~ /(J.*(?<![.]))/ and say "<$1>" or say "No match"'
<John>
It is often more efficient to use a character class instead of assertions and .* quantifications, so that we can avoid backtracking:
/J[^.]*/
However, this is not strictly equivalent to the above regexes.
This regexp:
/John(?![.])/
will match John but not John. It uses a negative look-ahead assertion (rather than look-behind).
If you want to match full names other than 'John', you'll need to be a bit more specific about what you do and don't want to allow in the match, as putting J* will match zero or more J's.
Edit: Obviously I misread the * per #amon's post. Look-ahead vs. look-behind still applies.

Problems with perl regex

I need a perl regex to match A.CC3 on a line begining with something followed by anything then, my 'A.CC3 " and then anything...
I am surprised this (text =~ /^\W+\CC.*\A\.CC\[3].*/) is not working
Thanks
\A is an escape sequence that denotes beginning of line, or ^ like in the beginning of your regex. Remove the backslash to make it match a literal A.
Edit: You also seem to have \C in there. You should only use backslash to escape meta characters such as period ., or to create escape sequences, such as \Q .. \E.
At its simplest, a regex to match A.CC3 would be
$text =~ /A\.CC3/
That's all you need. This will match any string with A.CC3 in it. In the comments you mention the string you are matching is this:
my $text = "//%CC Unused Static Globals, A.CC3, Halstead Progam Volume";
You might want to avoid partial matches, in which case you can use word boundary \b
$text =~ /\bA\.CC3\b/
You might require that a line begins with //%
$text =~ m#^//%.*\bA\.CC3\b#
Of course, only you know which parts of the string should be matched and in what way. "Something followed by anything followed by A.CC3 followed by anything" really just needs the first simple regex.
It doesn't seem like you're trying to capture anything. If that's the case, and all you need to do is find lines that contain A.CC3 then you can simply do
if ( index( $str, 'A.CC3' ) >= 0 ) # Found it...
No need for a regex.
Try to give this a shot:
^.*?A\.CC.*$
That will match anything until it reaches A, then a literal ., followed by CC, then anything until end of string.
It depends what you want to match. If you want to pull back the whole line in which the A.CC3 pattern occurs then something like this should work:
^.*A\.CC3.*$

Regex to get all character to the right of first space?

I am trying to craft a regular expression that will match all characters after (but not including) the first space in a string.
Input text:
foo bar bacon
Desired match:
bar bacon
The closest thing I've found so far is:
\s(.*)
However, this matches the first space in addition to "bar bacon", which is undesirable. Any help is appreciated.
You can use a positive lookbehind:
(?<=\s).*
(demo)
Although it looks like you've already put a capturing group around .* in your current regex, so you could just try grabbing that.
I'd prefer to use [[:blank:]] for it as it doesn't match newlines just in case we're targetting mutli's. And it's also compatible to those not supporting \s.
(?<=[[:blank:]]).*
You don't need look behind.
my $str = 'now is the time';
# Non-greedily match up to the first space, and then get everything after in a group.
$str =~ /^.*? +(.+)/;
my $right_of_space = $1; # Keep what is in the group in parens
print "[$right_of_space]\n";
You can also try this
(?s)(?<=\S*\s+).*
or
(?s)\S*\s+(.*)//group 1 has your match
With (?s) . would also match newlines

How does pattern matching work in Perl?

I want to know how pattern matching works in Perl.
My code is:
my $var = "VP KDC T. 20, pgcet. 5, Ch. 415, Refs %50 Annos";
if($var =~ m/(.*)\,(.*)/sgi)
{
print "$1\n$2";
}
I learnt that the first occurrence of comma should be matched. but here the last occurrence is being matched. The output I got is:
VP KDC T. 20, pgcet. 5, Ch. 415
Refs %50 Annos
Can someone please explain me how this matching works?
From docs:
By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match
So, first (.*) will take as much as possible.
Simple workaround is using non-greedy quantifier: *?. Or match not every character, but all except comma: ([^,]*).
Greedy and Ungreedy Matching
Perl regular expressions normally match the longest string possible.
For instance:
my($text) = "mississippi";
$text =~ m/(i.*s)/;
print $1 . "\n";
Run the preceding code, and here's what you get:
ississ
It matches the first i, the last s, and everything in between them. But what if you want to match the first i to the s most closely following it? Use this code:
my($text) = "mississippi";
$text =~ m/(i.*?s)/;
print $1 . "\n";
Now look what the code produces:
is
Clearly, the use of the question mark makes the match ungreedy. But theres another problem in that regular expressions always try to match as early as possible.
Source: http://www.troubleshooters.com/codecorn/littperl/perlreg.htm
Use question mark in your regex:
if($var =~ m/(.*?)\,(.*)/sgi)
{
print "$1\n$2";
}
So:
(.*)\, means: "match as much characters as you can as long as there will be a comma after them"
(.*?)\, means: "match any characters until you stumble upon a comma"
(.*)\, -you might expect that it will match till the first comma.
But it is greedy enough to match all the xcharacters it came across untill last comma instead of the first comma.
so
it matches till the last command.
and the second match is the rest of the line.
to avoid greedy pattern match adda ? after *