Match from last occurrence using regex in perl - regex

I have a text like this:
hello world /* select a from table_b
*/ some other text with new line cha
racter and there are some blocks of
/* any string */ select this part on
ly
////RESULT rest string
The text is multilined and I need to extract from last occurrence of "*/" until "////RESULT". In this case, the result should be:
select this part on
ly
How to achieve this in perl?
I have attempted \\\*/(.|\n)*////RESULT but that will start from first "*/"

A useful trick in cases like this is to prefix the regexp with the greedy pattern .*, which will try to match as many characters as possible before the rest of the pattern matches. So:
my ($match) = ($string =~ m!^.*\*/(.*?)////RESULT!s);
Let's break this pattern into its components:
^.* starts at the beginning of the string and matches as many characters as it can. (The s modifier allows . to match even newlines.) The beginning-of-string anchor ^ is not strictly necessary, but it ensures that the regexp engine won't waste too much time backtracking if the match fails.
\*/ just matches the literal string */.
(.*?) matches and captures any number of characters; the ? makes it ungreedy, so it prefers to match as few characters as possible in case there's more than one position where the rest of the regexp can match.
Finally, ////RESULT just matches itself.
Since the pattern contains a lot of slashes, and since I wanted to avoid leaning toothpick syndrome, I decided to use alternative regexp delimiters. Exclamation points (!) are a popular choice, since they don't collide with any normal regexp syntax.
Edit: Per discussion with ikegami below, I guess I should note that, if you want to use this regexp as a sub-pattern in a longer regexp, and if you want to guarantee that the string matched by (.*?) will never contain ////RESULT, then you should wrap those parts of the regexp in an independent (?>) subexpression, like this:
my $regexp = qr!\*/(?>(.*?)////RESULT)!s;
...
my $match = ($string =~ /^.*$regexp$some_other_regexp/s);
The (?>) causes the pattern inside it to fail rather than accepting a suboptimal match (i.e. one that extends beyond the first substring matching ////RESULT) even if that means that the rest of the regexp will fail to match.

(?:(?!STRING).)*
matches any number of characters that don't contain STRING. It's like [^a], but for strings instead of characters.
You can take shortcuts if you know certain inputs won't be encountered (like Kenosis and Ilmari Karonen did), but this is what what matches what you specified:
my ($segment) = $string =~ m{
\*/
( (?: (?! \*/ ). )* )
////RESULT
(?: (?! \*/ ). )*
\z
}xs;
If you don't care if */ appears after ////RESULT, the following is the safest:
my ($segment) = $string =~ m{
\*/
( (?: (?! \*/ ). )* )
////RESULT
}xs;
You didn't specify what should happen if there are two ////RESULT that follow the last */. The above matches until the last one. If you wanted to match until the first one, you'd use
my ($segment) = $string =~ m{
\*/
( (?: (?! \*/ | ////RESULT ). )* )
////RESULT
}xs;

Here's one option:
use strict;
use warnings;
my $string = <<'END';
hello world /* select a from table_b
*/ some other text with new line cha
racter and there are some blocks of
/* any string */ select this part on
ly
////RESULT
END
my ($segment) = $string =~ m!\*/([^/]+)////RESULT$!s;
print $segment;
Output:
select this part on
ly

Related

Regex to find(/replace) multiple instances of character in string

I have a (probably very basic) question about how to construct a (perl) regex, perl -pe 's///g;', that would find/replace multiple instances of a given character/set of characters in a specified string. Initially, I thought the g "global" flag would do this, but I'm clearly misunderstanding something very central here. :/
For example, I want to eliminate any non-alphanumeric characters in a specific string (within a larger text corpus). Just by way of example, the string is identified by starting with [ followed by #, possibly with some characters in between.
[abc#def"ghi"jkl'123]
The following regex
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1$2/g;
will find the first " and if I run it three times I have all three.
Similarly, what if I want to replace the non-alphanumeric characters with something else, let's say an X.
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1X$2/g;
does the trick for one instance. But how can I find all of them in one go?
The reason your code doesn't work is that /g doesn't rescan the string after a substitution. It finds all non-overlapping matches of the given regex and then substitutes the replacement part in.
In [abc#def"ghi"jkl'123], there is only a single match (which is the [abc#def" part of the string, with $1 = '[abc#def' and $2 = ''), so only the first " is removed.
After the first match, Perl scans the remaining string (ghi"jkl'123]) for another match, but it doesn't find another [ (or #).
I think the most straightforward solution is to use a nested search/replace operation. The outer match identifies the string within which to substitute, and the inner match does the actual replacement.
In code:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9//cdr }xe;
Or to replace each match by X:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9/X/cr }xe;
We match a prefix of [, followed by 0 or more characters that are not [ or ] or #, followed by #.
\K is used to mark the virtual beginning of the match (i.e. everything matched so far is not included in the matched string, which simplifies the substitution).
We match and capture 0 or more characters that are not [ or ].
Finally we match a suffix of ] in a look-ahead (so it's not part of the matched string either).
The replacement part is executed as a piece of code, not a string (as indicated by the /e flag). Here we could have used $1 =~ s/[^a-zA-Z0-9]//gr or $1 =~ s/[^a-zA-Z0-9]/X/gr, respectively, but since each inner match is just a single character, it's also possible to use a transliteration.
We return the modified string (as indicated by the /r flag) and use it as the replacement in the outer s operation.
So...I'm going to suggest a marvelously computationally inefficient approach to this. Marvelously inefficient, but possibly still faster than a variable-length lookbehind would be...and also easy (for you):
The \K causes everything before it to be dropped....so only the character after it is actually replaced.
perl -pe 'while (s/\[[^]]*#[^]]*\K[^]a-zA-Z0-9]//){}' file
Basically we just have an empty loop that executes until the search and replace replaces nothing.
Slightly improved version:
perl -pe 'while (s/\[[^]]*?#[^]]*?\K[^]a-zA-Z0-9](?=[^]]*?])//){}' file
The (?=) verifies that its content exists after the match without being part of the match. This is a variable-length lookahead (what we're missing going the other direction). I also made the *s lazy with the ? so we get the shortest match possible.
Here is another approach. Capture precisely the substring that needs work, and in the replacement part run a regex on it that cleans it of non-alphanumeric characters
use warnings;
use strict;
use feature 'say';
my $var = q(ah [abc#def"ghi"jkl'123] oh); #'
say $var;
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
(my $v = $1) =~ s{[^0-9a-zA-Z]}{}g;
$v
}ex;
say $var;
where the lone $v is needed so to return that and not the number of matches, what s/ operator itself returns. This can be improved by using the /r modifier, which returns the changed string and doesn't change the original (so it doesn't attempt to change $1, what isn't allowed)
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
$1 =~ s/[^0-9a-zA-Z]//gr;
}ex;
The \K is there so that all matches before it are "dropped" -- they are not consumed so we don't need to capture them in order to put them back. The /e modifier makes the replacement part be evaluated as code.
The code in the question doesn't work because everything matched is consumed, and (under /g) the search continues from the position after the last match, attempting to find that whole pattern again further down the string. That fails and only that first occurrence is replaced.
The problem with matches that we want to leave in the string can often be remedied by \K (used in all current answers), which makes it so that all matches before it are not consumed.

Non-Capturing and Capturing Groups - The right way

I'm trying to match an array of elements preceeded by a specific string in a line of text. For Example, match all pets in the text below:
fruits:apple,banana;pets:cat,dog,bird;colors:green,blue
/(?:pets:)(\w+[,|;])+/g**
Using the given regex I only could match the last word "bird"
Can anybody help me to understand the right way of using Non-Capturing and Capturing Groups?
Thanks!
First, let's talk about capturing and non-capturing group:
(?:...) non-capturing version, you're looking for this values, but don't need it
() capturing version, you want this values! You're searching for it
So:
(?:pets:) you searching for "pets" but don't want to capture it, after that point, you WANT to capture (if I've understood):
So try (?:pets:)([a-zA-Z,]+); ... You're searching for "pets:" (but don't want it !) and stop at the first ";" (and don't want it too).
Result is :
Match 1 : cat,dog,bird
A better solution exists with 1 match == 1 pet.
Since you want to have each pet in a separate match and you are using PCRE \G is, as suggested by Wiktor, a decent option:
(?:pets:)|\G(?!^)(\w+)(?:[,;]|$)
Explanation:
1st Alternative (?:pets:) to find the start of the pattern
2nd Alternative \G(?!^)(\w+)(?:[,;]|$)
\G asserts position at the end of the previous match or the start of the string for the first match
Negative Lookahead (?!^) to assert that the Regex does not match at the start of the string
(\w+) to matches the pets
Non-capturing group (?:[,;]|$) used as a delimiter (matches a single character in the list ,; (case sensitive) or $ asserts position at the end of the string
Perl Code Sample:
use strict;
use Data::Dumper;
my $str = 'fruits:apple,banana;pets:cat,dog,bird;colors:green,blue';
my $regex = qr/(?:pets:)|\G(?!^)(\w+)(?:[,;]|$)/mp;
my #result = ();
while ( $str =~ /$regex/g ) {
if ($1 ne '') {
#print "$1\n";
push #result, $1;
}
}
print Dumper(\#result);

Extract delimited substrings from a string

I have a string that looks like this
my $source = "PayRate=[[sDate=05Jul2017,Rate=0.05,eDate=06Sep2017]],item1,item2,ReceiveRate=[[sDate=05Sep2017,Rate=0.06]],item3" ;
I want to use capture groups to extract only the PayRate values contained within the first [[...]] block.
$1 = "Date=05Jul2017,Rate=0.05,EDate=06Sep2017"
I tried this but it returns the entire string.
my $match =~ m/PayRate=\[\[(.*)\]\],/ ;
It is clear that I have to put specific patterns for the series of {(.*)=(.*)} blocks inside. Need expert advice.
You are using a greedy match .*, which consumes as much input as possible while still matching, you're matching the first [[ to the last ]].
Instead, use a reluctant match .*?, which matches as little as possible while still matching:
my ( $match) = $source =~ /PayRate=\[\[(.*?)\]\]/;
Use the /x modifier on match (and substitute) so you can use white space for easier reading and comments to tell what is going on. Limit the patterns by matching everything not in the pattern. [^\]]*? is better than .*?.
my ( $match ) = $line =~ m{
Payrate \= \[\[ # select which part
( [^\]]*? ) # capture anything between square brackets
}x;

Extract certain part of a string in Perl

I have the following Perl strings. The lengths and the patterns are different. The file is always named *log.999
my $file1 = '/user/mike/desktop/sys/syslog.1';
my $file2 = '/user/mike/desktop/movie/dnslog.2';
my $file3 = '/haselog.3';
my $file4 = '/user/mike/desktop/movie/dns-sys.log'
I need to extract the words before log. In this case, sys, dns, hase and dns-sys.
How can I write a regular expression to extract them?
\w+(?=log\b)
matches one or more alphanumeric characters that are followed by log (but not logging etc.)
If the filename format is fixed, you can make the regex more reliable by using
\w+(?=log\.\d+\/$)
The main property of shown strings is that the *log* phrase is last.
Then anchor the pattern, so we wouldn't match a log somewhere in the middle
my ($name) = $string =~ /(\w+)log\.[0-9]+$/;
while if .N extension is optional
my ($name) = $string =~ /(\w+)log(?:\.[0-9]+)?$/;
The above uses the \w+ pattern to capture the text preceding log. But that text may also contain non-word characters (-, ., etc), in which case we would use [^/]+ to capture everything after the last /, as pointed out in Abigail's answer. With .N optional, per question in the comments
my ($name) = $string =~ m{ ([^/]+) log (?: \.[0-9]+ )? $}x;
where I added the }x modifier, with which spaces inside are ignored, what can aid readibility.
I use a set of delimiters other than / to be able to use / inside without escaping it, and then the m is compulsory. The [^...] is a negated character class, matching any character not listed inside. So [^/]+log matches all successive characters which are not /, coming before log.
The non capturing group (?: ... ) groups patterns inside, so that ? applies to the whole group, but doesn't needlessly capture them.
The (?:\.[0-9]+)? pattern was written specifically so to disallow things like log. (nothing after dot) and log5. But if these are acceptable, change it to the simpler \.?[0-9]*
Update Corrected a typo in code: for optional .N there is +, not *

Perl $1 variable not defined after regex match

This is probably a very basic error on my part, but I've been stuck on this problem for ages and it's driving me up the wall!
I am looping through a file of Python code using Perl and identifying its variables. I am using a Perl regex to pick out substrings of alphanumeric characters in between spaces. The regex works fine and identifies the lines that the matches belong to, but when I try to return the actual substring that matches the regex, the capture variable $1 is undefined.
Here is my regex:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
And here is the error:
x = 1
Use of uninitialized value $1 in print at ./vars.pl line 7, <> line 2.
As I understand it, $1 is supposed to return x. Where is my code going wrong?
You're not capturing the result:
if ($line =~ /.*\s+([a-zA-Z0-9]+)\s+.*/) {
If you want to match a line like x = 1 and get both parts of it, you need to match on and capture both with parenthesis. A crude approach:
if ( $line =~ /^\s* ( \w+ ) \s* = \s* ( \w+ ) \s* $/msx ) {
my $var = $1;
my $val = $2;
}
The correct answer has been given by Leeft: You need to capture the string by using parentheses. I wanted to mention some other things. In your code:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
You are surrounding your match with .*\s+. This is unlikely doing what you think. You never need to use .* with m//, unless you are capturing a string (or capturing the whole match using $&). The match is not anchored by default, and will match anywhere in the string. To anchor the match you must use ^ or $. E.g.:
if ('abcdef' =~ /c/) # returns true
if ('abcdef' =~ /^c/) # returns false, match anchored to beginning
if ('abcdef' =~ /c$/) # returns false, match anchored to end
if ('abcdef' =~ /c.*$/) # returns true
As you see in the last example, using .* is quite redundant, and to get the match you need only remove the anchor. Or if you wanted to capture the whole string:
if ('abcdef' =~ /(c.*)$/) # returns true, captures 'cdef'
You can also use $&, which contains the entire match, regardless of parentheses.
You are probably using \s+ to ensure you do not match partial words. You should be aware that there is an escape sequence called word boundary, \b. This is a zero-length assertion, that checks that the characters around it are word and non-word.
'abc cde fgh' =~ /\bde\b/ # no match
'abc cde fgh' =~ /\bcde\b/ # match
'abc cde fgh' =~ /\babc/ # match
'abc cde fgh' =~ /\s+abc/ # no match! there is no whitespace before 'a'
As you see in the last example, using \s+ fails at start or end of string. Do note that \b also matches partially at non-word characters that can be part of words, such as:
'aaa-xxx' =~ /\bxxx/ # match
You must decide if you want this behaviour or not. If you do not, an alternative to using \s is to use the double negated case: (?!\S). This is a zero-length negative look-ahead assertion, looking for non-whitespace. It will be true for whitespace, and for end of string. Use a look-behind to check the other side.
Lastly, you are using [a-zA-Z0-9]. This can be replaced with \w, although \w also includes underscore _ (and other word characters).
So your regex becomes:
/\b(\w+)\b/
Or
/(?<!\S)(\w+)(?!\S)/
Documentation:
perldoc perlvar - Perl built-in variables
perldoc perlop - Perl operators
perldoc perlre - Perl regular expressions