Extract delimited substrings from a string

Extract delimited substrings from a string - regex

I have a string that looks like this
my $source = "PayRate=[[sDate=05Jul2017,Rate=0.05,eDate=06Sep2017]],item1,item2,ReceiveRate=[[sDate=05Sep2017,Rate=0.06]],item3" ;
I want to use capture groups to extract only the PayRate values contained within the first [[...]] block.
$1 = "Date=05Jul2017,Rate=0.05,EDate=06Sep2017"
I tried this but it returns the entire string.
my $match =~ m/PayRate=\[\[(.*)\]\],/ ;
It is clear that I have to put specific patterns for the series of {(.*)=(.*)} blocks inside. Need expert advice.

You are using a greedy match .*, which consumes as much input as possible while still matching, you're matching the first [[ to the last ]].
Instead, use a reluctant match .*?, which matches as little as possible while still matching:
my ( $match) = $source =~ /PayRate=\[\[(.*?)\]\]/;

Use the /x modifier on match (and substitute) so you can use white space for easier reading and comments to tell what is going on. Limit the patterns by matching everything not in the pattern. [^\]]*? is better than .*?.
my ( $match ) = $line =~ m{
Payrate \= \[\[ # select which part
( [^\]]*? ) # capture anything between square brackets
}x;

Related

Regex to find(/replace) multiple instances of character in string

I have a (probably very basic) question about how to construct a (perl) regex, perl -pe 's///g;', that would find/replace multiple instances of a given character/set of characters in a specified string. Initially, I thought the g "global" flag would do this, but I'm clearly misunderstanding something very central here. :/
For example, I want to eliminate any non-alphanumeric characters in a specific string (within a larger text corpus). Just by way of example, the string is identified by starting with [ followed by #, possibly with some characters in between.
[abc#def"ghi"jkl'123]
The following regex
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1$2/g;
will find the first " and if I run it three times I have all three.
Similarly, what if I want to replace the non-alphanumeric characters with something else, let's say an X.
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1X$2/g;
does the trick for one instance. But how can I find all of them in one go?

The reason your code doesn't work is that /g doesn't rescan the string after a substitution. It finds all non-overlapping matches of the given regex and then substitutes the replacement part in.
In [abc#def"ghi"jkl'123], there is only a single match (which is the [abc#def" part of the string, with $1 = '[abc#def' and $2 = ''), so only the first " is removed.
After the first match, Perl scans the remaining string (ghi"jkl'123]) for another match, but it doesn't find another [ (or #).
I think the most straightforward solution is to use a nested search/replace operation. The outer match identifies the string within which to substitute, and the inner match does the actual replacement.
In code:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9//cdr }xe;
Or to replace each match by X:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9/X/cr }xe;
We match a prefix of [, followed by 0 or more characters that are not [ or ] or #, followed by #.
\K is used to mark the virtual beginning of the match (i.e. everything matched so far is not included in the matched string, which simplifies the substitution).
We match and capture 0 or more characters that are not [ or ].
Finally we match a suffix of ] in a look-ahead (so it's not part of the matched string either).
The replacement part is executed as a piece of code, not a string (as indicated by the /e flag). Here we could have used $1 =~ s/[^a-zA-Z0-9]//gr or $1 =~ s/[^a-zA-Z0-9]/X/gr, respectively, but since each inner match is just a single character, it's also possible to use a transliteration.
We return the modified string (as indicated by the /r flag) and use it as the replacement in the outer s operation.

So...I'm going to suggest a marvelously computationally inefficient approach to this. Marvelously inefficient, but possibly still faster than a variable-length lookbehind would be...and also easy (for you):
The \K causes everything before it to be dropped....so only the character after it is actually replaced.
perl -pe 'while (s/\[[^]]*#[^]]*\K[^]a-zA-Z0-9]//){}' file
Basically we just have an empty loop that executes until the search and replace replaces nothing.
Slightly improved version:
perl -pe 'while (s/\[[^]]*?#[^]]*?\K[^]a-zA-Z0-9](?=[^]]*?])//){}' file
The (?=) verifies that its content exists after the match without being part of the match. This is a variable-length lookahead (what we're missing going the other direction). I also made the *s lazy with the ? so we get the shortest match possible.

Here is another approach. Capture precisely the substring that needs work, and in the replacement part run a regex on it that cleans it of non-alphanumeric characters
use warnings;
use strict;
use feature 'say';
my $var = q(ah [abc#def"ghi"jkl'123] oh); #'
say $var;
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
(my $v = $1) =~ s{[^0-9a-zA-Z]}{}g;
$v
}ex;
say $var;
where the lone $v is needed so to return that and not the number of matches, what s/ operator itself returns. This can be improved by using the /r modifier, which returns the changed string and doesn't change the original (so it doesn't attempt to change $1, what isn't allowed)
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
$1 =~ s/[^0-9a-zA-Z]//gr;
}ex;
The \K is there so that all matches before it are "dropped" -- they are not consumed so we don't need to capture them in order to put them back. The /e modifier makes the replacement part be evaluated as code.
The code in the question doesn't work because everything matched is consumed, and (under /g) the search continues from the position after the last match, attempting to find that whole pattern again further down the string. That fails and only that first occurrence is replaced.
The problem with matches that we want to leave in the string can often be remedied by \K (used in all current answers), which makes it so that all matches before it are not consumed.

Capture a substring between two characters?

I am trying to write a regex pattern which will capture a substring between two characters. The string is
default_checks/my_checks/VLG6.3: Unsupported system function call
I need to capture VLG6.3. It is between a slash / and a colon :.
I have tried these ideas
my $rule = $line =~ /\/(.*)\:/;
my $rule = $line =~ /\/(.+?)\:/ ;
my $rule = $line =~ /\/(\w+)\:/ ;
But none of them are working. In the best case I get my_checks/VLG6.3

Aside from the issue with assigning a list to a scalar, which ikegami has helpfully pointed out, the regex pattern can use some fixing.
The repeater * in regex is greedy. It gobbles up as many characters as it can as long as it matches. You need to let another repeater do the gobbling up front so that it only leaves just enough for the repeater you really want to match.
my ($rule) = $line =~ /.*\/(.*):/;
Alternatively, in this case you can just use an exclusion class instead of matching any characters.
my ($rule) = $line =~ /\/([^\/]*):/;
Both of the above will end up with $rule assigned with 'VLG6.3'.

You are interested in a non-empty string, meeting the following conditions:
It is preceded by a /.
It is followed by a colon.
It contains neither / nor a colon.
So the intuitive regex, without any capturing group is:
(?<=\/)[^\/:]+(?=:) (positive lookbehind, the actual content
and positive lookahead).
Using such a regex, you can:
Use the result of =~ operator only to check whether something has been
matched.
Print the matched text from $& variable.
And the example script can look like below:
use strict;
use warnings;
my $line = 'default_checks/my_checks/VLG6.3: Unsupported system function call';
print "Source: $line\n";
if ($line =~ /(?<=\/)[^\/:]+(?=:)/) {
print "Rule: $&\n";
} else {
print "No match.\n";
}

The reason you are getting 1 is because you are evaluating the match in scalar context. For the match to return the captures, it needs to be evaluated in list context.
You need to evaluate the match in list context by evaluating the =~ in list context. Unlike the scalar assignment operator you used, the list assignment operator evaluates its operands in list context. You can cause the list assignment operator to be used by replacing my $rule with my ($rule).
my ($rule) = $line =~ /\/(.*)\:/;
See Why are there parentheses around scalar when assigning the return value of regex match in this Perl snippet?.
Furthermore, the match operator will grab more than desired. You can address that by replacing
/\/(.*)\:/
with
/\/([^\/]*)\:/
I would write that as follows:
m{/([^/]*):}

To capture a string between two characters, capture everything that is not the two characters.
my $line = 'default_checks/my_checks/VLG6.3: Unsupported system function call';
my ( $rule ) = $line =~ /\/([^\/:]*):/;
print "$rule\n";
PS: To capture content between two string involves skipping sequences of the starting string.
my $line = 'begin not this begin or this begin wanted end not this end or this end';
my ( $rule ) = $line =~ m{ (?: begin .* )? begin (.*?) end }msx;
print "$rule\n";

perl regex substring

$str="!bypass";
I need return string that only start with regex "!"
How can I return bypass ?

To match strings that start with a ! you need this pattern. The ^ is the anchor at the beginning of the string.
/^!/
If you want to capture the stuff after the !, you need this pattern. The parenthesis () are a capture group. They tell Perl to grab everything between them and keep it. The . means any character, and the + is a quantifier for as many as possible, at least one. So .+ means grab everything.
/^!(.+)/
To apply it, do this.
$str =~ m/^!(.+)/;
And to get the "bypass" out of that pattern, use the $1 match variable that was assigned automatically by Perl with the m// operation.
print $1; # will print bypass
To make that conditional, it would be:
print $1 if $str =~ m/^!(.+)/;
The if here is in post-fix notation, which lets you omit the block and the parenthesis. It's the same as the following, but shorter and easier to read for single statements.
if ( $str =~ m/^!(.+)/ ) {
print $1;
}
If you want to permanently change $str to not have an exclamation mark at the beginning, you need to use a substitution instead.
$str =~ s/^!//;
The s/// is the substitution operator. It changes $str in place. The original value including the ! will be lost.

Use ^!\K.+.
It works this way:
^! - Match initial ! (but this will soon change, see below).
\K - Keep - "forget" about what you have matched so far and set the starting point of the match here (after the !).
.+ - Match non-empty sequence of chars.
Due to \K, only the last part (.+) is actually matched.

Perl $1 variable not defined after regex match

This is probably a very basic error on my part, but I've been stuck on this problem for ages and it's driving me up the wall!
I am looping through a file of Python code using Perl and identifying its variables. I am using a Perl regex to pick out substrings of alphanumeric characters in between spaces. The regex works fine and identifies the lines that the matches belong to, but when I try to return the actual substring that matches the regex, the capture variable $1 is undefined.
Here is my regex:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
And here is the error:
x = 1
Use of uninitialized value $1 in print at ./vars.pl line 7, <> line 2.
As I understand it, $1 is supposed to return x. Where is my code going wrong?

You're not capturing the result:
if ($line =~ /.*\s+([a-zA-Z0-9]+)\s+.*/) {
If you want to match a line like x = 1 and get both parts of it, you need to match on and capture both with parenthesis. A crude approach:
if ( $line =~ /^\s* ( \w+ ) \s* = \s* ( \w+ ) \s* $/msx ) {
my $var = $1;
my $val = $2;
}

The correct answer has been given by Leeft: You need to capture the string by using parentheses. I wanted to mention some other things. In your code:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
You are surrounding your match with .*\s+. This is unlikely doing what you think. You never need to use .* with m//, unless you are capturing a string (or capturing the whole match using $&). The match is not anchored by default, and will match anywhere in the string. To anchor the match you must use ^ or $. E.g.:
if ('abcdef' =~ /c/) # returns true
if ('abcdef' =~ /^c/) # returns false, match anchored to beginning
if ('abcdef' =~ /c$/) # returns false, match anchored to end
if ('abcdef' =~ /c.*$/) # returns true
As you see in the last example, using .* is quite redundant, and to get the match you need only remove the anchor. Or if you wanted to capture the whole string:
if ('abcdef' =~ /(c.*)$/) # returns true, captures 'cdef'
You can also use $&, which contains the entire match, regardless of parentheses.
You are probably using \s+ to ensure you do not match partial words. You should be aware that there is an escape sequence called word boundary, \b. This is a zero-length assertion, that checks that the characters around it are word and non-word.
'abc cde fgh' =~ /\bde\b/ # no match
'abc cde fgh' =~ /\bcde\b/ # match
'abc cde fgh' =~ /\babc/ # match
'abc cde fgh' =~ /\s+abc/ # no match! there is no whitespace before 'a'
As you see in the last example, using \s+ fails at start or end of string. Do note that \b also matches partially at non-word characters that can be part of words, such as:
'aaa-xxx' =~ /\bxxx/ # match
You must decide if you want this behaviour or not. If you do not, an alternative to using \s is to use the double negated case: (?!\S). This is a zero-length negative look-ahead assertion, looking for non-whitespace. It will be true for whitespace, and for end of string. Use a look-behind to check the other side.
Lastly, you are using [a-zA-Z0-9]. This can be replaced with \w, although \w also includes underscore _ (and other word characters).
So your regex becomes:
/\b(\w+)\b/
Or
/(?<!\S)(\w+)(?!\S)/
Documentation:
perldoc perlvar - Perl built-in variables
perldoc perlop - Perl operators
perldoc perlre - Perl regular expressions

Perl regex - why does the regex /[0-9\.]+(\,)/ match comma

The following seems to match ,
Can someone explain why?
I would like to match more than one Number or point, ended by comma.
123.456.768,
123,
.,
1.2,
But doing the following unexpectedly prints , too
my $text = "241.000,00";
foreach my $match ($text =~ /[0-9\.]+(\,)/g){
print "$match \n";
}
print $text;
# prints 241.000,
# ,
Update:
The comma matched because:
In list context, //g returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regex
As defined here.

Use a zero-width positive look-ahead assertion to exclude the comma from the match itself:
$text =~ /[0-9\.]+(?=,)/g

Your match in the foreach loop is in list context. In list context, a match returns what its captured. Parens indicate a capture, not the whole regex. You have parens around a comma. You want it the other way around, put the parens aroundt he bit you want.
my $text = "241.000,00";
# my($foo) puts the right hand side in list context.
my($integer_part) = $text =~ /([0-9\.]+),/;
print "$integer_part\n"; # 241.000

If you don't want to match the comma, use a lookahead assertion:
/[0-9\.]+(?=,)/g

You're capturing the wrong thing! Move the parens from around the comma to around the number.
$text =~ /([0-9\.]+),/g

You can replace the comma with a lookahead, or just exclude the comma altogether since it isn't part of what you want to capture, it won't make a difference in this case. However, the pattern as it is puts the comma instead of the number into capture group 1, and then doesn't even reference by capture group, returning the entire match instead.
This is how a capture group is retrieved:
$mystring = "The start text always precedes the end of the end text.";
if($mystring =~ m/start(.*)end/) {
print $1;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract delimited substrings from a string - regex

You are using a greedy match ., which consumes as much input as possible while still matching, you're matching the first [[ to the last ]]. Instead, use a reluctant match .?, which matches as little as possible while still matching: my ( $match) = $source =~ /PayRate=\[\[(.*?)\]\]/;

Related

Regex to find(/replace) multiple instances of character in string

Capture a substring between two characters?

perl regex substring

Perl $1 variable not defined after regex match

Perl regex - why does the regex /[0-9\.]+(\,)/ match comma

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract delimited substrings from a string - regex

You are using a greedy match .*, which consumes as much input as possible while still matching, you're matching the first [[ to the last ]]. Instead, use a reluctant match .*?, which matches as little as possible while still matching: my ( $match) = $source =~ /PayRate=\[\[(.*?)\]\]/;

Related

Regex to find(/replace) multiple instances of character in string

Capture a substring between two characters?

perl regex substring

Perl $1 variable not defined after regex match

Perl regex - why does the regex /[0-9\.]+(\,)/ match comma

Categories

Resources

You are using a greedy match ., which consumes as much input as possible while still matching, you're matching the first [[ to the last ]]. Instead, use a reluctant match .?, which matches as little as possible while still matching: my ( $match) = $source =~ /PayRate=\[\[(.*?)\]\]/;