This is probably a very basic error on my part, but I've been stuck on this problem for ages and it's driving me up the wall!
I am looping through a file of Python code using Perl and identifying its variables. I am using a Perl regex to pick out substrings of alphanumeric characters in between spaces. The regex works fine and identifies the lines that the matches belong to, but when I try to return the actual substring that matches the regex, the capture variable $1 is undefined.
Here is my regex:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
And here is the error:
x = 1
Use of uninitialized value $1 in print at ./vars.pl line 7, <> line 2.
As I understand it, $1 is supposed to return x. Where is my code going wrong?
You're not capturing the result:
if ($line =~ /.*\s+([a-zA-Z0-9]+)\s+.*/) {
If you want to match a line like x = 1 and get both parts of it, you need to match on and capture both with parenthesis. A crude approach:
if ( $line =~ /^\s* ( \w+ ) \s* = \s* ( \w+ ) \s* $/msx ) {
my $var = $1;
my $val = $2;
}
The correct answer has been given by Leeft: You need to capture the string by using parentheses. I wanted to mention some other things. In your code:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
You are surrounding your match with .*\s+. This is unlikely doing what you think. You never need to use .* with m//, unless you are capturing a string (or capturing the whole match using $&). The match is not anchored by default, and will match anywhere in the string. To anchor the match you must use ^ or $. E.g.:
if ('abcdef' =~ /c/) # returns true
if ('abcdef' =~ /^c/) # returns false, match anchored to beginning
if ('abcdef' =~ /c$/) # returns false, match anchored to end
if ('abcdef' =~ /c.*$/) # returns true
As you see in the last example, using .* is quite redundant, and to get the match you need only remove the anchor. Or if you wanted to capture the whole string:
if ('abcdef' =~ /(c.*)$/) # returns true, captures 'cdef'
You can also use $&, which contains the entire match, regardless of parentheses.
You are probably using \s+ to ensure you do not match partial words. You should be aware that there is an escape sequence called word boundary, \b. This is a zero-length assertion, that checks that the characters around it are word and non-word.
'abc cde fgh' =~ /\bde\b/ # no match
'abc cde fgh' =~ /\bcde\b/ # match
'abc cde fgh' =~ /\babc/ # match
'abc cde fgh' =~ /\s+abc/ # no match! there is no whitespace before 'a'
As you see in the last example, using \s+ fails at start or end of string. Do note that \b also matches partially at non-word characters that can be part of words, such as:
'aaa-xxx' =~ /\bxxx/ # match
You must decide if you want this behaviour or not. If you do not, an alternative to using \s is to use the double negated case: (?!\S). This is a zero-length negative look-ahead assertion, looking for non-whitespace. It will be true for whitespace, and for end of string. Use a look-behind to check the other side.
Lastly, you are using [a-zA-Z0-9]. This can be replaced with \w, although \w also includes underscore _ (and other word characters).
So your regex becomes:
/\b(\w+)\b/
Or
/(?<!\S)(\w+)(?!\S)/
Documentation:
perldoc perlvar - Perl built-in variables
perldoc perlop - Perl operators
perldoc perlre - Perl regular expressions
Related
I have a (probably very basic) question about how to construct a (perl) regex, perl -pe 's///g;', that would find/replace multiple instances of a given character/set of characters in a specified string. Initially, I thought the g "global" flag would do this, but I'm clearly misunderstanding something very central here. :/
For example, I want to eliminate any non-alphanumeric characters in a specific string (within a larger text corpus). Just by way of example, the string is identified by starting with [ followed by #, possibly with some characters in between.
[abc#def"ghi"jkl'123]
The following regex
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1$2/g;
will find the first " and if I run it three times I have all three.
Similarly, what if I want to replace the non-alphanumeric characters with something else, let's say an X.
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1X$2/g;
does the trick for one instance. But how can I find all of them in one go?
The reason your code doesn't work is that /g doesn't rescan the string after a substitution. It finds all non-overlapping matches of the given regex and then substitutes the replacement part in.
In [abc#def"ghi"jkl'123], there is only a single match (which is the [abc#def" part of the string, with $1 = '[abc#def' and $2 = ''), so only the first " is removed.
After the first match, Perl scans the remaining string (ghi"jkl'123]) for another match, but it doesn't find another [ (or #).
I think the most straightforward solution is to use a nested search/replace operation. The outer match identifies the string within which to substitute, and the inner match does the actual replacement.
In code:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9//cdr }xe;
Or to replace each match by X:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9/X/cr }xe;
We match a prefix of [, followed by 0 or more characters that are not [ or ] or #, followed by #.
\K is used to mark the virtual beginning of the match (i.e. everything matched so far is not included in the matched string, which simplifies the substitution).
We match and capture 0 or more characters that are not [ or ].
Finally we match a suffix of ] in a look-ahead (so it's not part of the matched string either).
The replacement part is executed as a piece of code, not a string (as indicated by the /e flag). Here we could have used $1 =~ s/[^a-zA-Z0-9]//gr or $1 =~ s/[^a-zA-Z0-9]/X/gr, respectively, but since each inner match is just a single character, it's also possible to use a transliteration.
We return the modified string (as indicated by the /r flag) and use it as the replacement in the outer s operation.
So...I'm going to suggest a marvelously computationally inefficient approach to this. Marvelously inefficient, but possibly still faster than a variable-length lookbehind would be...and also easy (for you):
The \K causes everything before it to be dropped....so only the character after it is actually replaced.
perl -pe 'while (s/\[[^]]*#[^]]*\K[^]a-zA-Z0-9]//){}' file
Basically we just have an empty loop that executes until the search and replace replaces nothing.
Slightly improved version:
perl -pe 'while (s/\[[^]]*?#[^]]*?\K[^]a-zA-Z0-9](?=[^]]*?])//){}' file
The (?=) verifies that its content exists after the match without being part of the match. This is a variable-length lookahead (what we're missing going the other direction). I also made the *s lazy with the ? so we get the shortest match possible.
Here is another approach. Capture precisely the substring that needs work, and in the replacement part run a regex on it that cleans it of non-alphanumeric characters
use warnings;
use strict;
use feature 'say';
my $var = q(ah [abc#def"ghi"jkl'123] oh); #'
say $var;
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
(my $v = $1) =~ s{[^0-9a-zA-Z]}{}g;
$v
}ex;
say $var;
where the lone $v is needed so to return that and not the number of matches, what s/ operator itself returns. This can be improved by using the /r modifier, which returns the changed string and doesn't change the original (so it doesn't attempt to change $1, what isn't allowed)
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
$1 =~ s/[^0-9a-zA-Z]//gr;
}ex;
The \K is there so that all matches before it are "dropped" -- they are not consumed so we don't need to capture them in order to put them back. The /e modifier makes the replacement part be evaluated as code.
The code in the question doesn't work because everything matched is consumed, and (under /g) the search continues from the position after the last match, attempting to find that whole pattern again further down the string. That fails and only that first occurrence is replaced.
The problem with matches that we want to leave in the string can often be remedied by \K (used in all current answers), which makes it so that all matches before it are not consumed.
Hello can anyone help me with a regex to match a string between multiple whitespaces
My string may look like this :
This is just Nicolas-764 sdh and his sister
I want to match Nicolas-764 sdh
So far I wrote this but it matches all the string after the first whitespaces
if ($string =~ m/(just) {5,}(.*) {5,}/) {
print "$1\n";
print "$2\n";
}
I want to create a hash that will have as key just and as value Nicolas-764 sdh.
I don't want to just match a string between multiple spaces. I need to use just too
You're suffering from greedy matching .*.
You simply need to change to non-greedy matching using .*?.
use strict;
use warnings;
my $string = 'This is just Nicolas-764 sdh and his sister';
if ($string =~ m/just\s{5,}(.*?)\s{5,}/) {
print "$1\n";
}
Outputs:
Nicolas-764 sdh
Your code would be,
if ($string =~ m/^.*?just {5,}(\S+)\s+(\S+) {5,}.*$/) {
print "$1\n";
print "$2\n";
}
First group contains Nicolas-764 and the second group contains sdh
DEMO
Or
You could try the below regex also,
^.*?just {5,}(\S+(?:\s\S+)*?) {5,}.*$
Explanation:
^ Asserst that we are at the start of the line.
.*?just This would match upto the first just string. ? after * does a non-greedy match.
{5,} Matches 5 or more spaces.
() Capturing groups.
\S+ One or more non-space characters.
(?:) Non-capturing groups. It won't capture anything. Just matching would be done.
(?:\s\S+)*? Matches a space followed by one or more non-space characters. And the whole would occur zero or more times.
{5,} Matches 5 or more spaces.
.* Matches any character zero or more times.
$ Asserts that we are at the end of the line.
Why do the following two regexes behave differently?
$millisec = "1391613310.1";
$millisec =~ s/.*(\.\d+)?$/$1/;
vs.
$millisec =~ s/\d*(\.\d+)?$/$1/;
This code prints nothing:
perl -e 'my $mtime = "1391613310.1"; my $millisec = $mtime; $millisec =~ s/.*(\.\d+)?$/$1/; print "$millisec";'
While this prints the decimal portion of the string:
perl -e 'my $mtime = "1391613310.1"; my $millisec = $mtime; $millisec =~ s/\d*(\.\d+)?$/$1/; print "$millisec";'
In the first regex, the .* is taking up everything to the end of the string, so there's nothing the optional (.\d+)? can pick up. $1 will be empty, so the string is replaced by an empty string.
In the second regex, only digits are grabbed from the beginning so that \d* stops in front of the dot. (.\d+)? will pick the dot, including the trailing digits.
You're using .\d+ inside parentheses, which will match any character plus digits. If you want to match a dot explicitly, you have to use \..
To make the first regex behave similarly to the second one you would have to write
$millisec =~ s/.*?(\.\d+)?$/$1/;
so that the initial .* doesn't take up everything.
Greed.
Perl's regex engine will match as much as possible with each term before moving on to the next term. So for .*(.\d+)?$ the .* matches the entire string, then (.\d)? matches nothing as it is optional.
\d*(.\d+)?$ can match only up to the dot, so then has to match .1 against (.\d+)?
The following seems to match ,
Can someone explain why?
I would like to match more than one Number or point, ended by comma.
123.456.768,
123,
.,
1.2,
But doing the following unexpectedly prints , too
my $text = "241.000,00";
foreach my $match ($text =~ /[0-9\.]+(\,)/g){
print "$match \n";
}
print $text;
# prints 241.000,
# ,
Update:
The comma matched because:
In list context, //g returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regex
As defined here.
Use a zero-width positive look-ahead assertion to exclude the comma from the match itself:
$text =~ /[0-9\.]+(?=,)/g
Your match in the foreach loop is in list context. In list context, a match returns what its captured. Parens indicate a capture, not the whole regex. You have parens around a comma. You want it the other way around, put the parens aroundt he bit you want.
my $text = "241.000,00";
# my($foo) puts the right hand side in list context.
my($integer_part) = $text =~ /([0-9\.]+),/;
print "$integer_part\n"; # 241.000
If you don't want to match the comma, use a lookahead assertion:
/[0-9\.]+(?=,)/g
You're capturing the wrong thing! Move the parens from around the comma to around the number.
$text =~ /([0-9\.]+),/g
You can replace the comma with a lookahead, or just exclude the comma altogether since it isn't part of what you want to capture, it won't make a difference in this case. However, the pattern as it is puts the comma instead of the number into capture group 1, and then doesn't even reference by capture group, returning the entire match instead.
This is how a capture group is retrieved:
$mystring = "The start text always precedes the end of the end text.";
if($mystring =~ m/start(.*)end/) {
print $1;
}
I want to do matching in the following way for a large multiline text:
I have a few matching patterns:
$text =~ m#finance(.*?)end#s;
$text =~ m#<class>(.*?)</class>#s;
$text =~ m#/data(.*?)<end>#s;
If either one is matched, then print the result print $1, and then continue with the rest of the text to match again for the three patterns.
How can I get the printed results in the order they appear in the whole text?
Many thanks for your help!
while ($text =~ m#(?: finance (.*?) end
| <class> (.*?) </class>
| data (.*?) </end>
)
#sgx) {
print $+;
}
ought to do it.
$+ is the last capturing group that successfully matched.
The /g modifier is intended specifically for this kind of usage; it turns the regex into an iterator that, when resumed, continues the match where it left off instead of restarting at the beginning of $text.
(And /x lets you use arbitrary whitespace, meaning you can make your regexes readable. Or as readable as they get, at least.)
If you need to deal with multiple captures, it becomes a bit harder as you can't use $+. You can, however, test for capturing groups being defined:
while ($text =~ m#(?: a (.*?) b (.*?) c
| d (.*?) e (.*?) f
| data (.*?) </end>
)
#sgx) {
if (defined $1) {
# first set matched (don't need to check $2)
}
elsif (defined $3) {
# second set matched
}
else {
# final one matched
}
}