My regexp has anorexia - regex

I'm trying to get multiple key/value pairs from a string where the keys is on the left of an = character and the value on the right. So the following code
$line = <<END;
names='bob,jane, Alexander the Great' colors = "red,green" test= %results
END
my %hash = ($line =~ m/(\w+)\s*=\s*(.+?)/g);
for (keys %hash) { print "$_: $hash{$_}\n"; }
Should output
names: 'bob,jane, Alexander the Great'
colors: "red,green"
test: %results
But my regexp is just returning the first character of the value like
names: '
colors: "
and so on. If I change the second match to (.+) then it matches the whole line after the first =. Can someone fix this regexp?

Because .+? is non-greedy which stops once it finds a match since you're not giving any regex pattern next to non-greedy form.
my %hash = ($line =~ m/(\w+)\s*=\s*(.+?)(?=\h+\w+\h*=|$)/gm);
DEMO
(?=\h+\w+\h*=|$) called positive lookahead which asserts that the match must be followed by
\h+ one or more horizontal spaces.
\w+ one or more word characters.
\h* zero or more horizontal spaces.
= equal symbol.
| OR
$ End of the line anchor.

.+? says match one or more non-newline characters, preferring as few as possible.
You want .+ which matches one or more non-newline characters, preferring as many as possible.
Then it looks like you also need to stop at a matching quote, so
/(\w+)\s*=\s*('.+?'|".+?"|.+)/g
Though if spaces aren't allowed in unquoted values, you want ´\S+´ instead of ´.+´

Related

Perl $1 variable not defined after regex match

This is probably a very basic error on my part, but I've been stuck on this problem for ages and it's driving me up the wall!
I am looping through a file of Python code using Perl and identifying its variables. I am using a Perl regex to pick out substrings of alphanumeric characters in between spaces. The regex works fine and identifies the lines that the matches belong to, but when I try to return the actual substring that matches the regex, the capture variable $1 is undefined.
Here is my regex:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
And here is the error:
x = 1
Use of uninitialized value $1 in print at ./vars.pl line 7, <> line 2.
As I understand it, $1 is supposed to return x. Where is my code going wrong?
You're not capturing the result:
if ($line =~ /.*\s+([a-zA-Z0-9]+)\s+.*/) {
If you want to match a line like x = 1 and get both parts of it, you need to match on and capture both with parenthesis. A crude approach:
if ( $line =~ /^\s* ( \w+ ) \s* = \s* ( \w+ ) \s* $/msx ) {
my $var = $1;
my $val = $2;
}
The correct answer has been given by Leeft: You need to capture the string by using parentheses. I wanted to mention some other things. In your code:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
You are surrounding your match with .*\s+. This is unlikely doing what you think. You never need to use .* with m//, unless you are capturing a string (or capturing the whole match using $&). The match is not anchored by default, and will match anywhere in the string. To anchor the match you must use ^ or $. E.g.:
if ('abcdef' =~ /c/) # returns true
if ('abcdef' =~ /^c/) # returns false, match anchored to beginning
if ('abcdef' =~ /c$/) # returns false, match anchored to end
if ('abcdef' =~ /c.*$/) # returns true
As you see in the last example, using .* is quite redundant, and to get the match you need only remove the anchor. Or if you wanted to capture the whole string:
if ('abcdef' =~ /(c.*)$/) # returns true, captures 'cdef'
You can also use $&, which contains the entire match, regardless of parentheses.
You are probably using \s+ to ensure you do not match partial words. You should be aware that there is an escape sequence called word boundary, \b. This is a zero-length assertion, that checks that the characters around it are word and non-word.
'abc cde fgh' =~ /\bde\b/ # no match
'abc cde fgh' =~ /\bcde\b/ # match
'abc cde fgh' =~ /\babc/ # match
'abc cde fgh' =~ /\s+abc/ # no match! there is no whitespace before 'a'
As you see in the last example, using \s+ fails at start or end of string. Do note that \b also matches partially at non-word characters that can be part of words, such as:
'aaa-xxx' =~ /\bxxx/ # match
You must decide if you want this behaviour or not. If you do not, an alternative to using \s is to use the double negated case: (?!\S). This is a zero-length negative look-ahead assertion, looking for non-whitespace. It will be true for whitespace, and for end of string. Use a look-behind to check the other side.
Lastly, you are using [a-zA-Z0-9]. This can be replaced with \w, although \w also includes underscore _ (and other word characters).
So your regex becomes:
/\b(\w+)\b/
Or
/(?<!\S)(\w+)(?!\S)/
Documentation:
perldoc perlvar - Perl built-in variables
perldoc perlop - Perl operators
perldoc perlre - Perl regular expressions

Match a string between multiple whitespaces

Hello can anyone help me with a regex to match a string between multiple whitespaces
My string may look like this :
This is just Nicolas-764 sdh and his sister
I want to match Nicolas-764 sdh
So far I wrote this but it matches all the string after the first whitespaces
if ($string =~ m/(just) {5,}(.*) {5,}/) {
print "$1\n";
print "$2\n";
}
I want to create a hash that will have as key just and as value Nicolas-764 sdh.
I don't want to just match a string between multiple spaces. I need to use just too
You're suffering from greedy matching .*.
You simply need to change to non-greedy matching using .*?.
use strict;
use warnings;
my $string = 'This is just Nicolas-764 sdh and his sister';
if ($string =~ m/just\s{5,}(.*?)\s{5,}/) {
print "$1\n";
}
Outputs:
Nicolas-764 sdh
Your code would be,
if ($string =~ m/^.*?just {5,}(\S+)\s+(\S+) {5,}.*$/) {
print "$1\n";
print "$2\n";
}
First group contains Nicolas-764 and the second group contains sdh
DEMO
Or
You could try the below regex also,
^.*?just {5,}(\S+(?:\s\S+)*?) {5,}.*$
Explanation:
^ Asserst that we are at the start of the line.
.*?just This would match upto the first just string. ? after * does a non-greedy match.
{5,} Matches 5 or more spaces.
() Capturing groups.
\S+ One or more non-space characters.
(?:) Non-capturing groups. It won't capture anything. Just matching would be done.
(?:\s\S+)*? Matches a space followed by one or more non-space characters. And the whole would occur zero or more times.
{5,} Matches 5 or more spaces.
.* Matches any character zero or more times.
$ Asserts that we are at the end of the line.

Why do these two regexes behave differently?

Why do the following two regexes behave differently?
$millisec = "1391613310.1";
$millisec =~ s/.*(\.\d+)?$/$1/;
vs.
$millisec =~ s/\d*(\.\d+)?$/$1/;
This code prints nothing:
perl -e 'my $mtime = "1391613310.1"; my $millisec = $mtime; $millisec =~ s/.*(\.\d+)?$/$1/; print "$millisec";'
While this prints the decimal portion of the string:
perl -e 'my $mtime = "1391613310.1"; my $millisec = $mtime; $millisec =~ s/\d*(\.\d+)?$/$1/; print "$millisec";'
In the first regex, the .* is taking up everything to the end of the string, so there's nothing the optional (.\d+)? can pick up. $1 will be empty, so the string is replaced by an empty string.
In the second regex, only digits are grabbed from the beginning so that \d* stops in front of the dot. (.\d+)? will pick the dot, including the trailing digits.
You're using .\d+ inside parentheses, which will match any character plus digits. If you want to match a dot explicitly, you have to use \..
To make the first regex behave similarly to the second one you would have to write
$millisec =~ s/.*?(\.\d+)?$/$1/;
so that the initial .* doesn't take up everything.
Greed.
Perl's regex engine will match as much as possible with each term before moving on to the next term. So for .*(.\d+)?$ the .* matches the entire string, then (.\d)? matches nothing as it is optional.
\d*(.\d+)?$ can match only up to the dot, so then has to match .1 against (.\d+)?

Regex to find the number of extra spaces, including trailing and leading spaces

I'm trying to count the number of extra spaces, including trailing and leading spaces in a string. There are a lot of suggestions out there, but none of them get the count exactly right.
Example ( _ indicates space)
__this is a string__with extra spaces__
should match 5 extra spaces.
Here's my code:
if (my #matches = $_[0] =~ m/(\s(?=\s)|(?<=\s)\s)|^\s|\s$/g){
push #errors, {
"error_count" => scalar #matches,
"error_type" => "extra spaces",
};
}
The problem with this regex is that it counts spaces in the middle twice.
However, if I take out one of the look-ahead/look-behind matches, like so:
$_[0] =~ m/\s(?=\s)|^\s|\s$/g
It won't count two extra spaces at the beginning of a string. (My test string would only match 4 spaces.)
Try
$_[0] =~ m/^\s|(?<=\s)\s|\s(?=\s*$)/g
This should match
the first space (if one exists),
each space that follows a space,
and that one trailing space that immediately follows the last non-space (the rest of the trailing spaces are already counted by the second case).
In other words, for your example, here's what each of the three cases would match:
__this is a string _with extra spaces__
12 2 32
This also works for the edge case of all spaces:
_____
12222
This regex should match all unnecessary individual spaces
^( )+|( )(?= )|( )+$
or
$_[0] =~ m/^( )+|( )(?= )|( )+$/g
You could change the spaces to \s but then it'll count tabs as well.
Working on RegexPal
Breakdown:
^( )+ Match any spaces connected to the start of the line
( )(?= ) Match any spaces that are immediately followed by another space
( )+$ Match any spaces connected to the end of the line
With three simple regular expressions (and replacing spaces with underscores for clarity) you could use:
use strict;
use warnings;
my $str = "__this_is_a_string__with_extra_underscores__";
my $temp = $str;
$temp =~ s/^_+//;
$temp =~ s/_+$//;
$temp =~ s/__+/_/g;
my $num_extra_underscores = (length $str) - (length $temp);
print "The string '$str' has $num_extra_underscores extraunderscores\n";

Ignoring Whitespace with Regex(perl)

I am using Perl Regular expressions.
How would i go about ignoring white space and still perform a test to see if a string match.
For example.
$var = " hello "; #I want var to igonore whitespace and still match
if($var =~ m/hello/)
{
}
what you have there should match just fine. the regex will match any occurance of the pattern hello, so as long as it sees "hello" somewhere in $var it will match
On the other hand, if you want to be strict about what you ignore, you should anchor your string from start to end
if($var =~ m/^\s*hello\s*$/) {
}
and if you have multiple words in your pattern
if($var =~ m/^\s*hello\s+world\s*$/) {
}
\s* matches 0 or more whitespace, \s+ matches 1 or more white space. ^ matches the beginning of a line, and $ matches the end of a line.
As other have said, Perl matches anywhere in the string, not the whole string. I found this confusing when I first started and I still get caught out. I try to teach myself to think about whether I need to look at the start of the line / whole string etc.
Another useful tip is use \b. This looks for word breaks so /\bbook\b/ matches
"book. "
"book "
"-book"
but not
"booking"
"ebook"
This regex is a little unrelated but if you wanted to concatenate all of the whitespaces from your string before passing it through the if.
s/[\h\v]+/ /g;
/^\shello\s$/