Match a string between multiple whitespaces - regex

Hello can anyone help me with a regex to match a string between multiple whitespaces
My string may look like this :
This is just Nicolas-764 sdh and his sister
I want to match Nicolas-764 sdh
So far I wrote this but it matches all the string after the first whitespaces
if ($string =~ m/(just) {5,}(.*) {5,}/) {
print "$1\n";
print "$2\n";
}
I want to create a hash that will have as key just and as value Nicolas-764 sdh.
I don't want to just match a string between multiple spaces. I need to use just too

You're suffering from greedy matching .*.
You simply need to change to non-greedy matching using .*?.
use strict;
use warnings;
my $string = 'This is just Nicolas-764 sdh and his sister';
if ($string =~ m/just\s{5,}(.*?)\s{5,}/) {
print "$1\n";
}
Outputs:
Nicolas-764 sdh

Your code would be,
if ($string =~ m/^.*?just {5,}(\S+)\s+(\S+) {5,}.*$/) {
print "$1\n";
print "$2\n";
}
First group contains Nicolas-764 and the second group contains sdh
DEMO
Or
You could try the below regex also,
^.*?just {5,}(\S+(?:\s\S+)*?) {5,}.*$
Explanation:
^ Asserst that we are at the start of the line.
.*?just This would match upto the first just string. ? after * does a non-greedy match.
{5,} Matches 5 or more spaces.
() Capturing groups.
\S+ One or more non-space characters.
(?:) Non-capturing groups. It won't capture anything. Just matching would be done.
(?:\s\S+)*? Matches a space followed by one or more non-space characters. And the whole would occur zero or more times.
{5,} Matches 5 or more spaces.
.* Matches any character zero or more times.
$ Asserts that we are at the end of the line.

Related

Non-Capturing and Capturing Groups - The right way

I'm trying to match an array of elements preceeded by a specific string in a line of text. For Example, match all pets in the text below:
fruits:apple,banana;pets:cat,dog,bird;colors:green,blue
/(?:pets:)(\w+[,|;])+/g**
Using the given regex I only could match the last word "bird"
Can anybody help me to understand the right way of using Non-Capturing and Capturing Groups?
Thanks!
First, let's talk about capturing and non-capturing group:
(?:...) non-capturing version, you're looking for this values, but don't need it
() capturing version, you want this values! You're searching for it
So:
(?:pets:) you searching for "pets" but don't want to capture it, after that point, you WANT to capture (if I've understood):
So try (?:pets:)([a-zA-Z,]+); ... You're searching for "pets:" (but don't want it !) and stop at the first ";" (and don't want it too).
Result is :
Match 1 : cat,dog,bird
A better solution exists with 1 match == 1 pet.
Since you want to have each pet in a separate match and you are using PCRE \G is, as suggested by Wiktor, a decent option:
(?:pets:)|\G(?!^)(\w+)(?:[,;]|$)
Explanation:
1st Alternative (?:pets:) to find the start of the pattern
2nd Alternative \G(?!^)(\w+)(?:[,;]|$)
\G asserts position at the end of the previous match or the start of the string for the first match
Negative Lookahead (?!^) to assert that the Regex does not match at the start of the string
(\w+) to matches the pets
Non-capturing group (?:[,;]|$) used as a delimiter (matches a single character in the list ,; (case sensitive) or $ asserts position at the end of the string
Perl Code Sample:
use strict;
use Data::Dumper;
my $str = 'fruits:apple,banana;pets:cat,dog,bird;colors:green,blue';
my $regex = qr/(?:pets:)|\G(?!^)(\w+)(?:[,;]|$)/mp;
my #result = ();
while ( $str =~ /$regex/g ) {
if ($1 ne '') {
#print "$1\n";
push #result, $1;
}
}
print Dumper(\#result);

My regexp has anorexia

I'm trying to get multiple key/value pairs from a string where the keys is on the left of an = character and the value on the right. So the following code
$line = <<END;
names='bob,jane, Alexander the Great' colors = "red,green" test= %results
END
my %hash = ($line =~ m/(\w+)\s*=\s*(.+?)/g);
for (keys %hash) { print "$_: $hash{$_}\n"; }
Should output
names: 'bob,jane, Alexander the Great'
colors: "red,green"
test: %results
But my regexp is just returning the first character of the value like
names: '
colors: "
and so on. If I change the second match to (.+) then it matches the whole line after the first =. Can someone fix this regexp?
Because .+? is non-greedy which stops once it finds a match since you're not giving any regex pattern next to non-greedy form.
my %hash = ($line =~ m/(\w+)\s*=\s*(.+?)(?=\h+\w+\h*=|$)/gm);
DEMO
(?=\h+\w+\h*=|$) called positive lookahead which asserts that the match must be followed by
\h+ one or more horizontal spaces.
\w+ one or more word characters.
\h* zero or more horizontal spaces.
= equal symbol.
| OR
$ End of the line anchor.
.+? says match one or more non-newline characters, preferring as few as possible.
You want .+ which matches one or more non-newline characters, preferring as many as possible.
Then it looks like you also need to stop at a matching quote, so
/(\w+)\s*=\s*('.+?'|".+?"|.+)/g
Though if spaces aren't allowed in unquoted values, you want ´\S+´ instead of ´.+´

Perl $1 variable not defined after regex match

This is probably a very basic error on my part, but I've been stuck on this problem for ages and it's driving me up the wall!
I am looping through a file of Python code using Perl and identifying its variables. I am using a Perl regex to pick out substrings of alphanumeric characters in between spaces. The regex works fine and identifies the lines that the matches belong to, but when I try to return the actual substring that matches the regex, the capture variable $1 is undefined.
Here is my regex:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
And here is the error:
x = 1
Use of uninitialized value $1 in print at ./vars.pl line 7, <> line 2.
As I understand it, $1 is supposed to return x. Where is my code going wrong?
You're not capturing the result:
if ($line =~ /.*\s+([a-zA-Z0-9]+)\s+.*/) {
If you want to match a line like x = 1 and get both parts of it, you need to match on and capture both with parenthesis. A crude approach:
if ( $line =~ /^\s* ( \w+ ) \s* = \s* ( \w+ ) \s* $/msx ) {
my $var = $1;
my $val = $2;
}
The correct answer has been given by Leeft: You need to capture the string by using parentheses. I wanted to mention some other things. In your code:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
You are surrounding your match with .*\s+. This is unlikely doing what you think. You never need to use .* with m//, unless you are capturing a string (or capturing the whole match using $&). The match is not anchored by default, and will match anywhere in the string. To anchor the match you must use ^ or $. E.g.:
if ('abcdef' =~ /c/) # returns true
if ('abcdef' =~ /^c/) # returns false, match anchored to beginning
if ('abcdef' =~ /c$/) # returns false, match anchored to end
if ('abcdef' =~ /c.*$/) # returns true
As you see in the last example, using .* is quite redundant, and to get the match you need only remove the anchor. Or if you wanted to capture the whole string:
if ('abcdef' =~ /(c.*)$/) # returns true, captures 'cdef'
You can also use $&, which contains the entire match, regardless of parentheses.
You are probably using \s+ to ensure you do not match partial words. You should be aware that there is an escape sequence called word boundary, \b. This is a zero-length assertion, that checks that the characters around it are word and non-word.
'abc cde fgh' =~ /\bde\b/ # no match
'abc cde fgh' =~ /\bcde\b/ # match
'abc cde fgh' =~ /\babc/ # match
'abc cde fgh' =~ /\s+abc/ # no match! there is no whitespace before 'a'
As you see in the last example, using \s+ fails at start or end of string. Do note that \b also matches partially at non-word characters that can be part of words, such as:
'aaa-xxx' =~ /\bxxx/ # match
You must decide if you want this behaviour or not. If you do not, an alternative to using \s is to use the double negated case: (?!\S). This is a zero-length negative look-ahead assertion, looking for non-whitespace. It will be true for whitespace, and for end of string. Use a look-behind to check the other side.
Lastly, you are using [a-zA-Z0-9]. This can be replaced with \w, although \w also includes underscore _ (and other word characters).
So your regex becomes:
/\b(\w+)\b/
Or
/(?<!\S)(\w+)(?!\S)/
Documentation:
perldoc perlvar - Perl built-in variables
perldoc perlop - Perl operators
perldoc perlre - Perl regular expressions

perl Regular expression matching repeating words

a regular expression that matches any line of input that has the same word repeated
two or more times consecutively in a row. Assume there is one space between consecutive
words
if($line!~m/(\b(\w+)\b\s){2,}/{print"No match\n";}
{ print "$`"; #print out first part of string
print "<$&>"; #highlight the matching part
print "$'"; #print out the rest
}
This is best i got so far,but there is something wrong
correct me if i am wrong
\b start with a word boundary
(\w+) followed by one word or more words
\bend with a word boundary
\s then a space
{2,} check if this thing repeat 2 or more times
what's wrong with my expression
This should be what you're looking for: (?:\b(\w+)\b) (?:\1(?: |$))+
Also, don't use \s when you're just looking for spaces as it's possible you'll match a newline or some other whitespace character. Simple spaces aren't delimiters or special characters in regex, so it's fine to just type the space. You can use [ ] if you want it to be more visually apparent.
I tried CAustin's answer in regexr.com and the results were not what I would expect. Also, no need for all the non-capturing groups.
My regex:
(\b(\w+))( \2)+
Word-boundary, followed by (1 or more word characters)[group 2], followed by one or more of: space, group 2.
This next one replaces the space with \s+, generalizing the separation between the words to be 1 or more of any kind of white-space:
(\b(\w+))(\s+\2)+
You aren't actually checking to see if it's the SAME word that's repeating. To do that, you need to use a captured backreference:
if ($line =~ m/\b(\w+)(?:\s\1){2,}\b/) {
print "matched '$1'\n";
}
Also, anytime you're testing a regular expression, it's helpful if you create a list of examples to work with. The following demonstrates one way of doing that using the __DATA__ block
use strict;
use warnings;
while (my $line = <DATA>) {
if ($line =~ m/\b(\w+)(?:\s\1){2,}/) {
print "matched '$1'\n";
} else {
print "no match\n";
}
}
__DATA__
foo foo
foo bar foo
foo foo foo
Outputs
no match
no match
matched 'foo'

Insertion with Regex to format a date (Perl)

Suppose I have a string 04032010.
I want it to be 04/03/2010. How would I insert the slashes with a regex?
To do this with a regex, try the following:
my $var = "04032010";
$var =~ s{ (\d{2}) (\d{2}) (\d{4}) }{$1/$2/$3}x;
print $var;
The \d means match single digit. And {n} means the preceding matched character n times. Combined you get \d{2} to match two digits or \d{4} to match four digits. By surrounding each set in parenthesis the match will be stored in a variable, $1, $2, $3 ... etc.
Some of the prior answers used a . to match, this is not a good thing because it'll match any character. The one we've built here is much more strict in what it'll accept.
You'll notice I used extra spacing in the regex, I used the x modifier to tell the engine to ignore whitespace in my regex. It can be quite helpful to make the regex a bit more readable.
Compare s{(\d{2})(\d{2})(\d{4})}{$1/$2/$3}x; vs s{ (\d{2}) (\d{2}) (\d{4}) }{$1/$2/$3}x;
Well, a regular expression just matches, but you can try something like this:
s/(..)(..)(..)/$1/$2/$3/
#!/usr/bin/perl
$var = "04032010";
$var =~ s/(..)(..)(....)/$1\/$2\/$3/;
print $var, "\n";
Works for me:
$ perl perltest
04/03/2010
I always prefer to use a different delimiter if / is involved so I would go for
s| (\d\d) (\d\d) |$1/$2/|x ;