perl Regular expression matching repeating words

perl Regular expression matching repeating words - regex

a regular expression that matches any line of input that has the same word repeated
two or more times consecutively in a row. Assume there is one space between consecutive
words
if($line!~m/(\b(\w+)\b\s){2,}/{print"No match\n";}
{ print "$`"; #print out first part of string
print "<$&>"; #highlight the matching part
print "$'"; #print out the rest
}
This is best i got so far,but there is something wrong
correct me if i am wrong
\b start with a word boundary
(\w+) followed by one word or more words
\bend with a word boundary
\s then a space
{2,} check if this thing repeat 2 or more times
what's wrong with my expression

This should be what you're looking for: (?:\b(\w+)\b) (?:\1(?: |$))+
Also, don't use \s when you're just looking for spaces as it's possible you'll match a newline or some other whitespace character. Simple spaces aren't delimiters or special characters in regex, so it's fine to just type the space. You can use [ ] if you want it to be more visually apparent.

I tried CAustin's answer in regexr.com and the results were not what I would expect. Also, no need for all the non-capturing groups.
My regex:
(\b(\w+))( \2)+
Word-boundary, followed by (1 or more word characters)[group 2], followed by one or more of: space, group 2.
This next one replaces the space with \s+, generalizing the separation between the words to be 1 or more of any kind of white-space:
(\b(\w+))(\s+\2)+

You aren't actually checking to see if it's the SAME word that's repeating. To do that, you need to use a captured backreference:
if ($line =~ m/\b(\w+)(?:\s\1){2,}\b/) {
print "matched '$1'\n";
}
Also, anytime you're testing a regular expression, it's helpful if you create a list of examples to work with. The following demonstrates one way of doing that using the __DATA__ block
use strict;
use warnings;
while (my $line = <DATA>) {
if ($line =~ m/\b(\w+)(?:\s\1){2,}/) {
print "matched '$1'\n";
} else {
print "no match\n";
}
}
__DATA__
foo foo
foo bar foo
foo foo foo
Outputs
no match
no match
matched 'foo'

Related

Perl regular expression to retrieve the first digit

I have a string with the value Validation_File_2_3.45.2017.csv.
How do I extract the first digit which is 2 in this case using a regular expression?
I have tried the expression ($Filedigit) = ($Filename =~ m/^[0-9]/g) but it didn't work

In a comment, you said you tried this:
($Filedigit)= ($Filename =~ m/^[0-9]/g);
A couple of things. You should always check that a match is successful before continuing on with your script, specifically when trying to capture. Next, ^ looks from the beginning of a string, then immediately looks for a single digit 0-9, globally. This won't match unless you had a filename such as 2_blah.csv. However, you're not actually attempting to capture anything, so if you do happen to match an entry, $Filedigit will be 1 in all cases (signifying a match happened).
Here's an example that does what you want:
use warnings;
use strict;
my $str = 'Validation_File_2_3.45.2017.csv';
# confirm there's a match
if (my ($num) = $str =~ /^.*?(\d+)/){
print "$num\n";
}
else {
print "no match\n";
}
Explanation of the regex:
^ - start from beginning of string
.*? - anything, non-greedy
( - begin capture
\d+ - any number of contiguous digit chars
) - end capture
So, it starts from the beginning of the string, throws away anything before the first set of contiguous digits and captures them and puts that into the variable.
See perlreftut and perlre.

My regexp has anorexia

I'm trying to get multiple key/value pairs from a string where the keys is on the left of an = character and the value on the right. So the following code
$line = <<END;
names='bob,jane, Alexander the Great' colors = "red,green" test= %results
END
my %hash = ($line =~ m/(\w+)\s*=\s*(.+?)/g);
for (keys %hash) { print "$_: $hash{$_}\n"; }
Should output
names: 'bob,jane, Alexander the Great'
colors: "red,green"
test: %results
But my regexp is just returning the first character of the value like
names: '
colors: "
and so on. If I change the second match to (.+) then it matches the whole line after the first =. Can someone fix this regexp?

Because .+? is non-greedy which stops once it finds a match since you're not giving any regex pattern next to non-greedy form.
my %hash = ($line =~ m/(\w+)\s*=\s*(.+?)(?=\h+\w+\h*=|$)/gm);
DEMO
(?=\h+\w+\h*=|$) called positive lookahead which asserts that the match must be followed by
\h+ one or more horizontal spaces.
\w+ one or more word characters.
\h* zero or more horizontal spaces.
= equal symbol.
| OR
$ End of the line anchor.

.+? says match one or more non-newline characters, preferring as few as possible.
You want .+ which matches one or more non-newline characters, preferring as many as possible.
Then it looks like you also need to stop at a matching quote, so
/(\w+)\s*=\s*('.+?'|".+?"|.+)/g
Though if spaces aren't allowed in unquoted values, you want ´\S+´ instead of ´.+´

Match a string between multiple whitespaces

Hello can anyone help me with a regex to match a string between multiple whitespaces
My string may look like this :
This is just Nicolas-764 sdh and his sister
I want to match Nicolas-764 sdh
So far I wrote this but it matches all the string after the first whitespaces
if ($string =~ m/(just) {5,}(.*) {5,}/) {
print "$1\n";
print "$2\n";
}
I want to create a hash that will have as key just and as value Nicolas-764 sdh.
I don't want to just match a string between multiple spaces. I need to use just too

You're suffering from greedy matching .*.
You simply need to change to non-greedy matching using .*?.
use strict;
use warnings;
my $string = 'This is just Nicolas-764 sdh and his sister';
if ($string =~ m/just\s{5,}(.*?)\s{5,}/) {
print "$1\n";
}
Outputs:
Nicolas-764 sdh

Your code would be,
if ($string =~ m/^.*?just {5,}(\S+)\s+(\S+) {5,}.*$/) {
print "$1\n";
print "$2\n";
}
First group contains Nicolas-764 and the second group contains sdh
DEMO
Or
You could try the below regex also,
^.*?just {5,}(\S+(?:\s\S+)*?) {5,}.*$
Explanation:
^ Asserst that we are at the start of the line.
.*?just This would match upto the first just string. ? after * does a non-greedy match.
{5,} Matches 5 or more spaces.
() Capturing groups.
\S+ One or more non-space characters.
(?:) Non-capturing groups. It won't capture anything. Just matching would be done.
(?:\s\S+)*? Matches a space followed by one or more non-space characters. And the whole would occur zero or more times.
{5,} Matches 5 or more spaces.
.* Matches any character zero or more times.
$ Asserts that we are at the end of the line.

Perl regex - why does the regex /[0-9\.]+(\,)/ match comma

The following seems to match ,
Can someone explain why?
I would like to match more than one Number or point, ended by comma.
123.456.768,
123,
.,
1.2,
But doing the following unexpectedly prints , too
my $text = "241.000,00";
foreach my $match ($text =~ /[0-9\.]+(\,)/g){
print "$match \n";
}
print $text;
# prints 241.000,
# ,
Update:
The comma matched because:
In list context, //g returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regex
As defined here.

Use a zero-width positive look-ahead assertion to exclude the comma from the match itself:
$text =~ /[0-9\.]+(?=,)/g

Your match in the foreach loop is in list context. In list context, a match returns what its captured. Parens indicate a capture, not the whole regex. You have parens around a comma. You want it the other way around, put the parens aroundt he bit you want.
my $text = "241.000,00";
# my($foo) puts the right hand side in list context.
my($integer_part) = $text =~ /([0-9\.]+),/;
print "$integer_part\n"; # 241.000

If you don't want to match the comma, use a lookahead assertion:
/[0-9\.]+(?=,)/g

You're capturing the wrong thing! Move the parens from around the comma to around the number.
$text =~ /([0-9\.]+),/g

You can replace the comma with a lookahead, or just exclude the comma altogether since it isn't part of what you want to capture, it won't make a difference in this case. However, the pattern as it is puts the comma instead of the number into capture group 1, and then doesn't even reference by capture group, returning the entire match instead.
This is how a capture group is retrieved:
$mystring = "The start text always precedes the end of the end text.";
if($mystring =~ m/start(.*)end/) {
print $1;
}

Ignoring Whitespace with Regex(perl)

I am using Perl Regular expressions.
How would i go about ignoring white space and still perform a test to see if a string match.
For example.
$var = " hello "; #I want var to igonore whitespace and still match
if($var =~ m/hello/)
{
}

what you have there should match just fine. the regex will match any occurance of the pattern hello, so as long as it sees "hello" somewhere in $var it will match
On the other hand, if you want to be strict about what you ignore, you should anchor your string from start to end
if($var =~ m/^\s*hello\s*$/) {
}
and if you have multiple words in your pattern
if($var =~ m/^\s*hello\s+world\s*$/) {
}
\s* matches 0 or more whitespace, \s+ matches 1 or more white space. ^ matches the beginning of a line, and $ matches the end of a line.

As other have said, Perl matches anywhere in the string, not the whole string. I found this confusing when I first started and I still get caught out. I try to teach myself to think about whether I need to look at the start of the line / whole string etc.
Another useful tip is use \b. This looks for word breaks so /\bbook\b/ matches
"book. "
"book "
"-book"
but not
"booking"
"ebook"

This regex is a little unrelated but if you wanted to concatenate all of the whitespaces from your string before passing it through the if.
s/[\h\v]+/ /g;

/^\shello\s$/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

perl Regular expression matching repeating words - regex

Related

Perl regular expression to retrieve the first digit

My regexp has anorexia

Match a string between multiple whitespaces

Perl regex - why does the regex /[0-9\.]+(\,)/ match comma

Ignoring Whitespace with Regex(perl)

Categories

Resources