Perl regex: look for keyword which are not starting with - regex

Example 1: "hello this is me. KEYWORD: blah"
Example 2: "KEYWORD: apple"
I just want to be able to catch KEYWORD in example 1, not 2 since in 2, it starts with KEYWORD
if ($line =~/KEYWORD:/x) {
# do something
}
The above code catch both examples. How can I change regex so that it only catches KEYWORD in example 1?
PS Eventually I want example 1 to be KEYWORD: blah

If you are just looking for a keyword, you should be using index and not a regex :
if (index($line, 'KEYWORD') > 0) {
# do something
}
See the documentation : index STR, SUBSTR returns -1 if SUBSTR isn't found in STR, otherwise it return the index of SUBSTR in STR (starting at 0).
If you want to look for a more complex pattern than a simple keyword, then you should do as #Perl Dog said in his answer.

You are looking for a negative lookbehind assertion, i.e. for 'KEYWORD' that is not preceeded by a certain string (in your case the start-of-line marker ^):
if ($line =~/(?<!^)KEYWORD:/x) {
# found KEYWORD in '$line', but not at the beginning
print $line, "\n";
}
Output:
hello this is me. KEYWORD: blah
Update: As stated in the comments, the /x modifier isn't necessary in my first regex but can be used to make the pattern more readable. It allows for whitespace (including newlines) and/or comments in the pattern to improve readability. The downside is that every blank/space character in the actual pattern has to be escaped (to distinguish it from the comments) but we don't have these here. The pattern can thus be re-written as follows (the result is the same):
if ($line =~ / (?<! # huh? (?) ahh, look left (<) for something
# NOT (!) appearing on the left.
^) # oh, ok, I got it, there must be no '^' on the left
KEYWORD: # but the string 'KEYWORD:' should come then
/x ) {
# found KEYWORD in '$line', but not at the beginning
print $line, "\n";
}

The answer is actually quite simple!
/.KEYWORD/ # Not at the start of a line
/.KEYWORD/s # Not at the start of the string
By the way, you might want to add \b before KEYWORD to avoid matching NOTTHEKEYWORD.

I think you need to give better, real examples
On the face of it, all you need is
if ( /KEYWORD/ and not /^KEYWORD/ ) {
...
}

Another simple regex
print if /^.+KEYWORD/;
match
hello this is me. KEYWORD: blah

Related

Extract first word after specific word

I'm having difficulty writing a Perl program to extract the word following a certain word.
For example:
Today i'm not going anywhere except to office.
I want the word after anywhere, so the output should be except.
I have tried this
my $words = "Today i'm not going anywhere except to office.";
my $w_after = ( $words =~ /anywhere (\S+)/ );
but it seems this is wrong.
Very close:
my ($w_after) = ($words =~ /anywhere\s+(\S+)/);
^ ^ ^^^
+--------+ |
Note 1 Note 2
Note 1: =~ returns a list of captured items, so the assignment target needs to be a list.
Note 2: allow one or more blanks after anywhere
In Perl v5.22 and later, you can use \b{wb} to get better results for natural language. The pattern could be
/anywhere\b{wb}.+?\b{wb}(.+?\b{wb})/
"wb" stands for word break, and it will account for words that have apostrophes in them, like "I'll", that plain \b doesn't.
.+?\b{wb}
matches the shortest non-empty sequence of characters that don't have a word break in them. The first one matches the span of spaces in your sentence; and the second one matches "except". It is enclosed in parentheses, so upon completion $1 contains "except".
\b{wb} is documented most fully in perlrebackslash
First, you have to write parentheses around left side expression of = operator to force array context for regexp evaluation. See m// and // in perlop documentation.[1] You can write
parentheses also around =~ binding operator to improve readability but it is not necessary because =~ has pretty high priority.
Use POSIX Character Classes word
my ($w_after) = ($words =~ / \b anywhere \W+ (\w+) \b /x);
Note I'm using x so whitespaces in regexp are ignored. Also use \b word boundary to anchor regexp correctly.
[1]: I write my ($w_after) just for convenience because you can write my ($a, $b, $c, #rest) as equivalent of (my $a, my $b, my $c, my #rest) but you can also control scope of your variables like (my $a, our $UGLY_GLOBAL, local $_, #_).
This Regex to be matched:
my ($expect) = ($words=~m/anywhere\s+([^\s]+)\s+/);
^\s+ the word between two spaces
Thanks.
If you want to also take into consideration the punctuation marks, like in:
my $words = "Today i'm not going anywhere; except to office.";
Then try this:
my ($w_after) = ($words =~ /anywhere[[:punct:]|\s]+(\S+)/);

Perl regex forward reference

I would like to match a forward reference with regexp. The pattern I am looking for is
[snake-case prefix]_[snake-case words] [same snake-case prefix]_number
For example:
foo_bar_eighty_twelve foo_bar_8012
I cannot extract foo_bar and eighty_twelve without looking first at foo_bar_8012. Thus I need a forward reference, not a backward reference which work only if my prefix is not a snake-case prefix.
my $prefix = "foo";
local $_ = "${prefix}_thirty_two = ${prefix}_32";
# Backward reference that works with a prefix with no underscores
{
/(\w+)_(\w+) \s+ = \s+ \1_(\d+)/ix;
print "Name: $2 \t Number: $3\n";
}
# Wanted Forward reference that do not work :(
{
/\2_(\w+) \s+ = \s+ (\w+)_\d+/ix;
print "Name: $1 \t Number: $2\n";
}
Unfortunately, my forward reference does not work and I do not know why. I've read that Perl support that kind of patterns.
Any help ?
The following assumption is false:
“I cannot extract foo_bar and eighty_twelve without looking first at foo_bar_8012.”
Yes, it is true that you can't definitely determine where the break in prefix and name occur in the first group of characters until looking at the second group, but thus comes the power of regular expressions. It greedily matches on the first pass, finds the second string doesn't match, and then backtracks to try again with a smaller string for the prefix.
The following demonstrates how you would accomplish your goal using simple back references:
use strict;
use warnings;
while (<DATA>) {
if (m{\b(\w+)_(\w+)\s+\1_(\d+)\b}) {
print "Prefix = $1, Name = $2, Number = $3\n";
} else {
warn "Not found: $_"
}
}
__DATA__
foo_thirty_two foo_32
foo_bar_eighty_twelve foo_bar_8012
Outputs:
Prefix = foo, Name = thirty_two, Number = 32
Prefix = foo_bar, Name = eighty_twelve, Number = 8012
AFAIK Forward referencing is not a magic bullet that allows to to swap capture-group and reference.
I've look at quite a bit of examples and i simply dont think you can do what you're trying, using forward referencing.
I solved the issue by using back-referencing combined with look-ahead. Like so:
/(?=.*=\s*([a-z]+))\1_(\w+) \s+ = \s+ \w+_\d+/ix
This works because the look-ahead initializes the first capture group ahead of the "actual" expression. For reference, this part is the look-ahead:
(?=.*=\s*([a-z]+))
and its basically just sort of a "sub-regex". The reason i use [a-z]+, is because \w+ includes underscore. And i don't think that was what you wanted.

In Perl, what is the meaning of if (s/^\+//)?

In a Perl/Tk code I found a conditional statement as below
if (s/^\+//)
{
#do something
}
elsif (/^-/)
{
#do another thing
}
Seems like some pattern matching has been done. But I cannot understand it. Can anyone help me understanding this pattern matching?
They are both regular expressions. You can read up on them at perlre and perlretut. You can play around with them on http://www.rubular.com.
They both implicitly do something with $_. There probably is a while or foreach around your lines of code without a loop variable. In that case, $_ becomes that loop variable. It might for instance contain the current line of a file that is being read.
If the current value of $_ contains a + (plus) sign as the first character at the beginning of the string, #do somehting.
Else if it contains a - (minus) sign, #do another thing.
In case 1. it also replaces that + sign with nothing (i.e. removes it). It does not remove the - in 2. however.
Let's look at an explanation with YAPE::Regex::Explain.
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new(qr/^\+/)->explain();
Here it is. Not really helpful in our case, but a nice tool nonetheless. Note that the (?-imsx and ) parts are the default things Perl implies. They are always there unless you change them.
The regular expression:
(?-imsx:^\+)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
\+ '+'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
Update: As Mikko L in the comments pointed out, you should maybe refactor/change this piece of code. While it probably does what it is supposed to, I believe it would be a good idea to make it more readable. Whoever wrote it obviously didn't care about you as the later maintainer. I suggest you do. You could change it to:
# look at the content of $_ (current line?)
if ( s/^\+// )
{
# the line starts with a + sign,
# which we remove!
#do something
}
elsif ( m/^-/ )
{
# the line starts witha - sign
# we do NOT remove the - sign!
#do another thing
}
Those are regular expressions, used for pattern matching and substitution.
You should read up on the concept, but as for your question:
s/^\+//
If the string started with a plus, remove that plus (the "s" means "substitute"), and return true.
/^-/
True if the string starts with a minus.
This code is equivalent to
if ($_ =~ s/^\+//) { # s/// modifies $_ by default
#do something
}
elsif ($_ =~ m/^-/) { # m// searches $_ by default
#do another thing
}
s/// and m// are regexp quote-like operators. You can read about them in perlop.
The other answers have given a summary of how the code works, but not much of why. Here is a simple example of why one might use such logic.
#!/usr/bin/env perl
use strict;
use warnings;
my $args = {};
for ( #ARGV ) {
if ( s/^no// ) {
$args->{$_} = 0;
} else {
$args->{$_} = 1;
}
}
use Data::Dumper;
print Dumper $args;
When you call the script like
./test.pl hi nobye
you get
$VAR1 = {
'hi' => 1,
'bye' => 0
};
The key is the string, however if it is preceded by no then remove it (to get the key in question) and instead set the value to 0.
The OP's example is a little more involved, but follows the same logic.
if the key starts with a +, remove it and do something
if the key starts with a -, don't remove it and do something else

cant get the perl regex to work

My perl is getting rusty. It only prints "matched=" but $1 is blank!?!
EDIT 1: WHo the h#$! downvoted this? There are no wrong questions. If you dont like it, move on to next one!
$crazy="abcd\r\nallo\nXYZ\n\n\nQQQ";
if ($crazy =~ m/([.\n\r]+)/gsi) {
print "matched=", $1, "\n";
} else {
print "not matched!\n";
}
EDIT 2: This is the code fragment with updated regex, works great!
$crazy="abcd\r\nallo\nXYZ\n\n\nQQQ";
if ($crazy =~ m/([\s\S]+)/gsi) {
print "matched=", $1, "\n";
} else {
print "not matched!\n";
}
EDIT 3: Haha, i see perl police strikes yet again!!!
I don't know if this is your exact problem, but inside square brackets, '.' is just looking for a period. I didn't see a period in the input, so I wondered which you meant.
Aside from the period, the rest of the character class is looking for consecutive whitespace. And as you didn't use the multiline switch, you've got newlines being counted as whitespace (and any character), but no indication to scan beyond the first record separator. But because of the way that you print it out, it also gives some indication that you meant more than the literal period, as mentioned above.
Axeman is correct; your problem is that . in a character class doesn't do what you expect.
By default, . outside a character class (and not backslashed) matches any character but a newline. If you want to include newlines, you specify the /s flag (which you seem to already have) on your regex or put the . in a (?s:...) group:
my $crazy="abcd\r\nallo\nXYZ\n\n\nQQQ";
if ($crazy =~ m/((?s:.+))/) {
print "matched=", $1, "\n";
} else {
print "not matched!\n";
}
. in a character class is a literal period, not match anything. What you really want is /(.+)/s. The /g flag says to match multiple times, but you are using the regex in scalar context, so it will only match the first item. The /i flag makes the regex case insensitive, but there are no characters with case in your regex. The \s flag makes . match newlines, and it always matches "\r", so instead of [.\n\r], you can just use ..
However, /(.+)/s will match any string with one or more characters, so you would be better off with
my $crazy="abcd\r\nallo\nXYZ\n\n\nQQQ";
if (length $crazy) {
print "matched=$crazy\n";
} else {
print "not matched!\n";
}
It is possible you meant to do something like this:
#!/usr/bin/perl
use strict;
use warnings;
my $crazy = "abcd\r\nallo\nXYZ\n\n\nQQQ";
while ($crazy =~ /(.+)[\r\n]+/g) {
print "matched=$1\n";
}
But that would probably be better phrased:
#!/usr/bin/perl
use strict;
use warnings;
my $crazy = "abcd\r\nallo\nXYZ\n\n\nQQQ";
for my $part (split /[\r\n]+/, $crazy) {
print "matched=$part\n";
}
$1 contains white space, that's why you don't see it in a print like that, just add something after it/quote it.
Example:
perl -E "qq'abcd\r\nallo\nXYZ\n\n\nQQQ'=~/([.\n\r]+)/gsi;say 'got(',length($1),qq') >$1<';"
got(2) >
<
Updated for your comments:
To match everything you can simply use /(.+)/s
[.] (dot inside a character class) does not mean "match any character", it just means match the literal . character. So in an input string without any dots,
m/([.\n\r]+)/gsi
will just match strings of \n and \r characters.
With the /s modifier, you are already asking the regex engine to include newlines with . (match any character), so you could just write
m/(.+)/gsi

How can I find repeated letters with a Perl regex?

I am looking for a regex that will find repeating letters. So any letter twice or more, for example:
booooooot or abbott
I won't know the letter I am looking for ahead of time.
This is a question I was asked in interviews and then asked in interviews. Not so many people get it correct.
You can find any letter, then use \1 to find that same letter a second time (or more). If you only need to know the letter, then $1 will contain it. Otherwise you can concatenate the second match onto the first.
my $str = "Foooooobar";
$str =~ /(\w)(\1+)/;
print $1;
# prints 'o'
print $1 . $2;
# prints 'oooooo'
I think you actually want this rather than the "\w" as that includes numbers and the underscore.
([a-zA-Z])\1+
Ok, ok, I can take a hint Leon. Use this for the unicode-world or for posix stuff.
([[:alpha:]])\1+
I Think using a backreference would work:
(\w)\1+
\w is basically [a-zA-Z_0-9] so if you only want to match letters between A and Z (case insensitively), use [a-zA-Z] instead.
(EDIT: or, like Tanktalus mentioned in his comment (and as others have answered as well), [[:alpha:]], which is locale-sensitive)
Use \N to refer to previous groups:
/(\w)\1+/g
You might want to take care as to what is considered to be a letter, and this depends on your locale. Using ISO Latin-1 will allow accented Western language characters to be matched as letters. In the following program, the default locale doesn't recognise é, and thus créé fails to match. Uncomment the locale setting code, and then it begins to match.
Also note that \w includes digits and the underscore character along with all the letters. To get just the letters, you need to take the complement of the non-alphanum, digits and underscore characters. This leaves only letters.
That might be easier to understand by framing it as the question:
"What regular expression matches any digit except 3?"
The answer is:
/[^\D3]/
#! /usr/local/bin/perl
use strict;
use warnings;
# uncomment the following three lines:
# use locale;
# use POSIX;
# setlocale(LC_CTYPE, 'fr_FR.ISO8859-1');
while (<DATA>) {
chomp;
if (/([^\W_0-9])\1+/) {
print "$_: dup [$1]\n";
}
else {
print "$_: nope\n";
}
}
__DATA__
100
food
créé
a::b
The following code will return all the characters, that repeat two or more times:
my $str = "SSSannnkaaarsss";
print $str =~ /(\w)\1+/g;
Just for kicks, a completely different approach:
if ( ($str ^ substr($str,1) ) =~ /\0+/ ) {
print "found ", substr($str, $-[0], $+[0]-$-[0]+1), " at offset ", $-[0];
}
FYI, aside from RegExBuddy, a real handy free site for testing regular expressions is RegExr at gskinner.com. Handles ([[:alpha:]])(\1+) nicely.
How about:
(\w)\1+
The first part makes an unnamed group around a character, then the back-reference looks for that same character.
I think this should also work:
((\w)(?=\2))+\2
/(.)\\1{2,}+/u
'u' modifier matching with unicode