Regex to match certain words but not others - regex

I am working with regexes in perl, and I'm trying to make a regex that finds two words where one ends with d and the next word starts with p (but not ph). Here is my regex, which works:
d\s(p[^h])}
However, I'd also like this to exclude the word "and" (but only within this pattern) so I have tried to use a negative lookahead, so my code looks like this:
if ($text =~ m{d\s(p[^h])} && $text =~ m{(?:(?!\sand\s))}) {
print "Yes\n";
} else {
}
However, this doesn't seem to work.
Here are some sample input/output:
sand pet -> yes
sand phone -> no
go and pet -> no
sand pet and -> yes
Any help with this is greatly appreciated!

You can accomplish what you need with a single regex:
/(?<!\ban)d\s(p[^h]\w+)/
Where:
\b is the word boundary anchor, doesn't consume any chars but assure that the excluded words is and and not sand. It matches between \w (word chars: [a-zA-Z0-9_]) and \W (not in word chars) and in the same position of ^ and $.
(?<!\ban)d a d not preceded by an isolated an, technically speaking is almost equivalent to (?<!\Wan).
Online Demo
If you doesn't need to extract the first and second word separately you may remove the capturing groups also and add some tolerance (one or more spaces between words):
if ( $input =~ m/(?<!\ban)d\s+p(?!h)/ )
print "Yes\n";
else
print "No\n";
Note: this regex is actually search for a d (not preceded by a non-substring an) separated by one or more spaces from a p not followed by a h. It tells nothing aboud words. If you want to make sure there are words of more than one char you can add a leading and trailing \w+.
Another Demo

You're getting too complicated. That negative lookahead is applied to the string, and matches against any substring. So it will match any substring which doesn't contain \sand\s which is always going to work, because zero length substrings are 'ok'.
You can see this at work with turning on debugging:
#!/usr/bin/env perl
use strict;
use warnings;
use re 'debug';
while ( <DATA> ) {
print if m{(?:(?!\sand\s))};
}
__DATA__
sand pet
sand phone
go and pet
sand pet and
empty
That lookahead is used with another pattern to say 'match this, but only if this is (or isn't) next'.
So something like:
m{d\s(p[^h])} and not m{\sand\s};
Might do what you want - or alternatively, just break it down into phases:
#!/usr/bin/env perl
use strict;
use warnings;
#use re 'debug';
while (<DATA>) {
my ($capture) = m{d\s(p[^h])};
if ( $capture and not $capture =~ m/\sand\s/ ) {
print $capture, " => ", $_, "\n";
}
}
__DATA__
sand pet
sand phone
go and pet
sand pet and
empty

It's often inappropriate to try to get everything working in a single regex. This program has a subroutine ok_words that checks a pair of words to see if your criteria apply. The calling code takes every pair of words in the string and prints a yes if the test is true for any pair, otherwise no
These are your tests, together with the Perl code that checks for them
The first ends in a d — /\d\z/
...but isn't and — `ne 'and'
The second starts with a p, but not ph — /\Ap(?!h)/
And this is the program that applies them
use strict;
use warnings 'all';
use List::MoreUtils qw/ any /;
while ( <DATA> ) {
chomp;
my #w = split;
if ( any { ok_words( $w[$_], $w[$_+1] ) } 0 .. $#w-1 ) {
print "$_ -> yes\n";
}
else {
print "$_ -> no\n";
}
}
sub ok_words {
my ($this, $next) = map lc, #_;
$this =~ /d\z/ and $this ne 'and' and $next =~ /\Ap(?!h)/;
}
__DATA__
sand pet
sand phone
go and pet
sand pet and
output
sand pet -> yes
sand phone -> no
go and pet -> no
sand pet and -> yes

Related

Regular expression to match exactly and only n times

If I have the lines:
'aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf'
'asfdcacttaskdfjcacttklasdjf'
'cksjdfcacttlkasdjf'
I want to match them by the number of times a repeating subunit (cactt) occurs. In other words, if I ask for n repeats, I want matches that contain n and ONLY n instances of the pattern.
My initial attempt was implemented in perl and looks like this:
sub MATCHER {
print "matches with $_ CACTT's\n";
my $pattern = "^(.*?CACTT.+?){$_}(?!.*?CACTT).*\$";
my #grep_matches = grep(/$pattern/, #matching);
print "$_\n" for #grep_matches;
my #copy = #grep_matches;
my $squashed = #copy;
print "number of rows total: $squashed\n";
}
for (2...6) {
MATCHER($_);
}
Notes:
#matching contains the strings from 1, 2, and 3 in an array.
the for loop is set from integers 2-6 because I have a separate regex that works to forbid duplicate occurrences of the pattern.
This loop ALMOST works except that for n=2, matches containing 3 occurrences of the "cactt" pattern are returned. In fact, for any string containing n+1 matches (where n>=2), lines with n+1 occurrences are also returned by the match. I though the negative lookahead could prevent this behavior in perl. If anyone could give me thoughts, I would be appreciative.
Also, I have thought of getting a count per line and separating them by count; I dislike the approach because it requires two steps when one should accomplish what I want.
I would be okay with a:
foreach (#matches) { $_ =~ /$pattern/; push(#selected_by_n, $1);}
The regex seems like it should be similar, but for whatever reason in practice the results differ dramatically.
Thanks in advance!
Your code is sort of strange. This regex
my $pattern = "^(.*?CACTT.+?){$_}(?!.*?CACTT).*\$";
..tries to match first beginning of string ^, then a minimal match of any character .*?, followed by your sequence CACTT, followed by a minimal match (but slightly different from .*?) .+?. And you want to match these $_ times. You assume $_ will be correct when calling the sub (this is bad). Then you have a look-ahead assumption that wants to make sure that there is no minimal match of any char .*? followed by your sequence, followed by any char of any length followed by end of line $.
First off, this is always redundant: ^.*. Beginning of line anchor followed by any character any number of times. This actually makes the anchor useless. Same goes for .*$. Why? Because any match that will occur, will occur anyway at the first possible time. And .*$ matches exactly the same thing that the empty string does: Anything.
For example: the regex /^.*?foo.*?$/ matches exactly the same thing as /foo/. (Excluding cases of multiline matching with strings that contain newlines).
In your case, if you want to count the occurrences of a string inside a string, you can just match them like this:
my $count = () = $str =~ /CACTT/gi;
This code:
my #copy = #grep_matches;
my $squashed = #copy;
Is completely redundant. You can just do my $squashed = #grep_matches. It makes little to no sense to first copy the array.
This code:
MATCHER($_);
Does the same as this: MATCHER("foo") or MATCHER(3.1415926536). You are not using the subroutine argument, you are ignoring it, and relying on the fact that $_ is global and visible inside the sub. What you want to do is
sub MATCHER {
my $number = shift; # shift argument from #_
Now you have encapsulated the code and all is well.
What you want to do in your case, I assume, is to count the occurrences of the substring inside your strings, then report them. I would do something like this
use strict;
use warnings;
use Data::Dumper;
my %data;
while (<DATA>) {
chomp;
my $count = () = /cactt/gi; # count number of matches
push #{ $data{$count} }, $_; # store count and original
}
print Dumper \%data;
__DATA__
aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf
asfdcacttaskdfjcacttklasdjf
cksjdfcacttlkasdjf
This will print
$VAR1 = {
'2' => [
'asfdcacttaskdfjcacttklasdjf'
],
'3' => [
'aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf'
],
'1' => [
'cksjdfcacttlkasdjf'
]
};
This is just to demonstrate how to create the data structure. You can now access the strings in the order of matches. For example:
for (#$data{3}) { # print strings with 3 matches
print;
}
Would you just do something like this:
use warnings;
use strict;
my $n=2;
my $match_line_cnt=0;
my $line_cnt=0;
while (<DATA>) {
my $m_cnt = () = /cactt/g;
if ($m_cnt>=$n){
print;
$match_line_cnt++;
}
$line_cnt++;
}
print "total lines: $line_cnt\n";
print "matched lines: $match_line_cnt\n";
print "squashed: ",$line_cnt-$match_line_cnt;
__DATA__
aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf
asfdcacttaskdfjcacttklasdjf
cksjdfcacttlkasdjf
prints:
aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf
asfdcacttaskdfjcacttklasdjf
total lines: 3
matched lines: 2
squashed: 1
I think you're unintentionally asking two seperate questions.
If you want to directly capture the number of times a pattern matches in a string, this one liner is all you need.
$string = 'aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf';
$pattern = qr/cactt/;
print $count = () = $string =~ m/$pattern/g;
-> 3
That last line is as if you had written $count = #junk = $string =~ m/$pattern/g; but without needing an intermediate array variable. () = is the null list assignment and it throws away whatever is assigned to it just like scalar undef = throws away its right hand side. But, the null list assignment still returns the number of things thrown away when its left hand side is in scalar context. It returns an empty list in list context.
If you want to match strings that only contain some number of pattern matches, then you want to stop matching once too many are found. If the string is large (like a document) then you would waste a lot of time counting past n.
Try this.
sub matcher {
my ($string, $pattern, $n) = #_;
my $c = 0;
while ($string =~ m/$pattern/g) {
$c++;
return if $c > $n;
}
return $c == $n ? 1 : ();
}
Now there is one more option but if you call it over and over again it gets inefficient. You can build a custom regex that matches only n times on the fly. If you only build this once however, it's just fine and speedy. I think this is what you originally had in mind.
$regex = qr/^(?:(?:(?!$pattern).)*$pattern){$n}(?:(?!$pattern).)*$/;
I'll leave the rest of that one to you. Check for n > 1 etc. The key is understanding how to use lookahead. You have to match all the NOT THINGS before you try to match THING.
https://perldoc.perl.org/perlre

Perl regex exclude optional word from match

I have a strings and need to extract only icnnumbers/numbers from them.
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
I need to extract below data from above example.
9876AB54321
987654321FR
987654321YQ
Here is my regex, but its working for first line of data.
(icnnumber|number):(\w+)(?:_IN)
How can I have expression which would match for three set of data.
Given your strings to extract are only upper case and numeric, why use \w when that also matches _?
How about just matching:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
m/number:([A-Z0-9]+)/;
print "$1\n";
}
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
Another alternative to get only the values as a match using \K to reset the match buffer
\b(?:icn)?number:\K[^\W_]+
Regex demo | Perl demo
For example
my $str = 'icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ';
while($str =~ /\b(?:icn)?number:\K[^\W_]+/g ) {
print $& . "\n";
}
Output
9876AB54321
987654321FR
987654321YQ
You may replace \w (that matches letters, digits and underscores) with [^\W_] that is almost the same, but does not match underscores:
(icnnumber|number):([^\W_]+)
See the regex demo.
If you want to make sure icnnumber and number are matched as whole words, you may add a word boundary at the start:
\b(icnnumber|number):([^\W_]+)
^^
You may even refactor the pattern a bit in order not to repeat number using an optional non-capturing group, see below:
\b((?:icn)?number):([^\W_]+)
^^^^^^^^
Pattern details
\b - a word boundary (immediately to the right, there must be start of string or a char other than letter, digit or _)
((?:icn)?number) - Group 1: an optional sequence of icn substring and then number substring
: - a : char
([^\W_]+) - Group 2: one or more letters or digits.
Just another suggestion maybe, but if your strings are always valid, you may consider just to split on a character class and pull the second index from the resulting array:
my $string= "number:987654321FR";
my #part = (split /[:_]/, $string)[1];
print #part
Or for the whole array of strings:
#Array = ("icnnumber:9876AB54321_IN", "number:987654321FR", "icnnumber:987654321YQ");
foreach (#Array)
{
my $el = (split /[:_]/, $_)[1];
print "$el\n"
}
Results in:
9876AB54321
987654321FR
987654321YQ
Regular expression can have 'icn' as an option and part of the interest is 11 characters after :.
my $re = qr/(icn)?number:(.{11})/;
Test code snippet
use strict;
use warnings;
use feature 'say';
my $re = qr/(icn)?number:(.{11})/;
while(<DATA>) {
say $2 if /$re/;
}
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
Output
9876AB54321
987654321FR
987654321YQ
Already you got best and better answers here anyway I trying to solve your question right now.
Get the whole string,
my $str = do { local $/; <DATA> }; #print $str;
You can check the first grouping method upto _ or \b from the below line,
#arrs = ($str=~m/number\:((?:(?!\_).)*)(?:\b|\_)/ig);
(or)
You can check the non-words \W and _ for the first grouping here, and pushing the matches in the array
#arrs = ($str=~m/number\:([^\W\_]+)(?:\_|\b)/ig);
print the output
print join "\n", #arrs;
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ

Perl - Regex to extract only the comma-separated strings

I have a question I am hoping someone could help with...
I have a variable that contains the content from a webpage (scraped using WWW::Mechanize).
The variable contains data such as these:
$var = "ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig"
$var = "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf"
$var = "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew"
The only bits I am interested in from the above examples are:
#array = ("cat_dog","horse","rabbit","chicken-pig")
#array = ("elephant","MOUSE_RAT","spider","lion-tiger")
#array = ("ANTELOPE-GIRAFFE","frOG","fish","crab","kangaROO-KOALA")
The problem I am having:
I am trying to extract only the comma-separated strings from the variables and then store these in an array for use later on.
But what is the best way to make sure that I get the strings at the start (ie cat_dog) and end (ie chicken-pig) of the comma-separated list of animals as they are not prefixed/suffixed with a comma.
Also, as the variables will contain webpage content, it is inevitable that there may also be instances where a commas is immediately succeeded by a space and then another word, as that is the correct method of using commas in paragraphs and sentences...
For example:
Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.
^ ^
| |
note the spaces here and here
I am not interested in any cases where the comma is followed by a space (as shown above).
I am only interested in cases where the comma DOES NOT have a space after it (ie cat_dog,horse,rabbit,chicken-pig)
I have a tried a number of ways of doing this but cannot work out the best way to go about constructing the regular expression.
How about
[^,\s]+(,[^,\s]+)+
which will match one or more characters that are not a space or comma [^,\s]+ followed by a comma and one or more characters that are not a space or comma, one or more times.
Further to comments
To match more than one sequence add the g modifier for global matching.
The following splits each match $& on a , and pushes the results to #matches.
my $str = "sdfds cat_dog,horse,rabbit,chicken-pig then some more pig,duck,goose";
my #matches;
while ($str =~ /[^,\s]+(,[^,\s]+)+/g) {
push(#matches, split(/,/, $&));
}
print join("\n",#matches),"\n";
Though you can probably construct a single regex, a combination of regexs, splits, grep and map looks decently
my #array = map { split /,/ } grep { !/^,/ && !/,$/ && /,/ } split
Going from right to left:
Split the line on spaces (split)
Leave only elements having no comma at the either end but having one inside (grep)
Split each such element into parts (map and split)
That way you can easily change the parts e.g. to eliminate two consecutive commas add && !/,,/ inside grep.
I hope this is clear and suits your needs:
#!/usr/bin/perl
use warnings;
use strict;
my #strs = ("ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig",
"fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf",
"dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew",
"Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.",
"Another sentence, although having commas, should not confuse the regex with this: a,b,c,d");
my $regex = qr/
\s #From your examples, it seems as if every
#comma separated list is preceded by a space.
(
(?:
[^,\s]+ #Now, not a comma or a space for the
#terms of the list
, #followed by a comma
)+
[^,\s]+ #followed by one last term of the list
)
/x;
my #matches = map {
$_ =~ /$regex/;
if ($1) {
my $comma_sep_list = $1;
[split ',', $comma_sep_list];
}
else {
[]
}
} #strs;
$var =~ tr/ //s;
while ($var =~ /(?<!, )\b[^, ]+(?=,\S)|(?<=,)[^, ]+(?=,)|(?<=\S,)[^, ]+\b(?! ,)/g) {
push (#arr, $&);
}
the regular expression matches three cases :
(?<!, )\b[^, ]+(?=,\S) : matches cat_dog
(?<=,)[^, ]+(?=,) : matches horse & rabbit
(?<=\S,)[^, ]+\b(?! ,) : matches chicken-pig

Perl regex - can I say 'if character/string matches, delete it and all to right of it'?

I have an array of strings, some of which contain the character '-'. I want to be able to search for it and for those strings that contain it I wish to delete all characters to the right of it.
So for example if I have:
$string1 = 'home - London';
$string2 = 'office';
$string3 = 'friend-Manchester';
or something as such, then the affected strings would become:
$string1 = 'home';
$string3 = 'friend';
I don't know if the white-space before the '-' would be included in the string afterwards (I don't want it as I will be comparing strings at a later point, although if it doesn't affect string comparisons then it doesn't matter).
I do know that I can search and replace specific strings/characters using something like:
$string1 =~ s/-//
or
$string1 =~ tr/-//
but I'm not very familiar with regular expressions in Perl so I'm not 100% sure of these. I've looked around and couldn't see anything to do with 'to the right of' in regex. Help appreciated!
You can delete anything after a hyphen - with this substitution:
s/-.*$//s
However, you will want to remove the whitespace prior to the hyphen and thus do
s/\s* - .* $//xs
The $ anchores the regex at the end of the string and the /s flag allows the dot to match newlines as well. While the $ is superfluous, it might add clarity.
Your substitution would just have removed the first -, and your transliteration would have removed all hyphens from the string.
Your regular expressions are just searching for the dash, so that's all they replace. You want to search for the dash, and anything after it.
$string =~ s/-.*//;
. represents any character, * means search for that character 0 or more times, and match as many as possible (i.e. to the end of the string if possible)
You can also search for an optional space before it.
$string =~ s/\s?-.*//;
(\s is a clearer way to specify a space character)
Using plain substr() and index() is possible as well.
my #strings = ("we are - so cool",
"lonely",
"friend-Manchester",
"home - london",
"home-new york",
"home with-childeren-first episode");
local $/ = " ";
foreach (#strings) {
$_ = substr($_,0,index($_,'-')) if (index($_,'-') != -1);
chomp;
}
The other answers are good. However, in light of what you said:
...if it doesn't affect string comparisons then it doesn't matter
You don't need a separate step for this at all. Suppose you want to compare $stringwith another variable, $search_string. The following expression will check for an exact match, except that it ignores anything $string has after a dash:
if ($string =~ /^$search_string(\s*-|$)/) { print "Strings matched"; }
#Using Regex:
my #strings =
("we are - so cool",
"lonely",
"friend - Manchester",
"home - london",
"home - new york",
"home with-childeren-first episode"
);
foreach (#strings) {
$_ =~ s/-\s*[a-zA-Z ]+\s*//g;
print "NEW: ".$_."\n";
}

Regex to match vowels in order as they appear once

I trying to figure out a correct regex to match vowels in a word as they appear.
There may be any number of consonants in the word; however, there should be no other vowels other than the 5 listed above. For example, the word “sacrilegious” should not match because, although it contains the five vowels in alphabetical order, there is an extra ‘i’ vowel located between the ‘a’ and the ‘e’. Hyphenated words are not allowed. In fact, your regular expression should not match any ‘word’ that contains any character other than the upper and lower case letters.
Here are some words that it should match
abstemious
facetious
arsenious
acheilous
anemious
caesious
This is what I have come up with so far, but when I run the program it doesn't seem to be doing what it should do.
#!/usr/bin/perl -w
use strict;
my $test = "abstemious";
if( $test =~ /a[^eiou]*e[^aiou]*i[^aeou]*o[^aeiu]u/ )
{
print "yes";
}
You have a small typo there, try this:
/a[^eiou]*e[^aiou]*i[^aeou]*o[^aeio]*u/
You're not too far off. Try this regex:
/\b[b-df-hj-np-tv-z]*a[b-df-hj-np-tv-z]*e[b-df-hj-np-tv-z]*i[b-df-hj-np-tv-z]*o[b-df-hj-np-tv-z]*u[b-df-hj-np-tv-z]*\b/i
[b-df-hj-np-tv-z]* matches any number of consonants, then it's just slotting in the vowels and the word boundary markers (\b).
Try this regexp:
\b(?=[a-zA-Z]+)[^aA]*?[aA][^aeAE]*?[eE][^aeiAEI]*?[iI][^aeioAEIO]*?[oO][^aeiouAEIOU]*?[uU][^aeiouAEIOU]*?\b
Is it okay for a word to contain duplicate vowels, as long as they're in the correct order? For example, would faaceetiioouus (if there were such a word) be acceptable? I ask because your current regex does indeed match it.
If you want to match only words that contain exactly one of each vowel, try this:
/^
(?=[a-z0-9]+$)
[^aeiou]*a
[^aeiou]*e
[^aeiou]*i
[^aeiou]*o
[^aeiou]*u
[^aeiou]*
$
/ix
Here is another approach, if you don't mind making a temporary copy of the string:
use warnings;
use strict;
while (<DATA>) {
chomp;
my $test = $_;
my $cp = $test; # leave original string intact
$cp =~ tr/aeiou//cd;
print "$test\n" if $cp eq 'aeiou';
}
=for output
abstemious
facetious
arsenious
acheilous
anemious
caesious
=cut
__DATA__
abstemious
facetious
arsenious
acheilous
anemious
caesious
unabstemious
sacrilegious
intravenous
faaceetiioouus