How can I find repeated letters with a Perl regex? - regex

I am looking for a regex that will find repeating letters. So any letter twice or more, for example:
booooooot or abbott
I won't know the letter I am looking for ahead of time.
This is a question I was asked in interviews and then asked in interviews. Not so many people get it correct.

You can find any letter, then use \1 to find that same letter a second time (or more). If you only need to know the letter, then $1 will contain it. Otherwise you can concatenate the second match onto the first.
my $str = "Foooooobar";
$str =~ /(\w)(\1+)/;
print $1;
# prints 'o'
print $1 . $2;
# prints 'oooooo'

I think you actually want this rather than the "\w" as that includes numbers and the underscore.
([a-zA-Z])\1+
Ok, ok, I can take a hint Leon. Use this for the unicode-world or for posix stuff.
([[:alpha:]])\1+

I Think using a backreference would work:
(\w)\1+
\w is basically [a-zA-Z_0-9] so if you only want to match letters between A and Z (case insensitively), use [a-zA-Z] instead.
(EDIT: or, like Tanktalus mentioned in his comment (and as others have answered as well), [[:alpha:]], which is locale-sensitive)

Use \N to refer to previous groups:
/(\w)\1+/g

You might want to take care as to what is considered to be a letter, and this depends on your locale. Using ISO Latin-1 will allow accented Western language characters to be matched as letters. In the following program, the default locale doesn't recognise é, and thus créé fails to match. Uncomment the locale setting code, and then it begins to match.
Also note that \w includes digits and the underscore character along with all the letters. To get just the letters, you need to take the complement of the non-alphanum, digits and underscore characters. This leaves only letters.
That might be easier to understand by framing it as the question:
"What regular expression matches any digit except 3?"
The answer is:
/[^\D3]/
#! /usr/local/bin/perl
use strict;
use warnings;
# uncomment the following three lines:
# use locale;
# use POSIX;
# setlocale(LC_CTYPE, 'fr_FR.ISO8859-1');
while (<DATA>) {
chomp;
if (/([^\W_0-9])\1+/) {
print "$_: dup [$1]\n";
}
else {
print "$_: nope\n";
}
}
__DATA__
100
food
créé
a::b

The following code will return all the characters, that repeat two or more times:
my $str = "SSSannnkaaarsss";
print $str =~ /(\w)\1+/g;

Just for kicks, a completely different approach:
if ( ($str ^ substr($str,1) ) =~ /\0+/ ) {
print "found ", substr($str, $-[0], $+[0]-$-[0]+1), " at offset ", $-[0];
}

FYI, aside from RegExBuddy, a real handy free site for testing regular expressions is RegExr at gskinner.com. Handles ([[:alpha:]])(\1+) nicely.

How about:
(\w)\1+
The first part makes an unnamed group around a character, then the back-reference looks for that same character.

I think this should also work:
((\w)(?=\2))+\2

/(.)\\1{2,}+/u
'u' modifier matching with unicode

Related

Extract first word after specific word

I'm having difficulty writing a Perl program to extract the word following a certain word.
For example:
Today i'm not going anywhere except to office.
I want the word after anywhere, so the output should be except.
I have tried this
my $words = "Today i'm not going anywhere except to office.";
my $w_after = ( $words =~ /anywhere (\S+)/ );
but it seems this is wrong.
Very close:
my ($w_after) = ($words =~ /anywhere\s+(\S+)/);
^ ^ ^^^
+--------+ |
Note 1 Note 2
Note 1: =~ returns a list of captured items, so the assignment target needs to be a list.
Note 2: allow one or more blanks after anywhere
In Perl v5.22 and later, you can use \b{wb} to get better results for natural language. The pattern could be
/anywhere\b{wb}.+?\b{wb}(.+?\b{wb})/
"wb" stands for word break, and it will account for words that have apostrophes in them, like "I'll", that plain \b doesn't.
.+?\b{wb}
matches the shortest non-empty sequence of characters that don't have a word break in them. The first one matches the span of spaces in your sentence; and the second one matches "except". It is enclosed in parentheses, so upon completion $1 contains "except".
\b{wb} is documented most fully in perlrebackslash
First, you have to write parentheses around left side expression of = operator to force array context for regexp evaluation. See m// and // in perlop documentation.[1] You can write
parentheses also around =~ binding operator to improve readability but it is not necessary because =~ has pretty high priority.
Use POSIX Character Classes word
my ($w_after) = ($words =~ / \b anywhere \W+ (\w+) \b /x);
Note I'm using x so whitespaces in regexp are ignored. Also use \b word boundary to anchor regexp correctly.
[1]: I write my ($w_after) just for convenience because you can write my ($a, $b, $c, #rest) as equivalent of (my $a, my $b, my $c, my #rest) but you can also control scope of your variables like (my $a, our $UGLY_GLOBAL, local $_, #_).
This Regex to be matched:
my ($expect) = ($words=~m/anywhere\s+([^\s]+)\s+/);
^\s+ the word between two spaces
Thanks.
If you want to also take into consideration the punctuation marks, like in:
my $words = "Today i'm not going anywhere; except to office.";
Then try this:
my ($w_after) = ($words =~ /anywhere[[:punct:]|\s]+(\S+)/);

Use Perl to check if a string has only English characters

I have a file with submissions like this
%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi
I am stripping everything but the song name by using this regex.
$line =~ s/.*>|([([\/\_\-:"``+=*].*)|(feat.*)|[?¿!¡\.;&\$#%#\\|]//g;
I want to make sure that the only strings printed are ones that contain only English characters, so in this case it would the first song title Ai Wo Quing shut up and not the next one because of the è.
I have tried this
if ( $line =~ m/[^a-zA-z0-9_]*$/ ) {
print $line;
}
else {
print "Non-english\n";
I thought this would match just the English characters, but it always prints Non-english. I feel this is me being rusty with regex, but I cannot find my answer.
Following from the comments, your problem would appear to be:
$line =~ m/[^a-zA-z0-9_]*$/
Specifically - the ^ is inside the brackets, which means that it's not acting as an 'anchor'. It's actually a negation operator
See: http://perldoc.perl.org/perlrecharclass.html#Negation
It is also possible to instead list the characters you do not want to match. You can do so by using a caret (^) as the first character in the character class. For instance, [^a-z] matches any character that is not a lowercase ASCII letter, which therefore includes more than a million Unicode code points. The class is said to be "negated" or "inverted".
But the important part is - that without the 'start of line' anchor, your regular expression is zero-or-more instances (of whatever), so will match pretty much anything - because it can freely ignore the line content.
(Borodin's answer covers some of the other options for this sort of pattern match, so I shan't reproduce).
It's not clear exactly what you need, so here are a couple of observations that speak to what you have written.
It is probably best if you use split to divide each line of data on <SEP>, which I presume is a separator. Your question asks for the fourth such field, like this
use strict;
use warnings;
use 5.010;
while ( <DATA> ) {
chomp;
my #fields = split /<SEP>/;
say $fields[3];
}
__DATA__
%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi
output
Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
Achète-moi
Also, the word character class \w matches exactly [a-zA-z0-9_] (and \W matches the complement) so you can rewrite your if statement like this
if ( $line =~ /\W/ ) {
print "Non-English\n";
}
else {
print $line;
}

cant get the perl regex to work

My perl is getting rusty. It only prints "matched=" but $1 is blank!?!
EDIT 1: WHo the h#$! downvoted this? There are no wrong questions. If you dont like it, move on to next one!
$crazy="abcd\r\nallo\nXYZ\n\n\nQQQ";
if ($crazy =~ m/([.\n\r]+)/gsi) {
print "matched=", $1, "\n";
} else {
print "not matched!\n";
}
EDIT 2: This is the code fragment with updated regex, works great!
$crazy="abcd\r\nallo\nXYZ\n\n\nQQQ";
if ($crazy =~ m/([\s\S]+)/gsi) {
print "matched=", $1, "\n";
} else {
print "not matched!\n";
}
EDIT 3: Haha, i see perl police strikes yet again!!!
I don't know if this is your exact problem, but inside square brackets, '.' is just looking for a period. I didn't see a period in the input, so I wondered which you meant.
Aside from the period, the rest of the character class is looking for consecutive whitespace. And as you didn't use the multiline switch, you've got newlines being counted as whitespace (and any character), but no indication to scan beyond the first record separator. But because of the way that you print it out, it also gives some indication that you meant more than the literal period, as mentioned above.
Axeman is correct; your problem is that . in a character class doesn't do what you expect.
By default, . outside a character class (and not backslashed) matches any character but a newline. If you want to include newlines, you specify the /s flag (which you seem to already have) on your regex or put the . in a (?s:...) group:
my $crazy="abcd\r\nallo\nXYZ\n\n\nQQQ";
if ($crazy =~ m/((?s:.+))/) {
print "matched=", $1, "\n";
} else {
print "not matched!\n";
}
. in a character class is a literal period, not match anything. What you really want is /(.+)/s. The /g flag says to match multiple times, but you are using the regex in scalar context, so it will only match the first item. The /i flag makes the regex case insensitive, but there are no characters with case in your regex. The \s flag makes . match newlines, and it always matches "\r", so instead of [.\n\r], you can just use ..
However, /(.+)/s will match any string with one or more characters, so you would be better off with
my $crazy="abcd\r\nallo\nXYZ\n\n\nQQQ";
if (length $crazy) {
print "matched=$crazy\n";
} else {
print "not matched!\n";
}
It is possible you meant to do something like this:
#!/usr/bin/perl
use strict;
use warnings;
my $crazy = "abcd\r\nallo\nXYZ\n\n\nQQQ";
while ($crazy =~ /(.+)[\r\n]+/g) {
print "matched=$1\n";
}
But that would probably be better phrased:
#!/usr/bin/perl
use strict;
use warnings;
my $crazy = "abcd\r\nallo\nXYZ\n\n\nQQQ";
for my $part (split /[\r\n]+/, $crazy) {
print "matched=$part\n";
}
$1 contains white space, that's why you don't see it in a print like that, just add something after it/quote it.
Example:
perl -E "qq'abcd\r\nallo\nXYZ\n\n\nQQQ'=~/([.\n\r]+)/gsi;say 'got(',length($1),qq') >$1<';"
got(2) >
<
Updated for your comments:
To match everything you can simply use /(.+)/s
[.] (dot inside a character class) does not mean "match any character", it just means match the literal . character. So in an input string without any dots,
m/([.\n\r]+)/gsi
will just match strings of \n and \r characters.
With the /s modifier, you are already asking the regex engine to include newlines with . (match any character), so you could just write
m/(.+)/gsi

Regex to match vowels in order as they appear once

I trying to figure out a correct regex to match vowels in a word as they appear.
There may be any number of consonants in the word; however, there should be no other vowels other than the 5 listed above. For example, the word “sacrilegious” should not match because, although it contains the five vowels in alphabetical order, there is an extra ‘i’ vowel located between the ‘a’ and the ‘e’. Hyphenated words are not allowed. In fact, your regular expression should not match any ‘word’ that contains any character other than the upper and lower case letters.
Here are some words that it should match
abstemious
facetious
arsenious
acheilous
anemious
caesious
This is what I have come up with so far, but when I run the program it doesn't seem to be doing what it should do.
#!/usr/bin/perl -w
use strict;
my $test = "abstemious";
if( $test =~ /a[^eiou]*e[^aiou]*i[^aeou]*o[^aeiu]u/ )
{
print "yes";
}
You have a small typo there, try this:
/a[^eiou]*e[^aiou]*i[^aeou]*o[^aeio]*u/
You're not too far off. Try this regex:
/\b[b-df-hj-np-tv-z]*a[b-df-hj-np-tv-z]*e[b-df-hj-np-tv-z]*i[b-df-hj-np-tv-z]*o[b-df-hj-np-tv-z]*u[b-df-hj-np-tv-z]*\b/i
[b-df-hj-np-tv-z]* matches any number of consonants, then it's just slotting in the vowels and the word boundary markers (\b).
Try this regexp:
\b(?=[a-zA-Z]+)[^aA]*?[aA][^aeAE]*?[eE][^aeiAEI]*?[iI][^aeioAEIO]*?[oO][^aeiouAEIOU]*?[uU][^aeiouAEIOU]*?\b
Is it okay for a word to contain duplicate vowels, as long as they're in the correct order? For example, would faaceetiioouus (if there were such a word) be acceptable? I ask because your current regex does indeed match it.
If you want to match only words that contain exactly one of each vowel, try this:
/^
(?=[a-z0-9]+$)
[^aeiou]*a
[^aeiou]*e
[^aeiou]*i
[^aeiou]*o
[^aeiou]*u
[^aeiou]*
$
/ix
Here is another approach, if you don't mind making a temporary copy of the string:
use warnings;
use strict;
while (<DATA>) {
chomp;
my $test = $_;
my $cp = $test; # leave original string intact
$cp =~ tr/aeiou//cd;
print "$test\n" if $cp eq 'aeiou';
}
=for output
abstemious
facetious
arsenious
acheilous
anemious
caesious
=cut
__DATA__
abstemious
facetious
arsenious
acheilous
anemious
caesious
unabstemious
sacrilegious
intravenous
faaceetiioouus

How can I check if a Perl string contains letters?

In Perl, what regex should I use to find if a string of characters has letters or not?
Example of a string used: Thu Jan 1 05:30:00 1970
Would this be fine?
if ($l =~ /[a-zA-Z]/)
{
print "string ";
}
else
{
print "number ";
}
try this:
/[a-zA-Z]/
or
/[[:alpha:]]/
otherwise, you should give examples of the strings you want to match.
also read perldoc perlrequick
Edit: #OP, you have provided example string, but i am not really sure what you want to do with it. so i am assuming you want to check whether a word is all letters, all numbers or something else. here's something to start with. All from perldoc perlrequick (and perlretut) so please read them.
sub check{
my $str = shift;
if ($str =~ /^[a-zA-Z]+$/){
return $str." all letters";
}
if ($str =~ /^[0-9]+$/){
return $str." all numbers";
}else{
return $str." a mix of numbers/letters/others";
}
}
$string = "99932";
print check ($string)."\n";
$string = "abcXXX";
print check ($string)."\n";
$string = "9abd99_32";
print check ($string)."\n";
output
$ perl perl.pl
99932 all numbers
abcXXX all letters
9abd99_32 a mix of numbers/letters/others
If you want to match Unicode characters rather than just ASCII ones, try this:
#!/usr/bin/perl
while (<>) {
if (/[\p{L}]+/) {
print "letters\n";
} else {
print "no letters\n";
}
}
If you're looking for any kind of letter from any language, you should go with
\p{L}
Take a look on this full reference: Unicode Character Properties
Using /[A-Za-z]/ is a US-centric way to do it. To accept any letter, use one of
/[[:alpha:]]/
/\p{L}/
/[^\W\d_]/
The third one employs a double-negative: not not-a-letter, not a digit, and not an underscore.
Whichever you choose, those who maintain your code will certainly appreciate it if you stick with one consistently!
If you're looking to detect whether something looks like a number for the purposes of manipulating it in Perl, you'll want Scalar::Util::looks_like_number (core since perl 5.7.3). From perlapi:
looks_like_number
Test if the content of an SV looks
like a number (or is a number). Inf
and Infinity are treated as numbers
(so will not issue a non-numeric
warning), even if your atof() doesn't
grok them.
[^\W0-9_]
# or
[[:alpha:]]
See perldoc perlre