Perl regular expression for English word - regex

I need a regular expression that will find anything that looks like an English word. In particular, I want the expression to match when a string has:
1) only letters; and
2) at least two different letters. (I am purposely excluding one-letter words.)
So I'm looking for something that would match the and abracadabra but not aaa.
Any help is much appreciated.

Perhaps \b(\w*(\w)\w*(?!\2)\w+)\b works for you. It handles the examples you give.
It matches a letter \w in a group, then looks for something other than than letter using backreferences and negative lookahead (?!\2). We match at least one character at the end, which is necessary to make the negative lookahead force at least one distinct character. Then we place additional \w*'s around to allow additional letters. \b assures the ends of the matches are at word boundaries.
http://www.rubular.com/r/pwjGi9eLf5
Please note that this is no super duper regular expression that matches English-only words. For that, you want to compare against a dictionary. But that doesn't seem to be what you're looking to do here.

Check out Lingua::EN::Splitter:
use strict; use warnings;
use Lingua::EN::Splitter qw(words);
my #words = words $input_text;
print #words;

Related

Find words does not end with a letter expression using regexp

I am trying to find any word which ends 'k' letter and must be come after these letters 'a,e,o'.
Regex should find this:
'stack'
'kick'
'kiik'
'kimk'
'gesk'
and should not find belows:
'book'
'beak'
'aiok'
For this gain i use this reguler expression :
(?![aeo]+k)^.*?$
. But it does not work.
^.*(?<![aeo])k$
You can use this as all your words are ending with k.See demo.The lookbehind will separate out the words having aeo just before the last k.
https://regex101.com/r/cD5jK1/3
You can use this negation based regex:
^.*[^aeo]k$
RegEx Demo
You may not have provided enough information, but I don't see why any sort of lookaround is warranted here. You should be able to simply use:
\b[A-Za-z]*[aeo]k\b
Word boundaries ( \b ) will help you limit this pattern to only words. If you need to account for hyphens, then you could adjust the first range to include hyphen as well.

Get text using Regular Expression

I have the sentence as below:
First learning of regular expression.
And I want to extract only First learning and expression by means of regular expressions.
Where would I start/
Regular expressions are for pattern matching, which means we'd need to know a pattern that is to be matched.
If you literally just want those strings, you'd just use First learning and expression as your patterns.
As #orique says, this is kind of pointless; you don't need RegEx for that. If you want something more complicated, you'd need to explain what you're trying to match.
Regex is not usually used to match literal text like what you're doing, but instead is used to match patterns of text. If you insist on using regex, you'll have to match the trivial expression
(First learning|expression)
As already pointed out, it is unusual to match a literal string like you are asking, but more common to match patterns such as several word characters followed by a space character etc...
Here is a pattern to match several word characters (which are a-z, A-Z, 0-9 and _) followed by a space, followed by several more word characters etc... It ends up capturing three groups. The first group will match the first two words, the second part the next to words, and the last part, the fifth word and the preceding space.
$words = "First learning of regular expression.";
preg_match(/(\w+\s\w+)\s(\w+\s\w+)(\s\w+)/, $words, $matches);
$result = matches[1]+matches[3];
I hope this matches your requirement.

Regex negation - word parsing

I am trying to parse a phrase and exclude common words.
For instance in the phrase "as the world turns", I want to exclude the common words "as" and "the" and return only "world" and "turns".
(\w+(?!the|as))
Doesn't work. Feedback appreciated.
The lookahead should come first:
(\b(?!(the|as)\b)\w+\b)
I have also added word boundaries to ensure that it only matches whole words otherwise it would fail to match the complete word "as" but it would successfully match the letter "s" of that word.
You might also want to consider what \w matches and if that meets your needs. If you are looking for words in English you probably are interested in letters but not digits and you may wish to include some punctuation characters that are excluded by \w, such as apostrophes. You could try something like this instead (Rubular):
/(\b(?!(?:the|as)\b)[a-z'-]+\b)/i
To match words more accurately in a human language you could consider using a natural language parsing library instead of regular expressions.
You should use word boundaries to only match whole words. Either with a look-ahead assertion:
(\b(?!(?:the|as)\b)\w+\b)
Or with a look-behind assertion:
(\b\w+\b(?<!\b(?:the|as)))

Regular expression to match phone number?

I want to match a phone number that can have letters and an optional hyphen:
This is valid: 333-WELL
This is also valid: 4URGENT
In other words, there can be at most one hyphen but if there is no hyphen, there can be at most seven 0-9 or A-Z characters.
I dont know how to do and "if statement" in a regex. Is that even possible?
I think this should do it:
/^[a-zA-Z0-9]{3}-?[a-zA-Z0-9]{4}$/
It matches 3 letters or numbers followed by an optional hyphen followed by 4 letters or numbers. This one works in ruby. Depending on the regex engine you're using you may need to alter it slightly.
You seek the alternation operator, indicated with pipe character: |
However, you may need either 7 alternatives (1 for each hyphen location + 1 for no hyphen), or you may require the hyphen between 3rd and 4th character and use 2 alternatives.
One use of alternation operator defines two alternatives, as in:
({3,3}[0-9A-Za-z]-{4,4}[0-9A-Za-z]|{7,7}[0-9A-Za-z])
Not sure if this counts, but I'd break it into two regexes:
#!/usr/bin/perl
use strict;
use warnings;
my $text = '333-URGE';
print "Format OK\n" if $text =~ m/^[\dA-Z]{1,6}-?[\dA-Z]{1,6}$/;
print "Length OK\n" if $text =~ m/^(?:[\dA-Z]{7}|[\dA-Z-]{8})$/;
This should avoid accepting multiple dashes, dashes in the wrong place, etc...
Supposing that you want to allow the hyphen to be anywhere, lookarounds will be of use to you. Something like this:
^([A-Z0-9]{7}|(?=^[^-]+-[^-]+$)[A-Z0-9-]{8})$
There are two main parts to this pattern: [A-Z0-9]{7} to match a hyphen-free string and (?=^[^-]+-[^-]+$)[A-Z0-9-]{8} to match a hyphenated string.
The (?=^[^-]+-[^-]+$) will match for any string with a SINGLE hyphen in it (and the hyphen isn't the first or last character), then the [A-Z0-9-]{8} part will count the characters and make sure they are all valid.
Thank you Heath Hunnicutt for his alternation operator answer as well as showing me an example.
Based on his advice, here's my answer:
[A-Z0-9]{7}|[A-Z0-9][A-Z0-9-]{7}
Note: I tested my regex here. (Just including this for reference)

Help with Regex patterns

I need some help with regex.
I have a pattern AB.* , this pattern should match for strings
like AB.CD AB.CDX (AB.whatever).and
so on..But it should NOT match
strings like AB,AB.CD.CD ,AB.CD.
AB.CD.CD that is ,if it encounters a
second dot in the string. whats the
regex for this?
I have a pattern AB.** , this pattern should match strings like
AB,AB.CD.CD, AB.CD. AB.CD.CD but NOT
strings like AB.CD ,AB.CDX,
AB.whatever Whats the regex for
this?
Thanks a lot.
Looks like you've got globs not regular expressions. Dot matches any char, and * makes the previous element match any 0+ times.
1) AB\.[^.]*
Escape the first dot so it matches a literal dot, and then match any character other than a dot, any number of times.
2) "^(AB)|(AB\.[^.]*\.[^.]*$"
This matches AB or AB followed by .<stuff>.<stuff>
http://www.regular-expressions.info/ contains lots of useful information for learning about regular expressions.
If your regex engine supports negative lookahead you might try something like:
^AB\.[^.]+$
^AB(?!\.[^.]+$)
(or
^AB\.[^.]*$
^AB(?!\.[^.]*$)
if you want to allow AB. )
I don't find you're question entirely clear; please comment here (or edit your question if you can't add comments) if I'm getting this wrong but what I think you're looking for is:
1) matching strings "AB.AnyTextHereWithoutDots" but not "AB" or "AB.foo." etc
If so a matching regex would be:
"^AB\.[^.]*$"
2) matching "AB" or "AB.something.something" with either none or two or more dots
If so a matching regex would be something like:
"^AB(\..*\..*)?$" or "'^AB\(\..*\..*\)\?" (depending on the nature of your regex engine)
As Douglas suggests matching with globs would likely be easier.
And as spdenne suggests, find a good regex reference.
I tried this in vim. Here is the sample data:
AB.CD
AB.CDX
AB.whatever
AB
AB.CD.CD
AB.CD.
AB.CD.CD
Here is my regexes
This captures all lines starting with AB and then expects a literal dot, and then filters out all lines that has a second dot.
^AB\.[^.]*$
This captures all lines that is just an AB (the part before the pipe) or lines that start with AB that is followed by two literal dots (escaped with a backslash)
^AB$\|^AB\..\..$