Regular expression to match phone number? - regex

I want to match a phone number that can have letters and an optional hyphen:
This is valid: 333-WELL
This is also valid: 4URGENT
In other words, there can be at most one hyphen but if there is no hyphen, there can be at most seven 0-9 or A-Z characters.
I dont know how to do and "if statement" in a regex. Is that even possible?

I think this should do it:
/^[a-zA-Z0-9]{3}-?[a-zA-Z0-9]{4}$/
It matches 3 letters or numbers followed by an optional hyphen followed by 4 letters or numbers. This one works in ruby. Depending on the regex engine you're using you may need to alter it slightly.

You seek the alternation operator, indicated with pipe character: |
However, you may need either 7 alternatives (1 for each hyphen location + 1 for no hyphen), or you may require the hyphen between 3rd and 4th character and use 2 alternatives.
One use of alternation operator defines two alternatives, as in:
({3,3}[0-9A-Za-z]-{4,4}[0-9A-Za-z]|{7,7}[0-9A-Za-z])

Not sure if this counts, but I'd break it into two regexes:
#!/usr/bin/perl
use strict;
use warnings;
my $text = '333-URGE';
print "Format OK\n" if $text =~ m/^[\dA-Z]{1,6}-?[\dA-Z]{1,6}$/;
print "Length OK\n" if $text =~ m/^(?:[\dA-Z]{7}|[\dA-Z-]{8})$/;
This should avoid accepting multiple dashes, dashes in the wrong place, etc...

Supposing that you want to allow the hyphen to be anywhere, lookarounds will be of use to you. Something like this:
^([A-Z0-9]{7}|(?=^[^-]+-[^-]+$)[A-Z0-9-]{8})$
There are two main parts to this pattern: [A-Z0-9]{7} to match a hyphen-free string and (?=^[^-]+-[^-]+$)[A-Z0-9-]{8} to match a hyphenated string.
The (?=^[^-]+-[^-]+$) will match for any string with a SINGLE hyphen in it (and the hyphen isn't the first or last character), then the [A-Z0-9-]{8} part will count the characters and make sure they are all valid.

Thank you Heath Hunnicutt for his alternation operator answer as well as showing me an example.
Based on his advice, here's my answer:
[A-Z0-9]{7}|[A-Z0-9][A-Z0-9-]{7}
Note: I tested my regex here. (Just including this for reference)

Related

How can I get the second part of a hyphenated word using regex?

For example, I have the word: sh0rt-t3rm.
How can I get the t3rm part using perl regex?
I could get sh0rt by using [(a-zA-Z0-9)+]\[-\], but \[-\][(a-zA-Z0-9)+] doesn't work to get t3rm.
The syntax used for the regex is not correct to get either sh0rt or t3rm
You flipped the square brackets and the parenthesis, and the hyphen does not have to be between square brackets.
To get sh0rt in sh0rt-t3rm you you might use for example one of:
Regex
Demo
Explanation
\b([a-zA-Z0-9]+)-
Demo 1
\b is a word boundary to prevent a partial word match, the value is in capture group 1.
\b[a-zA-Z0-9]+(?=-)
Demo 2
Match the allowed chars in the character class, and assert a - to the right using a positive lookahead (?=-)
To get t3rm in sh0rt-t3rm you might use for example one of:
Regex
Demo
Explanation
-([a-zA-Z0-9]+)\b
Demo 3
The other way around with a leading - and get the value from capture group 1.
-\K[a-zA-Z0-9]+\b
Demo 4
Match - and use \K to keep out what is matched so far. Then match 1 or more times the allowed chars in the character class.
If your whole target string is literally just sh0rt-t3rm then you want all that comes after the -.
So the barest and minimal version, cut precisely for this description, is
my ($capture) = $string =~ /-(.+)/;
We need parenthesis on the left-hand-side so to make regex run in a list context because that's when it returns the matches (otherwise it returns true/false, normally 1 or '').
But what if the preceding text may have - itself? Then make sure to match all up to that last -
my ($capture) = $string =~ /.*-(.+)/;
Here the "greedy" nature of the * quantifier makes the previous . match all it possibly can so that the whole pattern still matches; thus it goes up until the very last -.
There are of course many other variations on how the data may look like, other than just being one hyphenated-word. In particular, if it's a part of a text, you may want to include word-boundaries
my ($capture) = $string =~ /\b.*?-(.+?)\b/;
Here we also need to adjust our "wild-card"-like pattern .+ by limiting it using ? so that it is not greedy. This matches the first such hyphenated word in the $string. But if indeed only "word" characters fly then we can just use \w (instead of . and word-boundary anchors)
my ($capture) = $string =~ /\w*?-(\w+)/;
Note that \w matches [a-zA-Z0-9_] only, which excludes some characters that may appear in normal text (English, not to mention all other writing systems).
But this is clearly getting pickier and cookier and would need careful close inspection and testing, and more complete knowledge of what the data may look like.
Perl offers its own tutorial, perlretut, and the main full reference is perlre
-([a-zA-Z0-9]+) will match a - followed by a word, with just the word being captured.
Demo

Regex Greediness

I have a perl regex that i'm fairly certain should work (perl) but is being too greedy:
regex:
(?:.*serial[^\d]+?(\d+).*)
Test string:
APPLICATIONSERIALNO123456Plnsn123456te20140728tdrnserialnun12hou
Desired group 1 match:
123456
Actual group 1 Match:
12
I've tried every permutation of lookahead and behind and laziness and I can't get the damn thing to work.
WHAT AM I MISSING.
Thanks!
The Problem is Not Greediness, but Case-Sensitivity
Currently your regex matches the 12 at the end of serialnun12, probably because it is case-sensitive. We have two options: using upper-case, or making the pattern case-insensitive.
Option 1: Use Upper-Case
If you only want 123456, you can use:
SERIALNO\K\d+
The \K tells the engine to drop what was matched so far from the final match it returns.
If you want to match the whole string and capture 123456 to Group 1, use:
.*?SERIAL\D+(\d+).*
Option 2: Turning Case-Sensitivity On using (?i) inline or the i flag
To only match 123456, you can use:
(?i)serial\D+\K\d+
Note that if you use the g flag, this would match both numbers.
If you want to match the whole string and capture 123456 to Group 1, use:
(?i).*?serial\D+(\d+).*
A few tips
You can turn case-insensitivity either with the (?i) inline modifier or the i flag at the end of the pattern: /serial\D+\K\d+/i
Instead of [^\d], use \D
There is no need for a lazy quantifier in something like \D+\d+ because the two tokens are mutually exclusive: there is no danger that the \D will run over the \d
The problem is not greediness; it's case-sensitivity.
Currently your regex matches the 12 at the end of serialnun12 because those are the only digits following serial. The ones you want follow SERIAL. S and s are different characters.
There are two solution.
Use the uppercase characters in the pattern.
my ($serial) = $string =~ /SERIAL\D*(\d+)/;
Use case-insensitive matching.
my ($serial) = $string =~ /serial\D*(\d+)/i;
There's probably no need for this, but I thought I'd mention it just in case.

Perl regular expression for English word

I need a regular expression that will find anything that looks like an English word. In particular, I want the expression to match when a string has:
1) only letters; and
2) at least two different letters. (I am purposely excluding one-letter words.)
So I'm looking for something that would match the and abracadabra but not aaa.
Any help is much appreciated.
Perhaps \b(\w*(\w)\w*(?!\2)\w+)\b works for you. It handles the examples you give.
It matches a letter \w in a group, then looks for something other than than letter using backreferences and negative lookahead (?!\2). We match at least one character at the end, which is necessary to make the negative lookahead force at least one distinct character. Then we place additional \w*'s around to allow additional letters. \b assures the ends of the matches are at word boundaries.
http://www.rubular.com/r/pwjGi9eLf5
Please note that this is no super duper regular expression that matches English-only words. For that, you want to compare against a dictionary. But that doesn't seem to be what you're looking to do here.
Check out Lingua::EN::Splitter:
use strict; use warnings;
use Lingua::EN::Splitter qw(words);
my #words = words $input_text;
print #words;

Choosing just the alphanumeric words with regex

I'm trying to find the regular expression to find just the alphanumeric words from a string i.e the words that are a combination of alphabets or numbers. If a word is pure numbers or pure characters I need to discard it.
Try this regular expression:
\b([a-z]+[0-9]+[a-z0-9]*|[0-9]+[a-z]+[a-z0-9]*)\b
Or more compact:
\b([a-z]+[0-9]+|[0-9]+[a-z]+)[a-z0-9]*\b
This matches all words (note the word boundaries \b) that either start with one or more letters followed by one or more digits or vice versa that may be followed by one or more letters or digits. So the condition of at least one letter and at least one digit is always fulfilled.
With lookaheads:
'/\b(?![0-9]+\b)(?![a-z]+\b)[0-9a-z]+\b/i'
A quick test that also shows example usage:
$str = 'foo bar F0O 8ar';
$arr = array();
preg_match_all('/\b(?![0-9]+\b)(?![a-z]+\b)[0-9a-z]+\b/i', $str, $arr);
print_r($arr);
Output:
F0O
8ar
This will return all individual alphanumeric words, which you can loop through. I don't think regex can do the whole job by itself.
\b[a-z0-9]+\b
Make sure you mark that as case-insensitive.
\b(?:[a-z]+[0-9]+|[0-9]+[a-z]+)[[:alnum:]]*\b
'\b([a-zA-Z]+[0-9]+ | [0-9]+[a-zA-Z]+ | [a-zA-Z]+[0-9]+[a-zA-Z]*)\b'

A regex that will parse 00.00

I'm trying to create a regex that will accept the following values:
(blank)
0
00
00.0
00.00
I came up with ([0-9]){0,2}\.([0-9]){0,2} which to me says "the digits 0 through 9 occurring 0 to 2 times, followed by a '.' character (which should be optional), followed by the digits 0 through 9 occuring 0 to 2 times. If only 2 digits are entered the '.' is not necessary. What's wrong with this regex?
You didn't make the dot optional:
[0-9]{0,2}(\.[0-9]{1,2})?
First off, {0-2} should be {0,2} as it was in the first instance.
Secondly, you need to group the repetition sections as well.
Thirdly, you need to make the whole last part optional. Because if there's a dot, there must be something after it, you should also change the second repetition thing to {1,2}.
([0-9]{0,2})(\.([0-9]{1,2}))?
There are a few problems with your regex:
The dot is a special character, and acts as a wildcard; if you want a literal dot, you need to escape it (\.).
Even if you replaced the dot to not be a wildcard, your regex will match strings like "0." because you did not tell the regular expression engine to only match the dot if there are numbers following it.
Because your expression isn't anchored, it could match strings that contain the pattern within another word, for example (ie. ab12 would match).
A better pattern would be something like:
/\b[0-9]{0,2}(?:\.[0-9]{1,2})?\b/
Note that (?:...) makes the group not create a backreference, which probably is not needed in your case.
Here is one way, illustrated in Perl, to match only the strings you listed. The important part is its method for matching empty strings: it does not make every pattern element optional, a strategy that has the undesirable effect of matching almost every string.
use warnings;
use strict;
my #data = (
'',
'0',
'00',
'00.0',
'00.00',
'foo', # Should not match.
'.0', # Should not match.
);
for (#data){
print $_, "\n" if /^$|^[0-9]{1,2}(\.[0-9]{1,2})?$/;
}
Most of the above examples don't anchor the beginning ^ and ending $ of the data.
I would solve it with one of the following:
^[[:digit:]]{0,2}([.][[:digit:]]{1,2})$
^\d{0,2}([.]\d{1,2})$
^[0-9]{0,2}([.][0-9]{1,2})$
For readability, i generally prefer using [.] to \. and using POSIX classes like [[:digit:]].