Regex - simple phone number - regex

I know there are a ton of regex examples on how to match certain phone number types. For my example I just want to allow numbers and a few special characters. I am again having trouble achieving this.
Phone numbers that should be allowed could take these forms
5555555555
555-555-5555
(555)5555555
(555)-555-5555
(555)555-5555 and so on
I just want something that will allow [0-9] and also special characters '(' , ')', and '-'
so far my expression looks like this
/^[0-9]*^[()-]*$/
I know this is wrong but logically I believe this means allow numbers 0-9 or and allow characters (, ), and -.

^(\(\d{3}\)|\d{3})-?\d{3}-?\d{4}$
\(\d{3}\)|\d{3} three digits with or without () - The simpler regex would be \(?\d{3}\)? but that would allow (555-5555555 and 555)5555555 etc.
An optional - followed by three digits
An optional - followed by four digits
Note that this would still allow 555555-5555 and 555-5555555 - I don't know if these are covered in your and so on part

This match what you want numbers,(, ) and -
/^[0-9()-]+$/

^[0-9-+\s]+$
06754654
+54654654
+546 546 5654 43534 +
+09945 345 3453 45

Why do you have a stray ^ in there? I think you meant [()-] This is actually making you have to have two beginning-of-strings in the regex, which will never match.
Also, \d is a nice shortcut for [0-9]. They are exactly the same.
Also, this will only match a bunch of numbers, then a bunch of ( or ) or -. Something like: 1294819024()()()()()-----()- would match. I think you want the whole thing to be able to repeat, something like: ^(\d*[()-]*)*$. Now, you can match repeating sequences of this.
Now, it is important to notice that nested * are typically inefficient, we can realize that we are just wanting to match any digit and the punctuation you want: [\d()-]*

For digits you can use \d. For more than one digit, you can use \d{n}, where n is the number of digits you want to match. Some special characters must be escaped, for example \( matches (. For example: \(\d{3}\)\-\d{3}\-\d{4} matches (555)-555-5555.

The second carat (afaik) is going to break anything you do since it means "start of string".
What you appear to be asking for therefore is:
start of string, followed by...
any number of numeric characters, followed by...
start of string, followed by...
any number of '(',')', or '-' characters, followed by...
end of string
Which won't work even if that second carat does nothing, because you're not accounting for anything after the first '(',')', or '-', and in fact will probably only validate an empty string if that.
You want /^[0-9()-]+$/ for a very crude pattern which will "work".

If you are doing US only number the best solution is to strip out all the non-digit characters and then just test to see if the length == 10.

Related

Decoding a regex... I know what it's function is but I want to understand exactly what is happening

I have a regular expression that I'm going to be using to verify that an inputted number is in standard U.S. telephone format (i.e (###) ###-####). I am new to regex and still having some trouble figuring out the exact function of each character. If someone would go through this piece by piece/verify that I am understanding I would really appreciate it. Also if the regex is wrong I would obviously like to know that.
\D*?(\d\D*?){10}
What I think is happening:
\D*?( indicates an escape sequence for the parenthesis metacharacter... not sure why the \D*? is necessary
\d indicating digits
\D*? indicating there is a non-digit character (-) followed by the closing parenthesis.
{10} for the 10 digits
I feel very unsure explaining this, like my understanding is very vague in terms of why the regex is in the order that it is etc. Thanks in advance for help/explanations.
EDIT
It seems like this is not the best regex for what I want. Another possibility was [(][0-9]{3}[)] [0-9]{3}-[0-9]{4}, but I was told this would fail. I suppose I'll have to do a little more work with regular expressions to figure this out.
\D matches any non-digit character.
* means that the previous character is repeated 0 or more times.
*? means that the previous character is repeated 0 or more times, but until the match of the following character in the regex. It is a bit difficult perhaps at the start, but in your regex, the next character is \d, meaning \D*? will match the least amount of characters until the next \d character.
( ... ) is a capture group, and is also used to group things. For instance {10} means that the previous character or group is repeated 10 times exactly.
Now, \D*?(\d\D*?){10} will match exactly 10 numbers, starting with non-digit characters or not, with non-digit characters in between the digits if they are present.
[(][0-9]{3}[)] [0-9]{3}-[0-9]{4}
This regex is a bit better since it doesn't just accept anything (like the first regex does) and will match the format (###) ###-#### (notice the space is a character in regex!).
The new things introduced here are the square brackets. These represent character classes. [0-9] means any character between 0 to 9 inclusive, which means it will match 0, 1, 2, 3, 4, 5, 6, 7, 8 or 9. Adding {3} after it makes it match 3 similar character class, and since this character class contains only digits, it will match exactly 3 digits.
A character class can be used to escape certain characters, such as ( or ) (note I mentioned earlier they are for capturing groups, or grouping) and thus, [(] and [)] are literal ( and ) instead of being used for capturing/grouping.
You can also use backslashes (\) to escape characters. Thus:
\([0-9]{3}\) [0-9]{3}-[0-9]{4}
Will also work. I would also recommend the use of line anchors ^ and $ if you're only trying to see if a phone number matches the above format. This ensures that the string has only the phone number, and nothing else. ^ matches the beginning of a line and $ matches the end of a line. Thus, the regex will become:
^\([0-9]{3}\) [0-9]{3}-[0-9]{4}$
However, I don't know all the combinations of the different formats of phone numbers in the US, so this regex might need some tweaking if you have different phone number formats.
\D is "not a digit"; \d is "digit". With that in mind:
This matches zero or more non-digits, then it matches a digit and any number of non-digit characters 10 times. This won't actually verify that the number if formatted properly, just that it contains 10 digits. I suspect that the regex isn't what you want in the first place.
For example, the following will match your regex:
this is some bad text 1 and some more 2 and more 34567890
\D matches a character that is not a digit
* repeats the previous item 0 or more times
? find the first occurrence
\d matches a digit
so your group is matches 10 digits or non digits

I want to egrep regular expression for multiple sets of numbers without union

This seems like a basic question but I have done a fair amount of searching and couldn't find an answer.
Say I have many sets of numbers
9-9-9
999
123
1-23
12-3
444
55-5
I want to egrep for all numbers, now one way I could do this is set up a egrep and union the regex for all of the possibilities
egrep '[0-9][0-9[0-9]|[0-9][-][0-9][-][0-9]' and so on and so forth
Is there a way to essentially say [0-9 or NULL or -] characters in my regex? so I could write one regex without unions like this [0-9][0-9-NULL][0-9-NULL][0-9-NULL][0-9-NULL][0-9NULL] and have it return all of the groups?
So the groups it would search would like as follows
- first 0-9
- second 0-9, -, NULL
- third 0-9, -, NULL
- fourth 0-9, -, NULL
- fifth 0-9, -, NULL
- sixth 0-9, NULL
Any help is appreciated.
Excluding the 'Null' part because I'm not quite sure what you mean by that, you use a simple regex for the egrep:
[0-9](-?[0-9]){2}
-? means 0 or 1 occurrence of -.
{2} means the preceding group gets repeated twice. Change it to + to mean at least once.
try a regex like
\d[-0-9]+
a dash or 0-9 one or more times
Edit
or if it must start and end with a number (or have just one digit
\d[-0-9]*\d|\d
Edit
or it is always three digits
\d{3}|\d-\d{2}|\d{2}-d
I think egrep '[0-9][0-9\-]*' should give the behavior you want.
Perhaps egrep '^[0-9]+(-[0-9]+){0,5}$'
some digits, followed by up to 5 groups of (a hyphen and digits)
anchored at both ends of the line.
pretty sure (-?[0-9])+ would work
it will find one or more digits that may be preceded by a -, although i'm also unsure on what you mean by NULL

Regex to find integers and decimals in string

I have a string like:
$str1 = "12 ounces";
$str2 = "1.5 ounces chopped;
I'd like to get the amount from the string whether it is a decimal or not (12 or 1.5), and then grab the immediately preceding measurement (ounces).
I was able to use a pretty rudimentary regex to grab the measurement, but getting the decimal/integer has been giving me problems.
Thanks for your help!
If you just want to grab the data, you can just use a loose regex:
([\d.]+)\s+(\S+)
([\d.]+): [\d.]+ will match a sequence of strictly digits and . (it means 4.5.6 or .... will match, but those cases are not common, and this is just for grabbing data), and the parentheses signify that we will capture the matched text. The . here is inside character class [], so no need for escaping.
Followed by arbitrary spaces \s+ and maximum sequence (due to greedy quantifier) of non-space character \S+ (non-space really is non-space: it will match almost everything in Unicode, except for space, tab, new line, carriage return characters).
You can get the number in the first capturing group, and the unit in the 2nd capturing group.
You can be a bit stricter on the number:
(\d+(?:\.\d*)?|\.\d+)\s+(\S+)
The only change is (\d+(?:\.\d*)?|\.\d+), so I will only explain this part. This is a bit stricter, but whether stricter is better depending on the input domain and your requirement. It will match integer 34, number with decimal part 3.40000 and allow .5 and 34. cases to pass. It will reject number with excessive ., or only contain a .. The | acts as OR which separate 2 different pattern: \.\d+ and \d+(?:\.\d*)?.
\d+(?:\.\d*)?: This will match and (implicitly) assert at least one digit in integer part, followed by optional . (which needs to be escaped with \ since . means any character) and fractional part (which can be 0 or more digits). The optionality is indicated by ? at the end. () can be used for grouping and capturing - but if capturing is not needed, then (?:) can be used to disable capturing (save memory).
\.\d+: This will match for the case such as .78. It matches . followed by at least one (signified by +) digit.
This is not a good solution if you want to make sure you get something meaningful out of the input string. You need to define all expected units before you can write a regex that only captures valid data.
use this regular expression \b\d+([\.,]\d+)?
To get integers and decimals that either use a comma or a dot plus the next word, use the following regex:
/\d+([\.,]\d+)?\s\S+/

Can this Regex be improved?

I have a regex to match a user entered id which has the basic format of [a-zA-z]{2}[\d]{8} but the kicker is a space can be placed between any of the letters or digits in the id so my regex looks like this
[A-Za-z]+[\s]*[A-Za-z]+[\s]*[\d]+[\s]*[\d]+[\s]*[\d]+[\s]*[\d]+[\s]*[\d]+[\s]*[\d]+[\s]*[\d]+[\s]*[\d]+[\s]*
Which is obviously an abomination and should be killed with fire, can this be improved upon?
All of the following are valid inputs
a b 1 2 2 3 4 5 5 6
ab12345678
ab 12345678
Your regex does not comply with your specification, can there be 2 or more letters before the digits? Extactly 8 digits or 8 digist or more?
Try
([a-zA-Z]\s*){2}(\d\s*){8}
If there can only be one space between each character:
([a-zA-Z]\s?){2}(\d\s?){8}
Don't ever use \d and \s unless you know EXACTLY where you are going...
\d will match 09E6 ০ BENGALI DIGIT ZERO (the ০ is your digit :-) ). For example read http://msdn.microsoft.com/en-us/library/w1c0s6bb.aspx
\s will match more types of strange spaces (and the tab character) than you can count, and I'm not kidding. http://msdn.microsoft.com/en-us/library/t809ektx.aspx
Paradoxically using [a-zA-Z] you are limiting quite much your users... No àèéìòù, nor the Turkish ı and İ (the first one is an i without the dot, lower case, the second one is the upper case version of i) http://en.wikipedia.org/wiki/Dotted_and_dotless_I .
Perhaps you could use (\p{L}\p{M}*) (with brackets) instead of [A-Za-z] (all the letters plus the combining marks). You have to add an * or a + AFTER the close bracket. The one expression is for a single letter PLUS its combining marks.
Oh... and you can use one of the other suggestions as a basis for the regex :-)
[\s]*[\d]+[\s]*[\d]+[\s]*[\d]+[\s]*[\d]+[\s]*[\d]+[\s]*[\d]+[\s]*[\d]+[\s]*[\d]+[\s]*
can be replaced with...
\s*(?:\d+\s*){8}
(Also, you can just write \s, rather than [\s], and \d rather than [\d] - the brackets are redundant if you're only specifying a single backslash character class.)
Edit Since there seems to be some confusion about what part of the original regex is being replaced, here's the entire expression after replacement:
[A-Za-z]+\s*[A-Za-z]+\s*(?:\d+\s*){8}
(?:[A-Za-z]+\s*){2}(?:\d+\s*){8}

Regular Expression to match set of arbitrary codes

I am looking for some help on creating a regular expression that would work with a unique input in our system. We already have some logic in our keypress event that will only allow digits, and will allow the letter A and the letter M. Now I need to come up with a RegEx that can match the input during the onblur event to ensure the format is correct.
I have some examples below of what would be valid. The letter A represents an age, so it is always followed by up to 3 digits. The letter M can only occur at the end of the string.
Valid Input
1-M
10-M
100-M
5-7
5-20
5-100
10-20
10-100
A5-7
A10-7
A100-7
A10-20
A5-A7
A10-A20
A10-A100
A100-A102
Invalid Input
a-a
a45
4
This matches all of the samples.
/A?\d{1,3}-A?\d{0,3}M?/
Not sure if 10-A10M should or shouldn't be legal or even if M can appear with numbers. If it M is only there without numbers:
/A?\d{1,3}-(A?\d{1,3}|M)/
Use the brute force method if you have a small amount of well defined patterns so you don't get bad corner-case matches:
^(\d+-M|\d+-\d+|A\d+-\d+|A\d+-A\d+)$
Here are the individual regexes broken out:
\d+-M <- matches anything like '1-M'
\d+-\d+ <- 5-7
A\d+-\d+ <- A5-7
A\d+-A\d+ <- A10-A20
/^[A]?[0-9]{1,3}-[A]?[0-9]{1,3}[M]?$/
Matches anything of the form:
A(optional)[1-3 numbers]-A(optional)[1-3 numbers]M(optional)
^A?\d+-(?:A?\d+|M)$
An optional A followed by one or more digits, a dash, and either another optional A and some digits or an M. The '(?: ... )' notation is a Perl 'non-capturing' set of parentheses around the alternatives; it means there will be no '$1' after the regex matches. Clearly, if you wanted to capture the various bits and pieces, you could - and would - do so, and the non-capturing clause might not be relevant any more.
(You could replace the '+' with '{1,3}' as JasonV did to limit the numbers to 3 digits.)
^A?\d{1,3}-(M|A?\d{1,3})$
^ -- the match must be done from the beginning
A? -- "A" is optional
\d{1,3} -- between one and 3 digits; [0-9]{1,3} also work
- -- A "-" character
(...|...) -- Either one of the two expressions
(M|...) -- Either "M" or...
(...|A?\d{1,3}) -- "A" followed by at least one and at most three digits
$ -- the match should be done to the end
Some consequences of changing the format. If you do not put "^" at the beginning, the match may ignore an invalid beginning. For example, "MAAMA0-M" would be matched at "A0-M".
If, likewise, you leave $ out, the match may ignore an invalid trail. For example, "A0-MMMMAAMAM" would match "A0-M".
Using \d is usually preferred, as is \w for alphanumerics, \s for spaces, \D for non-digit, \W for non-alphanumeric or \S for non-space. But you must be careful that \d is not being treated as an escape sequence. You might need to write it \\d instead.
{x,y} means the last match must occur between x and y times.
? means the last match must occur once or not at all.
When using (), it is treated as one match. (ABC)? will match ABC or nothing at all.
I’d use this regular expression:
^(?:[1-9]\d{0,2}-(?:M|[1-9]\d{0,2})|A[1-9]\d{0,2}-A?[1-9]\d{0,2})$
This matches either:
<number>-M or <number>-<number>
A<number>-<number> or A<number>-A<number>
Additionally <number> must not begin with a 0.