Using Regex to find repeating groups in phone numbers - regex

I'm looking for a way to use regex to search for obviously false phone numbers that have the same digit repeating. The numbers are all formatted and stored as follows:
(111)111-1111
I'm not able to alter the text in any way.
I've tried modifying a few of the regex lines I've seen such as:
^([0-9])\1{2}.\1{3}.\1{4}$
which was for finding repeating digits with a period in between the numbers. However, I haven't figured out how to get around the first character as a parenthesis.
Any help would be appreciated!

You misunderstand the purpose of the . Dot Operator. It is not to match a period, it matches anything. In that (quite badly) regex, it serves only to skip the - – and because it matches anything, it will also match something like 11121113111.
Use this regexp instead:
^\(?([0-9])\1{2}\)?\1{3}-?\1{4}$
This checks for parentheses around the first group, optionally so it will still work without; and specifically checks for the presence of a dash between the second and third group of digits, also optionally.

Related

Finding numbers with decimal comma in a TeX-File using regular expressions

I have a very large TeX-File wherein we set mathematical text. Unfortunately, TeX sets a empty space after each comma, even though numbers (in Germany) are formatted without space after comma.
I am aware that there are packages that solve this problem automatically, but I tried to find such commas with a regular expression before the TeX formatter would do its thing, so I could wrap these commas in braces. For example, I want 1,2 to become 1{,}2. But my regular expression approach does not work as expected:
I used "[0-9],[0-9]" resp. "\d,\d" to look for all places where two digits are separated with a decimal comma. That works fine. But I was wondering why for input "1,2,3" it only finds the match "1,2" but not "2,3".
There must be something I don't quite understand about regular expressions. How would I have to alter the expression to find overlapping instances of the search pattern?
Once a character has been matched, it is not matched again. As in your example "2" is matched, the "cursor" of the regex engine has now passed that "2", and so the next match can only be found in ",3" -- and there no match is found.
To resolve that, use a look-ahead assertion for the decimal digits without actually capturing them:
\d,(?=\d)
The matches will be "1," and "2," and now you can make the replacement as you want (adding braces), knowing these matches guarantee there is a digit following.

How do I create a regex expression that does not allow the same 9 duplicate numbers in a social security number, with or without hyphens?

The first thing I tried to do, is get the regex matching what I DON'T want. This way, I could just flip it to NOT accept that same input. This is where I came up with the first part of this regex.
Accept all 9 digit numbers, where all 9 digits are identical (without dashes): "^(\d)\1{8}$". This expression works as expected (as seen here: (https://regex101.com/r/Ez8YC3/1)).
The second expression should do the same, with dashes formatted as follows xxx-xx-xxxx: "^(\d)\1{8}$". This expressions works as expected (as seen here: https://regex101.com/r/bodzIX/1).
Now what I want to do at this point, is combine them together to look for BOTH conditions. However when I do that it seems to break, and only match 9 digit numbers that are identical throughout WITH dashes: "^(\d)\1{2}-(\d)\1{1}-(\d)\1{3}$|^(\d)\1{8}$". This can be seen here: https://regex101.com/r/lPnksf/1.
I may be getting a little ahead of myself here, but in order to show my work as much as possible, I also tried flipping those regex separately, which also did not work as expected.
Condition #1 flipped: "^(?!(\d)\1{8})$". Can be seen here: https://regex101.com/r/ed51yk/1.
Condition #2 flipped: "^(?!(\d)\1{2}-(\d)\1{1}-(\d)\1{3})$". Can be seen here: https://regex101.com/r/UYfoMK/1.
I would expect the two expressions (when flipped) to match any 9 digit number (with or without dashes) where all numbers are not identical. How ever this does not happen at all.
This is the final regex that I came up with, which is clearly not doing what I would expect it to: "^(?!(\d)\1{2}-(\d)\1{1}-(\d)\1{3})$|^(?!(\d)\1{8})$". Can be seen here: https://regex101.com/r/9eHhF5/1
At the end of the day, I want to combine these 2 expressions, with this one (that already works as intended): "^(?!000|666|9\d\d)\d{3}-(?!00)\d\d-(?!0000)\d\d\d\d$". Can be seen here: https://regex101.com/r/AdRI8i/1.
I am still pretty new to regex, and really want to understand why I can't simply wrap the condition in (?!...) in order to match the opposite condition.
Thank you in advance
What you want to do is not flip, but reverse the regex logic.
Yes, to reverse the pattern logic, you should use a negative lookahead, but there are caveats.
First, the $ end of string anchor: if it was at the end of the "positive" regex, it must also be moved to the lookahead in the reverse pattern. So, your ^(?!(\d)\1{8})$ regex must be written as ^(?!(\d)\1{8}$). Same goes for your second regex.
Next, mind that each subsequent capturing group gets an incremented ID number, so you cannot keep the same backreferences when you "join" patterns with OR | operator. You must adjust these IDs to reflect their new values in the new regex.
So, you want to match a string that matches ^(?!000|666|9\d\d)\d{3}-(?!00)\d\d-(?!0000)\d\d\d\d$ first (let's note \d\d\d\d = \d{4}), then you can add restrictions with lookaheads:
(?!(\d)\1{8}$) - fails the match if, immediately from the current position, it matches identical 9 digits and then the string end comes
(?!(\d)\2\2-(\d)\2-(\d)\2{3}$) - (note the ID incrementing continuation) fails the match if, immediately from the current position, it matches identical to the first one 3 digits, -, identical 2 digits, -, identical 5 digits, and then the string end comes.
So, to follow your logic, you can use
^(?!(\d)\1{8}$)(?!(\d)\2\2-(\d)\2-(\d)\2{3}$)(?!000|666|9\d\d)\d{3}-(?!00)\d\d-(?!0000)\d{4}$
See the regex demo
As the lookaheads are non-consuming patterns, i.e. the regex index remains at the same position after matching their pattern sequences where it was before, the 3 lookaheads will all be tried at the start of the string (see the ^ anchor). If any of the three negative lookaheads at the start fails, the whole string match will be failed right away.
By this Regex you match what you dont want as social security number:
^(?:(\d)\1{8})|(?:(\d)\2{2}-\2{2}-\2{4})$
Demo
By this regex you match only what you want:
^(?!(?:(\d)\1{8})|(?:(\d)\2{2}-\2{2}-\2{4})).*$
Demo

Putting a group within a group [123[a-u]]

I'm having a lot more difficulty than I anticipated in creating a simple regex to match any specific characters, including a range of characters from the alphabet.
I've been playing with regex101 for a while now, but every combination seems to result in no matches.
Example expression:
[\n\r\t\s\(\)-]
Preferred expression:
[[a-z][a-Z]\n\r\t\s\(\)-]
Example input:
(123) 241()-127()()() abc ((((((((
Ideally the expression will capture every character except the digits
I know I could always manually input "abcdefgh".... but there has to be an easier way. I also know there are easier ways to capture numbers only, but there are some special characters and letters which I may eventually need to include as well.
With regex you can set the regex expression to trigger on a range of characters like in your above example [a-z] that will capture any letter in the alphabet that is between a and z. To trigger on more than one character you can add a "+" to it or, if you want to limit the number of characters captured you can use {n} where n is the number of characters you want to capture. So, [a-z]+ is one or more and [a-z]{4} would match on the first four characters between a and z.
You can use partial intervals. For example, [a-j] will match all characters from a to j. So, [a-j]{2} for string a6b7cd will match only cd. Also you can use these intervals several times within same group like this: [a-j4-6]{4}. This regex will match ab44 but not ab47
Overlooked a pretty small character. The term I was looking for was "Alternative" apparently.
[\r\t\n]|[a-z] with the missing element being the | character. This will allow it to match anything from the first group, and then continue on to match the second group.
At least that's my conclusion when testing this specific example.

positive look ahead and replace

Recently I'm writing/testing regexps on https://regex101.com/.
My question is: Is it possible to do a positive look-ahead AND a replacement in the same "replacement"? Or just limited kind of replacement is possible.
Input is several lines with phone numbers. Let's say the correct phone number where the number of "numbers" are 11. No matter how the numbers are divided/group together with - / characters, no matter if starts with + 00 or it is omitted.
Some example lines:
+48301234567
+48/30/1234567
+48-30-12-345-67
+483011223344556677
0048301234567
+(48)30/1234567
Positive look-ahead able to check if from the beginning until the end of line there are only 11 digits, regardless how many other, above specified character separating them. This works perfectly.
Where the positive look-ahead check is fine, I would like to delete every character but numbers. The replacement works fine until I'm not involving look-ahead.
Checking the regexp itself working perfectly ("gm" modes):
^(?:\+|00)?(?:[\-\/\(\)]?\d){11}$
Checking the replace part works perfectly (replace to nothing):
[^\d\n]
Put this into look-ahead, after the deletion of non new-line and non-digit characters from the matching lines:
(?=^(?:\+|00)?(?:[\-\/\(\)]?\d){11}$)[^\d\n]
Even I put the ^ $ into look-ahead, seems the replacement working only from beginning of the lines until the very first digit.
I know in real life the replacement and the check should/would go separate ways, however I'm curious if I could mix look-ahead/look-behind with string operations like replace, delete, take the string apart and put together as I like.
UPDATE: This is what would do the trick, however I feel this one "ugly" a bit. Is there any prettier solution?
https://regex101.com/r/yT5dA4/2
Or the version which I asked originally, where only digits remains: regex101.com/r/yT5dA4/3
You cannot replace/delete text with regex. Regex is just a tool for matching certain strings and then taking certain action depending on the matching text, eg. perform a substitution, retrieve the second capture group.
However it is possible to perform certain decisions within a regex engine, by using conditionals. The common syntax for this, with a lookahead assertion, is (?(?=regex)then|else).
With conditionals you can change the behaviour depending on how the text matches the regex. For your example you could do something like:
^(\+)?(?(1)\(|\d)
If the phone number starts with a plus it must be followed by a bracket, else it should start with a digit. Although in your situation, this is not very useful.
If you want to read up more on conditionals in regex you can do so here.

How to optimise this regex to match string (1234-12345-1)

I've got this RegEx example: http://regexr.com?34hihsvn
I'm wondering if there's a more elegant way of writing it, or perhaps a more optimised way?
Here are the rules:
Digits and dashes only.
Must not contain more than 10 digits.
Must have two hyphens.
Must have at least one digit between each hyphen.
Last number must only be one digit.
I'm new to this so would appreciate any hints or tips.
In case the link expires, the text to search is
----------
22-22-1
22-22-22
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
666666-7777777-1
88888888-88888888-1
1-1-1
88888888-88888888-22
22-333-
333-22
----------
My regex is: \b((\d{1,4}-\d{1,5})|(\d{1,5}-\d{1,4}))-\d{1}\b
I'm using this site for testing: http://gskinner.com/RegExr/
Thanks for any help,
Nick
Here is a regex I came up with:
(?=\b[\d-]{3,10}-\d\b)\b\d+-\d+-\d\b
This uses a look-ahead to validate the information before attempting the match. So it looks for between 3-10 characters in the class of [\d-] followed by a dash and a digit. And then after that you have the actual match to confirm that the format of your string is actually digit(dash)digit(dash)digit.
From your sample strings this regex matches:
22-22-1
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
1-1-1
It also matches the following strings:
22-7777777-1
1-88888888-1
Your regexp only allows a first and second group of digits with a maximum length of 5. Therefore, valid strings like 1-12345678-1 or 123456-1-1 won't be matched.
This regexp works for the given requirements:
\b(?:\d\-\d{1,8}|\d{2}\-\d{1,7}|\d{3}\-\d{1,6}|\d{4}\-\d{1,5}|\d{5}\-\d{1,4}|\d{6}\-\d{1,3}|\d{7}\-\d{1,2}|\d{8}\-\d)\-\d\b
(RegExr)
You can use this with the m modifier (switch the multiline mode on):
^\d(?!.{12})\d*-\d+-\d$
or this one without the m modifier:
\b\d(?!.{12})\d*-\d+-\d\b
By design these two patterns match at least three digits separated by hyphens (so no need to put a {5,n} quantifier somewhere, it's useless).
Patterns are also build to fail faster:
I have chosen to start them with a digit \d, this way each beginning of a line or word-boundary not followed by a digit is immediately discarded. Other thing, using only one digit, I know the remaining string length.
Then I test the upper limit of the string length with a negative lookahead that test if there is one more character than the maximum length (if there are 12 characters at this position, there are 13 characters at least in the string). No need to use more descriptive that the dot meta-character here, the goal is to quickly test the length.
finally, I describe the end of string without doing something particular. That is probably the slower part of the pattern, but it doesn't matter since the overwhelming majority of unnecessary positions have already been discarded.