Finding repetition by Vim's Regex and globbing - regex

How can you find the repetiting sequences of at least 30 numbers?
Sample of the data
2.3758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840547
My attempt in Vim
:g/\(\d\{4}\)\[^\1\]\1/
|
|----------- Problem here!
I do not know how you can have the negation of the first glob.

First of all, to find your repeating numbers, you can use this simple search:
/\(\d\{5\}\).\{-}\1
This search finds repetitions of 5 digits. Unfortunately, vim highlights from the start of the 5 digit number to the end of the repetition - including every digit in between - and this makes it hard to see what the 5 digit number is. Also, because your number sequence repeats so much, the whole thing is highlighted because there are repeats all the way through.
You will probably find it's more useful to use :set incsearch and type /\(\d\{5\}\).\{-}\1 or /\(\d\{5\}\)\ze.\{-}\1 without hitting enter so you can see what the digits are.
This command might be more useful to you:
:syn region repeatSection matchgroup=Search start=/\z(\d\{30}\)/ matchgroup=Error end=/\z1/ oneline
This will highlight a sequence of 30 digits in yellow (first time it is seen) or red (when it is repeated). Note that this only works for a single line of text (multi-line isn't possible).

How about :g/\(\d\{30,\}\{2,\}\)/?

I'm not sure why you need the negation. /\(\d\{4\}\)\1/ will match a sequence of (exactly) four digits, repeated once. You probably actually want something like /\(\d\{30,\}\)\1/ to get your "at least 30". This appears to work for me, unless I've misunderstood what you're trying to search for. Note that since the regex are greedy, you will get the longest possible repeated sequence.

If it helps you on the way, the appropriate way to make sure that the following set of characters aren't the same as what is stored in back-reference #1 would be (?!\1). Note that the (?!) (negative look-ahead) group is a zero-width assertion (i.e., it will not change the position of the cursor, it just checks whether the regex should fail or not.)
Whether that is supported by the regex engine you're using, I don't know.
Update
I just had a quick sketch on paper, and something along these lines might work in PCRE... but I haven't tested it and can't right now, but maybe it'll give you some ideas:
(?=(\d{30}))\d(?=\d{29,}?\1)
To ensure that I understood you correctly, the purpose of the above regex would be to match any sequence of 30 digits that also exists later in the whole string being searched.
My thoughts for the above regex were these:
First I want to match a sequence of 30 digits, but I don't want to consume them since I want to check 1 digit later (not 30) next time. Therefore I use a look-ahead with a capturing group that stores the next 30 digits.
Then I consume one digit to ensure I don't match the 30 digits with themselves.
Then I match at least 29 digits (which means I'll be starting on the digit just outside the current sequence of digits) with a non-greedy quantifier, so that it will try 30, then 31, etc.
Then I match the 30 digits I'm currently testing. If they exist later in the sequence, the regular expression will succeed; otherwise, it will fail.

This command will match lines with 123451234 but not 111111111
:g/\(\d\{4}\)\1\#!.\1/
\1\#!. uses a negative lookahead to say "make sure this position doesn't match (\#!) group 1 (\1), then consume a character (.)"

Related

How do I create a regex expression that does not allow the same 9 duplicate numbers in a social security number, with or without hyphens?

The first thing I tried to do, is get the regex matching what I DON'T want. This way, I could just flip it to NOT accept that same input. This is where I came up with the first part of this regex.
Accept all 9 digit numbers, where all 9 digits are identical (without dashes): "^(\d)\1{8}$". This expression works as expected (as seen here: (https://regex101.com/r/Ez8YC3/1)).
The second expression should do the same, with dashes formatted as follows xxx-xx-xxxx: "^(\d)\1{8}$". This expressions works as expected (as seen here: https://regex101.com/r/bodzIX/1).
Now what I want to do at this point, is combine them together to look for BOTH conditions. However when I do that it seems to break, and only match 9 digit numbers that are identical throughout WITH dashes: "^(\d)\1{2}-(\d)\1{1}-(\d)\1{3}$|^(\d)\1{8}$". This can be seen here: https://regex101.com/r/lPnksf/1.
I may be getting a little ahead of myself here, but in order to show my work as much as possible, I also tried flipping those regex separately, which also did not work as expected.
Condition #1 flipped: "^(?!(\d)\1{8})$". Can be seen here: https://regex101.com/r/ed51yk/1.
Condition #2 flipped: "^(?!(\d)\1{2}-(\d)\1{1}-(\d)\1{3})$". Can be seen here: https://regex101.com/r/UYfoMK/1.
I would expect the two expressions (when flipped) to match any 9 digit number (with or without dashes) where all numbers are not identical. How ever this does not happen at all.
This is the final regex that I came up with, which is clearly not doing what I would expect it to: "^(?!(\d)\1{2}-(\d)\1{1}-(\d)\1{3})$|^(?!(\d)\1{8})$". Can be seen here: https://regex101.com/r/9eHhF5/1
At the end of the day, I want to combine these 2 expressions, with this one (that already works as intended): "^(?!000|666|9\d\d)\d{3}-(?!00)\d\d-(?!0000)\d\d\d\d$". Can be seen here: https://regex101.com/r/AdRI8i/1.
I am still pretty new to regex, and really want to understand why I can't simply wrap the condition in (?!...) in order to match the opposite condition.
Thank you in advance
What you want to do is not flip, but reverse the regex logic.
Yes, to reverse the pattern logic, you should use a negative lookahead, but there are caveats.
First, the $ end of string anchor: if it was at the end of the "positive" regex, it must also be moved to the lookahead in the reverse pattern. So, your ^(?!(\d)\1{8})$ regex must be written as ^(?!(\d)\1{8}$). Same goes for your second regex.
Next, mind that each subsequent capturing group gets an incremented ID number, so you cannot keep the same backreferences when you "join" patterns with OR | operator. You must adjust these IDs to reflect their new values in the new regex.
So, you want to match a string that matches ^(?!000|666|9\d\d)\d{3}-(?!00)\d\d-(?!0000)\d\d\d\d$ first (let's note \d\d\d\d = \d{4}), then you can add restrictions with lookaheads:
(?!(\d)\1{8}$) - fails the match if, immediately from the current position, it matches identical 9 digits and then the string end comes
(?!(\d)\2\2-(\d)\2-(\d)\2{3}$) - (note the ID incrementing continuation) fails the match if, immediately from the current position, it matches identical to the first one 3 digits, -, identical 2 digits, -, identical 5 digits, and then the string end comes.
So, to follow your logic, you can use
^(?!(\d)\1{8}$)(?!(\d)\2\2-(\d)\2-(\d)\2{3}$)(?!000|666|9\d\d)\d{3}-(?!00)\d\d-(?!0000)\d{4}$
See the regex demo
As the lookaheads are non-consuming patterns, i.e. the regex index remains at the same position after matching their pattern sequences where it was before, the 3 lookaheads will all be tried at the start of the string (see the ^ anchor). If any of the three negative lookaheads at the start fails, the whole string match will be failed right away.
By this Regex you match what you dont want as social security number:
^(?:(\d)\1{8})|(?:(\d)\2{2}-\2{2}-\2{4})$
Demo
By this regex you match only what you want:
^(?!(?:(\d)\1{8})|(?:(\d)\2{2}-\2{2}-\2{4})).*$
Demo

How to write a Regex that identifies specific letters plus a minimum amount of numbers

I'm trying to write a regex that can locate IDs in a body of text. The ID starts with "DW" and has a minimum of 5 numbers after that. It will only have numbers and no other characters following that.
Correct Examples
DW40056
DW4000057
Wrong Examples
DW4005
DW405679fg
Use word boundaries around DW followed by 4 digits then one or more digits:
\bDW\d{4}\d+\b
See live demo.
The word boundaries prevent matches with input such as ABCDW12345XYZ etc.
Although you could code the digits part as\d{5,}, which is simpler than \d{4}\d+, not all engines support open-ended quantity ranges. Since you haven’t indicated the language/tool you’re using, this regex is going to work in more situations.
Try this pattern: DW\d{5,}$
See Demo
Explanation:
DW is two characters that id start with
\d is for 0-9 numbers
{5,} it means \d must appear five or more times
$ it means the end of string. this cause this pattern just take strings that end with numbers (no more characters after numbers)

Numbers between 99 and 9999999 regular expression

I am trying to generate a regular expression that will match any numbers within the range of 99 and 9999999. I have trouble understanding how generating number ranges generally works. I managed to find a range generator online that does the job for me, but I want to understand how it actually works.
My attempt to do this range is as follows:
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
This is supposed to match 99, any 3 digit number or any 4 digit number, but it does not work as expected. When tested it matches only numbers 99 and 3 digit numbers. Four digit numbers are not matched at all. If I only write the part for 4 digit numbers on its own as
[1-9][0-9][0-9][0-9]
It matches 4 digit numbers, but when I construct it as in the first example it does not work. Can someone give me some clarification how this actually works and how successfully to generate a regular expression for the range of 99 to 9999999.
Link to demo - Here
So you want to know how this works...
Regexs have no real understanding of the values of numbers in your string, it only cares how they are represented, which is why looking for numbers in a range seems more awkward than it should be. The only reason your regex engine can understand a range in a character class like [0-9] at all is because of the characters' positions in a list (a character range like [&-~] is just as valid, and equally understandable to it.)
So, to match a range like 99-9999999, ya gotta spell out what that looks like: literal "99", or three digits without a leading zero, or four digits without a leading zero, and so on.
But this is what your demo did, right? And it didn't work. Of your test string "9293" your regex only matched "929". What happened here is the regex engine is eager to return a complete match - as soon as it found one it returned it, even though a better/longer match might have occurred later.
Here's how that match happened. (I'll skip some details like grouping, as they're not super relevant here.)
Step 1.
The engine compares the first token in the regex with the first character in the string
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
9293 ✅
Success, they match.
Step 2.
The engine then advances both to the next token in the regex and the next character in the string and compares them.
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
9293 ❌
Failure, no match. The engine would stop and return the failure here, but you're using alternation via |, so it knows there's an alternate expression to try.
Step 3.
The engine advances to the first token of the next alternate expression in the regex, and rewinds the position in the string.
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
9293 ✅
Success, they match.
Step 4.
Continuing on.
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
9293 ✅
Match.
Step 5.
And again.
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
9293 ✅
Success. The complete expression matches. There's no need to try the remaining alternate. The match here returned is:
929
As you've probably figured out, if your input string was instead "9923" then step 2 would've matched and the engine there would've stopped and returned "99".
As you've also probably figured out, if you rearrange your alternate expressions from longest to shortest
([1-9][0-9][0-9][0-9]|[1-9][0-9][0-9]|99)
the longest would be attempted first, which would match and return your expected "9293".
Simplifying
It's still pretty wordy though, especially as you crank up the number of digits in your range. There are a couple things you can do to simplify it.
The character class [0-9] can be represented by the shorthand character class \d.
([1-9]\d\d\d|[1-9]\d\d|99)
And instead of repeating them use a quantifier in curly brackets like so:
([1-9]\d{3}|[1-9]\d{2}|99)
As it happens, quantifiers can also take the form of {min, max}, so you can combine the two similar alternates:
([1-9]\d{2,3}|99)
You might expect this to land you back returning "929" again, the engine being eager and all, but quantifiers are by default greedy so they'll try to pick up as much as they can. This lends itself well to your larger desired range:
([1-9]\d{2,6}|99)
Finishing up
What you do with it from here depends on what you need the regex to do. As it stands the parentheses are superfluous, there's no point in creating a capturing group of the entire regex itself. However a decision comes when you've got an input string like:
You will likely be eaten by 1000 grue.
If you're trying to pluck out how many grue are about to eat you, you might use
[1-9]\d{2,6}|99
which will return 1000.
However that sorta runs back into the original problem with your demo. If it's "12345678 grue", which is out of range, this'll match "1234567" which might not be what you want. You can make sure the number you've matched isn't immediately followed by (or preceded by) another digit by using negative lookarounds.
(?<!\d)([1-9]\d{2,6}|99)(?!\d)
(?<!\d) means "from this position, the prior character is not a digit" while (?!\d) means "from this position, the next character is not a digit."
The parentheses around the alternates are back as they're necessary for grouping here, otherwise the lookbehind would only be part of and apply in the first alternate expression and the lookahead would only be part of and apply in the second alternate.
On the other hand if you're trying to make sure the entire string only consists of a number in your range you'll want to instead use the anchors ^ and $ (start of string and end of string, respectively):
^([1-9]\d{2,6}|99)$
And finally you can trade the capturing group out for a non-capturing group (?:...), so:
^(?:[1-9]\d{2,6}|99)$
or
(?<!\d)(?:[1-9]\d{2,6}|99)(?!\d)
You'll still grab the number as the match, it just won't be repeated in a group capture. (Lookarounds are already non-capturing, no need to worry about those.)
First of all you need some string boundaries for you regex (anything except digit, in my example I use ^ and $ -- begging and end of line or string)
Try this one:
^([1-9][0-9]{2,6}|99)$

Regex cannot have a number in the first position or last and must be 6-20 characters long

Good Morning all.
I have a regex that is not fulfilling what I need. I cannot start or end with a number. I must have at least one symbol, one upper case letter, one lower case letter and of course a number in between the outer "boundaries" I have described. The regex must be at least 6 characters long and 20 max.
Below is my regex:
^([^0-9](?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[_!?##$%])[^0-9]).{6,20}$
The issue I am having is I cannot seem to get the number boundaries and length correct.
For example, in a regex tester this is acceptable,
MaA1?kss1111111
but is not acceptable to what I need.
But this would be acceptable,
Mk?1wK
I have no starting number and no ending number.
I would appreciate any help.
You can tweak your regex to this:
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[_!?##$%])[^0-9].{4,18}[^0-9]$
It uses .{4,18} (2 less than your length requirements) due to the reason you have [^0-9] at start and end of your regex.
RegEx Demo 1
Alternatively (and this is my preferred solution as well) you can check for {6,20} length and check presence of non-digit at start/end using a negative lookahead:
^(?![0-9]|.*[0-9]$)(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[_!?##$%]).{6,20}$
RegEx Demo 2

How to optimise this regex to match string (1234-12345-1)

I've got this RegEx example: http://regexr.com?34hihsvn
I'm wondering if there's a more elegant way of writing it, or perhaps a more optimised way?
Here are the rules:
Digits and dashes only.
Must not contain more than 10 digits.
Must have two hyphens.
Must have at least one digit between each hyphen.
Last number must only be one digit.
I'm new to this so would appreciate any hints or tips.
In case the link expires, the text to search is
----------
22-22-1
22-22-22
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
666666-7777777-1
88888888-88888888-1
1-1-1
88888888-88888888-22
22-333-
333-22
----------
My regex is: \b((\d{1,4}-\d{1,5})|(\d{1,5}-\d{1,4}))-\d{1}\b
I'm using this site for testing: http://gskinner.com/RegExr/
Thanks for any help,
Nick
Here is a regex I came up with:
(?=\b[\d-]{3,10}-\d\b)\b\d+-\d+-\d\b
This uses a look-ahead to validate the information before attempting the match. So it looks for between 3-10 characters in the class of [\d-] followed by a dash and a digit. And then after that you have the actual match to confirm that the format of your string is actually digit(dash)digit(dash)digit.
From your sample strings this regex matches:
22-22-1
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
1-1-1
It also matches the following strings:
22-7777777-1
1-88888888-1
Your regexp only allows a first and second group of digits with a maximum length of 5. Therefore, valid strings like 1-12345678-1 or 123456-1-1 won't be matched.
This regexp works for the given requirements:
\b(?:\d\-\d{1,8}|\d{2}\-\d{1,7}|\d{3}\-\d{1,6}|\d{4}\-\d{1,5}|\d{5}\-\d{1,4}|\d{6}\-\d{1,3}|\d{7}\-\d{1,2}|\d{8}\-\d)\-\d\b
(RegExr)
You can use this with the m modifier (switch the multiline mode on):
^\d(?!.{12})\d*-\d+-\d$
or this one without the m modifier:
\b\d(?!.{12})\d*-\d+-\d\b
By design these two patterns match at least three digits separated by hyphens (so no need to put a {5,n} quantifier somewhere, it's useless).
Patterns are also build to fail faster:
I have chosen to start them with a digit \d, this way each beginning of a line or word-boundary not followed by a digit is immediately discarded. Other thing, using only one digit, I know the remaining string length.
Then I test the upper limit of the string length with a negative lookahead that test if there is one more character than the maximum length (if there are 12 characters at this position, there are 13 characters at least in the string). No need to use more descriptive that the dot meta-character here, the goal is to quickly test the length.
finally, I describe the end of string without doing something particular. That is probably the slower part of the pattern, but it doesn't matter since the overwhelming majority of unnecessary positions have already been discarded.