Backreferencing something without putting it in the rest of the expression - regex

I am trying to make a regular expression that will match all words that have a letter that repeats at least an arbitrary number of times.
For example, if I want to match words that have a letter that repeats at least 3 times, I would want to match words like
applepie banana insidious
I want to be able to change the number of repeats I'm looking for by just changing one number in my expression, so expressions that only work for a certain number of repeats are not what I'm looking for.
Currently, this is what I'm using
^(?=.*(.))(?=(.*\1){4}).*$
Where 4 is the number of repeats, a number that I can change to whatever number of repeats I'm looking for.
The above regular expression appears to work, but using a lookahead just so I can use a capturing group seems very unwieldy, and so I'm looking for a better way to solve this problem.

This will eliminate one lookahead:
\b(?=\w*(\w)(\w*\1){2})\w*
Start of word, then any number of word-characters such that they consist of any number of word characters, a particular word character, and then any number of characters and that character again, repeated at least twice.
For four repetitions, use {3} (for n repetitions, use one less).
Also, feel free to replace \b... with ^...$ as you were doing if you meant to match whole lines and not words in text.

You can use this regex:
\b\w*?(\w)(?=(?:\w*?\1){2})\w*\b
RegEx Demo
Where 2 is n-1 for n repetitions you're trying to find in a complete word.

Related

Regex for a series of known words separated by periods, optionally with number between

I am having a hard time figuring out the right regex for a problem I'm trying to solve. It does not need to be concise, or even a single expression for all cases. Some examples of what I'd be looking to match:
my_word
my_word.0
my_word.1
my_word.0.my_other_word
my_word.my_other_word
my_word.my_other_word.0
my_word and my_other_word are order dependent, and must be an exact match. Everything will be delimited with ., and there may or may not be numbers between them with the same delimiter. There may be an arbitrarily long list of words, and the numbers can be any length. two numbers will not follow each other, but two words might. I understand this may require a series of regex expressions, and that's fine. As long as the pattern is clear, I can probably autogenerate the regex when I have the list of words to check for.
Any help is much appreciated!
I would use the following regex pattern:
^(?!.*\.\d+\.\d+)(?:[A-Za-z_]+|[0-9]+)(?:\.(?:[A-Za-z_]+|[0-9]+))*$
Demo
Here is an explanation:
^ from the start of the input
(?!.*\.d+\.\d+) assert that two digits separated by dot do NOT occur
(?:[A-Za-z_]+|[0-9]+) match a word (letters + underscore) or number
(?:\.(?:[A-Za-z_]+|[0-9]+))* then match zero or more words/numbers, separated by dots
$ end of the input

How to write a Regex that identifies specific letters plus a minimum amount of numbers

I'm trying to write a regex that can locate IDs in a body of text. The ID starts with "DW" and has a minimum of 5 numbers after that. It will only have numbers and no other characters following that.
Correct Examples
DW40056
DW4000057
Wrong Examples
DW4005
DW405679fg
Use word boundaries around DW followed by 4 digits then one or more digits:
\bDW\d{4}\d+\b
See live demo.
The word boundaries prevent matches with input such as ABCDW12345XYZ etc.
Although you could code the digits part as\d{5,}, which is simpler than \d{4}\d+, not all engines support open-ended quantity ranges. Since you haven’t indicated the language/tool you’re using, this regex is going to work in more situations.
Try this pattern: DW\d{5,}$
See Demo
Explanation:
DW is two characters that id start with
\d is for 0-9 numbers
{5,} it means \d must appear five or more times
$ it means the end of string. this cause this pattern just take strings that end with numbers (no more characters after numbers)

Putting a group within a group [123[a-u]]

I'm having a lot more difficulty than I anticipated in creating a simple regex to match any specific characters, including a range of characters from the alphabet.
I've been playing with regex101 for a while now, but every combination seems to result in no matches.
Example expression:
[\n\r\t\s\(\)-]
Preferred expression:
[[a-z][a-Z]\n\r\t\s\(\)-]
Example input:
(123) 241()-127()()() abc ((((((((
Ideally the expression will capture every character except the digits
I know I could always manually input "abcdefgh".... but there has to be an easier way. I also know there are easier ways to capture numbers only, but there are some special characters and letters which I may eventually need to include as well.
With regex you can set the regex expression to trigger on a range of characters like in your above example [a-z] that will capture any letter in the alphabet that is between a and z. To trigger on more than one character you can add a "+" to it or, if you want to limit the number of characters captured you can use {n} where n is the number of characters you want to capture. So, [a-z]+ is one or more and [a-z]{4} would match on the first four characters between a and z.
You can use partial intervals. For example, [a-j] will match all characters from a to j. So, [a-j]{2} for string a6b7cd will match only cd. Also you can use these intervals several times within same group like this: [a-j4-6]{4}. This regex will match ab44 but not ab47
Overlooked a pretty small character. The term I was looking for was "Alternative" apparently.
[\r\t\n]|[a-z] with the missing element being the | character. This will allow it to match anything from the first group, and then continue on to match the second group.
At least that's my conclusion when testing this specific example.

Regex : Find a number between space

I am trying to extract a zip code of six numbers starting with the number 4 from a string. Right now I am using [4][0-9]{5}, but it is also matching starting from other numbers, like 020-25468811 and it's returning 468811. I don't want it to search in the middle of a number, only full numbers.
Try to use the following:
(?<!\d)4\d{5}(?!\d)
I.e. find 6-digit number starting with 4 and not preceded or followed by digit.
Your expression right now tries to match any six numbers consisting of a 4 with five numbers between 0 and 9. To fix this behavior you should add word boundaries as per Jon's suggestion.
\b[4][0-9]{5}\b
More on word boundaries here: http://www.regular-expressions.info/wordboundaries.html
You could simply add a space to the beginning of your regular expression " 4[0-9]{5}". If you need a more universal way of finding the beginning of the number (could it maybe be also be tabulator, a newline, etc?) you should have look at the predefined character class \s. Also have a look at boundary matchers. I dont know which language you are using, but regex work very similar in most languages. Check this Java regex documentation.
There is a start of line character in regex: ^
You could do:
^4[0-9]{5}
If the numbers are not always in the beginning of a line, you can more generally use:
\<4[0-9]{5}\>
To match only whole words.
Both examples work with egrep.

How to optimise this regex to match string (1234-12345-1)

I've got this RegEx example: http://regexr.com?34hihsvn
I'm wondering if there's a more elegant way of writing it, or perhaps a more optimised way?
Here are the rules:
Digits and dashes only.
Must not contain more than 10 digits.
Must have two hyphens.
Must have at least one digit between each hyphen.
Last number must only be one digit.
I'm new to this so would appreciate any hints or tips.
In case the link expires, the text to search is
----------
22-22-1
22-22-22
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
666666-7777777-1
88888888-88888888-1
1-1-1
88888888-88888888-22
22-333-
333-22
----------
My regex is: \b((\d{1,4}-\d{1,5})|(\d{1,5}-\d{1,4}))-\d{1}\b
I'm using this site for testing: http://gskinner.com/RegExr/
Thanks for any help,
Nick
Here is a regex I came up with:
(?=\b[\d-]{3,10}-\d\b)\b\d+-\d+-\d\b
This uses a look-ahead to validate the information before attempting the match. So it looks for between 3-10 characters in the class of [\d-] followed by a dash and a digit. And then after that you have the actual match to confirm that the format of your string is actually digit(dash)digit(dash)digit.
From your sample strings this regex matches:
22-22-1
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
1-1-1
It also matches the following strings:
22-7777777-1
1-88888888-1
Your regexp only allows a first and second group of digits with a maximum length of 5. Therefore, valid strings like 1-12345678-1 or 123456-1-1 won't be matched.
This regexp works for the given requirements:
\b(?:\d\-\d{1,8}|\d{2}\-\d{1,7}|\d{3}\-\d{1,6}|\d{4}\-\d{1,5}|\d{5}\-\d{1,4}|\d{6}\-\d{1,3}|\d{7}\-\d{1,2}|\d{8}\-\d)\-\d\b
(RegExr)
You can use this with the m modifier (switch the multiline mode on):
^\d(?!.{12})\d*-\d+-\d$
or this one without the m modifier:
\b\d(?!.{12})\d*-\d+-\d\b
By design these two patterns match at least three digits separated by hyphens (so no need to put a {5,n} quantifier somewhere, it's useless).
Patterns are also build to fail faster:
I have chosen to start them with a digit \d, this way each beginning of a line or word-boundary not followed by a digit is immediately discarded. Other thing, using only one digit, I know the remaining string length.
Then I test the upper limit of the string length with a negative lookahead that test if there is one more character than the maximum length (if there are 12 characters at this position, there are 13 characters at least in the string). No need to use more descriptive that the dot meta-character here, the goal is to quickly test the length.
finally, I describe the end of string without doing something particular. That is probably the slower part of the pattern, but it doesn't matter since the overwhelming majority of unnecessary positions have already been discarded.