Regex to detect repetition - regex

I need a regex to detect different forms of repetitions (where the entire word is a multiple of same character/substring). The total length of the word should be minimum 7 (of the whole word, not of the repetitive sequence)
Example - Terms as follows are not allowed
abcdefabcdef
brian
2222222
john12john12
Terms as follows are allowed
hellojohn
2122222222
abcdefabc

The validity of this answer depends on the regular expression engine you are using, as it uses negative look-aheads to effectively "invert" the repeated substring matching. You can play with the regex solution here: https://regex101.com/r/DjmuaI/1/
Short answer: ^(?!(.+?)\1+).{7,}$
Long answer:
Start off by trying to match at least one repetition of a character sequence. This tries to capture a sequence of characters (.+) and uses a back-reference of this captured group \1.
^(.+)\1$
Allow more than 1 repetition by adding + to our capture group back-reference. This now detects a character sequence that is a substring repeated.
^(.+)\1+$
Look for character sequences that are NOT repeating. A negative-lookahead (?!regex) (which support varies between regex engines) allows us to invert the condition.
^(?!(.+?)\1+).+$
However, this would match any non-repetitive string (including strings less than 7 in length). The pattern can be changed to be 7 or more characters using {7,}.
^(?!(.+?)\1+).{7,}$
I will note that matching some strings may be not have great performance.

Related

RegEx: Non-repeating patterns?

I'm wrestling with how to write a specific regex, and thought I'd come here for a little guidance.
What I'm looking for is an expression that does the following:
Character length of 7 or more
Any single character is one of four patterns (uppercase letters, lowercase letters, numbers and a specific set of special characters. Let's say #$%#).
(Now, here's where I'm having problems):
Another single character would also match with one of the patterns described above EXCEPT for the pattern that was already matched. So, if the first pattern matched is an uppercase letter, the second character match should be a lowercase letter, number or special character from the pattern.
To give you an example, the string AAAAAA# would match, as would the string AAAAAAa. However, the string AAAAAAA, nor would the string AAAAAA& (as the ampersand was not part of the special character pattern).
Any ideas? Thanks!
If you only need two different kinds of characters, you can use the possessive quantifier feature (available in Objective C):
^(?:[a-z]++|[A-Z]++|[0-9]++|[#$%#]++)[a-zA-Z0-9#$%#]+$
or more concise with an atomic group:
^(?>[a-z]+|[A-Z]+|[0-9]+|[#$%#]+)[a-zA-Z0-9#$%#]+$
Since each branch of the alternation is a character class with a possessive quantifier, you can be sure that the first character matched by [a-zA-Z0-9#$%#]+ is from a different class.
About the string size, check it first separately with the appropriate function, if the size is too small, you will avoid the cost of a regex check.
First you need to do a negative lookahead to make sure the entire string doesn't consist of characters from a single group:
(?!(?:[a-z]*|[A-Z]*|[0-9]*|[#$%#]*)$)
Then check that it does contain at least 7 characters from the list of legal characters (and nothing else):
^[a-zA-Z0-9#$%#]{7,}$
Combining them (thanks to Shlomo for pointing that out):
^(?!(?:[a-z]*|[A-Z]*|[0-9]*|[#$%#]*)$)[a-zA-Z0-9#$%#]{7,}$

Java regex for definite or any character less than 11

I am a Rails developer but I need a regular expression that can allow a shortcode or any set of characters not more than 11 in total.
I was thinking something like:
(7575|[0-9a-zA-Z& ]*{11})
However, it has not worked.
I don't know what function you are using (this matters because find and matches behave differently), but to make things unambiguous, you can use the following:
^(7575|[0-9a-zA-Z& ]{1,11})$
The above means either match 7575 or match between 1 to 11 characters from the character set 0-9a-zA-Z& . If you want to allow an empty string as well, you will have to use {0,11} instead.
A slightly more memory efficient one would be ^(?:7575|[0-9a-zA-Z& ]{1,11})$ (since there are no capture groups).
^ matches the beginning of the string and $ matches the end of the string, thus ensuring there are no more characters before or after the matched part.
Further more memory efficient regex
"^(7575|[\w]{1,11})$"
where \w is A word character, short for [a-zA-Z_0-9]

Regex query efficient?

I came up with the below regex expression to look for terms like Password,Passphrase,Pass001 etc and the word following it. Is it efficient or can it be made better? Thanks for the help
"([Pp][aA][sS][Ss]([wW][oO][rR][dD][sS]?|[Pp][hH][rR][aA][sS][eE])?|[Pp]([aA][sS]([sS])?)?[wW][Dd])[0-9]?[0-9]?[0-9]?[\s\:\-\=\_\/\#\&\'\[\(\+\*\r\n\)\]]+\S*"
I will be using it to scan files upto 300K for these terms. When I try now to scan with these expression a whole C: drive it takes 5 hours or worse case I have encountered, 5 days
You may use the following enhancement:
(?i)p(?:ass(?:words?|phrase)?|(?:ass?)?wd)[0-9]{0,3}[-\s:=_\/#&'\]\[()+*\r\n]\S*
See the regex demo
Instead of [sS], you may make the regex case insensitive by adding (?i) case insensitive modifier. Use corresponding option in your software if it does not work like this.
Make sure your alternations do not match at the same location in the string. It is not quite easy here, but p at the start of each alternative in the first group decreases the regex efficiency. So, move it outside (e.g. (?:pass|port) => p(ass|ort)).
Use non-capturing groups rather than capturing ones if you are not going to access submatches, that also has a slight impact on performance.
Use limiting quantifiers instead of repeating ? quantified patterns. Instead of a?a?a?, use a{0,3}.
Do not overescape chars inside the character class. I only left \/, \] and \[ as I am not sure what regex flavor you are using, it might appear you can avoid escaping at all.
Note that a performance penalty is big if you have consecutive non-fixed width patterns that may match the same type of chars. You have [\s\:\-\=\_\/\#\&\'\[\(\+\*\r\n\)\]]+\S*: [\s\:\-\=\_\/\#\&\'\[\(\+\*\r\n\)\]]+ matches 1 or more special chars and \S* matches 0 or more chars other than whitespace that also matches some chars matched by the preceding pattern. Remove the + from the preceding subpattern.

Is it possible to match any wide character that appears more than once using only regxp?

For example, in this string with no \s:
abodnpjdcqe
only d should be matched.
But in my case there are thousands of different characters, is it possible to use ONLY regxp to match all characters that appear in the string more than once? It seems that all other problems use other tools.
It is possible to find characters that are present two times in a string as anubhava demonstrates it, and I don't see any other regex pattern to do it.
However, there are problems with an only regex way:
The complexity of this kind of pattern is very high, and you will experience problems (with backtracking limits and execution time) if your string is long and if there are few duplicates.
This way is unable to see if a duplicate character have been already found. For example the string a123a456a789a, the pattern will return a three times instead of one. If your goal is to obtain a list of unique duplicate characters, it can be problematic (but easy to solve programmatically)
So, to answer your question: my answer is no.
a simple way, to do it with code is to loop over the characters of your string and to build an associative array where the keys are the characters and the values the number of occurences. Then, removes each item that has the value 1 and extract the keys.
Note: you can solve the problem of duplicate results (2.) using this pattern:
(.)(?=(?:(?!\1).)*\1(?:(?!\1).)*$)
or if possessive quantifiers are available:
(.)(?=(?:(?!\1).)*+\1(?:(?!\1).)*+$)
but I'm afraid that the complexity may be even more high.
So, using your favorite language stay from far the best way.
You can use this regex:
([a-zA-Z])(?=.*\1)
Explanation:
Regex uses ([a-zA-Z]) to match any letter and captures it as group #1 i.e. \1
A positive lookahead (?=.*\1) then makes sure this match is successful only when it is followed by at least one of the backreference \1 i.e. the character itself.
RegEx Demo

A pattern matching an expression that doesn't end with specific sequence

I need a regex pattern which matches such strings that DO NOT end with such a sequence:
\.[A-z0-9]{2,}
by which I mean the examined string must not have at its end a sequence of a dot and then two or more alphanumeric characters.
For example, a string
/home/patryk/www
and also
/home/patryk/www/
should match desired pattern and
/home/patryk/images/DSC002.jpg should not.
I suppose this has something to do with lookarounds (look aheads) but still I have no idea how to make it.
Any help appreciated.
Old Answer
You can use a negative lookbehind at the end if your regex flavor supports it:
^.*+(?<!\.\w{2,})$
This will match a string that has an end anchor not preceded by the icky sequence you don't want.
Note that as m.buettner has pointed out, this uses an indefinite length lookbehind, which is a feature unique to .NET
New Answer
After a bit of digging around, however, I've found that variable length look-aheads are pretty widely supported, so here is a version that uses those:
^(?:(?!\.\w{2,}$).)++$
In a comment on an answer, you have stated you wanted to not match strings with forward slashes at the end, which is accomplished by simply adding a forward slash to the lookahead.
^(?:(?!(\.\w{2,}|/)$).)++$
Note that I am using \w for succinctness, but it lets underscores through. If this is important, you could replace it with [^\W_].
Asad's version is very convenient, but only .NET's regex engine supports variable-length lookbehinds (which is one of the many reasons why every regex question should include the language or tool used).
We can reduce this to a fixed-length lookbehind (which is supported in most engines except for JavaScrpit) if we think about the possible cases which should match. That would be either one or zero letters/digits at the end (whether preceded by . or not) or two or more letters/digits that are not preceded by a dot.
^.*(?:(?<![a-zA-Z0-9])[a-zA-Z0-9]?|(?<![a-zA-Z0-9.])[a-zA-Z0-9]{2,})$
This should do it:
^(?:[^.]+|\.(?![A-Za-z0-9]{2,}$))+$
It alternates between matching one or more of anything except a dot, or a dot if it's not followed by two or more alphanumeric characters and the end of the string.
EDIT: Upgrading it to meet the new requirement is just more of the same:
^(?:[^./]+|/(?=.)|\.(?![A-Za-z0-9]{2,}$))+$
Breaking that down, we have:
[^./]+ # one or more of any characters except . or /
/(?=.) # a slash, as long as there's at least one character following it
\.(?![A-Za-z0-9]{2,}$) # a dot, unless it's followed by two or more alphanumeric characters followed by the end of the string
On another note: [A-z] is an error. It matches all the uppercase and lowercase ASCII letters, but it also matches the characters [, ], ^, _, backslash and backtick, whose code points happen to lie between Z and a.
Variable length look behinds are rarely supported, but you don't need one:
^.*(?<!\.[A-z0-9][A-z0-9]?)$