RegExp ignore special and repeated characters (username test) - regex

All,
I am trying to construct a regular expression (I will use to test valid usernames):
^[(0-9)|(_|\.)]|^[0-9]+$|[^a-zA-z0-9_.]{3,}|(_\.|\._)|\.{2,}|_{2,}
and testing it against this string:
1123#sssssasdf sslkdf*.sf...____.__sfsfdddddsss
What this regular expression should test is:
string should not begin with numbers, underscore or dot
string should be alphanumeric
should not contain characters repeated thrice or more --this fails
should not contain underscore and dot together
should not contain dot and underscore together
should not contain repeated dots should not contain repeated underscores
It looks like all of the cases are matched but 3rd. It doesn't catch repeated characters that are repeated thrice or more.
My questions are:
How can I fix this regular expression so it can catch repeated characters?
How can I optimize this regular expression?
Thanks in advance
EDIT
As requested the valid string are:
john
john.snow
john.snow123
john1.snow1
john_snow
john_snow123
john1_snow1
The invalid strings are:
123
1john.snow
.john_snow
john__snow
john..snow
jjjohn.snow
_john_snow

And can be done this way too -
(?i)^(?=[a-z])(?!.*?(?:\._|_\.|\.{2}|_{2}|([a-z\d])\1{2}))[a-z\d._]+$
Formatted:
(?i) # Case insensitive
^ # BOS
(?= [a-z] ) # First char alpha
(?! # Not these
.*?
(?:
\._ # dot, underscore
| _\. # underscore, dot
| \.{2} # 2 or more dots
| _{2} # 3 or more underscore
| ( [a-z\d] ) # (1), 3 or more repeated alpha-num's
\1{2}
)
)
[a-z\d._]+ # Get valid char's: alpha-num, dot and underscore
$ # EOS

^(?![0-9_.])(?!.*([._])\1)(?!.*(?:_\.|\._))(?!.*(.)\2{2,})[\w.]+$
You can add a negative lookahead for each of the condition.See demo.
https://regex101.com/r/rO0yD8/6

Related

Regex for patterns like [ABC], ABC and ABCxx where xx is a number

I have a text whose length can vary between 1 and 1000. I am looking to get the following sub strings extracted from the text.
Sub string of the form ABCxx/ABCx where ABC are always english alphabets and x/xx is a number which can vary from 0 to 99 (the numeric length is either 1 or 2). The following regex does the job for me to extract this sub string - [a-zA-Z]{3}[0-9]{1,2}
Sub string of the form <space>ABC<space>, ABC (last sub string/word in the text)and ABC (first sub string in the text). Basically here I am trying to find a 3 letter word delimited by spaces in the text.
For getting the above matches, I have the following regex's.
[ ][a-zA-Z]{3}[ ], [ ][a-zA-Z]{3} and [a-zA-Z]{3}[ ]
Same as 2, but the three character string can also be in a box bracket like [ABC].
\[([a-zA-Z]{3})\]
Since the patterns are more or less similar, is there anyway to combine all 5 of them ?
Eg: ABC catmat dogdog [rat] LAN45 eat HGF1 jkhgkj abc
Here valid matches are ABC, rat, LAN45, eat, HGF1, abc.
R = /
\p{L}{3}\d{1,2} # match 3 letters followed by 1 or 2 digits
| # or
(?<=\A|\p{Space}) # match start of string or a space in a pos lookbehind
(?: # begin a non-capture group
\p{L}{3} # match three letters
| # or
\[\p{L}{3}\] # match three letters surrounded by brackets
) # end of non-capture group
(?=\p{Space}|\z) # match space or end of string in a pos lookahead
/x # free-spacing regex definition mode
"ABC catmat dogdog [rat] LAN45 eat HGF1 jkhgkj abc".scan R
#=> ["ABC", "[rat]", "LAN45", "eat", "HGF1", "abc"]
This regex is conventionally written (not free-spacing mode):
R = /\p{L}{3}\d{1,2}|(?<=\A| )(?:\p{L}{3}\[\p{L}{3}\])(?= |\z)/
Now consider:
"ABCD123 [efg]456".scan R
#=> ["BCD12"]
I believe this is consistent with the statement of the problem, but if "BCD12" should not be a match if it is preceded by a letter or followed by a digit (here both apply), then the regex should be modified as follows.
R = /
(?<=\A|\p{Space}) # match start of string or a space in a pos lookbehind
(?: # begin a non-capture group
\p{L}{3} # match three letters
\d{,2} # match 0, 1 or 2 digits
| # or
\[\p{L}{3}\] # match three letters surrounded by brackets
) # end of non-capture group
(?=\p{Space}|\z) # match space or end of string in a pos lookahead
/x # free-spacing regex definition mode
"ABC catmat dogdog [rat] XLAN45 eat HGF123 jkhgkj abc".scan R
#=> ["ABC", "[rat]", "eat", "abc"]
Notice that, in both regexes, I replaced \p{Space} with a space character. In free-spacing mode spaces are removed before the regex is parsed, so they must be written \p{Space}, [[:space:]], [ ] (a character class containing a space), \ an escaped space character or, if appropriate, \s for a whitespace character (which includes spaces, newlines, tabs and a few other characters).
Thank you all for your answers. This regex did the trick for me.
(\b[a-zA-Z]{3}([0-9]{1,2})?\b)

How to use regular expression to use as few groups as possible to match as long string as possible

For example, this is the regular expression
([a]{2,3})
This is the string
aaaa // 1 match "(aaa)a" but I want "(aa)(aa)"
aaaaa // 2 match "(aaa)(aa)"
aaaaaa // 2 match "(aaa)(aaa)"
However, if I change the regular expression
([a]{2,3}?)
Then the results are
aaaa // 2 match "(aa)(aa)"
aaaaa // 2 match "(aa)(aa)a" but I want "(aaa)(aa)"
aaaaaa // 3 match "(aa)(aa)(aa)" but I want "(aaa)(aaa)"
My question is that is it possible to use as few groups as possible to match as long string as possible?
How about something like this:
(a{3}(?!a(?:[^a]|$))|a{2})
This looks for either the character a three times (not followed by a single a and a different character) or the character a two times.
Breakdown:
( # Start of the capturing group.
a{3} # Matches the character 'a' exactly three times.
(?! # Start of a negative Lookahead.
a # Matches the character 'a' literally.
(?: # Start of the non-capturing group.
[^a] # Matches any character except for 'a'.
| # Alternation (OR).
$ # Asserts position at the end of the line/string.
) # End of the non-capturing group.
) # End of the negative Lookahead.
| # Alternation (OR).
a{2} # Matches the character 'a' exactly two times.
) # End of the capturing group.
Here's a demo.
Note that if you don't need the capturing group, you can actually use the whole match instead by converting the capturing group into a non-capturing one:
(?:a{3}(?!a(?:[^a]|$))|a{2})
Which would look like this.
Try this Regex:
^(?:(a{3})*|(a{2,3})*)$
Click for Demo
Explanation:
^ - asserts the start of the line
(?:(a{3})*|(a{2,3})*) - a non-capturing group containing 2 sub-sequences separated by OR operator
(a{3})* - The first subsequence tries to match 3 occurrences of a. The * at the end allows this subsequence to match 0 or 3 or 6 or 9.... occurrences of a before the end of the line
| - OR
(a{2,3})* - matches 2 to 3 occurrences of a, as many as possible. The * at the end would repeat it 0+ times before the end of the line
-$ - asserts the end of the line
Try this short regex:
a{2,3}(?!a([^a]|$))
Demo
How it's made:
I started with this simple regex: a{2}a?. It looks for 2 consecutive a's that may be followed by another a. If the 2 a's are followed by another a, it matches all three a's.
This worked for most cases:
However, it failed in cases like:
So now, I knew I had to modify my regex in such a way that it would match the third a only if the third a is not followed by a([^a]|$). So now, my regex looked like a{2}a?(?!a([^a]|$)), and it worked for all cases. Then I just simplified it to a{2,3}(?!a([^a]|$)).
That's it.
EDIT
If you want the capturing behavior, then add parenthesis around the regex, like:
(a{2,3}(?!a([^a]|$)))

How to use positive regex lookahead to match, but exclude the lookahead part?

The lines to match against are
part1a_part1b__part1c_part1d_part3.extension
part1a_part1b__part1c_part1d__part3.extension
part1a_part1b__part1c_part1d_part2short_part3.extension
part1a_part1b__part1c_part1d_part2short__part3.extension
part1a_part1b__part1c_part1d_part2_part3.extension
part1a_part1b__part1c_part1d_part2__part3.extension
part1a_part1b__part1c_part1d_part2full_part3.extension
part1a_part1b__part1c_part1d_part2full__part3.extension
part1a_part1b__part1c_part1d_part2short-part3.extension
part1a_part1b__part1c_part1d_part2-part3.extension
part1a_part1b__part1c_part1d_part2full-part3.extension
part1a_part1b__part1c_part1d_part4.extension
part1a_part1b__part1c_part1d__part4.extension
The desired match should give exactly part1a_part1b__part1c_part1d for all the above lines except the last two lines. That is to say, the "stem" has an arbitrary number of part1, an optional part2 (in limited forms), and must ends with part3.extension.
Right now, I only got as far as
(?P<stem>[[:alnum:]_-]+)(?=(|part2short|part2|part2full))[_-]+part3\.extension
,by which the matched "stem" values for the lines above are
part1a_part1b__part1c_part1d
part1a_part1b__part1c_part1d_
part1a_part1b__part1c_part1d_part2short
part1a_part1b__part1c_part1d_part2short_
part1a_part1b__part1c_part1d_part2
part1a_part1b__part1c_part1d_part2_
part1a_part1b__part1c_part1d_part2full
part1a_part1b__part1c_part1d_part2full_
part1a_part1b__part1c_part1d_part2short
part1a_part1b__part1c_part1d_part2
part1a_part1b__part1c_part1d_part2full
Could you help to comment how to match exactly part1a_part1b__part1c_part1d from all the above lines except the last two lines, if it is possible ?
You may use this regex using a non-greedy match, a lookahead with an optional match:
(?m)^(?P<stem>[[:alnum:]_-]+?)(?=(?:[_-]+part2(?:short|full)?)?[_-]+part3\.extension$)
RegEx Demo
(?=(?:[_-]+part2(?:short|full)?)?[_-]+part3\.extension$) is a positive lookahead that asserts line ends with [-_]part3.extension with optional [-_]part2... string before.
You could match the first 4 parts with the text and the underscores and use a positive lookahead that asserts that the string ends with part3.extension:
^(?P<stem>[^_]+_[^_]+__[^_]+_[^_]+)(?=.*part3\.extension$)
That would match:
^ # Begin of the string
(?P<stem> # Named captured group stem
[^_]+_ # Match not _ one or more times, then _
[^_]+__ # Match not _ one or more times, then __
[^_]+_ # Match not _ one or more times, then _
[^_]+ # # Match not _ one or more times
) # Close named capturing group
(?= # A positive lookahead that asserts what follows
.*part3\.extension$ # Match part3.extension at the end of the string
) # Close lookahead

Selecting if no delimiter, and no selecting if it is

I have string like "smth 2sg. smth", and sometimes "smth 2sg.| smth.".
What mask should I use for selecting "2sg." if string does not contains"|", and select nothing if string does contains "|"?
I have 2 methods. They both use something called a Negative Lookahead, which is used like so:
(?!data)
When this is inserted into a RegEx, it means if data exists, the RegEx will not match.
More info on the Negative Lookahead can be found here
Method 1 (shorter)
Just capture 2sg.
Try this RegEx:
(\dsg\.)(?!\|)
Use (\d+... if the number could be longer than 1 digit
Live Demo on RegExr
How it works:
( # To capture (2sg.)
\d # Digit (2)
sg # (sg)
\. # . (Dot)
)
(?!\|) # Do not match if contains |
Method 2 (longer but safer)
Match the whole string and capture 2sg.
Try this RegEx:
^\w+\s*(\dsg\.)(?!\|)\s*\w+\.?$
Use (\d+sg... if the number could be longer than 1 digit
Live Demo on RegExr
How it works:
^ # String starts with ...
\w+\s* # Letters then Optional Whitespace (smth )
( # To capture (2sg.)
\d # Digit (2)
sg # (sg)
\. # . (Dot)
)
(?!\|) # Do not match if contains |
\s* # Optional Whitespace
\w+ # Letters (smth)
\.? # Optional . (Dot)
$ # ... Strings ends with
Something like this might work for you:
(\d*sg\.)(?!\|)
It assumes that there is(or there is no)number followed by sg. and not followed by |.
^.*(\dsg\.)[^\|]*$
Explanation:
^ : starts from the beginning of the string
.* : accepts any number of initial characters (even nothing)
(\dsg\.) : looks for the group of digit + "sg."
[^\|]* : considers any number of following characters except for |
$ : stops at the end of the string
You can now select your string by getting the first group from your regex
Try:
(\d+sg.(?!\|))
depending on your programming environment, it can be little bit different but will get your result.
For more information see Negative Lookahead

How to write a RegEx pattern that accepts a string with at most one of each letter, but unordered?

I have tried this:
[a]?[b]?[c]?[d]?[e]?[f]?[g]?[h]?[i]?[j]?[k]?[l]?[m]?[n]?[o]?[p]?[q]?[r]?[s]?[t]?[u]?[v]?[w]?[x]?[y]?[z]?
But this RegEx rejects string where the order in not alphabetical, like these:
"zabc"
"azb"
I want patterns like these two to be accepted too. How could I do that?
EDIT 1
I don't want letter repetitions, i.e., I want the following strings to be rejected:
aazb
ozob
Thanks.
You can use a negative lookahead assertion to make sure no two characters are the same:
^(?!.*(.).*\1)[a-z]*$
Explanation:
^ # Start of string
(?! # Assert that it's impossible to match the following:
.* # any number of characters
(.) # followed by one character (capture that in group 1)
.* # followed by any number of characters
\1 # followed by the same character as the one captured before
) # End of lookahead
[a-z]* # Match any number of ASCII lowercase letters
$ # End of string
Test it live on regex101.com.
Note: This regex needs to brute-force check all possible character pairs, so performance may be a problem with larger strings. If you can use anything besides regex, you're going to be happier. For example, in Python:
if re.search("^[a-z]*$", mystring) and len(mystring) == len(set(mystring)):
# valid string