I have this regex to detect an email address:
(?=.*[a-zA-Z])([a-zA-Z0-9_.+-]{8,})#(\S+\.\S+)
The requirement: The part before # needs to contain at least one letter and be at least 8 characters long.
I'm using positive lookahead to see if it contains a letter, but lookahead actually apply to the entire line (the part after # usually will contain letters), so this will pass
123456789#gmail.com
So question is, how can I validate only the result of the first capturing group (in this case 123456789) to see if it has a letter or not?
The [a-zA-Z0-9_.+-]{8,} consuming pattern part before # does not match #, so the lookahead check should only check for a letter after 0 or more chars other than #.
Using
(?=[^#]*[a-zA-Z])([a-zA-Z0-9_.+-]{8,})#(\S+\.\S+)
will fix the issue. See the regex demo and a Regulex graph:
You may further optimize the lookahead pattern by precising the [^#]. E.g. since you only allow 0-9_.+- apart from letters, you may write the regex as
(?=[0-9_.+-]*[a-zA-Z])([a-zA-Z0-9_.+-]{8,})#(\S+\.\S+)
^^^^^^^^^
See this regex demo.
Or, you may follow the principle of contrast (suggested in comments), and use [^#a-zA-Z]* instead of [^#]*.
Depending on where you are using the regex, you might want to wrap it with ^ and $ anchors to ensure a full string match.
Related
I have a conditional lookahead regex that tests to see if there is a number substring at the end of a string, and if so match for the numbers, and if not, match for another substring
The string in question: "H2K 101"
If just the lookahead is used, i.e. (?=\d{1,8}$)(\d{1,8}$), the lookahead succeeds, and "101" is found in capture group 1
When the lookahead is placed into a conditional, i.e. (?(?=\d{1,8}\z)(\d{1,8}\z)|([a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+)), the lookahead now fails, and the second pattern is used, matching "H2K", and a "2" is found in capture group 2.
If the test string has the "2" swapped for a letter, i.e. "HKK 101"
then the lookahead conditional works as expected, and the number "101" is once again found in capture group 1.
I've tested this in Regex101 and other PCRE engines, and all work the same, so clearly I'm missing something obvious about conditionals or the condition regex I'm using. Any insight greatly appreciated.
Thanks.
The look ahead starts at the current position, so initially it fails, and the alternative is used -- where it finds a match at the current position.
If you want the look ahead to succeed when still at the initial position, you need to allow for the intermediate characters to occur. Also, when the alternative kicks in, realise that there can follow a second match that still uses the look ahead, but now at a position where the look ahead is successful.
From what I understand, you are interested in one match only, not two consecutive matches (or more). So that means you should attempt to match the whole string, and capture the part of interest in a capture group. Also, the look ahead should be made to succeed when still at the initial position. This all means you need to inject several .*. There is no need for a conditional.
(?=.*\d{1,8}\z).*?(\d{1,8}\z)|([a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+).*
Note also that (?=.*\d{1,8}\z) succeeds if and only when (?=.*\d\z) succeeds, so you can simplify that:
(?=.*\d\z).*?(\d{1,8}\z)|([a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+).*
There are two capture groups. It there is a match, exactly one of the capture groups will have a non-empty matching content, which is the content you need.
You want to match a number of specific length at the end of the string, and if there is none, match something else.
There is no need for a conditional here. Conditional patterns are necessary to examine what to match next at the given position inside the string based either on a specific group match or a lookaround test. They are not useful when you want to give priority to a specific pattern.
Here, you can use a PCRE pattern based on the \K operator like
.*?\K\d{1,8}\z|[a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+
Or, using capturing groups
(?|.*?(\d{1,8})\z|([a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+))
See the regex demo #1 and regex demo #2.
Details:
.*?\K\d{1,8}$ - any zero or more chars other than line break chars, as few as possible, then the match reset operator that discards the text matched so far, then one to eight digits at the end of string
| - or
[a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+ - one or more letters, 1-8 digits, underscores or hyphens, and then one or more letters.
And
(?| - start of the branch reset group:
.*? - any zero or more chars other than line break chars, as few as possible
(\d{1,8}) - Group 1: one to eight digits
\z - end of string
| - or
( - Group 1 start:
[a-zA-Z]+ - one or more ASCII letters
[\d_-]{1,8} - one to eight digits, underscores, hyphens
[a-zA-Z]+ - one or more ASCII letters
) - Group 1 end
) - end of the group.
I am trying to make a regex match which is discarding the lookahead completely.
\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*
This is the match and this is my regex101 test.
But when an email starts with - or _ or . it should not match it completely, not just remove the initial symbols. Any ideas are welcome, I've been searching for the past half an hour, but can't figure out how to drop the entire email when it starts with those symbols.
You can use the word boundary near # with a negative lookbehind to check if we are at the beginning of a string or right after a whitespace, then check if the 1st symbol is not inside the unwanted class [^\s\-_.]:
(?<=^|\s)[^\s\-_.]\w*(?:[-+.]\w+)*\b#\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*
See demo
List of matches:
support#github.com
s.miller#mit.edu
j.hopking#york.ac.uk
steve.parker#soft.de
info#company-hotels.org
kiki#hotmail.co.uk
no-reply#github.com
s.peterson#mail.uu.net
info-bg#software-software.software.academy
Additional notes on usage and alternative notation
Note that it is best practice to use as few escaped chars as possible in the regex, so, the [^\s\-_.] can be written as [^\s_.-], with the hyphen at the end of the character class still denoting a literal hyphen, not a range. Also, if you plan to use the pattern in other regex engines, you might find difficulties with the alternation in the lookbehind, and then you can replace (?<=\s|^) with the equivalent (?<!\S). See this regex:
(?<!\S)[^\s_.-]\w*(?:[-+.]\w+)*\b#\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*
And last but not least, if you need to use it in JavaScript or other languages not supporting lookarounds, replace the (?<!\S)/(?<=\s|^) with a (non)capturing group (\s|^), wrap the whole email pattern part with another set of capturing parentheses and use the language means to grab Group 1 contents:
(\s|^)([^\s_.-]\w*(?:[-+.]\w+)*\b#\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*)
See the regex demo.
I use this for multiple email addresses, separate with ‘;':
([A-Za-z0-9._%-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4};)*
For a single mail:
[A-Za-z0-9._%-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}
My regex (PCRE):
\b([\w-.]*error)\b(?:[^-\/.]|\.\W|\.$|$)
is a match (the actual match is surrounded by stars) :
**this.is.an.error**
**this.IsAnerror**
**this.is.an.error**.
**this.is.an.error**(
bla **this_is-an-error**
**this.is.an.error**:
this is an (**error**)
not a match:
this.is.an.error.but.dont.match
this.is.an.error-but.dont.match
this.is.an.error/but.dont.match
this.is.an.error/
/this.is.an.error
for this sample: /this.is.an.error
I can't manage to have a condition that will reject the whole match if it starts with the character /.
every combination I've tried resulted in some partial catch (which is not the desired).
Is there any simple or fancy way to do that?
You can try to add lookabehinds at the beginning instead of a word boundary:
(?<!\/)(?<=[^\w-.])([\w-.]*error)\b(?:[^-\/.]|\.\W|\.$|$)
Explanation:
(?<!\/) - negative lookbehind assuring there is no / before the first character;
(?<=[^\w-.]) - word boundary implementation taking into account your extended definition of characters accepted for a word [\w-.];
Demo
Prepend your regex with \/.*|:
\/.*|\b([\w-.]*error)\b(?=[^-\/.]|(?:\.\W?)?$)
Now just like before the first capturing group holds the desired part.
See live demo here
Note: I made some modifications to your regex to remove unnecessary alternations.
Not quite sure how to go about this, but basically what I want to do is match a character, say a for example. In this case all of the following would not contain matches (i.e. I don't want to match them):
aa
aaa
fooaaxyz
Whereas the following would:
a (obviously)
fooaxyz (this would only match the letter a part)
My knowledge of RegEx is not great, so I am not even sure if this is possible. Basically what I want to do is match any single a that has any other non a character around it (except for the start and end of the string).
Basically what I want to do is match any single a that has any other non a character around it (except for the start and end of the string).
^[^\sa]*\Ka(?=[^\sa]*$)
DEMO
\K discards the previously matched characters and lookahead assertes whether a match is possibel or not. So the above matches only the letter a which satifies the conditions.
OR
a{2,}(*SKIP)(*F)|a
DEMO
You may use a combination of a lookbehind and a lookahead:
(?<!a)a(?!a)
See the regex demo and the regex graph:
Details
(?<!a) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a a char
a - an a char
(?!a) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a a char.
You need two things:
a negated character class: [^a] (all except "a")
anchors (^ and $) to ensure that the limits of the string are reached (in other words, that the pattern matches the whole string and not only a substring):
Result:
^[^a]*a[^a]*$
Once you know there is only one "a", you can use the way you want to extract/replace/remove it depending of the language you use.
I am implementing the following problem in ruby.
Here's the pattern that I want :
1234, 1324, 1432, 1423, 2341 and so on
i.e. the digits in the four digit number should be between [1-4] and should also be non-repetitive.
to make you understand in a simple manner I take a two digit pattern
and the solution should be :
12, 21
i.e. the digits should be either 1 or 2 and should be non-repetitive.
To make sure that they are non-repetitive I want to use $1 for the condition for my second digit but its not working.
Please help me out and thanks in advance.
You can use this (see on rubular.com):
^(?=[1-4]{4}$)(?!.*(.).*\1).*$
The first assertion ensures that it's ^[1-4]{4}$, the second assertion is a negative lookahead that ensures that you can't match .*(.).*\1, i.e. a repeated character. The first assertion is "cheaper", so you want to do that first.
References
regular-expressions.info/Lookarounds and Backreferences
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?
Just for a giggle, here's another option:
^(?:1()|2()|3()|4()){4}\1\2\3\4$
As each unique character is consumed, the capturing group following it captures an empty string. The backreferences also try to match empty strings, so if one of them doesn't succeed, it can only mean the associated group didn't participate in the match. And that will only happen if string contains at least one duplicate.
This behavior of empty capturing groups and backreferences is not officially supported in any regex flavor, so caveat emptor. But it works in most of them, including Ruby.
I think this solution is a bit simpler
^(?:([1-4])(?!.*\1)){4}$
See it here on Rubular
^ # matches the start of the string
(?: # open a non capturing group
([1-4]) # The characters that are allowed the found char is captured in group 1
(?!.*\1) # That character is matched only if it does not occur once more
){4} # Defines the amount of characters
$
(?!.*\1) is a lookahead assertion, to ensure the character is not repeated.
^ and $ are anchors to match the start and the end of the string.
While the previous answers solve the problem, they aren't as generic as they could be, and don't allow for repetitions in the initial string. For example, {a,a,b,b,c,c}. After asking a similar question on Perl Monks, the following solution was given by Eily:
^(?:(?!\1)a()|(?!\2)a()|(?!\3)b()|(?!\4)b()|(?!\5)c()|(?!\6)c()){6}$
Similarly, this works for longer "symbols" in a string, and for variable length symbols too.