Extend an regex with logical AND in a non-capturing group

Extend an regex with logical AND in a non-capturing group - regex

I want to extend an existing regex string:
((?:street)|(?:addr)|(?:straße)|(?:strasse)|(?:adr))
It basically matches strings like street or address.
So now I want to add, that if the strings 'addressAdd' or 'streetnr' exists it doesn't match anything anymore (not even street).
I tried
((?:street)|(?:addr)|(?:straße)|(?:strasse)|(?:adr))(^(?:addressAdd))(^(?:streetnr))
and several variations thereof however didn't succeed. Does anyone of you know how to negate strings?
Update: Some clarification: If a string like addressAdd exists I don't want that any string matches. The java code for this would look like this:
String toCheck="some string to match";
if((!toCheck.equals("streetnr") && !toCheck.equals("addressAdd")) && ( toCheck.equals("street") || toCheck.equals("strasse") || toCheck.equals("adr"))

I'd rather remove unnecessary grouping constructs and add a negative lookahead with these 2 exceptions:
(?!addressAdd|streetnr)(?:street|addr|straße|strasse|adr)
See the regex demo
To match whole words:
\b(?!(?:addressAdd|streetnr)\b)(?:street|addr|straße|strasse|adr)\b
See another demo
Here, you can read more about lookaheads. In short: (?!addressAdd|streetnr) checks if there is no addressAdd and streetnr after the current position and only then the regex engine can go on matching one of the alternatives listed in (?:street|addr|straße|strasse|adr) non-capturing group. With word boundaries (\b(?!(?:addressAdd|streetnr)\b)) only those exceptions are skipped that are whole words (so, if there is streetnrs, it will get matched).
Answer to the update:
To match strings (or lines if DOTALL option is not used) that contain specific substrings and do not contain disallowed whole words, use the negative lookahead at the beginning of the pattern right after ^:
^(?!.*\b(?:addressAdd|streetnr)\b).*(?:street|addr|straße|strasse|adr).*
See another regex demo

Related

Regex Email validation with some special cases [duplicate]

I am trying to make a regex match which is discarding the lookahead completely.
\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*
This is the match and this is my regex101 test.
But when an email starts with - or _ or . it should not match it completely, not just remove the initial symbols. Any ideas are welcome, I've been searching for the past half an hour, but can't figure out how to drop the entire email when it starts with those symbols.

You can use the word boundary near # with a negative lookbehind to check if we are at the beginning of a string or right after a whitespace, then check if the 1st symbol is not inside the unwanted class [^\s\-_.]:
(?<=^|\s)[^\s\-_.]\w*(?:[-+.]\w+)*\b#\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*
See demo
List of matches:
support#github.com
s.miller#mit.edu
j.hopking#york.ac.uk
steve.parker#soft.de
info#company-hotels.org
kiki#hotmail.co.uk
no-reply#github.com
s.peterson#mail.uu.net
info-bg#software-software.software.academy
Additional notes on usage and alternative notation
Note that it is best practice to use as few escaped chars as possible in the regex, so, the [^\s\-_.] can be written as [^\s_.-], with the hyphen at the end of the character class still denoting a literal hyphen, not a range. Also, if you plan to use the pattern in other regex engines, you might find difficulties with the alternation in the lookbehind, and then you can replace (?<=\s|^) with the equivalent (?<!\S). See this regex:
(?<!\S)[^\s_.-]\w*(?:[-+.]\w+)*\b#\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*
And last but not least, if you need to use it in JavaScript or other languages not supporting lookarounds, replace the (?<!\S)/(?<=\s|^) with a (non)capturing group (\s|^), wrap the whole email pattern part with another set of capturing parentheses and use the language means to grab Group 1 contents:
(\s|^)([^\s_.-]\w*(?:[-+.]\w+)*\b#\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*)
See the regex demo.

I use this for multiple email addresses, separate with ‘;':
([A-Za-z0-9._%-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4};)*
For a single mail:
[A-Za-z0-9._%-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}

How can I match all lines with a certain pattern, except when a certain substring is present?

I have multiple lines that have a bit of code that has a format that follow a very simple pattern: &G3FRM.GetRecord("<TAG>".GetField("<TAG>").Value. For example, I might have the following:
&G3FRM.GetRecord("PAGEREC").GetField("GSHOURS").Value
&G3FRM.GetRecord("RSCH_SETUP").GetField("Y_NIH_MNTHLY_CAP").Value
&G3FRM.GetRecord("PAYMENT").GetField("Y_HRS_TOTAL").Value
I need to match anything that has &G3FRM.GetRecord, that doesn't have PAGEREC as the first string/tag, and is then followed by the rest of the pattern. These statements can appear at the beginning, middle or end of any given line, and there could even be multiple matches in a single line.
This is the Regex pattern that I have tried:
&G3FRM\.GetRecord\("(?!PAGEREC)"\)\.GetField\("\w+"\)\.Value
As far as I understand, this is matching some literals (&G3FRM.GetRecord(") and is then looking for any string that doesn't match PAGEREC, using a negative lookahead. It certainly excludes any of the matches that have PAGEREC, but it also excludes everything else, so I know that I'm missing something.
So, I have a bunch of lines that I've cherry-picked that could look something like this:
Local string &rqst_dept_descr = %This.GetDepartmentDescription(&G3FRM.GetRecord("PAGEREC").GetField("GSREQUESTING_DEPT").Value);
Local string &hoursHTML = GetHTMLText(HTML.G_FORM_ROW_VALUE, "Hours", &G3FRM.GetRecord("PAYMENT").GetField("GSHOURS").Value);
Local string &off_cycle_deposit = &G3FRM.GetRecord("PAGEREC").GetField("GSOFFCYCLE_DIR_DEP").Value;
&G3FRM.GetRecord("POSITION").GetField("GSCOMMISSIONTIPS").Value = "Y";
SQLExec(SQL.Y_HAS_CONTRACT_DATA_IN_RANGE, &G3FRM.GetRecord("PAGEREC").GetField("EMPLID").Value, &G3FRM.GetRecord("PAYMENT").GetField("CONTRACT_NUM").Value, &G3FRM.GetRecord("PAYMENT").GetField("EFFDT").Value, &G3FRM.GetRecord("PAYMENT").GetField("EFFDT").Value, &HasContractData);
In this example, it should exclude the first line, since it only has the pattern I don't want. It should include the second line, exclude the third, include the fourth, and include the fifth (even though it does have one example of the excluded pattern, it has multiples that I do want).

You may use this regex:
&G3FRM\.GetRecord\("(?!PAGEREC\b)\w+"\)\.GetField\("\w+"\)\.Value
Note use of \w+ after negative lookahead to allow it to match a word that must not be PAGEREC1. I have added \b in your lookahead condition to make sure we don't match partial words.
RegEx Demo
In your regex &G3FRM\.GetRecord\("(?!PAGEREC)"\)\.GetField\("\w+"\)\.Value your negative lookahead condition is correct but regex is not matching anything between 2 double quotes so your regex will only match e.g. &G3FRM.GetRecord("").GetField("GSHOURS").Value.

Allowing words picked up in regex in certain cases only

I have a regex expression to look for people just sticking "N/A" or similar into a form field.
^(?!(\b(N/A|NA|n/a|na|Yes|yes|YES|No|no|NO)\b))
Probably not the most elegant I am sure. However I cannot for the life of me get it to allow the above words if followed by something.
So if someone just types "yes" then I want it to fail the regex check. But if someone types "yes, I have blah blah etc etc" I want it to pass.
The expression I have allows the word to be used as long as it isn't the first word in the sentence. I just want to disallow the listed words as the ONLY words in the field.
Any ideas?
Thanks

You may remove the first \b (it is redundant between the start of string and a word char) and replace the second one with $ (end of string):
^(?!(?:N/A|NA|n/a|na|Yes|yes|YES|No|no|NO)$)
See the regex demo
With a case insensitive option, you may reduce the pattern to
^(?!(?:n/?a|yes|no)$)
See another regex demo
Details
^ - start of string, then...
(?!(?:n/?a|yes|no)$) - a location in string that is not immediately followed with n/?a (na, n/a), yes or no that are followed with the end of string.
In human words, only the start of string is matched if the whole string is not equal to the alternatives inside the alternation group.

The easiest way would be to match all the forbidden strings exactly and invert the result.
Try ^(n/?a|yes|no)$ with a case-insensitive option and invert the result.
^ matches the beginning of the string. $ matches the end of the string.
When you don't have a case-insensitive option, use ^([nN]/?[aA]|[yY][eE][sS]|[nN][oO])$.

Regex: how do I match a character before other capture characters?

I'm trying to match on a list of strings where I want to make sure the first character is not the equals sign, don't capture that match. So, for a list (excerpted from pip freeze) like:
ply==3.10
powerline-status===2.6.dev9999-git.b-e52754d5c5c6a82238b43a5687a5c4c647c9ebc1-
psutil==4.0.0
ptyprocess==0.5.1
I want the captured output to look like this:
==3.10
==4.0.0
==0.5.1
I first thought using a negative lookahead (?![^=]) would work, but with a regular expression of (?![^=])==[0-9]+.* it ends up capturing the line I don't want:
==3.10
==2.6.dev9999-git.b-e52754d5c5c6a82238b43a5687a5c4c647c9ebc1-
==4.0.0
==0.5.1
I also tried using a non-capturing group (?:[^=]) with a regex of (?:[^=])==[0-9]+.* but that ends up capturing the first character which I also don't want:
y==3.10
l==4.0.0
s==0.5.1
So the question is this: How can one match but not capture a string before the rest of the regex?

Negative look behind would be the go:
(?<!=)==[0-9.]+
Also, here is the site I like to use:
http://www.rubular.com/
Of course it does some times help if you advise which engine/software you are using so we know what limitations there might be.

If you want to remove the version numbers from the text you could capture not an equals sign ([^=]) in the first capturing group followed by matching == and the version numbers\d+(?:\.\d+)+. Then in the replacement you would use your capturing group.
Regex
([^=])==\d+(?:\.\d+)+
Replacement
Group 1 $1
Note
You could also use ==[0-9]+.* or ==[0-9.]+ to match the double equals signs and version numbers but that would be a very broad match. The first would also match ====1test and the latter would also match ==..

There's another regex operator called a 'lookbehind assertion' (also called positive lookbehind) ?<= - and in my above example using it in the expression (?<=[^=])==[0-9]+.* results in the expected output:
==3.10
==4.0.0
==0.5.1
At the time of this writing, it took me a while to discover this - notably the lookbehind assertion currently isn't supported in the popular regex tool regexr.
If there's alternatives to using lookbehind to solve I'd love to hear it.

Regular Expressions and negating a whole character group [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 5 years ago.
I'm attempting something which I feel should be fairly obvious to me but it's not. I'm trying to match a string which does NOT contain a specific sequence of characters. I've tried using [^ab], [^(ab)], etc. to match strings containing no 'a's or 'b's, or only 'a's or only 'b's or 'ba' but not match on 'ab'. The examples I gave won't match 'ab' it's true but they also won't match 'a' alone and I need them to. Is there some simple way to do this?

Using a character class such as [^ab] will match a single character that is not within the set of characters. (With the ^ being the negating part).
To match a string which does not contain the multi-character sequence ab, you want to use a negative lookahead:
^(?:(?!ab).)+$
And the above expression disected in regex comment mode is:
(?x) # enable regex comment mode
^ # match start of line/string
(?: # begin non-capturing group
(?! # begin negative lookahead
ab # literal text sequence ab
) # end negative lookahead
. # any single character
) # end non-capturing group
+ # repeat previous match one or more times
$ # match end of line/string

Use negative lookahead:
^(?!.*ab).*$
UPDATE: In the comments below, I stated that this approach is slower than the one given in Peter's answer. I've run some tests since then, and found that it's really slightly faster. However, the reason to prefer this technique over the other is not speed, but simplicity.
The other technique, described here as a tempered greedy token, is suitable for more complex problems, like matching delimited text where the delimiters consist of multiple characters (like HTML, as Luke commented below). For the problem described in the question, it's overkill.
For anyone who's interested, I tested with a large chunk of Lorem Ipsum text, counting the number of lines that don't contain the word "quo". These are the regexes I used:
(?m)^(?!.*\bquo\b).+$
(?m)^(?:(?!\bquo\b).)+$
Whether I search for matches in the whole text, or break it up into lines and match them individually, the anchored lookahead consistently outperforms the floating one.

Yes its called negative lookahead. It goes like this - (?!regex here). So abc(?!def) will match abc not followed by def. So it'll match abce, abc, abck, etc.
Similarly there is positive lookahead - (?=regex here). So abc(?=def) will match abc followed by def.
There are also negative and positive lookbehind - (?<!regex here) and (?<=regex here) respectively
One point to note is that the negative lookahead is zero-width. That is, it does not count as having taken any space.
So it may look like a(?=b)c will match "abc" but it won't. It will match 'a', then the positive lookahead with 'b' but it won't move forward into the string. Then it will try to match the 'c' with 'b' which won't work. Similarly ^a(?=b)b$ will match 'ab' and not 'abb' because the lookarounds are zero-width (in most regex implementations).
More information on this page

abc(?!def) will match abc not followed
by def. So it'll match abce, abc,
abck, etc. what if I want neither def
nor xyz will it be abc(?!(def)(xyz))
???
I had the same question and found a solution:
abc(?:(?!def))(?:(?!xyz))
These non-counting groups are combined by "AND", so it this should do the trick. Hope it helps.

Using a regex as you described is the simple way (as far as I am aware). If you want a range you could use [^a-f].

Simplest way is to pull the negation out of the regular expression entirely:
if (!userName.matches("^([Ss]ys)?admin$")) { ... }

Just search for "ab" in the string then negate the result:
!/ab/.test("bamboo"); // true
!/ab/.test("baobab"); // false
It seems easier and should be faster too.

In this case I might just simply avoid regular expressions altogether and go with something like:
if (StringToTest.IndexOf("ab") < 0)
//do stuff
This is likely also going to be much faster (a quick test vs regexes above showed this method to take about 25% of the time of the regex method). In general, if I know the exact string I'm looking for, I've found regexes are overkill. Since you know you don't want "ab", it's a simple matter to test if the string contains that string, without using regex.

The regex [^ab] will match for example 'ab ab ab ab' but not 'ab', because it will match on the string ' a' or 'b '.
What language/scenario do you have? Can you subtract results from the original set, and just match ab?
If you are using GNU grep, and are parsing input, use the '-v' flag to invert your results, returning all non-matches. Other regex tools also have a 'return nonmatch' function, too.
If I understand correctly, you want everything except for those items which contain 'ab' anywhere.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extend an regex with logical AND in a non-capturing group - regex

Related

Regex Email validation with some special cases [duplicate]

How can I match all lines with a certain pattern, except when a certain substring is present?

Allowing words picked up in regex in certain cases only

Regex: how do I match a character before other capture characters?

Regular Expressions and negating a whole character group [duplicate]

Categories

Resources