Regex - how to match everything except a particular pattern - regex

How do I write a regex to match any string that doesn't meet a particular pattern? I'm faced with a situation where I have to match an (A and ~B) pattern.

You could use a look-ahead assertion:
(?!999)\d{3}
This example matches three digits other than 999.
But if you happen not to have a regular expression implementation with this feature (see Comparison of Regular Expression Flavors), you probably have to build a regular expression with the basic features on your own.
A compatible regular expression with basic syntax only would be:
[0-8]\d\d|\d[0-8]\d|\d\d[0-8]
This does also match any three digits sequence that is not 999.

If you want to match a word A in a string and not to match a word B. For example:
If you have a text:
1. I have a two pets - dog and a cat
2. I have a pet - dog
If you want to search for lines of text that HAVE a dog for a pet and DOESN'T have cat you can use this regular expression:
^(?=.*?\bdog\b)((?!cat).)*$
It will find only second line:
2. I have a pet - dog

Match against the pattern and use the host language to invert the boolean result of the match. This will be much more legible and maintainable.

notnot, resurrecting this ancient question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
I'm faced with a situation where I have to match an (A and ~B)
pattern.
The basic regex for this is frighteningly simple: B|(A)
You just ignore the overall matches and examine the Group 1 captures, which will contain A.
An example (with all the disclaimers about parsing html in regex): A is digits, B is digits within <a tag
The regex: <a.*?<\/a>|(\d+)
Demo (look at Group 1 in the lower right pane)
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...

The complement of a regular language is also a regular language, but to construct it you have to build the DFA for the regular language, and make any valid state change into an error. See this for an example. What the page doesn't say is that it converted /(ac|bd)/ into /(a[^c]?|b[^d]?|[^ab])/. The conversion from a DFA back to a regular expression is not trivial. It is easier if you can use the regular expression unchanged and change the semantics in code, like suggested before.

pattern - re
str.split(/re/g)
will return everything except the pattern.
Test here

My answer here might solve your problem as well:
https://stackoverflow.com/a/27967674/543814
Instead of Replace, you would use Match.
Instead of group $1, you would read group $2.
Group $2 was made non-capturing there, which you would avoid.
Example:
Regex.Match("50% of 50% is 25%", "(\d+\%)|(.+?)");
The first capturing group specifies the pattern that you wish to avoid. The last capturing group captures everything else. Simply read out that group, $2.

(B)|(A)
then use what group 2 captures...

Related

Regex if then else confusion

I have a problem with the Regex-If-then-else logic:
I am trying to achieve the following:
If the string contains the substring PubDSK then do the Regex Expression
^[\s\S]{24}(?=.{10}([\s\S]*))0*(.*?)(?=\1)[\s\S]*
If it does NOT contain the substring PubDSK then do a different Regex Expression, namely ^[\s\S]{48}(?=.{10}([\s\S]*))0*(.*?)(?=\1)[\s\S]*
I am using this Regex Expression (?(?=^.*PubDSK.*$)^[\s\S]{24}(?=.{10}([\s\S]*))0*(.*?)(?=\1)[\s\S]*|^[\s\S]{48}(?=.{10}([\s\S]*))0*(.*?)(?=\1)[\s\S]*)
The affirmative case works great: https://regex101.com/r/ab9yOv/
BUT the non-affirmative case, doesn't do the trick: https://regex101.com/r/azxGvh/1
I assume it doesn't match so it cannot do the replacement?? How can I tell the regex to do the replacement on the complete string in the ELSE case?
I understand, that this problem can be easily solved with any other programming language, but for this use case I can only use pure regex...
The second \1 backreference refers to the first capturing group of the entire regex. So, it does not refer to the right capturing group defined in the else pattern part. In fact, the second \1 must be replaced with \3 as it refers to the third capturing group.
Also, note that (?=\1) and (?=\3) lookaheads make little sense here as they are followed with [\s\S]* consuming patterns. Just remove the lookahead pattern and use consuming ones.
The fixed pattern looks like
(?(?=^.*PubDSK.*$)^[\s\S]{24}(?=.{10}([\s\S]*))0*(.*?)\1[\s\S]*|^[\s\S]{48}(?=.{10}([\s\S]*))0*(.*?)\3[\s\S]*)
See the regex demo.

Stop Regex Engine at first match [duplicate]

I am learning about using cucumber's step defintion, which use regex. I came across the following different usages and would like to know if there's some material difference between the two approaches of capturing a group within a pair of double quotes:
approach one: "([^"]*)"
approach two: "(.*?)"
For example, consider a string input: 'the output should be "pass!"'. Both approaches would capture pass!. Are there inputs where two the approaches capture differently; or are they equivalent?
Thanks
Well, in naked eye they look same. But slight different. Have a look on this example:
input:
a " regex
example is
here" please
Output for "([^"]*)":
regex
example is
here
And, Output for "(.*?)" is empty.
.*? means any character except \n (0 or more times), and there has few newlines between the quotes("). If we use this in regex we need to give the regex engine a hint to use Multiline matching.
"([^"]*)" will also capture newlines, so if you have
"Something
that goes on two lines"
then it will match it.
"(.*?)" does not span newlines, so it will not match that phrase.
Unless you use the single-line modifier (?s). In which case . will also include newline characters. The following expression: (?s)"(.*?)" would then match and capture.
Difference between "(.*?)" and "([^"]*)"
It depends upon where this regex fragment appears within the larger context of the overall pattern. It also depends upon the target string that is being searched. For example, given the following input string:
'foo "quote1" bar "quote2"'
The expression: /"(.*?)"$/ (note the added end of string anchor) will match: "quote1" bar "quote2" but the /"([^"]*)"$/ expression will match: "quote2".
The dot will match a double quote if it has to to get a successful overall match.

Regex Not Capturing input

I need to match over the alphabet {a,b} (meaning that we can discard any other letter since only a and b will exist):
All strings containing 2 or more as.
All strings that do not contain the substring bbb.
Why is this RegEx:
((b{0,2}aaa*)+)|((aaa*b{0,2})+)
Not capturing aab?
Because aa got captured by your first pattern. To get the desired output, you need to change the pattern order.
((aaa*b{0,2})+)|((b{0,2}aaa*)+)
Note that regex engine always try to match the input against the pattern which resides on the left side then it goes further to the right side. So it would be like,
1st|2nd|3rd
Update:
^(?!.*?bbb).*a.*a.*
DEMO
Your requirements:
All strings containing 2 or more a's.
All strings that do not contain the substring bbb.
seem to argue for a simpler, lookahead based approach, instead of the trickier consuming pattern (depends on your exact workflow):
(?=a.*a)(?!.*bbb).*
regex demo
edit: to exclude all letters except a and b:
^(?=.*a.*a)(?!.*bbb)[ab]+$
regex demo
This seems to work too:
(ab{0,2}a)(ab{0,2})*

Regular expression to find specific string and add characters when the're not already there in notepad++

Okay, I have zero knowledge of regular expressions so if someone can direct me to a better way to figure this out then by all means please do.
I figured out that a series of files are missing a particular naming convention for the database they will write to. So some might be dbname1, dbname2, dbname3, abcdbname4, abcdbname5 and they all need to have that abc in the beginning. I want to write a regular expression that will find all tags in the file that do not follow immediately by abc and add in abc. Any ideas how I can do this?
Again, forgive me if this is poorly worded/expressed. I really have absolutely zero knowledge of regular expressions. I can't find any questions that are asking this. I know that there are questions asking how to add strings to lines but not how to add only to lines that are missing the string when some already have it.
I thought I had written this in but I'm looking at lines that look like this
<Name>dbname</Name>
or
<Name>abcdbname</Name>
and I need to get them all to have that abc at the beginning
Cameron's answer will work, but so will this. It's called a negative lookbehind.
(?<!abc)(dbname\d+)
This regex looks for dbname followed by 1 or more digits, and not prefixed by abc. So it will capture dbname113.
This looks for any occurrence of dbname not immediately prefixed by the string "abc". THe original name is in the capture group \1 so you can replace this regex with abc\1 and all your files will be properly prefixed.
Not every program/language that implements regex (famously, javascript) supports lookbehinds, but most do and Notepad++ certainly does. Lookarounds (lookbehind / lookaheads) are exceedingly handy once you get the hang of them.
?<! negative lookbehind, ?<= positive lookbehind / lookbehind, ?! negative lookhead, and ?= lookahead all must be used within parantheses as I did above, but they're not used in capturing so they do not create capture groups, hence why the second set of parentheses is able to be referenced as \1 (or $1 depending on the language)
Edit: Given some better example criteria, this is possibly more what you're looking for.
Find: (<Name>)(.*?(?<!abc)dbname\d+)(</Name>)
Replace: \1abc\2\3
Alternatively, something a bit easier to understand, you can do this or something like this:
Find: (<Name>)(abc)?(dbname\d+)(</Name>)
Replace: \1abc\3\4
What this is does is:
Matches <Name>, captures as backreference 1.
Looks for abc and captures it, if it's there as backreference 2, otherwise 2 contains nothing. The ? after (abc) means match 0 or 1 times.
Looks for the dbname and captures it. and captures as backreference 3.
Matches </Name>, captures as backreference 4.
By replacing with \1abc\3\4, you kind of drop abc off dbname if it exists and replace dbname with abcdbname in all instances.
You can take this a step further and
Find: (<Name>)(?:abc)?(dbname\d+)(</Name>)
Replace: \1abc\2\3
prefix the abc with ?: to create a noncapturing group, so the backreferences for replacing are sequential.
Replace \bdbname(\d+) with abcdbname\1.
The \b means "word boundary", so it won't match the abc versions, but will match the others. The (...) parentheses represent a capturing group, which capture everything that's matched in-between into a numbered variable that can be later referenced (there's only one here so it goes in \1). The \d+ matches one or more digit characters.

Struggling with regular expression

I'm struggling to find the regular expression I can use to classify data that matches a certain pattern:
Here's a few examples:
pli:06e9b616-5712-d0e9-1bc2-000012e61393
pli:6fdd187d-cbdc-3028-4a8d-000020f3449a
pli:0472def9-ccf3-e4e9-ca05-00005fecf9f8
As you can see each string begins with pli: and they all have the same pattern even though the characters are different. Each set of characters is separated by a '-' at the same position.
Looks like it has the form pli:UUID where UUID is a universally unique identifier. Try this one:
pli:[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}
Where I've allowed upper case letters too.
See http://en.wikipedia.org/wiki/Universally_unique_identifier
This does it in as short an expression as I could think of:
pli:(?i)[\da-f]{8}-([\da-f]{4}-){3}[\da-f]{12}
The (?i) means "ignore case" (saves having to type a-zA-Z everywhere), and I've abbreviated the regex by recognising 3 groups of 4 digits in the middle
See a live demo on rubular