Regular expressions, can I exclude pairs of characters? - regex

How do you exclude pairs of characters from a regular expression?
I am trying to get a regular expression that will have 5 alphanumeric characters followed by
anything except "XX" and "AD", followed by XX.
So
D22D0ACXX
will match, but the following two will not match
D22D0ADXX
D22D0XXXX.
My first attempt was :
([A-Z0-9]{5}[^(?AD)|(?XX)]XX)
But this treats the character classes part [^(?AD)|(?XX)] as one character, so I end up with the last 8 characters, not all 9.
Can I exclude pairs of characters without getting into back references?
I need to capture the whole group, hence the outer parenthesis. The negative lookahead suggestions don't seem to do this.

Use negative lookahead:
([A-Z0-9]{5}(?!(AD|XX)XX).{4})

Don't treat it as a character class, instead, think of it as an alternation with a negative lookahead, e.g:
([A-Z0-9]{5}(?!(AD|XX)XX))
Then, if you need the tail, include it after the lookhead, e.g:
([A-Z0-9]{5}(?!(AD|XX)XX)[A-Z0-9]{4})

Related

Is there a way to use periodicity in a regular expression?

I'm trying to find a regular expression for a Tokenizer operator in Rapidminer.
Now, what I'm trying to do is to split text in parts of, let's say, two words.
For example, That was a good movie. should result to That was, was a, a good, good movie.
What's special about a regex in a tokenizer is that it plays the role of a delimiter, so you match the splitting point and not what you're trying to keep.
Thus the first thought is to use \s in order to split on white spaces, but that would result in getting each word separately.
So, my question is how could I force the expression to somehow skip one in two whitespaces?
First of all, we can use the \W for identifying the characters that separate the words. And for removing multiple consecutive instances of them, we will use:
\W+
Having that in mind, you want to split every 2 instances of characters that are included in the "\W+" expression. Thus, the result must be strings that have the following form:
<a "word"> <separators that are matched by the pattern "\W+"> <another "word">
This means that each token you get from the split you are asking for will have to be further split using the pattern "\W+", in order to obtain the 2 "words" that form it.
For doing the first split you can try this formula:
\w+\W+\w+\K\W+
Then, for each token you have to tokenize it again using:
\W+
For getting tokens of 3 "words", you can use the following pattern for the initial split:
\w+\W+\w+\W+\w+\K\W+
This approach makes use of the \K feature that removes from the match everything that has been captured from the regex up to that point, and starts a new match that will be returned. So essentially, we do: match a word, match separators, match another word, forget everything, match separators and return only those.
In RapidMiner, this can be implemented with 2 consecutive regex tokenizers, the first with the above formula and the second with only the separators to be used within each token (\W+).
Also note that, the pattern \w selects only Latin characters, so if your documents contain text in a different character set, these characters will be consumed by the \W which is supposed to match the separators. If you want to capture text with non-Latin character sets, like Greek for example, you need to change the formula like this:
\p{L}+\P{L}+\p{L}+\K\P{L}+
Furthermore, if you want the formula to capture text on one language and not on another language, you can modify it accordingly, by specifying {Language_Identifier} in place of {L}. For example, if you only want to capture text in Greek, you will use "{Greek}", or "{InGreek}" which is what RapidMiner supports.
What you can do is use a zero width group (like a positive look-ahead, as shown in example). Regex usually "consumes" characters it checks, but with a positive lookahead/lookbehind, you assert that characters exist without preventing further checks from checking those letters too.
This should work for your purposes:
(\w+)(?=(\W+\w+))
The following pattern matches for each pair of two words (note that it won't match the last word since it does not have a pair). The first word is in the first capture group, (\w+). Then a positive lookahead includes a match for a sequence of non word characters \W+ and then another string of word characters \w+. The lookahead (?=...) the second word is not "consumed".
Here is a link to a demo on Regex101
Note that for each match, each word is in its own capture group (group 1, group 2)
Here is an example solution, (?=(\b[A-Za-z]+\s[A-Za-z]+)) inspired from this SO question.
My question sounds wrong once you understand that is a problem of an overlapping regex pattern.

Positive and Negative Lookahead on matchings strings with two or more same consecutive characters [duplicate]

I can very easily write a regular expression to match a string that contains 2 consecutive repeated characters:
/(\w)\1/
How do I do the complement of that? I want to match strings that don't have 2 consecutive repeated characters. I've tried variations of the following without success:
/(\w)[^\1]/ ;doesn't work as hoped
/(?!(\w)\1)/ ;looks ahead, but some portion of the string will match
/(\w)(?!\1)/ ;again, some portion of the string will match
I don't want any language/platform specific way to take the negation of a regular expression. I want the straightforward way to do this.
The below regex would match the strings which don't have any repeated characters.
^(?!.*(\w)\1).*
(?!.*(\w)\1) negative lookahead which asserts that the string going to be matched won't contain any repeated characters. .*(\w)\1 will match the string which has repeated characters at the middle or at the start or at the end. ^(?!.*(\w)\1) matches all the starting boundaries except the one which has repeated characters. And the following .* matches all the characters exists on that particular line. Note this this matches empty strings also. If you don't want to match empty lines then change .* at the last to .+
Note that ^(?!(\w)\1) checks for the repeated characters only at the start of a string or line.
Lookahead and lookbehind, collectively called "lookaround", are zero-length assertions just like the start and end of line. They do not consume characters in the string, but only assert whether a match is possible or not. Lookaround allows you to create regular expressions that are impossible to create without them, or that would get very longwinded without them.

Combining expressions

How do I combine two or more regex expressions such that a match occurs only when both expressions are true.
For instance I want to identify Text containing 6 digits (not beginning with a 5) within word boundaries i.e.
\b[0-46-9]\d{5}\b
but I want to exclude Text containing 000000
^(?!.*000000).*$
abc234576c Match
abc534756c No Match
abc000000c No Match
How do I do this?
Try this regex pattern:
\b(?!.*000000)[^0-9]*[0-46-9]\d{5}[^0-9]*\b
This assumes that you are looking to match a six digit number possibly with non numbers both preceding and proceeding it. It also ensures that the number is not 000000 and the number does not begin with 5.
Demo
Your first regex miss one important point, \b is identifying contrast between a word character (digits included) and a non-word character.
When the whole text is needed, that should work:
[a-zA-Z]*[0-46-9]\d{5}[a-zA-Z]*
Combining it with your proper second expression, you would get:
[A-Za-z]*(?!0{6})[0-46-9]\d{5}[A-Za-z]*
You can view the results here.

What is the purpose of using positive lookarounds over not?

Say the string is ‘abc’ and the expression is (?=a)abc, would that not be the same as just searching for abc? When do positive lookarounds have purpose over not using them?
Positive lookahead works just the same. q(?=u) matches a q that is followed by a u, without making the u part of the match. The positive lookahead construct is a pair of parentheses, with the opening parenthesis followed by a question mark and an equals sign.
http://www.regular-expressions.info/lookaround.html
Here is a small example from https://ourcraft.wordpress.com/2009/03/25/positive-examples-of-positive-and-negative-lookahead/
Say I want to retrieve from a text document all the words that are immediately followed by a comma. We’ll use this example string:
What then, said I, shall I do? You shan't, he replied, do anything.
As a first attempt, I could use this regular expression to get one or more word parts followed by a comma:
[A-Za-z']+,
This yields four results over the string:
then,
I,
shan't,
replied,
Notice that this gets me the comma too, though, which I would then have to remove. Wouldn’t it be better if we could express that we want to match a word that is followed by a comma without also matching the comma?
We can do that by modifying our regex as follows:
[A-Za-z']+(?=,)
This matches groups of word characters that are followed by a comma, but because of the use of lookahead the comma is not part of the matched text (just as we want it not to be). The modified regex results in these matches:
then
I
shan't
replied

Regex must contain specific letters in any order

I have been attempting to validate a string in VB.net that must contain these three letters in no particular order and do not need to be next to One another. ABC
I can do this easily using LINQ
MessageBox.Show(("ABC").All(Function(n) ("AAAABBBBBCCCC").Contains(n)).ToString)
However, after searching Google and SO for over a week, I am completely stumped. My closest pattern is ".*[A|B|C]+.*[A|B|C]+.*[A|B|C]+.*" how ever AAA would also return true. I know i can do this using other methods just after trying for a week i really want to know if its possible using One regular expression.
Your original pattern won't work because it will match any number of characters, followed by one or more A, B, C, or | character, followed by any number of characters, followed by one or more A, B, C, or | character, followed by any number of characters, followed by one or more A, B, C, or | character, followed by any number of characters.
I'd probably go with the code you've already written, but if you really want to use a regular expression, you can use a series of lookahead assertions, like this:
(?=.*A)(?=.*B)(?=.*C)
This will match any string that contains A, B, and C in any order.
You can make use of positive lookaheads:
^(?=.*A)(?=.*B)(?=.*C).+
(?=.*A) makes sure there's an A somewhere in the string and the same logic applies to the other lookaheads.
You can use zero-width lookaheads. Lookaheads are great to eliminate match possibilities if they don't meet a certain criteria.
For example, let's use the words
untie queue unique block unity
Start with a basic word match:
\b\w+\b
to require the word matched with \w+ begins with un, we could use a positive lookahead
\b(?=un)\w+\b
What this says is
\b Match a blank
(?=un) Are there the letters "un"? If not, NO MATCH. If so, then possible match.
\w+ One or more word characters
\b Match a blank
A positive lookahead eliminates a match possibility if it does NOT meet the expression inside. It applies to the regex RIGHT AFTER it. So the (?=un) applies to the \w+ expression above and requires that it BEGINS WITH un. If it does not, then the \w+ expression won't match.
How about matching any words that do not begin with un? Simply use a "negative lookahead"
\b(?!un)\w+\b
\b Match a blank
(?!un) Are there the letters "un"? If SO, NO MATCH. If not, then possible match.
\w+ One or more word characters
\b Match a blank
So for your requirement of having at least 1 A, 1 B and 1 C in the string, a pattern like
(?=.*A)(?=.*B)(?=.*C).+
Works because it says:
(?=.*A) - Does it have .* any characters followed by A? If so, possible match if not no match.
(?=.*B) - Does it have .* any characters followed by B? If so, possible match if not no match.
(?=.*C) - Does it have .* any characters followed by C? If so, possible match if not no match.
.+ If the above 3 lookahead requirements were met, match any characters. If not, then match no characters (and so there isn't a match)
Does it have to be a regex? That's something that can easily be solved without one.
I've never programmed in VB, but I'm sure there are helper functions that let you take a string, and query whether or not a character occurs in it.
If str is your string, maybe something like:
str.contains('A') && str.contains('B') && str.contains('C')