Match a word but not its inverse using [^] syntax - regex

I am trying to make a regex that doesn't match one word, but does match its reverse. For example, if the word I don't want to match is "no":
I am matching this word // will pass
I am matching no word // will not pass
I am matching on word // will pass
I am matching that word // will pass
The current regex I am using doesn't pass on the third example, because it is not matching any word with "n" or "o" in it:
^I am matching ([^no]*) word$
What is the best way to achieve this - ie, match on a word, not a collection of characters?
For context I am writing acceptance tests using Scala and Cucumber, which use Regex to match a feature file up with its corresponding stepdef. My real-world example is more complex, so I have simplified it here. Also, I know that I can just catch (.*) and handle what is in that capture group using a case/match block in Scala, but I am curious about how to do this with purely Regex.

You can use a negative lookahead to test the text you're about to match:
^I am matching (?!no\b)(?<CapturedWord>\w+) word$
(?!no\b) - This is a negative lookahead. It tests the next two characters. If they are "no" followed by a word boundary, then the match fails. Anything else will pass. A lookahead does not actually capture those characters, so...
(?<CapturedWord>\w+) - ...we need to capture the characters in order to continue on with the rest of the test. I used a named group because they're often easier to reference later on in code.

An other solution consists to describe all words that aren't "on". Note that this solution isn't handy if you want to negate a long substring, but with several regex engines that don't have the lookahead feature, this is the only way:
^I am matching ([^\Wn]\w+|n[^\Wo]+|\w(?:\w{2,})?) word$
The two first branch of the alternation match in particular all 2 letters words that aren't "no", the last branch matches one letter and 3 or more letters words.

Related

Match a specific regex using matches()

Trying to match a specific word using matches()
*//id[matches(.,lower-case('*\s?Xander\s?*'))]
Examples:
Set of Xanderous- No match
Xander Tray of 6- Match
Tray of 6 pieces Xander- Match
Set of 6 Xander pieces- Match
Any instance of the exact word 'Xander' match is the objective.
The reason the XPath regex dialect doesn't handle word boundaries is that to do it properly, you need to be language-sensitive - a "word" is a cultural artefact.
You could do tokenize(., '\P{L}+') = 'Xander' which tokenizes treating any sequence of non-letters as a separator and then tests if one of the tokens is 'Xander'.
I have been running some tests and it seems word boundaries are not integrated into the XML/XPATH vocabulary. So the next best thing IMO is to test for a whitespace or start/end string anchors surrounding zero or more characters. Therefore, I ended up with:
*//id[matches(lower-case(.),'.*(^|\s)xander($|\s).*')]
Even better would be to drop lower-case alltogether and use the third matches parameter (flags) setting it to case-insensitive matching:
*//id[matches(.,'.*(^|\s)xander($|\s).*','i')]
Roughly, if you want to get the full line matching if it exactly contains the word Xander, you can use \b which delimits a specific word, plus some greedy operators .*:
^.*\bXander\b.*$
Demo: https://regex101.com/r/PvKptN/1
Or if you don't need the whole line, you can simply check if it contains Xander:
\bXander\b
Demo: https://regex101.com/r/PvKptN/2
I hope it satisfies the regex flavor you're using

How can I match the second occurrence of a string without using capture groups

I need a regex expression that ONLY matches the second occurrence of a given string within a longer string.
Possible examples
NSTR; TEST; NSTR
GKOH; NSTR; NSTR; JLAH
GKOH; JLAH; NSTR; CZE; NSTR; FKILL
Should match the second NSTR in each case
I can write an expression that only puts the second occurrence into a capture group:
.*NSTR.*?(NSTR)
However, that's no use to me as I'm writing it to fit in a function, the code of which I have no control over.
It's for use within a function in MusicBee which uses VB.Net
You can use the following in .net:
(?<=(\b\w+\b).*)\1
This works as follows:
(?<=(\b\w+\b).*) lookbehind asserting what precedes matches
(\b\w+\b) capture a word into capture group 1:
\b word boundary
\w+ match one or more word characters (defined as [a-zA-Z0-9_]) - you can obviously limit this further if you need to, then you'll also need to create your own boundaries to replace \b (such as (?<![a-z])[a-z]+(?![a-z]) for any lowercase letter)
\b word boundary
.* matches any character any number of times
\1 matches the same text as most recently captured by capture group 1
How this works? Most languages have a regex function or modifier to match all instances (g or global matches), and another to match the first. The regex above will scan the string to find the first location where \b\w+\b can be found anywhere left of that location and where that result is found again to the immediate right of that location (then we match the word). So a a a a will only match the second a.
I'm not familiar with MusicBee or whether you're using $IsMatch, $Replace, or $RxReplace, but it seems you can likely use $First to grab the first result if either of those functions return multiple values (as in my example of a a a a).
I guess, maybe,
(((?!NSTR).)*NSTR((?!NSTR).)*)NSTR
or,
(((?!\bNSTR\b).)*\bNSTR\b((?!\bNSTR\b).)*)\bNSTR\b
might be OK to look into.
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:

Is there a way to use periodicity in a regular expression?

I'm trying to find a regular expression for a Tokenizer operator in Rapidminer.
Now, what I'm trying to do is to split text in parts of, let's say, two words.
For example, That was a good movie. should result to That was, was a, a good, good movie.
What's special about a regex in a tokenizer is that it plays the role of a delimiter, so you match the splitting point and not what you're trying to keep.
Thus the first thought is to use \s in order to split on white spaces, but that would result in getting each word separately.
So, my question is how could I force the expression to somehow skip one in two whitespaces?
First of all, we can use the \W for identifying the characters that separate the words. And for removing multiple consecutive instances of them, we will use:
\W+
Having that in mind, you want to split every 2 instances of characters that are included in the "\W+" expression. Thus, the result must be strings that have the following form:
<a "word"> <separators that are matched by the pattern "\W+"> <another "word">
This means that each token you get from the split you are asking for will have to be further split using the pattern "\W+", in order to obtain the 2 "words" that form it.
For doing the first split you can try this formula:
\w+\W+\w+\K\W+
Then, for each token you have to tokenize it again using:
\W+
For getting tokens of 3 "words", you can use the following pattern for the initial split:
\w+\W+\w+\W+\w+\K\W+
This approach makes use of the \K feature that removes from the match everything that has been captured from the regex up to that point, and starts a new match that will be returned. So essentially, we do: match a word, match separators, match another word, forget everything, match separators and return only those.
In RapidMiner, this can be implemented with 2 consecutive regex tokenizers, the first with the above formula and the second with only the separators to be used within each token (\W+).
Also note that, the pattern \w selects only Latin characters, so if your documents contain text in a different character set, these characters will be consumed by the \W which is supposed to match the separators. If you want to capture text with non-Latin character sets, like Greek for example, you need to change the formula like this:
\p{L}+\P{L}+\p{L}+\K\P{L}+
Furthermore, if you want the formula to capture text on one language and not on another language, you can modify it accordingly, by specifying {Language_Identifier} in place of {L}. For example, if you only want to capture text in Greek, you will use "{Greek}", or "{InGreek}" which is what RapidMiner supports.
What you can do is use a zero width group (like a positive look-ahead, as shown in example). Regex usually "consumes" characters it checks, but with a positive lookahead/lookbehind, you assert that characters exist without preventing further checks from checking those letters too.
This should work for your purposes:
(\w+)(?=(\W+\w+))
The following pattern matches for each pair of two words (note that it won't match the last word since it does not have a pair). The first word is in the first capture group, (\w+). Then a positive lookahead includes a match for a sequence of non word characters \W+ and then another string of word characters \w+. The lookahead (?=...) the second word is not "consumed".
Here is a link to a demo on Regex101
Note that for each match, each word is in its own capture group (group 1, group 2)
Here is an example solution, (?=(\b[A-Za-z]+\s[A-Za-z]+)) inspired from this SO question.
My question sounds wrong once you understand that is a problem of an overlapping regex pattern.

How to find words that contain string with a limited size

I need to find all the words in an inputted text that has (?i:val) in it and are no longer that 5 characters.
So far I got: \b([a-zA-Z]*(?i:val)[a-zA-Z]*){1,4}\b
If we take this sample text to look in: In computer science, a value is an expression which cannot be evaluated any further (a normal form). Val is also a match
I get 3 matches (value, evaluated and Val), however evaluated should not match the pattern, as it is too long. What is the right way to get this straight?
Your pattern does not account for the length of the words matched.
Use word boundaries and a lookahead like this:
(?i)\b(?=\w*val)\w{1,5}\b
See regex demo
The regex matches:
\b - a leading word boundary since the next pattern is \w
(?=\w*val) - a lookahead making sure there is a val substring after zero or more word characters
\w{1,5} - matches 1 to 5 word characters
\b - trailing word boundary that stops words of more than 5 characters long from matching
You may use an ASCII JS version of the regex:
/\b(?=[a-z]*val)[a-z]{1,5}\b/i
It's important to understand why the "evaluated" was matched. Note:
[a-zA-Z]* matches the "e"
(?i:val) matches "val"
[a-zA-Z]* matches "uated"
Actually there's not repetition here! The pattern was matched in only one iteration.
You can achieve what you want using lookarounds, but I think that regex is not the best tool for this task. I highly recommend you using other functions depending on what you have.

RegEx lookahead but not immediately following

I am trying to match terms such as the Dutch ge-berg-te. berg is a noun by itself, and ge...te is a circumfix, i.e. geberg does not exist, nor does bergte. gebergte does. What I want is a RegEx that matches berg or gebergte, working with a lookaround. I was thinking this would work
\b(?i)(ge(?=te))?berg(te)?\b
But it doesn't. I am guessing because a lookahead only checks the immediate following characters, and not across characters. Is there any way to match characters with a lookahead withouth the constraint that those characters have to be immediately behind the others?
Valid matches would be:
Berg
berg
Gebergte
gebergte
Invalid matches could be:
Geberg
geberg
Bergte
bergte
ge-/Ge- and -te always have to occur together. Note that I want to try this with a lookahead. I know it can be done simpler, but I want to see if its methodologically possible to do something like this.
Here is one non-lookaround based regex:
\b(berg|gebergte)\b
Use it with i (ignore case) flag. This regex uses alternation and word boundary to search for complete words berg OR gebergte.
RegEx Demo
Lookaround based regex:
(?<=\bge)berg(?=te\b)|\bberg\b
This regex used a lookahead and lookbehind to search for berg preceded by ge and followed by te. Alternatively it matches complete word berg using word boundary asserter \b which is also 0-width asserter like anchors ^ and $.
To generally forbid a sign, you can put the negative lookaround to the beginning of a string and combine it with random number of other signs before the string you want to forbid:
regex: don't match if containing a specific string
^(?!.\*720).*
This will not match, if the string contains 720, but else match everything else.