Match a specific regex using matches() - regex

Trying to match a specific word using matches()
*//id[matches(.,lower-case('*\s?Xander\s?*'))]
Examples:
Set of Xanderous- No match
Xander Tray of 6- Match
Tray of 6 pieces Xander- Match
Set of 6 Xander pieces- Match
Any instance of the exact word 'Xander' match is the objective.

The reason the XPath regex dialect doesn't handle word boundaries is that to do it properly, you need to be language-sensitive - a "word" is a cultural artefact.
You could do tokenize(., '\P{L}+') = 'Xander' which tokenizes treating any sequence of non-letters as a separator and then tests if one of the tokens is 'Xander'.

I have been running some tests and it seems word boundaries are not integrated into the XML/XPATH vocabulary. So the next best thing IMO is to test for a whitespace or start/end string anchors surrounding zero or more characters. Therefore, I ended up with:
*//id[matches(lower-case(.),'.*(^|\s)xander($|\s).*')]
Even better would be to drop lower-case alltogether and use the third matches parameter (flags) setting it to case-insensitive matching:
*//id[matches(.,'.*(^|\s)xander($|\s).*','i')]

Roughly, if you want to get the full line matching if it exactly contains the word Xander, you can use \b which delimits a specific word, plus some greedy operators .*:
^.*\bXander\b.*$
Demo: https://regex101.com/r/PvKptN/1
Or if you don't need the whole line, you can simply check if it contains Xander:
\bXander\b
Demo: https://regex101.com/r/PvKptN/2
I hope it satisfies the regex flavor you're using

Related

Is there a way to use periodicity in a regular expression?

I'm trying to find a regular expression for a Tokenizer operator in Rapidminer.
Now, what I'm trying to do is to split text in parts of, let's say, two words.
For example, That was a good movie. should result to That was, was a, a good, good movie.
What's special about a regex in a tokenizer is that it plays the role of a delimiter, so you match the splitting point and not what you're trying to keep.
Thus the first thought is to use \s in order to split on white spaces, but that would result in getting each word separately.
So, my question is how could I force the expression to somehow skip one in two whitespaces?
First of all, we can use the \W for identifying the characters that separate the words. And for removing multiple consecutive instances of them, we will use:
\W+
Having that in mind, you want to split every 2 instances of characters that are included in the "\W+" expression. Thus, the result must be strings that have the following form:
<a "word"> <separators that are matched by the pattern "\W+"> <another "word">
This means that each token you get from the split you are asking for will have to be further split using the pattern "\W+", in order to obtain the 2 "words" that form it.
For doing the first split you can try this formula:
\w+\W+\w+\K\W+
Then, for each token you have to tokenize it again using:
\W+
For getting tokens of 3 "words", you can use the following pattern for the initial split:
\w+\W+\w+\W+\w+\K\W+
This approach makes use of the \K feature that removes from the match everything that has been captured from the regex up to that point, and starts a new match that will be returned. So essentially, we do: match a word, match separators, match another word, forget everything, match separators and return only those.
In RapidMiner, this can be implemented with 2 consecutive regex tokenizers, the first with the above formula and the second with only the separators to be used within each token (\W+).
Also note that, the pattern \w selects only Latin characters, so if your documents contain text in a different character set, these characters will be consumed by the \W which is supposed to match the separators. If you want to capture text with non-Latin character sets, like Greek for example, you need to change the formula like this:
\p{L}+\P{L}+\p{L}+\K\P{L}+
Furthermore, if you want the formula to capture text on one language and not on another language, you can modify it accordingly, by specifying {Language_Identifier} in place of {L}. For example, if you only want to capture text in Greek, you will use "{Greek}", or "{InGreek}" which is what RapidMiner supports.
What you can do is use a zero width group (like a positive look-ahead, as shown in example). Regex usually "consumes" characters it checks, but with a positive lookahead/lookbehind, you assert that characters exist without preventing further checks from checking those letters too.
This should work for your purposes:
(\w+)(?=(\W+\w+))
The following pattern matches for each pair of two words (note that it won't match the last word since it does not have a pair). The first word is in the first capture group, (\w+). Then a positive lookahead includes a match for a sequence of non word characters \W+ and then another string of word characters \w+. The lookahead (?=...) the second word is not "consumed".
Here is a link to a demo on Regex101
Note that for each match, each word is in its own capture group (group 1, group 2)
Here is an example solution, (?=(\b[A-Za-z]+\s[A-Za-z]+)) inspired from this SO question.
My question sounds wrong once you understand that is a problem of an overlapping regex pattern.

Regex not extracting all matching words

I am trying to extract words that have at least one character from a special character set. It picks up some words and not others. Here is a link to regex101 to test it. This it the regex \b(\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+\w*)\b, and this is the sample sentence I am using
His full name is Abu ʿĪsa Muḥammad ibn ʿĪsa ibn Sawrah ibn Mūsa ibn
Al-Daḥāk Al-Sulamī Al-Tirmidhī.
It should match the following words:
ʿĪsa Muḥammad ʿĪsa Mūsa Al-Daḥāk Al-Sulamī Al-Tirmidhī
I am not too experienced with regex, so I have no idea what I am doing wrong. If someone knows any tool to find out why a specific word doesn't match a regex pattern, please let me know as well.
You can use
[\w-]*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ][\wāīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ-]*
After matching the one required special character, use another character set to match more occurrences of those characters or normal word characters.
https://regex101.com/r/ovJoLt/2
You can make this work by enabling the Unicode flag /u (so that the word boundary \b assertions support Unicode characters) and adding hyphens to the surrounding character groups:
/\b[\w-]*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+[\w-]*\b/gu
Plus, you don't need the capturing group, since the only characters being matched form the desired output anyway (\b is a zero-width assertion).
Demo
You are not doing anything wrong except that to match unicode boundaries you have to enable u modifier or use (?<!\S)\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+\w*(?!\S)
If you want to match hyphen add it to your character class (?<!\S)\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ-]+\w*(?!\S)

Match a word but not its inverse using [^] syntax

I am trying to make a regex that doesn't match one word, but does match its reverse. For example, if the word I don't want to match is "no":
I am matching this word // will pass
I am matching no word // will not pass
I am matching on word // will pass
I am matching that word // will pass
The current regex I am using doesn't pass on the third example, because it is not matching any word with "n" or "o" in it:
^I am matching ([^no]*) word$
What is the best way to achieve this - ie, match on a word, not a collection of characters?
For context I am writing acceptance tests using Scala and Cucumber, which use Regex to match a feature file up with its corresponding stepdef. My real-world example is more complex, so I have simplified it here. Also, I know that I can just catch (.*) and handle what is in that capture group using a case/match block in Scala, but I am curious about how to do this with purely Regex.
You can use a negative lookahead to test the text you're about to match:
^I am matching (?!no\b)(?<CapturedWord>\w+) word$
(?!no\b) - This is a negative lookahead. It tests the next two characters. If they are "no" followed by a word boundary, then the match fails. Anything else will pass. A lookahead does not actually capture those characters, so...
(?<CapturedWord>\w+) - ...we need to capture the characters in order to continue on with the rest of the test. I used a named group because they're often easier to reference later on in code.
An other solution consists to describe all words that aren't "on". Note that this solution isn't handy if you want to negate a long substring, but with several regex engines that don't have the lookahead feature, this is the only way:
^I am matching ([^\Wn]\w+|n[^\Wo]+|\w(?:\w{2,})?) word$
The two first branch of the alternation match in particular all 2 letters words that aren't "no", the last branch matches one letter and 3 or more letters words.

RegEx lookahead but not immediately following

I am trying to match terms such as the Dutch ge-berg-te. berg is a noun by itself, and ge...te is a circumfix, i.e. geberg does not exist, nor does bergte. gebergte does. What I want is a RegEx that matches berg or gebergte, working with a lookaround. I was thinking this would work
\b(?i)(ge(?=te))?berg(te)?\b
But it doesn't. I am guessing because a lookahead only checks the immediate following characters, and not across characters. Is there any way to match characters with a lookahead withouth the constraint that those characters have to be immediately behind the others?
Valid matches would be:
Berg
berg
Gebergte
gebergte
Invalid matches could be:
Geberg
geberg
Bergte
bergte
ge-/Ge- and -te always have to occur together. Note that I want to try this with a lookahead. I know it can be done simpler, but I want to see if its methodologically possible to do something like this.
Here is one non-lookaround based regex:
\b(berg|gebergte)\b
Use it with i (ignore case) flag. This regex uses alternation and word boundary to search for complete words berg OR gebergte.
RegEx Demo
Lookaround based regex:
(?<=\bge)berg(?=te\b)|\bberg\b
This regex used a lookahead and lookbehind to search for berg preceded by ge and followed by te. Alternatively it matches complete word berg using word boundary asserter \b which is also 0-width asserter like anchors ^ and $.
To generally forbid a sign, you can put the negative lookaround to the beginning of a string and combine it with random number of other signs before the string you want to forbid:
regex: don't match if containing a specific string
^(?!.\*720).*
This will not match, if the string contains 720, but else match everything else.

Regex Matching with Space

I had a very simple question about regex matching, I want have "string" (ignore case) matched
in this case: "thisisastring", nothing should be returned
in this case: "this is a string" a single match on "string" should be returned
Now I had #"([S|s][T|t][R|r][I|i][N|n][G|g])" as the regex, However it doesn't work correctly in the first case.
How should I write this regex?
Thanks in advance!
[S|s] does not match what you seem to think
Please note that [S|s] does not mean "match a S or a s". It means "match one character that is either a S, a | or a s". That's how things work inside a [character class]. To express an OR, you can use a non-capturing group: (?:S|s). But [Ss] is all you need, and case-insensitivity is even better.
Case-Insensitivity
I'm going to assume we're using case-insensitive mode so we end up with a simpler regex. I assume you're in C# as it looks like you're using a verbatim string: (?i) will work. Another way to set case-insensitivity in C# would be RegexOptions.IgnoreCase
Option 1: boundary (close but no cigar)
(?i)\bstring
This no longer matches string in astring. However, it matches string in ##string, which you do not want.
Option 2: lookbehind
(?i)(?<=[ ])string
The lookbehind ensures that string is preceded by a space character. The brackets are optional, they help see the space.
Option 3: \K (but not in C#)
For engines that support it (Perl, PCRE, Ruby 2+):
(?i)[ ]\Kstring
The \K tells the engine to drop what was matched so far from the final match it returns