Regular expression for matching the same words in the text - regex

For example, I have text:
The word1 word2 is word3
(note word1 can be == word2 == word3)
I want my regular expression work when distance between words word(i) <= N. Distance is the number of words between words word.
The distance between word1 and word2 id 0.
The distance between word2 and word3 is 1.
The distance between the word1 and word3 (=2) should not be taken into account.
I make regular expression to solve this problem, but it takes into account the distance between the first and the last same words. How can I fix it?
(\b\w+\b)\W+((\b\w+\b)\W+){N,}?\1
For my text example I want regular expression which will be find matches, only when N=0 or 1.
(\b\w+\b)\W+((\b\w+\b)\W+){0,}?\1
(\b\w+\b)\W+((\b\w+\b)\W+){1,}?\1
But now it works also when N=2
(\b\w+\b)\W+((\b\w+\b)\W+){2,}?\1

Related

Find a set of words near each other in a text, in any order

To highlight the result of a full-text search of multiple words, I tried to used regex to find items with a predefined distance, using the following regex, (distance between two words is up to 100 characters):
word1(?:\\s|.){1,100}?word2
This will find word1 ... word2, but will not find word2...word1
I know I can combine two regex phrases, but what if a user searches for say 6 words?
I would probably not try to do this with a RegExp.
if you want to find two occurrences of n words within a short distance, then it's still doable. Something like:
var re = RegExp(r"\b(word1|...|wordn)\b[^]{1,100}\b(?:word1|...|wordn)\b(?<!\b\1)");
This should find any of word1 through wordn, followed by 1-100 other characters, and then followed by another of word1 through wordn (but not the same one because of the negative look-behind).
If you want to find all the words, in any order, then it's a very non-regular problem which regular expressions are really unsuited for.
You can generalize the expression above to something like:
RegExp(r"\b(word1|...|word10)\b"
r"[^]{1,100}\b(word1|...|word10)\b(?<!\b\1)")
r"[^]{1,100}\b(word1|...|word10)\b(?<!\b(?:\1|\2))"
...
r"[^]{1,100}\b(word1|...|word10)\b(?<!\b(?:\1|\2|\3|\4|...|\9))");
That is probably not going to be particularly efficient with all those negative look-behinds, but the biggest issue is that it grows quadratically in the number of words.
So, what I'd do instead is:
List<Match>? findWords(String source, List<String> words) {
var re = RegExp("\\b(?:${words.join("|")})\\b");
var seenWords = <String>{};
var matches = <Match>[];
for (var m in re.allMatches(source)) {
var str = m[0];
if (seenWords.add(str)) {
matches.add(m);
if (matches.length == words.length) return matches;
}
}
return null;
}
This will return a Match for each word in the words argument, if it finds all of them, and null if it does not.
Objective
Suppose we have the string:
Little Miss Muffet she sat on her tuffet, eating her curds and whey. Along came a spider who sat down beside her and frightened Miss Muffet away.
Now suppose we wish to determine if there exists a substring of no more than 70 characters that contains all the words she, tuffet, her and spider.
First regex
We can do that in two steps. The first is to match the string with the following regular expression.
(?!(?=.*(?:^| )she\b)(?=.*(?:^| )tuffet\b)(?=.*(?:^| )her\b)(?=.*(?:^| )spider\b)).*
This matches the string
she sat on her tuffet, eating her curds and whey. Along came a spider who sat down beside her and frightened Miss Muffet away.
which is the shortest tail of the original string that contains all four specified words.
Regex 1
The regex engine performs the following operations
(?! : begin a negative lookahead
(?= : begin a positive lookahead
.* : match 0+ characters
(?:^| ) : match the beginning of the string or a space
she : match 'she'
\b : assert a word boundary
) : end positive lookahead
(?=.*(?:^| )tuffet\b) : same as above for 'tuffet'
(?=.*(?:^| )her\b) : same as above for 'her'
(?=.*(?:^| )spider\b) : same as above for 'spider'
) : end negative lookahead
.* : match remainder of string
Second regex
We may now use the following regular expression to attempt to verify that all four specified words fall within 70 characters of the beginning of the tail string matched by regex 1.
^(?=.{0,67}\bshe\b)(?=.{0,64}\btuffet\b)(?=.{0,67}\bher\b)(?=.{0,64}\bspider\b)
Regex 2
The link shows two examples. The first is for the tail string matched by regex 1. That string is matched, meaning that all four specified words fall within the first 70 characters of the string. The second example is the same
as the first except I've inserted the word " big" before " spider". This has the effect of pushing the end of "spider" beyond the first 70 characters of the string, so there is no match.
Here the regex engine performs the following operations.
^ : assert beginning of string
(?= : begin positive lookahead
.{0,67} : match 0-67 characters
\bshe\b : match 'she' with word boundaries
) : end positive lookahead
(?=.{0,64}\btuffet\b) : same for 'tuffet' except 64 rather than 67
(?=.{0,67}\bher\b) : same for 'her'
(?=.{0,64}\bspider\b) : same for 'tuffet' except 64 rather than 67

notepad++ how to insert replacement after the following word

text: [aa-b c d...]
result: [b-123 c d...]
text:[aa-word1 word2 word3 ...]
result[word1-123 word2 word...]
[aa-bananas oranges apples]
[bananas-123 oranges apples]
I want to replace aa- but -123 should be only placed after the next word.
The next word should be a parameter, instead of a fixed text like the insert aa-. This is because there are many different cases to be replaced.
I'll change "aa-" to many other variants. "bb-" "cc-"...
But the word1 is always a variable in the text.
Ctrl+H
Find what: \[aa-(\w+)
Replace with: [$1-123
check Match case
check Wrap around
check Regular expression
Replace all
Explanation:
\[ # opening square bracket
aa # literally 2 a
- # hyphen
(\w+) # group 1, 1 or more word character
Result for given example:
[b-123 c d...]
[word1-123 word2 word3 ...]
[bananas-123 oranges apples]
Screen capture:

R: detect words and punctuation marks in text

I have some naturally occuring text:
text="word1 word2 word3. word4, word5 word6 word7"
And some elements that I want to detect in that text:
elements=c("word2","word6 word7",".",",")
However,
elements[sapply(paste0("\\<",elements,"\\>"),grepl,text)]
only returns the unigram "word2" and the bigram "word6 word7". The period and comma, which are in the text, are not detected.
How can I achieve that?
You don't need to include the square brackets, since sqaure brackets are special meta charcaters in regex which means a character class.
> text="word1 word2 word3. word4, word5 word6 word7"
> elements=c("word2","word6 word7",".",",")
> elements[sapply(paste0(elements),grepl,text, fixed=T)]
[1] "word2" "word6 word7" "." ","
elements[sapply(paste0("[",elements,"]"),grepl,text)] does the job.

Regexp exact word match

I need to match words from lines. For example:
The blue bird is dancing.
Yellow card is drawn
The day is perfect rainy
blue bird is eating
The four lines are in a text file l2.
I want to match the blue bird, yellow card, day and every time a line is printed that matched word is printed before the line.
y=regexp(l2,('^(?=.*blue bird)|(?=.*day)|(?=.*Yellow card)$'));
Is this how it works? I can't get the result.
sprintf('[%s]',y,l2);
MATLAB's regex engine doesn't use \b as word boundary anchors but \< and \>.
So your regex would become
y = regexp(l2, '^(?=.*\<(?:blue bird|day|Yellow card)\>).*', 'lineanchors');
assuming that l2 is a multiline string.
Try this reg exp.
(?:blue bird|yellow card|day)

regex, extract string NOT between two brackets

OK regex question , how to extract a character NOT between two characters, in this case brackets.
I have a string such as:
word1 | {word2 | word3 } | word 4
I only want to get the first and last 'pipe', not the second which is between brackets. I have tried a myriad of attempts with negative carats and negative groupings and can't seem to get it to work.
Basically I am using this regex in a JavaScript split function to split this into an array containing: "word1", "{word2 | word3}", "word4".
Any assistance would be greatly appreciated!
Try using this pattern
/\|(?![^{]*})/g
with this text
word1 | {word2 | word3 } | word 4 | word 4 | {word2 | word3 }
This should match all of the Pipe symbols that are not inside {}.
*edit - removed link to dead site (Thanks Dennis)
Depends on the language/implementation you're using, but...
\|(?![^{]*})
This matches a pipe that is not followed by a } except in the case that a { comes first.
The (?! ... ) is known as a negative lookahead assertion. This is easier to understand if we start with a positive lookahead assertion:
\|(?=[^{]*})
The above only matches a pipe that is followed by a } without encountering a { first. When you negate that by replacing the = with a !, the match is now only successful if there's no way for the positive case to be true (also known as the complement).