Regex Ignore words with part of string - regex

I have a text:
'1 2 3 ab AB úá awindow BCwindow'
Currently to get only words I use this regex: [a-zA-Zá-ú]+ and this is the result:
['ab', 'awindow', 'bcwindow', 'úá']
I would like to remove 'window' string of mathes words to get this:
['ab','a','bc','úá']
Thanks.

If word window always appears at the end of a matching word, you could do:
(?<!\S)[a-zA-Zá-ú]+?(?:(?!\S)|(?=window))
This assures you don't have extra non-whitespace characters preceding a word (prevents a match to begin from middle of a longer string) or following it. You may use word boundaries \b instead:
\b[a-zA-Zá-ú]+?(?:\b|(?=window))
Live demo
Breakdown:
\b Match a word boundary position (where a word begins)
[a-zA-Zá-ú]+? Match characters in class at least one time, ungreedily
(?: Start of non-capturing group
\b Match a word boundary (here we mean end of word)
| Or
(?=window) A positive lookahead, assert following characters are window
) End of non-capturing group
Whenever second word boundary is matched or positive lookahead asserts then engine is satisfied and every thing up to that point is returned as a match.

Related

Regular expression that matches at least 4 words starting with the same letter?

I've been trying to solve this problems for few hours but with no luck. The task is to write a regular expression that matches at least four words starting with the same letter. But! These words do not have to be one after another.
This regex should be able to match a line like this:
cat color coral chat
but also one like this:
cat take boom candle creepy drum cheek
Thank you!
So far I have got this regex but it only matches words when they are in order.
(\w)\w+\s+\1\w+\s+\1\w+\s+\1
If you have only words in the line that can be matched with \w:
\b(\w)\w*(?:(?:\s+\w+)*?\s+\1\w*){3}
Explanation
\b A word boundary to prevent a partial word match
(\w)\w* Capture a single word character in group 1 followed by matching optional word characters
(?: Non capture group to repeat as a whole part
(?:\s+\w+)*? Match 1+ whitespace chars and 1+ word chars in between in case the word does not start with the character captured in the back reference
\s+\1\w* Match 1+ whitespace chars, a backreference to the same captured character and optional word characters
){3} Close the non capture group and repeat 3 times
See a regex demo
Note that \s can also match a newline.
If the words that should with the same character should be at least 2 characters long (as (\w)\w+ matches 2 or more characters)
\b(\w)\w+(?:(?:\s+\w+)*?\s+\1\w+){3}
See another regex demo.
Another idea to match lines with at least 4 words starting with the same letter:
\b(\w)(?:.*?\b\1){3}
See this demo at regex101
This is not very accurate, it just checks if there are three \b word boundaries, each followed by \1 in the first group \b(\w) captured character to the right with .*? any characters in between.

Match all instances of a certain character inside every word preceded by a certain word and not delimited by a space

Given a string such as below:
word.hi. bla. word.
I want to construct a regex which will match all "."s preceded by "word" and any other non space character
So, in the above example I would want the the first, second and last dots to be matched.
While matching the first and last dots would be easy with global flag (/(?:word.*)\K./gU), I'm not sure how to construct a regex that would also match the second dot.
Appreciate any pointers.
You might match word and then get all consecutive matches using the \G anchor excluding matching whitespace chars or a dot.
(?:\bword|\G(?!\A))[^.\s]*\K\.
In parts
(?: Non capture group
\bword Match word preceded by a word boundary
| Or
\G(?!\A) Assert the position at the end of the previous match, not at the start
) Close non capture group
[^.\s]* Match 0+ occurrences of any char except . or a whitespace char
\K Clear the match buffer (forget what is matched until now)
\. Match a dot
Regex demo

Regex: matching up to the first occurrence of word with character 'a' in it

I need a regular expression to match the first word with character 'a' in it for each line. For example my test string is this:
bbsc abcd aaaagdhskss
dsaa asdd aaaagdfhdghd
wwer wwww awww wwwd
Only the ones in BOLD fonts should be matched. How can I do that? I can match all the words with 'a' in it, but can't figure out how to only match the first occurrence.
Under the assumption that the only characters being used are word characters, i.e. \w characters, and white space then use:
/^(?:[^a ]+ +)*([^a ]*a\w*)\b/gm
^ Matches the start of the line
(?:[^a ]+ +)* Matches 0 or more occurrences of words composed of any character other than an a followed by one or more spaces in a non-capturing group.
([^a ]*a\w*)\b Matches a word ending on a word boundary (it is already guaranteed to begin on a word boundary) that contains an a. The word-boundary constraint allows for the word to be at the end of the line.
The first word with an a in it will be in group #1.
See demo
If we cannot assume that only word (\w) and white space characters are present, then use:
^(?:[^a ]+ +)*(\w*a\w*)\b
The difference is in scanning the first word with an a in it, (\w*a\w*), where we are guaranteed that we are scanning a string composed of only word characters.
What are you using? In many programs you can set limit. If possible: \b[b-z]*a[a-z]* with 1 limit.
If it is not possible, use group to capture and match latter: ([b-z]*a[a-z]*).*
Try:
^(?:[^a ]+ )*(\w*a\w*) .*$
Basically what it says is: capture a bunch of words that are composed of anything but the letter a (or <space>) then capture a word that must include the letter a.
Group 1 should hold the first word with a.

regex nonconsecutive match

I'm trying to match a word that has 2 vowels in it (doesn't have to be consecutively) but the regex I've come up either matches nothing or not enough. This is the last iteration (dart).
final vowelRegex = new RegExp(r'[aeiouy]{2}');
Here's an example sentence being parsed and it should match, one, shoulder, their, and over. It's only matching shoulder and their. I understand why, because that's the expression I defined. How can the expression be defined to match on 2 vowels, regardless of position in the word?
one shoulder their the which over
The expression only needs to be tested on one word at a time so hopefully this simplifies things.
You can use :
new RegExp(r'(\w*[aeiouy]\w*){2}');
Both of the previous two answers are incorrect.
(\S*[aeiouy]\S*){2} can match substrings of non-whitespace characters even if they contain non-word characters (proof).
\S*[aeiouy]\S*[aeiouy]\S* has the same problem (proof).
Correct solution:
\b([^\Waeiou]*[aeiou]){2}\w*\b
And if you want only whitespace to count as the word boundary (rather than any non-word character), then use the following regex where the target word is in capture group \2.
(\s|^)(([^\Waeiou]*[aeiou]){2}\w*)(\s|$)
You can try this:
\S*[aeiouy]\S*[aeiouy]\S*
Explanation
\S* matches any non-whitespace character (equal to [^\r\n\t\f ])
* Quantifier — Matches between zero and unlimited times
[aeiou] Match a single character present in the list below [aeiou]
For input string : one shoulder their the which over
it will match four word: one shoulder their over
I'd do:
\b(?:\w*[aeiouy]+\w*){2,}\b
Explanation:
\b : word boundary
(?: : start non-capture group
\w* : 0 or more word characters
[aeiouy]+ : 1 or more vowels
\w* : 0 or more word characters
){2,} : end group repeated at least twice
\b : word boundary

Regex negative lookahead and word boundary removes first character from capture group

I am trying to capture every word in a string except for 'and'. I also want to capture words that are surrounded by asterisks like *this*. The regex command I am using mostly works, but when it captures a word with asterisks, it will leave out the first one (so *this* would only have this* captured). Here is the regex I'm using:
/((?!and\b)\b[\w*]+)/gi
When I remove the last word boundary, it will capture all of *this* but won't leave out any of the 'and' s.
The problem is that * is not treated as a word character, so \b don't match a position before it. I think you can replace it with:
^(?!and\b)([\w*]+)|((?!and\b)(?<=\W)[\w*]+)
The \b was repleced with \W (non-word character) to match also *, however then the first word in string will not match because is not precedeed by non-word character. This is why I added alternative.
DEMO