find word that each character separated by space - regex

I need a regex to select a word that each char on that word separated by whitespace. Look at the following string
Mengkapan,Sungai Apit,S I A K,Riau,
I want to select S I A K. I am stuck, I was trying to use the following regex
\s+\w{1}\s+
but it's not working.

I suggest
\b[A-Za-z](?:\s+[A-Za-z])+\b
pattern, where
\b - word boundary
[A-Za-z] - letter (exactly one)
(?: - one or more groups of
\s+ - white space (at least one)
[A-Za-z] - letter (exactly one)
)+
\b - word boundary

For your given information, you could use
(?:[A-Za-z] ){2,}[A-Za-z]
See a demo on regex101.com.

You could match a word boundary \b, a word character \w and repeat at least 2 times a space and a word character followed by a word boundary:
\b\w(?: \w){2,}\b
Regex demo

Related

Regular expression that matches at least 4 words starting with the same letter?

I've been trying to solve this problems for few hours but with no luck. The task is to write a regular expression that matches at least four words starting with the same letter. But! These words do not have to be one after another.
This regex should be able to match a line like this:
cat color coral chat
but also one like this:
cat take boom candle creepy drum cheek
Thank you!
So far I have got this regex but it only matches words when they are in order.
(\w)\w+\s+\1\w+\s+\1\w+\s+\1
If you have only words in the line that can be matched with \w:
\b(\w)\w*(?:(?:\s+\w+)*?\s+\1\w*){3}
Explanation
\b A word boundary to prevent a partial word match
(\w)\w* Capture a single word character in group 1 followed by matching optional word characters
(?: Non capture group to repeat as a whole part
(?:\s+\w+)*? Match 1+ whitespace chars and 1+ word chars in between in case the word does not start with the character captured in the back reference
\s+\1\w* Match 1+ whitespace chars, a backreference to the same captured character and optional word characters
){3} Close the non capture group and repeat 3 times
See a regex demo
Note that \s can also match a newline.
If the words that should with the same character should be at least 2 characters long (as (\w)\w+ matches 2 or more characters)
\b(\w)\w+(?:(?:\s+\w+)*?\s+\1\w+){3}
See another regex demo.
Another idea to match lines with at least 4 words starting with the same letter:
\b(\w)(?:.*?\b\1){3}
See this demo at regex101
This is not very accurate, it just checks if there are three \b word boundaries, each followed by \1 in the first group \b(\w) captured character to the right with .*? any characters in between.

Regex for allowing apostrophe and period

I have below regex which is used for removing punctuations from a string. What I need is to allow only apostrophes and periods in between words such as “Zipf’s”, “e.g”.
[^\w\s]
An idea to use non word boundaries (where no word-character touches specified characters).
\B matches at any position between two word characters as well as at any position between two non-word characters ...
[^\w\s.’']|\B[.’']\B
See this demo at regex101

Match everything until upcase word

I want to capture a word placed before another one which is full capitalized
Mister Foo BAR is here # => "Foo"
Miss Bar-Barz FOO loves cats # => "Bar-Barz"
I've been trying the following regex: (Mister|Miss)\s([[:alpha:]\s\-]+)(?=\s[A-Z]+), but sometimes it includes the rest of the sentence. For example, it'll return Bar-Barz FOO loves cats instead of Bar-Barz).
How can I say, using RegExp, "match every words until the upcase word" ?
To clarify the usage of negative lookahead, can we say it "captures until the specified sub-pattern matches, but does not include it to the match data" ?
As a non-native English speaker, apologies if my answer isn't perfectly formulated. Thanks by advance
Match 1+ word chars optionally repeated by a - and 1+ word chars to not match only hyphens or a hyphen at the end.
Assert a space followed by 1+ uppercase chars and a word boundary at the right.
\w+(?:-\w+)*(?=\s[A-Z]+\b)
Explanation
\w+ Match 1+ word char
(?:-\w+)* Optionally repeat matching - and 1+ word chars
(?=\s[A-Z]+\b) Positive lookahead, assert what is directly at the right is 1+ uppercase chars A-Z followed by a word boundary
Regex demo
If there can not be any newlines between the words, you can use [^\S\r\n] instead of \s
\w+(?:-\w+)*(?=[^\S\r\n]+[A-Z]+\b)
Regex demo
I want to capture a word placed before another one which is full capitalized
You may use this regex with a lookahead:
\b\S+(?=[ \t]+[A-Z]+\b)
RegEx Demo
RegEx Description:
\b: Word boundadry
\S+: Match 1+ non-whitespace characters
(?=[ \t]+[A-Z]+\b): Positive lookahead that asserts we have 1+ space and then a word containing only capital letters
You don't say what language you're working in, but the following works for me. The idea is to stop when the parser hits a sequence of uppercase letters/hyphens.
JS example:
let ptn = /(Mister|Miss)\s[\w\-]+(?=\s[A-Z\-]+)/;
"Mister Foo BAR is here".match(ptn); //["Mister Foo", "Mister"]
"Miss Bar-Barz FOO loves cats".match(ptn); //["Miss Bar-Barz", "Miss"]

Regex: matching up to the first occurrence of word with character 'a' in it

I need a regular expression to match the first word with character 'a' in it for each line. For example my test string is this:
bbsc abcd aaaagdhskss
dsaa asdd aaaagdfhdghd
wwer wwww awww wwwd
Only the ones in BOLD fonts should be matched. How can I do that? I can match all the words with 'a' in it, but can't figure out how to only match the first occurrence.
Under the assumption that the only characters being used are word characters, i.e. \w characters, and white space then use:
/^(?:[^a ]+ +)*([^a ]*a\w*)\b/gm
^ Matches the start of the line
(?:[^a ]+ +)* Matches 0 or more occurrences of words composed of any character other than an a followed by one or more spaces in a non-capturing group.
([^a ]*a\w*)\b Matches a word ending on a word boundary (it is already guaranteed to begin on a word boundary) that contains an a. The word-boundary constraint allows for the word to be at the end of the line.
The first word with an a in it will be in group #1.
See demo
If we cannot assume that only word (\w) and white space characters are present, then use:
^(?:[^a ]+ +)*(\w*a\w*)\b
The difference is in scanning the first word with an a in it, (\w*a\w*), where we are guaranteed that we are scanning a string composed of only word characters.
What are you using? In many programs you can set limit. If possible: \b[b-z]*a[a-z]* with 1 limit.
If it is not possible, use group to capture and match latter: ([b-z]*a[a-z]*).*
Try:
^(?:[^a ]+ )*(\w*a\w*) .*$
Basically what it says is: capture a bunch of words that are composed of anything but the letter a (or <space>) then capture a word that must include the letter a.
Group 1 should hold the first word with a.

regex nonconsecutive match

I'm trying to match a word that has 2 vowels in it (doesn't have to be consecutively) but the regex I've come up either matches nothing or not enough. This is the last iteration (dart).
final vowelRegex = new RegExp(r'[aeiouy]{2}');
Here's an example sentence being parsed and it should match, one, shoulder, their, and over. It's only matching shoulder and their. I understand why, because that's the expression I defined. How can the expression be defined to match on 2 vowels, regardless of position in the word?
one shoulder their the which over
The expression only needs to be tested on one word at a time so hopefully this simplifies things.
You can use :
new RegExp(r'(\w*[aeiouy]\w*){2}');
Both of the previous two answers are incorrect.
(\S*[aeiouy]\S*){2} can match substrings of non-whitespace characters even if they contain non-word characters (proof).
\S*[aeiouy]\S*[aeiouy]\S* has the same problem (proof).
Correct solution:
\b([^\Waeiou]*[aeiou]){2}\w*\b
And if you want only whitespace to count as the word boundary (rather than any non-word character), then use the following regex where the target word is in capture group \2.
(\s|^)(([^\Waeiou]*[aeiou]){2}\w*)(\s|$)
You can try this:
\S*[aeiouy]\S*[aeiouy]\S*
Explanation
\S* matches any non-whitespace character (equal to [^\r\n\t\f ])
* Quantifier — Matches between zero and unlimited times
[aeiou] Match a single character present in the list below [aeiou]
For input string : one shoulder their the which over
it will match four word: one shoulder their over
I'd do:
\b(?:\w*[aeiouy]+\w*){2,}\b
Explanation:
\b : word boundary
(?: : start non-capture group
\w* : 0 or more word characters
[aeiouy]+ : 1 or more vowels
\w* : 0 or more word characters
){2,} : end group repeated at least twice
\b : word boundary