I'm attempting to match the last character in a WORD.
A WORD is a sequence of non-whitespace characters
'[^\n\r\t\f ]', or an empty line matching ^$.
The expression I made to do this is:
"[^ \n\t\r\f]\(?:[ \$\n\t\r\f]\)"
The regex matches a non-whitespace character that follows a whitespace character or the end of the line.
But I don't know how to stop it from excluding the following whitespace character from the result and why it doesn't seem to capture a character preceding the end of the line.
Using the string "Hi World!", I would expect: the "i" and "!" to be captured.
Instead I get: "i ".
What steps can I take to solve this problem?
"Word" that is a sequence of non-whitespace characters scenario
Note that a non-capturing group (?:...) in [^ \n\t\r\f](?:[ \$\n\t\r\f]) still matches (consumes) the whitespace char (thus, it becomes a part of the match) and it does not match at the end of the string as the $ symbol is not a string end anchor inside a character class, it is parsed as a literal $ symbol.
You may use
\S(?!\S)
See the regex demo
The \S matches a non-whitespace char that is not followed with a non-whitespace char (due to the (?!\S) negative lookahead).
General "word" case
If a word consists of just letters, digits and underscores, that is, if it is matched with \w+, you may simply use
\w\b
Here, \w matches a "word" char, and the word boundary asserts there is no word char right after.
See another regex demo.
In Word text, if I want to highlight the last a in para. I search for all the words that have [space][para][space] to make sure I only have the word I want, then when it is found it should be highlighted.
Next, I search for the last [a ] space added, in the selection and I will get only the last [a] and I will highlight it or color it differently.
Related
I want to regex match the last word in a string where the string ends in ... The match should be the word preceding the ...
Example: "Do not match this. This sentence ends in the last word..."
The match would be word. This gets close: \b\s+([^.]*). However, I don't know how to make it work with only matching ... at the end.
This should NOT match: "Do not match this. This sentence ends in the last word."
If you use \s+ it means there must be at least a single whitespace char preceding so in that case it will not match word... only.
If you want to use the negated character class, you could also use
([^\s.]+)\.{3}$
( Capture group 1
[^\s.]+ Match 1+ times any char except a whitespace char or dot
) Close group
\.{3} Match 3 dots
$ End of string
Regex demo
You can anchor your regex to the end with $. To match a literal period you will need to escape it as it otherwise is a meta-character:
(\S+)\.\.\.$
\S matches everything everything but space-like characters, it depends on your regex flavor what it exactly matches, but usually it excludes spaces, tabs, newlines and a set of unicode spaces.
You can play around with it here:
https://regex101.com/r/xKOYa4/1
I need a regular expression to match the first word with character 'a' in it for each line. For example my test string is this:
bbsc abcd aaaagdhskss
dsaa asdd aaaagdfhdghd
wwer wwww awww wwwd
Only the ones in BOLD fonts should be matched. How can I do that? I can match all the words with 'a' in it, but can't figure out how to only match the first occurrence.
Under the assumption that the only characters being used are word characters, i.e. \w characters, and white space then use:
/^(?:[^a ]+ +)*([^a ]*a\w*)\b/gm
^ Matches the start of the line
(?:[^a ]+ +)* Matches 0 or more occurrences of words composed of any character other than an a followed by one or more spaces in a non-capturing group.
([^a ]*a\w*)\b Matches a word ending on a word boundary (it is already guaranteed to begin on a word boundary) that contains an a. The word-boundary constraint allows for the word to be at the end of the line.
The first word with an a in it will be in group #1.
See demo
If we cannot assume that only word (\w) and white space characters are present, then use:
^(?:[^a ]+ +)*(\w*a\w*)\b
The difference is in scanning the first word with an a in it, (\w*a\w*), where we are guaranteed that we are scanning a string composed of only word characters.
What are you using? In many programs you can set limit. If possible: \b[b-z]*a[a-z]* with 1 limit.
If it is not possible, use group to capture and match latter: ([b-z]*a[a-z]*).*
Try:
^(?:[^a ]+ )*(\w*a\w*) .*$
Basically what it says is: capture a bunch of words that are composed of anything but the letter a (or <space>) then capture a word that must include the letter a.
Group 1 should hold the first word with a.
I am trying to capture every word in a string except for 'and'. I also want to capture words that are surrounded by asterisks like *this*. The regex command I am using mostly works, but when it captures a word with asterisks, it will leave out the first one (so *this* would only have this* captured). Here is the regex I'm using:
/((?!and\b)\b[\w*]+)/gi
When I remove the last word boundary, it will capture all of *this* but won't leave out any of the 'and' s.
The problem is that * is not treated as a word character, so \b don't match a position before it. I think you can replace it with:
^(?!and\b)([\w*]+)|((?!and\b)(?<=\W)[\w*]+)
The \b was repleced with \W (non-word character) to match also *, however then the first word in string will not match because is not precedeed by non-word character. This is why I added alternative.
DEMO
How can I use regex for all words beginning with : punctuation?
This gets all words beginning with a:
\ba\w*\b
The minute I change the letter a to :, the whole thing fails. Am I supposed to escape the colon, and if so, how?
\b matches between a non-alphanumeric and an alphanumeric character, so if you place it before :, it only matches if there is a letter/digit right before the colon.
So you either need to drop the \b here or specify what exactly constitutes a boundary in this situation, for example:
(?<!\w):\w*\b
That would ensure that there is no letter/digit/underscore right before the :. Of course this presumes a regex flavor that supports lookbehind assertions.
The problem is that \b won't match the start of a word when the word starts with a colon :, because colon is not a word character. Try this:
(?<=:)\w*\b
This uses a (non-capturing) look-behind to assert that the previous character is a colon.
In the following expression:
if (($$_ =~ /^.+:\s*\#\s*abcd\s+XYZ/)
Where is $$_ taken from?
The right side of the expression means to match one or more characters plus followed by colon, followed by zero or more spaces followed by # followed by one or more spaces folowed by 'abcd' followed by zero or more spaces followed by 'XYZ'?
You have the last "one or more" and "zero or more" reversed from what the regex actually does.
$$_ dereferences the scalar reference in $_.
Concerning 2., your explanation of the regex is not entirely correct.
/^.+:\s*#\s*abcd\s+XYZ/
means one or more characters (starting at the beginning of the string) followed by a colon, followed by zero or more whitespace characters, followed by one hash character, followed by zero or more whitespace characters, followed by 'abcd', followed by one or more whitespace characters, followed by 'XYZ'.
As for pt. 2:
Line beginning with (^) one or more characters (.+), colon (:), zero or more whitespace characters (\s*), a hash (\#), zero or more whitespace characters (\s*), the string "abcd" (abcd), one or more whitespace characters (\s+), then the string "XYZ" (XYZ).
(emphasis added on discrepancies.) Do note that there is no anchor on the end of line ($), thus this only concerns the beginning.
Have a look at this site
Here is the given explanation of your regex:
Token Meaning
^ Matches beginning of input. If the multiline flag is set to true,
also matches immediately after a line break character.
.+ Matches any single character except newline characters.
The + quantifier causes this item to be matched 1 or more times (greedy).
: :
\s* Matches a single white space character.
The * quantifier causes this item to be matched 0 or more times (greedy).
\# #
\s* Matches a single white space character.
The * quantifier causes this item to be matched 0 or more times (greedy).
abcd abcd
\s+ Matches a single white space character.
The + quantifier causes this item to be matched 1 or more times (greedy).
XYZ XYZ