Match words with hyphens and apostrophes - regex

I have the following regex for matching words:
\w+(?:'|\-\w+)?
For the following string:
' 's yea' don't -yeah no- ice-cream '
it gives the following matches:
s yea' don't yeah no ice-cream
However, I would like the following matches:
's yea' don't yeah no ice-cream
Since a word can start or end with an apostrophe but not with a hyphen. Note the a ' on its own should not be matched.

Your \w+(?:'|\-\w+)? starts matching with a word character \w, thus all "words" starting with ' are not matched as per the requirements.
In general, you can match words with and without hyphens with
\w+(?:-\w+)*
In the current scenario, you may include the \w and ' into a character class and use
'?\w[\w']*(?:-\w+)*'?
See the regex demo
If a "word" can only have 1 hyphen, replace * at the end with the ? quantifier.
Breakdown:
'? - optional apostrophe
\w - a word character
[\w']* - 0+ word character or an apostrophe
(?:-\w+)* - 0+ sequences of:
- - a hyphen
\w+ - 1+ word character
'? - optional apostrophe

Related

Regex to list strings with no occurrence of special character '.' and having a space

I'm trying to filter out strings in project code which have the following form
'alphanumeric.alphanumeric.alphanumeric.alphanumeric'
(surrounded by quote and has one or more dots between alphanumeric words)
and another regex to find strings with the form
'this is a regular sentence with space'
I'm new to regex and have the following pattern which doesn't work. Which should mean:
(' + anything + . + anything + ')
/'*[^.]*'
I need multiple words with . connecting them.
The pattern that you tried /'*[^.]*' matches a /, then optional occurrences of ' followed by optional chars other than ' and match a ' so a dot can not be matched.
You could use 2 separate patterns matching either a dot or a space at the start of the group and matching alphanumerics [^\W_]+ exluding the underscore from a word character.
'[^\W_]+(?:\.[^\W_]+)+'
Another option is to use a capture group matching either a dot or space and use a backreference in the repetition and match any letter or any number:
'[\p{L}\p{N}]+([.\p{Zs}\t])[\p{L}\p{N}]+(?:\1[\p{L}\p{N}]+)*'
' Match literally
[\p{L}\p{N}]+ Match 1+ alphanumerics
([.\p{Zs}\t])[\p{L}\p{N}]+ Capture group 1, match either . or a space and 1+ alphanumerics
(?:\1[\p{L}\p{N}]+)* Optionally match what is captured in group 1 using the backreference \1 followed by 1+ alphanumerics
' Match literally
Regex demo

How to replace all whitespace between quotes that start with specific string with underscore in VSCode?

I have the following string:
this is a sample id="aaa bbb ccc" name="abc abc"
I want to match only the whitespace between quotes that start with the string "id=" and replace all occurrences with underscore. The result string should look like:
this is a sample id="aaa_bbb_ccc" name="abc abc"
The following regex matches all whitespace between quotes, but it doesn't take into account the fact that the quotes must be preceded by "id="
\s(?=[^"]*"[^"]*(?:"[^"]*"[^"]*)*$)
Quotes inside quotes are not possible.
Since starting with VS Code 1.31, infinite-width lookbehinds are supported, you may use
(?<=\bid="[^"]*?)\s
Or, to make sure there actually is a " after the whitespace,
(?<=\bid="[^"]*?)\s(?=[^"]*")
Replace with _.
See the regex demo online. Details:
(?<=\bid="[^"]*?) - a positive lookbehind that matches a location that is immediately preceded with
\b - word boundary
id=" - a literal id=" string
[^"]*? - any 0 or more chars other than ", as few as possible (due to *? non-greedy quantifier)
\s - a whitespace
(?=[^"]*") - a positive lookahead that matches a location immediately followed with any 0+ chars other than " (with [^"]* pattern) and then a ".
See the proof it works in VSCode:

Include "order" in Regex but exclude "in order" in Regex using python [duplicate]

I want to get only ['bar'] here:
>>> re.findall(r"(?<!\bdef )([a-zA-Z0-9.]+?)\(", "def foo(): bar()")
['oo', 'bar']
Is that possible in a single regex? If not, i'll use this first: re.sub(r"\bdef [a-zA-Z0-9.]+", "", "def foo(): bar()")
The current regex matches oo in foo because oo( is not preceded with "def ".
To stop the pattern from matching inside a word, you may use a a word boundary, \b and the fix might look like r"\b(?<!\bdef )([a-zA-Z0-9.]+?)\(".
Note that identifiers can be matched with [a-zA-Z_][a-zA-Z0-9_], so your pattern can be enhanced like
re.findall(r'\b(?<!\bdef\s)([a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*)\(', s, re.A)
Note that re.A or re.ASCII will make \w match ASCII only letters, digits and _.
See the regex demo.
Details
\b - a word boundary
(?<!\bdef\s) - no def + space allowed immediately to the left of the current location
([a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*) - Capturing group 1 (its value will be the result of re.findall call):
[a-zA-Z_] - an ASCII letter or _
\w* - 1+ word chars
(?: - start of a non-capturing group matching a sequence of...
\. - a dot
[a-zA-Z_] - an ASCII letter or _
\w* - 1+ word chars
)* - ... zero or more times
\( - a ( char.

Replace Word with #Word after first #

Anyone would kindly help with a regex for Notepad++ to replace Word with #Word (only after the first occurrence of #)?
#Celebrity #Glad #Known #Lord Byron #British #Poet
should become
#Celebrity #Glad #Known #Lord #Byron #British #Poet
^
To replace Word with #Word only after the first occurrence of #, you could use an alternation:
Find what
(?>^[^#]*#\w+\h*|#\w+\h*|\G)\K(\w+\h*)
Replace with
#\1
Regex demo
Explanation
(?> Atomic group
^[^#]*#\w+\h* Match from the start of the string not a # 0+ times using a negated character class followed by matching a #. Then match 1+ times a word character followed by 0+ times a horizontal whitespace character.
| Or
#\w+\h* Match #, a word character 1+ times followed by a horizontal whitespace character 0+ times
| Or
\G Assert position at the end of the previous match
) Close atomic group
\K Forget what what previously matched
(\w+\h*) Capture in a group 1+ word characters followed by 0+ times a horizontal whitespace character
You can use the the following regex to match and replace:
\s([^#]\w+)
It starts by matching a White Space then it creates a Group, that does not start with '#', but contains one or more Word characters.
You then replace with:
' #$1'
That will add '#' to the Words thats doesn't start with it.

Matching the verb to be with any words after it followed with to

I want to match the different forms of the verb to be then followed by words without ing ending, then followed by to.
I have the regex like this:
\b(is|it's|are|been|was|were|am|'m)(.*(?!ing\s))\bto\b‌
However, when using it with
Peppa and George love jumping in muddy puddles.
The "jumping in" still could be matched, how to modify the expression?
You may use the following regex to match be + space-separated word(s) + to while excluding be + verb-ing + to:
(?:\b(?:is|it's|are|been|was|were|am)|(?:\B'm))(?:\s+\w++(?<!ing))*\s+to\b
See the regex demo
Details
(?: - an alternation group:
\b(?:is|it's|are|been|was|were|am) - a whole word from the list
| - or
(?:\B'm) - 'm not preceded with any word char
) - end of the group
(?:\s+\w++(?<!ing))* - 0+ sequences of
\s+ - 1+ whitespaces
\w++(?<!ing) - a word not ending with ing
\s+ - 1+ whitespaces
to\b - a whole word to (\b is a word boundary)
Note that a single \b at the start won't let you match 'm as ' is not a word char. You may replace the first (?:\b(?:is|it's|are|been|was|were|am)|(?:\B'm)) part with (?<!\w)(?:is|it's|are|been|was|were|am|'m), or, if you want to match them only if not preceded with whitespace, (?<!\S)(?:is|it's|are|been|was|were|am|'m). Or, (?<![^\W\d_])(?:is|it's|are|been|was|were|am|'m) (if not preceded with a letter).