I have below regex which is used for removing punctuations from a string. What I need is to allow only apostrophes and periods in between words such as “Zipf’s”, “e.g”.
[^\w\s]
An idea to use non word boundaries (where no word-character touches specified characters).
\B matches at any position between two word characters as well as at any position between two non-word characters ...
[^\w\s.’']|\B[.’']\B
See this demo at regex101
Related
I would like to make a regex to match a word, but don't match it if there are special characters on its sides.
I tried to use a word boundary (\b) on both sides but it doesn't seem to exclude special characters...
For example, this should work:
text word-to-match more-text
But this should not:
text word-to-match-more-text
Because there is a - between the word to match and more text.
What i have now is this:
(?<=[^-\[\]{}()+?.,\\^$|#])\bword-to-match\b(?=[^-\[\]{}()+?.,\\^$|#])
I would like to know if there is a more elegant way instead of using [^-\[\]{}()+?.,\\^$|#]) on both sides of the word.
Thanks in advance!
You may use lookahead and lookbehind on both sides to fail the match if there is a non-whitespace character on either side:
(?<!\S)word-to-match(?!\S)
RegEx Demo
(?<!\S): Fail if previous character is a non-whitespace
(?!\S): Fail if next character is a non-whitespace
I need a regular expression to match the first word with character 'a' in it for each line. For example my test string is this:
bbsc abcd aaaagdhskss
dsaa asdd aaaagdfhdghd
wwer wwww awww wwwd
Only the ones in BOLD fonts should be matched. How can I do that? I can match all the words with 'a' in it, but can't figure out how to only match the first occurrence.
Under the assumption that the only characters being used are word characters, i.e. \w characters, and white space then use:
/^(?:[^a ]+ +)*([^a ]*a\w*)\b/gm
^ Matches the start of the line
(?:[^a ]+ +)* Matches 0 or more occurrences of words composed of any character other than an a followed by one or more spaces in a non-capturing group.
([^a ]*a\w*)\b Matches a word ending on a word boundary (it is already guaranteed to begin on a word boundary) that contains an a. The word-boundary constraint allows for the word to be at the end of the line.
The first word with an a in it will be in group #1.
See demo
If we cannot assume that only word (\w) and white space characters are present, then use:
^(?:[^a ]+ +)*(\w*a\w*)\b
The difference is in scanning the first word with an a in it, (\w*a\w*), where we are guaranteed that we are scanning a string composed of only word characters.
What are you using? In many programs you can set limit. If possible: \b[b-z]*a[a-z]* with 1 limit.
If it is not possible, use group to capture and match latter: ([b-z]*a[a-z]*).*
Try:
^(?:[^a ]+ )*(\w*a\w*) .*$
Basically what it says is: capture a bunch of words that are composed of anything but the letter a (or <space>) then capture a word that must include the letter a.
Group 1 should hold the first word with a.
I can use \s?(\w+\s){0,2}\w*) for "up to three words" and \w{0,20} for "no more than twenty characters", but how can I combine these? Trying to merge the two via a lookahead as mentioned here seems to fail.
Some examples for clarification:
The early bird catches the worm.
should match any three words in sequence (including the worm*).
Here we have a supercalifragilisticexpialidocious sentence.
"a supercalifragilisticexpialidocious sentence" is too long a sequence and therefore should not match.
* In my actual use case I'm going for a paragraph's last three words, i.e. a (?:\r) would be at the end of the RegEx and the match "catches the worm.") Matches are then applied with a "no linebreaks" character style in Adobe InDesign in order to avoid orphans.
To match 3 words separated with whitespace(s) at the end of a line or string, you can use
\b(?!(?:\s*\w){21})\w+(?:\s+\w+){0,2}(?=$|[\r\n])
See the regex demo. Note that in the demo, I use [^\S\r\n] instead of the \s in the lookahead since the text contains newlines, use the same trick if you need that.
Regex explanation
\b - a word boundary
(?!(?:\s*\w){21}) - a lookahead check that fails the match if after the initial word boundary there are 21 word characters optionally preceded with any number of whitespace symbols
\w+ - 1 word (consisting of 1 or more word characters)
(?:\s+\w+){0,2} - zero, one or two sequences of 1+ whitespaces followed with 1+ word characters
(?=$|[\r\n]) - a positive lookahead that only allows a match to be returned if there is the end-of-string ($) or the end of a line ([\r\n]).
Now, if your words should only contain letters, use [a-zA-Z] or equivalent for your language. If the regex flavor allows, use \p{L} Unicode category/property class.
I'm trying to match a string that contains alphanumeric, hyphen, underscore and space.
Hyphen, underscore, space and numbers are optional, but the first and last characters must be letters.
For example, these should all match:
abc
abc def
abc123
ab_cd
ab-cd
I tried this:
^[a-zA-Z0-9-_ ]+$
but it matches with space, underscore or hyphen at the start/end, but it should only allow in between.
Use a simple character class wrapped with letter chars:
^[a-zA-Z]([\w -]*[a-zA-Z])?$
This matches input that starts and ends with a letter, including just a single letter.
There is a bug in your regex: You have the hyphen in the middle of your characters, which makes it a character range. ie [9-_] means "every char between 9 and _ inclusive.
If you want a literal dash in a character class, put it first or last or escape it.
Also, prefer the use of \w "word character", which is all letters and numbers and the underscore in preference to [a-zA-Z0-9_] - it's easier to type and read.
Check this working in fiddle http://refiddle.com/refiddles/56a07cec75622d3ff7c10000
This will fix the issue
^[a-zA-Z]+[a-zA-Z0-9-_ ]*[a-zA-Z0-9]$
I tried using following regex:
/^\w+([\s-_]\w+)*$/
This allows alphanumeric, underscore, space and dash.
More details
As per your requirement of including space, hyphen, underscore and alphanumeric characters you can use \w shorthand character set for [a-zA-Z0-9_]. Escape the hyphen using \- as it usually used for character range inside character set.
To negate the space and hyphen at the beginning and end I have used [^\s\-].
So complete regex becomes [^\s\-][\w \-]+[^\s\-]
Here is the working demo.
You can use this regex:
^[a-zA-Z0-9]+(?:[\w -]*[a-zA-Z0-9]+)*$
RegEx Demo
This will only allow alphanumerics at start and end.
I'm writing a regular expression in Java for capturing some word without spaces.
The word can contain only letter, number, hyphens and dot.
The character set [\w+\-\\.] work well.
Now I want to edit the set for allowing a single space after the dot.
How I have to edit my regular expression?
You can add an alternation that matches this additional requirement
([\w\-.]|(?<=\.) )+
See it here on Regexr
(?<=\.) is a lookbehind assertion. It ensures that space is only matched, if it is preceded by a dot.
Other hints:
\w contains the underscore and matches per default only ASCII letters/digits. If you care about Unicode, use either the modifier UNICODE_CHARACTER_CLASS to enable Unicode for \w or use the Unicode properties \p{L} and \p{Nd} to match Unicode letters and digits.
You don't need to escape the dot in a character class.
You have \w+ in your character class, are you aware, that you just add the "+" character to the accepted characters?
In case of a dot followed by a space, I suppose this pattern should be neither the first, nor the last in the matched string? You may want to enclose it in word boundaries \b:
([0-9A-Za-z-]|\b\.( \b)?)+
I deliberately did not use \w, to exclude underscores.
For allowing ONLY a single space after the dot you can use this regex:
^(?!.*?\. {2})[\w.-]+$
You don't need to escape dot OR hyphen inside character class
(?!.*?\. {2}) is a negative lookahead that disallows 2 or more spaces after a dot