I need to extract from a text all the words which match these two requirements:
Contain at least one uppercase letter
Don't fully consist of uppercase characters.
So, Word and WorD are correct captures, but word and WORD aren't.
So, I can capture all the words using a \b([a-zA-Z]+)\b Regex, but I don't know how to add the uppercase letters condition here.
As about the requirement #1, I tried to use a positive lookahead here like this:
\b(?=.*[A-Z]+)([a-zA-Z]+)\b , but now it captures all the words from a line if this line has at least one uppercase letter.
Is it even possible to apply additional conditions to a capturing group?
I can process this in my application's code but I'd really prefer to fit all those requirements in a single Regex.
You may use
\b(?=[A-Z]*[a-z])(?=[a-z]*[A-Z])([a-zA-Z]+)\b
See the regex demo
Actually, you do not even need the capturing group, ([a-zA-Z]+) can be usually replaced with [a-zA-Z]+, but it depends where you are using the regex.
Details
\b - word boundary
(?=[A-Z]*[a-z]) - a positive lookahead that requires a lowercase letter after 0+ uppercase ones
(?=[a-z]*[A-Z]) - a positive lookahead that requires a uppercase letter after 0+ lowercase ones
([a-zA-Z]+) - Group 1: 1 or more letters
\b - a word boundary.
Related
using the Python module re, I would like to detect sequences that contain at least two letters (A-Z) and at least two digits (0-9) from a text, e.g., from the text
"N03FZ467 other text N03671"
precisely the sub-string "N03FZ467" shall be matched.
The best I have got so far is
(?=[A-Z]*\d)[A-Z0-9]{4,}
which detects sequences of length at least 4 that contain only letters A-Z and digits 0-9, and at least one digit and one letter.
How can I make sure I respectively get at least two?
If you want to match full words, start matching at word boundaries \b.
Check the first condition (two upper) by a lookahead: (?=(?:\d*[A-Z]){2})
If this succeeds, match the second requirement, two digits: (?:[A-Z]*\d){2}
Finally match any remaining [A-Z\d]* until another \b.
Putting it together:
\b(?=(?:\d*[A-Z]){2})(?:[A-Z]*\d){2}[A-Z\d]*\b
See this demo at regex101 or a Python demo at tio.run
Note that a lookahead is a zero length assertion, it does not consume characters. If you don't specifiy a starting point eg \b, the lookahead will be used at any place which is less efficient.
Further to mention, the minimum length of at least four will be satisfied by the requirements.
Use look aheads, one for each requirement:
^(?=(.*\d){2})(?=(.*[A-Z]){2}).*
See live demo.
Regex breakdown:
(?=(.*\d){2}) is "2 digits somewhere ahead"
(?=(.*[A-Z]){2}) is "2 letters somewhere ahead"
The more efficient version:
^(?=(?:.*?\d){2})(?=(?:.*?[A-Z]){2}).*
It's more efficient because it doesn't capture (uses non-capturing groups (?:...)) and it uses the reluctant quantifier .*? which matches as early as possible in the input, whereas .* will scan ahead to the end then backtrack to find a match.
If you only want to match chars A-Z and 0-9 you can use a single lookahead (if supported) to make sure there are 2 digits present, and then match 2 times A-Z when matching the string.
As you have asserted 2 chars and matching 2 chars, then length is automatically at least 4 chars.
\b(?=[A-Z\d]*\d\d)[A-Z\d]*[A-Z]{2}[A-Z\d]*\b
Explanation
\b A word boundary to prevent a partial word match
(?=[A-Z\d]*\d\d) Positive lookahead, assert 2 digits to the right
[A-Z\d]* Match optional chars A-Z or digits
[A-Z]{2} Match 2 uppercase chars A-Z
[A-Z\d]* Match optional chars A-Z or digits
\b A word boundary
See a regex demo.
I would enhance given answer and do this:
(?=\b(?:\D+\d+){2}\b)(?=\b(?:[^a-z]+[a-z]+){2}\b)\S+
Regex demo
This contains two lookaheads, each validating one rule:
(?=\b(?:\D+\d+){2}\b) - lookahead that asserts that what follows is word boundary \b, then its a non-digits followed by digits \D+\d+ to determine that we have at least two such groups. Then words boundary again, two be sure we are within one "word".
Another look ahead is the same, but now isntead of digits and non digits we have letter [a-z] and non-letters [^a-z] - (?=\b(?:[^a-z]+[a-z]+){2}\b)
At the end, we just match whole 'word' with \S+ which is simply match all non-whitespace characters (since we asserted earlier our 'word', this is sufficient).
I'm trying to capture alternating numbers and alphabets (alphabets come first) and ultimately remove them, unless it starts the string.
So in the below example, yellow is what I'm trying to capture:
While I'm identifying the correct rows I'm having a hard time just capturing just the yellow highlighted however...
^(?!([A-Z]+\d+\w*))(?:(.+))[A-Z]+\d+\w*
https://regexr.com/673hl
Any help greatly appreciated.
You can use
(?!^)\b[A-Z]+\d+\w*
See the regex demo. Details:
(?!^) - a negative lookahead that matches a position that is NOT at the start of string
\b - match a word boundary, the preceding char must a non-word char (or start of string, but the lookahead above already ruled that position out)
[A-Z]+ - one or more uppercase ASCII letters
\d+ - one or more digits
\w* - zero or more letters, digits or underscores.
If you want to match any kind of alphanumeric strings add an alternative:
(?!^)\b(?:[A-Z]+\d|\d+[A-Z])\w*
And to make it case insensitive:
(?!^)\b(?:[A-Za-z]+\d|\d+[A-Za-z])\w*
I want to extract 7-char matches. Each character can be a digit or uppercase letter but the whole match can't be only uppercase letters. Example: let's say I have a test string like so:
I want this nr A7A3G1A but not this ANTENNA
So I should get A7A3G1A but not ANTENNA. A regex to match both would be: [A-Z0-9]{7}. Is it possible to somehow not allow only uppercase letters and still extract the 1st match?
You can use this regex,
\b(?![A-Z]{7})[A-Z0-9]{7}\b
Demo
Here, word boundaries \b ensure only whole word is matched and (?![A-Z]{7}) negative look ahead ensures what is matched is not all upper case letters and [A-Z0-9]{7} captures a word exactly of seven characters containing mix of upper case letters and digits.
Another option could be to use a positive lookahead (?= and then make sure to match at least 1 digit.
Use word boundaries \b to prevent the match being part of a larger word.
\b(?=[A-Z0-9]{7}\b)[A-Z0-9]+[0-9][A-Z0-9]*
Regex demo
I can use \s?(\w+\s){0,2}\w*) for "up to three words" and \w{0,20} for "no more than twenty characters", but how can I combine these? Trying to merge the two via a lookahead as mentioned here seems to fail.
Some examples for clarification:
The early bird catches the worm.
should match any three words in sequence (including the worm*).
Here we have a supercalifragilisticexpialidocious sentence.
"a supercalifragilisticexpialidocious sentence" is too long a sequence and therefore should not match.
* In my actual use case I'm going for a paragraph's last three words, i.e. a (?:\r) would be at the end of the RegEx and the match "catches the worm.") Matches are then applied with a "no linebreaks" character style in Adobe InDesign in order to avoid orphans.
To match 3 words separated with whitespace(s) at the end of a line or string, you can use
\b(?!(?:\s*\w){21})\w+(?:\s+\w+){0,2}(?=$|[\r\n])
See the regex demo. Note that in the demo, I use [^\S\r\n] instead of the \s in the lookahead since the text contains newlines, use the same trick if you need that.
Regex explanation
\b - a word boundary
(?!(?:\s*\w){21}) - a lookahead check that fails the match if after the initial word boundary there are 21 word characters optionally preceded with any number of whitespace symbols
\w+ - 1 word (consisting of 1 or more word characters)
(?:\s+\w+){0,2} - zero, one or two sequences of 1+ whitespaces followed with 1+ word characters
(?=$|[\r\n]) - a positive lookahead that only allows a match to be returned if there is the end-of-string ($) or the end of a line ([\r\n]).
Now, if your words should only contain letters, use [a-zA-Z] or equivalent for your language. If the regex flavor allows, use \p{L} Unicode category/property class.
I'm new to RegEx and I'm looking for a way to match sentences where the first letter is capitalized and the rest is in lowercase.
I've tried a couple of things (IF statements included), but just can't seem to get it.
This is my last version:
(([A-Z])([a-z]+\s|[a-z]+))+
I thought it worked at first, but is now accepting capitalized letters in the middle of the word.
The Output Would Be Like This (Each Word Capitalized).
Thanks!!
The expression accepts capital letters in the middle of the world because now the spaces between words are optional, and words can run into each other.
You can take a more structured approach: a sentence must have at least one word. That's
[A-Z][a-z]*
After that initial word you can get any number of more words, each preceded by whitespace. So in total:
[A-Z][a-z]*(\s[A-Z][a-z]*)*
To match whole strings that start with an uppercase letter and then have no uppercase letters use
^[A-Z][^A-Z]*$
See the regex demo. ^ matches the start of string, [A-Z] matches the uppercase letters, [^A-Z]* matches 0 or more chars other than uppercase letters and $ matches the end of string.
To match capitalized words, you may use
\b[A-Z][a-zA-Z]*\b
where \b stands for word boundaries. See the regex demo.
In various regex flavors, there are other ways to match word boundaries:
bash,r (TRE, base R): \<[A-Z][a-zA-Z]*\>
postgresql, tcl: \m[A-Z][a-zA-Z]*\M or \y[A-Z][a-zA-Z]*\y
bash, mysql (MySQL versions before 8): [[:<:]][A-Z][a-zA-Z]*[[:>:]]
Also, you may consider using [[:upper:]] or \p{Lu} instead of [A-Z] and [[:alpha:]] or \p{L} instead of [a-zA-Z] to match any Unicode uppercase letters or any letters correspondingly.
See this demo and this demo, too.