using the Python module re, I would like to detect sequences that contain at least two letters (A-Z) and at least two digits (0-9) from a text, e.g., from the text
"N03FZ467 other text N03671"
precisely the sub-string "N03FZ467" shall be matched.
The best I have got so far is
(?=[A-Z]*\d)[A-Z0-9]{4,}
which detects sequences of length at least 4 that contain only letters A-Z and digits 0-9, and at least one digit and one letter.
How can I make sure I respectively get at least two?
If you want to match full words, start matching at word boundaries \b.
Check the first condition (two upper) by a lookahead: (?=(?:\d*[A-Z]){2})
If this succeeds, match the second requirement, two digits: (?:[A-Z]*\d){2}
Finally match any remaining [A-Z\d]* until another \b.
Putting it together:
\b(?=(?:\d*[A-Z]){2})(?:[A-Z]*\d){2}[A-Z\d]*\b
See this demo at regex101 or a Python demo at tio.run
Note that a lookahead is a zero length assertion, it does not consume characters. If you don't specifiy a starting point eg \b, the lookahead will be used at any place which is less efficient.
Further to mention, the minimum length of at least four will be satisfied by the requirements.
Use look aheads, one for each requirement:
^(?=(.*\d){2})(?=(.*[A-Z]){2}).*
See live demo.
Regex breakdown:
(?=(.*\d){2}) is "2 digits somewhere ahead"
(?=(.*[A-Z]){2}) is "2 letters somewhere ahead"
The more efficient version:
^(?=(?:.*?\d){2})(?=(?:.*?[A-Z]){2}).*
It's more efficient because it doesn't capture (uses non-capturing groups (?:...)) and it uses the reluctant quantifier .*? which matches as early as possible in the input, whereas .* will scan ahead to the end then backtrack to find a match.
If you only want to match chars A-Z and 0-9 you can use a single lookahead (if supported) to make sure there are 2 digits present, and then match 2 times A-Z when matching the string.
As you have asserted 2 chars and matching 2 chars, then length is automatically at least 4 chars.
\b(?=[A-Z\d]*\d\d)[A-Z\d]*[A-Z]{2}[A-Z\d]*\b
Explanation
\b A word boundary to prevent a partial word match
(?=[A-Z\d]*\d\d) Positive lookahead, assert 2 digits to the right
[A-Z\d]* Match optional chars A-Z or digits
[A-Z]{2} Match 2 uppercase chars A-Z
[A-Z\d]* Match optional chars A-Z or digits
\b A word boundary
See a regex demo.
I would enhance given answer and do this:
(?=\b(?:\D+\d+){2}\b)(?=\b(?:[^a-z]+[a-z]+){2}\b)\S+
Regex demo
This contains two lookaheads, each validating one rule:
(?=\b(?:\D+\d+){2}\b) - lookahead that asserts that what follows is word boundary \b, then its a non-digits followed by digits \D+\d+ to determine that we have at least two such groups. Then words boundary again, two be sure we are within one "word".
Another look ahead is the same, but now isntead of digits and non digits we have letter [a-z] and non-letters [^a-z] - (?=\b(?:[^a-z]+[a-z]+){2}\b)
At the end, we just match whole 'word' with \S+ which is simply match all non-whitespace characters (since we asserted earlier our 'word', this is sufficient).
Related
The strings I parse with a regular expression contain a region of fixed length N where there can either be numbers or dashes. However, if a dash occurs, only dashes are allowed to follow for the rest of the region. After this region, numbers, dashes, and letters are allowed to occur.
Examples (N=5, starting at the beginning):
12345ABC
12345123
1234-1
1234--1
1----1AB
How can I correctly match this? I currently am stuck at something like (?:\d|-(?!\d)){5}[A-Z0-9\-]+ (for N=5), but I cannot make numbers work directly following my region if a dash is present, as the negative look ahead blocks the match.
Update
Strings that should not be matched (N=5)
1-2-3-A
----1AB
--1--1A
You could assert that the first 5 characters are either digits or - and make sure that there is no - before a digit in the first 5 chars.
^(?![\d-]{0,3}-\d)(?=[\d-]{5})[A-Z\d-]+$
^ Start of string
(?![\d-]{0,3}-\d) Make sure that in the first 5 chars there is no - before a digit
(?=[\d-]{5}) Assert at least 5 digits or -
[A-Z\d-]+ Match 1+ times any of the listed characters
$ End of string
Regex demo
If atomic groups are available:
^(?=[\d-]{5})(?>\d+-*|-{5})[A-Z\d_]*$
^ Start of string
(?=[\d-]{5}) Assert at least 5 chars - or digit
(?> Atomic group
\d+-* Match 1+ digits and optional -
| or
-{5} match 5 times -
) Close atomic group
[A-Z\d_]* Match optional chars A-Z digit or _
$ End of string
Regex demo
Use a non-word-boundary assertion \B:
^[-\d](?:-|\B\d){4}[A-Z\d-]*$
A non word-boundary succeeds at a position between two word characters (from \w ie [A-Za-z0-9_]) or two non-word characters (from \W ie [^A-Za-z0-9_]). (and also between a non-word character and the limit of the string)
With it, each \B\d always follows a digit. (and can't follow a dash)
demo
Other way (if lookbehinds are allowed):
^\d*-*(?<=^.{5})[A-Z\d-]*$
demo
I have a string for example as follows:
ABCD17; ABC18; ABCEF19; XYZ19; ABCDE
Within the MusicBee application, I'm attempting to use a Regex replace function to swap MATCHED items for blanks and thus transform the above string into
ABCEF19; XYZ19
i.e. ONLY retain the items ending in "19"
The elements can be any length and they may or may not end in a number.
The following expression correctly matches the items Ending in 19
[^|;].*(?=19).{3}
However, I obviously need the opposite of this (since the matched items are then replaced with empty strings) which is NOT (surprisingly to me)
[^|;].*(?!19).{3}
If you only want to keep items that end on 19, one option might be to use word boundaries \b and start matching 1+ uppercase chars A-Z.
Optionally match the digits at the end when it is not 19 using the negative lookahead (?!19\b)
\b[A-Z]+(?!19\b)\d*\b;?
\b Word boundary
[A-Z]+ Match 1+ uppercase chars A-Z (or use [^\W\d] to match word chars without a digit)
(?!19\b) Negative lookahead, assert what is directly on the right is not 19
\d* Match 0+ digits
\b;? Word boundary and optionally match ;
Regex demo
I need to extract from a text all the words which match these two requirements:
Contain at least one uppercase letter
Don't fully consist of uppercase characters.
So, Word and WorD are correct captures, but word and WORD aren't.
So, I can capture all the words using a \b([a-zA-Z]+)\b Regex, but I don't know how to add the uppercase letters condition here.
As about the requirement #1, I tried to use a positive lookahead here like this:
\b(?=.*[A-Z]+)([a-zA-Z]+)\b , but now it captures all the words from a line if this line has at least one uppercase letter.
Is it even possible to apply additional conditions to a capturing group?
I can process this in my application's code but I'd really prefer to fit all those requirements in a single Regex.
You may use
\b(?=[A-Z]*[a-z])(?=[a-z]*[A-Z])([a-zA-Z]+)\b
See the regex demo
Actually, you do not even need the capturing group, ([a-zA-Z]+) can be usually replaced with [a-zA-Z]+, but it depends where you are using the regex.
Details
\b - word boundary
(?=[A-Z]*[a-z]) - a positive lookahead that requires a lowercase letter after 0+ uppercase ones
(?=[a-z]*[A-Z]) - a positive lookahead that requires a uppercase letter after 0+ lowercase ones
([a-zA-Z]+) - Group 1: 1 or more letters
\b - a word boundary.
I am trying to do a regex to get this cases:
Correct:
IUG4455
I4UG455
A4U345A
Wrong:
IUGG453
IIUG44555
need to be exactly 4 letters (in any order) and exactly 3 digits (in any order).
i tried use that expression
[A-Z]{3}\\d{4}
but it only accept start with letters (4) then digits (3).
You have a couple of options for this:
Option 1: See regex in use here
\b(?=(?:\d*[A-Z]){3})(?=(?:[A-Z]*\d){4})[A-Z\d]{7}\b
\b Assert position as a word boundary
(?=(?:\d*[A-Z]){3}) Positive lookahead ensuring the following matches
(?:\d*[A-Z]){3} Match the following exactly 3 times
\d* Match any digit any number of times
[A-Z] Match any uppercase ASCII character
(?=(?:[A-Z]*\d){4}) Positive lookahead ensuring the following matches
(?:[A-Z]*\d){4} Match the following exactly 4 times
[A-Z]* Match any uppercase ASCII character any number of times
\d Match any digit
[A-Z\d]{7} Match any digit or uppercase ASCII character exactly 7 times
\b Assert position as a word boundary
If speed needs to be taken into consideration, you can expand the above option and use the following:
\b(?=\d*[A-Z]\d*[A-Z]\d*[A-Z])(?=[A-Z]*\d[A-Z]*\d[A-Z]*\d[A-Z]*\d)[A-Z\d]{7}\b
Option 2: See regex in use here
\b(?=(?:\d*[A-Z]){3}(?!\d*[A-Z]))(?=(?:[A-Z]*\d){4}(?![A-Z]*\d))[A-Z\d]+\b
Similar to Option 1, but uses negative lookahead to ensure an extra character (uppercase ASCII letter or digit) doesn't exist in the string.
Having two positive lookaheads back-to-back simulates an and such that it ensures both subpatterns are satisfied starting at that particular position. Since you have two conditions (3 uppercase ASCII letters and 4 digits), you should use two lookaheads.
As an alternative,
(?:(?<d>\d)|(?<c>[A-Z])){7}(?<-d>){3}(?<-c>){4}
doesn't require any lookarounds. It just matches seven letter-or-digits and then checks it found 3 digits and 4 letters.
Adjust the 3 and 4 to taste... your examples have 4 digits and 3 letters.
Also add word boundaries or anchors depending on whether you are trying to match whole words or a whole string.
I can use \s?(\w+\s){0,2}\w*) for "up to three words" and \w{0,20} for "no more than twenty characters", but how can I combine these? Trying to merge the two via a lookahead as mentioned here seems to fail.
Some examples for clarification:
The early bird catches the worm.
should match any three words in sequence (including the worm*).
Here we have a supercalifragilisticexpialidocious sentence.
"a supercalifragilisticexpialidocious sentence" is too long a sequence and therefore should not match.
* In my actual use case I'm going for a paragraph's last three words, i.e. a (?:\r) would be at the end of the RegEx and the match "catches the worm.") Matches are then applied with a "no linebreaks" character style in Adobe InDesign in order to avoid orphans.
To match 3 words separated with whitespace(s) at the end of a line or string, you can use
\b(?!(?:\s*\w){21})\w+(?:\s+\w+){0,2}(?=$|[\r\n])
See the regex demo. Note that in the demo, I use [^\S\r\n] instead of the \s in the lookahead since the text contains newlines, use the same trick if you need that.
Regex explanation
\b - a word boundary
(?!(?:\s*\w){21}) - a lookahead check that fails the match if after the initial word boundary there are 21 word characters optionally preceded with any number of whitespace symbols
\w+ - 1 word (consisting of 1 or more word characters)
(?:\s+\w+){0,2} - zero, one or two sequences of 1+ whitespaces followed with 1+ word characters
(?=$|[\r\n]) - a positive lookahead that only allows a match to be returned if there is the end-of-string ($) or the end of a line ([\r\n]).
Now, if your words should only contain letters, use [a-zA-Z] or equivalent for your language. If the regex flavor allows, use \p{L} Unicode category/property class.