Regex for alphanumeric word and should not be like RUN123456 - regex

I want to apply regex on a string to get alphanumeric value and the value should not start with the RUN substring followed with any digit, e.g. RUN123456.
Below is the regex I am using to get alphanumeric value
regex='[A-Z]{2,}[_0-9a-zA-Z]*'
Sample Input:
CY0PNI94980 Production AutoSys Job has failed. Call 249-3344. EC=54. RUN130990.
The matches can include CY0PNI94980 and EC, but not RUN130990.
Kindly help me on this.

You may match the strings matching your pattern excluding all those starting with RUN and a digit:
\b(?!RUN[0-9])[A-Z]{2,}[_0-9a-zA-Z]*
See the regex demo
If you do not care if you match Unicode letters or digits or not, you may contract [A-Za-z0-9_] with \w and use
\b(?!RUN[0-9])[A-Z]{2,}\w*
Details
\b - a word boundary
(?!RUN[0-9]) - a negative lookahead that fails the match if there is RUN and any ASCII digit immediately to the right of the current location
[A-Z]{2,} - 2 or more uppercase ASCII letters
[_0-9a-zA-Z]* / \w* - 0 or more word chars (letters/digits/_).

Related

Is there a way to use Regex to capture numbers out of a string based on a specific leading letters?

I need to extract any number between 4-10 digits that following directly after 'PO#' OR 'PO# ' (with a whitespace). I do not want to include the PO# with the actual value that is extracted, however I do need it as criteria to target the value within a string. If the digits are less than 4 or greater than 10, I do not wish to capture the value and would like to otherwise ignore it.
A sample string would look like this:
PO#12445 for Vendor Enterprise
or
Invoice# 21412556 for Vendor Enterprise for PO# 12445
My current RegEX expression captures PO# with '#' and I use additional logic after the fact to remove the '#', however my expression is also capturing Invoice# and Inv# which I don't want it to do. I'd like it to only target PO#.
Current Expression: [P][O][#]\s*[0-9]{3,9}\d+\w
Any help would be greatly appreciated!
If you need only the digits, you can use \b(?<=PO#)\s?(\d{4,10})\b, with:
(?<=PO#): positivive lookbehind, be sure that this pattern is present before the needed pattern (PO followed by #)
\s?: 0 or 1 whitespace
(\d{4,10}): between 4 and 10 digits
\b: word boundaries to avoid ie. the 10 first digits of a 11 digits pattern match or 'SPO#' to match
Edit: Alexander Mashin is right about the lookbehind having to be fixed width, so \b(?<=PO#)\s?(\d{4,10})\b is better https://regex101.com/r/1KBQd1/5
Edit: added word boundaries
You can use a capturing group and repeat matching the digits 4-10 times using [0-9]{4,10}.
Note that [P][O][#] is the same as PO#
\bPO#\s*([0-9]{4,10})\b
\bPO#\s* Match PO# preceded by a word boundary and match 0+ whitespace chars
( Capture group 1
[0-9]{4,10} Match 4 - 10 digits
)\b Close group followed by a word boundary to prevent the match being part of a larger word
Regex demo
If PCRE is available, how about:
PO#\s*\K\d{4,10}(?=\D|$)
PO#\s* matches the leading substring "PO#" followed by 0 or more whitespaces.
\K resets the starting position of the match and works as a positive (zero length) lookbehind.
\d{4,10} matches a sequence of digits of 4 <= length <= 10.
(?=\D|$) is the positive lookahead to match a non-digit character or the end of the string.

Regex Exclude Number Within Two Characters of Number

I have some manually entered data (it's an email subject), and I am trying to extract the correct ID to perform a series of actions with RPA on.
RE:'HC=312-822-281' abc2-1234567 7354612
I have a regex query:
(?<!\d)\d{7}(?!\d)
I want to extract 7354612 but not 1234567.
I want to avoid matching any 7-digit number that is preceded with a hyphen, or a hyphen and a space.
My initial query works 80% of the time, but this hyphen issue is interfering with the other 20%.
You can modify the existing (?<!\d) lookbehind to also exclude the position after a hyphen, i.e. (?<![\d-]), and add another lookbehind to exclude the hyphen + space context ((?<!- ) or (?<!-\s)):
(?<![\d-])(?<!- )\d{7}(?!\d)
(?<![\d-])(?<!-\s)\d{7}(?!\d)
Note \s matches any whitespace. See the regex demo.
Details
(?<![\d-]) - a negative lookbehind that fails the match if there is a digit or a hyphen immediately to the left of the current location
(?<!-\s) - a negative lookbehind that fails the match if there is a - and a space after it immediately to the left of the current location
\d{7} - any seven digits
(?!\d) - a negative lookahead that fails the match if there is a digit immediately to the right of the current location.
Variations
With PCRE regex, you may also use
-\s*\d{7}(?!\d)(*SKIP)(*F)|(?<!\d)\d{7}(?!\d)
See the regex demo, where -\s*\d{7}(?!\d)(*SKIP)(*F)| matches -, 0+ spaces, seven digits after which there are no more digits and skips that match, only returning matches for the (?<!\d)\d{7}(?!\d) pattern.
In .NET, modern JavaScript and PyPi regex in Python, you may use
(?<!\d|-\s*)\d{7}(?!\d)
See this regex demo. Here, (?<!\d|-\s*) negative lookbehind fails the match if there is a digit or - + 0 or more whitespace chars immediately to the left of the current position.

Negating a complex regex containing three parts

I need a regex which is matched when the string doesn't have both lowercase and uppercase letters.
If the string has only lowercase letters -> should be matched
If the string has only uppercase letters -> should be matched
If the string has only digits or special characters -> should be matched
For example
abc, ABC, 123, abc123, ABC123&^ - should match
AbC, A12b, AB^%12c - should not match
Basically I need an inverse/negation of the following regex:
^(?=.*[a-z])(?=.*[A-Z]).+$
Does not sound like any lookarounds would be needed.
Either match only characters that are not a-z or only characters, that are not A-Z.
^(?:[^a-z]+|[^A-Z]+)$
See this demo at regex101 (used + for one or more)
You may use
^(?!.*[A-Z].*[a-z])(?!.*[a-z].*[A-Z])\S+$
Or
^(?=(?:[^a-z]+|[^A-Z]+)$).*$
See the regex demo #1 and regex demo #2
A lookaround solution like this can be used in more complex scenarios, when you need to apply more restrictions on the pattern. Else, consider a non-lookaround solution.
Details
^ - start of string
(?!.*[A-Z].*[a-z]) - no uppercase followed with a lowercase letter
(?!.*[a-z].*[A-Z]) - no lowercase letter followed with an uppercase one
(?=(?:[^a-z]+|[^A-Z]+)$) - a positive lookahead that requires 1 or more characters other than lowercase ASCII letters ([^a-z]+) to the end of the string, or 1 or more characters other than uppercase ASCII letters ([^A-Z]+) to the end of the string
.+ - 1+ chars other than line break chars
$ - end of string.
You can use this regex
^(([A-Z0-9?&%^](?![a-z]))+|([a-z0-9?&%^](?![A-Z]))+)$
You can test more cases here.
I've only added the characcter ?&%^ as possible character, but you could add which ever you like.
I would go with:
^(?:[^a-z]+?|[^A-Z]+?)$
It translates to "If the entire string is composed of non-lowercase letters or non-uppercase letters then match the string."
Lazy quantifiers +? are used so that the end-string $ anchor is obeyed when the multiline flag is enabled. If you're only validating a single-line string the you can simply use + without the question mark.
If you have a whitelist of specific allowed special chars then change [^A-Z] into [A-Z0-9()_+=-] and list the allowed special chars.
https://regex101.com/r/Wg6tLn/1

Why doesn’t work when regex entering 1 letter after the optional character?

I've custom regex pattern for check correct username on url:
^[#](?:[a-z][a-z0-9_]*[a-z0-9])?$
This pattern work when I write usernames:
#username
#username_16
#username16
But not work when I write:
#u
First part of question:
How to rewrite this pattern for work in #u?
Second part of question:
How control characters limit or length after # symbol?
The [a-z] and [a-z0-9] are obligatory patterns inside the optional group, hence if there is something after #, there must be two chars at least.
Besides, your regex also matches a string that equals #.
To fix all these issues you may use
^#[a-z](?:[a-z0-9_]*[a-z0-9])?$
See the regex demo.
Now, to restrict the length of a string after # symbol, you may insert a (?=.{x,m}$) positive lookahead right after #. Say, to only match 3 or 4 chars after #, use:
^#(?=[a-z0-9_]{3,4}$)[a-z](?:[a-z0-9_]*[a-z0-9])?$
^^^^^^^^^^^^^^^^^^^
Or, since the consuming pattern will validate the rest
^#(?=.{3,4}$)[a-z](?:[a-z0-9_]*[a-z0-9])?$
^^^^^^^^^^^
See this regex demo
Details
^ - start of string
(?=.{3,4}$) - a positive lookahead that requires any 3 or 4 chars other than line break chars up to the end of the string immediately to the right of the current location (i.e. from the string start here)
# - a # char
[a-z] - a lowercase ASCII letter
(?:[a-z0-9_]*[a-z0-9])? - an optional non-capturing group matching 1 or 0 occurrences of
[a-z0-9_]* - 0+ lowercase ASCII letters, digits or _
[a-z0-9] - a lowercase ASCII letter or digits
$ - end of string.

extract substring with regular expression

I have a string, actually is a directory file name.
str='\\198.168.0.10\share\ccdfiles\UA-midd3-files\UA0001A_15_Jun_2014_08.17.49\Midd3\y12m05d25h03m16.midd3'
I need to extract the target substring 'UA0001A' with matlab (well I would like think all tools should have same syntax).
It does not necessary to be exact 'UA0001A', it is arbitrary alphabet-number combination.
To make it more general, I would like to think the substring (or the word) shall satisfy
it is a alphabet-number combination word
it cannot be pure alphabet word or pure number word
it cannot include 'midd' or 'midd3' or 'Midd3' or 'MIDD3', etc, so may use case-intensive method to exclude word begin with 'midd'
it cannot include 'y[0-9]{2,4}m[0-9]{1,2}d[0-9]{1,2}\w*'
How to write the regular expression to find the target substring?
Thanks in advance!
You can use
s = '\\198.168.0.10\share\ccdfiles\UA-midd3-files\UA0001A_15_Jun_2014_08.17.49\Midd3\y12m05d25h03m16.midd3';
res = regexp(s, '(?i)\\(?![^\W_]*(midd|y\d+m\d+))(?=[^\W_]*\d)(?=[^\W_]*[a-zA-Z])([^\W_]+)','tokens');
disp(res{1}{1})
See the regex demo
Pattern explanation:
(?i) - the case-insensitive modifier
\\ - a literal backslash
(?![^\W_]*(midd|y\d+m\d+)) - a negative lookahead that will fail a match if there are midd or y+digits+m+digits after 0+ letters or digits
(?=[^\W_]*\d) - a positive lookahead that requires at least 1 digit after 0+ digits or letters ([^\W_]*)
(?=[^\W_]*[a-zA-Z]) - there must be at least 1 letter after 0+ letters or digits
([^\W_]+) - Group 1 (what will extract) matching 1+ letters or digits (or 1+ characters other than non-word chars and _).
The 'tokens' "mode" will let you extract the captured value rather than the whole match.
See the IDEONE demo
this should get you started:
[\\](?i)(?!.*midd.*)([a-z]+[0-9]+[a-z0-9]*|[a-z]+[0-9]+[a-z0-9]*)
[\\] : match a backslash
(?i) : rest of regex is case insensitive
?! following match can not match this
(?!.*midd.*) : following match can not be a word wich has any character, midd, any character
([a-z]+[0-9]+[a-z0-9]*|[a-z]+[0-9]+[a-z0-9]*) match at least one number followed by at least one letter OR at least one letter followed by at least one number followed by any amount of letters and numbers (remember, cannot match the ?! group so no word which contains mid )