Regex match all letters after a digit - regex

I want to match any letters that occur after a digit(s). There will not be any other digits in the sentence.
# example 1 one letter
> 'word 1 b'
> array(b)
# example 2 multiple letters
>'3c, d, e'
> array (c, d, e)
# example 3 no match
>'word 5'
> array()
# example 4 multiple letters multiple digits
>'words 12a b c'
> array(a, b, c)
I've tried [^\d]+?([A-Za-z]) but this matches letters before the digits also, and not the one that is attached to the digit (e.g. in example 4, 12a, or example 2, 3c)

Since this works for you, here are the possible solutions:
(?:\G(?!^)|\d+)[^a-z]*\K[a-z]
(?<=\d.*)[a-z]
See regex #1 demo and regex #2 demo. Details:
(?:\G(?!^)|\d+) - one or more digits or the end of the previous successful match
[^a-z]* - any zero or more non-lowercase letters
\K - match reset operator discarding all text matched so far
[a-z] - a lowercase letter.
The second regex means:
(?<=\d.*) - a location that is immediately preceded with a digit and then any zero or more chars other than line break chars, as many as possible
[a-z] - a lowercase letter.
To exclude the word and, you can use
(?:\G(?!^)(?:\s+and\b)?|\d+)[^a-z\n]*\K[a-z]
See this regex demo. Or,
(?<=\d.*)[a-z](?<!\band\b)(?!(?<=\ban)d\b)(?!(?<=\ba)nd\b)
See this regex demo.

It sounds like what you might need is a zero-width group, one that is required by the expression but is not part of the capture. The zero-width lookahead will consume any digits it finds, and the group captured will be anything after the digits.
(?=d+)(\w+)

Related

"contains at least one of the following letters" type regex issue

I'm having a bit of a issue getting an "at least one of certain characters" type regex working.
There are obviously similar questions to my problem but the solutions I am trying are not resolving an incorrect match.
Basically, the regex should match:
(starts with a forward slash) (has a-z letters but must contain at least an "a", "b" or "d" - this part should be 3 to 4 characters long) (ends in a single digit followed by a slash, the URL need not finish here - can be longer)
I've got the following but it doesn't work exactly as expected:
\/(?=.*[a|b|d])[a-z]{3,4}[\d]\/
The above expression incorrectly matches (/ccc1/) of the following expressions (see example at https://regex101.com/r/6ET43K/4/ ):
http://example.com/folder/aaaa/ - no match (no digit at end)
http://example.com/folder/bbb1/ - match (has at least one "b", and digit)
http://example.com/folder/ccc1/ - no match (has neither "a", "b" or "d")
http://example.com/folder/yyd3/ - match (has at least one "d", and digit)
http://example.com/folder/yydd3/ - match (has at least one "d", and digit)
http://example.com/folder/yyddd3/ - no match (too long)
I'd be very grateful for any pointers as to what I'm doing wrong here. Thanks!
You might use
^.*\/(?=[a-z]{3,4}\d\/)[a-z]*[abd][a-z]*\d\/
Explanation
^ Start of string
.*\/ Match until the last /
(?= Positive lookahead
[a-z]{3,4}\d\/ Assert 3 or 4 times a char a-z followed by 1 digit and /
) Close lookahead
[a-z]*[abd][a-z]* match at least a single a b or d between optional chars a-z
\d\/ Match 1 digit and /
Regex demo
Alternatively, maybe the following will do:
^.*\/(?=[a-z]*[abd])[a-z]{3,4}\d\/
See the online demo
^ - Start string anchor.
.*\/ - Match greedy untill the last forward slash.
(?=[a-z]*[abd]) - Positive lookahead with a to match as many alpha chars untill either a, b or d.
[a-z]{3,4}\d\/ - Three to four alpha chars, a digit and a literal forward slash.

Trying to match zero outside the word bounderies

I have patterns like
FQC19515_TCELL001_20190319_165944.pdf
FQC19515_TBNK001_20190319_165944.pdf
I can match word TCELL and TBNK with this RegEX
^(\D+)-(\d+)-(\d+)([A-Z1-9]+)?.*
But if I have patterns like
FLW194640_T20NK022_20190323_131348.pdf
FLW194228_C1920_SOME_DEBRIS_REMOVED.pdf
the above regex returns
T2 and C192 instead of T20NK and C1920 respectively
Is there a general regex that matches Nzeros out side of these word boundaries?
Let's consider all 4 examples of your input:
FQC19515_TCELL001_20190319_165944.pdf
FQC19515_TBNK001_20190319_165944.pdf
FLW194640_T20NK022_20190323_131348.pdf
FLW194228_C1920_SOME_DEBRIS_REMOVED.pdf
The first group, between start of line and the first "_" (e.g. FQC19515 in row 1)
consists of:
a non-empty sequence of letters,
a non-empty sequence of digits.
So the regex matching it, including the start of line anchor and a capturing group is:
^([A-Z]+\d+)
You used \D instead of [A-Z] but I think that [A-Z] is
more specific, as it matches only letters an not e.g. "_".
The next source char is _, so the regex can also include _.
A now the more diificult part: The second group to be captured has
actually 2 variants:
a sequence of letters and a sequence of digits (after that there is
a "_"),
a sequence of letters, a sequence of digits and another sequence of
letters (after that there are digits that you want to omit).
So the most intuitive way is to define 2 alternatives, each with
a respective positive lookahead:
alternative 1: [A-Z]+\d+(?=_),
alternative 2: [A-Z]+\d+[A-Z]+(?=\d).
But there is a bit shorter way. Notice that both alternatives start
from [A-Z]+\d+.
So we can put this fragment at the first place and only the rest
include as a non-capturing group ((?:...)), with 2 alternatives.
All the above should be surrounded with a capturing group:
([A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
So the whole regex can be:
^([A-Z]+\d+)_([A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
with m option ("^" matches also the start of each line).
For a working example see https://regex101.com/r/GDdt10/1
Your regex: ^(\D+)-(\d+) is wrong as after a sequence of non-digits
(\D+) you specified a minus which doesn't occur in your source.
Also the second minus does not correspond to your input.
Edit
To match all your strings, I modified slightly the previous regex.
The changes are limited to the matching group No 2 (after _):
Alternative No 1: [A-Z]{2,}+(?=\d) - two or more letters, after them
there is a digit, to be omitted. It will match TCELL and TBNK.
Alternative No 2: [A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)) - the previous
content of this group. It will match two remaining cases.
So the whole regex is:
^([A-Z]+\d+)_([A-Z]{2,}+(?=\d)|[A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
For a working example see https://regex101.com/r/GDdt10/2
As far as I understand, you could use:
^[A-Z]+\d+_\K[A-Z0-9]{5}
Explanation:
^ # beginning of line
[A-Z]+ # 1 or more capitals
\d+_ # 1 or more digit and 1 underscore
\K # forget all we have seen until this position
[A-Z0-9]{5} # 5 capitals or digits
Demo

Why doesn’t work when regex entering 1 letter after the optional character?

I've custom regex pattern for check correct username on url:
^[#](?:[a-z][a-z0-9_]*[a-z0-9])?$
This pattern work when I write usernames:
#username
#username_16
#username16
But not work when I write:
#u
First part of question:
How to rewrite this pattern for work in #u?
Second part of question:
How control characters limit or length after # symbol?
The [a-z] and [a-z0-9] are obligatory patterns inside the optional group, hence if there is something after #, there must be two chars at least.
Besides, your regex also matches a string that equals #.
To fix all these issues you may use
^#[a-z](?:[a-z0-9_]*[a-z0-9])?$
See the regex demo.
Now, to restrict the length of a string after # symbol, you may insert a (?=.{x,m}$) positive lookahead right after #. Say, to only match 3 or 4 chars after #, use:
^#(?=[a-z0-9_]{3,4}$)[a-z](?:[a-z0-9_]*[a-z0-9])?$
^^^^^^^^^^^^^^^^^^^
Or, since the consuming pattern will validate the rest
^#(?=.{3,4}$)[a-z](?:[a-z0-9_]*[a-z0-9])?$
^^^^^^^^^^^
See this regex demo
Details
^ - start of string
(?=.{3,4}$) - a positive lookahead that requires any 3 or 4 chars other than line break chars up to the end of the string immediately to the right of the current location (i.e. from the string start here)
# - a # char
[a-z] - a lowercase ASCII letter
(?:[a-z0-9_]*[a-z0-9])? - an optional non-capturing group matching 1 or 0 occurrences of
[a-z0-9_]* - 0+ lowercase ASCII letters, digits or _
[a-z0-9] - a lowercase ASCII letter or digits
$ - end of string.

Regex to match a unlimited repeating pattern between two strings

I have a dataset with repeating pattern in the middle:
YM10a15b5c27
and
YM1b5c17
How can I get what is between "YM" and the last two numbers?
I'm using this but is getting one number in the end and should not.
/([A-Z]+)([0-9a-z]+)([0-9]+)/
Capture exactly two characters in the last group:
/([A-Z]+)([0-9a-z]+)([0-9]{2})/
You should use:
/^(?:([a-z]+))([0-9a-z]+)(?=\1)/
^ matches the start of the sentence. This is really important, because if your code is aaaa1234aaaa, then without the ^, it would also match the aaaa of the end.
(?:([a-z]+)) is a non-capturing group which takes any letter from 'a' to 'z' as group 1
(?=\1) tells the regex to match the text as long as it is followed by the same code at the starting.
All you have to do is extract the code by group(2)
An example is shown here.
Solution
If you want to match these strings as whole words, use \b(([a-z])\2)([0-9a-z]+)(\1)\b. If you need to match them as separate strings, use ^(([a-z])\2)([0-9a-z]+)(\1)$.
Explanation
\b - a word boundary (or if ^ is used, start of string)
(([a-z])\2) - Group 1: any lowercase ASCII letter, exactly two occurrences (aa, bb, etc.)
([0-9a-z]+) - Group 3: 1 or more digits or lowercase ASCII letters
(\1) - Group 4: the same text as stored in Group 1
\b - a word boundary (or if $ is used, end of string).

extract substring with regular expression

I have a string, actually is a directory file name.
str='\\198.168.0.10\share\ccdfiles\UA-midd3-files\UA0001A_15_Jun_2014_08.17.49\Midd3\y12m05d25h03m16.midd3'
I need to extract the target substring 'UA0001A' with matlab (well I would like think all tools should have same syntax).
It does not necessary to be exact 'UA0001A', it is arbitrary alphabet-number combination.
To make it more general, I would like to think the substring (or the word) shall satisfy
it is a alphabet-number combination word
it cannot be pure alphabet word or pure number word
it cannot include 'midd' or 'midd3' or 'Midd3' or 'MIDD3', etc, so may use case-intensive method to exclude word begin with 'midd'
it cannot include 'y[0-9]{2,4}m[0-9]{1,2}d[0-9]{1,2}\w*'
How to write the regular expression to find the target substring?
Thanks in advance!
You can use
s = '\\198.168.0.10\share\ccdfiles\UA-midd3-files\UA0001A_15_Jun_2014_08.17.49\Midd3\y12m05d25h03m16.midd3';
res = regexp(s, '(?i)\\(?![^\W_]*(midd|y\d+m\d+))(?=[^\W_]*\d)(?=[^\W_]*[a-zA-Z])([^\W_]+)','tokens');
disp(res{1}{1})
See the regex demo
Pattern explanation:
(?i) - the case-insensitive modifier
\\ - a literal backslash
(?![^\W_]*(midd|y\d+m\d+)) - a negative lookahead that will fail a match if there are midd or y+digits+m+digits after 0+ letters or digits
(?=[^\W_]*\d) - a positive lookahead that requires at least 1 digit after 0+ digits or letters ([^\W_]*)
(?=[^\W_]*[a-zA-Z]) - there must be at least 1 letter after 0+ letters or digits
([^\W_]+) - Group 1 (what will extract) matching 1+ letters or digits (or 1+ characters other than non-word chars and _).
The 'tokens' "mode" will let you extract the captured value rather than the whole match.
See the IDEONE demo
this should get you started:
[\\](?i)(?!.*midd.*)([a-z]+[0-9]+[a-z0-9]*|[a-z]+[0-9]+[a-z0-9]*)
[\\] : match a backslash
(?i) : rest of regex is case insensitive
?! following match can not match this
(?!.*midd.*) : following match can not be a word wich has any character, midd, any character
([a-z]+[0-9]+[a-z0-9]*|[a-z]+[0-9]+[a-z0-9]*) match at least one number followed by at least one letter OR at least one letter followed by at least one number followed by any amount of letters and numbers (remember, cannot match the ?! group so no word which contains mid )