extract substring with regular expression

extract substring with regular expression - regex

I have a string, actually is a directory file name.
str='\\198.168.0.10\share\ccdfiles\UA-midd3-files\UA0001A_15_Jun_2014_08.17.49\Midd3\y12m05d25h03m16.midd3'
I need to extract the target substring 'UA0001A' with matlab (well I would like think all tools should have same syntax).
It does not necessary to be exact 'UA0001A', it is arbitrary alphabet-number combination.
To make it more general, I would like to think the substring (or the word) shall satisfy
it is a alphabet-number combination word
it cannot be pure alphabet word or pure number word
it cannot include 'midd' or 'midd3' or 'Midd3' or 'MIDD3', etc, so may use case-intensive method to exclude word begin with 'midd'
it cannot include 'y[0-9]{2,4}m[0-9]{1,2}d[0-9]{1,2}\w*'
How to write the regular expression to find the target substring?
Thanks in advance!

You can use
s = '\\198.168.0.10\share\ccdfiles\UA-midd3-files\UA0001A_15_Jun_2014_08.17.49\Midd3\y12m05d25h03m16.midd3';
res = regexp(s, '(?i)\\(?![^\W_]*(midd|y\d+m\d+))(?=[^\W_]*\d)(?=[^\W_]*[a-zA-Z])([^\W_]+)','tokens');
disp(res{1}{1})
See the regex demo
Pattern explanation:
(?i) - the case-insensitive modifier
\\ - a literal backslash
(?![^\W_]*(midd|y\d+m\d+)) - a negative lookahead that will fail a match if there are midd or y+digits+m+digits after 0+ letters or digits
(?=[^\W_]*\d) - a positive lookahead that requires at least 1 digit after 0+ digits or letters ([^\W_]*)
(?=[^\W_]*[a-zA-Z]) - there must be at least 1 letter after 0+ letters or digits
([^\W_]+) - Group 1 (what will extract) matching 1+ letters or digits (or 1+ characters other than non-word chars and _).
The 'tokens' "mode" will let you extract the captured value rather than the whole match.
See the IDEONE demo

this should get you started:
[\\](?i)(?!.*midd.*)([a-z]+[0-9]+[a-z0-9]*|[a-z]+[0-9]+[a-z0-9]*)
[\\] : match a backslash
(?i) : rest of regex is case insensitive
?! following match can not match this
(?!.*midd.*) : following match can not be a word wich has any character, midd, any character
([a-z]+[0-9]+[a-z0-9]*|[a-z]+[0-9]+[a-z0-9]*) match at least one number followed by at least one letter OR at least one letter followed by at least one number followed by any amount of letters and numbers (remember, cannot match the ?! group so no word which contains mid )

Related

Regex matching of character class and special conditions on certain other conditions

I want to match a section of a string that contains certain characters repeated, along with certain other characters only given a certain criteria. For instance matching characters a-z contained in angle brackets and numbers only if the number is preceeded by a plus.
Matching <abcde> to abcde.
<abcde1> should not match anything.
Matching <abcde+1> to abcde+1
Matching <abcde+1asd+2+3+4as> to abcde+1asd+2+3+4as
<abcde+> should not match anything.
The regex I've tried is <([a-z]|(\+(?=[0-9])|[0-9](?<=[\+])))*>.

You can use
(?<=<)(?:[a-zA-Z]+(?:\+\d+)*)+[a-zA-Z]*(?=>)
<((?:[a-zA-Z]+(?:\+\d+)*)+[a-zA-Z]*)>
See the regex demo. Details:
(?<=<) - a positive lookbehind that requires a < char immediately on the left
(?:[a-zA-Z]+(?:\+\d+)*)+ - one or more occurrences of
[a-zA-Z]+ - one or more letters
(?:\+\d+)* - zero or more sequences of + and one or more digits
[a-zA-Z]* - one or more ASCII letters
(?=>) - a positive lookahead that requires a > char immediately on the right.

Find certain colons in string using Regex

I'm trying to search for colons in a given string so as to split the string at the colon for preprocessing based on the following conditions
Preceeded or followed by a word e.g A Book: Chapter 1 or A Book :Chapter 1
Do not match if it is part of emoticons i.e :( or ): or :/ or :-) etc
Do not match if it is part of a given time i.e 16:00 etc
I've come up with a regex as such
(\:)(?=\w)|(?<=\w)(\:)
which satisfies conditions 2 & 3 but still fails on condition 3 as it matches the colon present in the string representation of time. How do I fix this?
edit: it has to be in a single regex statement if possible

You can use
(:\b|\b:)(?!(?:(?<=\b\d:)|(?<=\b\d{2}:))\d{1,2}\b)
See the regex demo. Details:
(:\b|\b:) - Group 1: a : that is either preceded or followed with a word char
(?!(?:(?<=\b\d:)|(?<=\b\d{2}:))\d{1,2}\b) - there should be no one or two digits right after : (followed with a word boundary) if the : is preceded with a single or two digits (preceded with a word boundary).
Note :\b is equal to :(?=\w) and \b: is equal to (?<=\w):.
If you need to get the same capturing groups as in your original pattern, replace (:\b|\b:) with (?:(:)\b|\b(:)).
More flexible solution
Note that excluding matches can be done with a simpler pattern that matches and captures what you need and just matches what you do not need. This is called "best regex trick ever". So, you may use a regex like
8:|:[PD]|\d+(?::\d+)+|(:\b|\b:)
that will match 8:, :P, :D, one or more digits and then one or more sequences of : and one or more digits, or will match and capture into Group 1 a : char that is either preceded or followed with a word char. All you need to do is to check if Group 1 matched, and implement required extraction/replacement logic in the code.

Word characters \w include numbers [a-zA-Z0-9_]
So just use [a-ZA-Z] instead
(\:)(?=[a-zA-Z])|(?<=[a-zA-Z])(\:)
Test Here

Regex To Validate A String, But The String Can't Contain n Number Of A Specific Character

Recently I ran into a validation situation I've been trying to solve with regex. The rules are as such:
Must start with a capital letter
Center of the string may be of any length
Center of the string may have any combination of upper and lower case letters and numbers
Center of the string may have up to one underscore
Must end with a number
I have attempted to match this string with the following regex:
^(?!_{2,})([A-Z][a-zA-Z0-9_]*[0-9])$
and
^(?<=_{0,1})([A-Z][a-zA-Z0-9_]*[0-9])$
Both of these attempts still match cases where there is more than one underscore present. I.E. App_l_e9 or App__le9.
How can you check to see if your regex match, I.E. the ([A-Z][a-zA-Z0-9_]*[0-9]) part contains zero or one underscore in any place within the middle of the string?

The simplest approach would probably be this
^[A-Z][a-zA-Z0-9]*_?[a-zA-Z0-9]*[0-9]$
Explanation:
^[A-Z] Must start with an uppercase letter
[a-zA-Z0-9]* A combination of uppercase and lowercase letters and numbers of any length (also 0-length)
_? Either zero or one underscore character
[a-zA-Z0-9]* Again A combination of uppercase and lowercase letters and numbers of any length (also 0-length)
[0-9]$ Must end with a number
This will accept A_9 or AA0_xY8 but for instance not aXY_34 or Aasf1__asdf5
If the underscore in the middle part must not be the first or last character of this middlepart, you can replace the * with a + like this.
^[A-Z][a-zA-Z0-9]+_?[a-zA-Z0-9]+[0-9]$
So this, won't accecept for instance A_9 anymore, but the word must at least be Ax_d9

You might also start the match with an uppercase A-Z and immediately check that the string ends with a number 0-9 using a positive lookahead to prevent catastrophic backtracking.
^[A-Z](?=.*[0-9]$)[a-zA-Z0-9]*_?[a-zA-Z0-9]*$
^ Start of string
[A-Z] Match an uppercase char A-Z
(?=.*[0-9]$) Positive lookahead to assert a digit 0-9 at the end of the string
[a-zA-Z0-9]* Optionally match any of the listed
_? Match an optional _
[a-zA-Z0-9]* Optionally match any of the listed
$ End of string
Regex demo
Or with an optional group
^[A-Z](?=.*[0-9]$)[a-zA-Z0-9]*(?:_[a-zA-Z0-9]*)?$
Regex demo

Negating a complex regex containing three parts

I need a regex which is matched when the string doesn't have both lowercase and uppercase letters.
If the string has only lowercase letters -> should be matched
If the string has only uppercase letters -> should be matched
If the string has only digits or special characters -> should be matched
For example
abc, ABC, 123, abc123, ABC123&^ - should match
AbC, A12b, AB^%12c - should not match
Basically I need an inverse/negation of the following regex:
^(?=.*[a-z])(?=.*[A-Z]).+$

Does not sound like any lookarounds would be needed.
Either match only characters that are not a-z or only characters, that are not A-Z.
^(?:[^a-z]+|[^A-Z]+)$
See this demo at regex101 (used + for one or more)

You may use
^(?!.*[A-Z].*[a-z])(?!.*[a-z].*[A-Z])\S+$
Or
^(?=(?:[^a-z]+|[^A-Z]+)$).*$
See the regex demo #1 and regex demo #2
A lookaround solution like this can be used in more complex scenarios, when you need to apply more restrictions on the pattern. Else, consider a non-lookaround solution.
Details
^ - start of string
(?!.*[A-Z].*[a-z]) - no uppercase followed with a lowercase letter
(?!.*[a-z].*[A-Z]) - no lowercase letter followed with an uppercase one
(?=(?:[^a-z]+|[^A-Z]+)$) - a positive lookahead that requires 1 or more characters other than lowercase ASCII letters ([^a-z]+) to the end of the string, or 1 or more characters other than uppercase ASCII letters ([^A-Z]+) to the end of the string
.+ - 1+ chars other than line break chars
$ - end of string.

You can use this regex
^(([A-Z0-9?&%^](?![a-z]))+|([a-z0-9?&%^](?![A-Z]))+)$
You can test more cases here.
I've only added the characcter ?&%^ as possible character, but you could add which ever you like.

I would go with:
^(?:[^a-z]+?|[^A-Z]+?)$
It translates to "If the entire string is composed of non-lowercase letters or non-uppercase letters then match the string."
Lazy quantifiers +? are used so that the end-string $ anchor is obeyed when the multiline flag is enabled. If you're only validating a single-line string the you can simply use + without the question mark.
If you have a whitelist of specific allowed special chars then change [^A-Z] into [A-Z0-9()_+=-] and list the allowed special chars.
https://regex101.com/r/Wg6tLn/1

Regex to match a unlimited repeating pattern between two strings

I have a dataset with repeating pattern in the middle:
YM10a15b5c27
and
YM1b5c17
How can I get what is between "YM" and the last two numbers?
I'm using this but is getting one number in the end and should not.
/([A-Z]+)([0-9a-z]+)([0-9]+)/

Capture exactly two characters in the last group:
/([A-Z]+)([0-9a-z]+)([0-9]{2})/

You should use:
/^(?:([a-z]+))([0-9a-z]+)(?=\1)/
^ matches the start of the sentence. This is really important, because if your code is aaaa1234aaaa, then without the ^, it would also match the aaaa of the end.
(?:([a-z]+)) is a non-capturing group which takes any letter from 'a' to 'z' as group 1
(?=\1) tells the regex to match the text as long as it is followed by the same code at the starting.
All you have to do is extract the code by group(2)
An example is shown here.

Solution
If you want to match these strings as whole words, use \b(([a-z])\2)([0-9a-z]+)(\1)\b. If you need to match them as separate strings, use ^(([a-z])\2)([0-9a-z]+)(\1)$.
Explanation
\b - a word boundary (or if ^ is used, start of string)
(([a-z])\2) - Group 1: any lowercase ASCII letter, exactly two occurrences (aa, bb, etc.)
([0-9a-z]+) - Group 3: 1 or more digits or lowercase ASCII letters
(\1) - Group 4: the same text as stored in Group 1
\b - a word boundary (or if $ is used, end of string).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

extract substring with regular expression - regex

Related

Regex matching of character class and special conditions on certain other conditions

Find certain colons in string using Regex

Regex To Validate A String, But The String Can't Contain n Number Of A Specific Character

Negating a complex regex containing three parts

Regex to match a unlimited repeating pattern between two strings

Categories

Resources