I want to validate strings that have the form:
One underscore _, a group of letters in a, b, c in alphabetical order and another underscore _.
Examples of valid strings are _a_, _b_, _ac_, _abc_.
I can achieve the correct validation for most cases using the regex _a?b?c?_, but that is still matched by __, which I don't want to consider valid. How can I adapt this regex so that among my zero or one characters a?b?c?, at least one of them must be present?
You can add a (?!_) lookahead after the first _:
_(?!_)a?b?c?_
Details:
_ - an underscore
(?!_) - the next char cannot be a _
a? - an optional a
b? - an optional b
c? - an optional c
_ - an underscore.
See the regex demo.
You may use this regex with a positive lookahead:
_(?=[abc])a?b?c?_
RegEx Demo
RegEx Demo:
_: Match a _
(?=[abc]): Positive lookahead to assert that there is a letter a or b or c
a?b?c?: Match optional a followed by b followed by c
_: Match a _
PS: Positive lookahead assertions are usually more efficient than negative lookahead (as evident from steps taken on regex101 webste).
Thought I'd chip in an alternative if one uses Python for this with PyPi's regex module which support what is called approximate “fuzzy” matching:
^_(abc){d<=2}_$
The pattern means:
^_ - Match start-line anchor and leading underscore;
(abc){d<=2} - Match 'abc' in order and allow for up to just two deletions;
_$ - Match trailing undescore and end-line anchor.
import regex as re
l = ["_a_", "_b_", "_ac_", "_abc_", "", "__", "_ca_"]
print([bool(re.search(r'^_(abc){d<=2}_$', s)) for s in l])
Prints:
[True, True, True, True, False, False, False]
Another option if a lookbehind is supported is looking back after the match, asserting not __
_a?b?c?_(?<!__)
Explanation
_ Match literally
a?b?c? Match an optional a or b or c
_ Match literally
(?<!__) Negative lookbehind, assert not __ directly to the left
Regex demo
If supported using SKIP FAIL getting __ out of the way:
__(*SKIP)(*FAIL)|_a?b?c?_
Regex demo
Related
How to make sure that part of the pattern (keyword in this case) is in the pattern you're looking for, but it can appear in different places. I want to have a match only when it occurs at least once.
Regex:
\b(([0-9])(xyz)?([-]([0-9])(xyz)?)?)\b
We only want the value if there is a keyword: xyz
Examples:
1. 1xyz-2xyz - it's OK
2. 1-2xyz - it's OK
3. 1xyz - it's OK
4. 1-2 - there should be no match, at least one xyz missing
I tried a positive lookahead and lookbehind but this is not working in this case.
You can make use of a conditional construct:
\b([0-9])(xyz)?(?:-([0-9])(xyz)?)?\b(?(2)|(?(4)|(?!)))
See the regex demo. Details:
\b - word boundary
([0-9]) - Group 1: a digit
(xyz)? - Group 2: an optional xyz string
(?:-([0-9])(xyz)?)? - an optional sequence of a -, a digit (Group 3), xyz optional char sequence
\b - word boundary
(?(2)|(?(4)|(?!))) - a conditional: if Group 2 (first (xyz)?) matched, it is fine, return the match, if not, check if Group 4 (second (xyz)?) matched, and return the match if yes, else, fail the match.
See the Python demo:
import re
text = "1. 1xyz-2xyz - it's OK\n2. 1-2xyz - it's OK\n3. 1xyz - it's OK\n4. 1-2 - there should be no match"
pattern = r"\b([0-9])(xyz)?(?:-([0-9])(xyz)?)?\b(?(2)|(?(4)|(?!)))"
print( [x.group() for x in re.finditer(pattern, text)] )
Output:
['1xyz-2xyz', '1-2xyz', '1xyz']
Indeed you could use a lookahead in the following way:
\b\d(?:xyz|(?=-\dxyz))(?:-\d(?:xyz)?)?\b
See this demo at regex101 (or using ^ start and $ end)
The first part matches either an xyz OR (if there is none) the lookahead ensures that the xyz occures in the second optional part. The second part is dependent on the previous condition.
Try this: \b(([0-9])?(xyz)+([-]([0-9])+(xyz)+)?)\b
Replace ? with +
Basically ?: zero or more and in your case you want to match one or more.
Whih is +
How about something as basic as this
(\dxyz-\dxyz|\dxyz-\d|\d-\dxyz|\dxyz)
You can add word boundary if needed
\b(\dxyz-\dxyz|\dxyz-\d|\d-\dxyz|\dxyz)\b
Just an OR
I want to match the numeric values of the last \b tag, or empty string if there's no number.
Here are the strings at left, and what I want to match at right :
\b Empty string
\b\b Empty string
\b1\b Empty string
\b\b1 1
\b\b\i1 Empty string
\b1\blur0 1
\b2\b10anana 10
\b2\b1 0anana 1
\b2\bbanana10 Empty string
B tags with numeric value should return an empty string
In ass language the only tags starting with the b letters are \blur \bord and \be
There can be spaces between \b and the number but not after the number, hence why \b2\b1 0anana should give "1"
Obviously bbanana is not a real ass tag, but ass libraries consider tags starting by \b as the \b tag as long as it's not blur/bord/be.
Also, important precision : the \b tag can be "\ b" or "\b " (with as many spaces as possible)
I use a regex module that work like .NET (C#) regex in regex101
I'm currently stucked with this regex https://regex101.com/r/0TZeSn/1
(?<=\\\s*b\s*)\d+(?!(?=\\b))|(?<=\\b)(?!.*\\b).*?(?=\w|\\)
It's hard to come by
For a match only without empty matches, you might use:
(?<=\\ *b *)(?!\S*\\b *(?!lur|ord|e))\d+
Explanation
(?<=\\ *b *) Positive lookbehind, assert \b with optional spaces in between to the left
(?! Negative lookahead, assert not to the right
\S*\\b *(?!lur|ord|e) Match optional non whitespace chars followed by \b that is not directly followed by lur ord or e
) Close lookahead
\d+ Match 1+ digits
See a regex demo.
To also get empty matches, you can match optional digits and assert not lur ord or e directly after the current position and also not directly after matching \b
(?<=\\ *b *)(?!lur|ord|e|\S*\\b *(?!lur|ord|e))\d*
See another regex demo.
How can I get only the middle part of a combined name with PCRE regex?
name: 211103_TV_storyname_TYPE
result: storyname
I have used this single line: .(\d)+.(_TV_) to remove the first part: 211103_TV_
Another idea is to use (_TYPE)$ but the problem is that I don´t have in all variations of names a space to declare a second word to use the ^ for the first word and $ for the second.
The variation of the combined name is fix for _TYPE and the TV.
The numbers are changing according to the date. And the storyname is variable.
Any ideas?
Thanks
With your shown samples, please try following regex, this creates one capturing group which contains matched values in it.
.*?_TV_([^_]*)(?=_TYPE)
OR(adding a small variation of above solution with fourth bird's nice suggestion), following is without lazy match .*? unlike above:
_TV_([^_]*)(?=_TYPE)
Here is the Online demo for above regex
Explanation: Adding detailed explanation for above.
.*?_ ##Using Lazy match to match till 1st occurrence of _ here.
TV_ ##Matching TV_ here.
([^_]*) ##Creating 1st capturing group which has everything before next occurrence of _ here.
(?=_TYPE) ##Making sure previous values are followed by _TYPE here.
You could match as least as possible chars after _TV_ until you match _TYPE
\d_TV_\K.*?(?=_TYPE)
\d_TV_ Match a digit and _TV_
\K Forget what is matched until now
.*? Match as least as possible characters
(?=_TYPE) Assert _TYPE to the right
Regex demo
Another option without a non greedy quantifier, and leaving out the digit at the start:
_TV_\K[^_]*+(?>_(?!TYPE)[^_]*)*(?=_TYPE)
_TV_ Match literally
\K[^_]*+ Forget what is matched until now and optionally match any char except _
(?>_(?!TYPE)[^_]*)* Only allow matching _ when not directly followed by TYPE
(?=_TYPE) Assert _TYPE to the right
Regex demo
Edit
If you want to replace the 2 parts, you can use an alternation and replace with an empty string.
If it should be at the start and the end of the string, you can prepend ^ and append $ to the pattern.
\b\d{6}_TV_|_TYPE\b
\b\d{6}_TV_ A word boundary, match 6 digits and _TV_
| Or
_TYPE\b Match _TYPE followed by a word boundary
Regex demo
Here i put some additional Screenshots to the post. With the Documentation that appears on the help button. And you see the forms and what i see.
Documentation
The regular expressions we use are based on PCRE - Perl Compatible Regular Expressions. Full specification can be found here: http://www.pcere.org and http://perldoc.perl.org/perlre.html
Summary of some useful terms:
Metacharacters
\ Quote the next metacharacter
^ Match the beginning of the line
. Match any character (except newline)
$ Match the end of the line (or before newline at the end)
| Alternation
() Grouping
[] Character class
Quantifiers
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
Charcter Classes
\w Match a "word" character (alphanumeric plus mao}
\W Match a non-"word" character
\s Match a whitespace character
\S Match a non-whitespace character
\d Match a digit character
\D Match a non-digit character
Capture buffers
The bracketing construct (...) creates capture buffers. To refer to
Within the same pattern, use \1 for the first, \2 for the second, and so on. Outside the match use "$" instead of "". The \ notation works in certain circumstances outside the match. See the warning below about \1 vs $1 for details.
Referring back to another part of the match is called a backreference.
Examples
Replace story with certain prefix letters M N or E to have the prefix "AA":
`srcPattern "(M|N|E ) ([A-Za-z0-9\s]*)"`
`trgPattern "AA$2" `
`"N StoryWord1 StoryWord2" -> "AA StoryWord1 StoryWord2"`
`"E StoryWord1 StoryWord2" -> "AA StoryWord1 StoryWord2"`
`"M StoryWord1 StoryWord2" -> "AA StoryWord1 StoryWord2"`
"NoMatchWord StoryWord1 StoryWord2" -> "NoMatchWord StoryWord1 StoryWord2" (no match found, name remains the same)
The lines to match against are
part1a_part1b__part1c_part1d_part3.extension
part1a_part1b__part1c_part1d__part3.extension
part1a_part1b__part1c_part1d_part2short_part3.extension
part1a_part1b__part1c_part1d_part2short__part3.extension
part1a_part1b__part1c_part1d_part2_part3.extension
part1a_part1b__part1c_part1d_part2__part3.extension
part1a_part1b__part1c_part1d_part2full_part3.extension
part1a_part1b__part1c_part1d_part2full__part3.extension
part1a_part1b__part1c_part1d_part2short-part3.extension
part1a_part1b__part1c_part1d_part2-part3.extension
part1a_part1b__part1c_part1d_part2full-part3.extension
part1a_part1b__part1c_part1d_part4.extension
part1a_part1b__part1c_part1d__part4.extension
The desired match should give exactly part1a_part1b__part1c_part1d for all the above lines except the last two lines. That is to say, the "stem" has an arbitrary number of part1, an optional part2 (in limited forms), and must ends with part3.extension.
Right now, I only got as far as
(?P<stem>[[:alnum:]_-]+)(?=(|part2short|part2|part2full))[_-]+part3\.extension
,by which the matched "stem" values for the lines above are
part1a_part1b__part1c_part1d
part1a_part1b__part1c_part1d_
part1a_part1b__part1c_part1d_part2short
part1a_part1b__part1c_part1d_part2short_
part1a_part1b__part1c_part1d_part2
part1a_part1b__part1c_part1d_part2_
part1a_part1b__part1c_part1d_part2full
part1a_part1b__part1c_part1d_part2full_
part1a_part1b__part1c_part1d_part2short
part1a_part1b__part1c_part1d_part2
part1a_part1b__part1c_part1d_part2full
Could you help to comment how to match exactly part1a_part1b__part1c_part1d from all the above lines except the last two lines, if it is possible ?
You may use this regex using a non-greedy match, a lookahead with an optional match:
(?m)^(?P<stem>[[:alnum:]_-]+?)(?=(?:[_-]+part2(?:short|full)?)?[_-]+part3\.extension$)
RegEx Demo
(?=(?:[_-]+part2(?:short|full)?)?[_-]+part3\.extension$) is a positive lookahead that asserts line ends with [-_]part3.extension with optional [-_]part2... string before.
You could match the first 4 parts with the text and the underscores and use a positive lookahead that asserts that the string ends with part3.extension:
^(?P<stem>[^_]+_[^_]+__[^_]+_[^_]+)(?=.*part3\.extension$)
That would match:
^ # Begin of the string
(?P<stem> # Named captured group stem
[^_]+_ # Match not _ one or more times, then _
[^_]+__ # Match not _ one or more times, then __
[^_]+_ # Match not _ one or more times, then _
[^_]+ # # Match not _ one or more times
) # Close named capturing group
(?= # A positive lookahead that asserts what follows
.*part3\.extension$ # Match part3.extension at the end of the string
) # Close lookahead
Is there an NOT operator in Regexes?
Like in that string : "(2001) (asdf) (dasd1123_asd 21.01.2011 zqge)(dzqge) name (20019)"
I want to delete all \([0-9a-zA-z _\.\-:]*\) but not the one where it is a year: (2001).
So what the regex should return must be: (2001) name.
NOTE: something like \((?![\d]){4}[0-9a-zA-z _\.\-:]*\) does not work for me (the (20019) somehow also matches...)
Not quite, although generally you can usually use some workaround on one of the forms
[^abc], which is character by character not a or b or c,
or negative lookahead: a(?!b), which is a not followed by b
or negative lookbehind: (?<!a)b, which is b not preceeded by a
No, there's no direct not operator. At least not the way you hope for.
You can use a zero-width negative lookahead, however:
\((?!2001)[0-9a-zA-z _\.\-:]*\)
The (?!...) part means "only match if the text following (hence: lookahead) this doesn't (hence: negative) match this. But it doesn't actually consume the characters it matches (hence: zero-width).
There are actually 4 combinations of lookarounds with 2 axes:
lookbehind / lookahead : specifies if the characters before or after the point are considered
positive / negative : specifies if the characters must match or must not match.
You could capture the (2001) part and replace the rest with nothing.
public static string extractYearString(string input) {
return input.replaceAll(".*\(([0-9]{4})\).*", "$1");
}
var subject = "(2001) (asdf) (dasd1123_asd 21.01.2011 zqge)(dzqge) name (20019)";
var result = extractYearString(subject);
System.out.println(result); // <-- "2001"
.*\(([0-9]{4})\).* means
.* match anything
\( match a ( character
( begin capture
[0-9]{4} any single digit four times
) end capture
\) match a ) character
.* anything (rest of string)
Here is an alternative:
(\(\d{4}\))((?:\s*\([0-9a-zA-z _\.\-:]*\))*)([^()]*)(( ?\([0-9a-zA-z _\.\-:]*\))*)
Repetitive patterns are embedded in a single group with this construction, where the inner group is not a capturing one: ((:?pattern)*), which enable to have control on the group numbers of interrest.
Then you get what you want with: \1\3