I have a list of strings. Some of them are of the form 123-...456. The variable portion "..." may be:
the string "apple" followed by a hyphen, e.g. 123-apple-456
the string "banana" followed by a hyphen, e.g. 123-banana-456
a blank string, e.g. 123-456 (note there's only one hyphen)
Any word other than "apple" or "banana" is invalid.
For these three cases, I would like to match "apple", "banana", and "", respectively. Note that I never want capture the hyphen, but I always want to match it. If the string is not of the form 123-...456 as described above, then there is no match at all.
How do I write a regular expression to do this? Assume I have a flavor that allows lookahead, lookbehind, lookaround, and non-capturing groups.
The key observation here is that when you have either "apple" or "banana", you must also have the trailing hyphen, but you don't want to match it. And when you're matching the blank string, you must not have the trailing hyphen. A regex that encapsulates this assertion will be the right one, I think.
The only way not to capture something is using look-around assertions:
(?<=123-)((apple|banana)(?=-456)|(?=456))
Because even with non-capturing groups (?:…) the whole regular expression captures their matched contents. But this regular expression matches only apple or banana if it’s preceded by 123- and followed by -456, or it matches the empty string if it’s preceded by 123- and followed by 456.
Lookaround
Name
What it Does
(?=foo)
Lookahead
Asserts that what immediately FOLLOWS the current position in the string is foo
(?<=foo)
Lookbehind
Asserts that what immediately PRECEDES the current position in the string is foo
(?!foo)
Negative Lookahead
Asserts that what immediately FOLLOWS the current position in the string is NOT foo
(?<!foo)
Negative Lookbehind
Asserts that what immediately PRECEDES the current position in the string is NOT foo
In javascript try: /123-(apple(?=-)|banana(?=-)|(?!-))-?456/
Remember that the result is in group 1
Debuggex Demo
Based on the input provided by Germán Rodríguez Herrera
Try:
123-(?:(apple|banana|)-|)456
That will match apple, banana, or a blank string, and following it there will be a 0 or 1 hyphens. I was wrong about not having a need for a capturing group. Silly me.
I have modified one of the answers (by #op1ekun):
123-(apple(?=-)|banana(?=-)|(?!-))-?456
The reason is that the answer from #op1ekun also matches "123-apple456", without the hyphen after apple.
Try this:
/\d{3}-(?:(apple|banana)-)?\d{3}/
A variation of the expression by #Gumbo that makes use of \K for resetting match positions to prevent the inclusion of number blocks in the match. Usable in PCRE regex flavours.
123-\K(?:(?:apple|banana)(?=-456)|456\K)
Matches:
Match 1 apple
Match 2 banana
Match 3
By far the simplest (works for python) is '123-(apple|banana)-?456'.
Related
I want to extract matches of the clauses match-this that is enclosed with anything other than the tilde (~) in the string.
For example, in this string:
match-this~match-this~ match-this ~match-this#match-this~match-this~match-this
There should be 5 matches from above. The matches are explained below (enclosed by []):
Either match-this~ or match-this is correct for first match.
match-this is correct for 2nd match.
Either ~match-this# or ~match-this is correct for 3rd match.
Either #match-this~ or #match-this or match-this~ is correct for 4th match.
Either ~match-this or match-this is correct for 5th match.
I can use the pattern ~match-this~ catch these ~match-this~, but when I tried the negation of it (?!(~match-this)), it literally catches all nulls.
When I tried the pattern [^~]match-this[^~], it catches only one match (the 2nd match from above). And when I tried to add asterisk wild card on any negation of tilde, either [^~]match-this[^~]* or [^~]*match-this[^~], I got only 2 matches. When I put the asterisk wild card on both, it catches all match-this including those which enclosed by tildes ~.
Is it possible to achieve this with only one regex test? Or Does it need more??
If you also want to match #match-this~ as a separate match, you would have to account for # while matching, as [^~] also matches #
You could match what you don't want, and capture in a group what you want to keep.
~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)
Explanation
~[^~#]*~ Match any char except ~ or # between ~
| Or
( Capture group 1
(?:(?!match-this).)* Match any char if not directly followed by *match-this~
match-this Match literally
(?:(?!match-this)[^#~])* Match any char except ~ or # if not directly followed by match this
) Close group 1
See a regex demo and a Python demo.
Example
import re
pattern = r"~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)"
s = "match-this~match-this~ match-this ~match-this#match-this~match-this~match-this"
res = [m for m in re.findall(pattern, s) if m]
print (res)
Output
['match-this', ' match-this ', '~match-this', '#match-this', 'match-this']
If all five matches can be "match-this" (contradicting the requirement for the 3rd match) you can match the regular expression
~match-this~|(\bmatch-this\b)
and keep only matches that are captured (to capture group 1). The idea is to discard matches that are not captured and keep matches that are captured. When the regex engine matches "~match-this~" its internal string pointer is moved just past the closing "~", thereby skipping an unwanted substring.
Demo
The regular expression can be broken down as follows.
~match-this~ # match literal
| # or
( # begin capture group 1
\b # match a word boundary
match-this # match literal
\b # match a word boundary
) # end capture group 1
Being so simple, this regular expression would be supported by most regex engines.
For this you need both kinds of lookarounds. This will match the 5 spots you want, and there's a reason why it only works this way and not another and why the prefix and/or suffix can't be included:
(?<=~)match-this(?!~)|(?<!~)match-this(?=~)|(?<!~)match-this(?!~)
Explaining lookarounds:
(?=...) is a positive lookahead: what comes next must match
(?!...) is a negative lookahead: what comes next must not match
(?<=...) is a positive lookbehind: what comes before must match
(?<!...) is a negative lookbehind: what comes before must not match
Why other ways won't work:
[^~] is a class with negation, but it always needs one character to be there and also consumes that character for the match itself. The former is a problem for a starting text. The latter is a problem for having advanced too far, so a "don't match" character is gone already.
(^|[^~]) would solve the first problem: either the text starts or it must be a character not matching this. We could do the same for ending texts, but this is a dead again anyway.
Only lookarounds remain, and even then we have to code all 3 variants, hence the two |.
As per the nature of lookarounds the character in front or behind cannot be captured. Additionally if you want to also match either a leading or a trailing character then this collides with recognizing the next potential match.
It's a difference between telling the engine to "not match" a character and to tell the engine to "look out" for something without actually consuming characters and advancing the current position in the text. Also not every regex engine supports all lookarounds, so it matters where you actually want to use it. For me it works fine in TextPad 8 and should also work fine in PCRE (f.e. in PHP). As per regex101.com/r/CjcaWQ/1 it also works as expected by me.
What irritates me: if the leading and/or trailing character of a found match is important to you, then just extract it from the input when processing all the matches, since they also come with starting positions and lengths: first match at position 0 for 10 characters means you look at input text position -1 and 10.
I have a regex
[a-zA-Z][a-z]
I have to change this regex such that the regex should not accept string that starts with "de","DE","dE" and "De" .I cannot use look behind or look ahead because my system does not support it?
There's a solution without a lookahead or lookbehind, but you need to be able to use groups.
The idea there is to create a sort of "honeypot" that will match your negative results and keep only the results that do interest you.
In your case, that would write:
[dD][eE].*|(<your-regex>)
If the proposition is de<anything> (case insensitive here), it will match, but group(1) will be null.
On the other hand, matching diZ for instance would match not match what is before the or and would therefore fall into the group(1).
Finally, if the proposition doesn't start with de and doesn't match your regex, well, there will be no groups to get at all.
If you need to be sure that your proposition will match the whole provided string, you can update the regex thus:
^(?:[dD][eE].*|(<your-regex>))$
Note that ?: is not a lookahead of any kind, it serves to mark the group as non-capturing, so that <your-regex> will still be captured by group(1) (would become group(2) otherwise and the capture of a group is not always a transparent operation, performance-wise).
Simply ignore those characters:
[a-ce-z][a-df-z][a-gi-kwxyzWZXZ]
Make sure the flag is set to case insensitive. Also, [a-gi-kwxyzWZXZ] can then be modified to [a-gi-kwxyz].
EDIT:
As pointed out in this comment, the regex here won't support other words that start with d but are not followed by e. In this case, negative lookahead is a possible solution:
^(?!de)[a-z]+
This matches anything not starting with "DE" (case insensitive, without look arounds, allowing leading whitespace):
^ *+(?:[^Dd].|.[^Ee])<your regex for rest of input>
See live demo.
The possessive quantifier *+ used for whitespace prevents [^Dd] from being allowed to match a space via backtracking, making this regex hardened against leading spaces.
You can use an alternation excluding matching the d and D from the first character, or exclude matching the e as the second character.
Note that the pattern [a-zA-Z][a-z] matches at least 2 characters, so will the following pattern:
^(?:[abce-zABCE-Z][a-z]|[a-zA-Z][a-df-z]).*
^ Start of string
(?: Non capture group
[abce-zABCE-Z][a-z] Match a char a-zA-Z without d and D followed by a lowercase char a-z
| or
[a-zA-Z][a-df-z] Match a char a-zA-Z followed by a lowercase chars a-z without e
) Close non capture grou
.* Match 0+ times any char except a newline
Regex demo
Another option is to use word boundaries \b instead of an anchor ^
\b(?:[abce-zABCE-Z][a-z]|[a-zA-Z][a-df-z])[a-zA-Z]*\b
Regex demo
I need to create regex to find last underscore in string like 012344_2.0224.71_3 or 012354_5.00123.AR_3.335_8
I have wanted find last part with expression [^.]+$ and then find underscore at found element but I can not handle it.
I hope you can help me :)
Just use a negative character class [^_] that will match everything except an underscore (this helps to ensure no other underscores are found afterwards) and end of string $
Pattern would look as such:
(_)[^_]*$
The final underscore _ is in a capturing group, so you are wanting to return the submatch. You would replace the group 1 (your underscore).
See it live: Regex101
Notice the green highlighted portion on Regex101, this is your submatch and is what would be replaced.
The simplest solution I can imagine is using .*\K_, however not all regex flavours support \K.
If not, another idea would be to use _(?=[^_]*$)
You have a demo of the first and second option.
Explanation:
.*\K_: Fetches any character until an underscore. Since the * quantifier is greedy, It will match until the last underscore. Then \K discards the previous match and then we match the underscore.
_(?=[^_]*$): Fetch an underscore preceeded by non-underscore characters until the end of the line
If you want nothing but the "net" (i.e., nothing matched except the last underscore), use positive lookahead to check that no more underscores are in the string:
/_(?=[^_]*$)/gm
Demo
The pattern [^.]+$ matches not a dot 1+ times and then asserts the end of the string. The will give you the matches 71_3 and 335_8
What you want to match is an underscore when there are no more underscores following.
One way to do that is using a negative lookahead (?!.*_) if that is supported which asserts what is at the right does not match any character followed by an underscore
_(?!.*_)
Pattern demo
To match all characters except vowels, we can use [^aeiou].
I wonder
how to match all strings other than a particular one? For example, I want to match a string which is not dog. So cat, sky, and mike will all be matches.
how to match all strings other than a few strings, or other than a regular expression?
For example, I want to match a string which is not c.t. So sky and mike will all be matches, but cat and cut will not be matches.
Thanks.
1. How to match all strings other than a particular one
^(?!your_string$).*$
2. How to match all strings other than a few strings
^(?!(?:string1|string2|string3)$).*$
How does that work?
The idea is to use a negative lookahead (?! to check that the string does not consists solely of the string(s) to avoid. If the negative lookahead (which is an assertion) succeeds, the .*$ matches everything to the end of the string.
Note the use of the ^ anchor at the beginning to ensure we are positioned at the beginning of the string.
Note the $ anchor inside the negative lookahead to ensure that we are excluding your_string if it is indeed the whole string, but that we do not exclude your_string and more
Reference
Mastering Lookahead and Lookbehind
Negative Lookaheads
The following regex finds text between substrings FTW and ODP.
/FTW(((?!FTW|ODP).)+)ODP+/
What does the (?!...) do?
(?!regex) is a zero-width negative lookahead. It will test the characters at the current cursor position and forward, testing that they do NOT match the supplied regex, and then return the cursor back to where it started.
The whole regexp:
/
FTW # Match Characters 'FTW'
( # Start Match Group 1
( # Start Match Group 2
(?!FTW|ODP) # Ensure next characters are NOT 'FTW' or 'ODP', without matching
. # Match one character
)+ # End Match Group 2, Match One or More times
) # End Match Group 1
OD # Match characters 'OD'
P+ # Match 'P' One or More times
/
So - Hunt for FTW, then capture while looking for ODP+ to end our string. Also ensure that the data between FTW and ODP+ doesn't contain FTW or ODP
From perldoc:
A zero-width negative look-ahead assertion. For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar". Note however that look-ahead and look-behind are NOT the same thing. You cannot use this for look-behind.
If you are looking for a "bar" that isn't preceded by a "foo", /(?!foo)bar/ will not do what you want. That's because the (?!foo) is just saying that the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will match. You would have to do something like /(?!foo)...bar/ for that. We say "like" because there's the case of your "bar" not having three characters before it. You could cover that this way: /(?:(?!foo)...|^.{0,2})bar/ . Sometimes it's still easier just to say:
if (/bar/ && $` !~ /foo$/)
It means "not followed by...". Technically this is what's called a negative lookahead in that you can peek at what's ahead in the string without capturing it. It is a class of zero-width assertion, meaning that such expressions don't capture any part of the expression.
The programmer must have been typing too fast. Some characters in the pattern got flipped. Corrected:
/WTF(((?!WTF|ODP).)+)ODP+/
Regex
/FTW(((?!FTW|ODP).)+)ODP+/
matches first FTW immediately followed neither by FTW nor by ODP, then all following chars up to the first ODP (but if there is FTW somewhere in them there will be no match) then all the letters P that follow.
So in the string:
FTWFTWODPFTWjjFTWjjODPPPPjjODPPPjjj
it will match the bold part
FTWFTWODPFTWjjFTWjjODPPPPjjODPPPjjj
'?!' is actually part of '(?! ... )', it means whatever is inside must NOT match at that location.