Simplify this repeating regex - regex

I have the following valid regex to match various excel cell/range patterns, of the form A1, A1:Z12, etc.
^(?:[A-Za-z]{1,3}\d{0,10})(?::(?:[A-Za-z]{1,3}\d{0,10}))?$
Is there a more compact way to do the second part of the match? Basically, the : <repeat> part I was hoping to be able to do it with something like:
^ (<main_part> ':'<lookahead, keep if before an A-Z> ){1,2} $
Any way to do that pattern?

A way without capture groups or lookarounds, use a word-boundary:
^(?:\b:?[A-Z]{1,3}[0-9]{1,10}){1,2}$
demo
The word-boundary can't succeed between the start of the string and a colon nor between a digit and a letter, but it does between a digit and a colon or between the start of the string and a letter.
Obviously, it's also possible to do it like that for the same kind of reasons:
^(?:[A-Z]{1,3}[0-9]{1,10}:?\b){1,2}$
(You win one step more with this one, YAY!)
test cases (first pattern):
with :A2
It fails because \b fails between the start of the string and a non-word character (the colon).
with A2:
It fails because there's no colon at the end of the sub-pattern (that is not repeated in this case).
with A2:A2
The pattern succeeds. \b succeeds because the first time it is between the start of the string and a letter (a word character), and the second time because it is between a digit (a word character too) and a colon (a non-word character).

Here would be an example pattern you can use, note that AB:AB is not a valid range as described above so that has been modified as well to \d{1,10}:
^(?:[A-Z]{1,3}[0-9]{1,10}(?::(?=[A-Z]))?){1,2}$
And a better approach would be to use ?1 to recurse to the first pattern:
^([A-Z]{1,3}[0-9]{1,10})(:(?1))?$
Note however with this approach we do need the extraneous capturing group at the beginning for this technique to work.

Related

RegExp: find "cleverness" in a string

My RegExpression:
((^|\s)(clever)($|\s))
It finds "clever" in the string:
clever or not
yahoo clever
but it doesn't find "clever" in this string:
what means cleverness
I don't want to bother you with the three other RegExp variations of my line above but I tried different approaches already but can't make it work.
I am filtering terms in a table to cluster them into defined groups. I am looking for the adjective "clever". I dont want to find strings where clever is part of another word, in example "MacLever" or "alcleveracio".
Try this :
((^|\s)(clever))
Your regex contains ($|\s) will force clever to be before a space or at the end of the string.
Try using ^(.*\W)?(clever)(\W.*)?$instead. \W matches any non-word character, so this will enforce that any string before "clever" include a nonword character at the end (and vice versa for the end.
You can plug it into https://regex101.com/ to see how it is working and test it out.
You can use the word boundary \b.
\bclever\w*\b
or maybe better (no capitals allowed)
\bclever[a-z]*\b
If "clever" should be either at the beginning or at the end:
\b([a-zA-Z]+)?clever(?(1)|[a-z]*)\b
\b beginig of the string
([a-zA-Z]+) at least one character
? match even group is empty
clever matches the characters
(?(1) starts a condition, depends on group 1
|[a-z]*) if group matches, there doesn't may be any chars, else ( | ) there may be any lower case chars ( [a-z]* )
\b the final word boundary
Test and visualizing: Debuggex Demo
Infos about If-Then-Else
(visulized by Regulex)
Test it on regex101

How to overcome multiple matches within same sentence (regex) [duplicate]

I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1
The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.
Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.

Match a word but not its inverse using [^] syntax

I am trying to make a regex that doesn't match one word, but does match its reverse. For example, if the word I don't want to match is "no":
I am matching this word // will pass
I am matching no word // will not pass
I am matching on word // will pass
I am matching that word // will pass
The current regex I am using doesn't pass on the third example, because it is not matching any word with "n" or "o" in it:
^I am matching ([^no]*) word$
What is the best way to achieve this - ie, match on a word, not a collection of characters?
For context I am writing acceptance tests using Scala and Cucumber, which use Regex to match a feature file up with its corresponding stepdef. My real-world example is more complex, so I have simplified it here. Also, I know that I can just catch (.*) and handle what is in that capture group using a case/match block in Scala, but I am curious about how to do this with purely Regex.
You can use a negative lookahead to test the text you're about to match:
^I am matching (?!no\b)(?<CapturedWord>\w+) word$
(?!no\b) - This is a negative lookahead. It tests the next two characters. If they are "no" followed by a word boundary, then the match fails. Anything else will pass. A lookahead does not actually capture those characters, so...
(?<CapturedWord>\w+) - ...we need to capture the characters in order to continue on with the rest of the test. I used a named group because they're often easier to reference later on in code.
An other solution consists to describe all words that aren't "on". Note that this solution isn't handy if you want to negate a long substring, but with several regex engines that don't have the lookahead feature, this is the only way:
^I am matching ([^\Wn]\w+|n[^\Wo]+|\w(?:\w{2,})?) word$
The two first branch of the alternation match in particular all 2 letters words that aren't "no", the last branch matches one letter and 3 or more letters words.

RegEx for capturing everything except numbers and one word

I am quite stuck with a regex I can't get to work. It should capture everything except digits and the word fiktiv (not single characters of it!). Objective is to get rid of this content.
I have tried something like (?!\d|fiktiv).* on my sample string 123456788daswqrt fiktiv
https://regex101.com/r/kU8mF3/1
However this does match the fiktiv at the end as well.
One possibility would be to use a neglected character class, which can be used by putting a ^ in [] braces. So you basically say don't match digits, and as many non digits as you can get until a space occurs and the word fiktiv appears.
This capturing will be "saved" in the capturing group 1 for later use.
([^\d]+)\s+fiktiv
Testing could be done here:
https://regex101.com/
It should capture everything except digits and the word fiktiv (not single characters of it!). Objective is to get rid of this content.
So, you want to remove any character that is not a digit (that is, \D or [^0-9] pattern) and not a fiktiv char sequence.
You may use a regex with a capturing group and alternation:
(fiktiv)|[^0-9]
and replace with the contents of Group 1 using a $1 backreference, fiktiv, to restore it in the replaced string.
See the regex demo
C# implementation:
Regex.Replace(input‌​, "(fiktiv)|[^0-9]", "$1")
Also, see Use RegEx in SQL with CLR Procs.

Regex pattern to start with first digit and end at last hyphen (if it exists)

I have a regex pattern that almost works, but I can't quite get it totally correct. My goal is that if a string starts with letters, to ignore them up to the first digit. The second part of the pattern needs to make the match stop at the last hyphen in the string, if one exists. Here are some examples of strings that I would be working on:
PCKG6JUB-0330M3-0-812 wanting returned 6JUB-0330M3-0
CCP352878 wanting returned 352878
0972543107 wanting returned 0972543107
This is the pattern that I have so far: \d[\S]*- The problem is that on the top example, it includes the last hyphen in the match, so I get 6JUB-0330M3-0-. Also, if no hyphen exists, then nothing is returned.
I'm using the VBScript engine.
Use this:
\d(?:\S*(?=-)|\S*)
First, I used a positive lookahead, (?=...), so we don't actually match the last hyphen. Then, I used alternation, |, to check for a match with a hyphen or without a hyphen. So that we don't need to match the digit on both sides of the alternation, I put this part in a non-capturing group, (?:...). Finally, \S is shorthand for a character class and doesn't need to be in brackets.
One would think we'd just be able to make the hyphen optional (i.e. \d\S*(?=-?)), but that doesn't work. This is because our \S* match is greedy (and it needs to be, since you want to match up until the last hyphen) and will just blow right past the hyphen.