Regular expression for all abbreviated keyword variations - regex

I need to search for a keyword, such as "abcdef", which can also be in an abbreviated version with a dot at the end. All valid variants are:
abcdef
abcde.
abcd.
abc.
ab.
a.
I have a regular expression for this, which is clear:
abcdef|abcde\.|abcd\.|abc\.|ab\.|a\.
Another regular expression where the keyword characters are not repeated:
a(b(c(d(e(f|\.)|\.)|\.)|\.)|\.)
I'm looking for a more compact expression where not even a dot will be repeated.
I use .NET syntax.

You can use a conditional construct:
a(b(c(d(e(?<f>f)?)?)?)?)?(?(f)|\.?)
See the regex demo. Here, (?<f>f)? is an optional named group matching f one or zero times. If the group matches, the f group is not empty, and (?(f)|\.?) matches an empty string then. If it is empty, \.? matches an optional ..
In PCRE falvor, could use
a(b(c(d(e(f(*ACCEPT))?)?)?)?)?\.?
where (*ACCEPT) verb inside an optional group would stop analyzing the current regex and return the value matched so far (so the last \.? would not be tried if f is found). See this regex flavor.

As a variant:
a(bcdef|(bcde|bcd|bc|b|)\.)
With shorting 2 letters (a bit shorter):
a(bcdef|(b(cde|cd|c|)|)\.)
With shorting 3 letters (the same length):
a(bcdef|(b(c(de|d|)|)|)\.)
With shorting 4 letters the shortest - 25 symbols:
a(bcdef|(b(c(de?|)|)|)\.)

Related

Regex Pattern to Match except when the clause enclosed by the tilde (~) on both sides

I want to extract matches of the clauses match-this that is enclosed with anything other than the tilde (~) in the string.
For example, in this string:
match-this~match-this~ match-this ~match-this#match-this~match-this~match-this
There should be 5 matches from above. The matches are explained below (enclosed by []):
Either match-this~ or match-this is correct for first match.
match-this is correct for 2nd match.
Either ~match-this# or ~match-this is correct for 3rd match.
Either #match-this~ or #match-this or match-this~ is correct for 4th match.
Either ~match-this or match-this is correct for 5th match.
I can use the pattern ~match-this~ catch these ~match-this~, but when I tried the negation of it (?!(~match-this)), it literally catches all nulls.
When I tried the pattern [^~]match-this[^~], it catches only one match (the 2nd match from above). And when I tried to add asterisk wild card on any negation of tilde, either [^~]match-this[^~]* or [^~]*match-this[^~], I got only 2 matches. When I put the asterisk wild card on both, it catches all match-this including those which enclosed by tildes ~.
Is it possible to achieve this with only one regex test? Or Does it need more??
If you also want to match #match-this~ as a separate match, you would have to account for # while matching, as [^~] also matches #
You could match what you don't want, and capture in a group what you want to keep.
~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)
Explanation
~[^~#]*~ Match any char except ~ or # between ~
| Or
( Capture group 1
(?:(?!match-this).)* Match any char if not directly followed by *match-this~
match-this Match literally
(?:(?!match-this)[^#~])* Match any char except ~ or # if not directly followed by match this
) Close group 1
See a regex demo and a Python demo.
Example
import re
pattern = r"~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)"
s = "match-this~match-this~ match-this ~match-this#match-this~match-this~match-this"
res = [m for m in re.findall(pattern, s) if m]
print (res)
Output
['match-this', ' match-this ', '~match-this', '#match-this', 'match-this']
If all five matches can be "match-this" (contradicting the requirement for the 3rd match) you can match the regular expression
~match-this~|(\bmatch-this\b)
and keep only matches that are captured (to capture group 1). The idea is to discard matches that are not captured and keep matches that are captured. When the regex engine matches "~match-this~" its internal string pointer is moved just past the closing "~", thereby skipping an unwanted substring.
Demo
The regular expression can be broken down as follows.
~match-this~ # match literal
| # or
( # begin capture group 1
\b # match a word boundary
match-this # match literal
\b # match a word boundary
) # end capture group 1
Being so simple, this regular expression would be supported by most regex engines.
For this you need both kinds of lookarounds. This will match the 5 spots you want, and there's a reason why it only works this way and not another and why the prefix and/or suffix can't be included:
(?<=~)match-this(?!~)|(?<!~)match-this(?=~)|(?<!~)match-this(?!~)
Explaining lookarounds:
(?=...) is a positive lookahead: what comes next must match
(?!...) is a negative lookahead: what comes next must not match
(?<=...) is a positive lookbehind: what comes before must match
(?<!...) is a negative lookbehind: what comes before must not match
Why other ways won't work:
[^~] is a class with negation, but it always needs one character to be there and also consumes that character for the match itself. The former is a problem for a starting text. The latter is a problem for having advanced too far, so a "don't match" character is gone already.
(^|[^~]) would solve the first problem: either the text starts or it must be a character not matching this. We could do the same for ending texts, but this is a dead again anyway.
Only lookarounds remain, and even then we have to code all 3 variants, hence the two |.
As per the nature of lookarounds the character in front or behind cannot be captured. Additionally if you want to also match either a leading or a trailing character then this collides with recognizing the next potential match.
It's a difference between telling the engine to "not match" a character and to tell the engine to "look out" for something without actually consuming characters and advancing the current position in the text. Also not every regex engine supports all lookarounds, so it matters where you actually want to use it. For me it works fine in TextPad 8 and should also work fine in PCRE (f.e. in PHP). As per regex101.com/r/CjcaWQ/1 it also works as expected by me.
What irritates me: if the leading and/or trailing character of a found match is important to you, then just extract it from the input when processing all the matches, since they also come with starting positions and lengths: first match at position 0 for 10 characters means you look at input text position -1 and 10.

Regular expression for a-b, a-c but not a-a?

I try to find method definitions except constructors.
To simplify Im looking for abc::def, foo::bar but not foo::foo
I already know how to write an expression like so:
\w[\w\d_]+::\w[\w\d_]+
But how to make sure the left part of the :: does not match the right part?
By the way, I cannot check if there is a type definition left of the qualified method name. I have a very old project where it was fine to not specify a type if it was int.
Note that \w already matches \d and _ and \w[\w\d_]+ = \w{2,}.
You can capture the first "word" (before ::) and check with a negative lookahead that the "word" after :: is not equal to it:
\b(\w+)::(?!\b\1\b)\w+\b
See the regex demo
Explanation:
\b - leading word boundary
(\w+) - Group 1: one or more alphanumeric and underscore characters
:: - 2 consecutive colons
(?!\b\1\b) - the next "word" cannot be the same as the value in Group 1
\w+\b - one or more alphanumeric and underscore characters followed with a trailing word boundary.
If you are not looking to match 1-character "words", you can use
\b(\w{2,})::(?!\b\1\b)\w{2,}\b
You can capture first part and check if it's repeated using back-referencing like this.
Regex: \b(\w[\w\d_]+)::(?!\1)\w[\w\d_]+
Explanation:
\b(\w[\w\d_]+) matches the first part.
(?!\1) negative lookahead for first part. If repeated whole match will be discarded.
\w[\w\d_]+ If not repeated then this part will match.
Regex101 Demo

Repeated variable length regexp matching

I have an expression
AA-BB/CC/DD
I want to convert this to
<AA-BB> <AA-CC> <AA-DD>
All I can do is configure this as a regexp substitution. I can't figure it out.
AA should match at the beginning of a line. - and / are literal characters, BB,CC and DD are numbers, i.e \d+
So a first draft is ...
^(\w+)([\-/]\d+)+
but I want all matches, not just the greedy one.
(actually this one matches AA-BB-CC-DD too, but that's ok although it's not according to spec)
No, you can't do that with regex. Probably with .net, because there you can access all intermediate results of repeated capturing groups ...
Repeating a Capturing Group vs. Capturing a Repeated Group
That is the problem, if you do something like ^(\w+)([\-/]\w+)+ the value stored in group2 is always only the last pattern it matched. Your task is not possible with regex/replace.
I would do something like:
^(\w+)-([\w+\/]+)
Then split the content of group 2 by "/" and combine group1 with each element of the array resulting from the split.

Finding a match one after another

How do I find multiple matches that are (and can only be) separated from each other by whitespaces?
I have this regular expression:
/([0-9]+)\s*([A-Za-z]+)/
And I want each of the matches (not groups) to be surrounded by a whitespace or another match. If the condition is not fullfilled, the match should not be returned.
This is valid: 1min 2hours 3days
This is not: 1min, 2hours 3days (1min and 2hours should not be returned)
Is there a simpler way of finding a continuous sequence of matches (in Java preferably) than repeating the whole regex before and after the main one, checking if there is a whitespace, start/end of the string or another match?
I believe this pattern will meet your requirements (provided that only a single space character separates your alphanumeric tokens):
(?<=^|[\w\d]\s)([\w\d]+)(?=\s|$)
^^^^^^^^^^ ^^^^^^^ ^^^^
(2) (1) (3)
A capture group that contains an alphanumeric string.
A look-behind assertion: To the left of the capture group must be a) the beginning of the line or b) an alphanumeric character followed by a single space character.
A look-ahead assertion: To the right of the capture group must be a) a space character or b) the end of the line.
See regex101.com demo.
Here is some sample data that I included in the demo. Each bolded alphanumeric string indicates a successful capture:
1min 2hours 3days
1min, 2hours 3days
42min 4hours 2days
String text = "1min 2hours 3days";
boolean match = text.matches("(?:\\s*[0-9]+\\s*[A-Za-z]+\\s*)*");
This is basically looking for a pattern on your example. Then using * after the pattern its looking for zero or more occurrence of the pattern in text. And ?: means doesn't capture the group.
This will will also return true for empty string. If you don't want the empty string to be true, then change * into +
I've mananged to solve my problem by splitting the string using string.split("\\s+") and then matching the results to the pattern /([0-9]+)\s*([A-Za-z]+)/.
There is an error here the '' will match all characters and ignore your rest
/([0-9]+)\s([A-Za-z]+)/
Change to
/(\d+)\s+(\w+)/g
This will return an array of matches either digits or word characters. There is no need to always write '[0-9]' or '[A-Za-z]' the same thing can be said as '\d' match any 0 to 9 more can be found at this cheat sheet regular expression cheat sheet

regular expression no characters

I have this regular expression
([A-Z], )*
which should match something like
test, (with a space after the comma)
How to I change the regex expression so that if there are any characters after the space then it doesn't match.
For example if I had:
test, test
I'm looking to do something similar to
([A-Z], ~[A-Z])*
Cheers
Use the following regular expression:
^[A-Za-z]*, $
Explanation:
^ matches the start of the string.
[A-Za-z]* matches 0 or more letters (case-insensitive) -- replace * with + to require 1 or more letters.
, matches a comma followed by a space.
$ matches the end of the string, so if there's anything after the comma and space then the match will fail.
As has been mentioned, you should specify which language you're using when you ask a Regex question, since there are many different varieties that have their own idiosyncrasies.
^([A-Z]+, )?$
The difference between mine and Donut is that he will match , and fail for the empty string, mine will match the empty string and fail for ,. (and that his is more case-insensitive than mine. With mine you'll have to add case-insensitivity to the options of your regex function, but it's like your example)
I am not sure which regex engine/language you are using, but there is often something like a negative character groups [^a-z] meaning "everything other than a character".