Regex is behaving lazy, should be greedy - regex

I thought that by default my Regex would exhibit the greedy behavior that I want, but it is not in the following code:
Regex keywords = new Regex(#"in|int|into|internal|interface");
var targets = keywords.ToString().Split('|');
foreach (string t in targets)
{
Match match = keywords.Match(t);
Console.WriteLine("Matched {0,-9} with {1}", t, match.Value);
}
Output:
Matched in with in
Matched int with in
Matched into with in
Matched internal with in
Matched interface with in
Now I realize that I could get it to work for this small example if I simply sorted the keywords by length descending, but
I want to understand why this
isn't working as expected, and
the actual project I am working on
has many more words in the Regex and
it is important to keep them in
alphabetical order.
So my question is: Why is this being lazy and how do I fix it?

Laziness and greediness applies to quantifiers only (?, *, +, {min,max}). Alternations always match in order and try the first possible match.

It looks like you're trying to word break things. To do that you need the entire expression to be correct, your current one is not. Try this one instead..
new Regex(#"\b(in|int|into|internal|interface)\b");
The "\b" says to match word boundaries, and is a zero-width match. This is locale dependent behavior, but in general this means whitespace and punctuation. Being a zero width match it will not contain the character that caused the regex engine to detect the word boundary.

According to RegularExpressions.info, regular expressions are eager. Therefore, when it goes through your piped expression, it stops on the first solid match.
My recommendation would be to store all of your keywords in an array or list, then generate the sorted, piped expression when you need it. You would only have to do this once too as long as your keyword list doesn't change. Just store the generated expression in a singleton of some sort and return that on regex executions.

Related

Matching multiple letters and special characters in regex

I am trying to catch strings around the acronym ADJ. The strings look like this:
·NOM·JJ·ADJ+CASE_DEF_GEN
·NOM·JJ·ADJ+CASE_DEF_ACC
·NOM·JJ·ADJ+CASE_INDEF_GEN
·NOM·DT+JJ·DET+ADJ+NSUFF_FEM_SG+CASE_DEF_GEN
·NOM·JJ·ADJ+CASE_INDEF_GEN
·NOM·JJ·ADJ+NSUFF_FEM_SG+CASE_INDEF_GEN
·NOM·DT+JJ·DET+ADJ+NSUFF_FEM_SG+CASE_DEF_ACC
So far I have this:
/[A-Z·\+#_]*?[·\+]ADJ[·\+][A-Z_·\+#]*?/g
But it only matches from the beginning of the strings until "ADJ+" ·NOM·DT+JJ·DET+ADJ+.
Since the rest of the strings after ADJ have the same composition of the beginning of the strings before ADJ, I thought this /[A-Z·\+#_]*?[·\+]/g should work, but it doesn't.
How do I get it to match the rest of the string?
My guess is that you want to make sure if you have an ADJ in the string, which if so, maybe we could simplify our expression to something similar to:
([A-Z·+#_]*)\bADJ\b([A-Z·+#_]*)
The expression is explained on the top right panel of this demo, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.
That *? quantifier after the +ADJ+ phrase is satisfied with the empty string right after it, since the ? makes the quantifier before it match "the minimum number of times possible" and for * that is zero times.
So drop the ?, which also has no purpose for the rest of the line
perl -wE'$_=q(-XADJX-JJ+ADJ-REST-);
($before, $after) = /(.*?)[+\-]ADJ[+\-](.*)/;
say for $before,$after'
Removing the ? at the end would match the whole strings,
/[A-Z·\+#_]*?[·\+]ADJ[·\+][A-Z_·\+#]*/g
I am not entirely sure why you needed a ? in a *.

Regex: match string unless it contains a word [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 2 years ago.
I know that I can negate group of chars as in [^bar] but I need a regular expression where negation applies to the specific word - so in my example how do I negate an actual bar, and not "any chars in bar"?
A great way to do this is to use negative lookahead:
^(?!.*bar).*$
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].
Unless performance is of utmost concern, it's often easier just to run your results through a second pass, skipping those that match the words you want to negate.
Regular expressions usually mean you're doing scripting or some sort of low-performance task anyway, so find a solution that is easy to read, easy to understand and easy to maintain.
Solution:
^(?!.*STRING1|.*STRING2|.*STRING3).*$
xxxxxx OK
xxxSTRING1xxx KO (is whether it is desired)
xxxSTRING2xxx KO (is whether it is desired)
xxxSTRING3xxx KO (is whether it is desired)
You could either use a negative look-ahead or look-behind:
^(?!.*?bar).*
^(.(?<!bar))*?$
Or use just basics:
^(?:[^b]+|b(?:$|[^a]|a(?:$|[^r])))*$
These all match anything that does not contain bar.
The following regex will do what you want (as long as negative lookbehinds and lookaheads are supported), matching things properly; the only problem is that it matches individual characters (i.e. each match is a single character rather than all characters between two consecutive "bar"s), possibly resulting in a potential for high overhead if you're working with very long strings.
b(?!ar)|(?<!b)a|a(?!r)|(?<!ba)r|[^bar]
I came across this forum thread while trying to identify a regex for the following English statement:
Given an input string, match everything unless this input string is exactly 'bar'; for example I want to match 'barrier' and 'disbar' as well as 'foo'.
Here's the regex I came up with
^(bar.+|(?!bar).*)$
My English translation of the regex is "match the string if it starts with 'bar' and it has at least one other character, or if the string does not start with 'bar'.
The accepted answer is nice but is really a work-around for the lack of a simple sub-expression negation operator in regexes. This is why grep --invert-match exits. So in *nixes, you can accomplish the desired result using pipes and a second regex.
grep 'something I want' | grep --invert-match 'but not these ones'
Still a workaround, but maybe easier to remember.
If it's truly a word, bar that you don't want to match, then:
^(?!.*\bbar\b).*$
The above will match any string that does not contain bar that is on a word boundary, that is to say, separated from non-word characters. However, the period/dot (.) used in the above pattern will not match newline characters unless the correct regex flag is used:
^(?s)(?!.*\bbar\b).*$
Alternatively:
^(?!.*\bbar\b)[\s\S]*$
Instead of using any special flag, we are looking for any character that is either white space or non-white space. That should cover every character.
But what if we would like to match words that might contain bar, but just not the specific word bar?
(?!\bbar\b)\b\[A-Za-z-]*bar[a-z-]*\b
(?!\bbar\b) Assert that the next input is not bar on a word boundary.
\b\[A-Za-z-]*bar[a-z-]*\b Matches any word on a word boundary that contains bar.
See Regex Demo
Extracted from this comment by bkDJ:
^(?!bar$).*
The nice property of this solution is that it's possible to clearly negate (exclude) multiple words:
^(?!bar$|foo$|banana$).*
I wish to complement the accepted answer and contribute to the discussion with my late answer.
#ChrisVanOpstal shared this regex tutorial which is a great resource for learning regex.
However, it was really time consuming to read through.
I made a cheatsheet for mnemonic convenience.
This reference is based on the braces [], (), and {} leading each class, and I find it easy to recall.
Regex = {
'single_character': ['[]', '.', {'negate':'^'}],
'capturing_group' : ['()', '|', '\\', 'backreferences and named group'],
'repetition' : ['{}', '*', '+', '?', 'greedy v.s. lazy'],
'anchor' : ['^', '\b', '$'],
'non_printable' : ['\n', '\t', '\r', '\f', '\v'],
'shorthand' : ['\d', '\w', '\s'],
}
Just thought of something else that could be done. It's very different from my first answer, as it doesn't use regular expressions, so I decided to make a second answer post.
Use your language of choice's split() method equivalent on the string with the word to negate as the argument for what to split on. An example using Python:
>>> text = 'barbarasdbarbar 1234egb ar bar32 sdfbaraadf'
>>> text.split('bar')
['', '', 'asd', '', ' 1234egb ar ', '32 sdf', 'aadf']
The nice thing about doing it this way, in Python at least (I don't remember if the functionality would be the same in, say, Visual Basic or Java), is that it lets you know indirectly when "bar" was repeated in the string due to the fact that the empty strings between "bar"s are included in the list of results (though the empty string at the beginning is due to there being a "bar" at the beginning of the string). If you don't want that, you can simply remove the empty strings from the list.
I had a list of file names, and I wanted to exclude certain ones, with this sort of behavior (Ruby):
files = [
'mydir/states.rb', # don't match these
'countries.rb',
'mydir/states_bkp.rb', # match these
'mydir/city_states.rb'
]
excluded = ['states', 'countries']
# set my_rgx here
result = WankyAPI.filter(files, my_rgx) # I didn't write WankyAPI...
assert result == ['mydir/city_states.rb', 'mydir/states_bkp.rb']
Here's my solution:
excluded_rgx = excluded.map{|e| e+'\.'}.join('|')
my_rgx = /(^|\/)((?!#{excluded_rgx})[^\.\/]*)\.rb$/
My assumptions for this application:
The string to be excluded is at the beginning of the input, or immediately following a slash.
The permitted strings end with .rb.
Permitted filenames don't have a . character before the .rb.

Interesting easy looking Regex

I am re-phrasing my question to clear confusions!
I want to match if a string has certain letters for this I use the character class:
[ACD]
and it works perfectly!
but I want to match if the string has those letter(s) 2 or more times either repeated or 2 separate letters
For example:
[AKL] should match:
ABCVL
AAGHF
KKUI
AKL
But the above should not match the following:
ABCD
KHID
LOVE
because those are there but only once!
that's why I was trying to use:
[ACD]{2,}
But it's not working, probably it's not the right Regex.. can somebody a Regex guru can help me solve this puzzle?
Thanks
PS: I will use it on MYSQL - a differnt approach can also welcome! but I like to use regex for smarter and shorter query!
To ensure that a string contains at least two occurencies in a set of letters (lets say A K L as in your example), you can write something like this:
[AKL].*[AKL]
Since the MySQL regex engine is a DFA, there is no need to use a negated character class like [^AKL] in place of the dot to avoid backtracking, or a lazy quantifier that is not supported at all.
example:
SELECT 'KKUI' REGEXP '[AKL].*[AKL]';
will return 1
You can follow this link that speaks on the particular subject of the LIKE and the REGEXP features in MySQL.
If I understood you correctly, this is quite simple:
[A-Z].*?[A-Z]
This looks for your something in your set, [A-Z], and then lazily matches characters until it (potentially) comes across the set, [A-Z], again.
As #Enigmadan pointed out, a lazy match is not necessary here: [A-Z].*[A-Z]
The expression you are using searches for characters between 2 and unlimited times with these characters ACDFGHIJKMNOPQRSTUVWXZ.
However, your RegEx expression is excluding Y (UVWXZ])) therefore Z cannot be found since it is not surrounded by another character in your expression and the same principle applies to B ([ACD) also excluded in you RegEx expression. For example Z and A would match in an expression like ZABCDEFGHIJKLMNOPQRSTUVWXYZA
If those were not excluded on purpose probably better can be to use ranges like [A-Z]
If you want 2 or more of a match on [AKL], then you may use just [AKL] and may have match >= 2.
I am not good at SQL regex, but may be something like this?
check (dbo.RegexMatch( ['ABCVL'], '[AKL]' ) >= 2)
To put it in simple English, use [AKL] as your regex, and check the match on the string to be greater than 2. Here's how I would do in Java:
private boolean search2orMore(String string) {
Matcher matcher = Pattern.compile("[ACD]").matcher(string);
int counter = 0;
while (matcher.find())
{
counter++;
}
return (counter >= 2);
}
You can't use [ACD]{2,} because it always wants to match 2 or more of each characters and will fail if you have 2 or more matching single characters.
your question is not very clear, but here is my trial pattern
\b(\S*[AKL]\S*[AKL]\S*)\b
Demo
pretty sure this should work in any case
(?<l>[^AKL\n]*[AKL]+[^AKL\n]*[AKL]+[^AKL\n]*)[\n\r]
replace AKL for letters you need can be done very easily dynamicly tell me if you need it
Is this what you are looking for?
".*(.*[AKL].*){2,}.*" (without quotes)
It matches if there are at least two occurences of your charactes sorrounded by anything.
It is .NET regex, but should be same for anything else
Edit
Overall, MySQL regular expression support is pretty weak.
If you only need to match your capture group a minimum of two times, then you can simply use:
select * from ... where ... regexp('([ACD].*){2,}') #could be `2,` or just `2`
If you need to match your capture group more than two times, then just change the number:
select * from ... where ... regexp('([ACD].*){3}')
#This number should match the number of matches you need
If you needed a minimum of 7 matches and you were using your previous capture group [ACDF-KM-XZ]
e.g.
select * from ... where ... regexp('([ACDF-KM-XZ].*){7,}')
Response before edit:
Your regex is trying to find at least two characters from the set[ACDFGHIJKMNOPQRSTUVWXZ].
([ACDFGHIJKMNOPQRSTUVWXZ]){2,}
The reason A and Z are not being matched in your example string (ABCDEFGHIJKLMNOPQRSTUVWXYZ) is because you are looking for two or more characters that are together that match your set. A is a single character followed by a character that does not match your set. Thus, A is not matched.
Similarly, Z is a single character preceded by a character that does not match your set. Thus, Z is not matched.
The bolded characters below do not match your set
ABCDEFGHIJKLMNOPQRSTUVWXYZ
If you were to do a global search in the string, only the italicized characters would be matched:
ABCDEFGHIJKLMNOPQRSTUVWXYZ

How to match a string that does not end in a certain substring?

how can I write regular expression that dose not contain some string at the end.
in my project,all classes that their names dont end with some string such as "controller" and "map" should inherit from a base class. how can I do this using regular expression ?
but using both
public*.class[a-zA-Z]*(?<!controller|map)$
public*.class*.(?<!controller)$
there isnt any match case!!!
Do a search for all filenames matching this:
(?<!controller|map|anythingelse)$
(Remove the |anythingelse if no other keywords, or append other keywords similarly.)
If you can't use negative lookbehinds (the (?<!..) bit), do a search for filenames that do not match this:
(?:controller|map)$
And if that still doesn't work (might not in some IDEs), remove the ?: part and it probably will - that just makes it a non-capturing group, but the difference here is fairly insignificant.
If you're using something where the full string must match, then you can just prefix either of the above with ^.* to do that.
Update:
In response to this:
but using both
public*.class[a-zA-Z]*(?<!controller|map)$
public*.class*.(?<!controller)$
there isnt any match case!!!
Not quite sure what you're attempting with the public/class stuff there, so try this:
public.*class.*(?<!controller|map)$`
The . is a regex char that means "anything except newline", and the * means zero or more times.
If this isn't what you're after, edit the question with more details.
Depending on your regex implementation, you might be able to use a lookbehind for this task. This would look like
(?<!SomeText)$
This matches any lines NOT having "SomeText" at their end. If you cannot use that, the expression
^(?!.*SomeText$).*$
matches any non-empty lines not ending with "SomeText" as well.
You could write a regex that contains two groups, one consists of one or more characters before controller or map, the other contains controller or map and is optional.
^(.+)(controller|map)?$
With that you may match your string and if there is a group() method in the regex API you use, if group(2) is empty, the string does not contain controller or map.
Check if the name does not match [a-zA-Z]*controller or [a-zA-Z]*map.
finally I did it in this way
public.*class.*[^(controller|map|spec)]$
it worked

Regex multi word search

What do I use to search for multiple words in a string? I would like the logical operation to be AND so that all the words are in the string somewhere. I have a bunch of nonsense paragraphs and one plain English paragraph, and I'd like to narrow it down by specifying a couple common words like, "the" and "and", but would like it match all words I specify.
Regular expressions support a "lookaround" condition that lets you search for a term within a string and then forget the location of the result; starting at the beginning of the string for the next search term. This will allow searching a string for a group of words in any order.
The regular expression for this is:
^(?=.*\bword1\b)(?=.*\bword2\b)(?=.*\bword3\b)
Where \b is a word boundary and the ?= is the lookaround modifier.
If you have a variable number of words you want to search for, you will need to build this regular expression string with a loop - just wrap each word in the lookaround syntax and append it to the expression.
AND as concatenation
^(?=.*?\b(?:word1)\b)(?=.*?\b(?:word2)\b)(?=.*?\b(?:word3)\b)
OR as alternation
^(?=.*?\b(?:word1|word2|word3)\b
^(?=.*?\b(?:word1)\b)|^(?=.*?\b(?:word2)\b)|^(?=.*?\b(?:word3)\b)
Maybe using a language recognition chart to recognize english would work. Some quick tests seem to work (this assumes paragraphs separated by newlines only).
The regexp will match one of any of those conditions... \bword\b is word separated by boundaries word\b is a word ending and just word will match it in any place of the paragraph to be matched.
my #paragraphs = split(/\n/,$text);
for my $p (#paragraphs) {
if ($p =~ m/\bthe\b|\band\b|\ban\b|\bin\b|\bon\b|\bthat\b|\bis\b|\bare\b|th|sh|ough|augh|ing\b|tion\b|ed\b|age\b|’s\b|’ve\b|n’t\b|’d\b/) {
print "Probable english\n$p\n";
}
}
Firstly I'm not certain what you're trying to return... the whole sentence? The words in between your two given words?
Something like:
\b(word1|word2)\b(\w+\b)*(word1|word2)\b(\w+\b)*\.
(where \b is the word boundary in your language)
would match a complete sentence that contained either of the two words or both..
You'd probably need to make it case insensitive so that if it appears at the start of the sentence it will still match
Assuming PCRE (Perl regexes), I am not sure that you can do it at all easily. The AND operation is concatenation of regexes, but you want to be able to permute the order in which the words appear without having to formally generate the permutation. For N words, when N = 2, it is bearable; with N = 3, it is barely OK; with N > 3, it is unlikely to be acceptable. So, the simple iterative solution - N regexes, one for each word, and iterate ensuring each is satisfied - looks like the best choice to me.