How to continue a match in Regex - regex

price:(?:(?:\d+)?(?:\.)?\d+|min)-?(?:(?:\d+)?(?:\.)?\d+|max)?
This Regex matches the following examples correctly.
price:1.00-342
price:.1-23
price:4
price:min-900.00
price:.10-.50
price:45-100
price:453.23-231231
price:min-max
Now I want to improve it to match these cases.
price:4.45-8.00;10.45-14.50
price:1.00-max;3-12;23.34-12.19
price:1.00-2.50;min-12;23.34-max
Currently the match stops at the semi colon. How can I get the regex to repeat across the semi-colon dividers?
Final Solution:
price:(((\d*\.)?\d+|min)-?((\d*\.)?\d+|max)?;?)+

Add an optional ; at the end, and make the whole pattern to match one or more:
price:((?:(?:\d+)?(?:\.)?\d+|min)-?(?:(?:\d+)?(?:\.)?\d+|max)?;?)+

(?:\d+)? is the same thing as \d*, and (?:\.)? can just be \.?. Simplified, your original regex is:
price:(?:\d*\.?\d+|min)(?:-(?:\d*\.?\d+|max))?
You have two choices. You can either do price([:;]range)* where range is the regex you have for matching number ranges, or be more precise about the punctuation but have to write out range twice and do price:range(;range)*.
price([:;]range)* -- shorter but allows first ':' to be ';'
price:range(;range)* -- longer but gets colon vs semi-colon correct
Pick one of these two regexes:
price[:;](?:\d*\.?\d+|min)(?:-(?:\d*\.?\d+|max))?
price:(?:\d*\.?\d+|min)(?:-(?:\d*\.?\d+|max))?(?:(?:\d*\.?\d+|min)(?:-(?:\d*\.?\d+|max))?)*

First there are some issues with your regular expression: to match xx.yyy instead of the expression (?:\d+)?(?:\.)?\d+ you can use this (?:\d*\.)?\d+. This can only match in one way so it avoids unnecessary backtracking.
Also currently your regular expression matches things like price:minmax and price:1.2.3 which I assume you do not want to match.
The simple way to repeat your match is to add a semi-colon and then repeat your regular expression verbatim.
You can do it like this though to avoid writing out the entire regular twice:
price:(?:(?:(?:\d*\.)?\d+|min)(?:-(?:(?:\d*\.)?\d+|max))?(?:;|$))*
See it in action on Rubular.

price:((?:(?:\d+)?(?:\.)?\d+|min)-?(?:(?:\d+)?(?:\.)?\d+|max)?;?)+
I'm not sure what's up with all of the ?'s (I know the syntax, I just don't know why you're using it so much), but that should do it for you.

Related

Regex to match one of any terms, some terms with spaces

I'm trying to write a RegEx that matches one of several terms, as part of a spam filter. The problem is, some of these terms contain spaces, and I'm having trouble writing a valid expression.
What I originally had (before multiple word temrs) was this:
(?i)(alzheimers|baldness|obese)
Now, I want to add, for example "blood pressure", but the following expression is chucking a barny:
(?i)(alzheimers|baldness|blood pressure|obese)
You can have whitespace characters in an either-or group, your expression works. Check it out for yourself:
https://regex101.com/r/56tz6B/1
Your expression should also match "blood pressure" without any problems.
Could you try to use \s+ instead of the space character and see if it works? Please note that this would also match any whitespace (tabs, new lines etc.).

Is there any upper limit for number of groups used or the length of the regex in Notepad++?

I am new to using regex. I am trying to use the regex find and replace option in Notepad++.
I have used the following regex:
((?:)|(\+)|(-))(\d)((?:)|(\+)|(-))(/)((?:)|(\+)|(-))(\d)((?:)|(\+)|(-))
For the following text:
2/2
+2/+2
-2/-2
2+/2+
2-/2-
But I am able to get matches only for the first three. The last two, it only gives partial matches, excluding the last "+" and the "-". I am wondering if there is any upper limit for the number of groups (which i doubt is unlikely) that can be used or any upper limit for the maximum length of the regex. I am not sure why my regex is failing. Or if there is anything wrong with my regex, please correct it.
This is not an issue with Notepad++'s regex engine. The problem is that when you have alternations like (?:)|(\+)|(-), the regex engine will attempt to match the different options in the order they are specified. Since you specified an empty group first, it will attempt to match an empty string first, only matching the + or - if it needs to backtrack. This essentially makes the alternation lazy—it will never match any character unless it has to.
vks's answer works perfectly well, but just in case you actually needed those capturing groups separated out, you can do the same thing just by rewriting your alternations like this:
((\+)|(-)|(?:))(\d)((\+)|(-)|(?:))(/)((\+)|(-)|(?:))(\d)((\+)|(-)|(?:))
or even more simply, like this:
((\+)|(-)|)(\d)((\+)|(-)|)(/)((\+)|(-)|)(\d)((\+)|(-)|)
([-+]?)(\d)([-+]?)(/)([-+]?)(\d)([-+]?)
You can use this simple regex to match all cases.See here.
https://www.regex101.com/r/fG5pZ8/19

matching in between a long sentence with keywords

target sentence:
$(SolDir)..\..\ABC\ccc\1234\ccc_am_system;$(SolDir)..\..\ABC\ccc\1234\ccc_am_system\host;$(SolDir)..\..\ABC\ccc\1234\components\fds\ab_cdef_1.0\host; $(SolDir)..\..\ABC\ccc\1234\somethingelse;
how should I construct my regex to extract item contains "..\..\ABC\ccc\1234\ccc_am_system"
basically, I want to extract all those folders and may be more, they are all under \ABC\ccc\1234\ccc_am_system:
$(SolDir)..\..\ABC\ccc\1234\ccc_am_system\host\abc;
$(SolDir)..\..\ABC\ccc\1234\ccc_am_system\host\123\123\123\123;
$(SolDir)..\..\ABC\ccc\1234\ccc_am_system\host;
my current regex doesn't work and I can't figure out why
\$.*ccc\\1234\.*;
Your problem is most likely that * is a greedy operator. It's greedily matching more than you intend it to. In many regex dialects, *? is the reluctant operator. I would first try using it like this:
\$.*?ccc\\1234.*?;
You can read up a bit more on greedy vs reluctant operators in this question.
If that doesn't work, you can try to be more specific with the characters you match than .. For example, you can match every non-semicolon character with an expression like this: [^;]*. You could use that idea this way:
\$[^;]*ccc\\1234[^;]*;
The below regex would store the captured strings inside group 1.
(\$.*?ccc\\1234\\.*?;)
You need to make the * quantifier to does a shortest match by adding ? next to * . And also this \.* matches a literal dot zero or more times. It's wrong.
DEMO
I found this to be the best:
\$(.[^\$;])*ccc\\1234(.[^\$;])*;
it doesn't allow any over match whatsoever, if I use ?, it still matches more $ or ; more than once for some reason, but with above expression, that will never be case. Still thanks to all those who took the time to answer my question,.

Simple regex for matching up to an optional character?

I'm sure this is a simple question for someone at ease with regular expressions:
I need to match everything up until the character #
I don't want the string following the # character, just the stuff before it, and the character itself should not be matched. This is the most important part, and what I'm mainly asking. As a second question, I would also like to know how to match the rest, after the # character. But not in the same expression, because I will need that in another context.
Here's an example string:
topics/install.xml#id_install
I want only topics/install.xml. And for the second question (separate expression) I want id_install
First expression:
^([^#]*)
Second expression:
#(.*)$
[a-zA-Z0-9]*[\#]
If your string contains any other special characters you need to add them into the first square bracket escaped.
I don't use C#, but i will assume that it uses pcre... if so,
"([^#]*)#.*"
with a call to 'match'. A call to 'search' does not need the trailing ".*"
The parens define the 'keep group'; the [^#] means any character that is not a '#'
You probably tried something like
"(.*)#.*"
and found that it fails when multiple '#' signs are present (keeping the leading '#'s)?
That is because ".*" is greedy, and will match as much as it can.
Your matcher should have a method that looks something like 'group(...)'. Most matchers
return the entire matched sequence as group(0), the first paren-matched group as group(1),
and so forth.
PCRE is so important i strongly encourage you to search for it on google, learn it, and always have it in your programming toolkit.
Use look ahead and look behind:
To get all characters up to, but not including the pound (#): .*?(?=\#)
To get all characters following, but not including the pound (#): (?<=\#).*
If you don't mind using groups, you can do it all in one shot:
(.*?)\#(.*) Your answers will be in group(1) and group(2). Notice the non-greedy construct, *?, which will attempt to match as little as possible instead of as much as possible.
If you want to allow for missing # section, use ([^\#]*)(?:\#(.*))?. It uses a non-collecting group to test the second half, and if it finds it, returns everything after the pound.
Honestly though, for you situation, it is probably easier to use the Split method provided in String.
More on lookahead and lookbehind
first:
/[^\#]*(?=\#)/ edit: is faster than /.*?(?=\#)/
second:
/(?<=\#).*/
For something like this in C# I would usually skip the regular expressions stuff altogether and do something like:
string[] split = exampleString.Split('#');
string firstString = split[0];
string secondString = split[1];

How can I "inverse match" with regex?

I'm processing a file, line-by-line, and I'd like to do an inverse match. For instance, I want to match lines where there is a string of six letters, but only if these six letters are not 'Andrea'. How should I do that?
I'm using RegexBuddy, but still having trouble.
(?!Andrea).{6}
Assuming your regexp engine supports negative lookaheads...
...or maybe you'd prefer to use [A-Za-z]{6} in place of .{6}
Note that lookaheads and lookbehinds are generally not the right way to "inverse" a regular expression match. Regexps aren't really set up for doing negative matching; they leave that to whatever language you are using them with.
For Python/Java,
^(.(?!(some text)))*$
http://www.lisnichenko.com/articles/javapython-inverse-regex.html
In PCRE and similar variants, you can actually create a regex that matches any line not containing a value:
^(?:(?!Andrea).)*$
This is called a tempered greedy token. The downside is that it doesn't perform well.
The capabilities and syntax of the regex implementation matter.
You could use look-ahead. Using Python as an example,
import re
not_andrea = re.compile('(?!Andrea)\w{6}', re.IGNORECASE)
To break that down:
(?!Andrea) means 'match if the next 6 characters are not "Andrea"'; if so then
\w means a "word character" - alphanumeric characters. This is equivalent to the class [a-zA-Z0-9_]
\w{6} means exactly six word characters.
re.IGNORECASE means that you will exclude "Andrea", "andrea", "ANDREA" ...
Another way is to use your program logic - use all lines not matching Andrea and put them through a second regex to check for six characters. Or first check for at least six word characters, and then check that it does not match Andrea.
Negative lookahead assertion
(?!Andrea)
This is not exactly an inverted match, but it's the best you can directly do with regex. Not all platforms support them though.
If you want to do this in RegexBuddy, there are two ways to get a list of all lines not matching a regex.
On the toolbar on the Test panel, set the test scope to "Line by line". When you do that, an item List All Lines without Matches will appear under the List All button on the same toolbar. (If you don't see the List All button, click the Match button in the main toolbar.)
On the GREP panel, you can turn on the "line-based" and the "invert results" checkboxes to get a list of non-matching lines in the files you're grepping through.
I just came up with this method which may be hardware intensive but it is working:
You can replace all characters which match the regex by an empty string.
This is a oneliner:
notMatched = re.sub(regex, "", string)
I used this because I was forced to use a very complex regex and couldn't figure out how to invert every part of it within a reasonable amount of time.
This will only return you the string result, not any match objects!
(?! is useful in practice. Although strictly speaking, looking ahead is not a regular expression as defined mathematically.
You can write an inverted regular expression manually.
Here is a program to calculate the result automatically.
Its result is machine generated, which is usually much more complex than hand writing one. But the result works.
If you have the possibility to do two regex matches for the inverse and join them together you can use two capturing groups to first capture everything before your regex
^((?!yourRegex).)*
and then capture everything behind your regex
(?<=yourRegex).*
This works for most regexes. One problem I discovered was when I had a quantifier like {2,4} at the end. Then you gotta get creative.
In Perl you can do:
process($line) if ($line =~ !/Andrea/);