Possible to use a back reference in a number range? - regex

I want to match a string where a number is equal or higher than a number in a capturing group.
Example:
1x1 = match
1x2 = match
2x1 = no match
In my mind the regex would look something like this (\d)x[\1-9] but this doesn't work. Is it possible to achieve this using regex?

As you've discovered, you cannot interpolate a value within a regex because:
Because character classes are determined when the regex is compiled... The only character class regex node type is "hard-coded list of characters" that was built when the regex was compiled (not after it ran part way and figured out what $1 might end up being).
[Source]
Since character classes do not permit backreferences, a backslash followed by a number is repurposed in a character class:
A backslash followed by two or three octal digits is considered an octal number.
[Source]
This obviously isn't what you intended by [\1-9]. But since there's no way to compile a character class until all characters are known, we'll have to find another way.
If we're looking to do this entirely within a regex we can't enumerate all possible combinations, because we'd have to check all the captures to figure out which one matched. For example:
"1x2" =~ m/(?:(0)x(\d)|(1)x([1-9])|(2)x([2-9])|(3)x([3-9])|(4)x([4-9])|(5)x([5-9])|(6)x([6-9])|(7)x([7-9])|(8)x([89])|(9)x(9))/
Will contain "1" in $3 and "2" in $4, but you'd have to search captures 1 to 20 to find if anything was matched each time.
The only way around doing post processing on regex results is to use a regex conditional: (?(A)X) Where A is a conditional and X is the resulting action.
Sadly conditionals are not supported by RE2, but we'll keep going just to demonstrate it can be done.
What you'd want to use for the X is (*F) (or (?!) in Ruby 2+) to force failure: http://www.rexegg.com/regex-tricks.html#fail
What you'd want to use for the A is ?{$1 > $2}, but only Perl will allow you to use code directly in a regex. Perl would allow you to use:
m/(\d)x(\d)(?(?{$1 > $2})(?!))/
[Live Example]
So the answer to your question is: "No, you cannot do this with RE2 which Google Analytics uses, but yes you can do this with a Perl regex."

Related

Regex for numbers including decimal points

I'm still trying to pick up regex so I can't seem to figure this one out. I want to be able to match any type of number including things like
0.2
.1243
1.
-0.34
+033.98274E-10
-.1e+004
I have created the following regex which matches all of these: [+-]?[0-9+\.]+([E][+-]?[0-9]+)?, however this also matches single decimal points such as if I had something like param.attribute, it would pick up on that decimal point. How can I get around this? I thought that in the part [0-9+\.] the + would require that the string contain at least one numeric value.
You may use alternation to make sure either 1. or .1 is matched. Avoid making all subpatterns optional if you do not want to end up with a single period matched:
[-+]?(?:[0-9]+\.?[0-9]*|[0-9]*\.?[0-9]+)(?:[eE][-+]?[0-9]+)?
^--- Alernative 1 | Alternative 2-^
See regex demo
More "fun" facts about alternation in regular expressions:
You can use alternation to match a single regular expression out of several possible regular expressions.
If you want to search for the literal text cat or dog, separate both options with a vertical bar or pipe symbol: cat|dog. If you want more options, simply expand the list: cat|dog|mouse|fish.
The alternation operator has the lowest precedence of all regex operators. That is, it tells the regex engine to match either everything to the left of the vertical bar, or everything to the right of the vertical bar. If you want to limit the reach of the alternation, you need to use parentheses for grouping.
And here is my 5 cents: to keep the regex match as clean as possible and unless you need to access any of the alternatives after a match is found, use non-capturing groups (i.e. (?: ... )) with alternations.
Here's a proposition:
[+-]?([0-9]+\.?[0-9]*|0?\.[0-9]+)([Ee][+-]?[0-9]+)?
See the demo here

positive look ahead and replace

Recently I'm writing/testing regexps on https://regex101.com/.
My question is: Is it possible to do a positive look-ahead AND a replacement in the same "replacement"? Or just limited kind of replacement is possible.
Input is several lines with phone numbers. Let's say the correct phone number where the number of "numbers" are 11. No matter how the numbers are divided/group together with - / characters, no matter if starts with + 00 or it is omitted.
Some example lines:
+48301234567
+48/30/1234567
+48-30-12-345-67
+483011223344556677
0048301234567
+(48)30/1234567
Positive look-ahead able to check if from the beginning until the end of line there are only 11 digits, regardless how many other, above specified character separating them. This works perfectly.
Where the positive look-ahead check is fine, I would like to delete every character but numbers. The replacement works fine until I'm not involving look-ahead.
Checking the regexp itself working perfectly ("gm" modes):
^(?:\+|00)?(?:[\-\/\(\)]?\d){11}$
Checking the replace part works perfectly (replace to nothing):
[^\d\n]
Put this into look-ahead, after the deletion of non new-line and non-digit characters from the matching lines:
(?=^(?:\+|00)?(?:[\-\/\(\)]?\d){11}$)[^\d\n]
Even I put the ^ $ into look-ahead, seems the replacement working only from beginning of the lines until the very first digit.
I know in real life the replacement and the check should/would go separate ways, however I'm curious if I could mix look-ahead/look-behind with string operations like replace, delete, take the string apart and put together as I like.
UPDATE: This is what would do the trick, however I feel this one "ugly" a bit. Is there any prettier solution?
https://regex101.com/r/yT5dA4/2
Or the version which I asked originally, where only digits remains: regex101.com/r/yT5dA4/3
You cannot replace/delete text with regex. Regex is just a tool for matching certain strings and then taking certain action depending on the matching text, eg. perform a substitution, retrieve the second capture group.
However it is possible to perform certain decisions within a regex engine, by using conditionals. The common syntax for this, with a lookahead assertion, is (?(?=regex)then|else).
With conditionals you can change the behaviour depending on how the text matches the regex. For your example you could do something like:
^(\+)?(?(1)\(|\d)
If the phone number starts with a plus it must be followed by a bracket, else it should start with a digit. Although in your situation, this is not very useful.
If you want to read up more on conditionals in regex you can do so here.

Sublime Text find and replace "foo" across all situations and combinations except when it becomes another word ie. "foobar"

I know this is a elementary RegEx possibility, but I can't seem to determine the right expression to use.
What I am looking to do is find & replace "foo" and only "foo" within a set of different situations like; abc_foo, abc_foo[something], abc-foo-something, and all different combinations except when it becomes another word like "foobar". The basic 'whole word' search function was close but doesn't help when variables and underscores are factored in.
It's actually not that elementary to match a string which does not contain word characters around itself:
If your language supports negative lookbehind, which is quite rare occasion, it would be simple:
(?<!\w)foo(?!\w)
However, there is a workaround to match the string with surrounding non-word characters (including _ which is a word character but you want to treat is as non-word) and use capturing groups to sort it all out:
(^|[\W_])foo([\W_]|$)
Debuggex Demo
e.g. in javascript syntax:
str.replace(/(^|[\W_])foo([\W_]|$)/g, "$1replacement$2");
You can use a negative lookahead assertion to do this. Using regex search, foo(?!bar) will match any instance of foo not followed by bar, and the following text is not part of the match, only foo is.

Regex for nested matches

Consider the string
cos(t(2))+t(51)
Using a regular expression, I'd like to match cos(t(2)), t(2) and t(51). The general pattern this fits is intended to be something like
variable or function name + opening_parenthesis + contents + closing_parenthesis,
where contents can be any expression that has an equal number of opening and closing parentheses.
I'm using [a-zA-Z]+\([\W\w]*\) which returns cos(t(2)))+t(51), which of course is not the desired result.
Any ideas on how to achieve this using regex? I'm particularly stuck at this "equal number of opening and closing parentheses".
Niels, this is an interesting and tricky question because you are looking for overlapping matches. Even with recursion, the task is not trivial.
You asked about any idea how to achieve this with regex, so it sounds like even if this is not available in matlab, you would be interested in seeing an answer that shows you how to do it in regex.
This makes sense to me because tools often change the regex libraries they use. For instance Notepad++, which used to have crippled regex, switched to PCRE in version 6. (As it happens, PCRE would work with this solution.)
In Perl and PCRE, you can use this short regex:
(?=(\b\w+\((?:\d+|(?1))\)))
This will match:
cos(t(2))
t(2)
t(51)
For instance, in php, you could use this code (see the results at the bottom of the online demo).
$regex = "~(?=(\b\w+\((?:\d+|(?1))\)))~";
$string = "cos(t(2))+t(51)";
$count = preg_match_all($regex,$string,$matches);
print_r($matches[1]);
How does it work?
To allow overlapping matches, we use a lookahead. That way, after matching cos(t(2)), the engine will position itself NOT after cos(t(2)), but before the o in cos
In fact the engine does not actually match cos(t(2)) but merely captures it to Group 1. What it matches is the assertion that at this position in the string, looking ahead, we can see x. After matching this assertion, it tries to match it again starting from the next position in the string.
The expression in the lookahead (which describes what we're looking for) is almost very simple: in (\b\w+\((?:\d+|(?1))\)), after the \d+, the alternation | allows us to repeat subroutine number one with (?1), which is to say, the whole expression we are currently within. So we don't recurse the entire regex (which includes a lookahead), but a subexpression thereof.

How can I "inverse match" with regex?

I'm processing a file, line-by-line, and I'd like to do an inverse match. For instance, I want to match lines where there is a string of six letters, but only if these six letters are not 'Andrea'. How should I do that?
I'm using RegexBuddy, but still having trouble.
(?!Andrea).{6}
Assuming your regexp engine supports negative lookaheads...
...or maybe you'd prefer to use [A-Za-z]{6} in place of .{6}
Note that lookaheads and lookbehinds are generally not the right way to "inverse" a regular expression match. Regexps aren't really set up for doing negative matching; they leave that to whatever language you are using them with.
For Python/Java,
^(.(?!(some text)))*$
http://www.lisnichenko.com/articles/javapython-inverse-regex.html
In PCRE and similar variants, you can actually create a regex that matches any line not containing a value:
^(?:(?!Andrea).)*$
This is called a tempered greedy token. The downside is that it doesn't perform well.
The capabilities and syntax of the regex implementation matter.
You could use look-ahead. Using Python as an example,
import re
not_andrea = re.compile('(?!Andrea)\w{6}', re.IGNORECASE)
To break that down:
(?!Andrea) means 'match if the next 6 characters are not "Andrea"'; if so then
\w means a "word character" - alphanumeric characters. This is equivalent to the class [a-zA-Z0-9_]
\w{6} means exactly six word characters.
re.IGNORECASE means that you will exclude "Andrea", "andrea", "ANDREA" ...
Another way is to use your program logic - use all lines not matching Andrea and put them through a second regex to check for six characters. Or first check for at least six word characters, and then check that it does not match Andrea.
Negative lookahead assertion
(?!Andrea)
This is not exactly an inverted match, but it's the best you can directly do with regex. Not all platforms support them though.
If you want to do this in RegexBuddy, there are two ways to get a list of all lines not matching a regex.
On the toolbar on the Test panel, set the test scope to "Line by line". When you do that, an item List All Lines without Matches will appear under the List All button on the same toolbar. (If you don't see the List All button, click the Match button in the main toolbar.)
On the GREP panel, you can turn on the "line-based" and the "invert results" checkboxes to get a list of non-matching lines in the files you're grepping through.
I just came up with this method which may be hardware intensive but it is working:
You can replace all characters which match the regex by an empty string.
This is a oneliner:
notMatched = re.sub(regex, "", string)
I used this because I was forced to use a very complex regex and couldn't figure out how to invert every part of it within a reasonable amount of time.
This will only return you the string result, not any match objects!
(?! is useful in practice. Although strictly speaking, looking ahead is not a regular expression as defined mathematically.
You can write an inverted regular expression manually.
Here is a program to calculate the result automatically.
Its result is machine generated, which is usually much more complex than hand writing one. But the result works.
If you have the possibility to do two regex matches for the inverse and join them together you can use two capturing groups to first capture everything before your regex
^((?!yourRegex).)*
and then capture everything behind your regex
(?<=yourRegex).*
This works for most regexes. One problem I discovered was when I had a quantifier like {2,4} at the end. Then you gotta get creative.
In Perl you can do:
process($line) if ($line =~ !/Andrea/);