How come this RegEx isn't working quite right? - regex

I have this RegEx here:
/^function(\d)$/
It matches function(5) but not function(55). How come?

The other posters are correct about the +, but what language are you using for to parse the regular expression? Shouldn't you have to escape the ()? Otherwise it should capture the digit(s).
I would think you would need...
/^function\(\d+\)$/

/^function(\d+)$/
You need to add the + to make the \d (digits) greedy -- to match as much as possible. (Assuming that is what you are after as it would probably match
function(3242345235234235235234234234535325234235235234523) as well as function(55)
Repeats the previous item once or more. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only once.
referring to +
http://www.regular-expressions.info/reference.html

Because you only gave it one \d. If you want to match more than one digit, tell it so.

Related

Regex for *either or both* of last two characters are not digits?

I can't figure out the proper regular expression for this... Most of my data ends with digits as the last two characters. A subset ends with where either one or both of the last two are non-digits. So xyz99 is normal and I'm able to find those records with "*[0-9][0-9]$". If I change that to "*[^0-9][^0-9]$" then I get records where both are non-digits.
I don't know regex well enough to match all of the following with a single regex: xy9z, xyz9, xyzw, but not matching xyz99.
I prefer a single regex, but (already know how to and) can work-around with multiple.
Thanks for any help.
[^\d]$|[^\d].$
should do the trick
https://regex101.com/r/PsZxLj/2
It matches anything that doesn't end in a digit OR anything where the 2nd to last character isn't a digit. Lots of ways to do this, but pick one that is easy for you to read and maintain. :) Good luck!
Something like this should do it:
(?:\d[^\d]|[^\d].)$
You could use a negative lookahead (?!...).
(?![0-9]{2}).{2}$
This will first make sure that [0-9]{2} (2 digits) does not match. Then proceeds to match the remaining regex, which matches any 2 characters .{2} followed by the end of the string $.
Regexper
Thank you all for the quick and helpful responses. A couple of the references above that start with "(?" are beyond what I understand so far. But the "or" operator is what I was missing. Here is what I ended up using (and it worked): select * from mytable where regexp_like( myfield, '.*([^0-9]|[^0-9][0-9])$' );

Having difficulty in a understanding regex backtracking

I was browsing through the regex tagged questions on SO when i came accross this problem,
A regex for a url was needed, the url begins with domain.com/advertorials/
The regex should match the following scenarios,
domain.com/advertorials
domain.com/advertorials?test=true
domain.com/advertorials/
domain.com/advertorials/?test=true
but not this,
domain.com/advertorials/version1?test=true
I came up with this regex advertorials\/?(?:(?!version)(.*))
This should work, but it doesnt for the last case. Looking at the debugger in regex101.com,
i see that after matching 's/' it matches 'version' word character by character and ultimately matches but since this is negative lookahead the condition fails. And this is the part i dont understand after failing it backtracks to before the '/' in 's/' and not after 's/'.
Is this how its supposed to work?? Can anyone help me understand?
(here's the demo link: https://regex101.com/r/ww3HR8/1).
Thanks,
Note: People already gave their solutions on that problem i just want to know why my regex fails.
The backtracking mechanism is in charge of this phenomenon, as you have already pointed out.
The ? quantifier, matching 1 or 0 repetitions of the quantified subpattern lets the regex engine match the string in two ways: either matching the quantified subpattern, or go on matching the string with subsequent subpattern.
So, advertorials/?(?!version)(.*) (I removed the redundant (?:...) non-capturing group), when applied to domain.com/advertorials/version1?test=true, matches advertorials, then matches /, and then the negative lookahead checks if, immediately to the right of the current position, there is version substring. Since there is version after /, the regex engine goes back and sees that /? pattern can match an empty string. So, the lookahead check is re-applied striaght after advertorials. There is no version after advertorials, and the match is returned.
The usual solution is using possessive quantifiers or atomic groups, but there are other approaches, too.
E.g.
advertorials\/?+(?!version)(.*)
^^
See the regex demo. Here, \/?+ matches 1 or 0 / chars, but once it matches, the egine cannot go back and re-match a part of a string with this pattern.
Or, you may include the /? in the lookahead and place it before /? pattern:
advertorials(?!\/?version)\/?(.*)
See another regex demo.
If you plan to disallow version anywhere after advertorials use
advertorials(?!.*version)\/?(.*)
See yet another demo.
Making the slash optional means there is a way to match without violating the constraint. If there is a way to match, the regex engine will find it, always.
Make the slash non-optional when it's followed by anything at all.
advertorials(?:/(?!version).*)?$
Incidentally, regex itself doesn't require the slash to be backslash-escaped (though some host languages use slashes as regex delimiters, so maybe you need to put it back). I also removed some redundant parentheses.
The reason:
This highlighted part is optional
advertorials\/?(?:(?!version)(.*))
Therefore it can also be advertorials(?:(?!version)(.*))
which matches advertorials/version
Essentially, (?!version)(.*) matches /version
Btw, this is normal backtracking by 1 character.
If you have already fixed it, then we're done !

Regular Expression to match two words near each other on a single line

Hi I am trying to construct a regular expression (PCRE) that is able to find two words near each other but which occur on the same line. The near examples generally provided are insufficient for my requirements as the "\W" obviously includes new lines. I have spent quite a bit of time trying to find an answer to this and have thus far been unsuccessful. To exemplify what I have so far, please see below:
(?i)(?:\b(tree)\b)\W+(?:\w+\W+){0,5}?\b(house)\b.*
I want this to match on:
here is a tree with a house
But not match on
here is a tree
with a house
Any help would be greatly appreciated!
How about
\btree\b[^\n]+\bhouse\b
Just add a negative lookahead to match all the non-word characters but not of a new line character.
(?i)(?:\b(tree)\b)(?:(?!\n)\W)+(?:\w+\W+){0,5}?\b(house)\b.*
DEMO
Dot matches anything except newlines, so just:
(?i)\btree\b.{1,5}\bhouse\b
Note it is impossible for there to be zero characters between the two words, because then they wouldn't be two words - they would be the one word and the \b wouldn't match.
Just replace \W with [^\w\r\n] in your regex:
(?i)(?:\b(tree)\b)[^\w\r\n]+(?:\w+[^\w\r\n]+){0,5}?\b(house)\b.*
To get the closest matches of both words on the same line, an option is to use a negative lookahead:
(?i)(\btree\b)(?>(?!(?1)).)*?\bhouse\b
The . dot default does not match a newline (only with s DOTALL modifier)
(?>(?!(?1)).)*? As few as possibly of any characters, that are not followed by \btree\b
(?1) pastes the first parenthesized pattern.
Example at regex101.com; Regex FAQ
Maybe this helps, found here https://www.regular-expressions.info/near.html
\bword1\W+(?:\w+\W+){1,6}?word2\b.

matching in between a long sentence with keywords

target sentence:
$(SolDir)..\..\ABC\ccc\1234\ccc_am_system;$(SolDir)..\..\ABC\ccc\1234\ccc_am_system\host;$(SolDir)..\..\ABC\ccc\1234\components\fds\ab_cdef_1.0\host; $(SolDir)..\..\ABC\ccc\1234\somethingelse;
how should I construct my regex to extract item contains "..\..\ABC\ccc\1234\ccc_am_system"
basically, I want to extract all those folders and may be more, they are all under \ABC\ccc\1234\ccc_am_system:
$(SolDir)..\..\ABC\ccc\1234\ccc_am_system\host\abc;
$(SolDir)..\..\ABC\ccc\1234\ccc_am_system\host\123\123\123\123;
$(SolDir)..\..\ABC\ccc\1234\ccc_am_system\host;
my current regex doesn't work and I can't figure out why
\$.*ccc\\1234\.*;
Your problem is most likely that * is a greedy operator. It's greedily matching more than you intend it to. In many regex dialects, *? is the reluctant operator. I would first try using it like this:
\$.*?ccc\\1234.*?;
You can read up a bit more on greedy vs reluctant operators in this question.
If that doesn't work, you can try to be more specific with the characters you match than .. For example, you can match every non-semicolon character with an expression like this: [^;]*. You could use that idea this way:
\$[^;]*ccc\\1234[^;]*;
The below regex would store the captured strings inside group 1.
(\$.*?ccc\\1234\\.*?;)
You need to make the * quantifier to does a shortest match by adding ? next to * . And also this \.* matches a literal dot zero or more times. It's wrong.
DEMO
I found this to be the best:
\$(.[^\$;])*ccc\\1234(.[^\$;])*;
it doesn't allow any over match whatsoever, if I use ?, it still matches more $ or ; more than once for some reason, but with above expression, that will never be case. Still thanks to all those who took the time to answer my question,.

How can I "inverse match" with regex?

I'm processing a file, line-by-line, and I'd like to do an inverse match. For instance, I want to match lines where there is a string of six letters, but only if these six letters are not 'Andrea'. How should I do that?
I'm using RegexBuddy, but still having trouble.
(?!Andrea).{6}
Assuming your regexp engine supports negative lookaheads...
...or maybe you'd prefer to use [A-Za-z]{6} in place of .{6}
Note that lookaheads and lookbehinds are generally not the right way to "inverse" a regular expression match. Regexps aren't really set up for doing negative matching; they leave that to whatever language you are using them with.
For Python/Java,
^(.(?!(some text)))*$
http://www.lisnichenko.com/articles/javapython-inverse-regex.html
In PCRE and similar variants, you can actually create a regex that matches any line not containing a value:
^(?:(?!Andrea).)*$
This is called a tempered greedy token. The downside is that it doesn't perform well.
The capabilities and syntax of the regex implementation matter.
You could use look-ahead. Using Python as an example,
import re
not_andrea = re.compile('(?!Andrea)\w{6}', re.IGNORECASE)
To break that down:
(?!Andrea) means 'match if the next 6 characters are not "Andrea"'; if so then
\w means a "word character" - alphanumeric characters. This is equivalent to the class [a-zA-Z0-9_]
\w{6} means exactly six word characters.
re.IGNORECASE means that you will exclude "Andrea", "andrea", "ANDREA" ...
Another way is to use your program logic - use all lines not matching Andrea and put them through a second regex to check for six characters. Or first check for at least six word characters, and then check that it does not match Andrea.
Negative lookahead assertion
(?!Andrea)
This is not exactly an inverted match, but it's the best you can directly do with regex. Not all platforms support them though.
If you want to do this in RegexBuddy, there are two ways to get a list of all lines not matching a regex.
On the toolbar on the Test panel, set the test scope to "Line by line". When you do that, an item List All Lines without Matches will appear under the List All button on the same toolbar. (If you don't see the List All button, click the Match button in the main toolbar.)
On the GREP panel, you can turn on the "line-based" and the "invert results" checkboxes to get a list of non-matching lines in the files you're grepping through.
I just came up with this method which may be hardware intensive but it is working:
You can replace all characters which match the regex by an empty string.
This is a oneliner:
notMatched = re.sub(regex, "", string)
I used this because I was forced to use a very complex regex and couldn't figure out how to invert every part of it within a reasonable amount of time.
This will only return you the string result, not any match objects!
(?! is useful in practice. Although strictly speaking, looking ahead is not a regular expression as defined mathematically.
You can write an inverted regular expression manually.
Here is a program to calculate the result automatically.
Its result is machine generated, which is usually much more complex than hand writing one. But the result works.
If you have the possibility to do two regex matches for the inverse and join them together you can use two capturing groups to first capture everything before your regex
^((?!yourRegex).)*
and then capture everything behind your regex
(?<=yourRegex).*
This works for most regexes. One problem I discovered was when I had a quantifier like {2,4} at the end. Then you gotta get creative.
In Perl you can do:
process($line) if ($line =~ !/Andrea/);