How to make a regex for files ending in .php? - regex

I'm trying to write a very simple regular expression that matches any file name that doesn't end in .php. I came up with the following...
(.*?)(?!\.php)$
...however this matches all filenames. If someone could point me in the right direction I'd be very grateful.

Almost:
.*(?!\.php)....$
The last four dots make sure that there is something to look ahead at, when the look-ahead is checked.
The outer parentheses are unnecessary since you are interested in the entire match.
The reluctant .*? is unnecessary, since backtracking four steps is more efficient than checking the following condition with every step.

Instead of using negative lookahead, sometimes it's easier to use the negation outside the regex at the hosting language level. In many languages, the boolean complement operator is the unary !.
So you can write something like this:
! str.hasMatch(/\.php$/)
Depending on language, you can also skip regex altogether and use something like (e.g. Java):
! str.endsWith(".php")
As for the problem with the original pattern itself:
(.*?)(?!\.php)$ // original pattern, doesn't work!
This matches, say, file.php, because the (.*?) can capture file.php, and looking ahead, you can't match \.php, but you can match a $, so altogether it's a match! You may want to use look behind, or if it's not supported, you can lookahead at the start of the string.
^(?!.*\.php$).*$ // negative lookahead, works
This will match all strings that does not end with ".php" using negative lookahead.
References
regular-expressions.info/Lookarounds
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?

You are at the end of the string and looking ahead. What you want is a look behind instead:
(.*)$(?<!\.php)
Note that not all regular expression engines support lookbehind assertions.

Related

How do regex positive look-behinds work?

I have been solving old question from stack so that I can improve my regex knowledge. As I have a basic knowledge of regex, most of them were easy but this question regex problem is tough.
It asks for a regex that extracts from this kind of string ou=persons,ou=(.*),dc=company,dc=org the last string immediately preceded by a comma not followed by (.*). In the last case, this should give dc=company,dc=org.
The solution is (?<=,(?!.*\Q(.*)\E)).* but I cannot understand its flow. I understood (?!.*\Q(.*)\E) portion but other are still mystery to me. Specially ?<= which is a positive look-behind. Does it search from end of string? Can anyone explain it to me like I am a 7 year old kid — and please http://regex101.com/ is not helping.
The RegEx (?<=,(?!.*\Q(.*)\E)).* look-behind potion works like this:
Start at the beginning of the string at first character.
Can we match the the thing we are looking for? ,(?!.*\Q(.*)\E)
If we can't: Move forward one character, Go To 2. and check match again.
If a match is found: Capture all the remaining characters until we can't find any .* (or generally then try the matching the remaining RegEx).
For a more wordly explaination consider reading Lookahead and Lookbehind Zero-Length Assertions.
A lookbehind allows you to specify a context just before the actual match.
You can say ,(dc=) and only return the capture group, or ,\Kdc=, or (?<=,)dc= to return the match on dc= but require that the comma is present just before the match.
The facility also allows for multiple lookbehinds, so you could do (?<=a.*)(?<=b.*)c to match c only if it is preceded by both a and b somewhere in the input.
A lookbehind is basically syntactic sugar, in that you can usually rephrase your conditions using some other regex construct. It can be really handy when you have multiple unanchored constraints, like in the last example

RegEx - Exclude Matched Patterns

I have the below patterns to be excluded.
make it cheaper
make it cheapere
makeitcheaper.com.au
makeitcheaper
making it cheaper
www.make it cheaper
ww.make it cheaper.com
I've created a regex to match any of these. However, I want to get everything else other than these. I am not sure how to inverse this regex I've created.
mak(e|ing) ?it ?cheaper
Above pattern matches all the strings listed. Now I want it to match everything else. How do I do it?
From the search, it seems I need something like negative lookahead / look back. But, I don't really get it. Can some one point me in the right direction?
You can just put it in a negative look-ahead like so:
(?!mak(e|ing) ?it ?cheaper)
Just like that isn't going to work though since, if you do a matches1, it won't match since you're just looking ahead, you aren't actually matching anything, and, if you do a find1, it will match many times, since you can start from lots of places in the string where the next characters doesn't match the above.
To fix this, depending on what you wish to do, we have 2 choices:
If you want to exclude all strings that are exactly one of those (i.e. "make it cheaperblahblah" is not excluded), check for start (^) and end ($) of string:
^(?!mak(e|ing) ?it ?cheaper$).*
The .* (zero or more wild-cards) is the actual matching taking place. The negative look-ahead checks from the first character.
If you want to exclude all strings containing one of those, you can make sure the look-ahead isn't matched before every character we match:
^((?!mak(e|ing) ?it ?cheaper).)*$
An alternative is to add wild-cards to the beginning of your look-ahead (i.e. exclude all strings that, from the start of the string, contain anything, then your pattern), but I don't currently see any advantage to this (arbitrary length look-ahead is also less likely to be supported by any given tool):
^(?!.*mak(e|ing) ?it ?cheaper).*
Because of the ^ and $, either doing a find or a matches will work for either of the above (though, in the case of matches, the ^ is optional and, in the case of find, the .* outside the look-ahead is optional).
1: Although they may not be called that, many languages have functions equivalent to matches and find with regex.
The above is the strictly-regex answer to this question.
A better approach might be to stick to the original regex (mak(e|ing) ?it ?cheaper) and see if you can negate the matches directly with the tool or language you're using.
In Java, for example, this would involve doing if (!string.matches(originalRegex)) (note the !, which negates the returned boolean) instead of if (string.matches(negLookRegex)).
The negative lookahead, I believe is what you're looking for. Maybe try:
(?!.*mak(e|ing) ?it ?cheaper)
And maybe a bit more flexible:
(?!.*mak(e|ing) *it *cheaper)
Just in case there are more than one space.

Regex to *not* match any characters

I know it is quite some weird goal here but for a quick and dirty fix for one of our system we do need to not filter any input and let the corruption go into the system.
My current regex for this is "\^.*"
The problem with that is that it does not match characters as planned ... but for one match it does work. The string that make it not work is ^#jj (basically anything that has ^ ... ).
What would be the best way to not match any characters now ? I was thinking of removing the \  but only doing this will transform the "not" into a "start with" ...
The ^ character doesn't mean "not" except inside a character class ([]). If you want to not match anything, you could use a negative lookahead that matches anything: (?!.*).
A simple and cheap regex that will never match anything is to match against something that is simply unmatchable, for example: \b\B.
It's simply impossible for this regex to match, since it's a contradiction.
References
regular-expressions.info\Word Boundaries
\B is the negated version of \b. \B matches at every position where \b does not.
Another very well supported and fast pattern that would fail to match anything that is guaranteed to be constant time:
$unmatchable pattern $anything goes here etc.
$ of course indicates the end-of-line. No characters could possibly go after $ so no further state transitions could possibly be made. The additional advantage are that your pattern is intuitive, self-descriptive and readable as well!
tldr; The most portable and efficient regex to never match anything is $- (end of line followed by a char)
Impossible regex
The most reliable solution is to create an impossible regex. There are many impossible regexes but not all are as good.
First you want to avoid "lookahead" solutions because some regex engines don't support it.
Then you want to make sure your "impossible regex" is efficient and won't take too much computation steps to match... nothing.
I found that $- has a constant computation time ( O(1) ) and only takes two steps to compute regardless of the size of your text (https://regex101.com/r/yjcs1Z/3).
For comparison:
$^ and $. both take 36 steps to compute -> O(1)
\b\B takes 1507 steps on my sample and increase with the number of character in your string -> O(n)
Empty regex (alternative solution)
If your regex engine accepts it, the best and simplest regex to never match anything might be: an empty regex .
Instead of trying to not match any characters, why not just match all characters? ^.*$ should do the trick. If you have to not match any characters then try ^\j$ (Assuming of course, that your regular expression engine will not throw an error when you provide it an invalid character class. If it does, try ^()$. A quick test with RegexBuddy suggests that this might work.
^ is only not when it's in class (such as [^a-z] meaning anything but a-z). You've turned it into a literal ^ with the backslash.
What you're trying to do is [^]*, but that's not legal. You could try something like
" {10000}"
which would match exactly 10,000 spaces, if that's longer than your maximum input, it should never be matched.
((?iLmsux))
Try this, it matches only if the string is empty.
Interesting ... the most obvious and simple variant:
~^
.
https://regex101.com/r/KhTM1i/1
requiring usually only one computation step (failing directly at the start and being computational expensive only if the matched string begins with a long series of ~) is not mentioned among all the other answers ... for 12 years.
You want to match nothing at all? Neg lookarounds seems obvious, but can be slow, perhaps ^$ (matches empty string only) as an alternative?

The Greedy Option of Regex is really needed?

The Greedy Option of Regex is really needed?
Lets say I have following texts, I like to extract texts inside [Optionx] and [/Optionx] blocks
[Option1]
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
[/Option2]
But with Regex Greedy Option, its give me
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
Anybody need like that? If yes, could you let me know?
If I understand correctly, the question is “why (when) do you need greedy matching?”
The answer is – almost always. Consider a regular expression that matches a sequence of arbitrary – but equal – characters, of length at least two. The regular expression would look like this:
(.)\1+
(\1 is a back-reference that matches the same text as the first parenthesized expression).
Now let’s search for repeats in the following string: abbbbbc. What do we find? Well, if we didn’t have greedy matching, we would find bb. Probably not what we want. In fact, in most application s we would be interested in finding the whole substring of bs, bbbbb.
By the way, this is a real-world example: the RLE compression works like that and can be easily implemented using regex.
In fact, if you examine regular expressions all around you will see that a lot of them use quantifiers and expect them to behave greedily. The opposite case is probably a minority. Often, it makes no difference because the searched expression is inside guard clauses (e.g. a quoted string is inside the quote marks) but like in the example above, that’s not always the case.
Regular expressions can potentially match multiple portion of a text.
For example consider the expression (ab)*c+ and the string "abccababccc". There are many portions of the string that can match the regular expressions:
(abc)cababccc
(abcc)ababccc
abcc(ababccc)
abccab(abccc)
ab(c)cababccc
ab(cc)ababccc
abcabab(c)ccc
....
some regular expressions implementation are actually able to return the entire set of matches but it is most common to return a single match.
There are many possible ways to determine the "winning match". The most common one is to take the "longest leftmost match" which results in the greedy behaviour you observed.
This is tipical of search and replace (a la grep) when with a+ you probably mean to match the entire aaaa rather than just a single a.
Choosing the "shortest non-empty leftmost" match is the usual non-greedy behaviour. It is the most useful when you have delimiters like your case.
It all depends on what you need, sometimes greedy is ok, some other times, like the case you showed, a non-greedy behaviour would be more meaningful. It's good that modern implementations of regular expressions allow us to do both.
If you're looking for text between the optionx blocks, instead of searching for .+, search for anything that's not "[\".
This is really rough, but works:
\[[^\]]+]([^(\[/)]+)
The first bit searches for anything in square brackets, then the second bit searches for anything that isn't "[\". That way you don't have to care about greediness, just tell it what you don't want to see.
One other consideration: In many cases, greedy and non-greedy quantifiers result in the same match, but differ in performance:
With a non-greedy quantifier, the regex engine needs to backtrack after every single character that was matched until it finally has matched as much as it needs to. With a greedy quantifier, on the other hand, it will match as much as possible "in one go" and only then backtrack as much as necessary to match any following tokens.
Let's say you apply a.*c to
abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc. This finds a match in 5 steps of the regex engine. Now apply a.*?c to the same string. The match is identical, but the regex engine needs 101 steps to arrive at this conclusion.
On the other hand, if you apply a.*c to abcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb, it takes 101 steps whereas a.*?c only takes 5.
So if you know your data, you can tailor your regex to match it as efficiently as possible.
just use this algorithm which you can use in your fav language. No need regex.
flag=0
open file for reading
for each line in file :
if check "[/Option" in line:
flag=0
if check "[Option" in line:
flag=1
continue
if flag:
print line.strip()
# you can store the values of each option in this part

Regular Expression Opposite

Is it possible to write a regex that returns the converse of a desired result? Regexes are usually inclusive - finding matches. I want to be able to transform a regex into its opposite - asserting that there are no matches. Is this possible? If so, how?
http://zijab.blogspot.com/2008/09/finding-opposite-of-regular-expression.html states that you should bracket your regex with
/^((?!^ MYREGEX ).)*$/
, but this doesn't seem to work. If I have regex
/[a|b]./
, the string "abc" returns false with both my regex and the converse suggested by zijab,
/^((?!^[a|b].).)*$/
. Is it possible to write a regex's converse, or am I thinking incorrectly?
Couldn't you just check to see if there are no matches? I don't know what language you are using, but how about this pseudocode?
if (!'Some String'.match(someRegularExpression))
// do something...
If you can only change the regex, then the one you got from your link should work:
/^((?!REGULAR_EXPRESSION_HERE).)*$/
The reason your inverted regex isn't working is because of the '^' inside the negative lookahead:
/^((?!^[ab].).)*$/
^ # WRONG
Maybe it's different in vim, but in every regex flavor I'm familiar with, the caret matches the beginning of the string (or the beginning of a line in multiline mode). But I think that was just a typo in the blog entry.
You also need to take into account the semantics of the regex tool you're using. For example, in Perl, this is true:
"abc" =~ /[ab]./
But in Java, this isn't:
"abc".matches("[ab].")
That's because the regex passed to the matches() method is implicitly anchored at both ends (i.e., /^[ab].$/).
Taking the more common, Perl semantics, /[ab]./ means the target string contains a sequence consisting of an 'a' or 'b' followed by at least one (non-line separator) character. In other words, at ANY point, the condition is TRUE. The inverse of that statement is, at EVERY point the condition is FALSE. That means, before you consume each character, you perform a negative lookahead to confirm that the character isn't the beginning of a matching sequence:
(?![ab].).
And you have to examine every character, so the regex has to be anchored at both ends:
/^(?:(?![ab].).)*$/
That's the general idea, but I don't think it's possible to invert every regex--not when the original regexes can include positive and negative lookarounds, reluctant and possessive quantifiers, and who-knows-what.
You can invert the character set by writing a ^ at the start ([^…]). So the opposite expression of [ab] (match either a or b) is [^ab] (match neither a nor b).
But the more complex your expression gets, the more complex is the complementary expression too. An example:
You want to match the literal foo. An expression, that does match anything else but a string that contains foo would have to match either
any string that’s shorter than foo (^.{0,2}$), or
any three characters long string that’s not foo (^([^f]..|f[^o].|fo[^o])$), or
any longer string that does not contain foo.
All together this may work:
^[^fo]*(f+($|[^o]|o($|[^fo]*)))*$
But note: This does only apply to foo.
You can also do this (in python) by using re.split, and splitting based on your regular expression, thus returning all the parts that don't match the regex, how to find the converse of a regex
In perl you can anti-match with $string !~ /regex/;.
With grep, you can use --invert-match or -v.
Java Regexps have an interesting way of doing this (can test here) where you can create a greedy optional match for the string you want, and then match data after it. If the greedy match fails, it's optional so it doesn't matter, if it succeeds, it needs some extra data to match the second expression and so fails.
It looks counter-intuitive, but works.
Eg (foo)?+.+ matches bar, foox and xfoo but won't match foo (or an empty string).
It might be possible in other dialects, but couldn't get it to work myself (they seem more willing to backtrack if the second match fails?)