regex positive lookahead with if/else condition - regex

I am trying to write an regular expression that would check if a pattern exists and, if it does, matches everything following it, and if (and only if) it does not, matches everything after another pattern.
example lines:
http://example.com/contact
www.example.com/contact
http://www.example.com/contact
expected output in all 3 cases: example
Here is the regular expression I expected would do the job:
(?(?<=www\.).+|(?<=http:\/\/).+)(?=\.com)
which I assumed would:
check if "www." is to be found
if yes, would match everything following it
if not, match everything following "http://"
restrict match to everything before the occurrence of ".com "
For the first two lines, the expression worked well, but in the third line www.example is matched instead of just example. Does this mean that for some reason the else command is executed although the if condition is met?
How can I change the above expression so that it only does the http// lookahead if the www. part was not found?

Converting my comment to answer.
You may use this regex:
^(?:https?://(?:www\.)?|www\.)\K\S+?(?=\.com(?:/|$))
RegEx Demo
RegEx Description:
^: Start
(?:https?://(?:www\.)?|www\.): Match http://www. or http:// or (https)
\K: Reset matched information
\S+?: Match 1+ non-space characters (lazy)
(?=\.com(?:/|$)): Using lookahead assert that we have .com or end of line ahead

Related

Full match only if the capturing group encountered once

The pattern:
(test):(thestring)
What I want is full match only if there is just one test: before
test:thestring
But in this case there wouldn't be full match:
test:test:thestring
I've tried qualificator, but it didn't work.
Need help
Try this pattern: ^(?!.*((?(?<=^)|(?<=:))test(?=(:|$))).*(?1)).+$.
The main part is ((?(?<=^)|(?<=:))test(?=(:|$))), which matches test if it's preceeded by colon : or is at the beginning of a line and it's followed by colon : or end of the line.
(?(?<=^)|(?<=:)) this is workaround to (?<=(:|^)), but lookbehinds must have fixed length.
Then we have backreference to first capturing group (?1), to see if there are any other test.
This whole pattern is placed in negative lookahead (?!...), to match everything if it doesn't match pattern explained above (test matched more than one time).
Demo
for this very specific case:
(?<!.)(test:thestring)
Regex101
All it does is search for the string test:thestring and ensures that there are no characters before it. (Use MichaƂ Turczyn's regex for an all purpose search!)
^((?!test:).)*(test:thestring)
See in action
If you want a full match and there should be only one time test: before test:string you might assert the start of the string ^, use a negative lookahead (?:(?!test:).) to match any character if what is on the right side is not test:
Then match test:thestring followed by a negative lookahead (?:(?!test:thestring).)* that matches any character if what is on the right side is not test:thestring and assert the end of the string $
^(?:(?!test:).)*test:thestring(?:(?!test:thestring).)*$
Regex demo

Using multiple conditions in regex

I am attempting to create a regex that matches when two conditions are met:
URL snippet is present
After the snippet the number "1" must also be present (1 does not have to be
immediately after snippet)
Both conditions must be met for the regex to be true.
This is the regex that I have so far:
^https?:\/\/www\.website\.co\.uk\/brand\/
This matches the URL snippet. But I want the regex to include the second condition.
Therefore, if the second condition was included
This would match: http://www.website.co.uk/brand/AD/**1**/A_d.html
But this would not: http://www.website.co.uk/brand/
Any help on this would be great.
^https?:\/\/www\.website\.co\.uk\/brand\/.*1.*
If you have a .* in regex, you match all characters asides from newline. By padding your 1 with .* you match all regexes that have
http://www.website.co.uk/brand/
and are followed by characters with at least one 1.

Match pattern anywhere in string?

I want to match the following pattern:
Exxxx49 (where x is a digit 0-9)
For example, E123449abcdefgh, abcdefE123449987654321 are both valid. I.e., I need to match the pattern anywhere in a string.
I am using:
^*E[0-9]{4}49*$
But it only matches E123449.
How can I allow any amount of characters in front or after the pattern?
Remove the ^ and $ to search anywhere in the string.
In your case the * are probably not what you intended; E[0-9]{4}49 should suffice. This will find an E, followed by four digits, followed by a 4 and a 9, anywhere in the string.
I would go for
^.*E[0-9]{4}49.*$
EDIT:
since it fullfills all requirements state by OP.
"[match] Exxxx49 (where x is digit 0-9)"
"allow for any amount of characters in front or after pattern"
It will match
^.* everything from, including the beginning of the line
E[0-9]{4}49 the requested pattern
.*$ everthing after the pattern, including the the end of the line
Your original regex had a regex pattern syntax error at the first *. Fix it and change it to this:
.*E\d{4}49.*
This pattern is for matching in engines (most engines) that are anchored, like Java. Since you forgot to specify a language.
.* matches any number of sequences. As it surrounds the match, this will match the entire string as long as this match is located in the string.
Here is a regex demo!
Just simply use this:
E[0-9]{4}49
How do I allow for any amount of characters in front or after pattern? but it only matches E123449
Use global flag /E\d{4}49/g if supported by the language
OR
Try with capturing groups (E\d{4}49)+ that is grouped by enclosing inside parenthesis (...)
Here is online demo

regular expression doesn't match, why?

I want to match all paths that:
don't start with "/foo-bar/"
or not ends with any extension (.jpg, .gif, etc)
examples:
/foo-bar/aaaa/fff will not match
/foo-bar/aaaa/fff.jpg will not match
/aaa/bbb will match
/aaaa/bbbb.jpg will not match
/bbb.a will not match
this is my regex:
^\/(?!foo-bar\/).*(?!\.).*$
but is not working, why?
thanks!
It is more easy to try to match what you don't want. Example with PHP:
if (!preg_match('~^/foo-bar/|\.[^/]+$~', $url))
echo 'Valid!';
Your pattern doesn't work because of this part .*(?!\.).*$. The first .* is greedy and will take all the characters of the string until the end, after, to make the end of the pattern to succeed, the regex engine will backtrack one character (the last of the string). (?!\.).*$ will always match this last character if it is not a dot.
If you absolutely need an affirmative pattern, you can use this:
if (preg_match('~^/(?!foo-bar/)(?:[^/]*/)*+[^./]*$~', $url))
echo 'Valid!';
You can try this one, which is a bit simpler and close to what you have tried:
^(?!\/foo-bar)([^\.]+)$
Live Demo

Negative lookahead to match server directories not properly working

Given the following 3 example paths representing server paths i am trying to create a skiplist for my FTP client via PCRE regular expressions but can't seem to get the wished result.
/subdir-level-1/subdir-level-2/.../Author1_-_Title1-(1234)-Publisher1
/subdir-level-1/subdir-level-2/.../Author2_-_Title2_(5678)-PUBLiSHER2
/subdir-level-1/subdir-level-2/.../Author3_-_Title3-4951-publisher3
I want to skip all folders (not paths) that do not end with
-Publisher1
I am trying to create a working pattern with the help of this online help and and this regex tester but don't get any further than to this negative lookahead pattern
.*-(?!Publisher1)
But with this pattern all lines match because with all of them the substrings up to the pattern do all not contain the pattern.
/subdir/subdir/.../Author1_-_Title1-(1234) -Publisher1
/subdir/subdir/.../Author2_-_Title2_(5678) -PUBLiSHER2
/subdir/subdir/.../Author3_-_Title3-4951 -publisher3
What is my mistake and how would the correct pattern be just to match only the second and third line as line to be skipped but keep the first line?
EDIT to make it clearer what to highlight and what not.
Everything from the beginning of the path to the last slash must be ignored (allowed).
Everything after the last slash that matches the defined regex must be skipped.
EDIT to present an advanced pattern matching only the red part
[^/]*(?<!-Publisher2)$
Debuggex Demo
The regex which you have used is:
.*-(?!Publisher1)
I will tell you whats the fault in it.
According to this regex it will match those lines which dont have a - followed by Publisher1. Okay, do you notice the - there in between on yur text, yes. between author and title or after title. So all the strings satisfy this condition. Instead if you search with a negative lookahead in such a way that hiphen is with Publisher1 then your match should work.
So you plan on moving the hiphen inside the parenthesis so that it matches and make your regex like this :
^.*(?!-Publisher1)
but this will also not work, because here .* matches everything, so when we do a lookahead, we are not able to find a single character to match . Thus we will use a negative lookbehind. <.
.*(?<!-Publisher1)
what now ? . I have done everything but still I cannot get it to work. why is it so ?
because a negative lookbehind will lookback and tell if it is not followed by -Publisher1.
this is complex, just bear with me :
suppose your string
/subdir/subdir/.../Author1_-_Title1-(1234)-Publisher1
we do a negative lookbehind for -Publisher1. From the postition after 1 . i.e. at the end of the string -Publisher1 is visible when we lookback. BUT our condition is negative lookbehind. So it will move one character left to reach a position where it will no more be able to lookback and say that "Hey I can see -Publisher1 from here" because from here we are able to see "-Publisher" only. Our condtin satisfies but the regex still matches the rest of the string.
So it is essential to bind the lookbehind to the end of the string so that it doesnot move one character to the left to search for its match.
final regex:
.*(?<!-Publisher1)$
demo here : http://regex101.com/r/lE1vW2
This should suit your needs:
^.*(?<!-Publisher1)$
Debuggex Demo
I want to skip all folders that do not end with -Publisher1
You can use this negative lookahead based regex:
^(?!.*?-Publisher1$).+$
Working Demo
You could use the following regex in order to exclude lines containing Publisher1:
^((?!Publisher1).)*$
Online demo: http://regex101.com/r/gD8jK0