Regex - not in the list but matched anyway - regex

This is a bit hard to sum up in a title, but here is my problem:
(?:(?:http|https):\\/\\/)?(?:\\/\\/www\\.)?youtube.com\\/watch\\?(?:.*)v=(\\w{11}).*
Given the expression given below, I really really don't understand why ftp://www.youtube.com/watch?v=F5eScJmYZZ8 matches. I unsuccessfully tried to add ^ to the expression beginning, but then, my expression does not match anything anymore (this is done in Java, that explains the doubled backslashes).
How can ftp be accepted as it is clearly not listed in (http|ftp)?
EDIT
To be accurate, here is what is allowed:
http(s)://www.[...]
http(s)://[...]
www.[...]
[...]
and nothing else.

Because ? after the http part the means that it is optional. Use + instead of ?.
Also, you are checking for // after http twice.
\s* allows whitespace at the beginning. If you don't want to allow whitespace (i.e., the input text will contain only 1 match), use ^ instead.
Here is the working regex that meets all of your added requirements:
\s*(?:(http|https)\:\/\/)?(?:www\.)?youtube.com\/watch\?(?:.*)v=(\w{11}).*

Because the leading (?:(?:http|https):\\/\\/)? is optional. That's what the question mark at the end of the group signifies (match at most one, i.e. match only if it exists).
A leading ^ should prevent the match with ftp though. Can you post the failing regex you tried (with the ^)?
UPDATE:
Aha! It matches without the ^ since the http group is optional, and anything can come before the match (e.g. cheeseyoutube.com/... would match). Adding a ^ to the beginning of the regex fixes this, but there's another problem with your regex: the www group is trying to match two slashes (as first pointed out in Justin's answer), which it can't once the http group has already matched those slashes. So the www group fails to match (fine, since it's optional), but then the youtube part can't match since there's an unmatched www in the way!
This should fix your problem:
^(?:(?:http|https):\\/\\/)?(?:www\\.)?youtube.com\\/watch\\?(?:.*)v=(\\w{11}).*

Related

Having difficulty in a understanding regex backtracking

I was browsing through the regex tagged questions on SO when i came accross this problem,
A regex for a url was needed, the url begins with domain.com/advertorials/
The regex should match the following scenarios,
domain.com/advertorials
domain.com/advertorials?test=true
domain.com/advertorials/
domain.com/advertorials/?test=true
but not this,
domain.com/advertorials/version1?test=true
I came up with this regex advertorials\/?(?:(?!version)(.*))
This should work, but it doesnt for the last case. Looking at the debugger in regex101.com,
i see that after matching 's/' it matches 'version' word character by character and ultimately matches but since this is negative lookahead the condition fails. And this is the part i dont understand after failing it backtracks to before the '/' in 's/' and not after 's/'.
Is this how its supposed to work?? Can anyone help me understand?
(here's the demo link: https://regex101.com/r/ww3HR8/1).
Thanks,
Note: People already gave their solutions on that problem i just want to know why my regex fails.
The backtracking mechanism is in charge of this phenomenon, as you have already pointed out.
The ? quantifier, matching 1 or 0 repetitions of the quantified subpattern lets the regex engine match the string in two ways: either matching the quantified subpattern, or go on matching the string with subsequent subpattern.
So, advertorials/?(?!version)(.*) (I removed the redundant (?:...) non-capturing group), when applied to domain.com/advertorials/version1?test=true, matches advertorials, then matches /, and then the negative lookahead checks if, immediately to the right of the current position, there is version substring. Since there is version after /, the regex engine goes back and sees that /? pattern can match an empty string. So, the lookahead check is re-applied striaght after advertorials. There is no version after advertorials, and the match is returned.
The usual solution is using possessive quantifiers or atomic groups, but there are other approaches, too.
E.g.
advertorials\/?+(?!version)(.*)
^^
See the regex demo. Here, \/?+ matches 1 or 0 / chars, but once it matches, the egine cannot go back and re-match a part of a string with this pattern.
Or, you may include the /? in the lookahead and place it before /? pattern:
advertorials(?!\/?version)\/?(.*)
See another regex demo.
If you plan to disallow version anywhere after advertorials use
advertorials(?!.*version)\/?(.*)
See yet another demo.
Making the slash optional means there is a way to match without violating the constraint. If there is a way to match, the regex engine will find it, always.
Make the slash non-optional when it's followed by anything at all.
advertorials(?:/(?!version).*)?$
Incidentally, regex itself doesn't require the slash to be backslash-escaped (though some host languages use slashes as regex delimiters, so maybe you need to put it back). I also removed some redundant parentheses.
The reason:
This highlighted part is optional
advertorials\/?(?:(?!version)(.*))
Therefore it can also be advertorials(?:(?!version)(.*))
which matches advertorials/version
Essentially, (?!version)(.*) matches /version
Btw, this is normal backtracking by 1 character.
If you have already fixed it, then we're done !

Mistaken Squid Proxy regex? → ^.*stackoverflow\.*

I have several proxy rule files for Squid, and all contain rules like:
acl blacklisted dstdom_regex ^.*facebook\.* ^.*youtube\.* ^.*games.yahoo.com\.*
The patterns match against the domain name: dstdom_regex means destination (server) regular expression pattern matching.
The objective is to block some websites, but I don't know by what method: domain name, keywords in the domain name, ...
Let's expand/describe the pattern:
^.*stackexchange\.* The whole pattern
^ String beginning
.* Match anything (greedy quantifier, I presume)
stackexchange Keyword to match
\.* Any number of dots (.)
Totally legitimate matches:
stackexchange.com: The Stack Exchange website.
stackoverflow.stackexchange: The imaginary Stack Exchange gTLD.
But these possible matches make it seem more like a keyword block:
stackexchange
stackexchanger
notstackexchange
not-stackexchange
some-website.stackexchange
some-website.stackexchange-tld
And the pattern seems to contain a bug, since it allows the following invalid cases to match, thanks to the \.* at the end, although they never naturally occur:
stackexchange.
stackexchange...
stackexchange..........
stackexchange.......com
stackexchange.com
stackexchangecom
you get the idea.
Anything containing stackexchange, even if separated by dots from everything else, is still a valid match.
So now, the question itself:
This all means that this is simply a match for stackexchange! (I'm assuming the original author didn't intend to match infinite dots.)
So why not just use the pattern stackexchange? Wouldn't it be faster and give the same results, except for the "bug" (\.*)?
I.e., isn't ^.*stackexchange equivalent to stackexchange?
Edit: Just to clarify, I didn't write those proxy rule files.
I don't understand why you use \.* to match all the following dots
However to bypass your problem you can try this out :
^[^\.]*\.stackexchange\.*
[^\.]* matches anything except a dot
\. then you match the dot
edit : formatting

regex negative lookbehind - pcre

I'm trying to write a rule to match on a top level domain followed by five digits. My problem arises because my existing pcre is matching on what I have described but much later in the URL then when I want it to. I want it to match on the first occurence of a TLD, not anywhere else. The easy way to check for this is to match on the TLD when it has not bee preceeded at some point by the "/" character. I tried using negative-lookbehind but that doesn't work because that only looks back one single character.
e.g.: How it is currently working
domain.net/stuff/stuff=www.google.com/12345
matches .com/12345 even though I do not want this match because it is not the first TLD in the URL
e.g.: How I want it to work
domain.net/12345/stuff=www.google.com/12345
matches on .net/12345 and ignores the later match on .com/12345
My current expression
(\.[a-z]{2,4})/\d{5}
EDIT: rewrote it so perhaps the problem is clearer in case anyone in the future has this same issue.
You're pretty close :)
You just need to be sure that before matching what you're looking for (i.e: (\.[a-z]{2,4})/\d{5}), you haven't met any / since the beginning of the line.
I would suggest you to simply preppend ^[^\/]*\. before your current regex.
Thus, the resulting regex would be:
^[^\/]*\.([a-z]{2,4})/\d{5}
How does it work?
^ asserts that this is the beginning of the tested String
[^\/]* accepts any sequence of characters that doesn't contain /
\.([a-z]{2,4})/\d{5} is the pattern you want to match (a . followed by 2 to 4 lowercase characters, then a / and at least 5 digits).
Here is a permalink to a working example on regex101.
Cheers!
You can use this regex:
'|^(\w+://)?([\w-]+\.)+\w+/\d{5}|'
Online Demo: http://regex101.com/

Notepad++ regex group capture

I have such txt file:
ххх.prontube.ru
salo.ru
bbb.antichat.ru
yyy.ru
xx.bb.prontube.ru
zzz.com
srfsf.jwbefw.com.ua
Trying to delete all subdomains with such regex:
Find: .+\.((.*?)\.(ru|ua|com\.ua|com|net|info))$
Replace with: \1
Receive:
prontube.ru
salo.ru
antichat.ru
yyy.ru
prontube.ru
zzz.com
com.ua
Why last line becomes com.ua instead of jwbefw.com.ua ?
This works without look around:
Find: [a-zA-Z0-9-.]+\.([a-zA-Z0-9-]+)\.([a-zA-Z0-9-]+)$
Replace: \1\.\2
It finds something with at least 2 periods and only letters, numbers, and dashes following the last two periods; then it replaces it with the last 2 parts. More intuitive, in my opinion.
There's something funny going on with that leading xxx. It doesn't appear to be plain ASCII. For the sake of this question, I'm going to assume that's just something funny with this site and not representative of your real data.
Incorrect
Interestingly, I previously had an incorrect answer here that accumulated a lot of upvotes. So I think I should preserve it:
Find: [a-zA-Z0-9-]+\.([a-zA-Z0-9-]+)\.(.+)$
Replace: \1\.\2
It just finds a host name with at least 2 periods in it, then replaces it with everything after the first dot.
The .+ part is matching as much as possible. Try using .+? instead, and it will capture the least possible, allowing the com.ua option to match.
.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$
This answer still uses the specific domain names that the original question was looking at. As some TLD (top level domains) have a period in them, and you could theoretically have a list including multiple subdomains, whitelisting the TLD in the regex is a good idea if it works with your data set. Both current answers (from 2013) will not handle the difference between "xx.bb.prontube.ru" and "srfsf.jwbefw.com.ua" correctly.
Here is a quick explanation of why this psnig's original regex isn't working as intended:
The + is greedy.
.+ will zip all the way to the right at the end of the line capturing everything,
then work its way backwards (to the left) looking for a match from here:
(ru|ua|com\.ua|com|net|info)
With srfsf.jwbefw.com.ua the regex engine will first fail to match a,
then it will move the token one place to the left to look at "ua"
At that point, ua from the regex (the second option) is a match.
The engine will not keep looking to find "com.ua" because ".ua" met that requirement.
Niet the Dark Absol's answer tells the regex to be "lazy"
.+? will match any character (at least one) and then try to find the next part of the regex. If that fails, it will advance the token, .+ matching one more character and then evaluating the rest of the regex again.
The .+? will eventually consume: srfsf.jwbefw before matching the period, and then matching com.ua.
But the implimentation of ? also creates issues.
Adding in the question mark makes that first .+ lazy, but then causes group1 to match bb.prontube.ru instead of prontube.ru
This is because that first period after the bb will match, then inside group 1 (.*?) will match bb.prontube. before \.(ru|ua|com\.ua|com|net|info))$ matches .ru
To avoid this, change that third group from (.*?) to ([\w-]*?) so it won't capture . only letters and numbers, or a dash.
resulting regex:
.+?\.(([\w-])*?\.(ru|ua|com\.ua|com|net|info))$
Note that you don't need to capture any groups other than the first. Adding ?: makes the TLD options non-capturing.
last change:
.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$
Search what: .+?\.(\w+\.(?:ru|com|com\.au))
Replace with: $1
Look in the picture above, what regex capture referring
It's color the way you will not need a regex explaination anymore ....

Ignore everything after slash if it's a number

I'm trying to ignore everything after a slash if it's a number -
http://www.example.com/123abc/456/ABC/789/
Desired output is
http://www.example.com/123abc/
I have tried the following so far -
(https?:\/\/.*)(?=/\d+).*
which gives me -
http://www.example.com/123abc/456/ABC/
Many Thanks!
I think you want
(https?:\/\/.*?)(?=/\d+\/).*
// ^ ^^
Making the repetition non-greedy, and enforcing the whole directory to be a number (otherwise /123abc… would already match it). Maybe you also want to move the first slash from the lookahead into the matching group, so that your result has the trailing slash.
The .* is greedy and will try to match as much as possible. The 789 existence allows for a match of everything up to it. Instead you can use.
(https?:\/\/.*?)(?=/\d+).*
The ? makes the .* relucant, so it will match as little as possible to satisfy the expression.
However, this doesn't fulfill the requirement you described which is actually "Ignore everything after the second slash if it is a number." You can use (in your specific case):
(https?:\/\/.*?\/.*?\/)(?=\d+).*