Regex to match all valid links

Regex to match all valid links - regex

In regards to this: http://stackoverflow.uservoice.com/pages/general/suggestions/103227-parser-does-not-match-all-valid-urls is this regex adequate or will it need to be refined, if it needs to be refined how so?
\b(?P<link>(?:.*?://)[\w\-\_\.\#\:\/\?\#\=]*)\b

Even though the question is vague, I'll attempt to respond with possible solutions.
Possible Intention 1: To match any URL's in a given file (for replacement):
/^([^:]+):\/\/([-\w._]+)(\/[-\w._]\?(.+)?)?$/ig
The above should match nearly all URL formats, with the following captured groups:
0 => entire match
1 => protocol (eg. http, ftp, git, ...)
2 => hostname (eg. www.stackoverflow.com)
3 => requested_file_path (eg. /images/prod/1/4/success.gif)
4 => query_string (eg. param=1&param2=2&param3=3)
Possible Intention 2: To get details about the current request url
In order to get details about the url such as the protocol, hostname, requested file path, and query string, you're better off using language/object methods to gather the results. In php you can get all of the above information using function calls:
$protocol = $_SERVER['SERVER_PROTOCOL']; // HTTP/1.0
$host = $_SERVER['HTTP_HOST']; // www.stackoverflow.com
$path_to_file = dirname($_SERVER['SCRIPT_NAME']);
$file = basename($_SERVER['SCRIPT_NAME']);
$query_string = $_SERVER['QUERY_STRING'];
Hope this helps in any way.

I guess SO blocks comments after a while? localshred's answer is great, except for a missing wildcard and unescaped periods:
/^([^:]+):\/\/([-\w\._]+)(\/[-\w\._]*\?(.+)?)?$/ig
^-- wildcard
^
we dont want to match everything ^

Related

Regex to match anything after /

I'm basically not in the clue about regex but I need a regex statement that will recognise anything after the / in a URL.
Basically, i'm developing a site for someone and a page's URL (Local URL of Course) is say (http://)localhost/sweettemptations/available-sweets. This page is filled with custom post types (It's a WordPress site) which have the URL of (http://)localhost/sweettemptations/sweets/sweet-name.
What I want to do is redirect the URL (http://)localhost/sweettemptations/sweets back to (http://)localhost/sweettemptations/available-sweets which is easy to do, but I also need to redirect any type of sweet back to (http://)localhost/sweettemptations/available-sweets. So say I need to redirect (http://)localhost/sweettemptations/sweets/* back to (http://)localhost/sweettemptations/available-sweets.
If anyone could help by telling me how to write a proper regex statement to match everything after sweets/ in the URL, it would be hugely appreciated.

To do what you ask you need to use groups. In regular expression groups allow you to isolate parts of the whole match.
for example:
input string of: aaaaaaaabbbbcccc
regex: a*(b*)
The parenthesis mark a group in this case it will be group 1 since it is the first in the pattern.
Note: group 0 is implicit and is the complete match.
So the matches in my above case will be:
group 0: aaaaaaaabbbb
group 1: bbbb
In order to achieve what you want with the sweets pattern above, you just need to put a group around the end.
possible solution: /sweets/(.*)
the more precise you are with the pattern before the group the less likely you will have a possible false positive.
If what you really want is to match anything after the last / you can take another approach:
possible other solution: /([^/]*)
The pattern above will find a / with a string of characters that are NOT another / and keep it in group 1. Issue here is that you could match things that do not have sweets in the URL.
Note if you do not mind the / at the beginning then just remove the ( and ) and you do not have to worry about groups.
I like to use http://regexpal.com/ to test my regex.. It will mark in different colors the different matches.
Hope this helps.
I may have misunderstood you requirement in my original post.
if you just want to change any string that matches
(http://)localhost/sweettemptations/sweets/*
into the other one you provided (without adding the part match by your * at the end) I would use a regular expression to match the pattern in the URL but them just blind replace the whole string with the desired one:
(http://)localhost/sweettemptations/available-sweets
So if you want the URL:
http://localhost/sweettemptations/sweets/somethingmore.html
to turn into:
http://localhost/sweettemptations/available-sweets
and not into:
localhost/sweettemptations/available-sweets/somethingmore.html
Then the solution is simpler, no groups required :).
when doing this I would make sure you do not match the "localhost" part. Also I am assuming the (http://) really means an optional http:// in front as (http://) is not a valid protocol prefix.
so if that is what you want then this should match the pattern:
(http://)?[^/]+/sweettemptations/sweets/.*
This regular expression will match the http:// part optionally with a host (be it localhost, an IP or the host name). You could omit the .* at the end if you want.
If that pattern matches just replace the whole URL with the one you want to redirect to.

use this regular expression (?<=://).+

Regex Base URL Grabbing

I'm trying to filter out a bunch of urls to find their base url, which doesn't include the www or any prefix, having trouble writing a expression to capture it, but with subset of TLDs, it becomes a rather more complicated issue.
answers.yahoo.com => yahoo.com
www.google.com => google.com
uk.answers.yahoo.co.uk = > yahoo.co.uk
www.g.se => g.se
Any suggestions?
I was using this expression, but it messes up when the domain name isn't more than 2 characters or when the domain tld is less than 2 characters.
(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$

How do you know that the base of uk.answers.yahoo.co.uk is yahoo.co.uk, but the base of, for example, foo.bar.maps.google.com isn't maps.google.com?

[^\.]*\.(?:co.uk|\w{2,3})$
You'll need to add known domains in the regex.
http://regexr.com?30p4r

Regex match url without file extension

I would like some help matching the following urls.
/settings => /settings.php
/657_46hallo => /657_46hallo.php
/users/create => /users.php/create
/contact/create/user => /contact.php/create/user
/view/info.php => /view.php/info.php
/view/readme - now.txt => /view.php/readme - now.txt
/ => [NO MATCH]
/filename.php => /unknown.php
/filename.php/users/create => /unknown.php
if the first part after the domain name is a filename ending with ".php"
(see last 2 examples) It should redirect to /unknown.php
I think I need 2 regular expressions
1st should be almost something like: ^/([a-zA-Z0-9_]+)(/)?(.*)?$
2nd to catch the direct filename "/filename.php" or "/filename.php/create/user"
so I can redirect to unknown.php
The 1st regular expression that I got almost works for the first part.
==============================================
request url: http://domain.com/user/create
regex: ^/([a-zA-Z0-9_]+)(/)?(.*)?$
replace http://domain.com/$1.php$2$3
makes: http://domain.com/user.php/create
Problem is it also matches http://domain.com/user.php/create
If someone could help me with both regular expressions that would be great.

If you want to match those .php cases you can try this:
^\/([a-zA-Z0-9_]+)(\/)?(.*)?$
See here on Regexr
If you want to avoid those cases try this:
^/([a-zA-Z0-9_]+)(?!\.php)(?:(/)(.*)|)$
See here on Regexr
The (?!\.php) is a negative look ahead that ensures that there is no .php at this place.

When all you have is a hammer...
While this probably could be solved with a regexp, it is probably the wrong tool for the job, unless you have constraints that MANDATE the use of regexps.
Split the string using '/' as the delimiter, see whether the first component ends with '.php'; if so, reject it, otherwise append '.php' to the first component and join the components back using '/'.

This regex matches and shouldn't. Why is it?

This regex:
^((https?|ftp)\:(\/\/)|(file\:\/{2,3}))?(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(((([a-zA-Z0-9]+)(\.)?)+?)(\.)([a-z]{2}
|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum))([a-zA-Z0-9\?\=\&\%\/]*)?$
Formatted for readability:
^( # Begin regex / begin address clause
(https?|ftp)\:(\/\/)|(file\:\/{2,3}))? # protocol
( # container for two address formats, more to come later
((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) # match IP addresses
)|( # delimiter for address formats
((([a-zA-Z0-9]+)(\.)?)+?) # match domains and any number of subdomains
(\.) #dot for .com
([a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum) #TLD clause
) # end address clause
([a-zA-Z0-9\?\=\&\%\/]*)? # querystring support, will pretty this up later
$
is matching:
www.google
and shouldn't be. This is one of my "fail" test cases. I have declared the TLD portion of the URL to be mandatory when matching on alpha instead of on IP, and "google" doesn't fit into the "[a-z]{2}" clause.
Keep in mind I will fix the following issues seperately - this question is about why it matches www.google and shouldn't.
Querystring needs to support proper formats only, currently accepts any combination of querystring characters
Several protocols not supported, though the scope of my requirements may not include them
uncommon TLDs with 3 characters not included
Probably matches http://www.google..com - will check for consecutive dots
Doesn't support decimal IP address formats
What's wrong with my regex?
edit: See also a previous problem with an earlier version of this regex on a different test case:
How can I make this regex match correctly?
edit2: Fixed - The corrected regex (as asked) is:
^((https?|ftp)\:(\/\/)|(file\:\/{2,3}))?(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(((([a-zA-Z0-9]+)(\.)?)+?)(\.)([a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum))([\/][\/a-zA-Z0-9\.]*)*?([\/]?[\?][a-zA-Z0-9\=\&\%\/]*)?$

"google" might not fit in [a-z]{2}, but it does fit in [a-z]{2}([a-zA-Z0-9\?\=\&\%\/]*)? - you forgot to require a / after the TLD if the URL extends beyond the domain. So it's interpreting it with "www.go" as the domain and then "ogle" following it, with no slash in between. You can fix it by adding a [?/] to the front of that last group to require one of those two symbols between the TLD and any further portion of the URL.

Your TLD clause matches "go" in google and the querystring support part matches "ogle" afterwards. Try changing the querystring part to this:
([?/][a-zA-Z0-9\?\=\&\%\/]*)?

google" doesn't fit into the "[a-z]{2}" clause.
But "go" does and then "ogle" matches "([a-zA-Z0-9\?\=\&\%/]*)?"

I need a regEx to match general URLs

I need to test for general URLs using any protocol (http, https, shttp, ftp, svn, mysql and things I don't know about).
My first pass is this:
\w+://(\w+\.)+[\w+](/[\w]+)(\?[-A-Z0-9+&##/%=~_|!:,.;]*)?
(PCRE and .NET so nothing to fancy)

According to RFC2396:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

adding that RegEx as a wiki answer:
[\w+-]+://([a-zA-Z0-9]+\.)+[[a-zA-Z0-9]+](/[%\w]+)(\?[-A-Z0-9+&##/%=~_|!:,.;]*)?
option 2 (Re CMS)
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
But that's to lax for anything sane so trimmed to make it more restrictive and to differentiate off other things.
proto :// name : pass # server :port /path ? args
^([^:/?#]+)://(([^/?##:]+(:[^/?##:]+)?#)?[^/?##:]+(:[0-9]+)?)(/[^?#]*)(\?([^#]*))?

I came at this from a slightly different direction. I wanted to emulate gchats ability to match something.co.uk and linkify it. So I went with a regex that looks for a . without either a following period or a space on either side and then grabs everything around it until it hits whitespace. It does match a period at the end of a URI but I'm taking that off later. So this could be an option if you would prefer false positives over missing some potentials
url_re = re.compile(r"""
[^\s] # not whitespace
[a-zA-Z0-9:/\-]+ # the protocol and domain name
\.(?!\.) # A literal '.' not followed by another
[\w\-\./\?=&%~#]+ # country and path components
[^\s] # not whitespace""", re.VERBOSE)
url_re.findall('http://thereisnothing.com/a/path adn some text www.google.com/?=query#%20 https://somewhere.com other-countries.co.nz. ellipsis... is also a great place to buy. But try text-hello.com ftp://something.com')
['http://thereisnothing.com/a/path',
'www.google.com/?=query#%20',
'https://somewhere.com',
'other-countries.co.nz.',
'text-hello.com',
'ftp://something.com']

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to match all valid links - regex

In regards to this: http://stackoverflow.uservoice.com/pages/general/suggestions/103227-parser-does-not-match-all-valid-urls is this regex adequate or will it need to be refined, if it needs to be refined how so? \b(?P<link>(?:.?://)[\w\-\_\.\#\:\/\?\#\=])\b

I guess SO blocks comments after a while? localshred's answer is great, except for a missing wildcard and unescaped periods: /^([^:]+):\/\/([-\w\._]+)(\/[-\w\._]*\?(.+)?)?$/ig ^-- wildcard ^ we dont want to match everything ^

Related

Regex to match anything after /

Regex Base URL Grabbing

Regex match url without file extension

This regex matches and shouldn't. Why is it?

I need a regEx to match general URLs

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to match all valid links - regex

In regards to this: http://stackoverflow.uservoice.com/pages/general/suggestions/103227-parser-does-not-match-all-valid-urls is this regex adequate or will it need to be refined, if it needs to be refined how so? \b(?P<link>(?:.*?://)[\w\-\_\.\#\:\/\?\#\=]*)\b

I guess SO blocks comments after a while? localshred's answer is great, except for a missing wildcard and unescaped periods: /^([^:]+):\/\/([-\w\._]+)(\/[-\w\._]*\?(.+)?)?$/ig ^-- wildcard ^ we dont want to match everything ^

Related

Regex to match anything after /

Regex Base URL Grabbing

Regex match url without file extension

This regex matches and shouldn't. Why is it?

I need a regEx to match general URLs

Categories

Resources

In regards to this: http://stackoverflow.uservoice.com/pages/general/suggestions/103227-parser-does-not-match-all-valid-urls is this regex adequate or will it need to be refined, if it needs to be refined how so? \b(?P<link>(?:.?://)[\w\-\_\.\#\:\/\?\#\=])\b