I need a regEx to match general URLs - regex

I need to test for general URLs using any protocol (http, https, shttp, ftp, svn, mysql and things I don't know about).
My first pass is this:
\w+://(\w+\.)+[\w+](/[\w]+)(\?[-A-Z0-9+&##/%=~_|!:,.;]*)?
(PCRE and .NET so nothing to fancy)

According to RFC2396:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

adding that RegEx as a wiki answer:
[\w+-]+://([a-zA-Z0-9]+\.)+[[a-zA-Z0-9]+](/[%\w]+)(\?[-A-Z0-9+&##/%=~_|!:,.;]*)?
option 2 (Re CMS)
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
But that's to lax for anything sane so trimmed to make it more restrictive and to differentiate off other things.
proto :// name : pass # server :port /path ? args
^([^:/?#]+)://(([^/?##:]+(:[^/?##:]+)?#)?[^/?##:]+(:[0-9]+)?)(/[^?#]*)(\?([^#]*))?

I came at this from a slightly different direction. I wanted to emulate gchats ability to match something.co.uk and linkify it. So I went with a regex that looks for a . without either a following period or a space on either side and then grabs everything around it until it hits whitespace. It does match a period at the end of a URI but I'm taking that off later. So this could be an option if you would prefer false positives over missing some potentials
url_re = re.compile(r"""
[^\s] # not whitespace
[a-zA-Z0-9:/\-]+ # the protocol and domain name
\.(?!\.) # A literal '.' not followed by another
[\w\-\./\?=&%~#]+ # country and path components
[^\s] # not whitespace""", re.VERBOSE)
url_re.findall('http://thereisnothing.com/a/path adn some text www.google.com/?=query#%20 https://somewhere.com other-countries.co.nz. ellipsis... is also a great place to buy. But try text-hello.com ftp://something.com')
['http://thereisnothing.com/a/path',
'www.google.com/?=query#%20',
'https://somewhere.com',
'other-countries.co.nz.',
'text-hello.com',
'ftp://something.com']

Related

Regex get domain name from email

I am learning regex and am having trouble getting google from email address
String
first.name#google.com
I just want to get google, not google.com
Regex:
[^#].+(?=\.)
Result: https://regex101.com/r/wA5eX5/1
From my understanding. It ignore # find a string after that until . (dot) using (?=\.)
What did I do wrong?
[^#] means "match one symbol that is not an # sign. That is not what you are looking for - use lookbehind (?<=#) for # and your (?=\.) lookahead for \. to extract server name in the middle:
(?<=#)[^.]+(?=\.)
The middle portion [^.]+ means "one or more non-dot characters".
Demo.
Updated answer:Use a capturing group and keep it simple :)
#(\w+)
Explanation by splitting it up
( capturing group for extraction )
\w stands for word character [A-Za-z0-9_]
+ is a quantifier for one or more occurances of \w
Regex explanation and demo on Regex101
I used the solution's regex for my task, but realized that some of the emails weren't that easy: foo#us.industries.com, foobar#tm.valves.net, andfoo#ge.test.com
To anyone who came here wanting the sub domain as well (or is being cut off by it), here's the regex:
(?<=#)[^.]*.[^.]*(?=\.)
This should be the regex:
(?<=#)[^.]+
(?<=#) - places the search right after the #
[^.]+ - take all the characters that are not dot (stops on dot)
So it extracts google from the email address.
As I was working to get the domain name of email addresses and none corresponded to what I needed:
To not catch subdomains
To match countries top domains (like .com.ar or co.jp)
For example, in test#ext.domain.com.mx I need to match domain.com.mx
So I made this one:
[^.#]*?\.\w{2,}$|[^.#]*?\.com?\.\w{2}$
Here is a link to regex101 to illustrate the regex: https://regex101.com/r/vE8rP9/59
You can get the sumdomain name (without the top-level domain ex: .com or .com.mx) by adding lookaround operators (but it will match twice in test#test.com.mx):
[^.#]*?(?=\.\w{2,}$)|[^.#]*?(?=\.com?\.\w{2}$)
Maybe not strictly a "full regex answer" but more flexible ( in case the part before the # is not "first.last") would be using cut:
cut -d # -f 2 | cut -d . -f 1
The first cut will isolate the part after # and the second one will get what you want.
This will work also for another kinds of email patterns : xxxx#server.com / xxx.yyy.zzz# server.com and so on...
Thanks everyone for your great responses, I took what you had and expanded it with labelled match-groups for easy extraction of separate parts.
Caveat : Regex.Speed = Slow
Another post mentioned how SLOW and nonperformant regexes are, and that is a fair point to remember. My particular need is targeting my own background/slow/reporting processes and therefore it doesn't matter how long it takes.
But it's good to remember whenever possible Regex should NOT be used in any sort of web page load or "needs-to-be-quick" kind of application. In that case you're much better off using substring to algorithmically strip down the inputs and throw away all the junk that I'm optionally matching/allowing/including here.
https://regex101.com/r/ZnU3OC/1
One Regex to rule them all...
Subdomain/Domain/TopLevelDomain/CountryCode extraction for Emails, domain lists, & URLs
Also handles ?Querystring=junk, Slashes/With/Paths, #anchors
Now with more broth, batteries not included
^(?<Email>.*#)?(?<Protocol>\w+:\/\/)?(?<SubDomain>(?:[\w-]{2,63}\.){0,127}?)?(?<DomainWithTLD>(?<Domain>[\w-]{2,63})\.(?<TopLevelDomain>[\w-]{2,63}?)(?:\.(?<CountryCode>[a-z]{2}))?)(?:[:](?<Port>\d+))?(?<Path>(?:[\/]\w*)+)?(?<QString>(?<QSParams>(?:[?&=][\w-]*)+)?(?:[#](?<Anchor>\w*))*)?$
not overly complicated at all... why would you even say that?
Substitution / Outputs
EXAMPLE INPUT: "https://www.stackoverflow.co.uk/path/2?q=mysearch&and=more#stuff"
EXAMPLE OUTPUT:
{
Protocol: "https://"
SubDomain: "www"
DomainWithTLD: "stackoverflow.co.uk"
Domain: "stackoverflow"
TopLevelDomain: "co"
CountryCode: "uk"
Path: "/path/2"
QString: "?q=mysearch&and=more#stuff"
}
Allowed/Compliant Domains : Should ALL MATCH
www.bankofamerica.com
bankofamerica.com.securersite.regexr.com
bankofamerica.co.uk.blahblahblah.secure.com.it
dashes-bad-for-seo.but-technically-still-allowed.not-in-front-or-end
bit.ly
is.gd
foo.biz.pl
google.com.cn
stackoverflow.co.uk
level_three.sub_domain.example.com
www.thelongestdomainnameintheworldandthensomeandthensomemoreandmore.com
https://www.stackoverflow.co.uk?q=mysearch&and=more
foo://5th.4th.3rd.example.com:8042/over/there
foo://subdomain.example.com:8042/over/there?name=ferret#nose
example.com
www.example.com
example.co.uk
trailing-slash.com/
trailing-pound.com#
trailing-question.com?
probably-not-valid.com.cn?&#
probably-not-valid.com.cn/?&#
example.com/page
example.com?key=value
* NOTE: PunyCodes (Unicode in urls) handled just fine with \w ,no extra sauce needed
xn--fsqu00a.xn--0zwm56d.com
xn--diseolatinoamericano-66b.com
Emails : Should ALL MATCH
first.name#google1.co.com
foo#us.industries.com,
foobar#tm.valves.net,
andfoo#ge.test.com
jane.doe#my-bank.no
john.doe#spam.com
jane.ann.doe#sandnes.district.gov
Non-Compliant Domains : Should NOT MATCH
either not long-enough (domain min length 2), or too long (64)
v.gd
thing.y
0123456789012345678901234567890123456789012345678901234567891234.com
its-sixty-four-instead-of-sixty-three!.com
symbols-not-allowed#.com
symbols-not-allowed#.com
symbols-not-allowed$.com
symbols-not-allowed%.com
symbols-not-allowed^.com
symbols-not-allowed&.com
symbols-not-allowed*.com
symbols-not-allowed(.com
symbols-not-allowed).com
symbols-not-allowed+.com
symbols-not-allowed=.com
TBD Not handled:
* dashes as start or ending is disallowed (dropped from Regex for readability)
-junk-.com
* is underscore allowed? i donno... (but it simplifies the regex using \w instead of [a-zA-Z0-9\-] everywhere)
symbols-not-allowed_.com
* special case localhost?
.localhost
also see:
Domain Name Rules :: Super handy ASCII Diagram of a URL
see: https://stackoverflow.com/a/66660651/738895 *
Side NOTE: lazy load '?' for subdomains{0,127}? currently needed for any of the cases with country codes... (example: stackoverflow.co.uk)
Matches these, but does NOT grab $NLevelSubdomains in a match group, can only grab 3rd level only.
This is a relatively simple regex, and it grabs everything between the # and the final domain extension (e.g. .com, .org). It allows domain names that are made up of non-word characters, which exist in real-world data.
>>> regex = re.compile(r"^.+#(.+)\.[\w]+$")
>>> regex.findall('jane.doe#my-bank.no')
['my-bank']
>>> regex.findall('john.doe#spam.com')
['spam']
>>> regex.findall('jane.ann.doe#sandnes.district.gov')
['sandnes.district']
I used this regular expression to get the complete domain name '.*#+(.*)' where .* will ignore all the character before # (by #+) and start extracting cpmlete domain name by mentioning paranthesis and complete string inside(except linebrake characters)

Regular expression to match only domain from URL

I'm struggling with forming a regex that would match:
Just domain in case of URL
Whole string in case of no URL
Acceptance test (regex should match bold text):
http://mozart.co.uk
https://avocado.si/hmm
http://www.qwe123qwe.com
Starbucks
Benchmark 123
So far I've come up with this:
([^\/\/]+)(?:,|$)
It works fine, but not for URLs with trailing slash on the end. How can I modify the expression to include full path (everything on the right side of http(s)://) as well? Thank you.
This regex will match them if it starts with http:// or https:// until the next slash. If it doesn't start with http:// nor https:// then it will match the whole string. Close enough?
(?:^https?:\/\/([^\/]+)(?:[\/,]|$)|^(.*)$)
I should note that most languages have functions built in to properly parse URLs and these are preferable.
You should note that I've got 2 sets of capturing parentheses, so depending on your language that may be significant.
Maybe that ^(http[s]?:\/\/)?(.*)$. Play here: https://regex101.com/r/iZ2vL4/1
This will have Matching groups, the domain you want will be in the 4th matching group.
/^((http[s]?|ftp):\/\/)?\/?([^\/\.]+\.)*?([^\/\.]+\.[^:\/\s\.]{1,3}(\.[^:\/\s\.]{1,2})?(:\d+)?)($|\/)([^#?\s]+)?(.*?)?(#[\w\-]+)?$/mg
Regex101.com workbench to check out your URLs just paste them in the "TEST STRING" Textbox to test it out.
Don't recall where I got this... so I don't know who to credit. But it's pretty slick!

parse url from string in coldfusion

i need to parse all urls from a paragraph(string)
eg.
"check out this site google.com and don't forget to see this too bing.com/maps"
it should return "google.com and bing.com/maps"
i'm currently using this and its not to perfection.
reMatch("(^|\s)[^\s#]+\.[^\s#\?\/]{2,5}((\?|\/)\S*)?",mystring)
thanks
You need to define more clearly what you consider a URL
For example, I might use something such as this:
(?:https?:)?(?://)?(?:[\w-]+\.)+[a-z]{2,6}(?::\d+)?(?:/[\w.,-]+)*(?:\?\S+)?
(use with reMatchNoCase or plonk (?i) at front to ignore case)
Which specifically only allows alphanumerics, underscore, and hyphen in domain and path parts, requires the TLD to be letters only, and only looks for numeric ports.
It might be this is good enough, or you may need something that looks for more characters, or perhaps you want to trim things likes quotes, brackets, etc off the end of the URL, or whatever - it depends on the context of what you're doing as to whether you'd like to err towards missing URLs or detecting non-URLs.
(I'd probably go for the latter, then potentially run a secondary filter to verify if something is a URL, but that takes more work, and may not be necessary for what you're doing.)
Anyhow, the explanation of the above expression is below, hopefully with clear comments to help it make sense. :)
(Note that all groups are non-capturing (?:...) since we don't need the indiv parts.)
# PROTOCOL
(?:https?:)? # optional group of "http:" or "https:"
# SERVER NAME / DOMAIN
(?://)? # optional double forward slash
(?:[\w-]+\.)+ # one or more "word characters" or hyphens, followed by a literal .
# grouped together and repeated one or more times
[a-z]{2,6} # as many as 6 alphas, but at least 2
# PORT NUMBER
(?::\d+)? # an optional group made up of : and one or more digits
# PATH INFO
(?:/[\w.,-]+)* # a forward slash then multiple alphanumeric, underscores, or hyphens
# or dots or commas (add any other characters as required)
# in a group that might occur multiple times (or not at all)
# QUERY STRING
(?:\?\S+)? # an optional group containing ? then any non-whitespace
Update:
To prevent the end of email addresses being matched, we need to use a lookbehind, to ensure that prior to the URL we don't have an # sign (or anything else unwanted) but without actually including that prior character in the match.
CF's regex is Apache ORO which doesn't support lookbehinds, but we can use the java.util.regex nice and easily with a component I have created which does support lookbehinds.
Using that is as simple as:
<cfset jrex = createObject('component','jre-utils').init('CASE_INSENSITIVE') />
...
<cfset Urls = jrex.match( regex , input ) />
After the createObject, it should basically be like using the built-in re~ stuff, but with the slight syntax difference, and the different regex engine under the hood.
(If you have any problems or questions with the component, let me know.)
So, on to your excluding emails from URL matching problem:
We can either do a (?<=positive) or (?<!negative) lookbehind, depending on if we want to say "we must have this" or "we must not have this", like so:
(?<=\s) # there must be whitespace before the current position
(?<!#) # there must NOT be an # before current position
For this URL example, I would expand either of those examples to:
(?<=\s|^) # look for whitespace OR start of string
or
(?<![#\w/]) # ensure there is not a # or / or word character.
Both will work (and can be expanded with more chars), but in different ways, so it simply depends which method you want to do it with.
Put whichever one you like at the start of your expression, and it should no longer match the end of abcd#gmail.com, unless I've screwed something up. :)
Update 2:
Here is some sample code which will exclude any email addresses from the match:
<cfset jrex = createObject('component','jre-utils').init('CASE_INSENSITIVE') />
<cfsavecontent variable="SampleInput">
check out this site google.com and don't forget to see this too bing.com/maps
this is an email#somewhere.com which should not be matched
</cfsavecontent>
<cfset FindUrlRegex = '(?<=\s|^)(?:https?:)?(?://)?(?:[\w-]+\.)+[a-z]{2,6}(?::\d+)?(?:/[\w.,-]+)*(?:\?\S+)?' />
<cfset MatchedUrls = jrex.match( FindUrlRegex , SampleInput ) />
<cfdump var=#MatchedUrls#/>
Make sure you have downloaded the jre-utils.cfc from here and put in an appropriate place (e.g. same directory as script running this code).
This step is required because the (?<=...) construct does not work in CF regular expressions.

What would be the best way to extract the host portion of a url with regexp?

I'm extracting the host from my url and am getting jammed up by making the last / optional.
the regexp needs to be prepared to receive the following:
http://a.b.com:8080/some/path/file.txt
or
ftp://a.b.com:8080/some/path
or
ftp://user#a.b.com/some/path
or
http://a.b.com
or
a.b.com/some/path
and return a.b.com
so...
(ftp://|http://)? optionally matches the first part
then it gets hairy...
so... without adding ugly (and wrong) regexp here... just in english
(everything that isn't an '#') //optional
(everything that isn't a '/' up to the first '/' IF it's there) //this is the host group that I want
(everything else that trails) //optional
Do you need to use a regex? Most languages have support for parsing URLs. For instance, Java has its java.net.URL, Python has its urlparse module and Ruby has its URI module. You can use these to query different parts of a given URL.
Jeremy Ruten's answer is close but will fail if an # appears anywhere after the hostname. I'd suggest:
(everything that isn't an '#') //optional
(?:[^#:/]*#)?
The colon and slash prevent matching past the domain if # appears after the domain. Note the non-capturing parens.
(everything that isn't a '/' up to the first '/' IF it's there)
//this is the host group that I want
([^:/]+)
Note the capturing parens.
(everything else that trails) //optional
Since the parens capture the hostname and only the hostname, there's no need to continue matching.
So, putting it all together you get:
/^(?:ftp|https?)://(?:[^#:/]*#)?([^:/]+)/
(Note that the first two paren groupings are non-capturing -- hopefully your regex library supports that.)
I've tested this in PHP and it works on all of your examples:
/^(ftp:\/\/|https?:\/\/)?(.+#)?([a-zA-Z0-9\.\-]+).*$/

This regex matches and shouldn't. Why is it?

This regex:
^((https?|ftp)\:(\/\/)|(file\:\/{2,3}))?(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(((([a-zA-Z0-9]+)(\.)?)+?)(\.)([a-z]{2}
|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum))([a-zA-Z0-9\?\=\&\%\/]*)?$
Formatted for readability:
^( # Begin regex / begin address clause
(https?|ftp)\:(\/\/)|(file\:\/{2,3}))? # protocol
( # container for two address formats, more to come later
((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) # match IP addresses
)|( # delimiter for address formats
((([a-zA-Z0-9]+)(\.)?)+?) # match domains and any number of subdomains
(\.) #dot for .com
([a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum) #TLD clause
) # end address clause
([a-zA-Z0-9\?\=\&\%\/]*)? # querystring support, will pretty this up later
$
is matching:
www.google
and shouldn't be. This is one of my "fail" test cases. I have declared the TLD portion of the URL to be mandatory when matching on alpha instead of on IP, and "google" doesn't fit into the "[a-z]{2}" clause.
Keep in mind I will fix the following issues seperately - this question is about why it matches www.google and shouldn't.
Querystring needs to support proper formats only, currently accepts any combination of querystring characters
Several protocols not supported, though the scope of my requirements may not include them
uncommon TLDs with 3 characters not included
Probably matches http://www.google..com - will check for consecutive dots
Doesn't support decimal IP address formats
What's wrong with my regex?
edit: See also a previous problem with an earlier version of this regex on a different test case:
How can I make this regex match correctly?
edit2: Fixed - The corrected regex (as asked) is:
^((https?|ftp)\:(\/\/)|(file\:\/{2,3}))?(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(((([a-zA-Z0-9]+)(\.)?)+?)(\.)([a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum))([\/][\/a-zA-Z0-9\.]*)*?([\/]?[\?][a-zA-Z0-9\=\&\%\/]*)?$
"google" might not fit in [a-z]{2}, but it does fit in [a-z]{2}([a-zA-Z0-9\?\=\&\%\/]*)? - you forgot to require a / after the TLD if the URL extends beyond the domain. So it's interpreting it with "www.go" as the domain and then "ogle" following it, with no slash in between. You can fix it by adding a [?/] to the front of that last group to require one of those two symbols between the TLD and any further portion of the URL.
Your TLD clause matches "go" in google and the querystring support part matches "ogle" afterwards. Try changing the querystring part to this:
([?/][a-zA-Z0-9\?\=\&\%\/]*)?
google" doesn't fit into the "[a-z]{2}" clause.
But "go" does and then "ogle" matches "([a-zA-Z0-9\?\=\&\%/]*)?"