URL regex group catching - regex

Hello I'm trying to find a regex that would catch the terms in a url.
For example, given:
https://stackoverflow.com, it would catch "stackoverflow"
and given https://stackoverflow.com/questions/ask, it would catch "stackoverflow", "questions", "ask" and any potential terms in between the slash character after the domain name.
Up until now I managed to find the following regex but it cannot repeat catching groups
https?:\/\/(?:www\.)?([\da-z-]*)(?:[\.a-z]*)(?:\/([\da-z]*)\/?)+
Do you guys have any ways to resolve that issue?? that would be great.

I testet the answer of Michal M it appears not to get "www." so I updated it
/(?:\/(?:w{3}\.)?)\K([\w]+)/i
Edit: As soon as it's not important to match the "www." I placed it inside a non capturing group so it won't be captured. Btw I also placed the case insensitive modifier so "WWW." would be okay too.

Try this one:
(?:(\/))\K(\w+)
tested in notepad++

You may try using two separate regexes -- one for the hostname part and another for the terms in the path part. Then combine them with alternation construction and do global search:
https?:\/\/(?:\w+\.)*(\w+)\.\w+ # this would capture hostname "term"
|
\/(\w+) # this would capture path "terms"
(Note: requires /x modifier.)
Demo: https://regex101.com/r/nA8jT9/2

Thanks I managed to rearrange it for it to work with the "www"
(?:\/(?:www\.)?)\K([\w\d]+)

Related

Regex get domain name from email

I am learning regex and am having trouble getting google from email address
String
first.name#google.com
I just want to get google, not google.com
Regex:
[^#].+(?=\.)
Result: https://regex101.com/r/wA5eX5/1
From my understanding. It ignore # find a string after that until . (dot) using (?=\.)
What did I do wrong?
[^#] means "match one symbol that is not an # sign. That is not what you are looking for - use lookbehind (?<=#) for # and your (?=\.) lookahead for \. to extract server name in the middle:
(?<=#)[^.]+(?=\.)
The middle portion [^.]+ means "one or more non-dot characters".
Demo.
Updated answer:Use a capturing group and keep it simple :)
#(\w+)
Explanation by splitting it up
( capturing group for extraction )
\w stands for word character [A-Za-z0-9_]
+ is a quantifier for one or more occurances of \w
Regex explanation and demo on Regex101
I used the solution's regex for my task, but realized that some of the emails weren't that easy: foo#us.industries.com, foobar#tm.valves.net, andfoo#ge.test.com
To anyone who came here wanting the sub domain as well (or is being cut off by it), here's the regex:
(?<=#)[^.]*.[^.]*(?=\.)
This should be the regex:
(?<=#)[^.]+
(?<=#) - places the search right after the #
[^.]+ - take all the characters that are not dot (stops on dot)
So it extracts google from the email address.
As I was working to get the domain name of email addresses and none corresponded to what I needed:
To not catch subdomains
To match countries top domains (like .com.ar or co.jp)
For example, in test#ext.domain.com.mx I need to match domain.com.mx
So I made this one:
[^.#]*?\.\w{2,}$|[^.#]*?\.com?\.\w{2}$
Here is a link to regex101 to illustrate the regex: https://regex101.com/r/vE8rP9/59
You can get the sumdomain name (without the top-level domain ex: .com or .com.mx) by adding lookaround operators (but it will match twice in test#test.com.mx):
[^.#]*?(?=\.\w{2,}$)|[^.#]*?(?=\.com?\.\w{2}$)
Maybe not strictly a "full regex answer" but more flexible ( in case the part before the # is not "first.last") would be using cut:
cut -d # -f 2 | cut -d . -f 1
The first cut will isolate the part after # and the second one will get what you want.
This will work also for another kinds of email patterns : xxxx#server.com / xxx.yyy.zzz# server.com and so on...
Thanks everyone for your great responses, I took what you had and expanded it with labelled match-groups for easy extraction of separate parts.
Caveat : Regex.Speed = Slow
Another post mentioned how SLOW and nonperformant regexes are, and that is a fair point to remember. My particular need is targeting my own background/slow/reporting processes and therefore it doesn't matter how long it takes.
But it's good to remember whenever possible Regex should NOT be used in any sort of web page load or "needs-to-be-quick" kind of application. In that case you're much better off using substring to algorithmically strip down the inputs and throw away all the junk that I'm optionally matching/allowing/including here.
https://regex101.com/r/ZnU3OC/1
One Regex to rule them all...
Subdomain/Domain/TopLevelDomain/CountryCode extraction for Emails, domain lists, & URLs
Also handles ?Querystring=junk, Slashes/With/Paths, #anchors
Now with more broth, batteries not included
^(?<Email>.*#)?(?<Protocol>\w+:\/\/)?(?<SubDomain>(?:[\w-]{2,63}\.){0,127}?)?(?<DomainWithTLD>(?<Domain>[\w-]{2,63})\.(?<TopLevelDomain>[\w-]{2,63}?)(?:\.(?<CountryCode>[a-z]{2}))?)(?:[:](?<Port>\d+))?(?<Path>(?:[\/]\w*)+)?(?<QString>(?<QSParams>(?:[?&=][\w-]*)+)?(?:[#](?<Anchor>\w*))*)?$
not overly complicated at all... why would you even say that?
Substitution / Outputs
EXAMPLE INPUT: "https://www.stackoverflow.co.uk/path/2?q=mysearch&and=more#stuff"
EXAMPLE OUTPUT:
{
Protocol: "https://"
SubDomain: "www"
DomainWithTLD: "stackoverflow.co.uk"
Domain: "stackoverflow"
TopLevelDomain: "co"
CountryCode: "uk"
Path: "/path/2"
QString: "?q=mysearch&and=more#stuff"
}
Allowed/Compliant Domains : Should ALL MATCH
www.bankofamerica.com
bankofamerica.com.securersite.regexr.com
bankofamerica.co.uk.blahblahblah.secure.com.it
dashes-bad-for-seo.but-technically-still-allowed.not-in-front-or-end
bit.ly
is.gd
foo.biz.pl
google.com.cn
stackoverflow.co.uk
level_three.sub_domain.example.com
www.thelongestdomainnameintheworldandthensomeandthensomemoreandmore.com
https://www.stackoverflow.co.uk?q=mysearch&and=more
foo://5th.4th.3rd.example.com:8042/over/there
foo://subdomain.example.com:8042/over/there?name=ferret#nose
example.com
www.example.com
example.co.uk
trailing-slash.com/
trailing-pound.com#
trailing-question.com?
probably-not-valid.com.cn?&#
probably-not-valid.com.cn/?&#
example.com/page
example.com?key=value
* NOTE: PunyCodes (Unicode in urls) handled just fine with \w ,no extra sauce needed
xn--fsqu00a.xn--0zwm56d.com
xn--diseolatinoamericano-66b.com
Emails : Should ALL MATCH
first.name#google1.co.com
foo#us.industries.com,
foobar#tm.valves.net,
andfoo#ge.test.com
jane.doe#my-bank.no
john.doe#spam.com
jane.ann.doe#sandnes.district.gov
Non-Compliant Domains : Should NOT MATCH
either not long-enough (domain min length 2), or too long (64)
v.gd
thing.y
0123456789012345678901234567890123456789012345678901234567891234.com
its-sixty-four-instead-of-sixty-three!.com
symbols-not-allowed#.com
symbols-not-allowed#.com
symbols-not-allowed$.com
symbols-not-allowed%.com
symbols-not-allowed^.com
symbols-not-allowed&.com
symbols-not-allowed*.com
symbols-not-allowed(.com
symbols-not-allowed).com
symbols-not-allowed+.com
symbols-not-allowed=.com
TBD Not handled:
* dashes as start or ending is disallowed (dropped from Regex for readability)
-junk-.com
* is underscore allowed? i donno... (but it simplifies the regex using \w instead of [a-zA-Z0-9\-] everywhere)
symbols-not-allowed_.com
* special case localhost?
.localhost
also see:
Domain Name Rules :: Super handy ASCII Diagram of a URL
see: https://stackoverflow.com/a/66660651/738895 *
Side NOTE: lazy load '?' for subdomains{0,127}? currently needed for any of the cases with country codes... (example: stackoverflow.co.uk)
Matches these, but does NOT grab $NLevelSubdomains in a match group, can only grab 3rd level only.
This is a relatively simple regex, and it grabs everything between the # and the final domain extension (e.g. .com, .org). It allows domain names that are made up of non-word characters, which exist in real-world data.
>>> regex = re.compile(r"^.+#(.+)\.[\w]+$")
>>> regex.findall('jane.doe#my-bank.no')
['my-bank']
>>> regex.findall('john.doe#spam.com')
['spam']
>>> regex.findall('jane.ann.doe#sandnes.district.gov')
['sandnes.district']
I used this regular expression to get the complete domain name '.*#+(.*)' where .* will ignore all the character before # (by #+) and start extracting cpmlete domain name by mentioning paranthesis and complete string inside(except linebrake characters)

Regex for URL to sites

I have two URLs with the patterns:
1.http://localhost:9001/f/
2.http://localhost:9001/flight/
I have a site filter which redirects to the respective sites if the regex matches. I tried the following regex patterns for the 2 URLs above:
http?://localhost[^/]/f[^flight]/.*
http?://localhost[^/]/flight/.*
Both URLS are getting redirected to the first site, as both URLs are matched by the first regex.
I have tried http?://localhost[^/]/[f]/.* also for the 1st url. I am Unable to get what am i missing . I feel that this regex should not accept any thing other than "f", but it is allowing "flight" as well.
Please help me by pointing the mistake i have done.
Keep things simple:
.*/f(/[^/]*)?$
vs
.*/flight(/[^/]*)?$
Adding ? before $ makes the trailing slash with optional path term optional.
The first one will be caught with following regex;
/^http:[\/]{2}localhost:9001\/f[^light]$/
The other one will be disallowed and can be found with following regex
/^http:[\/]{2}localhost:9001\/flight\/$/
You regex has several issues: 1) p? means optional p (htt:// will match), 2) [^/] will only match : in your URLs since it will only capture 1 character (and you have a port number), 3) [^light] is a negated character class that means any character that is not l, i, g, h, or t.
So, if you want to only capture localhost URLs, you'd better use this regex for the 1st site:
http://localhost[^/]*/f/.*
And this one for the second
http://localhost[^/]*/flight/.*
Please also bear in mind that depending on where you use the regexps, your actual input may or may not include either the protocol.
These should work for you:
http[s]{0,1}:\/\/localhost:[0-9]{4}\/f\/
http[s]{0,1}:\/\/localhost:[0-9]{4}\/flight\/
You can see it working here

Regular expression to match only domain from URL

I'm struggling with forming a regex that would match:
Just domain in case of URL
Whole string in case of no URL
Acceptance test (regex should match bold text):
http://mozart.co.uk
https://avocado.si/hmm
http://www.qwe123qwe.com
Starbucks
Benchmark 123
So far I've come up with this:
([^\/\/]+)(?:,|$)
It works fine, but not for URLs with trailing slash on the end. How can I modify the expression to include full path (everything on the right side of http(s)://) as well? Thank you.
This regex will match them if it starts with http:// or https:// until the next slash. If it doesn't start with http:// nor https:// then it will match the whole string. Close enough?
(?:^https?:\/\/([^\/]+)(?:[\/,]|$)|^(.*)$)
I should note that most languages have functions built in to properly parse URLs and these are preferable.
You should note that I've got 2 sets of capturing parentheses, so depending on your language that may be significant.
Maybe that ^(http[s]?:\/\/)?(.*)$. Play here: https://regex101.com/r/iZ2vL4/1
This will have Matching groups, the domain you want will be in the 4th matching group.
/^((http[s]?|ftp):\/\/)?\/?([^\/\.]+\.)*?([^\/\.]+\.[^:\/\s\.]{1,3}(\.[^:\/\s\.]{1,2})?(:\d+)?)($|\/)([^#?\s]+)?(.*?)?(#[\w\-]+)?$/mg
Regex101.com workbench to check out your URLs just paste them in the "TEST STRING" Textbox to test it out.
Don't recall where I got this... so I don't know who to credit. But it's pretty slick!

Regex to match anything after /

I'm basically not in the clue about regex but I need a regex statement that will recognise anything after the / in a URL.
Basically, i'm developing a site for someone and a page's URL (Local URL of Course) is say (http://)localhost/sweettemptations/available-sweets. This page is filled with custom post types (It's a WordPress site) which have the URL of (http://)localhost/sweettemptations/sweets/sweet-name.
What I want to do is redirect the URL (http://)localhost/sweettemptations/sweets back to (http://)localhost/sweettemptations/available-sweets which is easy to do, but I also need to redirect any type of sweet back to (http://)localhost/sweettemptations/available-sweets. So say I need to redirect (http://)localhost/sweettemptations/sweets/* back to (http://)localhost/sweettemptations/available-sweets.
If anyone could help by telling me how to write a proper regex statement to match everything after sweets/ in the URL, it would be hugely appreciated.
To do what you ask you need to use groups. In regular expression groups allow you to isolate parts of the whole match.
for example:
input string of: aaaaaaaabbbbcccc
regex: a*(b*)
The parenthesis mark a group in this case it will be group 1 since it is the first in the pattern.
Note: group 0 is implicit and is the complete match.
So the matches in my above case will be:
group 0: aaaaaaaabbbb
group 1: bbbb
In order to achieve what you want with the sweets pattern above, you just need to put a group around the end.
possible solution: /sweets/(.*)
the more precise you are with the pattern before the group the less likely you will have a possible false positive.
If what you really want is to match anything after the last / you can take another approach:
possible other solution: /([^/]*)
The pattern above will find a / with a string of characters that are NOT another / and keep it in group 1. Issue here is that you could match things that do not have sweets in the URL.
Note if you do not mind the / at the beginning then just remove the ( and ) and you do not have to worry about groups.
I like to use http://regexpal.com/ to test my regex.. It will mark in different colors the different matches.
Hope this helps.
I may have misunderstood you requirement in my original post.
if you just want to change any string that matches
(http://)localhost/sweettemptations/sweets/*
into the other one you provided (without adding the part match by your * at the end) I would use a regular expression to match the pattern in the URL but them just blind replace the whole string with the desired one:
(http://)localhost/sweettemptations/available-sweets
So if you want the URL:
http://localhost/sweettemptations/sweets/somethingmore.html
to turn into:
http://localhost/sweettemptations/available-sweets
and not into:
localhost/sweettemptations/available-sweets/somethingmore.html
Then the solution is simpler, no groups required :).
when doing this I would make sure you do not match the "localhost" part. Also I am assuming the (http://) really means an optional http:// in front as (http://) is not a valid protocol prefix.
so if that is what you want then this should match the pattern:
(http://)?[^/]+/sweettemptations/sweets/.*
This regular expression will match the http:// part optionally with a host (be it localhost, an IP or the host name). You could omit the .* at the end if you want.
If that pattern matches just replace the whole URL with the one you want to redirect to.
use this regular expression (?<=://).+

Regex to match URL not followed by " or <

I'm trying to modify the url-matching regex at http://daringfireball.net/2010/07/improved_regex_for_matching_urls to not match anything that's already part of a valid URL tag or used as the link text.
For example, in the following string, I want to match http://www.foo.com, but NOT http://www.bar.com or http://www.baz.com
www.foo.com http://www.baz.com
I was trying to add a negative lookahead to exclude matches followed by " or <, but for some reason, it's only applying to the "m" in .com. So, this regex still returns http://www.bar.co and http://www.baz.co as matches.
I can't see what I'm doing wrong... any ideas?
\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))(?!["<])
Here is a simpler example too:
((((ht|f)tps?:\/\/)|(www.))[a-zA-Z0-9_\-.:#/~}?]+)(?!["<])
I looked into this issue last year and developed a solution that you may want to look at - See: URL Linkification (HTTP/FTP) This link is a test page for the Javascript solution with many examples of difficult-to-linkify URLs.
My regex solution, written for both PHP and Javascript - is not simple (but neither is the problem as it turns out.) For more information I would recommend also reading:
The Problem With URLs by Jeff Atwood, and
An Improved Liberal, Accurate Regex Pattern for Matching URLs by John Gruber
The comments following Jeff's blog post are a must read if you want to do this right...
Note also that John Gruber's regex has a component that can go into realm of catastrophic backtracking (the part which matches one level of matching parentheses).
Yeah, its actually trivial to make it work if you just want to exclude trailing characters, just make your expression 'independent', then no backtracking will occurr in that segment.
(?>\b ...)(?!["<])
A perl test:
use strict;
use warnings;
my $str = 'www.foo.com http://www.baz.comhttp://www.some.com';
while ($str =~ m~
(?>
\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
)
(?!["<])
~xg)
{
print "$1\n";
}
Output:
www.foo.com
http://www.some.com