Regex needed to match a domain name for django view

Regex needed to match a domain name for django view - regex

I'm trying to match a url with a domain like:
Testing.com
testing.com
Testing.net
testing.net
Testing.org
testing.org
and other extensions as well.
I'm trying to formulate a regex to use in a django view like:
(r'^Account/Testing/d=([a-z]{1,50})$', TestApp),
I tried ^[A-za-z]{2,50}$ but that doesn't match a domain with capital letter in the beginning
Any help?
Thank you!

you can use this
/^(?:http(?:s)?:\/\/)?(?:w{3})\.([a-z_0-9-]+\.\w{2,3}(?:\.\w{2})?)/i
it will match for links likes this
http://www.site.com
https://www.site.com
http://www.site.co.uk
https://www.site.co.uk
http://www.site.com.br
https://www.site.com.br
http://www.site-site.com.br
https://www.site-site.com
http://www.site-site.co.uk
https://www.site-site.co.uk
www.site-site.com
www.site-site.co.uk
www.site-site.com.br
www.site.com
and alot of other variations
even if the site has
www.site.com/news
it will only match for "site.com"
the /i modifier will match for all variations of upper and lower cases
if you only want to match domain name as upper and lower
/^(?:http(?:s)?:\/\/)?(?:w{3})\.((?i:[a-z_0-9-])+\.\w{2,3}(?:\.\w{2})?)/
(?i:[a-z_0-9-]) will match variations for domain's names only

Fortunately, this wasn't that bad after all - this is one way to match a domain with varying extensions:
^[A-za-z]{2,50}.[a-z]{1,3}$
matches .com, .org, .net, etc.
If you have a domain like me2.com, its better to use this:
(^[A-za-z0-9]{2,50}.[a-z]{1,3})$

Related

Regex get domain name from email

I am learning regex and am having trouble getting google from email address
String
first.name#google.com
I just want to get google, not google.com
Regex:
[^#].+(?=\.)
Result: https://regex101.com/r/wA5eX5/1
From my understanding. It ignore # find a string after that until . (dot) using (?=\.)
What did I do wrong?

[^#] means "match one symbol that is not an # sign. That is not what you are looking for - use lookbehind (?<=#) for # and your (?=\.) lookahead for \. to extract server name in the middle:
(?<=#)[^.]+(?=\.)
The middle portion [^.]+ means "one or more non-dot characters".
Demo.

Updated answer:Use a capturing group and keep it simple :)
#(\w+)
Explanation by splitting it up
( capturing group for extraction )
\w stands for word character [A-Za-z0-9_]
+ is a quantifier for one or more occurances of \w
Regex explanation and demo on Regex101

I used the solution's regex for my task, but realized that some of the emails weren't that easy: foo#us.industries.com, foobar#tm.valves.net, andfoo#ge.test.com
To anyone who came here wanting the sub domain as well (or is being cut off by it), here's the regex:
(?<=#)[^.]*.[^.]*(?=\.)

This should be the regex:
(?<=#)[^.]+
(?<=#) - places the search right after the #
[^.]+ - take all the characters that are not dot (stops on dot)
So it extracts google from the email address.

As I was working to get the domain name of email addresses and none corresponded to what I needed:
To not catch subdomains
To match countries top domains (like .com.ar or co.jp)
For example, in test#ext.domain.com.mx I need to match domain.com.mx
So I made this one:
[^.#]*?\.\w{2,}$|[^.#]*?\.com?\.\w{2}$
Here is a link to regex101 to illustrate the regex: https://regex101.com/r/vE8rP9/59
You can get the sumdomain name (without the top-level domain ex: .com or .com.mx) by adding lookaround operators (but it will match twice in test#test.com.mx):
[^.#]*?(?=\.\w{2,}$)|[^.#]*?(?=\.com?\.\w{2}$)

Maybe not strictly a "full regex answer" but more flexible ( in case the part before the # is not "first.last") would be using cut:
cut -d # -f 2 | cut -d . -f 1
The first cut will isolate the part after # and the second one will get what you want.
This will work also for another kinds of email patterns : xxxx#server.com / xxx.yyy.zzz# server.com and so on...

Thanks everyone for your great responses, I took what you had and expanded it with labelled match-groups for easy extraction of separate parts.
Caveat : Regex.Speed = Slow
Another post mentioned how SLOW and nonperformant regexes are, and that is a fair point to remember. My particular need is targeting my own background/slow/reporting processes and therefore it doesn't matter how long it takes.
But it's good to remember whenever possible Regex should NOT be used in any sort of web page load or "needs-to-be-quick" kind of application. In that case you're much better off using substring to algorithmically strip down the inputs and throw away all the junk that I'm optionally matching/allowing/including here.
https://regex101.com/r/ZnU3OC/1
One Regex to rule them all...
Subdomain/Domain/TopLevelDomain/CountryCode extraction for Emails, domain lists, & URLs
Also handles ?Querystring=junk, Slashes/With/Paths, #anchors
Now with more broth, batteries not included
^(?<Email>.*#)?(?<Protocol>\w+:\/\/)?(?<SubDomain>(?:[\w-]{2,63}\.){0,127}?)?(?<DomainWithTLD>(?<Domain>[\w-]{2,63})\.(?<TopLevelDomain>[\w-]{2,63}?)(?:\.(?<CountryCode>[a-z]{2}))?)(?:[:](?<Port>\d+))?(?<Path>(?:[\/]\w*)+)?(?<QString>(?<QSParams>(?:[?&=][\w-]*)+)?(?:[#](?<Anchor>\w*))*)?$
not overly complicated at all... why would you even say that?
Substitution / Outputs
EXAMPLE INPUT: "https://www.stackoverflow.co.uk/path/2?q=mysearch&and=more#stuff"
EXAMPLE OUTPUT:
{
Protocol: "https://"
SubDomain: "www"
DomainWithTLD: "stackoverflow.co.uk"
Domain: "stackoverflow"
TopLevelDomain: "co"
CountryCode: "uk"
Path: "/path/2"
QString: "?q=mysearch&and=more#stuff"
}
Allowed/Compliant Domains : Should ALL MATCH
www.bankofamerica.com
bankofamerica.com.securersite.regexr.com
bankofamerica.co.uk.blahblahblah.secure.com.it
dashes-bad-for-seo.but-technically-still-allowed.not-in-front-or-end
bit.ly
is.gd
foo.biz.pl
google.com.cn
stackoverflow.co.uk
level_three.sub_domain.example.com
www.thelongestdomainnameintheworldandthensomeandthensomemoreandmore.com
https://www.stackoverflow.co.uk?q=mysearch&and=more
foo://5th.4th.3rd.example.com:8042/over/there
foo://subdomain.example.com:8042/over/there?name=ferret#nose
example.com
www.example.com
example.co.uk
trailing-slash.com/
trailing-pound.com#
trailing-question.com?
probably-not-valid.com.cn?&#
probably-not-valid.com.cn/?&#
example.com/page
example.com?key=value
* NOTE: PunyCodes (Unicode in urls) handled just fine with \w ,no extra sauce needed
xn--fsqu00a.xn--0zwm56d.com
xn--diseolatinoamericano-66b.com
Emails : Should ALL MATCH
first.name#google1.co.com
foo#us.industries.com,
foobar#tm.valves.net,
andfoo#ge.test.com
jane.doe#my-bank.no
john.doe#spam.com
jane.ann.doe#sandnes.district.gov
Non-Compliant Domains : Should NOT MATCH
either not long-enough (domain min length 2), or too long (64)
v.gd
thing.y
0123456789012345678901234567890123456789012345678901234567891234.com
its-sixty-four-instead-of-sixty-three!.com
symbols-not-allowed#.com
symbols-not-allowed#.com
symbols-not-allowed$.com
symbols-not-allowed%.com
symbols-not-allowed^.com
symbols-not-allowed&.com
symbols-not-allowed*.com
symbols-not-allowed(.com
symbols-not-allowed).com
symbols-not-allowed+.com
symbols-not-allowed=.com
TBD Not handled:
* dashes as start or ending is disallowed (dropped from Regex for readability)
-junk-.com
* is underscore allowed? i donno... (but it simplifies the regex using \w instead of [a-zA-Z0-9\-] everywhere)
symbols-not-allowed_.com
* special case localhost?
.localhost
also see:
Domain Name Rules :: Super handy ASCII Diagram of a URL
see: https://stackoverflow.com/a/66660651/738895 *
Side NOTE: lazy load '?' for subdomains{0,127}? currently needed for any of the cases with country codes... (example: stackoverflow.co.uk)
Matches these, but does NOT grab $NLevelSubdomains in a match group, can only grab 3rd level only.

This is a relatively simple regex, and it grabs everything between the # and the final domain extension (e.g. .com, .org). It allows domain names that are made up of non-word characters, which exist in real-world data.
>>> regex = re.compile(r"^.+#(.+)\.[\w]+$")
>>> regex.findall('jane.doe#my-bank.no')
['my-bank']
>>> regex.findall('john.doe#spam.com')
['spam']
>>> regex.findall('jane.ann.doe#sandnes.district.gov')
['sandnes.district']

I used this regular expression to get the complete domain name '.*#+(.*)' where .* will ignore all the character before # (by #+) and start extracting cpmlete domain name by mentioning paranthesis and complete string inside(except linebrake characters)

Regular expression to match only domain from URL

I'm struggling with forming a regex that would match:
Just domain in case of URL
Whole string in case of no URL
Acceptance test (regex should match bold text):
http://mozart.co.uk
https://avocado.si/hmm
http://www.qwe123qwe.com
Starbucks
Benchmark 123
So far I've come up with this:
([^\/\/]+)(?:,|$)
It works fine, but not for URLs with trailing slash on the end. How can I modify the expression to include full path (everything on the right side of http(s)://) as well? Thank you.

This regex will match them if it starts with http:// or https:// until the next slash. If it doesn't start with http:// nor https:// then it will match the whole string. Close enough?
(?:^https?:\/\/([^\/]+)(?:[\/,]|$)|^(.*)$)
I should note that most languages have functions built in to properly parse URLs and these are preferable.
You should note that I've got 2 sets of capturing parentheses, so depending on your language that may be significant.

Maybe that ^(http[s]?:\/\/)?(.*)$. Play here: https://regex101.com/r/iZ2vL4/1

This will have Matching groups, the domain you want will be in the 4th matching group.
/^((http[s]?|ftp):\/\/)?\/?([^\/\.]+\.)*?([^\/\.]+\.[^:\/\s\.]{1,3}(\.[^:\/\s\.]{1,2})?(:\d+)?)($|\/)([^#?\s]+)?(.*?)?(#[\w\-]+)?$/mg
Regex101.com workbench to check out your URLs just paste them in the "TEST STRING" Textbox to test it out.
Don't recall where I got this... so I don't know who to credit. But it's pretty slick!

RegEx pattern to handle URL with dates

I moved to a new website and it mangled up my URL's. Now blog posts are accessible from multiple URL's and would like to redirect one pattern to the other.
I am trying to redirect the first case to the second case:
~/blogs/johndoe/john-doe/2014/03/14/test-article1 =>
~/blogs/john-doe/2014/03/14/test-article1
~/blogs/jimjones/jim-jones/2014/03/14/test-articleb =>
~/blogs/jim-jones/2014/03/14/test-articleb
How do I create a pattern smart enough to slice out the first "johndoe" and "jimjones"? I am using this for IIS rewrite but I think any RegEx should work. Thanks for any help.

This works:
^~/blogs/\w+/(\w+)-(\w+)/(\d{4})/(\d\d)/(\d\d)/([\w-]+)$
Debuggex Demo
It just discards the non-dash name. It doesn't know if its equal to the dash name or not. And it also assumes that the date numbers are valid. 9899/45/33 would be matched.
Capture groups:
First name
Last name
Year
Month
Day
Article name

I don't know about IIS rewrites, but this should work:
/^~/blogs\/[a-z]+\/ -> ~/blogs/
The regular expression will match the start of a string, following by ~/blogs/, followed by a string of all lowercase characters.

I don't use IIS, but this should be at least close.
Pattern:
^blogs/\w+/(\w+/)
Action
blogs/{R:1}
Handy usage doc

Regex to match any domain except two domains

in my htaccess i'm trying to set document root for all park domains to a specific path except two main domains, so basically i need a regex to match any domain except tow domains
i found something like this
^(?!foo$|bar$).*
and this
(?>[\w-]+)(?<!tea|nuka-cola)
but can not get it work with my situation because there is a dot tld in domain name and i want to use regex there too
here is my current regex
^(.*?)\.(com|net)$
instead of (.*?) i want to make exception there

Use a negative look behind:
^(.*?)(?<!(foo)|(bar))\.(com|net)$
Not sure what you want, but this regex will not match urls ending in foo.com or bar.net etc

Regex pattern for domain-part of URL

I am looking for a regex-pattern that matches the domain path of an url (http or https)
example 1:
https://www.blabla.com/path/pic.jpg
should match
https://www.blabla.com
example 2:
http://my.domain.tld/directory/?something
should match
http://my.domain.tld

something along the lines:
#^(https?://[a-z0-9.-]+)(?=/|$).*#i
It depends of course which characters you'd like to allow in the domain name.
P.S. # are there to delimit the regex, i at the end indicates case-insensitivity.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex needed to match a domain name for django view - regex

Fortunately, this wasn't that bad after all - this is one way to match a domain with varying extensions: ^[A-za-z]{2,50}.[a-z]{1,3}$ matches .com, .org, .net, etc. If you have a domain like me2.com, its better to use this: (^[A-za-z0-9]{2,50}.[a-z]{1,3})$

Related

Regex get domain name from email

Regular expression to match only domain from URL

RegEx pattern to handle URL with dates

Regex to match any domain except two domains

Regex pattern for domain-part of URL

Categories

Resources