Validate domain with regex but not subdomain

Validate domain with regex but not subdomain - regex

I couldn't find anywhere a regex that could validate a domain but not accepting subdomains.
I found a lot of rules that validates domains but unfortunately all of them also validates subdomains.
Anyone have tips on this?
I have this regex that is almost what I need:
/(?!www\.)(?=^.{5,254}$)(^(?:(?!\d+\.)[a-z0-9\-]{1,63}\.){1,2}(?:[a-z]{2,})$)/
If I use a subdomain like test.domain.com.br, it validates good (rejecting it), but test.domain.com don't.

I couldn't find anywhere a regex that could validate a domain but not accepting subdomains.
Because no regex can do that for you (and anyone pretending the opposite just doesn't understand the DNS).
Which is exactly why you found out that:
a lot of rules that validates domains but unfortunately all of them also validates subdomains.
Because a "subdomain" is just a domain seen differently (or you can say that any domain is also a subdomain of another domain, except for root and TLD). This is all because the DNS is a tree.
You can use the definition given in https://www.rfc-editor.org/rfc/rfc8499:
Subdomain: "A domain is a subdomain of another domain if it is
contained within that domain. This relationship can be tested by
seeing if the subdomain's name ends with the containing domain's
name." (Quoted from [RFC1034], Section 3.1) For example, in the
host name "nnn.mmm.example.com", both "mmm.example.com" and
"nnn.mmm.example.com" are subdomains of "example.com". Note that
the comparisons here are done on whole labels; that is,
"ooo.example.com" is not a subdomain of "oo.example.com".
You can not find administrative boundaries given an hostname by just looking at it. You need either to do DNS live queries to find the delegation points OR you need to use something like the Public Suffix List maintained by Mozilla. Both cases have drawbacks that can be or not a problem depending on your use case.
If you are not convinced, here is some list of valid hostnames (you can use them in an URL and it will work), and try to find out how a regex could have helped you by being right in all cases:
dk
www.sante.gouv.fr
www.com.com
www.nominet.co.uk
www.uk.com
www.walton.k12.fl.us
lagazettedesancetres.blogspot.fr
www.al.ma.leg.br
ab.m.wikibooks.nom.nu
1512f1.станок.спб.рус
You can obviously find shortcuts where a regex will still be wrong but good enough, if you restrict the cases you need to act on. Otherwise, if you need to stay generic and potentially work in any TLD, then, sorry, no regex will solve your problem.
Also your regex is wrong in multiple other cases. For example it won't handle IDN TLDs, that do exist, as they will be like xn--something in ASCII form which won't be accepted by [a-z]{2,}
BTW, useful terminologies I suggest using which may often be clearer than domain/subdomain, as taken from https://url.spec.whatwg.org/#host-miscellaneous
"A host’s public suffix is the portion of a host which is included on the Public Suffix List."
"A host’s registrable domain is a domain formed by the most specific public suffix, along with the domain label immediately preceding it, if any."
I think what you are searching is the "registrable domain" part of any given string (and as you can see from the algorithm given at above URL, you can't do that without finding first the public suffix, which you can't do without using an external resource, the information is NOT self contained in the string).

Related

Regex to validate URL characters and all available TLDs

I'm new to regex and after few days of practicing/learning I manage to write URL validating regex.
/^((?:http|https)):\/\/(?=[a-z\d])((?:(?:(?!_|\.\.|-\.|\.-|\.\/|-\/)[\w-\.])+?)(?:[\.][a-z]{2,}))\/([\w-\.~:\/?#\[\]#!$&\'\(\)*+,;=]*)$/i
It works perfectly but problem was that I wanted to check all currently available TLDs because regex above doesn't validates unicode TLDs (XN--RHQV96G for example) because it allows only letters for domain. I can make it to validate unicode TLDs, but there is no point because it can't validate if entered TLD is real.
Since stackoverflow allows to answer your own question, I will include solution I came up with in my answer and I hope someone will find it usefull, but if you have better solution to solve this problem with TLDs, I will gladly choose your answer as accepted answer.
Rules are following:
Any localhost or IP based URLs shouldn't validate (http://localhost/ or http://8.8.8.8/ for example)
Any URL with authorization parameters or port in it, shouldn't validate (http://username#example.com/ or http://username:password#example.com/ or http://example.com:8080/ for example)
Only allowed protocols are http and https... If someone wants to validate ftp or something else, they can add ftp support easily (?:http|HTTP|ftp|FTP)

My solution is to get list of all currently available TLDs from IANA and include all of them in regex.
/^((?:http|https)):\/\/(?=[a-z\d])((?:(?:(?!_|\.\.|-\.|\.-|\.\/|-\/)[\w-\.])+?)(?:[\.](?:aaa|aarp|abb|abbott|abbvie|abogado|abudhabi|ac|academy|accenture|accountant|accountants|aco|active|actor|ad|adac|ads|adult|ae|aeg|aero|aetna|af|afl|ag|agakhan|agency|ai|aig|airbus|airforce|airtel|akdn|al|alibaba|alipay|allfinanz|ally|alsace|alstom|am|amica|amsterdam|analytics|android|anquan|ao|apartments|app|apple|aq|aquarelle|ar|aramco|archi|army|arpa|arte|as|asia|associates|at|attorney|au|auction|audi|audible|audio|author|auto|autos|avianca|aw|aws|ax|axa|az|azure|ba|baby|baidu|band|bank|bar|barcelona|barclaycard|barclays|barefoot|bargains|bauhaus|bayern|bb|bbc|bbva|bcg|bcn|bd|be|beats|beer|bentley|berlin|best|bet|bf|bg|bh|bharti|bi|bible|bid|bike|bing|bingo|bio|biz|bj|black|blackfriday|blog|bloomberg|blue|bm|bms|bmw|bn|bnl|bnpparibas|bo|boats|boehringer|bom|bond|boo|book|boots|bosch|bostik|bot|boutique|br|bradesco|bridgestone|broadway|broker|brother|brussels|bs|bt|budapest|bugatti|build|builders|business|buy|buzz|bv|bw|by|bz|bzh|ca|cab|cafe|cal|call|cam|camera|camp|cancerresearch|canon|capetown|capital|car|caravan|cards|care|career|careers|cars|cartier|casa|cash|casino|cat|catering|cba|cbn|cc|cd|ceb|center|ceo|cern|cf|cfa|cfd|cg|ch|chanel|channel|chase|chat|cheap|chintai|chloe|christmas|chrome|church|ci|cipriani|circle|cisco|citic|city|cityeats|ck|cl|claims|cleaning|click|clinic|clinique|clothing|cloud|club|clubmed|cm|cn|co|coach|codes|coffee|college|cologne|com|commbank|community|company|compare|computer|comsec|condos|construction|consulting|contact|contractors|cooking|cool|coop|corsica|country|coupon|coupons|courses|cr|credit|creditcard|creditunion|cricket|crown|crs|cruises|csc|cu|cuisinella|cv|cw|cx|cy|cymru|cyou|cz|dabur|dad|dance|date|dating|datsun|day|dclk|dds|de|deal|dealer|deals|degree|delivery|dell|deloitte|delta|democrat|dental|dentist|desi|design|dev|dhl|diamonds|diet|digital|direct|directory|discount|dj|dk|dm|dnp|do|docs|dog|doha|domains|dot|download|drive|dtv|dubai|dunlop|dupont|durban|dvag|dz|earth|eat|ec|edeka|edu|education|ee|eg|email|emerck|energy|engineer|engineering|enterprises|epost|epson|equipment|er|ericsson|erni|es|esq|estate|et|eu|eurovision|eus|events|everbank|exchange|expert|exposed|express|extraspace|fage|fail|fairwinds|faith|family|fan|fans|farm|fashion|fast|feedback|ferrero|fi|film|final|finance|financial|fire|firestone|firmdale|fish|fishing|fit|fitness|fj|fk|flickr|flights|flir|florist|flowers|flsmidth|fly|fm|fo|foo|football|ford|forex|forsale|forum|foundation|fox|fr|fresenius|frl|frogans|frontier|ftr|fund|furniture|futbol|fyi|ga|gal|gallery|gallo|gallup|game|games|garden|gb|gbiz|gd|gdn|ge|gea|gent|genting|gf|gg|ggee|gh|gi|gift|gifts|gives|giving|gl|glass|gle|global|globo|gm|gmail|gmbh|gmo|gmx|gn|gold|goldpoint|golf|goo|goodyear|goog|google|gop|got|gov|gp|gq|gr|grainger|graphics|gratis|green|gripe|group|gs|gt|gu|guardian|gucci|guge|guide|guitars|guru|gw|gy|hamburg|hangout|haus|hdfcbank|health|healthcare|help|helsinki|here|hermes|hiphop|hisamitsu|hitachi|hiv|hk|hkt|hm|hn|hockey|holdings|holiday|homedepot|homes|honda|horse|host|hosting|hoteles|hotmail|house|how|hr|hsbc|ht|htc|hu|hyundai|ibm|icbc|ice|icu|id|ie|ifm|iinet|il|im|imamat|imdb|immo|immobilien|in|industries|infiniti|info|ing|ink|institute|insurance|insure|int|international|investments|io|ipiranga|iq|ir|irish|is|iselect|ismaili|ist|istanbul|it|itau|iwc|jaguar|java|jcb|jcp|je|jetzt|jewelry|jlc|jll|jm|jmp|jnj|jo|jobs|joburg|jot|joy|jp|jpmorgan|jprs|juegos|kaufen|kddi|ke|kerryhotels|kerrylogistics|kerryproperties|kfh|kg|kh|ki|kia|kim|kinder|kindle|kitchen|kiwi|km|kn|koeln|komatsu|kosher|kp|kpmg|kpn|kr|krd|kred|kuokgroup|kw|ky|kyoto|kz|la|lacaixa|lamborghini|lamer|lancaster|land|landrover|lanxess|lasalle|lat|latrobe|law|lawyer|lb|lc|lds|lease|leclerc|legal|lego|lexus|lgbt|li|liaison|lidl|life|lifeinsurance|lifestyle|lighting|like|limited|limo|lincoln|linde|link|lipsy|live|living|lixil|lk|loan|loans|locker|locus|lol|london|lotte|lotto|love|lr|ls|lt|ltd|ltda|lu|lupin|luxe|luxury|lv|ly|ma|madrid|maif|maison|makeup|man|management|mango|market|marketing|markets|marriott|mattel|mba|mc|md|me|med|media|meet|melbourne|meme|memorial|men|menu|meo|metlife|mg|mh|miami|microsoft|mil|mini|mk|ml|mlb|mls|mm|mma|mn|mo|mobi|mobily|moda|moe|moi|mom|monash|money|montblanc|mormon|mortgage|moscow|motorcycles|mov|movie|movistar|mp|mq|mr|ms|mt|mtn|mtpc|mtr|mu|museum|mutual|mutuelle|mv|mw|mx|my|mz|na|nadex|nagoya|name|natura|navy|nc|ne|nec|net|netbank|netflix|network|neustar|new|news|next|nextdirect|nexus|nf|ng|ngo|nhk|ni|nico|nikon|ninja|nissan|nissay|nl|no|nokia|northwesternmutual|norton|now|nowruz|nowtv|np|nr|nra|nrw|ntt|nu|nyc|nz|obi|office|okinawa|olayan|olayangroup|ollo|om|omega|one|ong|onl|online|ooo|oracle|orange|org|organic|origins|osaka|otsuka|ott|ovh|pa|page|pamperedchef|panerai|paris|pars|partners|parts|party|passagens|pccw|pe|pet|pf|pg|ph|pharmacy|philips|photo|photography|photos|physio|piaget|pics|pictet|pictures|pid|pin|ping|pink|pioneer|pizza|pk|pl|place|play|playstation|plumbing|plus|pm|pn|pohl|poker|porn|post|pr|praxi|press|prime|pro|prod|productions|prof|progressive|promo|properties|property|protection|ps|pt|pub|pw|pwc|py|qa|qpon|quebec|quest|racing|re|read|realestate|realtor|realty|recipes|red|redstone|redumbrella|rehab|reise|reisen|reit|ren|rent|rentals|repair|report|republican|rest|restaurant|review|reviews|rexroth|rich|richardli|ricoh|rio|rip|ro|rocher|rocks|rodeo|room|rs|rsvp|ru|ruhr|run|rw|rwe|ryukyu|sa|saarland|safe|safety|sakura|sale|salon|samsung|sandvik|sandvikcoromant|sanofi|sap|sapo|sarl|sas|save|saxo|sb|sbi|sbs|sc|sca|scb|schaeffler|schmidt|scholarships|school|schule|schwarz|science|scor|scot|sd|se|seat|security|seek|select|sener|services|seven|sew|sex|sexy|sfr|sg|sh|sharp|shaw|shell|shia|shiksha|shoes|shop|shouji|show|shriram|si|silk|sina|singles|site|sj|sk|ski|skin|sky|skype|sl|sm|smile|sn|sncf|so|soccer|social|softbank|software|sohu|solar|solutions|song|sony|soy|space|spiegel|spot|spreadbetting|sr|srl|st|stada|star|starhub|statebank|statefarm|statoil|stc|stcgroup|stockholm|storage|store|stream|studio|study|style|su|sucks|supplies|supply|support|surf|surgery|suzuki|sv|swatch|swiss|sx|sy|sydney|symantec|systems|sz|tab|taipei|talk|taobao|tatamotors|tatar|tattoo|tax|taxi|tc|tci|td|tdk|team|tech|technology|tel|telecity|telefonica|temasek|tennis|teva|tf|tg|th|thd|theater|theatre|tickets|tienda|tiffany|tips|tires|tirol|tj|tk|tl|tm|tmall|tn|to|today|tokyo|tools|top|toray|toshiba|total|tours|town|toyota|toys|tr|trade|trading|training|travel|travelers|travelersinsurance|trust|trv|tt|tube|tui|tunes|tushu|tv|tvs|tw|tz|ua|ubs|ug|uk|unicom|university|uno|uol|ups|us|uy|uz|va|vacations|vana|vc|ve|vegas|ventures|verisign|versicherung|vet|vg|vi|viajes|video|vig|viking|villas|vin|vip|virgin|vision|vista|vistaprint|viva|vlaanderen|vn|vodka|volkswagen|vote|voting|voto|voyage|vu|vuelos|wales|walter|wang|wanggou|warman|watch|watches|weather|weatherchannel|webcam|weber|website|wed|wedding|weibo|weir|wf|whoswho|wien|wiki|williamhill|win|windows|wine|wme|wolterskluwer|work|works|world|ws|wtc|wtf|xbox|xerox|xihuan|xin|xn--11b4c3d|xn--1ck2e1b|xn--1qqw23a|xn--30rr7y|xn--3bst00m|xn--3ds443g|xn--3e0b707e|xn--3pxu8k|xn--42c2d9a|xn--45brj9c|xn--45q11c|xn--4gbrim|xn--55qw42g|xn--55qx5d|xn--5tzm5g|xn--6frz82g|xn--6qq986b3xl|xn--80adxhks|xn--80ao21a|xn--80asehdb|xn--80aswg|xn--8y0a063a|xn--90a3ac|xn--90ais|xn--9dbq2a|xn--9et52u|xn--9krt00a|xn--b4w605ferd|xn--bck1b9a5dre4c|xn--c1avg|xn--c2br7g|xn--cck2b3b|xn--cg4bki|xn--clchc0ea0b2g2a9gcd|xn--czr694b|xn--czrs0t|xn--czru2d|xn--d1acj3b|xn--d1alf|xn--e1a4c|xn--eckvdtc9d|xn--efvy88h|xn--estv75g|xn--fct429k|xn--fhbei|xn--fiq228c5hs|xn--fiq64b|xn--fiqs8s|xn--fiqz9s|xn--fjq720a|xn--flw351e|xn--fpcrj9c3d|xn--fzc2c9e2c|xn--fzys8d69uvgm|xn--g2xx48c|xn--gckr3f0f|xn--gecrj9c|xn--h2brj9c|xn--hxt814e|xn--i1b6b1a6a2e|xn--imr513n|xn--io0a7i|xn--j1aef|xn--j1amh|xn--j6w193g|xn--jlq61u9w7b|xn--jvr189m|xn--kcrx77d1x4a|xn--kprw13d|xn--kpry57d|xn--kpu716f|xn--kput3i|xn--l1acc|xn--lgbbat1ad8j|xn--mgb9awbf|xn--mgba3a3ejt|xn--mgba3a4f16a|xn--mgba7c0bbn0a|xn--mgbaam7a8h|xn--mgbab2bd|xn--mgbayh7gpa|xn--mgbb9fbpob|xn--mgbbh1a71e|xn--mgbc0a9azcg|xn--mgbca7dzdo|xn--mgberp4a5d4ar|xn--mgbpl2fh|xn--mgbt3dhd|xn--mgbtx2b|xn--mgbx4cd0ab|xn--mix891f|xn--mk1bu44c|xn--mxtq1m|xn--ngbc5azd|xn--ngbe9e0a|xn--node|xn--nqv7f|xn--nqv7fs00ema|xn--nyqy26a|xn--o3cw4h|xn--ogbpf8fl|xn--p1acf|xn--p1ai|xn--pbt977c|xn--pgbs0dh|xn--pssy2u|xn--q9jyb4c|xn--qcka1pmc|xn--qxam|xn--rhqv96g|xn--rovu88b|xn--s9brj9c|xn--ses554g|xn--t60b56a|xn--tckwe|xn--unup4y|xn--vermgensberater-ctb|xn--vermgensberatung-pwb|xn--vhquv|xn--vuq861b|xn--w4r85el8fhu5dnra|xn--w4rs40l|xn--wgbh1c|xn--wgbl6a|xn--xhq521b|xn--xkc2al3hye2a|xn--xkc2dl3a5ee0h|xn--y9a3aq|xn--yfro4i67o|xn--ygbi2ammx|xn--zfr164b|xperia|xxx|xyz|yachts|yahoo|yamaxun|yandex|ye|yodobashi|yoga|yokohama|you|youtube|yt|yun|za|zappos|zara|zero|zip|zm|zone|zuerich|zw)))\/([\w-\.~:\/?#\[\]#!$&\'\(\)*+,;=]*)$/i
This regex is huge (there is 1,348 TLDs), but it works perfectly and I can't find any wrong URL combination it will validate.
It allows only valid subdomains and it won't validate not allowed domain name combinations like http://.example.com/ or http://-exa..mple.com/
If you don't care about valid TLDs and only pattern is enough, you can use regex in original question, it's much smaller, faster and works pretty well.
Any answers and comments are welcome if you find any mistake or you can make this regex shorter or faster.
I will update this answer from time to time to include new TLDs from IANA database if there will be any.

Regex for matching different parts of a domain

I am attempting to split up domains into different categories (Subdomain, Domain, TLD) and am having trouble..
I can't figure out a way to match any number of subdomains and not overtake my domain or TLD mathcing. I am using PCRE regex.
Current regex:
\s(?:(?<subdomain>[a-z0-9\-]*){0,1}\.){0,3}(?<domain>(?>([a-z0-9\-]+)))\.(?<tld>[a-z\.]{2,6})\s
Data set:
apple.orange.banana.clevername.co.uk
strawberry.apple.orange.banana.clevername.co.uk
tangerine.com.au
simple.com
Note: There are spaces before and after the domains and they will always be lower case.
An example of how this data would match:
apple.orange.banana.clevername.co.uk
subdomain: apple.orange.banana
domain: google
tld: co.uk
If I add another fruit to the subdomain(strawberry.apple.orange.banana.clevername.co.uk), the match will fail. If I modify the {0,3} for the subdomain regex to a higher number or an unlimited number of matches, it gets too greedy and I no longer end up with a correct match for a domain/tld. Example of this:
Modified regex:
\s(?:(?<subdomain>[a-z0-9\-]*){0,1}\.){0,5}(?<domain>(?>([a-z0-9\-]+)))\.(?<tld>[a-z\.]{2,6})\s
Resulting match with new regex:
strawberry.apple.orange.banana.clevername.co.uk
subdomain: strawberry.apple.orange.banana.clevername
domain:
tld: co.uk
I'm sure the regex isn't the most efficient either so any help or suggestions would be greatly appreciated. Thanks!

I believe this should do it for you:
\s((?<subdomain>[a-z0-9\.\-]*)\.)?(?<domain>[a-z0-9\-]{3,}(?=\.[a-z\.]{3,6}))\.(?<tld>[a-z\.]{3,6})\s
Tested this in Splunk and it works with your test data set.
Do note that this won't work for very short domains like bit.ly because there is no way to tell the domain from the subdomain without doing a lookup of the TLD.
For example, compare something.bit.ly and clevername.com.au. Without outside information, there is no way to tell that bit and clevername are the domains.

I recently came across the same problem. So I took Syon's regex and modified it a bit. This is the result:
\s(?:(?<subdomain>[a-z0-9\.\-]*)\.)?(?<domain>(?!com)[a-z0-9\-]{3,}(?=\.[a-z\.]{2,}))\.(?:(?<tld>[a-z\.]{2,})$)\s
It works on the whole test data set (I trimmed the spaces though), as well as short domains like bit.ly. Also works for new top level domains like .cancerresearch. See result:
https://regex101.com/r/nX6yQ7/4
Note: The regex specifically states that the domain can't be com, this needs to be updated if other {3 characters}.xyz tlds need to be supported

You could try to find the longest suffix of the domain which is still listed in the Public Suffix List. After that, splitting the string should be easy.
Note that the list also considers domains of web hosters a public suffix. For example, in example.blogspot.com the public suffix is considered to be blogspot.com, not com. Also the list has to be parsed carefully as it contains comments and exceptions.

Perl extract domain name from email address inc tld but excluding subdomains

I'm trying to do what the title says and I've got this:
sub getDomain {
my $scalarRef = shift;
my #from_domain = split(/\#/,$$scalarRef);
if($from_domain[1] =~ m/^.*?(\w+\.\w+)$/){
print "$from_domain[1] $1" if($username eq 'xxx');
return $1;
}
}
Works fine for user#domain.com returning domain.com, but of course domain.co.uk will return .co.uk and I need domain.co.uk. Any suggestions on how to proceed with this one, I'm guessing a module and some suggest some kind of tld lookup table.

Don't use a RegExp.
use Email::Address;
my ($addr) = Email::Address->parse('foo#domain.co.uk');
print "Domain: ".$addr->host."\n";
print "User: ".$addr->user."\n";
Prints:
Domain: domain.co.uk
User: foo

I think you're out of luck here. Net::Domain::TLD will give you a list of TLDs, but that's not actually what you want.
As I understand it, given an email address like user#sub.domain.com, you want to get domain.com. The TLD here is "com" and you want the TLD and the section of the domain that comes before it. That's easy.
And then there's user#sub.domain.co.uk. Here the TLD is "uk". But here you don't want the TLD and the section of the domain that precedes it - you want two sections before the TLD.
So perhaps you need a heuristic. If the TLD is three letters long, take the previous section of the domain, and if the TLD is three letters long, take the previous two sections.
But that doesn't work either. Not all ccTLDs have defined subdomains like .uk does. Take, for example, the popular .tv ccTLD. They allow you to register a domain directly under the ccTLD.
So you don't just need a list of TLDs. You also need to understand the rules that each of the TLDs apply to registrations. And they could change over time. And new TLDs are being introduced - you'd need to keep up with all of those.
Oh, and one last point. Even big ccTLDs like .uk don't always follow their own rules. There are a few .uk domains that don't have a top-level subdomain - .british-library.for example.
You might be able to implement this for a sub-set of domains that you're particularly interested in. But a full solution would be incredibly complex and almost impossible to keep up to date.

Need regex to get domain + subdomain

So im using this function here:
function get_domain($url)
{
$pieces = parse_url($url);
$domain = isset($pieces['host']) ? $pieces['host'] : '';
if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
return $regs['domain'];
}
return false;
}
$referer = get_domain($_SERVER['HTTP_REFERER']);
And what i need is another regex for it, if someone would be so kind to help.
Exactly what i need is for it to get the whole domain, including subdomains.
Lets say as a real problem i have now. When people blogging link from example: myblog.blogger.com
The referer url will be just blogger.com, which is not ideal..
So if someone could help me so i can get the including subdomain as regex code for the function above, id apreciate it alot!
Thanks!

This regex should match a domain in a string, including any dubdomains:
/([a-z0-9|-]+\.)*[a-z0-9|-]+\.[a-z]+/
Translated to rough english, it functions like this: "match the first part of the string that has 'sometextornumbers.sometext', and also include any number of 'sometextornumbers.' that might preceed it.
See it in action here: http://regexr.com?2vppk
Note that the multiline and global flags in that link are only there to be able to match the entire blob of test-text, so you don't need if you're passing only one line to the regex

Good luck with the above as Domain names now contain non-roman characters. These would have to be processed into equivalent but unique ascii before regex could work reliably. See RFC 3490 Internationalizing Domain Names in Applications (IDNA) ...
See https://www.rfc-editor.org/rfc/rfc3490
which has
Until now, there has been no standard method for domain names to use
characters outside the ASCII repertoire. This document defines
internationalized domain names (IDNs) and a mechanism called
Internationalizing Domain Names in Applications (IDNA) for handling
them in a standard fashion. IDNs use characters drawn from a large
repertoire (Unicode), but IDNA allows the non-ASCII characters to be
represented using only the ASCII characters already allowed in so-
called host names today. This backward-compatible representation is
required in existing protocols like DNS, so that IDNs can be
introduced with no changes to the existing infrastructure. IDNA is
only meant for processing domain names, not free text.

I guess this is an optimization for the first suggestion.
The main improvements:
does not react to invalid pattern sub..domain.xyz
captures more that one sub-domain as group
captures port if given
https://((?:[a-z0-9-]+\.)*)([a-z0-9-]+\.[a-z]+)($|\s|\:\d{1,5})
Test it: https://regex101.com/r/njFIil/1
This regex does not handle any unicode symbols, which could be a problem as mentioned above.

Better solution:
/^([a-z0-9|-]+[a-z0-9]{1,}\.)*[a-z0-9|-]+[a-z0-9]{1,}\.[a-z]{2,}$/
Regex sample:
https://regexr.com/4k71a
And for email address:
/^[a-z0-9|.|-]+[a-z0-9]{1,}#([a-z0-9|-]+[a-z0-9]{1,}\.)*[a-z0-9|-]+[a-z0-9]{1,}\.[a-z]{2,}$/

Possible Root URLS

When validating URLs, I was wondering if the root could be setup like this:
http://my.great.web.site.I.rule.com/
I guess the real question is, if someone wanted to buy a .com with the name "some.site", would the above example be possible?
I was thinking something like that was out of the ordinary, and that the maximum would be something like this:
http://subdomain.mysite.com/
I might be thinking about this wrong, but I have very little knowledge of url structures and am trying to learn as much as I can.
Just wondering, because you could get a heck of a lot more precise with a regex expression like this (assuming periods cannot be used in domain/subdoamin names):
(https?:\/\/)([a-z0-9_-]{1,63}\.){1,2}([a-z]{2,8}){1}\/
then you could with this (assuming periods can be used in domain/subdomain names):
(https?:\/\/)([a-z0-9_-]{1,63}\.)\/
Any thoughts, or is this just ridiculous?

Wikipedia has good descriptions of URI schemas with links to all the relevant RFCs and Domain Names.
One note about your regex, you should also consider including port numbers when servers are hosted at non-default ports, e.g.
http://typicaltomcat.com:8080/
Edit: If you are looking for a regex to match URLs, there is interesting article on a liberal URL matcher.

Regarding the urls, you can have (in theory) up to 127 domains (counting the top level domain name .com), as long as the domain exceed 255 characters and each sub domain is less than 64 characters.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Validate domain with regex but not subdomain - regex

Related

Regex to validate URL characters and all available TLDs

Regex for matching different parts of a domain

Perl extract domain name from email address inc tld but excluding subdomains

Need regex to get domain + subdomain

Possible Root URLS

Categories

Resources