Regex to match domain name (but not TLD) - regex

I know there are tons of domain matching regular expressions floating around, but couldn’t find one to answer my particular question. I’m looking for a regular expression that will match only a URL’s domain name, but nothing else (not even the TLD). It doesn’t need to validate the domain.
So given the sample below:
https://www.orchardsoft.com
https://www.horizon-lims.com/contact/us
https://www.quartzy.com
https://qbench.net
https://www.xifin.com
...the regular expression needs to match for the following:
Orchardshot
Horizon-lims
Quartz
QBench
Xifin
The regular expression I'm starting with is this: (.|//(\w+.)+
Is anyone able to point me in the right direction?

As long as you declare the possible TLDs (here: .com.tr, .com, .net), you can use this regex:
([\w-]+)(?=\.(?:com\.tr|com|net))
In fact, an FQDN has a hierarchical structure which makes it impossible to always analyze it correctly with a regex. It would fail (match twice) for entries that contain a TLD in its path like https://www.example.com/a.combination.

Related

regular expressions: catch any URLs of the domain example.com

I'm trying to get regexp code for the below case. I tried multiple tries but in vain.
I need to catch any URLs of the domain site.com. Tried using regexp '^site.com/*$
but it does not recognizes it.
i'm just looking for regexp code whichmatches site.com/*
With your expression ^site.com/*$ you match all strings that start with site.com and have zero or more trailing / characters (/*):
If you want to match any strings starting with site.com/ you might want to try ^site\.com/.*$:
There are already a lot of other regex questions regarding domain names on SO, but your question is not clear to me in what context you are trying to do this, or what is the actual goal you want to achieve. If you describe your needs more precisely you could probably find some answers on this forum.
I generally use a helper website like regex101.com.
Also, a few things to note, . has a special meaning in regex meaning any character, and if you wanted to capture site.com/foo you might want to use something where you are not limited to the number of characters by the end. I'd do this with groupings.
^(site\.com\/)(.+)$
You can see this in action here: https://regex101.com/r/AU2iYC/2
Your regex ^site.com/*$ is only matched follow sentences
ex) site.com/ site.com//////// site.com
because * asterisk in regex means Match 0 or more of the preceding token.
so, it should be work
^site.com\/.*$

Regular expression to match a domain

I want to have a regular Expression for Google Analytic so I can match all the domain including the sub domains
say we have to match a domain name called xyz.com
So i want to match every url that have xyz.com in it.
Example
abcd.xyz.com, abc1232.xyz, www.xyz.com, www.xyz.com/abc
Can anyone help me with that.
My purpose to it to have the traafic reports excluded in Google Analytics that are coming from these sites.
In general, the regular expression to match those domains would be something like .*\.xyz\.com$. The backslashes escape the dots (which are normally wildcard characters and the dollar-sign represents the end of the string.
There are different regex implementations, so you might have to tweak this for your regex engine.
To exclude subdomains like described above you can use GA filter([Exclude] [Hostname] [Matching RegEx]) along with regular expression (xyz.com)|(.*.xyz.com).
This RegEx including both main domain and it's subdomains.
You could try this regex
(.*\.)?xyz\.com
This matches all your required formats for the URL.

I want a regular expression that only matches domain names with one period in them

I want it to catch things like somedomain.com/folder/path, but not something like domain.sub.other.com. The regex I have so far is almost complete, it just doesn't sift out the multi-domain urls:
^(.*)://(?!(.{2,3})\.(.*)(.{2,3})(.*)
Is there any way to sift out on multiple periods?
Instead of .{2,3}, you want something like this: [^.]{2,3} - this excludes the period (no need to escape as it has no special meaning in this context in a regular expression) from that particular match. Overall you'd have something like:
://[^.]+\.[^.]{2,3}(/.*)?
Except obviously you're missing things like *.info by doing that....
Found a solution that is working given a variety of test scenarios:
^(.*)://([^.]+)\.([^(\?|/|\r|\n|\.)]+)((/|\?|$)+)(.*)$
Here, the 2nd to the last group is matching against a potential forward slash, question mark or end of string, working together with the group before it which does not allow matches which include '.'
So the final effect is that it only matches URLs with a two-part domain such as 'domain.com' and there aren't any limits placed on string length.

Regex with URLs - syntax

We're using a proprietary tracking system that requires the use of regular expressions to load third party scripts on the URLs we specify.
I wanted to check the syntax of the regex we're using to see if it looks right.
To match the following URL
/products/18/indoor-posters
We are using this rule:
.*\/products\/18\/indoor-posters.*
Does this look right? Also, if there was a query parameter on the URL, would it still work? e.g.
/products/18/indoor-posters?someParam=someValue
There's another URL to match:
/products
The rule for this is:
.*\/products
Would this match correctly?
Well, "right" is a relative term. Usually, .* is not a good idea because it matches anything, even nothing. So while these regexes will all match your example strings, they'll also match much more. The question is: What are you using the regexes for?
If you only want to check whether those substrings are present anywhere in the string, then they are fine (but then you don't need regex anyway, just check for substrings).
If you want to somehow check whether it's a valid URL, then no, the regexes are not fine because they'd also match foo-bar!$%(§$§$/products/18/indoor-postersssssss)(/$%/§($/.
If you can be sure that you'll always get a correct URL as your input and just want to check whether they match you pattern, then I'd suggest
^.*\/products$
to match any URL that ends in /products, and
^.*\/products\/18\/indoor-posters(?:\?[\w-]+=[\w-]+)?$
to match a URL that ends in /products/18/indoor-posters with an optional ?name=value bit at the end, assuming only alphanumeric characters are legal for name and value.

Regex for checking a body of text for a URL?

I have a regex pattern for URL's that I use to check for links in a body of text. The only problem is that the pattern will match this link
stackoverflow.com
And this sentence
I'm a sentence.Next Sentence.
Obviously this would make sense because my pattern doesn't strong check .com, .co.uk, .com.au etc
I want it to match stackoverflow.com and not the latter.
As I'm no Regex expert, does anyone know of any good Regex patterns for checking for all types of URL's in a body text, while not matching the sentences like above?
If I have to strong check the domain extension, I suppose I'll have to settle.
Here's my pattern, but i don't think it help.
(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?
I would definitely suggest finding a working regex that someone else has made (which would probably include a strong check on the domain extension), but here is one possible way to just modify your existing regex.
It requires that you make the assumption that usually links will not mix case in the domain extension, for example you might see .COM or .com but probably not .Com, if you only match domain extensions that don't mix case then you would avoid matching most sentences.
In the middle of your regex you have [\w]{2,4}, try changing this to ([A-Z]{2,4}|[a-z]{2,4}) (or (?:[A-Z]{2,4}|[a-z]{2,4}) if you don't want a new captured group).