Regular expression which matches a domain with only two ending characters? - regex

I'm trying to stop spammers who are using short domains bit.ly etc. The domains they post seem to all be only two characters (not .com, etc).
I've used this:
\.[a-z][a-z]$
But, it has two problems:
it matches .co.uk
If anything is after the domain, it doesn't match (a space or slash, example: bit.ly/2231)
Could someone assist me with a regex that would accomplish this, please?

Whole URL matching. Depends on domain being before the first forward slash past protocol. First one uses if it only has one dot in the url and ends with two character primary TLD. Second one uses negative lookbehind to make sure it's not something like .co.uk.
https://regex101.com/r/5acu56/2
^(https?:\/\/)?[^\/.]+\.[a-z][a-z](\/|\s*$)
https://regex101.com/r/p8Ajw9/2
^(https?:\/\/)?[^\/]+(?<!\.[a-z][a-z])\.[a-z][a-z](\/\s*|\s*$)

Related

Regex validation works without domain

So I have the following Regex URL validator:
[a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9](?:\.[a-zA-Z]{2,})+
It works perfectly well for my needs, except that it accepts urls without a domain for example www.test works.
How can I modify it to validate for a domain? (Any domain should be accepted not just .com
Demo
Just make the last group in your regex mandatory as appearing two or more times:
[a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9](?:\.[a-zA-Z]{2,}){2,}
As a disclaimer, and as #Wiktor will probably comment, you might want to use a regex pattern for validating URLs which already has been tested thoroughly. While this answer may fix your immediate problem, there are most likely other edge cases which exist.
You could do it like this to account for unicode:
^\p{L}+\.\p{L}+(\.\p{L}{2,})+
\p{L} or \p{Letter}: any kind of letter from any language.
So with this we match for a group of one or more letters (sudomain) followed by a . followed by a group of one or more letters (main domain) followed by any number of groups of . with two or more letters (domain suffix).

Regex for matching, except two words. In Nginx server

I am trying to create 3 servers in Nginx
one which can match everything EXCEPT if the name contains -testing or -staging.
one which matches everything with -testing in the server_name.
one which matches everything with -staging in the server_name.
The last two are not the problem. Those are working. I'm just struggling with the first.
Here is the regex i tried:
~^.*(?!(-testing)|(-staging))\.example\.lh$
https://regex101.com/r/m8vXsL/1 (removed the tilde)
But it still matches server_names containing those words. Any help would be appreciated.
If nginx supports it (you've chosen pcre in your example, so assume so), just insert a < to get negative look-behind instead. I.e
^.*(?<!(-testing)|(-staging))\.example\.lh$
^
Here at regex101.

Mistaken Squid Proxy regex? → ^.*stackoverflow\.*

I have several proxy rule files for Squid, and all contain rules like:
acl blacklisted dstdom_regex ^.*facebook\.* ^.*youtube\.* ^.*games.yahoo.com\.*
The patterns match against the domain name: dstdom_regex means destination (server) regular expression pattern matching.
The objective is to block some websites, but I don't know by what method: domain name, keywords in the domain name, ...
Let's expand/describe the pattern:
^.*stackexchange\.* The whole pattern
^ String beginning
.* Match anything (greedy quantifier, I presume)
stackexchange Keyword to match
\.* Any number of dots (.)
Totally legitimate matches:
stackexchange.com: The Stack Exchange website.
stackoverflow.stackexchange: The imaginary Stack Exchange gTLD.
But these possible matches make it seem more like a keyword block:
stackexchange
stackexchanger
notstackexchange
not-stackexchange
some-website.stackexchange
some-website.stackexchange-tld
And the pattern seems to contain a bug, since it allows the following invalid cases to match, thanks to the \.* at the end, although they never naturally occur:
stackexchange.
stackexchange...
stackexchange..........
stackexchange.......com
stackexchange.com
stackexchangecom
you get the idea.
Anything containing stackexchange, even if separated by dots from everything else, is still a valid match.
So now, the question itself:
This all means that this is simply a match for stackexchange! (I'm assuming the original author didn't intend to match infinite dots.)
So why not just use the pattern stackexchange? Wouldn't it be faster and give the same results, except for the "bug" (\.*)?
I.e., isn't ^.*stackexchange equivalent to stackexchange?
Edit: Just to clarify, I didn't write those proxy rule files.
I don't understand why you use \.* to match all the following dots
However to bypass your problem you can try this out :
^[^\.]*\.stackexchange\.*
[^\.]* matches anything except a dot
\. then you match the dot
edit : formatting

Regex expression for parsing URLs my way

I've a question in how to parse urls, my way.
Here's my regex expression:
[^\s]+?\.(com|net|org|edu...ALL_DOMAIN_EXTENSIONS)([^\s\w\d][^\s]{1,})?
My rationalle is that I want to accept
mail.google.com (as long as there's a .com, .net etc)
However the .com must be followed by a symbol (if any) and not alphanumeric. However in this way of checking, this url will fail
www.company.com
However I cant do a greedy repetiton to search for a .com as in this case
developer.google.com/appid=com.company.apppackage
How do I search to check for the first occurance of a '.com' without a alphanumeric character following it, yet making it optional in case its just
Google.com
Use $ as an alternative to match the end of the string.
[^\s]+?\.(com|net|org|edu...ALL_DOMAIN_EXTENSIONS)([^\s\w\d][^\s]+|$)?
BTW, trying to match all top-level domains will drive you crazy, since anyone can now register a TLD, so they change frequently.

Regex captures all occurrences but the last of certain characters

I want to exclude common punctuation from my URL Regex detector when my clients type a sentence with a URL in it. A common scenario would be the URL example.com?q=this (which obviously needs to include the ?) versus a sentence saying
What do you think of example.com?
This expression suits my needs just fine:
(?:https?\:\/\/)?(?:\w+\.)+\w{2,}(?:[?#/]\S*)?
However it includes all punctuation at the end, so I am iterating through each match to find and use this captured group to exclude said punctuation:
(.*?)[?,!.;:]+$
However, I'm not sure how to leverage the "end of string" technique when scanning the entire block of text which may have multiple URLs. Was hoping there'd be a way to capture the right blocks from the get-go without the extra work.
Just require non-whitespace after the punctuation instead of making it optional.
(?:https?\:\/\/)?(?:\w+\.)+\w{2,}(?:[?#\/]\S+)?
You will of course lose valid ending of URLs like example.com/ will become example.com but as far as I know there is no difference.