Regex - filter websites and IPs - regex

What regex can detect the presence of any IP address or URL in a string. It should be able to detect people trying to obviously avoid the filter: Examples:
154.43.45.345
website.com
website . com
website dot com
website?##[]?.,<>.com
etc
Thanks!

The IP Regex is very straightforward, as it's structure is constant
(\d{1,4}\.){3}\d{1,4}
For websites, I would need more detailed info, but the general idea is like
.+((\.|dot).+)*(\.|dot)\s*(com|net|org|gov|uk...)
or, extremely basic
.*(\.|dot).*(com|net|org|gov|uk...)

Related

Regex and cPanel Account Filtering

after countless hours googling and trying to contact my webhost (with no positive results) I wanted to jsut 'throw my question out there' and get better expertise with my issue. I really do believe, that this will be helpful to a lot of people as well, stuck asking the same question!
Just to keep things short, we have hosted our email solution with a webhost using cPanel and I have a big requirement. Basically, I need an account level filter to block certain mail addresses from sending out to other mail servers. For example;
lets say we use example.com
user1#example.com can send mail to anyone, anywhere
user2.int#example.com is only allowed to send mail to example.com address but not to any other address, for example gmail.com, yahoo.com, etc.
Out of the options given to me at account level filtering, I thought the best to use is regex.
I'm suspecting that EXIM (default mta for cpanel) uses PCRE like regex expressions, please correct if im wrong.
The syntax i wrote and need help with is the following:
^(?!.+\#example\.com$).*$
With this, all example.com addresses should not match and all other addresses should.
The testing tools I used is https://www.debuggex.com/
Guys, please help and let me know what I am doing wrong. cPanel is letting mail go through and is not blocking it.
The regex:
^(?![^#]*?#example\.com)
should do the trick
How it works
^: Find the beginning of the string/line
(?!...) Assert that it is impossible to find the following regex:
[^#]*? Match all the characters that are not an at symbol (#)
#example\.com Match the exact string '#example\.com'
For a more in-depth explanation see this

Regex for matching different parts of a domain

I am attempting to split up domains into different categories (Subdomain, Domain, TLD) and am having trouble..
I can't figure out a way to match any number of subdomains and not overtake my domain or TLD mathcing. I am using PCRE regex.
Current regex:
\s(?:(?<subdomain>[a-z0-9\-]*){0,1}\.){0,3}(?<domain>(?>([a-z0-9\-]+)))\.(?<tld>[a-z\.]{2,6})\s
Data set:
apple.orange.banana.clevername.co.uk
strawberry.apple.orange.banana.clevername.co.uk
tangerine.com.au
simple.com
Note: There are spaces before and after the domains and they will always be lower case.
An example of how this data would match:
apple.orange.banana.clevername.co.uk
subdomain: apple.orange.banana
domain: google
tld: co.uk
If I add another fruit to the subdomain(strawberry.apple.orange.banana.clevername.co.uk), the match will fail. If I modify the {0,3} for the subdomain regex to a higher number or an unlimited number of matches, it gets too greedy and I no longer end up with a correct match for a domain/tld. Example of this:
Modified regex:
\s(?:(?<subdomain>[a-z0-9\-]*){0,1}\.){0,5}(?<domain>(?>([a-z0-9\-]+)))\.(?<tld>[a-z\.]{2,6})\s
Resulting match with new regex:
strawberry.apple.orange.banana.clevername.co.uk
subdomain: strawberry.apple.orange.banana.clevername
domain:
tld: co.uk
I'm sure the regex isn't the most efficient either so any help or suggestions would be greatly appreciated. Thanks!
I believe this should do it for you:
\s((?<subdomain>[a-z0-9\.\-]*)\.)?(?<domain>[a-z0-9\-]{3,}(?=\.[a-z\.]{3,6}))\.(?<tld>[a-z\.]{3,6})\s
Tested this in Splunk and it works with your test data set.
Do note that this won't work for very short domains like bit.ly because there is no way to tell the domain from the subdomain without doing a lookup of the TLD.
For example, compare something.bit.ly and clevername.com.au. Without outside information, there is no way to tell that bit and clevername are the domains.
I recently came across the same problem. So I took Syon's regex and modified it a bit. This is the result:
\s(?:(?<subdomain>[a-z0-9\.\-]*)\.)?(?<domain>(?!com)[a-z0-9\-]{3,}(?=\.[a-z\.]{2,}))\.(?:(?<tld>[a-z\.]{2,})$)\s
It works on the whole test data set (I trimmed the spaces though), as well as short domains like bit.ly. Also works for new top level domains like .cancerresearch. See result:
https://regex101.com/r/nX6yQ7/4
Note: The regex specifically states that the domain can't be com, this needs to be updated if other {3 characters}.xyz tlds need to be supported
You could try to find the longest suffix of the domain which is still listed in the Public Suffix List. After that, splitting the string should be easy.
Note that the list also considers domains of web hosters a public suffix. For example, in example.blogspot.com the public suffix is considered to be blogspot.com, not com. Also the list has to be parsed carefully as it contains comments and exceptions.

Regular expression for isolating Comcast IP addresses in access log file for Apache

Really the fact I want to use this for my Apache access log file is arbitrary and irrelevant, but it gives context to the situation.
I need to filter out records associated with Comcast IP addresses. Here's a list of the dynamic IP address ranges that Comcast assigns. I need a regular expression that can match all of those, and only those. I'll work on it on my own in the mean time but I figured there would be some RegEx guru out there on SO that would enjoy the problem.
Regex solution is possible, but very cumbersome, since the subnet mask is not multiple of 8. You will need to write a function to process the list and convert into regex.
It is better to use regex to grab the IP address and test the IP address against the list of IP addresses by Comcast. Simple implementation would be a set which allows you to search for the nearest number that is smaller than the argument.
That are a lot of IP adresses.
For example, 24.0.0.0/12 defines the IP range 24.0.0.1 - 24.15.255.255. To match these numeric ranges with a regex:
24: 24
0-15: [0-9]|1[0-5]
0-255: [0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]
Which gives
(24)\.([0-9]|1[0-5])\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])
And that's just for 24.0.0.0/12, 293 to go.
If you really want to do this you should write a small script to convert each IP range into a regex automatically.
Another approach would be to match any IP address and feed it into a callback that does the matching using an appropriate module / framework / API.

Regex for URL with port validation

I need to validate a url like those of web servers.
Something like http://localhost:8080/xyz
How do we do that using regex. Sorry, new to regex.
the relevant specs can be found in rfc 3986 and include regular syntax definitions for all possible url components. however, for your purposes these will probably be too general. a somewhat condensed expression matching only urls under the http(s) protocol would be
http[s]?://(([[:alpha:][:digit:]-._~!$&'\(\)*+,;=]|%([0-9A-F]{2}))+|([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]))(:[0-9]+)?(/([[:alpha:][:digit:]-._~!$&'\(\)*+,;=]|%([0-9A-F]{2}))*)+(\?([[:alpha:][:digit:]-._~!$&'\(\)*+,;=/?]|%([0-9A-F]{2}))+)?(#([[:alpha:][:digit:]-._~!$&'\(\)*+,;=/?]|%([0-9A-F]{2}))+)?
which can be simplified to
http[s]?://(([^/:\.[:space:]]+(\.[^/:\.[:space:]]+)*)|([0-9](\.[0-9]{3})))(:[0-9]+)?((/[^?#[:space:]]+)(\?[^#[:space:]]+)?(\#.+)?)?
in case you can be confident about the proper syntax of the url components.
note that you might wish more restrictive patterns e.g. for full text search and to only allow for iana-registered top-level-domains.
hope it helps,
best regards, carsten

Possible Root URLS

When validating URLs, I was wondering if the root could be setup like this:
http://my.great.web.site.I.rule.com/
I guess the real question is, if someone wanted to buy a .com with the name "some.site", would the above example be possible?
I was thinking something like that was out of the ordinary, and that the maximum would be something like this:
http://subdomain.mysite.com/
I might be thinking about this wrong, but I have very little knowledge of url structures and am trying to learn as much as I can.
Just wondering, because you could get a heck of a lot more precise with a regex expression like this (assuming periods cannot be used in domain/subdoamin names):
(https?:\/\/)([a-z0-9_-]{1,63}\.){1,2}([a-z]{2,8}){1}\/
then you could with this (assuming periods can be used in domain/subdomain names):
(https?:\/\/)([a-z0-9_-]{1,63}\.)\/
Any thoughts, or is this just ridiculous?
Wikipedia has good descriptions of URI schemas with links to all the relevant RFCs and Domain Names.
One note about your regex, you should also consider including port numbers when servers are hosted at non-default ports, e.g.
http://typicaltomcat.com:8080/
Edit: If you are looking for a regex to match URLs, there is interesting article on a liberal URL matcher.
Regarding the urls, you can have (in theory) up to 127 domains (counting the top level domain name .com), as long as the domain exceed 255 characters and each sub domain is less than 64 characters.