Regex number range target [duplicate] - regex

I am trying to have my regex match the following:
169.254.0.0-169.254.254.255
Could anyone please help how can I achieve this.
so far I have this:
169\.254\.([1-9]{1,2}|[1-9]{1,2}[1-4])
but it would also pick up 169.254.255.1 which should not be one of the matches.
Please help!
thanks

This is the regex I use for general IP validation:
(([0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}(?!\d)|2[0-4][0-9](?!\d)|25[0-5](?!\d))[.]?){4}
Breakdown:
1.`[0-9](?!\d)` -> Any Number 0 through 9 (The `(?!\d)` makes sure it only grabs stand alone digits)
2.`|[1-9][0-9](?!\d)` -> Or any number 10-99 (The `(?!\d)` makes sure it only grabs double digit entries)
3.`|1[0-9]{2}` -> Or any number 100-199
4.`|2[0-4][0-9]` -> Or any number 200-249
5.`|25[0-5]` -> Or any number 250-255
6.`[.]?` -> With or without a `.`
7.`{4}` -> Lines 1-6 exactly 4 times
This hasn't failed my yet for IP address validation.
For your specific case, this should do it:
(169\.254\.)((([0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}|2[0-4][0-9]|25[0-4])[.])(([0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}|2[0-4][0-9]|25[0-5])))
This is very long because I couldn't figure out how to get 169.254.(0-254).255 to check without getting 169.254.255.1 to fail
Edit: Fixed due to comments

the regex ([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-4]) matches 0-254.
see this page for more discussion

I've written an article that provides regular expressions for all the components of a generic URI (as defined in RFC3986: Uniform Resource Identifier (URI): Generic Syntax)
See: Regular Expression URI Validation
One of the components of a generic URI is an IPv4 address. Here is the free-spacing mode Python version from that article:
re_python_rfc3986_IPv4address = re.compile(r""" ^
# RFC-3986 URI component: IPv4address
(?: (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) \.){3} # (dec-octet "."){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) # dec-octet "."
$ """, re.VERBOSE)
And the un-commented JavaScript version:
var re_js_rfc3986_IPv4address = /^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/;

Related

Advanced grouping in domain name regex with Python3

I have a program written in python3 that should parse several domain names every day and extrapolate data.
Parsed data should serve as input for a search function, for aggregation (statistics and charts) and to save some time to the analyst that uses the program.
Just so you know: I don't really have the time to study machine learning (which seems to be a pretty good solution here), so I chose to start with regex, that I already use.
I already searched the regex documentation inside and outside StackOverflow and worked on the debugger on regex101 and I still haven't found a way to do what I need.
Edit (24/6/2019): I mention machine learning because of the reason I need a complex parser, that is automate things as much as possible. It would be useful for making automatic choices like blacklisting, whitelisting, etc.
The parser should consider a few things:
a maximum number of 126 subdomains plus the TLD
each subdomain must not be longer than 64 characters
each subdomain can contain only alphanumeric characters and the - character
each subdomain must not begin or end with the - character
the TLD must not be longer than 64 characters
the TLD must not contain only digits
but I to go a little deeper:
the first string can (optionally) contain a "usage type" like cpanel., mail., webdisk., autodiscover. and so on... (or maybe a symple www.)
the TLD can (optionally) contain a particle like .co, .gov, .edu and so on (.co.uk for example)
the final part of the TLD is not really checked against any list of ccTLD/gTLDs right now and I don't think it will be in the future
What I thought useful to solve the problem is a regex group for the optional usage type, one for each subdomain and one for the TLD (the optional particle must be inside the TLD group)
With these rules in mind I came up with a solution:
^(?P<USAGE>autodiscover|correo|cpanel|ftp|mail|new|server|webdisk|webhost|webmail[\d]?|wiki|www[\d]?\.)?([a-z\d][a-z\d\-]{0,62}[a-z\d])?((\.[a-z\d][a-z\d\-]{0,62}[a-z\d]){0,124}?(?P<TLD>(\.co|\.com|\.edu|\.net|\.org|\.gov)?\.(?!\d+)[a-z\d]{1,64})$
The above solution doesn't return the expected results
I report here a couple of examples:
A couple of strings to parse
without.further.ado.lets.travel.the.forest.com
www.without.further.ado.lets.travel.the.forest.gov.it
The groups I expect to find
FullMatchwithout.further.ado.lets.travel.the.forest.com
group2without
group3further
group4ado
group5lets
group6travel
group7the
group8forest
groupTLD.com
FullMatchwww.without.further.ado.lets.travel.the.forest.gov.it
groupUSAGEwww.
group2without
group3further
group4ado
group5lets
group6travel
group7the
group8forest
groupTLD.gov.it
The groups I find
FullMatchwithout.further.ado.lets.travel.the.forest.com
group2without
group3.further.ado.lets.travel.the.forest
group4.forest
groupTLD.com
FullMatchwww.without.further.ado.lets.travel.the.forest.gov.it
groupUSAGEwww.
group2without
group3.further.ado.lets.travel.the.forest
group4.forest
groupTLD.gov.it
group6.gov
As you can see from the examples, a couple of particles are found twice and that is not the behavior i sought for, anyway. Any attempt to edit the formula results in unexpeted output.
Any idea about a way to find the expected results?
This a simple, well-defined task. There is no fuzzyness, no complexity, no guessing, just a series of easy tests to figure out everything on your checklist. I have no idea how "machine learning" would be appropriate, or helpful. Even regex is completely unnecessary.
I've not implemented everything you want to verify, but it's not hard to fill in the missing bits.
import string
double_tld = ['gov', 'edu', 'co', 'add_others_you_need']
# we'll use this instead of regex to check subdomain validity
valid_sd_characters = string.ascii_letters + string.digits + '-'
valid_trans = str.maketrans('', '', valid_sd_characters)
def is_invalid_sd(sd):
return sd.translate(valid_trans) != ''
def check_hostname(hostname):
subdomains = hostname.split('.')
# each subdomain can contain only alphanumeric characters and
# the - character
invalid_parts = list(filter(is_invalid_sd, subdomains))
# TODO react if there are any invalid parts
# "the TLD can (optionally) contain a particle like
# .co, .gov, .edu and so on (.co.uk for example)"
if subdomains[-2] in double_tld:
subdomains[-2] += '.' + subdomains[-1]
subdomains = subdomains[:-1]
# "a maximum number of 126 subdomains plus the TLD"
# TODO check list length of subdomains
# "each subdomain must not begin or end with the - character"
# "the TLD must not be longer than 64 characters"
# "the TLD must not contain only digits"
# TODO write loop, check first and last characters, length, isnumeric
# TODO return something
I don't know if it is possible to get the output exactly as you asked. I think that with a single pattern it cannot catch results in different groups(group2, group3,..).
I found one way to get almost the result you expect using regex module.
match = regex.search(r'^(?:(?P<USAGE>autodiscover|correo|cpanel|ftp|mail|new|server|webdisk|webhost|webmail[\d]?|wiki|www[\d]?)\.)?(?:([a-z\d][a-z\d\-]{0,62}[a-z\d])\.){0,124}?(?P<TLD>(?:co|com|edu|net|org|gov)?\.(?!\d+)[a-z\d]{1,64})$', 'www.without.further.ado.lets.travel.the.forest.gov.it')
Output:
match.captures(0)
['www.without.further.ado.lets.travel.the.forest.gov.it']
match.captures[1] or match.captures('USAGE')
['www.']
match.captures(2)
['without', 'further', 'ado', 'lets', 'travel', 'the', 'forest']
match.captures(3) or match.captures('TLD')
['gov.it']
Here, to avoid taking . in groups I have added it in non-capturing group like this
(?:([a-z\d][a-z\d\-]{0,62}[a-z\d])\.)
Hope it helps.

Regular Exp match anything but not specific string

I am handling user input in my program by using regular exp.
the string contains /_MyWord/ and only a-z is accepted before /_MyWord/.
the string not contain /s/123, /s/32A and atr/will in the beginning.
My try:
^(?!.*/s/123)(?!.*/s/32A )(?!.*atr/will)([/a-z]+)/_MyWord/(.*)$
Example:
/s/123/QWERERTYU/_MyWord/45454545 -> fail
/DFGH/FGHJK/GHJK/_MyWord/DFGHJ452 -> OK
HiCanYouHelpMe/_MyWord/fgh -> OK
/_MyWord/HiCanYouHelpMefgh -> OK
Can anyone help me to finish the Regular Exp string
If I got your question correctly, try this regex:
^(?!.*\/s\/123)(?!.*\/s\/32A)(?!.*atr\/will)([\/a-zA-Z]*)\/_MyWord\/(.*)$
Unescaped: ^(?!.*/s/123)(?!.*/s/32A)(?!.*atr/will)([/a-zA-Z]*)/_MyWord/(.*)$
Changed ([\/a-z]+) to ([\/a-zA-Z]*) to include lower and upper case as well as support none (e.g /_MyWord/Test)
Regex101 Demo
Works for
/DFGH/FGHJK/GHJK/_MyWord/DFGHJ452
HiCanYouHelpMe/_MyWord/fgh
/_MyWord/HiCanYouHelpMefgh
Doesn't match:
/s/123/QWERERTYU/_MyWord/45454545
atr/will/DFGH/FGHJK/GHJK/_MyWord/DFGHJ452
Also, you really don't need lookaheads for /s/123 and /s/32A since they contain numbers so they will automatically be rejected because your condition includes [a-zA-Z]. So you might want to remove (?!.*\/s\/123)(?!.*\/s\/32A) from the beginning.

sendmail R command regular expression adding exclusions to Kcheckaddress regex -a#MATCH

I'm trying to get some exclusions into our sendmail regex for the R command. The following configuration & regex works:
LOCAL_CONFIG
#
Kcheckaddress regex -a#MATCH
[a-zA-Z_0-9.-]+<#[a-zA-Z_0-9-]+?\.+[a-zA-Z_0-9.-]+?\.(us|info|to|br|bid|cn|ru)
LOCAL_RULESETS
SLocal_check_mail
# check address against various regex checks
R$* $: $>Parse0 $>3 $1
R$+ $: $(checkaddress $1 $)
R#MATCH $#error $: "553 Your Domain is Blocked for Unsolicited Mail"
So we are blocking anything#subdomain.domain.us but not anything#domain.us. I'd like to add exclusions for cities and schools so to allow user#ci.somedomain.us and user#subdomain.[state].us. (note that [state] means 1 of the 50 states including DC).
This regex is not working (using CA for California as a test):
(?!.*\#ci\..+?\.us$)(?!.*\#*\..+?\.ca.us$)([a-zA-Z_0-9.-]+#[a-zA-Z_0-9-]+?\.+[a-zA-Z_0-9.-]+?\.(us)$)
I get this error:
sendmail -bt
/etc/mail/sendmail.cf: line 199: pattern-compile-error: Invalid preceding regular expression
What surprises me, in order to get the regex that does work that it requires the leading spaces and I'm not sure what the function of the +<# part of the regex does? What is the less than (<) doing here? Does it need to be added to the bigger regex?
edit: I'm pretty sure that sendmail's R & K commands do not support negative look-aheads. So if anyone can help re-write the regex in a sed-friendly format I'd be grateful!
Your criteria is not clear, you say block all subdomains but then allow them too?
Unless you are using the user name specifically, don't match it.
Block
sub.domain.us
Allow
sub.sub.domain.us
or domain.us
Ksubsubdomains regex -a#MATCH #([a-zA-Z_0-9-]+\.){2}us
Ssubsub
R$+ $: $(subsubdomains $1 $)
R#MATCH $#error $: "553 No Thank You."
# sendmail -bt
Enter <ruleset> <address>
> subsub a#sub.sub.us
subsub input: a # sub . sub . us
subsub returns: $# error $: "553 No Thank You."
> subsub a#sub.sub.sub.us
subsub input: a # sub . sub . sub . us
subsub returns: a # sub . sub . sub . us
since states have 2 letter abbreviations, block sub domains of 3 or more characters
Ksubstates regex -a#MATCH #[a-zA-Z_0-9-]+\.([a-zA-Z_0-9]){3,}+\.us
I ended up taking a different approach as suggested on the SpamAssassin mailing list. I used sendmail's access.db. Since the locality namespaces I want to white list are all fourth-level domain registrations of the form "<organization-name>.<locality>.<state>.us" I simply created 50 entries for all the states like below, starting with rejecting anything.us:
From:us REJECT
From:ma.us OK
From:mi.us OK
I haven't seen ANY false negatives, i.e., missed spam, since enabling this for a few days now.

Regular Expression to exclude emailids with special characters

I have a sample set of emailids below
EmailAddress
abc#in.in
#abc#in.in
abc#in.in&
a#b#c#in.in
a!bc#in.in
a$bc#in.in
a+bc#in.+in
ab-c-#in.in
ab/c\#in.in
ab\c#in.in
ab~~~~c#in.in
una02#gmail.com
I have to separate invalid mailids containing special characters other than - _ # .
I wrote below rex and its working fine. Please point out if I missed any possible scenario or this rex can be improved. Thanks in advance.
[^\$\+\\/~#!&]*
Clean List
abc#in.in
ab-c-#in.in
una02#gmail.com
Invalid List
#abc#in.in
abc#in.in&
a#b#c#in.in
a!bc#in.in
a$bc#in.in
a+bc#in.+in
ab/c\#in.in
ab\c#in.in
ab~~~~c#in.in
You eleminated the addresses with the # in the local-part.
I think after RFC 5322 is is a valid character.

How to match IPv4 addresses

I am trying to have my regex match the following:
169.254.0.0-169.254.254.255
Could anyone please help how can I achieve this.
so far I have this:
169\.254\.([1-9]{1,2}|[1-9]{1,2}[1-4])
but it would also pick up 169.254.255.1 which should not be one of the matches.
Please help!
thanks
This is the regex I use for general IP validation:
(([0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}(?!\d)|2[0-4][0-9](?!\d)|25[0-5](?!\d))[.]?){4}
Breakdown:
1.`[0-9](?!\d)` -> Any Number 0 through 9 (The `(?!\d)` makes sure it only grabs stand alone digits)
2.`|[1-9][0-9](?!\d)` -> Or any number 10-99 (The `(?!\d)` makes sure it only grabs double digit entries)
3.`|1[0-9]{2}` -> Or any number 100-199
4.`|2[0-4][0-9]` -> Or any number 200-249
5.`|25[0-5]` -> Or any number 250-255
6.`[.]?` -> With or without a `.`
7.`{4}` -> Lines 1-6 exactly 4 times
This hasn't failed my yet for IP address validation.
For your specific case, this should do it:
(169\.254\.)((([0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}|2[0-4][0-9]|25[0-4])[.])(([0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}|2[0-4][0-9]|25[0-5])))
This is very long because I couldn't figure out how to get 169.254.(0-254).255 to check without getting 169.254.255.1 to fail
Edit: Fixed due to comments
the regex ([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-4]) matches 0-254.
see this page for more discussion
I've written an article that provides regular expressions for all the components of a generic URI (as defined in RFC3986: Uniform Resource Identifier (URI): Generic Syntax)
See: Regular Expression URI Validation
One of the components of a generic URI is an IPv4 address. Here is the free-spacing mode Python version from that article:
re_python_rfc3986_IPv4address = re.compile(r""" ^
# RFC-3986 URI component: IPv4address
(?: (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) \.){3} # (dec-octet "."){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) # dec-octet "."
$ """, re.VERBOSE)
And the un-commented JavaScript version:
var re_js_rfc3986_IPv4address = /^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/;