Modsecurity: removeWhitespace not working - mod-security

I have the following rule:
SecRule REQUEST_HEADERS:Client-IP "#ipMatchFromFile test.txt"
"id:210487,t:none,t:urlDecodeUni,t:removeWhitespace,drop,msg:'IP-test'"
But When I run it I get the response:
T (0) urlDecodeUni: "111.22.33.44 " // note the space before the "
T (0) removeWhitespace: "111.22.33.44" // perfect! The space has been removed
Transformation completed in 4 usec.
Executing operator "ipMatchFromFile" with param "test.txt" against REQUEST_HEADERS:Client-IP.
Target value: "111.22.33.44" // target value has no space, hooray!
IPmatchFromFile: Total tree entries: 8, ipv4 8 ipv6 0
IPmatch: bad IPv4 specification "111.22.33.44 ". // why, oh why, is the space back!
Operator completed in 4 usec.
Operator error: IPmatch: bad IPv4 specification "111.22.33.44 ". // that space again!
Rule returned -1.
Rule processing failed.
Rule failed, not chained -> mode NEXT_RULE.
Please Stack Overflow legends; show me how I can fix it :-)

That should work, so looks like a bug. Can't say I've honestly tried to match an IP address that required transformation first.
Since it's not really an IP address you could switch to using #pmFromFile rather than #ipMatchFromFile. Note that the documentation warns explicitly that you need to use boundaries correctly here:
Because this operator does not check for boundaries when matching,
false positives are possible in some cases. For example, if you want
to use #pm for IP address matching, the phrase 1.2.3.4 will
potentially match more than one IP address (e.g., it will also match
1.2.3.40 or 1.2.3.41). To avoid the false positives, you can use your own boundaries in phrases. For example, use /1.2.3.4/ instead of just
1.2.3.4. Then, in your rules, also add the boundaries where appropriate. You will find a complete example in the example:
# Prepare custom REMOTE_ADDR variable
SecAction "phase:1,id:168,nolog,pass,setvar:tx.REMOTE_ADDR=/%{REMOTE_ADDR}/"
# Check if REMOTE_ADDR is blacklisted
SecRule TX:REMOTE_ADDR "#pmFromFile blacklist.txt" "phase:1,id:169,deny,msg:'Blacklisted IP address'"
The file blacklist.txt may contain:
# ip-blacklist.txt contents:
# NOTE: All IPs must be prefixed/suffixed with "/" as the rules
# will add in this character as a boundary to ensure
# the entire IP is matched.
# SecAction "phase:1,id:170,pass,nolog,setvar:tx.remote_addr='/%{REMOTE_ADDR}/'"
/1.2.3.4/
/5.6.7.8/

Related

Advanced grouping in domain name regex with Python3

I have a program written in python3 that should parse several domain names every day and extrapolate data.
Parsed data should serve as input for a search function, for aggregation (statistics and charts) and to save some time to the analyst that uses the program.
Just so you know: I don't really have the time to study machine learning (which seems to be a pretty good solution here), so I chose to start with regex, that I already use.
I already searched the regex documentation inside and outside StackOverflow and worked on the debugger on regex101 and I still haven't found a way to do what I need.
Edit (24/6/2019): I mention machine learning because of the reason I need a complex parser, that is automate things as much as possible. It would be useful for making automatic choices like blacklisting, whitelisting, etc.
The parser should consider a few things:
a maximum number of 126 subdomains plus the TLD
each subdomain must not be longer than 64 characters
each subdomain can contain only alphanumeric characters and the - character
each subdomain must not begin or end with the - character
the TLD must not be longer than 64 characters
the TLD must not contain only digits
but I to go a little deeper:
the first string can (optionally) contain a "usage type" like cpanel., mail., webdisk., autodiscover. and so on... (or maybe a symple www.)
the TLD can (optionally) contain a particle like .co, .gov, .edu and so on (.co.uk for example)
the final part of the TLD is not really checked against any list of ccTLD/gTLDs right now and I don't think it will be in the future
What I thought useful to solve the problem is a regex group for the optional usage type, one for each subdomain and one for the TLD (the optional particle must be inside the TLD group)
With these rules in mind I came up with a solution:
^(?P<USAGE>autodiscover|correo|cpanel|ftp|mail|new|server|webdisk|webhost|webmail[\d]?|wiki|www[\d]?\.)?([a-z\d][a-z\d\-]{0,62}[a-z\d])?((\.[a-z\d][a-z\d\-]{0,62}[a-z\d]){0,124}?(?P<TLD>(\.co|\.com|\.edu|\.net|\.org|\.gov)?\.(?!\d+)[a-z\d]{1,64})$
The above solution doesn't return the expected results
I report here a couple of examples:
A couple of strings to parse
without.further.ado.lets.travel.the.forest.com
www.without.further.ado.lets.travel.the.forest.gov.it
The groups I expect to find
FullMatchwithout.further.ado.lets.travel.the.forest.com
group2without
group3further
group4ado
group5lets
group6travel
group7the
group8forest
groupTLD.com
FullMatchwww.without.further.ado.lets.travel.the.forest.gov.it
groupUSAGEwww.
group2without
group3further
group4ado
group5lets
group6travel
group7the
group8forest
groupTLD.gov.it
The groups I find
FullMatchwithout.further.ado.lets.travel.the.forest.com
group2without
group3.further.ado.lets.travel.the.forest
group4.forest
groupTLD.com
FullMatchwww.without.further.ado.lets.travel.the.forest.gov.it
groupUSAGEwww.
group2without
group3.further.ado.lets.travel.the.forest
group4.forest
groupTLD.gov.it
group6.gov
As you can see from the examples, a couple of particles are found twice and that is not the behavior i sought for, anyway. Any attempt to edit the formula results in unexpeted output.
Any idea about a way to find the expected results?
This a simple, well-defined task. There is no fuzzyness, no complexity, no guessing, just a series of easy tests to figure out everything on your checklist. I have no idea how "machine learning" would be appropriate, or helpful. Even regex is completely unnecessary.
I've not implemented everything you want to verify, but it's not hard to fill in the missing bits.
import string
double_tld = ['gov', 'edu', 'co', 'add_others_you_need']
# we'll use this instead of regex to check subdomain validity
valid_sd_characters = string.ascii_letters + string.digits + '-'
valid_trans = str.maketrans('', '', valid_sd_characters)
def is_invalid_sd(sd):
return sd.translate(valid_trans) != ''
def check_hostname(hostname):
subdomains = hostname.split('.')
# each subdomain can contain only alphanumeric characters and
# the - character
invalid_parts = list(filter(is_invalid_sd, subdomains))
# TODO react if there are any invalid parts
# "the TLD can (optionally) contain a particle like
# .co, .gov, .edu and so on (.co.uk for example)"
if subdomains[-2] in double_tld:
subdomains[-2] += '.' + subdomains[-1]
subdomains = subdomains[:-1]
# "a maximum number of 126 subdomains plus the TLD"
# TODO check list length of subdomains
# "each subdomain must not begin or end with the - character"
# "the TLD must not be longer than 64 characters"
# "the TLD must not contain only digits"
# TODO write loop, check first and last characters, length, isnumeric
# TODO return something
I don't know if it is possible to get the output exactly as you asked. I think that with a single pattern it cannot catch results in different groups(group2, group3,..).
I found one way to get almost the result you expect using regex module.
match = regex.search(r'^(?:(?P<USAGE>autodiscover|correo|cpanel|ftp|mail|new|server|webdisk|webhost|webmail[\d]?|wiki|www[\d]?)\.)?(?:([a-z\d][a-z\d\-]{0,62}[a-z\d])\.){0,124}?(?P<TLD>(?:co|com|edu|net|org|gov)?\.(?!\d+)[a-z\d]{1,64})$', 'www.without.further.ado.lets.travel.the.forest.gov.it')
Output:
match.captures(0)
['www.without.further.ado.lets.travel.the.forest.gov.it']
match.captures[1] or match.captures('USAGE')
['www.']
match.captures(2)
['without', 'further', 'ado', 'lets', 'travel', 'the', 'forest']
match.captures(3) or match.captures('TLD')
['gov.it']
Here, to avoid taking . in groups I have added it in non-capturing group like this
(?:([a-z\d][a-z\d\-]{0,62}[a-z\d])\.)
Hope it helps.

Exclude line which containing strings on the capturing group

Logs containing below:
2018-10-31 14:14:39; dcv0000088; 192.168.48.200;
Variable Bindings
vmwVpxdNewStatus:= Green
vmwVpxdObjValue:= alarm.FanHealthAlarm - Event: Hardware Health Changed (3131155);
--ENDOFTRAP--
2018-10-31 10:41:49; sb02; 192.168.41.252;
Variable Bindings
sysUpTime:= 2 days 20 hours 18 minutes 24.23 seconds (24590423)
snmpTrapOID:= FSC-RTP-MIB:iandcAdmin.55.1.3.4.5 (1.3.6.1.4.1.4329.2.55.1.3.4.5)
iandcAdmin.55.1.1.3.0:= SIP Server not running
iandcAdmin.55.1.1.7.0:= SIP Server;
--ENDOFTRAP--
I would like to capture all text after Variable Bindings and before ; but exclude line containing sysUpTime....from the capture.
I use regex:
Variable\sBindings\s+(?P<varBind>[^;]+(?!sysUpTime\:=.*))
but it still not working. Expected result is:
varBind=
vmwVpxdNewStatus:= Green
vmwVpxdObjValue:= alarm.FanHealthAlarm - Event: Hardware Health Changed (3131155)
varBind=
snmpTrapOID:= FSC-RTP-MIB:iandcAdmin.55.1.3.4.5 (1.3.6.1.4.1.4329.2.55.1.3.4.5)
iandcAdmin.55.1.1.3.0:= SIP Server not running
iandcAdmin.55.1.1.7.0:= SIP Server
Please advise. thank you.
You can make an optional (non-capturing) group that will match the sysUpTime line if it's there, ensuring that it won't be included in the subsequent varBind group:
Variable\sBindings\s+(?:sysUpTime.+\s+)?(?P<varBind>[^;]+)
^^^^^^^^^^^^^^^^^^^
https://regex101.com/r/n5zPcr/2
If sysUpTime can appear somewhere other than the first line after Variable Bindings, then note that any group (or full match) must contain contiguous characters from the input - leaving out part of them is not possible without some other method, such as capturing the initial substring, matching the sysUpTime line, and then capturing the later substring.

Regex number range target [duplicate]

I am trying to have my regex match the following:
169.254.0.0-169.254.254.255
Could anyone please help how can I achieve this.
so far I have this:
169\.254\.([1-9]{1,2}|[1-9]{1,2}[1-4])
but it would also pick up 169.254.255.1 which should not be one of the matches.
Please help!
thanks
This is the regex I use for general IP validation:
(([0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}(?!\d)|2[0-4][0-9](?!\d)|25[0-5](?!\d))[.]?){4}
Breakdown:
1.`[0-9](?!\d)` -> Any Number 0 through 9 (The `(?!\d)` makes sure it only grabs stand alone digits)
2.`|[1-9][0-9](?!\d)` -> Or any number 10-99 (The `(?!\d)` makes sure it only grabs double digit entries)
3.`|1[0-9]{2}` -> Or any number 100-199
4.`|2[0-4][0-9]` -> Or any number 200-249
5.`|25[0-5]` -> Or any number 250-255
6.`[.]?` -> With or without a `.`
7.`{4}` -> Lines 1-6 exactly 4 times
This hasn't failed my yet for IP address validation.
For your specific case, this should do it:
(169\.254\.)((([0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}|2[0-4][0-9]|25[0-4])[.])(([0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}|2[0-4][0-9]|25[0-5])))
This is very long because I couldn't figure out how to get 169.254.(0-254).255 to check without getting 169.254.255.1 to fail
Edit: Fixed due to comments
the regex ([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-4]) matches 0-254.
see this page for more discussion
I've written an article that provides regular expressions for all the components of a generic URI (as defined in RFC3986: Uniform Resource Identifier (URI): Generic Syntax)
See: Regular Expression URI Validation
One of the components of a generic URI is an IPv4 address. Here is the free-spacing mode Python version from that article:
re_python_rfc3986_IPv4address = re.compile(r""" ^
# RFC-3986 URI component: IPv4address
(?: (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) \.){3} # (dec-octet "."){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) # dec-octet "."
$ """, re.VERBOSE)
And the un-commented JavaScript version:
var re_js_rfc3986_IPv4address = /^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/;

How to parse csv output requiring multiple matches using one-liner?

I have a scenario, where I have post-process / filter values taken out from DB. I'm using perl ple for the task. All works well until I come across extracted output (csv) which contains multiple text tags. See sample here. The code works same (extract regex) correctly if there is just one text tag. In my db there are instances where there are more then one text files (i.e rule conditions).
The code is
echo "COPY (SELECT rule_data FROM custom_rule) TO STDOUT with CSV HEADER" | psql -U qradar -o /tmp/Rules.csv qradar;
perl -ple '
($enabled) = /(?<=enabled="").*?(?="")/g;
($group) = /(?<=group="").*?(?="")/g;
($name) = /(?<=<name>).*?(?=<\/name>)/g;
($text) = /(?<=<text>).*?(?=<\/text>)/g;
$_= "$enabled;$group;$name;$text";
s/<.*?>//g;
' Rules.csv > rules_revised.csv
Just running the code on sample output I get following content in rule_revised file.
true;Flow Property Tests;DoS: Local Flood (Other);when the flow bias
is any of the following outbound
Actually the line is truncated after outbound which infact should carry information similar to this..
when at least 3 flows are seen with the same Source IP,
Destination IP in 5 minutes and when the IP protocol is one of the
following IPSec, Uncommon and when the source packets is greater than
60000
I have tried to correct this by making the regex greedy removing the ? in $text but then it overflow all in-between text till the last text and at the end removing lt;.*?>messes the rest as it includes all the tag characters (i.e html) elements which I originally intended to dis include before making the regex greedy change.
The reason you are getting a truncated result with multiple matches is that you only store the first one.
($text) = /(?<=<text>).*?(?=<\/text>)/g;
This only stores the first match. If you change that scalar to an array, you will capture all matches:
(#text) = /(?<=<text>).*?(?=<\/text>)/g;
When you interpolate the array, it will insert spaces (the value of $") between the elements. If you do not want that, you can change the value of $" to an acceptable delimiter. To be clear, you would change two characters to get the following lines:
(#text) = /(?<=<text>).*?(?=<\/text>)/g;
...
$_= "$enabled;$group;$name;#text";
If I run your code on your sample with these changes the output looks like this:
false;Flow Property Tests;DoS: Local Flood (Other);when the flow bias is any of the following outbound when at least 3 flows are seen with the same Source IP, Destination IP in 5 minutes when the IP protocol is one of the following IPSec, Uncommon when the source packets is greater than 60000
Have you tried to use the s modifier, it make the dot match newline:
perl -ple '
($enabled) = /(?<=enabled="").*?(?="")/g;
($group) = /(?<=group="").*?(?="")/g;
($name) = /(?<=<name>).*?(?=<\/name>)/g;
($text) = /(?<=<text>).*?(?=<\/text>)/gs;
# here ___^
$_= "$enabled;$group;$name;$text";
s/<.*?>//g;
' Rules.csv > rules_revised.csv

How to match IPv4 addresses

I am trying to have my regex match the following:
169.254.0.0-169.254.254.255
Could anyone please help how can I achieve this.
so far I have this:
169\.254\.([1-9]{1,2}|[1-9]{1,2}[1-4])
but it would also pick up 169.254.255.1 which should not be one of the matches.
Please help!
thanks
This is the regex I use for general IP validation:
(([0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}(?!\d)|2[0-4][0-9](?!\d)|25[0-5](?!\d))[.]?){4}
Breakdown:
1.`[0-9](?!\d)` -> Any Number 0 through 9 (The `(?!\d)` makes sure it only grabs stand alone digits)
2.`|[1-9][0-9](?!\d)` -> Or any number 10-99 (The `(?!\d)` makes sure it only grabs double digit entries)
3.`|1[0-9]{2}` -> Or any number 100-199
4.`|2[0-4][0-9]` -> Or any number 200-249
5.`|25[0-5]` -> Or any number 250-255
6.`[.]?` -> With or without a `.`
7.`{4}` -> Lines 1-6 exactly 4 times
This hasn't failed my yet for IP address validation.
For your specific case, this should do it:
(169\.254\.)((([0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}|2[0-4][0-9]|25[0-4])[.])(([0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}|2[0-4][0-9]|25[0-5])))
This is very long because I couldn't figure out how to get 169.254.(0-254).255 to check without getting 169.254.255.1 to fail
Edit: Fixed due to comments
the regex ([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-4]) matches 0-254.
see this page for more discussion
I've written an article that provides regular expressions for all the components of a generic URI (as defined in RFC3986: Uniform Resource Identifier (URI): Generic Syntax)
See: Regular Expression URI Validation
One of the components of a generic URI is an IPv4 address. Here is the free-spacing mode Python version from that article:
re_python_rfc3986_IPv4address = re.compile(r""" ^
# RFC-3986 URI component: IPv4address
(?: (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) \.){3} # (dec-octet "."){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) # dec-octet "."
$ """, re.VERBOSE)
And the un-commented JavaScript version:
var re_js_rfc3986_IPv4address = /^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/;