I am using Trellix DLP solution and have IP Address classification to block outgoing IP Address information.
My regex is \b(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})\b
However, this also block documents which have 4 level numbered lists, like:
1.blah
1.1 blah blah
1.1.1 blah blah blah
1.1.1.1 blah blah blah blah (DLP thinks this is an IP Address and block the document)
is there any way to bypass this.
Regexes sometimes feel like magic, but unfortunatelly they are not. A regex cannot distinguish between an ip address versus a numbered footnote or article.
You can try to add some sort of intelligence (to say) to the regex, but you'll always end up having false positives/negatives. This sort of intelligence comes from inspecting previous or next characters.
If you try to go this way, start to use a regular expression that matches just valid ip addresses (your regex can match 300.1.2.3, which is not valid)
Also determine what ip address are you trying to avoid. Because if you are trying to avoid just private ip addresses, then you have less chances to get a false positive if you craft a regex that matches only private ip addresses.
If you try to get whatever ip address, then try to avoid matches that have 4 or more spaces before the match (or less than 4 and a begin of line). This is to try to avoid numbered titles.
(?<!^\h)(?<!^\h\h)(?<!^\h\h\h)(?<!\h\h\h\h)\b(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})\b
note: Use m modifier. If you cannot specify flags, try to use the regex like this:
(?m)(?<!^\h)(?<!^\h\h)(?<!^\h\h\h)(?<!\h\h\h\h)\b(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})\b
NOTE: if your tool does not support \h, change them for [\t\p{Zs}] or [ \t]
You have a very basic demo here. Please, keep on reading before using that for production :-)
Of course, since negative lookbehind usually cannot be variable length (unless some specific programming languages/tools), the more cases you add to the negative lookbehind with extra spaces, the more probable to skip those articles and not getting a false negative.
Also the tool must support negative lookbehinds, of course.
You could even combine both cases: a regex that matches 172.x.x.x and 192.x.x.x private addresses (not including 10.x.x.x private addresses because they are pretty low), in which case it may not take into account extra constraints, or any other valid ip address with extra constraints (like the spaces)
Are there any more false positives that you detected? Try to stablish similar rules for them. For example, consider that you could match footnotes like these: <<See 1.2.3.4>> or *1.2.3.4. Try to add exceptions for ip-address-like strings that start by * or end with >>, for example.
To sum up: "You cannot", but if you insist or try to...
Add extra 'logic' to the regex according to your found false positives
Check if the tool lacks needed regex features (like positive/negative lookbehinds)
The logic may be very specific to the document that you specified on your example. If there are other documents with other different formats, it may not be possible to have a generic solution for any kind of document
Even if you just have a single type of document to inspect, you may still have false positives/negatives, in which case, go to step 1 and repeat
Related
Could someone help me with some REGEX...
I have been blocking internal traffic using the filter pattnrn:
10.*..
This just bit me in the foot as this is blocking all referral traffic between our sites.
What I want to do now is block everything except 10.103..
Do I need to apply two separate ranges, or can I accomplish this with one filter?
If you want to block everything but 10.103.xxx.xxx, use an include filter instead of the usual exclude filter.
NOTE ABOUT REGEXES MATCHING IPs IN ANALYTICS
I am not sure if the filter I suggested above uses regex or not (literal string match), but it doesn't make a difference because there's no way the expression 10.103. could be misinterpreted in an IP address.
Your original pattern, on the other hand, is bogus and is probably hurting you. That's because in a regex the dot . is not a literal dot, but represents any character. Your expression, in fact, excludes every single IP that merely starts with 10 (not just 10. that is ten-dot), including 100.xxx, 101.xxx etc.
The correct version of your original excluding regex would be 10\..*, which contains an escaped dot (\.), then proceeds to any characters after that (.*).
REGEXP are very good explained in the Google Analytics Help (here).
For multiple IPs, there is this little helper, which generates the REGEXP for you.
If you want to block internal traffic, just ADD NEW FILTER and CUSTOM then EXCLUDE and put the IP in REGEXP in the field, that's it.
I have a text, that contains url domains in the following form:
[second_level_domain].[top_level_domain]
This could be for instance test.com, amazon.com or something similar, but not more complex stuff like e.g. www.test.com or de.wikipedia.org (no sub level domains!).
It could be that in front of the dot (between second and top level domain) or after the dot is an optional space like test . com, but this doesn't always have to be the case.
However what I don't want to match is if the second level domain and top level domain belong to an e-mail address like for instance hello#test.org. So in this case it shouldn't extract test.org
I wrote the following regex now:
(?<!#)(([a-zA-Z\d]+(?:-[a-zA-Z\d]+)*(?<!www))\s?\.\s?(com|net|org))
With the negative look behind I want to make sure, that in front of the second level domain shouldn't be an #. However it doesn't really do what I expected. For instance on the text hello#test.org it extracts est.org instead of extracting nothing. So, apparently it only looks at the first character when it checks if there is an # in front. But when I use the following regex it seems to work on the text hello#test.org:
(?<!#)((test)\s?\.\s?(com|net|org))
Here I hard coded the second level domain, with which it works. However if I exchange that with a regex that matches all kinds of second level domains
([a-zA-Z\d]+(?:-[a-zA-Z\d]+)*(?<!www))
it doesn't work anymore. It looks like that the negative look behind is already used after the first character is matched and that it doesn't wait with the negative look behind until everything is matched.
As an alternative I could match a bit more and then use the groups afterwards to build my desired match, but I want to avoid that if possible. I would like to match it correctly immediately. I'm not an expert in regular expressions and apparently I have not understood look arounds properly yet. Is there a way to write a regex, which behaves like I want?
(?:^|(?<=\s))((?:[a-zA-Z\d]+(?:-[a-zA-Z\d]+)*(?<!www))\s?\.\s?(?:com|net|org))
Add anchors to disallow partial matches.See demo.
https://www.regex101.com/r/rK5lU1/34
I'm writing a regex for google analytics and I need to block any IP from 156.21.x.x I don't care about the last 2 octets just the first two. I would like to keep the regex to as few characters as possible as google only allows 255 chars and my regex is already pretty large.
not sure what flavor of regex or what lang your using, but this will work on most regex engines:
156\.21\.\d{1,3}\.\d{1,3}
Of course, this will match invalid ip's like 156.21.777.888, but if the list your parsing doesnt contain invalid ip addresses, then you should be ok. Or:
156\.21(\.\d{1,3}){2}
If you are running short on space, this would work, though you would match non-IP addresses as well. If you can assume Google will give you valid IP addresses, this is your shortest option:
^156\.21\.
Matches things like: 156.21.1.1 156.21.1000.1000 156.21.ABC
But does not match http://156.21.1.1 ehlo 156.21.1000.1000
The following regex would match (almost) valid IPv4 addresses that starts with 156.21:
(156\.21(?:\.[\d]{1,3}){2})
I'm looking for the regex to validate hostnames. It must completely conform to the standard. Right now, I have
^[0-9a-z]([0-9a-z\-]{0,61}[0-9a-z])?(\.[0-9a-z](0-9a-z\-]{0,61}[0-9a-z])?)*$
but it allows successive hypens and hostnames longer than 255 characters. If the perfect regex is impossible, say so.
Edit/Clarification: a Google search didn't reveal that this is a solved (or proven unsolvable) problem. I want to to create the definitive regex so that nobody has to write his own ever. If dialects matter, I want a a version for each one in which this can be done.
^(?=.{1,255}$)[0-9A-Za-z](?:(?:[0-9A-Za-z]|-){0,61}[0-9A-Za-z])?(?:\.[0-9A-Za-z](?:(?:[0-9A-Za-z]|-){0,61}[0-9A-Za-z])?)*\.?$
The approved answer validates invalid hostnames containing multiple dots (example..com). Here is a regex I came up with that I think exactly matches what is allowable under RFC requirements (minus an ending "." supported by some resolvers to short-circuit relative naming and force FQDN resolution).
Spec:
<hname> ::= <name>*["."<name>]
<name> ::= <letter-or-digit>[*[<letter-or-digit-or-hyphen>]<letter-or-digit>]
Regex:
^([a-zA-Z0-9](?:(?:[a-zA-Z0-9-]*|(?<!-)\.(?![-.]))*[a-zA-Z0-9]+)?)$
I've tested quite a few permutations myself, I think it is accurate.
This regex also does not do length validation. Length constraints on labels betweens dots and on names are required by RFC, but lengths can easily be checked as second and third passes after validating against this regex, by checking full string length, and by splitting on "." and validating all substrings lengths. E.g., in JavaScript, label length validation might look like: "example.com".split(".").reduce(function (prev, curr) { return prev && curr.length <= 63; }, true).
Alternative Regex (without negative lookbehind, courtesy of the HTML Living Standard):
^[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$
Your answer was relatively close.
But see
RFC 2396 Section 3.2.2
JaredPar's reference to this answer is referring to Regexp/Common/URI/RFC2396.pm source.
For a hostname RE, that perl module produces
(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z])[.]?)
I would modify to be more accurate as:
(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]{0,61})?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]{0,61}[a-zA-Z0-9]|[a-zA-Z])[.]?)
Optionally anchoring the ends with ^$ to ONLY match hostnames.
I don't think a single RE can accomplish an full validation because, according to Wikipedia, there is a 255 character length restriction which i don't think can be included within that same RE, at least not without a ton of changes, but it's easy enough to just check the length <= 255 before running the RE.
Take a look at the following question. A few of the answers have regex expressions for host names
Regular expression to match DNS hostname or IP Address?
Could you specify what language you want to use this regex in? Most languages / systems have slightly different regex implementations that will affect people's answers.
I tried all answers with these examples below and unfortunately no one has passed the test.
ec2-11-111-222-333.cd-blahblah-1.compute.amazonaws.com
domaine.com
subdomain.domain.com
12533d5.dkkkd.com
2dotsextension.co
1dotextension.c
ekkej_dhh.com
12552.2225
112.25.25
12345.com
12345.123.com
domaine.123
whatever
9999-ee.99
email#domain.com
.jjdj.kkd
-subdomain.domain.com
#subdomain.domain.com
112.25.25
Here is a better solution.
^[A-Za-z0-9][A-Za-z0-9-.]*\.\D{2,4}$
Just please post any other not considered case if exists # https://regex101.com/r/89zZkW/1
What about:
^(?=.{1,255})([0-9A-Za-z]|_{1}|\*{1}$)(?:(?:[0-9A-Za-z]|\b-){0,61}[0-9A-Za-z])?(?:\.[0-9A-Za-z](?:(?:[0-9A-Za-z]|\b-){0,61}[0-9A-Za-z])?)*\.?$
for matching only one '_' (for some SRV) at the beginning and only one * (in case of a label for a DNs wildcard)
According to the relevant internet RFCs and assuming you have lookahead and lookbehind positive and negative assertions:
If you want to validate a local/leaf hostname for use in an internet hostname (e.g. - FQDN), then:
^(?!-)[-a-zA-Z0-9]{1,63}(?<!-)$
That ^^^ is also the general check that a label component inside an internet hostname is valid.
If you want to validate an internet hostname (e.g. - FQDN), then:
^(?=.{1,253}\.?$)(?:(?!-)[-a-zA-Z0-9]{1,63}(?<!-)\.)*(?!-)[-a-zA-Z0-9]{1,63}(?<!-)\.?$
Short version:
How can I get a regex that matches a#a.aaaa but not a#a.aaaaa using CAtlRegExp ?
Long version:
I'm using CAtlRegExp http://msdn.microsoft.com/en-us/library/k3zs4axe(VS.80).aspx to try to match email addresses. I want to use the regex
^[A-Z0-9._%+-]+#(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$
extracted from here.
But the syntax that CAtlRegExp accepts is different than the one used there. This regex returns the error REPARSE_ERROR_BRACKET_EXPECTED, you can check for yourself using this app: http://www.codeproject.com/KB/string/mfcregex.aspx
Using said app, I created this regex:
^[a-zA-Z0-9\._%\+\-]+#([a-zA-Z0-9-]+\.)+[a-zA-Z]$
But the problem is this matches a#a.aaaaa as valid, I need it to match 4 characters maximum for the op-level domain.
So, how can I get a regex that matches a#a.aaaa but not a#a.aaaaa ?
Try: ^[a-zA-Z0-9\._%\+\-]+#([a-zA-Z0-9-]+\.)+\c\c\c?\c?$
This expression replaces the [A-Z]{2,4} sequence which CAtlRegExp doesn't support with \c\c\c?\c?
\c serves as an abbreviation of [a-zA-Z]. The question marks after the 3rd and 4th \c's indicate they can match either zero or one characters. As a result, this portion of the expression matches 2, 3 or 4 characters, but neither more nor less.
You are trying to match email addresses, a very widely used critical element of internet communication.
To which I would say that this job is best done with the most widely used most correct regex.
Since email address format rules are described by RFC822, it seems useful to do internet searches for something like "RFC822 email regex".
For Perl the answer seems to be easy: use Mail::RFC822::Address: regexp-based address validation
RFC 822 Email Address Parser in PHP
Thus, to achieve the most correct handling of email addresses, one should either locate the most precise regex that there is out somewhere for the particular toolkit (ATL in your case) or - in case there's no suitable existing regex yet - adapt a very precise regex of another toolkit (Perl above seems to be a very complete albeit difficult candidate).
If you're trying to match a specific sub part of email addresses (as seems to be the case given your question), then it probably still makes sense to start with the most up-to-date/correct/universal regex and specifically limit it to the parts that you require.
Perhaps I stated the obvious, but I hope it helped.