POSIX ERE Regex - Creating Efficient Regex - regex

I'm working to create some regex entries that are well-formed, and efficient. I'll place an emphasis on efficient, as these regex entries can see thousands of logs per second. Inefficient regex entries can cause severe performance impacts.
Question: Does regex101 (through one flavor) support POSIX ERE Regex? Googling shows that PCRE2 should support BRE+ERE and more.
Regex Type: POSIX ERE
Syslog App: rsyslog (EL7)
Sample Payload (Well formed - Sensitive Information Stripped):
Jul 10 00:00:00 Firewall-Name-Removed CEF:0|Fortinet|FortiGate-removed|1.2.3,build1111 (GA)|0000000013|forward traffic accept|5|start=Jul 10 2022 00:00:00 logver=604091966 deviceExternalId=FG9A9A9A9999999 dvchost=Firewall-Name-Removed ad.vd=root ad.eventtime=1111111111111111111 ad.tz=-9999 ad.logid=0000000013 cat=traffic ad.subtype=forward deviceSeverity=notice src=1.1.1.1 shost=RandomHost1 spt=62119 deviceInboundInterface=DII-Out ad.srcintfrole=lan ad.srcssid=SSID Has Been Removed ad.apsn=ABC123D ad.ap=CHL-07 ad.channel=157 ad.radioband=802.11ac n-only ad.signal=-40 ad.snr=55 dst=2.2.2.2 dpt=53 deviceOutboundInterface=DOI-Out ad.dstintfrole=undefined ad.srccountry=Reserved ad.dstcountry=CountryRemoved externalID=123456789 proto=00 act=accept ad.policyid=000 ad.policytype=policy ad.poluuid=UUID-Removed ad.policyname=policy_name_removed app=DNS ad.trandisp=noop ad.appid=16195 ad.app=DNS ad.appcat=Network.Service ad.apprisk=elevated ad.applist=UTM Name - Removed ad.duration=180 out=0 in=205 ad.sentpkt=0 ad.rcvdpkt=1 ad.utmaction=allow ad.countdns=1 ad.osname=Windows ad.srcswversion=10 ad.mastersrcmac=MAC removed ad.srcmac=MAC removed ad.srcserver=0 tz="-9999"
What I'm attempting to do is remove specific logs that are not required. Normally I'd do this at a SIEM level through something like routing rules (where I can utilize fields), but this isn't possible for the foreseeable future. In this particular case: I'm trying to exclude on the following pieces of information.
Source IP: Is in a specific range
deviceOutboundInterface: is DOI-Out
Current Regex: "\bsrc=1.1.1[4-5]{0,1}.[0-9]{0,3}\b.*?\bdeviceOutboundInterface=DOI-Out\b" (Regex101 link in PCRE2). If that is matched, the log is rejected (through the stop call). Otherwise, it moves onto the other entries to check for unnecessary logs.
Most of my regex entries are in the low double-digits because they're a lot simpler. Is there a better way to make the more complex regex more efficient?
Thank you for any insight you can offer.

You might be able to cut some time with:
src=1\.1\.1[4-5]{0,1}\.[0-9]{0,3}.*?deviceOutboundInterface=DOI-Out
changes:
remove word boundaries
change the . to . in IP address
regex101 has the original efficiency at 383 steps, new is 301 so a potential savings of ~21%. Not terrible but you'll want to make sure any removals were OK.
to be honest, what you have looks pretty good to me.

This RE reduces the number of steps on Reg101 from 383 to 270 (~ -29.5%):
src=1\.1\.1[45]?\.\d{0,3}.*?O[boundIter]*?face=DOI-Out
The original RE already is quite simple, only matching one pattern and one literal string which makes it difficult to optimize. But we can do if we know (from the documentation of the text in question, here the Log Message manual) that an even simpler pattern will not lead to ambiguities.
Changes:
matching literal text whereever possible
replacing range '4-5' with simple elements
instead of matching the long 'deviceOutboundInterface=', use a pattern which will just barely match this string but would possibly match other words if they ever occurred in log messages - but we know they don't.

Related

Regular expression to allow only the valid time offset values

I am looking for a regular expression which allows only the time offset values.
I have used:
^(?:[+-](?:2[0-3]|[01][0-9]):[0-5][0-9])$
The ONLY strings I need to match:
-12:00
+14:00
-11:00
-10:00
-09:30
-09:00
-08:00
-07:00
-06:00
-05:00
-04:00
-03:30
-03:00
-02:00
-01:00
00:00
+01:00
+04:00
+03:30
+03:00
+02:00
+04:30
+05:00
+05:30
+05:45
+06:00
+06:30
+07:00
+08:00
+08:45
+09:00
+09:30
+10:00
+10:30
+11:00
+12:00
+12:45
+13:00
+14:00
Please check here for what I have tried so far, and the values I want it to allow.
It works fine for the all the values except for 00:00.
Also, it allows some extra values such as -19:30 +23:00 22:30 21:00 which should not be allowed.
I want it to allow only those values which have been mentioned in my aforesaid link.
I was able to achieve the results you wanted by slightly tweaking your regex.
This is also short and concise.
^(?:(?:[+-](?:1[0-4]|0[1-9]):[0-5][0-9])|00:00)$
You can check the results and test it further here
One point which should be noted here is that you would be able to pass other values between the current valid values of timezone(-12:00 to +14:00). By reading the comments in the question, I feel it is better to have it this way, for future proofing just in case they change. (You would need to tweak the regex to allow values greater than 14:00)
If you strictly want to limit it to the values which you have listed, enumeration would be a better approach to go about it.
To only match these strings of yours, you can use
^(?:\+(?:0(?:[12]:00|[34]:[03]0|5:(?:[03]0|45)|6:[03]0|7:00|8:(?:00|45)|9:[03]0)|1(?:0:[03]0|1:00|2:(?:00|45)|[34]:00))|-(?:0(?:[12]:00|3:[03]0|[4-8]:00|9:[03]0)|1[0-2]:00)|00:00)$
See the regex demo
Use online/external tools to build word list regexps like this (e.g. My Regex Tester, etc.).

Alternation in regexes seems to be terribly slow in big files

I am trying to use this regex:
my #vulnerabilities = ($g ~~ m:g/\s+("Low"||"Medium"||"High")\s+/);
On chunks of files such as this one, the chunks that go from one "sorted" to the next. Every one must be a few hundred kilobytes, and all of them together take from 1 to 3 seconds all together (divided by 32 per iteration).
How can this be sped up?
Inspection of the example file reveals that the strings only occur as a whole line, starting with a tab and a space. From your responses I further gathered that you're really only interested in counts. If that is the case, then I would suggest something like this solution:
my %targets = "\t Low", "Low", "\t Medium", "Medium", "\t High", "High";
my %vulnerabilities is Bag = $g.lines.map: {
%targets{$_} // Empty
}
dd %vulnerabilities; # ("Low"=>2877,"Medium"=>54).Bag
This runs in about .25 seconds on my machine.
It always pays to look at the problem domain thoroughly!
This can be simplified a little bit. You use \s+ before and after, but is this necessary? I think you need just to assure word boundary or just one whitespace, thus, you can use
\s("Low"||"Medium"||"High")\s
or you can use \b instead of \s.
Second step is not to use capturing group, use non-capturing grous instead, because regex engine wastes time and memory for "remembering" groups, so you could try with:
\s(?:"Low"||"Medium"||"High")\s
TL;DR I've compared solutions on a recent rakudo, using your sample data. The ugly brute-force solution I present here is about twice as fast as the delightfully elegant solution Liz has presented. You could probably improve times another order of magnitude or more by breaking your data up and parallel processing it. I also discuss other options if that's not enough.
Alternations seems like a red herring
When I eliminated the alternation (leaving just "Low") and ran the code on a recent rakudo, the time taken was about the same. So I think that's a red herring and have not studied that aspect further.
Parallel processing looks promising
It's clear from your data that you could break it up, splitting at some arbitrary line, and then pattern match each piece in parallel, and then combine results.
That could net you a substantial win, depending on various factors related to your system and the data you process.
But I haven't explored this option.
The fastest results I've seen
The fastest results I've seen are with this code:
my %counts;
$g ~~ m:g / "\t " [ 'Low' || 'Medium' || 'High' ] \n { %counts{$/}++ } /;
say %counts.map: { .key.trim, .value }
This displays:
((Low 2877) (Medium 54))
This approach incorporates similar changes to those MichaƂ Turczyn discussed, but pushed harder:
I've thrown away all capturing, not only not bothering to capture the 'Low' or whatever, but also throwing away all results of the match.
I've replaced the \s+ patterns with concrete characters rather than character classes. I've done so on the basis my casual tests with a recent rakudo suggested that's a bit faster.
Going beyond raku's regexes
Raku is designed for full Unicode generality. And its regex engine is extremely powerful. But it looks like your data is just ASCII and your pattern is a typical very simple regex. So you're using a sledgehammer to crack a nut. This shouldn't really matter -- the sledgehammer is supposed to be just fine as a nutcracker too -- but raku's regex engine remains very poorly optimized thus far.
Perhaps this nut is just a simple example and you're just curious about pushing raku's built in regex capabilities to their maximum current performance.
But if not, and you need yet more speed, and the speedups from this or other better solutions in raku, coupled with parallel processing, aren't enough to get you where you need to go, it's worth considering either not using raku or using it with another tool.
One idiomatic way to use raku with another tool is to use an Inline, with the obvious one in this case being Inline::Perl5. Using that you can try perl's fast default built in regex engine or even use its regex plugin capability to plug in a really fast regex engine.
And, given the simplicity of the pattern you're matching, you could even eschew regexes altogether by writing a quick bit of glue to some low-level raw text searching tool (perhaps saving character offsets and then generating corresponding raku match objects from the results).

Regex - Match Exact (and only exact) string/phrase

I'm having some trouble using Regex to match an exact string (and only that string, must not contain prefix or suffix) out of a sample with only slight variations.
I've looked over every "duplicate" this has been compared to, and none of the solutions seem to apply to the problem I'm trying to accomplish. If it is indeed a duplicate, I'd love to see how!
Phrase to Match:
Metallica - Master Of Puppets
Sample Text:
Metallica - Master Of Puppets (instrumental)
Metallica - Master Of Puppets
- Metallica - Master Of Puppets
I've tried a few different approaches with this.
There's "the starting point": ^(Metallica - Master Of Puppets)$
The "slightly more involved": ^((?!Lamb of God - Laid to Rest).)*$
The "I'm getting desperate": /(?<=\||\A|\n)(Metallica -
.Master.of.Puppets)(?=\||Z|\n)/g
and the "im out of ideas, so why the hell not": (?=^\s*Metallica - Master Of
Puppets).{29}
None of which will match the correct (the second option, in bold) string. I've dedicated the better part of my free-time from last night on this one little string rather than coding out a new app I've been working on (I really do hate to give up), and am, at this point, out of ideas, examples and patience. Nonetheless, I really would like to get to the bottom of what this seemingly simple Regex need will take to accomplish, both for the app and for my mental well being (I hate Regex, but love a good challenge).
Note: I DO need this to be done in Regex (not grep, not java etc). Sorry to throw such a seemingly menial question up, but my 15 or so months in the programming world only gets me so far. Looking forward to a solution, thanks!5
I believe your first approach was correct (in agreement with #Dmitry Egorov), although you are probably missing the multi-line flag. This sets it so that ^ and $ are set at the beginning and end of each line of your string or file.
In PHP/Js you'll want to use
/^Metallica - Master Of Puppets$/gm
the g flag is 'global' and finds all instances, the mflag is the aforementioned multi-line flag.
Other languages will have similar flags or options for multi-line support.

SQL Server Regular Expression Workaround in T-SQL?

I have some SQLCLR code for working with Regular Expresions. But now that it is getting migrated into Azure, which does not allow SQLCLR, that's out. I need to find a way to do regex in pure T-SQL.
Master Data Services are not available because the dev edition of MSSQL we have is not R2.
All ideas appreciated, thanks.
Regular expression match samples that need handling
(culled from regexlib and other places over the past few years)
email address
^[\w-]+(\.[\w-]+)*#([a-z0-9-]+(\.[a-z0-9-]+)*?\.[a-z]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?$
dollars
^(\$)?(([1-9]\d{0,2}(\,\d{3})*)|([1-9]\d*)|(0))(\.\d{2})?$
uri
^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)*#)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*$
one numeric digit
^\d$
percentage
^-?[0-9]{0,2}(\.[0-9]{1,2})?$|^-?(100)(\.[0]{1,2})?$
height notation
^\d?\d'(\d|1[01])"$
numbers between 1 1000
^([1-9]|[1-9]\d|1000)$
credit card numbers
^((4\d{3})|(5[1-5]\d{2})|(6011))-?\d{4}-?\d{4}-?\d{4}|3[4,7]\d{13}$
list of years
^([1-9]{1}[0-9]{3}[,]?)*([1-9]{1}[0-9]{3})$
days of the week
^(Sun|Mon|(T(ues|hurs))|Fri)(day|\.)?$|Wed(\.|nesday)?$|Sat(\.|urday)?$|T((ue?)|(hu?r?))\.?$
time on 12 hour clock
(?<Time>^(?:0?[1-9]:[0-5]|1(?=[012])\d:[0-5])\d(?:[ap]m)?)
time on 24 hour clock
^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
usa phone numbers
^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$
Unfortunately, you will not be able to move your CLR function(s) to SQL Azure. You will need to either use the normal string functions (PATINDEX, CHARINDEX, LIKE, and so on) or perform these operations outside of the database.
EDIT Adding some information for the examples added to the question.
Email address
This one is always controversial because people disagree about which version of the RFC they want to support. The original didn't support apostrophes, for example (or at least people insist that it didn't - I haven't dug it up from the archives and read it myself, admittedly), and it has to be expanded quite often for new TLDs (once for 4-letter TLDs like .info, then again for 6-letter TLDs like .museum). I've often heard quite knowledgeable people state that perfect e-mail validation is impossible, and having previously worked for an e-mail service provider, I can tell you that it was a constantly moving target. But for the simplest approaches, see the question TSQL Email Validation (without regex).
One numeric digit
Probably the easiest one of the bunch:
WHERE #s LIKE '[0-9]';
Credit card numbers
Assuming you strip out dashes and spaces, which you should do in any case. Note that this isn't an actual check of the credit card number algorithm to ensure that the number itself is actually valid, just that it conforms to the general format (AmEx = 15 digits starting with a 3, the rest are 16 digits - Visa starts with a 4, MasterCard starts with a 5, Discover starts with 6 and I think there's one that starts with a 7 (though that may just be gift cards of some kind)):
WHERE #s + ' ' LIKE '[3-7]'+ REPLICATE('[0-9]', 14) + '[0-9 ]';
If you want to be a little more precise at the cost of being long-winded, you can say:
WHERE (LEN(#s) = 15 AND #s LIKE '3' + REPLICATE('[0-9]', 14))
OR (LEN(#s) = 16 AND #s LIKE '[4-7]' + REPLICATE('[0-9]', 15));
USA phone numbers
Again, assuming you're going to strip out parentheses, dashes and spaces first. Pretty sure a US area code can't start with a 1; if there are other rules, I am not aware of them.
WHERE #s LIKE '[2-9]' + REPLICATE('[0-9]', 9);
-----
I'm not going to go further, because a lot of the other expressions you've defined can be extrapolated from the above. Hopefully this gives you a start. You should be able to Google for some of the others to see how other people have replicated the patterns with T-SQL. Some of them (like days of the week) can probably just be checked against a table - seems overkill to do an invasie pattern matching for a set of 7 possible values. Similarly with a list of 1000 numbers or years, these are things that will be much easier (and probably more efficient) to check if the numeric value is in a table rather than convert it to a string and see if it matches some pattern.
I'll state again that a lot of this will be much better if you can cleanse and validate the data before it gets into the database in the first place. You should strive to do this wherever possible, because without CLR, you just can't do powerful RegEx inside SQL Server.
Ken Henderson wrote about ways to replicate RegEx without CLR, but they require sp_OA* procedures, which are even less likely to ever see the light of day in Azure than CLR. Most of the other articles you'll find online use an approach similar to Ken's or use complex use of built-in string functions.
Which portions of RegEx specifically are you trying to replicate? Can you show an example of the input/output of one of your functions? Perhaps it will be easy to convert to get similar results using the built-in string functions like PATINDEX.

RegEx - Not match inside a text?

I am working with iCal entries:
BEGIN:VEVENT
UID:944f660b-01f8-4e09-95a9-f04a352537d2
ORGANIZER;CN=******
DTSTART;TZID="America/Chicago":20100802T080000
DTEND;TZID="America/Chicago":20100822T170000
STATUS:CONFIRMED
CLASS:PRIVATE
X-MICROSOFT-CDO-INTENDEDSTATUS:BUSY
TRANSP:OPAQUE
X-MICROSOFT-DISALLOW-COUNTER:TRUE
DTSTAMP:20100802T212130Z
SEQUENCE:0
END:VEVENT
BEGIN:VEVENT
UID:aa132e2b-8a8d-4ffc-9e54-b75249e78c72
RRULE:FREQ=DAILY;COUNT=12;INTERVAL=1
SUMMARY:***********
X-ALT-DESC;FMTTYPE=text/html:<html><body><div style='font-family:Times New R
oman\; font-size: 12pt\; color: #000000\;'></div></body></html>
LOCATION:Map Room
ORGANIZER;CN=*********
DTSTART;TZID="America/Chicago":20100730T080000
DTEND;TZID="America/Chicago":20100730T170000
STATUS:CONFIRMED
CLASS:PUBLIC
X-MICROSOFT-CDO-INTENDEDSTATUS:BUSY
TRANSP:OPAQUE
X-MICROSOFT-DISALLOW-COUNTER:TRUE
DTSTAMP:20100727T025231Z
SEQUENCE:0
EXDATE;TZID="America/Chicago":20100810T080000
EXDATE;TZID="America/Chicago":20100807T080000
BEGIN:VALARM
ACTION:DISPLAY
TRIGGER;RELATED=START:-PT5M
DESCRIPTION:*********
END:VALARM
END:VEVENT
I need to parse out starting and ending times. I have a comparison function that determines if the passed in event is between the two times. Due to the increased complexity in calculating the times I plan on not supporting the recurrance series. I would like to play the safe side and make sure my code only reads the first event as a match and not the second. So I have the following RegEx with the single-line option:
BEGIN:VEVENT.+?
DTSTART;.+?([0-9]{8})T([0-9]{6})
DTEND;.+?([0-9]{8})T([0-9]{6}).+?
END:VEVENT
This gets me the start and end times of both entries. My thought was to only match ones that don't have FREQ= between the BEGIN:VEVENT and DTSTART. I don't understand how to do this, however. I was wondering if someone could help me out here?
I realize at a certain point a full blown parser is a better option, but I am unskilled with parsers and I am under a slight time constraint. I have tried using the !? operator without success.
It's harder to write a regex to match for things you don't want then to match the things you do want. Usually when I run into this situation, I find it easier and faster to do things in two steps. In this case, I'd probably find all events that do contains FREQ=, remove those events, then continue matching on the result for the start and end times I want. Could you post the regex you tried with !?, because maybe it's easy to fix... Also, I assume this is in Objective-C, and I'm guessing the environment you're using does support !? (but not all of them do)...
UPDATE
Ok, try this one:
BEGIN:VEVENT.+?
(?<!FREQ=.+)DTSTART;.+?([0-9]{8})T([0-9]{6})
DTEND;.+?([0-9]{8})T([0-9]{6}).+?
END:VEVENT
Why not use a PHP iCalendar parser?
http://www.phpclasses.org/browse/file/16660.html