Was looking for a review of a regex I created, as I'm looking to see where improvements could be made
I have the following log message:
2017-02-09T14:12:07.381648+00:00 ATA-CENTER ATA[4844] CEF:0|Microsoft|ATA|1.7.5757.57477|AbnormalBehaviorSuspiciousActivity|Suspicion of identity theft based on abnormal behavior|5|start=2017-02-09T14:07:22.1490000Z app=Kerberos shost=xxx suser=Last Name, First Name msg=text here. cs1Label=url cs1=https://xxx-xxx.xxxx.xxx.xxx/suspiciousActivity/589c7796135ca912ec5b75b0
Here is my regex:
.*?\|ATA\|(?<version>.*?)\|(?:\w+)\|(?<alert>.*?)\|(?<severity>.*?)\|(?:.*?)\s\w+=(?<app>.*?)\s\w+=(?<src_host>.*?)\s\w+=(?<user>.*?)\s\w+=(?<msg>.*?).\s.*?
I'm trying to disregard everything up to ATA, and then disregard everything after the period at the end of the msg (starting at cs1Label).
Would appreciate any feedback.
Thx
There is one small error in your regex. The dot . after the message will match everything. If you really want to match a dot, you need to escape it. (?:\w+) will match every character and every digit. You could also just use \d to match a digit. Moreover, I wouldn't label the capture groups. In my opinion, this does not make the regex more readable.
Otherwise it seems fine to me. Here is a live demo with the fixed dot.
Related
https://regex101.com/r/9kfa7D/4
I can never get the look ahead portion correct. I've tried a few different things, but I'm trying to get to the next date and parse it like that. Mainly because I don't know what the message will look like and it could be pretty random. Any help would be great.
I need to group the message portion of it.
Edit: Updated to make it a little more clear of what I'm trying to do. Never everything from each date.
You can just tweak your regex without tinkering lookahead like this:
^\d{2}-\w{3}-\d{4} (?:\d{2}:){2}\d{2}\.\d{3}
Updated Regex Demo
EDIT:
As per updated question OP can use this negative lookahead based regex to capture log text:
^[^\[]+\[[^\]]+\] +[^:]+ +(.*(?:\n(?!\d{2}-[a-zA-Z]{3}-).*)*)
This regex doesn't use DOTALL flag by unrolling the loop in last segment. This makes above regex pretty fast to complete the parsing.
New Demo
If you care for the message between log timestamps use this (it's in the 2-nd group):
/(\d{2}-\w{3}-\d{4} \S+ \S+ \[[^\]]++\] )(?=(.+)((?1)|\z))/gms
^(?:\d{2}-\w{3}-\d{4} (?:\d{2}:){2}\d{2}\.\d{3}) ((?:[^\n]+(?:\n+(?!\d{2}-\w{3}-\d{4})|))+)
The first part is the date pattern, which is non-grouping since you do not want to keep the date.
The second part is [^\n]+ which is followed by a \n provided it is not followed by \d{2}-\w{3}-\d{4} (hence the negative look ahead).
The second part is then repeated any number of times.
You can see the demo on regex101.
What you need
(^\d+.[A-Z].*?)[A-Z]
how it works
Lots of people like the complex thinking when they are confront a regex. But you should know exactly what you want.
you just need to match this: 29-Jun-2016 09:33:43.565 INFO and nothing else. So let's begin:
First: two digit,
next: A word with capital letter
next: everything from this word to the next capitalize word
finish.
the main rule
Non-greedy mantch: .*?
prove
NOTE
do you want to match from beginning to log
very easy just add .*?log at the end. that's it.
Do you ever pay attention to how many steps it take?
First of mine: 7952
Second of mine: 13751
Compare it with other
After putting the picture here. some guys update their regex. I do
not want to argue. no problem. I just wanted to show it.
Otherwise I can ( as you can ) makes it less by choice the specific
pattern For example: ^\d+-[A-Za-z]+-\d+\s\d+:\d+:\d+\.\d+ Now 7952 become 3878
Do you want to learn how lock-head assertion works?
Very easy. The main concept is that (?=) is never matches anything. It only matches the position just one point before you want.
like:
^\d+-[A-Z].+(?=[A-Z]+ ).
It still matches: 29-Jun-2016 09:33:43.565 INFO
Pay attention to . at the end. So here the look head assertion point to between F and O
If would like to match this 29-Jun-2016 09:33:43.565 then what can you do?
Think about this:
^\d+-[A-Za-z].+(?=[\d] ).
and figure out it by yourself.
I am trying to write a regular expression that will match values such as U.S., D.C., U.S.A., etc.
Here is what I have so far -
\b([a-zA-Z]\.){2,}+
Note how this expression matches but does not include the last letter in the acronym.
Can anyone help explain what I am missing here?
SOLUTION
I'm posting the solution here in case this helps anyone.
\b(?:[a-zA-Z]\.){2,}
It seems as if a non-capturing group is required here.
Try (?:[a-zA-Z]\.){2,}
?: (non-capturing group) is there because you want to omit capturing the last iteration of the repeated group.
For example, without ?:, 'U.S.A.' will yield a group match 'A.', which you are not interested about.
None of these proposed solutions do what yours does - make sure that there are at least 2 letters in the acronym. Also, yours works on http://rubular.com/ . This is probably some issue with the regex implementation - to be fair, all of the matches that you got were valid acronyms. To fix this, you could either:
Make sure there's a space or EOF succeeding your expression ((?=\s|$) in ruby at least)
Surround your regex with ^ and $ to make sure it catches the whole string. You'd have to split the whole string on spaces to get matches with this though.
I prefer the former solution - to do this you'd have:
\b([a-zA-Z]\.){2,}(?=\s|$)
Edit: I've realized this doesn't actually work with other punctuation in the string, and a couple of other edge cases. This is super ugly, but I think it should be good enough:
(?<=\s|^)((?:[a-zA-Z]\.){2,})(?=[[:punct:]]?(?:\s|$))
This assumes that you've got this [[:punct:]] character class, and allows for 0-1 punctuation marks after an acronym that won't be captured. I've also fixed it up so that there's a single capture group that gets the whole acronym. Check out validation at http://rubular.com/r/lmr0qERLDh
Bonus: you now get to make this super confusing to anyone reading it.
This should work:
/([a-zA-Z]\.)+/g
I have slightly modified the solution above:
\b(?:[a-zA-Z]+\.){2,}
to enable capturing acronyms containing more than one letter between the dots, like in 'GHQ.AFP.X.Y'
I am using Regex to categorise codes in Omniture.
I need to work out a way where I can get my Match Group to go to the next - if present but go to the end if not.
This is an example of a tracking code
KNC-GUK-FUK-GEN-SUP-MRO-ARALDITE-MRO
My current Regex is (which isn't working as desired)
(?i)knc-(.*?)(SUP-)(.*?)(-)(.+)(?= *-|$)
So it needs to have KNC and SUP- and I need to capture the word after the next hypen, in this case ARALDITE.
edit the codes can be such as KNC-GUK-FUK-GEN-SUP-MRO-ARALDITE which is why I have an issue.
Just to clarify, it is the text in the Name in Match Group which I need, not just the match itself.
Is there a way of doing this?
Any help you could offer would be really appreciated.
Thanks
Shani
This matches everything up until the end of ARALDITE:
knc-(.+)SUP-[^-]+-[^-]+
Where the final [^-]+ is matching just ARALDITE. I'm afraid I'm not quite sure what the other parts of your attempt are trying to do, or what captures you want to make, but hopefully you can work from this to a final solution.
[^-] just matches anything that isn't a hyphen, so of course [^-]+ matches either until the end of the string or until it encounters a hyphen.
How about
KNC-.*?SUP-.*?-(.*?)-
Expl.: Make sure "KNC-" is present and followed, at any, point by "SUP-", then skip until next '-' and capture to, not including, the following '-'.
Regards
I am trying to formulate a regexp for a password field, which accepts at least one special character and one alpha numeric character.
I have already tried with this regexp ((?=.*\d)(?=.*[a-zA-Z])(?=.*\W)) on Rubular, which I cooked up. But it's not working properly.
Test String : test#123
Kindly suggest a way to overcome this.
If you can please give some explanation as well.
Your regex actually does match your test string. It seems that you are wanting it to be in your capture group though as you wrapped the look-aheads in parenthesis.
Wrapping a capture group around your look-aheads wont capture anything as they are just looking ahead to verify. You'll have to create a capture group capturing the entire thing after like this:
^(?=.*\d)(?=.*[a-zA-Z])(?=.*\W)(.{6,20})$
The ^ and $ are just checking the entire string passed. The . within the capture group () is just saying to grab the entire match. The {6,20} is saying it has to be between 6 and 20 characters long. You can change the numbers if you want.
Rubular
I have such txt file:
ххх.prontube.ru
salo.ru
bbb.antichat.ru
yyy.ru
xx.bb.prontube.ru
zzz.com
srfsf.jwbefw.com.ua
Trying to delete all subdomains with such regex:
Find: .+\.((.*?)\.(ru|ua|com\.ua|com|net|info))$
Replace with: \1
Receive:
prontube.ru
salo.ru
antichat.ru
yyy.ru
prontube.ru
zzz.com
com.ua
Why last line becomes com.ua instead of jwbefw.com.ua ?
This works without look around:
Find: [a-zA-Z0-9-.]+\.([a-zA-Z0-9-]+)\.([a-zA-Z0-9-]+)$
Replace: \1\.\2
It finds something with at least 2 periods and only letters, numbers, and dashes following the last two periods; then it replaces it with the last 2 parts. More intuitive, in my opinion.
There's something funny going on with that leading xxx. It doesn't appear to be plain ASCII. For the sake of this question, I'm going to assume that's just something funny with this site and not representative of your real data.
Incorrect
Interestingly, I previously had an incorrect answer here that accumulated a lot of upvotes. So I think I should preserve it:
Find: [a-zA-Z0-9-]+\.([a-zA-Z0-9-]+)\.(.+)$
Replace: \1\.\2
It just finds a host name with at least 2 periods in it, then replaces it with everything after the first dot.
The .+ part is matching as much as possible. Try using .+? instead, and it will capture the least possible, allowing the com.ua option to match.
.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$
This answer still uses the specific domain names that the original question was looking at. As some TLD (top level domains) have a period in them, and you could theoretically have a list including multiple subdomains, whitelisting the TLD in the regex is a good idea if it works with your data set. Both current answers (from 2013) will not handle the difference between "xx.bb.prontube.ru" and "srfsf.jwbefw.com.ua" correctly.
Here is a quick explanation of why this psnig's original regex isn't working as intended:
The + is greedy.
.+ will zip all the way to the right at the end of the line capturing everything,
then work its way backwards (to the left) looking for a match from here:
(ru|ua|com\.ua|com|net|info)
With srfsf.jwbefw.com.ua the regex engine will first fail to match a,
then it will move the token one place to the left to look at "ua"
At that point, ua from the regex (the second option) is a match.
The engine will not keep looking to find "com.ua" because ".ua" met that requirement.
Niet the Dark Absol's answer tells the regex to be "lazy"
.+? will match any character (at least one) and then try to find the next part of the regex. If that fails, it will advance the token, .+ matching one more character and then evaluating the rest of the regex again.
The .+? will eventually consume: srfsf.jwbefw before matching the period, and then matching com.ua.
But the implimentation of ? also creates issues.
Adding in the question mark makes that first .+ lazy, but then causes group1 to match bb.prontube.ru instead of prontube.ru
This is because that first period after the bb will match, then inside group 1 (.*?) will match bb.prontube. before \.(ru|ua|com\.ua|com|net|info))$ matches .ru
To avoid this, change that third group from (.*?) to ([\w-]*?) so it won't capture . only letters and numbers, or a dash.
resulting regex:
.+?\.(([\w-])*?\.(ru|ua|com\.ua|com|net|info))$
Note that you don't need to capture any groups other than the first. Adding ?: makes the TLD options non-capturing.
last change:
.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$
Search what: .+?\.(\w+\.(?:ru|com|com\.au))
Replace with: $1
Look in the picture above, what regex capture referring
It's color the way you will not need a regex explaination anymore ....