regex selecting multiple field - regex

From the following example pattern, I want to select the first 3 entries in the line.
Say:
timestamp
hostname
the first word after the hostname
Example pattern:
2017-04-24T09:20:01.687387+00:00 aabvabcw74.def.co.uk hostd-probe: lacp: DEBUG]:147, Recv signal 15, LACP service is about to stop
2017-04-24T09:20:01.687387+00:00 aacdefabcw74.def.co.uk hostd-probe: lacp: DEBUG]:147, Recv signal 15, LACP service is about to stop
I have used following regex and it worked fine.
REGEX 1 - ^(?:[^\s]\s){1}([^\s]) - to select the timestamp and hostname.
REGEX 2 - ^(?:[^\s]*\s){2}([^\s]\w+) - to select the word after the hostname.
2017-04-24T09:20:01.687387+00:00 hostd probing is done Fdm: sslThumbprint>95:43:64:71:A3:60:D8:17:C8:6F:68:83:92:CE:E4:3B:53:4E:1D:AD10.199.6.5a2:0e:09:01:0a:00a2:0e:09:01:0b:01/vmfs/volumes/b01f388c-aaa4889f/vmfs/volumes/6ad2d8d7-86746df14435.5.03568722host-619286aabvabcs16.def.co.uk
But the above log has created the problem, as it is not in a standard syslog format it has picked "hostd" as the hostname.
I would like to have regex which need to select the logs which has timestamp as the first entry, hostname as second entry (it always ends with.def.co.uk) and if it satisfies both then select the 3rd entry.
How can I achieve this?

^(\S+[^\s])\s(\w+\.def.co.uk)\s(.+?)\s Demo
Break down :
(\S+[^\s])\s capture out date and timestamp, and leave out the space after it
(\w+\.def.co.uk)\s capture only if it contains something.def.co.uk, and leave the space out again
(.+)? non greedily capture the first word (assuming word means no space in between
EDIT :
Unless you also want the date and time to be in their own capture groups, then it should be like this:
^(\S+)(T\S+)\s(\w+\.def.co.uk)\s(.+?)\s
Hope this helps!

Related

Regex Group Name prefix multiple options

I'm performing regex extraction for parsing logs for our SIEM. I'm working with PCRE2.
In those logs, I have this problem: I have to extract a field that can be preceded by multiple options and I want use only one group name.
Let me be clearer with an example.
The SSH connection can appear in our log with this form:
UserType=SSH,
And I know that a simple regex expression to catch this is:
UserType=(?<app>.*?),
But, at the same time, SSH can appear with another "prefix":
ACCESS TYPE:SSH;
that can be captured with:
ACCESS\sTYPE:(?<app>.*?);
Now, because the logical field is the same (SSH protocol) and I want map it in every case under group name "app", is there a way to put the previous values in OR and use the same group name?
The desiderd final result is something like:
(UserType=) OR (ACCESS TYPE:) <field_value_here>
You can use
(?:UserType=|ACCESS\sTYPE:)(?<app>[^,;]+)
See the regex demo. Details:
(?:UserType=|ACCESS\sTYPE:) - either UserType= or ACCESS + whitespace + TYPE:
(?<app>[^,;]+) - Group "app": one or more chars other than , and ;.

Validating MS Teams channel names, team name and channel address in one go //OwMyHead

Validating MS Teams channel names, team name and channel address in one go!
Regex1:
^(?![\s._])([^~#%&*{}+:<>?|\n]{1,50})(?<![.])( - ).{1,256}$
Which would validate this successfully:
MonkeyChannel - MonkeyTeam
but I need to check also that it doesn't contain the channel address like so:
MonkeyChannel - MonkeyTeam <12337aab.domain.com#emea.teams.ms>
so basically I'm thinking I need to incorporate this which looks for a channel address:
Regex2:
(?<![[a-z0-9]{8}\.domain\.com#emea\.teams\.ms])
into this somehow:
Regex1:
^(?![\s._])([^~#%&*{}+:<>?|\n]{1,50})(?<![.])( - ).{1,256}$
My guess looks something like this but its not working:
Regex3:
^(?![\s._])([^~#%&*{}+:<>?|\n]{1,50})(?<![.])( - ).{1,256}(?<![[a-z0-9]{8}\.domain\.com#emea\.teams\.ms])$
Any regex wizards who can spot the error of my ways?
You could either add the negative lookbhind after the anchor and note that you have <...> and not [...]
^(?![\s._])[^~#%&*{}+:<>?|\n]{1,50}(?<![.]) - .{1,256}$(?<!<[a-z0-9]{8}\.domain\.com#emea\.teams\.ms>)
Regex demo
The other way around is using a negative lookahead after matching -
^(?![\s._])[^~#%&*{}+:<>?|\n]{1,50}(?<![.]) - (?!.*<[a-z0-9]{8}\.domain\.com#emea\.teams\.ms>).{1,256}$
Regex demo
You can add the capture group accoringly if you want to after process separate parts.
If you still want to match the first part, you can optionally match the second and first assert that it does not start with the unwanted mail address
^(?![\s._])[^~#%&*{}+:<>?|\n]{1,50}(?<![.]) - [^<>\n]*(?:<(?![a-z0-9]{8}\.domain\.com#emea\.teams\.ms>).{1,256})?
Regex demo

Consolidated RegEx to parse syslog data

Goal
I am trying to craft a RegEx that will parse out specific data from various syslog entries that contain subtle differences in logged content. While I am able to accomplish my goal using multiple RegEx statements, if possible, I would like to combine these statements into a single consolidated RegEx.
Log entries
The main issue I'm having is that some log entries have a URL that needs to be parsed to a named group and other log entries do not have any URL. Examples of these two different log entries are provided below.
Entry with URL
Nov 3 11:33:04 host1 postfix/smtpd[12812]: NOQUEUE: reject: RCPT from 178.red-83-59-180.dynamicip.rima-tde.net[83.59.180.178]: 554 5.7.1 Service unavailable; Client host [83.59.180.178] blocked using b.barracudacentral.org; http://www.barracudanetworks.com/reputation/?pr=1&ip=83.59.180.178; from=<lmclapp68#newmail.spamcop.net> to=<user1#example.com> proto=ESMTP helo=<178.red-83-59-180.dynamicip.rima-tde.net>
Entry without URL
Nov 2 16:01:25 host1 postfix/smtpd[31667]: NOQUEUE: reject_warning: RCPT from mail1.sendersrv.com[185.3.229.125]: 554 5.7.1 Service unavailable; Client host [185.3.229.125] blocked using bl.spamcop.net; from=<bounces+rL59wUXq98_inBrG#sendersrv.com> to=<user1#example.com> proto=ESMTP helo=<mail1.sendersrv.com>
RegEx statements
In the RegEx statements that follow, the first two are what I currently use for each of the previous log messages. The third RegEx is my attempt at consolidating these both into a single RegEx that will parse data from either log message. My attempt was to use a conditional statement that would basically check for the existence of http(s) and if found, then to parse the URL to a named group. If http(s) was not found, then it would parse out everything until the next RegEx token.
The issue is that when I test the RegEx against a log entry that has a URL, the RegEx does not seem to find http(s) despite this token being set as optional (i.e. using the ? quantifier). However, if I remove the ? quantifier, it does find http(s) and then parses the URL as desired. However, without the quantifier, the RegEx does not work with log entries that do not have a URL.
Parse entries with URL
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+);.+https?:\/{2}(?P<entryUrl>.+);\s.+\sto=\<(?P<destEm>.+)>.+$
Parse entries without URL
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+);\s.+\sto=\<(?P<destEm>.+)>.+$
Attempt at consolidating RegEx
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+)(?<=[a-z]);.+(https?:\/{2})?(?(5)(?P<entryUrl>.+)|.+)to=\<(?P<destEm>.+)>.+$
I'm sure the issue is my misunderstanding as to how the conditional statements and the ? quantifier works.
Looking at your patterns, the email address for to: is between tags < and > but due to the formatting in the question they are not shown.
The parts in your pattern like .+ first match until the end of the string, and will then backtrack and try to match the rest of the pattern.
You can make the pattern a bit more performant making the parts that you want and know more specific.
For the datetime, you can make the pattern match the specified format instead of .+ using ^(?P<datetime>[A-Z][a-z]{2}\s+\d{1,2}\s* \d{1,2}:\d{1,2}:\d{1,2})
For (?P<blkList>[^;]+) and (?P<entryUrl>[^;]+) you can use a negated character class matching any char except ;
For group (?P<destEm>[^<>\s]+) you can exclude matching tags.
To make match the url, instead of using a condition you can make the group optional using ?
For example
^(?P<datetime>[A-Z][a-z]{2}\s+\d{1,2}\s* \d{1,2}:\d{1,2}:\d{1,2}) host1 postfix\b.*? RCPT from (?P<srcDns>.*?)\[(?P<srcIp>[0-9\.]+)\]:.*? blocked using (?P<blkList>[^;]+);(?:.+?https?:\/\/(?P<entryUrl>[^;]+);)?\s.*? to=[^<]*<(?P<destEm>[^<>\s]+)>
See a regex demo.
Have you tried to test your regex on page like regex101?
to=\<(?P<destEm>.+)> doesn't seem to match your examples. You should either remove <> or replace to with helo. Be careful to make your quantifier lazy after blkList otherwise you might catch too much text.
You can then make your url optional with ? and it should work in both cases:
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+?);(.+https?:\/{2}(?P<entryUrl>.+);\s)?.+\sto=(?P<destEm>.+?)\s.*$
One approach would be to replace in the first regex .+https?:\/{2}(?P<entryUrl>.+); with (?:.+https?:\/{2}(?P<entryUrl>.+);)? where ?: indicates that it is a non-capturing group and the ? at the end means that it is optional.
However, it still does not work because .+ is greedy, so use lazy .+? instead.
Final regex:
^(?P<datetime>.+?) host1 postfix.+?RCPT from (?P<srcDns>.+?)\[(?P<srcIp>[0-9\.]+)\]:.+?blocked using (?P<blkList>.+?);(?:.+?https?:\/{2}(?P<entryUrl>.+?);)?\s.+?\sto=\<(?P<destEm>.+?)>.+?$
https://regex101.com/r/QkmXWz (to see it in action)

regex group matching based on first entry

As part of regex match, I am trying to select development / product based on first entry being dd-develop / dd.
eg.
The given code below always matches development, whether first string entry is "dd-develop" or just "dd".
I wanted to determine second or third word based on first value.
Any Ideas ?
Regex: (?(?=) (?:development) | (?:product))
Text: dd-develop development product.
From the looks of it, you're trying to decide whether to capture "development" or "product" based on the first word. This regex does that:
(:?dd-develop .*(development).*)|(?:dd .*(product).*)
If your string starts with dd-develop, it captures "development". If it starts with dd, it captures "product". To reverse this, just switch the words in the capture group.
Try it here!

Regex: select the XML messages and time stamp from the log

I am going to streaming the logs in to nxlog, i need to push xml messages in to nexlog server, To select the XML message:
(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3})(.*)(my sentence 1....|my sentence 2 : [\S+\s+]*>\n)(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3})
But I am not able to select all XML messages from logs
https://regex101.com/r/iA8qE5/5
In your regex you have to close the alternation using ) after:
(Message Picked from the queue....|Response Message :
Using a + inside the character class would have a different meaning and would match a plus sign literally. The plus is greedy so you have to make it non greedy using a question mark to let [\S\s]+ not match all lines.
Update [\S+\s+]*>\n)
to
)([\S\s]+?>)\n
Your match is in the 4th capturing group.
(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3})(.*)(Message Picked from the queue....|Response Message : )([\S\s]+?>)\n(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3})
Regex demo
Not that if you don't need all the capturing groups, you can also omit them and take only the first capturing group (Demo)
it capture date from starting line, message and xml. it using gms flag, Demo
^([\d-\.\s\:]+)\s.*?-\s([\w\s:\.]+)(<\w+.*?)\n\d{4}
date and xml only
^([\d-\.\s\:]+)\s.*?(<\w+.*?)\n\d{4}