I'm performing regex extraction for parsing logs for our SIEM. I'm working with PCRE2.
In those logs, I have this problem: I have to extract a field that can be preceded by multiple options and I want use only one group name.
Let me be clearer with an example.
The SSH connection can appear in our log with this form:
UserType=SSH,
And I know that a simple regex expression to catch this is:
UserType=(?<app>.*?),
But, at the same time, SSH can appear with another "prefix":
ACCESS TYPE:SSH;
that can be captured with:
ACCESS\sTYPE:(?<app>.*?);
Now, because the logical field is the same (SSH protocol) and I want map it in every case under group name "app", is there a way to put the previous values in OR and use the same group name?
The desiderd final result is something like:
(UserType=) OR (ACCESS TYPE:) <field_value_here>
You can use
(?:UserType=|ACCESS\sTYPE:)(?<app>[^,;]+)
See the regex demo. Details:
(?:UserType=|ACCESS\sTYPE:) - either UserType= or ACCESS + whitespace + TYPE:
(?<app>[^,;]+) - Group "app": one or more chars other than , and ;.
Related
I have filenames in format <pod-name>_<namespace-name>_<container-name>-<dockerid>.log
For example:
pod-name_namespace-name_container-name-7a1d0ed5675bdb365228d43f470fcee20af5c8bea84dd6d886b9bf837a9d358c.log
pod-name_namespace-name-1234567890_container-name-7a1d0ed5675bdb365228d43f470fcee20af5c8bea84dd6d886b9bf837a9d358c.log
Actually this is the k8s container's log files.
The namespace-name may contain numeric postfix that represents automation system run id (github.run_id - 10 digits number).
I need to parse filenames with regex to extract pod name, namespace name without run id, run id, container name and docker id.
Regex based on default fluentbit kubernetes parser that I need to change for our usage:
(?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)(-(?<run_id>\d{10,}))_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
https://rubular.com/r/CROBxpHHgX5UZx
The regex above parses well filenames that contains namespace with run id, but fails to parse namespace without run id:
pod-name_namespace-name_container-name-7a1d0ed5675bdb365228d43f470fcee20af5c8bea84dd6d886b9bf837a9d358c.log
https://rubular.com/r/6MSQsnuGzrkVJG
In this case the run_id should be empty string
How to fix it that it match both cases?
You can use
(?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+?)(-(?<run_id>\d{10,}))?_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
See the regex demo.
The main point is to make two changes in (?<namespace_name>[^_]+)(-(?<run_id>\d{10,})) part:
make the [^_]+ pattern lazy, so that it could match as few chars other than _ as possibe, i.e. add a ? after +
make the (-(?<run_id>\d{10,})) part optional by adding a ? quantifier after the group.
Goal
I am trying to craft a RegEx that will parse out specific data from various syslog entries that contain subtle differences in logged content. While I am able to accomplish my goal using multiple RegEx statements, if possible, I would like to combine these statements into a single consolidated RegEx.
Log entries
The main issue I'm having is that some log entries have a URL that needs to be parsed to a named group and other log entries do not have any URL. Examples of these two different log entries are provided below.
Entry with URL
Nov 3 11:33:04 host1 postfix/smtpd[12812]: NOQUEUE: reject: RCPT from 178.red-83-59-180.dynamicip.rima-tde.net[83.59.180.178]: 554 5.7.1 Service unavailable; Client host [83.59.180.178] blocked using b.barracudacentral.org; http://www.barracudanetworks.com/reputation/?pr=1&ip=83.59.180.178; from=<lmclapp68#newmail.spamcop.net> to=<user1#example.com> proto=ESMTP helo=<178.red-83-59-180.dynamicip.rima-tde.net>
Entry without URL
Nov 2 16:01:25 host1 postfix/smtpd[31667]: NOQUEUE: reject_warning: RCPT from mail1.sendersrv.com[185.3.229.125]: 554 5.7.1 Service unavailable; Client host [185.3.229.125] blocked using bl.spamcop.net; from=<bounces+rL59wUXq98_inBrG#sendersrv.com> to=<user1#example.com> proto=ESMTP helo=<mail1.sendersrv.com>
RegEx statements
In the RegEx statements that follow, the first two are what I currently use for each of the previous log messages. The third RegEx is my attempt at consolidating these both into a single RegEx that will parse data from either log message. My attempt was to use a conditional statement that would basically check for the existence of http(s) and if found, then to parse the URL to a named group. If http(s) was not found, then it would parse out everything until the next RegEx token.
The issue is that when I test the RegEx against a log entry that has a URL, the RegEx does not seem to find http(s) despite this token being set as optional (i.e. using the ? quantifier). However, if I remove the ? quantifier, it does find http(s) and then parses the URL as desired. However, without the quantifier, the RegEx does not work with log entries that do not have a URL.
Parse entries with URL
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+);.+https?:\/{2}(?P<entryUrl>.+);\s.+\sto=\<(?P<destEm>.+)>.+$
Parse entries without URL
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+);\s.+\sto=\<(?P<destEm>.+)>.+$
Attempt at consolidating RegEx
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+)(?<=[a-z]);.+(https?:\/{2})?(?(5)(?P<entryUrl>.+)|.+)to=\<(?P<destEm>.+)>.+$
I'm sure the issue is my misunderstanding as to how the conditional statements and the ? quantifier works.
Looking at your patterns, the email address for to: is between tags < and > but due to the formatting in the question they are not shown.
The parts in your pattern like .+ first match until the end of the string, and will then backtrack and try to match the rest of the pattern.
You can make the pattern a bit more performant making the parts that you want and know more specific.
For the datetime, you can make the pattern match the specified format instead of .+ using ^(?P<datetime>[A-Z][a-z]{2}\s+\d{1,2}\s* \d{1,2}:\d{1,2}:\d{1,2})
For (?P<blkList>[^;]+) and (?P<entryUrl>[^;]+) you can use a negated character class matching any char except ;
For group (?P<destEm>[^<>\s]+) you can exclude matching tags.
To make match the url, instead of using a condition you can make the group optional using ?
For example
^(?P<datetime>[A-Z][a-z]{2}\s+\d{1,2}\s* \d{1,2}:\d{1,2}:\d{1,2}) host1 postfix\b.*? RCPT from (?P<srcDns>.*?)\[(?P<srcIp>[0-9\.]+)\]:.*? blocked using (?P<blkList>[^;]+);(?:.+?https?:\/\/(?P<entryUrl>[^;]+);)?\s.*? to=[^<]*<(?P<destEm>[^<>\s]+)>
See a regex demo.
Have you tried to test your regex on page like regex101?
to=\<(?P<destEm>.+)> doesn't seem to match your examples. You should either remove <> or replace to with helo. Be careful to make your quantifier lazy after blkList otherwise you might catch too much text.
You can then make your url optional with ? and it should work in both cases:
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+?);(.+https?:\/{2}(?P<entryUrl>.+);\s)?.+\sto=(?P<destEm>.+?)\s.*$
One approach would be to replace in the first regex .+https?:\/{2}(?P<entryUrl>.+); with (?:.+https?:\/{2}(?P<entryUrl>.+);)? where ?: indicates that it is a non-capturing group and the ? at the end means that it is optional.
However, it still does not work because .+ is greedy, so use lazy .+? instead.
Final regex:
^(?P<datetime>.+?) host1 postfix.+?RCPT from (?P<srcDns>.+?)\[(?P<srcIp>[0-9\.]+)\]:.+?blocked using (?P<blkList>.+?);(?:.+?https?:\/{2}(?P<entryUrl>.+?);)?\s.+?\sto=\<(?P<destEm>.+?)>.+?$
https://regex101.com/r/QkmXWz (to see it in action)
I would like to use Regular expression to extract content between brackets, after some specific string and the 1st match.
Example text:
**-n --command PING being applied--:
Wed May 34 7:23:18 2010
[ZZZ_6323] Command [ping] failed with error [[TEZZZGH_IUE] [[EIJERTMMMMIJE_EIEJ] gdyugedyue Service [ABC] is not available in domain [DEF]. Check the content and review diejidjei. Service [ABC] Domain [DEF] ] did not ping back. It might be due to one of the following reasons:
=> Reason1
=> Reason3
=> Reason 4: deijdije djkeoidjeio.
info=4343 day=Mon year=2010*
I would like to extract the string between [] but after string Service and 1st match as Service could appear again later. In this case ABC
Could someone help me?
I am not able to combine these three conditionals.
Thanks
Assuming that you don't care about capturing square brackets inside the [ ] pair, by far the easiest way to do this is to use the following simple regex:
Service (\[[^\]]*\])
and extract only the 1st capturing group from the result using whatever regex functionality you're using. For example, using JS, you would write
string.match(/Service (\[[^\]]*\])/)[1]
to extract the first capturing group.
If you instead want a regex that will only capture the first occurrence, you can exploit the greedy nature of the * quantifier and change the regex to this:
Service (\[[^\]]*\]).*
Service \[([^\]]+)\]
will match Service [anything besides brackets] and capture anything besides brackets in group number 1. Since regex engines work left-to-right, the first match will be the leftmost match.
Test it live on regex101.com.
In PHP, you could do this (code snippet generated by RegexBuddy):
if (preg_match('/Service \[([^\]]+)\]/', $subject, $groups)) {
$result = $groups[1];
} else {
$result = "";
}
The definition of the group name How should I write it? I know that it can be like this: (?) but I dont know how to combine it with this part Service [([^]]+)] in a single way
From the following example pattern, I want to select the first 3 entries in the line.
Say:
timestamp
hostname
the first word after the hostname
Example pattern:
2017-04-24T09:20:01.687387+00:00 aabvabcw74.def.co.uk hostd-probe: lacp: DEBUG]:147, Recv signal 15, LACP service is about to stop
2017-04-24T09:20:01.687387+00:00 aacdefabcw74.def.co.uk hostd-probe: lacp: DEBUG]:147, Recv signal 15, LACP service is about to stop
I have used following regex and it worked fine.
REGEX 1 - ^(?:[^\s]\s){1}([^\s]) - to select the timestamp and hostname.
REGEX 2 - ^(?:[^\s]*\s){2}([^\s]\w+) - to select the word after the hostname.
2017-04-24T09:20:01.687387+00:00 hostd probing is done Fdm: sslThumbprint>95:43:64:71:A3:60:D8:17:C8:6F:68:83:92:CE:E4:3B:53:4E:1D:AD10.199.6.5a2:0e:09:01:0a:00a2:0e:09:01:0b:01/vmfs/volumes/b01f388c-aaa4889f/vmfs/volumes/6ad2d8d7-86746df14435.5.03568722host-619286aabvabcs16.def.co.uk
But the above log has created the problem, as it is not in a standard syslog format it has picked "hostd" as the hostname.
I would like to have regex which need to select the logs which has timestamp as the first entry, hostname as second entry (it always ends with.def.co.uk) and if it satisfies both then select the 3rd entry.
How can I achieve this?
^(\S+[^\s])\s(\w+\.def.co.uk)\s(.+?)\s Demo
Break down :
(\S+[^\s])\s capture out date and timestamp, and leave out the space after it
(\w+\.def.co.uk)\s capture only if it contains something.def.co.uk, and leave the space out again
(.+)? non greedily capture the first word (assuming word means no space in between
EDIT :
Unless you also want the date and time to be in their own capture groups, then it should be like this:
^(\S+)(T\S+)\s(\w+\.def.co.uk)\s(.+?)\s
Hope this helps!
I've got data coming from kafka and I want to send them to ElasticSearch. I've got a log like this with tags:
<TOTO><ID_APPLICATION>APPLI_A|PRF|ENV_1|00</ID_APPLICATION><TN>3</TN></TOTO>
I'm trying to parse it with grok using grok debugger:
\<ID_APPLICATION\>%{WORD:APPLICATION}\|%{WORD:PROFIL}\|%{WORD:ENV}\|%{WORD:CODE}\</ID_APPLICATION\>\<TN\>%{NUMBER:TN}\</TN\>
It works, but sometimes the log has a new field like this (the one with the tag <TP>):
<TOTO><ID_APPLICATION>APPLI_A|PRF|ENV_1|00</ID_APPLICATION><TN>3</TN><TP>new</TP></TOTO>
I'd like to get lines with this field (the TP tag) and lines without. How can I do that?
If you have an optional field, you can match it with an optional named capturing group:
(?:<TP>%{WORD:TP}</TP>)?
^^^ ^
The non-capturing group does not save any submatches in memory and is used for grouping only, and ? quantifier matches 1 or 0 times (=optional). It will create a TP field with a value of type word. If the field is absent, the value will be null.
So, the whole pattern will look like:
<ID_APPLICATION>%{WORD:APPLICATION}\|%{WORD:PROFIL}\|%{WORD:ENV}\|%{WORD:CODE}</ID_APPLICATION><TN>%{NUMBER:TN}</TN>(?:<TP>%{WORD:TP}</TP>)?
This is the filter I used in Heroku App and reading this Documentation on how to use grok operators.
I created my own pattern, called "content" that will retrieve whatever it is inside your TP tags.
\<ID_APPLICATION\>%{WORD:APPLICATION}\|%{WORD:PROFIL}\|%{WORD:ENV}\|%{WORD:CODE}\<\/ID_APPLICATION\>\<TN>%{NUMBER:TN}\<\/TN\>(\<TP\>(?<content>(.)*)\<\/TP\>)?
Basically, I just added an optionnal tag to your pattern.
(<TP> ... </TP>)?
To retrieve the content, which I assume can be anything, I added the following inside the optional tags.
(?<content>(.)*)