Using grok to match custom style email address - regex

I just set up an ELK stack for my apache logs. It's working great. Now I want to add maillogs to the mix, and I'm having trouble parsing the logs with grok.
I'm using this site to debug:
https://grokdebug.herokuapp.com/
Here is an example maillog (sendmail) entry:
Apr 24 19:38:51 ip-10-0-1-204 sendmail[9489]: w3OJco1s009487: to=<username#domain.us>, delay=00:00:01, xdelay=00:00:01, mailer=smtp, pri=120318, relay=webmx.bglen.net. [10.0.3.231], dsn=2.0.0, stat=Sent (Ok: queued as E2DEF60724), w3OJco1s009487: to=<username#domain.us>, delay=00:00:01, xdelay=00:00:01, mailer=smtp, pri=120318, relay=webmx.[redacted].net. [10.0.3.231], dsn=2.0.0, stat=Sent (Ok: queued as E2DEF60724)
From the text above, I want to pull out the text to=<username#domain.us>.
So far I have this for a grok pattern:
(?<mail_sent_to>[a-zA-Z0-9_.+=:-]+#[0-9A-Za-z][0-9A-Za-z-]{0,62}(?:\.(?:[0-9A-Za-z][0-‌​9A-Za-z-]{0,62}))*)
It gives me the result username#domain.us> which is nice, but I want it to have the to= on the front as well. And I only want this grok filter to match email addresses which have to= in front of them.
I tried this, but it gives me "no matches" as a result:
(?<mail_sent_to>"to="[a-zA-Z0-9_.+=:-]+#[0-9A-Za-z][0-9A-Za-z-]{0,62}(?:\.(?:[0-9A-Za-z][0-‌​9A-Za-z-]{0,62}))*)

You may use
\b(?<mail_sent_to>to=<[a-zA-Z0-9_.+=:-]+#[0-9A-Za-z][0-9A-Za-z-]{0,62}(?:\.[0-9A-Za-z][0-9A-Za-z-]{0,62})*>)
or, since [a-zA-Z0-9_] matches the same chars as \w:
\b(?<mail_sent_to>to=<[\w.+=:-]+#[0-9A-Za-z][0-9A-Za-z-]{0,62}(?:\.[0-9A-Za-z][0-9A-Za-z-]{0,62})*>)
See the regex demo.
Details
\b - a word boundary
(?<mail_sent_to> - "mail_sent_to" group:
to=< - a literal string to=<
[\w.+=:-]+ - 1+ word, ., +, =, : or - chars
# - a # char
[0-9A-Za-z] - an alphanumeric char
[0-9A-Za-z-]{0,62} - 0 to 62 letters, digits or -
(?:\.[0-9A-Za-z][0-9A-Za-z-]{0,62})* - 0+ sequences of
\. - a dot
[0-9A-Za-z] - an alphanumeric char
[0-9A-Za-z-]{0,62} - 0 to 62 letters, digits or -
> - a > char
) - end of the group.

This is much simple, it create a custom pattern to match to=< and >, and pre-defined EMAILADDRESS to match email address.
\b(?<mail_sent_to>to=<%{EMAILADDRESS}>)
This will output,
{
"mail_sent_to": [
[
"to=<username#domain.us>"
]
],
"EMAILADDRESS": [
[
"username#domain.us"
]
],
"EMAILLOCALPART": [
[
"username"
]
],
"HOSTNAME": [
[
"domain.us"
]
]
}
EDIT:
Patterns for email are,
EMAILLOCALPART [a-zA-Z][a-zA-Z0-9_.+-=:]+
EMAILADDRESS %{EMAILLOCALPART}#%{HOSTNAME}

Related

vscode snippet - transform and replace filename

my filename is
some-fancy-ui.component.html
I want to use a vscode snippet to transform it to
SOME_FANCY_UI
So basically
apply upcase to each character
Replace all - with _
Remove .component.html
Currently I have
'${TM_FILENAME/(.)(-)(.)/${1:/upcase}${2:/_}${3:/upcase}/g}'
which gives me this
'SETUP-PRINTER-SERVER-LIST.COMPONENT.HTML'
The docs doesn't explain how to apply replace in combination with their transforms on regex groups.
If the chunks you need to upper are separated with - or . you may use
"Filename to UPPER_SNAKE_CASE": {
"prefix": "usc_",
"body": [
"${TM_FILENAME/\\.component\\.html$|(^|[-.])([^-.]+)/${1:+_}${2:/upcase}/g}"
],
"description": "Convert filename to UPPER_SNAKE_CASE dropping .component.html at the end"
}
You may check the regex workings here.
\.component\.html$ - matches .component.html at the end of the string
| - or
(^|[-.]) capture start of string or - / . into Group 1
([^-.]+) capture any 1+ chars other than - and . into Group 2.
The ${1:+_}${2:/upcase} replacement means:
${1:+ - if Group 1 is not empty,
_ - replace with _
} - end of the first group handling
${2:/upcase} - put the uppered Group 2 value back.
Here is a pretty simple alternation regex:
"upcaseSnake": {
"prefix": "rf1",
"body": [
"${TM_FILENAME_BASE/(\\..*)|(-)|(.)/${2:+_}${3:/upcase}/g}",
"${TM_FILENAME/(\\..*)|(-)|(.)/${2:+_}${3:/upcase}/g}"
],
"description": "upcase and snake the filename"
},
Either version works.
(\\..*)|(-)|(.) alternation of three capture groups is conceptually simple. The order of the groups is important, and it is also what makes the regex so simple.
(\\..*) everything after and including the first dot . in the filename goes into group 1 which will not be used in the transform.
(-) group 2, if there is a group 2, replace it with an underscore ${2:+_}.
(.) group 3, all other characters go into group 3 which will be upcased ${3:/upcase}.
See regex101 demo.

Conditionals and regex doubts with grok filter in logstash

I'm taking my first steps with elastic-stack with a practical approach, trying to make it work with an appliacation in my enviroment. I'm having difficulties understanding from scratch how to write grok filters. I would like to have one like this one working, so from that one, I can work the rest of them.
I've taken some udemy courses, I'm reading this "Elastic Stack 6.0", I'm reading the documentation, but I can't find a way to make this work as intended.
So far, the only grok filter I'm using that actually works, is as simple as (/etc/logstash/config.d/beats.conf)
input {
beats {
port => 5044
}
}
filter {
grok {
match => { 'message' => "%{DATE:date} %{TIME:time} %
{LOGLEVEL:loglevel}"
}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
}
}
This is one of the log entries I'll need to work with, but there are many with different forms. I just need to have this one sorted out so I can adapt the filters to the rest.
2019-02-05 19:13:04,394 INFO [qtp1286783232-574:http://localhost:8080/service/soap/AuthRequest] [name=admin#example.com;oip=172.16.1.69;ua=zclient/8.8.9_GA_3019;soapId=3bde7ed0;] SoapEngine - handler exception: authentication failed for [admin], invalid password
I'd like to have this info, only when there is a "soapId" and when the field next to "INFO" starts with "qtq":
date: 2019-02-05
time: 19:13:04,394
loglevel: INFO
identifier: qtp1286783232-574
soap: http://localhost:8080/service/soap/AuthRequest
Which could also end in things like "GetInfoRequest" or "NoOpRequest"
account: admin#example.com
oip: 172.16.1.69
client: zclient/8.8.9_GA_3019
soapid: 3bde7ed0
error: true (if either "invalid password" or "authentication failed" are found in the line)
If the conditions are not met, then I will apply other filters (which hopefully I will be able to write adapting this one as a base).
You can't have false in the output if you have invalid password in the input. You can only match what is there in the string.
I think you may use
%{DATE:date} %{TIME:time} %{LOGLEVEL:loglevel} *\[(?<identifier>qtp[^\]\[:]*):(?<soap>[^\]\[]*)]\s*\[name=(?<account>[^;]+);oip=(?<oip>[0-9.]+);ua=(?<client>[^;]+);soapId=(?<soapId>[^;]+);].*?(?:(?<error>authentication failed).*)?$
Here are the details of the added patterns:
* - 0+ spaces
\[ - a [ char
(?<identifier>qtp[^\]\[:]*) - Named group "identifier": qtp and then 0+ chars other than :, ] and [
: - a colon
(?<soap>[^\]\[]*) - Named group "soap": 0+ chars other than ] and [
]\s*\[name= - a ], then 0+ whitespaces and [name= substring
(?<account>[^;]+) - Named group "account": 1+ chars other than ;
;oip= - a literal substring
(?<oip>[0-9.]+) - Named group "oip": 1+ digits and/or dots
;ua= - a literal substring
(?<client>[^;]+) - Named group "client": 1+ chars other than ;
;soapId= - a literal substring
(?<soapId>[^;]+) - Named group "soapId": 1+ chars other than ;
;] - a literal substring
.*? - any 0+ chars other than line break chars, as few as possible
(?:(?<error>authentication failed).*)? - an optional group matching 1 or 0 occurrences of
Named group "error": authentication failed substring
.* - all the rest of the line
$ - end of input.

Custom regular expression for grok

My question is about grok filter in logstash. For logstash filter I need to parse a log file . Sample log statement below
2017-07-31 09:01:53,135 - INFO
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer#617] -
Established session 0x15d964d654646f4 with negotiated timeout 5000 for
client /10.191.202.89:56232
I want to parse statement between [] using regular expression but did not get any success ? From above line
QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181 should be mapped to thread id .
ZooKeeperServer should be mapped to class name
617 should be mapped with line number
Can someone help me with the regular expression for this ?
You may use
\[(?<threadid>\w+[^/]*/[\d:]+):(?<classname>[^\]#]+)#(?<linenumber>\d+)\]
Details
\[ - a literal [
(?<threadid>\w+[^/]*/[\d:]+) - Group "threadid": 1+ word chars, then 0+ chars other than /, / and then 1 or more digits or : (note that you may adjust this pattern as you see fit, e.g. it can also be written as (?<threadid>.*?[\d:]+) but it won't be that safe)
: - a colon
(?<classname>[^\]#]+) - Group "classname": 1 or more chars other than ] and #
# - a # char
(?<linenumber>\d+) - Group "linenumber": 1 or more digits
\] - a literal ].
Online test results at grokdebug.herokuapp.com:

grok parsing issue

I have an input line that looks like this:
localhost_9999.kafka.server:type=SessionExpireListener,name=ZooKeeperSyncConnectsPerSec.OneMinuteRate
and I can use this pattern to parse it:
%{DATA:kafka_node}:type=%{DATA:kafka_metric_type},name=%{JAVACLASS:kafka_metric_name}
which gives me this:
{
"kafka_node": [
[
"localhost_9999.kafka.server"
]
],
"kafka_metric_type": [
[
"SessionExpireListener"
]
],
"kafka_metric_name": [
[
"ZooKeeperSyncConnectsPerSec.OneMinuteRate"
]
]
}
I want to split the OneMinuteRate into a seperate field but can't seem to get it to work. I've tried this:
%{DATA:kafka_node}:type=%{DATA:kafka_metric_type},name=%{WORD:kafka_metric_name}.%{WORD:attr_type}"
but get nothing back then.
I'm also using https://grokdebug.herokuapp.com/ to test these out...
You can either use your last regex with an escaped . (note that a . matches any char but newline and a \. will match a literal dot char), or use DATA type for the last but one field and a GREEDYDATA for the last field:
%{DATA:kafka_node}:type=%{DATA:kafka_metric_type},name=% {DATA:kafka_metric_name}\.%{GREEDYDATA:attr_type}
Since %{DATA:name} translates to (?<name>.*?) and %{GREEDYDATA:name} translates to (?<name>.*), the name part will match any chars, 0 or more occurrences, as few as possible, up to the first ., and attr_type .* pattern will greedily "eat up" the rest of the line up to its end.

Issue on parsing logs using regex

I have tried separating the wowza logs using regex for data analysis, but I couldn't separate the section below.
I need a SINGLE regex pattern that would satisfy below both log formats.
Format 1:
live wowz://test1.example.com:443/live/_definst_/demo01|wowz://test2.example.com:443/live/_definst_/demo01 test
Format 2:
live demo01 test
I am trying to split the line on the 3 parameters and capturing them in the groups app, streamname and id, but streamname should only capture the text after the last /.
This is what I've tried:
(?<stream_name>[^/]+)$ --> Using this pattern I could only separate the format 1 "wowz" section. Not entire Format 1 example mentioned above.
Expected Output
{
"app": [
[
"live"
]
],
"streamname": [
[
"demo1"
]
],
"id": [
[
"test"
]
]
}
You can achieve what you specified using the following regex:
^(?<app>\S+) (?:\S*/)?(?<streamname>\S+) (?<id>\S+)$
regex101 demo
\S+ matches any number of characters except whitespace.
(?:\S*/)? to optionally consume the characters in the second parameter up to the last /. This is not included in the group, so it won't be captured.