Custom regular expression for grok

Custom regular expression for grok - regex

My question is about grok filter in logstash. For logstash filter I need to parse a log file . Sample log statement below
2017-07-31 09:01:53,135 - INFO
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer#617] -
Established session 0x15d964d654646f4 with negotiated timeout 5000 for
client /10.191.202.89:56232
I want to parse statement between [] using regular expression but did not get any success ? From above line
QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181 should be mapped to thread id .
ZooKeeperServer should be mapped to class name
617 should be mapped with line number
Can someone help me with the regular expression for this ?

You may use
\[(?<threadid>\w+[^/]*/[\d:]+):(?<classname>[^\]#]+)#(?<linenumber>\d+)\]
Details
\[ - a literal [
(?<threadid>\w+[^/]*/[\d:]+) - Group "threadid": 1+ word chars, then 0+ chars other than /, / and then 1 or more digits or : (note that you may adjust this pattern as you see fit, e.g. it can also be written as (?<threadid>.*?[\d:]+) but it won't be that safe)
: - a colon
(?<classname>[^\]#]+) - Group "classname": 1 or more chars other than ] and #
# - a # char
(?<linenumber>\d+) - Group "linenumber": 1 or more digits
\] - a literal ].
Online test results at grokdebug.herokuapp.com:

Related

How to make Ruby regex expression with some conditional inputs

This is my inputs looks like
format 1: 2022-09-23 18:40:45.846 I/getUsers: fetching data
format 2: 11:54:54.619 INFO loadingUsers:23 - visualising: "Entered to dashboard
This is the expression which is working for format one, i want to have the same (making changes to this) to handle both formats
^([0-9-]+ [:0-9.]+)\s(?<level>\w+)[\/+](?<log>.*)
it results as for format 1:
level I
message getUsers: fetching data
for 2nd it should be as
level INFO
message loadingUsers:23 - visualising: "Entered to dashboard
Help would be appreciated, Thanks

You can use
^([0-9-]+ [:0-9.]+|[0-9:.]+)\s(?<level>\w+)[\/+\s]+(?<log>.*)
See the Rubular demo.
Details:
^ - start of a line
([0-9-]+ [:0-9.]+|[0-9:.]+) - Group 1: one or more digits/hyphens, space, one or more digits/colons/dots, or one or more digits/colons/dots
\s - a whitespace
(?<level>\w+) - Group "level": one or more letters, digits or underscores
[\/+\s]+ - one or more slashes, + or whitespaces
(?<log>.*) - Group "log": zero or more chars other than line break chars as many as possible.
If you want to precise your Group 1 pattern (although I consider using a loose pattern fine in these scenarios), you can replace ([0-9-]+ [:0-9.]+|[0-9:.]+) with (\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2}\.\d+|\d{1,2}:\d{1,2}:\d{1,2}\.\d+), see this regex demo.

Regex for validating account names for NEAR protocol

I want to have accurate form field validation for NEAR protocol account addresses.
I see at https://docs.near.org/docs/concepts/account#account-id-rules that the minimum length is 2, maximum length is 64, and the string must either be a 64-character hex representation of a public key (in the case of an implicit account) or must consist of "Account ID parts" separated by . and ending in .near, where an "Account ID part" consists of lowercase alphanumeric symbols separated by either _ or -.
Here are some examples.
The final 4 cases here should be marked as invalid (and there might be more cases that I don't know about):
example.near
sub.ex.near
something.near
98793cd91a3f870fb126f66285808c7e094afcfc4eda8a970f6648cdf0dbd6de
wrong.near.suffix (INVALID)
shouldnotendwithperiod.near. (INVALID)
space should fail.near (INVALID)
touchingDotsShouldfail..near (INVALID)
I'm wondering if there is a well-tested regex that I should be using in my validation.
Thanks.
P.S. Originally my question pointed to what I was starting with at https://regex101.com/r/jZHtDA/1 but starting from scratch like that feels unwise given that there must already be official validation rules somewhere that I could copy.
I have looked at code that I would have expected to use some kind of validation, such as these links, but I haven't found it yet:
https://github.com/near/near-wallet/blob/40512df4d14366e1b8e05152fbf5a898812ebd2b/packages/frontend/src/utils/account.js#L8
https://github.com/near/near-wallet/blob/40512df4d14366e1b8e05152fbf5a898812ebd2b/packages/frontend/src/components/accounts/AccountFormAccountId.js#L95
https://github.com/near/near-cli/blob/cdc571b1625a26bcc39b3d8db68a2f82b91f06ea/commands/create-account.js#L75

The pre-release (v0.6.0-0) version of the JS SDK comes with a built-in accountId validation function:
const ACCOUNT_ID_REGEX =
/^(([a-z\d]+[-_])*[a-z\d]+\.)*([a-z\d]+[-_])*[a-z\d]+$/;
/**
* Validates the Account ID according to the NEAR protocol
* [Account ID rules](https://nomicon.io/DataStructures/Account#account-id-rules).
*
* #param accountId - The Account ID string you want to validate.
*/
export function validateAccountId(accountId: string): boolean {
return (
accountId.length >= 2 &&
accountId.length <= 64 &&
ACCOUNT_ID_REGEX.test(accountId)
);
}
https://github.com/near/near-sdk-js/blob/dc6f07bd30064da96efb7f90a6ecd8c4d9cc9b06/lib/utils.js#L113
Feel free to implement this in your program too.

Something like this should do: /^(\w|(?<!\.)\.)+(?<!\.)\.(testnet|near)$/gm
Breakdown
^ # start of line
(
\w # match alphanumeric characters
| # OR
(?<!\.)\. # dots can't be preceded by dots
)+
(?<!\.) # "." should not precede:
\. # "."
(testnet|near) # match "testnet" or "near"
$ # end of line
Try the Regex out: https://regex101.com/r/vctRlo/1

If you want to match word characters only, separated by a dot:
^\w+(?:\.\w+)*\.(?:testnet|near)$
Explanation
^ Start of string
\w+ Match 1+ word characters
(?:\.\w+)* Optionally repeat . and 1+ word characters
\. Match .
(?:testnet|near) Match either testnet or near
$ End of string
Regex demo
A bit broader variant matching whitespace character excluding the dot:
^[^\s.]+(?:\.[^\s.]+)*\.(?:testnet|near)$
Regex demo

vscode snippet - transform and replace filename

my filename is
some-fancy-ui.component.html
I want to use a vscode snippet to transform it to
SOME_FANCY_UI
So basically
apply upcase to each character
Replace all - with _
Remove .component.html
Currently I have
'${TM_FILENAME/(.)(-)(.)/${1:/upcase}${2:/_}${3:/upcase}/g}'
which gives me this
'SETUP-PRINTER-SERVER-LIST.COMPONENT.HTML'
The docs doesn't explain how to apply replace in combination with their transforms on regex groups.

If the chunks you need to upper are separated with - or . you may use
"Filename to UPPER_SNAKE_CASE": {
"prefix": "usc_",
"body": [
"${TM_FILENAME/\\.component\\.html$|(^|[-.])([^-.]+)/${1:+_}${2:/upcase}/g}"
],
"description": "Convert filename to UPPER_SNAKE_CASE dropping .component.html at the end"
}
You may check the regex workings here.
\.component\.html$ - matches .component.html at the end of the string
| - or
(^|[-.]) capture start of string or - / . into Group 1
([^-.]+) capture any 1+ chars other than - and . into Group 2.
The ${1:+_}${2:/upcase} replacement means:
${1:+ - if Group 1 is not empty,
_ - replace with _
} - end of the first group handling
${2:/upcase} - put the uppered Group 2 value back.

Here is a pretty simple alternation regex:
"upcaseSnake": {
"prefix": "rf1",
"body": [
"${TM_FILENAME_BASE/(\\..*)|(-)|(.)/${2:+_}${3:/upcase}/g}",
"${TM_FILENAME/(\\..*)|(-)|(.)/${2:+_}${3:/upcase}/g}"
],
"description": "upcase and snake the filename"
},
Either version works.
(\\..*)|(-)|(.) alternation of three capture groups is conceptually simple. The order of the groups is important, and it is also what makes the regex so simple.
(\\..*) everything after and including the first dot . in the filename goes into group 1 which will not be used in the transform.
(-) group 2, if there is a group 2, replace it with an underscore ${2:+_}.
(.) group 3, all other characters go into group 3 which will be upcased ${3:/upcase}.
See regex101 demo.

Conditionals and regex doubts with grok filter in logstash

I'm taking my first steps with elastic-stack with a practical approach, trying to make it work with an appliacation in my enviroment. I'm having difficulties understanding from scratch how to write grok filters. I would like to have one like this one working, so from that one, I can work the rest of them.
I've taken some udemy courses, I'm reading this "Elastic Stack 6.0", I'm reading the documentation, but I can't find a way to make this work as intended.
So far, the only grok filter I'm using that actually works, is as simple as (/etc/logstash/config.d/beats.conf)
input {
beats {
port => 5044
}
}
filter {
grok {
match => { 'message' => "%{DATE:date} %{TIME:time} %
{LOGLEVEL:loglevel}"
}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
}
}
This is one of the log entries I'll need to work with, but there are many with different forms. I just need to have this one sorted out so I can adapt the filters to the rest.
2019-02-05 19:13:04,394 INFO [qtp1286783232-574:http://localhost:8080/service/soap/AuthRequest] [name=admin#example.com;oip=172.16.1.69;ua=zclient/8.8.9_GA_3019;soapId=3bde7ed0;] SoapEngine - handler exception: authentication failed for [admin], invalid password
I'd like to have this info, only when there is a "soapId" and when the field next to "INFO" starts with "qtq":
date: 2019-02-05
time: 19:13:04,394
loglevel: INFO
identifier: qtp1286783232-574
soap: http://localhost:8080/service/soap/AuthRequest
Which could also end in things like "GetInfoRequest" or "NoOpRequest"
account: admin#example.com
oip: 172.16.1.69
client: zclient/8.8.9_GA_3019
soapid: 3bde7ed0
error: true (if either "invalid password" or "authentication failed" are found in the line)
If the conditions are not met, then I will apply other filters (which hopefully I will be able to write adapting this one as a base).

You can't have false in the output if you have invalid password in the input. You can only match what is there in the string.
I think you may use
%{DATE:date} %{TIME:time} %{LOGLEVEL:loglevel} *\[(?<identifier>qtp[^\]\[:]*):(?<soap>[^\]\[]*)]\s*\[name=(?<account>[^;]+);oip=(?<oip>[0-9.]+);ua=(?<client>[^;]+);soapId=(?<soapId>[^;]+);].*?(?:(?<error>authentication failed).*)?$
Here are the details of the added patterns:
* - 0+ spaces
\[ - a [ char
(?<identifier>qtp[^\]\[:]*) - Named group "identifier": qtp and then 0+ chars other than :, ] and [
: - a colon
(?<soap>[^\]\[]*) - Named group "soap": 0+ chars other than ] and [
]\s*\[name= - a ], then 0+ whitespaces and [name= substring
(?<account>[^;]+) - Named group "account": 1+ chars other than ;
;oip= - a literal substring
(?<oip>[0-9.]+) - Named group "oip": 1+ digits and/or dots
;ua= - a literal substring
(?<client>[^;]+) - Named group "client": 1+ chars other than ;
;soapId= - a literal substring
(?<soapId>[^;]+) - Named group "soapId": 1+ chars other than ;
;] - a literal substring
.*? - any 0+ chars other than line break chars, as few as possible
(?:(?<error>authentication failed).*)? - an optional group matching 1 or 0 occurrences of
Named group "error": authentication failed substring
.* - all the rest of the line
$ - end of input.

Using grok to match custom style email address

I just set up an ELK stack for my apache logs. It's working great. Now I want to add maillogs to the mix, and I'm having trouble parsing the logs with grok.
I'm using this site to debug:
https://grokdebug.herokuapp.com/
Here is an example maillog (sendmail) entry:
Apr 24 19:38:51 ip-10-0-1-204 sendmail[9489]: w3OJco1s009487: to=<username#domain.us>, delay=00:00:01, xdelay=00:00:01, mailer=smtp, pri=120318, relay=webmx.bglen.net. [10.0.3.231], dsn=2.0.0, stat=Sent (Ok: queued as E2DEF60724), w3OJco1s009487: to=<username#domain.us>, delay=00:00:01, xdelay=00:00:01, mailer=smtp, pri=120318, relay=webmx.[redacted].net. [10.0.3.231], dsn=2.0.0, stat=Sent (Ok: queued as E2DEF60724)
From the text above, I want to pull out the text to=<username#domain.us>.
So far I have this for a grok pattern:
(?<mail_sent_to>[a-zA-Z0-9_.+=:-]+#[0-9A-Za-z][0-9A-Za-z-]{0,62}(?:\.(?:[0-9A-Za-z][0-‌9A-Za-z-]{0,62}))*)
It gives me the result username#domain.us> which is nice, but I want it to have the to= on the front as well. And I only want this grok filter to match email addresses which have to= in front of them.
I tried this, but it gives me "no matches" as a result:
(?<mail_sent_to>"to="[a-zA-Z0-9_.+=:-]+#[0-9A-Za-z][0-9A-Za-z-]{0,62}(?:\.(?:[0-9A-Za-z][0-‌9A-Za-z-]{0,62}))*)

You may use
\b(?<mail_sent_to>to=<[a-zA-Z0-9_.+=:-]+#[0-9A-Za-z][0-9A-Za-z-]{0,62}(?:\.[0-9A-Za-z][0-9A-Za-z-]{0,62})*>)
or, since [a-zA-Z0-9_] matches the same chars as \w:
\b(?<mail_sent_to>to=<[\w.+=:-]+#[0-9A-Za-z][0-9A-Za-z-]{0,62}(?:\.[0-9A-Za-z][0-9A-Za-z-]{0,62})*>)
See the regex demo.
Details
\b - a word boundary
(?<mail_sent_to> - "mail_sent_to" group:
to=< - a literal string to=<
[\w.+=:-]+ - 1+ word, ., +, =, : or - chars
# - a # char
[0-9A-Za-z] - an alphanumeric char
[0-9A-Za-z-]{0,62} - 0 to 62 letters, digits or -
(?:\.[0-9A-Za-z][0-9A-Za-z-]{0,62})* - 0+ sequences of
\. - a dot
[0-9A-Za-z] - an alphanumeric char
[0-9A-Za-z-]{0,62} - 0 to 62 letters, digits or -
> - a > char
) - end of the group.

This is much simple, it create a custom pattern to match to=< and >, and pre-defined EMAILADDRESS to match email address.
\b(?<mail_sent_to>to=<%{EMAILADDRESS}>)
This will output,
{
"mail_sent_to": [
[
"to=<username#domain.us>"
]
],
"EMAILADDRESS": [
[
"username#domain.us"
]
],
"EMAILLOCALPART": [
[
"username"
]
],
"HOSTNAME": [
[
"domain.us"
]
]
}
EDIT:
Patterns for email are,
EMAILLOCALPART [a-zA-Z][a-zA-Z0-9_.+-=:]+
EMAILADDRESS %{EMAILLOCALPART}#%{HOSTNAME}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js