Logstash log time and date parsing - regex

Hello I have below log
12-Apr-2021 16:11:41.078 WARNING [https-jsse-nio2-8443-exec-3] org.apache.catalina.realm.LockOutRealm.filterLockedAccounts An attempt was made to authenticate the locked user [user1]
I am trying to build a pattern for these for logstash.
I have following
%{MY_DATE_PATTERN:timestamp}\s%{WORD:severity}\s\[%{DATA:thread}\]\s%{NOTSPACE:type_log}
which parses below
{
"timestamp": [
"12-Apr-2021 16:01:01.505"
],
"severity": [
"FINE"
],
"thread": [
"https-jsse-nio2-8443-exec-8"
],
"type_log": [
"org.apache.catalina.realm.CombinedRealm.authenticate"
]
}
My Date stamp is a custom pattern it works with grok debugger but not with the system that i am using so i would need help to get date and time with regex. would anyone help me please?
12-Apr-2021 16:11:41.078 GROK REGEX for this

Instead of %{MY_DATE_PATTERN:timestamp}, you can use
(?<timestamp>%{MONTHDAY}-%{MONTH}-%{YEAR} %{HOUR}:%{MINUTE}:%{SECOND})
Legend:
%{MONTHDAY} - (?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])
%{MONTH} - \b(?:Jan(?:uary|uar)?|Feb(?:ruary|ruar)?|M(?:a|รค)?r(?:ch|z)?|Apr(?:il)?|Ma(?:y|i)?|Jun(?:e|i)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|O(?:c|k)?t(?:ober)?|Nov(?:ember)?|De(?:c|z)(?:ember)?)\b
%{YEAR} - (?>\d\d){1,2}`
%{HOUR} - (?:2[0123]|[01]?[0-9])
%{MINUTE} - (?:[0-5][0-9])
%{SECOND} - (?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?).

Related

GCP recommendation data format for catalog

I am currently working on recommendation AI. since I am new to GCP recommendation, I have been struggling with data format for catalog. I read the documentation and it says each product item JSON format should be on a single line.
I understand this totally, but It would be really great if I could get what the JSON format looks like in real because the one in their documentation is very ambiguous to me. and I am trying to use console to import data
I tried to import data looking like down below but I got error saying invalid JSON format 100 times. it has lots of reasons such as unexpected token and something should be there and so on.
[
{
"id": "1",
"title": "Toy Story (1995)",
"categories": [
"Animation",
"Children's",
"Comedy"
]
},
{
"id": "2",
"title": "Jumanji (1995)",
"categories": [
"Adventure",
"Children's",
"Fantasy"
]
},
...
]
Maybe it was because each item was not on a single line, but I am also wondering if the above is enough for importing. I am not sure if those data should be included in another property like
{
"inputConfig": {
"productInlineSource": {
"products": [
{
"id": "1",
"title": "Toy Story (1995)",
"categories": [
"Animation",
"Children's",
"Comedy"
]
},
{
"id": "2",
"title": "Jumanji (1995)",
"categories": [
"Adventure",
"Children's",
"Fantasy"
]
},
}
I can see the above in the documentation but it says it is for importing inline which is using POST request. it does not mention anything about importing with console. I just guess the format is also used for console but I am not 100% sure. that is why I am asking
Is there anyone who can show me the entire data format to import data by using console?
Problem Solved
For those who might have the same question, The exact data format you should import by using gcp console looks like
{"id":"1","title":"Toy Story (1995)","categories":["Animation","Children's","Comedy"]}
{"id":"2","title":"Jumanji (1995)","categories":["Adventure","Children's","Fantasy"]}
No square bracket wrapping all the items.
No comma between items.
Only each item on a single line.
Posting this Community Wiki for better visibility.
OP edited question and add solution:
The exact data format you should import by using gcp console looks like
{"id":"1","title":"Toy Story (1995)","categories":["Animation","Children's","Comedy"]}
{"id":"2","title":"Jumanji (1995)","categories":["Adventure","Children's","Fantasy"]}
No square bracket wrapping all the items.
No comma between items.
Only each item on a single line.
However I'd like to elaborate a bit.
There are a few ways to import Importing catalog information:
Importing catalog data from Merchant Center
Importing catalog data from BigQuery
Importing catalog data from Cloud Storage
I guess this is what was used by OP, as I was able to import catalog using UI and GCS with below JSON file.
{
"inputConfig": {
"catalogInlineSource": {
"catalogItems": [
{"id":"111","title":"Toy Story (1995)","categories":["Animation","Children's","Comedy"]}
{"id":"222","title":"Jumanji (1995)","categories":["Adventure","Children's","Fantasy"]}
{"id":"333","title":"Test Movie (2020)","categories":["Adventure","Children's","Fantasy"]}
]
}
}
}
Importing catalog data inline
At the bottom of the Importing catalog information documentation you can find information:
The line breaks are for readability; you should provide an entire catalog item on a single line. Each catalog item should be on its own line.
It means you should use something similar to NDJSON - convenient format for storing or streaming structured data that may be processed one record at a time.
If you would like to try inline method, you should use this format, however it's single line but with breaks for readability.
data.json file
{
"inputConfig": {
"catalogInlineSource": {
"catalogItems": [
{
"id": "1212",
"category_hierarchies": [ { "categories": [ "Animation", "Children's" ] } ],
"title": "Toy Story (1995)"
},
{
"id": "5858",
"category_hierarchies": [ { "categories": [ "Adventure", "Fantasy" ] } ],
"title": "Jumanji (1995)"
},
{
"id": "321123",
"category_hierarchies": [ { "categories": [ "Comedy", "Adventure" ] } ],
"title": "The Lord of the Rings: The Fellowship of the Ring (2001)"
},
]
}
}
}
Command
curl -X POST \
-H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
--data #./data.json \
"https://recommendationengine.googleapis.com/v1beta1/projects/[your-project]/locations/global/catalogs/default_catalog/catalogItems:import"
{
"name": "import-catalog-default_catalog-1179023525XX37366024",
"done": true
}
Please keep in mind that the above method requires Service Account authentication, otherwise you will get PERMISSION DENIED error.
"message" : "Your application has authenticated using end user credentials from the Google Cloud SDK or Google Cloud Shell which are not supported by the translate.googleapis.com. We recommend that most server applications use service accounts instead. For more information about service accounts and how to use them in your application, see https://cloud.google.com/docs/authentication/.",
"status" : "PERMISSION_DENIED"

How to highlight custom extractions using a2i's crowd-textract-analyze-document?

I would like to create a human review loop for images that undergone OCR using Amazon Textract and Entity Extraction using Amazon Comprehend.
My process is:
send image to Textract to extract the text
send text to Comprehend to extract entities
find the Block IDs in Textract's output of the entities extracted by Comprehend
add new Blocks of type KEY_VALUE_SET to textract's JSON output per the docs
create a Human Task with crowd-textract-analyze-document element in the template and feed it the modified textract output
What fails to work in this process is step 5. My custom entities are not rendered properly. By "fails to work" I mean that the entities are not highlighted on the image when I click them on the sidebar. There is no error in the browser's console.
Has anyone tried such a thing?
Sorry for not including examples. I will remove secrets/PII from my files and attach them to the question
I used the AWS documentation of the a2i-crowd-textract-detection human task element to generate the value of the initialValue attribute. It appears the doc for that attribute is incorrect. While the the doc shows that the value should be in the same format as the output of Textract, namely:
[
{
"BlockType": "KEY_VALUE_SET",
"Confidence": 38.43309020996094,
"Geometry": { ... }
"Id": "8c97b240-0969-4678-834a-646c95da9cf4",
"Relationships": [
{ "Type": "CHILD", "Ids": [...]},
{ "Type": "VALUE", "Ids": [...]}
],
"EntityTypes": ["KEY"],
"Text": "Foo bar"
},
]
the a2i-crowd-textract-detection expects the input to have lowerCamelCase attribute names (rather than UpperCamelCase). For example:
[
{
"blockType": "KEY_VALUE_SET",
"confidence": 38.43309020996094,
"geometry": { ... }
"id": "8c97b240-0969-4678-834a-646c95da9cf4",
"relationships": [
{ "Type": "CHILD", "ids": [...]},
{ "Type": "VALUE", "ids": [...]}
],
"entityTypes": ["KEY"],
"text": "Foo bar"
},
]
I opened a support case about this documentation error to AWS.

grok parse optional field pattern doesn't work

I've got a log like this:
ERROR_MESSAGE:Invalid Credentials,THROTTLED_OUT_REASON:API_LIMIT_EXCEEDED
I'm trying to parse it with grok using grok debugger:
ERROR_MESSAGE:%{GREEDYDATA:errorMassage},THROTTLED_OUT_REASON:%{GREEDYDATA:throttledOutReason}
It works, but sometimes the log comes without THROTTLED_OUT_REASON field.
ERROR_MESSAGE:%{GREEDYDATA:errorMassage}
In that case I tried below code since THROTTLED_OUT_REASON is an optional field.
ERROR_MESSAGE:%{GREEDYDATA:errorMassage}(,THROTTLED_OUT_REASON:%{GREEDYDATA:throttledOutReason})?
So this should work for both cases. The given output for the log with optional field is:
{
"errorMassage": [
[
"Invalid Credentials,THROTTLED_OUT_REASON:API_LIMIT_EXCEEDED"
]
],
"throttledOutReason": [
[
null
]
]
}
But the expected output for the log with optional field:
{
"errorMassage": [
[
"Invalid Credentials"
]
],
"throttledOutReason": [
[
"API_LIMIT_EXCEEDED"
]
]
}
expected output for the log without optional field:
{
"errorMassage": [
[
"Invalid Credentials"
]
],
"throttledOutReason": [
[
null
]
]
}
Can anyone suggest a solution which gives correct output for both type of logs?
Since you use GREEDYDATA it "eats" as much as it can get in order to fill errormessage.
I do not know GROK enough to tell you what alternative defined patterns there are, but you should be able to use a custom pattern:
ERROR_MESSAGE:(?<errorMassage>.*?),THROTTLED_OUT_REASON:%{GREEDYDATA:throttledOutReason}
I got the answer using #Skeeve 's idea.
Here it is for anyone who would come up with a similar question:
I've used custom pattern in order to avoid excess eating of GREEDYDATA (for errorMessage field).
ERROR_MESSAGE:(?<errorMassage>([^,]*)?)(,THROTTLED_OUT_REASON:%{GREEDYDATA:throttledOutReason})?

custom log grok pattern

I am trying to parse a custom log line using grok pattern but I'm not able to completely parse the line.
Custom log line:
site 'TRT' : alias 'TRT,FAK,FAS,ATI,ONE,DVZ,TWO' : serveur 'Test10011' RAS : TRT / TRT serveur 'Test10011' OK
Grok pattern:
%{DATA:site}\:%{DATA:alias}\:%{DATA:server}\:%{DATA:msg}
Result:
{
"site": [
[
"site 'TRT' "
]
],
"alias": [
[
" alias 'TRT,FAK,FAS,ATI,ONE,DVZ,TWO' "
]
],
"server": [
[
" serveur 'Test10011' RAS "
]
],
"msg": [
[
""
]
]
}
I am not able to parse the last few items in the 'msg', . Could you please help ,where I'm going wrong? msg should contain "TRT / TRT serveur 'Test10011' OK"
It seems you just need to use GREEDYDATA instead of DATA pattern:
%{DATA:site}\s*:\s*%{DATA:alias}\s*:\s*%{DATA:server}\s*:\s*%{GREEDYDATA:msg}
I also suggest adding \s* around : to get rid of leading/trailing whitespaces.

Regex working with calculator but not with AWS lambda function

Trouble with regex and gather all data between [ and ].
Testing with the program: http://regexr.com/
String data
{
"Items": [
{
"UserID": "1487840267893246",
"Timestamp": 1487204364877,
},
{
"UserID": "1487840267893336",
"Timestamp": 1487204364888,
}
],
"Count": 2,
"ScannedCount": 3
}
The below (fired in AWS lambda) has the intention of pulling all chars between the [ and ] and outputting it. (\[[^]*\]) works with the regex calc above, but only returns "undefined" in Lambda. Why?
Items = data.match(/"(\[[^]*\])"/);
console.log(Items);
An alternative solution was to extract the data into an array as follows
userID = data.match(/"UserID":"([^"]+)"/g);
console.log(userID);
Try the dotall flag:
Items = data.match(/"(?s)\[.*\]");
And you didn't need those brackets.