How to parse mixed text and JSON log entries in AWS CloudWatch for Log Metric Filter - amazon-web-services

I am trying to parse log entries which are a mix of text and JSON. The first line is text representation and the next lines are JSON payload of the event. One of the possible examples are:
2016-07-24T21:08:07.888Z [INFO] Command completed lessonrecords-create
{
"key": "lessonrecords-create",
"correlationId": "c1c07081-3f67-4ab3-a5e2-1b3a16c87961",
"result": {
"id": "9457ce88-4e6f-4084-bbea-14fff78ce5b6",
"status": "NA",
"private": false,
"note": "Test note",
"time": "2016-02-01T01:24:00.000Z",
"updatedAt": "2016-07-24T21:08:07.879Z",
"createdAt": "2016-07-24T21:08:07.879Z",
"authorId": null,
"lessonId": null,
"groupId": null
}
}
For these records I try to define Log Metric Filter to a) match records b) select data or dimensions if possible.
According to the AWS docs JSON pattern should look like this:
{ $.key = "lessonrecords-create" }
however, it does not match anything. My guess is that because of mix text and JSON in a single log entry.
So, the questions are:
1. Is it possible to define a pattern that will match this log format?
2. Is it possible to extract dimensions, values from such a log format?
3. Help me with a pattern to do this.

If you set up the metric filter in the way that you have defined, the test will not register any matches (I have also had this issue), however when you deploy the metric filter it will still register matches (at least mine did). Just keep in mind that there is no way (as far as I am aware) to run this metric filter BACKWARDS (ie. it will only capture data from when it is created). [If you're trying to get stats on past data, you're better off using log insight queries]
I am currently experimenting with different parse statements to try and extract data (its also a mix of JSON and text), this thread MAY help you (it didn't for me) Amazon Cloudwatch Logs Insights with JSON fields .
UPDATE!
I have found a way to parse the text but its a little bit clunky. If you export your cloudwatch logs using a lamda function to SumoLogic, their search tool allows for MUCH better log manipulation and lets you parse JSON fields (if you treat the entire entry as text). SumoLogic is also really helpful because you can just extract your search results as a CSV. For my purposes, I parse the entire log message in SumoLogic, extract all the logs as a CSV and then I used regex in Python to filter through and extract the values I need.

Let's say you have the following log
2021-09-29 15:51:18,624 [main] DEBUG com.company.app.SparkResources - AUDIT : {"user":"Raspoutine","method":"GET","pathInfo":"/analysis/123"}
you can parse it like this to be able to handle the part after "AUDIT : " as a JSON
fields #message
| parse #message "* [*] * * - AUDIT : *" as timestamp, thread, logLevel, clazz, msg
| filter ispresent(msg)
| filter method = "GET" # You can use fields which are contained in the JSON String of 'msg' field. Do not use 'msg.method' but directly 'method'
The fields contained in your isolated / parsed JSON field are automatically added as fields usable in the query

You can use CloudWatch Events for such purpose(aka Subscription Filters), what you will need to do is define a cloudwatch Rule which uses an expression statement to match your logs.
Here, I will let you do all the reading:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/SubscriptionFilters.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-Scheduled-Rule.html
:)

Split the message into 3 fields and the 3rd field will be a valid json . I think in your case it would be
fields #timestamp, #message
| parse #message '[] * {"*"}' as field1, field2, field3
| limit 50
field3 is the valid json.
[INFO} will be the first field.

You can search JSON string representation, which is not as powerful.
For your example,
instead of { $.key = "lessonrecords-create" }
try "\"key\":\"lessonrecords-create\"".
This filter is not semantically identical to your requirement, though. It will also give events where key is not at the root of json.

you can use fluentd agent to send logs to Cloudwatch. Create custom grok pattern based on your metric filter.
Steps:
Install fluentd agent in your server
Install fluent-plugin-cloudwatch-logs plugin and fluent-plugin-grok-parser plugin
write your custom grok pattern based on your log format
please refer this blog for more information

Related

Don't show discovered fields in CloudWatch log insights?

Is it possible to configure CloudWatch log insights so it just shows the JSON log message, not the discovered fields? For example, currently my logs show up like this:
#timestamp xxx
#message { "field1": "value1", "field2": "value2" }
field1 value1
field2 value2
Because CloudWatch discovered the two fields in the JSON message. But is it possible to just see the JSON, like so:
{
"field1": "value1",
"field2": "value2"
}
This would make it much easier for me to copy logs to different sources, as the #message field can get very long and unreadable if it's not formatted as JSON, which each field on a separate line and with indentations. Also, when I copy a log entry in the first format from Log Insights, the formatting puts the field name and value on different lines which is difficult to read.

AWS Kendra PreHook Lambdas for Data Enrichment

I am working on a POC using Kendra and Salesforce. The connector allows me to connect to my Salesforce Org and index knowledge articles. I have been able to set this up and it is currently working as expected.
There are a few custom fields and data points I want to bring over to help enrich the data even more. One of these is an additional answer / body that will contain key information for the searching.
This field in my data source is rich text containing HTML and is often larger than 2048 characters, a limit that seems to be imposed in a String data field within Kendra.
I came across two hooks that are built in for Pre and Post data enrichment. My thought here is that I can use the pre hook to strip HTML tags and truncate the field before it gets stored in the index.
Hook Reference: https://docs.aws.amazon.com/kendra/latest/dg/API_CustomDocumentEnrichmentConfiguration.html
Current Setup:
I have added a new field to the index called sf_answer_preview. I then mapped this field in the data source to the rich text field in the Salesforce org.
If I run this as is, it will index about 200 of the 1,000 articles and give an error that the remaining articles exceed the 2048 character limit in that field, hence why I am trying to set up the enrichment.
I set up the above enrichment on my data source. I specified a lambda to use in the pre-extraction, as well as no additional filtering, so run this on every article. I am not 100% certain what the S3 bucket is for since I am using a data source, but it appears to be needed so I have added that as well.
For my lambda, I create the following:
exports.handler = async (event) => {
// Debug
console.log(JSON.stringify(event))
// Vars
const s3Bucket = event.s3Bucket;
const s3ObjectKey = event.s3ObjectKey;
const meta = event.metadata;
// Answer
const answer = meta.attributes.find(o => o.name === 'sf_answer_preview');
// Remove HTML Tags
const removeTags = (str) => {
if ((str===null) || (str===''))
return false;
else
str = str.toString();
return str.replace( /(<([^>]+)>)/ig, '');
}
// Truncate
const truncate = (input) => input.length > 2000 ? `${input.substring(0, 2000)}...` : input;
let result = truncate(removeTags(answer.value.stringValue));
// Response
const response = {
"version" : "v0",
"s3ObjectKey": s3ObjectKey,
"metadataUpdates": [
{"name":"sf_answer_preview", "value":{"stringValue":result}}
]
}
// Debug
console.log(response)
// Response
return response
};
Based on the contract for the lambda described here, it appears pretty straight forward. I access the event, find the field in the data called sf_answer_preview (the rich text field from Salesforce) and I strip and truncate the value to 2,000 characters.
For the response, I am telling it to update that field to the new formatted answer so that it complies with the field limits.
When I log the data in the lambda, the pre-extraction event details are as follows:
{
"s3Bucket": "kendrasfdev",
"s3ObjectKey": "pre-extraction/********/22736e62-c65e-4334-af60-8c925ef62034/https://*********.my.salesforce.com/ka1d0000000wkgVAAQ",
"metadata": {
"attributes": [
{
"name": "_document_title",
"value": {
"stringValue": "What majors are under the Exploratory track of Health and Life Sciences?"
}
},
{
"name": "sf_answer_preview",
"value": {
"stringValue": "A complete list of majors affiliated with the Exploratory Health and Life Sciences track is available online. This track allows you to explore a variety of majors related to the health and life science professions. For more information, please visit the Exploratory program description. "
}
},
{
"name": "_data_source_sync_job_execution_id",
"value": {
"stringValue": "0fbfb959-7206-4151-a2b7-fce761a46241"
}
},
]
}
}
The Problem:
When this runs, I am still getting the same field limit error that the content exceeds the character limit. When I run the lambda on the raw data, it strips and truncates it as expected. I am thinking that the response in the lambda for some reason isn't setting the field value to the new content correctly and still trying to use the data directly from Salesforce, thus throwing the error.
Has anyone set up lambdas for Kendra before that might know what I am doing wrong? This seems pretty common to be able to do things like strip PII information before it gets indexed, so I must be slightly off on my setup somewhere.
Any thoughts?
since you are still passing the rich text as a metadata filed of a document, the character limit still applies so the document would fail at validation step of the API call and would not reach the enrichment step. A work around is to somehow append those rich text fields to the body of the document so that your lambda can access it there. But if those fields are auto generated for your documents from your data sources, that might not be easy.

AWS CloudWatch Logs Insight Nested Json

My json log object is like this.
{
"FileName":"file1.xlsx",
"IsSuccess":false,
"LogList":[
{"ErrorDetail":"Text1"},
{"ErrorDetail":"Text2"},
{"ErrorDetail":"Text3"},
]
}
When I write the query in CloudWatch Logs Insight like below it list the nested json in a single line.
fields #timestamp, FileName, LogList.ErrorDetail:
LogsInsight Query Result Nested Json
LogsInsight Query Result
This makes very difficult to read as the user need to scroll horizontally. I want the list to be displayed veritically. How can this be achieved?

Seeking help manuerving JSON files in CloudWatch Log Insight

I have a question on using CloudWatch Log Insights when it comes to JSON files.
I am trying to include two log streams in one query for CloudWatch Logs Insights where I would want to focus on "level" to find errors:
Here is my code:
filter #logStream = 'ingest-23j23d3-daf4343ff3, ingest-2fdfd434d-dsa32434d'
| fields #message, #timestamp
| parse #message '"level": "*"' as level
| filter level == "error"
Here is an example of the JSON:
{
"message": "Could not delete old file cache entries: rimraf: callback function required",
"level": "error"
}
How can I incorporate more than one #logStream in my query. Also, can anyone direct me to maneuvering the JSON file for future use. I would greatly appreciate it.
I was able to fix the issue I had. Since I had no knowledge of Regex, I had to go into it's documentation and also AWS's and find ways of displaying my data:
filter level = "error" | filter strcontains(#logStream, 'ingest-')
| fields #timestamp, #message, level
I was able to filter my levels (which were debug, info, and error) to only show error. From here, I filtered ALL my logstreams beginning with ingest to find the error logs. I hope this helps anybody out in need of answers.

WSO2 CEP siddhi Filter issue

I am trying to use the siddhi query langage but it seems I am misusing it.
I have some events with the following streamdef :
{ 'name':'eu.ima.stat.events', 'version':'1.1.0', 'nickName': 'Flux event Information', 'description': 'Details of Analytics Statistics', 'metaData':[ {name:'HostIP','type':'STRING'} ], 'correlationData':[ {name:'ProcessType','type':'STRING'}, {name:'Flux','type':'STRING'}, {name:'ReferenceId','type':'STRING'} ], 'payloadData':[ {'name':'Timestamp','type':'STRING'}, {'name':'EventCode','type':'STRING'}, {'name':'Type','type':'STRING'}, {'name':'EventInfo','type':'STRING'} ]}
I am just trying to filter events with the same processus value and the same flux value using a query like this one :
from myEventStream[processus == 'SomeName' and flux == 'someOtherName' ]
insert into someStream
processus, flux, timestamp
Whenever I try this, no output is generated. When I get rid of the filter
from myEventStream
insert into someStream
processus, flux, timestamp
all my events are ther in the output.
What's wrong with my query ?
I can see some spell mistakes in your query... In the filter you have used a variable name called "processus" which is not in the event stream. That is why this query does not give any output. When you are creating a bucket in WSO2 CEP, make sure that the bucket is deployed correctly in the CEP server and check in the management console.(CEP BUCKETS --> List).
On your situation. bucket will not be deployed because of the wrong configuration and also there will be error messages printed in the terminal where CEP server runs. After correcting this mistake your query will run perfectly without any issue...
Regards,
Mohan
Considering Mohan's answer,rename 'ProcessType' or change your query like this
from myEventStream[ ProcessType == 'SomeName' and flux == 'someOtherName' ]
insert into someStream
ProcessType, flux, timestamp