CloudWatch Insights - Group logs by url with unique ids removed

CloudWatch Insights - Group logs by url with unique ids removed - regex

I'm looking to use CloudWatch Logs Insights to group logs by a request url field, however the url can contain 0-2 unique numerical identifiers that I'd like to be ignored when doing the grouping.
Some examples of urls:
/dev/user
/dev/user/123
/dev/user/123/inventory/4
/dev/server/3/statistics
The groups would look something like:
/dev/user
/dev/user/
/dev/user//inventory/
/dev/server//statistics
I have something quite close to what I need which extracts the section of the url in front of the first optional identifier and the section between the first identifier and the second identifier and concatenates the two, but it isn't totally reliable. This is where I'm at currently, #message is valid json which containers an 'endpoint' field that looks like one of the urls above:
fields #message | parse endpoint /(\bdev)\/(?<#prefix>[^0-9]+)(?:[0-9]+)(?<#suffix>[^0-9]+)/ | stats count(*) by #prefix
While this query will work with endpoints like '/dev/accounts/1' it ignores endpoints like '/dev/accounts' as it doesn't have all of the components the regex is looking for, which means I'm missing a lot of results.

If there are 0-2 numerical identifiers that you want to remove, you could match the first and optionally match the second number and use 2 capturing groups to capture what you want to keep.
In the replacement use the 2 capturing groups $1$2
^(.*?\/)\d+(?:(.*?\/)\d+\b)?
Regex demo

Looks like I can use question marks outside of capture groups to mark those groups as optional, which has resolved the last issue I was having.
Regex demo

Related

Okta Group Attribute Statement Regex Filter

I wrote the regex below in Okta group attribute statement filter which returns all the groups a user is part of based on the group naming convention.
*H_DAM_.*|.*H_TOOLS_.*|.*H_ASSOCIATES.*
Sample output for a particular user:
H_DAM_TESTER
CDG_H_DAM_ADMIN
CDG_H_ASSOCIATES
If I want Okta to remove "CDG_", so the output group names would start with H_ only, what would be the correct syntax for my regex?

Extract date from AWS Cloudwatch insights

I need to extract date_from and date_to from the following log field value.
date_from=11-04-2020&date_to=01-04-2021&page_size=1000&page=1 in AWS cloudwatch
I have so far tried parse keyword with the following regex \d{2}-\d{2}-\d{4} and it does not work.
What I ultimately want to do is extract these two dates and gets the time difference between them in days.
Here's the query I tried,
filter #logStream like /<log-stream>/ and process like /rest-call/ | parse parameters '\d{2}-\d{2}-\d{4}' as #date | display #date

You can capture both date_from and date_to into two named capturing groups:
parse parameters /date_from=(?<date_from>\d{2}-\d{2}-\d{4}).*?date_to=(?<date_to>\d{2}-\d{2}-\d{4})/ | display date_from, date_to
See the regex demo.
If the date format can be any, you may replace the \d{2}-\d{2}-\d{4} specific pattern with a more generic [^&]+:
/date_from=(?<date_from>[^&]+).*?date_to=(?<date_to>[^&]+)/
See the regex demo.
Note that .*? matches any zero or more chars other than line break chars, as few as possible (it is necessary to make sure the regex engine can "travel" all the way from the first capture to the second one as the regex engine parses the string from left to right and can never "skip" parts of a string, it should consume them).

For anyone looking AWS does not currently have any Date time functions to convert a date (i.e - mm/dd/yyyy) to a timestamp. Therefore, I exported the results of the above query to a CSV and did the timestamp calculations in Google Sheets.

How to use Postgres Regex Replace with a capture group

As the title presents above I am trying to reference a capture groups for a regex replace in a postgres query. I have read that the regex_replace does not support using regex capture groups. The regex I am using is
r"(?:[\s\(\)\=\)\,])(username)(?:[\s\(\)\=\)\,])?"gm
The above regex almost does what I need it to but I need to find out how to only allow a match if the capture groups also capture something. There is no situation where a "username" should be matched if it just so happens to be a substring of a word. By ensuring its surrounded by one of the above I can much more confidently ensure its a username.
An example application of the regex would be something like this in postgres (of course I would be doing an update vs a select):
select *, REGEXP_REPLACE(reqcontent,'(?:[\s\(\)\=\)\,])(username)(?:[\s\(\)\=\)\,])?' ,'NEW-VALUE', 'gm') from table where column like '%username%' limit 100;
If there is any more context that can be provided please let me know. I have also found similar posts (postgresql regexp_replace: how to replace captured group with evaluated expression (adding an integer value to capture group)) but that talks more about splicing in values back in and I don't think quite answers my question.
More context and example value(s) for regex work against. The below text may look familiar these are JQL filters in Jira. We are looking to update our usernames and all their occurrences in the table that contains the filter. Below is a few examples of filters. We originally were just doing a find a replace but that doesn't work because we have some usernames that are only two characters and it was matching on non usernames (e.g je (username) would place a new value in where the word project is found which completely malforms the JQL/String resulting in something like proNEW-VALUEct = balh blah)
type = bug AND status not in (Closed, Executed) AND assignee in (test, username)
assignee=username
assignee = username
Definition of Answered:
Regex that will only match on a 'username' if its surrounded by one of the specials
A way to regex/replace that username in a postgres query.

Capturing groups are used to keep the important bits of information matched with a regex.
Use either capturing groups around the string parts you want to stay in the result and use their placeholders in the replacement:
REGEXP_REPLACE(reqcontent,'([\s\(\)\=\)\,])username([\s\(\)\=\)\,])?' ,'\1NEW-VALUE\2', 'gm')
Or use lookarounds:
REGEXP_REPLACE(reqcontent,'(?<=[\s\(\)\=\)\,])(username)(?=[\s\(\)\=\)\,])?' ,'NEW-VALUE', 'gm')
Or, in this case, use word boundaries to ensure you only replace a word when inside special characters:
REGEXP_REPLACE(reqcontent,'\yusername\y' ,'NEW-VALUE', 'g')

Grok - parsing optional fields

I've got data coming from kafka and I want to send them to ElasticSearch. I've got a log like this with tags:
<TOTO><ID_APPLICATION>APPLI_A|PRF|ENV_1|00</ID_APPLICATION><TN>3</TN></TOTO>
I'm trying to parse it with grok using grok debugger:
\<ID_APPLICATION\>%{WORD:APPLICATION}\|%{WORD:PROFIL}\|%{WORD:ENV}\|%{WORD:CODE}\</ID_APPLICATION\>\<TN\>%{NUMBER:TN}\</TN\>
It works, but sometimes the log has a new field like this (the one with the tag <TP>):
<TOTO><ID_APPLICATION>APPLI_A|PRF|ENV_1|00</ID_APPLICATION><TN>3</TN><TP>new</TP></TOTO>
I'd like to get lines with this field (the TP tag) and lines without. How can I do that?

If you have an optional field, you can match it with an optional named capturing group:
(?:<TP>%{WORD:TP}</TP>)?
^^^ ^
The non-capturing group does not save any submatches in memory and is used for grouping only, and ? quantifier matches 1 or 0 times (=optional). It will create a TP field with a value of type word. If the field is absent, the value will be null.
So, the whole pattern will look like:
<ID_APPLICATION>%{WORD:APPLICATION}\|%{WORD:PROFIL}\|%{WORD:ENV}\|%{WORD:CODE}</ID_APPLICATION><TN>%{NUMBER:TN}</TN>(?:<TP>%{WORD:TP}</TP>)?

This is the filter I used in Heroku App and reading this Documentation on how to use grok operators.
I created my own pattern, called "content" that will retrieve whatever it is inside your TP tags.
\<ID_APPLICATION\>%{WORD:APPLICATION}\|%{WORD:PROFIL}\|%{WORD:ENV}\|%{WORD:CODE}\<\/ID_APPLICATION\>\<TN>%{NUMBER:TN}\<\/TN\>(\<TP\>(?<content>(.)*)\<\/TP\>)?
Basically, I just added an optionnal tag to your pattern.
(<TP> ... </TP>)?
To retrieve the content, which I assume can be anything, I added the following inside the optional tags.
(?<content>(.)*)

Google Analytics - Content grouping - Regex fix

This is our URL structure:
http://www.disabledgo.com/access-guide/the-university-of-manchester/176-waterloo-place-2
http://www.disabledgo.com/access-guide/kingston-university/coombehurst-court-2
http://www.disabledgo.com/access-guide/kings-college-london/franklin-wilkins-building-2
http://www.disabledgo.com/access-guide/redbridge-college/brook-centre-learning-resource-centre
I am trying to create a list of groups based on the client names
/access-guide/[this bit]/...
So I can have a performance list of all our clients.
This is my regex:
/access-guide/(.*universit(y|ies)|.*colleg(e|es))/
I want it to group anything that has university/ies or college/es in it, at any point within that client name section of the URL.
At the moment, my current regex will only return groups that are X-University:
Durham-University
Plymouth-University
Cardiff-University
etc.
What does the regex need to be to have the list I'm looking for?
Do I need to have something at the end to stop it matching things after the client name? E.g. ([^/]+$)?
Thanks for your help in advance!

Depending upon your needs you may want to do:
/access-guide/([^/]*(?:university|universities|college|colleges)[^/]*)/
This will match names even if "university" or "college" is not at the end of the string. For example "college-of-the-ozarks" Note the non-capturing internal parenthesis, that should probably be used no matter what solution you go with, as you don't want to just match the word "university" or "college"
Live Example
Additionally, I don't know what may be in your but if you may have compound words you want to eliminate using a \b may be advisable. For instance if you don't want to match "miskatonic-postcollege" you may want to do something like this:
/access-guide/([^/]*\b(?:university|universities|college|colleges)\b[^/]*)/

If the client name section of the URL is after the access-guid/ and before the next /:
http://www.disabledgo.com/access-guide/the-university-of-manchester/176-waterloo-place-2
|----------------------------|
you need to use a negated character class to only match university before the regex reaches that rightmost / boundary.
As per the Reference:
You can extract pages by Page URL, Page Title, or Screen Name. Identify each one with a regex capture group (Analytics uses the first capture group for each expression)
Thus, you can use
/access-guide/([^/]*(universit(y|ies)|colleges?))
^^^^^
See demo.
The regex matches
/access-guide/ - leftmost boundary, matches /access-guide/ literally
[^/]* - any character other than / (so we still remain in that customer section)
(universit(y|ies)|colleges?) - university, or universities, orcollegeorcolleges` literally. Add more if needed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

CloudWatch Insights - Group logs by url with unique ids removed - regex

If there are 0-2 numerical identifiers that you want to remove, you could match the first and optionally match the second number and use 2 capturing groups to capture what you want to keep. In the replacement use the 2 capturing groups $1$2 ^(.?\/)\d+(?:(.?\/)\d+\b)? Regex demo

Looks like I can use question marks outside of capture groups to mark those groups as optional, which has resolved the last issue I was having. Regex demo

Related

Okta Group Attribute Statement Regex Filter

Extract date from AWS Cloudwatch insights

How to use Postgres Regex Replace with a capture group

Grok - parsing optional fields

Google Analytics - Content grouping - Regex fix

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

CloudWatch Insights - Group logs by url with unique ids removed - regex

If there are 0-2 numerical identifiers that you want to remove, you could match the first and optionally match the second number and use 2 capturing groups to capture what you want to keep. In the replacement use the 2 capturing groups $1$2 ^(.*?\/)\d+(?:(.*?\/)\d+\b)? Regex demo

Looks like I can use question marks outside of capture groups to mark those groups as optional, which has resolved the last issue I was having. Regex demo

Related

Okta Group Attribute Statement Regex Filter

Extract date from AWS Cloudwatch insights

How to use Postgres Regex Replace with a capture group

Grok - parsing optional fields

Google Analytics - Content grouping - Regex fix

Categories

Resources

If there are 0-2 numerical identifiers that you want to remove, you could match the first and optionally match the second number and use 2 capturing groups to capture what you want to keep. In the replacement use the 2 capturing groups $1$2 ^(.?\/)\d+(?:(.?\/)\d+\b)? Regex demo