I need to extract date_from and date_to from the following log field value.
date_from=11-04-2020&date_to=01-04-2021&page_size=1000&page=1 in AWS cloudwatch
I have so far tried parse keyword with the following regex \d{2}-\d{2}-\d{4} and it does not work.
What I ultimately want to do is extract these two dates and gets the time difference between them in days.
Here's the query I tried,
filter #logStream like /<log-stream>/ and process like /rest-call/ | parse parameters '\d{2}-\d{2}-\d{4}' as #date | display #date
You can capture both date_from and date_to into two named capturing groups:
parse parameters /date_from=(?<date_from>\d{2}-\d{2}-\d{4}).*?date_to=(?<date_to>\d{2}-\d{2}-\d{4})/ | display date_from, date_to
See the regex demo.
If the date format can be any, you may replace the \d{2}-\d{2}-\d{4} specific pattern with a more generic [^&]+:
/date_from=(?<date_from>[^&]+).*?date_to=(?<date_to>[^&]+)/
See the regex demo.
Note that .*? matches any zero or more chars other than line break chars, as few as possible (it is necessary to make sure the regex engine can "travel" all the way from the first capture to the second one as the regex engine parses the string from left to right and can never "skip" parts of a string, it should consume them).
For anyone looking AWS does not currently have any Date time functions to convert a date (i.e - mm/dd/yyyy) to a timestamp. Therefore, I exported the results of the above query to a CSV and did the timestamp calculations in Google Sheets.
Related
I'm looking to use CloudWatch Logs Insights to group logs by a request url field, however the url can contain 0-2 unique numerical identifiers that I'd like to be ignored when doing the grouping.
Some examples of urls:
/dev/user
/dev/user/123
/dev/user/123/inventory/4
/dev/server/3/statistics
The groups would look something like:
/dev/user
/dev/user/
/dev/user//inventory/
/dev/server//statistics
I have something quite close to what I need which extracts the section of the url in front of the first optional identifier and the section between the first identifier and the second identifier and concatenates the two, but it isn't totally reliable. This is where I'm at currently, #message is valid json which containers an 'endpoint' field that looks like one of the urls above:
fields #message | parse endpoint /(\bdev)\/(?<#prefix>[^0-9]+)(?:[0-9]+)(?<#suffix>[^0-9]+)/ | stats count(*) by #prefix
While this query will work with endpoints like '/dev/accounts/1' it ignores endpoints like '/dev/accounts' as it doesn't have all of the components the regex is looking for, which means I'm missing a lot of results.
If there are 0-2 numerical identifiers that you want to remove, you could match the first and optionally match the second number and use 2 capturing groups to capture what you want to keep.
In the replacement use the 2 capturing groups $1$2
^(.*?\/)\d+(?:(.*?\/)\d+\b)?
Regex demo
Looks like I can use question marks outside of capture groups to mark those groups as optional, which has resolved the last issue I was having.
Regex demo
I am trying to remove square brackets around a date field in Google Data Studio so I can properly treat it as a proper date dimension.
It looks like this:
[2020-05-20 00:00:23]
and I am using the RegEx of REGEXP_REPLACE(Date, "/[\[\]']+/g", "") and I want it to look like this for the output:
2020-05-20 00:00:23
It keeps giving me error results and will not work. I can not figure out what I am doing wrong here, I've used https://www.regextester.com/ to verify that it should work
Regarding Dates, it can be achieved with a single TODATE Calculated Field:
TODATE(Date, "[%Y-%m-%d %H:%M:%S]", "%Y%m%d%H%M%S")
The Date Type can then be set as required:
YYYYMMDD: Date
YYYYMMDDhh: Date Hour
YYYYMMDDhhmm: Date Hour Minute
Google Data Studio Report and GIF to elaborate:
You need to use a plain regex pattern, not a regex literal notation (/.../g).
Note that REGEXP_REPLACE removes all occurrences found, thus, there is no need for a g flag.
Use
REGEXP_REPLACE(Date, "[][]+", "")
to remove all square brackets in Date.
I'm building out a Google Data Studio dashboard and I need to create a calculated field for the year a post was published. The year is in the URI path, but I'm not sure how to extract it using REGEXP_EXTRACT. I've tried a number of solutions proposed on here but none of them seem to work on Data Studio.
In short, I have a URI like this: /theme/2019/jan/blog-post-2019/
How do I use the REGEXP_EXTRACT function to get the first 2019 after theme/ and before /jan?
Try this:
REGEXP_EXTRACT(Page, 'theme\/([0-9]{4})\/[a-z]{3}\/')
where:
theme\/ means literally "theme/";
([0-9]{4}) is a capturing group containing 4 characters from 0 to 9 (i.e. four digits);
\/[a-z]{3}\/ means a slash, followed by 3 lowercase letters (supposing that you want the regex to match all the months), followed by another slash. If you want something more restrictive, try with \/(?:jan|feb|mar|...)\/ for the last part.
See demo.
As you mentioned, I think you only want to extract the year between the string. The following will achieve that for you.
fit the query as per your needs
SELECT *
FROM Sample_table
WHERE REGEXP_EXTRACT(url, "(?<=\/theme\/)(?<year>\d{4})(?=\/[a-zA-Z]{3})")
I am using a query to pull a user ID from a column that contains text. This is for a forum system I am using and want to get the User id portion out of a text field that contains the full message. The query I am using is
SELECT REGEXP_SUBSTR(message, '(?:member: )(\d+)'
) AS user_id
from posts
where message like '%quote%';
Now ignoring the fact thats ugly SQL and not final I just need to get to the point where it reads the user ID. The following is an example of the text that you would see in the message column
`QUOTE="Moony, post: 967760, member: 22665"]I'm underwhelmed...[/QUOTE]
Hopefully we aren’t done yet and this is nothing but a depth signing!`
Is there something different about the regular expression when used in mariadb REGEXP_SUBST? this should be PCRE and works within regex testers and should read correctly. It should be looking for the group "member: " and then take the numbers after that and have a single match on all those posts.
This is an ugly hack/workaround that works by using a lookahead for the following "] however will not work if there are multiple quotes in a post
(?<=member: )\.+(?="])
I tried to fix bad data in postgres DB where photo tags are appended twice.
The trip is wonderful.<photo=2-1-1601981-7-1.jpg><photo=2-1-1601981-5-2.jpg>We enjoyed it very much.<photo=2-1-1601981-5-2.jpg><photo=2-1-1601981-7-1.jpg>
As you can see in the string, photo tags were added already, but they were appended to the text again. I want to remove the second occurrence: . The first occurrence has certain order and I want to keep them.
I wrote a function that could construct a regex pattern:
CREATE OR REPLACE FUNCTION dd_trip_photo_tags(tagId int) RETURNS text
LANGUAGE sql IMMUTABLE
AS $$
SELECT string_agg(concat('<photo=',media_name,'>.*?(<photo=',media_name,'>)'),'|') FROM t_ddtrip_media WHERE tag_id=tagId $$;
This captures the second occurrence of a certain photo tag.
Then, I use regex_replace to replace the second occurrence:
update t_ddtrip_content set content = regexp_replace(content,dd_trip_photo_tags(332761),'') from t_ddtrip_content where tag_id=332761;
Yet, it would remove all matched tags. I looked up online for days but still couldn't figure out a way to fix this. Appreciate any help.
This Should Work.
Regex 1:
<photo=.+?>
See: https://regex101.com/r/thHmlq/1
Regex 2:
<.+?>
See: https://regex101.com/r/thHmlq/2
Input:
The trip is wonderful.<photo=2-1-1601981-7-1.jpg><photo=2-1-1601981-5-2.jpg>We enjoyed it very much.<photo=2-1-1601981-5-2.jpg><photo=2-1-1601981-7-1.jpg>
Output:
<photo=2-1-1601981-7-1.jpg>
<photo=2-1-1601981-5-2.jpg>
<photo=2-1-1601981-5-2.jpg>
<photo=2-1-1601981-7-1.jpg>