Capturing content from a string - regex

I am attempting to parse some logs to get the specific catalog numbers for the items viewed. I have broken out all the necessary fields and am now parsing the referer field to get the catalog id of the page viewed.
The strings are in the following formats:
/catalog/AAA1111111
/catalog/BBB-22222-1/
/catalog/CCC-333333/XXX
http://url/catalog/DDD-44444444
http://url/catalog/EEE-555555555/ZZZ
I am using the following regex to strip out the catalog id:
.*\/catalog\/([^\/]+)
The problem is that I cannot stop the regex from grabbing everything after the next forward slash. It looks like it is to greedy?
The results are:
AAA1111111
BBB-22222-1/
CCC-333333/XXX
DDD-44444444
http:EEE-555555555/ZZZ
I've been banging my head on this one for a couple of hours.
I am just looking for a regex that will split out just the catalog id (the string after catalog/.)
Can anyone help guide this old coder in the proper direction?
Many thanks.

using sed
cat catalogs | sed -E 's/.*\/catalog\/([^/]+)\/?.*/\1/g'
results in
AAA1111111
BBB-22222-1
CCC-333333
DDD-44444444
EEE-555555555
note the only modification is matching the trailing stuff

Why using a regex when you can split on "/catalog/", take the last item then split on "/" and take the 1st item ?
In Python, this could be done like this :
line.split('/catalog/')[-1].split('/')[0]
Just wanted to point out that regexp are not the solutions for every string parsing problems.
Often, when you're faced to "greedy" parsing, doing a "manual" modification before using regexp helps

Related

Splunk regex how to regex mulitple triple backslashes

First fyi, i have searched answers and am having an issue finding something that works. I am an end user (and a newbie) to Splunk and have data from a log looking like this (snippet):
{\\\"routepoint\\\":\\\"1234567890\\\",\\\"prefix\\\":\\\"\\\",
I want to pull field called "routepoint" with the value 1234567890.
I tried:
rex field=_raw "\"routepoint\\\\\\\":\\\\\\\"(?<routepoint>\d+)\\\\\\\""
and then tried to do a | table routepoint but it isn't working.
Hoping there is a better way of extracting this field. Thanks in advance for helping this newbie!
Beachsandguy

Extract only the text field needed

I am at the beginning of learning Regex, and I use every opportunity to understand how it's working. Currently I am trying to extract dates from a text file (which is in fact a vnt-file type from my mobile phone). It looks like following:
BEGIN:VNOTE
VERSION:1.1
BODY;ENCODING=QUOTED-PRINTABLE;CHARSET=UTF-8:18.07.=0A14.08.=0A15.09.=0A15.10.=
=0A13.11.=0A13.12.=0A12.01.=0A03.02. Grippe=0A06.03.=0A04.04.2015=0A0=
5.05.2015=0A03.06.2015=0A03.07.2015=0A02.08.2015=0A30.08.2015=0A28.09=
17.11.2017=0A
DCREATED:20171118T095601
X-IRMC-LUID:150
END:VNOTE
I want to extract all dates, so that the final list is like that:
18.07.
14.08.
15.09.
15.10.
and so on. If the date has also a year, it should also be displayed.
I almost found out how to detect the dates by the following regex:
.+(\d\d\.\d\d\.(2015|2016|2017)?).+
But it only detect very few of the dates. The result is this:
BEGIN:VNOTE
VERSION:1.1
15.10.
04.04.2015
30.08.2015
24.01.2016
DCREATED:20171118T075601
X-IRMC-LUID:150
END:VNOTE
Then I tried to add a question mark which makes the .+ not greedy, as far as I read in tutorials. Then the regex looks like:
.+?(\d\d\.\d\d\.(2015|2016|2017)?).+?
But the result is still not what I am looking for:
BEGIN:VNOTE
VERSION:1.1
21.03.20.04.18.05.18.06.18.07.14.08.15.09.15.10.
13.11.13.12.12.01.03.02.06.03.04.04.20150A0=
03.06.201503.07.201502.08.201530.08.20150A28.09=
28.10.201525.11.201528.12.201524.01.20160A
DCREATED:20171118T075601
X-IRMC-LUID:150
END:VNOTE
For someone who is familiar with regex I am pretty sure this is very easy to solve, but I don't get it. It's very confusing when you are new to regex. I tried to find a hint in some tutorials or stackoverflow posts, but all I found is this: Notepad++ how to extract only the text field which is needed?
But it doesn't work for me. I assume it might have something to do with the fact that my text file is not one single line.
I have my example on regex101 too.
I would be very thankful if maybe someone can give me a hint what else I can try.
Edit: I would like to detect the dates with the regex and as a result have a list with only the dates (maybe it is called substitute?)
Edit 2: Sorry for not mentioning it earlier: I just want to use the regex in e.g. Notepad++ or an online regex test website. Just to get the result of the dates and save the result in a new txt-file. I don't want to use the regex in an programming language. My apologies for not being precisely before.
Edit 3: The result should be a list with the dates, and each date in a new line:
I want to extract all dates, so that the final list is like that:
18.07.
14.08.
15.09.
15.10.
I suggest this pattern:
(?:.*?|\G)(\d\d\.\d\d\.(?:\d{4})?)
This makes use of the \G flag that, in this case, allows for multiple matches from the very start of the match without letting any single unmatched character in the text, thus allowing the removal of all but what's wanted.
If you want to remove the extra matches as well, add |.* at the end:
(?:.*?|\G)(\d\d\.\d\d\.(?:\d{4})?)|.*
regex101 demo
In N++, make sure the options underlined are selected, and that the cursor is at the beginning. In the picture below, I replaced then undid the replacement, only to show that matches were identified (16 replacements).
You can try using the following pattern:
\d{2}\.\d{2}\.(?:\d{4})?
This will match day.month dates of the form 18.07., but it also allows such a date to be followed by a four digit year, e.g. 18.07.2017. While it would be nice to make the pattern more restrictive, to avoid false fire matches, I do not see anything obvious which can be added to the above pattern. Follow the demo link below to see the pattern in action.
Demo

What is the best way to extract text between tracking params using regex?

I have some data that I need help cleaning up. For some reason, tracking params are being stored within the database so what is the best way to extract the search query minus the tracking params using regex? I need to extract the following search queries:
things to do
las vegas
airport parking
from the following data:
{"query":"things to do","prefilteredchannel":"gpse
{"query":"las vegas","prefilteredchannel":"gpsea
{"query":"airport parking
I've tried a few things but I can only match the things I don't care about and I don't know how to just extract the search query. I'm new to this so any help would be appreciated.
Any ideas on how to make this work with the Platfora regex_replace:
http://documentation.platfora.com/webdocs/index.html#reference/expression_language/function_regex_replace.html
use this regex . its quite simple
{"query":"([^"]*)(?:"|$)
See demo here regex101
You can use following:
\{"query":"([^"]*|$)
It will match the query value until " is encountered or end of string (whichever comes first).
Demo

regex to remove hyperlinks

Input:
source http://www.emaxhealth.com/1275/misdiagnosing from here http://www.cancerresearchuk.org/about-cancer/type recounting her experiences and thoughts blog http://fty720.blogspot.com even carried the new name. She was far from home.
From the about input I want to remove the hyperlinks. Below is the regex that I am trying
http://[\w|\W|\d|\s]*(?=[ ])
This regex will encompass all characters,digits and whitespaces after encountering the word 'http' and will continue till first blank space.
Unfortunately, it is not working as expected. Please do help me find out my error.Thanks
Try this sed command
sed 's/http[^ ]\+//g' FileName
Output :
source from here recounting her experiences and thoughts blog even carried the new name. She was far from home.
To find the hyperlink use:
\b(https?)://[A-Z0-9+&##/%?=~_|$!:,.;-]*[A-Z0-9+&##/%=~_|$]
or:
If you want to find the html a tag use:
<a\b[^>]*>(.*?)</a>

Grabbing specific query string parameters from URL with regex

We have an implementation of Liferay portal and I'm just getting started with using Google Analytics with it. I'm noticing a lot of duplicate entries in GA, mainly because of the query strings in the URI, for example:
/web/home-community/search-and-help?p_p_id=mytcdirectory_WAR_mytcdirectory&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&p_p_col_id=column-3&p_p_col_count=4&_mytcdirectory_WAR_mytcdirectory_action=getResults
I'm playing around with the Search and Replace filters in GA (using regex) and my goal is to try to pull out the ?p_p_id and &*_action parameters from the URI, and disregard the rest. I'm getting close with the following regex:
^([^\?]+)([\?\&]p_p_id=[^\&]+)?.*(\&[^\&]+_action=[^\&]+)?.*$
But that last grouping isn't working correctly. If I remove the ? from the end of the last grouping it matches, but the problem with that approach is that not all URIs contain that query string so it needs to be optional. But if I keep it in, it won't grab that last parameter. My regex fiddle is located here:
http://regex101.com/r/qQ2dE4/13
Thank you all in advance for any help.