Cannot seem to regexp extract a word on URL (Data Studio) - regex

I read from somewhere that Datastudio use little bit different Regular Expression from other places: that it uses RE2. I, however, manage to find a site to test for RE2 regex and able to get it running, but it was not working on Data studio.
I have this URL I wanted to extract:
/marketing/news-717777
/finance/news-123456?asdasdasd_asdad
I wanted the regex to extract the word with dash and number. "news-******".
The result would be like this
news-717777
news-123456
I cannot seem to get it to work on data studio. The code that I have tried are the following:
(news-).*(?=\?)|(news-).*
(news-).*(?=\\?)|(news-).*
(news-.*?)\?
(news-.*).*(?=\?)
The closest I get is to get news with number"news-***", but I cannot remove the "?" that comes after. Anyone has any ideas on this? Thank you in advance.

You can use several solutions here.
Solution 1: matching digits after a specific string (here, news-)
(news-[0-9]+)
See the regex demo, [0-9]+ matches one or more digits.
Solution 2: If there can be any char other than ? after news-, if there can be chars other than digits, you can use
(news-[^?]+)
See this regex demo, where [^?]+ matches one or more chars other than a ? char.

Related

Powershell Regex Log Files for Mitel

I am struggling to get this regex going. Here is the string I am trying to work with.
08:07:46.914 ( 1708: 8624) G-MST: 400000EF " guid=00040000-73b2-5c7f-2295-00104941e7b0" ("10.10.60.3","10.10.29.251"),(10292, 59046),2(ULaw),rsn:1,12:05:15.623 (UTC),pl:20,(s:7525, r:7557, l:0),(j:0,u:27037,o:0) flgs:0x00000000 "sip:TGrp_5,p111#10.10.60.3:5441",vpn:0
I am failing badly at this one. It's kicking my butt. Any help would be amazing. What I have so far:
(?<date>\d+[:]\d+[:]\d+[.]\d+).*?(?<InPorts>\d+).*?(?<OutPort>\d+).*?(?<GMST>\d+\w+).*?(?<Guid>\d+............................).*?(?<SourceIP>\d+\D+\d+\D+\d+\D+\d+).*?(?<targetIP>\d+\D+\d+\D+\d+\D+\d+).*?(?<SourceSpeed>\d+).*?(?<TargetSpeed>\d+).*?(?<AudioType>\d+).*?(?<rsn>\d+).*?(?<utc>\d+\D+\d+\D+\d+\D+\d+).*?(?<pl>\d+).*?(?<s>\d+).*?(?<r>\d+).*?(?<l>\d+).*?(?<j>\d+).*?(?<u>\d+).*?(?<o>\d+).*?(?<flags>\d+\w\d+).*?(?<sip>:(.*)").*?(?<vpn>\d+)
The problem with this code are, The GUIDs are different lengths. The Sip is not always a tgrp_5, sometimes it's just the p111. Sometimes it's even more complex.
The ultimate goal with this regex is to parse logs that all match the same pattern into a database.
You may use a pattern like
(?<date>\d[\d:.]+)\W+(?<InPorts>\d+):\s*(?<OutPort>\d+)\W+G-MST:\s*(?<GMST>\w+)\W+guid=(?<Guid>[^"]+)"\W+(?<SourceIP>\d{1,3}(?:\.\d{1,3}){3})\W+(?<targetIP>\d{1,3}(?:\.\d{1,3}){3})\W+(?<SourceSpeed>\d+)\W+(?<TargetSpeed>\d+)\D+(?<AudioType>\d+)\D+(?<rsn>\d+)\W+(?<utc>\d[\d.:]*)\D+(?<pl>\d+)\D+(?<s>\d+)\D+(?<r>\d+)\D+(?<l>\d+)\D+(?<j>\d+)\D+(?<u>\d+)\D+(?<o>\d+)\D+(?<flags>0x\d+).*?:(?<sip>[^"]*)"\D+(?<vpn>\d+)
See the regex demo.
Its main points are:
Get rid of .* and .*?, these patterns tend to "overfire" and overmatch
Use specific patterns, \D+ to get from the current position to the nearest digit (if the next pattern is \d+) or \W+ if the next pattern is a word char.

Regex - Find the Shortest Match Possible

The Problem
Given the following:
\plain\f2 This is the first part of the note. This is the second part of the note. This is the \plain\f2\fs24\cf6{\txfielddef{\*\txfieldstart\txfieldtype1\txfieldflags144\txfielddataval44334\txfielddata 35003800380039000000}{\*\txfielddatadef\txfielddatatype1\txfielddata 340034003300330034000000}{\*\txfieldtext 20{\*\txfieldend}}{\field{\*\fldinst{ HYPERLINK "44334" }}{\fldrslt{20}}}}\plain\f2\fs24 part of the note.
I'd like to produce this:
\plain\f2 This is the first part of the note. This is the second part of the note. This is the third part of the note.
What I've Tried
The example input/output is a very simplified version of the data I need to parse and it would be nice to have a way to parse the data programmatically. I have a PHP application and I've been trying to use regex to match the segments that are important and then filter out the parts of the string that aren't required. Here's what I've come up with so far:
/\\plain.*?\\field{\\\*\\fldinst{ HYPERLINK "(.*?)" }}{\\fldrslt{(.*?)}}}}\\plain.*? /gm
regex101: https://regex101.com/r/ILLZU6/2
It almost matches what I want, but it but grabs the longest possible match instead of the shortest. I want it to match only one \\plain before the \\field{.... Maybe after the \\plain, I could match anything except for a space? How would I go about doing that?
I'm no regex expert, but my use-case really calls for it. (Otherwise, I'd just write code to handle everything.) Any help would be much appreciated!
(?:(?!\\plain).)* will match any string unless it contains a match for \\plain. Here's the regex implementing this:
/\\plain(?:(?!\\plain).)*\\field{\\\*\\fldinst{ HYPERLINK "(.*?)" }}{\\fldrslt{(.*?)}}}}\\plain.*? /gm
regex101: https://regex101.com/r/ILLZU6/5
Also, you can replace the space at the end with (?: |$) if you want to allow the end of the text to trigger it as well as a space:
/\\plain(?:(?!\\plain).)*\\field{\\\*\\fldinst{ HYPERLINK "(.*?)" }}{\\fldrslt{(.*?)}}}}\\plain.*?(?: |$)/gm
regex101: https://regex101.com/r/ILLZU6/4

Matching a string between two sets of characters without using lookarounds

I've been working on some regex to try and match an entire string between two characters. I am trying to capture everything from "System", all the way down to "prod_rx." (I am looking to include both of these strings in my match). Below is the full text that I am working with:
\"alert_id\":\"123456\",\"severity\":\"medium\",\"summary\":\"System generated a Medium severity alert\\\\prod_rx.\",\"title\":\"123456-test_alert\",
The regex that I am using right now is...:
(?<=summary\\":\\").*?(?=\\")
This works perfectly when I am able to use lookarounds, such as in Regex101: https://regex101.com/r/jXltNZ/1. However, the regex parser in the software that my company uses does not support lookarounds (crazy, right?).
Anyway - my question is basically how can I match the above text described without using lookaheads/lookbehinds. Any help is VERY MUCH appreciated!!
Well, we can simply use other non-lookaround method, such as this simple expression:
.+summary\\":\\"(.+)\\",
and our data is in this capturing group:
(.+)
our right boundary is:
\\",
and our left boundary is:
.+summary\\":\\"
Demo

Regex for extracting each word between hyphens

I am learning regex and trying to write a pattern that exactly matches each of the strings without'-' so that I can iterate for each of the groups and print the respective strings.
I have a string that looks like "Abcd001-wd2s-vwe1-20180e3103.txt"
I was able to write a regex for extracting Abcd001, wd2s and .txt from above text as shown below
(\A[^-]+)=> Abcd001
(-[^-]+-)=> wd2s
(\..*)=>.txt
However, I was unable to come up with the correct pattern for extracting the exact strings vwe1 and 20180e3103
It will be really helpful if you can guide me on this or if there is a better approach to achieve this?
Please note: [^-.]+ may give me all the words separately but I am looking for an option where I have a group defined for each of these strings so that its one to one mapping.
Thanks!
To get vwe1 or 20180e3103 from the example data, you might use a quantifier {2} or {3} to repeat matching one or more word charcters followed by a hyphen (?:\w+-){2}.
Then you could capture in a group ([^-.]+) matching not a hyphen or a dot.
(?:\w+-){2}([^-.]+)
Try the below regex
/\-([^\)]+)\-/gmi;
Also check the similar implementation:
https://stackoverflow.com/a/50336050/8179245

Find Regex mismatch part in a string using vb.net

I had a regex expression
^\d{9}_[a-zA-Z]{1}_(0[1-9]|1[0-2]).(0[1-9]|[1-2][0-9]|3[0-1]).[0-9]{4}_\d*_[0-9a-zA-Z]*_[0-9a-zA-Z]*
and string that match regex expression
000066874_A_12.31.2014_001_2Q_ICAN14
if user by mistake enters the string other than above format like
000066874_12.31.14_001_2Q_ICAN14
I need to find out in which part of my regex got failed. I tried using Regex.Matches and Regex.Match but using this I couldn't find in which part my string got miss matched with my Regex expression. I am using vb.net
This is very complicated to do with regex. I managed to make this regex, but you still have to check the capture groups after that.
^(?:(?:(\d{9})|.*?)_)?(?:(?:([a-zA-Z]{1})|.*?)_)?(?:(?:((?:0[1-9]|1[0-2]).(?:0[1-9]|[1-2][0-9]|3[0-1]).[0-9]{4})|.*?)_)?(?:(?:(\d*)|.*?)_)?(?:(?:([0-9a-zA-Z]*)|.*?)_)?(?:([0-9a-zA-Z]*)|.*?)$ will work if you, as seen in demo: https://regex101.com/r/aJ1wG1/2
Each part before an underline is a capture group, if a capture group is not there, there's an error in it. As you can see in the example, $3 is not present in 1st example, hence, a mistake in date is there. In second example, the $2 is not present, hence $2 onward are not there. 3rd example is correct and all 6 caputre groups are there.
When regexes get this massive, it's a sign that probably a different method should be used to solve the problem, but this might work for you with some additional code for group result checks.