Slash included words from string - regex

I have a string as follows and I want to have only highlighted text in RegEx:
"Datapath from SF7PCRINFVCR1/MPLS to SF2PCRINFVCR1/MPLS for fwdClass fc_nc is down".
I have tried the RegEx (?<=Datapath from ).*?(?= for) but I am expecting the result as second row under Group Details [attached screenshot] with
Match#2 GroupIndex#1 and GroupContent as SF7PCRINFVCR1/MPLS to SF2PCRINFVCR1/MPLS from the below site as my integration tool is looking for second entry form the Group Details section.
https://www.freeformatter.com/java-regex-tester.html#before-output
Expecting output like

Related

Regex to find a specific anchor tag that have href with a specific domain and nofollow

I have a string that contains html I want a regex that get me the string that has with a specific domain name and has noFollow
I have found this would will do work on the domain name but does not include nofollow condition
(<a\s*(?!.\brel=)[^>])(href="https?://)((?stackoverflow)[^"]+)"([^>]*)>
let's say the domain name I want is stackoverflow
Example:
- "click here " this would match
- "<a href="stackoverflow.com"> would not match since it has no follow
- "<a href="google.com" rel = "nofollow"> would not match
It's bit hard to match a HTML tag with specific condition, but the following regex should do it:
select regexp_match(str, '<a((?:\s+(([^\/=''"<>\s]+)(=((''[^'']*'')|("[^"]*")|([^\s<>''"=`]+)))?)))* href=((''(https?:\/\/)?stackoverflow\.com[^'']*'')|("(https?:\/\/)?stackoverflow\.com[^"]*"))((?: (([^\/=''"<>\s]+)(=((''[^'']*'')|("[^"]*")|([^\s<>''"=`]+)))?)))*\s+rel=("nofollow"|''nofollow'')((?: (([^\/=''"<>\s]+)(=((''[^'']*'')|("[^"]*")|([^\s<>''"=`]+)))?)))*\/?>') from tes;
It's really hard to read, but basically most of the regex is there for matching attributes. The important thing for you is to find stackoverflow\.com (which can be found 2 times; one for href with single quote and second for double quote) and replace it with whatever domain you need (and don't forget to escape it properly).
Some notes
I don't know which regexp function you want to use, but you should be able to use it with whatever regexp function you need. Another thing is that your example click here won't be matched, because you have spaces between attribute name and = sign (i don't know if this is valid HTML or not). It will work with this click here . If you need to match addresses which might include spaces between = signs just comment me and I'll try to edit the regex.

Regex PCRE capture multiple occurences query string in URL

I am trying to capture multiple occurence of utm tag in a URL and append when re-writing the url. However i just want utm key values and skip others.
This is a sample URL
https://example.com/dl/?screen=page&title=SABC&page_id=4063&myvalue=Noidea&utm_source=sourceTest19&utm_medium=mediumTest19&utm_campaign=campaignTest19&utm_term=termTest19&test=value&utm_content=contentTest19
I tried this:
(\?.*)(page_id=([^&]*))(\?|&)(.*[&?]utm_[a-z]+=([^&]+).*)
and unfortunately, it doesn't produce the result I expect.
I need to capture PAGE ID and utm tags both, but do not want test=value, myvalue=Noidea and only want query strings with utm tags.
Expected Result is the URL below:
https://example.com/dl/page_id/4063?utm_source=sourceTest19&utm_medium=mediumTest19&utm_campaign=campaignTest19&utm_term=termTest19&utm_content=contentTest19
one group with pageid=<somenumber/text>
one group with all utm tags with key and value
Help will be appreciated.
You can make regex like this to get group result:
(?:(page_id|utm_[a-z]+)=[A-z0-9]+)(?:^\&)?
You can instead replace any parameter that does not match the desired ones with the empty string. The pattern for this is
(?:[?&](?!(?:page_id|utm_[^=&]++)=)[^&]*+)++$|(?<=[?&])(?!(?:page_id|utm_[^=&]++)=)[^&]*+(?:&|$)
Here's a working proof: https://regex101.com/r/L5xcl4/2 It has an extra \s only so it works on the multiline input in the tester, but you shouldn't need it as you'll be working on a string that contains only a URL without whitespace.

How to extract file name from URL?

I have file names in a URL and want to strip out the preceding URL and filepath as well as the version that appears after the ?
Sample URL
Trying to use RegEx to pull, CaptialForecasting_Datasheet.pdf
The REGEXP_EXTRACT in Google Data Studio seems unique. Tried the suggestion but kept getting "could not parse" error. I was able to strip out the first part of the url with the following. Event Label is where I store URL of downloaded PDF.
The URL:
https://www.dudesolutions.com/Portals/0/Documents/HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033
REGEXP_EXTRACT( Event Label , 'Documents/([^&]+)' )
The result:
HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033
Now trying to determine how do I pull out everything after the? where the version data is, so as to extract just the Filename.pdf.
You could try:
[^\/]+(?=\?[^\/]*$)
This will match CaptialForecasting_Datasheet.pdf even if there is a question mark in the path. For example, the regex will succeed in both of these cases:
https://www.dudesolutions.com/somepath/CaptialForecasting_Datasheet.pdf?ver
https://www.dudesolutions.com/somepath?/CaptialForecasting_Datasheet.pdf?ver
Assuming that the name appears right after the last / and ends with the ?, the regular expression below will leave the name in group 1 where you can get it with \1 or whatever the tool that you are using supports.
.*\/(.*)\?
It basically says: get everything in between the last / and the first ? after, and put it in group 1.
Another regular expression that only matches the file name that you want but is more complex is:
(?<=\/)[^\/]*(?=\?)
It matches all non-/ characters, [^\/], immediately preceded by /, (?<=\/) and immediately followed by ?, (?=\?). The first parentheses is a positive lookbehind, and the second expression in parentheses is a positive lookahead.
This REGEXP_EXTRACT formula captures the characters a-zA-Z0-9_. between / and ?
REGEXP_EXTRACT(Event Label, "/([\\w\\.]+)\\?")
Google Data Studio Report to demonstrate.
Please try the following regex
[A-Za-z\_]*.pdf
I have tried it online at https://regexr.com/. Attaching the screenshot for reference
Please note that this only works for .pdf files
Following regex will extract file name with .pdf extension
(?:[^\/][\d\w\.]+)(?<=(?:.pdf))
You can add more extensions like this,
(?:[^\/][\d\w\.]+)(?<=(?:.pdf)|(?:.jpg))
Demo

REGEX: select everything to the left until the first specified delimeter

I'm using ColdFusion functions to query an Active Directory Database, return Membership information for a user, then REGEX functions to search the output for specified groups. I made "|" a delimiter.
Anyway, here's some example output:
CN=Group One,OU=Distribution Lists,DC=Domain,DC=org|CN=Group Two,OU=Distribution Lists,DC=Domain,DC=org|CN=Group Three,OU=Distribution Lists,DC=Domain,DC=org|CN=Group Four,OU=Distribution Lists,DC=Domain,DC=org|CN=Group Five,OU=Distribution Lists,DC=Domain,DC=org
What I would like to capture is this:
CN=Group Three,OU=Distribution Lists,DC=Domain,DC=org
Here is what I've tried so far:
^|CN=(.*Group? Three)
Here's a link to the example: http://rubular.com/r/DIGZOPwTag
What's my problem?
Well, this doesn't work all that great... It goes to the left, but it goes too far! How do I stop it at the first occurrence of |CN= to the left?
Thank you in advance for your time. It is appreciated.
!!Clarification!!
Better Example Output:
CN=Pay Band 50,OU=Distribution Lists,DC=Domain,DC=org|CN=Human Resources,OU=Distribution Lists,DC=Domain,DC=org|CN=SiteA Staff,OU=Distribution Lists,DC=Domain,DC=org|CN=SiteB Additional Staff,OU=Distribution Lists,DC=Domain,DC=org|CN=Executives,OU=Distribution Lists,DC=Domain,DC=org
Desired matches:
I'm looking for specific groups:
Site Name w/Possible Spaces Staff
Site Name w/Possible Spaces Additional Staff
It would be awesome to return: "StieAlpha Staff", "Site Beta Additional Staff". It would also be acceptable to include the "CN=" prefix because I could use it to do queries later.
"Staff" and "Additional Staff" will always be part of the group(s) I want to match.
What I've tried, again
^|CN=[^|CN=]*? Staff|Additional? Staff
This new example is not quite perfect as it doesn't grab all of "Site Beta". "Site Beta" Could be any name of any building, for example.
example link: http://rubular.com/r/vq5JcrvaBR
It is not really clear what you want to extract, if only the "Group Three" CN value or all CN values.
You can extract every CN value with this regex:
CN=([^,]*)
this regex begins extracting after each "CN=" occurence and continues extraction until the first comma (,).
A RegEx to fit your demands is
^.*?\|. Visualisation:

regex using positive lookahead

My source data text looks something like this:
a1,a2,a3
a4,a5,a6
a7,a8,a9
test="1"
b1,b2,b3
b4,b5,b6
b7,b8,b9
test="2"
c1,c2,c3
c4,c5,c6
c7,c8,c9
test="3"
I need to parse this so the end result looks like this (appropriate “test” field included in each row):
a1,a2,a3,1
a4,a5,a6,1
a7,a8,a9,1
b1,b2,b3,2
b4,b5,b6,2
b7,b8,b9,2
c1,c2,c3,3
c4,c5,c6,3
c7,c8,c9,3
...etc
this what I started with and captures the fields correctly:
(?<f1>.*?),(?<f2>.*?),(?<f3>.*?)\s+
I understand I need to use lookarounds to capture and include the “test” field on each line.
So something like this added (using a positive lookahead)…
(?<f1>.*?),(?<f2>.*?),(?<f3>.*?)\s+(?=test="(?<test>.*?)")
This seems close but is not yielding all rows of data, but instead only the last row of data with the included test value as if it is consuming the look ahead row.
This expression with its captured groups are input into a .NET application that inserts these captured groups as fields within a database table. Number of fields is always static (4 in the example above; field1=f1, field2=f2, field3=f3, field4=test), but the number of records will be variable.
Any guidance would be appreciated.
Parsing your data to extract the relevant values
You are almost there, but need to allow the look ahead to skip the rows between the current one and the test line:
(?ms)(?<f1>\w+),(?<f2>\w+),(?<f3>\w+)\R(?=.*?^test="(?<test>\d+)")
\R matches all sort of newlines, (?ms) is the inline way of turning on the multiline and dot match all modifiers, so that .*?^test matches every line up to the test one, see demo here.
Again, your issue was that \s+ forced the lookahead to be on the line right after the one your were matching.