Data Studio Regex (Google RE2) to Extract Subdirectory from Path - regex

I'm working with a Google Data Studio field that has a page URL Path contained within it. Examples:
/
/sample-url
/sample-url-2/
/#sample-url-5/
/sample-url-3/sample-url-4
/sample-url-3/sample-url-6
In each one, I want to be capturing the bold portion in a custom formula/field--from the first slash, up to but excluding the second slash if there is one, and also including the first slash if that's the whole path. (In essence, the first subdirectory.) I would be open to recording the second backslash when there is one if that would make the solution simpler, but I'm guessing it's more complicated that way. I tried the following:
REGEXP_EXTRACT(Field, "^/[^/]+/$")
But it didn't work; everything returned null. What is wrong with that string?

The ^/[^/]+/$ pattern matches a string that starts with a / char, then contains one or more chars other than / and then ends with a / char. So, you can only match strings like /abc/, /123abc/, /abc-1 2 3.?!/, etc.
You can use
REGEXP_EXTRACT(Field, "^(/[^/]*)")
See the regex demo.
NOTE: REGEXP_EXTRACT requires a capturing group in the pattern, the content captured is the return value.
Here, ^ matches the start of string and (/[^/]*) is a capturing group with ID 1 that matches a / char and then any zero or more chars other than / (with [^/]*).

Related

Regex to pull last 2 segments from FQDN

Working on trying to figure out some regex to pull out the last 2 segments of an FQDN.
^.*\shostname=[\w-]+\.(?P<myfield>[^\t]+)
This RegEx works and takes out the first segment of an FQDN.
www.aaa.bbb.someurl.net --> aaa.bbb.someurl.net
But… I only want to keep the last 2 segments of any FQDN.
I need it to be --> someurl.net
Other restrictions:
The hostname field will always be at least 3 segments - don't know the max.
This is for Splunk so I can't use a script. I need it to be PCRE compatible regex.
Here is an example of data:
2021-07-20 18:19:14 reason=Not allowed to browse this category event_id=12345 protocol=HTTP action=Blocked transactionsize=16051 responsesize=789 requestsize=456 urlcategory=Blocked serverip=1.2.4.5 clienttranstime=0 requestmethod=GET refererURL=None useragent=Microsoft-Delivery location=Internal ClientIP=5.6.7.8 status=403 user=John url=dl.delivery.mp.microsoft.com/filestreamingservice/files/abcd-efgh-ijkl/pieceshash vendor=Zscaler hostname=dl.delivery.mp.microsoft.com
From this I data I need the field “myfield” to be: microsoft.com.
The original answer with a much simpler regex ((?:\s|^)hostname=(?:[^\s.]+\.)*(?P<myfield>[^\s.]+\.[^\s.]+)) that worked for OP is in the question history.
You can use
(?:\s|^)hostname=(?:[^\s.]+\.)*?(?P<myfield>[^\s.]+\.(?:(?:ac|co)\.uk|govt?\.uk|judiciary\.uk|l(?:ea|td)\.uk|m(?:e|il|od)\.uk|n(?:et|hs|ic)\.uk|orgn?\.uk|p(?:arliament|lc|olice)\.uk|(?:royal|sch)\.uk|[^\s.]+)(?!\S))
Or, to match the last hostname=... value on a line:
^.*\shostname=(?:[^\s.]+\.)*?(?P<myfield>[^\s.]+\.(?:(?:ac|co)\.uk|govt?\.uk|judiciary\.uk|l(?:ea|td)\.uk|m(?:e|il|od)\.uk|n(?:et|hs|ic)\.uk|orgn?\.uk|p(?:arliament|lc|olice)\.uk|(?:royal|sch)\.uk|[^\s.]+)(?!\S))
See the regex #1 demo and regex #2 demo. Details:
(?:\s|^) - either a whitespace or start of string
hostname= - a literal substring
(?:[^\s.]+\.)*? - zero or more (but as few as possible) occurrences of one or more chars other than whitespace and dot and then a dot
(?P<myfield>[^\s.]+\.(?:(?:ac|co)\.uk|govt?\.uk|judiciary\.uk|l(?:ea|td)\.uk|m(?:e|il|od)\.uk|n(?:et|hs|ic)\.uk|orgn?\.uk|p(?:arliament|lc|olice)\.uk|(?:royal|sch)\.uk|[^\s.]+)(?!\S)) - Group "myfield": one or more chars other than whitespace and dot, then a dot, then any second-level domain or any one or more chars other than whitespace and dot and then either a whitespace or end of string.
Note: the \.(?:(?:ac|co)\.uk|govt?\.uk|judiciary\.uk|l(?:ea|td)\.uk|m(?:e|il|od)\.uk|n(?:et|hs|ic)\.uk|orgn?\.uk|p(?:arliament|lc|olice)\.uk|(?:royal|sch)\.uk pattern part (built from a regex trie) matches this list:
.ac.uk
.co.uk
.gov.uk
.judiciary.uk
.ltd.uk
.me.uk
.mod.uk
.net.uk
.nhs.uk
.nic.uk
.org.uk
.parliament.uk
.plc.uk
.police.uk
.royal.uk
.sch.uk
.co.uk
.ltd.uk
.me.uk
.net.uk
.nic.uk
.org.uk
.plc.uk
.sch.uk
.govt.uk
.orgn.uk
.lea.uk
.mil.uk
If you want to add more second-level domain names, add more to the list and use https://www.myregextester.com or suchlike services to built the word list regex.
You could match all following non whitspace chars after hostname= and then use a capture group to capture the last part with a single dot.
^.*\shostname=(?:\S+\.)?([^\s.]+\.[^\s.]+)
^.*\shostname=
(?:\S+\.)? Optionally match a possible dot before
( Capture group 1
[^\s.]+\.[^\s.]+ Match 2 non dot parts with a . in between
) Close group
Regex demo
If you would like to account for country codes, I've previously answered this at: Get Domain Extension From Hostname
The regular expression would look something like (shortened version): \w+((\.[a-z]{2,3})(\.(uk|au))?)$
The full expression with all country codes: \w+((\.[a-z]{2,3})(\.(ad|ae|af|ag|ai|al|am|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bl|bm|bn|bo|bq|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cu|cv|cw|cx|cy|cz|de|dj|dk|dm|do|dz|ec|ee|eg|er|es|et|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mf|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|ss|st|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|za|zm|zw))?)$

Finding all XML Files containing specific strings using REGEX

I use VSCode for salesforce and I have hundreds of fieldsets in the sandbox, I would like to use REGEX to find all XML files that contains these 2 words in any order:
LLC_BI__Stage__c
LLC_BI__Status__c
I have tried using these REGEX but it did not work, I am assuming because the strings are in different lines:
(?=LLC_BI__Stage__c)(?=LLC_BI__Status__c)
^(?=.*\bLLC_BI__Stage__c\b)(?=.*\bLLC_BI__Status__c\b).*$
(.* LLC_BI__Stage__c.* LLC_BI__Status__c.* )|(.* LLC_BI__Status__c.* LLC_BI__Stage__c.*)
e.g, this XML File contains the 2 strings and should be returned
<displayedFields>
<field>LLC_BI__Amount__c</field>
<isFieldManaged>false</isFieldManaged>
<isRequired>false</isRequired>
</displayedFields>
<displayedFields>
**<field>LLC_BI__Stage__c</field>**
<isFieldManaged>false</isFieldManaged>
<isRequired>false</isRequired>
</displayedFields>
<displayedFields>
<field>LLC_BI__lookupKey__c</field>
<isFieldManaged>false</isFieldManaged>
<isRequired>false</isRequired>
</displayedFields>
<displayedFields>
**<field>LLC_BI__Status__c</field>**
<isFieldManaged>false</isFieldManaged>
<isRequired>false</isRequired>
</displayedFields>
You could use an alternation to find either one of them and according to this post use [\s\S\r] to match any character including newlines.
If there is an issue using [\s\S\r] you migh tuse [\S\r\n\t\f\v ]* instead.
(?:LLC_BI__Stage__c[\S\s\r]*LLC_BI__Status__c|LLC_BI__Status__c[\S\s\r]*LLC_BI__Stage__c)
Explanation
(?: Non capturing group
LLC_BI__Stage__c[\S\s\r]*LLC_BI__Status__c Match first part till second part
| Or
LLC_BI__Status__c[\S\s\r]*LLC_BI__Stage__c Match second part till first part
) Close group
Regex demo 1 and Regex demo 2

Regex to match only urls that contains a certain path

I am using the following regex
(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
and it's showing me a url but I want to show only URLS that contain
/video/hd/
The following correction of the Regex above did not deal correctly with slashes
((?:https\:\/\/)|(?:http\:\/\/)|(?:www\.))?([a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(?:\??)[a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]+)
You said only the whole match is used, and the regex contains no backreferences. Therefore we can replace all capturing groups (( )) in the regex by non-capturing groups ((?: )). A few of the groups are redundant, and http|https can be simplified to https?. Together this gives us
(?:https?|ftp)://[\w_-]+(?:\.[\w_-]+)+(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
_ is not allowed in hostnames:
(?:https?|ftp)://[\w-]+(?:\.[\w-]+)+(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
Technically - cannot appear at the beginning or end of a hostname, but we'll ignore that. Your regex doesn't allow non-default ports or IPv6 hosts either, but we'll ignore that, too.
The stuff matched by the last part of your regex (which is presumably meant to match path, query string, and anchor all together) can overlap with the hostname (both \w and - are in both character classes). We can fix this by requiring a separator of either / or ? after the hostname:
(?:https?|ftp)://[\w-]+(?:\.[\w-]+)+(?:[/?][\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
Now we can start looking at your additional requirement: The URL should contain /video/hd/. Presumably this string should appear somewhere in the path. We can encode this as follows:
(?:https?|ftp)://[\w-]+(?:\.[\w-]+)+/(?:[\w.,#^=%&:/~+-]*/)?video/hd/(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
Instead of matching an optional separator of / or ?, we now always require a / after the hostname. This / must be followed by either video/hd/ directly or 0 or more path characters and another /, which is then followed by video/hd/. (The set of path characters does not include ? (which would start the query string) or # (which would start the anchor).)
As before, after /video/hd/ there can be a final part of more path components, a query string, and an anchor (all optional).
First of all, you need a regex to match URLs (be they http, https...)
(([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))
Once you got that, you need to select them but not "consume" them. You can do this with a lookahed, i.e. a regex that assert that what follows the current position is e.g. foo:
(?=foo)
Of course you will replace foo with the first regex I wrote.
At this point, you know you selected a URL; now you just constraint your search to URLs that contain /video/hd:
.*\/video\/hd\/.*
So the complete regex is
(?=(([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))).*\/video\/hd\/.*
You can test it here with a live demo.

Using regex to get string after final occurrence of / in a URL

I have a large list of URLS such as:
https://www.walmart.com/ip/Cabbage-Patch-Kids-Naptime-Babies-Doll-Blonde-Hair-Blue-Eye-Girl/45792420
https://www.walmart.com/ip/My-Life-As-18-inch-Schoolgirl-Doll-Blonde/336940687
https://www.walmart.com/ip/My-Life-As-18-inch-Everyday-Girl-Doll-African-American/52730785
I need to find all instances after the final / such as 45792420 within the file.
I'm using Sublime Text 3 to do the search with regex.
I created the following regex
\/(?:.(?!\/))+$
however it is returning the / with the string rather than just the string that occurs after the /
For example /45792420
How can I just get whatever comes after the final / ?
Just use \K to prevent anything before the \K from being included in your capture:
\/\K(?:.(?!\/))+$
I would use zero-width assertion(lookbehind) for this :
(?<=\/)\d+$
If last part is not digit, you can search for word characters :
(?<=\/)\w+$
Based on your regex, you can simply apply lookbehind(zero width assertion) :
(?<=\/)(?:.(?!\/))+$

regex with 3 backreferences but one optional

I have a regular expression that captures three backreferences though one (the 2nd) may be null.
Given the flowing string:
http://www.google.co.uk/url?sa=t&rct=j&q=site%3Ajonathonoat.es&source=web&cd=1&ved=0CC8QFjAA&url=http%3A%2F%2Fjonathonoat.es%2Fbritish-mozcast%2F&ei=MQj9UKejDYeS0QWruIHgDA&usg=AFQjCNHy1cDoWlIAwyj76wjiM6f2Rpd74w&bvm=bv.41248874,d.d2k,.co.uk,site%3Ajonathonoat.es&source=web,1
I wish to capture the TLD (in this case .co.uk), q param and cd param.
I'm using the following RegEx:
/.*\.google([a-z\.]*).*q=(.*[^&])?.*cd=(\d*).*/i
Which works except the 2nd backreference includes the other parameters upto the cd param, I current get this:
["http://www.google.co.uk/url?sa=t&rct=j&q=site%3Ajo…,d.d2k,.co.uk,site%3Ajonathonoat.es&source=web,1 ", ".co.uk", "site%3Ajonathonoat.es&source=web", "1", index: 0, input: "http://www.google.co.uk/url?sa=t&rct=j&q=site%3Ajo…,d.d2k,.co.uk,site%3Ajonathonoat.es&source=web,1"]
The 1st backreference is correct, it's .co.uk and so is the 3rd; it's 1. I want the 2nd backreference to be either null (or undefined or whatever) or just the q param, in this example site%3Ajonathonoat.es. It currently includes the source param too (site%3Ajonathonoat.es&source=web).
Any help would be much appreciated, thanks!
I've added a JSFiddle of the code, look in your browser console for the output, thanks!
if negating character classes, i always add a multiplier to the class itself:
/.*\.google([a-z\.]*).*q=([^&]*?)?.*cd=(\d*).*/i
i also recoomend not using * or + as they are "greedy", always use *? or +? when you are going to find delimiters inside your string. For more on greedyness check J.F.Friedls Mastering Rgeular Expressions or simply here
You want the middle group to be:
q=([^&]*)
This will capture characters other than ampersand. This also allows zero characters, so you can remove the optional group (?).
Working example: http://rubular.com/r/AJkXxgeX5K