How to extract only domain country from URL (Data Studio / regex) - regex

I'm struggling with extracting from URL only country for example .pl from https://www.google.pl.
At this moment I'm able to extract google.pl from provided url using the following code:
TRIM(REGEXP_EXTRACT(REGEXP_REPLACE(REGEXP_REPLACE(URL, "https?://", ""), R"^(w{3}\.)?", ""), "([^/?]+)"))
What is needed to change in this code to provide only .pl instead of example.pl?
Thanks in advance.

You can use
REGEXP_EXTRACT(URL, r'https?://[^/]*\.([^/]+)')
See the regex demo. Details:
https?:// - https:// or http://
[^/]* - zero or more chars other than /
\. - a . char
([^/]+) - Group 1: one or more chars other than /.

Related

Regex to extract a part of an URL

I'm trying to extract the part of an URL ignoring the http(s)://www. part of it.
These URLs come from a form that the user fills and multiple formats and errors are expected, here's a sample:
http://www.akashicbooks.com
https://deliciouselsalvador.com
http://altaonline.com
http://https://www.amtb-la.org/
http://https://www.amovacations.com/
http://dornsife.usc.edu/jep
I've tried in Google Sheets and Airtable using the REGEXEXTRACT formula:
=REGEXEXTRACT({URL},"[^/]+$")
But unfortunately, I can't make it work for all the cases:
Any ideas on how to make it work?
You can use
^(?:https?://(?:www\.)?)*(.*)
See the regex demo. Details:
^ - start of string
(?:https?://(?:www\.)?)* - zero or more occurrences of
https?:// - http:// or https://
(?:www\.)? - an optional sequence of www.
(.*) - Group 1: the rest of the string.
With REGEXEXTRACT, the output value is the text captured with Group 1.

Regular expression to extract different parts of URL and path

Consider URLs like
https://stackoverflow.com/v1/summary/1243PQ/details/P1/9981
http://stackoverflow.com/v2/summary/saas?test=123
I need a regular expression to match these URLs and convert them into
stackoverflow.com:v1:summary:1243PQ:details:P1:9981
stackoverflow.com:v2:summary:saas
I need to build a single rule using regex where I can extract paths using $1, $2, etc. without using any javascript logic as I need to use it in a classification rule builder tool.
I tried this URL contains ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? and extracted $4:$5 which returns stackoverflow.com:v1/summary/1243PQ/details/P1/9981
But, this is incorrect. Can anyone help me with the correct regex for this?
You may try this:
Regex
/(?:https?:\/\/([^\/?\s#]+))?\/([^\/?\s#]*)(?:[\?#].*)?/g
Substitution
$1:$2
(?: non-capturing group
https?:\/\/ "http://" or "https://"
([^\/?\s#]+) capture the domain and put it in group 1
)? make this capture optional
\/ "/"
([^\/?\s#]*) one segment of the url path, capture it in group 2
(?:[\?#].*)? an optional non-capturing group for consuming query string or # anchor at the end
Check the test cases
Update
If you can't use g flag for substitution, there's no better way but bruteforce all the combinations:
You need to add a \/([^\/?#\s]+) and :$2 etc for each segment of the url path:
https://stackoverflow.com
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/?(?:[#?].*)?$
$1
https://stackoverflow.com/path1
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2
https://stackoverflow.com/path1/path2
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3
https://stackoverflow.com/path1/path2/path3
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4
https://stackoverflow.com/path1/path2/path3/path4
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4:$5
https://stackoverflow.com/path1/path2/path3/path4/path5
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4:$5:$6

Regex google analytics - Filter some directories and pages from url

I need to create a view in google analytics that can filter some pages.
The current regex I am using is:
^/(|es|pt|fr|de|blog|examples|about|)?(/(\w*)?)?$
Pages I want to show:
/
/es
/es/
/es/something-else
/blog/blog-post-123
/about
/examples
/examples/directory
/examples/directory/title-examples
Pages I don't want to show:
/any-other-url-not-mentioned
/es1231
/esad
/examplesee
/blogbla-bla-bala
/blog-bla-bla-ba
The current problem with my regex is if the page contains "-" it will NOT show up in the view:
/blog/10-tools
page/this-is-content-url
This one should work:
^\/(es|pt|fr|de|blog|examples|about)?($|\/).*
Regex101
You may use
^\/(es|pt|fr|de|blog|examples|about)?(\/[^\/]+)*\/?$
Or
^\/((es|pt|fr|de|blog|examples|about)(\/[^\/]+)*\/?)?$
See the regex demo and regex demo 2.
Details
^ - start of string
\/ - a /
(es|pt|fr|de|blog|examples|about)? - an optional group matching any of the alternative substrings in it
(\/[^\/]+)* - 0+ / followed with 1+ chars other than /
\/? - an optional /
$ - end of string
The second regex is a bit more precise since it won't match //, but it does not seem a valid scenario.

Using a wildcard in Regex at the end of a URL in GA

I'm a newbie at Regex. I'm trying to get a report in GA that returns all pages after a certain point in the URL.
For example:
http://www.essentialibiza.com/ibiza-club-tickets/carl-cox/14-June-2016/
I want to see all dates so: http://www.essentialibiza.com/ibiza-club-tickets/carl-cox/*
Here's what I've got so far in my regex:
^https:\/\/www\.essentialibiza\.com\/ibiza-club-tickets\/carl-cox(?=(?:\/.*)?$)
You can try this:
https?:\/\/www\.essentialibiza\.com\/ibiza-club-tickets\/carl-cox[\w/_-]*
GA RE2 regex engine does not allow lookarounds (even lookaheads) in the pattern. You have defined one - (?=(?:\/.*)?$).
If you need all links having www.essentialibiza.com/ibiza-club-tickets/carl-cox/, you can use a simple regex:
www\.essentialibiza\.com/ibiza-club-tickets/carl-cox/
If you want to precise the protocol:
https?://www\.essentialibiza\.com/ibiza-club-tickets/carl-cox(/|$)
The ? will make s optional (1 or 0 occurrences) and (/|$) will allow matching the URL ending with cox (remove this group if you want to match URLs that only have / after cox).

filter Google Analytics API pagepath to not show further subdirectories

I'm looking to do a top ten list of the main pages of a bio but not the additional subdirectories. E.g.
/cr/johndoe
/cr/janesmith
but NOT:
/cr/johndoe/news
/cr/janesmith/dvds
/cr/santaclaus/galleries/1
etc.
I've started with filters=ga:pagePath==/cr/* but I'm not sure how to do the regex to get it to not go any deeper.
In order to match the /cr/something you can use the following filters:
filters=ga:pagePath=~/cr/[^/]*/?$
Or Url-encoded version:
filters=ga:pagePath%3D~%2Fcr%2F%5B%5E%2F%5D*%2F%3F%24
REGEX EXPLANATION:
/cr/ - matches the sequence of the characters literally
[^/]* - 0 or more characters other than /
/? - 0 or 1 / symbol
$ - matches the end of string.
If the URL starts with /cr, add ^ (start of string):
filters=ga:pagePath=~^/cr/[^/]*/?$
or Url-encoded:
filters=ga:pagePath%3D~%5E%2Fcr%2F%5B%5E%2F%5D*%2F%3F%24
You can use this:
^\/cr\/\w+$
DEMO