Regex to extract a part of an URL - regex

I'm trying to extract the part of an URL ignoring the http(s)://www. part of it.
These URLs come from a form that the user fills and multiple formats and errors are expected, here's a sample:
http://www.akashicbooks.com
https://deliciouselsalvador.com
http://altaonline.com
http://https://www.amtb-la.org/
http://https://www.amovacations.com/
http://dornsife.usc.edu/jep
I've tried in Google Sheets and Airtable using the REGEXEXTRACT formula:
=REGEXEXTRACT({URL},"[^/]+$")
But unfortunately, I can't make it work for all the cases:
Any ideas on how to make it work?

You can use
^(?:https?://(?:www\.)?)*(.*)
See the regex demo. Details:
^ - start of string
(?:https?://(?:www\.)?)* - zero or more occurrences of
https?:// - http:// or https://
(?:www\.)? - an optional sequence of www.
(.*) - Group 1: the rest of the string.
With REGEXEXTRACT, the output value is the text captured with Group 1.

Related

How to extract only domain country from URL (Data Studio / regex)

I'm struggling with extracting from URL only country for example .pl from https://www.google.pl.
At this moment I'm able to extract google.pl from provided url using the following code:
TRIM(REGEXP_EXTRACT(REGEXP_REPLACE(REGEXP_REPLACE(URL, "https?://", ""), R"^(w{3}\.)?", ""), "([^/?]+)"))
What is needed to change in this code to provide only .pl instead of example.pl?
Thanks in advance.
You can use
REGEXP_EXTRACT(URL, r'https?://[^/]*\.([^/]+)')
See the regex demo. Details:
https?:// - https:// or http://
[^/]* - zero or more chars other than /
\. - a . char
([^/]+) - Group 1: one or more chars other than /.

Regular expression to extract different parts of URL and path

Consider URLs like
https://stackoverflow.com/v1/summary/1243PQ/details/P1/9981
http://stackoverflow.com/v2/summary/saas?test=123
I need a regular expression to match these URLs and convert them into
stackoverflow.com:v1:summary:1243PQ:details:P1:9981
stackoverflow.com:v2:summary:saas
I need to build a single rule using regex where I can extract paths using $1, $2, etc. without using any javascript logic as I need to use it in a classification rule builder tool.
I tried this URL contains ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? and extracted $4:$5 which returns stackoverflow.com:v1/summary/1243PQ/details/P1/9981
But, this is incorrect. Can anyone help me with the correct regex for this?
You may try this:
Regex
/(?:https?:\/\/([^\/?\s#]+))?\/([^\/?\s#]*)(?:[\?#].*)?/g
Substitution
$1:$2
(?: non-capturing group
https?:\/\/ "http://" or "https://"
([^\/?\s#]+) capture the domain and put it in group 1
)? make this capture optional
\/ "/"
([^\/?\s#]*) one segment of the url path, capture it in group 2
(?:[\?#].*)? an optional non-capturing group for consuming query string or # anchor at the end
Check the test cases
Update
If you can't use g flag for substitution, there's no better way but bruteforce all the combinations:
You need to add a \/([^\/?#\s]+) and :$2 etc for each segment of the url path:
https://stackoverflow.com
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/?(?:[#?].*)?$
$1
https://stackoverflow.com/path1
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2
https://stackoverflow.com/path1/path2
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3
https://stackoverflow.com/path1/path2/path3
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4
https://stackoverflow.com/path1/path2/path3/path4
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4:$5
https://stackoverflow.com/path1/path2/path3/path4/path5
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4:$5:$6

Regex google analytics - Filter some directories and pages from url

I need to create a view in google analytics that can filter some pages.
The current regex I am using is:
^/(|es|pt|fr|de|blog|examples|about|)?(/(\w*)?)?$
Pages I want to show:
/
/es
/es/
/es/something-else
/blog/blog-post-123
/about
/examples
/examples/directory
/examples/directory/title-examples
Pages I don't want to show:
/any-other-url-not-mentioned
/es1231
/esad
/examplesee
/blogbla-bla-bala
/blog-bla-bla-ba
The current problem with my regex is if the page contains "-" it will NOT show up in the view:
/blog/10-tools
page/this-is-content-url
This one should work:
^\/(es|pt|fr|de|blog|examples|about)?($|\/).*
Regex101
You may use
^\/(es|pt|fr|de|blog|examples|about)?(\/[^\/]+)*\/?$
Or
^\/((es|pt|fr|de|blog|examples|about)(\/[^\/]+)*\/?)?$
See the regex demo and regex demo 2.
Details
^ - start of string
\/ - a /
(es|pt|fr|de|blog|examples|about)? - an optional group matching any of the alternative substrings in it
(\/[^\/]+)* - 0+ / followed with 1+ chars other than /
\/? - an optional /
$ - end of string
The second regex is a bit more precise since it won't match //, but it does not seem a valid scenario.

RegEx to filter E-Mail Adresses from URLs in Google Analytics

I want to use a Google Analytics filter to remove email addresses from incoming URIs. I am using the custom advanced filter, filtering field A on a RegEx for the Request URI and replacing the respective part later. However, my RegEx does not seem to work correctly. It should find email addresses, not only if an '#' is used, but also if '(at)', '%40', or '$0040' are used to represent the '#'.
My latest RegEx version (see below) still allows '$0040' to go through undetected. Can someone advise me what to change?
^(.*)=([A-Z0-9._%+-]+[#|[\(at\)]|[\$0040]|[\%40]][A-Z0-9.-]+\.[A-Z]{2,4})(.*)$
I suggest using
([A-Za-z0-9._%+-]+(#|\(at\)|[$]0040|\%40)[A-Za-z0-9.-]+\.[A‌​-Za-z]{2,4})
See the regex demo.
If you need to match the whole string, you may keep that pattern enclosed with your ^(.*) and (.*)$.
Details
([A-Za-z0-9._%+-]+(#|\(at\)|[$]0040|\%40)[A-Za-z0-9.-]+\.[A‌​-Za-z]{2,4}) - Group 1 capturing
[A-Za-z0-9._%+-]+ - 1 or more ASCII letters/digits, ., _, %, +, or -
(#|\(at\)|[$]0040|\%40) - one of the alternatives: #, (at), $0040 or %40
[A-Za-z0-9.-]+ - 1 or more ASCII letters/digits, . or -
\. - a dot
[A‌​-Za-z]{2,4} - 2 to 4 ASCII letters.

Using a wildcard in Regex at the end of a URL in GA

I'm a newbie at Regex. I'm trying to get a report in GA that returns all pages after a certain point in the URL.
For example:
http://www.essentialibiza.com/ibiza-club-tickets/carl-cox/14-June-2016/
I want to see all dates so: http://www.essentialibiza.com/ibiza-club-tickets/carl-cox/*
Here's what I've got so far in my regex:
^https:\/\/www\.essentialibiza\.com\/ibiza-club-tickets\/carl-cox(?=(?:\/.*)?$)
You can try this:
https?:\/\/www\.essentialibiza\.com\/ibiza-club-tickets\/carl-cox[\w/_-]*
GA RE2 regex engine does not allow lookarounds (even lookaheads) in the pattern. You have defined one - (?=(?:\/.*)?$).
If you need all links having www.essentialibiza.com/ibiza-club-tickets/carl-cox/, you can use a simple regex:
www\.essentialibiza\.com/ibiza-club-tickets/carl-cox/
If you want to precise the protocol:
https?://www\.essentialibiza\.com/ibiza-club-tickets/carl-cox(/|$)
The ? will make s optional (1 or 0 occurrences) and (/|$) will allow matching the URL ending with cox (remove this group if you want to match URLs that only have / after cox).