I'm trying to extract the part of an URL ignoring the http(s)://www. part of it.
These URLs come from a form that the user fills and multiple formats and errors are expected, here's a sample:
http://www.akashicbooks.com
https://deliciouselsalvador.com
http://altaonline.com
http://https://www.amtb-la.org/
http://https://www.amovacations.com/
http://dornsife.usc.edu/jep
I've tried in Google Sheets and Airtable using the REGEXEXTRACT formula:
=REGEXEXTRACT({URL},"[^/]+$")
But unfortunately, I can't make it work for all the cases:
Any ideas on how to make it work?
You can use
^(?:https?://(?:www\.)?)*(.*)
See the regex demo. Details:
^ - start of string
(?:https?://(?:www\.)?)* - zero or more occurrences of
https?:// - http:// or https://
(?:www\.)? - an optional sequence of www.
(.*) - Group 1: the rest of the string.
With REGEXEXTRACT, the output value is the text captured with Group 1.
Related
I'm struggling with extracting from URL only country for example .pl from https://www.google.pl.
At this moment I'm able to extract google.pl from provided url using the following code:
TRIM(REGEXP_EXTRACT(REGEXP_REPLACE(REGEXP_REPLACE(URL, "https?://", ""), R"^(w{3}\.)?", ""), "([^/?]+)"))
What is needed to change in this code to provide only .pl instead of example.pl?
Thanks in advance.
You can use
REGEXP_EXTRACT(URL, r'https?://[^/]*\.([^/]+)')
See the regex demo. Details:
https?:// - https:// or http://
[^/]* - zero or more chars other than /
\. - a . char
([^/]+) - Group 1: one or more chars other than /.
Consider URLs like
https://stackoverflow.com/v1/summary/1243PQ/details/P1/9981
http://stackoverflow.com/v2/summary/saas?test=123
I need a regular expression to match these URLs and convert them into
stackoverflow.com:v1:summary:1243PQ:details:P1:9981
stackoverflow.com:v2:summary:saas
I need to build a single rule using regex where I can extract paths using $1, $2, etc. without using any javascript logic as I need to use it in a classification rule builder tool.
I tried this URL contains ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? and extracted $4:$5 which returns stackoverflow.com:v1/summary/1243PQ/details/P1/9981
But, this is incorrect. Can anyone help me with the correct regex for this?
You may try this:
Regex
/(?:https?:\/\/([^\/?\s#]+))?\/([^\/?\s#]*)(?:[\?#].*)?/g
Substitution
$1:$2
(?: non-capturing group
https?:\/\/ "http://" or "https://"
([^\/?\s#]+) capture the domain and put it in group 1
)? make this capture optional
\/ "/"
([^\/?\s#]*) one segment of the url path, capture it in group 2
(?:[\?#].*)? an optional non-capturing group for consuming query string or # anchor at the end
Check the test cases
Update
If you can't use g flag for substitution, there's no better way but bruteforce all the combinations:
You need to add a \/([^\/?#\s]+) and :$2 etc for each segment of the url path:
https://stackoverflow.com
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/?(?:[#?].*)?$
$1
https://stackoverflow.com/path1
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2
https://stackoverflow.com/path1/path2
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3
https://stackoverflow.com/path1/path2/path3
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4
https://stackoverflow.com/path1/path2/path3/path4
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4:$5
https://stackoverflow.com/path1/path2/path3/path4/path5
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4:$5:$6
I need to create a view in google analytics that can filter some pages.
The current regex I am using is:
^/(|es|pt|fr|de|blog|examples|about|)?(/(\w*)?)?$
Pages I want to show:
/
/es
/es/
/es/something-else
/blog/blog-post-123
/about
/examples
/examples/directory
/examples/directory/title-examples
Pages I don't want to show:
/any-other-url-not-mentioned
/es1231
/esad
/examplesee
/blogbla-bla-bala
/blog-bla-bla-ba
The current problem with my regex is if the page contains "-" it will NOT show up in the view:
/blog/10-tools
page/this-is-content-url
This one should work:
^\/(es|pt|fr|de|blog|examples|about)?($|\/).*
Regex101
You may use
^\/(es|pt|fr|de|blog|examples|about)?(\/[^\/]+)*\/?$
Or
^\/((es|pt|fr|de|blog|examples|about)(\/[^\/]+)*\/?)?$
See the regex demo and regex demo 2.
Details
^ - start of string
\/ - a /
(es|pt|fr|de|blog|examples|about)? - an optional group matching any of the alternative substrings in it
(\/[^\/]+)* - 0+ / followed with 1+ chars other than /
\/? - an optional /
$ - end of string
The second regex is a bit more precise since it won't match //, but it does not seem a valid scenario.
I want to use a Google Analytics filter to remove email addresses from incoming URIs. I am using the custom advanced filter, filtering field A on a RegEx for the Request URI and replacing the respective part later. However, my RegEx does not seem to work correctly. It should find email addresses, not only if an '#' is used, but also if '(at)', '%40', or '$0040' are used to represent the '#'.
My latest RegEx version (see below) still allows '$0040' to go through undetected. Can someone advise me what to change?
^(.*)=([A-Z0-9._%+-]+[#|[\(at\)]|[\$0040]|[\%40]][A-Z0-9.-]+\.[A-Z]{2,4})(.*)$
I suggest using
([A-Za-z0-9._%+-]+(#|\(at\)|[$]0040|\%40)[A-Za-z0-9.-]+\.[A-Za-z]{2,4})
See the regex demo.
If you need to match the whole string, you may keep that pattern enclosed with your ^(.*) and (.*)$.
Details
([A-Za-z0-9._%+-]+(#|\(at\)|[$]0040|\%40)[A-Za-z0-9.-]+\.[A-Za-z]{2,4}) - Group 1 capturing
[A-Za-z0-9._%+-]+ - 1 or more ASCII letters/digits, ., _, %, +, or -
(#|\(at\)|[$]0040|\%40) - one of the alternatives: #, (at), $0040 or %40
[A-Za-z0-9.-]+ - 1 or more ASCII letters/digits, . or -
\. - a dot
[A-Za-z]{2,4} - 2 to 4 ASCII letters.
I'm a newbie at Regex. I'm trying to get a report in GA that returns all pages after a certain point in the URL.
For example:
http://www.essentialibiza.com/ibiza-club-tickets/carl-cox/14-June-2016/
I want to see all dates so: http://www.essentialibiza.com/ibiza-club-tickets/carl-cox/*
Here's what I've got so far in my regex:
^https:\/\/www\.essentialibiza\.com\/ibiza-club-tickets\/carl-cox(?=(?:\/.*)?$)
You can try this:
https?:\/\/www\.essentialibiza\.com\/ibiza-club-tickets\/carl-cox[\w/_-]*
GA RE2 regex engine does not allow lookarounds (even lookaheads) in the pattern. You have defined one - (?=(?:\/.*)?$).
If you need all links having www.essentialibiza.com/ibiza-club-tickets/carl-cox/, you can use a simple regex:
www\.essentialibiza\.com/ibiza-club-tickets/carl-cox/
If you want to precise the protocol:
https?://www\.essentialibiza\.com/ibiza-club-tickets/carl-cox(/|$)
The ? will make s optional (1 or 0 occurrences) and (/|$) will allow matching the URL ending with cox (remove this group if you want to match URLs that only have / after cox).