How to match URL path using Google RE2 regex - regex

Google Cloud Platform lets you create label logs using the RE2 regex engine.
How can I create a regex that matches the path in the URL?
Examples matches:
https://example.com/awesome --> "awesome"
https://example.com/awesome/path --> "awesome/path"
https://example.com/awesome/path/ --> "awesome/path"
https://example.com/awesome/path?arg1=123 --> "awesome/path"
Details:
The domain and protocol are constant, it can be assumed to be https://example.com here.
If there are multiple directories, they should be matched too, including the / in between.
Trailing / should NOT be matched.
Queries, e.g. ?arg1=123&arg2=456 should NOT be matched.
It can be assumed that directory names will only contain alphanumeric characters a-zA-Z0-9, dashes - and underscores _.
Note that Google RE2 is different than PCRE2.

So the syntax isn't 100% clear what is supported and what isn't. Assuming (NOT SUPPORTED) VIM means it is supported but not on vim, I'd start with a negative look behind for the beginning of the url that you don't care about
(?<=https:\/\/example\.com\/)
Then you want alphanumeric characters [\w\-]+ followed by non trailing / so I'd add a lookahead to verify that there are alphanumeric characters after the / with (?=\/\w+)\/
The complete regex
(?<=https:\/\/example\.com\/)([\w\-]+((?=\/\w+)\/|\b))+

Related

Regex to match url where the top level domain is not .com, .net, .org, .info, .edu, .gov, or .ca

I'm looking for regex to match entire urls that are NOT from .com, .net, .org, .info, .edu, .gov, or .ca domains. The TLD list may grow over time, but it's a good start.
These would match:
https://www.example.ru
http://www.example.xyz/index.php
https://someserver.example.co.uk/home
These would NOT match:
https://www.example.com
http://www.example.info/index.php
https://someserver.example.ca/home
For a little background, I'm looking to use the expression with Exchange Online to filter inbound email containing unusual/international links, which in our case are almost 100% phishing or spam. We're a small business that only services local customers and generally all of vendors are North American.
Answer
Tricky one. Here it is:
/https?:\/\/(?![\w.-]+\.(?:com|edu|gov|ca|net|org|info)[^\w.-])\S*/gi
Works for all the use cases listed below:
These would NOT match:
https://www.organization.org/
https://some-server.example.ca/home
https://www.complete.com/index.php
https://www.example.com
https://www.example.com/?url=junk.xyz
https://www.freddy.dana.comealong.com/
http://www.example.info:8181
https://some_server.example.ca/home.html
https://www.complete.com/index.php
https://www.organization.org/
These would match:
https://www.spammy.spammer.comealong.cop/
https://spam.caught.cat/home/away/now/index.htm
https://www.complete.xyz/index.php?com=seww
https://www.example.abc/?spam=yes&spammer=yep&from=me.com
https://www.example.ru/spammy/spammer/index.php
https://www.com_server.ru/?url=beep.gov&para=HaHaGotYou
https://www.example.ru
https://www.example.ru/home.html.com
http://www.example.xyz/index.php
https://some-server.example.co.uk/home
Top group is not matched, bottom group matches all so you can send off to /dev/cornfield
In the bottom group, please notice that there are URL parameters that DO have .com in them, but my assumption is you want to blow those away as well, so the regex is very narrow in defining how and where the TLD appears.
And there are URLs like www.complete.xyz or "example.abc/spam.com ', which obviously should be selected. Details below:
Here's a link to a regex pen: https://regexr.com/69vce
Tutorial:
/https?:\/\/(?![\w.-]+\.(?:com|edu|gov|ca|net|org|info)[^\w.-])\S*/gi
Starts with the obv https?:\/\/ but we go immediately into a negative lookahead (?!
For the non-selected URLs, we want that based on the TLD as used only, no .com in a parameter and no skipping a name like www.complete.abc
So the first part of the negative lookahead is [\w.-]+\. so we only evaluate letters numbers and - . in the brackets with + for one to many as their may be before the TLD, then one single mandatory escaped period \. which is how we "lock in" the TLD.
Note 1: that inside the brackets, the period does not need to be escaped, when inside the brackets [.] it is a literal period not a wildcard.
Note 2: the \w includes the underscore _ which is not a legal domain character, but we disregard as we do not need to specifically validate the domain names as presented.
And next a non-capture group, with ORed | list of the TL domains to NOT match, then [^\w.-] is how we block names like www.complete.xyz. This rejects any TLD letters IF followed by any legal domain name char: letter number period hyphen. Notice by the way the hyphen - is LAST in the group, because if it was instead, say [^-\w.] it would be an error in some implementations of regex as the hyphen is otherwise used for series such as a-z.
Then finally the \S* means match all except a whitespace character. So if the negative lookahead did not reject this line match, we then step back to the http:// and take the entire rest of the URL.
Now, this is potentially a little broad, but since I assume you are just trashing them that should be fine. If you were selecting them for further use, then you may want to use something more selective like [\w.:%&?~=/-]* instead. This includes period, colon for port, = & ? for params, % for URL escaping, etc. And again the hyphen is last.
And of course at the very end, global and case insensitive /gi
To match the entire URL...
Note this implementation attempts to cover additional elements based on the usage for matching unusual URLs:
Any schema for possible unknown security vectors (e.g. ftp, ldap)
Containing basic auth username and passwords
IPv6 IP addresses
Port numbers specified (e.g. https://www.example.com:8080/)
No path i.e. just a hostname / IP address
Query string
Fragment
I don't know the exact regular expression engine used by "Exchange Online", so here I'm using RegEx features of C# and PowerShell assuming those will be available.
Regular Expression
[a-z][a-z0-9+.-]*://(?>(?:[a-z0-9!$%&'()*+,.:;=_~-]+#)?(?:[a-z0-9%._~-]+|\[[a-z0-9!$%&'()*+,.:;=_~-]+\]))(?<!\.(?:com|net|org|info|edu|gov|ca)(?::\d+)?)[a-z0-9!#$%&'()*+,./:;=?#_~-]*
Breakdown
Schema (http/https/ftp etc): [a-z][a-z0-9+\-.]*
atomic group start: (?>
Username / password: (?:[a-z0-9!$%&'()*+,.:;=_~-]+#)?
Hostname: (?:[a-z0-9%._~-]+|\[[a-z0-9!$%&'()*+,.:;=_~-]+\]))
IPv4 or usual domain: [a-z0-9%._~-]+
or IPv6: \[[a-z0-9!$%&'()*+,.:;=_~-]+\]
Hostname (negative lookbehind): (?<!\.(?:com|net|org|info|edu|gov|ca)(?::\d+)?)
optionally allow port numbers: (?::\d+)?
atomic group end: )
Query String and Fragment: [a-z0-9!#$%&'()*+,./:;=?#_~-]*
The atomic group is to prevent the "Username / password" and "Query String and Fragment" part of the expression matching as the "Hostname" part of the string without our validations.
Using RegEx to match in URL in text
If you are using this regular expression to match URL's in a text document you might find some issues with "quoted" URLs or markdown links.
E.g.
[an example](http://example.cox/)
'http://www.example.cox/'
http://www.example.cox/index.html, something interesting in a sentence
You can get it here http://www.example.cox/download.html.
This RegEx as-is would match additional characters at the end because they are valid URL characters i.e.:
http://example.cox/)
http://www.example.cox/'
http://www.example.cox/index.html,
http://www.example.cox/download.html.
To avoid this you can repeat RegEx above in a pattern like this (obviously you would remove the whitespace / new lines):
(?:
(?<=['])
# RegEx here
(?=['])
|
(?<=["])
# RegEx here
(?=["])
|
(?<=\()
# RegEx here
(?=\))
|
# RegEx here
(?<![.,])
)
So here we are saying it has a quote '/" or bracket ( before the URL assume the one the end of the URL can be ignored etc.
Where the match didn't have a bracket (, quote '' etc at the start this last part (?<![.,]) basically says don't match the last full-stop . or comma , character at the end of the URL even though they are perfectly valid characters. Doing this in the full knowlege this might break the returned URL.
\/\/.*.(com|ca|info|org|info)(\/|$)
This should work
.*.(com|ca|info|org|info)
This part will look for the entire URL starting from // until the last part of the TLD i.e. till the next / or end of the line.
You can add more TLDs inside (org|info...) in a similar manner.
https://regex101.com/r/LC1FLQ/1

How to use RegEx to get part of redirect url?

I have a column with list redirect URL on Google Custom Search Results. I would like to extract the external domain from that combined URL.
Example:
https://www.google.com/url?client=internal-element-cse&cx=3c360356&q=https://examplesite1.co.uk/aa-vv--cc-dd-gggg-/&sa=U&ved=2ahUKEwjj1cvJ79PuAhXBHc0KHRgvBLsgQIAhAC&usg=AOvVaw2vIHUiy31YKWs5c41Q
https://www.google.com/url?client=internal-element-cse&cx=3c360356&q=http://www.exmaplesite2.co.uk/wp-content/uploads/2016/12/research-paper.pdf&sa=U&ved=2ahUKEwiphLKMi80KHcLUCMAQFjAFegQIARAC&usg=AOvVawkm-bXjmxsPxLQ9w3
https://www.google.com/url?client=internal-element-cse&cx=3c360356&q=https://examplesite-3.com/home/en/aaa-bbb/38376&sa=U&ved=2ahUKEwixq4K7qttXEKHTOEClsQFjAAegQIARAB&usg=AOvVaw2ouHhfNNTPV
From Above URL's, I would like to extract the external domain name
Results from above examples:
www.site2.co.uk
www.exmaplesite2.co.uk
examplesite-3.com
I am able to do this in Google Sheet, but need RedEx so that I can use it in Google Data Studio.
Thanks.
Just combine both regexes:
(?:(?<=&q=https://)|(?<=&q=http://))(.*?)(?=/.*?&)
Demo & explanation
You may use this regex with an additional negative lookbehind:
(?<=(?<!^https)://)[^/]+
RegEx Demo
RegEx Details:
(?<=(?<!^https)://): Positive lookbehind to assert that we have :// before current position. Additionally nested negative lookbehind (?<!^https) asserts that we don't have starting https before :// thus skipping matching starting URLs
[^/]+: Match 1+ of any character that is not /`
Update: As per comments below lookbehind is not supported in Google Data Studio, hence we can use this regex:
.https?://([^/]+)
And grab domain name from capture group #1.
. placed before https?: will ensure that we don't match a URL at the start of a line.

Replace underscore with dash in url for given url extensions using GREP / Regex

I use BBEdit. BBEdit supports multi-file search and replace with GREP. Using this (copied from a Notepad++ post here at stackoverflow):
(\bhref="|(?!^)\G)[^"<_]*\K_
I can get a list of all URLs containing underscores. The idea is to replace all the underscores with dashes. No problems with that, BBEdit search panel has a 'Replace with' field (like Notepad++).
All's fine, BUT I don't want to process all URLs actually. There are for example file download URLs that should remain as they are, especially URLs with the .exe, .zip, .sit and .dmg extensions. Actually the URLs I want to process are the .php and .html urls.
I mean this type of URL should here found:
<a href="software/internet-tools/ftp-disk_sheet_us.php">
but not this one:
<a href="software/internet-tools/ftp-disk_us_setup.exe">
I have tried to edit the REGEX above unsuccessfully so far and since I have to process around 30,000 urls in 600 files I would really like to be sure I don't do anything wrong.
Thanks a lot in advance for helping me out with that.
You may force the match only when the link ends with .html/.htm or .php:
(?:\G(?!^)|\bhref="(?=[^"]*\.(?:html?|php)"))[^"<_]*\K_
^^^^^^^^^^^^^^^^^^^^^^^^^
See the regex demo
The (?=[^"]*\.(?:html?|php)") positive lookahead will require any 0+ chars other than " and then a . followed with htm/html or php immediately after href=", else, no match will be found.
Details
(?:\G(?!^)|\bhref="(?=[^"]*\.(?:html?|php)")) - end of the previous match (\G(?!^)) or (|)
\bhref=" - a whole word href followed with ="
(?=[^"]*\.(?:html?|php)") - a positive lookahead that requires the following sequences of patterns to match immediately to the right of the current location:
[^"]* - 0+ chars other than "
\. - a dot
(?:html?|php) - a non-capturing group matching either htm and then an optional l or php
" - a double quotation mark
[^"<_]* - any 0+ chars other than ", < and _
\K - match reset operator that discards all text matched so far
_ - an underscore.

Recursive Regex, capture same text under two different tags

I am looking to parse some text using regex, and I need to be able to grab the same text under two different tags while only capturing text before a certain character on the second tag. Here is a sample of the text I'm trying to bring in.
Reputation=High risk ProtocolP=SSL client Web_Application=YouTube URL=https://youtube.com
And here is the RegEx I have written so far.
^Reputation=(?<rep>.*?)\sProtocol=(?<prot>.*?)\sWeb_Application=(?<webapp>.*?)\sURL=(?<url>[http|https].*?)\sSource_IP=(?<sip>.*?)\s
This gets me what I need initially, but I need to add a second tag for the URL section to grab ONLY the domain name. For example, only https://youtube.com
Of course, if the domain happens to be https://m.youtube.com then that should be captured as well.
Is there a way to do this?
You can replace the URL matching part with URL=(?<url>https?://(?<domain>[^/\s]+)(?:/[^\s]+)*):
Reputation=(?<rep>.*?)\sProtocolP=(?<prot>.*?)\sWeb_Application=(?<webapp>.*?)\sURL=(?<url>https?://(?<domain>[^/\s]+)(?:/\S+)*)\sSource_IP=(?<sip>.*?)\s
See the regex demo
https?:// - matches http:// or https://
(?<domain>[^/\s]+) - domain matching part, 1+ characters other than / and whitespace
(?:/\S+)* - 0+ sequences of / followed with 1+ characters other than whitespace

PCRE Pattern match to URL

I'm trying to configure a website plugin and need to set a PCRE url pattern for certain pages on my site.
I exclude certain pages using the following patterns successfully.
^/wp-admin/
^/wp-login.php
^/content/
However I also need to exclude pages which are in the following patterns
www.site.com/post_id/post_name/feed/
e.g. www.site.com/345345/myawesomepost/feed/
Where the only static component of the url is /feed/ Would anyone know how to do this using PCRE expressions..?
To exclude urls like this
Format: www.site.com/post_id/post_name/feed/
Example: www.site.com/345345/myawesomepost/feed/
you might consider the following regular expression:
^(?:\w+\.)+\w+/[0-9]+/\w+/feed/$
Debuggex Demo
Free-spaced:
^: Start-of-line anchor
(?:: [Non-capturing group] of
\w+\.: One-or-more word-characters, followed by a literal dot (which could also be written as [.]
)+: One-or-more times
\w+/: One-or-more word characters followed by a (literal) slash
[0-9]+/: One-or-more digits, followed by a slash
\w+/
feed/: The word "feed" followed by a slash.
$: End-of-line anchor
Here is an more general and comprehensive answer on using regex to match urls, that may also be of interest.
(All the above links come from the Stack Overflow Regular Expressions FAQ, under "Common Tasks > Validation > Internet". Please consider bookmarking the FAQ for future reference.)
Domething like this should match
^/.+/feed/
It just matches every url that ends with /feed/
This is what I would use:
/www.+\/feed\//gi
will match:
www.site.com/post_id/post_name/feed/
www.site.com/345345/myawesomepost/feed/
If the urls contain 'http://', you should alter the pattern to:
/http.+\/feed\//gi
The 'i' at the end can be left out if you are sure 'feed' will never contain capital letters.