Exclude phrase ending with matched pattern using regex - regex

I have written a rule to catch all web domains ending with .watch or .video, and I want to exclude 2 domains:
(?!\.videolecture|fb\.watch)\b(\.video|\.watch)
The first exclusion .videolecture works fine. But I can't exclude fb.watch.
I'm really sorry but I could't find any similar questions on stackoverflow..

You can use
\b(\.video(?!lecture\b)|(?<!\bfb)\.watch)
See the regex demo. Details:
\b - a word boundary
( - start of a capturing group:
\.video(?!lecture\b) - .video that is not immediately followed by lecture as a whole word
| - or
(?<!\bfb)\.watch - .watch that is not immediately preceded with fb as a whole word
) - end of the group.

An exclude variable could be set using a map. To exclude second-level domains, videolecture or fb, and top-level domains, watch or video:
map $host $exclude {
~\b(?:videolecture|fb)\.(?:watch|video)$ 1;
}
Return 403 if $exclude is set:
server {
if ($exclude) {
return 403
}
}

Related

Regex (grok) - create general pattern for log which occurs but don't have to

I am sorry for enigmatic topic title, but I did not know how to put it correctly.
These are log types:
{vpnclient} Client[10.10.10.10:54576](11764): sending R_KEYCHANGE message
{vpnclient} Client[10.10.10.10:54576](16031): sending R_IPCONFIG message - client IP = 172.11.11.11/255.255.255.0, CEP = 3600 s, DNS = 172.11.1.101, 172.11.1.102
And this is my grok pattern:
^{vpnclient} %{WORD}\[%{IP:[client][ip]}:%{NUMBER:[source][port]}\]\(%{INT:[process][pid]}\): %{GREEDYDATA:message} (:?%{GREEDYDATA:kv_vpn_message})
What i want to do is forward log after hyphen (so - client IP) to kv filter.
My problem is - this type of log does not occur always, so i want to wrap the whole grok pattern, so it matches until %{GREEDYDATA:message} and also %{GREEDYDATA:kv_vpn_message}, but only when it occurs.
You can use
^{vpnclient} %{WORD}\[%{IP:[client][ip]}:%{NUMBER:[source][port]}\]\(%{INT:[process][pid]}\): %{DATA:message}(?: - %{GREEDYDATA:kv_vpn_message})?$
There are several changes:
%{DATA:message} - the message pattern is turned into a non-greedy dot pattern, .*?, with GREEDYDATA changed to DATA
(?: - %{GREEDYDATA:kv_vpn_message})? - is an optional non-capturing group that matches one or zero occurrences of - and then zero or more chars as many as possible captured into the "kv_vpn_message" group
$ - end of string anchor, it allows the "message" DATA pattern match till the end of line.

Regex to extract a part of an URL

I'm trying to extract the part of an URL ignoring the http(s)://www. part of it.
These URLs come from a form that the user fills and multiple formats and errors are expected, here's a sample:
http://www.akashicbooks.com
https://deliciouselsalvador.com
http://altaonline.com
http://https://www.amtb-la.org/
http://https://www.amovacations.com/
http://dornsife.usc.edu/jep
I've tried in Google Sheets and Airtable using the REGEXEXTRACT formula:
=REGEXEXTRACT({URL},"[^/]+$")
But unfortunately, I can't make it work for all the cases:
Any ideas on how to make it work?
You can use
^(?:https?://(?:www\.)?)*(.*)
See the regex demo. Details:
^ - start of string
(?:https?://(?:www\.)?)* - zero or more occurrences of
https?:// - http:// or https://
(?:www\.)? - an optional sequence of www.
(.*) - Group 1: the rest of the string.
With REGEXEXTRACT, the output value is the text captured with Group 1.

Regex google analytics - Filter some directories and pages from url

I need to create a view in google analytics that can filter some pages.
The current regex I am using is:
^/(|es|pt|fr|de|blog|examples|about|)?(/(\w*)?)?$
Pages I want to show:
/
/es
/es/
/es/something-else
/blog/blog-post-123
/about
/examples
/examples/directory
/examples/directory/title-examples
Pages I don't want to show:
/any-other-url-not-mentioned
/es1231
/esad
/examplesee
/blogbla-bla-bala
/blog-bla-bla-ba
The current problem with my regex is if the page contains "-" it will NOT show up in the view:
/blog/10-tools
page/this-is-content-url
This one should work:
^\/(es|pt|fr|de|blog|examples|about)?($|\/).*
Regex101
You may use
^\/(es|pt|fr|de|blog|examples|about)?(\/[^\/]+)*\/?$
Or
^\/((es|pt|fr|de|blog|examples|about)(\/[^\/]+)*\/?)?$
See the regex demo and regex demo 2.
Details
^ - start of string
\/ - a /
(es|pt|fr|de|blog|examples|about)? - an optional group matching any of the alternative substrings in it
(\/[^\/]+)* - 0+ / followed with 1+ chars other than /
\/? - an optional /
$ - end of string
The second regex is a bit more precise since it won't match //, but it does not seem a valid scenario.

Regex get domain name from email

I am learning regex and am having trouble getting google from email address
String
first.name#google.com
I just want to get google, not google.com
Regex:
[^#].+(?=\.)
Result: https://regex101.com/r/wA5eX5/1
From my understanding. It ignore # find a string after that until . (dot) using (?=\.)
What did I do wrong?
[^#] means "match one symbol that is not an # sign. That is not what you are looking for - use lookbehind (?<=#) for # and your (?=\.) lookahead for \. to extract server name in the middle:
(?<=#)[^.]+(?=\.)
The middle portion [^.]+ means "one or more non-dot characters".
Demo.
Updated answer:Use a capturing group and keep it simple :)
#(\w+)
Explanation by splitting it up
( capturing group for extraction )
\w stands for word character [A-Za-z0-9_]
+ is a quantifier for one or more occurances of \w
Regex explanation and demo on Regex101
I used the solution's regex for my task, but realized that some of the emails weren't that easy: foo#us.industries.com, foobar#tm.valves.net, andfoo#ge.test.com
To anyone who came here wanting the sub domain as well (or is being cut off by it), here's the regex:
(?<=#)[^.]*.[^.]*(?=\.)
This should be the regex:
(?<=#)[^.]+
(?<=#) - places the search right after the #
[^.]+ - take all the characters that are not dot (stops on dot)
So it extracts google from the email address.
As I was working to get the domain name of email addresses and none corresponded to what I needed:
To not catch subdomains
To match countries top domains (like .com.ar or co.jp)
For example, in test#ext.domain.com.mx I need to match domain.com.mx
So I made this one:
[^.#]*?\.\w{2,}$|[^.#]*?\.com?\.\w{2}$
Here is a link to regex101 to illustrate the regex: https://regex101.com/r/vE8rP9/59
You can get the sumdomain name (without the top-level domain ex: .com or .com.mx) by adding lookaround operators (but it will match twice in test#test.com.mx):
[^.#]*?(?=\.\w{2,}$)|[^.#]*?(?=\.com?\.\w{2}$)
Maybe not strictly a "full regex answer" but more flexible ( in case the part before the # is not "first.last") would be using cut:
cut -d # -f 2 | cut -d . -f 1
The first cut will isolate the part after # and the second one will get what you want.
This will work also for another kinds of email patterns : xxxx#server.com / xxx.yyy.zzz# server.com and so on...
Thanks everyone for your great responses, I took what you had and expanded it with labelled match-groups for easy extraction of separate parts.
Caveat : Regex.Speed = Slow
Another post mentioned how SLOW and nonperformant regexes are, and that is a fair point to remember. My particular need is targeting my own background/slow/reporting processes and therefore it doesn't matter how long it takes.
But it's good to remember whenever possible Regex should NOT be used in any sort of web page load or "needs-to-be-quick" kind of application. In that case you're much better off using substring to algorithmically strip down the inputs and throw away all the junk that I'm optionally matching/allowing/including here.
https://regex101.com/r/ZnU3OC/1
One Regex to rule them all...
Subdomain/Domain/TopLevelDomain/CountryCode extraction for Emails, domain lists, & URLs
Also handles ?Querystring=junk, Slashes/With/Paths, #anchors
Now with more broth, batteries not included
^(?<Email>.*#)?(?<Protocol>\w+:\/\/)?(?<SubDomain>(?:[\w-]{2,63}\.){0,127}?)?(?<DomainWithTLD>(?<Domain>[\w-]{2,63})\.(?<TopLevelDomain>[\w-]{2,63}?)(?:\.(?<CountryCode>[a-z]{2}))?)(?:[:](?<Port>\d+))?(?<Path>(?:[\/]\w*)+)?(?<QString>(?<QSParams>(?:[?&=][\w-]*)+)?(?:[#](?<Anchor>\w*))*)?$
not overly complicated at all... why would you even say that?
Substitution / Outputs
EXAMPLE INPUT: "https://www.stackoverflow.co.uk/path/2?q=mysearch&and=more#stuff"
EXAMPLE OUTPUT:
{
Protocol: "https://"
SubDomain: "www"
DomainWithTLD: "stackoverflow.co.uk"
Domain: "stackoverflow"
TopLevelDomain: "co"
CountryCode: "uk"
Path: "/path/2"
QString: "?q=mysearch&and=more#stuff"
}
Allowed/Compliant Domains : Should ALL MATCH
www.bankofamerica.com
bankofamerica.com.securersite.regexr.com
bankofamerica.co.uk.blahblahblah.secure.com.it
dashes-bad-for-seo.but-technically-still-allowed.not-in-front-or-end
bit.ly
is.gd
foo.biz.pl
google.com.cn
stackoverflow.co.uk
level_three.sub_domain.example.com
www.thelongestdomainnameintheworldandthensomeandthensomemoreandmore.com
https://www.stackoverflow.co.uk?q=mysearch&and=more
foo://5th.4th.3rd.example.com:8042/over/there
foo://subdomain.example.com:8042/over/there?name=ferret#nose
example.com
www.example.com
example.co.uk
trailing-slash.com/
trailing-pound.com#
trailing-question.com?
probably-not-valid.com.cn?&#
probably-not-valid.com.cn/?&#
example.com/page
example.com?key=value
* NOTE: PunyCodes (Unicode in urls) handled just fine with \w ,no extra sauce needed
xn--fsqu00a.xn--0zwm56d.com
xn--diseolatinoamericano-66b.com
Emails : Should ALL MATCH
first.name#google1.co.com
foo#us.industries.com,
foobar#tm.valves.net,
andfoo#ge.test.com
jane.doe#my-bank.no
john.doe#spam.com
jane.ann.doe#sandnes.district.gov
Non-Compliant Domains : Should NOT MATCH
either not long-enough (domain min length 2), or too long (64)
v.gd
thing.y
0123456789012345678901234567890123456789012345678901234567891234.com
its-sixty-four-instead-of-sixty-three!.com
symbols-not-allowed#.com
symbols-not-allowed#.com
symbols-not-allowed$.com
symbols-not-allowed%.com
symbols-not-allowed^.com
symbols-not-allowed&.com
symbols-not-allowed*.com
symbols-not-allowed(.com
symbols-not-allowed).com
symbols-not-allowed+.com
symbols-not-allowed=.com
TBD Not handled:
* dashes as start or ending is disallowed (dropped from Regex for readability)
-junk-.com
* is underscore allowed? i donno... (but it simplifies the regex using \w instead of [a-zA-Z0-9\-] everywhere)
symbols-not-allowed_.com
* special case localhost?
.localhost
also see:
Domain Name Rules :: Super handy ASCII Diagram of a URL
see: https://stackoverflow.com/a/66660651/738895 *
Side NOTE: lazy load '?' for subdomains{0,127}? currently needed for any of the cases with country codes... (example: stackoverflow.co.uk)
Matches these, but does NOT grab $NLevelSubdomains in a match group, can only grab 3rd level only.
This is a relatively simple regex, and it grabs everything between the # and the final domain extension (e.g. .com, .org). It allows domain names that are made up of non-word characters, which exist in real-world data.
>>> regex = re.compile(r"^.+#(.+)\.[\w]+$")
>>> regex.findall('jane.doe#my-bank.no')
['my-bank']
>>> regex.findall('john.doe#spam.com')
['spam']
>>> regex.findall('jane.ann.doe#sandnes.district.gov')
['sandnes.district']
I used this regular expression to get the complete domain name '.*#+(.*)' where .* will ignore all the character before # (by #+) and start extracting cpmlete domain name by mentioning paranthesis and complete string inside(except linebrake characters)

Negative lookbehind in a regex with an optional prefix

We are using the following regex to recognize urls (derived from this gist by Jim Gruber). This is being executed in Scala using scala.util.matching which in turn uses java.util.regex:
(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b/?(?!#)))
This version has escaped forward slashes, for Rubular:
(?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!#))))
Previously the front-end was only sending plaintext to the back end, however now they're allowing users to create anchor tags for urls. Therefore the back end now needs to recognize urls except for those that are already in anchor tags. I initially tried to accomplish this with a negative loohbehind, ignoring urls with a href=" prefix
(?i)\b((?<!href=")((?:https?: ... etc
The problem is that our url regex is very liberal, recognizing http://www.google.com, www.google.com, and google.com - given
Google
the negative lookbehind will ignore http://www.google.com, but then the regex will still recognize www.google.com. I'm wondering if there's a succinct way to tell the regex "ignore www.google.com and google.com if they are substrings of an ignored http(s)://www.google.com"
At present I'm using a filter on the url regex matches (code is in Scala) - this also ignores urls in link text (www.google.com) by ignoring urls with a > prefix and </a> suffix. I'd rather stick with the filter if doing this in a regex would make an already complicated regex even more unreadable.
urlPattern.findAllMatchIn(text).toList.filter(m => {
val start: Int = m.start(1)
val end: Int = m.end(1)
val isHref: Boolean = (start - 6 > 0) &&
text.substring(start - 6, start) == """href=""""
val isAnchor: Boolean = (start - 1 > 0 && end + 3 < text.length &&
text.substring(start - 1, start) == ">" &&
text.substring(end, end + 3) == "</a>")
!(isHref || isAnchor) && Option(m.group(1)).isDefined
})
<a href=\S+|\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!#)))
or
<a href=(?:(?!<\/a>).)*<\/a>|\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!#)))
Try this. What it essentially does is:
Consumes all href links so that it cannot be matched later
Does not capture it so it will not appear in groups anyways.
Process the rest as before.
See demo.
http://regex101.com/r/vR4fY4/17
It seems that you're not only wanting to ignore www.google.com and google.com if they are substrings of an ignored http(s)://www.google.com", but instead any substring fragments from a previously ignored section... In which case, you can use a bit of code to work around this! Please see the regex:
(a href=")?(?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!#))))
^^^^^^^^^^^
I'm not good at scala but you can probably do this:
val links = new Regex("""(a href=")?(?i)\b(((?:https?:... """.r, "unwanted")
val unwanted = for (o <- links findAllMatchIn text) yield o group "unwanted"
If unwanted is scala.Null, then the match is useful.
You can workaround for a need of replacement by replacing an alternative:
a href="(?i)\b(?:(?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!#)))|((?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!#)))))
The second part of the regex behind the pipe | is grouped as a capturing group. You can replace by this regex with the first group: \1
Similar question:
Regex Pattern to Match, Excluding when... / Except between
How about just adding the <a href= part as an optional group, then when checking your matching, you only return those matches in which that group is empty?