Regular expression to exclude local addresses - regex

I'm trying to configure my Foxy Proxy program and one of the features is to provide a regular expression for an exclusion list.
I'm trying to blacklist the local sites (ending in .local), but it doesn't seem to work.
This is what I attempted:
^(?:https?://)?\d+\.(?!local)+/.*$
^(?:https?://)?\d+\.(?!local)(\d)+/.*$
I also researched on Google and Stack Exchange with no success.

Since you indicate in the comments that you actually need a whitelist solution, I went with that:
Try: ^(?:https?://)?[\w.-]+\\.(?!local)\w+/.*$
http://regex101.com/r/xV4gS0

Your regex expressions match host names which start with a series of digits followed by a period and then not followed by the string "local". If this is a "blacklist", then that hardly seems like what you want.
If you're trying to match all hostnames which end in .local, you'd want something like the following for the hostname portion:
[^/]*\.local(?:/|$)
with appropriate escapes inserted depending on regex context.
If your original question was incorrect and you really need a whitelist, then you'd want something like:
^(?:(?!\.local)[^\/])*(?:\/|$)
as illustrated in http://regex101.com/r/yB0uY4

Thank you everyone to help. Indeed, it turns out that for this program, enlisting "not .local" as blacklist, it's not the same as "all .local" as whitelist.
I also had a rookie mistake on my pattern. I meant "\w" instead of "\d". Thank you Peter Alfvin for catching that.
So my final working solution is what Bart suggested:
^(?:https?://)?[\w.-]+\.(?!local)\w+/.*$ as a whitelist.

Related

Regex that can handle an arbitrary number of asterisks in a word

I'm trying to write a regex for x509 CN/SAN validation and have just learned that apparently partial wildcards are possible in theory. How would I build a regex to handle this when I want to make sure that it captures all certificates that might be issued for example.org?
My naive approach would be
\**e\**x\**a\**m\**p\**l\**e\**.\**o\**r\**g\**
not including possible subdomains of course. This looks pretty bad though and really inflates the term longer than I'd like it to be. Is there a more concise way to get the behaviour I described?
Edit: I also just realised that my naive regex wouldn't even catch when someone uses the asterisk to replace a part of the domain, e.g. exa*.org.
Since I feel like there's a possibility that this is not easily expressible in a concise regex, I solved my use case within the Python code that surrounds my previous regex check.
Instead of mapping a regex to the domains appearing in a certificate, I instead convert the certificate domain into a regex pattern, replace the literal dots with escaped dots and the asterisk with [a-zA-Z0-9-]{0,63}. I then compare it to the list of domains I manage and if the regex matches, I know that the certificate is applicable to the managed domain.
If someone manages to express this in a concise regex I'd still be interested.

How to exclude the last part of a variable string using regex

I am currently making a bunch of landing pages that use similar URL structure, but each URL varies in number of words.
So it's something like:
http://landingpage.xyz/page-number-five
http://landingpage.xyz/page-number-fifty-four
http://landingpage.xyz/page-for-a-different-topic
and for the sent page I just postfix -sent like this. The reason I am not adding it as /sent is because the platform I am using handles URLs this way.
http://landingpage.xyz/page-number-five-sent
http://landingpage.xyz/page-number-fifty-four-sent
http://landingpage.xyz/page-for-a-different-topic-sent
Now I found it easy to make a regular expression that identifies all the sent pages which is let's say:
\/([a-z0-9\-]*)-sent
The thing is that I am not sure how to identify the ones that are not sent. I tried using a similar regular expression using something like this, but it's not working as expected:
\/([a-z0-9\-]*)(?!-sent)
What's the best way to design the regex for this? Or I am approaching it in the wrong way?
A lookahead should be considered where there are some characters left to match. So one at the end of regex doesn't look for anything. As long as I'm not sure whether or not your environment supports lookbehinds, this should be a workaround:
\/(?!.*-sent\b)([a-z0-9\-]*)

Negating a regex query

I have looked at multiple posts about this, and am still having issues.
I am attempting to write a regex query that finds the names of S3 buckets that do not follow the naming scheme we want. The scheme we want is as follows:
test-bucket-logs**-us-east-1**
The bolded part is optional. Meaning, the following two are valid bucket names:
test-bucket-logs
test-bucket-logs-us-east-1
Now, what I want to do is negate this. So I want to catch all buckets that do not follow the scheme above. I have successfully formed a query that will match for the naming scheme, but am having issues forming one that negates it. The regex is below:
^(.*-bucket-logs)(-[a-z]{2}-[a-z]{4,}-\d)?$
So some more valid bucket names:
example-bucket-logs-ap-northeast-1
something-bucket-logs-eu-central-1
Invalid bucket names (we want to match these):
Iscrewedthepooch
test-bucket-logs-us-ee
bucket-logs-us-east-1
Thank you for the help.
As mr Barmar said, probably the best approach on these circumstances is solving it programatically. You could write the usual regex for matching the right pattern, and exclude them from the collection.
But you can try this:
^(?:.(?!-bucket-logs-[a-z]{2}-[a-z]{4,}-\d|-bucket-logs$))*$
which is a typical solution using a negative lookeahead (?!) which is a non-capturing group, with zero-length. Basically it states that you want every line that starts with something but dont has the pattern after it.
EDITED
As Ibrahim pointed out(thank you!), there was a little issue with my first regex. I fixed it and I think it is ok now. I had forgot to set the last part of inner regex as optional(?).

regex to find domain without those instances being part of subdomain.domain

I'm new to regex. I need to find instances of example.com in an .SQL file in Notepad++ without those instances being part of subdomain.example.com(edited)
From this answer, I've tried using ^((?!subdomain))\.example\.com$, but this does not work.
I tested this in Notepad++ and # https://regex101.com/r/kS1nQ4/1 but it doesn't work.
Help appreciated.
Simple
^example\.com$
with g,m,i switches will work for you.
https://regex101.com/r/sJ5fE9/1
If the matching should be done somewhere in the middle of the string you can use negative look behind to check that there is no dot before:
(?<!\.)example\.com
https://regex101.com/r/sJ5fE9/2
Without access to example text, it's a bit hard to guess what you really need, but the regular expression
(^|\s)example\.com\>
will find example.com where it is preceded by nothing or by whitespace, and followed by a word boundary. (You could still get a false match on example.com.pk because the period is a word boundary. Provide better examples in your question if you want better answers.)
If you specifically want to use a lookaround, the neative lookahead you used (as the name implies) specifies what the regex should not match at this point. So (?!subdomain\.)example trivially matches always, because example is not subdomain. -- the negative lookahead can't not be true.
You might be better served by a lookbehind:
(?<!subdomain\.)example\.com
Demo: https://regex101.com/r/kS1nQ4/3
Here's a solution that takes into account the protocols/prefixes,
/^(www\.)?(http:\/\/www\.)?(https:\/\/www\.)?example\.com$/

Regex to find a web address

I'm trying to isolate links from html using a regex and the one I found that is suppose to do it doesn't seem to work.
/^(http?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
Am I missing something? I'm using Brackets as my text editor
^(?:http|https):\/\/(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&%#!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&%#!\-\/\(\)]+))?$
Messy, but works.
Also, you might want to look at a similar question: Regex expression for valid website link
Hope this helps :)
It is hard to make it 100% accurate.
A url could also be a IP address for example.
http://ip/
It can contain query strings.
http://www.google.com/?a=1&b=2
It can contain spaces.
http://www.google.com/this is my url/
It depends on what need you have for accuracy.