Regex match website that is NOT an email - regex

I'm trying to extract websites without matching email addresses.
In other words if my contact section has
email: a#gmail.com ---- website: www.company.com
I want the www.company.com without matching gmail.com.
So far I have tried everything that I can think of, the best I have so far is
\b(?:.(?<!#))+\.\S+\b
but that will still match gmail.com in a#gmail.com.
I'll admit that my Regex skills are not the strongest, I've done my research regarding negative lookaheads/behinds etc but I still don't know how to do this.

This is an expression made by JGSoft for domain names:
\b(?<!#)((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b
It is internationalized and strict.
I added (?<!#) to stop it from matching domain names after email names.
See a demo here

Related

Regex: how can exclude all TLD except my own domain

I have an Asp.Net website on which I'm implementing some basic anti-spam stuff via the validation controls.
One such regex is: "^(?!.*(//|[.]({com|net|info|uk|etc}))).*$"
It pretty much does what it needs to as far as blocking goes — it doesn't need to be too sophisticated. However, I want to include the option to whitelist my own domain.
So, I want to block all .uk domains, except mydomain.co.uk.
This is a regex step beyond me — can anyone help?
You may use a nested negative lookbehind while uk to fail the already existing negative lookahead for mydomain.co. part before matching uk:
^(?!.*(//|[.](com|net|info|(?<!mydomain\.co\.)uk))).*$
RegEx Demo
Take note of (?<!mydomain\.co\.) which is a negative lookbehind to not to match uk if it is preceded by mydomain.co..

Regex and cPanel Account Filtering

after countless hours googling and trying to contact my webhost (with no positive results) I wanted to jsut 'throw my question out there' and get better expertise with my issue. I really do believe, that this will be helpful to a lot of people as well, stuck asking the same question!
Just to keep things short, we have hosted our email solution with a webhost using cPanel and I have a big requirement. Basically, I need an account level filter to block certain mail addresses from sending out to other mail servers. For example;
lets say we use example.com
user1#example.com can send mail to anyone, anywhere
user2.int#example.com is only allowed to send mail to example.com address but not to any other address, for example gmail.com, yahoo.com, etc.
Out of the options given to me at account level filtering, I thought the best to use is regex.
I'm suspecting that EXIM (default mta for cpanel) uses PCRE like regex expressions, please correct if im wrong.
The syntax i wrote and need help with is the following:
^(?!.+\#example\.com$).*$
With this, all example.com addresses should not match and all other addresses should.
The testing tools I used is https://www.debuggex.com/
Guys, please help and let me know what I am doing wrong. cPanel is letting mail go through and is not blocking it.
The regex:
^(?![^#]*?#example\.com)
should do the trick
How it works
^: Find the beginning of the string/line
(?!...) Assert that it is impossible to find the following regex:
[^#]*? Match all the characters that are not an at symbol (#)
#example\.com Match the exact string '#example\.com'
For a more in-depth explanation see this

Regular expression - for email spam filtering, match email address variants other than the original

I am a email spam quarantine administrator and I can write regular expression rules to block email messages. There is a common classification of email spam hitting our domain such that the username of any of our email addresses is spoofed in front of some other domain.
For example, suppose my email address is jwclark#domain.com. In that case, spammers are writing to me from all kinds of other domains that start with my username such as:
jwclark1234#whatever.com
jwclark#wrongdomain.com
jwclark#a.domain.com
How can I write a regular expression rule to match everything including jwclark and any wildcards, but not match the original jwclark#domain.com? I would like a regex that matches everything above except for my actual example email address jwclark#domain.com.
I've made this regexp here
^jwclark.*[#](?!domain\.com).*$
it's in javascript format, but it should be easy to adapt to php or something else.
Given the nature of your problem, you might be better off making a regex builder function that makes the proper regexp for you, given the parameters.
Or, actually use a different approach. I recently found out how to parse ranges of floating point numbers with regexp, but that doesn't make it the proper solution to finding numbers within ranges. :P
edit - fixed silly redundancy thanks to zx81
edit - change to comply with strange limitations:
^jwclark.{0,25}[#][^d][^o][^m][^a][^i][^n].{0,25}\.com.{0,25}$
demo for the strange one

A URL that contains all valid characters to test my regex pattern?

First of all I created my own regex to find all URLs in a text, because:
When I searched SO and google only found regex for specific URL constructions, like images, etc.
I found a pretty complete regex from the PHP's manual itself (see "splattermania at freenet dot de 01-Oct-2009 12:01" post at http://php.net/manual/en/function.preg-match.php) that can find almost anything that resembles a URL, as little as "bit.ly".
This pattern has a few errors and constraints, so I'm fixing and enhancing it.
Now the pattern structure seems right, but I'm not sure all valid characters are present. Please post samples of URLs to test my pattern. Might be laziness, but I don't want to read pages and pages of references to find all of them, need to focus on the development. If you have a summary of valid chars for username, password, path, query and anchor that you can share, that would be very very helpful.
Best Regards!
The pattern you linked to does indeed match a lot of URLs, both valid and invalid. It's not really a surprise since nearly everything in that regex is optional; as you wrote yourself, it even matches bit.ly, so it's easy to see how it would match lots of non-URL stuff.
It doesn't take new Unicode domain names into account, for one (e.g., http://www.müller.de).
It doesn't match valid URLs like
http://msdn.microsoft.com/en-us/library/aa752574(VS.85).aspx
It doesn't match relative paths (might not be necessary, though) like /cgi-bin/version.pl.
It doesn't match mailto: links.
It doesn't match URLs like http://1.2.3.4. Don't even ask about IPv6 :)
All in all, regular expressions are NOT the right tool to reliably match or validate URLs. This is a job for a parser. If you can live with many false positive and false negative matches, then regexes are fine.
Please read Jan Goyvaerts' excellent essay on this subject: Detecting URLs in a block of text.

How can I accept mail from only one domain in smtp-gated?

I need a Perl-compatible regular expression for filtering my email with smtp-gated. I only want to allow one domain ('mydomain.com') and reject everything else. How do I do this in a foolproof way? (regex_reject_mail_from)
I know this question halfway belongs on serverfault, but basically it's a Perl regex question so I think it fits stackoverflow more.
EDIT:
This should match so I can reject it:
"Someone" <someone#somedomain.com>
This should not match:
"Me" <me#mydomain.com>
This shouldn't match also:
you#mydomain.com
-
I'd suggest the following:
\b[A-Z0-9._%+-]+#(?!mydomain\.com)[A-Z0-9.-]+\.[A-Z]{2,6}\b
Use the /i option to make it case-insensitive.
This will match most valid (and some invalid) e-mail addresses that don't have mydomain.com after the #. Keep in mind that e-mail validation is hard with regexes.
If your regex is going to be applied to the MAIL FROM line in the MTA communication then you do not need to concern yourself with the full 'email address' specification. MAIL FROM lines are just the email address enclosed in '<>', so any regex that tests for #mydomain.com> should work.
\b(?:"[ a-zA-Z]+")?\s*<?[a-zA-Z0-9_.]#(?!mydomain\.com)\w+(?:\.\w{2,})+>?\b
UPDATE: Note that this regex is fa{9,}r from being perfect. Check the official regex for email addresses for more info (Scroll down to the <p/> titled RFC 2822).