VB.Net REGEX to strip email - regex

I have a need to strip email addresses out of paragraphs of plain text. I have googled and search this site and found many suggestions - none of which I can get to work. I'm using code like this:
Imports System.Text.RegularExpressions
Dim strEmailPattern As String = "^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$"
Dim senText As String = "blah blah blah blah blah someone#somewhere.com"
Dim newText As String = String.Empty
newText = Regex.Replace(senText, strEmailPattern, String.Empty)
After the call to Regex.Replace the newText string still contains the complete senText string including the email. I thought it was the regex pattern I was using but I have tried many so maybe I'm am missing something in the code?

This posix regex should match all the emails, provided
they may not be valid
every email contains at least on #
there are sequences of characters around # symbols which includes alphabet, digits, hyphen and dots and not started by any non-alpha characters.
All emails are separated by at least a single space char.
Regex
([[:alpha:]][[:alnum:].-]+#)+[[:alpha:]][[:alnum:].-]+
This might also work
([a-zA-Z][[a-zA-Z0-9].-]+#)+[a-zA-Z][a-zA-Z0-9.-]+
A shorter version (as in comment) would be
(\w[\w.-]+#)+\w[\w.-]+
But this will match some more invalid emails.
The patter I am addressing will match most of the email addresses. if you really want to match all the RFC-822 compliant emails, consider using the pattern here. Its a 6425 character long regex that matches all the standard email address. But be ware, it'll execute slow!

There are various corner cases where your regex would fail..
you should use as simple as this
(?<=^|\s)[^#]+?\#[^#]+?(?=$|\s)

Related

Use regex to strip out emails

I know that this is a notoriously difficult topic. The best regex that I've found after trawling many different answers is the one at http://emailregex.com/
It works great at validating an email address, but I'm struggling to alter this regex to find all email addresses in a string.
I'm using the PHP version of the regex.
How would I go about using this regex to find all of the email addresses in a string?
I know about the preg functions, my PHP code isn't as much the problem as adapting that regex.
$redacted = preg_replace_callback(
"/$emailRegex/i",
function ($matches) {
return '[' . $this->getHashedValue($matches[0]) . ']';
},
$input
);
If you already have a working regular expression, you can use PHP's preg_replace to replace all (non-overlapping) matches by a certain string, in our case "" (to remove them).
preg_replace($your_regex, "", $your_string)
This should strip all matches from your string.
Also, as #MonkeyZeus commented, if your regex contains the start anchor (^) or the end anchor ($), make sure to remove those before using preg_replace. Otherwise, the only match you can get will be the entire string, if it matches.

I want to modify this regex to include apostrophe

This regex is used for validating email addresses, however it doesn't include the case for apostrophy (') which is a valid character in the first part of an email address.
I have tried myself and to use some examples I found, but they don't work.
^([\w-\.]+)#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([\w-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$
How do I modify it slightly to support the ' character (apostraphy)?
Per the documentation for an email address, the apostrophe can appear anywhere before the # symbol, which, in your current regex is:
^([\w-\.]+)#
You should be able to add the apostrophe into the brackets of valid characters:
^([\w-\.']+)#
This would make the entire regex:
^([\w-\.']+)#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([\w-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$
EDIT (regex contained in single-quotes)
If you're using this regex inside a string with single-quotes, such as in PHP with $regex = '^([\w ..., you will need to escape the single-quote in the regex with \':
^([\w-\.\']+)#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([\w-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$
You need to update the first part as follows:
^([\'\w-\.]+)

Extract pattern from string, with special characters, using Regular Expressions

I am trying to use a regex in VB.NET - the language probably shouldn't matter though - I am trying to extract something reasonable out of a very large file name, "\\path\path\path.path.path\path\some_more_stuff_from a name.item_123_456.html"
I would like to extract, from that whole mess, the "item_123_456"
It seems to make sense that I can get everything before a pattern like ".html" , and from it, everything after the last dot ?
I have tried to get at least the last part (the entire string before .html) and I still get no matches:
Dim matches As MatchCollection
Dim regexStuff As New Regex(".*\\.html")
matches = regexStuff.Matches(strINeed)
Dim successfulMatch As Match
For Each successfulMatch In matches
strFound = successfulMatch.Value
Next
The match I experimented with, hoping I might even get everything between a dot and an .html: Regex("\\..*\\.html") returned Nothing as well.
I just can't get regular expressions to work...
.*\.(.*?)\.html
This finds as many characters as possible .* until it comes to ( a dot followed by as few characters as possible followed by a dot html ) (\.(.*?)\.html)
It places the stuff between the dot html and the dot preceding the dot html into a capturing group, which should be in $1. If you need the vb.net code for that I can likely get that as well, but your code looked okay
Your vb code should look something like this:
Dim matches As MatchCollection
Dim regexStuff As New Regex(".*\.(.*?)\.html")
matches = regexStuff.Matches(strINeed)
strFound = matches.Item(0).Groups(1).Value.ToString
It could probably be generalized into this
[^.\\]+\.html
Edit: or, initial dot required
\.[^.\\]+\.html

RegEx : replace all Url-s that are not anchored

I'm trying to replace Urls contained inside a HTML code block the users post into an old web-app with proper anchors (<A>) for those Urls.
The problem is that Urls can be already 'anchored', that is contained in <A> elements. Those Url should not be replaced.
Example:
http://noreplace.com <- do not replace
<u>http://noreplace.com</u> <- do not replace
...http://replace.com <- replace
What would the regex to match only 'not anchored Urls' look like?
I use the following function to replace with RegEx:
Function ReplaceRegExp(strString, strPattern, strReplace)
Dim RE: Set RE = New RegExp
With RE
.Pattern = strPattern
.IgnoreCase = True
.Global = True
ReplaceRegExp = .Replace(strString, strReplace)
End With
End Function
The following non greedy regex is used to format UBB URLs. Can this regex be adapted to match only the ones I need?
' the double doublequote in the brackets is because
' double doublequoting is ASP escaping for doublequotes
strString = ReplaceRegExp(strString, "\[URL=[""]?(http|ftp|https)(:\/\/[\w\-_]+)((\.[\w\-_]+)+)([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?[""]?\](.*?)\[/URL\]", "$6")
If this really cannot be done with RegEx, what would be the solution in ASP Classic, with some code or pseudocode please? However I would really like to keep code simple with an additional regex line than add additional functions to this old code.
Thanks for your effort!
Seems like regular expressions are too complex to use for this kind of job so I went to my rusty VBScript skills and coded a function that first removes anchors and then replaces the URLs.
Here it is if somebody may need it:
Function Linkify(Text)
Dim regEx, Match, Matches, patternURLs, patternAnchors, lCount, anchorCount, replacements
patternURLs = "((http|ftp|https)(:\/\/[\w\-_]+)((\.[\w\-_]+)+)([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)"
patternAnchors = "<a[^>]*?>.*?</a>"
Set replacements=Server.CreateObject("Scripting.Dictionary")
' Create the regular expression.
Set regEx = New RegExp
regEx.Pattern = patternAnchors
regEx.IgnoreCase = True
regEx.Global = True
' Do the search for anchors.
Set Matches = regEx.Execute(Text)
lCount = 0
' Iterate through the existing anchors and replace with a placeholder
For Each Match in Matches
key = "<#" & lCount & "#>"
replacements.Add key, Match.Value
Text = Replace(Text,Cstr(Match.Value),key)
lCount = lCount+1
Next
anchorCount = lCount
' we now search for URls
regEx.Pattern = patternURLs
' create anchors from URLs
Text = regEx.Replace(Text, "$1")
' put back the originally existing anchors
For lCount = 0 To anchorCount-1
key = "<#" & lCount & "#>"
Text = Replace(Text,key, replacements.Item(key))
Next
Linkify = Text
End Function
The answer you're looking for is in negative and positive look aheads and look behinds
This article gives a pretty good overview: http://www.regular-expressions.info/lookaround.html
Here's the Regular Expression I've formulated for your case:
(?<!"|>)(ht|f)tps?://.*?(?=\s|$)
Here's some sample data I matched against:
#Matches
http://www.website.com
https://www.website.com
This is a link http://www.website.com that is not linked
This is a long link http://www.website.com/index.htm?foo=bar
ftp://www.website.com
#No Matches
<u>http://www.website.com</u>
http://website.com
http://website.com
<u>http://www.website.com</u>
ftp://www.website.com
Here's a breakdown of what the regular expression is doing:
(?<!"|>)
A negative look behind, making sure what matches next isn't preceded by a " or >
(ht|f)tps?://.*?
This looks for http, https, or ftp and anything following it. It'll also match ftps! If you want to avoid this, you could use (https?|ftp)://.*? instead
(?=\s|$)
This is a positive look ahead, which matches a space or end of line.
EXTRA CREDIT
(ht)?(?(1)tps?|ftp)://
This will match http/https/ftp but not ftps, this may be a bit overkill when you can use (https?|ftp):// but it's an awesome example of if/else in regex.
Some design issues you're going to have to work around:
Embedded URLs could be absolute or relative and may not include the protocol.
Your HTML may not have quotes around attribute values.
The character right after a URL may also be a valid URL character.
There are lots of valid URL characters these days.
If you can assume (1) absolute URLs with protocols and (2) quoted HTML attributes and (3) people will have whitespace after a URL and (4) you're sticking with supporting only basic URL characters, you can just look for URLs not preceded by a double-quote.
Here's an overly-simple example to start with (untested):
(?<!")((http|https|ftp)://[^\s<>])(?=\s|$) replaced with $1
The [^\s<>] part above is ridiculously greedy, so all of the fun will be in tweaking that to build a character set that fits the URLs your users are typing in. Your example shows a much more involved character class with \w plus a hodge-podge of other allowed characters, so you could start there if you want.

Pattern matching email address using regular expressions [duplicate]

This question already has answers here:
How can I validate an email address using a regular expression?
(79 answers)
Closed 2 years ago.
Filter email address with regular expressions: I am new to regular expressions and was hoping someone might be able to help out.
I am trying to pattern match an email address string with the following format:
FirstName.LastName#gmail.com
I want to be sure that there is a period somewhere before the '#' character and that the characters after the '#' character matches gmail.com
You want some symbols before and after the dot, so I would suggest .+\..+#gmail\.com.
.+ means any symbols (.) can appear 1 or more times (+)
\. means the dot symbol; screened with backslash to suppress the special meaning of .
#gmail and com should be matched exactly.
See also Regular Expression Basic Syntax Reference
EDIT: gmail rules for account name only allow latin letters, digits, and dots, so a better regex is
[a-zA-Z0-9]+\.[a-zA-Z0-9]+#gmail\.com
check valid email
^(?:(?!.*?[.]{2})[a-zA-Z0-9](?:[a-zA-Z0-9.+!%-]{1,64}|)|\"[a-zA-Z0-9.+!% -]{1,64}\")#[a-zA-Z0-9][a-zA-Z0-9.-]+(.[a-z]{2,}|.[0-9]{1,})$
enforced rules:
must start with alphanumeric char
can only have alphanumeric and #._-% char
cannot have 2 consecutives . exept for quoted string
char before # can only be alphanumeric and ._-%, exept for quoted string
must have # in the middle
need to have at least 1 . in the domain part
cannot have double - in the domain part
can only have alphanumeric and .- char in the domain part
need to finish by a valid extension of 2 or more letters
support IP address (test#1.1.1.1)
support for quoted user name
You don't even need regex since your requirements are pretty specific. Not sure what language you're using, but most would support doing a split on # and checking for a .. In python:
name, _, domain = email.partition('#')
if '.' in name and domain == 'gmail.com':
# valid
You haven't tell us what kind of regex flavor you need however this example will fit most of them:
.*\..*#gmail.com
Assuming Unix style where . is any character: .*\..*#gmail\.com
Edit: escaped the . in the domain
I used follwing regex expression to validate the email address. Also I have added a small code snippet in C# language regarding to that.
Regex - "^[a-zA-Z0-9]{3,20}#gmail.com$"
Code :-
static void Main(string[] args)
{
Console.WriteLine("Enter Your Text ");
string input = Console.ReadLine();
Console.WriteLine("The Text that you have entered is :" + input);
Console.ReadLine();
string pattern = "^[a-zA-Z0-9]{3,20}#gmail.com$";
Regex regex = new Regex(pattern);
bool results = regex.IsMatch(input);
Console.WriteLine(results.ToString());
Console.ReadLine();
}
Emails such as ∂øøµ$∂å¥!#gmail.com also checked and show as false here. ;-)