Regex to detect count of email addresses in email header? [closed] - regex

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I have a regex to detect any email address - I am trying to create a regex that looks specifically in the header of an email message that counts email addresses and ignores email addresses from a specific domain (abc.com).
For example, there's ten email addresses from 1#test.com ignoring the 11th address from 2#abc.com.
Current regex:
^[A-Z0-9._%+-]+#[A-Z0-9.-]+.[A-Z]{2,4}$

Consider the following powershell example of a universal regex.
To find all email addresses:
<(.*?)> is handy if your server surrounds the email addresses with brackets
(?<!Content-Type(.|\n){0,10000000})([a-zA-Z0-9.!#$%&''*+-/=?\^_``{|}~-]+#(?!abc.com)[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*) if you don't have brackets around all email addresses in your header. Note this particular regex was copied from a community wiki answer on stackoverflow 201323 and modified here to prevent #abc.com. There are probably some edge cases which this regex will not work for. So on the same page there is really complex regex which looks like it would match every email address. I don't have the time to modify that one to skip #abc.com.
Example
$Matches = #()
$String = 'Return-Path: <example_from#abc123.com>
X-SpamCatcher-Score: 1 [X]
Received: from [136.167.40.119] (HELO abc.com)
by fe3.abc.com (CommuniGate Pro SMTP 4.1.8)
with ESMTP-TLS id 61258719 for example_to#mail.abc.com;
Message-ID: <4129F3CA.2020509#abc.com>
Date: Wed, 21 Jan 2009 12:52:00 -0500 (EST)
From: Taylor Evans <Remember#To.Vote>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.0.1)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Jon Smith <example_to#mail.abc.com>
Subject: Business Development Meeting
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Content-Type: multipart/alternative;
boundary="------------060102080402030702040100"
This is a multi-part message in MIME format.
--------------060102080402030702040100
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
Hello,
this is an HTML mail, it has *bold*, /italic /and _underlined_ text.
And then we have a table here:
Cell(1,1)
Cell(2,1)
Cell(1,2) Cell(2,2)
And we put a picture here:
Image Alt Text
That''s it.
--------------060102080402030702040100
Content-Type: multipart/related;
boundary="------------030904080004010009060206"
--------------030904080004010009060206
Content-Type: text/html; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html;
charset=ISO-8859-15">
</head>
<body bgcolor="#ffffff" text="#000000">
Hello,<br>
<br>
this is an HTML mail, it has <b>bold</b>, <i>italic </i>and <u>underlined</u>
text.<br>
And then we have a table here:<br>
<table border="1" cellpadding="2" cellspacing="2" height="62"
width="401">
<tbody>
<tr>
<td valign="top">Cell(1,1)<br>
</td>
<td valign="top">Cell(2,1)</td>
</tr>
<tr>
<td valign="top">Cell(1,2)</td>
<td valign="top">Cell(2,2)</td>
</tr>
</tbody>
</table>
<br>
And we put a picture here:<br>
<br>
<img alt="Image Alt Text"
src="cid:part1.FFFFFFFF.5555555#example.com" height="79"
width="98"><br>
<br>
That''s it. email me at test#email.com<br>
Subject: <br>
</body>
</html>'
# Write-Host start with
# write-host $String
Write-Host
Write-Host found
[array]$Found = ([regex]'(?<!Content-Type(.|\n){0,10000000})([a-zA-Z0-9.!#$%&''*+-/=?\^_`{|}~-]+#(?!abc.com)[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*)').matches($String)
$Found | foreach {
write-host "key at $($_.Groups[1].Index) = '$($_.Groups[1].Value)'"
} # next match
Write-Host "found $($Found.count) matching addresses"
Yields
found
key at 14 = 'example_from#abc123.com'
key at 200 = 'example_to#mail.abc.com'
key at 331 = 'Remember#To.Vote'
key at 485 = 'example_to#mail.abc.com'
found 4 matching addresses
Summary
(?<!Content-Type(.|\n){0,10000000}) prevents Content-Type from appearing within the 10,000,000 characters before the email address. This has the effect of preventing email address matches which are in the body of the message. Because the requester is using Java and Java doesn't support the use a * inside a lookbehind I'm using {0,10000000} instead. (see also Regex look behind without obvious maximum length in Java). Be aware this may introduce some edge cases which may not be captured as expected.
<(.*?#(?!abc.com).*?)>
( start return
[a-zA-Z0-9.!#$%&''*+-/=?\^_``{|}~-]+ match 1 or more allowed characters. the double single quote is to escape the single quote character for powershell. And the double back tick escapes the backtick for stackoverflow.
# include the first at sign
(?!abc.com) reject the find if it includes abc.com
[a-zA-Z0-9-]+ continue looking for all remaining characters non greedy upto the first dot or end of string.
(?:\.[a-zA-Z0-9-]+)*) continue looking for character chunks followed by a dot

Related

RegEx: Grabbing values with or without quotation marks

My Issue:
I am trying to grab Facebook meta value from different sites, but some website(usatoday.com) are not having appropriate HTML code. As you can see the data sample 1 & 2, so my question is how can I modify my regex expression code to get the value of the property and content.
What I've done:
With below if statement, I am kind of resolving the quotation mark issue (not dynamic enough), but I guess there must be a better way (I am really suck in regex)
Secondly, the regex I had not able to catch the content value(the url) in Data Sample 2 for usatoday.com, I guess the "" in the url mess up my regex.
Really need some help here, big thanks!
if(
preg_match( '/<meta(.*?)property="og:title"(.*?)content="(.+?)"(.*?)(\/)?>/', $raw_html, $matching )
// for normal sites
or
preg_match( '/<meta(.*?)property=og:title(.*?)content="(.+?)"(.*?)(\/)?>/', $raw_html, $matching )
// property no quote at all
or
preg_match( '/<meta(.*?)property=og:title(.*?)content=(.+?)(.*?)(\/)?>/', $raw_html, $matching )
// no quote at all
)
Data Sample 1 - no quotation mark on meta text attribute
# usatoday.com
<meta property=og:title content="Lakers trading Russell Westbrook in massive three-team deal with Jazz and Timberwolves"/>
# normal sites
<meta property="og:title" content="Lakers trading Russell Westbrook in massive three-team deal with Jazz and Timberwolves"/>
Data Sample 2 - no quotation mark on meta URL attribute
# usatoday.com
<meta property=og:url content=https://www.usatoday.com/story/sports/nba/2023/02/08/lakers-jazz-timberwolves-trade-russell-westbrook-mike-conley-dangelo-russell/11214855002/ />
# normal sites
<meta property="og:url" content="https://www.usatoday.com/story/sports/nba/2023/02/08/lakers-jazz-timberwolves-trade-russell-westbrook-mike-conley-dangelo-russell/11214855002/" />

Building an AMP for Email with chilkat .net

I would like to build an Email with Chilkat in .NET which would have three body contents: a HTML one, a plain text one and an AMP for Email one (with Content Type: "text/x-amp-html")
Current version on Chilkat (9.5.0.78) or the one I am using (9.5.0.68) doesn't support AMP for Email so it is not possible to build an email with the methods they provided. As a workaround I am editing the email via GetMime() which already has a Plain Body and Html Body and I am pasting the AMP part there.
Will Chilkat support AMP for Email?
EDIT:
With some more experiments I managed to make three bodies of a message although it's kind of ridicules:
var email = new Email();
email.Body = PlainContent;
email.AddHtmlAlternativeBody(HtmlContent);
email.RemoveHtmlAlternative();
email.SetTextBody(AmpHtmlContent, "text/x-amp-html");
email.AddHtmlAlternativeBody(HtmlContent);
The result is something like this:
[...]
X-Message-Type: test
--------------090501080304020500060805
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
some text some text some text some text some text some text some text some =
text some text some text some text=20
--------------090501080304020500060805
Content-Type: text/x-amp-html; charset=utf-8
Content-Transfer-Encoding: quoted-printable
<!doctype html>
<html amp4email>
<head>
<meta charset=3D"utf-8">
<script async src=3D"https://cdn.ampproject.org/v0.js"></script>
<style amp4email-boilerplate>body{visibility:hidden}</style>
</head>
<body>
Hello, AMP world.
<amp-img src=3D"https://images-na.ssl-images-amazon.com/images/I/41zetwwV=
h3L.jpg" alt=3D"Welcome" width=3D"382" height=3D"500">
</amp-img>
</body>
</html>
--------------090501080304020500060805
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable
<html><head><META http-equiv=3D"Content-Type" content=3D"text/html;charset=
=3Dutf-8"></head><body><h1>test fallback to html</h1> <h1>test fallback t=
o html</h1> <h1>test fallback to html</h1> </body></html>
--------------090501080304020500060805--
Here is an example that works for me:
https://www.example-code.com/csharp/amp_for_email.asp
It calls AddPlainTextAlternativeBody instead of setting the email.Body property.

RegEx for removing all spam links in a <div> The only identifier is overflow:hidden

I have just discovered around a thousands posts on our site with hidden links. They are all contained in divs the styles like this:
<div style='width:10px;height:13px;overflow:hidden'>
<div style='overflow:hidden;width:7px;height:13px'>
The width and height are all different, the only identifier is the overflow:hidden
Here is one example
<div style='width:10px;height:13px;overflow:hidden'>
<p>BRANDO CHANGED WILL IN LAST DAYS.(News)</p>
<p>The Mirror (London, England) July 8, 2004 Byline: IAN MARKHAM-SMITH HOLLYWOOD legend Marlon Brando changed his will days before his death, it emerged last night.</p>
<p>Movie mogul Mike Medavoy revealed that before the eccentric 80-year-old succumbed to illness on Friday, he summoned lawyers and some friends to make significant changes to his estate. lastnightmovienow.net last night movie</p>
</div>
How do I create a RegEx that finds every day with the style that contains overflow:hidden then any character, set of character etc up until the closing div.
I tried this, but didn't work
<div style='.*overflow:hidden'>(.*)</div>
I think it's due to not escaping the normal HTML.
I'm a RegEx noob.
Thanks
Ollie
Thanks mate, very detailed response :)
As you say it's sketchy, worked on some posts and not others.
We solved this by adding this to the functions.php file to strip all the problematic divs out server side.
RegEx was the incorrect approach.
function my_the_content_filter( $content ) {
$content = preg_replace("#<div[^>]*overflow:hidden[^>]*>.*?</div>#is", "", $content);
return $content;
}
add_filter( 'the_content', 'my_the_content_filter');
?>

Regex in google link params

I do not have experience with regex code.
I want take from following text
http://news.google.com/news/url?sa=t&fd=R&ct2=it&usg=AFQjCNG4x7juUilTtEDL5ae1ecsNh7E-yQ&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52778905305151&ei=2_utVbj7MsHS1QaH3YHQBA&url=http://time.com/3964691/yoga-dogs-and-cats/ tag:news.google.com,2005:cluster=http://time.com/3964691/yoga-dogs-and-cats/ Mon, 20 Jul 2015 17:44:50 GMT <table border="0" cellpadding="2" cellspacing="7" style="vertical-align:top;"><tr><td width="80" align="center" valign="top"><font style="font-size:85%;font-family:arial,sans-serif"><img src="//t0.gstatic.com/images?q=tbn:ANd9GcSPm8SUGKyWdqCih-LdFBEVfcJI2B86tVNolZJLoeWesaK1Jss7lbJsPKhaqLe8Pap7kYdL2Xw" alt="" border="1" width="80" height="80"><br><font size="-2">TIME</font></font></td><td valign="top" class="j"><font style="font-size:85%;font-family:arial,sans-serif"><br><div style="padding-top:0.8em;"><img alt="" height="1" width="1"></div><div class="lh"><b>Watch <b>cats</b> and dogs interrupt yoga routines - Time</b><br><font size="-1"><b><font color="#6f6f6f">TIME</font></b></font><br><font size="-1">The compilation above shows many a yoga routine getting interrupted. And it really never gets old watching a dog rush to the aid of his owner trapped in a headstand or for a a pet to think pigeon pose is an invitation for kisses. There's also the <b>cat</b> <b>...</b></font><br><font size="-1"><b>Cats</b> And Dogs Interrupting Yoga - Huffington Post UK<font size="-1" color="#6f6f6f"><nobr>Huffington Post UK</nobr></font></font><br><font size="-1" class="p"></font><br><font class="p" size="-1"><a class="p" href="http://news.google.com/news/story?ncl=dtJjhOioeLRtSJMzD7u9ebMAVfF0M&ned=it&hl=en"><nobr><b>tutte le notizie (3) ยป</b></nobr></a></font></div></font></td></tr></table>
the following string present in the text above
http://time.com/3964691/yoga-dogs-and-cats/
You can get this text using
(?<=url=)http[^\s"]+
See demo
Note that your (?<=url=).+?(?= ) regex matches more than the URL you need to extract:
Try this:
(?<=url=).+?(?= )
Play around with it here: https://regex101.com/r/pO4cT3/1

How to replace text values that contain a #

In my dataset I have a variable with values which contain html-code, e.g.:
<font color="#800080">None of these</font>.
I wanted to replace that with Other by:
df$Country <- gsub("<font color="#800080">None of these</font>", "Other", df$Country)
However that doesn't work, which is probably caused by the #-character. How can I solve this?
Part of the data:
structure(c(2L, 1L, 1L, 1L, 1L), .Label = c("Spain", "<font color=\"#800080\">None of these</font>"), class = "factor")
All these problems with regex on html are reasons not to use it. Assuming your data started out as an actual html document, use XPath instead. Here's an example:
html.text <- '<html>
<head></head>
<body>
<div><font color="#800080">None of these</font></div>
</body>
<html>'
library(XML)
html <- htmlTreeParse(html.text,useInternalNodes=TRUE)
replaceNodes(html['//font[#color="#800080"]'][[1]],"Other")
# <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# <html>
# <head></head>
# <body>
# <div>Other</div>
# </body>
# </html>
There are two options to look at. Both assume we are starting with something that looks like this.
x <- '<font color="#800080">None of these</font>'
Option 1: Using a different quote. When you used double quotes to identify your "pattern" argument, it ends at the next double quote it encounters, which comes just before your #. Hence, you can try to enclose the pattern with single quotes instead.
gsub('<font color="#800080">None of these</font>', "other", x)
Option 2: Escaping the quote character. This is as simple as putting a \ before the quote to indicate that it should be escaped.
gsub("<font color=\"#800080\">None of these</font>", "other", x)