Coding a Gmail style "hide quoted text" for web based mailing list archive - regex

I'm working on a web application that parses and displays email messages in a threaded format (among other things). Emails may come from any number of different mail clients, and in either text or HTML format.
Given that most people have a tendency to top post, I'd like to be able to hide the duplicated message in an email reply in a manner similar to how Gmail does it (e.g. "show quoted text").
Determining which part of the message is the reply is somewhat challenging. Personally, I use "> " delimiters at the beginning of the quoted text when replying. I created a regexp that looks for these lines and wraps a div around them to allow some JS to hide or show this block of text.
I then noticed that Outlook doesn't use the "> " characters by default, it simply adds a header block above the reply with the summary of the headers (From, Subject, Date, etc.). The reply is untouched. I can match on this and hide the rest of the email, working with the assumption that it's a top quote.
I then looked at Thunderbird, and it uses "> " for text, and <blockquote> for HTML mails. I still haven't looked at what Apple Mail does, what Notes does, or what any of the other millions of mail clients out there do.
Will I be writing a special case regexp for every single client out there? or is there something I'm missing?
Any suggestions, sample code or pointers to third party libraries much appreciated!

It'll be pretty hard to duplicate the way gmail does it since it doesn't care about whether it was a quoted piece or not, like Zac says, it just seems to care about the diff.
Its actually pretty hard to get this right 100% of the time. Plain text email is "lossy", its entirely possible for you to send
> Here is my long line that is over 74 chars (email line length limit)
Which can get encoded as something like
> Here is my long line that is over 74 chars (email=
line length limit)
And then is decoded as
> Here is my long line that is over 74 chars (email
line length limit)
Making it indistinguishable from an inline-reply.
This is email, so variations are abound. Email usually line-wraps at something like 74 characters, and encoding schemes can differ. Its a real PITA. If you can access the HTML version, you will probably have better luck looking for quote tags and the like. Another idea would be to parse both the plain text and html version to try and determine the boundries.
Additionally, its best to just plan for specific client hacks. They all construct mime messages differently, both in structure and header content.
Edit: I say this with the experience of writing an email processing system as well as seeing several people try to do the -exact- thing you're doing. It always only got "ok" results.

From what I can tell, gmail does not bother about prefixed lines or section headings, except to ignore them. If the text lines appeared earlier in the thread, and then reappear, it is considered to be quoted. Thus, e.g., if you send multiple messages and don't change your signature, the signature is considered to be quoted. If you've already dealt with the '>' prefix, a simple diff should do most of the rest. No need to get fancy.

First thing I think I'd do is strip out all the white space, or reduce white space to 1 between each word, and special characters from both blocks, then look for the old one in the new one.

Here's a mozdev project that may be helpful for others who stumble across this page looking for a Thunderbird solution:
http://quotecollapse.mozdev.org/

Related

Stripping superscript from plaintext

I often grab quotes from articles that include citations that include superscripted footnotes, which when copied are a pain in the ass. They show up as actual letters in the text as they are pasted in plaintext and not in html.
Is there a way I could run this through a regex to take out these superscripts?
For example
In the abeginning bGod ccreated the dheaven and the eearth.
Should become
In the beginning God created the heaven and the earth.
I can't think of a way to have regex search for misspellings and a corresponding sequential set of numbers and letters.
Any thoughts? I'm also using Sublime Text 3 for the majority of my writing, but I wouldn't mind outsourcing this to an AppleScript, or text replacement app (aText, textExpander, etc.).
Matching Code vs. Matching a Screen
It's hard to tell without seeing an example, but this should be doable if you copy the text from code view, as opposed to the regular browser view. (Ctrl or Cmd-J is your friend). Since writing the rules will take time, this will only be worthwhile for large chunks of text.
In code view, your superscript will be marked up in a way that can be targetted by regex. For instance:
and therefore bananas make you smartera
in the browser view (where the a at the end is a citation note) may look like this in code view:
and therefore bananas make you smarter<span class="mycitations">a</span>
In your editor, using regex, you can process the text to remove all tags, or just certain tags. The rules may not always be easy to write, and of course there are many disclaimers about using regex to parse html.
However, if your source is always the same (Wikipedia for instance), then you can create and save rules that should work across many pages.

How to create Gmail filter searching for text only at start of subject line?

We receive regular automated build messages from Jenkins build servers at work.
It'd be nice to ferret these away into a label, skipping the inbox.
Using a filter is of course the right choice.
The desired identifier is the string [RELEASE] at the beginning of a subject line.
Attempting to specify any of the following regexes causes emails with the string release in any case anywhere in the subject line to be matched:
\[RELEASE\]*
^\[RELEASE\]
^\[RELEASE\]*
^\[RELEASE\].*
From what I've read subsequently, Gmail doesn't have standard regex support, and from experimentation it seems, as with google search, special characters are simply ignored.
I'm therefore looking for a search parameter which can be used, maybe something like atstart:mystring in keeping with their has:, in: notations.
Is there a way to force the match only if it occurs at the start of the line, and only in the case where square brackets are included?
Sincere thanks.
Regex is not on the list of search features, and it was on (more or less, as Better message search functionality (i.e. Wildcard and partial word search)) the list of pre-canned feature requests, so the answer is "you cannot do this via the Gmail web UI" :-(
There are no current Labs features which offer this. SIEVE filters would be another way to do this, that too was not supported, there seems to no longer be any definitive statement on SIEVE support in the Gmail help.
Updated for link rot The pre-canned list of feature requests was, er canned, the original is on archive.org dated 2012, now you just get redirected to a dumbed down page telling you how to give feedback. Lack of SIEVE support was covered in answer 78761 Does Gmail support all IMAP features?, since some time in 2015 that answer silently redirects to the answer about IMAP client configuration, archive.org has a copy dated 2014.
With the current search facility brackets of any form () {} [] are used for grouping, they have no observable effect if there's just one term within. Using (aaa|bbb) and [aaa|bbb] are equivalent and will both find words aaa or bbb. Most other punctuation characters, including \, are treated as a space or a word-separator, + - : and " do have special meaning though, see the help.
As of 2016, only the form "{term1 term2}" is documented for this, and is equivalent to the search "term1 OR term2".
You can do regex searches on your mailbox (within limits) programmatically via Google docs: http://www.labnol.org/internet/advanced-gmail-search/21623/ has source showing how it can be done (copy the document, then Tools > Script Editor to get the complete source).
You could also do this via IMAP as described here:
Python IMAP search for partial subject
and script something to move messages to different folder. The IMAP SEARCH verb only supports substrings, not regex (Gmail search is further limited to complete words, not substrings), further processing of the matches to apply a regex would be needed.
For completeness, one last workaround is: Gmail supports plus addressing, if you can change the destination address to youraddress+jenkinsrelease#gmail.com it will still be sent to your mailbox where you can filter by recipient address. Make sure to filter using the full email address to:youraddress+jenkinsrelease#gmail.com. This is of course more or less the same thing as setting up a dedicated Gmail address for this purpose :-)
Using Google Apps Script, you can use this function to filter email threads by a given regex:
function processInboxEmailSubjects() {
var threads = GmailApp.getInboxThreads();
for (var i = 0; i < threads.length; i++) {
var subject = threads[i].getFirstMessageSubject();
const regex = /^\[RELEASE\]/; //change this to whatever regex you want, this one should cover OP's scenario
let isAtLeast40 = regex.test(subject)
if (isAtLeast40) {
Logger.log(subject);
// Now do what you want to do with the email thread. For example, skip inbox and add an already existing label, like so:
threads[i].moveToArchive().addLabel("customLabel")
}
}
}
As far as I know, unfortunately there isn't a way to trigger this with every new incoming email, so you have to create a time trigger like so (feel free to change it to whatever interval you think best):
function createTrigger(){ //you only need to run this once, then the trigger executes the function every hour in perpetuity
ScriptApp.newTrigger('processInboxEmailSubjects').timeBased().everyHours(1).create();
}
The only option I have found to do this is find some exact wording and put that under the "Has the words" option. Its not the best option, but it works.
I was wondering how to do this myself; it seems Gmail has since silently implemented this feature. I created the following filter:
Matches: subject:([test])
Do this: Skip Inbox
And then I sent a message with the subject
[test] foo
And the message was archived! So it seems all that is necessary is to create a filter for the subject prefix you wish to handle.

ColdFusion -- Do I need URLDecode with form POSTs? / URLDecode randomly removes one character

I'm using a WYSIWYG to allow users to format text. This is the error-causing text:
<p><span style="line-height: 115%">This text starts with a 'T'</span></p>
The error is that the 'T' in "This", or whatever the first letter happens to be, is randomly removed when using URLDecode and saving to the DB. Removing URLDecode on the server side seems to fix it without any negative side-effects (the DB contains the same information).
The documentation says that
Query strings in HTTP are always URL-encoded.
Is this really the case? If so, why doesn't removing URLDecode seem to mess everything up?
So two questions:
Why is URLDecode causing the first text character to be removed like this (it seems to only happen when the line-height property is present)?
Do I really need (or would I even want) to use URLDecode before putting POSTed data into the database?
Edit: I made a test page to echo back the decoded text, and URLDecode is definitely removing that character, but I have no idea why.
I believe decoding is done automatically when form scope is populated. That's why characters after % (this char is used for encoding) are removed -- you are trying to decode the string second time.
For security reasons you might be interested in stripping script tags, or even cleaning up HTML using white-list. Try to search in CFLib.org for applicable functions.

Controlling Word Wrap in a container

I have a peculiar problem. I have an email group that pipes emails to a message board. The word wrap of the emails varies. In yahoo, the messages tend to fill the entire container on the message board. But in all other mail clients, only part of the container width is filled, because the original mail was wrapped. I want all of the email messages to fill the entire width of the container. I've thought of two possible solutions: CSS, or a Regex that eliminates line breaks. Because I am only a garage mechanic (at these sorts of things), I simply cannot get the job done. Any help out there?
Here is a link that shows the issue: http://seanwilson.org/forum/index.php?t=msg&th=1729&start=0&S=171399e41f2c10c4357dd9b217caaa3f
(compare the message of "sean" with that of "rob." One fills the container, the other not).
Can any of you suggest how to get all the mail to fill the container?
You gave too little information - what programming language are you using - PHP/Javascript/anything different?
I think you only need to replace \n, \r and \r\n with whitespace. PHP code for that:
$nowrap = str_replace('\r\n', ' ', $nowrap);
$nowrap = str_replace('\r', ' ', $nowrap);
$nowrap = str_replace('\n', ' ', $nowrap);
You can do that analogically in other languages (for JS see string.replace method: http://www.tizag.com/javascriptT/javascript-string-replace.php).
Depending on the situation (people always seem to add 2 linebreaks between paragraphs), you could say the problem is: replace all newlines not directly preceded or followed by a newline with a space.
//just to be sure, remove \r's
$string = str_replace("\r",'',$string);
$string = preg_replace('/(?<!\n)\n(?!\n)/',' ',$string);
While allowing \r's:
$string = preg_replace('/(?<!\r|\n)\r?\n(?!\r|\n)/',' ',$string);
Edit: nevermind: do not use: while people tend to write their email text in paragraphs, you will break their signature / signoff with this regex. One could fiddle around with a minimum linelength before deeming it 'breakable' (i chose 63), but fiddly it will be:
$string = preg_replace('/([^\r\n]{63,})\r?\n(?!\r|\n)/','$1 ',$string);
The problem is: there are no assurance the linebreak wasn't intended. With a fiddleable line-length you could base it on average users, but the question is: what do they mind more: the differences between breaking & non-breaking paragraphs, or the breaking of their signatures?
Thanks for getting back so quickly!
The discussion board uses php (and also CSS). The only trouble is that I am somewhat limited in my ability to tinker with its programing. If I am to do this at my current level of skilty, I have only one of two options.
using a preg-replace in php. The discussion board allows us to do this from a control panel. So If I could do it with one preg-replace statement, it should work.
Would Wrikken's solution work if I do not remove \r's? Because that seems to be spot on. (could the \r's be added to the preg-replace?)
I had hoped the solution could come through a css property of some sort. I guess that isn't possible.
Thanks so much for your help!
[NOTE: thanks so much for your help! The solution worked!!! I changed the number to 53 or so. It needed to be a little smaller. I don't care that a rare, long signature lines may lose its carriage return. That's a small price to pay for a full message box! You easily saved me several days of learning something that was bound to be moderately frustrating, Thanks so much for that quick fix. I am joyous at the help I received here.]

Use cases for regular expression find/replace

I recently discussed editors with a co-worker. He uses one of the less popular editors and I use another (I won't say which ones since it's not relevant and I want to avoid an editor flame war). I was saying that I didn't like his editor as much because it doesn't let you do find/replace with regular expressions.
He said he's never wanted to do that, which was surprising since it's something I find myself doing all the time. However, off the top of my head I wasn't able to come up with more than one or two examples. Can anyone here offer some examples of times when they've found regex find/replace useful in their editor? Here's what I've been able to come up with since then as examples of things that I've actually had to do:
Strip the beginning of a line off of every line in a file that looks like:
Line 25634 :
Line 632157 :
Taking a few dozen files with a standard header which is slightly different for each file and stripping the first 19 lines from all of them all at once.
Piping the result of a MySQL select statement into a text file, then removing all of the formatting junk and reformatting it as a Python dictionary for use in a simple script.
In a CSV file with no escaped commas, replace the first character of the 8th column of each row with a capital A.
Given a bunch of GDB stack traces with lines like
#3 0x080a6d61 in _mvl_set_req_done (req=0x82624a4, result=27158) at ../../mvl/src/mvl_serv.c:850
strip out everything from each line except the function names.
Does anyone else have any real-life examples? The next time this comes up, I'd like to be more prepared to list good examples of why this feature is useful.
Just last week, I used regex find/replace to convert a CSV file to an XML file.
Simple enough to do really, just chop up each field (luckily it didn't have any escaped commas) and push it back out with the appropriate tags in place of the commas.
Regex make it easy to replace whole words using word boundaries.
(\b\w+\b)
So you can replace unwanted words in your file without disturbing words like Scunthorpe
Yesterday I took a create table statement I made for an Oracle table and converted the fields to setString() method calls using JDBC and PreparedStatements. The table's field names were mapped to my class properties, so regex search and replace was the perfect fit.
Create Table text:
...
field_1 VARCHAR2(100) NULL,
field_2 VARCHAR2(10) NULL,
field_3 NUMBER(8) NULL,
field_4 VARCHAR2(100) NULL,
....
My Regex Search:
/([a-z_])+ .*?,?/
My Replacement:
pstmt.setString(1, \1);
The result:
...
pstmt.setString(1, field_1);
pstmt.setString(1, field_2);
pstmt.setString(1, field_3);
pstmt.setString(1, field_4);
....
I then went through and manually set the position int for each call and changed the method to setInt() (and others) where necessary, but that worked handy for me. I actually used it three or four times for similar field to method call conversions.
I like to use regexps to reformat lists of items like this:
int item1
double item2
to
public void item1(int item1){
}
public void item2(double item2){
}
This can be a big time saver.
I use it all the time when someone sends me a list of patient visit numbers in a column (say 100-200) and I need them in a '0000000444','000000004445' format. works wonders for me!
I also use it to pull out email addresses in an email. I send out group emails often and all the bounced returns come back in one email. So, I regex to pull them all out and then drop them into a string var to remove from the database.
I even wrote a little dialog prog to apply regex to my clipboard. It grabs the contents applies the regex and then loads it back into the clipboard.
One thing I use it for in web development all the time is stripping some text of its HTML tags. This might need to be done to sanitize user input for security, or for displaying a preview of a news article. For example, if you have an article with lots of HTML tags for formatting, you can't just do LEFT(article_text,100) + '...' (plus a "read more" link) and render that on a page at the risk of breaking the page by splitting apart an HTML tag.
Also, I've had to strip img tags in database records that link to images that no longer exist. And let's not forget web form validation. If you want to make a user has entered a correct email address (syntactically speaking) into a web form this is about the only way of checking it thoroughly.
I've just pasted a long character sequence into a string literal, and now I want to break it up into a concatenation of shorter string literals so it doesn't wrap. I also want it to be readable, so I want to break only after spaces. I select the whole string (minus the quotation marks) and do an in-selection-only replace-all with this regex:
/.{20,60} /
...and this replacement:
/$0"ΒΆ + "/
...where the pilcrow is an actual newline, and the number of spaces varies from one incident to the next. Result:
String s = "I recently discussed editors with a co-worker. He uses one "
+ "of the less popular editors and I use another (I won't say "
+ "which ones since it's not relevant and I want to avoid an "
+ "editor flame war). I was saying that I didn't like his "
+ "editor as much because it doesn't let you do find/replace "
+ "with regular expressions.";
The first thing I do with any editor is try to figure out it's Regex oddities. I use it all the time. Nothing really crazy, but it's handy when you've got to copy/paste stuff between different types of text - SQL <-> PHP is the one I do most often - and you don't want to fart around making the same change 500 times.
Regex is very handy any time I am trying to replace a value that spans multiple lines. Or when I want to replace a value with something that contains a line break.
I also like that you can match things in a regular expression and not replace the full match using the $# syntax to output the portion of the match you want to maintain.
I agree with you on points 3, 4, and 5 but not necessarily points 1 and 2.
In some cases 1 and 2 are easier to achieve using a anonymous keyboard macro.
By this I mean doing the following:
Position the cursor on the first line
Start a keyboard macro recording
Modify the first line
Position the cursor on the next line
Stop record.
Now all that is needed to modify the next line is to repeat the macro.
I could live with out support for regex but could not live without anonymous keyboard macros.