I have been using the Ink Inliner over the past year and just recently noticed that all of my HTML special characters are being converted to plain text by the inliner? Is this intentional or do I need to add the characters back in each time after running through the inliner? Seems counter-intuitive for the inliner to strip them out unless this is intended for plain text emails.
Related
I have a large text file that I'm going to be working with programmatically but have run into problems with a special character strewn throughout the file. The file is way too large to scan it looking for specific characters. Most of the other unwanted special characters I've been able to get rid of using some regex pattern. But there is a box character, similar to "□". When I tried to copy the character from the actual text file and past it here I get "�", so the example of the box is from Windows character map which includes the code 'U+25A1', which I'm not sure how to interpret or if it's something I could use for a regex search.
Would anyone know how I could search for the box symbol similar to "□" in a UTF-8 encoded file?
EDIT:
Here is an example from the text file:
"� Prune palms when flower spathes show, or delay pruning until after the palm has finished flowering, to prevent infestation of palm flower caterpillars. Leave the top five rows."
The only problem is that, as mentioned in the original post, the square gets converted into a diamond question mark.
It's unclear where and how you are searching, although you could use the hex equivalent:
\x{25A1}
Example:
https://regex101.com/r/b84oBs/1
The black diamond with a question mark is not a character, per se. It is what a browser spits out at you when you give it unrecognizable bytes.
Find out where that data is coming from.
Determine its encoding. (Usually UTF-8, but might be something else.)
Be sure the browser is configured to display that encoding. This is likely to suffice <meta charset=UTF-8> in the header of the page.
I found a workaround using Notepad++ and this website. It's still not clear what encoding system the square is originally from, but when I post it into the query field in the website above or into the Notepad++ Conversion Table (Plugins > Converter > Conversion Table) it gives the hex-character code for the "Replacement Character" which is the diamond with the question mark.
Using this code in a regex expression, \x{FFFD}, within Notepad++ search gave me all the squares, although recognizing them as the Replacement Character.
I often grab quotes from articles that include citations that include superscripted footnotes, which when copied are a pain in the ass. They show up as actual letters in the text as they are pasted in plaintext and not in html.
Is there a way I could run this through a regex to take out these superscripts?
For example
In the abeginning bGod ccreated the dheaven and the eearth.
Should become
In the beginning God created the heaven and the earth.
I can't think of a way to have regex search for misspellings and a corresponding sequential set of numbers and letters.
Any thoughts? I'm also using Sublime Text 3 for the majority of my writing, but I wouldn't mind outsourcing this to an AppleScript, or text replacement app (aText, textExpander, etc.).
Matching Code vs. Matching a Screen
It's hard to tell without seeing an example, but this should be doable if you copy the text from code view, as opposed to the regular browser view. (Ctrl or Cmd-J is your friend). Since writing the rules will take time, this will only be worthwhile for large chunks of text.
In code view, your superscript will be marked up in a way that can be targetted by regex. For instance:
and therefore bananas make you smartera
in the browser view (where the a at the end is a citation note) may look like this in code view:
and therefore bananas make you smarter<span class="mycitations">a</span>
In your editor, using regex, you can process the text to remove all tags, or just certain tags. The rules may not always be easy to write, and of course there are many disclaimers about using regex to parse html.
However, if your source is always the same (Wikipedia for instance), then you can create and save rules that should work across many pages.
Is there an easy way in C++ to tell if a RTF text string has any content, aside pure formatting.
For example this text is only formatting, there is no real content here:
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Sans Serif;}}
Loading RTF text in RichTextControl is not an option, I want something that will work fast and require minimum resources.
The only sure-fire way is to write your own RTF parser [spec], use a library like LibRTF, or you might consider keeping a RichTextControl open and updating it with new RTF documents rather than destroying the object every time.
I believe RTF is not a regular language, so cannot be properly parsed by RegEx (not unlike HTML, despite millions of attempts to do so), but you do not need to write a complete RTF parser.
I'd start with a simple string parser. Try:
Remove content between {\ and }
Remove tags. Tags begin with a backslash, \, and are followed by some text. If a backslash is followed by whitespace, it is not a tag.
The document should end with at least one closing curly brace, }
Any content left which isn't whitespace should be document content, though this may have some exceptions so you'll want to test on numerous samples of RTF.
I have encountered some odd characters that do not display properly in Internet Explorer, such as these: “, –, and ’. I think they're carried over from copy-and-paste Word content.
I am using XSLT to build the page content and it would be great to detect these characters in the XSLT and replace them with valid HTML codes. I already do string replacement in the style sheet, but I'm not sure how detect these encoded characters or whether it's possible.
What about simply changing the encoding for the Stylesheet as well as its output to UTF-8? The characters you mention are “, – and ’. Certainly not invalid or so, given the correct encoding (the characters are at least perfectly valid in Codepage 1252).
Using a good XML editor such as XMLSpy should highlight any errors in formatting your XSLT by validating at development time.
Jeni Tennison's Multiple string replacements may be a good starting point.
I'm working on a web application that parses and displays email messages in a threaded format (among other things). Emails may come from any number of different mail clients, and in either text or HTML format.
Given that most people have a tendency to top post, I'd like to be able to hide the duplicated message in an email reply in a manner similar to how Gmail does it (e.g. "show quoted text").
Determining which part of the message is the reply is somewhat challenging. Personally, I use "> " delimiters at the beginning of the quoted text when replying. I created a regexp that looks for these lines and wraps a div around them to allow some JS to hide or show this block of text.
I then noticed that Outlook doesn't use the "> " characters by default, it simply adds a header block above the reply with the summary of the headers (From, Subject, Date, etc.). The reply is untouched. I can match on this and hide the rest of the email, working with the assumption that it's a top quote.
I then looked at Thunderbird, and it uses "> " for text, and <blockquote> for HTML mails. I still haven't looked at what Apple Mail does, what Notes does, or what any of the other millions of mail clients out there do.
Will I be writing a special case regexp for every single client out there? or is there something I'm missing?
Any suggestions, sample code or pointers to third party libraries much appreciated!
It'll be pretty hard to duplicate the way gmail does it since it doesn't care about whether it was a quoted piece or not, like Zac says, it just seems to care about the diff.
Its actually pretty hard to get this right 100% of the time. Plain text email is "lossy", its entirely possible for you to send
> Here is my long line that is over 74 chars (email line length limit)
Which can get encoded as something like
> Here is my long line that is over 74 chars (email=
line length limit)
And then is decoded as
> Here is my long line that is over 74 chars (email
line length limit)
Making it indistinguishable from an inline-reply.
This is email, so variations are abound. Email usually line-wraps at something like 74 characters, and encoding schemes can differ. Its a real PITA. If you can access the HTML version, you will probably have better luck looking for quote tags and the like. Another idea would be to parse both the plain text and html version to try and determine the boundries.
Additionally, its best to just plan for specific client hacks. They all construct mime messages differently, both in structure and header content.
Edit: I say this with the experience of writing an email processing system as well as seeing several people try to do the -exact- thing you're doing. It always only got "ok" results.
From what I can tell, gmail does not bother about prefixed lines or section headings, except to ignore them. If the text lines appeared earlier in the thread, and then reappear, it is considered to be quoted. Thus, e.g., if you send multiple messages and don't change your signature, the signature is considered to be quoted. If you've already dealt with the '>' prefix, a simple diff should do most of the rest. No need to get fancy.
First thing I think I'd do is strip out all the white space, or reduce white space to 1 between each word, and special characters from both blocks, then look for the old one in the new one.
Here's a mozdev project that may be helpful for others who stumble across this page looking for a Thunderbird solution:
http://quotecollapse.mozdev.org/