Replacing special characters from HTML source - c++

I'm new to HTML coding and I know HTML has some reserved characters for its use and it also displays some characters by their character code. For example -:
Œ is Œ
© is ©
® is ®
I have the HTML source in std::string. how can i decipher them into their actual form and replace from std::string? is there any library with source available or can it be done using macros preprocessors?

I would recommend using some HTML/XML parser that can automatically do the conversion for you. Parsing HTML correctly by hand is extremely difficult. If you insist on doing it yourself, Boost String Algorithms library provides useful replacement functions.

Œ is Œ
No it isn't. Œ is 'PARTIAL LINE BACKWARD'. The correct numeric entities for Œ are Œ and Œ.

One method for the numeric entities would be to use a regular expression like &#([0-9]+);, grab the numeric value and convert it to the ASCII character (probably with sprintf in C++).
For the named entities you would need to build a mapping. You could probably do a simple string replace to convert to the numbers, then use the method above. W3C has a table here: http://www.w3.org/TR/WD-html40-970708/sgml/entities.html
But if you're trying to read or parse a bunch of HTML in a string, you should use an HTML parser. Search for the many questions on SO.

Related

Parsing JSON style text using sscanf()

["STRING", FLOAT, FLOAT, FLOAT],
I need to parse three values from this string - a STRING and three FLOATS.
sscanf() returns zero, probably I got the format specifiers wrong.
sscanf(current_line.c_str(), "[\"%s[^\"]\",%f,%f,%f],",
&temp_node.id,
&temp_node.pos.x,
&temp_node.pos.y,
&temp_node.pos.z))
Do you know what's wrong?
Please read the manual page on sscanf(3). The %s format does not match using a regular expression, it just scans non-whitespace characters. Even if it worked as you assumed, your regular expression would not be able to handle all JSON strings correctly (which might not be a problem if your input data format is sufficiently restricted, but would be unclean nonetheless).
Use a proper JSON parser. It's not really complicated. I used cJSON for a moderately complex case, you should be able to integrate it within a few hours.
To fix your immediate problem, use this format specifier:
"[\"%[^\"]s\",%f,%f,%f],"
The right syntax for parsing a set is %[...]s instead of %s[...].
That being said, sscanf() is not the right tool for parsing JSON. Even the "fixed" code would fail to parse strings that contain escaped quotes, for instance.
Use a proper JSON parser library.

Escaping and unescaping HTML

In a function I do not control, data is being returned via
return xmlFormat(rc.content)
I later want to do a
<cfoutput>#resultsofreturn#</cfoutput>
The problem is all the HTML tags are escaped.
I have considered
<cfoutput>#DecodeForHTML(resultsofreturn)#</cfoutput>
But I am not sure these are inverses of each other
Like Adrian concluded, the best option is to implement a system to get to the pre-encoded value.
In the current state, the string your working with is encoded for an xml document. One option is to create an xml document with the text and parse the text back out of the xml document. I'm not sure how efficient this method is, but it will return the text back to it's pre-encoded value.
function xmlDecode(text){
return xmlParse("<t>#text#</t>").t.xmlText;
}
TryCF.com example
As of CF 10, you should be using the newer encodeFor functions. These functions account for high ASCII characters as well as UTF-8 characters.
Old and Busted
XmlFormat()
HTMLEditFormat()
JSStringFormat()
New Hotness
encodeForXML()
encodeForXMLAttribute()
encodeForHTML()
encodeForHTMLAttribute()
encodeForJavaScript()
encodeForCSS()
The output from these functions differs by context.
Then, if you're only getting escaped HTML, you can convert it back using Jsouo or the Jakarta Commons Lang library. There are some examples in a related SO answer.
Obviously, the best solution would be to update the existing function to return either version of the content. Is there a way to copy that function in order to return the unescaped content? Or can you just call it from a new function that uses the Java solution to convert the HTML?

Is RTF text empty

Is there an easy way in C++ to tell if a RTF text string has any content, aside pure formatting.
For example this text is only formatting, there is no real content here:
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Sans Serif;}}
Loading RTF text in RichTextControl is not an option, I want something that will work fast and require minimum resources.
The only sure-fire way is to write your own RTF parser [spec], use a library like LibRTF, or you might consider keeping a RichTextControl open and updating it with new RTF documents rather than destroying the object every time.
I believe RTF is not a regular language, so cannot be properly parsed by RegEx (not unlike HTML, despite millions of attempts to do so), but you do not need to write a complete RTF parser.
I'd start with a simple string parser. Try:
Remove content between {\ and }
Remove tags. Tags begin with a backslash, \, and are followed by some text. If a backslash is followed by whitespace, it is not a tag.
The document should end with at least one closing curly brace, }
Any content left which isn't whitespace should be document content, though this may have some exceptions so you'll want to test on numerous samples of RTF.

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?
Cheers
Tama
Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.
One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.

Detecting Characters in an XSLT

I have encountered some odd characters that do not display properly in Internet Explorer, such as these: “, –, and ’. I think they're carried over from copy-and-paste Word content.
I am using XSLT to build the page content and it would be great to detect these characters in the XSLT and replace them with valid HTML codes. I already do string replacement in the style sheet, but I'm not sure how detect these encoded characters or whether it's possible.
What about simply changing the encoding for the Stylesheet as well as its output to UTF-8? The characters you mention are “, – and ’. Certainly not invalid or so, given the correct encoding (the characters are at least perfectly valid in Codepage 1252).
Using a good XML editor such as XMLSpy should highlight any errors in formatting your XSLT by validating at development time.
Jeni Tennison's Multiple string replacements may be a good starting point.