Using xsl:character-map on text nodes only - xslt

I am trying to create a generic stylesheet that can convert all Latin characters in Unicode to uppercase ASCII characters. Using <xsl:character-map> works well except for one thing: namespaces. The character map converts all of my namespaces to upper case, which I do not want.
Is there a way to utilize a character map to do what I want to all the other nodes while leaving the namespaces untouched? I see the disable-output-escaping attribute might be an option, but I haven't been able to make it work.

Looks like this is an Oracle-specific issue. I'll probably post this on the Oracle Forums then.
Thanks for all the feedback!

Related

Characters debugging based on code points

I have a character vector with multiple coding "errors" which I pulled from it.dbpedia.org. In fact, each accented characters is rendered incorrectly like "\"Democrazia è Libertà - La Margherita\"#it" instead of \"Democrazia è Libertà - La Margherita\"#it.
I found a debugging map for this kind of encoding problems here. Still I noticed that the relation between "actual" and "expected" characters is not one-to-one (as I would expect) but one-to-many. Then my character "Ã" might alternatively translate as "Á", "Í", "Ï", "Ð", "Ý", "à". In other words, I can't use a pattern/replacement solution for actual/expected characters.
Can I use a pattern/replacement solution with Unicode code points / expected characters? How do I pass to gsub() the unicode code point instead of the actual characters?
Should I use instead a package as stringi to solve the encoding issue? How?
UPDATE: I just noticed the problem is at the source: the XML output of SPARQL.
NOTE: Related to this unanswered question.

How To Remove "Ctrl + Backspace" Special Character?

I have a server written in C++, and when receiving a chat string, I'd like to remove weird special characters like the one created by Ctrl + Backspace (though not other symbols like :)]>_ etc.)
I'm using Boost, too.
edit: Why'd this get -1'd? It's a legit question.
Sounds like isprint might help. It returns true for any printable character, ie. not for control characters and whitespaces. For a list of what is considered printable and what not, take a look at this table.
I haven't used it, and this probably isn't the best way to do it, but have you considered trying the boost regex library (i.e., regex_replace)?

Can an tinyxml someone explain which characters need to be escaped?

I am using tinyxml to save input from a text ctrl. The user can copy whatever they like into the text box and it gets written to an xml file. I'm finding that the new lines don't get saved and neither do & characters. The weird part is that tinyxml just discards them completely without any warning. If I put a & into the textbox and save, the tag will look like:
<textboxtext></textboxtext>
newlines completely disappear as well. No characters whatsoever are stored. What's going on? Even if I need to escape them with &amp or something, why does it just discard everything? Also, I can't find anything on google regarding this topic. Any help?
EDIT:
I found this topic which suggest the discarding of these characters may be a bug.
TinyXML and preserving HTML Entities
It is, apparently, a bug in TinyXml.
The simple workaround is to escape anything that it might not like:
&, ", ', < and > got their regular xml entities encoding
strange characters (read non-alphanumerical / regular punctuation) are best translated to their unicode codepoint: &#....;
Remember that TinyXml is before all a lightweight xml library, not a full-fledged beast.

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?
Cheers
Tama
Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.
One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.

Detecting Characters in an XSLT

I have encountered some odd characters that do not display properly in Internet Explorer, such as these: “, –, and ’. I think they're carried over from copy-and-paste Word content.
I am using XSLT to build the page content and it would be great to detect these characters in the XSLT and replace them with valid HTML codes. I already do string replacement in the style sheet, but I'm not sure how detect these encoded characters or whether it's possible.
What about simply changing the encoding for the Stylesheet as well as its output to UTF-8? The characters you mention are “, – and ’. Certainly not invalid or so, given the correct encoding (the characters are at least perfectly valid in Codepage 1252).
Using a good XML editor such as XMLSpy should highlight any errors in formatting your XSLT by validating at development time.
Jeni Tennison's Multiple string replacements may be a good starting point.