Encoding Issues in XSL Transformation - xslt

I have Encoding issues similar to those discussed here : cross-encoding XSL transformations
No clean answer was given to these questions; that's why I'm asking it again.
I have an XML input file encoded in UTF8.
I have a XSL Transformation to apply to these files which should generate an XML ouptput encoded in Windows-1252.
I have the two declarations below in my XSLT file :
<?xml version="1.0" encoding='Windows-1252'?>
<xsl:output method="text" indent="yes" encoding="Windows-1252"/>
I use Saxon as the XSL processor.
Besides all of that, I still have fatal errors each time a UTF8 charac whith no Windows-1252 equivalent is encountered.
Actually, I don't really care about these characters and my transformation could dropp all of them. I just want the transformation goes on and don't crash because of them.
Where I miss something ? Why still have this fatal errors (Fatal Error! Output character not available in this encoding) ?
Thanks in advance for your help.

The message you describe is produced only with the text output method (with XML or HTML, the serializer would use numeric character entities). This error is required by the specification
(see http://www.w3.org/TR/xslt-xquery-serialization/#TEXT_ENCODING), though I can understand why you might want a gentler fallback, e.g. outputting a substitute character.
If you don't mind a bit of Java coding, it would be easy to substitute your own version of Saxon's TEXTEmitter that does things differently (you only need to override one method); alternatively, you could send the XSLT output to a Java Writer (the encoding would then be ignored), and use the Java I/O framework to convert characters to the required encoding, with whatever handling of invalid characters your application requires.

UTF-8 is a larger character set then Windows-1252
This means some UTF-8 characters can not be translated to windows-1252
Ask yourself why you need to transform between encodings

Related

POCO C++ SAX parser: If the xml document encoding is ANSI then next statement is not reading and throwing encoding error exception

Suppose the following is the xml document then hello tag is not reading by the poco sax parser because of encoding is ANSI.
<?xml version="1.0" encoding="ANSI"?>
<hello xmlns=" ">
If the encoding is UTF-8 then hello tag is reading and everything is went fine.
Is there any solution in POCO for this issue?
It's not a POCO problem, fix the producer. There's no such thing as "ANSI" encoding in XML. The producer should generate output in a valid encoding. Whether that's "UTF-8" or "ISO-8859-1" doesn't really matter, as long as it's all consistent.
The encoding problem arise if you specify a encoding but you use a different encoding, source of trouble could arise (in example) if you copy-paste a XML source between multiple documents, from webpages, or simply because it has a buggy encoder. Try to use some program that can detect encoding and change that, set it to UTF8 and then replace the header tag for ANSI wich the one for UTF8.

Preserve character hex codes during XSLT 2.0 transform

I have the following XML:
<root>
<child value="ÿï™à"/>
</root>
When I do a transform I want the character hex code values to be preserved. So if my transform was just a simple xsl:copy and the input was the above XML, then the output should be identical to the input.
I have read about the saxon:character-representation function, but right now I'm using Saxon-HE 9.4, so that function is not available to me, and I'm not even 100% sure it would do what I want.
I also read about use-character-maps. This seems to solve my problem, but I would rather not add a giant map to my transform to catch every possible character hex code.
<xsl:character-map name="characterMap">
<xsl:output-character character=" " string="&#xA0;"/>
<xsl:output-character character="¡" string="&#xA1;"/>
<!-- 93 more entries... ¡ through þ -->
<xsl:output-character character="ÿ" string="&#xFF;"/>
</xsl:character-map>
Are there any other ways to preserve character hex codes?
The XSLT processor doesn't know how the character was represented in the input - that's all handled by the XML parser. So it can't reproduce the original.
If you want to output all non-ASCII characters using numeric character references, regardless how they were represented in the input, try using xsl:output encoding="us-ascii".
If you really need to retain the original representation - and I can't see any defensible reason why anyone would need to do that - then try Andrew Welch's lexev, which converts all the entity and character references to processing instructions on the way in, and back to entity/character references on the way out.

TinyXML parsing multi-byte characters but skipping following [x] chars

I've got a c++ program which receives some xml from a server and then attempts to parse it in order to populate some combo-boxes, for example
<?xml version="1.0"?>
<CustomersMachines>
<Customer name="bob" id="1">
<Machine name="office1" id="1" />
<Machine name="officeserver" id="2" />
</Customer>
</CustomersMachines>
For these values, TinyXML parses fine and the resulting combo-boxes populate as intended. The problem arises when a multi-byte character is placed at (or near, depending on how many bytes) the end of the name element.
<Customer name="boß" id="3">
will result in the combo-box being populated with the value boß" id=
From stepping through the debugger I see that when a multi-byte character gets passed to ReadText() the following 1-3 single-byte characters in the element get skipped over but automatically included, so tinyXML doesn't register the closing quote and keeps parsing until it reaches the next one. The application running on the server sending the xml predominantly uses ISO-8859-1 encoding, whereas tinyXML is defaulting to UTF-8.
I've tried tweaking tinyxml to default to use TIXML_ENCODING_UNKNOWN which appears to solve the problem but causes a substantial number of issues elsewhere in the program. Other things I've tried are to utf8_encode the xml server-side before sending it (but this causes strange characters to display in the combo boxes where the multi-byte char should be), and forcing the encoding into the xml being sent to the client program to no avail.
Anyone have any idea on how to prevent multi-byte characters from automatically ignoring the following 1-3 characters in this case?
The <?xml?> prolog is not specifying an encoding. If the encoding is not available outside of the XML through out-of-band means then the encoding has to be guessed through analysis of the XML's starting bytes, per the rules outlined in Appendix F of the XML spec. In this case, that likely leads to UTF-8 being selected. If the XML is not actually UTF-8 encoded, that would account for the behavior you are seeing.
In ISO-8859-1, ß is encoded as byte octet 0xDF, and " is encoded as byte octet 0x22.
In UTF-8, 0xDF is a starting byte of a 2-byte octet sequence, which accounts for the " being skipped. However, 0xDF 0x22 is not a valid UTF-8 2-octet byte sequence, so TinyXml should have failed the parse with an error. If it does not, then that is a bug in TinyXml.
If the XML is actually ISO-8859-1 encoded, the server must provide that info. If it is not, then that is a bug in the server.

Detecting Characters in an XSLT

I have encountered some odd characters that do not display properly in Internet Explorer, such as these: “, –, and ’. I think they're carried over from copy-and-paste Word content.
I am using XSLT to build the page content and it would be great to detect these characters in the XSLT and replace them with valid HTML codes. I already do string replacement in the style sheet, but I'm not sure how detect these encoded characters or whether it's possible.
What about simply changing the encoding for the Stylesheet as well as its output to UTF-8? The characters you mention are “, – and ’. Certainly not invalid or so, given the correct encoding (the characters are at least perfectly valid in Codepage 1252).
Using a good XML editor such as XMLSpy should highlight any errors in formatting your XSLT by validating at development time.
Jeni Tennison's Multiple string replacements may be a good starting point.

XSLT encoding problem, questionmarks in result

I'm trying to run an XSLT transformation, but characters like ëöï are replaced by a literal '?' in the output (I checked with an hex editor). The source file has the correct characters, and the stylesheet has:
<xsl:output encoding="UTF-8" indent="yes" method="xml"/>
What else am I missing?
I'm using saxon as the transformer, if that matters.
The problem is most likely in the way you call the transformer. My guess is it will work fine if you call it using java -jar saxon.jar ...
In general, when you use XML tools which take InputStream/OutputStream, then the tools will make sure that the encoding is correct.
When you use a mixture of Streams and Writers, you will have to make sure that the encoding when going from one to the other matches what you told the XSLT processor to produce. Always set encodings explicitly. There may be defaults, but when it comes to encodings, they are wrong more often than not.