TinyXML parsing multi-byte characters but skipping following [x] chars - c++

I've got a c++ program which receives some xml from a server and then attempts to parse it in order to populate some combo-boxes, for example
<?xml version="1.0"?>
<CustomersMachines>
<Customer name="bob" id="1">
<Machine name="office1" id="1" />
<Machine name="officeserver" id="2" />
</Customer>
</CustomersMachines>
For these values, TinyXML parses fine and the resulting combo-boxes populate as intended. The problem arises when a multi-byte character is placed at (or near, depending on how many bytes) the end of the name element.
<Customer name="boß" id="3">
will result in the combo-box being populated with the value boß" id=
From stepping through the debugger I see that when a multi-byte character gets passed to ReadText() the following 1-3 single-byte characters in the element get skipped over but automatically included, so tinyXML doesn't register the closing quote and keeps parsing until it reaches the next one. The application running on the server sending the xml predominantly uses ISO-8859-1 encoding, whereas tinyXML is defaulting to UTF-8.
I've tried tweaking tinyxml to default to use TIXML_ENCODING_UNKNOWN which appears to solve the problem but causes a substantial number of issues elsewhere in the program. Other things I've tried are to utf8_encode the xml server-side before sending it (but this causes strange characters to display in the combo boxes where the multi-byte char should be), and forcing the encoding into the xml being sent to the client program to no avail.
Anyone have any idea on how to prevent multi-byte characters from automatically ignoring the following 1-3 characters in this case?

The <?xml?> prolog is not specifying an encoding. If the encoding is not available outside of the XML through out-of-band means then the encoding has to be guessed through analysis of the XML's starting bytes, per the rules outlined in Appendix F of the XML spec. In this case, that likely leads to UTF-8 being selected. If the XML is not actually UTF-8 encoded, that would account for the behavior you are seeing.
In ISO-8859-1, ß is encoded as byte octet 0xDF, and " is encoded as byte octet 0x22.
In UTF-8, 0xDF is a starting byte of a 2-byte octet sequence, which accounts for the " being skipped. However, 0xDF 0x22 is not a valid UTF-8 2-octet byte sequence, so TinyXml should have failed the parse with an error. If it does not, then that is a bug in TinyXml.
If the XML is actually ISO-8859-1 encoded, the server must provide that info. If it is not, then that is a bug in the server.

Related

POCO C++ SAX parser: If the xml document encoding is ANSI then next statement is not reading and throwing encoding error exception

Suppose the following is the xml document then hello tag is not reading by the poco sax parser because of encoding is ANSI.
<?xml version="1.0" encoding="ANSI"?>
<hello xmlns=" ">
If the encoding is UTF-8 then hello tag is reading and everything is went fine.
Is there any solution in POCO for this issue?
It's not a POCO problem, fix the producer. There's no such thing as "ANSI" encoding in XML. The producer should generate output in a valid encoding. Whether that's "UTF-8" or "ISO-8859-1" doesn't really matter, as long as it's all consistent.
The encoding problem arise if you specify a encoding but you use a different encoding, source of trouble could arise (in example) if you copy-paste a XML source between multiple documents, from webpages, or simply because it has a buggy encoder. Try to use some program that can detect encoding and change that, set it to UTF8 and then replace the header tag for ANSI wich the one for UTF8.

Encode gives wrong value of Japanese kanji

As a part of a scraper, I need to encode kanji to URLs, but I just can't seem to even get the correct output from a simple sign, and I'm currently blinded by everything I've tried thus far from various Stack Overflow posts.
The document is set to UTF-8.
sampleText=u'ル'
print sampleText
print sampleText.encode('utf-8')
print urllib2.quote(sampleText.encode('utf-8'))
It gives me the values:
ル
ル
%E3%83%AB
But as far as I understand, it should give me:
ル
XX
%83%8B
What am I doing wrong? Are there some settings I don't have correct? Because as far as I understand it, my output from the encode() should not be ル.
The code you show works correctly. The character ル is KATAKANA LETTER RU, and is Unicode codepoint U+30EB. When encoded to UTF-8, you'll get the Python bytestring '\xe3\x83\xab', which prints out as ル if your console encoding is Latin-1. When you URL-escape those three bytes, you get %E3%83%AB.
The value you seem to be expecting, %83%8B is the Shift-JIS encoding of ル, rather than UTF-8 encoding. For a long time there was no standard for how to encode non-ASCII text in a URL, and as this Wikipedia section notes, many programs simply assumed a particular encoding (often without specifying it). The newer standard of Internationalized Resource Identifiers (IRIs) however says that you should always convert Unicode text to UTF-8 bytes before performing percent encoding.
So, if you're generating your encoded string for a new program that wants to meet the current standards, stick with the UTF-8 value you're getting now. I would only use the Shift-JIS version if you need it for backwards compatibility with specific old websites or other software that expects that the data you send will have that encoding. If you have any influence over the server (or other program), see if you can update it to use IRIs too!

Preserve character hex codes during XSLT 2.0 transform

I have the following XML:
<root>
<child value="ÿï™à"/>
</root>
When I do a transform I want the character hex code values to be preserved. So if my transform was just a simple xsl:copy and the input was the above XML, then the output should be identical to the input.
I have read about the saxon:character-representation function, but right now I'm using Saxon-HE 9.4, so that function is not available to me, and I'm not even 100% sure it would do what I want.
I also read about use-character-maps. This seems to solve my problem, but I would rather not add a giant map to my transform to catch every possible character hex code.
<xsl:character-map name="characterMap">
<xsl:output-character character=" " string="&#xA0;"/>
<xsl:output-character character="¡" string="&#xA1;"/>
<!-- 93 more entries... ¡ through þ -->
<xsl:output-character character="ÿ" string="&#xFF;"/>
</xsl:character-map>
Are there any other ways to preserve character hex codes?
The XSLT processor doesn't know how the character was represented in the input - that's all handled by the XML parser. So it can't reproduce the original.
If you want to output all non-ASCII characters using numeric character references, regardless how they were represented in the input, try using xsl:output encoding="us-ascii".
If you really need to retain the original representation - and I can't see any defensible reason why anyone would need to do that - then try Andrew Welch's lexev, which converts all the entity and character references to processing instructions on the way in, and back to entity/character references on the way out.

Encoding Issues in XSL Transformation

I have Encoding issues similar to those discussed here : cross-encoding XSL transformations
No clean answer was given to these questions; that's why I'm asking it again.
I have an XML input file encoded in UTF8.
I have a XSL Transformation to apply to these files which should generate an XML ouptput encoded in Windows-1252.
I have the two declarations below in my XSLT file :
<?xml version="1.0" encoding='Windows-1252'?>
<xsl:output method="text" indent="yes" encoding="Windows-1252"/>
I use Saxon as the XSL processor.
Besides all of that, I still have fatal errors each time a UTF8 charac whith no Windows-1252 equivalent is encountered.
Actually, I don't really care about these characters and my transformation could dropp all of them. I just want the transformation goes on and don't crash because of them.
Where I miss something ? Why still have this fatal errors (Fatal Error! Output character not available in this encoding) ?
Thanks in advance for your help.
The message you describe is produced only with the text output method (with XML or HTML, the serializer would use numeric character entities). This error is required by the specification
(see http://www.w3.org/TR/xslt-xquery-serialization/#TEXT_ENCODING), though I can understand why you might want a gentler fallback, e.g. outputting a substitute character.
If you don't mind a bit of Java coding, it would be easy to substitute your own version of Saxon's TEXTEmitter that does things differently (you only need to override one method); alternatively, you could send the XSLT output to a Java Writer (the encoding would then be ignored), and use the Java I/O framework to convert characters to the required encoding, with whatever handling of invalid characters your application requires.
UTF-8 is a larger character set then Windows-1252
This means some UTF-8 characters can not be translated to windows-1252
Ask yourself why you need to transform between encodings

UTF 8 encoded Japanese string in XML

I am trying to create a SOAP call with Japanese string. The problem I faced is that when I encode this string to UTF8 encoded string, it has many control characters in it (e.g. 0x1B (Esc)). If I remove all such control characters to make it a valid SOAP call then the Japanese content appears as garbage on server side.
How can I create a valid SOAP request for Japanese characters? Any suggestion is highly appreciated.
I am using C++ with MS-DOM.
With Best Regards.
If I remember correctly it's true, the first 32 unicode code points are not allowed as characters in XML documents, even escaped with &#. Not sure whether they're allowed in HTML or not, but certainly the server thinks they're not allowed in your requests, and it gets the only meaningful vote.
I notice that your document claims to be encoded in iso-2022-jp, not utf-8. And indeed, the sequence of characters ESC $ B that appears in your document is valid iso-2022-jp. It indicates that the data is switching encodings (from ASCII to a 2-byte Japanese encoding called JIS X 0208-1983).
But somewhere in the process of constructing your request, something has seen that 0x1B byte and interpreted it as a character U+001B, not realising that it's intended as one byte in data that's already encoded in the document encoding. So, it has XML-escaped it as a "best effort", even though that's not valid XML.
Probably, whatever is serializing your XML document doesn't know that the encoding is supposed to be iso-2022-jp. I imagine it thinks it's supposed to be serializing the document as ASCII, ISO-Latin-1, or UTF-8, and the <meta> element means nothing to it (that's an HTML way of specifying the encoding anyway, it has no particular significance in XML). But I don't know MS-DOM, so I don't know how to correct that.
If you just remove the ESC characters from iso-2022-jp data, then you conceal the fact that the data has switched encodings, and so the decoder will continue to interpret all that 7nMK stuff as ASCII, when it's supposed to be interpreted as JIS X 0208-1983. Hence, garbage.
Something else strange -- the iso-2022-jp code to switch back to ASCII is ESC ( B, but I see |(B</font> in your data, when I'd expect the same thing to happen to the second ESC character as happened to the first: &#0x1B(B</font>. Similarly, $B#M#S(B and $BL#D+(B are mangled attempts to switch from ASCII to JIS X 0208-1983 and back, and again the ESC characters have just disappeared rather than being escaped.
I have no explanation for why some ESC characters have disappeared and one has been escaped, but it cannot be coincidence that what you generate looks almost, but not quite, like valid iso-2022-jp. I think iso-2022-jp is a 7 bit encoding, so part of the problem might be that you've taken iso-2022-jp data, and run it through a function that converts ISO-Latin-1 (or some other 8 bit encoding for which the lower half matches ASCII, for example any Windows code page) to UTF-8. If so, then this function leaves 7 bit data unchanged, it won't convert it to UTF-8. Then when interpreted as UTF-8, the data has ESC characters in it.
If you want to send the data as UTF-8, then first of all you need to actually convert it out of iso-2022-jp (to wide characters or to UTF-8, whichever your SOAP or XML library expects). Secondly you need to label it as UTF-8, not as iso-2022-jp. Finally you need to serialize the whole document as UTF-8, although as I've said you might already be doing that.
As pointed out by Steve Jessop, it looks like you have encoded the text as iso-2022-jp, not UTF-8. So the first thing to do is to check that and ensure that you have proper UTF-8.
If the problem still persists, consider encoding the text.
The simplest option is "hex encoding" where you just write the hex value of each byte as ASCII digits. e.g. the 0x1B byte becomes "1B", i.e. 0x31, 0x42.
If you want to be fancy you could use MIME or even UUENCODE.