Automatically converting escaped characters to string literals - xslt

I am working on an XSLT transformation to re-arrange XML blocks to validate NewsML files. Some of these files contains encoded characters (such as & " etc...). The problem is the XSLT transformation is converting these characters to their literal string (ie "and", "'"). This is causing problems. I do not want this to happen.
I have experimented with various techniques (uses of <xsl:text>, <xsl:value-of> and the disable-output-escaping flag, <xsl:output method='xml|html|xhtml|text'>) to no avail. These methods either, convert the characters, or simply leave them out.
eg, a string which starts with "stars on PM&apos;s cards" can end up as
stars on PM's cards
stars on PMs cards
I am using the Saxonica (http://www.saxonica.com/) processing app.
The basic XSLT I am using is provided below. (There are other things but the problem exists even with this simplest stylesheet)
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="no" />
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Any ideas on how to prevent this conversion would be most appreciated. The requirement is to keep the original text as it appears.

I think you need to do both the disable-output-escaping="yes" and set the document to HTML at the same time.
FROM W3C (emphasis mine):
It is an error for output escaping to be disabled for a text node that is used for something other than a text node in the result tree. Thus, it is an error to disable output escaping for an xsl:value-of or xsl:text element that is used to generate the string-value of a comment, processing instruction or attribute node; it is also an error to convert a result tree fragment to a number or a string if the result tree fragment contains a text node for which escaping was disabled. In both cases, an XSLT processor may signal the error; if it does not signal the error, it must recover by ignoring the disable-output-escaping attribute.
The disable-output-escaping attribute may be used with the html output method as well as with the xml output method. The text output method ignores the disable-output-escaping attribute, since it does not perform any output escaping.
An XSLT processor will only be able to disable output escaping if it controls how the result tree is output. This may not always be the case. For example, the result tree may be used as the source tree for another XSLT transformation instead of being output. An XSLT processor is not required to support disabling output escaping. If an xsl:value-of or xsl:text specifies that output escaping should be disabled and the XSLT processor does not support this, the XSLT processor may signal an error; if it does not signal an error, it must recover by not disabling output escaping.
If output escaping is disabled for a character that is not representable in the encoding that the XSLT processor is using for output, then the XSLT processor may signal an error; if it does not signal an error, it must recover by not disabling output escaping.
Since disabling output escaping may not work with all XSLT processors and can result in XML that is not well-formed, it should be used only when there is no alternative.

These are entities. Usually they get mapped to a unicode representation of that entity. The final stream will just contain the characters. If you output the stream it's up to the serializer to escape the characters depending on the output type (which is what you can disable with disable-output-escaping). So a proper serializer should turn this
<xsl:output method="html" encoding="UTF-8"/>
<xsl:text>some test</xsl:text>
into
some test
See section 5 on this article.
So I would check that with your XSLT processor first.

Related

Preserve character hex codes during XSLT 2.0 transform

I have the following XML:
<root>
<child value="ÿï™à"/>
</root>
When I do a transform I want the character hex code values to be preserved. So if my transform was just a simple xsl:copy and the input was the above XML, then the output should be identical to the input.
I have read about the saxon:character-representation function, but right now I'm using Saxon-HE 9.4, so that function is not available to me, and I'm not even 100% sure it would do what I want.
I also read about use-character-maps. This seems to solve my problem, but I would rather not add a giant map to my transform to catch every possible character hex code.
<xsl:character-map name="characterMap">
<xsl:output-character character=" " string="&#xA0;"/>
<xsl:output-character character="¡" string="&#xA1;"/>
<!-- 93 more entries... ¡ through þ -->
<xsl:output-character character="ÿ" string="&#xFF;"/>
</xsl:character-map>
Are there any other ways to preserve character hex codes?
The XSLT processor doesn't know how the character was represented in the input - that's all handled by the XML parser. So it can't reproduce the original.
If you want to output all non-ASCII characters using numeric character references, regardless how they were represented in the input, try using xsl:output encoding="us-ascii".
If you really need to retain the original representation - and I can't see any defensible reason why anyone would need to do that - then try Andrew Welch's lexev, which converts all the entity and character references to processing instructions on the way in, and back to entity/character references on the way out.

XSLT Identity Transformation without change to the output

Is it possible to do xslt identity transformation where absolutly nothing is changed from the source?
When I use following template, ident and linebreaks are changed in the output and I don't want to do any changes to the source xml.
XSLT
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
INPUT
<S:Envelope
xmlns:S="http://www.w3.org/2003/05/soap-envelope"
xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing"
xmlns:f123="http://www.fabrikam123.example/svc53">
<S:Header>
<wsa:MessageID>
uuid:aaaabbbb-cccc-dddd-eeee-wwwwwwwwwww
</wsa:MessageID>
<wsa:RelatesTo>
uuid:aaaabbbb-cccc-dddd-eeee-ffffffffffff
</wsa:RelatesTo>
<wsa:To S:mustUnderstand="1">
http://business456.example/client1
</wsa:To>
<wsa:Action>http://fabrikam123.example/mail/DeleteAck</wsa:Action>
</S:Header>
<S:Body>
<f123:DeleteAck/>
</S:Body>
</S:Envelope>
OUTPUT
<?xml version="1.0" encoding="UTF-8"?><S:Envelope xmlns:S="http://www.w3.org/2003/05/soap-envelope" xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing" xmlns:f123="http://www.fabrikam123.example/svc53">
<S:Header>
<wsa:MessageID>
uuid:aaaabbbb-cccc-dddd-eeee-wwwwwwwwwww
</wsa:MessageID>
<wsa:RelatesTo>
uuid:aaaabbbb-cccc-dddd-eeee-ffffffffffff
</wsa:RelatesTo>
<wsa:To S:mustUnderstand="1">
http://business456.example/client1
</wsa:To>
<wsa:Action>http://fabrikam123.example/mail/DeleteAck</wsa:Action>
</S:Header>
<S:Body>
<f123:DeleteAck/>
</S:Body>
</S:Envelope>
No, you cannot. The input and output XML will be the "same" in the sense that they produce the same XML Infoset, but they will not necessarily be byte-for-byte identical and this is not something that XSLT can control.
Why do you need this? If you are trying to compare XML documents easily, consider using XML Canonicalization. Many XML libraries have a method of producing canonical XML, and the xmllint command line tool can produce it easily from files.
The default behavior of XSLT processors is to preserve whitespace in the input, and the behavior of the processors I've just tested is consistent with the spec.
But the whitespace in question is whitespace in the text nodes of the input.
The whitespace between attribute-value specifications in start-tags, and the whitespace between items (e.g. comments and processing instructions) in the prolog and epilog of the document are not text nodes, and are not affected by the preserve-space settings. That white space is also, in fact, not part of the XPath data model, so there is very little the processor can legitimately do to preserve it.
If the whitespace in question carries information, you will want to revisit the design of the vocabulary (it's really a bad idea for that whitespace to be significant); if it's just that you would prefer that there be newlines between attribute-value specifications, you may want to write a custom serializer to insert such newlines and indentation on output. (If your motive is to avoid confusing a diff program with whitespace differences, my experience is that your choices are to normalize whitespace before diffing or to get a diff program that's a bit more robust in the face of whitespace variation.) Good luck.
In general it's not possible to be 100% confident that you'll get exactly everything unchanged because the xslt data model simply doesn't preserve all the information from the parse. For example if the input contains < then the output might contain <. Similarly CDATA sections aren't preserved - adjacent text nodes (CDATA sections and normal text modes) are merged into one at parse time and while you can configure the processor to use CDATA for the content of certain elements you can't simply preserve them as they were.
There are other issues such as the fact that the data model doesn't distinguish between <foo></foo>, <foo/> and <foo /> - they all represent the same empty element and any of them from the input could be represented by any of them in the output. And as in your example white space between attributes within a start tag is not preserved.
But of course these differences are all things that an XML tool shouldn't care about as they're different ways to represent exactly the same infoset.

XSLT text output processor differences

I encountered some peculiar difference in behavior of XSLT processors. I wonder what is the reason behind this and whether there is a full overview available somewhere of processor differences.
I tested the following simple transformation (with a dummy input):
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:fo="http://www.w3.org/1999/XSL/Format">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:text>1=
2=
3=
4=
end</xsl:text>
</xsl:template>
</xsl:stylesheet>
Run in XML Spy (v 2011 sp1 x64), the output is:
1=
2=
3=
4=
end
In all cases, in hex after the =, and on the line after 4=, two characters have been added, 0D and 0A.
Apparently, XML Spy replaces each request for &xA; or &xD; by a full CR+LF occurrence, except when CR and LF are requested in that order, right after each other (see the 3= part).
But when run in saxon9he, I get a warning that I am running a v1.0 stylesheet with a v2.0 processor, and the output is
1=
2=3=
4=
end
In this case, all requests for &xA; are replaced by 0D 0A (so a CR is added in front of the LF), but a request for &xD; outputs the requested CR, not an additional LF.
Rerunning in XML Spy setting XSLT version to 2.0 gives the same result as for 1.0, so I guess it's not a different convention in the two XSLT versions that is causing this.
Most probably, this is just a diff between tools we have to know about, but I wonder whether there is more to say on the subject.
The 2.0 specification states that with output method text, it is implementation-defined how line endings will be represented (More specifically, "A newline character in the instance of the data model MAY be output using any character sequence that is conventionally used to represent a line ending in the chosen system environment.").
The XSLT 1.0 says nothing (which is not really any different).
Some implementations might use a single newline consistently, some might output exactly what you asked for, some might use the default line ending for the operating system you are running on.

XSLT: keeping whitespaces when copying attributes

I'm trying to sort Microsoft Visual Studio's vcproj so that a diff would show something meaningful after e.g. deleting a file from a project. Besides the sorting, I want to keep everything intact, including whitespaces. The input looks like
space<File
spacespaceRelativePath="filename"
spacespace>
...
The xslt fragment below can add the spaces around elements, but I can't find out how to deal with those around attributes, so my output looks like
space<File RelativePath="filename">
xslt I use for the msxsl 4.0 processor:
<xsl:for-each select="File">
<xsl:sort select="#RelativePath"/>
<xsl:value-of select="preceding-sibling::text()[1]"/>
<xsl:copy>
<xsl:for-each select="text()|#*">
<xsl:copy/>
</xsl:for-each>
Those spaces are always insignificant in XML, and I believe that there is no option to control this behavior in a general way for any XML/XSLT library.
XSLT works on a tree representation of the input XML. Many of the irrelevant detail of the original XML has been abstracted away in this tree - for example the order of attributes, insignificant whitespace between attributes, or the distinction between " and ' as an attribute delimiter. I can't see any conceivable reason for wanting to write a program that treats these distinctions as significant.

XSLT encoding problem, questionmarks in result

I'm trying to run an XSLT transformation, but characters like ëöï are replaced by a literal '?' in the output (I checked with an hex editor). The source file has the correct characters, and the stylesheet has:
<xsl:output encoding="UTF-8" indent="yes" method="xml"/>
What else am I missing?
I'm using saxon as the transformer, if that matters.
The problem is most likely in the way you call the transformer. My guess is it will work fine if you call it using java -jar saxon.jar ...
In general, when you use XML tools which take InputStream/OutputStream, then the tools will make sure that the encoding is correct.
When you use a mixture of Streams and Writers, you will have to make sure that the encoding when going from one to the other matches what you told the XSLT processor to produce. Always set encodings explicitly. There may be defaults, but when it comes to encodings, they are wrong more often than not.