I encountered some peculiar difference in behavior of XSLT processors. I wonder what is the reason behind this and whether there is a full overview available somewhere of processor differences.
I tested the following simple transformation (with a dummy input):
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:fo="http://www.w3.org/1999/XSL/Format">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:text>1=
2=
3=
4=
end</xsl:text>
</xsl:template>
</xsl:stylesheet>
Run in XML Spy (v 2011 sp1 x64), the output is:
1=
2=
3=
4=
end
In all cases, in hex after the =, and on the line after 4=, two characters have been added, 0D and 0A.
Apparently, XML Spy replaces each request for &xA; or &xD; by a full CR+LF occurrence, except when CR and LF are requested in that order, right after each other (see the 3= part).
But when run in saxon9he, I get a warning that I am running a v1.0 stylesheet with a v2.0 processor, and the output is
1=
2=3=
4=
end
In this case, all requests for &xA; are replaced by 0D 0A (so a CR is added in front of the LF), but a request for &xD; outputs the requested CR, not an additional LF.
Rerunning in XML Spy setting XSLT version to 2.0 gives the same result as for 1.0, so I guess it's not a different convention in the two XSLT versions that is causing this.
Most probably, this is just a diff between tools we have to know about, but I wonder whether there is more to say on the subject.
The 2.0 specification states that with output method text, it is implementation-defined how line endings will be represented (More specifically, "A newline character in the instance of the data model MAY be output using any character sequence that is conventionally used to represent a line ending in the chosen system environment.").
The XSLT 1.0 says nothing (which is not really any different).
Some implementations might use a single newline consistently, some might output exactly what you asked for, some might use the default line ending for the operating system you are running on.
Related
I have the following XML:
<root>
<child value="ÿïà"/>
</root>
When I do a transform I want the character hex code values to be preserved. So if my transform was just a simple xsl:copy and the input was the above XML, then the output should be identical to the input.
I have read about the saxon:character-representation function, but right now I'm using Saxon-HE 9.4, so that function is not available to me, and I'm not even 100% sure it would do what I want.
I also read about use-character-maps. This seems to solve my problem, but I would rather not add a giant map to my transform to catch every possible character hex code.
<xsl:character-map name="characterMap">
<xsl:output-character character=" " string=" "/>
<xsl:output-character character="¡" string="¡"/>
<!-- 93 more entries... ¡ through þ -->
<xsl:output-character character="ÿ" string="ÿ"/>
</xsl:character-map>
Are there any other ways to preserve character hex codes?
The XSLT processor doesn't know how the character was represented in the input - that's all handled by the XML parser. So it can't reproduce the original.
If you want to output all non-ASCII characters using numeric character references, regardless how they were represented in the input, try using xsl:output encoding="us-ascii".
If you really need to retain the original representation - and I can't see any defensible reason why anyone would need to do that - then try Andrew Welch's lexev, which converts all the entity and character references to processing instructions on the way in, and back to entity/character references on the way out.
I have Encoding issues similar to those discussed here : cross-encoding XSL transformations
No clean answer was given to these questions; that's why I'm asking it again.
I have an XML input file encoded in UTF8.
I have a XSL Transformation to apply to these files which should generate an XML ouptput encoded in Windows-1252.
I have the two declarations below in my XSLT file :
<?xml version="1.0" encoding='Windows-1252'?>
<xsl:output method="text" indent="yes" encoding="Windows-1252"/>
I use Saxon as the XSL processor.
Besides all of that, I still have fatal errors each time a UTF8 charac whith no Windows-1252 equivalent is encountered.
Actually, I don't really care about these characters and my transformation could dropp all of them. I just want the transformation goes on and don't crash because of them.
Where I miss something ? Why still have this fatal errors (Fatal Error! Output character not available in this encoding) ?
Thanks in advance for your help.
The message you describe is produced only with the text output method (with XML or HTML, the serializer would use numeric character entities). This error is required by the specification
(see http://www.w3.org/TR/xslt-xquery-serialization/#TEXT_ENCODING), though I can understand why you might want a gentler fallback, e.g. outputting a substitute character.
If you don't mind a bit of Java coding, it would be easy to substitute your own version of Saxon's TEXTEmitter that does things differently (you only need to override one method); alternatively, you could send the XSLT output to a Java Writer (the encoding would then be ignored), and use the Java I/O framework to convert characters to the required encoding, with whatever handling of invalid characters your application requires.
UTF-8 is a larger character set then Windows-1252
This means some UTF-8 characters can not be translated to windows-1252
Ask yourself why you need to transform between encodings
Is it possible to do xslt identity transformation where absolutly nothing is changed from the source?
When I use following template, ident and linebreaks are changed in the output and I don't want to do any changes to the source xml.
XSLT
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
INPUT
<S:Envelope
xmlns:S="http://www.w3.org/2003/05/soap-envelope"
xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing"
xmlns:f123="http://www.fabrikam123.example/svc53">
<S:Header>
<wsa:MessageID>
uuid:aaaabbbb-cccc-dddd-eeee-wwwwwwwwwww
</wsa:MessageID>
<wsa:RelatesTo>
uuid:aaaabbbb-cccc-dddd-eeee-ffffffffffff
</wsa:RelatesTo>
<wsa:To S:mustUnderstand="1">
http://business456.example/client1
</wsa:To>
<wsa:Action>http://fabrikam123.example/mail/DeleteAck</wsa:Action>
</S:Header>
<S:Body>
<f123:DeleteAck/>
</S:Body>
</S:Envelope>
OUTPUT
<?xml version="1.0" encoding="UTF-8"?><S:Envelope xmlns:S="http://www.w3.org/2003/05/soap-envelope" xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing" xmlns:f123="http://www.fabrikam123.example/svc53">
<S:Header>
<wsa:MessageID>
uuid:aaaabbbb-cccc-dddd-eeee-wwwwwwwwwww
</wsa:MessageID>
<wsa:RelatesTo>
uuid:aaaabbbb-cccc-dddd-eeee-ffffffffffff
</wsa:RelatesTo>
<wsa:To S:mustUnderstand="1">
http://business456.example/client1
</wsa:To>
<wsa:Action>http://fabrikam123.example/mail/DeleteAck</wsa:Action>
</S:Header>
<S:Body>
<f123:DeleteAck/>
</S:Body>
</S:Envelope>
No, you cannot. The input and output XML will be the "same" in the sense that they produce the same XML Infoset, but they will not necessarily be byte-for-byte identical and this is not something that XSLT can control.
Why do you need this? If you are trying to compare XML documents easily, consider using XML Canonicalization. Many XML libraries have a method of producing canonical XML, and the xmllint command line tool can produce it easily from files.
The default behavior of XSLT processors is to preserve whitespace in the input, and the behavior of the processors I've just tested is consistent with the spec.
But the whitespace in question is whitespace in the text nodes of the input.
The whitespace between attribute-value specifications in start-tags, and the whitespace between items (e.g. comments and processing instructions) in the prolog and epilog of the document are not text nodes, and are not affected by the preserve-space settings. That white space is also, in fact, not part of the XPath data model, so there is very little the processor can legitimately do to preserve it.
If the whitespace in question carries information, you will want to revisit the design of the vocabulary (it's really a bad idea for that whitespace to be significant); if it's just that you would prefer that there be newlines between attribute-value specifications, you may want to write a custom serializer to insert such newlines and indentation on output. (If your motive is to avoid confusing a diff program with whitespace differences, my experience is that your choices are to normalize whitespace before diffing or to get a diff program that's a bit more robust in the face of whitespace variation.) Good luck.
In general it's not possible to be 100% confident that you'll get exactly everything unchanged because the xslt data model simply doesn't preserve all the information from the parse. For example if the input contains < then the output might contain <. Similarly CDATA sections aren't preserved - adjacent text nodes (CDATA sections and normal text modes) are merged into one at parse time and while you can configure the processor to use CDATA for the content of certain elements you can't simply preserve them as they were.
There are other issues such as the fact that the data model doesn't distinguish between <foo></foo>, <foo/> and <foo /> - they all represent the same empty element and any of them from the input could be represented by any of them in the output. And as in your example white space between attributes within a start tag is not preserved.
But of course these differences are all things that an XML tool shouldn't care about as they're different ways to represent exactly the same infoset.
I'm trying to run an XSLT transformation, but characters like ëöï are replaced by a literal '?' in the output (I checked with an hex editor). The source file has the correct characters, and the stylesheet has:
<xsl:output encoding="UTF-8" indent="yes" method="xml"/>
What else am I missing?
I'm using saxon as the transformer, if that matters.
The problem is most likely in the way you call the transformer. My guess is it will work fine if you call it using java -jar saxon.jar ...
In general, when you use XML tools which take InputStream/OutputStream, then the tools will make sure that the encoding is correct.
When you use a mixture of Streams and Writers, you will have to make sure that the encoding when going from one to the other matches what you told the XSLT processor to produce. Always set encodings explicitly. There may be defaults, but when it comes to encodings, they are wrong more often than not.
I am working on an XSLT transformation to re-arrange XML blocks to validate NewsML files. Some of these files contains encoded characters (such as & " etc...). The problem is the XSLT transformation is converting these characters to their literal string (ie "and", "'"). This is causing problems. I do not want this to happen.
I have experimented with various techniques (uses of <xsl:text>, <xsl:value-of> and the disable-output-escaping flag, <xsl:output method='xml|html|xhtml|text'>) to no avail. These methods either, convert the characters, or simply leave them out.
eg, a string which starts with "stars on PM's cards" can end up as
stars on PM's cards
stars on PMs cards
I am using the Saxonica (http://www.saxonica.com/) processing app.
The basic XSLT I am using is provided below. (There are other things but the problem exists even with this simplest stylesheet)
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="no" />
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Any ideas on how to prevent this conversion would be most appreciated. The requirement is to keep the original text as it appears.
I think you need to do both the disable-output-escaping="yes" and set the document to HTML at the same time.
FROM W3C (emphasis mine):
It is an error for output escaping to be disabled for a text node that is used for something other than a text node in the result tree. Thus, it is an error to disable output escaping for an xsl:value-of or xsl:text element that is used to generate the string-value of a comment, processing instruction or attribute node; it is also an error to convert a result tree fragment to a number or a string if the result tree fragment contains a text node for which escaping was disabled. In both cases, an XSLT processor may signal the error; if it does not signal the error, it must recover by ignoring the disable-output-escaping attribute.
The disable-output-escaping attribute may be used with the html output method as well as with the xml output method. The text output method ignores the disable-output-escaping attribute, since it does not perform any output escaping.
An XSLT processor will only be able to disable output escaping if it controls how the result tree is output. This may not always be the case. For example, the result tree may be used as the source tree for another XSLT transformation instead of being output. An XSLT processor is not required to support disabling output escaping. If an xsl:value-of or xsl:text specifies that output escaping should be disabled and the XSLT processor does not support this, the XSLT processor may signal an error; if it does not signal an error, it must recover by not disabling output escaping.
If output escaping is disabled for a character that is not representable in the encoding that the XSLT processor is using for output, then the XSLT processor may signal an error; if it does not signal an error, it must recover by not disabling output escaping.
Since disabling output escaping may not work with all XSLT processors and can result in XML that is not well-formed, it should be used only when there is no alternative.
These are entities. Usually they get mapped to a unicode representation of that entity. The final stream will just contain the characters. If you output the stream it's up to the serializer to escape the characters depending on the output type (which is what you can disable with disable-output-escaping). So a proper serializer should turn this
<xsl:output method="html" encoding="UTF-8"/>
<xsl:text>some test</xsl:text>
into
some test
See section 5 on this article.
So I would check that with your XSLT processor first.