I'm trying to canonicalize the representation of some XML data by sorting each element's attributes by name (not value). The idea is to keep textual differences minimal when attributes are added or removed and to prevent different editors from introducing equivalent variants. These XML files are under source control and developers are wanting to diff the changes without resorting to specialized XML tools.
I was surprised to not find an XSL example of how to this. Basically I want just the identity transform with sorted attributes. I came up with the following with seems to work in all my test cases:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
<xsl:template match="*|/|text()|comment()|processing-instruction()">
<xsl:copy>
<xsl:for-each select="#*">
<xsl:sort select="name(.)"/>
<xsl:copy/>
</xsl:for-each>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
As a total XSL n00b I would appreciate any comments on style or efficiency. I thought it might be helpful to post it here since it seems to be at least not a common example.
With xslt being a functional language doing a for-each might often be the easiest path for us humans but not the most efficient for XSLT processors since they cannot fully optimize the call.
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
<xsl:template match="*">
<xsl:copy>
<xsl:apply-templates select="#*">
<xsl:sort select="name()"/>
</xsl:apply-templates>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
<xsl:template match="#*|comment()|processing-instruction()">
<xsl:copy />
</xsl:template>
</xsl:stylesheet>
This is totally trivial in this regards though and as a "XSL n00b" i think you solved the problem very well indeed.
Well done for solving the problem. As I assume you know the order or attributes is unimportant for XML parsers so the primary benefit of this exercise is for humans - a machine will re-order them on input or output in unpredictable ways.
Canonicalization in XML is not trivial and you would be well advised to use the canonicalizer provided with any reasonable XML toolkit rather than writing your own.
Related
This is a follow up question to
how to get 'excel' new lines in spreadsheetML (MSXSLT)
but asked as a new question, to separate this into different issue, as the behaviour seems to be different between engines (I'll leave the specific context in the other question, this is purely how to achieve some functional result).
This XSLT (in saxon he) will create what I want.
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<root>
<bar>
<xsl:text disable-output-escaping="yes"> </xsl:text>
</bar>
</root>
</xsl:template>
</xsl:stylesheet>
and gives the output
<root>
<bar>
</bar>
</root>
this one wont:
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:exsl="http://exslt.org/common"
version="1.0">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<xsl:variable name="foo">
<bar>
<xsl:text disable-output-escaping="yes"> </xsl:text>
</bar>
</xsl:variable>
<root>
<xsl:copy-of select="exsl:node-set($foo)"/>
</root>
</xsl:template>
</xsl:stylesheet>
it gives
<bar> </bar>
(the question is about XSLT 1.0 but interestingly XSLT 3.0 can be made to work like this
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="3.0">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<xsl:variable name="foo">
<bar>
<xsl:text disable-output-escaping="yes"> </xsl:text>
</bar>
</xsl:variable>
<root>
<xsl:sequence select="$foo"/>
</root>
</xsl:template>
</xsl:stylesheet>
whilst
<xsl:copy-of select="$foo"/>
doesnt. Even following the 'sequence' pattern, I don't seem to be able to preserve non escaping in anything but a non trivial xslt - I've got a complex transformation using call-templates/apply-templates etc, and I think understanding how nodes are interpreted and serialised is not trivial)
There's actually a long history to this question, which was known in the working group as the "sticky d-o-e problem" (d-o-e being disable-output-escaping). The question is, does d-o-e have any effect when writing to a temporary tree (an xsl:variable), or is it only effective when writing to serialized output?
The XSLT 1.0 specification is pretty clear on the matter:
It is an error for output escaping to be disabled for a text node that
is used for something other than a text node in the result tree. Thus,
it is an error to disable output escaping for an xsl:value-of or
xsl:text element that is used to generate the string-value of a
comment, processing instruction or attribute node; it is also an error
to convert a result tree fragment to a number or a string if the
result tree fragment contains a text node for which escaping was
disabled. In both cases, an XSLT processor may signal the error; if it
does not signal the error, it must recover by ignoring the
disable-output-escaping attribute.
XSLT 2.0 deprecated d-o-e, but retained the rule in a slightly different form:
This [property], however, can be set only within a final result tree
that is being passed to the serializer.
But in between those two versions, the working group dithered. The XSLT 1.1 working draft (which never became a recommendation, but was popularised by the first version of my XSLT book) says:
When a root node is copied using an xsl:copy-of element ... and
escaping was disabled for a text node descendant of that root node,
then escaping should also be disabled for the resulting copy of that
text node. For example
<xsl:variable name="x">
<xsl:text disable-output-escaping="yes"><</xsl:text>
</xsl:variable>
<xsl:copy-of select="$x"/>
This is the "sticky d-o-e" - the d-o-e property is attached to the text node in the temporary tree and springs into life when the text node is eventually serialized. So this behaviour was endorsed at some stage in the life of XSLT, and you may be using a processor that implements this version of the spec.
Generally, though, try to forget that d-o-e exists. Whatever the problem, it's not the best solution. It's an incredibly messy feature because it requires a breaking of the architectural boundary between the transformation processor and the serializer, and breaking this boundary leads to close coupling of the transformation and serialization, and prevents you reusing the same code in a different pipeline configuration.
I'm afraid that researching the history of the W3C spec on this is rather easier than researching exactly what was implemented in early versions of Saxon (which are now nearly a quarter of a century old).
So to take the information from Michael Kay's answer which explains how the specification for XSLT 1.0 handles this, then we CAN implement a solution, for this.
So we take a recap of the underlying issue.
Excel spreadsheetML requires data to be formatted with the specific chars "
" to interpret a line feed in a cell (but this solution applies generally).
<Cell>Alpha
Bravo
Charlie</Cell>
If we try to write an XSLT to generate this, lets say naively:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<Cell>
<xsl:text>Alpha</xsl:text>
<xsl:text> </xsl:text>
<xsl:text>Bravo</xsl:text>
<xsl:text> </xsl:text>
<xsl:text>Charlie</xsl:text>
</Cell>
</xsl:template>
</xsl:stylesheet>
our
will get delimited and we get this
<Cell>Alpha Bravo Charlie</Cell>
this (thanks to the answer on how to get 'excel' new lines in spreadsheetML (MSXSLT)) can be fixed by using
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<Cell>
<xsl:text>Alpha</xsl:text>
<xsl:text disable-output-escaping="yes"> </xsl:text>
<xsl:text>Bravo</xsl:text>
<xsl:text disable-output-escaping="yes"> </xsl:text>
<xsl:text>Charlie</xsl:text>
</Cell>
</xsl:template>
</xsl:stylesheet>
which produces this:
<Cell>Alpha
Bravo
Charlie</Cell>
unfortunately this 'breaks' if you process your output document via some intermediary internal document e.g. even this:
<xsl:stylesheet version="1.0"
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
exclude-result-prefixes="msxsl">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<xsl:variable name="output">
<Cell>
<xsl:text>Alpha</xsl:text>
<xsl:text disable-output-escaping="yes"> </xsl:text>
<xsl:text>Bravo</xsl:text>
<xsl:text disable-output-escaping="yes"> </xsl:text>
<xsl:text>Charlie</xsl:text>
</Cell>
</xsl:variable>
<xsl:copy-of select="msxsl:node-set($output)"/>
</xsl:template>
</xsl:stylesheet>
reverts to:
<Cell>Alpha Bravo Charlie</Cell>
because (see Michael Hay's answer) the disable-output-escaping attribute gets ignored if its passed through some internal document (i.e. the variable).
So...how can you get around this?
If you generate a token for the LF, you can then construct your psuedo excel output almost in its entirety except you use a custom element to flag the LF char, and then you can process that DIRECTLY into the result tree and interpret the custom element as an unescaped "
"
so this:
<xsl:stylesheet version="1.0"
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:kookerella="kookerella.com"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
exclude-result-prefixes="msxsl kookerella">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<xsl:variable name="output">
<Cell>
<xsl:text>Alpha</xsl:text>
<kookerella:LF/>
<xsl:text>Bravo</xsl:text>
<kookerella:LF/>
<xsl:text>Charlie</xsl:text>
</Cell>
</xsl:variable>
<!-- process data directly into the result tree only -->
<xsl:apply-templates select="msxsl:node-set($output)" mode="injectLF"/>
</xsl:template>
<!-- Inject LF -->
<xsl:template match="#* | node()" mode="injectLF">
<xsl:copy>
<xsl:apply-templates select="#* | node()" mode="injectLF"/>
</xsl:copy>
</xsl:template>
<xsl:template match="kookerella:LF" mode="injectLF">
<xsl:text disable-output-escaping="yes"> </xsl:text>
<xsl:apply-templates select="#* | node()" mode="injectLF"/>
</xsl:template>
</xsl:stylesheet>
now results in:
<Cell>Alpha
Bravo
Charlie</Cell>
P.S.
as an aside, this seems to work for me in both the various MSXSLT and Saxon HE, but I have had an instance of using the MSXSLT engine where even this doesnt work, presumably due to some configuration out output serialisation issue.
I am using XSLT 1.0, and using xsltproc on OS X Yosemite.
The source content is HTML; the target content is XML.
The issue is a fairly common one. I want all "uninteresting"
nodes simply to be discarded from the output. I've seen catch-all
directives like this:
<xsl:template match="node()|script"/>
<xsl:template match="*">
<xsl:apply-templates/>
</xsl:template>
This is close to what I need. But unfortunately, it's too strong when I need to add another template that visits one of the text nodes caught by node(). For example, suppose I added this template:
<xsl:template match="a/div[#class='location']/br">
<xsl:text> </xsl:text>
</xsl:template>
which simply replaces certain <br/> elements with spaces.
Well, node() precludes this latter template from taking effect,
because the relevant text node containing the line-break is discarded
already!
Well, to correct the issue, here's what I have done in lieu of the catch-all node():
<xsl:template match="html/head|div[#id='banner_parent']|button|ul|div[#id='feed_title']|span|div[#class='submit_event']|script"/>
But this is precisely the problem: I am now piecing together a template
whose matching criteria is likely to be error-prone when the source
content changes.
Is there a simpler directive that would accomplish the same thing? I'm aiming for something like this:
<xsl:template match="node()[not(locations)]|script"/>
Thanks.
If i understood correctly, you want only some nodes in the output and the rest you dont care abour, in this example I try to catch only li elements and throw the rest away.. not sure if this is what you want though http://xsltransform.net/gWmuiKk
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output method="html" doctype-public="XSLT-compat" omit-xml-declaration="yes" encoding="UTF-8" indent="yes" />
<!-- Lets pretend li is interesting for you -->
<xsl:template match="li">
<xsl:text>Interesting Node Only!
</xsl:text>
</xsl:template>
<xsl:template match="#*|node()">
<xsl:apply-templates select="#*|node()"/>
</xsl:template>
</xsl:transform>
My question is related to to another poster's StackOverflow question on Two Phase Processing. I didn't want to use mode="#all" without fully understanding it and how it could affect the rest of my XSLT. I'm thinking the below code accomplishes the same thing without risking interference with other templates but would like confirmation. It kind of seems like I am processing $completepolicy twice without need to do so.
Empty tag definition: <field/> <field></field>. Tags can have attributes but there will never be an empty tag that has an attribute. There will also never be nodes with <field> </field> where the white space could represent many other things.
Given this XSLT:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<!-- many other apply-templates here -->
<xsl:variable name="completepolicy" as="element()">
<holder>
<TABLE1 type="global">
<col1>Red</col1>
<col2/>
</TABLE1>
<TABLE2>
<field1>Blue</field1>
<field2/>
</TABLE2>
</holder>
</xsl:variable>
<xsl:apply-templates mode="emptytags" select="$completepolicy/*"/>
</xsl:template>
<xsl:template match="*[not(node())]" mode="emptytags"/>
<xsl:template match="node() | #*" mode="emptytags">
<xsl:copy>
<xsl:apply-templates select="node() | #*" mode="#current"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Results in this output for $completepolicy:
<TABLE1 type="global">
<col1>Red</col1>
</TABLE1>
<TABLE2>
<field1>Blue</field1>
</TABLE2>
Why do you think the $completepolicy variable is being processed twice? This cannot be seen in the provided code.
I confirm that the provided code looks good to me.
I would recommend never to use mode="#all". This is too powerful and dangerous -- this is almost never needed.
I have some XML like
<ITEM><TITLE>Video: High Stakes for <KW>Obama</KW> & Romney</TITLE></ITEM>
which I would like to translate to the below XML using XSLT.
<ITEM><TITLE>Video: High Stakes for <KW>Obama</KW> & Romney</TITLE></ITEM>
I tried <xls:copy-of select="./TITLE"/> but that is not escaping the XML tags. Any idea?
These sorts of questions always attract a lot of competing answers, because the best answer depends on how generalized you need it to be, and that is not clear from your question.
If you want a super-generalised solution, I will leave that to the regular XSLT SO stars. Or you could just search SO for questions which ask how to serialize/ or stringize XML. There are plenty.
I will offer you a very simple but specific solution, for a very narrow interpretation of your question. For this I will assume that:
You are just interested in serializing the content of the TITLE element.
The element children of TITLE' (such asKW`) are restricted, in that they have no children and no attributes, and are not in a namespace.
With this interpretation, this input document...
<ITEM>
<TITLE>Video: High Stakes for <KW>Obama</KW> & Romney</TITLE>
</ITEM>
...transformed by this XSLT 1.0 style-sheet (works just the same in 2.0)...
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*" />
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="TITLE/*">
<xsl:value-of select="concat('<',name(),'>',.,'</',name(),'>')" />
</xsl:template>
</xsl:stylesheet>
...yields...
<ITEM>
<TITLE>Video: High Stakes for <KW>Obama</KW> & Romney</TITLE>
</ITEM>
XSLT doesn't have a specific, direct feature to destruct markup to text (and you have to think well if this is really useful in your case).
This said, here is a short solution using a tiny C# extension function (a similar extension function can easily be written for most other PLs and platforms):
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:my="my:my" exclude-result-prefixes="msxsl my">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="/*/TITLE/*">
<xsl:value-of select="my:stringize(.)"/>
<xsl:apply-templates/>
</xsl:template>
<msxsl:script language="c#" implements-prefix="my">
public string stringize(XPathNavigator doc)
{
return doc.OuterXml;
}
</msxsl:script>
</xsl:stylesheet>
When this transformation is applied to the provided XML document:
<ITEM><TITLE>Video: High Stakes for <KW>Obama</KW> & Romney</TITLE></ITEM>
the wanted, correct result is produced:
<ITEM>
<TITLE>Video: High Stakes for <KW>Obama</KW> & Romney</TITLE>
</ITEM>
Do note:
This solution flattens out to text the whole subtree rooted in a TITLE element that is a child of the top element of the XML document -- regardless of depth -- which the other posted answer doesn't.
This solution correctly flattens elements with attributes or empty elements -- which the other posted answer -- doesn't.
The typicle XSL usage is:
XML1.xml -> *transformed using xsl* -> XML2.xml
How does an XSL document look like, if I want to simply mirror the input data?
ex:
XML1.xml -> *transformed using xsl* -> XML1.xml
How does an XSL document look like, if
I want to simply mirror the input
data?
There are more than one answers to this question, however all of them could be named "Identity Transform":
<xsl:copy-of select="/"/> This is the shortest, simplest, most efficient and most inflexible, non-extensible and unuseful identity transform.
The classical identity rule, which everybody knows (or should know):
_
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
This is still very short, one-template transformation, which is so much more extensible and useful identity transform, known also as the "identity rule". Using and overriding the identity transform is the most fundamental and powerful XSLT design pattern, allowing to solve common copy and replace/rename/delete/add problems in just a few lines. Maybe 90%+ of all answers in the xslt tag use this form of the identity transform.
.3. The fine-grained control identity rule, which everybody should know (and very few people know):
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="#*|node()[1]"/>
</xsl:copy>
<xsl:apply-templates select="following-sibling::node()[1]"/>
</xsl:template>
</xsl:stylesheet>
This is similar to the generally known identity rule defined at 2. above, but it provides a finer control over the XSLT processing.
Typically with 2. the <xsl:apply-templates select="#*|node()"> triggers a number of transformations (for all attributes and child nodes), that can be done in any order or even in parallel. There are tasks where we don't want certain types of nodes to be processed after some other nodes, so we have to plumb the leakage of the identity rule with overriding it with empty templates matching the unwanted nodes and adding other templates in a specific mode to process these nodes "when the time comes"...
.3. is more appropriate for tasks where we want more control and really sequential-type processing.
Some tasks that are very difficult to solve with 2. are easy using 3.
It would look like the identity transform:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
This is one of the most fundamental XSLT transforms. It matches any attribute or other node, copies what it matches, and then applies itself to all attributes and child nodes of the matched node.
This turns out to be quite powerful for other tasks, too. A common requirement is to copy most of a source file unchanged, while handling certain elements in a special way. This can be solved using the identity transform plus one template for the special nodes. It's a generic, flexible (and short) solution.
This matches every element or attribute and recursively applies the template.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="* | #*">
<xsl:copy>
<xsl:copy-of select="#*"/>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>