Why does Saxon delete blank lines in an identity transform? - xslt

Consider this "identity" transform:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output encoding="UTF-8" method="xml" indent="yes" media-type="application/xml"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:transform>
and this input XML:
<?xml version="1.0" encoding="UTF-8"?>
<Foobar xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:transform version="2.0">
<!-- Parameters -->
<xsl:param name="param1"/>
<xsl:param name="param2"/>
<xsl:param name="param3"/>
<!-- Variables -->
<xsl:variable name="variable1" select="'abc'"/>
<xsl:template match="/*">
</xsl:template>
</xsl:transform>
</Foobar>
Why does SaxonJ-HE 11.3 delete the blank lines?
Here's a diff showing what I'm talking about:
$ saxon -xsl:transform.xsl -s:input.xml | diff -u input.xml -
--- input.xml 2022-06-16 16:26:41.000000000 -0400
+++ - 2022-06-16 16:28:42.000000000 -0400
## -6,12 +6,9 ##
<xsl:param name="param1"/>
<xsl:param name="param2"/>
<xsl:param name="param3"/>
-
<!-- Variables -->
<xsl:variable name="variable1" select="'abc'"/>
-
<xsl:template match="/*">
</xsl:template>
-
</xsl:transform>
</Foobar>

It's quite challenging to find an indentation algorithm that both (a) preserves existing whitespace in the source document, and (b) produces nice-looking output. For example, consider what happens when a template rule processes all children of an element (both element children and whitespace text node children) with an xsl:sort on an attribute value; if all whitespace from this output sequence is preserved, this will tend to put a massive wadge of whitespace at the start of the output sequence, which looks pretty ugly. This can also happen if you apply-templates to all children, but delete some of the elements while leaving the text nodes unchanged. So the spec allows the processor not only to add whitespace for indentation, but to merge ("elide") this with existing whitespace.
In particular, it's a reasonable assumption to make that if you get multiple blank lines in the result tree, they weren't put there deliberately, but arrived by accident as a result of copying multiple whitespace nodes from the input.
What's actually happening in this particular case is as follows:
For comments, the rules are different depending on whether the comment follows a start tag or an end tag. The first comment follows a start tag, and in this case the accumulated whitespace is output as-is, followed by the comment with no further indentation. The second comment follows an end tag (actually an empty element tag), and in this case the comment is indented according to its hierarchic level in the result tree, and any preceding whitespace in the result tree is discarded.
Before a start tag, indentation is added if the start tag immediately follows another start tag or end tag; if it follows a text node, no identation is added. This rule is designed primarily to make mixed content work properly.
Before an end tag, indentation is added if it follows another end tag, but not if it follows a start tag or character data.
The detail is a lot more complex, and it has evolved in a fairly ad-hoc way to cope reasonably well with a wide variety of circumstances. As a high-level summary, Saxon will in some circumstances output the whitespace that it finds in the result tree, and in other circumstances it will output its own whitespace in preference. The algorithm isn't perfect, but it copes reasonably well with messy situations like when the input is indented with 4 spaces and the output is to be indented with 3.

Related

Whitespace in xsl:text elements

I have following stylesheet:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" encoding="UTF-8"/>
<xsl:template match="/">
<xsl:text>1</xsl:text>
<xsl:text>
</xsl:text>
<xsl:text>2</xsl:text>
</xsl:template>
</xsl:stylesheet>
When running this with Saxon 9.8, I get following result:
1
2
When running this with MSXML 6.0, the whitespace is stripped and I get:
1 2
What is the correct behavior? Is the whitespace here supposed to be stripped?
This is to do with whitespace striping in the XSLT document. According to the W3C specification (for XSLT 1.0, which is what MSXML uses)
A text node is preserved if any of the following apply:
The element name of the parent of the text node is in the set of
whitespace-preserving element names.
The text node contains at least one non-whitespace character. As in
XML, a whitespace character is #x20, #x9, #xD or #xA.
An ancestor element of the text node has an xml:space attribute with a
value of preserve, and no closer ancestor element has xml:space with a
value of default.
It then says "For stylesheets, the set of whitespace-preserving element names consists of just xsl:text."
So, is looks like MSXML is not following the specification.
However, if you add xml:space="preserve" to the xsl:text in question, you might find it does work in MSXML
<xsl:template match="/">
<xsl:text>1</xsl:text>
<xsl:text xml:space="preserve">
</xsl:text>
<xsl:text>2</xsl:text>
</xsl:template>
The correct behaviour is as you see it from Saxon.
There's some history here and I don't remember the full details, but MSXML has a nasty habit of stripping whitespace text nodes within the parser itself. If the XML parser strips out the whitespace text nodes, then they never get as far as the XSLT processor, so it makes no difference whether that conforms to all the XSLT rules or not.
I'm pretty sure there are options in MSXML to control this behaviour, so check exactly how you are invoking the MSXML parser and change the options if necessary.

Specify Exceptions in XSLT for Removing Empty Tags

We have the following XSLT that removes empty tags on all levels.
We also want to specify some exceptions. It can be either a specific tag to avoid, or a parameterized list of tags to avoid. For instance, if we encounter a tag called SpecialTag, then that tag should be allowed to be empty. Everything underneath that tag can still get trimmed. Is that possible?
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/>
<xsl:output indent="yes" omit-xml-declaration="yes"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*[not(descendant-or-self::*[text()[normalize-space()] | #*])]"/>
</xsl:stylesheet>
Here are two ways to do this. Based on your description, this involves comparing against a list of element names.
XSLT 1.0
XSLT 1.0 is more limited in its featureset -- specifically affecting us here, we cannot compare against variables in template match expressions. Consequently, we take a straightforward brute-force approach.
<!-- Test first that the name doesn't match anything we want to keep.
Just add more conditions as needed.
Put the main test at the end (that checks for attributes or text). -->
<xsl:template match="*
[local-name() != 'SpecialTag']
[local-name() != 'OtherSpecialTag']
[local-name() != 'ThirdSpecialTag']
['Add other conditions here.']
[not(descendant-or-self::*[text()[normalize-space()] | #*])]
"/>
XSLT 2.0
In XSLT 2.0, we're allowed to compare against variables in template match expressions. Here's one possible way of using a variable, which might look cleaner or more readable than the XSLT 1.0 approach.
<xsl:variable name="Specials">
<item>SpecialTag</item>
<item>OtherSpecialTag</item>
<item>ThirdSpecialTag</item>
<!-- Add other tag names here. -->
<item>...</item>
</xsl:variable>
<!-- Test first that the name doesn't match anything we want to keep.
Note the syntax - we say `not()` rather than using `!=`,
because `!=` evaluates to true if *any* <item> value doesn't match
the name of the current context element.
Meanwhile, `not(... = ...)` only evaluates to true if *all* of the
<item> values don't match the name of the current context element.
Put the main test at the end (that checks for attributes or text). -->
<xsl:template match="*
[not(local-name() = $Specials/item)]
[not(descendant-or-self::*[text()[normalize-space()] | #*])]
"/>
Note
The functions name() and local-name() produce the same values if the elements don't have an explicit prefix. If they do have prefixes, name() includes the prefix, while local-name() does not. Adjust the sample code above accordingly to match your use case.

XPath filter not working on XSL

I have the following XML
<?xml version="1.0"?>
<people><human><sex>male</sex><naxme>Juanito</naxme>
</human>
<human><sex>female</sex><naxme>Petra</naxme></human>
<human><sex>male</sex><naxme>Anaximandro</naxme></human>
</people>
and this XSL
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" encoding="utf-8" indent="no"/>
<xsl:template match="/people/human[sex='male']">
<xsl:value-of select="naxme"/>
</xsl:template>
</xsl:stylesheet>
I'm expecting it to filter out the female, and it kind of works but I get odd values for the non-matching nodes:
Juanito
femalePetra
Anaximandro
I'm expecting the same output as with
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" encoding="utf-8" indent="no"/>
<xsl:template match="/people/human">
<xsl:if test = "sex='male'">
<xsl:value-of select="naxme"/>
</xsl:if> </xsl:template>
</xsl:stylesheet>
Thanks!
I'll expand on Daniels answer which covers a little of the why.
The reason you are getting two different outputs comes down to the built-in template rules, and how nodes and text are treated by default.
Summarising that link, if no other template exists, there are default templates that ensure every node - be it element, text, attribute, comment - will be encoutered, just to ensure that other nodes that do have rules can be processed correctly.
With this XSLT:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/people/human[sex='male']">
<xsl:value-of select="naxme"/>
</xsl:template>
</xsl:stylesheet>
you have an explicit rule that says:
If you find a node that matches the XPath /people/human[sex='male] do this template.
Along with the default rule:
Find all nodes, and then process all of their children. If it is text, just output the text.
This default rule is why your template is being processed, since you have no explicit rule for the root node - / - it any every child and grandchild node are processed using the default rules, unless another exists. As such, each node is traversed using the defaults, except for the nodes matching /people/human[sex='male]. The result of this is that when you have a node that is "female" the text is being spat out instead of ignored.
However, contrast this with:
<xsl:template match="/people/human">
<xsl:if test = "sex='male'">
<xsl:value-of select="naxme"/>
</xsl:if>
</xsl:template>
Where the rule becomes:
If you find a node that matches the XPath /people/human do this template.
It just so happens that in that template, you have an extra condition that says, if it is male, then process it in some way, with no other conditions, so if a "female" node is encountered it is now blank in the output.
Lastly, the reason why Daniels answer works but could easily break, is that it changes the rule for processing text. Instead of now copying all text as in the default rules, it outputs nothing (as per the empty template. However, if you had any other templates which used xsl:apply-templates to process text and were expecting text, they would also now output nothing.
It's probably because of XSLT's built-in template rules. Try adding this template:
<xsl:template match="text()"/>

XSLT: need alternative to document()-function for multi-source processing

I'm adapting an XSLT from a third party which transforms an arbitrary number of XMLs into a single HTML document. It's a pretty complex script and it will be revised in the future, so I'm trying to do a minimal adaptation in order to get it to work for our needs.
The following is a stripped down version of the XSLT (containing the essentials):
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/1999/xhtml">
<xsl:output method="text" encoding="UTF-8" omit-xml-declaration="yes"/>
<xsl:param name="files" select="document('files.xml')//File"/>
<xsl:param name="root" select="document($files)"/>
<xsl:template match="/">
<xsl:for-each select="$root/RootNode">
<xsl:apply-templates select="."/>
</xsl:for-each>
</xsl:template>
<xsl:template match="RootNode">
<xsl:for-each select="//Node">
<xsl:text>Node: </xsl:text><xsl:value-of select="."/><xsl:text>, </xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Now files.xml contains a list of all the URLs of the files to be included (in this case the local files file1.xml and file2.xml). Because we want to read XMLs from memory rather than from disk, and because the invocation of the XSLT only allows for a single XML source, I have combined the two files in a single XML document. The following is a combination of two files (there may be more in a real situation)
<?xml version="1.0" encoding="UTF-8"?>
<TempNode>
<RootNode>
<Node>1</Node>
<Node>2</Node>
</RootNode>
<RootNode>
<Node>3</Node>
<Node>4</Node>
</RootNode>
</TempNode>
where the first RootNode originally resided in file1.xml and the second in file2.xml.
Due to the complexity of the actual XSLT, I've figured that my best shot is to try to alter the $root-param. I've tried the following:
<xsl:param name="root" select="/TempNode"/>
The problem is this. In the case of <xsl:param name="root" select="document($files)"/>, the XPath expression "//Node" in <xsl:for-each select="//Node"> selects the Node's from file1.xml and file2.xml independently, i.e. producing the following (desired) list:
Node: 1, Node: 2, Node: 3, Node: 4,
However, when I combine the content of the two files into a single XML and parse this (and use the suggested $root-definition), the expression "//Node" will select all Node's that are children of the TempNode. (In other words, the desired list, as represented above, is produced twice due to the combination with the outer <xsl:for-each select="$root/RootNode"> loop.)
(A side note: as observed in comment a) in this page, document() apparently changes the root node, perhaps explaining this behavior.)
My question becomes:
How can I re-define $root, using the combined XML as source instead of a multi-source through document(), so that the list is only produced once, without touching the remainder of the XSLT? It's like if $root defined using the document()-function, there is no common root node in the param. Is it possible to define a param with two "separate" node trees?
Btw: I've tried defining a document like this
<xsl:param name="root">
<xsl:for-each select="/TempNode/*">
<xsl:document>
<xsl:copy-of select="."/>
</xsl:document>
</xsl:for-each>
</xsl:param>
thinking it might solve the problem, but the "//Node" expression still fetches all the Nodes. Is the context node in the <xsl:template match="RootNode">-template actually somewhere in the input document and not the param? (Honestly, I'm pretty confused when it comes to context nodes.)
Thanks in advance!
(Updated more)
OK, some of the problem is becoming clear. First, just to make sure I understand, you aren't actually passing parameters for $files and $root to the XSLT processor invocation, right? (They might as well be variables rather than params?)
Now to the main issues... In XPath, when you evaluate an expression that begins with "/" (including "//"), the context node is ignored [mostly]. Therefore, when you have
<xsl:template match="RootNode">
<xsl:for-each select="//Node">
the matched RootNode is ignored. Maybe you wanted
<xsl:template match="RootNode">
<xsl:for-each select=".//Node">
in which the for-each would select Node elements that are descendants of the matched RootNode? This would fix your problem of generating the desired node list twice.
I inserted [mostly] above because I recalled that an "absolute location path" starts from "the root node of the document containing the context node". So the context node does affect what document is used for "//Node". Maybe that's what you intended all along? I guess I was slow to catch on to that.
(A side note: as observed in comment
a) in this page, document() apparently
changes the root node, perhaps
explaining this behavior.)
Or more precisely,
An absolute location path ["/..."]
followed by a relative location
path... selects the set of nodes that
would be selected by the relative
location path relative to the root
node of the document containing the
context node.
document() doesn't actually change anything, in the sense of side effects; rather, it returns a set of nodes contained (usually) by different documents than the primary source document. XSLT instructions like xsl:apply-templates and xsl:for-each establish new values for the context node inside the scope of their template bodies. So if you use xsl:apply-templates and xsl:for-each with select="document(...)/...", the context node inside the scope of those instructions will belong to an external document, so any use of "/..." as an XPath will start from that external document.
Updated again
How can I re-define $root, using the
combined XML as source instead of a
multi-source through document(), so
that the list is only produced once,
without touching the remainder of the
XSLT?
As #Alej hinted, it's really not possible given the above constraint. If you're selecting "//Node" in each iteration of the loop over "$root/RootNode", then in order for each iteration not to select the same nodes as the other iterations, each value of "$root/RootNode" must be in a different document. Since you're using the combined XML source, instead of a multi-source, this is not possible.
But if you don't insist that your <xsl:for-each select="//..."> XPath expression cannot change, it becomes very easy. :-) Just put a "." before the "//".
It's like if $root defined using the document()-function, there is no common root node
in the param.
The value of the param is a node-set. All nodes in the set may be contained in the same document, or they may not, depending on whether the first argument to document() is a nodeset or just a single node.
Is it possible to define a param with two "separate" node trees?
I believe by "separate", you mean "belonging to different documents"? Yes it is, but I don't think you can do it in XSLT 1.0 unless you're selecting nodes that belong to different documents in the first place.
You mentioned trying
<xsl:param name="root">
<xsl:for-each select="/TempNode/*">
<xsl:document>
<xsl:copy-of select="."/>
</xsl:document>
</xsl:for-each>
</xsl:param>
but <xsl:document> is not defined in XSLT 1.0, and your stylesheet says version="1.0". Do you have XSLT 2.0 available? If so, let us know and we can pursue this option. To be honest, <xsl:document> is not familiar territory for me. But I'm happy to learn along with you.
You can apply only nodes you need:
Input:
<?xml version="1.0" encoding="UTF-8"?>
<TempNode>
<RootNode>
<Node>1</Node>
<Node>2</Node>
</RootNode>
<RootNode>
<Node>3</Node>
<Node>4</Node>
</RootNode>
</TempNode>
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl">
<xsl:output method="html" indent="yes"/>
<xsl:template match="/">
<xsl:copy>
<xsl:apply-templates select="TempNode/RootNode"/>
</xsl:copy>
</xsl:template>
<xsl:template match="RootNode">
<xsl:value-of select="concat('RootNode-', generate-id(.), '
')"/>
<xsl:apply-templates select="Node"/>
</xsl:template>
<xsl:template match="Node">
<xsl:value-of select="concat('Node', ., '
')"/>
</xsl:template>
</xsl:stylesheet>
Output:
RootNode-N65540
Node1
Node2
RootNode-N65549
Node3
Node4

XSLT apply-template with mode - Incorrect result with no matching mode

Here's a simple case.
Here's my XML:
<?xml version="1.0" encoding="utf-8" ?>
<dogs>
<dog type="Labrador">
<Name>Doggy</Name>
</dog>
<dog type="Batard">
<Name>Unknown</Name>
</dog>
</dogs>
This XML is used with two Xslt. This is the common one:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl">
<xsl:output method="text"/>
<xsl:template match="dogs">
<xsl:text>First template
</xsl:text>
<xsl:apply-templates select="." mode="othertemplate" />
</xsl:template>
</xsl:stylesheet>
This is the child one:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl">
<xsl:include href="transform.xslt"/>
<xsl:template match="dogs" mode="othertemplate">
<xsl:text> Other template</xsl:text>
</xsl:template>
</xsl:stylesheet>
The child includes the common one (called transform.xslt).
When I execute the child, I get the expected result:
First template
Other template
When I execute the common one, I get this strange results:
First template
Doggy
Unknown
The common one applies a template with the mode "othertemplate". This mode is only included, some times, in the child xslt.
I want that, if there's no template with mode "othertemplate", then nothing should be outputted.
I don't want to include a template with mode "othertemplate" with empty body for all xslt files that does not have to use this template mode...
What should I do?
Thanks
Built-in Template Rules in XSLT
The element contents and the extra whitespace appear because of XSLT's built-in template rules, also known as default templates. These rules are applied when there is no other matching template. The built-in template rules are
<xsl:template match="*|/">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="text()|#*">
<xsl:value-of select="."/>
</xsl:template>
<xsl:template match="processing-instruction()|comment()"/>
Built-in template rules process root and element nodes recursively and copy text (and attribute values if attribute nodes are selected). The built-in template rule for processing instructions and comments is to do nothing. Namespace nodes are not processed by default. Note that <xsl:apply-templates/> is practically a shorthand for <xsl:apply-templates select="child::node()"/> so it will not select attribute or namespace nodes.
There is also a built-in template rule for every mode. These templates are like default templates for elements and root except they continue processing in the same mode.
<xsl:template match="*|/" mode="foobar">
<xsl:apply-templates mode="foobar"/>
</xsl:template>
Because your stylesheet doesn't have a template matching dogs with mode othertemplate this built-in template rule is applied which in this case results in processing all child nodes and eventually printing the text nodes. Indentation and line feeds between the source document's elements are also text nodes so they also get printed and cause the extra whitespace in the output.
Warning on non-terminating loops because of <xsl:apply-templates select="."/>
Typically apply-templates is used to process descendants. In your example code you selected the current node when calling apply-templates. This may result in non-terminating loop if the template itself is applied because of an apply-templates command inside it. Example below
<xsl:template match="foobar">
<!-- This is an infinite loop -->
<xsl:apply-templates select="."/>
</xsl:template>
By the way. On a general rule on combining stylesheets, think carefully which template you should run and which one you should import or include. (I have read that as a general practice, Michael Kay seems to recommend using <xsl:import> to import the general-case stylesheet into the special-case stylesheet, not the other way round.)
The built-in XSLT templates are defined and selected for every mode. So, the built-in template for text nodes is selected and (by definition) it outputs the text node.
To suppress this, you need to override thie built-in template for text nodes (also possibly for elements) in your desired mode with an empty template:
<xsl:template match="text()" mode="othertemplate"/>
Include the above in your imported stylesheet.