Whitespace in xsl:text elements - xslt

I have following stylesheet:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" encoding="UTF-8"/>
<xsl:template match="/">
<xsl:text>1</xsl:text>
<xsl:text>
</xsl:text>
<xsl:text>2</xsl:text>
</xsl:template>
</xsl:stylesheet>
When running this with Saxon 9.8, I get following result:
1
2
When running this with MSXML 6.0, the whitespace is stripped and I get:
1 2
What is the correct behavior? Is the whitespace here supposed to be stripped?

This is to do with whitespace striping in the XSLT document. According to the W3C specification (for XSLT 1.0, which is what MSXML uses)
A text node is preserved if any of the following apply:
The element name of the parent of the text node is in the set of
whitespace-preserving element names.
The text node contains at least one non-whitespace character. As in
XML, a whitespace character is #x20, #x9, #xD or #xA.
An ancestor element of the text node has an xml:space attribute with a
value of preserve, and no closer ancestor element has xml:space with a
value of default.
It then says "For stylesheets, the set of whitespace-preserving element names consists of just xsl:text."
So, is looks like MSXML is not following the specification.
However, if you add xml:space="preserve" to the xsl:text in question, you might find it does work in MSXML
<xsl:template match="/">
<xsl:text>1</xsl:text>
<xsl:text xml:space="preserve">
</xsl:text>
<xsl:text>2</xsl:text>
</xsl:template>

The correct behaviour is as you see it from Saxon.
There's some history here and I don't remember the full details, but MSXML has a nasty habit of stripping whitespace text nodes within the parser itself. If the XML parser strips out the whitespace text nodes, then they never get as far as the XSLT processor, so it makes no difference whether that conforms to all the XSLT rules or not.
I'm pretty sure there are options in MSXML to control this behaviour, so check exactly how you are invoking the MSXML parser and change the options if necessary.

Related

How to select elements containing special characters using XSL?

I have an ascii-encoded XML-file (in which the various special characters are encoded as &#x..;). Here is a simplified example:
<?xml version="1.0" encoding="ascii"?>
<data>
<element1>Some regular text</element1>
<element2>Text containing special characters: 1º-2ª</element2>
<element3>Again regular text, but with the special charactre prefix: #x</element3>
</data>
Now what I want to do is to pick all the leaf elements containing special characters. The output should look like
The following elements in the input file contain special characters:
<element2>Text containing special characters: 1º-2ª</element2>
I tried with this XSL:
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="/">
<xsl:text>The following elements in the input file contain special characters:
</xsl:text>
<xsl:for-each select="//*">
<xsl:if test="not(*) and contains(., '&#x')">
<xsl:copy-of select="."></xsl:copy-of>
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
But it only gives me:
The following elements in the input file contain special characters:
If I try to search for just "#x" with this XSL:
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="/">
<xsl:text>The following elements in the input file contain special characters:
</xsl:text>
<xsl:for-each select="//*">
<xsl:if test="not(*) and contains(., '#x')">
<xsl:copy-of select="."></xsl:copy-of>
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
I get:
The following elements in the input file contain special characters:
<element3>Again regular text, but with the special character prefix: #x</element3>
So the question is: is there any way to find those elements which contain special characters encoded as "&#x..;"?
I know I can do this with grep etc:
grep '&#x' simpletest.xml
<element2>Text containing special characters: 1º-2ª</element2>
but the ultimate goal is to generate a pretty output with information about parent elements etc that can be sent as email notification, and using XSLT would make that part so much easier.
In XSLT/XPath you can't know whether any Unicode character was literally in the input document or as a character reference but in XSLT 2 or 3 you can certainly check with matches and Unicode ranges whether certain characters occur (e.g. with \P{IsBasicLatin} for anything not ASCII/Latin):
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="/">
<xsl:text>The following elements in the input file contain special characters:
</xsl:text>
<xsl:for-each select="//*[not(*) and matches(., '\P{IsBasicLatin}')]">
<xsl:copy-of select="."></xsl:copy-of>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Output:
The following elements in the input file contain special characters:
<element2>Text containing special characters: 1º-2ª</element2>
As Martin said, character entity references like ª are resolved by XML parsers so when the XML is passed to your XSLT they will have already been converted to regular Unicode characters, with no sign that they were encoded specially.
If you want to find characters which are "special" in some way (i.e. Unicode characters with particular code points), then Martin's solution using regular expressions is what you want. That will find those characters, irrespective of whether they were encoded with character entity references or not.
However, if you are actually trying to find character entity references, then your XSLT would need to read the XML file as plain text (without parsing it as XML), e.g. using the unparsed-text XPath function. Note though, that if you do that, then you won't be able to see which particular XML elements contains the characters, because the XML element markup will also not have been parsed.

Why does Saxon delete blank lines in an identity transform?

Consider this "identity" transform:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output encoding="UTF-8" method="xml" indent="yes" media-type="application/xml"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:transform>
and this input XML:
<?xml version="1.0" encoding="UTF-8"?>
<Foobar xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:transform version="2.0">
<!-- Parameters -->
<xsl:param name="param1"/>
<xsl:param name="param2"/>
<xsl:param name="param3"/>
<!-- Variables -->
<xsl:variable name="variable1" select="'abc'"/>
<xsl:template match="/*">
</xsl:template>
</xsl:transform>
</Foobar>
Why does SaxonJ-HE 11.3 delete the blank lines?
Here's a diff showing what I'm talking about:
$ saxon -xsl:transform.xsl -s:input.xml | diff -u input.xml -
--- input.xml 2022-06-16 16:26:41.000000000 -0400
+++ - 2022-06-16 16:28:42.000000000 -0400
## -6,12 +6,9 ##
<xsl:param name="param1"/>
<xsl:param name="param2"/>
<xsl:param name="param3"/>
-
<!-- Variables -->
<xsl:variable name="variable1" select="'abc'"/>
-
<xsl:template match="/*">
</xsl:template>
-
</xsl:transform>
</Foobar>
It's quite challenging to find an indentation algorithm that both (a) preserves existing whitespace in the source document, and (b) produces nice-looking output. For example, consider what happens when a template rule processes all children of an element (both element children and whitespace text node children) with an xsl:sort on an attribute value; if all whitespace from this output sequence is preserved, this will tend to put a massive wadge of whitespace at the start of the output sequence, which looks pretty ugly. This can also happen if you apply-templates to all children, but delete some of the elements while leaving the text nodes unchanged. So the spec allows the processor not only to add whitespace for indentation, but to merge ("elide") this with existing whitespace.
In particular, it's a reasonable assumption to make that if you get multiple blank lines in the result tree, they weren't put there deliberately, but arrived by accident as a result of copying multiple whitespace nodes from the input.
What's actually happening in this particular case is as follows:
For comments, the rules are different depending on whether the comment follows a start tag or an end tag. The first comment follows a start tag, and in this case the accumulated whitespace is output as-is, followed by the comment with no further indentation. The second comment follows an end tag (actually an empty element tag), and in this case the comment is indented according to its hierarchic level in the result tree, and any preceding whitespace in the result tree is discarded.
Before a start tag, indentation is added if the start tag immediately follows another start tag or end tag; if it follows a text node, no identation is added. This rule is designed primarily to make mixed content work properly.
Before an end tag, indentation is added if it follows another end tag, but not if it follows a start tag or character data.
The detail is a lot more complex, and it has evolved in a fairly ad-hoc way to cope reasonably well with a wide variety of circumstances. As a high-level summary, Saxon will in some circumstances output the whitespace that it finds in the result tree, and in other circumstances it will output its own whitespace in preference. The algorithm isn't perfect, but it copes reasonably well with messy situations like when the input is indented with 4 spaces and the output is to be indented with 3.

XPath filter not working on XSL

I have the following XML
<?xml version="1.0"?>
<people><human><sex>male</sex><naxme>Juanito</naxme>
</human>
<human><sex>female</sex><naxme>Petra</naxme></human>
<human><sex>male</sex><naxme>Anaximandro</naxme></human>
</people>
and this XSL
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" encoding="utf-8" indent="no"/>
<xsl:template match="/people/human[sex='male']">
<xsl:value-of select="naxme"/>
</xsl:template>
</xsl:stylesheet>
I'm expecting it to filter out the female, and it kind of works but I get odd values for the non-matching nodes:
Juanito
femalePetra
Anaximandro
I'm expecting the same output as with
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" encoding="utf-8" indent="no"/>
<xsl:template match="/people/human">
<xsl:if test = "sex='male'">
<xsl:value-of select="naxme"/>
</xsl:if> </xsl:template>
</xsl:stylesheet>
Thanks!
I'll expand on Daniels answer which covers a little of the why.
The reason you are getting two different outputs comes down to the built-in template rules, and how nodes and text are treated by default.
Summarising that link, if no other template exists, there are default templates that ensure every node - be it element, text, attribute, comment - will be encoutered, just to ensure that other nodes that do have rules can be processed correctly.
With this XSLT:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/people/human[sex='male']">
<xsl:value-of select="naxme"/>
</xsl:template>
</xsl:stylesheet>
you have an explicit rule that says:
If you find a node that matches the XPath /people/human[sex='male] do this template.
Along with the default rule:
Find all nodes, and then process all of their children. If it is text, just output the text.
This default rule is why your template is being processed, since you have no explicit rule for the root node - / - it any every child and grandchild node are processed using the default rules, unless another exists. As such, each node is traversed using the defaults, except for the nodes matching /people/human[sex='male]. The result of this is that when you have a node that is "female" the text is being spat out instead of ignored.
However, contrast this with:
<xsl:template match="/people/human">
<xsl:if test = "sex='male'">
<xsl:value-of select="naxme"/>
</xsl:if>
</xsl:template>
Where the rule becomes:
If you find a node that matches the XPath /people/human do this template.
It just so happens that in that template, you have an extra condition that says, if it is male, then process it in some way, with no other conditions, so if a "female" node is encountered it is now blank in the output.
Lastly, the reason why Daniels answer works but could easily break, is that it changes the rule for processing text. Instead of now copying all text as in the default rules, it outputs nothing (as per the empty template. However, if you had any other templates which used xsl:apply-templates to process text and were expecting text, they would also now output nothing.
It's probably because of XSLT's built-in template rules. Try adding this template:
<xsl:template match="text()"/>

XSLT: need alternative to document()-function for multi-source processing

I'm adapting an XSLT from a third party which transforms an arbitrary number of XMLs into a single HTML document. It's a pretty complex script and it will be revised in the future, so I'm trying to do a minimal adaptation in order to get it to work for our needs.
The following is a stripped down version of the XSLT (containing the essentials):
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/1999/xhtml">
<xsl:output method="text" encoding="UTF-8" omit-xml-declaration="yes"/>
<xsl:param name="files" select="document('files.xml')//File"/>
<xsl:param name="root" select="document($files)"/>
<xsl:template match="/">
<xsl:for-each select="$root/RootNode">
<xsl:apply-templates select="."/>
</xsl:for-each>
</xsl:template>
<xsl:template match="RootNode">
<xsl:for-each select="//Node">
<xsl:text>Node: </xsl:text><xsl:value-of select="."/><xsl:text>, </xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Now files.xml contains a list of all the URLs of the files to be included (in this case the local files file1.xml and file2.xml). Because we want to read XMLs from memory rather than from disk, and because the invocation of the XSLT only allows for a single XML source, I have combined the two files in a single XML document. The following is a combination of two files (there may be more in a real situation)
<?xml version="1.0" encoding="UTF-8"?>
<TempNode>
<RootNode>
<Node>1</Node>
<Node>2</Node>
</RootNode>
<RootNode>
<Node>3</Node>
<Node>4</Node>
</RootNode>
</TempNode>
where the first RootNode originally resided in file1.xml and the second in file2.xml.
Due to the complexity of the actual XSLT, I've figured that my best shot is to try to alter the $root-param. I've tried the following:
<xsl:param name="root" select="/TempNode"/>
The problem is this. In the case of <xsl:param name="root" select="document($files)"/>, the XPath expression "//Node" in <xsl:for-each select="//Node"> selects the Node's from file1.xml and file2.xml independently, i.e. producing the following (desired) list:
Node: 1, Node: 2, Node: 3, Node: 4,
However, when I combine the content of the two files into a single XML and parse this (and use the suggested $root-definition), the expression "//Node" will select all Node's that are children of the TempNode. (In other words, the desired list, as represented above, is produced twice due to the combination with the outer <xsl:for-each select="$root/RootNode"> loop.)
(A side note: as observed in comment a) in this page, document() apparently changes the root node, perhaps explaining this behavior.)
My question becomes:
How can I re-define $root, using the combined XML as source instead of a multi-source through document(), so that the list is only produced once, without touching the remainder of the XSLT? It's like if $root defined using the document()-function, there is no common root node in the param. Is it possible to define a param with two "separate" node trees?
Btw: I've tried defining a document like this
<xsl:param name="root">
<xsl:for-each select="/TempNode/*">
<xsl:document>
<xsl:copy-of select="."/>
</xsl:document>
</xsl:for-each>
</xsl:param>
thinking it might solve the problem, but the "//Node" expression still fetches all the Nodes. Is the context node in the <xsl:template match="RootNode">-template actually somewhere in the input document and not the param? (Honestly, I'm pretty confused when it comes to context nodes.)
Thanks in advance!
(Updated more)
OK, some of the problem is becoming clear. First, just to make sure I understand, you aren't actually passing parameters for $files and $root to the XSLT processor invocation, right? (They might as well be variables rather than params?)
Now to the main issues... In XPath, when you evaluate an expression that begins with "/" (including "//"), the context node is ignored [mostly]. Therefore, when you have
<xsl:template match="RootNode">
<xsl:for-each select="//Node">
the matched RootNode is ignored. Maybe you wanted
<xsl:template match="RootNode">
<xsl:for-each select=".//Node">
in which the for-each would select Node elements that are descendants of the matched RootNode? This would fix your problem of generating the desired node list twice.
I inserted [mostly] above because I recalled that an "absolute location path" starts from "the root node of the document containing the context node". So the context node does affect what document is used for "//Node". Maybe that's what you intended all along? I guess I was slow to catch on to that.
(A side note: as observed in comment
a) in this page, document() apparently
changes the root node, perhaps
explaining this behavior.)
Or more precisely,
An absolute location path ["/..."]
followed by a relative location
path... selects the set of nodes that
would be selected by the relative
location path relative to the root
node of the document containing the
context node.
document() doesn't actually change anything, in the sense of side effects; rather, it returns a set of nodes contained (usually) by different documents than the primary source document. XSLT instructions like xsl:apply-templates and xsl:for-each establish new values for the context node inside the scope of their template bodies. So if you use xsl:apply-templates and xsl:for-each with select="document(...)/...", the context node inside the scope of those instructions will belong to an external document, so any use of "/..." as an XPath will start from that external document.
Updated again
How can I re-define $root, using the
combined XML as source instead of a
multi-source through document(), so
that the list is only produced once,
without touching the remainder of the
XSLT?
As #Alej hinted, it's really not possible given the above constraint. If you're selecting "//Node" in each iteration of the loop over "$root/RootNode", then in order for each iteration not to select the same nodes as the other iterations, each value of "$root/RootNode" must be in a different document. Since you're using the combined XML source, instead of a multi-source, this is not possible.
But if you don't insist that your <xsl:for-each select="//..."> XPath expression cannot change, it becomes very easy. :-) Just put a "." before the "//".
It's like if $root defined using the document()-function, there is no common root node
in the param.
The value of the param is a node-set. All nodes in the set may be contained in the same document, or they may not, depending on whether the first argument to document() is a nodeset or just a single node.
Is it possible to define a param with two "separate" node trees?
I believe by "separate", you mean "belonging to different documents"? Yes it is, but I don't think you can do it in XSLT 1.0 unless you're selecting nodes that belong to different documents in the first place.
You mentioned trying
<xsl:param name="root">
<xsl:for-each select="/TempNode/*">
<xsl:document>
<xsl:copy-of select="."/>
</xsl:document>
</xsl:for-each>
</xsl:param>
but <xsl:document> is not defined in XSLT 1.0, and your stylesheet says version="1.0". Do you have XSLT 2.0 available? If so, let us know and we can pursue this option. To be honest, <xsl:document> is not familiar territory for me. But I'm happy to learn along with you.
You can apply only nodes you need:
Input:
<?xml version="1.0" encoding="UTF-8"?>
<TempNode>
<RootNode>
<Node>1</Node>
<Node>2</Node>
</RootNode>
<RootNode>
<Node>3</Node>
<Node>4</Node>
</RootNode>
</TempNode>
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl">
<xsl:output method="html" indent="yes"/>
<xsl:template match="/">
<xsl:copy>
<xsl:apply-templates select="TempNode/RootNode"/>
</xsl:copy>
</xsl:template>
<xsl:template match="RootNode">
<xsl:value-of select="concat('RootNode-', generate-id(.), '
')"/>
<xsl:apply-templates select="Node"/>
</xsl:template>
<xsl:template match="Node">
<xsl:value-of select="concat('Node', ., '
')"/>
</xsl:template>
</xsl:stylesheet>
Output:
RootNode-N65540
Node1
Node2
RootNode-N65549
Node3
Node4

How to remove spaces from the end of text in an element. (XSL)

Does anyone know what XSL code would remove the trailing whitespace after the last word in an element?
<p>This is my paragraph. </p>
Thanks!!
Look at the normalize-space() XPath function.
<xsl:template match="p">
<p><xsl:value-of select="normalize-space()" /></p>
</xsl:template>
Be careful, there is a catch (which might or might not be relevant to you):
The [...] function returns the
argument string with whitespace
normalized by stripping leading and
trailing whitespace and replacing
sequences of whitespace characters by
a single space.
This means it also removes all line breaks and tabs and other whitespace and turns them into a single space.
EDIT: Significant simplification of the code, thanks to a hint from Tomalak.
Here is an XPath 2.0 / XSLT 2.0 solution, which removes only the trailing spaces:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
>
<xsl:output method="text"/>
<xsl:template match="text()">
"<xsl:sequence select="replace(., '\s+$', '', 'm')"/>"
</xsl:template>
</xsl:stylesheet>
When this is applied on the following XML document:
<someText> This is some text </someText>
the wanted result is produced:
" This is some text"
You can see the XSLT 1.0 solution (implementing almost the same idea), which uses FXSL 1.x, here:
http://www.biglist.com/lists/xsl-list/archives/200112/msg01067.html