How to select elements containing special characters using XSL? - xslt

I have an ascii-encoded XML-file (in which the various special characters are encoded as &#x..;). Here is a simplified example:
<?xml version="1.0" encoding="ascii"?>
<data>
<element1>Some regular text</element1>
<element2>Text containing special characters: 1º-2ª</element2>
<element3>Again regular text, but with the special charactre prefix: #x</element3>
</data>
Now what I want to do is to pick all the leaf elements containing special characters. The output should look like
The following elements in the input file contain special characters:
<element2>Text containing special characters: 1º-2ª</element2>
I tried with this XSL:
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="/">
<xsl:text>The following elements in the input file contain special characters:
</xsl:text>
<xsl:for-each select="//*">
<xsl:if test="not(*) and contains(., '&#x')">
<xsl:copy-of select="."></xsl:copy-of>
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
But it only gives me:
The following elements in the input file contain special characters:
If I try to search for just "#x" with this XSL:
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="/">
<xsl:text>The following elements in the input file contain special characters:
</xsl:text>
<xsl:for-each select="//*">
<xsl:if test="not(*) and contains(., '#x')">
<xsl:copy-of select="."></xsl:copy-of>
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
I get:
The following elements in the input file contain special characters:
<element3>Again regular text, but with the special character prefix: #x</element3>
So the question is: is there any way to find those elements which contain special characters encoded as "&#x..;"?
I know I can do this with grep etc:
grep '&#x' simpletest.xml
<element2>Text containing special characters: 1º-2ª</element2>
but the ultimate goal is to generate a pretty output with information about parent elements etc that can be sent as email notification, and using XSLT would make that part so much easier.

In XSLT/XPath you can't know whether any Unicode character was literally in the input document or as a character reference but in XSLT 2 or 3 you can certainly check with matches and Unicode ranges whether certain characters occur (e.g. with \P{IsBasicLatin} for anything not ASCII/Latin):
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="/">
<xsl:text>The following elements in the input file contain special characters:
</xsl:text>
<xsl:for-each select="//*[not(*) and matches(., '\P{IsBasicLatin}')]">
<xsl:copy-of select="."></xsl:copy-of>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Output:
The following elements in the input file contain special characters:
<element2>Text containing special characters: 1º-2ª</element2>

As Martin said, character entity references like ª are resolved by XML parsers so when the XML is passed to your XSLT they will have already been converted to regular Unicode characters, with no sign that they were encoded specially.
If you want to find characters which are "special" in some way (i.e. Unicode characters with particular code points), then Martin's solution using regular expressions is what you want. That will find those characters, irrespective of whether they were encoded with character entity references or not.
However, if you are actually trying to find character entity references, then your XSLT would need to read the XML file as plain text (without parsing it as XML), e.g. using the unparsed-text XPath function. Note though, that if you do that, then you won't be able to see which particular XML elements contains the characters, because the XML element markup will also not have been parsed.

Related

Why does Saxon delete blank lines in an identity transform?

Consider this "identity" transform:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output encoding="UTF-8" method="xml" indent="yes" media-type="application/xml"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:transform>
and this input XML:
<?xml version="1.0" encoding="UTF-8"?>
<Foobar xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:transform version="2.0">
<!-- Parameters -->
<xsl:param name="param1"/>
<xsl:param name="param2"/>
<xsl:param name="param3"/>
<!-- Variables -->
<xsl:variable name="variable1" select="'abc'"/>
<xsl:template match="/*">
</xsl:template>
</xsl:transform>
</Foobar>
Why does SaxonJ-HE 11.3 delete the blank lines?
Here's a diff showing what I'm talking about:
$ saxon -xsl:transform.xsl -s:input.xml | diff -u input.xml -
--- input.xml 2022-06-16 16:26:41.000000000 -0400
+++ - 2022-06-16 16:28:42.000000000 -0400
## -6,12 +6,9 ##
<xsl:param name="param1"/>
<xsl:param name="param2"/>
<xsl:param name="param3"/>
-
<!-- Variables -->
<xsl:variable name="variable1" select="'abc'"/>
-
<xsl:template match="/*">
</xsl:template>
-
</xsl:transform>
</Foobar>
It's quite challenging to find an indentation algorithm that both (a) preserves existing whitespace in the source document, and (b) produces nice-looking output. For example, consider what happens when a template rule processes all children of an element (both element children and whitespace text node children) with an xsl:sort on an attribute value; if all whitespace from this output sequence is preserved, this will tend to put a massive wadge of whitespace at the start of the output sequence, which looks pretty ugly. This can also happen if you apply-templates to all children, but delete some of the elements while leaving the text nodes unchanged. So the spec allows the processor not only to add whitespace for indentation, but to merge ("elide") this with existing whitespace.
In particular, it's a reasonable assumption to make that if you get multiple blank lines in the result tree, they weren't put there deliberately, but arrived by accident as a result of copying multiple whitespace nodes from the input.
What's actually happening in this particular case is as follows:
For comments, the rules are different depending on whether the comment follows a start tag or an end tag. The first comment follows a start tag, and in this case the accumulated whitespace is output as-is, followed by the comment with no further indentation. The second comment follows an end tag (actually an empty element tag), and in this case the comment is indented according to its hierarchic level in the result tree, and any preceding whitespace in the result tree is discarded.
Before a start tag, indentation is added if the start tag immediately follows another start tag or end tag; if it follows a text node, no identation is added. This rule is designed primarily to make mixed content work properly.
Before an end tag, indentation is added if it follows another end tag, but not if it follows a start tag or character data.
The detail is a lot more complex, and it has evolved in a fairly ad-hoc way to cope reasonably well with a wide variety of circumstances. As a high-level summary, Saxon will in some circumstances output the whitespace that it finds in the result tree, and in other circumstances it will output its own whitespace in preference. The algorithm isn't perfect, but it copes reasonably well with messy situations like when the input is indented with 4 spaces and the output is to be indented with 3.

Whitespace in xsl:text elements

I have following stylesheet:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" encoding="UTF-8"/>
<xsl:template match="/">
<xsl:text>1</xsl:text>
<xsl:text>
</xsl:text>
<xsl:text>2</xsl:text>
</xsl:template>
</xsl:stylesheet>
When running this with Saxon 9.8, I get following result:
1
2
When running this with MSXML 6.0, the whitespace is stripped and I get:
1 2
What is the correct behavior? Is the whitespace here supposed to be stripped?
This is to do with whitespace striping in the XSLT document. According to the W3C specification (for XSLT 1.0, which is what MSXML uses)
A text node is preserved if any of the following apply:
The element name of the parent of the text node is in the set of
whitespace-preserving element names.
The text node contains at least one non-whitespace character. As in
XML, a whitespace character is #x20, #x9, #xD or #xA.
An ancestor element of the text node has an xml:space attribute with a
value of preserve, and no closer ancestor element has xml:space with a
value of default.
It then says "For stylesheets, the set of whitespace-preserving element names consists of just xsl:text."
So, is looks like MSXML is not following the specification.
However, if you add xml:space="preserve" to the xsl:text in question, you might find it does work in MSXML
<xsl:template match="/">
<xsl:text>1</xsl:text>
<xsl:text xml:space="preserve">
</xsl:text>
<xsl:text>2</xsl:text>
</xsl:template>
The correct behaviour is as you see it from Saxon.
There's some history here and I don't remember the full details, but MSXML has a nasty habit of stripping whitespace text nodes within the parser itself. If the XML parser strips out the whitespace text nodes, then they never get as far as the XSLT processor, so it makes no difference whether that conforms to all the XSLT rules or not.
I'm pretty sure there are options in MSXML to control this behaviour, so check exactly how you are invoking the MSXML parser and change the options if necessary.

Escaping Double Quotes, Space and Allowing for an Extra Forward Slash

I have XML
<?xml version="1.0" encoding="UTF-8"?>
<icestats>
<stats_connections>0</stats_connections>
<source mount="/live">
<bitrate>Some data</bitrate>
<server_description>This is what I want to return</server_description>
</source>
</icestats>
And I have XSL
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="/">
<xsl:copy-of select="/icestats/source mount="/live"/server_description/node()" />
</xsl:template>
</xsl:stylesheet>
I want the output
This is what I want to return
If I remove the double quotes, space and forward slash from the source it works, but I haven't been able to successfully escape the non standard characters yet using suggested methods in other posts.
For clarity, below is the solution thanks to Lego Stormtroopr
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="/">
<xsl:copy-of select="/icestats/source[#mount='/live']/server_description/node()" />
</xsl:template>
</xsl:stylesheet>
There are a couple of issues you will need to resolve before your processor will produce the output you're looking for.
1) Your XML input must be made well-formed. The closing tag of the source element should not include the mount attribute that is specified on the opening tag.
<source mount="/live">
...
</source>
2) The XPath on your xsl:copy-of element must be made valid. The syntax for an XPath expression is (fortunately) not like the syntax for XML elements and attributes. Specifying which source element to match is done by predicating on an attribute value, like you have done, except that you need to use square brackets:
/icestats/source[#mount="/live"]/server_description
In order to use this XPath expression in an XSLT select statement, you will need to make sure that you enclose the entire select attribute value with one type of quotes, and use the other type of quotes within the attribute value, e.g.:
<xsl:value-of select="/icestats/source[#mount='/live']/server_description" />
With This input
<?xml version="1.0" encoding="UTF-8"?>
<icestats>
<stats_connections>0</stats_connections>
<source mount="/live">
<bitrate>Some data</bitrate>
<server_description>This is what I want to return</server_description>
</source>
</icestats>
and this stylesheet
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:value-of select="/icestats/source[#mount='/live']/server_description" />
</xsl:template>
</xsl:stylesheet>
I get the following line of text from xsltproc and saxon:
This is what I want to return
The xsl:value-of element will return the string value of an element (here, that one text node). If you actually wanted the server_description element, then you can use xsl:copy-of to get the whole thing, tags and all. (You would have to update xsl:output as well.)
It looks like you are doing a select based on the attribute, so you just need to properly capture the attribute in the XPath. The quotes you use in the document and the XPath don't need to match, so you can switch them to single quotes ('):
<xsl:copy-of select="/icestats/source[#mount='/live']/server_description/node()" />
(Edited to correct the the missing / from the mount attribute.)
Also, your original document isn't valid XML, as XML doesn't allow attributes in the closing tag.
I think all you need to do is escape the quotes in the attribute string with ":
<xsl:copy-of select="/icestats/source mount="/live"/server_description/node()" />

not adding new line in my XSLT

I am not certain why my xslt won't put a new line in my output...
This is my xslt....
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl"
>
<xsl:output method="text" encoding="iso-8859-1"/>
<xsl:variable name="newline"></xsl:variable>
<xsl:template name="FairWarningTransform" match="/"> <!--#* | node()">-->
<xsl:for-each select="//SelectFairWarningInformationResult">
<xsl:value-of select="ApplicationID"/>,<xsl:value-of select="USERID"/>
</xsl:for-each>
* Note. This report outlines Fair warning entries into reported for the above time frame.
</xsl:template>
</xsl:stylesheet>
Here is my output...
1,TEST1,test2,
I want it to look like...
1,TEST
1,test2,
Why isn't this character
creating a newline
Try replacing
with
<xsl:text>
</xsl:text>
That helps XSLT distinguish it from other whitespace in your stylesheet that is part of the stylesheet formatting (not part of the desired output).
XSLT's default behavior is to ignore any text nodes in the stylesheet that are entirely whitespace (this is true even if some of the whitespace is encoded as entities like
), except for text inside <xsl:text>, which is preserved.
I suggest replacing these lines:
<xsl:value-of select="ApplicationID"/>,<xsl:value-of select="USERID"/>
with this:
<xsl:value-of select="concat(ApplicationID, ',', USERID, '
')"/>
That way the newline should be ensured to be included in the output.
Try using this as your newline instead of the escaped character:
<xsl:text>
</xsl:text>

How to remove spaces from the end of text in an element. (XSL)

Does anyone know what XSL code would remove the trailing whitespace after the last word in an element?
<p>This is my paragraph. </p>
Thanks!!
Look at the normalize-space() XPath function.
<xsl:template match="p">
<p><xsl:value-of select="normalize-space()" /></p>
</xsl:template>
Be careful, there is a catch (which might or might not be relevant to you):
The [...] function returns the
argument string with whitespace
normalized by stripping leading and
trailing whitespace and replacing
sequences of whitespace characters by
a single space.
This means it also removes all line breaks and tabs and other whitespace and turns them into a single space.
EDIT: Significant simplification of the code, thanks to a hint from Tomalak.
Here is an XPath 2.0 / XSLT 2.0 solution, which removes only the trailing spaces:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
>
<xsl:output method="text"/>
<xsl:template match="text()">
"<xsl:sequence select="replace(., '\s+$', '', 'm')"/>"
</xsl:template>
</xsl:stylesheet>
When this is applied on the following XML document:
<someText> This is some text </someText>
the wanted result is produced:
" This is some text"
You can see the XSLT 1.0 solution (implementing almost the same idea), which uses FXSL 1.x, here:
http://www.biglist.com/lists/xsl-list/archives/200112/msg01067.html