How to remove spaces from the end of text in an element. (XSL) - xslt

Does anyone know what XSL code would remove the trailing whitespace after the last word in an element?
<p>This is my paragraph. </p>
Thanks!!

Look at the normalize-space() XPath function.
<xsl:template match="p">
<p><xsl:value-of select="normalize-space()" /></p>
</xsl:template>
Be careful, there is a catch (which might or might not be relevant to you):
The [...] function returns the
argument string with whitespace
normalized by stripping leading and
trailing whitespace and replacing
sequences of whitespace characters by
a single space.
This means it also removes all line breaks and tabs and other whitespace and turns them into a single space.

EDIT: Significant simplification of the code, thanks to a hint from Tomalak.
Here is an XPath 2.0 / XSLT 2.0 solution, which removes only the trailing spaces:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
>
<xsl:output method="text"/>
<xsl:template match="text()">
"<xsl:sequence select="replace(., '\s+$', '', 'm')"/>"
</xsl:template>
</xsl:stylesheet>
When this is applied on the following XML document:
<someText> This is some text </someText>
the wanted result is produced:
" This is some text"
You can see the XSLT 1.0 solution (implementing almost the same idea), which uses FXSL 1.x, here:
http://www.biglist.com/lists/xsl-list/archives/200112/msg01067.html

Related

How to select elements containing special characters using XSL?

I have an ascii-encoded XML-file (in which the various special characters are encoded as &#x..;). Here is a simplified example:
<?xml version="1.0" encoding="ascii"?>
<data>
<element1>Some regular text</element1>
<element2>Text containing special characters: 1º-2ª</element2>
<element3>Again regular text, but with the special charactre prefix: #x</element3>
</data>
Now what I want to do is to pick all the leaf elements containing special characters. The output should look like
The following elements in the input file contain special characters:
<element2>Text containing special characters: 1º-2ª</element2>
I tried with this XSL:
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="/">
<xsl:text>The following elements in the input file contain special characters:
</xsl:text>
<xsl:for-each select="//*">
<xsl:if test="not(*) and contains(., '&#x')">
<xsl:copy-of select="."></xsl:copy-of>
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
But it only gives me:
The following elements in the input file contain special characters:
If I try to search for just "#x" with this XSL:
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="/">
<xsl:text>The following elements in the input file contain special characters:
</xsl:text>
<xsl:for-each select="//*">
<xsl:if test="not(*) and contains(., '#x')">
<xsl:copy-of select="."></xsl:copy-of>
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
I get:
The following elements in the input file contain special characters:
<element3>Again regular text, but with the special character prefix: #x</element3>
So the question is: is there any way to find those elements which contain special characters encoded as "&#x..;"?
I know I can do this with grep etc:
grep '&#x' simpletest.xml
<element2>Text containing special characters: 1º-2ª</element2>
but the ultimate goal is to generate a pretty output with information about parent elements etc that can be sent as email notification, and using XSLT would make that part so much easier.
In XSLT/XPath you can't know whether any Unicode character was literally in the input document or as a character reference but in XSLT 2 or 3 you can certainly check with matches and Unicode ranges whether certain characters occur (e.g. with \P{IsBasicLatin} for anything not ASCII/Latin):
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="/">
<xsl:text>The following elements in the input file contain special characters:
</xsl:text>
<xsl:for-each select="//*[not(*) and matches(., '\P{IsBasicLatin}')]">
<xsl:copy-of select="."></xsl:copy-of>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Output:
The following elements in the input file contain special characters:
<element2>Text containing special characters: 1º-2ª</element2>
As Martin said, character entity references like ª are resolved by XML parsers so when the XML is passed to your XSLT they will have already been converted to regular Unicode characters, with no sign that they were encoded specially.
If you want to find characters which are "special" in some way (i.e. Unicode characters with particular code points), then Martin's solution using regular expressions is what you want. That will find those characters, irrespective of whether they were encoded with character entity references or not.
However, if you are actually trying to find character entity references, then your XSLT would need to read the XML file as plain text (without parsing it as XML), e.g. using the unparsed-text XPath function. Note though, that if you do that, then you won't be able to see which particular XML elements contains the characters, because the XML element markup will also not have been parsed.

Why does Saxon delete blank lines in an identity transform?

Consider this "identity" transform:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output encoding="UTF-8" method="xml" indent="yes" media-type="application/xml"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:transform>
and this input XML:
<?xml version="1.0" encoding="UTF-8"?>
<Foobar xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:transform version="2.0">
<!-- Parameters -->
<xsl:param name="param1"/>
<xsl:param name="param2"/>
<xsl:param name="param3"/>
<!-- Variables -->
<xsl:variable name="variable1" select="'abc'"/>
<xsl:template match="/*">
</xsl:template>
</xsl:transform>
</Foobar>
Why does SaxonJ-HE 11.3 delete the blank lines?
Here's a diff showing what I'm talking about:
$ saxon -xsl:transform.xsl -s:input.xml | diff -u input.xml -
--- input.xml 2022-06-16 16:26:41.000000000 -0400
+++ - 2022-06-16 16:28:42.000000000 -0400
## -6,12 +6,9 ##
<xsl:param name="param1"/>
<xsl:param name="param2"/>
<xsl:param name="param3"/>
-
<!-- Variables -->
<xsl:variable name="variable1" select="'abc'"/>
-
<xsl:template match="/*">
</xsl:template>
-
</xsl:transform>
</Foobar>
It's quite challenging to find an indentation algorithm that both (a) preserves existing whitespace in the source document, and (b) produces nice-looking output. For example, consider what happens when a template rule processes all children of an element (both element children and whitespace text node children) with an xsl:sort on an attribute value; if all whitespace from this output sequence is preserved, this will tend to put a massive wadge of whitespace at the start of the output sequence, which looks pretty ugly. This can also happen if you apply-templates to all children, but delete some of the elements while leaving the text nodes unchanged. So the spec allows the processor not only to add whitespace for indentation, but to merge ("elide") this with existing whitespace.
In particular, it's a reasonable assumption to make that if you get multiple blank lines in the result tree, they weren't put there deliberately, but arrived by accident as a result of copying multiple whitespace nodes from the input.
What's actually happening in this particular case is as follows:
For comments, the rules are different depending on whether the comment follows a start tag or an end tag. The first comment follows a start tag, and in this case the accumulated whitespace is output as-is, followed by the comment with no further indentation. The second comment follows an end tag (actually an empty element tag), and in this case the comment is indented according to its hierarchic level in the result tree, and any preceding whitespace in the result tree is discarded.
Before a start tag, indentation is added if the start tag immediately follows another start tag or end tag; if it follows a text node, no identation is added. This rule is designed primarily to make mixed content work properly.
Before an end tag, indentation is added if it follows another end tag, but not if it follows a start tag or character data.
The detail is a lot more complex, and it has evolved in a fairly ad-hoc way to cope reasonably well with a wide variety of circumstances. As a high-level summary, Saxon will in some circumstances output the whitespace that it finds in the result tree, and in other circumstances it will output its own whitespace in preference. The algorithm isn't perfect, but it copes reasonably well with messy situations like when the input is indented with 4 spaces and the output is to be indented with 3.

How to remove last N character using XSLT

I have following code
<xsl:value-of select=concat(substring(DBColumn, string-length(DBColumn)-3),concat('-',DBColumn))
It results me
230-Virginia-230.
I want it as 230-Virginia.
Originally in database it is as Virginia-230
Furthermore
ABC, 230-Virginia
How to trim whitespace in the same mentioned code so that it should look like as follow ABC,230-Virginia
It's not clear what exactly your question is.
To answer the question as stated in your title: you can remove the last N characters from a string using:
substring($string, 1, string-length($string) - $N)
Trying to illustrate with an input document that contains the data that you mentioned:
<input>
<DBColumn>Virginia-230</DBColumn>
<other>ABC </other> <!-- N.B. trailing space -->
</input>
This XSLT 3.0 stylesheet does some of the things that you mentioned in the "proposed value". I've also included the input value and the "old-value" with the value-of expression that you mentioned in your post.
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="3.0"
exclude-result-prefixes="#all">
<xsl:output method="xml" indent="yes" />
<xsl:template match="/input">
<output>
<input-value><xsl:value-of select="DBColumn" /></input-value>
<old-output-value>
<xsl:value-of
select="concat(substring(DBColumn, string-length(DBColumn)-3),
concat('-', DBColumn))"/>
</old-output-value>
<proposed-value>
<xsl:value-of
select="normalize-space(other)
|| ',' ||
string-join(reverse(tokenize(DBColumn, '-')), '-')"
/>
</proposed-value>
</output>
</xsl:template>
</xsl:stylesheet>
which produces:
<output>
<input-value>Virginia-230</input-value>
<old-output-value>-230-Virginia-230</old-output-value>
<proposed-value>ABC,230-Virginia</proposed-value>
</output>
For an xsl:value-of() that I believe works in XSLT1.0 (but I won't guarantee), you could try:
<xsl:value-of
select="concat(other, ',',
substring-after(DBColumn, '-'),
'-',
substring-before(DBColumn, '-'))" />
which does not address the trailing space in other but at least suggests how to reverse the two values around the '-' char in DBColumn.
For suggestions on removing leading/trailing spaces on string, see: XSLT 1.0 to remove leading and trailing spaces

Whitespace in xsl:text elements

I have following stylesheet:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" encoding="UTF-8"/>
<xsl:template match="/">
<xsl:text>1</xsl:text>
<xsl:text>
</xsl:text>
<xsl:text>2</xsl:text>
</xsl:template>
</xsl:stylesheet>
When running this with Saxon 9.8, I get following result:
1
2
When running this with MSXML 6.0, the whitespace is stripped and I get:
1 2
What is the correct behavior? Is the whitespace here supposed to be stripped?
This is to do with whitespace striping in the XSLT document. According to the W3C specification (for XSLT 1.0, which is what MSXML uses)
A text node is preserved if any of the following apply:
The element name of the parent of the text node is in the set of
whitespace-preserving element names.
The text node contains at least one non-whitespace character. As in
XML, a whitespace character is #x20, #x9, #xD or #xA.
An ancestor element of the text node has an xml:space attribute with a
value of preserve, and no closer ancestor element has xml:space with a
value of default.
It then says "For stylesheets, the set of whitespace-preserving element names consists of just xsl:text."
So, is looks like MSXML is not following the specification.
However, if you add xml:space="preserve" to the xsl:text in question, you might find it does work in MSXML
<xsl:template match="/">
<xsl:text>1</xsl:text>
<xsl:text xml:space="preserve">
</xsl:text>
<xsl:text>2</xsl:text>
</xsl:template>
The correct behaviour is as you see it from Saxon.
There's some history here and I don't remember the full details, but MSXML has a nasty habit of stripping whitespace text nodes within the parser itself. If the XML parser strips out the whitespace text nodes, then they never get as far as the XSLT processor, so it makes no difference whether that conforms to all the XSLT rules or not.
I'm pretty sure there are options in MSXML to control this behaviour, so check exactly how you are invoking the MSXML parser and change the options if necessary.

XSLT-1.0 to read a specific substring from a message

I've below message in a variable .
7c
 {"code":3001,"message":"issued"}
 0
I would like to take the message starting with '{' and ending with '}' using XSLT. I tried using sub-string() and starts-with functions, but without success.
My final out put should be
{"code":3001,"message":"issued"}
In XSLT 2.0 you could use analyze-string with matching-substring inside, to process the captured regex.
Let's move to an example. Start with a source XML given below:
<?xml version="1.0" encoding="UTF-8"?>
<main>
<message>7c {"code":3001,"message":"issued"} 0</message>
</main>
Then we can use such XSLT:
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output method="xml" encoding="UTF-8" indent="yes" />
<xsl:template match="message">
<xsl:copy>
<xsl:analyze-string select="." regex="\{{(.*)\}}">
<xsl:matching-substring>
<xsl:value-of select="regex-group(1)"/>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:copy>
</xsl:template>
<xsl:template match="#*|node()">
<xsl:copy><xsl:apply-templates select="#*|node()"/></xsl:copy>
</xsl:template>
</xsl:transform>
Note the content of regex attribute.
In XSLT curly braces must be doubled in order to tell them apart from
an attribute value template.
But these curly braces are here literal curly braces, i.e. we are looking
just for { and } chars (they are not here as delimiters of repetition
counts for the preceding regex). For this reason we have to precede
each of them with a backslash.
Between these curly braces we have a capturing group (...).
We refer to the content of the captured group in regex-group(1) below.
If you need, you can put more capturing groups in the regex, to capture
individual parts of the message and then make some use of them.
But if you are really limited to XSLT 1.0 you can:
Start from substring-before to cut off } and everything after.
Then use substring-after to cut off { and everything before.
Or maybe you need the text with surrounding curly braces?
Then use concat to prepend { and append }.
I tried substring-after and before which give me text after '{' and
before '}'
If you're using XSLT 1.0, then do exactly that, and add the missing separators as text - for example:
<xsl:variable name="var">7c {"code":3001,"message":"issued"} 0 </xsl:variable>
<xsl:text>{</xsl:text>
<xsl:value-of select="substring-before(substring-after($var, '{'), '}')"/>
<xsl:text>}</xsl:text>
returns:
{"code":3001,"message":"issued"}
In XSLT 2.0, you could do simply:
<xsl:value-of select="replace($var, '.*(\{.*\}).*', '$1')"/>
to get the same result.