I'd like to trim the leading whitespace inside p tags in XML, so this:
<p> Hey, <em>italics</em> and <em>italics</em>!</p>
Becomes this:
<p>Hey, <em>italics</em> and <em>italics</em>!</p>
(Trimming trailing whitespace won't hurt, but it's not mandatory.)
Now, I know normalize-whitespace() is supposed to do this, but if I try to apply it to the text nodes..
<xsl:template match="text()">
<xsl:text>[</xsl:text>
<xsl:value-of select="normalize-space(.)"/>
<xsl:text>]</xsl:text>
</xsl:template>
...it's applied to each text node (in brackets) individually and sucks them dry:
[Hey,]<em>[italics]</em>[and]<em>[italics]</em>[!]
My XSLT looks basically like this:
<xsl:template match="p">
<xsl:apply-templates/>
</xsl:template>
So is there any way I can let apply-templates complete and then run normalize-space on the output, which should do the right thing?
This stylesheet:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="p//text()[1][generate-id()=
generate-id(ancestor::p[1]
/descendant::text()[1])]">
<xsl:variable name="vFirstNotSpace"
select="substring(normalize-space(),1,1)"/>
<xsl:value-of select="concat($vFirstNotSpace,
substring-after(.,$vFirstNotSpace))"/>
</xsl:template>
</xsl:stylesheet>
Output:
<p>Hey, <em>italics</em> and <em>italics</em>!</p>
Edit 2: Better expression (now only three function calls).
Edit 3: Matching the first descendant text node (not just the first node if it's a text node). Thanks to #Dimitre's comment.
Now, with this input:
<p><b> Hey, </b><em>italics</em> and <em>italics</em>!</p>
Output:
<p><b>Hey, </b><em>italics</em> and <em>italics</em>!</p>
I would do something like this:
<xsl:template match="p">
<xsl:apply-templates/>
</xsl:template>
<!-- strip leading whitespace -->
<xsl:template match="p/node()[1][self::text()]">
<xsl:call-template name="left-trim">
<xsl:with-param name="s" value="."/>
</xsl:call-template>
</xsl:template>
This will strip left space from the initial node child of a <p> element, if it is a text node. It will not strip space from the first text node child, if it is not the first node child. E.g. in
<p><em>Hey</em> there</p>
I intentionally avoid stripping the space from the front of 'there', because that would make the words run together when rendered in a browser. If you did want to strip that space, change the match pattern to
match="p/text()[1]"
If you also want to strip trailing whitespace, as your title possibly implies, add these two templates:
<!-- strip trailing whitespace -->
<xsl:template match="p/node()[last()][self::text()]">
<xsl:call-template name="right-trim">
<xsl:with-param name="s" value="."/>
</xsl:call-template>
</xsl:template>
<!-- strip leading/trailing whitespace on sole text node -->
<xsl:template match="p/node()[position() = 1 and
position() = last()][self::text()]"
priority="2">
<xsl:value-of select="normalize-space(.)"/>
</xsl:template>
The definitions of the left-trim and right-trim templates are at Trim Template for XSLT (untested). They might be slow for documents with lots of <p>s. If you can use XSLT 2.0, you can replace the call-templates with
<xsl:value-of select="replace(.,'^\s+','')" />
and
<xsl:value-of select="replace(.,'\s+$','')" />
(Thanks to Priscilla Walmsley.)
You want:
<xsl:template match="text()">
<xsl:value-of select=
"substring(
substring(normalize-space(concat('[',.,']')),2),
1,
string-length(.)
)"/>
</xsl:template>
This wraps the string in "[]", then performs normalize-string(), then finally removes the wrapping characters.
Related
Kindly help me to wrap the img.inline element with the following sibling text comma (if comma exists):
text <img id="1" class="inline" src="1.jpg"/> another text.
text <img id="2" class="inline" src="2.jpg"/>, another text.
Should be changed to:
text <img id="1" class="inline" src="1.jpg"/> another text.
text <span class="img-wrap"><img id="2" class="inline" src="2.jpg"/>,</span> another text.
Currently, my XSLT will wrap the img.inline element and add comma inside the span, now I want to remove the following comma.
text <span class="img-wrap"><img id="2" class="inline" src="2.jpg"/>,</span>
, <!--remove this extra comma--> another text.
My XSLT:
<xsl:template match="//img[#class='inline']">
<xsl:copy>
<xsl:choose>
<xsl:when test="starts-with(following-sibling::text(), ',')">
<span class="img-wrap">
<xsl:apply-templates select="node()|#*"/>
<xsl:text>,</xsl:text>
</span>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="node()|#*"/>
</xsl:otherwise>
</xsl:choose>
</xsl:copy>
<!-- checking following-sibling::text() -->
<xsl:apply-templates select="following-sibling::text()" mode="commatext"/>
</xsl:template>
<!-- here I want to match the following text, if comma, then remove it -->
<xsl:template match="the following comma" mode="commatext">
<!-- remove comma -->
</xsl:template>
Is my approach is correct? or is this something should be handled differently? pls suggest?
Currently you are copying the img and the embedding the span within that. Also, you do <xsl:apply-templates select="node()|#*"/> which will select child nodes of img (or which there are none). And for the attributes it will end add them to the span.
You don't actually need the xsl:choose here as you can add the condition to the match attribute.
<xsl:template match="//img[#class='inline'][starts-with(following-sibling::node()[1][self::text()], ',')]">
Note I have changed the condition as following-sibling::text() selects ALL text elements that follow the img node. You only want to get the node immediately after the img node, but only if it is a text node.
Also, trying to select the following text node with xsl:apply-templates is probably not the right approach, assuming you have a template that matches the parent node which selects all child nodes (not just img ones). I am assuming you were using the identity template here.
Anyway, try this XSLT instead
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="html" indent="no" />
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()" />
</xsl:copy>
</xsl:template>
<xsl:template match="//img[#class='inline'][starts-with(following-sibling::node()[1][self::text()], ',')]">
<span class="img-wrap">
<xsl:copy-of select="." />
<xsl:text>,</xsl:text>
</span>
</xsl:template>
<xsl:template match="text()[starts-with(., ',')][preceding-sibling::node()[1][self::img]/#class='inline']">
<xsl:value-of select="substring(., 2)" />
</xsl:template>
</xsl:stylesheet>
I have a xml like this,
<doc>
<p>Biological<sub>89</sub> bases<sub>4456</sub> for<sub>8910</sub> sexual<sub>4456</sub>
differences<sub>8910</sub> in<sub>4456</sub> the brain exist in a wide range of
vertebrate species, including chickens<sub>8910</sub> Recently<sub>8910</sub> the
dogma<sub>8910</sub> of<sub>4456</sub> hormonal dependence for the sexual
differentiation of the brain has been challenged.</p>
</doc>
As you can see there are <sub> nodes and text() node contains inside the <p> node. and every <sub> node end, there is a text node, starting with a space. (eg: <sub>89</sub> bases : here before 'bases' text appear there is a space exists.) I need to replace those specific spaces with nodes.
SO the expected output should look like this,
<doc>
<p>Biological<sub>89</sub><s/>bases<sub>4456</sub><s/>for<sub>8910</sub><s/>sexual<sub>4456</sub>
<s/>differences<sub>8910</sub><s/>in<sub>4456</sub><s/>the brain exist in a wide range of
vertebrate species, including chickens<sub>8910</sub><s/>Recently<sub>8910</sub><s/>the
dogma<sub>8910</sub><s/>of<sub>4456</sub><s/>hormonal dependence for the sexual
differentiation of the brain has been challenged.</p>
</doc>
to do this I can use regular expression like this,
<xsl:template match="p/text()">
<xsl:analyze-string select="." regex="( )">
<xsl:matching-substring>
<xsl:choose>
<xsl:when test="regex-group(1)">
<s/>
</xsl:when>
</xsl:choose>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
But this adds <s/> nodes to every spaces in the text() node. But I only need thi add nodes to that specific spaces.
Can anyone suggest me a method how can I do this..
If you only want to match text nodes that start with a space and are preceded by a sub element, you can put the condition in your template match
<xsl:template match="p/text()[substring(., 1, 1) = ' '][preceding-sibling::node()[1][self::sub]]">
And if you just want to remove the space at the start of the string, a simple replace will do.
<xsl:value-of select="replace(., '^\s+', '')" />
Try this XSLT
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output method="xml" indent="no" />
<xsl:template match="p/text()[substring(., 1, 1) = ' '][preceding-sibling::node()[1][self::sub]]">
<s />
<xsl:value-of select="replace(., '^\s+', '')" />
</xsl:template>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Just change the regex like so ^( ): it will match only the spaces at the beginning of the text part.
With this XSL snipped:
<xsl:analyze-string select="." regex="^( )">
Here is the result I obtain:
<p>Biological<sub>89</sub><s></s>bases<sub>4456</sub><s></s>for<sub>8910</sub><s></s>sexual<sub>4456</sub>
differences<sub>8910</sub><s></s>in<sub>4456</sub><s></s>the brain exist in a wide range of
vertebrate species, including chickens<sub>8910</sub><s></s>Recently<sub>8910</sub><s></s>the
dogma<sub>8910</sub><s></s>of<sub>4456</sub><s></s>hormonal dependence for the sexual
differentiation of the brain has been challenged.
</p>
I have following xml
<xml>
<xref>
is determined “in prescribed manner”
</xref>
</xml>
I want to see if we can process xslt 2 and return the following result
<xml>
<xref>
is
</xref>
<xref>
determined
</xref>
<xref>
“in prescribed manner”
</xref>
</xml>
I tried few options like replace the space and entities and then using for-each loop but not able to work it out. May be we can use tokenize function of xslt 2.0 but don't know how to use it. Any hint will be helpful.
# JimGarrison: Sorry, I couldn't resist. :-) This XSLT is definitely not elegant but it does (I assume) most of the job:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" />
<xsl:variable name="left_quote" select="'<'"/>
<xsl:variable name="right_quote" select="'>'"/>
<xsl:template name="protected_tokenize">
<xsl:param name="string"/>
<xsl:variable name="pattern" select="concat('^([^', $left_quote, ']+)(', $left_quote, '[^', $right_quote, ']*', $right_quote,')?(.*)')"/>
<xsl:analyze-string select="$string" regex="{$pattern}">
<xsl:matching-substring>
<!-- Handle the prefix of the string up to the first opening quote by "normal" tokenizing. -->
<xsl:variable name="prefix" select="concat(' ', normalize-space(regex-group(1)))"/>
<xsl:for-each select="tokenize(normalize-space($prefix), ' ')">
<xref>
<xsl:value-of select="."/>
</xref>
</xsl:for-each>
<!-- Handle the text between the quotes by simply passing it through. -->
<xsl:variable name="protected_token" select="normalize-space(regex-group(2))"/>
<xsl:if test="$protected_token != ''">
<xref>
<xsl:value-of select="$protected_token"/>
</xref>
</xsl:if>
<!-- Handle the suffix of the string. This part may contained protected tokens again. So we do it recursively. -->
<xsl:variable name="suffix" select="normalize-space(regex-group(3))"/>
<xsl:if test="$suffix != ''">
<xsl:call-template name="protected_tokenize">
<xsl:with-param name="string" select="$suffix"/>
</xsl:call-template>
</xsl:if>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:template>
<xsl:template match="*|#*">
<xsl:copy>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
<xsl:template match="xref">
<xsl:call-template name="protected_tokenize">
<xsl:with-param name="string" select="text()"/>
</xsl:call-template>
</xsl:template>
</xsl:stylesheet>
Notes:
There is the general assumption that white space only serves as a token delimiter and need not be preserved.
“ and rdquo; seem to be invalid in XML although they are valid in HTML. In the XSLT there are variables defined holding the quote characters. They will have to be adapted once you find the right XML representation. You can also eliminate the variables and put the characters right into the regular expression pattern. It will be significantly simplified by this.
<xsl:analyze-string> does not allow a regular expression which may evaluate into an empty string. This comes as a little problem since either the prefix and/or the proteced token and/or the suffix may be empty. I take care of this by artificially adding a space at the beginning of the pattern which allows me to search for the prefix using + (at least one occurence) instead of * (zero or more occurences).
I'm using XSLT to extract some data from a trademark XML file from the Patent and Trademark Office. It's mostly okay, except for one blank line. I can get rid of it with a moderately ugly workaround, but I'd like to know if there a better way.
Here's a subset of my XSLT:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:tm="http://www.wipo.int/standards/XMLSchema/trademarks" xmlns:pto="urn:us:gov:doc:uspto:trademark:status">
<xsl:output method="text" encoding="utf-8" />
<xsl:strip-space elements="*"/>
<xsl:template match="tm:Transaction">
<xsl:apply-templates select=".//tm:TradeMark"/>
<xsl:apply-templates select=".//tm:ApplicantDetails"/>
<xsl:apply-templates select=".//tm:MarkEvent"/>
</xsl:template>
<xsl:template match="tm:TradeMark">
MarkCurrentStatusDate,"<xsl:value-of select="normalize-space(tm:MarkCurrentStatusDate)"/>"<xsl:text/>
ApplicationNumber,"<xsl:value-of select="normalize-space(tm:ApplicationNumber)"/>"<xsl:text/>
ApplicationDate,"<xsl:value-of select="normalize-space(tm:ApplicationDate)"/>"<xsl:text/>
RegistrationNumber,"<xsl:value-of select="normalize-space(tm:RegistrationNumber)"/>"<xsl:text/>
RegistrationDate,"<xsl:value-of select="normalize-space(tm:RegistrationDate)"/>"<xsl:text/>
<xsl:apply-templates select="tm:WordMarkSpecification"/>
<xsl:apply-templates select="tm:TradeMarkExt"/>
<xsl:apply-templates select="tm:PublicationDetails"/>
<xsl:apply-templates select="tm:RepresentativeDetails"/>
</xsl:template>
<xsl:template match="tm:WordMarkSpecification">
MarkVerbalElementText,"<xsl:value-of select="normalize-space(tm:MarkVerbalElementText)"/>"<xsl:text/>
</xsl:template>
It has a few more templates, but that's the gist of it. I always get a blank line at the very beginning of the output, before any data; I don't get any other blank lines. My circumvention is to combine the two lines:
<xsl:template match="tm:TradeMark">
MarkCurrentStatusDate,"<xsl:value-of select="normalize-space(tm:MarkCurrentStatusDate)"/>"<xsl:text/>
into a single line:
<xsl:template match="tm:TradeMark">MarkCurrentStatusDate,"<xsl:value-of select="normalize-space(tm:MarkCurrentStatusDate)"/>"<xsl:text/>
This works, and I guess I'm okay with it if there's nothing better, but it seems inelegant and like a kludge to me. None of the other templates need this treatment (e.g. the tm:WordMarkSpecification template or another six after that)), and I'm confused why it's needed here. Any ideas?
Because I can see the specific point in the XSLT that's inserting the blank line, I presume it's not helpful to provide the XML I'm testing on, but if you do need to see it, you can get it at https://tsdrapi.uspto.gov/ts/cd/casestatus/rn2178784/download.zip ; it's the XML file in that archive.
Use at the beginning of the template the same trick you are using at the end of the template to chop up the stylesheet node tree with empty <xsl:text/> instructions:
<xsl:template match="tm:TradeMark">
<xsl:text/>MarkCurrentStatusDate,"<xsl:value-of select="normalize-space(tm:MarkCurrentStatusDate)"/>"<xsl:text/>
Personally, I think it's cleaner to use a concat() when you need to combine static text and dynamic values:
<xsl:template match="tm:TradeMark">
<xsl:value-of
select="concat(
'MarkCurrentStatusDate,"', normalize-space(tm:MarkCurrentStatusDate), '"',
'ApplicationNumber,"', normalize-space(tm:ApplicationNumber), '"',
'ApplicationDate,"', normalize-space(tm:ApplicationDate), '"',
'RegistrationNumber,"', normalize-space(tm:RegistrationNumber), '"',
'RegistrationDate,"', normalize-space(tm:RegistrationDate), '"'
)"/>
<xsl:apply-templates select="tm:WordMarkSpecification"/>
<xsl:apply-templates select="tm:TradeMarkExt"/>
<xsl:apply-templates select="tm:PublicationDetails"/>
<xsl:apply-templates select="tm:RepresentativeDetails"/>
</xsl:template>
This should also solve your issue with the blank spaces showing up.
Remember you can always just play with the XML syntax to ignore end-of-line sequences that are inside of start and end tag delimiters:
<xsl:template match="tm:TradeMark"
>MarkCurrentStatusDate,"<xsl:value-of select="normalize-space(tm:MarkCurrentStatusDate)"
/>"ApplicationNumber,"<xsl:value-of select="normalize-space(tm:ApplicationNumber)"
/>"ApplicationDate,"<xsl:value-of select="normalize-space(tm:ApplicationDate)"
/>"RegistrationNumber,"<xsl:value-of select="normalize-space(tm:RegistrationNumber)"
/>"RegistrationDate,"<xsl:value-of select="normalize-space(tm:RegistrationDate)"
/>"<xsl:apply-templates select="tm:WordMarkSpecification"/>
<xsl:apply-templates select="tm:TradeMarkExt"/>
<xsl:apply-templates select="tm:PublicationDetails"/>
<xsl:apply-templates select="tm:RepresentativeDetails"/>
</xsl:template>
There is no rule in XML that a tag's closing delimiter /> has to be on the same line as the tag's opening delimiter <. White-space inside of a tag is ignored (where innocuous), and an end-of-line sequence is considered white-space.
If you want all of the text to be emitted without extra line breaks or white-space, then put the literal text inside of <xsl:text> elements.
<xsl:template match="tm:TradeMark">
<xsl:text>MarkCurrentStatusDate,"</xsl:text>
<xsl:value-of select="normalize-space(tm:MarkCurrentStatusDate)"/>
<xsl:text>"</xsl:text>
<xsl:text>ApplicationNumber,"</xsl:text>
<xsl:value-of select="normalize-space(tm:ApplicationNumber)"/>
<xsl:text>"</xsl:text>
<xsl:text>ApplicationDate,"</xsl:text>
<xsl:value-of select="normalize-space(tm:ApplicationDate)"/>
<xsl:text>"</xsl:text>
<xsl:text>RegistrationNumber,"</xsl:text>
<xsl:value-of select="normalize-space(tm:RegistrationNumber)"/>
<xsl:text>"</xsl:text>
<xsl:text>RegistrationDate,"</xsl:text>
<xsl:value-of select="normalize-space(tm:RegistrationDate)"/>
<xsl:text>"</xsl:text>
<xsl:apply-templates select="tm:WordMarkSpecification"/>
<xsl:apply-templates select="tm:TradeMarkExt"/>
<xsl:apply-templates select="tm:PublicationDetails"/>
<xsl:apply-templates select="tm:RepresentativeDetails"/>
</xsl:template>
That way, none of the line breaks and white-space inside of the <xsl:template> will be seen as significant and will not be included in the result tree output.
I'm trying to have an XSLT that copies most of the tags but removes empty "<b/>" tags. That is, it should copy as-is "<b> </b>" or "<b>toto</b>" but completely remove "<b/>".
I think the template would look like :
<xsl:template match="b">
<xsl:if test=".hasChildren()">
<xsl:element name="b">
<xsl:apply-templates/>
</xsl:element>
</xsl:if>
</xsl:template>
But of course, the "hasChildren()" part doesn't exist ... Any idea ?
dsteinweg put me on the right track ... I ended up doing :
<xsl:template match="b">
<xsl:if test="./* or ./text()">
<xsl:element name="b">
<xsl:apply-templates/>
</xsl:element>
</xsl:if>
</xsl:template>
This transformation ignores any <b> elements that do not have any node child. A node in this context means an element, text, comment or processing instruction node.
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="b[not(node()]"/>
</xsl:stylesheet>
Notice that here we use one of the most fundamental XSLT design patterns -- using the identity transform and overriding it for specific nodes.
The overriding template will be selected only for nodes that are elements named "b" and do not have (any nodes as) children. This template is empty (does not have any contents), so the effect of its application is that the matching node is ignored/discarded and is not reproduced in the output.
This technique is very powerful and is widely used for such tasks and also for renaming, changing the contents or attributes, adding children or siblings to any specific node that can be matched (avery type of node with the exception of a namespace node can be used as a match pattern in the "match" attribute of <xsl:template/>
Hope this helped.
Cheers,
Dimitre Novatchev
I wonder if this will work?
<xsl:template match="b">
<xsl:if test="b/text()">
...
See if this will work.
<xsl:template match="b">
<xsl:if test=".!=''">
<xsl:element name="b">
<xsl:apply-templates/>
</xsl:element>
</xsl:if>
</xsl:template>
An alternative would be to do the following:
<xsl:template match="b[not(text())]" />
<xsl:template match="b">
<b>
<xsl:apply-templates/>
</b>
</xsl:template>
You could put all the logic in the predicate, and set up a template to match only what you want and delete it:
<xsl:template match="b[not(node())] />
This assumes that you have an identity template later on in the transform, which it sounds like you do. That will automatically copy any "b" tags with content, which is what you want:
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
Edit: Now uses node() like Dimitri, below.
If you have access to update the original XML, you could try using use xml:space=preserve on the root element
<html xml:space="preserve">
...
</html>
This way, the space in the empty <b> </b> tag is preserved, and so can be distinguished from <b /> in the XSLT.
<xsl:template match="b">
<xsl:if test="text() != ''">
....
</xsl:if>
</xsl:template>