xslt xml parsing optional keywords with variable text groups - xslt

I have a snippet:
<p>keyword1 text text more text
</p>
<p>more text</p>
<p>more text</p>
<p>keyword2 text text more text
</p>
<p>more text</p>
<p>more text</p>
<p>keyword3 text text more text
</p>
<p>more text</p>
<p>more text</p>
<p>keyword4
</p>
</body>
In the snippet above, I have a list of optional keywords. The text which follows is of variable length. There might be multiple groupings of <p></p> before the next keyword appears. When the next keyword appears, it signals the end of the previous keyword.
Whats a good way to go about doing this in XSLT.
edit:
suppose my keywords were: keyword1, keyword2, keyword3, keyword4
version 1.0
i'll post my xslt in a little while... its not working though

I'd use the XSLT 2.0 grouping constructs, with a group-starts-with attribute that returns true for each p element containing a keyword.
That is, something like this:
<xsl:variable name="keywords"
as="xs:string*"
select="('keyword1', 'keyword2', 'keyword3', 'keyword4')"
/>
<xsl:for-each-group select="p"
group-starting-with="tokenize(., '\s+') = $keywords">
<!--* process each group here ... *-->
</xsl:for-each-group>

it's not clear what kind of result you intend to get.
C.M. Sperberg's approach addresses the right basic idea, however the code provided seems not to run with my XSL processor. So I'd propose a transformation like this
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output indent="yes" method="xml"/>
<xsl:variable name="keywords" select="'keyword1 keyword2 keyword3 keyword4'"/>
<xsl:template match="body">
<xsl:copy>
<xsl:for-each-group select="p" group-starting-with="p[contains($keywords,substring-before(.,' '))]">
<div>
<xsl:attribute name="class">
<xsl:value-of select="substring-before(current-group()[1],' ')"/>
</xsl:attribute>
<xsl:copy-of select="current-group()"/>
</div>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>
</xsl:transform>

Related

Recursively replacing elements in XSLT

I need to replace the <tref> element with other tags from elsewhere in my document. For example, I have:
<tref id="57236"/>
and
<Topic>
<ID>57236</ID>
<Text>
<p id="4">
<cs id="56792">1090-189-01 </cs>
<href id="57237">
<cs id="56792">Document Name</cs>
</href>
</p>
</Text>
</Topic>
Obtaining the following is not a problem:
<p id="4">
<cs id="56792">1090-189-01 </cs>
<href id="57237">
<cs id="56792">Document Name</cs>
</href>
</p>
With this stylesheet:
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="tref">
<xsl:variable name="NodeID"><xsl:value-of select="#id"/></xsl:variable>
<xsl:copy-of select="//Topic[ID = $NodeID]/Text/p/node()"/>
</xsl:template>
What I cannot do is replacing trefs nested into other trefs. For example, consider the following:
<tref id="57236"/>
and:
<Topic>
<ID>57236</ID>
<Text>
<p id="251">
<tref id="37287"/>
</p>
</Text>
</Topic>
My stylesheet duly replaces the tref with the content of the tag - which also contains a tref:
<p id="251">
<tref id="37287"/>
</p>
My current solution is to call <xsl:template match="tref"> from two different stylesheets. It does the job, but it is not very elegant, and what if trefs are nested at an even deeper level? And recursion is the bread and butter of XSLT.
Is there a solution to recursively replace all trefs as in XSLT?
Instead of using xsl:copy-of, use xsl:apply-templates
<xsl:apply-templates select="//Topic[ID = $NodeID]/Text/p/node()"/>
Or, to eliminate the use of the varianle
<xsl:apply-templates select="//Topic[ID = current()/#id]/Text/p/node()"/>
Note you can make use of an xsl:key to look-up the Topic elements
<xsl:key name="topic" match="Topic" use="ID" />
Then you can write this
<xsl:apply-templates select="key('topic', #id)/Text/p/node()"/>
Be wary of infinite recursion if you have a tref referring to a Topic that is an ancestor of it.

Transforming node contents to remove whitespace

If the contents of a citations node is something like the following:
<p>
WAJWAJADS:
</p>
<p>
asdf
</p>
<p>
ALSOAS:
</p>
<p>
lorem ipsum...<br />
lorem<br />
blah blah <i>
adfas & dasdsaafs
</i>, April 2011.<br />
lorem lorem dear lord the whitespace
</p>
Is there any way to transform this to properly formatted HTML with XSLT?
normalize-space() just concats everything together. The best I've managed to do is normalize-space() on all p descendants within a for-each loop and wrap them in a p element. However, then any inner tags are still lost.
Is there a better way to parse this WYSIWYG generated trainwreck? Unfortunately I have no control over the generated XML.
I've modified a little the answer by Martin Honnen:
<xsl:template match="text()">
<xsl:value-of select="normalize-space(.)"/>
<xsl:if test="substring(., string-length(.)) = ' ' and substring(., string-length(.) - 1, string-length(.)) != ' '">
<xsl:text> </xsl:text>
</xsl:if>
</xsl:template>
it tests if the last character is a space and the last 2 characters are not both spaces, if true, it inserts a space.
You first need to have a well-formed XML with a root.
Assuming you have that, you can apply an identity transform to copy the source tree to the result, strip spaces between the tags, optionally generate output in HTML (without the XML declaration) and indented, and use normalize-space() only in the text nodes.
Try this stylesheet:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:strip-space elements="*"/>
<xsl:output indent="yes" method="html"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="text()">
<xsl:value-of select="normalize-space(.)"/>
</xsl:template>
</xsl:stylesheet>
The result applied to the data you provided will be:
<p>WAJWAJADS:</p>
<p>asdf</p>
<p>ALSOAS:</p>
<p>lorem ipsum...<br>lorem<br>blah blah<i>adfas & dasdsaafs</i>, April 2011.<br>lorem lorem dear lord the whitespace
</p>
You can see the result applied to your example in this XSLT Fiddle
UPDATE 1: to add an extra space around each text node (and avoid concatenation when the string value of the node is calculated) you can replace the last template with:
<xsl:template match="text()">
<xsl:value-of select="concat(' ',normalize-space(.),' ')"/>
</xsl:template>
Result:
<html>
<p> WAJWAJADS: </p>
<p> asdf </p>
<p> ALSOAS: </p>
<p> lorem ipsum... <br> lorem <br> blah blah <i> adfas & dasdsaafs </i> , April 2011. <br> lorem lorem dear lord the whitespace
</p>
</html>
See: http://xsltransform.net/3NzcBsE/1
UPDATE 2: to add a space or newline after each copied element. Place this <xsl:text>
</xsl:text> (for a newline) or this <xsl:text> </xsl:text> (for a space) after the </xsl:copy> in the first template:
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
<xsl:text>
</xsl:text>
</xsl:template>
Result:
<html>
<p>WAJWAJADS:</p>
<p>asdf</p>
<p>ALSOAS:</p>
<p>lorem ipsum...<br>
lorem<br>
blah blah<i>adfas & dasdsaafs</i>
, April 2011.<br>
lorem lorem dear lord the whitespace
</p>
</html>
See: http://xsltransform.net/3NzcBsE/2
Use the identity transformation template plus a template for text nodes doing the normalize-space:
<xsl:template match="text()"><xsl:value-of select="normalize-space()"/></xsl:template>
This question would have been a lot easier to understand if the example contained real text instead of gibberish. "No additional whitespace between node start/end and text." is not an accurate enough description of the expected result.
I am going to take a guess here and assume you actually want to perform a "run of spaces to one space" operation on all the text nodes. This could be done as follows:
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- identity transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="text()" priority="1">
<xsl:variable name="temp" select="normalize-space(concat('x', ., 'x'))" />
<xsl:value-of select="substring($temp, 2, string-length($temp) - 2)"/>
</xsl:template>
</xsl:stylesheet>
When applied to the following test input:
<chapter>
<p>
This question would have
been a lot <b> easier </b> to understand
if the example contained
<i> real </i> text instead of
gibberish.
</p>
<p>
Here is an example of preserving zero spaces
between text nodes:<br/>(continued) on a new line.
</p>
<p>
Here is another example of
preserving zero spaces within a text
node: <i>some text in italic</i> followed
by normal text.
</p>
</chapter>
the result will be:
<?xml version="1.0" encoding="UTF-8"?>
<chapter>
<p> This question would have been a lot <b> easier </b> to understand if the example contained <i> real </i> text instead of gibberish. </p>
<p> Here is an example of preserving zero spaces between text nodes:<br/>(continued) on a new line. </p>
<p> Here is another example of preserving zero spaces within a text node: <i>some text in italic</i> followed by normal text. </p>
</chapter>
--
Note that there will be no difference between the input and output when rendered in HTML.

With XSLT, how can I process normally, but hold some nodes until the end and then output them all at once (e.g. footnotes)?

I have an XSLT application which reads the internal format of Microsoft Word 2007/2010 zipped XML and translates it into HTML5 with XSLT. I am investigating how to add the ability to optionally read OpenOffice documents instead of MSWord.
Microsoft stores XML for footnote text separately from the XML of the document text, which happens to suit me because I want the footnotes in a block at the end of the output HTML page.
However, unfortunately for me, OpenOffice puts each footnote right next to its reference, inline with the text of the document. Here is a simple paragraph example:
<text:p text:style-name="Standard">The real breakthrough in aerial mapping
during World War II was trimetrogon
<text:note text:id="ftn0" text:note-class="footnote">
<text:note-citation>1</text:note-citation>
<text:note-body>
<text:p text:style-name="Footnote">Three separate cameras took three
photographs at once, a direct downward and an oblique on each side.</text:p>
</text:note-body>
</text:note>
photography, but the camera was large and heavy, so there were problems finding
the right aircraft to carry it.
</text:p>
My question is, can XSLT process the XML as normal, but hold each of the text:note items until the end of the document text, and then emit them all at one time?
You're thinking of your logic as being driven by the order of things in the input, but in XSLT you need to be driven by the order of things in the output. When you get to the point where you want to output the footnotes, go find the footnote text wherever it might be in the input. Admittedly that doesn't always play too well with the apply-templates recursive descent processing model, which is explicitly input-driven; but nevertheless, that's the way you have to do it.
Don't think of it as "holding" the text:note items, instead simply ignore them in the main pass and then gather them at the end with a //text:note and process them there, e.g.
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:text="whateveritshouldbe">
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()" />
</xsl:copy>
</xsl:template>
<!-- normal mode - replace text:note element by [reference] -->
<xsl:template match="text:note">
<xsl:value-of select="concat('[', text:note-citation, ']')" />
</xsl:template>
<xsl:template match="/">
<document>
<xsl:apply-templates select="*" />
<footnotes>
<xsl:apply-templates select="//text:note" mode="footnotes"/>
</footnotes>
</document>
</xsl:template>
<!-- special "footnotes" mode to de-activate the usual text:node template -->
<xsl:template match="#*|node()" mode="footnotes">
<xsl:copy>
<xsl:apply-templates select="#*|node()" mode="footnotes" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
You could use <xsl:apply-templates mode="..."/>. I'm not sure on the exact syntax and your use case, but maybe the example below will give you a clue on how to approach your problem.
Basic idea is to process your nodes twice. First iteration would be pretty much the same as now, and the second iteration only looks for footnotes and only outputs those. You differentiate those iteration by setting "mode" parameter.
Maybe this example will give you a clue how to approach your problem. Note that I used different tags that in your code, so the example would be simpler.
XSLT sheet:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml" indent="yes" />
<xsl:template match="doc">
<xml>
<!-- First iteration - skip footnotes -->
<doc>
<xsl:apply-templates select="text" />
</doc>
<!-- Second iteration, extract all footnotes.
'mode' = footnotes -->
<footnotes>
<xsl:apply-templates select="text" mode="footnotes" />
</footnotes>
</xml>
</xsl:template>
<!-- Note: no mode attribute -->
<xsl:template match="text">
<text>
<xsl:for-each select="p">
<p>
<xsl:value-of select="text()" />
</p>
</xsl:for-each>
</text>
</xsl:template>
<!-- Note: mode = footnotes -->
<xsl:template match="text" mode="footnotes">
<xsl:for-each select=".//footnote">
<footnote>
<xsl:value-of select="text()" />
</footnote>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Input XML:
<?xml version="1.0" encoding="UTF-8"?>
<doc>
<text>
<p>
some text
<footnote>footnote1</footnote>
</p>
<p>
other text
<footnote>footnote2</footnote>
</p>
</text>
<text>
<p>
some text2
<footnote>footnote3</footnote>
</p>
<p>
other text2
<footnote>footnote4</footnote>
</p>
</text>
</doc>
Output XML:
<?xml version="1.0" encoding="UTF-8"?>
<xml>
<!-- Output from first iteration -->
<doc>
<text>
<p>some text</p>
<p>other text</p>
</text>
<text>
<p>some text2</p>
<p>other text2</p>
</text>
</doc>
<!-- Output from second iteration -->
<footnotes>
<footnote>footnote1</footnote>
<footnote>footnote2</footnote>
<footnote>footnote3</footnote>
<footnote>footnote4</footnote>
</footnotes>
</xml>

Input contains a paragraph character that needs to be removed

I have been attempting to modify the text of the parent element from within the xsl. How can I delete the element from the XSL code ( I do not control the input ). I only want to delete the preceding line break not all line breaks in the body. The preceding 'some text here' may take the form of multiple paragraphs.
Xsl
<xsl:template match="element">
<!-- attempting to add fix here -->
<xsl:apply-templates />
</xsl:template>
Input
<body>
<p>
some text here
</p>
<element>
some more text
</element>
</body>
Output
some text here
some more text
Desired Output
some text here some more text
Does
<xsl:template match="p[following-sibling::*[1][self::element]]//text() | element[preceding-sibling::*[1][self::p]//text()">
<xsl:value-of select="normalize-space()"/>
</xsl:template>
do what you want?
You don't need the <xsl:template match="element"><xsl:apply-templates/></xsl:template> as the built-in template will do that anyway.
I found some time to test code, now I have
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="p[following-sibling::*[1][self::element]]//text() |
element[preceding-sibling::*[1][self::p]]//text()">
<xsl:value-of select="normalize-space()"/>
</xsl:template>
<xsl:template match="text()[preceding-sibling::*[1][self::p] and following-sibling::*[1][self::element] and not(normalize-space())]">
<xsl:text> </xsl:text>
</xsl:template>
</xsl:stylesheet>
transforms
<body>
<p>
some text here
</p>
<element>
some more text
</element>
</body>
into
some text here some more text

Converting plain text into html style lists using xstl, or grouping elements according to their contents and their positions using xslt

Trying to convert a plain text document into a html document using xslt, I am struggling with unordered lists.
I have:
<item>some text</item>
<item>- a list item</item>
<item>- another list item</item>
<item>more plain text</item>
<item>more and more plain text</item>
<item>- yet another list item</item>
<item>even more plain text</item>
What I want:
<p>some text</p>
<ul>
<li>a list item</li>
<li>another list item</li>
</ul>
<p>more plain text</p>
<p>more and more plain text</p>
<ul>
<li>yet another list item</li>
</ul>
<p>even more plain text</p>
I was looking at the Muenchian grouping but it would combine all list items into one group and all the plain text items into another. Then I tried to do select only items which preceding elements first char is different from its first char. But when I try to combine everything, I still get all the li in one ul.
Do you have any hints for me?
This transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:key name="kFollowing"
match="item[contains(., 'list')]
[preceding-sibling::item[1][contains(.,'list')]]"
use="generate-id(preceding-sibling::item
[not(contains(.,'list'))]
[1]
/following-sibling::item[1]
)"/>
<xsl:template match="item[contains(.,'list')]
[preceding-sibling::item[1][not(contains(.,'list'))]]">
<ul>
<xsl:apply-templates mode="list"
select=".|key('kFollowing',generate-id())"/>
</ul>
</xsl:template>
<xsl:template match="item" mode="list">
<li><xsl:value-of select="."/></li>
</xsl:template>
<xsl:template match="item[not(contains(.,'list'))]">
<p><xsl:value-of select="."/></p>
</xsl:template>
<xsl:template match="item[contains(.,'list')]
[preceding-sibling::item[1][contains(.,'list')]]"/>
</xsl:stylesheet>
when applied on the provided XML document (corrected from severely malformed into a well-formed XML document):
<t>
<item>some text</item>
<item>- a list item</item>
<item>- another list item</item>
<item>more plain text</item>
<item>more and more plain text</item>
<item>- yet another list item</item>
<item>even more plain text</item>
</t>
produces the wanted, correct result:
<p>some text</p>
<ul>
<li>- a list item</li>
<li>- another list item</li>
</ul>
<p>more plain text</p>
<p>more and more plain text</p>
<ul>
<li>- yet another list item</li>
</ul>
<p>even more plain text</p>
This stylesheet:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()">
<xsl:apply-templates select="node()[1]|following-sibling::node()[1]"/>
</xsl:template>
<xsl:template match="item">
<p>
<xsl:value-of select="."/>
</p>
<xsl:apply-templates select="following-sibling::node()[1]"/>
</xsl:template>
<xsl:template match="item[starts-with(.,'- ')]">
<ul>
<xsl:call-template name="open"/>
</ul>
<xsl:apply-templates
select="following-sibling::node()
[not(self::item[starts-with(.,'- ')])][1]"/>
</xsl:template>
<xsl:template match="node()" mode="open"/>
<xsl:template match="item[starts-with(.,'- ')]" mode="open" name="open">
<li>
<xsl:value-of select="substring-after(.,'- ')"/>
</li>
<xsl:apply-templates select="following-sibling::node()[1]" mode="open"/>
</xsl:template>
</xsl:stylesheet>
Output:
<p>some text</p>
<ul>
<li>a list item</li>
<li>another list item</li>
</ul>
<p>more plain text</p>
<p>more and more plain text</p>
<ul>
<li>yet another list item</li>
</ul>
<p>even more plain text</p>
Note: This is like wrapping adjacents. Ussing fine grained traversal.