Transforming node contents to remove whitespace - xslt

If the contents of a citations node is something like the following:
<p>
WAJWAJADS:
</p>
<p>
asdf
</p>
<p>
ALSOAS:
</p>
<p>
lorem ipsum...<br />
lorem<br />
blah blah <i>
adfas & dasdsaafs
</i>, April 2011.<br />
lorem lorem dear lord the whitespace
</p>
Is there any way to transform this to properly formatted HTML with XSLT?
normalize-space() just concats everything together. The best I've managed to do is normalize-space() on all p descendants within a for-each loop and wrap them in a p element. However, then any inner tags are still lost.
Is there a better way to parse this WYSIWYG generated trainwreck? Unfortunately I have no control over the generated XML.

I've modified a little the answer by Martin Honnen:
<xsl:template match="text()">
<xsl:value-of select="normalize-space(.)"/>
<xsl:if test="substring(., string-length(.)) = ' ' and substring(., string-length(.) - 1, string-length(.)) != ' '">
<xsl:text> </xsl:text>
</xsl:if>
</xsl:template>
it tests if the last character is a space and the last 2 characters are not both spaces, if true, it inserts a space.

You first need to have a well-formed XML with a root.
Assuming you have that, you can apply an identity transform to copy the source tree to the result, strip spaces between the tags, optionally generate output in HTML (without the XML declaration) and indented, and use normalize-space() only in the text nodes.
Try this stylesheet:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:strip-space elements="*"/>
<xsl:output indent="yes" method="html"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="text()">
<xsl:value-of select="normalize-space(.)"/>
</xsl:template>
</xsl:stylesheet>
The result applied to the data you provided will be:
<p>WAJWAJADS:</p>
<p>asdf</p>
<p>ALSOAS:</p>
<p>lorem ipsum...<br>lorem<br>blah blah<i>adfas & dasdsaafs</i>, April 2011.<br>lorem lorem dear lord the whitespace
</p>
You can see the result applied to your example in this XSLT Fiddle
UPDATE 1: to add an extra space around each text node (and avoid concatenation when the string value of the node is calculated) you can replace the last template with:
<xsl:template match="text()">
<xsl:value-of select="concat(' ',normalize-space(.),' ')"/>
</xsl:template>
Result:
<html>
<p> WAJWAJADS: </p>
<p> asdf </p>
<p> ALSOAS: </p>
<p> lorem ipsum... <br> lorem <br> blah blah <i> adfas & dasdsaafs </i> , April 2011. <br> lorem lorem dear lord the whitespace
</p>
</html>
See: http://xsltransform.net/3NzcBsE/1
UPDATE 2: to add a space or newline after each copied element. Place this <xsl:text>
</xsl:text> (for a newline) or this <xsl:text> </xsl:text> (for a space) after the </xsl:copy> in the first template:
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
<xsl:text>
</xsl:text>
</xsl:template>
Result:
<html>
<p>WAJWAJADS:</p>
<p>asdf</p>
<p>ALSOAS:</p>
<p>lorem ipsum...<br>
lorem<br>
blah blah<i>adfas & dasdsaafs</i>
, April 2011.<br>
lorem lorem dear lord the whitespace
</p>
</html>
See: http://xsltransform.net/3NzcBsE/2

Use the identity transformation template plus a template for text nodes doing the normalize-space:
<xsl:template match="text()"><xsl:value-of select="normalize-space()"/></xsl:template>

This question would have been a lot easier to understand if the example contained real text instead of gibberish. "No additional whitespace between node start/end and text." is not an accurate enough description of the expected result.
I am going to take a guess here and assume you actually want to perform a "run of spaces to one space" operation on all the text nodes. This could be done as follows:
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- identity transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="text()" priority="1">
<xsl:variable name="temp" select="normalize-space(concat('x', ., 'x'))" />
<xsl:value-of select="substring($temp, 2, string-length($temp) - 2)"/>
</xsl:template>
</xsl:stylesheet>
When applied to the following test input:
<chapter>
<p>
This question would have
been a lot <b> easier </b> to understand
if the example contained
<i> real </i> text instead of
gibberish.
</p>
<p>
Here is an example of preserving zero spaces
between text nodes:<br/>(continued) on a new line.
</p>
<p>
Here is another example of
preserving zero spaces within a text
node: <i>some text in italic</i> followed
by normal text.
</p>
</chapter>
the result will be:
<?xml version="1.0" encoding="UTF-8"?>
<chapter>
<p> This question would have been a lot <b> easier </b> to understand if the example contained <i> real </i> text instead of gibberish. </p>
<p> Here is an example of preserving zero spaces between text nodes:<br/>(continued) on a new line. </p>
<p> Here is another example of preserving zero spaces within a text node: <i>some text in italic</i> followed by normal text. </p>
</chapter>
--
Note that there will be no difference between the input and output when rendered in HTML.

Related

xslt xml parsing optional keywords with variable text groups

I have a snippet:
<p>keyword1 text text more text
</p>
<p>more text</p>
<p>more text</p>
<p>keyword2 text text more text
</p>
<p>more text</p>
<p>more text</p>
<p>keyword3 text text more text
</p>
<p>more text</p>
<p>more text</p>
<p>keyword4
</p>
</body>
In the snippet above, I have a list of optional keywords. The text which follows is of variable length. There might be multiple groupings of <p></p> before the next keyword appears. When the next keyword appears, it signals the end of the previous keyword.
Whats a good way to go about doing this in XSLT.
edit:
suppose my keywords were: keyword1, keyword2, keyword3, keyword4
version 1.0
i'll post my xslt in a little while... its not working though
I'd use the XSLT 2.0 grouping constructs, with a group-starts-with attribute that returns true for each p element containing a keyword.
That is, something like this:
<xsl:variable name="keywords"
as="xs:string*"
select="('keyword1', 'keyword2', 'keyword3', 'keyword4')"
/>
<xsl:for-each-group select="p"
group-starting-with="tokenize(., '\s+') = $keywords">
<!--* process each group here ... *-->
</xsl:for-each-group>
it's not clear what kind of result you intend to get.
C.M. Sperberg's approach addresses the right basic idea, however the code provided seems not to run with my XSL processor. So I'd propose a transformation like this
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output indent="yes" method="xml"/>
<xsl:variable name="keywords" select="'keyword1 keyword2 keyword3 keyword4'"/>
<xsl:template match="body">
<xsl:copy>
<xsl:for-each-group select="p" group-starting-with="p[contains($keywords,substring-before(.,' '))]">
<div>
<xsl:attribute name="class">
<xsl:value-of select="substring-before(current-group()[1],' ')"/>
</xsl:attribute>
<xsl:copy-of select="current-group()"/>
</div>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>
</xsl:transform>

How do I understand linebreaks in XSLT?

I have a piece of XML that looks like:
<bunch of other things>
<bunch of other things>
<errorLog> error1 \n error2 \n error3 </errorLog>
I want to modify the XSLT that this XML runs through to apply newlines after errors1 through error3.
I can completely control the output of errorLog or the contents of the XSLT file, but I'm not sure how to craft either the XML or the XSLT to make the output HTML show line breaks. Is it easier to change the XML output into some special character that will cause a newline, or do I modify the XSLT to interpret \n as newlines?
There is an example on this site that contains something akin to what I want, but my <errorLog> XSLT is nested in another template, and I'm not sure how templates inside templates can work.
Backslash is used as an escape character in a number of languages including C and Java, but not in XML or XSLT. If you put \n in your stylesheet, that's not a newline, it's two characters backslash followed by "n". The XML way of writing a newline is
. However, if you send a newline to the browser in HTML, it displays it as a space. If you want a newline displayed by the browser, you need to send a <br/> element.
If you have control over your errorLog element then you may as well use a literal LF character in there. It is no different from any other character as far as XSLT is concerned.
As for creating HTML that displays with line breaks, you will want to add a <br/> element in place of whatever marker you have in your XML source. It would be easiest of all if you could put each error within a separate element, like this
<errorLog>
<error>error1</error>
<error>error2</error>
<error>error3</error>
</errorLog>
then the XSLT doesn't have to go through the rather clumsy process of splitting up the text itself.
With this XML data taken from your question
<document>
<bunch-of-other-things/>
<bunch-of-other-things/>
<errorLog>error1 \n error2 \n error3</errorLog>
</document>
this stylesheet
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/>
<xsl:output method="xml" indent="yes" omit-xml-declaration="yes" />
<xsl:template match="/document">
<html>
<head>
<title>Error Log</title>
</head>
<body>
<xsl:apply-templates select="*"/>
</body>
</html>
</xsl:template>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="errorLog">
<p>
<xsl:call-template name="split-on-newline">
<xsl:with-param name="string" select="."/>
</xsl:call-template>
</p>
</xsl:template>
<xsl:template name="split-on-newline">
<xsl:param name="string"/>
<xsl:choose>
<xsl:when test="contains($string, '\n')">
<xsl:value-of select="substring-before($string, '\n')"/>
<br/>
<xsl:call-template name="split-on-newline">
<xsl:with-param name="string" select="substring-after($string, '\n')"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$string"/>
<br/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
will produce this output
<html>
<head>
<title>Error Log</title>
</head>
<body>
<bunch-of-other-things/>
<bunch-of-other-things/>
<p>error1 <br/> error2 <br/> error3<br/>
</p>
</body>
</html>

With XSLT, how can I process normally, but hold some nodes until the end and then output them all at once (e.g. footnotes)?

I have an XSLT application which reads the internal format of Microsoft Word 2007/2010 zipped XML and translates it into HTML5 with XSLT. I am investigating how to add the ability to optionally read OpenOffice documents instead of MSWord.
Microsoft stores XML for footnote text separately from the XML of the document text, which happens to suit me because I want the footnotes in a block at the end of the output HTML page.
However, unfortunately for me, OpenOffice puts each footnote right next to its reference, inline with the text of the document. Here is a simple paragraph example:
<text:p text:style-name="Standard">The real breakthrough in aerial mapping
during World War II was trimetrogon
<text:note text:id="ftn0" text:note-class="footnote">
<text:note-citation>1</text:note-citation>
<text:note-body>
<text:p text:style-name="Footnote">Three separate cameras took three
photographs at once, a direct downward and an oblique on each side.</text:p>
</text:note-body>
</text:note>
photography, but the camera was large and heavy, so there were problems finding
the right aircraft to carry it.
</text:p>
My question is, can XSLT process the XML as normal, but hold each of the text:note items until the end of the document text, and then emit them all at one time?
You're thinking of your logic as being driven by the order of things in the input, but in XSLT you need to be driven by the order of things in the output. When you get to the point where you want to output the footnotes, go find the footnote text wherever it might be in the input. Admittedly that doesn't always play too well with the apply-templates recursive descent processing model, which is explicitly input-driven; but nevertheless, that's the way you have to do it.
Don't think of it as "holding" the text:note items, instead simply ignore them in the main pass and then gather them at the end with a //text:note and process them there, e.g.
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:text="whateveritshouldbe">
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()" />
</xsl:copy>
</xsl:template>
<!-- normal mode - replace text:note element by [reference] -->
<xsl:template match="text:note">
<xsl:value-of select="concat('[', text:note-citation, ']')" />
</xsl:template>
<xsl:template match="/">
<document>
<xsl:apply-templates select="*" />
<footnotes>
<xsl:apply-templates select="//text:note" mode="footnotes"/>
</footnotes>
</document>
</xsl:template>
<!-- special "footnotes" mode to de-activate the usual text:node template -->
<xsl:template match="#*|node()" mode="footnotes">
<xsl:copy>
<xsl:apply-templates select="#*|node()" mode="footnotes" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
You could use <xsl:apply-templates mode="..."/>. I'm not sure on the exact syntax and your use case, but maybe the example below will give you a clue on how to approach your problem.
Basic idea is to process your nodes twice. First iteration would be pretty much the same as now, and the second iteration only looks for footnotes and only outputs those. You differentiate those iteration by setting "mode" parameter.
Maybe this example will give you a clue how to approach your problem. Note that I used different tags that in your code, so the example would be simpler.
XSLT sheet:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml" indent="yes" />
<xsl:template match="doc">
<xml>
<!-- First iteration - skip footnotes -->
<doc>
<xsl:apply-templates select="text" />
</doc>
<!-- Second iteration, extract all footnotes.
'mode' = footnotes -->
<footnotes>
<xsl:apply-templates select="text" mode="footnotes" />
</footnotes>
</xml>
</xsl:template>
<!-- Note: no mode attribute -->
<xsl:template match="text">
<text>
<xsl:for-each select="p">
<p>
<xsl:value-of select="text()" />
</p>
</xsl:for-each>
</text>
</xsl:template>
<!-- Note: mode = footnotes -->
<xsl:template match="text" mode="footnotes">
<xsl:for-each select=".//footnote">
<footnote>
<xsl:value-of select="text()" />
</footnote>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Input XML:
<?xml version="1.0" encoding="UTF-8"?>
<doc>
<text>
<p>
some text
<footnote>footnote1</footnote>
</p>
<p>
other text
<footnote>footnote2</footnote>
</p>
</text>
<text>
<p>
some text2
<footnote>footnote3</footnote>
</p>
<p>
other text2
<footnote>footnote4</footnote>
</p>
</text>
</doc>
Output XML:
<?xml version="1.0" encoding="UTF-8"?>
<xml>
<!-- Output from first iteration -->
<doc>
<text>
<p>some text</p>
<p>other text</p>
</text>
<text>
<p>some text2</p>
<p>other text2</p>
</text>
</doc>
<!-- Output from second iteration -->
<footnotes>
<footnote>footnote1</footnote>
<footnote>footnote2</footnote>
<footnote>footnote3</footnote>
<footnote>footnote4</footnote>
</footnotes>
</xml>

Input contains a paragraph character that needs to be removed

I have been attempting to modify the text of the parent element from within the xsl. How can I delete the element from the XSL code ( I do not control the input ). I only want to delete the preceding line break not all line breaks in the body. The preceding 'some text here' may take the form of multiple paragraphs.
Xsl
<xsl:template match="element">
<!-- attempting to add fix here -->
<xsl:apply-templates />
</xsl:template>
Input
<body>
<p>
some text here
</p>
<element>
some more text
</element>
</body>
Output
some text here
some more text
Desired Output
some text here some more text
Does
<xsl:template match="p[following-sibling::*[1][self::element]]//text() | element[preceding-sibling::*[1][self::p]//text()">
<xsl:value-of select="normalize-space()"/>
</xsl:template>
do what you want?
You don't need the <xsl:template match="element"><xsl:apply-templates/></xsl:template> as the built-in template will do that anyway.
I found some time to test code, now I have
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="p[following-sibling::*[1][self::element]]//text() |
element[preceding-sibling::*[1][self::p]]//text()">
<xsl:value-of select="normalize-space()"/>
</xsl:template>
<xsl:template match="text()[preceding-sibling::*[1][self::p] and following-sibling::*[1][self::element] and not(normalize-space())]">
<xsl:text> </xsl:text>
</xsl:template>
</xsl:stylesheet>
transforms
<body>
<p>
some text here
</p>
<element>
some more text
</element>
</body>
into
some text here some more text

XSLT Match attribute and then its element

In my source XML, any element can have an #n attribute. If one does, I want to output it before processing the element and all its children.
For example
<line n="2">Ipsum lorem</line>
<verse n="5">The sounds of silence</verse>
<verse>Four score and seven</verse>
<sentence n="3">
<word n="1">Hello</word>
<word n="2">world</word>
</sentence>
I have templates that match "line", "verse", "sentence" and "word". If any of those elements has an #n value, I want to output it in front of whatever the element's template generates.
The above might come out something like
2 <div class="line">Ipsum lorem</span>
5 <span class="verse">The sounds of silence</span>
<span class="verse">Four score and seven</span>
3 <p class="sentence">
1 <span class="word">Hello</span>
2 <span class="word">world</span>
</p>
where the templates for "line", "verse", etc. generated the div, span and p elements.
How should I think of this problem? -- Match the attribute and then apply-templates to its parent? (What would the syntax for that be?) Put a call-template at the beginning of every element's template? (That's unappealing.) Something else? (Probably!)
I tried a few things but got either an infinite loop, or nothing, or processing of the attribute and then its parent's children, but not the parent itself.
To simplify matters, I've placed the mapping from XML to HTML elements in an in-document data structure (accessible via the document() function with no arguments). Now only one template is needed requiring special processing of the #n attribute in only one place.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html"/>
<map>
<elt xml="line" html="class"/>
<elt xml="verse" html="span"/>
<elt xml="sentence" html="p"/>
<elt xml="word" html="span"/>
</map>
<xsl:template match="line|verse|sentence|word">
<xsl:if test="#n"><xsl:value-of select="#n"/> </xsl:if>
<xsl:element name="{document()/map/elt[#xml=name()]/#html}">
<xsl:attribute name="class"><xsl:value-of select="name()"/></xsl:attibute>
<xsl:apply-templates/>
</xsl:element>
</xsl:template>
Here is one simple way to do this:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="*/*[#n]">
<xsl:value-of select="concat('
', #n, ' ')"/>
<xsl:apply-templates select="self::*" mode="content"/>
</xsl:template>
<xsl:template match="*/*[not(#*)]">
<xsl:apply-templates select="." mode="content"/>
</xsl:template>
<xsl:template match="line" mode="content">
<div class="line"><xsl:apply-templates/></div>
</xsl:template>
<xsl:template match="verse | word" mode="content">
<span class="{name()}"><xsl:apply-templates mode="content"/></span>
</xsl:template>
<xsl:template match="sentence" mode="content">
<p class="sentence"><xsl:apply-templates/></p>
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document:
<t>
<line n="2">Ipsum lorem</line>
<verse n="5">The sounds of silence</verse>
<verse>Four score and seven</verse>
<sentence n="3">
<word n="1">Hello</word>
<word n="2">world</word>
</sentence>
</t>
the wanted, correct result is produced:
2 <div class="line">Ipsum lorem</div>
5 <span class="verse">The sounds of silence</span>
<span class="verse">Four score and seven</span>
3 <p class="sentence">
1 <span class="word">Hello</span>
2 <span class="word">world</span>
</p>
Explanation: Appropriate use of templates and modes.