XSLT - analyse following text value - regex

I have a XML that text() node is not correctly formatted,
example:
<section>
<p>A number,of words have, been, suggested,as sources for,the term,</p>
</section>
Here after some ',' there are no space character and some does. what I need to do is if ',' not followed by a space character add a '*' character after the ',' character.
so, expected result,
<section>
<p>A number,*of words have, been, suggested,*as sources for,*the term*</p>
</section>
I think this can be done using regular expression but how can I select , characters that are not followed by space in regular expression in XSLT. also, some , exist just before the closing element (last , in the input) and I need to select those , as well.
<xsl:template match="para">
<xsl:copy>
<xsl:analyze-string select="." regex=",\s*">
<xsl:matching-substring>
<xsl:value-of select="regex-group(1)"/>
<xsl:value-of select="'*'"/>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:copy>
</xsl:template>

You've replaced the last , in your input with ,* though your statement doesn't say that. I hope the below XSLT helps:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="p/text()">
<xsl:value-of select="replace(., ',([^\s]|$)',',*$1')"/>
</xsl:template>
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#*, node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Output:
<?xml version="1.0" encoding="UTF-8"?>
<section>
<p>A number,*of words have, been, suggested,*as sources for,*the term,*</p>
</section>
Here, the regex, ,([^\s]|$) matches the comma and the first character after that if not a space character; ,*$1 replaces the , with ,* and keeps the matched group intact.

Related

Select various text in the same node and put them to different nodes

I want to Select various text in the same node and put them to different nodes in XSLT.
Input :
<p>The first para
<formula>formula text</formula> The second para
<list>List text</list>
The third para </p>
Desired output :
<pcom>The first para</pcom>
<formulai>formula text</formulai>
<pi>The second para</pi>
<listi>List text</listi>
<pi>The third para</pi>
Tried code :
<xsl:template match="p/text()[preceding-sibling::formula or preceding-sibling::list]">
<pi><xsl:apply-template/></pi>
</xsl:template>
<p>The first para</p> , <p>The second para</p> and <p>The third para</p> are text in same <p>. I want to tarns form them to seperate <pi>.
Those text preceding-sibling must be <formula> or <list>. If preceding-sibling not a <formula> or <list> then the output should be in <pcom>
How can I solve this? I am using XSLT 2.0
This should work for the given example:
<xsl:template match="p/text()">
<pcom>
<xsl:copy/>
</pcom>
</xsl:template>
<xsl:template match="p/text()[preceding-sibling::formula or preceding-sibling::list]">
<pi>
<xsl:copy/>
</pi>
</xsl:template>
However, the problem description is ambiguous. If you want to limit the pi element to text nodes whose immediately preceding sibling is formula or list, use:
<xsl:template match="p/text()[preceding-sibling::node()[1][self::formula or self::list]]">
<pi>
<xsl:copy/>
</pi>
</xsl:template>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs"
version="2.0">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="node()[ following-sibling::*[1][self::list] and preceding-sibling::*[1][self::formula] or preceding-sibling::*[1][self::list]]">
<pi>
<xsl:value-of select=" normalize-space(.)"/>
</pi>
</xsl:template>
<xsl:template match="node()[following-sibling::*[1][self::formula]]">
<pcom>
<xsl:value-of select="normalize-space(.)"/>
</pcom>
</xsl:template>
<xsl:template match="formula">
<formulai>
<xsl:value-of select="."/>
</formulai>
</xsl:template>
<xsl:template match="p">
<xsl:apply-templates/>
</xsl:template>
You may also try this.

Is it possible to replace & with & in XSLT?

I tried to do that with replace($val, 'amp;', ''), but seems like & is atomic entity to the parser. Any other ideas?
I need it to get rid of double escaping, so I have constructions like &#8112; in input file.
UPD:
Also one important notice: I have to make this substitution only inside of specific tags, not inside of every tag.
If you serialize there is always (if supported) the disable-output-escaping hack, see http://xsltransform.hikmatu.com/nbUY4kh which transforms
<root>
<foo>a & b</foo>
<bar>a & b</bar>
</root>
selectively into
<root>
<foo>a & b</foo>
<bar>a & b</bar>
</root>
by using <xsl:value-of select="." disable-output-escaping="yes"/> in the template matching foo/text():
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="foo/text()">
<xsl:value-of select="." disable-output-escaping="yes"/>
</xsl:template>
</xsl:transform>
To achieve the same selective replacement with a character map you could replace the ampersand in foo text() children (or descendants if necessary) with a character not used elsewhere in your document and then use the map to map it to an unescaped ampersand:
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output use-character-maps="doe"/>
<xsl:character-map name="doe">
<xsl:output-character character="«" string="&"/>
</xsl:character-map>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="foo/text()">
<xsl:value-of select="replace(., '&', '«')"/>
</xsl:template>
</xsl:transform>
That way
<root>
<foo>a & b</foo>
<bar>a & b</bar>
</root>
is also transformed to
<root>
<foo>a & b</foo>
<bar>a & b</bar>
</root>
see http://xsltransform.hikmatu.com/pPgCcoj for a sample.
If your XML contains ᾰ and you believe that this is a double-escaped representation of the character with codepoint 8112, then you can convert it to this character using the XPath expression
codepoints-to-string(xs:integer(replace($input, '&#([0-9]+);', $1)))
remembering that if you write this XPath expression in XSLT, then the & must be written as &.

XSLT - replace specific content of the text() node with a new node

I have a xml like this,
<doc>
<p>Biological<sub>89</sub> bases<sub>4456</sub> for<sub>8910</sub> sexual<sub>4456</sub>
differences<sub>8910</sub> in<sub>4456</sub> the brain exist in a wide range of
vertebrate species, including chickens<sub>8910</sub> Recently<sub>8910</sub> the
dogma<sub>8910</sub> of<sub>4456</sub> hormonal dependence for the sexual
differentiation of the brain has been challenged.</p>
</doc>
As you can see there are <sub> nodes and text() node contains inside the <p> node. and every <sub> node end, there is a text node, starting with a space. (eg: <sub>89</sub> bases : here before 'bases' text appear there is a space exists.) I need to replace those specific spaces with nodes.
SO the expected output should look like this,
<doc>
<p>Biological<sub>89</sub><s/>bases<sub>4456</sub><s/>for<sub>8910</sub><s/>sexual<sub>4456</sub>
<s/>differences<sub>8910</sub><s/>in<sub>4456</sub><s/>the brain exist in a wide range of
vertebrate species, including chickens<sub>8910</sub><s/>Recently<sub>8910</sub><s/>the
dogma<sub>8910</sub><s/>of<sub>4456</sub><s/>hormonal dependence for the sexual
differentiation of the brain has been challenged.</p>
</doc>
to do this I can use regular expression like this,
<xsl:template match="p/text()">
<xsl:analyze-string select="." regex="( )">
<xsl:matching-substring>
<xsl:choose>
<xsl:when test="regex-group(1)">
<s/>
</xsl:when>
</xsl:choose>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
But this adds <s/> nodes to every spaces in the text() node. But I only need thi add nodes to that specific spaces.
Can anyone suggest me a method how can I do this..
If you only want to match text nodes that start with a space and are preceded by a sub element, you can put the condition in your template match
<xsl:template match="p/text()[substring(., 1, 1) = ' '][preceding-sibling::node()[1][self::sub]]">
And if you just want to remove the space at the start of the string, a simple replace will do.
<xsl:value-of select="replace(., '^\s+', '')" />
Try this XSLT
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output method="xml" indent="no" />
<xsl:template match="p/text()[substring(., 1, 1) = ' '][preceding-sibling::node()[1][self::sub]]">
<s />
<xsl:value-of select="replace(., '^\s+', '')" />
</xsl:template>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Just change the regex like so ^( ): it will match only the spaces at the beginning of the text part.
With this XSL snipped:
<xsl:analyze-string select="." regex="^( )">
Here is the result I obtain:
<p>Biological<sub>89</sub><s></s>bases<sub>4456</sub><s></s>for<sub>8910</sub><s></s>sexual<sub>4456</sub>
differences<sub>8910</sub><s></s>in<sub>4456</sub><s></s>the brain exist in a wide range of
vertebrate species, including chickens<sub>8910</sub><s></s>Recently<sub>8910</sub><s></s>the
dogma<sub>8910</sub><s></s>of<sub>4456</sub><s></s>hormonal dependence for the sexual
differentiation of the brain has been challenged.
</p>

XSL transform on text to XML with unparsed-text: need more depth

My rather well-formed input (I don't want to copy all data):
StartThing
Size Big
Colour Blue
coords 42, 42
foo bar
EndThing
StartThing
Size Small
Colour Red
coords 29, 51
machin bidule
EndThing
<!-- repeat a few thousand times-->
I have the below XSL which I modified from Parse text file with XSLT
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:param name="text-encoding" as="xs:string" select="'iso-8859-1'"/>
<xsl:param name="text-uri" as="xs:string" select="'unparsed-text.txt'"/>
<xsl:template name="text2xml">
<xsl:variable name="text" select="unparsed-text($text-uri, $text-encoding)"/>
<xsl:analyze-string select="$text" regex="(Size|Colour|coords) (.+)">
<xsl:matching-substring>
<xsl:element name="{(regex-group(1))}">
<xsl:value-of select="(regex-group(2))"/>
</xsl:element>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:template>
<xsl:template match="/">
<xsl:call-template name="text2xml"/>
</xsl:template>
</xsl:stylesheet>
and it works fine on parsing the pairs into elements and values. It gives me this output:
<?xml version="1.0" encoding="UTF-8"?>
<Size>Big</Size>
<Colour>Blue</Colour>
<coords>42, 42</coords>
But I'd also like to wrap the values in the Thing tag so that my output looks like this:
<Thing>
<Size>Big</Size>
<Colour>Blue</Colour>
<coords>42, 42</coords>
</Thing>
One solution might be a regex that matches each group of lines after each "thing". Then matches substrings as I'm already doing. Or is there some other way to parse the tree?
I would use two nested analyze-string levels, an outer one to extract everything between StartThing and EndThing, and then an inner one that operates on the strings matched by the outer one.
<xsl:template name="text2xml">
<xsl:variable name="text" select="unparsed-text($text-uri, $text-encoding)"/>
<!-- flags="s" allows .*? to match across newlines -->
<xsl:analyze-string select="$text" regex="StartThing.*?EndThing" flags="s">
<xsl:matching-substring>
<Thing>
<!-- "." here is the matching substring from the outer regex -->
<xsl:analyze-string select="." regex="(Size|Colour|coords) (.+)">
<xsl:matching-substring>
<xsl:element name="{(regex-group(1))}">
<xsl:value-of select="(regex-group(2))"/>
</xsl:element>
</xsl:matching-substring>
</xsl:analyze-string>
</Thing>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:template>

Remove whitespace from HTML generated using XSL

Background
Maintain readable XSL source code while generating HTML without excessive breaks that introduce spaces between sentences and their terminating punctuation. From Rethinking XSLT:
White space in XSLT stylesheets is especially problematic because it serves two purposes: (1) for formatting the XSLT stylesheet itself; and (2) for specifying where whitespace should go in the output of XSLT-processed XML data.
Problem
An XSL template contains the following code:
<xsl:if test="#min-time < #max-time">
for
<xsl:value-of select="#min-time" />
to
<xsl:value-of select="#max-time" />
minutes
</xsl:if>
<xsl:if test="#setting">
on <xsl:value-of select="#setting" /> heat
</xsl:if>
.
This, for example, generates the following output (with whitespace exactly as shown):
for
2
to
3
minutes
.
All major browsers produce:
for 2 to 3 minutes .
Nearly flawless, except for the space between the word minutes and the punctuation. The desired output is:
for 2 to 3 minutes.
It might be possible to eliminate the space by removing the indentation and newlines within the XSL template, but that means having ugly XSL source code.
Workaround
Initially the desired output was wrapped in a variable and then written out as follows:
<xsl:value-of select="normalize-space($step)" />.
This worked until I tried to wrap <span> elements into the variable. The <span> elements never appeared within the generated HTML code. Nor is the following code correct:
<xsl:copy-of select="normalize-space($step)" />.
Technical Details
The stylesheet already uses:
<xsl:strip-space elements="*" />
<xsl:output indent="no" ... />
Related
Storing html tags within an xsl variable
Question
How do you tell the XSLT processor to eliminate that space?
Thank you!
Instead of using copy-of you can apply the identity template with an additional template that trims the spaces from the text nodes. You only create one variable like in your first workaround.
You call:
<li><xsl:apply-templates select="$step" mode="nospace" />.</li>
The templates:
<xsl:template match="text()" mode="nospace" priority="1" >
<xsl:value-of select="normalize-space(.)" />
</xsl:template>
<xsl:template match="node() | #*" mode="nospace">
<xsl:copy>
<xsl:apply-templates select="node() | #*" mode="nospace" />
</xsl:copy>
</xsl:template>
I. This transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="t[#max-time > #min-time]">
<span>
<xsl:value-of select=
"concat('for ', #min-time, ' to ', #max-time, ' minutes')"/>
</span>
<xsl:apply-templates select="#setting"/>
<xsl:text>.</xsl:text>
</xsl:template>
<xsl:template match="#setting">
<span>
<xsl:value-of select="concat(' on ', ., ' heat')"/>
</span>
</xsl:template>
</xsl:stylesheet>
when applied on the following XML document (none has been presented!):
<t min-time="2" max-time="3" setting="moderate"/>
produces the wanted, correct result:
<span>for 2 to 3 minutes</span>
<span> on moderate heat</span>.
and it is displayed by the browser as:
for 2 to 3 minutes
on moderate heat.
When the same transformation is applied on this XML document:
<t min-time="2" max-time="3"/>
again the correct, wanted result is produced:
<span>for 2 to 3 minutes</span>.
and it is displayed by the browser as:
for 2 to 3 minutes.
II. Layout (visual) solution:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:my="my:my" xmlns:gen="gen:gen" xmlns:gen-attr="gen:gen-attr"
exclude-result-prefixes="my gen">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<my:layout>
<span>for <gen-attr:min-time/> to <gen-attr:max-time/> minutes</span>
<gen-attr:setting><span> on <gen:current/> heat</span></gen-attr:setting>
<gen:literal>.</gen:literal>
</my:layout>
<xsl:variable name="vLayout" select="document('')/*/my:layout/*"/>
<xsl:variable name="vDoc" select="/"/>
<xsl:template match="node()|#*">
<xsl:param name="pCurrent"/>
<xsl:copy>
<xsl:apply-templates select="node()|#*">
<xsl:with-param name="pCurrent" select="$pCurrent"/>
</xsl:apply-templates>
</xsl:copy>
</xsl:template>
<xsl:template match="/">
<xsl:apply-templates select="$vLayout">
<xsl:with-param name="pCurrent" select="$vDoc/*"/>
</xsl:apply-templates>
</xsl:template>
<xsl:template match="gen-attr:*">
<xsl:param name="pCurrent"/>
<xsl:value-of select="$pCurrent/#*[name() = local-name(current())]"/>
</xsl:template>
<xsl:template match="gen-attr:setting">
<xsl:param name="pCurrent"/>
<xsl:variable name="vnextCurrent" select=
"$pCurrent/#*[name() = local-name(current())]"/>
<xsl:apply-templates select="node()[$vnextCurrent]">
<xsl:with-param name="pCurrent" select="$vnextCurrent"/>
</xsl:apply-templates>
</xsl:template>
<xsl:template match="gen:current">
<xsl:param name="pCurrent"/>
<xsl:value-of select="$pCurrent"/>
</xsl:template>
<xsl:template match="gen:literal">
<xsl:apply-templates/>
</xsl:template>
</xsl:stylesheet>
This transformation gives us an idea how to make a visual (skeletal) representation of the wanted output and use it to "populate" it with the wanted data from the source XML document.
The result is identical with that of the first solution. If this transformation is run "as-is" it will produce a lot of namespaces -- they are harmless and will not be produced if the layout is in a separate XML file.