xslt remove white space around specific elements (xml to xml) - xslt

I have a lot of XML files where an export from a database has added certain whitespace via indentation that now I wish to remove in an XSLT 3.0 transformation to a new xml output. I want to remove the whitespace introduced by the export around <lb> and <pb> (in the original files, before export, these two elements abutted other elements without whitespace - a hidden bug in the export indented them).
This is an example of the problem file to transform:
<p xml:id="MS609-0783-LA" xml:lang="LA">
<seg type="dep_event" xml:id="MS609-0783-1">
<pb n="58r"/>
<lb n="1"/>Item.
<date type="deposition_date" when="1245-07-07">Anno Domini M°CC°XL°V° Nonas Iulii</date><persName
nymRef="#ber_r_baz-hg" role="dep">Ber. R.</persName> testis juratus dixit quod vidit
<persName nymRef="#heretics_in_public" ref="her">hereticos</persName>.</seg>
</p>
Here, an example of the desired XML output:
<p xmlns="http://www.tei-c.org/ns/1.0" xml:id="MS609-0783-LA" xml:lang="LA">
<seg type="dep_event" xml:id="MS609-0783-1"><pb n="58r"/><lb n="1"/>Item.
<date type="deposition_date" when="1245-07-07">Anno Domini M°CC°XL°V° Nonas Iulii</date> <persName
nymRef="#ber_r_baz-hg" role="dep">Ber. R.</persName> testis juratus dixit quod vidit
<persName nymRef="#heretics_in_public" ref="her">hereticos</persName>.</seg>
</p>
I thought, naively, that I could target it so:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:tei="http://www.tei-c.org/ns/1.0"
version="3.0">
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml"/>
<xsl:template match="/">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="text()[normalize-space(.) = '']">
<xsl:choose>
<xsl:when test="./following-sibling::tei:pb">
<xsl:text/>
</xsl:when>
<xsl:when test="./following-sibling::tei:lb">
<xsl:text/>
</xsl:when>
<xsl:otherwise>
<xsl:text> </xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
But it does not produce the desired result: https://xsltfiddle.liberty-development.net/93nwMpi
Ideally, I am working towards a solution that strips out any blank white space out before or after <pb> and/or <lb> (when they are abutted by other elements), anywhere inside <seg> or its descendants.
Many thanks in advance for pointers.

I don't have a clear understanding on which text nodes you want to strip but
<xsl:template match="tei:seg//text()[not(normalize-space())][following-sibling::node()[1][self::tei:pb | self::tei:lb]]"/>
would strip white-space only ones followed by pb or lb inside of a seg.
Of course you can extend that to the ones preceded by those elements with e.g.
<xsl:template match="tei:seg//text()[not(normalize-space())][preceding-sibling::node()[1][self::tei:pb | self::tei:lb] or following-sibling::node()[1][self::tei:pb | self::tei:lb]]"/>
If the simple blocking based on match patterns doesn't suffice you might want to try whether your definition of "abutted" can be somehow expressed with group-adjacent and xsl:for-each-group and then drop any white space nodes in a group e.g.
<xsl:template match="tei:seg[tei:pb | tei:lb] | tei:seg//*[tei:pb | tei:lb]">
<xsl:copy>
<xsl:for-each-group select="node()" group-adjacent="boolean(self::text()[not(normalize-space())]|self::tei:pb|self::tei:lb)">
<xsl:choose>
<xsl:when test="current-grouping-key() and current-group()[self::tei:pb|self::tei:lb]">
<xsl:sequence select="current-group()[not(self::text())]"/>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="current-group()"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>

Related

XSLT - replace specific content of the text() node with a new node

I have a xml like this,
<doc>
<p>Biological<sub>89</sub> bases<sub>4456</sub> for<sub>8910</sub> sexual<sub>4456</sub>
differences<sub>8910</sub> in<sub>4456</sub> the brain exist in a wide range of
vertebrate species, including chickens<sub>8910</sub> Recently<sub>8910</sub> the
dogma<sub>8910</sub> of<sub>4456</sub> hormonal dependence for the sexual
differentiation of the brain has been challenged.</p>
</doc>
As you can see there are <sub> nodes and text() node contains inside the <p> node. and every <sub> node end, there is a text node, starting with a space. (eg: <sub>89</sub> bases : here before 'bases' text appear there is a space exists.) I need to replace those specific spaces with nodes.
SO the expected output should look like this,
<doc>
<p>Biological<sub>89</sub><s/>bases<sub>4456</sub><s/>for<sub>8910</sub><s/>sexual<sub>4456</sub>
<s/>differences<sub>8910</sub><s/>in<sub>4456</sub><s/>the brain exist in a wide range of
vertebrate species, including chickens<sub>8910</sub><s/>Recently<sub>8910</sub><s/>the
dogma<sub>8910</sub><s/>of<sub>4456</sub><s/>hormonal dependence for the sexual
differentiation of the brain has been challenged.</p>
</doc>
to do this I can use regular expression like this,
<xsl:template match="p/text()">
<xsl:analyze-string select="." regex="( )">
<xsl:matching-substring>
<xsl:choose>
<xsl:when test="regex-group(1)">
<s/>
</xsl:when>
</xsl:choose>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
But this adds <s/> nodes to every spaces in the text() node. But I only need thi add nodes to that specific spaces.
Can anyone suggest me a method how can I do this..
If you only want to match text nodes that start with a space and are preceded by a sub element, you can put the condition in your template match
<xsl:template match="p/text()[substring(., 1, 1) = ' '][preceding-sibling::node()[1][self::sub]]">
And if you just want to remove the space at the start of the string, a simple replace will do.
<xsl:value-of select="replace(., '^\s+', '')" />
Try this XSLT
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output method="xml" indent="no" />
<xsl:template match="p/text()[substring(., 1, 1) = ' '][preceding-sibling::node()[1][self::sub]]">
<s />
<xsl:value-of select="replace(., '^\s+', '')" />
</xsl:template>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Just change the regex like so ^( ): it will match only the spaces at the beginning of the text part.
With this XSL snipped:
<xsl:analyze-string select="." regex="^( )">
Here is the result I obtain:
<p>Biological<sub>89</sub><s></s>bases<sub>4456</sub><s></s>for<sub>8910</sub><s></s>sexual<sub>4456</sub>
differences<sub>8910</sub><s></s>in<sub>4456</sub><s></s>the brain exist in a wide range of
vertebrate species, including chickens<sub>8910</sub><s></s>Recently<sub>8910</sub><s></s>the
dogma<sub>8910</sub><s></s>of<sub>4456</sub><s></s>hormonal dependence for the sexual
differentiation of the brain has been challenged.
</p>

parsing string in xslt

I have following xml
<xml>
<xref>
is determined “in prescribed manner”
</xref>
</xml>
I want to see if we can process xslt 2 and return the following result
<xml>
<xref>
is
</xref>
<xref>
determined
</xref>
<xref>
“in prescribed manner”
</xref>
</xml>
I tried few options like replace the space and entities and then using for-each loop but not able to work it out. May be we can use tokenize function of xslt 2.0 but don't know how to use it. Any hint will be helpful.
# JimGarrison: Sorry, I couldn't resist. :-) This XSLT is definitely not elegant but it does (I assume) most of the job:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" />
<xsl:variable name="left_quote" select="'<'"/>
<xsl:variable name="right_quote" select="'>'"/>
<xsl:template name="protected_tokenize">
<xsl:param name="string"/>
<xsl:variable name="pattern" select="concat('^([^', $left_quote, ']+)(', $left_quote, '[^', $right_quote, ']*', $right_quote,')?(.*)')"/>
<xsl:analyze-string select="$string" regex="{$pattern}">
<xsl:matching-substring>
<!-- Handle the prefix of the string up to the first opening quote by "normal" tokenizing. -->
<xsl:variable name="prefix" select="concat(' ', normalize-space(regex-group(1)))"/>
<xsl:for-each select="tokenize(normalize-space($prefix), ' ')">
<xref>
<xsl:value-of select="."/>
</xref>
</xsl:for-each>
<!-- Handle the text between the quotes by simply passing it through. -->
<xsl:variable name="protected_token" select="normalize-space(regex-group(2))"/>
<xsl:if test="$protected_token != ''">
<xref>
<xsl:value-of select="$protected_token"/>
</xref>
</xsl:if>
<!-- Handle the suffix of the string. This part may contained protected tokens again. So we do it recursively. -->
<xsl:variable name="suffix" select="normalize-space(regex-group(3))"/>
<xsl:if test="$suffix != ''">
<xsl:call-template name="protected_tokenize">
<xsl:with-param name="string" select="$suffix"/>
</xsl:call-template>
</xsl:if>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:template>
<xsl:template match="*|#*">
<xsl:copy>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
<xsl:template match="xref">
<xsl:call-template name="protected_tokenize">
<xsl:with-param name="string" select="text()"/>
</xsl:call-template>
</xsl:template>
</xsl:stylesheet>
Notes:
There is the general assumption that white space only serves as a token delimiter and need not be preserved.
“ and rdquo; seem to be invalid in XML although they are valid in HTML. In the XSLT there are variables defined holding the quote characters. They will have to be adapted once you find the right XML representation. You can also eliminate the variables and put the characters right into the regular expression pattern. It will be significantly simplified by this.
<xsl:analyze-string> does not allow a regular expression which may evaluate into an empty string. This comes as a little problem since either the prefix and/or the proteced token and/or the suffix may be empty. I take care of this by artificially adding a space at the beginning of the pattern which allows me to search for the prefix using + (at least one occurence) instead of * (zero or more occurences).

Get the maximum starting substring of the text node, such that the last word in it isn't truncated, and its length doesn't exceed a given limit

How do I select only n number of comments element using xsl. I am able to select n number of characters but that snaps in the middle of a word and makes it look ugly.
I have been able to get a count of total number of words in 'comments' node but not sure how to only show like 15 words if there are like a total of 20 there.
<div class="NewsDescription">
<xsl:value-of select="string-length(translate(normalize-space(#Comments),'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789',''))+1" />
<xsl:value-of select="substring(#Comments,0,120)"/>
<xsl:if test="string-length(#Comments) > 120">…</xsl:if>
Read More
</div>
Actual xml is like this. I am trying to rollup story pieces in "Related Stories' manner on the right side of a dynamic page.
<news>
<title>Story title</title>
<comments> story gist here..a couple of sentences.</comments>
<content> actualy story </content>
Please help. I found this resources by Marc andersen but the xsl is too complex for me filter through and make use of myself..
http://sympmarc.com/2010/07/13/displaying-the-first-n-words-of-announcement-bodies-with-xsl-in-a-dvwp/
Some help from the XSL gurus will be appreciated..
Here is a non-recursive, pure XSLT 1.0 solution:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:variable name="vUpper" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>
<xsl:variable name="vLower" select="'abcdefghijklmnopqrstuvwxyz'"/>
<xsl:variable name="vDigits" select="'0123456789'"/>
<xsl:variable name="vAlhanum" select="concat($vLower, $vUpper, $vDigits)"/>
<xsl:template match="comments">
<xsl:param name="pText" select="."/>
<xsl:choose>
<xsl:when test="not(string-length($pText) > 120)">
<xsl:value-of select="$pText"/>
</xsl:when>
<xsl:otherwise>
<xsl:variable name="vTruncated" select="substring($pText, 1, 121)"/>
<xsl:variable name="vPunct"
select="translate($vTruncated, $vAlhanum, '')"/>
<xsl:for-each select=
"(document('')//node() | document('')//#* | document('')//namespace::*)
[not(position() > 121)]">
<xsl:variable name="vPos" select="122 - position()"/>
<xsl:variable name="vRemaining" select="substring($vTruncated, $vPos+1)"/>
<xsl:if test=
"contains($vPunct, substring($vTruncated, $vPos, 1))
and
contains($vAlhanum, substring($vTruncated, $vPos -1, 1))
and
string-length(translate($vRemaining, $vPunct, ''))
= string-length($vRemaining)
">
<xsl:value-of select="substring($vTruncated, 1, $vPos -1)"/>
</xsl:if>
</xsl:for-each>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:template match="text()"/>
</xsl:stylesheet>
When this transformation is applied on the following XML document:
<news>
<title>Story title</title>
<comments> story gist here..a couple of sentences. Many more sentences ...
even some more text with lots of meaning and sense aaand a lot of humor. </comments>
<content> actualy story </content>
</news>
the wanted, correct result is produced:
story gist here..a couple of sentences. Many more sentences ...
even some more text with lots of meaning and sense

convert character if codepoint within given range

I have a couple of XML files that contain unicode characters with codepoint values between 57600 and 58607. Currently these are shown in my content as square blocks and I'd like to convert these to elements.
So what I'd like to achieve is something like :
<!-- current input -->
<p> Follow the on-screen instructions.</p>
<!-- desired output-->
<p><unichar value="58208"/> Follow the on-screen instructions.</p>
<!-- Where 58208 is the actual codepoint of the unicode character in question -->
I've fooled around a bit with tokenizer but as you need to have reference to split upon this turned out to be over complicated.
Any advice on how to tackle this best ? I've been trying some things like below but got struck (don't mind the syntax, I know it doesn't make any sense)
<xsl:template match="text()">
-> for every character in my string
-> if string-to-codepoints(current character) greater then 57600 return <unichar value="codepoint value"/>
else return character
</xsl:template>
It sounds like a job for analyze-string e.g.
<xsl:template match="text()">
<xsl:analyze-string select="." regex="[-]">
<xsl:matching-substring>
<unichar value="{string-to-codepoints(.)}"/>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
Untested.
This transformation:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="/*">
<p>
<xsl:for-each select="string-to-codepoints(.)">
<xsl:choose>
<xsl:when test=". > 57600">
<unichar value="{.}"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="codepoints-to-string(.)"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</p>
</xsl:template>
</xsl:stylesheet>
when applied on the provided XML document:
<p> Follow the on-screen instructions.</p>
produces the wanted, correct result:
<p><unichar value="58498"/> Follow the on-screen instructions.</p>
Explanation: Proper use of the standard XPath 2.0 functions string-to-codepoints() and codepoints-to-string().

replacing text in xml using xslt

I have an XML file which has some values in child Element aswell in attributes.
If i want to replace some text when specific value is matched how can i achieve it?
I tried using xlst:translate() function. But i cant use this function for each element or attribute in xml.
So is there anyway to replace/translate value at one shot?
<?xml version="1.0" encoding="UTF-8"?>
<Employee>
<Name>Emp1</Name>
<Age>40</Age>
<sex>M</sex>
<Address>Canada</Address>
<PersonalInformation>
<Country>Canada</country>
<Street1>KO 92</Street1>
</PersonalInformation>
</Employee>
Output :
<?xml version="1.0" encoding="UTF-8"?>
<Employee>
<Name>Emp1</Name>
<Age>40</Age>
<sex>M</sex>
<Address>UnitedStates</Address>
<PersonalInformation>
<Country>UnitedStates</country>
<Street1>KO 92</Street1>
</PersonalInformation>
</Employee>
in the output, replaced text from Canada to UnitedStates.
so, without using xslt:transform() functions on any element , i should be able to replace text Canada to UnitedStates irrespective of level nodes.
Where ever i find 'Canada' i should be able to replace to 'UnitedStates' in entire xml.
So how can i achieve this.?
I. XSLT 1.0 solution:
This transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:my="my:my" >
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<my:Reps>
<rep>
<old>replace this</old>
<new>replaced</new>
</rep>
<rep>
<old>cat</old>
<new>tiger</new>
</rep>
</my:Reps>
<xsl:variable name="vReps" select=
"document('')/*/my:Reps/*"/>
<xsl:template match="node()|#*" name="identity">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="#*">
<xsl:attribute name="{name()}">
<xsl:call-template name="replace">
<xsl:with-param name="pText" select="."/>
</xsl:call-template>
</xsl:attribute>
</xsl:template>
<xsl:template match="text()" name="replace">
<xsl:param name="pText" select="."/>
<xsl:if test="string-length($pText)">
<xsl:choose>
<xsl:when test=
"not($vReps/old[contains($pText, .)])">
<xsl:copy-of select="$pText"/>
</xsl:when>
<xsl:otherwise>
<xsl:variable name="vthisRep" select=
"$vReps/old[contains($pText, .)][1]
"/>
<xsl:variable name="vNewText">
<xsl:value-of
select="substring-before($pText, $vthisRep)"/>
<xsl:value-of select="$vthisRep/../new"/>
<xsl:value-of select=
"substring-after($pText, $vthisRep)"/>
</xsl:variable>
<xsl:call-template name="replace">
<xsl:with-param name="pText"
select="$vNewText"/>
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
when applied on this XML document:
<t>
<a attr1="X replace this Y">
<b>cat mouse replace this cat dog</b>
</a>
<c/>
</t>
produces the wanted, correct result:
<t>
<a attr1="X replaced Y">
<b>tiger mouse replaced tiger dog</b>
</a>
<c/>
</t>
Explanation:
The identity rule is used to copy "as-is" some nodes.
We perform multiple replacements, parameterized in my:Reps
If a text node or an attribute doesn't contain any rep-target, it is copied as-is.
If a text node or an attribute contains text to be replaced (rep target), then the replacements are done in the order specified in my:Reps
If the string contains more than one string target, then all targets are replaced: first all occurences of the first rep target, then all occurences of the second rep target, ..., last all occurences of the last rep target.
II. XSLT 2.0 solution:
In XSLT 2.0 one can simply use the standard XPath 2.0 function replace(). However, for multiple replacements the solution would be still very similar to the XSLT 1.0 solution specified above.