Include parentheses as part of a word while comparing using DeltaXML

Include parentheses as part of a word while comparing using DeltaXML - xslt

We need to include parentheses as part of the word they surround while comparing.
For example:
Input A:
Hi (A)
Input B:
Hi (B)
Current Output:
Hi (<Inserted>B<\Inserted> <Deleted>A<\Deleted>)
Expected Output:
Hi <Inserted>(B)<\Inserted> <Deleted>(A)<\Deleted>
Thanks!

In this answer, I'm assuming you're using DeltaXML's DocumentComparator class.
The default behaviour for this class is to allow word-by-word comparison. It does this by splitting text up into space, punctuation or word elements.
So, for example, with Input A you would have:
...
<deltaxml:word>Hi</deltaxml:word>
<deltaxml:space>_</deltaxml:space>
<deltaxml:punctuation>(</deltaxml:punctuation>
<deltaxml:word>A</deltaxml:word>
<deltaxml:punctuation>)</deltaxml:punctuation>
...
As the 'A' input to the comparator.
To prevent text being split in this way you need an XSLT input filter that wraps specific text patterns in a deltaxml:word element with a deltaxml:word-by-word='false' attribute. Here is some sample XSLT that achieves this:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
exclude-result-prefixes="#all"
version="2.0">
<xsl:template match="*|comment()|processing-instruction() | #*">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<!--
wrap a specific text pattern in text as a word-element
in this case, any non-whitespace characters surrounded by '(' and ')' chars
-->
<xsl:template match="text()">
<xsl:analyze-string select="." regex="(\([^\s]*?\))">
<xsl:matching-substring>
<deltaxml:word deltaxml:word-by-word="false">
<xsl:value-of select="current()"/>
</deltaxml:word>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="current()"/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
</xsl:stylesheet>

Related

XSLT - analyse following text value

I have a XML that text() node is not correctly formatted,
example:
<section>
<p>A number,of words have, been, suggested,as sources for,the term,</p>
</section>
Here after some ',' there are no space character and some does. what I need to do is if ',' not followed by a space character add a '*' character after the ',' character.
so, expected result,
<section>
<p>A number,*of words have, been, suggested,*as sources for,*the term*</p>
</section>
I think this can be done using regular expression but how can I select , characters that are not followed by space in regular expression in XSLT. also, some , exist just before the closing element (last , in the input) and I need to select those , as well.
<xsl:template match="para">
<xsl:copy>
<xsl:analyze-string select="." regex=",\s*">
<xsl:matching-substring>
<xsl:value-of select="regex-group(1)"/>
<xsl:value-of select="'*'"/>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:copy>
</xsl:template>

You've replaced the last , in your input with ,* though your statement doesn't say that. I hope the below XSLT helps:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="p/text()">
<xsl:value-of select="replace(., ',([^\s]|$)',',*$1')"/>
</xsl:template>
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#*, node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Output:
<?xml version="1.0" encoding="UTF-8"?>
<section>
<p>A number,*of words have, been, suggested,*as sources for,*the term,*</p>
</section>
Here, the regex, ,([^\s]|$) matches the comma and the first character after that if not a space character; ,*$1 replaces the , with ,* and keeps the matched group intact.

XSLT - replace specific content of the text() node with a new node

I have a xml like this,
<doc>
<p>Biological<sub>89</sub> bases<sub>4456</sub> for<sub>8910</sub> sexual<sub>4456</sub>
differences<sub>8910</sub> in<sub>4456</sub> the brain exist in a wide range of
vertebrate species, including chickens<sub>8910</sub> Recently<sub>8910</sub> the
dogma<sub>8910</sub> of<sub>4456</sub> hormonal dependence for the sexual
differentiation of the brain has been challenged.</p>
</doc>
As you can see there are <sub> nodes and text() node contains inside the <p> node. and every <sub> node end, there is a text node, starting with a space. (eg: <sub>89</sub> bases : here before 'bases' text appear there is a space exists.) I need to replace those specific spaces with nodes.
SO the expected output should look like this,
<doc>
<p>Biological<sub>89</sub><s/>bases<sub>4456</sub><s/>for<sub>8910</sub><s/>sexual<sub>4456</sub>
<s/>differences<sub>8910</sub><s/>in<sub>4456</sub><s/>the brain exist in a wide range of
vertebrate species, including chickens<sub>8910</sub><s/>Recently<sub>8910</sub><s/>the
dogma<sub>8910</sub><s/>of<sub>4456</sub><s/>hormonal dependence for the sexual
differentiation of the brain has been challenged.</p>
</doc>
to do this I can use regular expression like this,
<xsl:template match="p/text()">
<xsl:analyze-string select="." regex="( )">
<xsl:matching-substring>
<xsl:choose>
<xsl:when test="regex-group(1)">
<s/>
</xsl:when>
</xsl:choose>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
But this adds <s/> nodes to every spaces in the text() node. But I only need thi add nodes to that specific spaces.
Can anyone suggest me a method how can I do this..

If you only want to match text nodes that start with a space and are preceded by a sub element, you can put the condition in your template match
<xsl:template match="p/text()[substring(., 1, 1) = ' '][preceding-sibling::node()[1][self::sub]]">
And if you just want to remove the space at the start of the string, a simple replace will do.
<xsl:value-of select="replace(., '^\s+', '')" />
Try this XSLT
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output method="xml" indent="no" />
<xsl:template match="p/text()[substring(., 1, 1) = ' '][preceding-sibling::node()[1][self::sub]]">
<s />
<xsl:value-of select="replace(., '^\s+', '')" />
</xsl:template>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

Just change the regex like so ^( ): it will match only the spaces at the beginning of the text part.
With this XSL snipped:
<xsl:analyze-string select="." regex="^( )">
Here is the result I obtain:
<p>Biological<sub>89</sub><s></s>bases<sub>4456</sub><s></s>for<sub>8910</sub><s></s>sexual<sub>4456</sub>
differences<sub>8910</sub><s></s>in<sub>4456</sub><s></s>the brain exist in a wide range of
vertebrate species, including chickens<sub>8910</sub><s></s>Recently<sub>8910</sub><s></s>the
dogma<sub>8910</sub><s></s>of<sub>4456</sub><s></s>hormonal dependence for the sexual
differentiation of the brain has been challenged.
</p>

XSL Analyze-String -> Matching-Substring into multiple variables

I was wondering if it is possible to use analyze-string and set multiple groups within the RegEx and then store all of the matching groups in variables to use later on.
like so:
<xsl:analyze-string regex="^Blah\s+(\d+)\s+Bloo\s+(\d+)\s+Blee" select=".">
<xsl:matching-substring>
<xsl:variable name="varX">
<xsl:value-of select="regex-group(1)"/>
</xsl:variable>
<xsl:variable name="varY">
<xsl:value-of select="regex-group(2)"/>
</xsl:variable>
</xsl:matching-substring>
</xsl:analyze-string>
This doesn't actually work, but that's the sort of thing I'm after, I know I can wrap the analyze-string in a variable, but that seems daft that for every group I have to process the RegEx, not very efficient, I should be able to process the regex once and store all of the groups for use later on.
Any ideas?

Well does
<xsl:variable name="groups" as="element(group)*">
<xsl:analyze-string regex="^Blah\s+(\d+)\s+Bloo\s+(\d+)\s+Blee" select=".">
<xsl:matching-substring>
<group>
<x><xsl:value-of select="regex-group(1)"/></x>
<y><xsl:value-of select="regex-group(2)"/></y>
</group>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:variable>
help? That way you have a variable named groups which is a sequence of group elements with the captures.

This transformation shows that xsl:analyze-string isn't necessary to obtain the wanted results -- a simpler and generic solution exists.:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="*[matches(., '^Blah\s+(\d+)\s+Bloo\s+(\d+)\s+Blee')]">
<xsl:variable name="vTokens" select=
"tokenize(replace(., '^Blah\s+(\d+)\s+Bloo\s+(\d+)\s+Blee', '$1 $2'), ' ')"/>
<xsl:variable name="varX" select="$vTokens[1]"/>
<xsl:variable name="varY" select="$vTokens[2]"/>
<xsl:sequence select="$varX, $varY"/>
</xsl:template>
</xsl:stylesheet>
when applied on this XML document:
<t>Blah 123 Bloo 4567 Blee</t>
which produces the wanted, correct result:
123 4567
Here we don't rely on knowing the RegEx (can be supplied as parameter) and the string -- we just replace the string with a delimited string of the RegEx groups, which we then tokenize and every item in the sequence produced by tokenize() can readily be assigned to a corresponding variable.
We don't have to find the wanted results buried in a temp. tree -- we just get them all in a result sequence.

convert character if codepoint within given range

I have a couple of XML files that contain unicode characters with codepoint values between 57600 and 58607. Currently these are shown in my content as square blocks and I'd like to convert these to elements.
So what I'd like to achieve is something like :
<!-- current input -->
<p> Follow the on-screen instructions.</p>
<!-- desired output-->
<p><unichar value="58208"/> Follow the on-screen instructions.</p>
<!-- Where 58208 is the actual codepoint of the unicode character in question -->
I've fooled around a bit with tokenizer but as you need to have reference to split upon this turned out to be over complicated.
Any advice on how to tackle this best ? I've been trying some things like below but got struck (don't mind the syntax, I know it doesn't make any sense)
<xsl:template match="text()">
-> for every character in my string
-> if string-to-codepoints(current character) greater then 57600 return <unichar value="codepoint value"/>
else return character
</xsl:template>

It sounds like a job for analyze-string e.g.
<xsl:template match="text()">
<xsl:analyze-string select="." regex="[-]">
<xsl:matching-substring>
<unichar value="{string-to-codepoints(.)}"/>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
Untested.

This transformation:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="/*">
<p>
<xsl:for-each select="string-to-codepoints(.)">
<xsl:choose>
<xsl:when test=". > 57600">
<unichar value="{.}"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="codepoints-to-string(.)"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</p>
</xsl:template>
</xsl:stylesheet>
when applied on the provided XML document:
<p> Follow the on-screen instructions.</p>
produces the wanted, correct result:
<p><unichar value="58498"/> Follow the on-screen instructions.</p>
Explanation: Proper use of the standard XPath 2.0 functions string-to-codepoints() and codepoints-to-string().

Trim whitespace from parent element only

I'd like to trim the leading whitespace inside p tags in XML, so this:
<p> Hey, <em>italics</em> and <em>italics</em>!</p>
Becomes this:
<p>Hey, <em>italics</em> and <em>italics</em>!</p>
(Trimming trailing whitespace won't hurt, but it's not mandatory.)
Now, I know normalize-whitespace() is supposed to do this, but if I try to apply it to the text nodes..
<xsl:template match="text()">
<xsl:text>[</xsl:text>
<xsl:value-of select="normalize-space(.)"/>
<xsl:text>]</xsl:text>
</xsl:template>
...it's applied to each text node (in brackets) individually and sucks them dry:
[Hey,]<em>[italics]</em>[and]<em>[italics]</em>[!]
My XSLT looks basically like this:
<xsl:template match="p">
<xsl:apply-templates/>
</xsl:template>
So is there any way I can let apply-templates complete and then run normalize-space on the output, which should do the right thing?

This stylesheet:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="p//text()[1][generate-id()=
generate-id(ancestor::p[1]
/descendant::text()[1])]">
<xsl:variable name="vFirstNotSpace"
select="substring(normalize-space(),1,1)"/>
<xsl:value-of select="concat($vFirstNotSpace,
substring-after(.,$vFirstNotSpace))"/>
</xsl:template>
</xsl:stylesheet>
Output:
<p>Hey, <em>italics</em> and <em>italics</em>!</p>
Edit 2: Better expression (now only three function calls).
Edit 3: Matching the first descendant text node (not just the first node if it's a text node). Thanks to #Dimitre's comment.
Now, with this input:
<p><b> Hey, </b><em>italics</em> and <em>italics</em>!</p>
Output:
<p><b>Hey, </b><em>italics</em> and <em>italics</em>!</p>

I would do something like this:
<xsl:template match="p">
<xsl:apply-templates/>
</xsl:template>
<!-- strip leading whitespace -->
<xsl:template match="p/node()[1][self::text()]">
<xsl:call-template name="left-trim">
<xsl:with-param name="s" value="."/>
</xsl:call-template>
</xsl:template>
This will strip left space from the initial node child of a <p> element, if it is a text node. It will not strip space from the first text node child, if it is not the first node child. E.g. in
<p><em>Hey</em> there</p>
I intentionally avoid stripping the space from the front of 'there', because that would make the words run together when rendered in a browser. If you did want to strip that space, change the match pattern to
match="p/text()[1]"
If you also want to strip trailing whitespace, as your title possibly implies, add these two templates:
<!-- strip trailing whitespace -->
<xsl:template match="p/node()[last()][self::text()]">
<xsl:call-template name="right-trim">
<xsl:with-param name="s" value="."/>
</xsl:call-template>
</xsl:template>
<!-- strip leading/trailing whitespace on sole text node -->
<xsl:template match="p/node()[position() = 1 and
position() = last()][self::text()]"
priority="2">
<xsl:value-of select="normalize-space(.)"/>
</xsl:template>
The definitions of the left-trim and right-trim templates are at Trim Template for XSLT (untested). They might be slow for documents with lots of <p>s. If you can use XSLT 2.0, you can replace the call-templates with
<xsl:value-of select="replace(.,'^\s+','')" />
and
<xsl:value-of select="replace(.,'\s+$','')" />
(Thanks to Priscilla Walmsley.)

You want:
<xsl:template match="text()">
<xsl:value-of select=
"substring(
substring(normalize-space(concat('[',.,']')),2),
1,
string-length(.)
)"/>
</xsl:template>
This wraps the string in "[]", then performs normalize-string(), then finally removes the wrapping characters.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Include parentheses as part of a word while comparing using DeltaXML - xslt

We need to include parentheses as part of the word they surround while comparing. For example: Input A: Hi (A) Input B: Hi (B) Current Output: Hi (<Inserted>B<\Inserted> <Deleted>A<\Deleted>) Expected Output: Hi <Inserted>(B)<\Inserted> <Deleted>(A)<\Deleted> Thanks!

Related

XSLT - analyse following text value

XSLT - replace specific content of the text() node with a new node

XSL Analyze-String -> Matching-Substring into multiple variables

convert character if codepoint within given range

Trim whitespace from parent element only

Categories

Resources