XSLT REGEX pattern match - regex

Using Saxon 9.7, XSLT 3.0, I'm trying to select square bracketed terms from a string of text and then remove duplicate values of the terms.
So far I have found a template which selects the substrings I want and a function that tokenizes the string and then removes duplicate values.
However, I haven't been able to get the correct regex for the tokenizing of the string.
Here is my XML of the full text
<column>
<columnDerivationPrompt>Option 1: (No visit windowing)</columnDerivationPrompt>
<columnDerivationDescription>Set to collected visit name [EG.VISIT] Set to 'POST-BASELINE MINIMUM' for the new observation generated for derviation type minimum [ADEG.DTYPE] = 'MINIMUM'
Set to 'POST-BASELINE MAXIMUM' for the new observation generated for derviation type maximum [ADEG.DTYPE]= 'MAXIMUM'
</columnDerivationDescription>
<columnDerivationPrompt>Option 2: (User defined visit windows)</columnDerivationPrompt>
<columnDerivationDescription>Set to a re-defined visit range based on user-defined input, using formatting of Analysis Relative Day [ADEG.ADY] range in conjunction with Analysis Window Target [ADEG.AWTARGET] and Analysis Window Diff from Target [ADEG.AWTDIFF] to determine analysis visit.
Set to 'POST-BASELINE MINIMUM' for the new observation generated for derviation type minimum [ADEG.DTYPE] = 'MINIMUM'
Set to 'POST-BASELINE MAXIMUM' for the new observation generated for derviation type maximum [ADEG.DTYPE]= 'MAXIMUM'
</columnDerivationDescription>
</column>
The string of terms taken from the text that I need to remove duplicates from
EG.VISIT ADEG.DTYPE ADEG.DTYPE ADEG.ADY ADEG.AWTARGET ADEG.AWTDIFF ADEG.DTYPE ADEG.DTYPE
What I would like to see
EG.VISIT ADEG.DTYPE ADEG.ADY ADEG.AWTARGET ADEG.AWTDIFF
my XSLT template and function
<xsl:variable name="test">
<xsl:if test="contains($string,'[')">
<xsl:variable name="relevant-part" select="substring-before(substring-after($string,'['),']')"/>
<xsl:variable name="remainder" select="substring-after($string,']')"/>
<xsl:value-of select="$relevant-part"/>
<xsl:if test="contains($remainder,'[')">
<xsl:text disable-output-escaping="yes"> </xsl:text>
</xsl:if>
<xsl:call-template name="find-relevant-text">
<xsl:with-param name="string" select="$remainder"/>
</xsl:call-template>
</xsl:if>
</xsl:variable>
<xsl:value-of select="myfn:sortCSV($test)"/>
</xsl:template>
<xsl:function name="myfn:sortCSV" as="xs:string*">
<xsl:param name="csvString" as="xs:string"/>
<!-- Split up string and remove duplicates -->
<xsl:variable name="values" select="distinct-values(tokenize($csvString,'\W+\.\W+'))" as="xs:string*"/>
<!-- Return all elements, sorted -->
<xsl:for-each select="$values">
<xsl:sort/>
<!-- We don't return empty strings -->
<xsl:sequence select=".[.!='']"/>
</xsl:for-each>
</xsl:function>
\W+\.\W+ is the regex I have been using to identify e.g. EG.VISIT or ADEG.DTYPE. So any pattern including CC.CCCC to CCCC.CCCCCCCC (where C is a char [A-Z]).
The output I am getting is
EG.VISIT ADEG.DTYPE ADEG.DTYPE ADEG.ADY ADEG.AWTARGET ADEG.AWTDIFF ADEG.DTYPE ADEG.DTYPE
So no duplicates have been removed.
QUESTION:
Can anyone see where I am going wrong with my expression or code?

As for your regular expression, note that a \W matches a non-word char and cannot match uppercase (nor lowercase) letters. \w matches a word char.
However, best is to restrict it to [A-Z]+\.[A-Z]+ since you say the items you want to match follow the uppercase+.+uppercase pattern.
See the regex demo

I would use analyze-string, either with XSLT 2.0 the XSLT xsl:anyalyze-string or with XSLT 3.0 the function of the same name, using that approach it is a one-liner:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:math="http://www.w3.org/2005/xpath-functions/math"
exclude-result-prefixes="xs math fn"
version="3.0">
<xsl:template match="column">
<xsl:value-of select="distinct-values(analyze-string(., '\[([A-Z]+\.[A-Z]+)\]')//fn:match/fn:group[#nr = 1])"/>
</xsl:template>
</xsl:stylesheet>
Output is EG.VISIT ADEG.DTYPE ADEG.ADY ADEG.AWTARGET ADEG.AWTDIFF.
If you want to sort the extracted strings then use <xsl:value-of select="sort(distinct-values(analyze-string(., '\[([A-Z]+\.[A-Z]+)\]')//fn:match/fn:group[#nr = 1]))"/>.

Related

Regular expressions, xslt 2, basics

I have went through several examples and tried to modify my search to understand what is what.
Input:
<Description>Ottelu pelattu 22.4.2018. Gagarin Cupin 5. finaaliottelu. Selostus: Antti Mäkinen.</Description>
<Description>Ottelu pelattu 20.4.2018. Gagarin Cupin 1. finaaliottelu. Selostus: Antti Mäkinen.</Description>
<Description>Ottelu pelattu 22.4.2018. Gagarin Cupin 2. puolivälierä. Selostus: Antti Mäkinen.</Description>`
What I want to do is select these to my output:
"Gagarin Cupin 5. finaaliottelu"
"Gagarin Cupin 2. puolivälierä"
Without the dot there in the middle.
I could use substring-before/after, but I understand it could be useful to use regex?
I have made this that fetches what I need: Gagarin Cupin\s\d\W\s\w*[a-zA-ZäöüÄÖÜß]*
But now how can I use this in XSLT? Is it analyze-string() that I should use? or matches() in some way?
XSLT:
<xsl:variable name="episode" select="Description"/>
<xsl:variable name="fetchcup">
<xsl:analyze-string select="$episode" regex="Gagarin Cupin\s\d\W\s\w*[a-zA-ZäöüÄÖÜß]*">
<xsl:matching-substring>
<xsl:value-of select="regex-group(1)"/>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:variable>
<Cup><xsl:value-of select="$fetchcup"/></Cup>
But honestly, I feel like I am missing some basics of how it works despite looking through tutorial pages and examples. If I get a foot in the door I can apply it further.
Your regular expression works in the context of a single Description element, inside of the xsl:matching-substring if you want to output the matched string you can simply use . for the context item or regex-group(0) (see https://www.w3.org/TR/xslt-30/#func-regex-group). The use of regex-group(1) doesn't make sense in your case as your regular expression does not have any subexpressions.
<xsl:template match="Description">
<cup>
<xsl:analyze-string select="." regex="Gagarin Cupin\s\d\W\s\w*[a-zA-ZäöüÄÖÜß]*">
<xsl:matching-substring>
<xsl:value-of select="."/>
</xsl:matching-substring>
</xsl:analyze-string>
</cup>
</xsl:template>
That template in https://xsltfiddle.liberty-development.net/nc4NzQH outputs
<cup>Gagarin Cupin 5. finaaliottelu</cup>
<cup>Gagarin Cupin 1. finaaliottelu</cup>
<cup>Gagarin Cupin 2. puolivälierä</cup>
for your three Description elements, I hope I understood that correctly as the desired output.

Substring before throwing error

I've the below XML
<?xml version="1.0" encoding="UTF-8"?>
<body>
<p>Industrial drawing: Any creative composition</p>
<p>Industrial drawing: Any creative<fn>
<fnn>4</fnn>
<fnt>
<p>ftn1"</p>
</fnt>
</fn> composition
</p>
</body>
and the below XSL.
<xsl:template match="p">
<xsl:choose>
<xsl:when test="contains(substring-before(./text(),' '),'Article')">
<xsl:text>sect3</xsl:text>
<xsl:value-of select="./text()"/>
</xsl:when>
<xsl:when test="contains(substring-before(./b/text(),' '),'Section')">
<xsl:text> Sect 2</xsl:text>
<xsl:value-of select="./text()"/>
</xsl:when>
<xsl:when test="contains(substring-before(./b/text(),' '),'Chapter')">
<xsl:text> Sect 1</xsl:text>
<xsl:value-of select="./text()"/>
</xsl:when>
<xsl:otherwise>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
Here my XSL is working fine for <p>Industrial drawing: Any creative composition</p> but for the below Case
<p>Industrial drawing: Any creative<fn>
<fnn>4</fnn>
<fnt>
<p>ftn1"</p>
</fnt>
</fn> composition
</p>
it is throwing me the below error.
XSLT 2.0 Debugging Error: Error: file:///C:/Users/u0138039/Desktop/Proview/ASAK/DIFC/XSLT/tabel.xslt:38: Wrong occurrence to match required sequence type - Details: - XPTY0004: The supplied sequence ('2' item(s)) has the wrong occurrence to match the sequence type xs:string ('zero or one')
please let me know how can i fix this and grab the text required.
Thanks
The second p element in your example XML has two child text nodes, one containing "Industrial drawing: Any creative" and the other containing a space, "composition", a newline and another six spaces. In XSLT 1.0 it is legal to apply a function that expects a string to an argument that is a set of more than one node, the behaviour is to take the value of the first node and ignore all the others. But in 2.0 it is a type mismatch error to pass two nodes to a function that expected a single value for its parameter.
But in this case I doubt that you really need to use text() at all - if all you care about is seeing whether the string "Article" occurs anywhere within the first word under the p (including when this is nested inside another element) then you can simply use .:
<xsl:when test="contains(substring-before(.,' '),'Article')">
(or better still, use predicates to separate the different conditions into their own templates, with one template matching "Article" paragraphs, another matching "Section" paragraphs, etc.)
The p element in your example has several text nodes, so the expression ./text() creates a sequence. You cannot apply a string function to a sequence; you must convert it to a string first. Instead of:
test="contains(substring-before(./text(),' '),'Article')"
try:
test="contains(substring-before(string-join(text(), ''), ' '), 'Article')"

How to craft an entity in XSLT

How can I create the entity ' ', if I have the part starting with the '#' in a variable?
When I try to do something like this:
concat('&', '#160;')
I get an syntax error in XMLspy.
Does it have to be an entity (actually you mean a "character reference"), or will it do just to output a non-breaking space character?
To do the latter, given that $var holds "#160", in XSLT 2.0 you can use
<xsl:value-of select="codepoints-to-string(number(substring($var, 2)))"/>
The problem with your code is that, in XML, you cannot use a standalone &, so it should be like this:
concat('&', '#160;')
which outputs &#160; if the output method is xml and   if text.
disable-output-escaping helps to force   in xml output:
<xsl:value-of select="concat('&', '#160;')" disable-output-escaping="yes"/>
Another way to replace a character by an arbitrary string is using character maps:
<xsl:output use-character-maps="foo"/>
<xsl:character-map name="foo">
<xsl:output-character character="&" string="&"/>
</xsl:character-map>
<xsl:template match="/">
<xsl:value-of select="concat('&', '#160;')"/>
</xsl:template>

Why isn't local-name() returning anything?

I'm trying to run the following template:
<xsl:template match="*[starts-with(., 'ATTITUDE_')]/text()">
<xsl:variable name="ElementName" select="local-name()"/>
<xsl:variable name="vVal" select= "$vAttitudes[. = substring-after(current(), '_')]/#val"/>
<xsl:choose>
<xsl:when test="contains($ElementName, 'Refuse')">
<xsl:value-of select="civf:book-capitalise($vAttitudes[#val = $vVal+1])"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="civf:book-capitalise($vAttitudes[#val = $vVal])"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
So the premise is, find the name of the element, if it has the text "Refuse" in the name of it then "doTheThing"+1 otherwise just "doTheThing". However this test always fails so +1 never gets called even if the element has "Refuse" in the name. If I just output local-name then I get empty too. Why does local-name() not appear to work here?
I did previously try to start the template with:
<xsl:template match="*[contains(., 'Refuse')]/name()">
But Saxon complained that I was running too many functions in the match sequence.
I apologise in advance for not knowing too much about XSLT.
I believe that local-name() does not work because you are matching text nodes (/text() in the match attribute), and text nodes do not have local names.
I'm not sure what you are trying to do but I don't think you actually want to match /text() but instead the whole element, and obtain its text() afterwards.
Alternatively, you could try using local-name(..) to get the name of parent node but I'm not sure about that.

XSLT, finding out if last child node is a specific element

Look at the following two examples:
<foo>some text <bar/> and maybe some more</foo>
and
<foo>some text <bar/> and a last <bar/></foo>
Mixed text nodes and bar elements within the foo element. Now I am in foo, and want to find out if the last child is a bar. The first example should prove false, as there are text after the bar, but the second example should be true.
How can I accomplish this with XSLT?
Just select the last node of the <foo> element and then use self axis to resolve the node type.
/foo/node()[position()=last()]/self::bar
This XPath expression returns an empty set (which equates to boolean false) if the last node is not an element. If you want to specifically get value true or false, wrap this expression in the XPath function boolean(). Use self::* instead of self::bar to match any element as the last node.
Input XML document:
<root>
<foo>some text <bar/> and maybe some more</foo>
<foo>some text <bar/> and a last <bar/></foo>
</root>
XSLT document example:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="foo">
<xsl:choose>
<xsl:when test="node()[position()=last()]/self::bar">
<xsl:text>bar element at the end
</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text>text at the end
</xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
Output of the stylesheet:
text at the end
bar element at the end
Now I am in foo, and want to find
out if the last child is a bar
Use:
node()[last()][self::bar]
The boolean value of any non-empty node-set is true() and it is false() for otherwise. You can use the above expression directly (unmodified) as the value of the test attribute of any <xsl:if> or <xsl:when>.
Better, use:
foo/node()[last()][self::bar]
as the match attribute of an <xsl:template> -- thus you write in pure "push" style.
Update: This answer addresses the requirement stated in the original question title, "finding out if last child node is a text node." But the question body suggests a different requirement, and it seems that the latter requirement was the one intended by the OP.
The previous two answers explicitly test whether the last child is a bar element, rather than directly testing whether it is a text node. This is correct if foo contains only "mixed text nodes and bar elements" and never has zero children.
But you may want to test directly whether the last child is a text node:
For readability of stylesheet logic
In case the element contains other children besides elements and text: e.g. comments or processing instructions
In case the element has no children
Maybe you know the latter two will never occur in your case (but from your question I would guess that #3 could). Or maybe you think so but aren't sure, or maybe you hadn't thought about it. In either case, it's safer to test directly for what you actually want to know:
test="node()[last()]/self::text()"
Thus, building on #Dimitre's example code and input, the following XML input:
<root>
<foo>some text <bar/> and maybe some more</foo>
<foo>some text <bar/> and a pi: <?foopi param=yes?></foo>
<foo>some text <bar/> and a comment: <!-- baz --></foo>
<foo>some text and an element: <bar /></foo>
<foo noChildren="true" />
</root>
With this XSLT template:
<xsl:template match="foo">
<xsl:choose>
<xsl:when test="node()[last()]/self::text()">
<xsl:text>text at the end;
</xsl:text>
</xsl:when>
<xsl:when test="node()[last()]/self::*">
<xsl:text>element at the end;
</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text>neither text nor element child at the end;
</xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
yields:
text at the end;
neither text nor element child at the end;
neither text nor element child at the end;
element at the end;
neither text nor element child at the end;