XSLT: Regular Expression function does not work? - regex

Ok, this one has been driving me up the wall...
I have a xslt function that is supposed to split out the Zip-code part from a Zip+City string depending on the country. I cannot get it to work! This is what I got so far:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:exslt="http://exslt.org/functions" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xsl:function name="exslt:GetZip" as="xs:string">
<xsl:param name="zipandcity" as="xs:string"/>
<xsl:param name="countrycode" as="xs:string"/>
<xsl:choose>
<xsl:when test="$countrycode='DK'">
<xsl:analyze-string select="$zipandcity" regex="(\d{4}) ([A-Za-zÆØÅæøå]{3,24})">
<xsl:matching-substring>
<xsl:value-of select="regex-group(1)"/>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:text>fail</xsl:text>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:when>
<xsl:otherwise>
<xsl:text>error</xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:function>
I am running it on a source XML where the following values are passed to the function:
zipandcity: "DK-2640 København SV"
countrycode: "DK"
...will output 'fail'!
I think there is something I am misunderstanding here...

Aside from that facts that regexes aren't supported until XSLT 2.0 and braces have to be escaped (but backslashes don't), there's one more reason why that code won't work: XSLT regexes are implicitly anchored at both ends. Given the string DK-2640 København SV, your regex only matches 2640 København, so you need to "pad" it to make it consume the whole string:
regex=".*(\d{{4}}) ([A-Za-zÆØÅæøå]{{3,24}}).*"
.* is probably sufficient in this case, but sometimes you have to be more specific. For example, if there's more than one place where \d{4} could match, you might use \D* at the beginning to make sure the first capturing group matches the first bunch of digits.

The regex attribute is parsed as an attribute value template whery curly braces have a special meaning. If this is in fact an XSL 2.0 Stylesheet, you need to escape the curly braces in the regex attribute by doubling them: (\d{{4}}) ([A-Za-zÆØÅæøå]{{3,24}})
Alternatively you could define a variable containing your pattern like this:
<xsl:variable name="pattern">(\d{4}) ([A-Za-zÆØÅæøå]{3,24})</xsl:variable
<xsl:analyze-string select="$zipandcity" regex="{$pattern}">

Regular expressions are only supported in XSLT 2.x -- not in XSLT 1.0.

Related

Extract the sub string based on the regular expression in xslt

I have scenario where I want to extract the sub string which matches the regular expression.
Below is the example:
<xsl:value-of select="matches('Process java(Application=JavaApplication_2) is not running in the system.', ''.*AppName=Archiver_[0-9]{1,2}.*'')"/>
But this gives me the boolean value as 'false'.
I tried with tokenize but it is becoming more complex.
Please help me on this.
See instruction analyze-string Source + Regex examples
Input
<root>Process java(Application=JavaApplication_2) is not running in the system.</root>
Template
<xsl:template match="root">
<xsl:analyze-string select="." regex="Application=JavaApplication_[0-9]{{1,2}}">
<xsl:matching-substring>
<xsl:value-of select="."/>
</xsl:matching-substring>
<!-- optional -->
<xsl:non-matching-substring>
<!-- do sth -->
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
The matches() function returns either true or false.
To extract a matching substring, try using the replace() function instead. I am not sure which substring you are trying to extract, so I will not give an example here, but see: https://stackoverflow.com/a/39402132/3016153

xslt 1.0 template that reduces multiple spaces to a single space

In my XSLT 2.0 stylesheet, I use the following template reduces multiple spaces to a single space.
<xsl:template match="text()">
<xsl:value-of select="replace(., '\s+', ' ')"/>
</xsl:template>
I'd like to do the same thing in a XSLT 1.0 stylesheet, but the "replace" function is not supported. Any suggestions for what I can do?
You could use normalize-space():
<xsl:template match="text()">
<xsl:value-of select="normalize-space()"/>
</xsl:template>
This will remove any leading and trailing whitespace and reduce multiple spaces to a single space.
For reference: https://developer.mozilla.org/en-US/docs/Web/XPath/Functions/normalize-space

XSLT - Regular Expression Parsing

My current project revolves around translating a number of test cases in a document into a form of XML compatible with a test case management system. In many of these cases, the title is prefixed by a number of ticket identifiers, document location numbers and so on, which need to be removed before they can be uploaded to the system.
Given that many of these ticket identifiers could exist elsewhere in the title and be completely valid, I've written the translation in its current form so that only the start of the string is checked for the regular expression. I have written two approaches, with varying results.
Sample Input
1.
<case-name>3.1.6 (C0) TID#EIIY CHM-2213 BZ-7043 Client side Java Upgrade R8</case-name>
2.
<case-name>4.2.7 (C1) TID#F1DR – AIP - EHD-319087 - BZ6862 - Datalink builder res...</case-name>
Desired Output
1.
<tr:summary>Client side Java Upgrade R8</tr:summary>
2.
<tr:summary>Datalink builder res...</tr:summary>
First Approach
<xsl:template match="case-name">
<tr:summary>
<xsl:variable name="start">
<xsl:apply-templates/>
</xsl:variable>
<xsl:variable name="start" select="normalize-space($start)"/>
<xsl:variable name="noFloat" select="normalize-space(fn:remFirstRegEx($start, '^[0-9]+([.][0-9]+)*' ))"/>
<xsl:variable name="noFloatDash" select="normalize-space(fn:remFirstRegEx($noFloat, '^[\p{Pd}]' ))"/>
<xsl:variable name="noC" select="normalize-space(fn:remFirstRegEx($noFloatDash, '^\(C[0-2]\)' ))"/>
<xsl:variable name="noCDash" select="normalize-space(fn:remFirstRegEx($noC, '^[\p{Pd}]' ))"/>
<xsl:variable name="noTID" select="normalize-space(fn:remFirstRegEx($noCDash, '^(TID)(#|\p{Pd})(\w+)' ))"/>
<xsl:variable name="noTIDDash" select="normalize-space(fn:remFirstRegEx($noTID, '^[\p{Pd}]' ))"/>
<xsl:variable name="noAIP" select="normalize-space(fn:remFirstRegEx($noTIDDash, '^AIP' ))"/>
<xsl:variable name="noAIPDash" select="normalize-space(fn:remFirstRegEx($noAIP, '^[\p{Pd}]' ))"/>
<xsl:variable name="noCHM" select="normalize-space(fn:remFirstRegEx($noAIPDash, '^(CHM)[\p{Pd}]([0-9]+)' ))"/>
<xsl:variable name="noCHMDash" select="normalize-space(fn:remFirstRegEx($noCHM, '^[\p{Pd}]' ))"/>
<xsl:variable name="noEHD" select="normalize-space(fn:remFirstRegEx($noCHMDash, '^(EHD)[\p{Pd}]([0-9]+)' ))"/>
<xsl:variable name="noEHDDash" select="normalize-space(fn:remFirstRegEx($noEHD, '^[\p{Pd}]' ))"/>
<xsl:variable name="noBZ" select="normalize-space(fn:remFirstRegEx($noEHDDash, '^(BZ)(((#|\p{Pd})[0-9]+)|[0-9]+)' ))"/>
<xsl:variable name="noBZDash" select="normalize-space(fn:remFirstRegEx($noBZ, '^[\p{Pd}]' ))"/>
<xsl:variable name="noTT" select="normalize-space(fn:remFirstRegEx($noBZDash, '^(TT)[#](\w)+' ))"/>
<xsl:variable name="noTTDash" select="normalize-space(fn:remFirstRegEx($noTT, '^[\p{Pd}]' ))"/>
<xsl:variable name="nobrack" select="normalize-space(fn:remFirstRegEx($noTTDash, '^\[(.*?)\]' ))"/>
<xsl:variable name="noBrackDash" select="normalize-space(fn:remFirstRegEx($nobrack, '^[\p{Pd}]' ))"/>
<xsl:value-of select="normalize-space($noBrackDash)"/>
</tr:summary>
</xsl:template>
<xsl:function name="fn:remFirstRegEx">
<xsl:param name="inString"/>
<xsl:param name="regex"/>
<xsl:variable name="words" select="tokenize($inString, '\p{Z}')"/>
<xsl:variable name="outString">
<xsl:for-each select="$words">
<xsl:if test="not(matches(., $regex)) or index-of($words, .) > 1">
<xsl:value-of select="."/><xsl:text> </xsl:text>
</xsl:if>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="string-join($outString, '')">
</xsl:function>
Note: The namespace fn, for the purpose of this translation, is just "function/namespace", used to write my own functions.
First Results
1. Success
<tr:summary>Client side Java Upgrade R8</tr:summary>
2. Failure
<tr:summary>- EHD-319087 - BZ6862 - Datalink builder resolution selector may drop leading zeros on coordinate seconds</tr:summary>
Second Approach
<xsl:function name="fn:remFirstRegEx">
<xsl:param name="inString"/>
<xsl:param name="regex"/>
<xsl:analyze-string select="$inString" regex="$regex">
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:function>
This approach fails completely, I'm including it here because it's the more obvious solution and did not work at all.
It should be noted that there are a large number of regular expressions in the above solution, this is to account for all the possible IDs that might come through. Mercifully, the IDs seem to come in a consistent order.
The problem, as I have concluded, is with the dashes. I have noted that in every case in the documents where translation has failed, the failing ID has been both preceded and followed by a dash. If it only precedes, it'll go through fine. If it only follows, no issues. Both is where it falls down, and curiously, the dash still shows up, even though it has already been seemingly eliminated from the string.
There are two kinds of dashes at play here, a normal dash (–) and a minus sign (-).
Paradoxically: sorry for the long question, and let me know if I've missed anything out.
EDIT: Forgot to say, all regular expressions with the exception of the dashes have been tested elsewhere and are known to work on all input stuff.
EDIT II: Following #acheong87's solution, I tried to run the following:
<xsl:template match="case-name">
<tr:summary>
<xsl:variable name="regEx" select=
"'^[\s\p{Pd}]*(\d+([.]\d+)*)?[\s\p{Pd}]*(\(C[0-2]\))?([\s\p{Pd}]*(TID|AIP|CHM|EHD|BZ|TT)((#|\p{Pd}|)\w+|))*[\s\p{Pd}]*(\[.*?\])?'"/>
<xsl:analyze-string select="string(.)" regex="{$regEx}">
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</tr:summary>
</xsl:template>
And Saxon gives me the following error:
Error at xsl:analyze-string at line (for our purposes, 5):
XTDE1150: The regular expression must not be one that matches a zero-length string
I can get why that would come up, given that everything is optional. Is there another way of running it that won't give me this error?
Thanks again.
Here are the main components that would go into a single regex. I've rewritten some of your expressions.
\d+([.]\d+)*
\(C[0-2]\)
TID(#|\p{Pd})\w+
AIP
CHM[\p{Pd}]\d+
EHD[\p{Pd}]\d+
BZ(#|\p{Pd}|)\d+
TT#\w+
\[.*?\]
Each component should be wrapped in (...)? to make it optional, and all components should be joined by the separator, [\s\p{Pd}]*. This produces:
^[\s\p{Pd}]*(\d+([.]\d+)*)?[\s\p{Pd}]*(\(C[0-2]\))?[\s\p{Pd}]*(TID(#|\p{Pd})\w+)?[\s\p{Pd}]*(AIP)?[\s\p{Pd}]*(CHM[\p{Pd}]\d+)?[\s\p{Pd}]*(EHD[\p{Pd}]\d+)?[\s\p{Pd}]*(BZ(#|\p{Pd}|)\d+)?[\s\p{Pd}]*(TT#\w+)?[\s\p{Pd}]*(\[.*?\])?
You can see in this Rubular demo that the above expression indeed matches your two examples.
There may be an elegant simplification you may be interested in.
\d+([.]\d+)*
\(C[0-2]\)
(TID|AIP|CHM|EHD|BZ|TT)((#|\p{Pd}|)\w+|)
\[.*?\]
Maybe some codes like AIP should be separate, but you can see the spirit of this version. That is, it's unlikely that valid titles would begin with such codes; in fact probably more likely that your examples could be missing a possible combination such as EHD#, which may appear in the future but your past-based formulation would miss. (Of course, my point is irrelevant if there is no future—and the data you have is the only data you'll need to process.) If there is a future though, IMO, it's better in this case to loosen the rigor of the expression to capture potential related combinations.
The above would become:
^[\s\p{Pd}]*(\d+([.]\d+)*)?[\s\p{Pd}]*(\(C[0-2]\))?([\s\p{Pd}]*(TID|AIP|CHM|EHD|BZ|TT)((#|\p{Pd}|)\w+|))*[\s\p{Pd}]*(\[.*?\])?
Here is the Rubular demo.
One regex to rule them all looks like
^ # start of string
([0-9]\.[0-9.]+).*? # digits and dots
\((C[0-2])\).*? # C0, C1, C2
((TID#\S+).*?)? # TID...
((AIP).*?)? # AIP...
((CHM\S+).*?)? # CHM...
((EHD\S+).*?)? # EHD...
((BZ\S+).*?)? # BZ...
(\w.*)? # free text
$ # end of string
^([0-9]\.[0-9.]+).*?\((C[0-2])\).*?((TID#\S+).*?)?((AIP).*?)?((CHM\S+).*?)?((EHD\S+).*?)?((BZ\S+).*?)?(\w.*)?$
http://rubular.com/r/pPxKBVwJaE
The .*? eat any delimiter until the next match begins. Most of the matches are optional, possibly even more than you need to be optional. Remove the enclosing (...)? for any match that you want to make mandatory. Optional groups are counted but can be empty.
Putting it all together
<xsl:variable name="linePattern"> <!-- group|contents -->
<xsl:text>^</xsl:text> <!-- start of string -->
<xsl:text>([0-9]\.[0-9.]+).*?</xsl:text> <!-- 1 digits and dots -->
<xsl:text>\((C[0-2])\).*?</xsl:text> <!-- 2 C0, C1, C2 -->
<xsl:text>((TID#\S+).*?)?</xsl:text> <!-- 3, 4 TID... -->
<xsl:text>((AIP).*?)?</xsl:text> <!-- 5, 6 AIP... -->
<xsl:text>((CHM\S+).*?)?</xsl:text> <!-- 7, 8 CHM... -->
<xsl:text>((EHD\S+).*?)?</xsl:text> <!-- 9, 10 EHD... -->
<xsl:text>((BZ\S+).*?)?</xsl:text> <!-- 11, 12 BZ... -->
<xsl:text>(\w.*)?</xsl:text> <!-- 13 free text -->
<xsl:text>$</xsl:text> <!-- end of string -->
</xsl:variable>
<xsl:template match="case-name">
<xsl:analyze-string select="string(.)" regex="{$linePattern}">
<xsl:matching-substring>
<tr:summary>
<part><xsl:value-of select="regex-group(1)" /></part>
<part><xsl:value-of select="regex-group(2)" /></part>
<part><xsl:value-of select="regex-group(4)" /></part>
<part><xsl:value-of select="regex-group(6)" /></part>
<part><xsl:value-of select="regex-group(8)" /></part>
<part><xsl:value-of select="regex-group(10)" /></part>
<part><xsl:value-of select="regex-group(12)" /></part>
<part><xsl:value-of select="regex-group(13)" /></part>
</tr:summary>
</xsl:matching-substring>
<!--
possibly include <xsl:non-matching-substring>, <xsl:fallback>
-->
</xsl:analyze-string>
</xsl:template>
Of course you can deal with the individual match groups any way you like.

XSL Analyze-String -> Matching-Substring into multiple variables

I was wondering if it is possible to use analyze-string and set multiple groups within the RegEx and then store all of the matching groups in variables to use later on.
like so:
<xsl:analyze-string regex="^Blah\s+(\d+)\s+Bloo\s+(\d+)\s+Blee" select=".">
<xsl:matching-substring>
<xsl:variable name="varX">
<xsl:value-of select="regex-group(1)"/>
</xsl:variable>
<xsl:variable name="varY">
<xsl:value-of select="regex-group(2)"/>
</xsl:variable>
</xsl:matching-substring>
</xsl:analyze-string>
This doesn't actually work, but that's the sort of thing I'm after, I know I can wrap the analyze-string in a variable, but that seems daft that for every group I have to process the RegEx, not very efficient, I should be able to process the regex once and store all of the groups for use later on.
Any ideas?
Well does
<xsl:variable name="groups" as="element(group)*">
<xsl:analyze-string regex="^Blah\s+(\d+)\s+Bloo\s+(\d+)\s+Blee" select=".">
<xsl:matching-substring>
<group>
<x><xsl:value-of select="regex-group(1)"/></x>
<y><xsl:value-of select="regex-group(2)"/></y>
</group>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:variable>
help? That way you have a variable named groups which is a sequence of group elements with the captures.
This transformation shows that xsl:analyze-string isn't necessary to obtain the wanted results -- a simpler and generic solution exists.:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="*[matches(., '^Blah\s+(\d+)\s+Bloo\s+(\d+)\s+Blee')]">
<xsl:variable name="vTokens" select=
"tokenize(replace(., '^Blah\s+(\d+)\s+Bloo\s+(\d+)\s+Blee', '$1 $2'), ' ')"/>
<xsl:variable name="varX" select="$vTokens[1]"/>
<xsl:variable name="varY" select="$vTokens[2]"/>
<xsl:sequence select="$varX, $varY"/>
</xsl:template>
</xsl:stylesheet>
when applied on this XML document:
<t>Blah 123 Bloo 4567 Blee</t>
which produces the wanted, correct result:
123 4567
Here we don't rely on knowing the RegEx (can be supplied as parameter) and the string -- we just replace the string with a delimited string of the RegEx groups, which we then tokenize and every item in the sequence produced by tokenize() can readily be assigned to a corresponding variable.
We don't have to find the wanted results buried in a temp. tree -- we just get them all in a result sequence.

How do I use a regular expression in XSLT 1.0?

I am using XSLT 1.0.
My input information may contain these values
<!--case 1-->
<attribute>123-00</attribute>
<!--case 2-->
<attribute>Abc-01</attribute>
<!--case 3-->
<attribute>--</attribute>
<!--case 4-->
<attribute>Z2-p01</attribute>
I want to find out those string that match the criteria:
if string has at least 1 alphabet AND has at least 1 number,
then
do X processing
else
do Y processing
In example above, for case 1,2,4 I should be able to do X processing. For case 3, I should be able to do Y processing.
I aim to use a regular expression (in XSLT 1.0).
For all the cases, the attribute can take any value of any length.
I tried use of match, but the processor returned an error.
I tried use of translate function, but not sure if used the right way.
I am thinking about.
if String matches [a-zA-Z0-9]*
then do X processing
else
do y processing.
How do I implement that using XSLT 1.0 syntax?
This solution really works in XSLT 1.0 (and is simpler, because it doesn't and needn't use the double-translate method.):
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:variable name="vUpper" select=
"'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>
<xsl:variable name="vLower" select=
"'abcdefghijklmnopqrstuvwxyz'"/>
<xsl:variable name="vAlpha" select="concat($vUpper, $vLower)"/>
<xsl:variable name="vDigits" select=
"'0123456789'"/>
<xsl:template match="attribute">
<xsl:choose>
<xsl:when test=
"string-length() != string-length(translate(.,$vAlpha,''))
and
string-length() != string-length(translate(.,$vDigits,''))">
Processing X
</xsl:when>
<xsl:otherwise>
Processing Y
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
when applied on the provided XML fragment -- made a well-formed XML document:
<t>
<!--case 1-->
<attribute>123-00</attribute>
<!--case 2-->
<attribute>Abc-01</attribute>
<!--case 3-->
<attribute>--</attribute>
<!--case 4-->
<attribute>Z2-p01</attribute>
</t>
the wanted, correct result is produced:
Processing Y
Processing X
Processing Y
Processing X
Do Note: Any attempt to use with a true XSLT 1.0 processor code like this (borrowed from another answer to this question) will fail with error:
<xsl:template match=
"attribute[
translate(.,
translate(.,
concat($upper, $lower),
''),
'')
and
translate(., translate(., $digit, ''), '')]
">
because in XSLT 1.0 it is forbidden for a match pattern to contain a variable reference.
If you found this question because you're looking for a way to use regular expressions in XSLT 1.0, and you're writing an application using Microsoft's XSLT processor, you can solve this problem by using an inline C# script.
I've written out an example and a few tips in this thread, where someone was seeking out similar functionality. It's super simple, though it may or may not be appropriate for your purposes.
XSLT does not support regular expressions, but you can fake it.
The following stylesheet prints an X processing message for all attribute elements having a string value containing at least one number and at least one letter (and Y processing for those that do not):
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:variable name="lower" select="'abcdefghijklmnopqrstuvwxyz'"/>
<xsl:variable name="upper" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>
<xsl:variable name="digit" select="'0123456789'"/>
<xsl:template match="attribute">
<xsl:choose>
<xsl:when test="
translate(., translate(., concat($upper, $lower), ''), '') and
translate(., translate(., $digit, ''), '')">
<xsl:message>X processing</xsl:message>
</xsl:when>
<xsl:otherwise>
<xsl:message>Y processing</xsl:message>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
Note: You said this:
In example above, for case 1,2,4 I should be able to do X processing.
for case 3, I should be able to do Y processing.
But that conflicts with your requirement, because case 1 does not contain a letter. If, on the other hand, you really want to match the equivalent of [a-zA-Z0-9], then use this:
translate(., translate(., concat($upper, $lower, $digit), ''), '')
...which matches any attribute having at least one letter or number.
See the following question for more information on using translate in this way:
How to write xslt if element contains letters?