Regex seems correct, but doesn't match with XSLT - regex

I have this string:
"und", 3.96036662358691, 3.3195020746888, 3.2907085875176, 3.02795262671161, 3.5162776568275, 3.6582196231319, 4.25539102528011, 2.66244838424718, 2.92641494865261, 2.76262283971535,
I want to get the first decimal and the comma afterwards (3.96036662358691,).
Therefore I'm using this regex:
(^"\w+",\s)(\d+\.\d+,)
To use the regex in XSLT, I escaped the quotes, so the regex is now:
(^"\w+",\s)(\d+\.\d+,)
A snippet from the XSLT:
<xsl:analyze-string select="." regex="(^"\w+",\s)(\d+\.\d+,)">
<xsl:matching-substring>
<xsl:value-of select="regex-group(2)"/>
</xsl:matching-substring>
</xsl:analyze-string>
I'm using Oxygen, XSLT 3.0 and Saxon-PE 11.4.
Why is my regex only matching when looking for the pattern in the file, but not when using it with XSLT?
Some more information:
This is a snippet from the XML file (with word frequencies):
<?xml version="1.0" encoding="utf-8"?><xml>"name1", "name2", "name3", "name4", "name5", "name6", "name7", "name8", "name9", "name10",
"und", 3.96036662358691, 3.3195020746888, 3.2907085875176, 3.02795262671161, 3.5162776568275, 3.6582196231319, 4.25539102528011, 2.66244838424718, 2.92641494865261, 2.76262283971535,
"sie", 1.74547291174592, 2.69105265278169, 1.79199906147349, 4.57921899704663, 2.02015087843653, 0.786224821312541, 2.4266651652497, 5.35571214331204, 1.5944714693846, 2.0382921043714,
"die", 1.87916870924135, 2.7111952624582, 2.32578601595495, 2.36866441931923, 2.16444736975343, 2.00129954515919, 2.77011429536625, 2.09749984592313, 2.3009806192572, 2.05947136563877,</xml>
This is a snippet from my XSLT so far:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="3.0"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
expand-text="yes">
<xsl:template match="*:xml/text()">
<!-- copy all words to the very top -->
<xsl:analyze-string select="." regex="("(\w+)",)">
<xsl:matching-substring>
<xsl:value-of select="."/>
</xsl:matching-substring>
</xsl:analyze-string>
<!-- copy all word frequencies for name1 -->
<xsl:analyze-string select="." regex="(^"\w+",\s)(\d+\.\d+,)">
<xsl:matching-substring>
<xsl:text>"name1",</xsl:text>
<xsl:value-of select="regex-group(2)"/>
</xsl:matching-substring>
</xsl:analyze-string>
<!-- after that: copy all frequencies for name2 etc. -->
</xsl:template>
</xsl:stylesheet>
All in all, I'm trying to reformat a txt file to get the correct format for csv.

You haven't quite explained what you are after but I would think that adding the flags="m" attribute in
<xsl:analyze-string select="." regex="(^"\w+",\s)(\d+\.\d+,)" flags="m">
<xsl:matching-substring>
<xsl:text>"name1",</xsl:text>
<xsl:value-of select="regex-group(2)"/>
</xsl:matching-substring>
</xsl:analyze-string>
is what helps with your complete, multi-line input as that way ^ matches the beginning of each line and not simply of the input string.

Related

XSLT regex doesn't match even when in online regex tests match correctly

Given example path C:\example\innerExample\file.txt, I want to extract filename with extension using this regex, you can see it here.
<xsl:analyze-string select="$filePath" regex="$regexPattern" flags="mis">
<xsl:matching-substring>
<xsl:value-of select="concat(regex-group(2), regex-group(3))"/>
</xsl:matching-substring>
</xsl:analyze-string>
This is my xslt code, is there anything I'm missing?
Without going into your attempt (which I cannot reproduce), I believe you can extract the filename with extension simply by using:
<xsl:value-of select="tokenize($filepath, '\\')[last()]"/>
Demo: http://xsltransform.hikmatu.com/6qVRKvN
You haven't shown us a complete but minimal examples with the proper values but with the correction of not escaping the / in the square brackets I think your pattern works with XSLT/XPath 2 and later:
Input
<root>
<data>C:\example\innerExample\file.txt</data>
</root>
is at https://xsltfiddle.liberty-development.net/jyRYYhM transformed with
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
version="3.0">
<xsl:param name="regexPattern" as="xs:string">^(.*)[/|\\](.*)(\..*)</xsl:param>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="data">
<xsl:copy>
<xsl:analyze-string select="." regex="{$regexPattern}" flags="mis">
<xsl:matching-substring>
<xsl:value-of select="concat(regex-group(2), regex-group(3))"/>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
into
<root>
<data>file.txt</data>
</root>
(I have used XSLT 3 there but I think there has been no change between XSLT 2 and 3 in terms of xsl:analyze-string).

Extracting a number from a string in a XSLT file

Is there a way to extract a number from a string after a set of characters.
Our software uses XSLT files to convert emails into XML files. In the Subject line of an email there can be a reference to an already opened Incident/Service Request/Task.
For example - the Subject of an email could be:
RE: SR#51417: D_SATTER-NOV60LKA-I_G-A0201244
I want to extract the Service Request Number 51417 from the Subject.
The number will always be after the String "SR#". "SR#" could be written as "sr#", "Sr#" or "sR#".
I was trying to use the RegEx functions in XSLT but can't get it to work.
Do you have any suggestions on how to do this?
Thanks in advance.
Update
I am trying the solution provided by cyclexx. This is the Code that I have put in my XSLT File:
<xsl:when test="contains($subject, 'SR#')">
<xsl:element name="Field">
<xsl:attribute name="Name">
<xsl:text>a_eco_parentObjectType</xsl:text>
</xsl:attribute>
<xsl:value-of select="'ServiceReq'"></xsl:value-of>
</xsl:element>
<xsl:element name="Field">
<xsl:attribute name="Name">
<xsl:text>a_eco_parentObjectID</xsl:text>
</xsl:attribute>
<xsl:analyze-string select ="$subject" regex="\s*[Ss][Rr]#([0-9]+)\s*">
<xsl:matching-substring>
<SR>
<xsl:value-of select="regex-group(1)"></xsl:value-of>
</SR>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:element>
The variable $subject contains the Subject line of the Email file that is being processed. The output file just contains:
ServiceReq
and I have an error message: :Error in loading Hierarchical Object XSLT file(s)
Here a solution :)
<emails>
<email>
<subject>RE: SR#51417: D_SATTER-NOV60LKA-I_G-A0201244</subject>
</email>
<email>
<subject>RE: Sr#565465: D_SATTER-NOV60LKA-I_G-A0201244</subject>
</email>
</emails>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" />
<xsl:template match="subject">
<xsl:variable name="v" select="." />
<xsl:analyze-string select="$v" regex="\s*[Ss][Rr]#([0-9]+)\s*">
<xsl:matching-substring>
<SR>
<xsl:value-of select="regex-group(1)" />
</SR>
</xsl:matching-substring>
<xsl:non-matching-substring>
<subject>
<xsl:value-of select="$v"/>
</subject>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
</xsl:stylesheet>

XSLT - replace specific content of the text() node with a new node

I have a xml like this,
<doc>
<p>Biological<sub>89</sub> bases<sub>4456</sub> for<sub>8910</sub> sexual<sub>4456</sub>
differences<sub>8910</sub> in<sub>4456</sub> the brain exist in a wide range of
vertebrate species, including chickens<sub>8910</sub> Recently<sub>8910</sub> the
dogma<sub>8910</sub> of<sub>4456</sub> hormonal dependence for the sexual
differentiation of the brain has been challenged.</p>
</doc>
As you can see there are <sub> nodes and text() node contains inside the <p> node. and every <sub> node end, there is a text node, starting with a space. (eg: <sub>89</sub> bases : here before 'bases' text appear there is a space exists.) I need to replace those specific spaces with nodes.
SO the expected output should look like this,
<doc>
<p>Biological<sub>89</sub><s/>bases<sub>4456</sub><s/>for<sub>8910</sub><s/>sexual<sub>4456</sub>
<s/>differences<sub>8910</sub><s/>in<sub>4456</sub><s/>the brain exist in a wide range of
vertebrate species, including chickens<sub>8910</sub><s/>Recently<sub>8910</sub><s/>the
dogma<sub>8910</sub><s/>of<sub>4456</sub><s/>hormonal dependence for the sexual
differentiation of the brain has been challenged.</p>
</doc>
to do this I can use regular expression like this,
<xsl:template match="p/text()">
<xsl:analyze-string select="." regex="( )">
<xsl:matching-substring>
<xsl:choose>
<xsl:when test="regex-group(1)">
<s/>
</xsl:when>
</xsl:choose>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
But this adds <s/> nodes to every spaces in the text() node. But I only need thi add nodes to that specific spaces.
Can anyone suggest me a method how can I do this..
If you only want to match text nodes that start with a space and are preceded by a sub element, you can put the condition in your template match
<xsl:template match="p/text()[substring(., 1, 1) = ' '][preceding-sibling::node()[1][self::sub]]">
And if you just want to remove the space at the start of the string, a simple replace will do.
<xsl:value-of select="replace(., '^\s+', '')" />
Try this XSLT
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output method="xml" indent="no" />
<xsl:template match="p/text()[substring(., 1, 1) = ' '][preceding-sibling::node()[1][self::sub]]">
<s />
<xsl:value-of select="replace(., '^\s+', '')" />
</xsl:template>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Just change the regex like so ^( ): it will match only the spaces at the beginning of the text part.
With this XSL snipped:
<xsl:analyze-string select="." regex="^( )">
Here is the result I obtain:
<p>Biological<sub>89</sub><s></s>bases<sub>4456</sub><s></s>for<sub>8910</sub><s></s>sexual<sub>4456</sub>
differences<sub>8910</sub><s></s>in<sub>4456</sub><s></s>the brain exist in a wide range of
vertebrate species, including chickens<sub>8910</sub><s></s>Recently<sub>8910</sub><s></s>the
dogma<sub>8910</sub><s></s>of<sub>4456</sub><s></s>hormonal dependence for the sexual
differentiation of the brain has been challenged.
</p>

XSL transform on text to XML with unparsed-text: need more depth

My rather well-formed input (I don't want to copy all data):
StartThing
Size Big
Colour Blue
coords 42, 42
foo bar
EndThing
StartThing
Size Small
Colour Red
coords 29, 51
machin bidule
EndThing
<!-- repeat a few thousand times-->
I have the below XSL which I modified from Parse text file with XSLT
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:param name="text-encoding" as="xs:string" select="'iso-8859-1'"/>
<xsl:param name="text-uri" as="xs:string" select="'unparsed-text.txt'"/>
<xsl:template name="text2xml">
<xsl:variable name="text" select="unparsed-text($text-uri, $text-encoding)"/>
<xsl:analyze-string select="$text" regex="(Size|Colour|coords) (.+)">
<xsl:matching-substring>
<xsl:element name="{(regex-group(1))}">
<xsl:value-of select="(regex-group(2))"/>
</xsl:element>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:template>
<xsl:template match="/">
<xsl:call-template name="text2xml"/>
</xsl:template>
</xsl:stylesheet>
and it works fine on parsing the pairs into elements and values. It gives me this output:
<?xml version="1.0" encoding="UTF-8"?>
<Size>Big</Size>
<Colour>Blue</Colour>
<coords>42, 42</coords>
But I'd also like to wrap the values in the Thing tag so that my output looks like this:
<Thing>
<Size>Big</Size>
<Colour>Blue</Colour>
<coords>42, 42</coords>
</Thing>
One solution might be a regex that matches each group of lines after each "thing". Then matches substrings as I'm already doing. Or is there some other way to parse the tree?
I would use two nested analyze-string levels, an outer one to extract everything between StartThing and EndThing, and then an inner one that operates on the strings matched by the outer one.
<xsl:template name="text2xml">
<xsl:variable name="text" select="unparsed-text($text-uri, $text-encoding)"/>
<!-- flags="s" allows .*? to match across newlines -->
<xsl:analyze-string select="$text" regex="StartThing.*?EndThing" flags="s">
<xsl:matching-substring>
<Thing>
<!-- "." here is the matching substring from the outer regex -->
<xsl:analyze-string select="." regex="(Size|Colour|coords) (.+)">
<xsl:matching-substring>
<xsl:element name="{(regex-group(1))}">
<xsl:value-of select="(regex-group(2))"/>
</xsl:element>
</xsl:matching-substring>
</xsl:analyze-string>
</Thing>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:template>

convert character if codepoint within given range

I have a couple of XML files that contain unicode characters with codepoint values between 57600 and 58607. Currently these are shown in my content as square blocks and I'd like to convert these to elements.
So what I'd like to achieve is something like :
<!-- current input -->
<p> Follow the on-screen instructions.</p>
<!-- desired output-->
<p><unichar value="58208"/> Follow the on-screen instructions.</p>
<!-- Where 58208 is the actual codepoint of the unicode character in question -->
I've fooled around a bit with tokenizer but as you need to have reference to split upon this turned out to be over complicated.
Any advice on how to tackle this best ? I've been trying some things like below but got struck (don't mind the syntax, I know it doesn't make any sense)
<xsl:template match="text()">
-> for every character in my string
-> if string-to-codepoints(current character) greater then 57600 return <unichar value="codepoint value"/>
else return character
</xsl:template>
It sounds like a job for analyze-string e.g.
<xsl:template match="text()">
<xsl:analyze-string select="." regex="[-]">
<xsl:matching-substring>
<unichar value="{string-to-codepoints(.)}"/>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
Untested.
This transformation:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="/*">
<p>
<xsl:for-each select="string-to-codepoints(.)">
<xsl:choose>
<xsl:when test=". > 57600">
<unichar value="{.}"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="codepoints-to-string(.)"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</p>
</xsl:template>
</xsl:stylesheet>
when applied on the provided XML document:
<p> Follow the on-screen instructions.</p>
produces the wanted, correct result:
<p><unichar value="58498"/> Follow the on-screen instructions.</p>
Explanation: Proper use of the standard XPath 2.0 functions string-to-codepoints() and codepoints-to-string().