xslt find and replace regex-lookarounds - regex

I have a problem with XSLT replace().
XML:
<root>
<title>I am title</title>
<body>
the new formula is:<br/>
the speed test 234 km/h<br>
the weight is 49 kg<br/>
in the 1492 Lorenzo de Medici die.
etc.
<dida>the mass is 56 kn</dida>
</body>
</root>
I must replace all the space after number of measure system.
In PHP I found this regex:
((?<=\d)\s(?=km|kg|kn))
In XSLT I have:
<xsl:template match="//*/text()">
<xsl:value-of select="replace(., '\(\(?\<=\\d\)\\s\(?=km\|kg\|kn\)\)', ' ')"></xsl:variable>
</xsl:template>
The problem is < character!

The common notation for '<' inside a literal string is <
That, however, didn't fix it for my XSLT processor (Kernow, using Saxon 9.1.0.3). As it appeared, it doesn't need all those escapes for parentheses and vertical bars. In addition, the lookarounds didn't work. I was able to solve this using
<xsl:value-of select="replace(., '(\d)\s(km|kg|kn)', '$1!$2')"></xsl:value-of>
(replacing with a '!' for clarity).
There are a few other basic errors in your example which I had to fix first: <br> was not correctly closed, and you mustn't terminate <xsl:value-of .. with </xsl:variable>.

Related

How to return multiple regexp matches where the result depends on a previous match?

I've been trying to match hazard codes held within a free text field. I've got a regexp that works where the codes have been entered in the format Hxxx where xxx is a three digit number. Easy!
However, sometimes the users have entered the first as Hxxx but subsequent values just as xxx.
So, for input data like
R12 34 456 / H123 H456 789 012
I want to match H123 H456 and 789 and 012, but not the 456 before the first H.
Edit: To clarify, there is not a clear pattern of the data in the field. Mostly, there are some H codes, sometimes with R codes preceeding them, sometimes delimited in the example above, and sometimes not. Thus the rule I am envisaging is that three digit codes following one beginning with an H will be returned, but any codes not preceded by at least one H code will be ignored.
I've tried every combination of optional grouping and look-behind I can think of, and the best I've got is
((H|(?<=(H\d{3}\s)))\d{3}[A-Z]{0,2})
which matches all but the last group, but would cause problems if there were more than once space between group.
I suspect look-behind may not work anyway in an xsl:analyze-string command.
Is there any clever regexp trick that will work for this, or do I have to go for some more brute-force approach?
Can you use Saxon 9.6 or later PE and EE (for instance in oXygen or Stylus Studio) or Altova XMLSpy 2017 or Exselt and XSLT 3.0? In that case you could perhaps simply tokenize($data, '\s+') and use xsl:for-each-group group-starting-with=".[matches(., 'H[0-9]{3}')]. The following stylesheet
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:math="http://www.w3.org/2005/xpath-functions/math" exclude-result-prefixes="xs math"
version="3.0">
<xsl:template match="data">
<xsl:copy>
<xsl:variable name="matches" as="xs:string*">
<xsl:for-each-group select="tokenize(., '\s+')"
group-starting-with=".[matches(., 'H[0-9]{3}')]">
<xsl:if test="matches(., 'H[0-9]{3}')">
<xsl:sequence select="current-group()"/>
</xsl:if>
</xsl:for-each-group>
</xsl:variable>
<xsl:value-of select="$matches"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
transforms <data>R12 34 456 / H123 H456 789 012</data> into <data>H123 H456 789 012</data> so it extracts the items you are looking for.

Substring before throwing error

I've the below XML
<?xml version="1.0" encoding="UTF-8"?>
<body>
<p>Industrial drawing: Any creative composition</p>
<p>Industrial drawing: Any creative<fn>
<fnn>4</fnn>
<fnt>
<p>ftn1"</p>
</fnt>
</fn> composition
</p>
</body>
and the below XSL.
<xsl:template match="p">
<xsl:choose>
<xsl:when test="contains(substring-before(./text(),' '),'Article')">
<xsl:text>sect3</xsl:text>
<xsl:value-of select="./text()"/>
</xsl:when>
<xsl:when test="contains(substring-before(./b/text(),' '),'Section')">
<xsl:text> Sect 2</xsl:text>
<xsl:value-of select="./text()"/>
</xsl:when>
<xsl:when test="contains(substring-before(./b/text(),' '),'Chapter')">
<xsl:text> Sect 1</xsl:text>
<xsl:value-of select="./text()"/>
</xsl:when>
<xsl:otherwise>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
Here my XSL is working fine for <p>Industrial drawing: Any creative composition</p> but for the below Case
<p>Industrial drawing: Any creative<fn>
<fnn>4</fnn>
<fnt>
<p>ftn1"</p>
</fnt>
</fn> composition
</p>
it is throwing me the below error.
XSLT 2.0 Debugging Error: Error: file:///C:/Users/u0138039/Desktop/Proview/ASAK/DIFC/XSLT/tabel.xslt:38: Wrong occurrence to match required sequence type - Details: - XPTY0004: The supplied sequence ('2' item(s)) has the wrong occurrence to match the sequence type xs:string ('zero or one')
please let me know how can i fix this and grab the text required.
Thanks
The second p element in your example XML has two child text nodes, one containing "Industrial drawing: Any creative" and the other containing a space, "composition", a newline and another six spaces. In XSLT 1.0 it is legal to apply a function that expects a string to an argument that is a set of more than one node, the behaviour is to take the value of the first node and ignore all the others. But in 2.0 it is a type mismatch error to pass two nodes to a function that expected a single value for its parameter.
But in this case I doubt that you really need to use text() at all - if all you care about is seeing whether the string "Article" occurs anywhere within the first word under the p (including when this is nested inside another element) then you can simply use .:
<xsl:when test="contains(substring-before(.,' '),'Article')">
(or better still, use predicates to separate the different conditions into their own templates, with one template matching "Article" paragraphs, another matching "Section" paragraphs, etc.)
The p element in your example has several text nodes, so the expression ./text() creates a sequence. You cannot apply a string function to a sequence; you must convert it to a string first. Instead of:
test="contains(substring-before(./text(),' '),'Article')"
try:
test="contains(substring-before(string-join(text(), ''), ' '), 'Article')"

How do I output line breaks from xsl:value of

I have this statement
xsl:value-of select="metadata/line1"/
where line1 is in the souce xml is:
Microsoft Windows 7 is installed<br/&gt
The HTML output turns out to be:
Microsoft Windows 7 is installed<br/>
I want it to actually insert the break after the word installed instead of outputting the literal <br/>
Seems that what you like to do is "unescape" some xml content. If I'm right disable-output-escaping should help. Try:
<xsl:value-of select="metadata/line1" disable-output-escaping="yes" />
E.g.: How to unescape XML characters with help of XSLT?
Instead of xsl:select, use xsl:copy-of
It will render the HTML as well
See http://www.w3schools.com/xsl/el_copy-of.asp
You can even try <xsl:value-of select=" " disable-output-escaping="yes" />

apostrophe in xsl:format-number

I parse an xml with my xslt and get the result as a xml.
i need to format numbers with apostrophe as delimiter for a tousand, million, etc...
eg: 1234567 = 1'234'567
now the problem is how do i get these apostrophes in there?
<xsl:value-of select="format-number(/path/to/number, '###'###'###'###')" />
this doesn't work because the apostrophe itself is already delimiting the start of the format.
is there a simple solution to that (maybe escaping the apostrophe like in c#?
The answer depends on whether you are using 1.0 or 2.0.
In 2.0, you can escape the string delimiter by doubling it (for example 'it''s dark'), and you can escape the attribute delimiter by using an XML entity such as ". So you could write:
<xsl:value-of select="format-number(/path/to/number, '###''###''###''###')" />
In 1.0, you can escape the attribute delimiter by using an XML entity, but there is no way of escaping the string delimiter. So you could switch your delimiters and use
<xsl:value-of select='format-number(/path/to/number, "###&apos;###&apos;###&apos;###")' />
The other way - probably easier - is to put the string in a variable:
<xsl:variable name="picture">###'###'###'###</xsl:variable>
<xsl:value-of select="format-number(/path/to/number, $picture)" />
After some research we came up with this solution:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:decimal-format name='ch' grouping-separator="'" />
<xsl:template match="/">
<xsl:value-of select='format-number(/the/path/of/the/number, "###&apos;###&apos;###", "ch")'/>
...

How do I convert strings starting with numbers to numeric data in XSLT?

Given the following XML:
<table>
<col width="12pt"/>
<col width="24pt"/>
<col width="12pt"/>
<col width="48pt"/>
</table>
How can I convert the width attributes to numeric values that can be used in mathematical expressions? So far, I have used substring-before to do this. Here is an example template (XSLT 2.0 only) that shows how to sum the values:
<xsl:template match="table">
<xsl:text>Col sum: </xsl:text>
<xsl:value-of select="sum(
for $w
in col/#width
return number(substring-before($w, 'pt'))
)"/>
</xsl:template>
Now my questions:
Is there a more efficient way to do the conversion than substring-before?
What if I don't know the text after the numbers? Any way to do it without using regular expressions?
This is horrible, but depending on just how much you know about the potetntial set of non-numeric characters, you could strip them with translate():
translate("12jfksjkdfjskdfj", "abcdefghijklmnopqrstuvwxyz", "")
returns
"12"
which you can then pass to number() as currently.
(I said it was horrible. Note that translate() is case sensitive, too)
I found this answer from Dimitre Novatchev that provides a very clever XPATH solution that doesn't use regex:
translate(., translate(.,'0123456789', ''), '')
It uses the nested translate to strip all the numbers from the string, which yields all other characters, which are used as the values for the wrapping translate function to strip out and return just the number characters.
Applied to your template:
<xsl:template match="table">
<xsl:text>Col sum: </xsl:text>
<xsl:value-of select="sum(
for $w
in col/#width
return number(translate($w, translate($w,'0123456789', ''), ''))
)"/>
</xsl:template>
If you are using XSLT 2.0 is there a reason why you want to avoid using regex?
The most simple solution would probably be to use the replace function with a regex pattern to match on any non-numeric character and replace with empty string.:
replace($w,'[^0-9]','')