How XSLT handles regex \w? - regex

I have an intput xml file with the name of "718322_c341b0-TEST_NOC_20160423121052.XML", which in my XSLT is assigned to $SourceFile. I am trying to test if the $SourceFile contains the string of "-TEST" using the following code:
<xsl:if test="matches($SourceFile, '^\w+-TEST.*')">
However, it did not match. So I updated the code to
<xsl:if test="matches($SourceFile, '^[A-Za-z0-9_]+-TEST.*')">
Then I got a match. I did more testing and the following code got a match, too.
<xsl:if test="matches($SourceFile, '^\w+_\w+-TEST.*')">
Here's what confused me, I think \w means [A-Za-z0-9_], correct? Why \w did not work in this case? It seems to have a trouble including the underscore. Thanks!

See https://www.w3.org/TR/xmlschema-2/#charcter-classes, the class \w is defined as [#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}], explained as 'all characters except the set of "punctuation", "separator" and "other" characters', that seems to exclude _.

Related

What escape character for fn:replace function

I have to to change a bad char to a quotation mark but I can't escape this last one.
Doing this doesn't work
<xsl:value-of select="fn:replace(prog:intitules/prog:intitule_fr,'¿', '\'')"/>
it produces
net.sf.saxon.trans.XPathException: Unmatched quote in expression
Same error with double or triple escapes '\' or '\\'.
My editor refuses this alternative syntax:
<xsl:value-of select='fn:replace(prog:intitules/prog:intitule_fr,"¿", "'")'/>
Any idea ?
Bernard
Try it this way:
<xsl:value-of select='replace(input, "¿", "&apos;")'/>
In XPath 2.0+, you can escape an apostrophe within an apostrophe-delimited string literal by doubling it. So try:
''''
You need to think very carefully about escapes when you're using regular expressions within XPath within XSLT. Why does the character need escaping?
If it has a special meaning in regular expressions (for example '(') then use a backslash
If it isn't allowed because of XPath rules (like here), use XPath escaping (write 'O'Neil' as 'O''Neil' or "a="3"" as "a=""3""")
If it isn't allowed because of XML rules (e.g. "<"), use XML escaping (write < as <)
The reason this doesn't work:
<xsl:value-of select='fn:replace(prog:intitules/prog:intitule_fr,"¿", "'")'/>
is that the XML parser is treating the apostrophe within the string literal as marking the end of the value of the select attribute. So here you have an XML issue, and under rule 3 you therefore need to use XML escaping (&apos;)
Assuming your files are utf-8 encoded, you could you try a workaround, using the Unicode apostrophe character (hexadecimal 2BC) instead of quote (hexadecimal 27):
<xsl:value-of select="fn:replace(prog:intitules/prog:intitule_fr,'¿', 'ʼ')"/>
Edited: searching a little bit more, I discovered that switching ' and " and using the entity &apos; will get the same result, as michael.hor257k proposed meanwhile:
<xsl:value-of select='fn:replace(prog:intitules/prog:intitule_fr,"¿", "&apos;")'/>

Regex search in XSL, select string after match

I have a solution where the filename has a prefix showing the filesize of a PDF. I need to pick up that value in to a XML-file that has a lot of other info that is collected with the XSLT.
How ever I can't get just this Regex match to work.
Filename have this structure as this example:
776524_P9466_Novilon_Broschyr_SE_Omslag.xml where the digits before the underscore is the filesize.
I have a Regex search pattern of _(.*) and I can validate that it will match everything after the first section of the digits.
Here is my XSL that I'm having problems with:
<xsl:param name="find_size">
<xsl:text>(_.*)</xsl:text>
</xsl:param>
<xsl:variable name="filename_of_start"><xsl:value-of select="replace($filename_of_file, '$find_size', '')"/></xsl:variable>
<artwork_size><xsl:value-of select="$filename_of_start"/></artwork_size>
$filename_of_file has the string: 776524_P9466_Novilon_Broschyr_SE_Omslag.xml
I have also tried to match the digits before the underscore and replace with that match but haven't got that to work either. Other replaces where I remove other matches from the beginning of the string works.
Thanks
How about using the substring-before() XPath function?
<xsl:variable name="file_size" select="substring-before($filename, '_')" />
Instead of replace($filename_of_file, '$find_size', '') you want replace($filename_of_file, $find_size, '').

xslt 2.0 how replace $ by escaped dollar (for conversion to LaTeX)

I am new to XSLT. I googled extensively but couldn't figure out how to do the following:
I am transforming XML to LaTeX. Of course, LaTeX needs to escape characters such as $ and #. I tried the following in the replace function but it does not work. (They do work without the replace function.)
<xsl:template match="xyz:doc">
\subsubsection{<xsl:value-of select="replace( xyz:headline, '(\$)', '\$1' )"/>}
...
</xsl:template>
<xsl:template match="xyz:doc">
\subsubsection{<xsl:value-of select="replace( xyz:headline, '\$', '\$' )"/>}
...
</xsl:template>
Possible content to be escaped is:
"Locally defined field #931" or
"Locally defined subfield $b"
What am I doing wrong?
Many thanks for your answers!
If you want to replace a dollar symbol $ in the input with \$ in the output then use replace(xyz:headline, '\$', '\\\$').
If there are several characters that need the same escaping then replace(xyz:headline, '([$#])', '\\$1') should do.
Sample at http://xsltransform.net/bdxtqX/1

SLRE regex doesn't work properly

I have a problem with SLRE library, I can't figure out how to stop grabbing everything after my match. Let's say I have a html output and somewhere in the middle of buffer there is line I want to parse
name="id" value="1a2b3c4d5e6f" />
Here is my regular expression
slre_compile(&test, "name=\"id\" value=\"(.*?)\" />")
I have read about greedy and non-greedy flags in other threads where people used to have similar problem as me, but in my case adding ? to the expression doesn't change anything.
SLRE returns me match starting from 1a2b3c4d5e6f" /> and shows rest of the html page ending on </html> tag, just I don't know why. It is cutting the beginning of the html source but leaves everything after my expression. I have also tried following regex
slre_compile(&test, "^.*?name=\"id\" value=\"(.*?)\" />.*?$")
and some others, modified with greedy and non-reedy flags, which gave me same results. Does anyone know why SLRE can't stop at " /> and continues capturing characters till the source string ends?
it seems that SLRE does not understand non-greedy qualifiers and parses .*? instead as if it were (?:.*)?. However, in this case \"[^\"]*\" should work...

How to find a word within text using XSLT 2.0 and REGEX (which doesn't have \b word boundary)?

I am attempting to scan a string of words and look for the presence of a particular word(case insensitive) in an XSLT 2.0 stylesheet using REGEX.
I have a list of words that I wish to iterate over and determine whether or not they exist within a given string.
I want to match on a word anywhere within the given text, but I do not want to match within a word (i.e. A search for foo should not match on "food" and a search for bar should not match on "rebar").
XSLT 2.0 REGEX does not have a word boundary(\b), so I need to replicate it as best I can.
You can use alternation to avoid repetition:
<xsl:if test="matches($prose, concat('(^|\W)', $word, '($|\W)'),'i')">
If your XSLT 2.0 processor is Saxon 9 then you can use Java regular expression syntax (including \b) with the functions matches, tokenize and replace by starting the flag attribute with an exclamation mark:
<xsl:value-of select="matches('all foo is bar', '\bfoo\b', '!i')"/>
Michael Kay mentioned that option recently on the XSL mailing list.