How can I combine xslt and regex to find specific strings - regex

I am very new to XSL, and learning regex, so I might be going about this incorrectly, but I would like a way to find strings in XML files, and sometimes those strings must appear in specific elements, or not in specific elements.
e.g., (\w+)\ (\,|\.|\:|\;|\?) finds orphan punctuation but I don't want to search inside <screen> or similar elements, which typically contain commands, output, and so on, and where orphan punctuation is commonplace.
By way of example:
This is an error , because there is a space before the comma and before the period .
This is not an error, because <command>cd ../</command> is a valid command.
Thanks very much.

To use regular expressions with XSLT you need XSLT 2.0 or later, and it's then very simple:
<!-- Match errors -->
<xsl:template match="text()[matches(., '\s[.,:;?!]')]"
mode="look-for-bad-punctuation" priority="5">
<bad-punctuation-found/>
</xsl:template>
<!-- Match unchecked elements -->
<xsl:template match="screen/text() | command/text()"
mode="look-for-bad-punctuation" priority="6">
<xsl:copy-of select="."/>
</xsl:template>
<!-- Match elements with no error -->
<xsl:template match="text()"
mode="look-for-bad-punctuation" priority="4">
<xsl:copy-of select="."/>
</xsl:template>

Related

How to match uri-collection results using templates

I have variable with collection of files URIs.
<xsl:variable name="swiftFilesPath" select="concat($inputPath, '?select=*.swift;recurse=yes;on-error=warning')"/>
<xsl:variable name="swiftFiles" select="uri-collection($swiftFilesPath)"/>
I want to use apply-templates to process through all URIs.
For now I'm using for-each for getting files and then process through each line.
<xsl:for-each select="$swiftFiles">
[...]
<xsl:variable name="filePath" select="."/>
<xsl:variable name="fileContent" select="unparsed-text($filePath, $encoding)"/>
<xsl:for-each select="tokenize($fileContent, '\n')">
[...]
</xsl:for-each>
</xsl:for-each>
I am thinking about changing it to something like this:
<xsl:apply-templates select="$swiftFiles" mode="swiftFiles"/>
[...]
<xsl:template match="*" mode="swiftFiles">
[...]
</xsl:template/>
Will it be better approach to processing files? I mean apply-templates better than for-each.
Is there a way to avoid "*" in template match? Maybe something like "*[. castable as xs:anyURI]"?
Firstly, I don't think there's anything to gain from using apply-templates unless there's some kind of dynamic despatch going on. For example, if you had both .txt URIs and .xml URIs then you could do
<xsl:apply-templates select="uri-collection(....)" mode="dereference"/>
<xsl:template match=".[ends-with(., '.txt')]" mode="dereference">
--- process unparsed text file ----
</xsl:template>
<xsl:template match=".[ends-with(., '.xml')]" mode="dereference">
--- process XML file ----
</xsl:template>
<xsl:template match="." mode="dereference"/>
But if they are all processed the same way, then xsl:for-each does the job perfectly well.
I've answered your second question by using "." as the pattern that matches everything (atomic values included). The pattern "*" will only match element nodes.

not(#attribute) test not working in JDom XSL transform?

I have a piece of an XSLT stylesheet that works as expected using xsltproc but produces a different output in my actual application, where the transform is applied via org.jdom.transform.XSLTransformer (jdom 1.0), I believe using Xalan.
Stylesheet snippet (this is part of a larger template that starts like this: <xsl:template match="/dspace:dim[#dspaceType='ITEM']">):
<xsl:if test="//dspace:field[#mdschema='dc' and #element='rights']">
<rightsList>
<xsl:if test="//dspace:field[#mdschema='dc' and #element='rights' and not(#qualifier) and #language='*']">
<rights>
<xsl:if test="//dspace:field[#mdschema='dc' and #element='rights' and #qualifier='uri' and #language='*']">
<xsl:attribute name="rightsUri">
<xsl:value-of select="//dspace:field[#mdschema='dc' and #element='rights' and #qualifier='uri' and #language='*']"/>
</xsl:attribute>
</xsl:if>
<xsl:value-of select="//dspace:field[#mdschema='dc' and #element='rights' and not(#qualifier) and #language='*']" />
</rights>
</xsl:if>
<xsl:apply-templates select="//dspace:field[#mdschema='dc' and #element='rights' and not(#language='*')]" />
</rightsList>
</xsl:if>
and
<xsl:template match="//dspace:field[#mdschema='dc' and #element='rights' and not(#language='*')]">
<rights><xsl:value-of select="." /></rights>
</xsl:template>
XML snippet:
<dim:dim dspaceType="ITEM" xmlns:dim="http://www.dspace.org/xmlns/dspace/dim">
<dim:field element="rights" language="en_NZ" mdschema="dc">Actual text redacted</dim:field>
<dim:field element="rights" language="*" mdschema="dc">Attribution 3.0 New Zealand</dim:field>
<dim:field element="rights" qualifier="uri" language="*" mdschema="dc">http://creativecommons.org/licenses/by/3.0/nz/</dim:field>
</dim:dim>
With xsltproc, this produces
<rightsList>
<rights rightsUri="http://creativecommons.org/licenses/by/3.0/nz/">Attribution 3.0 New Zealand</rights>
<rights>Actual text redacted</rights>
</rightsList>
In my application, this produces
<rightsList>
<rights>Actual text redacted</rights>
<rights>Attribution 3.0 New Zealand</rights>
<rights>http://creativecommons.org/licenses/by/3.0/nz/</rights>
</rightsList>
So to me it looks like the not(#qualifier) bit doesn't work using jdom.
I'd appreciate any insight into what's going on here and how I might change the stylesheet to get the same result in my application that I currently get via xsltproc.
Edited to add: just in case it makes any difference, the stylesheet starts out as
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:dspace="http://www.dspace.org/xmlns/dspace/dim"
xmlns:exslt="http://exslt.org/common"
xmlns="http://datacite.org/schema/kernel-3"
extension-element-prefixes="exslt"
exclude-result-prefixes="exslt"
version="1.0">
and also includes this template:
<!-- Don't copy everything by default! -->
<xsl:template match="#* | text()" />
See my answer below the XML structure is actually different from what I thought it was, so the problem wasn't in the XSL after all.
Apart from solving your original problem, let's have a quick look at how to reorganize your code.
You use a lot of //foo expressions. Starting an expression with //foo means "search the whole document, at any level, for the element with the name foo". Apart from this being a potentially expensive operation, this often has unwanted side effects and makes your code hard to read, because it requires you to specify each element uniquely, leading to a lot of duplicated code.
You also use a lot of xsl:if, but in XSLT, it is hardly ever necessary to use if-statements (an exception in XSLT 1.0 and 2.0 being when you deal with something other than nodes). In almost all cases, you can replace an xsl:if with a simple xsl:apply-templates.
That said, let's have a look how we can rewrite your code to get the same effect and have less chance for error:
<xsl:if test="//dspace:field[#mdschema='dc' and #element='rights']">
<rightsList>
.....
Is similar to having a matching template as follows (assuming you have a throw-away template for uninteresting nodes):
<xsl:template match="dspace:dim[dspace:field[#mdschema='dc' and #element='rights']]">
<rightsList>
This says: if you encounter a dim element with any field element that has those properties set, then output <rightsList>.
Then you have:
<xsl:if test="//dspace:field[#mdschema='dc' and #element='rights' and not(#qualifier) and #language='*']">
<rights>
Which is precisely equivalent to the following apply-template expression (assuming a matching template with it):
<xsl:apply-templates select="dspace:field[#mdschema='dc' and #element='rights' and not(#qualifier) and #language='*']" />
Here we find that a little bit below that, we have an almost equivalent expression, this time with not(#language='*'). So let's see if we can get rid of those duplicate expressions altogether.
First, let's go back a bit and have a look at what you were doing:
If anywhere any "dc" and "rights", then create a <rightsList>
If anywhere any of these have do not have a qualifier but have a language "*", create <rights>
Inside this, create an attribute rightsUri if anywhere any qualifier has value "uri" and language "*", set its value to the first such you find
After this <rights> element (there can be at most one of them in your current structure), create a list of <rights> for each field element with language "*"
If this is correct, then this can be rewritten as follows:
<xsl:template match="dspace:dim[dspace:field[#mdschema='dc' and #element='rights']]">
<xsl:variable name="adjusted">
<xsl:copy-of select="dspace:field[#mdschema='dc' and #element='rights']"/>
</xsl:variable>
<rightsList>
<xsl:apply-templates select="exsl:node-set($adjusted)/*[not(#qualifier) and #language='*'][1]" mode="noquali"/>
<xsl:apply-templates select="exsl:node-set($adjusted)/*[not(#language='*')]" />
</rightsList>
</xsl:template>
<xsl:template match="dspace:field" mode="noquali">
<rights>
<xsl:apply-templates select="/dspace:field[#qualifier='uri' and #language='*'][1]" mode="uri"/>
<xsl:value-of select="."/>
</rights>
</xsl:template>
<xsl:template match="dspace:field" mode="uri">
<xsl:attribute name="rightsUri" select="." />
</xsl:template>
<!-- matching anything else -->
<xsl:template match="dspace:field">
<rights><xsl:value-of select="." /></rights>
</xsl:template>
The exsl:node-set function is supported by just about every XSLT 1.0 processor, just add the namespace xmlns:exsl="http://exslt.org/common" to your xsl:stylesheet declaration.
Note that I added a few times [1] to the select-expressions. While you don't do that in your code, your current code has the same effect, but if you use apply-templates, if you encounter multiple matches, you have to specify that you are only interested in the first match.
I think your code can be further simplified, but I wanted to make sure that the logic remains exactly the same. As you can see, the end result is without any //. However, you do see one /, which is now pointing to the root of the node-set, which conveniently only has the nodes you are interested in: the ones with schema "dc" and "rights" element attributes, so we do not have to repeat that expression over and over again.
You may try this rewrite and see if it helps with your current bug, otherwise I'll gladly to help you further.
Edit
After your edit, your original context item will have been dspace:dim already. If you don't mind always outputting <rightsList> (even if it ends up empty), you can simply replace my first template match pattern above with your existing dspace:dim pattern.
Duh. Forest/trees indeed. Even though the language attribute is called "language" pretty much everywhere else in the application (see also, the XML snippet I gave), it is actually called "lang" in the XML that my stylesheet operates on - I finally gave in and used this answer to be sure what the XML structure is. Surprise!
Anyway, I followed the advice Abel gave in his answer in part and simplified the templates for this particular case quite a bit. I now just have
<xsl:if test="dspace:field[#mdschema='dc' and #element='rights']">
<rightsList>
<xsl:apply-templates select="dspace:field[#mdschema='dc' and #element='rights']"/>
</rightsList>
</xsl:if>
in the big template, plus a couple of custom ones:
<xsl:template match="dspace:field[#mdschema='dc' and #element='rights']">
<xsl:choose>
<xsl:when test="#qualifier='uri'"/>
<xsl:otherwise>
<rights>
<xsl:if test="#lang='*'">
<xsl:apply-templates select="//dspace:field[#mdschema='dc' and #element='rights' and #qualifier='uri' and #lang='*'][1]" mode="rightsURI"/>
</xsl:if>
<xsl:value-of select="."/>
</rights>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:template match="dspace:field[#mdschema='dc' and #element='rights' and #qualifier='uri' and #lang='*']" mode="rightsURI">
<xsl:attribute name="rightsURI"><xsl:value-of select="."/></xsl:attribute>
</xsl:template>

How can I replace text with angle bracket without parsing the replace value?

I have this:
replace("Both cruciate ligaments are well visualized and are intact.",
".",
".<br>")
But I do not want to output the escaped angle brackets but the actual brackets. when I run the code I get :
Both cruciate ligaments are well visualized and are intact.<br>
I want:
Both cruciate ligaments are well visualized and are intact.<br>
How can I achieve that? I cannot use the angle bracket directly as replace value since I get an error.
EDIT
I have a stylesheet that takes in a text file that is injected into a HTML file (coming from the stylesheet). I take an XML (Clinical document) and a text file and merge them together with the stylesheet. So for example I have:
RADIOLOGY REPORT
NAME: JOHN, DOE
DoB: 1982-02-25
Injected text goes here
The text has to wrap on carriage return and has to wrap at a word level. I did manage to do the latter but I did not find a way to the line breaks. I thought of finding 'LF' in the file an replace with <BR> so that once the page is rendered I get to see the line breaks.
You need to use xsl:analyze-string if you want to output nodes and not simply strings. Here is an example:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html"/>
<xsl:template match="text">
<xsl:analyze-string select="." regex="\.">
<xsl:matching-substring>
<xsl:value-of select="."/><br/>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
</xsl:stylesheet>
With the input being
<text>Both cruciate ligaments are well visualized and are intact.</text>
the transformation result is
Both cruciate ligaments are well visualized and are intact.<br>
Martin Honnen's answer is a perfectly good way to do this.
Using a simple template to find the text in question is another way:
<xsl:variable name="magic-string"
select='"Both cruciate ligaments are well visualized and are intact."'/>
...
<xsl:template match="text()
[contains(.,$magic-string)]">
<xsl:value-of select="substring-before(.,$magic-string)"/>
<xsl:value-of select="$magic-string"/>
<br/>
<xsl:value-of select="substring-after(.,$magic-string)"/>
</xsl:template>
In either case, use the HTML output method to serialize the empty br element as <br> instead of as <br/>.
Note: I'm assuming here that you want a br after this particular sentence, not that you want one after each occurrence of full stop, which is how Martin Honnen appears to have interpreted the question.

XSLT do not match certain attributes

Is it possible to match attributes that do not belong to a subset of attributes? For example, I would like to match everything but #attr1 and #attr2. Is there a way to write a template match statement similar to the following, or am I going about this the wrong way?
<xsl:template match="NOT(#attr1) and NOT(#attr2)">
Thanks
The easiest way would be to use two templates:
<xsl:template match="#attr1|#attr2"/>
<xsl:template match="#*">
....
</xsl:template>
The first template will catch the references to those you want to ignore, and simply eat them. The second will match the remaining attributes.
The original inquiry is possible. Use the following:
<xsl:template match="#*[local-name()!='attr1' and local-name()!='attr2']">
....
</xsl:template>
This is especially useful if you want to change an attribute or add it if missing withing a single copy operation. The other answer does not work in such situation. e.g.
...
<xsl:copy>
<xsl:attribute name="attr1">
<xsl:value-of select="'foo'"/>
</xsl:attribute>
<xsl:apply-templates select="#*[local-name()!='attr1']|node()"/>
</xsl:copy>
...

XSLT Selection of Nodes Based on Substring of Element Name

How can I, with XSLT, select nodes based on a substring of the nodes' element name?
For example, consider the XML:
<foo_bar>Keep this.
<foo_who>Keep this, too.
<fu_bar>Don't want this.</fu_bar>
</foo_who>
</foo_bar>
From which I want to output:
<foo_bar>Keep this.
<foo_who>Keep this, too.
</foo_who>
</foo_bar>
Here I want to select for processing those nodes whose names match a regex like "foo.*".
I think I need an XSLT template match attribute expression, or an apply-templates select attribute expression, that applies the regex to the element's name. But maybe this can't be done without some construct like an statement?
Any help would be appreciated.
Here is some XSL that finds elements that start with "foo" to get you started. I don't think regex functionality was added until XSLT 2.0 based on Regular Expression Matching in XSLT 2.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="*">
<xsl:variable name="name" select="local-name()"/>
<xsl:if test="starts-with($name, 'foo')">
<xsl:copy>
<xsl:apply-templates/>
</xsl:copy>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
It gives this output, which seems to have an extra newline.
<foo_bar>Keep this.
<foo_who>Keep this, too.
</foo_who>
</foo_bar>