Regex how to replace all instances except - regex

I have several hundred XML files which i need to make a slight change to. I'm aware that i really should be using XSLT to make batch changes to XML structure, but i think some quick and dirty Regex will do what i need much faster than me working out the XSLT. At least i thought that before spending hours trying to get the Regex right!!
Take the below example, what i have is various lists <seqlist> which contain <items> elements for each item in the list. Each <item> element contains a <para> element which has various ID attribute values. I want to remove those <para> tags completely so that the <item> contains the actual text.
So from: <seqlist><item><para id="1.1">Some text here.</para></item></seqlist>
To: <seqlist><item>Some text here.</item></seqlist>
This is fairly strightforward in itself i can simply do:
Regex: <item><para id="([^\"]*)">
Replace: <item>
Then remove the redundant closing tags by doing a simple find replace
Find: </para></item>
Replace: </item>.
However, as can be seen from the example below, some <item> elements in the list, contain another <seqlist> nested within them, which contains further nested <item> ad <para> tags. This means the above find replace to remove the closing </para> tag will result in the closing </para> in the very last line in the example below being replaced too.
Basically what i need to say is: find </para></item> and replace with </item> UNLESS there is a opening <para> element to the left of it.
The very last line of the example below explains it better. If i do the above Find & Replace the last </para> will be removed and it will not parse.
Any ideas how to achive this please?
<seqlist>
<item><para id="p7.1"><emphasis>JRK Type 1</emphasis>: (NSP XX-XX-XXX-XXXX)
outputs:
<seqlist>
<item><para id="p7.1.1">12 V or 15 V,0-5A</para></item>
<item><para id="p7.1.2">12 V or 15 V,0-5A</para></item>
</seqlist></para>
<para>Both at 120 W maximum output power.</para><para>The outputs are isolated, permitting parallel or serial connection to provide power as required.</para></item>
<item><para id="p7.2"><emphasis>JRK Type 2:</emphasis> (NSN 6130-99-788-6945) outputs:</para>
<seqlist>
<item><para id="p7.2.1">5 V, 0 - 30 A</para></item>
<item><para id="p7.2.2">12 V, 0 - 0.5 A</para></item>
</seqlist><para>Both at 120 W maximum output power.</para>
<para>The 12 V outputs are measured with respect to a common 0 V line but these are isolated from the 5 V output.</para></item>
</seqlist>

Here is the trivial XSLT way:
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="seqlist/item/para">
<xsl:apply-templates/>
</xsl:template>
</xsl:transform>
Online at http://xsltransform.net/3NSSEw6.
If only those para elements with an id attribute are to be removed then use
<xsl:template match="seqlist/item/para[#id]">
<xsl:apply-templates/>
</xsl:template>
for that template instead, http://xsltransform.net/3NSSEw6/1.

Related

How do I flatten text nodes and nested nodes?

I’m trying to flatten an element’s text nodes and nested inlined elements
<e>something <inline>rather</inline> else</e>
into
<text>something </text>
<text-inline>rather</text-inline>
<text> else</text>
Using e/text() would return both text nodes but how do I flatten all nodes in order for arbitrarily inlined elements (even nested)?
I am not sure "flatten" is the right term for this. It seems all you want to do is change some text nodes into elements containing the same text. This can be done by a template matching these text nodes:
<xsl:template match="e/text()">
<text>
<xsl:copy/>
</text>
</xsl:template>
Demo: https://xsltfiddle.liberty-development.net/ncdD7n4
Of course, if you also want to rename inline to text-inline, you will need another template for that:
<xsl:template match="inline">
<text-inline>
<xsl:apply-templates />
</text-inline>
</xsl:template>

How to return multiple regexp matches where the result depends on a previous match?

I've been trying to match hazard codes held within a free text field. I've got a regexp that works where the codes have been entered in the format Hxxx where xxx is a three digit number. Easy!
However, sometimes the users have entered the first as Hxxx but subsequent values just as xxx.
So, for input data like
R12 34 456 / H123 H456 789 012
I want to match H123 H456 and 789 and 012, but not the 456 before the first H.
Edit: To clarify, there is not a clear pattern of the data in the field. Mostly, there are some H codes, sometimes with R codes preceeding them, sometimes delimited in the example above, and sometimes not. Thus the rule I am envisaging is that three digit codes following one beginning with an H will be returned, but any codes not preceded by at least one H code will be ignored.
I've tried every combination of optional grouping and look-behind I can think of, and the best I've got is
((H|(?<=(H\d{3}\s)))\d{3}[A-Z]{0,2})
which matches all but the last group, but would cause problems if there were more than once space between group.
I suspect look-behind may not work anyway in an xsl:analyze-string command.
Is there any clever regexp trick that will work for this, or do I have to go for some more brute-force approach?
Can you use Saxon 9.6 or later PE and EE (for instance in oXygen or Stylus Studio) or Altova XMLSpy 2017 or Exselt and XSLT 3.0? In that case you could perhaps simply tokenize($data, '\s+') and use xsl:for-each-group group-starting-with=".[matches(., 'H[0-9]{3}')]. The following stylesheet
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:math="http://www.w3.org/2005/xpath-functions/math" exclude-result-prefixes="xs math"
version="3.0">
<xsl:template match="data">
<xsl:copy>
<xsl:variable name="matches" as="xs:string*">
<xsl:for-each-group select="tokenize(., '\s+')"
group-starting-with=".[matches(., 'H[0-9]{3}')]">
<xsl:if test="matches(., 'H[0-9]{3}')">
<xsl:sequence select="current-group()"/>
</xsl:if>
</xsl:for-each-group>
</xsl:variable>
<xsl:value-of select="$matches"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
transforms <data>R12 34 456 / H123 H456 789 012</data> into <data>H123 H456 789 012</data> so it extracts the items you are looking for.

Apply-templates with in Analyze string

I've the below XML.
<?xml version="1.0" encoding="UTF-8"?>
<para align="center">
<content-style font-style="bold">A.1 This is the first text</content-style> (This is second text)
</para>
Below are my 2 Questions.
here i've declared a regex to match the content-style, But when i run this the second one is caught where as it should be div class="para", but in the output i get <div class="para align-center">. please let me know where am i going wrong.
Is there a way i can apply-templates with in the match. when i tried it throws me an error. I want it like below.
if (para)
xsl:apply-templates select child::node()[not(self::text)]
else
xsl:apply-templates
Working Example
Thanks
If you want to use apply-templates inside the analyze-string then you need to store the context node outside of analyze-string in a variable <xsl:variable name="context-node" select="."/>, then you can use <xsl:apply-templates select="$context-node/node()"/> for instance to process the child nodes.
Whether you need that approach I am not sure, I wonder whether you can not simply use the matches functions in a pattern e.g. <xsl:template match="para[content-style[matches(., '(\w+)\.(\w+)')]]">...</xsl:template>.

xslt find and replace regex-lookarounds

I have a problem with XSLT replace().
XML:
<root>
<title>I am title</title>
<body>
the new formula is:<br/>
the speed test 234 km/h<br>
the weight is 49 kg<br/>
in the 1492 Lorenzo de Medici die.
etc.
<dida>the mass is 56 kn</dida>
</body>
</root>
I must replace all the space after number of measure system.
In PHP I found this regex:
((?<=\d)\s(?=km|kg|kn))
In XSLT I have:
<xsl:template match="//*/text()">
<xsl:value-of select="replace(., '\(\(?\<=\\d\)\\s\(?=km\|kg\|kn\)\)', ' ')"></xsl:variable>
</xsl:template>
The problem is < character!
The common notation for '<' inside a literal string is <
That, however, didn't fix it for my XSLT processor (Kernow, using Saxon 9.1.0.3). As it appeared, it doesn't need all those escapes for parentheses and vertical bars. In addition, the lookarounds didn't work. I was able to solve this using
<xsl:value-of select="replace(., '(\d)\s(km|kg|kn)', '$1!$2')"></xsl:value-of>
(replacing with a '!' for clarity).
There are a few other basic errors in your example which I had to fix first: <br> was not correctly closed, and you mustn't terminate <xsl:value-of .. with </xsl:variable>.

Trying to match more than one class in XSLT

I'm very new to XSLT and trying to format some text for pdf's and I need to match and hide a few elements.
I am currently using:
<xsl:template match="*[#outputclass='LC ACaseName']">
to match:
<p outputclass="LC ACaseName">
and it works just fine.
What I now need to do is match 4 or 5 more
<p outputclass="<somestring>">
and apply the same style to them. I could easily just duplicate the above line substituting the different outputclass names each time but this is lazy and I know there must be a correct way of doing this which I should learn.
I hope I have provided enough info here. If I have missed anything please say.
thanks,
Hedley Phillips
You can specify multiple conditions in the predicate:
<xsl:template match="*[#outputclass='test' or #outputclass='blah']">
I couldn't find the duplicate...
In XSLT/XPath 1.0:
<xsl:template match="*[contains(
'|LC ACaseName|other class|',
concat('|',#outputclass,'|')
)
]">
<!-- Content Template -->
<xsl:template>
In XSLT/XPath 2.0:
<xsl:template match="*[#outputclass = ('LC ACaseName','other class')]">
<!-- Content Template -->
<xsl:template>
Note: For XSLT/XPath 1.0 solution you need a separator not being part of any item content.