Get text between 2 processing instructions - xslt

My goal is to get all text, including text inside elements, between 2 processing instructions using xslt.
Input file is DITA having standard XML-based structure. There are 2 processing instructions I am searching for <?PI start?> and <?PI end?>. I search for text after <?PI start?> and before <?PI end?>. There can be just text or an element that has text in it.
Input
<concept id="testcase" >
<title> Introduction</title>
<conbody>
<p>
<p>text01</p>
<?PI start 1?> text02 <?PI end 1?>
<b> text03 </b>
<?PI start 2?> text04 text05 <?PI end 2?> text06
<?PI start 3?> text07 <?PI end 3?>
</p>
<p>
<?PI start 4?>text11 <?PI end 4?>
<?PI start 5?><b>text12</b><?PI end 5?>
<?PI start 6?> text13<?PI end 6?>
</p>
</conbody>
</concept>
My approaches were:
match <?PI start?>, and try to get following-sibling until I will get to the <?PI end?>. Problem is there is no break for a loop in xslt as well as there is no way to change the value of variable, so I don't know how to stop.
xsl
<xsl:template match="//processing-instruction('PI')[contains(.,'start')]">
<xsl:variable name='text1' select="following-sibling::text()[preceding::processing-instruction('PI')[1][contains(.,'start')]][following::processing-instruction('PI')[1][contains(., 'end ')]] "/>
<xsl:variable name='text2' select="following-sibling::*[preceding::processing-instruction('PI')[1][contains(.,'start')]][following::processing-instruction('PI')[1][contains(., 'end')]]/text() "/>
<xsl:variable name="text" select="concat($text1,$text2)"/>
<xsl:value-of select="$text"/>
<xsl:copy>
<xsl:apply-templates select="#* | node()" />
</xsl:copy>
</xsl:template>
output
<concept xmlns:ditaarch="http://dita.oasis-open.org/architecture/2005/" id="testcase">
<title> Introduction</title>
<conbody>
<p>
<p>text01</p>
text02 <?PI start 1?> text02 <?PI end 1?>
<b> text03 </b>
text04 text05 <?PI start 2?> text04 text05 <?PI end 2?> text06
text07 <?PI start 3?> text07 <?PI end 3?>
</p>
<p>
text11 text12<?PI start 4?>text11 <?PI end 4?>
text13text12<?PI start 5?>
<b>text12</b><?PI end 5?>
text13<?PI start 6?> text13<?PI end 6?>
</p>
</conbody>
</concept>
match text or any element that has preceding-sibling <?PI start?> and following-sibling <?PI end?>.
xsl
<xsl:template match="//processing-instruction('PI')[contains(.,'start')]">
<xsl:for-each select="following-sibling::*">
<xsl:value-of select="./text()"/>
</xsl:for-each>
<xsl:for-each select="following-sibling::text()">
<xsl:value-of select="."/>
</xsl:for-each>
<xsl:copy>
<xsl:apply-templates select="#* | node()" />
</xsl:copy>
</xsl:template>
output
<concept xmlns:ditaarch="http://dita.oasis-open.org/architecture/2005/" id="testcase">
<title> Introduction</title>
<conbody>
<p>
<p>text01</p>
text03 text02
text04 text05 text06
text07
<?PI start 1?> text02 <?PI end 1?>
<b> text03 </b>
text04 text05 text06
text07
<?PI start 2?> text04 text05 <?PI end 2?> text06
text07
<?PI start 3?> text07 <?PI end 3?>
</p>
<p>
text12text11
text13
<?PI start 4?>text11 <?PI end 4?>
text12
text13
<?PI start 5?>
<b>text12</b><?PI end 5?>
text13
<?PI start 6?> text13<?PI end 6?>
</p>
</conbody>
</concept>
Problem is that it matches even the elements that are not between 2 processing instructions. For example text03 from below, as technically it does have preceding-sibling <?PI start?> and following-sibling <?PI end?>:
<?PI start 1?> text02 <?PI end 1?>
<b> text03 </b>
<?PI start 2?> text04 text05 <?PI end 2?>
XSLT version: 1.0
XSLT processor: Saxon-HE
I will appreaciate any input, ideas and sugesstions

If the PIs are always siblings then doing
<xsl:template match="processing-instruction('PI')[starts-with(.,'start')]">
<xsl:variable name="end-pi" select="following-sibling::processing-instruction('PI')[starts-with(., 'end')][1]"/>
<xsl:variable name="nodes-before-end"
select="following-sibling::node()[. << $end-pi]"/>
</xsl:template>
should suffice to select the nodes between the two PIs. Output them as needed, I couldn't quite tell where/how you want to output them.

I know this is not what you asked for, but maybe it will be useful.
I have this code that does almost exactly the same thing you asked for with minor difference: if the content of the node between 2 processing instructions is a collection of white-spaces or the text of a node between 2 processing instructions is a collection of white-spaces, it replaces them with empty nodes.
I am using pythons standard library, no need to install anything additional. Just make sure to run with python3(I run it with python3.6, but any python3+ should be fine)
from pprint import pprint
from typing import List
from xml.dom import minidom
from xml.dom import Node
import re
def get_all_processing_instruction_nodes(child_nodes: List):
start_end_pairs = []
current_pair = {}
for index, node in enumerate(child_nodes):
if node.nodeType == Node.PROCESSING_INSTRUCTION_NODE:
if "start" in node.nodeValue:
current_pair["start"] = {
"node": node,
"index": index
}
if "end" in node.nodeValue:
if "start" not in current_pair:
raise ValueError("End detected before start")
current_pair['end'] = {
"node": node,
"index": index
}
start_end_pairs.append(current_pair)
current_pair = {}
return start_end_pairs
def process_all_paired_child_nodes_recursively(node):
pi_pairs = get_all_processing_instruction_nodes(node.childNodes)
if pi_pairs:
print(node.nodeName, "::", node.nodeValue)
pprint(pi_pairs)
for pair in pi_pairs:
start_index = pair['start']['index']
end_index = pair['end']['index']
interesting_nodes = node.childNodes[start_index + 1: end_index]
process_interesting_nodes(interesting_nodes)
for child_node in node.childNodes:
process_all_paired_child_nodes_recursively(child_node)
def process_interesting_nodes(interesting_nodes: List):
for i_node in interesting_nodes:
if i_node.nodeValue:
i_node.nodeValue = re.sub(r"\s+", "", i_node.nodeValue)
def process_xml_file(input_file_path: str, output_file_path: str):
document_node = minidom.parse(input_file_path)
process_all_paired_child_nodes_recursively(document_node)
with open(output_file_path, "w") as f:
f.write(document_node.toxml(document_node._get_encoding()).decode())
You can easily modify the process_interesting nodes function to do anything with the matched nodes(note that the plain text between 2 processing instructions is parsed as a Text node in python so you treat it as a regular node).
Hope this helps. I would also recommend you take a look at the python's xml lib, especially the minidom part (https://docs.python.org/3/library/xml.dom.minidom.html). Minidom module, unlike the regular xml parser, allows you to tread processing instructions and comments as regular nodes.

Related

Using xsl:accumulator to keep track of text nodes between two PIs

I am learning about accumulators in XSLT 3.0 but I do not find any examples that help me solve my current problem. I have large files in which processing instructions are used to mark modifications. I need to process these into visible markers for the review process. With an accumulator I have succeeded to keep track of the latest modification code to be shown. So far, so good.
As the original files are massive, I created a simple sample input XML that shows the essence of my task and I adapted my XSL to show what I am trying with the accumulator.
Simple input file:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<div>
<p>Paragraph 1</p>
<?MyPI Start Modification 1?>
<p>Paragraph 2</p>
<p>Paragraph 3</p>
<?MyPI End Modification 1?>
</div>
<div>
<list>
<item>
<p>Paragraph 4</p>
<?MyPI Start Modification 1?>
<p>Paragraph 5</p>
<?MyPI End Modification 1?>
</item>
<item>
<?MyPI Start Modification 1?>
<p>Paragraph 6</p>
<p>Paragraph 7</p>
<?MyPI End Modification 1?>
<?MyPI Start Modification 2?>
<p>Paragraph 8</p>
<?MyPI End Modification 2?>
</item>
</list>
<p>Paragraph 9</p>
</div>
</root>
My XSL using an accumulator for the current modification:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs"
version="3.0">
<xsl:mode use-accumulators="#all"/>
<xsl:accumulator name="modifier" initial-value="'Base text'">
<xsl:accumulator-rule match="processing-instruction('MyPI')[contains(.,'Modification')]">
<xsl:choose>
<xsl:when test="contains(.,'Start')">
<xsl:value-of select="substring-after(.,'Start ')"/>
</xsl:when>
<xsl:otherwise>Base text</xsl:otherwise>
</xsl:choose>
</xsl:accumulator-rule>
</xsl:accumulator>
<xsl:template match="/">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="node()">
<xsl:copy>
<xsl:apply-templates select="node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="processing-instruction('MyPI')">
<marker>
<xsl:value-of select="accumulator-after('modifier')"/>
</marker>
</xsl:template>
</xsl:stylesheet>
Output with this XSL:
<?xml version="1.0" encoding="UTF-8"?><root>
<div>
<p>Paragraph 1</p>
<marker>Modification 1</marker>
<p>Paragraph 2</p>
<p>Paragraph 3</p>
<marker>Base text</marker>
</div>
<div>
<list>
<item>
<p>Paragraph 4</p>
<marker>Modification 1</marker>
<p>Paragraph 5</p>
<marker>Base text</marker>
</item>
<item>
<marker>Modification 1</marker>
<p>Paragraph 6</p>
<p>Paragraph 7</p>
<marker>Base text</marker>
<marker>Modification 2</marker>
<p>Paragraph 8</p>
<marker>Base text</marker>
</item>
</list>
<p>Paragraph 9</p>
</div>
</root>
The problem I have is that closing and opening markers for the same modification code should be hidden when there is no text between them. They may be immediately following each other (which is fairly simple) but also have some non-text element boundaries between them. I have tried to create an accumulator that keeps track of all text since the last modification marker, but that causes nested calls to the same accumulator which gives a runtime error. What I am looking for is a method that keeps adding text to an accumulator and resets it to an empty string when a modification PI is found. This is my trial accumulator that caused too many nested calls:
<xsl:accumulator name="text" initial-value="''">
<xsl:accumulator-rule match="node()">
<xsl:choose>
<xsl:when test="self::processing-instruction('MyPI')"/>
<xsl:when test="self::text()">
<xsl:value-of select="concat(accumulator-after('text'),.)"/>
</xsl:when>
</xsl:choose>
</xsl:accumulator-rule>
</xsl:accumulator>
I guess I do not yet understand how the accumulator works, which makes it hard to get the result I am looking for.
Required output for the above simple XML:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<div>
<marker>Base text</marker>
<p>Paragraph 1</p>
<marker>Modification 1</marker>
<p>Paragraph 2</p>
<p>Paragraph 3</p>
<marker>Base text</marker>
</div>
<div>
<list>
<item>
<p>Paragraph 4</p>
<marker>Modification 1</marker>
<p>Paragraph 5</p>
</item>
<item>
<p>Paragraph 6</p>
<p>Paragraph 7</p>
<marker>Mpdification 2</marker>
<p>Paragraph 8</p>
</item>
</list>
<marker>Base text</marker>
<p>Paragraph 9</p>
</div>
</root>
Hoping someone can point me in the right direction. I guess accumulating text nodes since a particular node in the XML processing would be a problem that more people need to solve. In my current case I do not need the actual text content, I just need to know if there is any visible text since the last PI (i.e. I need to remove or disregard any whitespace in this check).
If there is another method that does not involve accumulators, that would be fine, too.
Thanks in advance for any help
Perhaps
<xsl:accumulator name="text" initial-value="()" as="xs:string?">
<xsl:accumulator-rule match="processing-instruction('MyPI')" select="''"/>
<xsl:accumulator-rule match="text()[normalize-space()]" select="$value || ."/>
</xsl:accumulator>
gives you an example on how to set up an accumulator to collect text node values, I am not sure I have understood the conditions to reset the accumulator to an empty string, so that is basically the match from your sample, just transcribed in (hopefully) compilable XSLT 3 you can adapt if there are more conditions relative to start or end processing instruction pairs or names.
As for the spec explaining the $value variable, see https://www.w3.org/TR/xslt-30/#accumulator-declaration:
The select attribute and the contained sequence constructor of the
xsl:accumulator-rule element are mutually exclusive: if the select
attribute is present then the sequence constructor must be empty. The
expression in the select attribute of xsl:accumulator-rule or the
contained sequence constructor is evaluated with a static context that
follows the normal rules for expressions in stylesheets, except that:
An additional variable is present in the context. The name of this
variable is value (in no namespace), and its type is the type that
appears in the as attribute of the xsl:accumulator declaration.
The context item for evaluation of the expression or sequence
constructor will always be a node that matches the pattern in the
match attribute.
and two of the examples in https://www.w3.org/TR/xslt-30/#accumulator-examples also use $value.

Select specific String using Xpath and XSLT

I want to select String between < and >.
Input :
<p type="Endnote Text"><p:endnote_bl1>This is a bullet list in an endnote</p>
<p type="Endnote Text"><p:endnote_bl2>This is a bullet list in an endnote</p>
<p type="Endnote Text"><p:endnote_bl3>This is a bullet list in an endnote</p>
I want to select p:endnote_bl1,p:endnote_bl2, etc.. from the text. It means whatever text between < and >. How can I write the XPath for this.
In XSLT, using xpath, you can simply select all p elements (or tps:p elements, if you do have namespaces), and use substring-before and substring-after to extract the text, although do note this assumes one occurrence of each of < and >
<xsl:for-each select="//p[#type='Endnote Text']">
<xsl:value-of select="substring-before(substring-after(., '<'), '>')" />
<xsl:text>
</xsl:text>
</xsl:for-each>
See it in action at http://xsltfiddle.liberty-development.net/bnnZX7
If you could use XSLT 2.0, you could do it without the xsl:for-each...
<xsl:value-of select="//p[#type='Endnote Text']/substring-before(substring-after(., '<'), '>')" separator="
" />
Or you could also use replace in XSLT 2.0....
<xsl:value-of select="//p[#type='Endnote Text']/replace(., '<(.+)>.*', '$1')" separator="
" />
You could use just xpath, if we fix your xml like so:
<doc>
<p type="Endnote Text"><p:endnote_bl1>This is a bullet list in an endnote</p>
<p type="Endnote Text"><p:endnote_bl2>This is a bullet list in an endnote</p>
<p type="Endnote Text"><p:endnote_bl3>This is a bullet list in an endnote</p>
</doc>
Then this expression,
doc/p/substring(./text(),2,13)
will select
p:endnote_bl1
p:endnote_bl2
p:endnote_bl3

First matching ancestor depeding on name

Hello XPath/Xslt Friends,
i have the following Xml. I want to determine the id of the first matching section or chapter of my cout element. If the the closest node of the cout is a chapter, then i will get the chapter id, otherwise the section id.
<book>
<chapter id="chapter1">
<aa>
<cout></cout> --> i will get "chapter1"
</aa>
<section id="section1">
<a>
<b>
<cout></cout> --> i will get section1
</b>
</a>
</section>
<section id="section2">
<a>
<b>
<cout></cout> --> i will get section2
</b>
</a>
</section>
</chapter>
</book>
i tried the following statement:
<xsl:value-of select="ancestor::*[local-name() = 'section' or local-name() = 'chapter']/#id" />
but in case of the cout contained in section1, it will give me chapter1, instead of section1. Any solutions?
Your current statement is selecting all ancestors with the name section or chapter, and after being selected, xsl:value-of will only return the value of the first one, in document order (in XSLT 1.0 that is).
Try this instead
<xsl:value-of select="ancestor::*[local-name() = 'section' or local-name() = 'chapter'][1]/#id" />
if cout has no ancestor section , print chapter id
else print section id.
<xsl:for-each select="//cout">
<xsl:if test="count(ancestor::section)= 0">
<xsl:value-of select="ancestor::chapter/#id"/>
</xsl:if>
<xsl:if test="count(ancestor::section)>0">
<xsl:value-of select="ancestor::section/#id"/>
</xsl:if>
</xsl:for-each>

XSLT conditionally change text font

For the following xml,
<question>
<bp>Suppose a file a.xml has content:</bp>
<bp><![CDATA[<a> 1 <b> 2 <b> 3 <a> 4 </a> </b> </b> </a>]]></bp>
<bp>What is the value of the following XPath expression:</bp>
<bp>for $x in doc("a.xml")//a/b return $x/b/a/text()</bp>
</question>
In the XSLT file, I have to change the font of the text if the text between the xml tags contains
<![CDATA[ ]]>
I tried using the following code,
<xsl:for-each select="mcq:bp">
<xsl:if test="contains(. , '<![CDATA[ ]]>')">
<xsl:attribute name='font-family'>courier</xsl:attribute>
<xsl:value-of select="."/>
</xsl:if>
<xsl:value-of select="."/>
<br/>
</xsl:for-each>
But the xslt does not display anything in the browser.
This cannot be done with XSLT.
At the time XSLT is passed the parsed XML document, there isn't any information whether a text node contained CDATA sections or not -- this lexical detail is stripped-off (lost) as result of the parsing of the XML document.
CDATA isn't a string and it isn't part of the text node. Therefore, it is wrong to try to detect a CDATA section by using the contains() function.

How do I make group-by ignore white-space in XSLT?

I have an xml file which I need to group-by using xsl:for-each-group. Everything works fine but the problem comes from when there is elements which have white space at the end of them (e.g. <word>test </word> and <word>test</word>) but I need these to be considered as one group.
Here's the sample xml file:
<u>
<s>
<w>this </w>
<w>is </w>
<w>a </w>
<w>test </w>
</s>
<s>
<w>this</w>
<w>is</w>
<w>a</w>
<w>test</w>
</s>
<u>
Here's the xslt code
<xsl:for-each-group select="bncDoc/stext/div/u/s" group-by="w" >
<tr>
<td style="text-align: center;">
<xsl:value-of select="current-grouping-key()"/>
</td>
<td>
<xsl:value-of select="count(current-group())"/>
</td>
</tr>
</xsl:for-each-group>
Is there any workaround for this?
<xsl:for-each-group select="bncDoc/stext/div/u/s/w" group-by="normalize-space()">
<!-- ... -->
</xsl:for-each-group>
OK, found the answer:
You just need to use the normailize-space() in this way:
<xsl:for-each-group select="bncDoc/stext/div/u/s/w" group-by="normalize-space((text())">
.
.
.
</xsl:for-each-group>