XSLT: How to discard unwanted HTML nodes from source? - xslt

I am using XSLT 1.0, and using xsltproc on OS X Yosemite.
The source content is HTML; the target content is XML.
The issue is a fairly common one. I want all "uninteresting"
nodes simply to be discarded from the output. I've seen catch-all
directives like this:
<xsl:template match="node()|script"/>
<xsl:template match="*">
<xsl:apply-templates/>
</xsl:template>
This is close to what I need. But unfortunately, it's too strong when I need to add another template that visits one of the text nodes caught by node(). For example, suppose I added this template:
<xsl:template match="a/div[#class='location']/br">
<xsl:text> </xsl:text>
</xsl:template>
which simply replaces certain <br/> elements with spaces.
Well, node() precludes this latter template from taking effect,
because the relevant text node containing the line-break is discarded
already!
Well, to correct the issue, here's what I have done in lieu of the catch-all node():
<xsl:template match="html/head|div[#id='banner_parent']|button|ul|div[#id='feed_title']|span|div[#class='submit_event']|script"/>
But this is precisely the problem: I am now piecing together a template
whose matching criteria is likely to be error-prone when the source
content changes.
Is there a simpler directive that would accomplish the same thing? I'm aiming for something like this:
<xsl:template match="node()[not(locations)]|script"/>
Thanks.

If i understood correctly, you want only some nodes in the output and the rest you dont care abour, in this example I try to catch only li elements and throw the rest away.. not sure if this is what you want though http://xsltransform.net/gWmuiKk
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output method="html" doctype-public="XSLT-compat" omit-xml-declaration="yes" encoding="UTF-8" indent="yes" />
<!-- Lets pretend li is interesting for you -->
<xsl:template match="li">
<xsl:text>Interesting Node Only!
</xsl:text>
</xsl:template>
<xsl:template match="#*|node()">
<xsl:apply-templates select="#*|node()"/>
</xsl:template>
</xsl:transform>

Related

XSLT: How to remove elements of a resulting result tree fragment while copying?

My goal is to extract the contents of the SOAP body, f.e. the ElementsToExtract node - but the node name can basically be arbitrary:
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope">
<soap:Header>
<MessageId>52DF2371-4094-4408-A3EA-42D73FD1B7A3</MessageId>
</soap:Header>
<soap:Body>
<ElementsToExtract>
...
<RemoveMe>...</RemoveMe>
<RemoveMeAlso>...</RemoveMeAlso>
...
</ElementsToExtract>
</soap:Body>
</soap:Envelope>
While I'm extracting the contents, I want to get rid of two elements that all my source documents have in common - say RemoveMe and RemoveMeAlso. As there's a chance that the deeper nested nodes may be called the same, they must only be stripped from the layer below the ElementsToExtract node. How would I formulate that expression?
Here's what I did up to now:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:soap="http://www.w3.org/2003/05/soap-envelope"
xmlns:exsl="http://exslt.org/common"
exclude-result-prefixes="soap exsl">
<xsl:output method="xml" indent="yes" omit-xml-declaration="no"/>
<xsl:strip-space elements="*"/>
<xsl:variable name="SoapHeaderContents" select="exsl:node-set(soap:Envelope/soap:Header/*)"/>
<xsl:variable name="SoapBodyContents" select="exsl:node-set(soap:Envelope/soap:Body/*)"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="/">
<xsl:apply-templates select="$SoapBodyContents"/>
</xsl:template>
<!-- This is global, how to restrict to the ElementsToExtract element? -->
<xsl:template match="node()[name() = 'RemoveMe']"/>
<xsl:template match="node()[name() = 'RemoveMeAlso']"/>
</xsl:stylesheet>
I also played with the node-set() function, having read that one can not modify result tree fragments (they're only text nodes?), but I don't quite understand how to address the resulting nodes of that set. So the nodes weren't removed:
<xsl:template match="/">
<xsl:apply-templates select="$SoapBodyContents"/>
<xsl:apply-templates select="$SoapBodyContents/RemoveMe" mode="m1"/>
</xsl:template>
<xsl:template name="StripRemoveMe" match="RemoveMe" mode="m1"/>
I also read some parts of the specification, but to no avail. I'm lost for clues. Can someone direct me to the right approach?
Would this work for you:
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:soap="http://www.w3.org/2003/05/soap-envelope"
exclude-result-prefixes="soap">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- identity transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- skip soap wrappers -->
<xsl:template match="/soap:Envelope">
<xsl:apply-templates select="soap:Body/ElementsToExtract"/>
</xsl:template>
<!-- remove unwanted elements -->
<xsl:template match="ElementsToExtract/RemoveMe | ElementsToExtract/RemoveMeAlso"/>
</xsl:stylesheet>
In the (unlikely) case you don't know the name of the ElementsToExtract element, you could use:
<!-- skip soap wrappers -->
<xsl:template match="/soap:Envelope">
<xsl:apply-templates select="soap:Body/*"/>
</xsl:template>
<!-- remove unwanted elements -->
<xsl:template match="soap:Body/*/RemoveMe | soap:Body/*/RemoveMeAlso"/>
Some quick thoughts.
You create variables for storing the SOAP header and body. These are already in the input document, so it makes more sense to just write templates that match these.
Although you create a variable for the SOAP header, you never use it anywhere.
If you try to apply templates in succession, as in your sample XSL code, you will get all the output nodes from the first apply-templates, and then all the output nodes from the next apply-templates. If these nodes are meant to be interleaved in any way, this approach will not produce viable output.
Here's a revised version of your sample input XML, adding in a couple elements that we want to keep.
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope">
<soap:Header>
<MessageId>52DF2371-4094-4408-A3EA-42D73FD1B7A3</MessageId>
</soap:Header>
<soap:Body>
<ElementsToExtract>
<KeepMe>This text will persist in the output.</KeepMe>
<RemoveMe>This is text that will be removed.</RemoveMe>
<RemoveMeAlso>This will also vanish from the output.</RemoveMeAlso>
<OtherElementToKeep>And this one will also be kept.</OtherElementToKeep>
</ElementsToExtract>
</soap:Body>
</soap:Envelope>
Here's what we'd want as output:
<?xml version="1.0" encoding="utf-8"?>
<ElementsToExtract>
<KeepMe>This text will persist in the output.</KeepMe>
<OtherElementToKeep>And this one will also be kept.</OtherElementToKeep>
</ElementsToExtract>
This XSL 1.0 code will do the job. I'm guessing from your post that you're not familiar with XSL processing flow, so I've added comments to help explain what's going on.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:soap="http://www.w3.org/2003/05/soap-envelope"
version="1.0"
exclude-result-prefixes="soap">
<xsl:strip-space elements="*"/>
<xsl:output method="xml" indent="yes"/>
<!-- The `/` matches the _logical root_ of the input file. This is
basically equivalent to the start of the file, NOT the first element.
This is a common place to start processing in XSL. -->
<xsl:template match="/">
<!-- We just apply templates. In your case, we know already that
we DON'T want to process everything: we want to leave certain
things out, including a lot of the outermost elements. So
we specify what to target in the `select` statement. -->
<xsl:apply-templates select="soap:Envelope/soap:Body/ElementsToExtract"/>
</xsl:template>
<!-- This is the "identity" template, so called because it
just copies over applicable matches identically.
A template with a more-specific match statement takes
precedence. -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- Here, we specify exactly those elements that are in the
processing flow, and that we want to exclude from the
output. Since `soap:Header` etc. are NOT in the processing
flow (their element trees were never included in a preceding
call to `apply-templates`), we don't need to worry about those. -->
<xsl:template match="RemoveMe | RemoveMeAlso"/>
</xsl:stylesheet>
Note that the outermost element in the output is ElementsToExtract. This element will include the xmlns:soap="http://www.w3.org/2003/05/soap-envelope" namespace declaration, even though this namespace isn't used in any of the output elements (at least, for this small sample input XML).
If you can use XSL 2.0+ and you want to remove this namespace from the output, you could add the copy-namespaces="no" attribute to the <xsl:copy> element.

XSLT/XPath: Treat <tag> as ordinary string, not node

I have an XML:
<foo>
<bar>some text</bar>
</foo>
And I'm generating an HTML from it using XSTL, and looking for an Xpath (or some XSTL method, don't know) that gives me the whole content of foo. To illustrate my problem,
<xsl:value-of select="foo"/>
as expected, delivers only
some text
. But is there anything I can do to get
<bar>some text</bar>
? So to treat it as if the bar tags would be just ordinary strings.
Well, there's certainly no way of treating <tag> as a string, because XSLT sees the output of the XML parser, which is a tree of nodes: the tags have long since disappeared by the time XSLT swings into action.
But you can of course copy the element node as a whole, rather than just extracting its string value. Simply use <xsl:copy-of> in place of <xsl:value-of>.
You can use Copy
HTML
<foo>
<bar>some text</bar>
</foo>
XSL
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="bar">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Output
<bar>some text</bar>

Removing empty tags within a variable

My question is related to to another poster's StackOverflow question on Two Phase Processing. I didn't want to use mode="#all" without fully understanding it and how it could affect the rest of my XSLT. I'm thinking the below code accomplishes the same thing without risking interference with other templates but would like confirmation. It kind of seems like I am processing $completepolicy twice without need to do so.
Empty tag definition: <field/> <field></field>. Tags can have attributes but there will never be an empty tag that has an attribute. There will also never be nodes with <field> </field> where the white space could represent many other things.
Given this XSLT:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<!-- many other apply-templates here -->
<xsl:variable name="completepolicy" as="element()">
<holder>
<TABLE1 type="global">
<col1>Red</col1>
<col2/>
</TABLE1>
<TABLE2>
<field1>Blue</field1>
<field2/>
</TABLE2>
</holder>
</xsl:variable>
<xsl:apply-templates mode="emptytags" select="$completepolicy/*"/>
</xsl:template>
<xsl:template match="*[not(node())]" mode="emptytags"/>
<xsl:template match="node() | #*" mode="emptytags">
<xsl:copy>
<xsl:apply-templates select="node() | #*" mode="#current"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Results in this output for $completepolicy:
<TABLE1 type="global">
<col1>Red</col1>
</TABLE1>
<TABLE2>
<field1>Blue</field1>
</TABLE2>
Why do you think the $completepolicy variable is being processed twice? This cannot be seen in the provided code.
I confirm that the provided code looks good to me.
I would recommend never to use mode="#all". This is too powerful and dangerous -- this is almost never needed.

Problems with XSLT and apply templates

I'm going to be brief. I'm doing XSLT on the client. The output is a report/html with data. The report consists of several blocks, ie one block is one child element of root-node in the xml file.
There are n reports residing in n different xslt-files in my project and the reports can have the same block. That means if there is a problem with one block for one report and it is in n reports i have to update every n report (xslt file).
So i want to put all my blocks in templates (kind-of-a businesslayer) that i can reuse for my reports by xsl:include on the templates for those reports.
So the pseudo is something like this:
<?xml version="1.0".....?>
<xsl:stylesheet version="1.0"....>
<xsl:include href="../../Blocks/MyBlock.xslt"/>
<xsl:template match='/'>
<xsl:apply-templates />
</xsl:template>
</xsl:stylesheet>
MyBlock.xslt:
<?xml version="1.0"....?>
<xsl:stylesheet version="1.0".....>
<xsl:template match='/root/rating'>
HTML OUTPUT
</xsl:template>
</xsl:stylesheet>
I hope someone out there understands my question. I need pointers on how to go about this, if this is one way to do it. But it doesn't seem to work.
Below is my experience that how am dealing this.
This is example which I modified your code.
<?xml version="1.0"?>
<xsl:stylesheet version="1.0">
<xsl:include href="../../Blocks/MyBlock.xslt"/>
<xsl:template match="/">
<xsl:apply-templates select="node()" mode="callingNode1"/>
</xsl:template>
</xsl:stylesheet>
MyBlock.xslt:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0">
<xsl:template mode="callingNode1" match="*">
HTML OUTPUT
</xsl:template>
<xsl:template mode="callingNode2" match="/root/rating">
HTML OUTPUT
</xsl:template>
</xsl:stylesheet>
Here am calling the nodes based on the mode & match.

Using XSL to sort attributes

I'm trying to canonicalize the representation of some XML data by sorting each element's attributes by name (not value). The idea is to keep textual differences minimal when attributes are added or removed and to prevent different editors from introducing equivalent variants. These XML files are under source control and developers are wanting to diff the changes without resorting to specialized XML tools.
I was surprised to not find an XSL example of how to this. Basically I want just the identity transform with sorted attributes. I came up with the following with seems to work in all my test cases:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
<xsl:template match="*|/|text()|comment()|processing-instruction()">
<xsl:copy>
<xsl:for-each select="#*">
<xsl:sort select="name(.)"/>
<xsl:copy/>
</xsl:for-each>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
As a total XSL n00b I would appreciate any comments on style or efficiency. I thought it might be helpful to post it here since it seems to be at least not a common example.
With xslt being a functional language doing a for-each might often be the easiest path for us humans but not the most efficient for XSLT processors since they cannot fully optimize the call.
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
<xsl:template match="*">
<xsl:copy>
<xsl:apply-templates select="#*">
<xsl:sort select="name()"/>
</xsl:apply-templates>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
<xsl:template match="#*|comment()|processing-instruction()">
<xsl:copy />
</xsl:template>
</xsl:stylesheet>
This is totally trivial in this regards though and as a "XSL n00b" i think you solved the problem very well indeed.
Well done for solving the problem. As I assume you know the order or attributes is unimportant for XML parsers so the primary benefit of this exercise is for humans - a machine will re-order them on input or output in unpredictable ways.
Canonicalization in XML is not trivial and you would be well advised to use the canonicalizer provided with any reasonable XML toolkit rather than writing your own.