What's the cleanest way to produce a copy of an existing XML file excluding elements that match a couple different XPath expressions? [duplicate] - xslt

I want a simple XSLT which will only keep the elements which contain a certain regex:
<example>
<abc>text</abc>
<bc>text</bc>
<ab>text</ab>
</example>
I want the same XML output but only with the elements which contain an "a":
<example>
<abc>text</abc>
<ab>text</ab>
</example>

Start with the identity transform and add a template that suppresses elements whose name does not match your regex:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*[not(matches(name(), 'a'))]"/>
</xsl:stylesheet>
Explanation: By default the identity transformation will copy everything over to the output as-is. Override this default behavior by writing a simple template that matches elements without "a" in their name and does nothing, thereby preventing such elements from appearing in the output document.

Related

XML with empty prefix transformation with XSLT [duplicate]

This question already has answers here:
XSLT Transform doesn't work until I remove root node
(2 answers)
Closed 1 year ago.
I'm willing to use XSLT to transform XML files to other XML files by removing (TextLine) elements. However, the elements are not removed as I expect in the output XML files. I imagine that I'll have to modify the XSLT file, but I don't know how. Let me know what should be done.
I suspect that the root cause of the issue is that elements in the XML files have an empty prefix namespace.
The details are the following ones.
An XML test-01.xml file that contains empty prefix namespace elements:
<?xml version="1.0" encoding="UTF-8"?>
<alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.loc.gov/standards/alto/ns-v4#"
xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v4# http://www.loc.gov/standards/alto/v4/alto-4-2.xsd">
<TextLine TAGREFS="LT9"/>
<TextLine TAGREFS="LT10"/>
<TextLine TAGREFS="LT9"/>
<TextLine TAGREFS="LT8"/>
</alto>
And I'm using the following date.xslt file:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="TextLine"/>
</xsl:stylesheet>
Note: I'm using python lxml to perform the transformation. However, this shouldn't have any influence on the process as I could use any other XML transformer as xsltproc.
Yes. Your assumption that that the default namespace was the cause of your XSLT not functioning as desired was correct. Try this XSLT-1.0 instead:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:loc="http://www.loc.gov/standards/alto/ns-v4#">
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="loc:TextLine"/>
</xsl:stylesheet>

XSLT: How to remove elements of a resulting result tree fragment while copying?

My goal is to extract the contents of the SOAP body, f.e. the ElementsToExtract node - but the node name can basically be arbitrary:
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope">
<soap:Header>
<MessageId>52DF2371-4094-4408-A3EA-42D73FD1B7A3</MessageId>
</soap:Header>
<soap:Body>
<ElementsToExtract>
...
<RemoveMe>...</RemoveMe>
<RemoveMeAlso>...</RemoveMeAlso>
...
</ElementsToExtract>
</soap:Body>
</soap:Envelope>
While I'm extracting the contents, I want to get rid of two elements that all my source documents have in common - say RemoveMe and RemoveMeAlso. As there's a chance that the deeper nested nodes may be called the same, they must only be stripped from the layer below the ElementsToExtract node. How would I formulate that expression?
Here's what I did up to now:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:soap="http://www.w3.org/2003/05/soap-envelope"
xmlns:exsl="http://exslt.org/common"
exclude-result-prefixes="soap exsl">
<xsl:output method="xml" indent="yes" omit-xml-declaration="no"/>
<xsl:strip-space elements="*"/>
<xsl:variable name="SoapHeaderContents" select="exsl:node-set(soap:Envelope/soap:Header/*)"/>
<xsl:variable name="SoapBodyContents" select="exsl:node-set(soap:Envelope/soap:Body/*)"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="/">
<xsl:apply-templates select="$SoapBodyContents"/>
</xsl:template>
<!-- This is global, how to restrict to the ElementsToExtract element? -->
<xsl:template match="node()[name() = 'RemoveMe']"/>
<xsl:template match="node()[name() = 'RemoveMeAlso']"/>
</xsl:stylesheet>
I also played with the node-set() function, having read that one can not modify result tree fragments (they're only text nodes?), but I don't quite understand how to address the resulting nodes of that set. So the nodes weren't removed:
<xsl:template match="/">
<xsl:apply-templates select="$SoapBodyContents"/>
<xsl:apply-templates select="$SoapBodyContents/RemoveMe" mode="m1"/>
</xsl:template>
<xsl:template name="StripRemoveMe" match="RemoveMe" mode="m1"/>
I also read some parts of the specification, but to no avail. I'm lost for clues. Can someone direct me to the right approach?
Would this work for you:
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:soap="http://www.w3.org/2003/05/soap-envelope"
exclude-result-prefixes="soap">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- identity transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- skip soap wrappers -->
<xsl:template match="/soap:Envelope">
<xsl:apply-templates select="soap:Body/ElementsToExtract"/>
</xsl:template>
<!-- remove unwanted elements -->
<xsl:template match="ElementsToExtract/RemoveMe | ElementsToExtract/RemoveMeAlso"/>
</xsl:stylesheet>
In the (unlikely) case you don't know the name of the ElementsToExtract element, you could use:
<!-- skip soap wrappers -->
<xsl:template match="/soap:Envelope">
<xsl:apply-templates select="soap:Body/*"/>
</xsl:template>
<!-- remove unwanted elements -->
<xsl:template match="soap:Body/*/RemoveMe | soap:Body/*/RemoveMeAlso"/>
Some quick thoughts.
You create variables for storing the SOAP header and body. These are already in the input document, so it makes more sense to just write templates that match these.
Although you create a variable for the SOAP header, you never use it anywhere.
If you try to apply templates in succession, as in your sample XSL code, you will get all the output nodes from the first apply-templates, and then all the output nodes from the next apply-templates. If these nodes are meant to be interleaved in any way, this approach will not produce viable output.
Here's a revised version of your sample input XML, adding in a couple elements that we want to keep.
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope">
<soap:Header>
<MessageId>52DF2371-4094-4408-A3EA-42D73FD1B7A3</MessageId>
</soap:Header>
<soap:Body>
<ElementsToExtract>
<KeepMe>This text will persist in the output.</KeepMe>
<RemoveMe>This is text that will be removed.</RemoveMe>
<RemoveMeAlso>This will also vanish from the output.</RemoveMeAlso>
<OtherElementToKeep>And this one will also be kept.</OtherElementToKeep>
</ElementsToExtract>
</soap:Body>
</soap:Envelope>
Here's what we'd want as output:
<?xml version="1.0" encoding="utf-8"?>
<ElementsToExtract>
<KeepMe>This text will persist in the output.</KeepMe>
<OtherElementToKeep>And this one will also be kept.</OtherElementToKeep>
</ElementsToExtract>
This XSL 1.0 code will do the job. I'm guessing from your post that you're not familiar with XSL processing flow, so I've added comments to help explain what's going on.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:soap="http://www.w3.org/2003/05/soap-envelope"
version="1.0"
exclude-result-prefixes="soap">
<xsl:strip-space elements="*"/>
<xsl:output method="xml" indent="yes"/>
<!-- The `/` matches the _logical root_ of the input file. This is
basically equivalent to the start of the file, NOT the first element.
This is a common place to start processing in XSL. -->
<xsl:template match="/">
<!-- We just apply templates. In your case, we know already that
we DON'T want to process everything: we want to leave certain
things out, including a lot of the outermost elements. So
we specify what to target in the `select` statement. -->
<xsl:apply-templates select="soap:Envelope/soap:Body/ElementsToExtract"/>
</xsl:template>
<!-- This is the "identity" template, so called because it
just copies over applicable matches identically.
A template with a more-specific match statement takes
precedence. -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- Here, we specify exactly those elements that are in the
processing flow, and that we want to exclude from the
output. Since `soap:Header` etc. are NOT in the processing
flow (their element trees were never included in a preceding
call to `apply-templates`), we don't need to worry about those. -->
<xsl:template match="RemoveMe | RemoveMeAlso"/>
</xsl:stylesheet>
Note that the outermost element in the output is ElementsToExtract. This element will include the xmlns:soap="http://www.w3.org/2003/05/soap-envelope" namespace declaration, even though this namespace isn't used in any of the output elements (at least, for this small sample input XML).
If you can use XSL 2.0+ and you want to remove this namespace from the output, you could add the copy-namespaces="no" attribute to the <xsl:copy> element.

XSLT/XPath: Treat <tag> as ordinary string, not node

I have an XML:
<foo>
<bar>some text</bar>
</foo>
And I'm generating an HTML from it using XSTL, and looking for an Xpath (or some XSTL method, don't know) that gives me the whole content of foo. To illustrate my problem,
<xsl:value-of select="foo"/>
as expected, delivers only
some text
. But is there anything I can do to get
<bar>some text</bar>
? So to treat it as if the bar tags would be just ordinary strings.
Well, there's certainly no way of treating <tag> as a string, because XSLT sees the output of the XML parser, which is a tree of nodes: the tags have long since disappeared by the time XSLT swings into action.
But you can of course copy the element node as a whole, rather than just extracting its string value. Simply use <xsl:copy-of> in place of <xsl:value-of>.
You can use Copy
HTML
<foo>
<bar>some text</bar>
</foo>
XSL
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="bar">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Output
<bar>some text</bar>

Handling < > in XSLT 1.0

I have a problem, when trying to read a structure having < > in source XML.
Input Structure -
<?xml version="1.0" encoding="UTF-8"?>
<RecordsData>
<RecordsData>
<UID><RecordsData xmlns=""><RecordsData><UID>200</UID><RID>Test-1</RID><Date>20142812</Date><Status>N</Status></RecordsData></RecordsData></UID>
</RecordsData>
</RecordsData>
Expected Output Structure (there are two requirements) -
One is just conversion of < >into well formed XML tags.
<?xml version="1.0" encoding="UTF-8"?>
<RecordsData>
<RecordsData>
<UID><RecordsData xmlns=""><RecordsData><UID>200</UID><RID>Test-1</RID><Date>20142812</Date><Status>N</Status></RecordsData></RecordsData></UID>
</RecordsData>
</RecordsData>
Second is extraction of whole data inside UID tag with output as only below -
<RecordsData xmlns=""><RecordsData><UID>200</UID><RID>Test-1</RID><Date>20142812</Date><Status>N</Status></RecordsData></RecordsData>
I am able to get second output if I have first one in hand. But struggling to get first output from Input over last few days after searching forum extensively and being very new to XSLT.
If we can directly get second output from input source - it's actually what is expected solution. For above - I just tried to break down problem into steps.
Any of experts can you please help!
Thanks,
Conversion is easy, extraction is not.
To convert the escaped markup to real markup, simply disable the escaping when writing the node to the result tree, for example:
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- identity transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="UID">
<xsl:copy>
<xsl:value-of select="." disable-output-escaping="yes"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Ideally, you would use the resulting XML file to extract any data from the escaped portion. Otherwise you would have to apply string functions for this purpose, since the escaped text is not XML.
However, in your example, you don't want to extract anything particular from the data, just isolate it and convert it to a stand-alone markup document. This can be easily accomplished by:
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<xsl:value-of select="RecordsData/RecordsData/UID" disable-output-escaping="yes"/>
</xsl:template>
</xsl:stylesheet>

I want to select all texts (including node names) under a node

I currently have a xml file like this:
<aaa>
<b>I am a <i>boy</i></b>.
</aaa>
How can I get the exact string as: <b>I am a <i>boy</i></b>.? Thanks.
You have to tell XSLT that you want to copy elements through as well. That can be done with an additional rule. Note that I use custom select clauses on my apply-templates elements to select attributes as well as all node-type objects. Also note that the rule for aaa takes precedence, and does not copy the aaa element itself to the output.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="aaa">
<xsl:apply-templates select="#*|node()"/>
</xsl:template>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
<aaa>
<b>I am a <i>boy</i></b>.
</aaa>
How can I get the exact string as:
<b>I am a <i>boy</i></b>.?
The easiest/shortest way to do this in your case is to output the result of the following XPath expression:
/*/node()
This means: "Select all nodes that are children of the top element."
Of course, there are some white-space-only text nodes that we don't want selected, but XSLT can take care of this, so the XPath expression is just as simple as shown above.
Now, to get the result with an XSLT transformation, we use the following:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select="/*/node()"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document, the wanted result is produced:
<b>I am a <i>boy</i></b>.
Do note:
The use of the <xsl:copy-of> xslt instruction (not <xsl:value-of>), which copies nodes, not string values.
The use of the <xsl:strip-space elements="*"/> XSLT instruction, directing the XSLT processor to ignore any white-space-only text node in the XML document.