Adding special characters to xml output - xslt

We have a piece of code which returns XML in a format like:
Source XML:
<Root>
<Book>
<BookId>a</BookId>
<Description>aDescription</Description>
</Book>
<Book>
<BookId>b</BookId>
<Description>bDescription</Description>
</Book>
</Root>
I want to replace the special characters with the literal characters...
<
will be < etc.
I know I can use:
<xsl:character-map name="escapeMapper">
<xsl:output-character character="<" string="<"/>
<xsl:output-character character=">" string=">"/>
</xsl:character-map>
However here is the twist, I want to convert special characters first, then run the resulting XML through other templates. So, I want to run the source XML through a template replacing the special characters, putting the result in to a variable:
<xsl:variable name="vrtfPass1">
Now I can use the multi-pass technique and apply other templates using the variable as the source.
How can I convert the special characters into the literal characters?

It sounds like what you want to do is not convert < and > to < and >, but to parse the XML, yes? Are you able to use Saxon extension functions in what you're doing? If so, I think you could do something like this:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:saxon="http://saxon.sf.net/">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/">
<xsl:variable name="loaded">
<xsl:apply-templates />
</xsl:variable>
<xsl:apply-templates select="$loaded/*" />
</xsl:template>
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="Book/text()[normalize-space()]">
<xsl:copy-of select="saxon:parse(.)"/>
</xsl:template>
</xsl:stylesheet>
Though this assumes that the text children of Book would always be well-formed XML in its own right.

Related

Is it possible to replace & with & in XSLT?

I tried to do that with replace($val, 'amp;', ''), but seems like & is atomic entity to the parser. Any other ideas?
I need it to get rid of double escaping, so I have constructions like &#8112; in input file.
UPD:
Also one important notice: I have to make this substitution only inside of specific tags, not inside of every tag.
If you serialize there is always (if supported) the disable-output-escaping hack, see http://xsltransform.hikmatu.com/nbUY4kh which transforms
<root>
<foo>a & b</foo>
<bar>a & b</bar>
</root>
selectively into
<root>
<foo>a & b</foo>
<bar>a & b</bar>
</root>
by using <xsl:value-of select="." disable-output-escaping="yes"/> in the template matching foo/text():
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="foo/text()">
<xsl:value-of select="." disable-output-escaping="yes"/>
</xsl:template>
</xsl:transform>
To achieve the same selective replacement with a character map you could replace the ampersand in foo text() children (or descendants if necessary) with a character not used elsewhere in your document and then use the map to map it to an unescaped ampersand:
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output use-character-maps="doe"/>
<xsl:character-map name="doe">
<xsl:output-character character="«" string="&"/>
</xsl:character-map>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="foo/text()">
<xsl:value-of select="replace(., '&', '«')"/>
</xsl:template>
</xsl:transform>
That way
<root>
<foo>a & b</foo>
<bar>a & b</bar>
</root>
is also transformed to
<root>
<foo>a & b</foo>
<bar>a & b</bar>
</root>
see http://xsltransform.hikmatu.com/pPgCcoj for a sample.
If your XML contains ᾰ and you believe that this is a double-escaped representation of the character with codepoint 8112, then you can convert it to this character using the XPath expression
codepoints-to-string(xs:integer(replace($input, '&#([0-9]+);', $1)))
remembering that if you write this XPath expression in XSLT, then the & must be written as &.

Transform the content in the tag using XSLT to characters based on the text

I need to transform the below XML based on the characters in the element. I have tried the below XSLT 1.0. In <mo> element, for { and }, should be transformed to |text{| and |text}| respectively. For { and } should be transformed to |cbo| and |cbc| respectively. But I am getting '|(text}||(text{||(text}||(text{|for the contents in` elements
Sample XML:
<chapter xmlns="http://www.w3.org/1998/Math/MathML"><p><math display='block'><mo>{</mo><mo>{</mo><mo>}</mo><mo>}</mo></math></p></chapter>
XSLT 1.0 tried:
<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML">
<xsl:output method="xml" encoding="UTF-8" indent="no"/>
<xsl:strip-space elements="*"/>
<xsl:template match="#* | node()"><xsl:copy><xsl:apply-templates select="#* | node()"/>
</xsl:copy></xsl:template>
<xsl:template match="m:mo">
<xsl:choose>
<xsl:when test="(.)='{'"><xsl:text disable-output-escaping="yes">|(text{|</xsl:text>
</xsl:when>
<xsl:when test="(.)='}'"><xsl:text disable-output-escaping="yes">|(text}|</xsl:text>
</xsl:when>
<xsl:when test="(.)='{'"><xsl:text disable-output-escaping="yes">|cbo|</xsl:text></xsl:when>
<xsl:when test="(.)='}'"><xsl:text disable-output-escaping="yes">|cbc|</xsl:text></xsl:when>
</xsl:choose></xsl:template></xsl:stylesheet>
I cannot reproduce the "problem" -- the output produced by the provided XSLT code doesn't contain a substring of:
"|(text}||(text{||(text}||(text{|"
The provided, unreadable code can be simplified to the following simple code -- do note that DOE isn't needed at all:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:m="http://www.w3.org/1998/Math/MathML">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="m:mo">
<xsl:choose>
<xsl:when test=". = '{'">
<xsl:text>|(text{|</xsl:text>
</xsl:when>
<xsl:when test=". = '}'">
<xsl:text>|(text}|</xsl:text>
</xsl:when>
<xsl:when test=". = '{'">
<xsl:text>|cbo|</xsl:text>
</xsl:when>
<xsl:when test=". = '}'">
<xsl:text>|cbc|</xsl:text>
</xsl:when>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
Both, the original code and its equivalent readable and simplified code (above), produce the same result(when we don't take into account indentation or lack of such):
<chapter xmlns="http://www.w3.org/1998/Math/MathML">
<p>
<math display="block">|(text{||(text{||(text}||(text}|</math>
</p>
</chapter>
I don't know whether this result is "good" or "bad", as the OP hasn't specified what results he wants to be produced, what result he is getting, and why the result he is getting isn't "good".
When entity { comes in it should be transformed to |(text{| and when charcter { comes it should be transformed to |cbo|.
Once your XML has been through an XML parser, these two inputs are indistinguishable. It's a bit like saying you want to process them differently depending on whether the author typed the text with his left hand or his right hand - they are just different ways of inputting the same data.
If you want to distinguish them, you will need to do some kind of preprocessing so that the difference is retained through XML parsing. One way to do that is Andrew Welch's Lexev tool, which is integrated with KernowForSaxon. However, I would question your design; depending on a lexical difference like this will make your system very fragile.

remove elements based on external file

I have an external setting file which has some nodes holiding attribute values of main xml document. I need to remove certian nodes from mian xml file if the attribute value is there in the setting file.
My setting file looks like this:
setting.xml
<xml>
<removenode titlename="abc" subtitlename="xyz"></removenode>
<removenode titlename="dvd" subtitlename="dvd"></removenode>
</xml>
Main.xml
<xml>
<title titlename="abc">
<subtitle subtitlename="xyz"></subtitle>
</title>
<title titlename="book">
<subtitle subtitlename="book sub title"></subtitle>
</title>
</xml>
Need a script which look for setting.xml file and remove the title element if titlename and subtitlename found in main.xml. The output should be
output.xml
<xml>
<title titlename="book">
<subtitle subtitlename="book sub title"></subtitle>
</title>
</xml>
I tried using document to read setting.xml file but not able to find how to do the match on main.xml file
<xsl:variable name="SuppressionSettings" select="document('Setting.xml')" />
<xsl:variable name="SuppressSetting" select="$SuppressionSettings/xml/removenode" />
.
Any hint how to implement it?
The key is to use an identity/copy pattern and, before each output, check the current (context) node isn't prohibited by the suppression rules nodeset.
<!-- get suppression settings -->
<xsl:variable name='suppression_settings' select="document('http://www.mitya.co.uk/xmlp/settings.xml')/xml/removenode" />
<!-- begin identity/copy -->
<xsl:template match="node()|#*">
<xsl:if test='not($suppression_settings[#titlename = current()/#titlename and #subtitlename = current()/subtitle/#subtitlename])'>
<xsl:copy>
<xsl:apply-templates select='node()|#*' />
</xsl:copy>
</xsl:if>
</xsl:template>
You can run it here (see output source - the 'abc' title node is omitted):
http://www.xmlplayground.com/9oCYKp
This XSLT indicated below works for the given document.
Note that I'm storing the contents of Setting.xml in a variable as you did, however, I'd then use that variable directly in my queries.
An important issue here is that in the match element of a template, variables cannot be used. Therefore, my template matches any <title> elements and then determines in an <xsl:choose> element whether the attributes match any values given in the settings file - if so, the <title> element will be omitted in the output.
As an explanation for why that test attribute in the <xsl:when> does what it should, imagine a comparison of someAttribute = someOtherAttribute not as a restriction that the attribute someAttribute must have the same value as the attribute someOtherAttribute, but rather as the condition that there must be any two attributes someAttribute and someOtherAttribute with the same value.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:variable name="SuppressionSettings" select="document('Setting.xml')" />
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="//title">
<xsl:choose>
<xsl:when test="(#titlename = $SuppressionSettings/xml/removenode/#titlename) and (subtitle/#subtitlename = $SuppressionSettings/xml/removenode/#subtitlename)"/>
<xsl:otherwise>
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
Here's a more generic answer where the names of the attributes are not hard coded into the XSLT. Like O. R. Mapper pointed out, in XSLT 1.0 you can't use variable references in the match, so I put the document() directly in the predicate. This may not be as efficient as using a variable and then testing the variable.
XSLT 1.0
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*[#* = document('setting.xml')/*/removenode/#*]"/>
</xsl:stylesheet>
XML Output (using your 2 xml files with main.xml as the input)
<xml>
<title titlename="book">
<subtitle subtitlename="book sub title"/>
</title>
</xml>

Wrap some fields in CDATA but not others OF THE SAME NAME

Forking from a previous question I asked, how do I ensure that example/one/field and example/three/field are enclosed in CDATA whilst example/two/field is not?
Input:
<?xml version="1.0"?>
<example>
<one>
<field>CDATA required here</field>
</one>
<two>
<field>No CDATA thanks</field>
</two>
<three>
<field>More CDATA please</field>
</three>
</example>
Required output:
<?xml version="1.0"?>
<example>
<one>
<field><![CDATA[CDATA required here]]></field>
</one>
<two>
<field>No CDATA thanks</field>
</two>
<three>
<field><![CDATA[More CDATA please]]></field>
</three>
</example>
I could specify <xsl:output cdata-section-elements="field"/> but this will affect example/two/field as well. I have tried putting in a path like <xsl:output cdata-section-elements="example/one/field example/three/field"/> but this produces an error (Error XTSE0280: Invalid element name. Invalid QName {example/one/field}). Where am I going wrong?
With your current markup I don't think there is a clean way with XSLT. You would need to use different element names or different namespaces at least to allow you and the XSLT processor's serializer to distinguish which elements to output as CDATA sections and which not.
Or you would need to consider to use disable-output-escaping e.g.
<xsl:template match="one/field | three/field">
<xsl:copy>
<xsl:text disable-output-escaping="yes"><![CDATA[<![CDATA[]]></xsl:text>
<xsl:value-of select="."/>
<xsl:text disable-output-escaping="yes">]]></xsl:text>
</xsl:copy>
</xsl:template>
[edit]
Here is a complete sample stylesheet:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="one/field | three/field">
<xsl:copy>
<xsl:text disable-output-escaping="yes"><![CDATA[<![CDATA[]]></xsl:text>
<xsl:value-of select="."/>
<xsl:text disable-output-escaping="yes">]]></xsl:text>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Note however that disable-output-escaping is an optional serialization feature that is not supported by all XSLT processors.

Can an XSLT processor preserve empty CDATA sections?

I'm processing an XML document (an InstallAnywhere .iap_xml installer) before handing it off to another tool (InstallAnywhere itself) to update some values. However, it appears that the XSLT transform I am using is stripping CDATA sections (which appear to be significant to InstallAnywhere) from the document.
I'm using Ant 1.7.0, JDK 1.6.0_16, and a stylesheet based on the identity:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml" encoding="UTF-8" cdata-section-elements="string" />
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Basically, "string" nodes that look like:
<string><![CDATA[]]></string>
are being processed into:
<string/>
From reading XSLT FAQs, I can see that what is happening is legal as far as the XSLT spec is concerned. Is there any way I can prevent this from happening and convince the XSLT processor to emit the CDATA section?
Found a solution:
<xsl:template match="string">
<xsl:element name="string">
<xsl:text disable-output-escaping="yes"><![CDATA[</xsl:text><xsl:value-of select="text()" disable-output-escaping="yes" /><xsl:text disable-output-escaping="yes">]]></xsl:text>
</xsl:element>
</xsl:template>
I also removed the cdata-section-elements attribute from the <xsl:output> element.
Basically, since the CDATA sections are significant to the next tool in the chain, I take output them manually.
To do this, you'll need to add a special case for empty string elements and use disable-output-escaping. I don't have a copy of Ant to test with, but the following template worked for me with libxml's xsltproc, which exhibits the same behavior you describe:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" omit-xml-declaration="yes" cdata-section-elements="string"/>
<xsl:template match="string">
<xsl:choose>
<xsl:when test=". = ''">
<string>
<xsl:text disable-output-escaping="yes"><![CDATA[]]></xsl:text>
</string>
</xsl:when>
<xsl:otherwise>
<xsl:copy-of select="."/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Input:
<input>
<string><![CDATA[foo]]></string>
<string><![CDATA[]]></string>
</input>
Output:
<input>
<string><![CDATA[foo]]></string>
<string><![CDATA[]]></string>
</input>
Once the XML parser has finished with the XML, there is absolutely no difference between <![CDATA[abc]]> and abc. And the same is true for an empty string - <![CDATA[]]> resolves to nothing at all, and is silently ignored. It has no representation in the XML model. In fact, there is no way to tell the difference from CDATA and regular strings, and neither has any representation in the XML model.
Sorry.
Now, why would you want this? Perhaps there is another solution which can help you?