Parse HTML inside the CDATA text - xslt

The data inside CDATA to be parsed as Html.
<?xml version="1.0" encoding="utf-8" ?>
<test>
<test1>
<![CDATA[ <B> Test Data1 </B> ]]>
</test1>
<test2>
<![CDATA[ <B> Test Data2 </B> ]]>
</test2>
<test3>
<![CDATA[ <B> Test Data3 </B> ]]>
</test3>
</test>
From the Above input xml I need the output to be parsed as html.
But I am getting the output as
<B>Test Data1</B>
<B>Test Data2</B>
<B>Test Data3</B>
But the actual output I need the text to be in bold.
**Test Data1
Test Data2
Test Data3**
The input is coming from external system.We could not change the text inside CDATA

Parsing as HTML is only possible with an extension function (or with XSLT 2.0 and an HTML parser written in XSLT 2.0) but if you want to create HTML output and want to output the contents of the testX elements as HTML then you can do that with e.g.
<xsl:template match="test/*[starts-with(local-name(), 'test')]">
<xsl:value-of select="." disable-output-escaping="yes"/>
</xsl:template>
Note however that disable-output-escaping is an optional serialization feature not supported by all XSLT processors in all use cases. For instance with client-side XSLT in Mozilla browsers it is not supported.

If your have to stay with XSLT 1.0 you have to to run two transformation passes.
First one to copy your xml but remove the CDTA by generate the content with disable-output-escaping="yes" (See answer from #Martin Honnen)
In second path you can access the html part.
But this may be only possible if the html part follow the roles for well formatted xml (xhtml). If not perhaps a input switch as in xsltproc may help to work with html e.g.:
--html: the input document is(are) an HTML file(s)
See also: Convert an xml element whose content is inside CDATA

Related

Regex of XML with multiple tags

I'm trying to find all text that is not within the XML markup:
<transcript>
<text start="9.75" dur="5.94">welcome to about my property here you
can learn more about how your property</text>
<text start="15.69" dur="4.71">was assessed see the information impact
has on file and compare your property to</text>
<text start="20.4" dur="1.3">others in your neighborhood</text>
<text start="21.7" dur="5.32">interested in learning about market
trends in your municipality no problem</text>
<text start="105.79" dur="6.23">I have all of this and more about life property
. see your property assessment know more</text>
<text start="112.02" dur="0.11">about</text>
</transcript>
I am using the following regex pattern, but obviously it is not correct because it grabs all of the text between the opening and closing <transcript> tags:
<transcript>[\s\S]*?<\/transcript>
How can modify this regex pattern to select only the text that is not within any of the markup tags?
Use XSLT. XSLT is a language specifically designed to convert XML into another output format (back to valid XML again, or something else such as (X)HTML, plain text, or any other format – but preferably, based on plain text).
In this case the smallest XSLT necessary is just this:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0" >
<xsl:output method="text" indent="no" />
<xsl:template match="text">
<!-- do NOTHING here! -->
</xsl:template>
</xsl:stylesheet>
This works because the default for processing a single XML tag is to recursively apply template matches to its containing tags, and plain text will always be copied. The only tag inside your <template> is <text>, and you process it by doing 'nothing' – i.e., by not copying its contents to the output. The line inside that template is just a comment.
All other "nodes", in XML terminology, are those without a surrounding tag and so are copied to the output.
Alternatively, if you have more types of tags than just <text> elements and you want to skip all of them, apply templates to / and transcript to process each and apply another to * (which will select all remaining tags not specified elsewhere) to not process them:
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0" >
<xsl:output method="text" indent="no" />
<xsl:template match="/">
<xsl:apply-templates />
</xsl:template>
<xsl:template match="transcript">
<xsl:apply-templates />
</xsl:template>
<xsl:template match="*">
<!-- do NOTHING here! -->
</xsl:template>
</xsl:stylesheet>
Again, the plain untagged text will fall through and not get processed, so their contents will be copied to output.
Both XSLT stylesheets will output only I ha, the only part in your sample text that is not surrounded by tags.
Do you want to find
welcome to about my property here you can learn more about how your property
from
<text start="9.75" dur="5.94">welcome to about my property here you can learn more about how your property</text>
??
Than it will work.
(?<=>).+?(?=<)

display xml elements inside xml using xslt

Ia using XSLT 1.0 and I do have a XML while looks like this
<item name="note"><value><p>Add the &lt;bean&gt; tag pased below to the &lt;beans&gt; element in this file .... </value></item>
I want to display it like this in HTML
Add the <bean> tag passed below to the <beans> element in this file.
Note here that the <p> will be converted to a paragraph tag as I use disable-output-escaping= yes.
This is what I have in my xslt
<xsl:template match="item[#name='note']">
<xsl:value-of select="value" disable-output-escaping="yes" />
</xsl:template>
With this xslt it ignores the bean and beans xml and it does not get displayed in the page. How do I make sure to display it the way I want it?
The problem is that some of your entities have been "double escaped".
&lt;bean&gt; should be <bean>

XSLT conditionally change text font

For the following xml,
<question>
<bp>Suppose a file a.xml has content:</bp>
<bp><![CDATA[<a> 1 <b> 2 <b> 3 <a> 4 </a> </b> </b> </a>]]></bp>
<bp>What is the value of the following XPath expression:</bp>
<bp>for $x in doc("a.xml")//a/b return $x/b/a/text()</bp>
</question>
In the XSLT file, I have to change the font of the text if the text between the xml tags contains
<![CDATA[ ]]>
I tried using the following code,
<xsl:for-each select="mcq:bp">
<xsl:if test="contains(. , '<![CDATA[ ]]>')">
<xsl:attribute name='font-family'>courier</xsl:attribute>
<xsl:value-of select="."/>
</xsl:if>
<xsl:value-of select="."/>
<br/>
</xsl:for-each>
But the xslt does not display anything in the browser.
This cannot be done with XSLT.
At the time XSLT is passed the parsed XML document, there isn't any information whether a text node contained CDATA sections or not -- this lexical detail is stripped-off (lost) as result of the parsing of the XML document.
CDATA isn't a string and it isn't part of the text node. Therefore, it is wrong to try to detect a CDATA section by using the contains() function.

Need to validate a string, whether it contains html or xml using xslt

I have an xsl:variable whose contents might be HTML or XML or binary.
I'm displaying the value of the variable in a textarea in a html page.
If the variable contains HTML or XML data, it is displayed unformatted in the textarea.
<xsl:variable name="outputString">
//html or xml or binary data goes in here
</xsl:variable>
<xsl:template match="/">
<html>
<body>
<textarea name="output" cols="20" rows="20">
<xsl:value-of select="$outputString" />
</textarea>
</body>
</html>
</xsl:template>
All I need is to display the contents of the variable in a formatted way inside the textarea if the contents are either HTML or XML.
You'll need processor extensions to do this job, so the answer depends on which XSLT processor you are using.
Well I would need to try it out, but I believe you could do the following
<xsl:if test="fn:contains($outputString, '<(.*)>.*<\1>')>
</xsl:if>
or in your case I would rather use the choose tag.
The fn:contains() is for XSLT 2.0 and the part I'm not sure is weather it will accept the regex in that format. even more because some places use $1 instead of \1 for referencing the group value.
If you are referencing a XML or HTML that would detect it though.

XSLT management - attaching metadata to a stylesheet for output and parameters

I am using about a dozen XSLT files to provide a large number of output formats. At the moment the user has to know the extension of the file format being exported to e.g. RTF, HTML, TXT.
I would also like to use parameters to allow more options. If I can embed the metadata in the XSL file itself then I can pick up the details by scanning through the files.
Here is what I am thinking about. In this example the program would have to parse the comments for the required information.
<?xml version="1.0" encoding="UTF-8"?>
<!-- Title: Export to Rich Text Format -->
<!-- Description: This Stylesheet converts to a Rich Text Format format which may be used in a word processor such as Word -->
<!-- FileFormat: RTF -->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:param name="CompanyName"/> <!-- Format:String, Description: Company name to be inserted in the footer -->
<xsl:param name="DateDue"/> <!-- Format:Date-yyyy-mm-dd, Description: Date Due -->
<xsl:param name="IncludePicture">true</xsl:param><!-- Format:Boolean, Description: Do you want to include a graphical representation? -->
<xsl:template match="/">
<!-- Stuff -->
</xsl:template>
</xsl:stylesheet>
Are there any standards out there? Do I need to butcher more than one (Dublin Core with a smattering of XML Schema)?
P.S. the project this is being applied to is Argumentative.
Here is what I am thinking about. In
this example the program would have to
parse the comments for the required
information.
You don't need to code the metadata within comments.
Metadata can be specified as part of the XSLT stylesheet using ordinary XML markup -- as rich in structure and meaning as we need.
Here is an example how to do that:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:meta="my:meta">
<xsl:output method="text"/>
<meta:metadata>
<title>Title: Export to Rich Text Format </title>
<description>
This Stylesheet converts to a Rich Text
Format format which may be used in a word processor
such as Word
</description>
<fileFormat>RTF</fileFormat>
<parameters>
<parameter name="CompanyName" format="xs:string"
Description="Company name to be inserted in the footer"/>
<parameter name="DateDue" format="xs:date"
Description="Date Due"/>
<parameter name="IncludePicture" format="xs:boolean"
Description="Do you want to include a graphical representation?"/>
</parameters>
</meta:metadata>
<xsl:param name="CompanyName"/>
<xsl:param name="DateDue"/>
<xsl:param name="IncludePicture" select="true"/>
<xsl:variable name="vMetadata" select=
"document('')/*/meta:metadata"/>
<xsl:template match="/">
This is a demo how we can access and use the metadats.
Metadata --> Description:
"<xsl:copy-of select="$vMetadata/description"/>"
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on any XML document (not used), the result is:
This is a demo how we can access and use the metadats.
Metadata --> Description:
"
This Stylesheet converts to a Rich Text
Format format which may be used in a word processor
such as Word
"
Do note:
Any element that is in a namespace (of course not the no-namespace and not the xsl namespace) can be specified at the global level of any xslt stylesheet.
Such elements can be accessed using the xslt function document().