XSLT - Extract and manipulate portion of XML data - xslt

The input XML:
<root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Description><![CDATA[Audience: Andrew Reed, Senior Training Specialist, Microsoft Corporation<br/>This session is for individuals who spend significant time writing and creating documents and have some familiarity with Microsoft Word.<br/>Thanks.]]></Description>
</root>
The XSLT:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl">
<xsl:output method="html" indent="yes"/>
<xsl:template match="/root">
<div>
<xsl:value-of disable-output-escaping="yes" select="Description"/>
</div>
</xsl:template>
</xsl:stylesheet>
I need to add a couple of more BR tags after first occurrence of BR, that's after Audience line and before other description starts.
Can you please modify my XSLT to get the desired output?
So I want output like below:
Audience: Andrew Reed, Senior Training Specialist, Microsoft Corporation
This session is for individuals who spend significant time writing and creating documents and have some familiarity with Microsoft Word.
Thanks.

It would be nice if your input data had the <br/> elements as actual elements, instead of being escaped, so that they could be selected directly using XPath.
But since they are as they are, you can use regexp replace, relying on the assumption that they will always conform to a limited range of patterns. You will often be warned not to parse XML or HTML in general using regexps, and rightly so, because regexps aren't a general solution. But for limited uses they can be sufficient.
If I've guessed your requirements correctly, you could use something like
<xsl:value-of select="replace(Description, '<[Bb][Rr] ?/?>',
'
')"/>
That will give you the sample output you showed, as opposed to adding a couple of more BR tags after first occurrence of BR. It will tolerate some variation, e.g. <br> or <BR />.
This is assuming you can use XSLT 2.0, because replace() isn't available in 1.0. If you're limited to 1.0, please let me know.

Related

XSLT: xml element not tranformed if element have 'xmlns' within the element, [duplicate]

I'm trying to extract the headline from the below XML from the Met Office web service using XSLT, however my XSLT select returns blank.
SOURCE:
<RegionalFcst xmlns="www.metoffice.gov.uk/xml/metoRegionalFcst" createdOn="2016-01-13T02:14:39" issuedAt="2016-01-13T04:00:00" regionId="se">
<FcstPeriods>
<Period id="day1to2">
<Paragraph title="Headline:">Frosty start. Bright or sunny day.</Paragraph>
<Paragraph title="Today:">A clear and frosty start in west, but cloudier in Kent with isolated showers. Then dry with sunny periods. Increasing cloud in west later will bring coastal showers with freshening southerly winds. Chilly inland, but less cold near coasts. Maximum Temperature 8C.</Paragraph>
</Period>
</FcstPeriods>
</RegionalFcst>
My XSLT:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<html>
<body>
<xsl:value-of select="FcstPeriods/Period/Paragraph"/>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
I've changed the root to /RegionalFcst and attempted other similar changes, such as adding a leading slash before FcstPeriods, but nothing works until I remove the first and last line from the source XML - then it works perfectly.
This is fine in testing, but of course I want to use the web service provided by Met Office and that's how they present it.
Any ideas?
The problem: your XML puts its elements in a namespace.
Solution: declare the same namespace in your stylesheet, assign it a prefix and use that prefix to address the elements in the source XML:
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:met="www.metoffice.gov.uk/xml/metoRegionalFcst"
exclude-result-prefixes="met">
<xsl:template match="/">
<html>
<body>
<xsl:value-of select="met:RegionalFcst/met:FcstPeriods/met:Period/met:Paragraph[#title='Headline:']"/>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
Additional to the answer of "michael.hor257k", there is another solution, for the version 2.0 of XSLT.
XSLT 2.0
Use xpath-default-namespace attribute. For the example above it looks like this:
<xsl:stylesheet xpath-default-namespace="www.metoffice.gov.uk/xml/metoRegionalFcst" ... >
Then you don't need to repeat the namespace prefix in every element referenced by XPath:
<xsl:value-of select="FcstPeriods/Period/Paragraph"/>
instead of
<xsl:value-of select="met:FcstPeriods/met:Period/met:Paragraph"/>

Regex of XML with multiple tags

I'm trying to find all text that is not within the XML markup:
<transcript>
<text start="9.75" dur="5.94">welcome to about my property here you
can learn more about how your property</text>
<text start="15.69" dur="4.71">was assessed see the information impact
has on file and compare your property to</text>
<text start="20.4" dur="1.3">others in your neighborhood</text>
<text start="21.7" dur="5.32">interested in learning about market
trends in your municipality no problem</text>
<text start="105.79" dur="6.23">I have all of this and more about life property
. see your property assessment know more</text>
<text start="112.02" dur="0.11">about</text>
</transcript>
I am using the following regex pattern, but obviously it is not correct because it grabs all of the text between the opening and closing <transcript> tags:
<transcript>[\s\S]*?<\/transcript>
How can modify this regex pattern to select only the text that is not within any of the markup tags?
Use XSLT. XSLT is a language specifically designed to convert XML into another output format (back to valid XML again, or something else such as (X)HTML, plain text, or any other format – but preferably, based on plain text).
In this case the smallest XSLT necessary is just this:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0" >
<xsl:output method="text" indent="no" />
<xsl:template match="text">
<!-- do NOTHING here! -->
</xsl:template>
</xsl:stylesheet>
This works because the default for processing a single XML tag is to recursively apply template matches to its containing tags, and plain text will always be copied. The only tag inside your <template> is <text>, and you process it by doing 'nothing' – i.e., by not copying its contents to the output. The line inside that template is just a comment.
All other "nodes", in XML terminology, are those without a surrounding tag and so are copied to the output.
Alternatively, if you have more types of tags than just <text> elements and you want to skip all of them, apply templates to / and transcript to process each and apply another to * (which will select all remaining tags not specified elsewhere) to not process them:
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0" >
<xsl:output method="text" indent="no" />
<xsl:template match="/">
<xsl:apply-templates />
</xsl:template>
<xsl:template match="transcript">
<xsl:apply-templates />
</xsl:template>
<xsl:template match="*">
<!-- do NOTHING here! -->
</xsl:template>
</xsl:stylesheet>
Again, the plain untagged text will fall through and not get processed, so their contents will be copied to output.
Both XSLT stylesheets will output only I ha, the only part in your sample text that is not surrounded by tags.
Do you want to find
welcome to about my property here you can learn more about how your property
from
<text start="9.75" dur="5.94">welcome to about my property here you can learn more about how your property</text>
??
Than it will work.
(?<=>).+?(?=<)

Problems Trying to Pretty Print XSLT Output

this is my first post so please let me know if I can make it more constructive in any way. I have read the forum guidelines so if I inadvertantly break them in anyway it will be nothing more than an innocent mistake.
The Question
Is a simple one:
How do I pretty print the output of an XSL file?
But with some criteria:
Using only native XSL functionality.
Without having to use a second XSL file to do a 'second pass'.
It must also work for elements with mixed content.
I have googled this reasonably thoroughly but have not found a clear answer to this question. I have only used XSL for about a week so go easy if I have somehow missed the answer elsewhere.
An Example
This XML...
<email>
<attachedItem>priceless photograph.jpg</attachedItem>
<attachedItem>important document.doc</attachedItem>
<attachedItem>access codes.pdf</attachedItem>
</email>
...Transformed by this XSL...
<!-- Pretty Print Output -->
<xsl:strip-space elements="*"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/">
<email>
"Please find attached the stuff."
<xsl:apply-templates/>
</email>
</xsl:template>
<xsl:template match="attachedItem">
<xsl:copy/>
</xsl:template>
...Produces this result...
<?xml version="1.0" encoding="utf-8"?>
<email>
"Please find attached the stuff."
<attachedItem>priceless photograph.jpg</attachedItem>
<attachedItem>important document.doc</attachedItem>
<attachedItem>access codes.pdf</attachedItem>
</email>
Using the Saxon6.5.5 engine
Desired Output
<?xml version="1.0" encoding="utf-8"?>
<email>
"Please find attached the stuff."
<attachedItem>priceless photograph.jpg</attachedItem>
<attachedItem>important document.doc</attachedItem>
<attachedItem>access codes.pdf</attachedItem>
</email>
My Own Progress on the Problem
From the XSL above you will see I have discovered the use of <xsl:strip-space> and <xsl:output>. This meets the first 2 criteria but not the 3rd. In other words, it produces nice pretty printed XML without the mixed content, but with it I recieve the undesired output you can see above.
I know that the reason I get this output is because of the way whitespace is preserved in the source XML. White space is always preserved if it is part of a text node that contains other non-whitespace characters, regardless of the <xsl:strip-space> instructions. However despite my understanding I still cannot think of a solution.
Although I have addressed the first 2 criteria myself I would still like to know if this is the best way to achieve a pretty printed result.
Thanks in advance!
The following stylesheet produces exactly the output you request. The transformation was performed with Saxon 6.5.5. The correct indentation can only be achieved by meticulously typing all the line feed (
) and space ( )characters.
Note that pretty printing XML has no meaning when text content is concerned. The indentation of element tags can be easily controlled, but text nodes of elements with mixed content are always a problem. An application that takes XML as input should never rely on the exact indentation or whitespace handling of text content in XML.
In general, it is considered a bad idea to directly output literal text in an XSLT stylesheet. Always put text content inside xsl:text. xsl:strip-space has an effect only on whitespace-only text nodes of elements that belong to the input XML document (as suggested by #TobiasKlevenz already).
Stylesheet
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<!-- Pretty Print Output -->
<xsl:strip-space elements="*"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/">
<email>
<xsl:text>
"Please find attached the stuff."
</xsl:text>
<xsl:apply-templates/>
</email>
</xsl:template>
<xsl:template match="attachedItem|text()">
<xsl:copy>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
</xsl:transform>
Output
<?xml version="1.0" encoding="utf-8"?>
<email>
"Please find attached the stuff."
<attachedItem>priceless photograph.jpg</attachedItem>
<attachedItem>important document.doc</attachedItem>
<attachedItem>access codes.pdf</attachedItem>
</email>
you can wrap "Please find attached the stuff." in an
<xsl:text>
which would produce my assumption of your desired result, if not please post a 'desired output' example/.

XSL for-each loop is not working

I'm using Java to transform an XML document to text:
Transformer transformer = tFactory.newTransformer(stylesource);
transformer.transform(source, result);
This seems to work except when there are colons in the XML document. I tried this example:
XML file:
<?xml version="1.0" encoding="UTF-8"?>
<test:TEST >
<one.two:three id="my id" name="my name" description="my description" >
</one.two:three>
<one.two:three id="some id" name="some name" description="some description" />
</test:TEST>
XSL file:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xmi="http://www.omg.org/XMI"
xmlns:one.two="http://www.one.two/one.two:three" >
<xsl:output method="text" indent="yes" omit-xml-declaration="yes"/>
<xsl:variable name="myVariable">one.two:three</xsl:variable>
<xsl:template match="/">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="*[substring(name(),1,9)='test:TEST']" >
<xsl:for-each select="./$myVariable">
inFirstLoop
</xsl:for-each>
<xsl:for-each select="./one.two:three">
inSecondLoop
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The result of the transformation I'm getting is a single line:
inFirstLoop
I'm expecting 4 lines of output
inFirstLoop
inFirstLoop
inSecondLoop
inSecondLoop
How do I fix this? Any help is greatly appreciated. Thanks.
There are multiple things wrong here. I'm surprised your transformation managed to run at all, instead of failing on parse errors and other errors.
One big problem is that your input XML uses namespace prefixes (that's what the colons are for) without declaring them. Declarations like
xmlns:one.two="http://www.one.two/one.two:three"
need to be in the source XML, as well as in the XSL. Otherwise your source XML is not well-formed (according to namespace rules).
A second problem is the XPath expression
./$myVariable
which should have thrown an error. I think what you wanted was
*[name() = $myVariable]
The third change I would make is not an error in the XSLT, but just a poor way of doing things... If you want to match <test:TEST>, you should use namespace tools to refer to namespaces. Therefore, instead of
<xsl:template match="*[substring(name(),1,9)='test:TEST']" >
use
<xsl:template match="test:TEST">
Much cleaner. Then you need to put in a namespace declaration on the outermost element of the stylesheet, as you already have to do in the input XML document:
xmlns:test="...test..."
XML namespaces, like driving a car, are a topic better learned from a little training than by trial-and-error. Reading a brief article like this will help you avoid a lot of confusion and pain down the road.

XSLT help in declaring namespace

need some help in resolving the following issue. I need to transform the below input(XML) to the mentioned output(XML).
<Header>
<End_Date xsi:nil="true"/>
<Header>
To the following format.
<Header>
<End_Date xsi:nil="true" xmlns:xsi"http://www.w3.org/2001/XMLSchema"/>
<Header>
This is the stylesheet:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs">
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<HEADER>
<xsl:for-each select="HEADER">
<xsl:sequence select="(./#node(), ./node())"/>
</xsl:for-each>
</HEADER>
</xsl:template>
</xsl:stylesheet>
Thanks in advance.
Gabriel
Am I right in thinking you want to reproduce a nearly exact copy of the input XML, with the addition of the xsi namespace declaration that is lacking from the input?
First, as it is now, your input is not well-formed XML, just because of the lacking xsi namespace declaration. Hence, there's no way to use XSLT for adding it: any XSLT processor will choke on the input's non-well-formedness.
Second, you have to check case sensitivity: currently, no input nodes are matched by the <xsl:for-each select="HEADER"> select expression. If you change it to "Header", the template rule will indeed replace the <Header> input with <HEADER>, whose content is copied identically. But... only if you have the namespace declarations in the input right...
So, if the purpose is indeed to 'upgrade' non-well-formed XML to a well-formed version, I'd suggest to look for other tools, such as Perl, Awk, or any other simple search/replace solution that operates on plain text and could just add the missing namespace declaration to the document element:
<Header xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<End_Date xsi:nil="true"/>
</Header>
(Of course, you could also make use of XSLT 2.0's unparsed-text($href) function that lets you read any file as unparsed text, which you could then further process with <xsl:analyze-string>. See Michael Kay's article Up-conversion using XSLT 2.0 for further inspiration. Since this a rather awkward way to process non-XML with an XML tool, I give this merely for completeness -- if adding the namespace prefix is the only problem to be solved, I'd definitely go for a cheaper search/replace option.)
Hope this helps,
Ron