Parsing HTML using XSLT

Parsing HTML using XSLT - xslt

I have an xml which contains cdata section. I have managed to fetch the cdata text using XSLT.
But inside CDATA we have html. So can anyone help me how can i parse the HTMl. Below is my code:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl" xmlns:xhtml="http://www.w3.org/1999/xhtml">
<xsl:output method="html" indent="yes"/>
<xsl:template match="/">
<xsl:variable name="dummy">
<xsl:value-of select="somexpath"/>
</xsl:variable>
</xsl:template>
</xsl:stylesheet>
The ouptut till this is an html:
<div class="feed-description">
<p style="text-align: justify;">Les amateurs du jeu Dance Central 3 pourront ajouter quelques nouvelles pièces à leur collection en février. Parmi les artistes qui seront disponibles via téléchargements, on retrouve le groupe de l'heure One Direction, Justin Bieber, Ellie Goulding et B.o.B. Dès demain le <strong>5 février</strong>, vous pourrez danser sur la chanson ''<strong>What Makes You Beautilful</strong>'' de One Direction.
</p>
</div>
Now i want to read the inner text of p tag using XSLT.
Please help me out.

Plz try
<xsl:variable name="dummy">
<xsl:value-of select="msxsl:node-set(somexpath)//p/text()"/>
</xsl:variable>

You have several options, none of them what you say you would like (sorry). Those that occur to me off the bat (there are surely others) include:
You can create a workflow that extracts the HTML, passes it to Tidy or some similar tool to produce XHTML, reinserts it into the document as markup instead of as a character sequence, and then runs your stylesheet on the result.
You can write an HTML parser in XSLT, to take the character sequence in your input document and produce an element structure for it. This will be tedious, error-prone, and time-consuming, and by the time you finish it the major browsers will have come out with new versions that handle corner cases differently, so your users will complain bitterly that your parser isn't doing it 'right'. But if you like doing that sort of thing it will be challenging and enjoyable and when you succeed you will have serious hacker cred.
You can look for extensions to XSLT processors that can handle this task for you. (I don't know of any off hand, but that doesn't mean they can't exist.)
You can re-think the design of your project. You might for example restructure the input so that instead of HTML disguised as character data the content of whatever element is it is actual markup of the kind XSLT is designed to process. You might move the processing of the HTML payload out of the XSLT and into Javascript and the DOM. You might wash your hands of the whole problem and move to the South Seas.
Good luck.

Related

XSL Substring-before with HTML characters

I am pulling in an rss feed which contains a joke followed by a number of links to share the joke on different services. As shown below:
It may be worth noting that when I tried to copy and paste the text from this output, the links did not copy into notepad, and pasted as pictures into MS Word.
In my XSL I am using substring-before in an attempt to exclude these links from my output, but the only consistent character I can think to use is the <a href from the hyperlinks, which will always be at the end. Is this possible? My first pass at it failed, is there an escape character I should include?
Perhaps I will just try to exclude the last X characters to remove the links
Unfortunately I could not find an XML version of the feed either, my source is here: http://feeds.feedburner.com/DailyJokes-ACleanJokeEveryday?format=xml
Here is the XSL I am working with, which is currently hard-coded to break at the end of the most recent joke (my next hurdle is to iterate through this list)
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html"/>
<xsl:template match="/">
<xsl:apply-templates select="//item[position() < 2]"/>
</xsl:template>
<xsl:template match="item">
<content-item>
<h1><xsl:value-of select="title"/></h1>
<p><xsl:value-of select="substring-before(description, 'mower')" disable-output-escaping="yes"/></p>
<br/><br/>
<p>"The following is here for testing purposes and will be removed"<br/><br/><xsl:value-of select="substring-after(description, 'lawn')" disable-output-escaping="yes"/></p>
<br/><br/>
</content-item>
</xsl:template>
</xsl:stylesheet>
I am rendering my output via a SharePoint 2013 RSS feed web part

In trying to view the proper XML I discovered the solution. I viewed the page source for my source URL, and in that I saw that the final characters are displayed as follows:
<title>Hunting with a wife #Joke #Humor</title><description>A hunter visited another hunter and was given a tour of his home. In the den was a stuffed lion.<br /><br />The visiting hunter asked, "when did you bag him?"<br /><br />The host said, "that was three years ago, when I went hunting with my wife."<br /><br />"What's he stuffed with," asked the visiting hunter. "My wife."<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/DailyJokes-ACleanJokeEveryday?a=RT1LsKVBV3Y:0LcrJjJq2X4:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/DailyJokes-ACleanJokeEveryday?d=yIl2AUoC8zA" border="0"></img></a> <a href....
The point being, the a href doesnt have a < it uses HTML markup <.
substring-before(description, '<a href') works.

Why is my xsl processing instruction missing a question mark?

I want to create a html document with a php block (just for learning purposes) from an xsl transformation of a xml document. I am using the <xsl:processing-instruction> tag.
&ltxsl:stylesheet version="1.0" xmlns:xsl = "http://www.w3.org/1999/XSL/Transform"&gt
&ltxsl:template match="/"&gt
&ltxsl:processing-instruction name="php"&gt
&ltxsl:text&gt
setcookie("cookiename", "cookievalue");
echo "";
&lt/xsl:text&gt
&lt/xsl:processing-instruction&gt
&lthtml&gt
&lthead&gt
&ltmeta charset="utf-8" /&gt
&lt/head&gt
&ltbody&gt
&ltxsl:apply-templates /&gt
&lt/body&gt
&lt/html&gt
&lt/xsl:template&gt
&ltxsl:template match="pagina"&gt
&ltxsl:for-each select="paragraf"&gt
&ltp&gt
&ltxsl:value-of select="."/&gt
&lt/p&gt
&lt/xsl:for-each&gt
&lt/xsl:template&gt
&lt/xsl:stylesheet&gt
The result is:
&lt?php
setcookie("ceva", "textceva");
echo "";&gt
&lthtml&gt
&lthead&gt
&ltmeta http-equiv="Content-Type" content="text/html; charset=UTF-8"&gt
&ltmeta charset="utf-8"&gt
&lt/head&gt
&ltbody&gt
&ltp&gt
text 1
&lt/p&gt
&ltp&gt
text 2
&lt/p&gt
&lt/body&gt
&lt/html&gt
Why is the second question mark missing? I was expecting something like <?php setcookie(...).. ?> .

It's because your pi (processing instruction) is an SGML processing instruction (HTML is SGML). Normally the default output for XSLT is XML, but whatever processor that you're using must be defaulting to HTML (or you omitted something in your XSLT example). Another clue pointing to this is that your meta elements aren't closed in the output.
Example (note the method="html"):
XSLT 1.0
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" method="html"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<html>
<xsl:processing-instruction name="test">pi</xsl:processing-instruction>
</html>
</xsl:template>
</xsl:stylesheet>
Output (using the XSLT as the input (or any XML file))
<html><?test pi></html>
To force an XML pi, add the xsl:output:
<xsl:output indent="yes" method="xml"/>

It's my understanding that the correct representation of a processing instruction in HTML is (or was, at the relevant point in time) to omit the question mark, and the specification for XSLT serialization says that this is what should be done by the HTML output method. Sorry, I don't have time to consult the specs just now to confirm this.
Of course, you are trying to generate stuff which is defined in the PHP specification rather than the HTML specification, and the XSLT serialization spec knows nothing of PHP.

Exclude certain child nodes when data structure is unknown

EDIT -
I've figured out the solution to my problem and posted a Q&A here.
I'm looking to process XML conforming to the Library of Congress EAD standard (found here). Unfortunately, the standard is very loose regarding the structure of the XML.
For example the <bioghist> tag can exist within the <archdesc> tag, or within a <descgrp> tag, or nested within another <bioghist> tag, or a combination of the above, or can be left out entirely. I've found it to be very difficult to select just the bioghist tag I'm looking for without also selecting others.
Below are a few different possible EAD XML documents my XSLT might have to process:
First example
<ead>
<eadheader>
<archdesc>
<bioghist>one</bioghist>
<dsc>
<c01>
<descgrp>
<bioghist>two</bioghist>
</descgrp>
<c02>
<descgrp>
<bioghist>
<bioghist>three</bioghist>
</bioghist>
</descgrp>
</c02>
</c01>
</dsc>
</archdesc>
</eadheader>
</ead>
Second example
<ead>
<eadheader>
<archdesc>
<descgrp>
<bioghist>
<bioghist>one</bioghist>
</bioghist>
</descgrp>
<dsc>
<c01>
<c02>
<descgrp>
<bioghist>three</bioghist>
</descgrp>
</c02>
<bioghist>two</bioghist>
</c01>
</dsc>
</archdesc>
</eadheader>
</ead>
Third example
<ead>
<eadheader>
<archdesc>
<descgrp>
<bioghist>one</bioghist>
</descgrp>
<dsc>
<c01>
<c02>
<bioghist>three</bioghist>
</c02>
</c01>
</dsc>
</archdesc>
</eadheader>
</ead>
As you can see, an EAD XML file might have a <bioghist> tag almost anywhere. The actual output I'm suppose to produce is too complicated to post here. A simplified example of the output for the above three EAD examples might be like:
Output for First example
<records>
<primary_record>
<biography_history>first</biography_history>
</primary_record>
<child_record>
<biography_history>second</biography_history>
</child_record>
<granchild_record>
<biography_history>third</biography_history>
</granchild_record>
</records>
Output for Second example
<records>
<primary_record>
<biography_history>first</biography_history>
</primary_record>
<child_record>
<biography_history>second</biography_history>
</child_record>
<granchild_record>
<biography_history>third</biography_history>
</granchild_record>
</records>
Output for Third example
<records>
<primary_record>
<biography_history>first</biography_history>
</primary_record>
<child_record>
<biography_history></biography_history>
</child_record>
<granchild_record>
<biography_history>third</biography_history>
</granchild_record>
</records>
If I want to pull the "first" bioghist value and put that in the <primary_record>, I can't simply <xsl:apply-templates select="/ead/eadheader/archdesc/bioghist", as that tag might not be a direct descendant of the <archdesc> tag. It might be wrapped by a <descgrp> or a <bioghist> or a combination thereof. And I can't select="//bioghist", because that will pull all the <bioghist> tags. I can't even select="//bioghist[1]" because there might not actually be a <bioghist> tag there and then I'll be pulling the value below the <c01>, which is "Second" and should be processed later.
This is already a long post, but one other wrinkle is that there can be an unlimited number of <cxx> nodes, nested up to twelve levels deep. I'm currently processing them recursively. I've tried saving the node I'm currently processing (<c01> for example) as a variable called 'RN', then running <xsl:apply-templates select=".//bioghist [name(..)=name($RN) or name(../..)=name($RN)]">. This works for some forms of EAD, where the <bioghist> tag isn't nested too deeply, but it will fail if it ever has to process an EAD file created by someone who loves wrapping tags in other tags (which is totally fine according to the EAD Standard).
What I'd love is someway of saying
Get any <bioghist> tag anywhere below the current node but
don't dig deeper if you hit a <c??> tag
I hope that I've made the situation clear. Please let me know if I've left anything ambiguous. Any assistance you can provide would be greatly appreciated. Thanks.

As the requirements are rather vague, any answer only reflects the guesses its author has made.
Here is mine:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:my="my:my" exclude-result-prefixes="my">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<my:names>
<n>primary_record</n>
<n>child_record</n>
<n>grandchild_record</n>
</my:names>
<xsl:variable name="vNames" select="document('')/*/my:names/*"/>
<xsl:template match="/">
<xsl:apply-templates select=
"//bioghist[following-sibling::node()[1]
[self::descgrp]
]"/>
</xsl:template>
<xsl:template match="bioghist">
<xsl:variable name="vPos" select="position()"/>
<xsl:element name="{$vNames[position() = $vPos]}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:template>
<xsl:template match="text()"/>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<ead>
<eadheader>
<archdesc>
<bioghist>first</bioghist>
<descgrp>
<bioghist>first</bioghist>
<bioghist>
<bioghist>first</bioghist></bioghist>
</descgrp>
<dsc>
<c01>
<bioghist>second</bioghist>
<descgrp>
<bioghist>second</bioghist>
<bioghist>
<bioghist>second</bioghist></bioghist>
</descgrp>
<c02>
<bioghist>third</bioghist>
<descgrp>
<bioghist>third</bioghist>
<bioghist>
<bioghist>third</bioghist></bioghist>
</descgrp>
</c02>
</c01>
</dsc>
</archdesc>
</eadheader>
</ead>
the wanted result is produced:
<primary_record>first</primary_record>
<child_record>second</child_record>
<grandchild_record>third</grandchild_record>

I worked out a solution on my own and posted it at this Q&A because the solution is quite specific to a certain XML standard and seemed out of the scope of this question. If people feel it would be best to post it here as well, I can update this answer with a copy.

Translating itunes affiliate rss via xslt

I can't get this working for the life of me. Here is a snippet of the xml I get from an RSS feed from itunes affiliate. I want top print the values within tags but I cannot for some reason:
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns:im="http://itunes.apple.com/rss" xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
<id>http://ax.itunes.apple.com/WebObjects/MZStoreServices.woa/ws/RSS/toppaidapplications/sf=143441/limit=100/genre=6014/xml</id><title>iTunes Store: Top Paid Applications</title><updated>2010-03-24T15:36:42-07:00</updated><link rel="alternate" type="text/html" href="http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewTop?id=25180&popId=30"/><link rel="self" href="http://ax.itunes.apple.com/WebObjects/MZStoreServices.woa/ws/RSS/toppaidapplications/sf=143441/limit=100/genre=6014/xml"/><icon>http://phobos.apple.com/favicon.ico</icon><author><name>iTunes Store</name><uri>http://www.apple.com/itunes/</uri></author><rights>Copyright 2008 Apple Inc.</rights>
<entry>
<updated>date</updated>
<id>someID</id>
<title>a</title>
<im:name>b</im:name>
</entry>
<entry>
<updated>date2/updated>
<id>someID2</id>
<title>a2</title>
<im:name>b2</im:name>
</entry>
</feed>
If I try <xsl:apply-templates match="entry"/> it spits out the entire contents of file. If I use <xsl:call-template name="entry"> it will show only one entry and I have to use <xsl:value-of select="//*[local-name(.)='name']"/> to get name but that's a hack. I've used xslt before for xml without namespaces and xml that has proper parent child relationships but not like this RSS feed. Notice entry is not wrapped in entries or anything.
Any help is appreciated. I want to use xslt because I want to alter the itunes link to go through my affiliate account - so something automated wouldn't work for me.

You are matching elements that are in no namespace, but the actual elements in the XML document do belong to a (deafult) namspace: xmlns="http://www.w3.org/2005/Atom".
Therefore, you need to declare the namespace in your stylesheet, let's say xmlns:atom="http://www.w3.org/2005/Atom". and then match not just on {elementName} but on {atom:elementName}, where {elementName} in your case is: "entry".

Xslt transform on special characters

I have an XML document that needs to pass text inside an element with an '&' in it.
This is called from .NET to a Web Service and comes over the wire with the correct encoding &
e.g.
T&O
I then need to use XSLT to create a transform but need to query SQL server through a SP without the encoding on the Ampersand e.g T&O would go to the DB.
(Note this all has to be done through XSLT, I do have the choice to use .NET encoding at this point)
Anyone have any idea how to do this from XSLT?
Note my XSLT knowledge isn’t the best to say the least!
Cheers

<xsl:text disable-output-escaping="yes">&<!--&--></xsl:text>
More info at: http://www.w3schools.com/xsl/el_text.asp

If you have the choice to use .NET you can convert between an HTML-encoded and regular string using (this code requires a reference to System.Web):
string htmlEncodedText = System.Web.HttpUtility.HtmlEncode("T&O");
string text = System.Web.HttpUtility.HtmlDecode(htmlEncodedText);
Update
Since you need to do this in plain XSLT you can use xsl:value-of to decode the HTML encoding:
<xsl:variable name="test">
<xsl:value-of select="'T&O'"/>
</xsl:variable>
The variable string($test) will have the value T&O. You can pass this variable as an argument to your extension function then.

Supposing your XML looks like this:
<root>T&O</root>
you can use this XSLT snippet to get the text out of it:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" />
<xsl:template match="root"> <!-- Select the root element... -->
<xsl:value-of select="." /> <!-- ...and extract all text from it -->
</xsl:template>
</xsl:stylesheet>
Output (from Saxon 9, that is):
T&O
The point is the <xsl:output/> element. The defauklt would be to output XML, where the ampersand would still be encoded.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parsing HTML using XSLT - xslt

Plz try <xsl:variable name="dummy"> <xsl:value-of select="msxsl:node-set(somexpath)//p/text()"/> </xsl:variable>

Related

XSL Substring-before with HTML characters

Why is my xsl processing instruction missing a question mark?

Exclude certain child nodes when data structure is unknown

Translating itunes affiliate rss via xslt

Xslt transform on special characters

Categories

Resources