Is there a nice XSL stylesheet for client-side DocBook rendering? - xslt

I want the DocBook documents in my SVN repository to look nice if someone looks at them in a web browser. I've started to write a CSS stylesheet, but I think that it will have significant limitations -- particularly ones regarding hyperlinks.
There is a large body of DocBook XSL stylesheets at the DocBook site , but they don't seem to be appropriate for browser rendering. I don't want to generate static documents and put them into SVN. I want them to be basically readable for other developers without much hassle.
I could write my own browser-appropriate XSL stylesheet to convert DocBook to HTML, but it seems like someone else must have already done this. I just don't know where to find it.

In a past life I used wysiwygdocbook: http://www.cs.hs-rm.de/~werntges/proj/wysiwyg-dbk01.html

You are right, the DocBook XSL stylesheets are very heavy, and are not really suitable for running in a browser. The DocBook Wiki lists some CSS stylesheets, perhaps one of those might work for you?
The only one I have experience of is the one which XMLMind XML Editor apparently uses to present DocBook documents.

I've done some XSLT+CSS very basic and incomplete implementation for browserside DocBook styling. You can check it out here http://github.com/arsi/db2xhtml
But I would like to see more advanced project if available somewhere!

[Edited because I misread the question]
You certainly wouldn't want to run the stylesheets via a browser and the PI but then you wouldn't want to do that for any reasonably complex content. Do it server side if you're running over a web server or as a batch task. Is there any way that you can interpose a server side process in svn?
DocBook is a complex 'language' and capturing even most of the subleties of DocBook is very difficult. Using the DocBook XSL is not complex at all and I really would recommend you go in that direction if you can. The stylesheets are designed to be customised and are extremely well documented by Bob Stayton in DocBook XSL: The Complete Guide.

After quite a bit of searching, I believe the answer is "there is not a nice XSL stylesheet for client-side DocBook rendering," besides the bespoke ones like the one I implemented.

Typically you'd produce a 'rendition' for reading/display. The rendition can either be PDF, a single HTML page, or set of HTML pages. It's rare that you deliver docbook directly to web.
Can I ask what you're trying to accomplish and why?
Is this for internal delivery or external?
I hate getting the question that asks "can your technology do X?" It assumes a lot of knowledge about the product (plus, usually the answer is "yes" but that doesn't answer the real question). It's always best when I ask -- "what are you trying to accomplish" -- so I can tell you whether or not any piece of technology is a good fit (or I can point you to some other piece that's a much better fit, or a better way to go about it).

Related

How to generate an XSLT programatically

I am trying to use Saxon to programmatically generate an XSLT. This is a similar question to Create xslt files programmatically, but as there was no acceptable answer to that question I am asking again, specifically around Saxon which at one point claimed to support this.
According to http://saxon.sourceforge.net/saxon7.0/api-guide.html, "This document describes how to use Saxon as a Java class library, without making any use of XSLT stylesheets."
(answering the other question I referred to above) and then serialize that to a stylesheet that could then be reloaded and executed later, but I already pretty much figured out that isn't going to happen.
If Saxon isn't the answer, what Java library will support any of this?
By the way, doing any and all of this using .Net is trivial. Unfortunately I need a Java solution for this.
Any help is greatly appreciated.
Well, for a start, there have been 26 major releases and innumerable maintenance releases of Saxon since release 7.0, so forget anything you read there.
It's not clear from your question whether you want to write your XSLT code generator in Java or in XSLT. I can't see why you would want to do it in Java rather than XSLT, but neither is difficult. The main problem with your question is that you don't explain why you think it's a problem. Generating an XSLT stylesheet is just like generating any other XML document: what's the problem?
I figured this out a while ago, and thought I would share my answer. This may or may be an answer for Create xslt files programmatically as well, but since that was not my question, I won't assume that.
The answer is not to use XSLT (or Saxon) at all. Originally I wanted an API with which to generate an XSLT model where the rules for transforming a particular (xml) tree structure are known programmatically. Then the model could be persisted as a style sheet so that it could be reloaded and used to transform another tree with the same structure again. Sure I could programmatically create a DOM containing an XSLT style sheet document, but why would I want to? It's inefficient for one - creating and saving a DOM with all the XSLT syntax (tedious), and then reading it using SAXON, etc. I'd rather have an API that natively understands XSLT - maybe one that has an XSLT model that can just be programmed and then serialized?
Anyway, the answer to my problem is just to use SAX. Since I know the particular transformation rules for my tree programmatically, I simply created a handler that processes the stream using the rules, and I'm done. In the end there's no XSLT style sheet that can live outside the program, but I'll live with that.
And I take back what I said about .Net supporting my first approach. I thought it would so it's just a reminder to check your assumptions before you post.
By the way, an answer to my question suggested that it was easy to accomplish but rather than outline how to do it, I was just questioned why I would want to. Can we at least try to answer a question if we also feel we need to ask someone why they are asking a question?

Sitemesh or XSLT for layout

I am designing a layout for my crm project now.
Now i am ended with 2 options one is sitemesh to define the layout or XSLT to define a layout.
Sitemesh will run at runtime from the server , it wont cause any issue if the number of request is high?
I guess XSLT will run at the browser based on the Xpath , is this correct?
Which one is better to use?
Please help me
Thanks
You can run XSLT either in the browser or at the server. The advantage to running it at the server is that the HTML you generate will be the same regardless of what browser the user has. If you run it in the web browser, users with different browsers might get slightly different results, because different XSLT transformation engines have different quirks, kind of like different web browsers do when rendering the same HTML and CSS.
I've designed and taught a corporate 1-day intro to XSLT class. I love the way XSLT works. That said, it has been criticized for running slowly and being hard to learn.
I've just started using SiteMesh 2.0, and I really like it. If you are not familiar with XSLT coding, you may be more comfortable with SiteMesh, because it simply wraps your content with a header/footer you create. You don't have to write and debug XSLT code.

Should I use XML in my c++ game

Is XML something required to use in modern games? Sense I'm trying to make game that will be programmed like a professional programmer did it, So should i use XML in my game or it is just an optional thing, Why and Why not?
There's nothing wrong with using XML if it solves your problem. There's nothing wrong with not using XML if it isn't the best option for solving your problem.
If you want to store data in an a format which is easy to query, then XML is not a bad option. If you're look for space efficiency, XML is not a good option. If you are looking for cross-platform features, XML is a no-brainer compared to binary formats.
JSON is another option which is very simple and very lightweight.
In short, it depends on your requirements. If you want more direction you'll have to give more information.
Use whatever works for you. Don't judge your tools based on whether something is considered "uncool". If it works for you and helps you create maintainable and flexible code and data, use it.
Now specifically for XML - that's definitely a format that is widely used and has libraries for virtually every platform. I've worked on a recent Xbox 360 and PlayStation 3 title that made use of XML.

"Smart" way of parsing and using website data?

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed yesterday.
Improve this question
How does one intelligently parse data returned by search results on a page?
For example, lets say that I would like to create a web service that searches for online books by parsing the search results of many book providers' websites. I could get the raw HTML data of the page, and do some regexs to make the data work for my web service, but if any of the websites change the formatting of the pages, my code breaks!
RSS is indeed a marvelous option, but many sites don't have an XML/JSON based search.
Are there any kits out there that help disseminate information on pages automatically? A crazy idea would be to have a fuzzy AI module recognize patterns on a search results page, and parse the results accordingly...
I've done some of this recently, and here are my experiences.
There are three basic approaches:
Regular Expressions.
Most flexible, easiest to use with loosely-structured info and changing formats.
Harder to do structural/tag analysis, but easier to do text matching.
Built in validation of data formatting.
Harder to maintain than others, because you have to write a regular expression for each pattern you want to use to extract/transform the document
Generally slower than 2 and 3.
Works well for lists of similarly-formatted items
A good regex development/testing tool and some sample pages will help. I've got good things to say about RegexBuddy here. Try their demo.
I've had the most success with this. The flexibility lets you work with nasty, brutish, in-the-wild HTML code.
Convert HTML to XHTML and use XML extraction tools. Clean up HTML, convert it to legal XHTML, and use XPath/XQuery/ X-whatever to query it as XML data.
Tools: TagSoup, HTMLTidy, etc
Quality of HTML-to-XHML conversion is VERY important, and highly variable.
Best solution if data you want is structured by the HTML layout and tags (data in HTML tables, lists, DIV/SPAN groups, etc)
Most suitable for getting link structures, nested tables, images, lists, and so forth
Should be faster than option 1, but slower than option 3.
Works well if content formatting changes/is variable, but document structure/layout does not.
If the data isn't structured by HTML tags, you're in trouble.
Can be used with option 1.
Parser generator (ANTLR, etc) -- create a grammar for parsing & analyzing the page.
I have not tried this because it was not suitable for my (messy) pages
Most suitable if HTML structure is highly structured, very constant, regular, and never changes.
Use this if there are easy-to-describe patterns in the document, but they don't involve HTML tags and involve recursion or complex behaviors
Does not require XHTML input
FASTEST throughput, generally
Big learning curve, but easier to maintain
I've tinkered with web harvest for option 2, but I find their syntax to be kind of weird. Mix of XML and some pseudo-Java scripting language. If you like Java, and like XML-style data extraction (XPath, XQuery) that might be the ticket for you.
Edit: if you use regular expressions, make sure you use a library with lazy quantifiers and capturing groups! PHP's older regex libraries lack these, and they're indispensable for matching data between open/close tags in HTML.
Without a fixed HTML structure to parse, I would hate to maintain regular expressions for finding data. You might have more luck parsing the HTML through a proper parser that builds the tree. Then select elements ... that would be more maintainable.
Obviously the best way is some XML output from the engine with a fixed markup that you can parse and validate. I would think that a HTML parsing library with some 'in the dark' probing of the produced tree would be simpler to maintain than regular expressions.
This way, you just have to check on <a href="blah" class="cache_link">... turning into <a href="blah" class="cache_result">... or whatever.
Bottom line, grepping specific elements with regexp would be grim. A better approach is to build a DOM like model of the page and look for 'anchors' to character data in the tags.
Or send an email to the site stating a case for a XML API ... you might get hired!
You don't say what language you're using. In Java land you can use TagSoup and XPath to help minimise the pain. There's an example from this blog (of course the XPath can get a lot more complicated as your needs dictate):
URL url = new URL("http://example.com");
SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser"); // build a JDOM tree from a SAX stream provided by tagsoup
Document doc = builder.build(url);
JDOMXPath titlePath = new JDOMXPath("/h:html/h:head/h:title");
titlePath.addNamespace("h","http://www.w3.org/1999/xhtml");
String title = ((Element)titlePath.selectSingleNode(doc)).getText();
System.out.println("Title is "+title);
I'd recommend externalising the XPath expressions so you have some measure of protection if the site changes.
Here's an example XPath I'm definitely not using to screenscrape this site. No way, not me:
"//h:div[contains(#class,'question-summary')]/h:div[#class='summary']//h:h3"
You haven't mentioned which technology stack you're using. If you're parsing HTML, I'd use a parsing library:
Beautiful Soup (Python)
HTML Agility Pack (.NET)
There are also webservices that do exactly what you're saying - commercial and free. They scrape sites and offer webservice interfaces.
And a generic webservice that offers some screen scraping is Yahoo Pipes. previous stackoverflow question on that
It isn't foolproof but you may want to look at a parser such as Beautiful Soup It won't magically find the same info if the layout changes but it's a lot easier then writing complex regular expressions. Note this is a python module.
Unfortunately 'scraping' is the most common solution, as you said attempting to parse HTML from websites. You could detect structural changes to the page and flag an alert for you to fix, so a change at their end doesn't result in bum data. Until the semantic web is a reality, that's pretty much the only way to guarantee a large dataset.
Alternatively you can stick to small datasets provided by APIs. Yahoo are working very hard to provide searchable data through APIs (see YDN), I think the Amazon API opens up a lot of book data, etc etc.
Hope that helps a little bit!
EDIT: And if you're using PHP I'd recommend SimpleHTMLDOM
Have you looked into using a html manipulation library? Ruby has some pretty nice ones. eg hpricot
With a good library you could specify the parts of the page you want using CSS selectors or xpath. These would be a good deal more robust than using regexps.
Example from hpricot wiki:
doc = Hpricot(open("qwantz.html"))
(doc/'div img[#src^="http://www.qwantz.com/comics/"]')
#=> Elements[...]
I am sure you could find a library that does similar things in .NET or Python, etc.
Try googling for screen scraping + the language you prefer.
I know several options for python, you may find the equivalent for your preferred language:
Beatiful Soup
mechanize: similar to perl WWW:Mechanize. Gives you a browser like object to ineract with web pages
lxml: python binding to libwww
scrapemark: uses templates to scrape pieces of pages
pyquery: allows you to make jQuery queries in xml/xhtml documents
scrapy: an high level scraping and web crawling framework for writing spiders to crawl and parse web pages
Depending on the website to scrape you may need to use one or more of the approaches above.
If you can use something like Tag Soup, that'd be a place to start. Then you could treat the page like an XML API, kinda.
It has a Java and C++ implementation, might work!
Parsley at http://www.parselets.com looks pretty slick.
It lets you define 'parslets' using JSON what you're define what to look for on the page, and it then parses that data out for you.
As others have said, you can use an HTML parser that builds a DOM representation and query it with XPath/XQuery. I found a very interesting article here: Java theory and practice: Screen-scraping with XQuery - http://www.ibm.com/developerworks/xml/library/j-jtp03225.html
There is a very interesting online service for parsing websites https://loadsiteinmysql.site This service splits the site into tags and loads them into MySQL tables. This allows you to parse sites using MySQL syntax

Benefits/Disadvantages of using XSL to render entire web pages

I am in the preliminary stages of planning a project with a client to redo their current website. I took a look at their current site to see what issues they are currently dealing with and upon inspection I noticed that every page is being rendered entirely using XSLT. I am familiar with XSLT, I have used it to render custom controls that need to be refreshed often on the client-side, but never to render an entire page.
Help me become less ignorant, what could be the reasoning behind this? What benefits or disadvantages does this bring to the table?
Server-side:
Advantages:
Clean, concise templates
An easy way to process XML data into HTML
Reasonably fast
Disadvantages:
Programming model unfamiliar and uncomfortable for many procedural-language programmers
Awkward if some or all source data is not in XML
Can be very slow when not used carefully (small changes can have large repercussions)
Client-side:
Advantages:
A convenient way to off-load processing onto client code, where scripts can potentially have much better knowledge of how best to format the resulting HTML.
Disadvantages:
Browser support is all over the map.
Google won't thank you.
Sounds like they did it because they knew XSL-T very well and already had XML data on hand.
I wouldn't like the solution myself. XSL-T isn't the easiest thing to read or write. It doesn't lend itself well to visualizing how a page will look. It cuts designers and web developers out of the process. And it doesn't internationalize well. There's nothing equivalent to Java resource bundles that can pull out locale specific information. I don't consider cut & paste a good solution.
XSLT on client side
Disables progressive rendering. User won't see anything at all until entire stylesheet and data is loaded completely.
Not supported by search engines and other crawlers. They'll see raw XML.
Not supported by older/less advanced browsers.
XSLT in general
Unless you carefully design your stylesheets, they may quickly become difficult to maintain:
with numerous templates it may be very difficult to figure out which templates are applied.
verbosity of XSLT and XML syntax make it hard to understand anything at glance.
XPath tricks are tempting, but nightmare to modify later.
I think XSLT is great when built the right way(We use a framework of templates at work).
Sounds like the All-singing-all-dancing XML fetish.
Since you can do anything with XSLT, might as well do everything. I've had people ask why a data warehouse isn't just XSLT transforms between input, data mart and reports.
Advantage. Everything's in XML.
Disadvantages.
Not very readable. Your page templates are bound up as XSLT transformations with confusing looping and conditional processing features.
Any change to page templates requires an XSLT expert, in addition to the graphic designer who created (and debugged) the HTML and CSS.
Biggest benefit: platform neutral way to render xml
Biggest disadvantage xsl is difficult to maintain
I've once had to work with an xsl over 4,000 lines long that also includes several other xsl templates. Now that was hard to work with!
The answers above provide a good survey of some of the advantages and disadvantages of XSLT. I'd like to add another disadvantage. We have found that you pretty quickly run into scalability issues when using XSLT for moderately large datasets.
When processing XML files, XSLT must load the entire document into memory. With Xalan, this consumes roughly 10x the size of the input file (saxon has an alternative DOM implementation that uses less memory). If any of your input datasets grow beyond a couple of hundred megabytes, your XSLT processor might just flake out.