"Smart" way of parsing and using website data? - web-services

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed yesterday.
Improve this question
How does one intelligently parse data returned by search results on a page?
For example, lets say that I would like to create a web service that searches for online books by parsing the search results of many book providers' websites. I could get the raw HTML data of the page, and do some regexs to make the data work for my web service, but if any of the websites change the formatting of the pages, my code breaks!
RSS is indeed a marvelous option, but many sites don't have an XML/JSON based search.
Are there any kits out there that help disseminate information on pages automatically? A crazy idea would be to have a fuzzy AI module recognize patterns on a search results page, and parse the results accordingly...

I've done some of this recently, and here are my experiences.
There are three basic approaches:
Regular Expressions.
Most flexible, easiest to use with loosely-structured info and changing formats.
Harder to do structural/tag analysis, but easier to do text matching.
Built in validation of data formatting.
Harder to maintain than others, because you have to write a regular expression for each pattern you want to use to extract/transform the document
Generally slower than 2 and 3.
Works well for lists of similarly-formatted items
A good regex development/testing tool and some sample pages will help. I've got good things to say about RegexBuddy here. Try their demo.
I've had the most success with this. The flexibility lets you work with nasty, brutish, in-the-wild HTML code.
Convert HTML to XHTML and use XML extraction tools. Clean up HTML, convert it to legal XHTML, and use XPath/XQuery/ X-whatever to query it as XML data.
Tools: TagSoup, HTMLTidy, etc
Quality of HTML-to-XHML conversion is VERY important, and highly variable.
Best solution if data you want is structured by the HTML layout and tags (data in HTML tables, lists, DIV/SPAN groups, etc)
Most suitable for getting link structures, nested tables, images, lists, and so forth
Should be faster than option 1, but slower than option 3.
Works well if content formatting changes/is variable, but document structure/layout does not.
If the data isn't structured by HTML tags, you're in trouble.
Can be used with option 1.
Parser generator (ANTLR, etc) -- create a grammar for parsing & analyzing the page.
I have not tried this because it was not suitable for my (messy) pages
Most suitable if HTML structure is highly structured, very constant, regular, and never changes.
Use this if there are easy-to-describe patterns in the document, but they don't involve HTML tags and involve recursion or complex behaviors
Does not require XHTML input
FASTEST throughput, generally
Big learning curve, but easier to maintain
I've tinkered with web harvest for option 2, but I find their syntax to be kind of weird. Mix of XML and some pseudo-Java scripting language. If you like Java, and like XML-style data extraction (XPath, XQuery) that might be the ticket for you.
Edit: if you use regular expressions, make sure you use a library with lazy quantifiers and capturing groups! PHP's older regex libraries lack these, and they're indispensable for matching data between open/close tags in HTML.

Without a fixed HTML structure to parse, I would hate to maintain regular expressions for finding data. You might have more luck parsing the HTML through a proper parser that builds the tree. Then select elements ... that would be more maintainable.
Obviously the best way is some XML output from the engine with a fixed markup that you can parse and validate. I would think that a HTML parsing library with some 'in the dark' probing of the produced tree would be simpler to maintain than regular expressions.
This way, you just have to check on <a href="blah" class="cache_link">... turning into <a href="blah" class="cache_result">... or whatever.
Bottom line, grepping specific elements with regexp would be grim. A better approach is to build a DOM like model of the page and look for 'anchors' to character data in the tags.
Or send an email to the site stating a case for a XML API ... you might get hired!

You don't say what language you're using. In Java land you can use TagSoup and XPath to help minimise the pain. There's an example from this blog (of course the XPath can get a lot more complicated as your needs dictate):
URL url = new URL("http://example.com");
SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser"); // build a JDOM tree from a SAX stream provided by tagsoup
Document doc = builder.build(url);
JDOMXPath titlePath = new JDOMXPath("/h:html/h:head/h:title");
titlePath.addNamespace("h","http://www.w3.org/1999/xhtml");
String title = ((Element)titlePath.selectSingleNode(doc)).getText();
System.out.println("Title is "+title);
I'd recommend externalising the XPath expressions so you have some measure of protection if the site changes.
Here's an example XPath I'm definitely not using to screenscrape this site. No way, not me:
"//h:div[contains(#class,'question-summary')]/h:div[#class='summary']//h:h3"

You haven't mentioned which technology stack you're using. If you're parsing HTML, I'd use a parsing library:
Beautiful Soup (Python)
HTML Agility Pack (.NET)
There are also webservices that do exactly what you're saying - commercial and free. They scrape sites and offer webservice interfaces.
And a generic webservice that offers some screen scraping is Yahoo Pipes. previous stackoverflow question on that

It isn't foolproof but you may want to look at a parser such as Beautiful Soup It won't magically find the same info if the layout changes but it's a lot easier then writing complex regular expressions. Note this is a python module.

Unfortunately 'scraping' is the most common solution, as you said attempting to parse HTML from websites. You could detect structural changes to the page and flag an alert for you to fix, so a change at their end doesn't result in bum data. Until the semantic web is a reality, that's pretty much the only way to guarantee a large dataset.
Alternatively you can stick to small datasets provided by APIs. Yahoo are working very hard to provide searchable data through APIs (see YDN), I think the Amazon API opens up a lot of book data, etc etc.
Hope that helps a little bit!
EDIT: And if you're using PHP I'd recommend SimpleHTMLDOM

Have you looked into using a html manipulation library? Ruby has some pretty nice ones. eg hpricot
With a good library you could specify the parts of the page you want using CSS selectors or xpath. These would be a good deal more robust than using regexps.
Example from hpricot wiki:
doc = Hpricot(open("qwantz.html"))
(doc/'div img[#src^="http://www.qwantz.com/comics/"]')
#=> Elements[...]
I am sure you could find a library that does similar things in .NET or Python, etc.

Try googling for screen scraping + the language you prefer.
I know several options for python, you may find the equivalent for your preferred language:
Beatiful Soup
mechanize: similar to perl WWW:Mechanize. Gives you a browser like object to ineract with web pages
lxml: python binding to libwww
scrapemark: uses templates to scrape pieces of pages
pyquery: allows you to make jQuery queries in xml/xhtml documents
scrapy: an high level scraping and web crawling framework for writing spiders to crawl and parse web pages
Depending on the website to scrape you may need to use one or more of the approaches above.

If you can use something like Tag Soup, that'd be a place to start. Then you could treat the page like an XML API, kinda.
It has a Java and C++ implementation, might work!

Parsley at http://www.parselets.com looks pretty slick.
It lets you define 'parslets' using JSON what you're define what to look for on the page, and it then parses that data out for you.

As others have said, you can use an HTML parser that builds a DOM representation and query it with XPath/XQuery. I found a very interesting article here: Java theory and practice: Screen-scraping with XQuery - http://www.ibm.com/developerworks/xml/library/j-jtp03225.html

There is a very interesting online service for parsing websites https://loadsiteinmysql.site This service splits the site into tags and loads them into MySQL tables. This allows you to parse sites using MySQL syntax

Related

How to refine text data?

I built many spiders to get news articles from different websites and i have an api to convert the text to audio clips, but i need a framework or python tools to refine the articles' text such as:
removing anything related to the source. removing any dates formats.
removing urls. change acronyms such as CEO to chief excution officer
for example. removing special characters and typos.
making sure that the sentence is written correctly after all the edits.
use the previously edited articles as a reference for the new articles.
I am using python, nltk and re, but it's exhausting and each time i think i covered all the cases, i find new cases to add and i think i am stuck in an infinite loop.
Any suggestions?
First of all, expanding acronyms to their full form is non-trivial and should probably not be considered part of scraping but rather part of a second step of processing (cf. IBM's The Art of Tokenization).
Cleaning scraped data is tedious, unfortunately: There is no magical solution because everyone is interested in scaping something different than what you are — some might be interested only in URLs, for example. Nevertheless, have you not tried using BeautifulSoup? — it's a Python library which offers a very nice API for handling many common scraping-related tasks.

Performance of Jsoup vs regexes vs XPath for extracting content from HTML?

I know that in common case HTML shouldn't be parsed with regex.
But I want to make a performance test for web application. I know for sure how HTML may look like. So I can use regexes to extract some data from page source.
As I do performance test (using Jmeter), I want to take less resources from master machine.
What option will be less resource intensive: XPath, regexes (Jakarta ORO) or Jsoup?
As of JMeter 2.8, the answer is Regexp.
But it depends of course on Regexp expressions you use.
Regexp implementation in JMeter is rather optimized and the main post processing way for correlation.
Regarding JSoup, it would need custom coding based on JSR223 post processor for example.
JMeter 2.9 will introduce a new CSS/JQuery selector based Extractor with 2 possible underlying implementations:
JSOUP
Jodd Lagarto (CSSelly)
See :
https://issues.apache.org/bugzilla/show_bug.cgi?id=54259
Its performance will be lower than Regexp as it builds a DOM document, but it eases much syntax in Test Plans that don't require ultra-optimised Test Plans.
Finally, regarding XPath, as it builds a DOM Tree:
http://www.developer.com/xml/article.php/3397691/Does-StAX-Belong-in-Your-XML-Toolbox.htm
It has a memory and CPU cost which is higher than regex particularly if you want to extract many elements, an enhancement has been created:
https://issues.apache.org/bugzilla/show_bug.cgi?id=53973

ColdFusion CRUD

For quite a long time now, I've been trying to write and have been in search of "a really good" CRUD application. Don't get me wrong - I didn't say "The ultimate" CRUD application. Just one that could be rated 1st class.
What I'm saying is: Please don't respond to this plea with an answer like "Well, every situation is different..."
Q: Is there a blog post or something in the Adobe documentation that shows CRUD on a one-to-many relationship (Header/Detail), that uses web standards css (instead of tables), that uses best practices (CF9 has changed so many things now: scripted components, ORM), that uses the latest UI techniques (jQuery or some of the built-in AJAX features of CF9), that has a nice front-end (a nice looking header and background along with some pretty buttons)?
I know that's a lot to ask, but such is my quest.
A good example of a one-to-many relationship is the city/state xml files built into the Spry examples. There are 23,000 cities in the sample xml files, so I think that's better than just using random data.
I'm not really sure what you're asking, but I just want to respond to a couple of points in your question (this is more a comment than an answer, but since SO is stupidly limited in this, I'll put it here instead.)
that uses web standards css (instead of tables),
There is no "css instead of tables" - they are two distinct and compatible things!
CSS describes visual aspects of a document, whilst tables markup tabular data.
If you're displaying tabular data, then tables is exactly what you should be using, and you can use CSS to make it look more exciting than the plain styles that tables come in.
Since you're asking for a CRUD app, odds are you are going to be wanting to display tabular data so should be using tables.
(The common mistake people make is not understanding the nature of the web, and using tables to apply grid layouts to documents, when they should be using strucuted semantic markup instead.)
that uses best practices (CF9 has changed so many things now:
scripted components, ORM)
Scripted components are not a best practise!
They are an alternative syntax (for people that prefer having non-descriptive braces everywhere) they do not offer anything you can't already do.
i would strongly suggest you check out cfwheels. read the documentation, it's built for doing such crud applications and has an amazing set of features and will save you a lot of time. as for the interface, there are many jquery plugins out there that can handle this. i suggest looking at ajaxrain and find a plugin you like

Is there a nice XSL stylesheet for client-side DocBook rendering?

I want the DocBook documents in my SVN repository to look nice if someone looks at them in a web browser. I've started to write a CSS stylesheet, but I think that it will have significant limitations -- particularly ones regarding hyperlinks.
There is a large body of DocBook XSL stylesheets at the DocBook site , but they don't seem to be appropriate for browser rendering. I don't want to generate static documents and put them into SVN. I want them to be basically readable for other developers without much hassle.
I could write my own browser-appropriate XSL stylesheet to convert DocBook to HTML, but it seems like someone else must have already done this. I just don't know where to find it.
In a past life I used wysiwygdocbook: http://www.cs.hs-rm.de/~werntges/proj/wysiwyg-dbk01.html
You are right, the DocBook XSL stylesheets are very heavy, and are not really suitable for running in a browser. The DocBook Wiki lists some CSS stylesheets, perhaps one of those might work for you?
The only one I have experience of is the one which XMLMind XML Editor apparently uses to present DocBook documents.
I've done some XSLT+CSS very basic and incomplete implementation for browserside DocBook styling. You can check it out here http://github.com/arsi/db2xhtml
But I would like to see more advanced project if available somewhere!
[Edited because I misread the question]
You certainly wouldn't want to run the stylesheets via a browser and the PI but then you wouldn't want to do that for any reasonably complex content. Do it server side if you're running over a web server or as a batch task. Is there any way that you can interpose a server side process in svn?
DocBook is a complex 'language' and capturing even most of the subleties of DocBook is very difficult. Using the DocBook XSL is not complex at all and I really would recommend you go in that direction if you can. The stylesheets are designed to be customised and are extremely well documented by Bob Stayton in DocBook XSL: The Complete Guide.
After quite a bit of searching, I believe the answer is "there is not a nice XSL stylesheet for client-side DocBook rendering," besides the bespoke ones like the one I implemented.
Typically you'd produce a 'rendition' for reading/display. The rendition can either be PDF, a single HTML page, or set of HTML pages. It's rare that you deliver docbook directly to web.
Can I ask what you're trying to accomplish and why?
Is this for internal delivery or external?
I hate getting the question that asks "can your technology do X?" It assumes a lot of knowledge about the product (plus, usually the answer is "yes" but that doesn't answer the real question). It's always best when I ask -- "what are you trying to accomplish" -- so I can tell you whether or not any piece of technology is a good fit (or I can point you to some other piece that's a much better fit, or a better way to go about it).

Can you easily configure MediaWiki to accept full HTML/CSS or even JS content?

I'd like to create a technical wiki site and it requires the full use of HTML/CSS and maybe Javascript when editing a page. Is this something I can easily configure in MediaWiki? If not, is there any other wiki software that you'd recommend?
Thanks!
You can enable raw HTML support by setting $wgRawHtml = true; in your LocalSettings.php:
http://www.mediawiki.org/wiki/Manual:$wgRawHtml
However, as noted above this is rather insecure for a public site. (If locked down to registered usage only by known folks it's ok -- but you need to trust your users.)
There are some links on that manual page to extensions organized around letting you put specific known bits of HTML/JS in your output code as well, which may or may not fit your needs better.
Well, while MediaWiki itself does not support this, there are some extensions which allow at least HTML in a page. See for example this extension list. SecureHTML might so what you are looking for.
That said, I'd like to point out that allowing raw HTML rather defeats the purpose of a wiki:
it can and will mess up formatting and create weird problems (clashes between generated and user-provided HTML)
it makes it hard/impossible to convert the wiki to other formats (such as to print it)
it makes searching harder
it makes any kind of security impossible (think XSS)
This is doubly true for allowing Javascript.
So I'd like to ask why you need this. If you need special formatting that MediaWiki does not offer, consider using (or writing) an extension for this.
If you really need arbitrary HTML, a Wiki might not be the best tool for you. You should consider a CMS, or just put HTML files into Subversion.
So what are you trying to do?
Use nowiki tags. Docs can be found here: https://www.mediawiki.org/wiki/Help:Formatting