Jsoup like html parser for C++ [closed] - c++

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I have been writing some codes to get some data from some pages in Java and Jsoup was on of the best libraries to work with. But, Unfortunately I have to port the whole code to C/C++. But I a cannot find any decent html parser to use on c++. Is there any Jsoup like library for C++ or How can similar results be achieved?
[Currently I am using Curl to get the source of the pages and roaming the internet to find a html parser]

Unfortunately, i guess there's no parser like Jsoup for C++ ...
Beside the libraries which are already mentioned here, there's a good overview about C++ (some C too) parser here: Free C or C++ XML Parser Libraries
For parsing i used TinyXML-2 for (Html-) DOM parsing; it's a very small (only 2 files) library that runs on most OS (even non-desktop).
LibXml
push and pull parser (DOM, SAX)
Validation
XPath and XPointer support
Cross-Plattform / good documentation
Apache Xerxces
push and pull parser (DOM, SAX)
Validation
No XPath support (but a package for this?)
Cross-Plattform / good documentation
If you are on C++ CLI, check out NSoup - a Jsoup port for .NET.
Some more:
htmlcxx - html and css APIs for C++
MSHTML (?)
pugixml (DOM / XPath and Unicode support)
LibCSS (CSS Parser) / LibDOM (DOM) (however, both in C)
hcxselect (CSS selector engine for C++)
Maybe you can combine a DOM Model / Parser and a CSS selector together?

If you are familiar with Qt Framework the most convenient way is using QWebElement (Reference here).
Otherwise, (as another post suggests) using Tidy to convert HTML to a valid XML and then using an XML parser such as libxml++ is a good option. You can find a sample code showing these two steps here.

Chromium has an open source parser. Also, the Google gumbo-parser looks cool.

Yes, there is a html parser lib for c++, check it out
https://github.com/HamedMasafi/HtmlParser/
This library can parse html or css and convert it to a tree model. You can search in parsed html by methods like: get_by_id, get_by_class_name, get_by_tag_name, and also there is a question method that you can search via css selector (only tag, id, class, nested childs selectors supported for now).
After finding a child you can change it's attributes and in final you can print a html into std::string in compact and pretty mode.

You can use xerces2 as DOM parser.
Or use HTML Tidy to clean up the HTML and convert it to XHTML then parse the XML with pugixml or similar XML parser. And since pugixml is a non-validating parser, it might as well work on the raw HTML without the need of runnin HTML Tidy on it first.

If you don't mind calling out to python from C++, you could use Beautiful Soup. At least the name is right!
Seriously - it's a nice, no-nonsense HTML parser. I haven't tried calling out to it from C++, although it should be straightforwards.

Related

Model-based generation of project documentation [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
During the development of my projects I find myself producing some large documents with slight variations on only a few paragraphs. For instance, the same configuration plan will be used throughout different projects but each document has to be tailored with specific data and to comply with some specific requirements.
Being a lazy person and a fan of model-driven development, I have been looking for ways to optimize this process and I got these options:
Document templates - Using master document templates (the presentation) with forms (the model) or restricting the edition of a document to only a few key fields, and then cross-referencing the inputted data all over the document would do the trick... but I still feel that I could de-couple both layers a bit more.
UML modeling - Using CASE tools with UML support, I thought I could model my documents as packages and classes with annotations, change the model for each project and generate a report using a document template. The problem is that those tools are not designed for treating large chunks of text and I am having some difficulties to progress.
Process modeling - Using Eclipse EPF https://www.eclipse.org/epf/ seems a bit overkilling for what I want to accomplish. Remember: I´m a lazy person.
I would like to ask the community on their experiences with model-based documentation or their ways to optimize the generation of documents throughout the software development cycle.
I'm not sure I fully understand, so apologies if this misses the mark.
I've faced (I think) a similar problem, where there's a many:many relationship between content and the documents it needs to be presented in. For example, a 'project overview' that needs to be included in a requirements document, project plan, etc.
Thus far the best solution I've found is:
Write each section in Markdown format. There are some nice editors that make writing Markdown easy and efficient (e.g. Mou on OSX).
Use Pandoc to convert Markdown into Restructured Text (RST).
Use Sphinx to generate documents from the RST files.
I have multiple Sphinx doc templates, each of which combines some of the common sections with others specific to that doc. If one of the common sections gets updated it's easy to re-generate all docs to incorporate. Version Control is pretty straightforward as the source files are all simple text. Sphinx can also generate multiple formats easily: for example html to put online, or pdf for printing/distribution.
You could remove the need for step 2 by writing in RST natively. For me the extra step is worth it as I haven't found an RST editor that's as comfortable or efficient as Mou. YMMV of course.
It's not a perfect solution: for example, creating links across sections isn't that easy. But in general it works well for my needs.
hth.

How can I programmatically generate documents using TeX?

This applies to popular languages used for web development e.g. Python, Java, Rails etc.
I want to be able to programatically generate TeX documents. For example a user submits a form and a field contains the LaTeX code to be typeset and the web service returns the typeset PDF.
Are there libraries available for such a task? I can't find any.
The only other solution I can think of is to use external shell command functionality that's usually available. But this is a bit messy.
Some time ago I created an Etherpad lite plugin that allows you to compile the LaTeX serverside. FlyLaTeX does it similarly, but didn't really work for me and the code looked pretty messy and almost impossible to fix and debug when I was having a look at it about 4 months ago.
Basically you need to generate a temporary file that you can then compile with LaTeX.
I don't know of any generation libraries, but LaTeX is quite easy to generate. However, pandoc can convert different formats into LaTeX.
There is also https://github.com/manuels/texlive.js/ which is an emscripten-based clientside port of LaTeX (that unfortunately has very limited capabilities and is quite large).

Any latex web services with an API? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Is there a web service API that takes this type of a latex http-request:
http://some_web_service/texfile?texfile=
\new\documentclass[12pt]{article}
\begin{document}
bla
\end{document}
and returns:
bla.pdf
the Online LaTeX Equation Editor is perfect for this.
EG:
uses the following markup:
![equation using Online Equation Editor]
(https://latex.codecogs.com/gif.latex?x&space;=&space;\frac{4}{5}+\pi\Omega\int_{2\pi}^{\infty}{5\left\(\frac{\tau+3}{2}\right\)d\omega})
note that you will need to escape parentheses with a backslash, eg: \left\( stuff \right\)
If you look through the editor API documentation you might figure out that you can change the format from gif to png by changing the api url endpoint from /gif.latex to /png.latex.
There's also options to set a white background by using \bg_white:
![equation using Online Equation Editor]
(https://latex.codecogs.com/gif.latex?\bg_white&space;x&space;=&space;\frac{4}{5}+\pi\Omega\int_{2\pi}^{\infty}{5\left\(\frac{\tau+3}{2}\right\)d\omega})
See also this meta.stackexchange answer and this tex.stackexchange answer. I'm sure there are many more answers that point to this tool and implement it in different ways. IE: instead of using the url to generate a gif or png using markdown notation, or you could use html markup and bypass markdown or you could just drag the image over to your post.
I'm looking for the same thing and Latex Online seems to be the closest thing to what we need.
You just need to setup the server by yourself.
EDIT
I've written my own little Sinatra app for this: https://github.com/codegestalt/sinatratex
ScribTeX has a CLSI API, you can send CLSI requests from any platform to compile LaTeX.
I blogged some time ago about this along with a CLSI client written in F#.
The Common LaTeX Service Interface (CLSI) is a web service interface and implementation that exposes common LaTeX related capabilities (such as compiling LaTeX documents to different formats):
http://code.google.com/p/common-latex-service-interface/
(This interface is one of the ways that latexlab.org can compile latex)
Fundamentally, this shouldn't be any different than a build server like you see for lots of open source projects (along the lines of Koji for example). Ultimately, you would just hook into pdflatex instead of gcc.
Were you to be able to install software on your local server this wouldn't be too hard. Some combination of Perl / TT / latexmk along with a LaTeX system (e.g. TeXLive or MiKTeX).
I don't know about latexlab mentioned above. The closest thing I know of is http://www.tlhiv.org/ltxpreview/ , which perhaps you could wrap to do what you need (or even write a howto for your users).
Just for completeness:
GitHub or docker might do the trick.
It's basically a micro service implemented in java/spring boot, which is wrapping pdflatex. Urls are immutable and the pdfs are stored on disk/distributed storage. You'll have to host it on your own though. The code is pretty simple and you're able/allowed to customize it to your needs. See the "Scaling the service" section in readme for more details about setting up a production environment.

"Smart" way of parsing and using website data?

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed yesterday.
Improve this question
How does one intelligently parse data returned by search results on a page?
For example, lets say that I would like to create a web service that searches for online books by parsing the search results of many book providers' websites. I could get the raw HTML data of the page, and do some regexs to make the data work for my web service, but if any of the websites change the formatting of the pages, my code breaks!
RSS is indeed a marvelous option, but many sites don't have an XML/JSON based search.
Are there any kits out there that help disseminate information on pages automatically? A crazy idea would be to have a fuzzy AI module recognize patterns on a search results page, and parse the results accordingly...
I've done some of this recently, and here are my experiences.
There are three basic approaches:
Regular Expressions.
Most flexible, easiest to use with loosely-structured info and changing formats.
Harder to do structural/tag analysis, but easier to do text matching.
Built in validation of data formatting.
Harder to maintain than others, because you have to write a regular expression for each pattern you want to use to extract/transform the document
Generally slower than 2 and 3.
Works well for lists of similarly-formatted items
A good regex development/testing tool and some sample pages will help. I've got good things to say about RegexBuddy here. Try their demo.
I've had the most success with this. The flexibility lets you work with nasty, brutish, in-the-wild HTML code.
Convert HTML to XHTML and use XML extraction tools. Clean up HTML, convert it to legal XHTML, and use XPath/XQuery/ X-whatever to query it as XML data.
Tools: TagSoup, HTMLTidy, etc
Quality of HTML-to-XHML conversion is VERY important, and highly variable.
Best solution if data you want is structured by the HTML layout and tags (data in HTML tables, lists, DIV/SPAN groups, etc)
Most suitable for getting link structures, nested tables, images, lists, and so forth
Should be faster than option 1, but slower than option 3.
Works well if content formatting changes/is variable, but document structure/layout does not.
If the data isn't structured by HTML tags, you're in trouble.
Can be used with option 1.
Parser generator (ANTLR, etc) -- create a grammar for parsing & analyzing the page.
I have not tried this because it was not suitable for my (messy) pages
Most suitable if HTML structure is highly structured, very constant, regular, and never changes.
Use this if there are easy-to-describe patterns in the document, but they don't involve HTML tags and involve recursion or complex behaviors
Does not require XHTML input
FASTEST throughput, generally
Big learning curve, but easier to maintain
I've tinkered with web harvest for option 2, but I find their syntax to be kind of weird. Mix of XML and some pseudo-Java scripting language. If you like Java, and like XML-style data extraction (XPath, XQuery) that might be the ticket for you.
Edit: if you use regular expressions, make sure you use a library with lazy quantifiers and capturing groups! PHP's older regex libraries lack these, and they're indispensable for matching data between open/close tags in HTML.
Without a fixed HTML structure to parse, I would hate to maintain regular expressions for finding data. You might have more luck parsing the HTML through a proper parser that builds the tree. Then select elements ... that would be more maintainable.
Obviously the best way is some XML output from the engine with a fixed markup that you can parse and validate. I would think that a HTML parsing library with some 'in the dark' probing of the produced tree would be simpler to maintain than regular expressions.
This way, you just have to check on <a href="blah" class="cache_link">... turning into <a href="blah" class="cache_result">... or whatever.
Bottom line, grepping specific elements with regexp would be grim. A better approach is to build a DOM like model of the page and look for 'anchors' to character data in the tags.
Or send an email to the site stating a case for a XML API ... you might get hired!
You don't say what language you're using. In Java land you can use TagSoup and XPath to help minimise the pain. There's an example from this blog (of course the XPath can get a lot more complicated as your needs dictate):
URL url = new URL("http://example.com");
SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser"); // build a JDOM tree from a SAX stream provided by tagsoup
Document doc = builder.build(url);
JDOMXPath titlePath = new JDOMXPath("/h:html/h:head/h:title");
titlePath.addNamespace("h","http://www.w3.org/1999/xhtml");
String title = ((Element)titlePath.selectSingleNode(doc)).getText();
System.out.println("Title is "+title);
I'd recommend externalising the XPath expressions so you have some measure of protection if the site changes.
Here's an example XPath I'm definitely not using to screenscrape this site. No way, not me:
"//h:div[contains(#class,'question-summary')]/h:div[#class='summary']//h:h3"
You haven't mentioned which technology stack you're using. If you're parsing HTML, I'd use a parsing library:
Beautiful Soup (Python)
HTML Agility Pack (.NET)
There are also webservices that do exactly what you're saying - commercial and free. They scrape sites and offer webservice interfaces.
And a generic webservice that offers some screen scraping is Yahoo Pipes. previous stackoverflow question on that
It isn't foolproof but you may want to look at a parser such as Beautiful Soup It won't magically find the same info if the layout changes but it's a lot easier then writing complex regular expressions. Note this is a python module.
Unfortunately 'scraping' is the most common solution, as you said attempting to parse HTML from websites. You could detect structural changes to the page and flag an alert for you to fix, so a change at their end doesn't result in bum data. Until the semantic web is a reality, that's pretty much the only way to guarantee a large dataset.
Alternatively you can stick to small datasets provided by APIs. Yahoo are working very hard to provide searchable data through APIs (see YDN), I think the Amazon API opens up a lot of book data, etc etc.
Hope that helps a little bit!
EDIT: And if you're using PHP I'd recommend SimpleHTMLDOM
Have you looked into using a html manipulation library? Ruby has some pretty nice ones. eg hpricot
With a good library you could specify the parts of the page you want using CSS selectors or xpath. These would be a good deal more robust than using regexps.
Example from hpricot wiki:
doc = Hpricot(open("qwantz.html"))
(doc/'div img[#src^="http://www.qwantz.com/comics/"]')
#=> Elements[...]
I am sure you could find a library that does similar things in .NET or Python, etc.
Try googling for screen scraping + the language you prefer.
I know several options for python, you may find the equivalent for your preferred language:
Beatiful Soup
mechanize: similar to perl WWW:Mechanize. Gives you a browser like object to ineract with web pages
lxml: python binding to libwww
scrapemark: uses templates to scrape pieces of pages
pyquery: allows you to make jQuery queries in xml/xhtml documents
scrapy: an high level scraping and web crawling framework for writing spiders to crawl and parse web pages
Depending on the website to scrape you may need to use one or more of the approaches above.
If you can use something like Tag Soup, that'd be a place to start. Then you could treat the page like an XML API, kinda.
It has a Java and C++ implementation, might work!
Parsley at http://www.parselets.com looks pretty slick.
It lets you define 'parslets' using JSON what you're define what to look for on the page, and it then parses that data out for you.
As others have said, you can use an HTML parser that builds a DOM representation and query it with XPath/XQuery. I found a very interesting article here: Java theory and practice: Screen-scraping with XQuery - http://www.ibm.com/developerworks/xml/library/j-jtp03225.html
There is a very interesting online service for parsing websites https://loadsiteinmysql.site This service splits the site into tags and loads them into MySQL tables. This allows you to parse sites using MySQL syntax

tool to generate xml file from xsd (for testing) [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have an xsd file and have not done much xml manipulation, parsing, etc. I want/need test xml files for my code but don't have any samples. (I am using xerces to parse)
This is similar to: xml-instance-generation-from-xml-schema-xsd
but I don't really want to make it a two step process. (python or java)
I just want to feed xsd file to some tool and have it generate a sample xml file. How can I do that?
Also see: how-to-generate-sample-xml-documents-from-their-dtd-or-xsd
Eclipse has tools for doing this (and it's free.)
EDIT (yeah, I was a little too terse) : What you want are the XSD editing tools in Eclipse. I know it's bundled with Eclipse IDE for Java EE Developers, and I think also with the Eclipse Modeling Tools download. (It's also possible to add them into an existing Eclipse install, though I don't know exactly which plugin(s) you'll want to add.)
(I'd like to be more precise than that, but the eclipse.org web site models itself after Massachusetts roads: If you don't know where you are, you don't belong there.)
Anyway: Once you've got the right version of Eclipse, open the existing schema file for editing (or create a new one: select File -> New... Other ... XML / XML Schema ). When you're ready to generate a test XML file, locate the file in the Package Explorer (the navigator view, usually on the left side), right click on it, and select Generate/XML File.
(What was I saying about navigability... ?)
Microsoft has published a "document generator" tool as a sample. This is an article that describes the architecture and operation of the sample app in some detail.
If you just want to use the doc generation tool, click here and install the MSI.
It's free. The source is available. Requires the .NET Framework to run. Works only with XSDs. (not Relax NG or DTD).
Oxygen's XML Schema Editor can generate sample XML instance documents from a given Schema.
I can do this very simply in VS2010, I don't know if this has been a new feature but this works pretty easy. Right click on the root element of an xsd in the 'xml schema explorer'. You will see an option to 'Generate Sample XML'. On clicking it VS creates a temp file
Liquid XML will do XML sample generation, don't think there's a command line option but you can do it through the UI. Seems to do a pretty good job, gets all the data types/enums right, the only thing it seems to struggle on is patterns, but then understanding a regular expression well enough to produce a valid string is a bit tricky...
Using Eclipse Ganymede or later you can generate xml from xsd. Just right click the xsd and navigate to generate > xml.
I've used XMLSpy for this in the past with great success.
You may use XMLFox for this simple task.
Have you taken a look at Microsoft's XML Schema Definition Tool (xsd.exe)?