How to parse HTML with C++/Qt?

How to parse HTML with C++/Qt? - c++

How can i parse the following HTML
<body>
<span style="font-size:11px">12345</span>
<a>Hello<a>
</body>
I would like to retrive the data "12345" from a "span" with style="font-size:11px" from www.testtest.com, but I only want the that very data, and nothing else.
How can I accomplish this?

I think QXmlQuery is what you want.
I think the code will be like
QXmlQuery query;
query.setQuery(html, QUrl("/body/span[#style='font-size:11p']"));
QString r;
query.evaluateTo(&r);
You can also provide URL directly to the query
query.setQuery(QUrl("http://WWW.testtest.com"), QUrl("/body/span[#style='font-size:11p']"));

EDIT: From the Qt 5.6 release blog post:
With 5.6, Qt WebKit and Qt Quick 1 will no longer be supported and are dropped from the release. The source code for these modules will still be available.
So, as of Qt 5.6 – unless you're willing to compile the sources –, QtWebKit is no longer available. If you're using a Qt release older than 5.6 ot are willing to compile QtWebKit, this might be helpful; otherwise this answer is no longer valid.
It is hard to tell you exactly what needs to be done as your explanation is incomplete about the use case. However, there are two ways of proceeding.
QtWebKit
If you already need any other functionality from that module, this is not going to introduce any further dependencies, and it will be the most convenient for you to use.
You need to get the https://doc.qt.io/archives/qt-5.5/qwebelement.html
That will come once you find the first "span" element in your html:
https://doc.qt.io/archives/qt-5.5/qwebframe.html#findFirstElement
Then, you can simply get the text for that element with the corresponding QWebElement method(s). For instances, you can use this one for getting an attribute value:
https://doc.qt.io/archives/qt-5.5/qwebelement.html#attribute
... but you can also request the attribute names as you can see in the documentation, etc.
This is how you will get the 12345 value:
https://doc.qt.io/archives/qt-5.5/qwebelement.html#toPlainText
XML parser in QtCore
If you do not need webkit for your sotware, and the html data comes in a different way rather than directly from the web for which you would need to use QWebKit, then you are better off using the xml parser available in QtCore. It still might be the case even if you do not have any other dependency from QtWebKit that this additional dependency will not cause any issues in your use case. It is hard to tell based upon your description. For sure, this would be less convenient, albeit not that much, compared to the webkit based solution as that is designed for html.
What you need to avoid is QtXmlPatterns. It is an unmaintained software as of now, and that would introduce an additional dependency for your code either way.

Related

Custom Implementing multileg option orders in QuickFIX 4.2

Multileg option orders are not supported in FIX Protocol 4.2. I've implemented custom tags but never a new message type. Can anyone provide a roadmap of the steps to implement NewOrderMultileg msgtype="AB" into the QuickFix FIX42 namespace?

This should help or more or less this is how you do it. It is for QuickFIX/N, but the method of adding new messages is consistent for all Quickfix libraries.
Or another way is hijack it from the data dictionary for the FIX version where it exists. I believe all versions of the data dictionary are in the quickfix releases. But you need to be careful how you do it i.e. check for fields, repeating group etc.
But you might have to add some code if the new message class doesn't exist at all and you will have to engineer it to fit in your existing library. This might need some work and may throw up some unlikely errors. For this you can easily refer to a quickfix library version which does have the class.

Good way to maintain Qt labels and text on UI

What is the best way/common practice for maintaining all string resources found on a UI in Qt, especially the textual input/text in combo boxes etc. (since these are the once that are frequently used in the code itself)?
I know that Android has this string resources thing such that resources only have to be modified at one position.
Does Qt have something like that too or do I have to initialize string resources in code instead of in the UI's XML itself...

AFAIK, there is no built-in mechanism for string resources in Qt. If you want to maintain strings at build time you can define them in one .h/.cpp file as global variables and reuse them in your code.
Otherwise you can use Qt's translator files (binary) and load them along with your application. If you need to change a string, you simply will need to edit the translation file (xml) and "recompile" it with lrelease utility without building the application again.

There is a mechanism to dynamically translate texts in application, but it works a bit different than Android string resources, but achieves the same goals.
Qt uses i18n system modelled after standard, well known unix gettext. It works in a very similar way to iOS NSLocalizedString, if that rings a bell.
http://doc.qt.io/qt-5/qobject.html#tr
This is worth reading too:
http://en.wikipedia.org/wiki/Gettext
http://doc.qt.io/qt-5/internationalization.html
Android approach is a bit unique and you should not expect it to be a "standard everywhere". It works, it's ok, but it's not a standard way of doing things on desktop.

Create PDFs with editable forms in Qt

I'm trying to find out if there's a way to embed an editable text cell in a PDF generated in a Qt application. I'm currently using QPrinter to generate the PDF, but if there's another library that could do this, that would be fine. The environment is limited, though, to C or C++, so libraries like iText are out. In terms of form capabilities, this pdf,
http://examples.itextpdf.com/results/part2/chapter08/text_fields.pdf, is a good example with the exception that I don't need a password text field.
Thanks,
Frank

This may not be terribly helpful, but I'll throw it out there anyway.
wkhtmltopdf is based on QTWebkit.
One of its command line options is to convert HTML fields into PDF fields (off by default).
There's almost no pdf-related code within wkhtmltopdf. Certainly nothing dealing with fields. Something upstream is doing the PDF conversion for them.
So find out what that "something" is and you're golden.
EDIT: That or spend a lot of time writing JNI wrappers for iText. :/ Having done so myself, I can say it'd be much more interesting to write a JNI generator tailored to iText, but far more practical to write a Java app that uses iText and then make JNI calls from your C/C++ app to pass the data it'll need and retrieve any response.
The form field borders are a part of the page, not the field itself. Odd, but that's not the first time I've encountered it. Our own software, LiquidOffice, used to generate fields with backgrounds AcroForms couldn't support the same way (now we use an icon-only button).
Those Real PDF Fields have their visibility flags set to "visible but doesn't print" within the pDF. I doubt wkhtmltopdf will let you control that directly. Patch time.
BUT, you could make a second pass with some PDF manipulation library to go through and change the visibility settings on your fields. I'm partial to iText, but there are many other fish in that particular sea.

Is there such a thing like a Printer-Markup-Language

I like to print a document. The content of the document are tables and text with different colors. Does a lightwight printer-file-format exist, which can be used like a template?
PS, PDF, DOC files in my opinion are to heavy to parse. May there exist some XML or YAML file format which supports:
Easy creation (maybe with a WYSIWYG-Editor)
Parsing and manipulation with Library-Support
Easy sending to the printer (maybe with Library-Support)
Or do I have to do it the usual way and paint within a CDC?

I noticed you’re using MFC (so, Windows). In that case the answer is a qualified yes. In recent versions of Windows, Microsoft offers the XPS Document API which lets you create and manipulate a PDF-like document using XML, which can then be printed using the XPS Print API.
(For earlier versions of Windows that don’t support this API, you could try to deal with the XPS file format directly, but that is probably a lot harder than using CDC. Even with the API you will be working at a fairly low level.)
End users can generate XPS documents using the XPS print driver that is available for free from Microsoft (and bundled with certain MS products—they probably already have it on their system).

There is no universal language that is supported across all (or even many) printers. While PCL and PS are the most used, there are also printers which only work with specific printer drivers because they only support a proprietary data format (often pre-rendered on the client).
However, you could use XSL-FO to create documents which can then be rendered to a printer driver using library support.

I think something like TeX or LaTeX (or even troff or groff) may meet your needs. Google them and see.
There are also libraries to render documents for print from HTML source. Look at http://libharu.sourceforge.net/ for example. This outputs a printer-ready .PDF

A think that Post Script is a really good choice for that.
It is actually a very simple language, and it must be very easy to parse becuse it is stack-oriented. Then -- most printers supprort it, and even if you have no support you can use GhostScript to convert for many different formats (Consider GS as a "virtual PS supporting printer").
Finally there are a lot of books and tutorials for the language.
About the parsing -- you can actually define new variables and functions in PS. So, maybe, your problem can be solved (almost) entirely using PS.

HTML + CSS can be printed -- properly. CSS was designed to support this with the media attribute to specify that your CSS is for printer layout, not for screen layout. Tools like PRINCE (free + commercial versions) exist to render this for printing.

I think postscript is the markup language used by printers. I read this somewhere, so correct me if postscript is now outdated.
http://en.wikipedia.org/wiki/PostScript

For more powerful suite you can use Latex. It will give options of creating templates where you can just copy the text.
On a more GUI friendly note, MS-Word and other word processors have templates. The issue is they are not of a common standard or markup.
You can also use HTML to render stuff in a common markup but it will not be very printer friendly.

building objects from xml file at runtime and initializing, in one pass?

I have to parse the XML file and build objects representation based on that, now once I get all these data I create entries in various database for these data objects. I have to do second pass over that for value as in the first pass all I could do is build the assets in various databases. and in second pass I get the values for all the data and put it in the database.
I have a feeling that this can be done in a single pass but I just want to see what are your opinions. As I am just a student who started with professional work, experienced ppl please help.
Can someone who have ideas or done similar work, please provide some light on the topic so that I can think over the possibility of the work and get the prototype going based on your suggestion.
Thanks a lot for your precious time, I honestly appreciate it.

You might be interested in learning several techniques of building XML parsers like DOM or SAX. As it is said in SAX description the only thing which requires second pass could be the XML validation but not the creating the tree.

Beside DOM and SAX parsing, you can use XQuery for querying data from XML files.It is fast, robust and efficient.
here is a link
You can use Qt Xml module for DOM ,SAX and XQuery, btw it is open source.
Another option is xml - C++ data binding, Here is the link.You can create C++ codes from definition directly.It is an elegant solution.
EDIT:
the latter one is at compile time.

You can also use Apache Licensed http://xmlbeansxx.touk.pl/. It works under Windows and Linux.

you could take a look at the somewhat simpler 'pull' api called stax instead of using sax (event based).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js