C/CPP version of BeautifulSoup especially at handling malformed HTML

C/CPP version of BeautifulSoup especially at handling malformed HTML - c++

Are there any recommendations for a c/cpp lib which can be used to easily (as much as that possible) parse / iterate / manipulate HTML streams/files assuming some might be malformed, i.e. tags not closed etc.
BeautifulSoup

HTMLparser from Libxml is easy to use (simple tutorial below) and works great even on malformed HTML.
Edit : Original blog post is no longer accessible, so I've copy pasted the content here.
Parsing (X)HTML in C is often seen as a difficult task.
It's true that C isn't the easiest language to use to develop a parser.
Fortunately, libxml2's HTMLParser module come to the rescue. So, as promised, here's a small tutorial explaining how to use libxml2's HTMLParser to parse (X)HTML.
First, you need to create a parser context. You have many functions for doing that, depending on how you want to feed data to the parser. I'll use htmlCreatePushParserCtxt(), since it work with memory buffers.
htmlParserCtxtPtr parser = htmlCreatePushParserCtxt(NULL, NULL, NULL, 0, NULL, 0);
Then, you can set many options on that parser context.
htmlCtxtUseOptions(parser, HTML_PARSE_NOBLANKS | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET);
We are now ready to parse an (X)HTML document.
// char * data : buffer containing part of the web page
// int len : number of bytes in data
// Last argument is 0 if the web page isn't complete, and 1 for the final call.
htmlParseChunk(parser, data, len, 0);
Once you've pushed it all your data, you can call that function again with a NULL buffer and 1 as the last argument. This will ensure that the parser have processed everything.
Finally, how to get the data you parsed? That's easier than it seems. You simply have to walk the XML tree created.
void walkTree(xmlNode * a_node)
{
xmlNode *cur_node = NULL;
xmlAttr *cur_attr = NULL;
for (cur_node = a_node; cur_node; cur_node = cur_node->next)
{
// do something with that node information, like... printing the tag's name and attributes
printf("Got tag : %s\n", cur_node->name)
for (cur_attr = cur_node->properties; cur_attr; cur_attr = cur_attr->next)
{
printf(" ->; with attribute : %s\n", cur_attr->name);
}
walkTree(cur_node->children);
}
}
walkTree(xmlDocGetRootElement(parser->myDoc));
And that's it! Isn't that simple enough? From there, you can do any kind of stuff, like finding all referenced images (by looking at img tag) and fetching them, or anything you can think of doing.
Also, you should know that you can walk the XML tree anytime, even if you haven't parsed the whole (X)HTML document yet.
If you have to parse (X)HTML in C, you should use libxml2's HTMLParser. It will save you a lot of time.

you could use Google gumbo-parser
Gumbo is an implementation of the HTML5 parsing algorithm implemented as a pure C99 library with no outside dependencies. It's designed to serve as a building block for other tools and libraries such as linters, validators, templating languages, and refactoring and analysis tools.
#include "gumbo.h"
int main() {
GumboOutput* output = gumbo_parse("<h1>Hello, World!</h1>");
// Do stuff with output->root
gumbo_destroy_output(&kGumboDefaultOptions, output);
}
There's also a C++ binding for this library gumbo-query
A C++ library that provides jQuery-like selectors for Google's Gumbo-Parser.
#include <iostream>
#include <string>
#include "Document.h"
#include "Node.h"
int main(int argc, char * argv[])
{
std::string page("<h1><a>some link</a></h1>");
CDocument doc;
doc.parse(page.c_str());
CSelection c = doc.find("h1 a");
std::cout << c.nodeAt(0).text() << std::endl; // some link
return 0;
}

I've only used libCurl C++ for this type of thing but found it to be pretty good and useable. Don't know how it would cope with broken HTML though.

Try using SIP and run BeautifulSoup on it might help.
More details on below link thread. OpenFrameworks + Python

Related

Missing Xerces C++ class to copy attributes of element for use after SAX2 parsing

The documentation of xerces anticipates the need to make a copy of attributes, but the AttributesImpl class doesn't seem to exist. Neither does the facility seem to exist in other associated classes in either the current 3.2.3 version of xerces or previous 2.X
Xerces documentation in the file itself src/xercesc/sax2/Attributes.hpp says:
"The instance provided will return valid results only during the scope of the startElement invocation (to save it for future use, the application must make a copy: the AttributesImpl helper class provides a convenient constructor for doing so)."
See also I've left issue here as a bug in xerces
https://issues.apache.org/jira/browse/XERCESC-2238
Appears I will be stuck instead creating my own version of attributes in which to copy or clone, and not overwritten each new line. Not saving whole document (which would defeat purpose of SAX streaming parse), but the existing framework populating Attributes is pretty convoluted and undocumented. Obviously the library and docs are designed to use the api, not to hack or extend the application.
Is this really correct, AttributesImpl is helper class in the documentation that doesn't actually exist? Neither is there a different class with this functionality to save an element's attributes for later use (outside the handler)?

Below is a working version of an Attributes deep copy utility function. It may be missing a few includes which I'm getting from other includes of my larger file. When I get the chance, I'll try making this a stand alone and update this answer. It still falls short of the Java version utility, due to inaccessible members of RefVectorOf class, because the wrapping class, the Attributes interface and VecAttributesImpl interface, do not provide access to them. https://xerces.apache.org/xerces-j/apiDocs/org/xml/sax/helpers/AttributesImpl.html
Last release of Xerces C/C++ is from 2016, so although marked status active, https://projects.apache.org/project.html?xerces-for_c++_xml_parser , really not so much. Can't vouch for libhunt site, but came up in quick google just now https://cpp.libhunt.com/xerces-c++-alternatives . One can see latest comment here, note use of the phrase "unless a security issue pops up or new committers appear to revive the project" https://issues.apache.org/jira/browse/XERCESC-2238?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17571942#comment-17571942
Leaving the status of Xerces C/C++ as active is either a lie or a gross and negligent oversight. This page shows no major release since 2010. https://xerces.apache.org/news.html (C++ is listed below the Java project updates)
#include <xercesc/validators/common/GrammarResolver.hpp>
#include <xercesc/framework/XMLGrammarPool.hpp>
#include <xercesc/sax2/Attributes.hpp>
#include <xercesc/util/RefVectorOf.hpp>
#include "spdlog/spdlog.h"
#define tr XMLString::transcode
static spdlog::logger logger = getLog();
/*
* cloneAttributes
* Uses LocalName as key instead of QName and ignores URI and URI id, id inside RefVectorOf
* but inaccessible to wrapper VecAttributesImpl, and type defaults to CDATA
*/
VecAttributesImpl* cloneAttributes(VecAttributesImpl& attrs, bool useScanner=false){
// from XMLReaderFactory::CreateXMLReader line 49
MemoryManager* const memManager = XMLPlatformUtils::fgMemoryManager;
XMLScanner* scanner;
if(useScanner){
// from void SAX2XMLReaderImpl::initialize() line 124
GrammarResolver* grammarResolver = new (memManager) GrammarResolver(0, memManager); // line 127
// use of 0 from SAX2XMLReaderImpl.hpp line 74 default constructor, XMLGrammarPool* const gramPool = 0
XMLStringPool* URIStringPool = grammarResolver->getStringPool(); // line 128
scanner = XMLScannerResolver::getDefaultScanner(0, grammarResolver, memManager);
// line 42 of XMLScannerResolver::getDefaultScanner uses return new (manager) IGXMLScanner(valToAdopt, grammarResolver, manager);
scanner->setURIStringPool(URIStringPool);
}else{
scanner = NULL;
}
VecAttributesImpl* newAttrs = new VecAttributesImpl(); //VecAttributesImpl is not a vector, it's a wrapper around RefVectorOf
RefVectorOf<XMLAttr> * newRefVectorOf = new (memManager) RefVectorOf<XMLAttr> (32, false, memManager) ;
XMLSize_t atLen = attrs.getLength();
XMLSize_t i;
std::stringstream bruce;
XMLAttr* cpXMLAttr;
for(i = 0;i<atLen;i++){
//Ever QName != LocalName? when URI != ""? logger.debug(format("{}. QName LocalName URI type: {}, {}, {}, {}", i, tr(attrs.getQName(i)), tr(attrs.getLocalName(i)), tr(attrs.getURI(i)),tr(attrs.getType(i)))); // #suppress("Invalid arguments")
cpXMLAttr = new (memManager) XMLAttr
(
0, //URIId, 0 if reading file, but int is inaccessible from attrs, inside RefVectorOf XMLAttr, and getURI(i) returns an XMLCh*
attrs.getLocalName(i),
attrs.getValue(i)
);
if(logger.level() == spdlog::level::debug){
bruce << tr(attrs.getLocalName(i))<<" : "<<tr(attrs.getValue(i))<< " | ";
}
newRefVectorOf->addElement(cpXMLAttr);
}
logger.debug(bruce.str());
newRefVectorOf->size();
logger.debug(newRefVectorOf->size());
//The scanner can actually be set to NULL and the above scanner construction skipped if the VecAttributesImpl isn't scanning.
newAttrs->setVector(newRefVectorOf, newRefVectorOf->size(), scanner, false);
return newAttrs;
}

How to get the next enum value from an enum in protobuf?

I have a protobuf message with non-consecutive enum values something like this:
message Information {
enum Versions {
version1 = 0;
version2 = 1;
version3 = 10;
version4 = 20;
version5 = 30;
}
}
I want to have a C++ function GetNextVersion() which takes in one enum version and gives the next version as output. For eg: GetNextVersion(Information::version4) should give Information::version5 as output. Is there any inbuilt and easy method to do this?

You can use protobuf's reflection to achieve the goal:
Information::Versions GetNextVersion(Information::Versions ver) {
const auto *desc = Information::Versions_descriptor();
auto cur_idx = desc->FindValueByNumber(ver)->index();
if (cur_idx >= desc->value_count() - 1) {
throw runtime_error("no next enum");
}
auto next_idx = cur_idx + 1;
return Information::Versions(desc->value(next_idx)->number());
}
int main() {
try {
auto ver = Information::version1;
while (true) {
cout << ver << endl;
ver = GetNextVersion(ver);
}
} catch (const runtime_error &e) {
cout << e.what() << endl;
}
return 0;
}

Is there any inbuilt and easy method to do this?
I see no easy method to get that.
But I can suggest metaprogramming approaches (at least on Linux) with C++ code generation.
You could, assuming you have access to the source code of protobuf-c :
write some GNU gawk script to parse that C++ code and generate the C++ code of GetNextVersion
perhaps write some GNU sed (or a Python one) script doing the same.
write some GCC plugin and use it to parse that C++ code and generate the C++ code of GetNextVersion
write some GNU emacs code doing the same.
wait a few months and (in spring 2021) use Bismon. I am developing it, so contact me by email
extend and adapt the Clang static analyzer for your needs.
extend and adapt the SWIG tool for your needs.
extend and adapt the RPGGEN tool for your needs.
use GNU bison or ANTLR to parse C++ code, or design your domain specific language with some documented EBNF syntax and write some code generator with them.
You could also keep the description of enum Versions in some database (sqlite, PostGreSQL, etc...) or some JSON file or some CSV file (or an XML one, using XSLT or libexpat) and emit it (for protobuf) and the source code of GetNextVersion using some Python script, or GNU m4, or GPP.
You could write a GNU guile script or some rules for CLIPS generating some C++ code with your protobuf description.
In a few months (spring 2021), the RefPerSys system might be helpful. Before that, you could contribute and extend it and reuse it for your needs.
A pragmatic approach could be to add a comment in your protobuf declaration to remind you of editing another file when you need to change the protobuf message and protocol.

No, there isn't.
You define your own data type, so you also must define the operators for it.
So, your GetNextVersion()method contains that knowledge how to increment the version number. If you had decided to use an integer, then the compiler knows already how to increment that, but you wanted something special and that is the price you have to pay for it.

Serializing a FlatBuffer object to JSON without it's schema file

I've been working with FlatBuffers as a solution for various things in my project, one of them specifically being JSON support. However, while FB natively supports JSON generation, the documentation for flatbuffers is poor, and the process is somewhat cumbersome. Right now, I am working in the Object->JSON direction. The issue I am having doesn't really arise the other way around (I think).
I currently have JSON generation working per an example I found here (line 630, JsonEnumsTest()) - by parsing a .fbs file into a flattbuffers::Parser, building and packaging my flatbuffer object, then running GenerateText() to generate a JSON string. The code I have is simpler than the example in test.cpp, and looks vaguely like this:
bool MyFBSchemaWrapper::asJson(std::string& jsonOutput)
{
//**This is the section I don't like having to do
std::string schemaFile;
if (flatbuffers::LoadFile((std::string(getenv("FBS_FILE_PATH")) + "MyFBSchema.fbs").c_str(), false, &schemaFile))
{
flatbuffers::Parser parser;
const char *includePaths[] = { getenv("FBS_FILE_PATH");
parser.Parse(schemaFile.c_str(), includePaths);
//**End bad section
parser.opts.strict_json = true;
flatbuffers::FlatBufferBuilder fbBuilder;
auto testItem1 = fbBuilder.CreateString("test1");
auto testItem2 = fbBuilder.CreateString("test2");
MyFBSchemaBuilder myBuilder(fbBuilder);
myBuilder.add_item1(testItem1);
myBuilder.add_item2(testItem2);
FinishMyFBSchemaBuffer(fbBuilder, myBuilder.finish());
auto result = GenerateText(parser, fbBuilder.GetBufferPointer(), &jsonOutput);
return true;
}
return false;
}
Here's my issue: I'd like to avoid having to include the .fbs files to set up my Parser. I don't want to clutter an already large monolithic program by adding even more random folders, directories, environment variables, etc. I'd like to be able to generate JSON from the compiled FlatBuffer schemas, and not have to search for a file to do so.
Is there a way for me to avoid having to read back in my .fbs schemas into the parser? My intuition is pointing to no, but the lack of documentation and community support on the topic of FlatBuffers & JSON is telling me there might be a way. I'm hoping that there's a way to use the already generated MyFBSchema_generated.h to create a JSON string.

Yes, see Mini Reflection in the documentation: http://google.github.io/flatbuffers/flatbuffers_guide_use_cpp.html

C++ - Using a variable without knowing what it is called

I have a program that uses plug-ins. As I'm in development, these plug-ins are currently just .h and .cpp files that I add or remove from my project before re-compiling, but eventually they will be libraries.
Each plug-in contains lists of data in vectors, and I need to dynamically load data from the plug-ins without knowing which plug-ins are present. For instance:
// plugin1.h
extern vector<int> plugin1Data;
// plugin2.h
extern vector<int> plugin2Data;
// main.cpp
vector<vector<int>> pluginDataList;
int CountPlugins () {
// Some function that counts how many plug-ins are present, got this bit covered ;)
}
int main() {
int numPlugins = CountPlugins();
for (int i = 0; i < numPlugins; i++) {
vector<int> newPluginData = /***WAY TO ADD PLUGIN DATA!!!***/;
pluginDataList.push_back(newPluginData);
}
}
I already access the names of each plugin present during my CountPlugins() function, and have a list of names, so my first gut feeling was to use the name from each plugin to create a variable name like:
vector<string> pluginNames = /*filled by CountPlugins*/;
string pluginDataName = pluginNames.at(i) + "Data";
// Use pluginDataName to locate plugin1Data or plugin2Data
That's something I've done before in c# when I used to mess around with unity, but I've read a few stackoverflow posts clearly stating that it's not possible in c++. It's also a fairly messy solution in C# anyway as far as I remember.
If each plugin was a class instead of just a group of vectors, I could access the specific data doing something like plugin2.data... but then I still need to be able to reference the object stored within each plugin, and that'll mean that when I get round to compiling the plugins as libraries, I'll always have to link to class declaration and definition, which isn't ideal (though not out of the question if it'll give a nicer solution over all).
I'm all out of ideas after that, any help you can offer will be most welcome!
Thanks! Pete

Why dont you save the data as JSON between the application and the plugins ? That way you will also allow other types of tech to plug-into your app, like javascript based plugins via an embedded version of v8 or c#/.net plugins via mono.'

xerces_3_1 is able to create invalid xml at comments & processing instructions

I've encountered a problem using the xerces-dom library:
When you're adding a comments to the xml-tree like:
DOMDocument* doc = impl->createDocument(0, L"root", 0);
DOMElement* root = doc->getDocumentElement();
DOMComment* com1 = doc->createComment(L"SetA -- DataA");
DOMComment* com2 = doc->createComment(L"SetB -- DataB");
doc->insertBefore(com1, root);
doc->insertBefore(com2, root);
That will create the following xml-tree:
<?xml version="1.0" encoding="UTF-8" standalone="false"?>
<!--SetA -- DataA-->
<!--SetB -- DataB-->
<root/>
which is indeed invalid xml.
The same can be done with processing instructions by using ?> as data:
DOMProcessingInstruction procInstr = doc->createProcessingInstruction(L"target", L"?>");
My question:
Is there a way i can configure xerces to not create these kind of comments or do i have to check for these things myself?
And my other question: Why isn't it possible to just always escape characters like <>&'", even in comments and processing instructions, in order to avoid these kind of problems?

A DOMDocument is not an XML document. It is supposed to represent one, but it is conceivable that a valid DOM may not be serializable into a valid XML document (the converse should be less likely). Indeed this appears to be the case here:
Neither the Level 1 or Level2 two specs say anything about this, but the Level 3 DOM specification added this sentence about the DOMComment interface:
No lexical check is done on the content of a comment and it is therefore possible to have the character sequence "--" (double-hyphen) in the content, which is illegal in a comment per section 2.5 of [XML 1.0]. The presence of this character sequence must generate a fatal error during serialization.
So Xerces is operating within the DOM Level 3 specification even if it accepts a comment with '--' in it, as long as it bombs if you go to serialize it.
Not a great situation, but it makes sense because DOM was originally intended to represent XML Documents that have been read in, not to create new ones. So it is liberal in what it can represent. Fine for reading - a DOMComment can represent anything (and more) the XML document can, but a bit annoying that it doesn't catch the invalid string when you createComment().
Checking DOMDocumentImpl.cpp we see:
DOMComment *DOMDocumentImpl::createComment(const XMLCh *data)
{
return new (this, DOMMemoryManager::COMMENT_OBJECT) DOMCommentImpl(this, data);
}
And in DOMCommentImpl.cpp we have just:
DOMCommentImpl::DOMCommentImpl(DOMDocument *ownerDoc, const XMLCh *dat)
: fNode(ownerDoc), fCharacterData(ownerDoc, dat)
{
fNode.setIsLeafNode(true);
}
Finally we see in DOMCharacterDataImpl.cpp that there is no chance of validation up front - it just saves the user provided string without checking it.
DOMCharacterDataImpl::DOMCharacterDataImpl(DOMDocument *doc, const XMLCh *dat)
{
fDoc = (DOMDocumentImpl*)doc;
XMLSize_t len=XMLString::stringLen(dat);
fDataBuf = fDoc->popBuffer(len+1);
if (!fDataBuf)
fDataBuf = new (fDoc) DOMBuffer(fDoc, len+15);
fDataBuf->set(dat, len);
}
Sadly, no Xerces does not have an option or even a nice hook to check this for you. And because the Level 3 spec seems to demand that "No lexical check is done", it probably isn't even legal to add one.
The answer to your second question is simpler to answer: Because that's the way they wanted it defined it. See the XML 1.1 spec for example:
Comments
[15] Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'
It is similar for PIs.
The grammar simply does not allow for escapes. Seems about right: baroque and broke.
Maybe there is a way to catch the error on serialization or normalization, but I wasn't able to confirm whether Xerces 3.1 can. To be safe I think the best way is to wrap createComment() and check for it before creating the node, or walk the tree and check it yourself.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js