XercesC setting output to UTF-8 - c++

I'm using XercesC Lib to create a serialization of my data. How can I set it to UTF-8? It is always generated with UTF-16 and I can't find a way to change that.
xercesc::DOMImplementation *gRegistry = xercesc::DOMImplementationRegistry::getDOMImplementation(X("Core"));
xercesc::DOMDocument *doc = gRegistry->createDocument(
0, // root element namespace URI.
X(oDocumentName.c_str()), // root element name
0); // document type object (DTD).
doc->setXmlStandalone(true);
... prepare the document ...
serializer = ((xercesc::DOMImplementationLS *)gRegistry)->createLSSerializer();
serializer->setNewLine(xercesc::XMLString::transcode("\n"));
XMLCh *xmlresult = serializer->writeToString(doc);
char *temp = xercesc::XMLString::transcode(xmlresult);
std::string result(temp);
xercesc::XMLString::release(&temp);
xercesc::XMLString::release(&xmlresult);
doc->release();
serializer->release();
getStream() << result.c_str();
When I deserialize with JAXB on the Java side, I always get a content is not allowed in prolog and so far this is the only difference I can see in the XML. When I try to locally deserialze in JAXB it works. When I take my XercesC XML I get this error. When I try to format it in Notepad++ with the XML plugin it also says that there is an error, but doesn't tell me any details.

Check the usage of DOMLSOutput, that should give you exactly what you want. I.e. you create a DOMLSOutput object to which you write (instead of using DOMLSSerializer::writeToString).

Related

Missing Xerces C++ class to copy attributes of element for use after SAX2 parsing

The documentation of xerces anticipates the need to make a copy of attributes, but the AttributesImpl class doesn't seem to exist. Neither does the facility seem to exist in other associated classes in either the current 3.2.3 version of xerces or previous 2.X
Xerces documentation in the file itself src/xercesc/sax2/Attributes.hpp says:
"The instance provided will return valid results only during the scope of the startElement invocation (to save it for future use, the application must make a copy: the AttributesImpl helper class provides a convenient constructor for doing so)."
See also I've left issue here as a bug in xerces
https://issues.apache.org/jira/browse/XERCESC-2238
Appears I will be stuck instead creating my own version of attributes in which to copy or clone, and not overwritten each new line. Not saving whole document (which would defeat purpose of SAX streaming parse), but the existing framework populating Attributes is pretty convoluted and undocumented. Obviously the library and docs are designed to use the api, not to hack or extend the application.
Is this really correct, AttributesImpl is helper class in the documentation that doesn't actually exist? Neither is there a different class with this functionality to save an element's attributes for later use (outside the handler)?
Below is a working version of an Attributes deep copy utility function. It may be missing a few includes which I'm getting from other includes of my larger file. When I get the chance, I'll try making this a stand alone and update this answer. It still falls short of the Java version utility, due to inaccessible members of RefVectorOf class, because the wrapping class, the Attributes interface and VecAttributesImpl interface, do not provide access to them. https://xerces.apache.org/xerces-j/apiDocs/org/xml/sax/helpers/AttributesImpl.html
Last release of Xerces C/C++ is from 2016, so although marked status active, https://projects.apache.org/project.html?xerces-for_c++_xml_parser , really not so much. Can't vouch for libhunt site, but came up in quick google just now https://cpp.libhunt.com/xerces-c++-alternatives . One can see latest comment here, note use of the phrase "unless a security issue pops up or new committers appear to revive the project" https://issues.apache.org/jira/browse/XERCESC-2238?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17571942#comment-17571942
Leaving the status of Xerces C/C++ as active is either a lie or a gross and negligent oversight. This page shows no major release since 2010. https://xerces.apache.org/news.html (C++ is listed below the Java project updates)
#include <xercesc/validators/common/GrammarResolver.hpp>
#include <xercesc/framework/XMLGrammarPool.hpp>
#include <xercesc/sax2/Attributes.hpp>
#include <xercesc/util/RefVectorOf.hpp>
#include "spdlog/spdlog.h"
#define tr XMLString::transcode
static spdlog::logger logger = getLog();
/*
* cloneAttributes
* Uses LocalName as key instead of QName and ignores URI and URI id, id inside RefVectorOf
* but inaccessible to wrapper VecAttributesImpl, and type defaults to CDATA
*/
VecAttributesImpl* cloneAttributes(VecAttributesImpl& attrs, bool useScanner=false){
// from XMLReaderFactory::CreateXMLReader line 49
MemoryManager* const memManager = XMLPlatformUtils::fgMemoryManager;
XMLScanner* scanner;
if(useScanner){
// from void SAX2XMLReaderImpl::initialize() line 124
GrammarResolver* grammarResolver = new (memManager) GrammarResolver(0, memManager); // line 127
// use of 0 from SAX2XMLReaderImpl.hpp line 74 default constructor, XMLGrammarPool* const gramPool = 0
XMLStringPool* URIStringPool = grammarResolver->getStringPool(); // line 128
scanner = XMLScannerResolver::getDefaultScanner(0, grammarResolver, memManager);
// line 42 of XMLScannerResolver::getDefaultScanner uses return new (manager) IGXMLScanner(valToAdopt, grammarResolver, manager);
scanner->setURIStringPool(URIStringPool);
}else{
scanner = NULL;
}
VecAttributesImpl* newAttrs = new VecAttributesImpl(); //VecAttributesImpl is not a vector, it's a wrapper around RefVectorOf
RefVectorOf<XMLAttr> * newRefVectorOf = new (memManager) RefVectorOf<XMLAttr> (32, false, memManager) ;
XMLSize_t atLen = attrs.getLength();
XMLSize_t i;
std::stringstream bruce;
XMLAttr* cpXMLAttr;
for(i = 0;i<atLen;i++){
//Ever QName != LocalName? when URI != ""? logger.debug(format("{}. QName LocalName URI type: {}, {}, {}, {}", i, tr(attrs.getQName(i)), tr(attrs.getLocalName(i)), tr(attrs.getURI(i)),tr(attrs.getType(i)))); // #suppress("Invalid arguments")
cpXMLAttr = new (memManager) XMLAttr
(
0, //URIId, 0 if reading file, but int is inaccessible from attrs, inside RefVectorOf XMLAttr, and getURI(i) returns an XMLCh*
attrs.getLocalName(i),
attrs.getValue(i)
);
if(logger.level() == spdlog::level::debug){
bruce << tr(attrs.getLocalName(i))<<" : "<<tr(attrs.getValue(i))<< " | ";
}
newRefVectorOf->addElement(cpXMLAttr);
}
logger.debug(bruce.str());
newRefVectorOf->size();
logger.debug(newRefVectorOf->size());
//The scanner can actually be set to NULL and the above scanner construction skipped if the VecAttributesImpl isn't scanning.
newAttrs->setVector(newRefVectorOf, newRefVectorOf->size(), scanner, false);
return newAttrs;
}

Saxon.Api.StaticError: 'xsl:import-schema requires Saxon-EE' with .net API

I've downloaded the .NET Saxon API. I've compiled and run the EE sample application.
Some of it required the license to be present, and I have a license file, which seemed to make it work.
I wanted to use xsl:import-schema in my xslt, and this xslt works in the Oxygen editor (which has its own EE license).
If I take the simple xslt example from the saxon sample and then attempt to get it to compile my xslt with the import-schema instruction I get:
Saxon.Api.StaticError: 'xsl:import-schema requires Saxon-EE'
That is true. However I am already explicitly referencing the Saxon EE library, so that shouldn't be an issue (see below for a clue):
Here's my code:
var samplesDir = new Uri(AppDomain.CurrentDomain.BaseDirectory);
String dir = samplesDir.LocalPath;
String sourceFile = Path.Combine(dir,"po.xml");
String styleFile = Path.Combine(dir,"po.xsl");
// Create a Processor instance.
Processor processor = new Processor();
// Load the source document
DocumentBuilder builder = processor.NewDocumentBuilder();
builder.BaseUri = new Uri(sourceFile);
XdmNode input = builder.Build(File.OpenRead(sourceFile));
XsltCompiler compiler = processor.NewXsltCompiler();
//compiler.SchemaAware = true;
compiler.BaseUri = new Uri(styleFile);
// fails on next line
Xslt30Transformer transformer = compiler.Compile(File.OpenRead(styleFile)).Load30();
// Set the root node of the source document to be the global context item
transformer.GlobalContextItem = input;
// Create a serializer, with output to the standard output stream
Serializer serializer = processor.NewSerializer();
serializer.SetOutputWriter(Console.Out);
// Transform the source XML and serialize the result document
transformer.ApplyTemplates(input, serializer);
Note that if I comment out the explicit setting to set the SchemaAware setting to true, it says:
net.sf.saxon.trans.LicenseException
HResult=0x80131500
Message=Requested feature (schema-aware XSLT) requires Saxon-EE. You are using Saxon-EE software, but the Configuration is an instance of net.sf.saxon.Configuration; to use this feature you need to create an instance of com.saxonica.config.EnterpriseConfiguration
Source=saxon9ee
StackTrace:
at net.sf.saxon.Configuration.checkLicensedFeature(Int32 feature, String name, Int32 localLicenseId)
at net.sf.saxon.PreparedStylesheet..ctor(Compilation compilation)
at net.sf.saxon.style.StylesheetModule.loadStylesheet(Source styleSource, Compilation compilation)
at net.sf.saxon.style.Compilation.compileSingletonPackage(Configuration config, CompilerInfo compilerInfo, Source source)
at net.sf.saxon.s9api.XsltCompiler.compile(Source source)
at Saxon.Api.XsltCompiler.Compile(Stream input)
at ValidateXslt.Program.Main(String[] args) in C:\Users\m_r_n\source\repos\SaxonEEExample\ValidateXslt\Program.cs:line 33
This is a better clue. It's telling me I am using saxon EE, but I need an instance of com.saxonica.config.EnterpriseConfiguration somehow.
Why am I getting this error message?
you need to tell the processor to behave like a licensed copy (seems a bit odd)
Processor processor = new Processor(true);
Simple, I'd copied an example that didnt need fancy features, but I'll leave this question as it is, just in case someone else has the same issue.

Xerces, xpaths, and XML namespaces

I'm trying to use xerces-c in order to parse a rather massive XML document generated from StarUML in order to change some things, but I'm running into issues getting the xpath query to work because it keeps crashing.
To simplify things I split out part of the file into a smaller XML file for testing, which looks like this:
<?xml version="1.0" encoding="utf-8"?>
<XPD:UNIT xmlns:XPD="http://www.staruml.com" version="1">
<XPD:HEADER>
<XPD:SUBUNITS>
</XPD:SUBUNITS>
</XPD:HEADER>
<XPD:BODY>
<XPD:OBJ name="Attributes[3]" type="UMLAttribute" guid="onMjrHQ0rUaSkyFAWtLzKwAA">
<XPD:ATTR name="StereotypeName" type="string">ConditionInteraction</XPD:ATTR>
</XPD:OBJ>
</XPD:BODY>
</XPD:UNIT>
All I'm trying to do for this example is to find all of the XPD:OBJ elements, of which there is only one. The problem seems to stem from trying to query with the namespace. When I pass a very simple xpath query of XPD:OBJ it will crash, but if I pass just OBJ it won't crash but it won't find the XPD:OBJ element.
I assume there's some important property or setting that I'm missing during initialization that I need to set but I have no idea what it might be. I looked up all of the properties of the parser having to do with namespace and enabled the ones I could but it didn't help at all so I'm completely stuck. The initialization code looks something like this, with lots of things removed obviously:
const tXercesXMLCh tXMLManager::kDOMImplementationFeatures[] =
{
static_cast<tXercesXMLCh>('L'),
static_cast<tXercesXMLCh>('S'),
static_cast<tXercesXMLCh>('\0')
};
// Instantiate the DOM parser.
fImplementation = static_cast<tXercesDOMImplementationLS *>(tXercesDOMImplementationRegistry::getDOMImplementation(kDOMImplementationFeatures));
if (fImplementation != nullptr)
{
fParser = fImplementation->createLSParser(tXercesDOMImplementationLS::MODE_SYNCHRONOUS, nullptr);
fConfig = fParser->getDomConfig();
// Let the validation process do its datatype normalization that is defined in the used schema language.
//fConfig->setParameter(tXercesXMLUni::fgDOMDatatypeNormalization, true);
// Ignore comments and whitespace so we don't get extra nodes to process that just waste time.
fConfig->setParameter(tXercesXMLUni::fgDOMComments, false);
fConfig->setParameter(tXercesXMLUni::fgDOMElementContentWhitespace, false);
// Setup some properties that look like they might be required to get namespaces to work but doesn't seem to help at all.
fConfig->setParameter(tXercesXMLUni::fgXercesUseCachedGrammarInParse, true);
fConfig->setParameter(tXercesXMLUni::fgDOMNamespaces, true);
fConfig->setParameter(tXercesXMLUni::fgDOMNamespaceDeclarations, true);
// Install our custom error handler.
fConfig->setParameter(tXercesXMLUni::fgDOMErrorHandler, &fErrorHandler);
}
Then later on I parse the document, find the root node, and then run the xpath query to find the node I want. I'll leave out the bulk of that and just show you where I'm running the xpath query in case there's something obviously wrong there:
tXercesDOMDocument * doc; // Comes from parsing the file.
tXercesDOMNode * contextNode; // This is the root node retrieved from the document.
tXercesDOMXPathResult * xPathResult;
doc->evaluate("XPD:OBJ", contextNode, nullptr, tXercesDOMXPathResult::ORDERED_NODE_SNAPSHOT_TYPE), xPathResult);
The call to evaluate() is where it crashes somewhere deep inside xerces that I can't see very clearly, but from what I can see there are a lot of things that look deleted or uninitialized so I'm not sure what's causing the crash exactly.
So is there anything here that looks obviously wrong or missing that is required to make xerces work with XML namespaces?
The solution was right in front of my face the whole time. The problem was that you need to create and pass a resolver to the evaluate() call or else it will not be able to figure out any of the namespaces and will throw an exception. The crash seems to be a bug in xerces since it's crashing on trying to throw the exception when it can't resolve the namespace. I had to debug deep into the xerces code to find it, which gave me the solution.
So to fix the problem I changed the call to evaluate() slightly to create a resolver with the root node and now it works perfectly:
tXercesDOMDocument * doc; // Comes from parsing the file.
tXercesDOMNode * contextNode; // This is the root node retrieved from the document.
tXercesDOMXPathResult * xPathResult;
// Create the resolver with the root node, which contains the namespace definition.
tXercesDOMXPathNSResolver * resolver(doc->createNSResolver(contextNode));
doc->evaluate("XPD:OBJ", contextNode, resolver, tXercesDOMXPathResult::ORDERED_NODE_SNAPSHOT_TYPE), xPathResult);
// Make sure to release the resolver since anything created from a `create___()`
// function has to be manually released.
resolver->release();

C/CPP version of BeautifulSoup especially at handling malformed HTML

Are there any recommendations for a c/cpp lib which can be used to easily (as much as that possible) parse / iterate / manipulate HTML streams/files assuming some might be malformed, i.e. tags not closed etc.
BeautifulSoup
HTMLparser from Libxml is easy to use (simple tutorial below) and works great even on malformed HTML.
Edit : Original blog post is no longer accessible, so I've copy pasted the content here.
Parsing (X)HTML in C is often seen as a difficult task.
It's true that C isn't the easiest language to use to develop a parser.
Fortunately, libxml2's HTMLParser module come to the rescue. So, as promised, here's a small tutorial explaining how to use libxml2's HTMLParser to parse (X)HTML.
First, you need to create a parser context. You have many functions for doing that, depending on how you want to feed data to the parser. I'll use htmlCreatePushParserCtxt(), since it work with memory buffers.
htmlParserCtxtPtr parser = htmlCreatePushParserCtxt(NULL, NULL, NULL, 0, NULL, 0);
Then, you can set many options on that parser context.
htmlCtxtUseOptions(parser, HTML_PARSE_NOBLANKS | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET);
We are now ready to parse an (X)HTML document.
// char * data : buffer containing part of the web page
// int len : number of bytes in data
// Last argument is 0 if the web page isn't complete, and 1 for the final call.
htmlParseChunk(parser, data, len, 0);
Once you've pushed it all your data, you can call that function again with a NULL buffer and 1 as the last argument. This will ensure that the parser have processed everything.
Finally, how to get the data you parsed? That's easier than it seems. You simply have to walk the XML tree created.
void walkTree(xmlNode * a_node)
{
xmlNode *cur_node = NULL;
xmlAttr *cur_attr = NULL;
for (cur_node = a_node; cur_node; cur_node = cur_node->next)
{
// do something with that node information, like... printing the tag's name and attributes
printf("Got tag : %s\n", cur_node->name)
for (cur_attr = cur_node->properties; cur_attr; cur_attr = cur_attr->next)
{
printf(" ->; with attribute : %s\n", cur_attr->name);
}
walkTree(cur_node->children);
}
}
walkTree(xmlDocGetRootElement(parser->myDoc));
And that's it! Isn't that simple enough? From there, you can do any kind of stuff, like finding all referenced images (by looking at img tag) and fetching them, or anything you can think of doing.
Also, you should know that you can walk the XML tree anytime, even if you haven't parsed the whole (X)HTML document yet.
If you have to parse (X)HTML in C, you should use libxml2's HTMLParser. It will save you a lot of time.
you could use Google gumbo-parser
Gumbo is an implementation of the HTML5 parsing algorithm implemented as a pure C99 library with no outside dependencies. It's designed to serve as a building block for other tools and libraries such as linters, validators, templating languages, and refactoring and analysis tools.
#include "gumbo.h"
int main() {
GumboOutput* output = gumbo_parse("<h1>Hello, World!</h1>");
// Do stuff with output->root
gumbo_destroy_output(&kGumboDefaultOptions, output);
}
There's also a C++ binding for this library gumbo-query
A C++ library that provides jQuery-like selectors for Google's Gumbo-Parser.
#include <iostream>
#include <string>
#include "Document.h"
#include "Node.h"
int main(int argc, char * argv[])
{
std::string page("<h1><a>some link</a></h1>");
CDocument doc;
doc.parse(page.c_str());
CSelection c = doc.find("h1 a");
std::cout << c.nodeAt(0).text() << std::endl; // some link
return 0;
}
I've only used libCurl C++ for this type of thing but found it to be pretty good and useable. Don't know how it would cope with broken HTML though.
Try using SIP and run BeautifulSoup on it might help.
More details on below link thread. OpenFrameworks + Python

libxml - Load xmlDoc from raw data

I have a char* buffer of data that I want to parse as XML in libxml2.
How would one go about that?
Currently I am using it to open the file automatically by calling a file name, but it would be nice to have more functionality.
Currently I do it like this:
xmlDocPtr doc = xmlParseFile("data/foo.xml");
However I have a resource system that gives me access to the raw data, so my more preferred method would be:
resource_base_ptr res = load_resource("data/foo.xml");
xmlDocPtr doc /*= some_function(res->raw_data) */;
You need to use xmlReadMemory()
http://xmlsoft.org/html/libxml-parser.html#xmlReadMemory