Ignore initial data when parsing XML with Xerces - c++

I hope someone on here has some knowledge of using xerces-c. I have a string that contains a valid XML packet. It had however some leading data that has nothing to do with the XML. Is it possible to have the xerces-c SAXParser ignore any leading data and simply parse the first valid XML it finds? I am using an extremely simply setup without even the use of a DTD as below:
SAXParser* lp_parser(0);
MySaxHandler l_handler;
lp_parser->setDocumentHandler((DocumentHandler *)&l_handler);
lp_parser->setDoNamespaces(false);
lp_parser->setDoNamespaces(false);
lp_parser->setDoSchema(false);
lp_parser->setValidationSchemaFullChecking(false);
MemBufInputSource lp_membuf((const XMLByte*)l_data.c_str(),
l_data.size(),
"My XML request",
false);
lp_parser->parse(lp_membuf);
The l_data is a std::string containing my XML packet including the initial data and MySaxHandler is where I save the few tags I am interested in. I can of course skip until I find the start of the XML myself but that is not the answer I was hoping for.

Related

Eliminate javascript from HTML with XSLT

I am trying to transform an HTML report into XML, but some javascript in the file is throwing errors, due to statements with a less-than character (e.g., for(var i=0; i<els.length;i++) ). I thought I could eliminate the javascript with the following template, which should remove entire script nodes:
<xsl:template match="script"/>
I assumed the XSLT processor would simply skip over the entire script nodes, but it's still throwing the same errors. I also tried adding this one:
<xsl:template match="script/text()"/>
No luck. If I manually remove all the javascript from the file, my transform works, but that's not practical as I need to create and run a daily automated process on these HTML files to extract some data in the HTML tables.
As a general rule, XSLT will only process well-formed XML input: it's not designed to process other formats like HTML.
However, XSLT will generally accept input from a parser that delivers a stream of events that looks sufficiently like an XML stream. This allows parsers like TagSoup and validator.nu to be used as a front-end to your XSLT processor.
Saxon packages this up with a parse-html() function that invokes TagSoup to parse HTML input and turn it into a DOM-like tree (actually an XDM tree) that it can process as if it came from XML.
validator.nu is a more up-to-date HTML parser than TagSoup, but you would have to do a little more work to integrate that.
Question was answered by Martin Honnen in the comments:
oxygenxml.com/doc/versions/18.1/ug-editor/tasks/… suggests there is an HTML import feature so try whether that helps. Of course there are standalone applications like HTML Tidy I think you can use outside of the XSLT processsing to first convert your HTML to XHTML.

Can I extract the columns/fields and logic used in an XSLT?

I am specifically looking to parse the XSLT to retrieve the fields in the input XML and also get the logic between the input and output data been generated,
I am not sure, but have i been given a target to create an XSLT parser which is like a sub module in a browser?
It is more like reverse engineer some code to get the source and map it to the destination data.

Regex or Xpath for extracting nodes?

I have an XML file with the following structure;
<JobList>
<Job><subnodes/></Job>
<Job><subnodes/></Job>
</JobList>
This xml can be broken sometimes leaving a missing ending of <JobList> and missing end of </Job>.
I would like to be able to extract the <Job> nodes with full content on those that are closed with </Job>. What is the best way to do this?
To make a long story short I am using .NET and built in serializers for deserializing xml content. But since new properties are added you cannot just go back and forth between different versions as it is to strict. Mostly it works, but I would like to have a backup recovery method for this - hence the question.
The current situation is that the deserializer "crashes" the whole deserializing when a new property has been added instead of ignoring it. I am looking to manually parse it on error.
As mentioned on the comments, the ideal would be to make the xml valid, if for whatever reason that is not possible, the workaround is parsing the file as text with a regex.
A general regex for this case could be something like:
<Job>((?!<Job>).)*</Job>$
this will bring anything between a complete pair
Please notice that this will also return nodes with 'broken' inner nodes, but according to your question you are only concerned about missing and tags.

how to parse ENCODED html within an XML document using XSLT

I'm trying to parse Feedburner's full text RSS feed (for example http://feeds.feedburner.com/IeeeSpectrumFullText) and the HTML content is in an element called "content:encoded", but it is encoded (the < symbol becomes < etc.). I'm trying to figure out if it's possible to decode that content via an XSLT transformation. I know that within PHP I can decode and parse it, but I'm hoping there's a way to do this purely in XSLT so that I can only have one PHP process (not conditionally decoding the HTML as necessary).
Please let me know if you have any suggestions.

getting XML data from Xerces (c++)

I am a latecomer to XML - have to parse an XML file. Our company is using xerces already so I managed to cobble together a sample app (SAX) that displays all the data in a file. However, after parsing is complete I was expecting to be able to call the parser or some other entity that had an internal representation of the file and iterate through the fields/data.
Basically I want to be able to hand it some key or other string(s) and get back strings or collections of key/value pairs. I do not see that. It seems pretty obvious to me that that is a good thing to do. Am I missing something?
Is the DOM parsing what I want, or does that fall short too?
Xerces provides both SAX and DOM processing. SAX parsing doesn't construct a model, so once parsing is finished there is nothing to examine or iterate through. DOM processing produces a tree-structured model which gives you what you want.
Check out the beginner's sample in this page
YoLinux Tutorial on Parsing XML
If you use the XercesDOMParser, there is still no way to request a specific key value pair after the document is parsed. I ran into the same problem recently, and while iterating through the DOM tree I stored all the key value pairs in an STL map. Then you can request key value pairs from the map later in the program.