xmlReadFile() (C++ Ubuntu) core dumps on broken XML - c++

I am using the libxml2 libraries to parse XML sent to me (my program) as a file from another program. With care that should mean that I never get bad XML, but twice already I've made hand tweaks that broke the XML in the received file. By broken I mean that the elements have errors, end tags not matching start tags, random characters in between tags, etc.
The file is small so there are no particular memory worries about loading all of it into the parser, so I use xmlReadFile() to read in the doc.
My problem comes when the XML is broken. xmlReadFile() does an abend and core dumps. I can't catch it with an exception nor does setting the flag to "recover" work.
I've looked at Google with minimum success. I found xmllint, but I really would like not to have to call system() or popen() every time I get a new XML file. I looked at DTDs but can't seem to figure out how to tell a DTD to actually validate the value passed in a . (Many of the tags in the doc have values that are one of a set of, say, 5 possible answers.) Of course, if DTD worked I at least wouldn't crash the xmlReadFile().
Any suggestions on how to validate the XML before xmlReadFile() or with xmlReadFile() and how to prevent the crashes? Does xmllint have a C++ interface that I just haven't found?
No boost. No changing libraries.

Have you tried xmlReaderForFile(... XML_PARSE_RECOVER ...) ?

Related

How to generate C++ library with xerces for specific XML

I've gone through this xerces C++ tutorial, which shows how you might write a nice C++ class that allows you to access your data from the XML using simple function calls. The problem is that 200 lines of C++ seems like excessive amount of work just to grab a couple pieces of data from an XML file. I am hoping to find something that will take in my XML file and spit out C++. I have tried to search for solutions online to generate this for me but I can't find anything.

SAXParseException that only occurs locally. Works on WebServers

I am writing a junit test that is testing and older piece of code. This code works on our iplanet webservers and our local Tomcat servers and runs with no problems. However when run by the JUNIT test I get this exception.
Background: It pulls an XSL file from a JAR then transforms it with an xml document that is read in from a resource file.
I have tried changing transformer factories, changing encoding, and checked all files for null characters using a hex editor. Any ideas?
[Fatal Error] :2251:46: An invalid XML character (Unicode: 0x0) was found in the value of attribute "test" and element is "xsl:when".
SystemId Unknown; Line #2251; Column #46; org.xml.sax.SAXParseException; lineNumber: 2251; columnNumber: 46; An invalid XML character (Unicode: 0x0) was found in the value of attribute "test" and element is "xsl:when".
**UPDATE
I have found that if I use the project's class folder where the XSL is held and move it about the jar's dependency it works, but if it uses the xsl out of the jar it breaks
SUGGESTIONS:
1) Make sure your library versions all match.
2) I suspect that " An invalid XML character (Unicode: 0x0) was found" might be caused by any of several completely different things. You should investigate each of them.
3) First, most obvious - check your input for a null character :)
4) Second, check your encoding - perhaps your sender is writing UTF-16, but your reader is expecting UTF-8. Here's a good link:
*
Error about invalid XML characters on Java
This is an encoding issue. Either you read it the inputstream as UTF8
and it isn't or the other way around.
You should specify the encoding explicitly when you read the content.
E.g. via
new InputStreamReader(getInputStream(), "UTF-8")
Another problem could be the tomcat. Try to add URIEncoding="UTF-8" in
your tomcat’s connector settings in the server.xml file
5) The root cause might also be a failed read or a missing object of some kind. A missing definition, perhaps.
Q: What is "SystemId"? What might cause it to "go missing"?
6) One possibility is that "resolveEntity()" is failing:
InputSource resolveEntity(String publicId, String systemId)
Here are a couple of links regarding that problem:
Java SAX Parser raises UnknownHostException
how to disable dtd at runtime in java's xpath?
7) Both of these links suggest "resolveEntity() might be failing because you can't connect to a specified host. Check the network host names listed in your XML, and make sure you can "ping" them.
If it's got as far as line 2251 that suggests strongly that there's something wrong with the file contents around that location. If there's nothing wrong with the file at that location, my next suggestion would be that something's wrong with the parser. I know it sounds crazy, but the XML parser built in to the JDK is seriously buggy, and I would check whether the problem goes away if you install the Apache version of Xerces in its place. In many cases this is simply a question of putting the relevant JAR files in the lib/endorsed directory of the JDK installation.
This was caused because the XSL files I was trying to transform were still in a JAR. I had to have Maven extract the files into the target directory first.

XSLT to convert an XML element containing RTF data to HTML?

OK, so here's the background:
We have a third-party piece of software that does a lot of complicated stuff to generate an XML file from a lot of tables based on a wide array of business rules. The software allows you to apply an XSL transformation by supplying an XSLT file as part of its workflow, before continuing on in the process, which is usually an upload to one or more servers, based on more business rules.
Here's the problem:
One of the elements (with more on the way) this application is processing contains RTF text, and needs to be converted into formatted HTML before being uploaded. There are no means of transforming the XML inside the application other than through an XSLT file, and once we output the file, we cannot resume the workflow. My original thought was, "Easy! someone must have written a few XSL transforms for converting RTF to formatted HTML!" Hours of searching later, I must conclude I either suck at searching or it's awfully obscure.
Disclaimers:
I know the software is pretty darned limited; I'm stuck with it.
I know there are a lot of third-party tools to do this; they are not available to me because I would need to run them externally.
I know that this is not a pretty or efficient thing to do with XSLT. Changing that is not an option for me at this point.
If I cannot find a means to do this through pure XSL transforms, I will need to output the files locally, run the extra process, and take the destination routing on through a custom process. I really don't want to do that.
Does anyone have access to an XSL transformation function/ scheme that will allow me to do this natively in the application? Perhaps a series of regular expressions I could use or something?
So it turns out that external scripts can be invoked from the XSLT. It seems I will be using another scripting language to get this to work. I'm a little bummed there was no other answer available.

High performance XML parsing in C++

Well a lot of questions have been made about parsing XML in C++ and so on...
But, instead of a generic problem, mine is very specific.
I am asking for a very efficient XML parser for C++. In particular I have a VERY VERY BIG XML file to parse.
My application must open this file and retrieve data. It must also insert new nodes and save the final result in the file again.
To do this I used, at the beginning, rapidxml, but it requires me to open the file, parse it all (all the content because this lib has no functions to access the file directly without loading the entire tree first), then edit the tree, modify it and store the final tree on the file by overwriting it... It consumes too much resources.
Is there an XML parser that does not require me to load the entire file, but that I can use to insert, quickly, new nodes and retrieve data? Can you please indicate solutions for this problem of mine?
You want a streaming XML parser rather than what is called a DOM parser.
There are two types of streaming parsers: pull and push. A pull parser is good for quickly writing XML parsers that load data into program memory. A push parser is good for writing a program to translate one document to another (which is what you are trying to accomplish). I think, therefore, that a push parser would be best for your problem.
In order to use a push parser, you need to write what is essentially an event handler for parsing events. By "parsing event", I mean events like "start tag reached", "end tag reached", "text found", "attribute parsed", etc.
I suggest that as you read in the document, you write out the transformed document to a separate, temporary file. Thus, your XML parsing event handlers will need to be written so that they are stateful and write out the XML of the translated document incrementally.
Three excellent push parser libraries for C++ include Expat, Xerces-C++, and libxml2.
Search for "SAX parser". They are mostly tokenizers, i.e. they emit tag by tag without building a tree.
SAX parsers are faster than DOM parsers because DOM parsers read the entire file into memory before building an in-memory representation of the XML document, whereas a SAX parser behaves like an event listener and builds the document as it reads in the file. Go here for an explanation.
As you mentioned Xerces is a good C++ SAX parser.
I would recommend looking into ways of breaking the XML document into smaller XML documents as that seems to be part of your problem.
Okay, here is one off the beaten track, I looked at this, but haven't really used it myself, it's called asmxml. These boys claim performance bar none, downside, you need x86 assembler.
If you really seek high performance XML stream parser then libhpxml is likely the right thing for you.
I’m convinced that no XML library exists that allows you to modify a file without loading it first. This simply isn’t possible because files don’t work that way: you cannot insert (or remove) in the middle of a file. You can only overwrite a block of identical size, or append at the end. But your request would require to append or remove in the middle of the file.
Reading only parts of an XML file may be possible. But writing … no way.
Go for template libraries as much as possible, like Boost::property_tree or Boost::XMLParser or POCO::XML and Folly has XML Parser in it.
Avoid old C libraries, it all old code designs.
someone say QtXML module is high performance for huge XML files.

C++ Logger-Should I use an ordinary xml parser?

I'm working on a logging system for my 2D engine, and I'm confused on how I should go about creating/editing the file, and how I should output that file.
I've learned that XML is more of a data carrier rather than a data displayer like HTML is. I've read that I can use XML to HTML converters. One method I've thought about is writing characters to a file in HTML.
Clarity on these matters is what I ask of you, stack overflow.
Creating an XML (or HTML) file doesn't need any special library. Straightforward string concatenation is usually good enough, you may have to encode some special characters (e.g. > into >.
But as Owen says, plain text is a log more common for log files. One reasonable compromise is comma-separated values in a text file, this gives you a little bit of structure without much overhead. For example, the Windows web server (IIS) uses this format by default, and if you have some fields that are output for each line such as timestamp or source filename and line number, this makes it easy to separate those out again.
Just about every log I've ever worked with has been pure text delimited by newlines. If you're going to depart from that, you may want to ask yourself what it is about your logging needs that you want to accomplish with markup.
If you must go the way of markup, I would suggest an XML format that contains a minimal set of markup that would be useful in your situation. You could use XML to capture structure in your log entries (timestamp, severity, and operational code, for example) that would be inconvenient to code for in HTML.
Note that you could also go hybrid and embed some XHTML tags in an XML element whose purpose is to capture displayable text, if you want.
The problem with XML or HTML files is that you cannot append at any time. You have to close the final tag (document tag) properly at the end of writing.
Therefore, it's not a popular format for logging.
For logging, I suggest using one of the existing log engines, such as Apache logger, or, John Torjo's boost log candidate. They will support log levels, runtime configuration, etc.
If you are considering writing logs in XML files, please, stop.
Log files should be simple plain text files, XML-izing it is introducing needless complexity. They are not structured data, they are meant to be read by people, not automated tools.
It all starts with XML logs, and then it goes downhill from there.