Xerces-c SaxParser issues - c++

I am using xerces-c to parse an XML file but I am getting some strange results.
I create my own DocumentHandler (derived from HandlerBase) and override:
void characters(const XMLCh* const chars, const unsigned int length);
this way I receive notification of character data inside an element.
To parse a file I create a parser, create an inputbuffer, create my handler and call parse.
SAXParser* lp_parser = new SAXParser();
XMLCh* lp_fileName = XMLString::transcode("myfile.xml");
LocalFileInputSource l_fileBuf(lp_fileName);
XMLString::release(&lp_fileName);
MyHandler l_handler;
lp_parser->setDocumentHandler((DocumentHandler *)&l_handler);
lp_parser->parse(l_fileBuf);
delete lp_parser;
The problem is that characters([...]) is not only being called with character data, but also (sometimes several times) for each tag it is called giving me a set of spaces and a newline as character data.
i.e. <Tag>Value</Tag> yields two calls to characters([...]), one where the data is 'Value' and another (or multiple ones) where the data is something like ' \n '
The xml file itself doesn't contain these characters. I have user xerces-c to parse XML like this many times without any problems, although this is the first time I use a LocalFileInputSource (I usually use a MemBufInputSource).
Any ideas?

I had a similar problem with SAX2XMLReader. What I understood is that with SAX parsers it is up to the developer to know where he is in the XML structure while parsing.
It is possible that these subsequent call to characters() are for other tags in the file or ignorable whitespaces.
Depending on the length of the data it is also possible that callback characters be called several times for the same tag. And it is up to you to concatenate the data you receive on each call.
So what I would do is detect the start and end of tag <Tag> with callback functions startElement() and endElement(). In this way you can discard subsequent call to characters() once you have received the endElement() for your tag.

Related

Write CDATA XML-node with boost::property_tree

I'm trying to write an XML-file containing CDATA-nodes using boost::property_tree. However since characters such as <, >, &, etc. are escaped automatically when writing the XML-file, something like
xml.put("node", "<![CDATA[message]]>")
will appear as
<node>&lt![CDATA[message]]&gt</node>
in the XML-file. Is there any way to properly write CDATA-nodes using property_tree or is this simply a limitation of the library?
Boost documentation clearly says that it is not able to distinguish between CDATA and non-CDATA values:
The XML storage encoding does not round-trip perfectly. A read-write cycle loses trimmed whitespace, low-level formatting information, and the distinction between normal data and CDATA nodes. Comments are only preserved when enabled. A write-read cycle loses trimmed whitespace; that is, if the origin tree has string data that starts or ends with whitespace, that whitespace is lost.
The few times I've faced the same problem have been for very specific cases where I knew no other escaped data would be needed, so a simple post-processing of the generated file replacing the escaped characters was enough.
As a general example:
std::ostringstream ss;
pt::write_xml(ss, xml, pt::xml_writer_make_settings<std::string>('\t', 1));
auto cleaned_xml = boost::replace_all_copy(ss.str(), ">", ">");
cleaned_xml = boost::replace_all_copy(cleaned_xml, "<", "<");
cleaned_xml = boost::replace_all_copy(cleaned_xml, "&", "&"); // last one
std::ofstream fo(path);
fo << cleaned_xml;
A more elaborated solution should include finding the opening <![CDATA[ and closing ]]&gt, and replace only within those limits to avoid replacing correctly escaped symbols.
Another solution is presented in this answer but I've never used it.

Escaping and unescaping HTML

In a function I do not control, data is being returned via
return xmlFormat(rc.content)
I later want to do a
<cfoutput>#resultsofreturn#</cfoutput>
The problem is all the HTML tags are escaped.
I have considered
<cfoutput>#DecodeForHTML(resultsofreturn)#</cfoutput>
But I am not sure these are inverses of each other
Like Adrian concluded, the best option is to implement a system to get to the pre-encoded value.
In the current state, the string your working with is encoded for an xml document. One option is to create an xml document with the text and parse the text back out of the xml document. I'm not sure how efficient this method is, but it will return the text back to it's pre-encoded value.
function xmlDecode(text){
return xmlParse("<t>#text#</t>").t.xmlText;
}
TryCF.com example
As of CF 10, you should be using the newer encodeFor functions. These functions account for high ASCII characters as well as UTF-8 characters.
Old and Busted
XmlFormat()
HTMLEditFormat()
JSStringFormat()
New Hotness
encodeForXML()
encodeForXMLAttribute()
encodeForHTML()
encodeForHTMLAttribute()
encodeForJavaScript()
encodeForCSS()
The output from these functions differs by context.
Then, if you're only getting escaped HTML, you can convert it back using Jsouo or the Jakarta Commons Lang library. There are some examples in a related SO answer.
Obviously, the best solution would be to update the existing function to return either version of the content. Is there a way to copy that function in order to return the unescaped content? Or can you just call it from a new function that uses the Java solution to convert the HTML?

Matlab - how to extract specific data from a vector

I have some data from a GPS receiver, however, some of the data are corrupted by extra characters. I want to extract the timestamp (the first field) and the data for the $GPGGA and $GPVTG.
To be more clear, here is a sample of the data I have in a cell array:
'1458937887.70818 $GPGGA,200228.90,3555.3269,N,15552.9641,A*25'
'1458937887.709668 $GPVTG,56.740,T,56.740,M,0.069,N,0.127,K,D*2D'
'1458937887.712022 ªDe¾,…´apö$™°%=HfSrîU¾Õ½ôAqö‚>1ÀàHqgu$GPGGA,200229.00,3555.3269,N,15552.9641,C*2B'
'1458937887.714071 $GPVTG,286.847,T,286.847,M,0.028,N,0.051,K,D*28'
As you can see, the problem here is in the third line where some strange characters appear between the timestamp and the data.
Another problem is that sometimes this third line is split into two lines, something like this:
'1458937887.712022 ªDe¾,…´apö$™°'
'%=HfSrîU¾Õ½ôAqö‚>1ÀàHqgu$GPGGA,200229.00,3555.3269,N,15552.9641,D*24'
which is making using regexp very hard.
In summary, I want to format the third line (in both cases) as:
'1458937887.712022 $GPGGA,200229.00,3555.3269,N,15552.9641,D*2R'
Update:
Thanks to #excaza, this solves the first issue (removing the garbage):
regexprep(str, '(?<=\d\s)(.*)(?=\$GPGGA)', '')
As for the second issue, #Suever's question gave me an idea by looking at the format of the data. Is it possible to solve it while reading the data from a .txt file? Something like defining the delimiter to be * followed by two characters and a \n since all packets end with this pattern?

c++ xml chunk data error

I have a xml reader in C++ and I am making a error function or proofer that only sends complete xml trees to the parser. The data is in a char array like
char chunkdata[245];
Then convert it to a string like
String data(chunkdata);
And parse the data.
This program will get chunked data at any time and process. The only thing with chunked data is that it sometimes sends incomplete xml trees... So I might only get half of a content in a char array like
<?xml version="1.0" encoding="UTF-8"?>
<note>
<to> Tove</to>
<from>Jani</from>
<heading>Remin
And get a few mil seconds later the rest
der</heading>
<body>Don't forget me this weekend!</body>
</note>
And after processing it would produce two strings and crash the program.
What could I add in my code to either wait to add if not complete... or get only the complete xml trees and leave the remaining to add to the rest when it comes... I tried things like string FIND with string substring which would process then Add the remaining later but it didn't work.. Any suggestions ??? Thank you
If the only thing you're doing is a validator that reads the file in block mode, then you should probably keep track of opened and closed tags in some sort of separate structure. If your buffer can change its length (I'm not sure what String is, but std::string certainly can change its size during runtime), you probably want to have something similar to the following:
std::map<std::string, long> tags;
And when you encounter a tag opening, do:
if(tags.find(tagName))
tags[tagName]++;
else
tags[tagName]=1;
And when you encounter a tag closing, do:
if(tags.find(tagName))
tags[tagName]--;
else
tags[tagName]=-1;
The tags are closed properly only if all the elements of the map are equal 0. Lets assume testForCorrectness() does just that. Then your code would look like this:
char chunkdata[245];
readSomeData();
String data(chunkdata);
while(!testforCorrectness()){
readSomeData();
data += (String)chunkdata;
}
return chunkdata;
If you also want to test if the tags were closed in the correct order - try using a vector instead:
std::vector<std::string> openedTags;
On tag begin:
openedTags.push_back(tagName);
On tag close:
if(openedTags.back() == tagName)
openedTags.pop_back();
else
// XML is ill-formed
Finish if empty(openedTags).

wistringstream from an xml file to an integer?

const XMLDataNode *pointsNode = node->GetChildren().at(0);
std::wistringstream pointsstrm(*pointsNode->GetInnerText());
pointsstrm >> loadedGame.points;
This is code I've written to pull an int from an XML file and pass it into loadedGame.points (an int). However, this isn't working. It compiles but doens't give the right value. Why is that? XMLDataNode is a class that manipulates xmllite.dll.
Time for some wild guesses!
I'll bet you that the text you get from *pointsNode->GetInnerText() isn't what you think it is. Have you checked that it is indeed exactly the text you want? In particular, could it contain whitespace? Parsing a nicely formatted (i.e. indented, broken into lines, etc) XML file without a schema to reference ends up meaning that all sorts text nodes involving whitespace will end up in your DOM tree.