I have an XML file with some lines like the following:
<rule pat="&&&&&&&&&&&&&&&(?<B>B) ?(?<AND>&) ?(?<E>E)">
I use TinyXML lib in C++ to parse this XML file, but when I try to get the 'pat' attribute of such lines, the TinyXML turns out just ignore any occurrence of character &. That is, the result read by TinyXML turns to be like:
(?<B>B) ?(?<AND>) ?(?<E>E)
with all & missing!
This char is a part of my regular expression pattern, so this will lead to further error in my program.
Do anyone have any idea why this character & is so SPECIAL and TinyXML just cannot read? even a stand alone & will be dismissed?
That's because that is not a valid XML file. You can't just stick an & character anywhere in XML. You have to escape it with entities:
&
TinyXML will only read valid XML files (or at least mostly valid ones).
Similarly, you need to escape the < and > characters too, with < and >.
That's not well formed XML. If you want an & character, you need to put &.
In xml, & is represented as &
Related
I'm trying to write an XML-file containing CDATA-nodes using boost::property_tree. However since characters such as <, >, &, etc. are escaped automatically when writing the XML-file, something like
xml.put("node", "<![CDATA[message]]>")
will appear as
<node><![CDATA[message]]></node>
in the XML-file. Is there any way to properly write CDATA-nodes using property_tree or is this simply a limitation of the library?
Boost documentation clearly says that it is not able to distinguish between CDATA and non-CDATA values:
The XML storage encoding does not round-trip perfectly. A read-write cycle loses trimmed whitespace, low-level formatting information, and the distinction between normal data and CDATA nodes. Comments are only preserved when enabled. A write-read cycle loses trimmed whitespace; that is, if the origin tree has string data that starts or ends with whitespace, that whitespace is lost.
The few times I've faced the same problem have been for very specific cases where I knew no other escaped data would be needed, so a simple post-processing of the generated file replacing the escaped characters was enough.
As a general example:
std::ostringstream ss;
pt::write_xml(ss, xml, pt::xml_writer_make_settings<std::string>('\t', 1));
auto cleaned_xml = boost::replace_all_copy(ss.str(), ">", ">");
cleaned_xml = boost::replace_all_copy(cleaned_xml, "<", "<");
cleaned_xml = boost::replace_all_copy(cleaned_xml, "&", "&"); // last one
std::ofstream fo(path);
fo << cleaned_xml;
A more elaborated solution should include finding the opening <![CDATA[ and closing ]]>, and replace only within those limits to avoid replacing correctly escaped symbols.
Another solution is presented in this answer but I've never used it.
I'm working with an ancient pre-XML markup that uses codes of the form "$=x", where x may be an alphabetic character or a symbol on the keyboard, such as ; (semicolon), ? (question mark), or < (right left angle bracket, aka greater-than less-than). [Note after editing: the confusion manifested in the question as originally phrased goes to the heart of the problem. See my comment to the accepted answer. RS]
So I've modified a copy of XML.tmLanguage syntax definition file in my User folder to identify the eleven different categories that these codes represent, so I can easily see them in the large text files (which also contain XML markup) I'm working with.
For all the symbols except < I'm able to escape the symbol by preceding it with a backslash. But in the Boost regex engine that ST2 uses, \< is how you indicate that you want to match only at the start of a word. Consequently I've been unable to get this code to be properly recognized and highlighted.
I've looked everywhere for how to escape the < symbol in this circumstance. I've tried preceding it with 0, 1, 2, 3 and 4 back-slashes; and I also tried using the hexadecimal escape code \x{3009}. [Note: this is the code for greater-than instead of less-than.]
All in vain. (A few alternatives didn't generate an error message but also didn't highlight the code.)
Because the codes I'm working with need to be colored differently, I can't use a generic symbol in lieu of <, and I can't specify it either. How do I get this?
The tmLanguage file is written in XML, so Sublime Text feeds it through an XML parser first, before giving pieces to its regex engine's parser.
XML uses < to open tags such as <string>, so you can't use it directly as a character. Instead, there are these standard character references:
& for & (ampersand)
< for < (less than)
> for > (greater than; not required)
" for " (quote mark; only required in attribute values quoted with ")
' for ' (apostrophe; only required in attribute values quoted with ')
So use <string>\$=<</string> in the syntax file. When Sublime Text reads the file, its own XML parser will turn this into \$=< for the regex parser.
Backslash sequences don't help because the XML parser passes them through unchanged to the regex parser, which then sees \< or \\, neither of which are what you want.
\x{3008} is passed by the XML parser to the regex parser, where it's decoded to 〈, a character which looks somewhat similar to < but doesn't match it. \x3C would work though.
By the way, tmLanguage files use plist (property list) XML, so you can convert it to a format that's easier to edit, or use a plist editor such as http://tustin2121.github.io/jsPlistor/ (from Is there any online .plist editor?).
Try to use > for a syntax file.
I have seen posts that explain how to output an actual & but my problem is on the input side. I have data coming out of the database (Oracle) into a DataSet.DataTable and yes we do perform a datatable.WriteXml which produces the xml structure I am looking for.
I have tried all the suggested methods for encoding the writer (xml and / or string) before I writeXml into it but the & persists.
I then need to pass this xml representation of the dataset through an xslt transformation and it fails when I hit the special character &.
What's my solution. It's gotta be something simple. I was thinking there could be some xsl setting on the xslt transformation that would handle this for me?
& is an entity ...
& is the numeric value of the symbol in utf-8
If you use escape sequences (with the #) you can never fail interpretation, but you can pick a wrong numeric value though.
I am using tinyxml to save input from a text ctrl. The user can copy whatever they like into the text box and it gets written to an xml file. I'm finding that the new lines don't get saved and neither do & characters. The weird part is that tinyxml just discards them completely without any warning. If I put a & into the textbox and save, the tag will look like:
<textboxtext></textboxtext>
newlines completely disappear as well. No characters whatsoever are stored. What's going on? Even if I need to escape them with & or something, why does it just discard everything? Also, I can't find anything on google regarding this topic. Any help?
EDIT:
I found this topic which suggest the discarding of these characters may be a bug.
TinyXML and preserving HTML Entities
It is, apparently, a bug in TinyXml.
The simple workaround is to escape anything that it might not like:
&, ", ', < and > got their regular xml entities encoding
strange characters (read non-alphanumerical / regular punctuation) are best translated to their unicode codepoint: &#....;
Remember that TinyXml is before all a lightweight xml library, not a full-fledged beast.
const XMLDataNode *pointsNode = node->GetChildren().at(0);
std::wistringstream pointsstrm(*pointsNode->GetInnerText());
pointsstrm >> loadedGame.points;
This is code I've written to pull an int from an XML file and pass it into loadedGame.points (an int). However, this isn't working. It compiles but doens't give the right value. Why is that? XMLDataNode is a class that manipulates xmllite.dll.
Time for some wild guesses!
I'll bet you that the text you get from *pointsNode->GetInnerText() isn't what you think it is. Have you checked that it is indeed exactly the text you want? In particular, could it contain whitespace? Parsing a nicely formatted (i.e. indented, broken into lines, etc) XML file without a schema to reference ends up meaning that all sorts text nodes involving whitespace will end up in your DOM tree.