Write CDATA XML-node with boost::property_tree - c++

I'm trying to write an XML-file containing CDATA-nodes using boost::property_tree. However since characters such as <, >, &, etc. are escaped automatically when writing the XML-file, something like
xml.put("node", "<![CDATA[message]]>")
will appear as
<node>&lt![CDATA[message]]&gt</node>
in the XML-file. Is there any way to properly write CDATA-nodes using property_tree or is this simply a limitation of the library?

Boost documentation clearly says that it is not able to distinguish between CDATA and non-CDATA values:
The XML storage encoding does not round-trip perfectly. A read-write cycle loses trimmed whitespace, low-level formatting information, and the distinction between normal data and CDATA nodes. Comments are only preserved when enabled. A write-read cycle loses trimmed whitespace; that is, if the origin tree has string data that starts or ends with whitespace, that whitespace is lost.
The few times I've faced the same problem have been for very specific cases where I knew no other escaped data would be needed, so a simple post-processing of the generated file replacing the escaped characters was enough.
As a general example:
std::ostringstream ss;
pt::write_xml(ss, xml, pt::xml_writer_make_settings<std::string>('\t', 1));
auto cleaned_xml = boost::replace_all_copy(ss.str(), ">", ">");
cleaned_xml = boost::replace_all_copy(cleaned_xml, "<", "<");
cleaned_xml = boost::replace_all_copy(cleaned_xml, "&", "&"); // last one
std::ofstream fo(path);
fo << cleaned_xml;
A more elaborated solution should include finding the opening <![CDATA[ and closing ]]&gt, and replace only within those limits to avoid replacing correctly escaped symbols.
Another solution is presented in this answer but I've never used it.

Related

Escape white space in boost::fs::path

What it says on the tin. Is there a cleverer way to replace white spaces in a boost::fs::path that does not require a regex?
EDIT as an example:
_appBundlePath = boost::fs::path("/path/with spaces/here");
regex space(" ");
string sampleFilename = regex_replace((_appBundlePath/"audio/samples/C.wav").string(), space, "\\ ");
Question: is there a way that avoids using a regex? Seems like an overkill to me.
EDIT 2 My issue is when passing a string to Pure Data via libpd. PD will interpret a space as a separator, so my string will be chopped up into multiple symbols. Surrounding it with double quotes won't work, and I'm not even sure that escaping white space would, but it's worth a shot.
The cleverest way is not to do it.
For example, use execve instead of system (so you can pass arguments in an array, no need for shell escaping). See e.g. How can I escape variables sent to the 'system' command in C++?
Or, if you e.g. talk to a database server, do not concatenate your queries but bind parameters into a prepared statement. Again this precludes the need for any escaping.
Avoiding escaping avoids a whole slew of security issues (RCE, SQLi etc.)
If you must, probably just do
"'" + replace_all(path.string(), "'", "''") + "'"
This would be fine for e.g. bash shells
For anything else, find out which characters need escaping and use the existing library functions that suit the goal, e.g.
http://en.cppreference.com/w/cpp/io/manip/quoted
https://dev.mysql.com/doc/refman/5.7/en/mysql-real-escape-string.html
... etc.

How can I use Regex to parse irregular CSV and not select certain characters

I have to handle a weird CSV format, and I have been running into problems. The string I have been able to work out thus far is
(?:\s*(?:\"([^\"]*)\"|([^,]+))\s*?)+?
My files are often broken and irregular, since we have to deal with OCR'd text which is usually not checked by our users. Therefore, we tend to end up with lots of weird things, like a single " within a field, or even a newline character(which is why I am using Regex instead of my previous readLine()-based solution). I've gotten it to parse most everything correctly, except it captures [,] [,]. How can I get it to NOT select fields with only a single comma? When I try and have it not select commas, it turns "156,000" into [156] and [000]
The test string I've been using is
"156,000","",""i","parts","dog"","","Monthly "running" totals"
The ideal desire capture output is
[156,000],[],[i],[parts],[dog],[],[Monthly "running" totals]
I can do with or without the internal quotes, since I can always just strip them during processing.
Thank you all very much for your time.
Your CSV is indeed irregular and difficult to parse. I suggest you do 2 replacements first to your data.
// remove all invalid double ""
input = Regex.Replace(input, #"(?<!,|^)""(?=,|$)|(?<=,)""(?!,|$)", "\"");
// now escape all inner "
input = Regex.Replace(input, #"(?<!,|^)"(?!,|$)", #"\\\"");
// at this stage your have proper CSV data and I suggest using a good .NET csv parser
// to parse your data and get individual values
Replacement 1 demo
Replacement 2 demo

Minify HTML with Boost regex in C++

Question
How to minify HTML using C++?
Resources
An external library could be the answer, but I'm more looking for improvements of my current code. Although I'm all ears for other possibilities.
Current code
This is my interpretation in c++ of the following answer.
The only part I had to change from the original post is this part on top: "(?ix)"
...and a few escape signs
#include <boost/regex.hpp>
void minifyhtml(string* s) {
boost::regex nowhitespace(
"(?ix)"
"(?>" // Match all whitespans other than single space.
"[^\\S ]\\s*" // Either one [\t\r\n\f\v] and zero or more ws,
"| \\s{2,}" // or two or more consecutive-any-whitespace.
")" // Note: The remaining regex consumes no text at all...
"(?=" // Ensure we are not in a blacklist tag.
"[^<]*+" // Either zero or more non-"<" {normal*}
"(?:" // Begin {(special normal*)*} construct
"<" // or a < starting a non-blacklist tag.
"(?!/?(?:textarea|pre|script)\\b)"
"[^<]*+" // more non-"<" {normal*}
")*+" // Finish "unrolling-the-loop"
"(?:" // Begin alternation group.
"<" // Either a blacklist start tag.
"(?>textarea|pre|script)\\b"
"| \\z" // or end of file.
")" // End alternation group.
")" // If we made it here, we are not in a blacklist tag.
);
// #todo Don't remove conditional html comments
boost::regex nocomments("<!--(.*)-->");
*s = boost::regex_replace(*s, nowhitespace, " ");
*s = boost::regex_replace(*s, nocomments, "");
}
Only the first regex is from the original post, the other one is something I'm working on and should be considered far from complete. It should hopefully give a good idea of what I try to accomplish though.
Regexps are a powerful tool, but I think that using them in this case will be a bad idea. For example, regexp you provided is maintenance nightmare. By looking at this regexp you can't quickly understand what the heck it is supposed to match.
You need a html parser that would tokenize input file, or allow you to access tokens either as a stream or as an object tree. Basically read tokens, discards those tokens and attributes you don't need, then write what remains into output. Using something like this would allow you to develop solution faster than if you tried to tackle it using regexps.
I think you might be able to use xml parser or you could search for xml parser with html support.
In C++, libxml (which might have HTML support module), Qt 4, tinyxml, plus libstrophe uses some kind of xml parser that could work.
Please note that C++ (especially C++03) might not be the best language for this kind of program. Although I strongly dislike python, python has "Beautiful Soup" module that would work very well for this kind of problem.
Qt 4 might work because it provides decent unicode string type (and you'll need it if you're going to parse html).

Is RTF text empty

Is there an easy way in C++ to tell if a RTF text string has any content, aside pure formatting.
For example this text is only formatting, there is no real content here:
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Sans Serif;}}
Loading RTF text in RichTextControl is not an option, I want something that will work fast and require minimum resources.
The only sure-fire way is to write your own RTF parser [spec], use a library like LibRTF, or you might consider keeping a RichTextControl open and updating it with new RTF documents rather than destroying the object every time.
I believe RTF is not a regular language, so cannot be properly parsed by RegEx (not unlike HTML, despite millions of attempts to do so), but you do not need to write a complete RTF parser.
I'd start with a simple string parser. Try:
Remove content between {\ and }
Remove tags. Tags begin with a backslash, \, and are followed by some text. If a backslash is followed by whitespace, it is not a tag.
The document should end with at least one closing curly brace, }
Any content left which isn't whitespace should be document content, though this may have some exceptions so you'll want to test on numerous samples of RTF.

Can an tinyxml someone explain which characters need to be escaped?

I am using tinyxml to save input from a text ctrl. The user can copy whatever they like into the text box and it gets written to an xml file. I'm finding that the new lines don't get saved and neither do & characters. The weird part is that tinyxml just discards them completely without any warning. If I put a & into the textbox and save, the tag will look like:
<textboxtext></textboxtext>
newlines completely disappear as well. No characters whatsoever are stored. What's going on? Even if I need to escape them with &amp or something, why does it just discard everything? Also, I can't find anything on google regarding this topic. Any help?
EDIT:
I found this topic which suggest the discarding of these characters may be a bug.
TinyXML and preserving HTML Entities
It is, apparently, a bug in TinyXml.
The simple workaround is to escape anything that it might not like:
&, ", ', < and > got their regular xml entities encoding
strange characters (read non-alphanumerical / regular punctuation) are best translated to their unicode codepoint: &#....;
Remember that TinyXml is before all a lightweight xml library, not a full-fledged beast.