C++ String splitting but escaping all delimiters in quotations - c++

Using C++, I would like to split the rows of a string (CSV file in this case) where some of the fields may contain delimiters that are escaped (using "") and should be seen as literals. I have looked at the various questions already posed by have not found a direct answer to my problem.
Example of CSV file data:
Header1,Header2,Header3,Header4,Header5
Hello,",,,","world","!,,!,",","
Desired string vector after splitting:
["Hello"],[",,,"],["world"],["!,,!,"],[","]
Note: The CSV is only valid if the number of data columns equal the number of header columns.
Would prefer a non-boost / third-party solution. Efficiency is not a priority.
EDIT:
Code below implementing regex from #ClasG at least satisfies the scenario above. I am drafting fringe test cases but would love to hear when / where it breaks down...
std::string s = "Hello,\",,,\",\"world\",\"!,,!,\",\",\"\"";
std::string rx_string = "(\"[^\"]*\"|[^,]*)(?:,|$)";
regex e(rx_string);
std::regex_iterator<std::string::iterator> rit ( s.begin(), s.end(), e );
std::regex_iterator<std::string::iterator> rend;
while (rit!=rend)
{
std::cout << rit->str() << std::endl;
++rit;
}

This is not a complete (c++) solution, but a regex that might nudge you in the right direction.
A regex like
("[^"]*"|[^,]*)(?:,|$)
will match the individual columns. (Note that it doesn't handle escaped quotes.)
See it here at regex101.

This is not an answer, but it's too long to put as a comment IMHO.
CSV is one of those seemingly-simple-but-actually-quite-fiendish storage formats.
The droid you're looking for is Boost.Spirit.
The Spirit Master's name (on stack overflow) is #sehe.
See his answer here: https://stackoverflow.com/a/18366335/2015579
Please credit sehe, not me.

Related

Inverse regex processing to produce regex phrase

We take the normal regex processor and pass the input text and the regex phrase to capture the desired output text.
output = the_normal_regex(
input = "12$abc##EF345",
phase = "\d+|[a-zA-Z]+")
= ["12", "abc", "EF", "345"]
Can we inverse the processing that receives both the input text and the output text to produce the adequate regex phrase, specially if the text size is limited to the practical minimum e.g. some dozens of characters? Is any tool available in this regard?
phrase = the_inverse_tool(
input = "12$abc##EF345",
output=["12", "abc", "EF", "345"])
= "\d+|[a-zA-Z]+"
What you're asking appears to be whether there is some algorithm or existing library that takes an input string (like "12$abc##EF345") and a set of matches (like ["12", "abc", "EF", "345"]) and produces an "adequate" regex that would produce the matches, given the input string.
However, what does 'adequate' mean in this context? For your example, a simple answer would be: "12|abc|EF|345". However, it appears you expect something more like the generalised "\d+|[a-zA-Z]+"
Note that your generalisation makes a number of assumptions, for example that words in French, Swedish or Chinese shouldn't be matched. And numbers containing , or . are also not included.
You cannot expect a generalised algorithm to make those kinds of distinctions, as those are essentially problems requiring general AI, understanding the problem domain at an abstract level and coming up with a suitable solution.
Another way of looking at it is: your question is the same as asking if there is some function or library that automates the work of a programmer (specific to the regex language). The answer is: no, not yet anyway, and by the time there is, there won't be people on StackOverflow asking or answering these question, because we'll all be out of a job.
However, some more optimistic viewpoints can be found here: Is it possible for a computer to "learn" a regular expression by user-provided examples?

Write CDATA XML-node with boost::property_tree

I'm trying to write an XML-file containing CDATA-nodes using boost::property_tree. However since characters such as <, >, &, etc. are escaped automatically when writing the XML-file, something like
xml.put("node", "<![CDATA[message]]>")
will appear as
<node>&lt![CDATA[message]]&gt</node>
in the XML-file. Is there any way to properly write CDATA-nodes using property_tree or is this simply a limitation of the library?
Boost documentation clearly says that it is not able to distinguish between CDATA and non-CDATA values:
The XML storage encoding does not round-trip perfectly. A read-write cycle loses trimmed whitespace, low-level formatting information, and the distinction between normal data and CDATA nodes. Comments are only preserved when enabled. A write-read cycle loses trimmed whitespace; that is, if the origin tree has string data that starts or ends with whitespace, that whitespace is lost.
The few times I've faced the same problem have been for very specific cases where I knew no other escaped data would be needed, so a simple post-processing of the generated file replacing the escaped characters was enough.
As a general example:
std::ostringstream ss;
pt::write_xml(ss, xml, pt::xml_writer_make_settings<std::string>('\t', 1));
auto cleaned_xml = boost::replace_all_copy(ss.str(), ">", ">");
cleaned_xml = boost::replace_all_copy(cleaned_xml, "<", "<");
cleaned_xml = boost::replace_all_copy(cleaned_xml, "&", "&"); // last one
std::ofstream fo(path);
fo << cleaned_xml;
A more elaborated solution should include finding the opening <![CDATA[ and closing ]]&gt, and replace only within those limits to avoid replacing correctly escaped symbols.
Another solution is presented in this answer but I've never used it.

Regex for comments in strings, strings in comments, etc

This a question I've solved and wanted to post in Q&A style because I think more people could use the solution. Or maybe improve the solution, show where it breaks.
The problem
You wanna do something with quoted strings and/or comments in a body of text. You wanna extract them, highlight them, what have you. But some quoted strings are inside comments, and sometimes comment-characters are inside strings. And strings delimiters can be escaped, and comments can be line-comments or block comments. And when you thought you had a solution somebody complains that it doesn't work when there's a regex-literal in his JavaScript. What do?
Concrete example
var ret = row.match(/'([^']+)'/i); // Get 1st single quoted string's content
if (!ret) return ''; /* return if there's no matches
Otherwise turn into xml: */
var message = '\t<' + ret[1].replace(/\[1]/g, '').replace(/\/#(\w+)/i, ' $1=""') + '></' + ret[1].match(/[A-Z_]\w*/i)[0] + '>';
alert('xml: \'' + message + '\''); /*
alert("xml: '" + message + "'"); // */
var line = prompt('How do line-comments start? (e.g. //)', '//');
// do something with line
This code is nonsense, but how do I do the right thing in each of the cases of the above JavaScript?
The only thing I found that comes close is this: Comments in string and strings in comments where Jan Goyvaerts himself answered with a similar approach. But that one doesn't handle apostrophe-escaping yet.
I've broken the regex into 4 lines corresponding with the 4 paths in the graph, don't keep those line-breaks in there if you ever use this.
(['"])(?:(?!\1|\\).|\\.)*\1|
\/(?![*/])(?:[^\\/]|\\.)+\/[igm]*|
\/\/[^\n]*(?:\n|$)|
\/\*(?:[^*]|\*(?!\/))*\*\/
Debuggex Demo
This code grabs 4 types of "blocks" that can contain the other 3. You can iterate through this and do with each one whatever you want or discard it because it's not the one you wanna do anything to.
This one is specific for JavaScript as it's a language I'm familiar with. But you could easily adapt this to the language of your preference.
Anyone see a way in which this code breaks?
Edit I have since been notified that the general pattern is described very well here: https://stackoverflow.com/a/23589204/2684660, neato!

Minify HTML with Boost regex in C++

Question
How to minify HTML using C++?
Resources
An external library could be the answer, but I'm more looking for improvements of my current code. Although I'm all ears for other possibilities.
Current code
This is my interpretation in c++ of the following answer.
The only part I had to change from the original post is this part on top: "(?ix)"
...and a few escape signs
#include <boost/regex.hpp>
void minifyhtml(string* s) {
boost::regex nowhitespace(
"(?ix)"
"(?>" // Match all whitespans other than single space.
"[^\\S ]\\s*" // Either one [\t\r\n\f\v] and zero or more ws,
"| \\s{2,}" // or two or more consecutive-any-whitespace.
")" // Note: The remaining regex consumes no text at all...
"(?=" // Ensure we are not in a blacklist tag.
"[^<]*+" // Either zero or more non-"<" {normal*}
"(?:" // Begin {(special normal*)*} construct
"<" // or a < starting a non-blacklist tag.
"(?!/?(?:textarea|pre|script)\\b)"
"[^<]*+" // more non-"<" {normal*}
")*+" // Finish "unrolling-the-loop"
"(?:" // Begin alternation group.
"<" // Either a blacklist start tag.
"(?>textarea|pre|script)\\b"
"| \\z" // or end of file.
")" // End alternation group.
")" // If we made it here, we are not in a blacklist tag.
);
// #todo Don't remove conditional html comments
boost::regex nocomments("<!--(.*)-->");
*s = boost::regex_replace(*s, nowhitespace, " ");
*s = boost::regex_replace(*s, nocomments, "");
}
Only the first regex is from the original post, the other one is something I'm working on and should be considered far from complete. It should hopefully give a good idea of what I try to accomplish though.
Regexps are a powerful tool, but I think that using them in this case will be a bad idea. For example, regexp you provided is maintenance nightmare. By looking at this regexp you can't quickly understand what the heck it is supposed to match.
You need a html parser that would tokenize input file, or allow you to access tokens either as a stream or as an object tree. Basically read tokens, discards those tokens and attributes you don't need, then write what remains into output. Using something like this would allow you to develop solution faster than if you tried to tackle it using regexps.
I think you might be able to use xml parser or you could search for xml parser with html support.
In C++, libxml (which might have HTML support module), Qt 4, tinyxml, plus libstrophe uses some kind of xml parser that could work.
Please note that C++ (especially C++03) might not be the best language for this kind of program. Although I strongly dislike python, python has "Beautiful Soup" module that would work very well for this kind of problem.
Qt 4 might work because it provides decent unicode string type (and you'll need it if you're going to parse html).

replace string through regex using boost C++

I have string in which tags like this comes(there are multiple such tags)
|{{nts|-2605.2348}}
I want to use boost regex to remove |{{nts| and }} and replace whole string that i have typed above with
-2605.2348
in original string
To make it more clear:
Suppose string is:
number is |{{nts|-2605.2348}}
I want string as:
number is -2605.2348
I am quite new to boost regex and read many things online but not able to get answer to this any help would be appreciated
It really depends on how specific do you want to be. Do you want to always remove exactly |{{nts|, or do you want to remove pipe, followed by {{, followed by any number of letters, followed by pipe? Or do you want to remove everything that isn't whitespace between the last space and the first part of the number?
One of the many ways to do this would be something like:
#include <iostream>
#include <boost/regex.hpp>
int main()
{
std::string str = "number is |{{nts|-2605.2348}}";
boost::regex re("\\|[^-\\d.]*(-?[\\d.]*)\\}\\}");
std::cout << regex_replace(str, re, "$1") << '\n';
}
online demo: http://liveworkspace.org/code/2B290X
However, since you're using boost, consider the much simpler and faster parsers generated by boost.spirit.