Boost property tree (XML) remove blank lines - c++

I am using boost::property_tree to read and write xml configuration files. I want to change the value of some tags in my code and write them back to file, with some reasonable xml formatting (new lines, indenting, etc.).
Currently I am using
std::fstream fs("filename");
boost::property_tree::ptree pt;
bpt::xml_parser::read_xml(fs,pt);
// replace value
pt.erase("tagname");
pt.put("tagname",newval);
bpt::xml_parser::xml_writer_settings<char> xmlstyle(' ',4);
bpt::xml_parser::write_xml("filename",pt,std::locale(),xmlstyle);
But it seems that every time a tag is deleted, it leaves behind a blank line and after some iterations the xml becomes unreadable. Is there a way to remove empty lines from the property tree itself or from the resulting xml file using boost?
I know there are other ways of removing the newlines by reading and parsing the entire file again, but I was hoping for a more convenient one-liner.

Ok, it looks like the answer was already out there on Stack Overflow, I just hadn't found it (newlines were not mentioned in the post)
boost::property_tree XML pretty printing
The solution is to read the file with boost::property_tree::xml_parser::trim_whitespace

Not Boost, but blank lines in particular.
You can use std::regex_replace() on the output before it is written to the file, removing the blank lines, like this
std::regex_replace(std::ostreambuf_iterator<char>(fout), text.begin(), text.end(), std::regex("(\\n+)"), "\n");
With fout as the file output stream and text as the output data as a std::string.
This replaces every newline followed by another newline w/o any characters in between with a single newline.

Related

How to search for line breaks within XML text (within tags)?

I've massive XML-file with text blocks and many of them contain unencoded line-breaks.
How to search line-breaks (/n) within XML text (within tags) and replace it with HTML-encoded line-breaks like
?
My code so far:
#include <regex>
...
std::string sInput_xml;
std::ifstream in(sFilePath_XMLFile);
// read file into input_xml
while(getline(in, sLine))
sInput_xml += sLine;
std::regex rxSearch("\>.*(\n)+.*\</");
std::regex_replace (sInput_xml, rxSearch,"
");
... and then I'd like to pass the string to the rapid-xml parser. Unencoded line-breaks are ignored by this (and many other) parser and I tried to replace it manually with
. It works perfectly but the file is 31k lines, it would take forever.
I'm not even sure if this regex is correct but my the VS compiler complains about the search_replace function not taking three params. But the should be a 3 param version like in the example on cplusplus regex replace.
Using RapidXML 1.13, an XML file with unescaped newlines in elements and attributes is parsed successfully, and the attribute and element values preserve the whitespace for me, so I think the search and replace is unnecessary.
Note that if you're debugging in Visual studio, the newlines are omitted from the tooltip when you hover over a variable in the editor, maybe that's what led you to believe they weren't preserved.
Regarding your problem with the regex_replace function, if you use an std::string for the third parameter it will compile. This seems to be a problem in Visual Studio 2010, as the const char* is accepted in Visual Studio 2013.
You will also need to be aware of characters to be escaped in both the search and replace strings if you still want to go down the regex route.
UPDATE: Now I realize that was representative code of how you were loading the file before introducing the regex, you should be aware that getline() does not include the newline, so it is your loading code which is removing the newlines from the file. Simplest thing would just be to use RapidXML to do the file loading directly:
#include "rapidxml_utils.hpp"
// ...
rapidxml::file<> xmlFile("test.xml");
rapidxml::xml_document<> doc;
doc.parse<0>(xmlFile.data());
Is there a reason for using c++ ?
maybe you can try sed
sed -i ':a;N;$!ba;s/\n/
/g' input.xml
the -i flag edit file in place, so make sure you have a backup before you run that.
reference
How can I replace a newline (\n) using sed?

Differentiating between delimiter and newline in getline

ifstream file;
file.open("file.csv");
string str;
while(file.good())
{
getline(file,str,',')
if (___) // string was split from delimiter
{
[do this]
}
else // string was split from eol
{
[do that]
}
}
file.close();
I'd like to read from a csv file, and differentiate between what happens when a string is split off due to a new line and what happens when it is split off due to the desired delimiter -- i.e. filling in the ___ in the sample code above.
The approaches I can think of are:
(1) manually adding a character to the end of each line in the original file,
(2) automatically adding a character to the end of each line by writing to another file,
(3) using getline without the delimiter and then making a function to split the resulting string by ','.
But is there a simpler or direct solution?
(I see that similar questions have been asked before, but I didn't see any solutions.)
My preference for clarity of the code would be to use your option 3) - use getline() with the standard '\n' delimiter to read the file into a buffer line by line and then use a tokenizer like strtok() (if you want to work on the C level) or boost::tokenizer to parse the string you read from the file.
You're really dealing with two distinct steps here, first read the line into the buffer, then take the buffer apart to extract the components you're after. Your code should reflect that and by doing so, you're also avoiding having to deal with odd states like the ones you describe where you end up having to do additional parsing anyway.
There is no easy way to determine "which delimiter terminated the string", and it gets "consumed" by getline, so it's lost to you.
Read the line, and parse split on commas yourself. You can use std::string::find() to find commas - however, if your file contains strings that in themselves contain commas, you will have to parse the string character by character, since you need to distinguish between commas in quoted text and commas in unquoted text.
Your big problem is your code does not do what you think it does.
getline with a delimiter treats \n as just another character from my reading of the docs. It does not split on both the delimiter and newline.
The efficient way to do this is to write your oen custom splitting getline: cppreference has a pretty clear description of what getline does, mimicing it should be easy (and safer than shooting from the hip, files are tricky).
Then return both the string, and information about why you finished your parse in a second channel.
Now, using getline naively then splitting is also viable, and will be much faster to write, snd probably less error prone to boot.

Reading in quoted CSV data without newline as endline

I have an issue with a file I am trying to read in and I don't know how to do solve it.
The file is a CSV, but there are also commas in the text of the file, so there are quotes around the commas indicating new values.
For instance:
"1","hello, ""world""","and then this" // In text " is written as ""
I would like to know how to deal quotes using a QFileStream (though I haven't seen a base solution either).
Furthermore, another problem is that I also can't read line by line as within these quotes there might be newlines.
In R, there is an option of quotes="" which solves these problems.
There must be something in C++. What is it?
You can split by quote (not just quote, but any symbol, like '\' for example) symbol in qt, just put \ before it, Example : string.split("\""); will split string by '"' symbol.
Here is a simple console app to split your file (the easiest solution is to split by "," symbols seems so far):
// opening file split.csv, in this case in the project folder
QFile file("split.csv");
file.open(QIODevice::ReadOnly);
// flushing out all of it's contents to stdout, just for testing
std::cout<<QString(file.readAll()).toStdString()<<std::endl;
// reseting file to read again
file.reset();
// reading all file to QByteArray, passing it to QString consructor,
// splitting that string by "," string and putting it to QStringList list
// where every element of a list is value from cell in csv file
QStringList list=QString(file.readAll()).split("\",\"",QString::SkipEmptyParts);
// adding back quotes, that was taken away by split
for (int i=0; i<list.size();i++){
if (i!=0) list[i].prepend("\"");
if (i!=(list.size()-1)) list[i].append("\"");
}//*/
// flushing results to stdout
foreach (QString i,list) std::cout<<i.toStdString()<<std::endl; // not using QDebug, becouse it will add more quotes to output, which is already confusing enough
where split.csv contains "1","hello, ""world""","and then this" and the output is:
"1"
"hello, ""world"""
"and then this"
After googling I've found some ready solution. See this article about qxt.

getline() text with UNIX formatting characters

I am writing a C++ program which reads lines of text from a .txt file. Unfortunately the text file is generated by a twenty-something year old UNIX program and it contains a lot of bizarre formatting characters.
The first few lines of the file are plain, English text and these are read with no problems. However, whenever a line contains one or more of these strange characters mixed in with the text, that entire line is read as characters and the data is lost.
The really confusing part is that if I manually delete the first couple of lines so that the very first character in the file is one of these unusual characters, then everything in the file is read perfectly. The unusual characters obviously just display as little ascii squiggles -arrows, smiley faces etc, which is fine. It seems as though a decision is being made automatically, without my knowledge or consent, based on the first line read.
Based on some googling, I suspected that the issue might be with the locale, but according to the visual studio debugger, the locale property of the ifstream object is "C" in both scenarios.
The code which reads the data is as follows:
//Function to open file at location specified by inFilePath, load and process data
int OpenFile(const char* inFilePath)
{
string line;
ifstream codeFile;
//open text file
codeFile.open(inFilePath,ios::in);
//read file line by line
while ( codeFile.good() )
{
getline(codeFile,line);
//check non-zero length
if (line != "")
ProcessLine(&line[0]);
}
//close line
codeFile.close();
return 1;
}
If anyone has any suggestions as to what might be going on or how to fix it, they would be very welcome.
From reading about your issues it sounds like you are reading in binary data, which will cause getline() to throw out content or simply skip over the line.
You have a couple of choices:
If you simply need lines from the data file you can first sanitise them by removing all non-printable characters (that is the "official" name for those weird ascii characters). On UNIX a tool such as strings would help you with that process.
You can off course also do this programmatically in your code by simply reading in X amount of data, storing it in a string, and then removing those characters that fall outside of the standard ASCII character range. This will most likely cause you to lose any unicode that may be stored in the file.
You change your program to understand the format and basically write a parser that allows you to parse the document in a more sane way.
If you can, I would suggest trying solution number 1, simply to see if the results are sane and can still be used. You mention that this is medical data, do you per-chance know what file format this is? If you are trying to find out and have access to a unix/linux machine you can use the utility file and maybe it can give you a clue (worst case it will tell you it is simply data).
If possible try getting a "clean" file that you can post the hex dump of so that we can try to provide better help than that what we are currently providing. With clean I mean that there is no personally identifying information in the file.
For number 2, open the file in binary mode. You mentioned using Windows, binary and non-binary files in std::fstream objects are handled differently, whereas on UNIX systems this is not the case (on most systems, I'm sure I'll get a comment regarding the one system that doesn't match this description).
codeFile.open(inFilePath,ios::in);
would become
codeFile.open(inFilePath, ios::in | ios::binary);
Instead of getline() you will want to become intimately familiar with .read() which will allow unformatted operations on the ifstream.
Reading will be like this:
// This code has not been tested!
char input[1024];
codeFile.read(input, 1024);
int actual_read = codeFile.gcount();
// Here you can process input, up to a maximum of actual_read characters.
//ProcessLine() // We didn't necessarily read a line!
ProcessData(input, actual_read);
The other thing as mentioned is that you can change the locale for the current stream and change the separator it considers a new line, maybe this will fix your issue without requiring to use the unformatted operators:
imbue the stream with a new locale that only knows about the newline. This method may or may not let your getline() function without issues.

wistringstream from an xml file to an integer?

const XMLDataNode *pointsNode = node->GetChildren().at(0);
std::wistringstream pointsstrm(*pointsNode->GetInnerText());
pointsstrm >> loadedGame.points;
This is code I've written to pull an int from an XML file and pass it into loadedGame.points (an int). However, this isn't working. It compiles but doens't give the right value. Why is that? XMLDataNode is a class that manipulates xmllite.dll.
Time for some wild guesses!
I'll bet you that the text you get from *pointsNode->GetInnerText() isn't what you think it is. Have you checked that it is indeed exactly the text you want? In particular, could it contain whitespace? Parsing a nicely formatted (i.e. indented, broken into lines, etc) XML file without a schema to reference ends up meaning that all sorts text nodes involving whitespace will end up in your DOM tree.