Reading in quoted CSV data without newline as endline - c++

I have an issue with a file I am trying to read in and I don't know how to do solve it.
The file is a CSV, but there are also commas in the text of the file, so there are quotes around the commas indicating new values.
For instance:
"1","hello, ""world""","and then this" // In text " is written as ""
I would like to know how to deal quotes using a QFileStream (though I haven't seen a base solution either).
Furthermore, another problem is that I also can't read line by line as within these quotes there might be newlines.
In R, there is an option of quotes="" which solves these problems.
There must be something in C++. What is it?

You can split by quote (not just quote, but any symbol, like '\' for example) symbol in qt, just put \ before it, Example : string.split("\""); will split string by '"' symbol.
Here is a simple console app to split your file (the easiest solution is to split by "," symbols seems so far):
// opening file split.csv, in this case in the project folder
QFile file("split.csv");
file.open(QIODevice::ReadOnly);
// flushing out all of it's contents to stdout, just for testing
std::cout<<QString(file.readAll()).toStdString()<<std::endl;
// reseting file to read again
file.reset();
// reading all file to QByteArray, passing it to QString consructor,
// splitting that string by "," string and putting it to QStringList list
// where every element of a list is value from cell in csv file
QStringList list=QString(file.readAll()).split("\",\"",QString::SkipEmptyParts);
// adding back quotes, that was taken away by split
for (int i=0; i<list.size();i++){
if (i!=0) list[i].prepend("\"");
if (i!=(list.size()-1)) list[i].append("\"");
}//*/
// flushing results to stdout
foreach (QString i,list) std::cout<<i.toStdString()<<std::endl; // not using QDebug, becouse it will add more quotes to output, which is already confusing enough
where split.csv contains "1","hello, ""world""","and then this" and the output is:
"1"
"hello, ""world"""
"and then this"

After googling I've found some ready solution. See this article about qxt.

Related

Loading a file with the LoadFromFile () function with a newline

I load the text file .txt using the LoadFromFile() function, and the text in the middle of the line is marked with a newline '\n'.
The LoadFromFile() function treats this character as a new line and divides the line in that place by creating a new line.
In the Windows system Note the text looks like this: **Ala has ace**
The program that loads this file looks different:
plik->LoadFromFile( path, TEncoding::ASCII);
for( short int i = 0; i < plik->Count; ++i )
Memo1->Lines->Add( plik->Strings[i] );
In Memo1 the text looks like this:
**Ala**
**has ace**
Can I remove the '\n' character to make the entire line and how?
I answered this same question on the Embarcadero forums earlier today, but I will answer it here, too.
plik is a TStringList (according to the other discussion), so its LoadFrom...() method treats bare-CR, bare-LF, and CRLF line breaks equally when the TStrings::LineBreak property matches the RTL's global sLineBreak constant. If the LineBreak property does not match sLineBreak, then TStrings only splits on line breaks that match its LineBreak property.
Since the RTL's sLineBreak constant is CRLF on Windows, and you don't
want to split on bare-LF line breaks, you are going to have to parse
the file data manually, not use TStrings::LoadFromFile() at all.
For instance, you could read the whole file into a System::String using the System::Classes::TStreamReader::ReadToEnd() or System::Ioutils::TFile::ReadAllText() method (TStreamReader and TFile both have methods for reading lines, but they both treat all three forms of line break equally), and then parse that String to extract CRLF-delimited substrings while ignoring any bare-LF characters.
Ideally, you would load a file into a TMemo by using its own LoadFromFile() method. But, in this situation, that will not work, either, because TMemo normalizes all three forms of line breaks to CRLF before passing the data to the Win32 API, so that is not useful to you.

Boost property tree (XML) remove blank lines

I am using boost::property_tree to read and write xml configuration files. I want to change the value of some tags in my code and write them back to file, with some reasonable xml formatting (new lines, indenting, etc.).
Currently I am using
std::fstream fs("filename");
boost::property_tree::ptree pt;
bpt::xml_parser::read_xml(fs,pt);
// replace value
pt.erase("tagname");
pt.put("tagname",newval);
bpt::xml_parser::xml_writer_settings<char> xmlstyle(' ',4);
bpt::xml_parser::write_xml("filename",pt,std::locale(),xmlstyle);
But it seems that every time a tag is deleted, it leaves behind a blank line and after some iterations the xml becomes unreadable. Is there a way to remove empty lines from the property tree itself or from the resulting xml file using boost?
I know there are other ways of removing the newlines by reading and parsing the entire file again, but I was hoping for a more convenient one-liner.
Ok, it looks like the answer was already out there on Stack Overflow, I just hadn't found it (newlines were not mentioned in the post)
boost::property_tree XML pretty printing
The solution is to read the file with boost::property_tree::xml_parser::trim_whitespace
Not Boost, but blank lines in particular.
You can use std::regex_replace() on the output before it is written to the file, removing the blank lines, like this
std::regex_replace(std::ostreambuf_iterator<char>(fout), text.begin(), text.end(), std::regex("(\\n+)"), "\n");
With fout as the file output stream and text as the output data as a std::string.
This replaces every newline followed by another newline w/o any characters in between with a single newline.

Differentiating between delimiter and newline in getline

ifstream file;
file.open("file.csv");
string str;
while(file.good())
{
getline(file,str,',')
if (___) // string was split from delimiter
{
[do this]
}
else // string was split from eol
{
[do that]
}
}
file.close();
I'd like to read from a csv file, and differentiate between what happens when a string is split off due to a new line and what happens when it is split off due to the desired delimiter -- i.e. filling in the ___ in the sample code above.
The approaches I can think of are:
(1) manually adding a character to the end of each line in the original file,
(2) automatically adding a character to the end of each line by writing to another file,
(3) using getline without the delimiter and then making a function to split the resulting string by ','.
But is there a simpler or direct solution?
(I see that similar questions have been asked before, but I didn't see any solutions.)
My preference for clarity of the code would be to use your option 3) - use getline() with the standard '\n' delimiter to read the file into a buffer line by line and then use a tokenizer like strtok() (if you want to work on the C level) or boost::tokenizer to parse the string you read from the file.
You're really dealing with two distinct steps here, first read the line into the buffer, then take the buffer apart to extract the components you're after. Your code should reflect that and by doing so, you're also avoiding having to deal with odd states like the ones you describe where you end up having to do additional parsing anyway.
There is no easy way to determine "which delimiter terminated the string", and it gets "consumed" by getline, so it's lost to you.
Read the line, and parse split on commas yourself. You can use std::string::find() to find commas - however, if your file contains strings that in themselves contain commas, you will have to parse the string character by character, since you need to distinguish between commas in quoted text and commas in unquoted text.
Your big problem is your code does not do what you think it does.
getline with a delimiter treats \n as just another character from my reading of the docs. It does not split on both the delimiter and newline.
The efficient way to do this is to write your oen custom splitting getline: cppreference has a pretty clear description of what getline does, mimicing it should be easy (and safer than shooting from the hip, files are tricky).
Then return both the string, and information about why you finished your parse in a second channel.
Now, using getline naively then splitting is also viable, and will be much faster to write, snd probably less error prone to boot.

Read files in C++

This is my simple code:
#include "C:\Users\Myname\Desktop\Documents\std_lib_facilities.h"
using namespace std;
//**************************************************
int main()
try {
ifstream ifs("C:\Users\Myname\Desktop\raw_temps.txt");
if(!ifs) error("can't open file raw_temps.txt");
keep_window_open("~~");
return 0;
}
//**************************************
catch(runtime_error& e) {
cerr<<e.what();
keep_window_open("~~");
return 1;
}
The .txt file is in address "C:\Users\Myname\Desktop\raw_temps.txt".
When I run that, only the error (" ... ") function operates and theifs can't open the raw_temps.txt file.
Why please?
I believe that this problem is are due to some misunderstanding your use of backslashes as a path separator. Paths in c++ should be written with normal slashes, and not backslashes to prevent errors like those you have done here. This is because a single backslash is used as an escape character, meaning that it combined with the next symbol becomes a new symbol. An example is "\n" for newline or "\t" for tab.
To prevent this, and to make the code run on all platforms, and not just those using backslash as path separator, stick to slash as a path separator.
More information on this can be found on Marshal Clines C++ FAQ
And, yes, you can make this work with double backslashes, but then you are making a bad habit IMO. Plus that it is two characters where only one is needed.
You need to ignore "\" as it is a wildcard character. Replace "\" with "\".
Change this line
ifstream ifs("C:\Users\Myname\Desktop\raw_temps.txt");
To this
ifstream ifs("C:/Users/Myname/Desktop/raw_temps.txt");
\ is used to mark escape characters, so unless you use \\, the string will not look like what you think it should. You can see this by using a debugger and breaking on this line.
best option is to keep the file you want to open in the folder of source code and write this
ifstream ifs("raw_temps.txt");

Perl splitting text string (from HTML page, text document, etc.) by line into array?

This is kind of a weird question, at least for me, as I don't exactly understand what is fully involved in this. Basically, I have been doing this process where I save a scraped document (such as a web page) to a .txt file. Then I can easily use Perl to read this file and put each line into an array. However, it is not doing this based on any visible thing in the document (i.e., it is not going by HTML linebreaks); it just knows where a new line is, based on the .txt format.
However, I would like to cut this process out and just do the same thing from within a variable, so instead I would have what would have been the contents of the .txt file in a string and then I want to parse it, in the same way, line by line. The problem for me is that I don't know much about how this would work as I don't really understand how Perl would be able to tell where a new line is (assuming I'm not going by HTML linebreaks, as often it is just a web based .txt file (which presents to my scraper, www:mechanize, as a web page) I'm scraping so there is no HTML to go by). I figure I can do this using other parameters, such as blank spaces, but am interested to know if there is a way to do this by line. Any info is appreciated.
I'd like to cut the actual saving of a file to reduce issues related to permissions on servers I use and also am just curious if I can make the process more efficient.
Here's an idea that might help you: you can open from strings as well as files.
So if you used to do this:
open( my $io, '<', 'blah.txt' ) or die "Could not open blah.txt! - $!";
my #list = <$io>;
You can just do this:
open( my $io, '<', \$text_I_captured );
my #list = <$io>;
It's hard to tell what your code's doing since we don't have it in front of us; it would be easier to help if you posted what you had. However, I'll give it a shot. If you scrape the text into a variable, you will have a string which may have embedded line breaks. These will either be \n (the traditional Unix newline) or \r\n (the traditional Windows newline sequence). Just as you can split on a space to get (a first approximation of) the words in a sentence, you can instead split on the newline sequence to get the lines in. Thus, the single line you'll need should be
my #lines = split(/\r?\n/, $scraped_text);
Use the $/ variable, this determines what to break lines on. So:
local $/ = " ";
while(<FILE>)...
would give you chunks separated by spaces. Just set it back to "\n" to get back to the way it was - or better yet, go out of the local $/ scope and let the global one come back, just in case it was something other than "\n" to begin with.
You can eliminate it altogether:
local $/ = undef;
To read whole files in one slurp. And then iterate through them however you like. Just be aware that if you do a split or a splice, you may end up copying the string over and over, using lots of CPU and lots of memory. One way to do it with less is:
# perl -de 0
> $_="foo\nbar\nbaz\n";
> while( /\G([^\n]*)\n/go ) { print "line='$1'\n"; }
line='foo'
line='bar'
line='baz'
If you're breaking apart things by newlines, for example. \G matches either the beginning of the string or the end of the last match, within a /g-tagged regex.
Another weird tidbit is $/=\10... if you give it a scalar reference to an integer (here 10), you can get record-length chunks:
# cat fff
eurgpuwergpiuewrngpieuwngipuenrgpiunergpiunerpigun
# perl -de 0
$/ = \10;
open FILE, "<fff";
while(<FILE>){ print "chunk='$_'\n"; }
chunk='eurgpuwerg'
chunk='piuewrngpi'
chunk='euwngipuen'
chunk='rgpiunergp'
chunk='iunerpigun'
chunk='
'
More info: http://www.perl.com/pub/a/2004/06/18/variables.html
If you combine this with FM's answer of using:
$data = "eurgpuwergpiuewrngpieuwngipuenrgpiunergpiunerpigun";
open STRING, "<", \$data;
while(<STRING>){ print "chunk='$_'\n"; }
I think you can get every combination of what you need...