Efficiently read CSV file with optional columns - c++

I'm trying to write a program that reads in a CSV file (no need to worry about escaping anything, it's strictly formatted with no quotes) but any numeric item with a value of 0 is instead just left blank. So a normal line would look like:
12,string1,string2,3,,,string3,4.5
instead of
12,string1,string2,3,0,0,string3,4.5
I have some working code using vectors but it's way too slow.
int main(int argc, char** argv)
{
string filename("path\\to\\file.csv");
string outname("path\\to\\outfile.csv");
ifstream infile(filename.c_str());
if(!infile)
{
cerr << "Couldn't open file " << filename.c_str();
return 1;
}
vector<vector<string>> records;
string line;
while( getline(infile, line) )
{
vector<string> row;
string item;
istringstream ss(line);
while(getline(ss, item, ','))
{
row.push_back(item);
}
records.push_back(row);
}
return 0;
}
Is it possible to overload operator<< of ostream similar to How to use C++ to read in a .csv file and output in another form? when fields can be blank?
Would that improve the performance?
Or is there anything else I can do to get this to run faster?
Thanks

The time spent reading the string data from the file is greater than the time spent parsing it. You won't make significant time savings in the parsing of the string.
To make your program run faster, read bigger "chunks" into memory; get more data per read. Research on memory mapped files.

One alternative way to handle this to get better performance is to read the whole file into a buffer. Then go through the buffer and set pointers to where the values start, if you find a , or end of line put in a \0.
e.g. https://code.google.com/p/csv-routine/

Related

Picking a random line from a text file

I need to write an 8 ball code that has eleven options to display and it needs to pull from a text file. I have it taking lines from the text file but sometimes it takes an empty line with no writing. And I need it to only take a line that has writing.
Here are that options it needs to draw from:
Yes, of course!
Without a doubt, yes.
You can count on it.
For sure!Ask me later.
I'm not sure.
I can't tell you right now.
I'll tell you after my nap.
No way!I don't think so.
Without a doubt, no.
The answer is clearly NO.
string line;
int random = 0;
int numOfLines = 0;
ifstream File("file.txt");
srand(time(0));
random = rand() % 50;
while (getline(File, line))
{
++numOfLines;
if (numOfLines == random)
{
cout << line;
}
}
}
IMHO, you need to either make the text lines all the same length, or use a database (table) of file positions.
Using File Positions
Minimally, create a std::vector<pos_type>.
Next read the lines from the file, recording the file position of the beginning of the string:
std::vector<std::pos_type> text_line_positions;
std::string text;
std::pos_type file_position = 0;
while (std::getline(text_file, text)
{
text_line_positions.push_back(file_position);
// Read the start position of the next line.
file_position = text_file.tellg();
}
To read a line from a file, get the file position from the database, then seek to it.
std::string text_line;
std::pos_type file_position = text_line_positions[5];
text_file.seekg(file_position);
std::getline(text_file, text_line);
The expression, text_line_positions.size() will return the number of text lines in the file.
If File Fits In Memory
If the file fits in memory, you could use std::vector<string>:
std::string text_line;
std::vector<string> database;
while (getline(text_file, text_line))
{
database.push_back(text_line);
}
To print the 10 line from the file:
std::cout << "Line 10 from file: " << database[9] << std::endl;
The above techniques minimize the amount of reading from the file.

gzstream lib for C++ : corrupted file created

I want to read and write compressed file with my C++ script. For this purpose, I use the gzstream lib. It works fine with a very simple example like this :
string inFile="/path/inputfile.gz";
igzstream inputfile;
ogzstream outputfile("/path/outputfile.gz");
inputfile.open(inFile.c_str());
// Writing from input file to output file
string line;
while(getline(inputfile, line)) {
outputfile << line << endl;
}
But in my C++ script, things are more complicated and my output files are created within a dynamic vector.
For UNcompressed files, this way worked very fine :
string inFile="/path/uncompressedInputFile.ext";
ifstream inputfile;
vector <ofstream *> outfiles(1);
string outputfile="/path/uncompressedOutputFile.ext";
outfiles[1] = new ofstream(outputfile.c_str());
inputfile.open(inFile.c_str());
string line;
while(getline(inputfile, line)) {
*outfiles[1] << line << endl;
}
Now with compressed file, this way produces me corrupted files :
string inFile="/path/compressedFile.gz";
igzstream inputfile;
vector <ogzstream *> outfiles(1);
string outputfile="/path/compressedOutputFile.gz";
outfiles[1] = new ogzstream(outputfile.c_str());
inputfile.open(inFile.c_str());
string line;
while(getline(inputfile, line)) {
*outfiles[1] << line << endl;
}
I got a "compressedOutputFile.gz" in my path, not empty, but when trying to uncompressed it I got "unexpected end of file" which, I guess, means the file is corrupted....
What's wrong with it ? Can anyone please help me ?! :)
In the simple example, the GZip file is closed automatically when the ofstream is destroyed, which flushes its remaining buffer to disk.
In the dynamic example, you're not closing because the object is being created on the heap. In both cases, this could result in the loss of data at the end of the file, depending on the format. Since GZip is compressed, it's more likely to lose more relevant data, resulting in a more obvious failure.
The best solution is to create a vector<unique_ptr<ogzstream> >, which cause it to automatically destroy streams when they go out of scope. The less optimal solution is to remember to manually delete each pointer prior to exiting the function.
Edit: And as a quick note, as pointed out by #doctorlove in the original comments, you need to use the correct index, otherwise you're causing other issues.

Read number of lines, words, characters from a file

I can read the number of lines easy, using:
ifstream in(file);
string content;
while(getline(in, content))
{
// do stuff
}
Or I can read the number of words and characters easy using something like:
ifstream in(file)
string content;
int numOfCharacters = 0;
int numOfWords = 0;
while(in >> content)
{
++numOfWords;
numOfCharacters += content.size();
}
But I dont want to read the file twice. How can I read the file once, and find out the number of lines, words and characters?
PS: I would welcome a Boost sugestion, if there is a easy way.
Thank you.
Read the line and for each line count the words. See stringstream for the second part.
(I'm not giving more information, that looks too much like an homework).
This could be done with a trivial boost.spirit.qi parser.
Sticking with the iostreams solution: you could create a strstream out of each line read via getline(), and do the word/char counting operations on it, accumulating across all the lines.

Tokenization of a text file with frequency and line occurrence. Using C++

once again I ask for help. I haven't coded anything for sometime!
Now I have a text file filled with random gibberish. I already have a basic idea on how I will count the number of occurrences per word.
What really stumps me is how I will determine what line the word is in. Gut instinct tells me to look for the newline character at the end of each line. However I have to do this while going through the text file the first time right? Since if I do it afterwords it will do no good.
I already am getting the words via the following code:
vector<string> words;
string currentWord;
while(!inputFile.eof())
{
inputFile >> currentWord;
words.push_back(currentWord);
}
This is for a text file with no set structure. Using the above code gives me a nice little(big) vector of words, but it doesn't give me the line they occur in.
Would I have to get the entire line, then process it into words to make this possible?
Use a std::map<std::string, int> to count the word occurrences -- the int is the number of times it exists.
If you need like by line input, use std::getline(std::istream&, std::string&), like this:
std::vector<std::string> lines;
std::ifstream file(...) //Fill in accordingly.
std::string currentLine;
while(std::getline(file, currentLine))
lines.push_back(currentLine);
You can split a line apart by putting it into an std::istringstream first and then using operator>>. (Alternately, you could cobble up some sort of splitter using std::find and other algorithmic primitaves)
EDIT: This is the same thing as in #dash-tom-bang's answer, but modified to be correct with respect to error handing:
vector<string> words;
int currentLine = 1; // or 0, however you wish to count...
string line;
while (getline(inputFile, line))
{
istringstream inputString(line);
string word;
while (inputString >> word)
words.push_back(pair(word, currentLine));
}
Short and sweet.
vector< map< string, size_t > > line_word_counts;
string line, word;
while ( getline( cin, line ) ) {
line_word_counts.push_back();
map< string, size_t > &word_counts = line_word_counts.back();
istringstream line_is( line );
while ( is >> word ) ++ word_counts[ word ];
}
cout << "'Hello' appears on line 5 " << line_word_counts[5-1]["Hello"]
<< " times\n";
You're going to have to abandon reading into strings, because operator >>(istream&, string&) discards white space and the contents of the white space (== '\n' or != '\n', that is the question...) is what will give you line numbers.
This is where OOP can save the day. You need to write a class to act as a "front end" for reading from the file. Its job will be to buffer data from the file, and return words one at a time to the caller.
Internally, the class needs to read data from the file a block (say, 4096 bytes) at a time. Then a string GetWord() (yes, returning by value here is good) method will:
First, read any white space characters, taking care to increment the object's lineNumber member every time it hits a \n.
Then read non-whitespace characters, putting them into the string object you'll be returning.
If it runs out of stuff to read, read the next block and continue.
If the you hit the end of file, the string you have is the whole word (which may be empty) and should be returned.
If the function returns an empty string, that tells the caller that the end of file has been reached. (Files usually end with whitespace characters, so reading whitespace characters cannot imply that there will be a word later on.)
Then you can call this method at the same place in your code as your cin >> line and the rest of the code doesn't need to know the details of your block buffering.
An alternative approach is to read things a line at a time, but all the read functions that would work for you require you to create a fixed-size buffer to read into beforehand, and if the line is longer than that buffer, you have to deal with it somehow. It could get more complicated than the class I described.

C++ length of file and vectors

Hi I have a file with some text in it. Is there some easy way to get the number of lines in the file without traversing through the file?
I also need to put the lines of the file into a vector. I am new to C++ but I think vector is like ArrayList in java so I wanted to use a vector and insert things into it. So how would I do it?
Thanks.
There is no way of finding the number of lines in a file without reading it. To read all lines:
1) create a std::vector of std::string
3 ) open a file for input
3) read a line as a std::string using getline()
4) if the read failed, stop
5) push the line into the vector
6) goto 3
You would need to traverse the file to detect the number of lines (or at least call a library method that traverse the file).
Here is a sample code for parsing text file, assuming that you pass the file name as an argument, by using the getline method:
#include <string>
#include <vector>
#include <fstream>
#include <iostream>
int main(int argc, char* argv[])
{
std::vector<std::string> lines;
std::string line;
lines.clear();
// open the desired file for reading
std::ifstream infile (argv[1], std::ios_base::in);
// read each file individually (watch out for Windows new lines)
while (getline(infile, line, '\n'))
{
// add line to vector
lines.push_back (line);
}
// do anything you like with the vector. Output the size for example:
std::cout << "Read " << lines.size() << " lines.\n";
return 0;
}
Update: The code could fail for many reasons (e.g. file not found, concurrent modifications to file, permission issues, etc). I'm leaving that as an exercise to the user.
1) No way to find number of lines without reading the file.
2) Take a look at getline function from the C++ Standard Library. Something like:
string line;
fstream file;
vector <string> vec;
...
while (getline(file, line)) vec.push_back(line);
Traversing the file is fundamentally required to determine the number of lines, regardless of whether you do it or some library routine does it. New lines are just another character, and the file must be scanned one character at a time in its entirety to count them.
Since you have to read the lines into a vector anyways, you might as well combine the two steps:
// Read lines from input stream in into vector out
// Return the number of lines read
int getlines(std::vector<std::string>& out, std::istream& in == std::cin) {
out.clear(); // remove any data in vector
std::string buffer;
while (std::getline(in, buffer))
out.push_back(buffer);
// return number of lines read
return out.size();
}