Fast, Simple CSV Parsing in C++ - c++

I am trying to parse a simple CSV file, with data in a format such as:
20.5,20.5,20.5,0.794145,4.05286,0.792519,1
20.5,30.5,20.5,0.753669,3.91888,0.749897,1
20.5,40.5,20.5,0.701055,3.80348,0.695326,1
So, a very simple and fixed format file. I am storing each column of this data into a STL vector. As such I've tried to stay the C++ way using the standard library, and my implementation within a loop looks something like:
string field;
getline(file,line);
stringstream ssline(line);
getline( ssline, field, ',' );
stringstream fs1(field);
fs1 >> cent_x.at(n);
getline( ssline, field, ',' );
stringstream fs2(field);
fs2 >> cent_y.at(n);
getline( ssline, field, ',' );
stringstream fs3(field);
fs3 >> cent_z.at(n);
getline( ssline, field, ',' );
stringstream fs4(field);
fs4 >> u.at(n);
getline( ssline, field, ',' );
stringstream fs5(field);
fs5 >> v.at(n);
getline( ssline, field, ',' );
stringstream fs6(field);
fs6 >> w.at(n);
The problem is, this is extremely slow (there are over 1 million rows per data file), and seems to me to be a bit inelegant. Is there a faster approach using the standard library, or should I just use stdio functions? It seems to me this entire code block would reduce to a single fscanf call.
Thanks in advance!

Using 7 string streams when you can do it with just one sure doesn't help wrt. performance.
Try this instead:
string line;
getline(file, line);
istringstream ss(line); // note we use istringstream, we don't need the o part of stringstream
char c1, c2, c3, c4, c5; // to eat the commas
ss >> cent_x.at(n) >> c1 >>
cent_y.at(n) >> c2 >>
cent_z.at(n) >> c3 >>
u.at(n) >> c4 >>
v.at(n) >> c5 >>
w.at(n);
If you know the number of lines in the file, you can resize the vectors prior to reading and then use operator[] instead of at(). This way you avoid bounds checking and thus gain a little performance.

I believe the major bottleneck (put aside the getline()-based non-buffered I/O) is the string parsing. Since you have the "," symbol as a delimiter, you may perform a linear scan over the string and replace all "," by "\0" (the end-of-string marker, zero-terminator).
Something like this:
// tmp array for the line part values
double parts[MAX_PARTS];
while(getline(file, line))
{
size_t len = line.length();
size_t j;
if(line.empty()) { continue; }
const char* last_start = &line[0];
int num_parts = 0;
while(j < len)
{
if(line[j] == ',')
{
line[j] = '\0';
if(num_parts == MAX_PARTS) { break; }
parts[num_parts] = atof(last_start);
j++;
num_parts++;
last_start = &line[j];
}
j++;
}
/// do whatever you need with the parts[] array
}

I don't know if this will be quicker than the accepted answer, but I might as well post it anyway in case you wish to try it.
You can load in the entire contents of the file using a single read call by knowing the size of the file using some fseek magic. This will be much faster than multiple read calls.
You could then do something like this to parse your string:
//Delimited string to vector
vector<string> dstov(string& str, string delimiter)
{
//Vector to populate
vector<string> ret;
//Current position in str
size_t pos = 0;
//While the the string from point pos contains the delimiter
while(str.substr(pos).find(delimiter) != string::npos)
{
//Insert the substring from pos to the start of the found delimiter to the vector
ret.push_back(str.substr(pos, str.substr(pos).find(delimiter)));
//Move the pos past this found section and the found delimiter so the search can continue
pos += str.substr(pos).find(delimiter) + delimiter.size();
}
//Push back the final element in str when str contains no more delimiters
ret.push_back(str.substr(pos));
return ret;
}
string rawfiledata;
//This call will parse the raw data into a vector containing lines of
//20.5,30.5,20.5,0.753669,3.91888,0.749897,1 by treating the newline
//as the delimiter
vector<string> lines = dstov(rawfiledata, "\n");
//You can then iterate over the lines and parse them into variables and do whatever you need with them.
for(size_t itr = 0; itr < lines.size(); ++itr)
vector<string> line_variables = dstov(lines[itr], ",");

std::ifstream file{ InputFilename };
std::vector<std::string> line_elements;
for (std::string line; std::getline(file, line);)
{
line_elements.clear();
std::istringstream ss(line);
for (std::string value; std::getline(ss, value, ',');)
{
line_elements.push_back(std::move(value));
}
// Do something with the line_elements.
}

Related

How to read input from file and pair into a map in C++

I am trying to read through a text file that can possibly look like below.
HI bye
goodbye
foo bar
boy girl
one two three
I am trying to take the lines with only two words and store them in a map, the first word would be the key and second word would be the value.
below is the code I came up with but I can't figure out how to ignore the lines that do not have two words on them.
this only works properly if every line has two words. I understand why this is only working if every line has two words but, I'm not sure what condition I can add to prevent this.
pair myPair;
map myMap;
while(getline(file2, line, '\0'))
{
stringstream ss(line);
string word;
while(!ss.eof())
{
ss >> word;
myPair.first = word;
ss >> word;
myPair.second = word;
myMap.insert(myPair);
}
}
map<string, string>::iterator it=myMap.begin();
for(it=myMap.begin(); it != myMap.end(); it++)
{
cout<<it->first<<" "<<it->second<<endl;
}
Read two words into a temporary pair. If you can't, do not add the pair to the map. If you can read two words, see if you can read a third word. If you can, you have too many words on the line. Do not add.
Example:
while(getline(file2, line, '\0'))
{
stringstream ss(line);
pair<string,string> myPair;
string junk;
if (ss >> myPair.first >> myPair.second && !(ss >> junk))
{ // successfully read into pair, but not into a third junk variable
myMap.insert(myPair);
}
}
let me suggest a little different implementation
std::string line;
while (std::getline(infile, line)) {
// Vector of string to save tokens
vector <string> tokens;
// stringstream class check1
stringstream check1(line);
string intermediate;
// Tokenizing w.r.t. space ' '
while(getline(check1, intermediate, ' ')) {
tokens.push_back(intermediate);
}
if (tokens.size() == 2) {
// your condition of 2 words in a line apply
// process 1. and 2. item of vector here
}
}
You can use fscanf for take input from file and sscanf for take input from string with format. sscanf return how many input successfully take with given format. so you can easily check, how many word have a line.
#include<stdio.h>
#include<stdlib.h>
#include <iostream>
using namespace std;
int main()
{
char line[100];
FILE *fp = fopen("inp.txt", "r");
while(fscanf(fp, " %[^\n]s", line) == 1)
{
cout<<line<<endl;
char s1[100], s2[100];
int take = sscanf(line, "%s %s", s1, s2);
cout<<take<<endl;
}
return 0;
}

Reading a file into memory C++: Is there a getline() for std::strings

I was asked to update my code that reads in a text file and parses it for specific strings.
Basically instead of opening the text file every time, I want to read the text file into memory and have it for the duration of the object.
I was wondering if there was a similar function to getline() I could use for a std::string like i can for a std::ifstream.
I realize I could just use a while/for loop but I am curious if there is some other way. Here is what I am currently doing:
file.txt: (\n represents a newline )
file.txt
My Code:
ifstream file("/tmp/file.txt");
int argIndex = 0;
std::string arg,line,substring,whatIneed1,whatIneed2;
if(file)
{
while(std::getline(file,line))
{
if(line.find("3421",0) != string::npos)
{
std::getline(file,line);
std::getline(file,line);
std::stringstream ss1(line);
std::getline(file,line);
std::stringstream ss2(line);
while( ss1 >> arg)
{
if( argIndex==0)
{
whatIneed1 = arg;
}
argIndex++;
}
argIndex=0;
while( ss2 >> arg)
{
if( argIndex==0)
{
whatIneed2 = arg;
}
argIndex++;
}
argIndex=0;
}
}
}
Where at the end whatIneed1=="whatIneed1" and whatIneed2=="whatIneed2".
Is there a way to do this with storing file.txt in a std::string instead of a std::ifstream asnd using a function like getline()? I like getline() because it makes getting the next line of the file that much easier.
If you've already read the data into a string, you can use std::stringstream to turn it into a file-like object compatible with getline.
std::stringstream ss;
ss.str(file_contents_str);
std::string line;
while (std::getline(ss, line))
// ...
Rather than grab a line then try to extract one thing from it, why not extract the one thing, then discard the line?
std::string whatIneed1, whatIneed2, ignored;
if(ifstream file("/tmp/file.txt"))
{
for(std::string line; std::getline(file,line);)
{
if(line.find("3421",0) != string::npos)
{
std::getline(file, ignored);
file >> whatIneed1;
std::getline(file, ignored);
file >> whatIneed2;
std::getline(file, ignored);
}
}
}

Reading in file with delimiter

How do I read in lines from a file and assign specific segments of that line to the information in structs? And how can I stop at a blank line, then continue again until end of file is reached?
Background: I am building a program that will take an input file, read in information, and use double hashing for that information to be put in the correct index of the hashtable.
Suppose I have the struct:
struct Data
{
string city;
string state;
string zipCode;
};
But the lines in the file are in the following format:
20
85086,Phoenix,Arizona
56065,Minneapolis,Minnesota
85281
56065
Sorry but I still cannot seem to figure this out. I am having a really hard time reading in the file. The first line is basically the size of the hash table to be constructed. The next blank line should be ignored. Then the next two lines are information that should go into the struct and be hashed into the hash table. Then another blank line should be ignored. And finally, the last two lines are input that need to be matched to see if they exist in the hash table or not. So in this case, 85281 is not found. While 56065 is found.
As the other two answers point out you have to use std::getline, but this is how I would do it:
if (std::getline(is, zipcode, ',') &&
std::getline(is, city, ',') &&
std::getline(is, state))
{
d.zipCode = std::stoi(zipcode);
}
The only real change I made is that I encased the extractions within an if statement so you can check if these reads succeeded. Moreover, in order for this to be done easily (you wouldn't want to type the above out for every Data object), you can put this inside a function.
You can overload the >> operator for the Data class like so:
std::istream& operator>>(std::istream& is, Data& d)
{
std::string zipcode;
if (std::getline(is, zipcode, ',') &&
std::getline(is, d.city, ',') &&
std::getline(is, d.state))
{
d.zipCode = std::stoi(zipcode);
}
return is;
}
Now it becomes as simple as doing:
Data d;
if (std::cin >> d)
{
std::cout << "Yes! It worked!";
}
You can use a getline function from <string> like this:
string str; // This will store your tokens
ifstream file("data.txt");
while (getline(file, str, ',') // You can have a different delimiter
{
// Process your data
}
You can also use stringstream:
stringstream ss(line); // Line is from your input data file
while (ss >> str) // str is to store your token
{
// Process your data here
}
It's just a hint. Hope it helps you.
All you need is function std::getline
For example
std::string s;
std::getline( YourFileStream, s, ',' );
To convert a string to int you can use function std::stoi
Or you can read a whole line and then use std::istringstream to extract each data with the same function std::getline. For example
Data d = {};
std::string line;
std::getline( YourFileStream, line );
std::istringstream is( line );
std::string zipCode;
std::getline( is, zipCode, ',' );
d.zipCode = std::stoi( zipCode );
std::getline( is, d.city, ',' );
std::getline( is, d.state, ',' );

c++ parsing lines in file as streams

I want to parse a file which describes a set of data line by line. Each datum consists of 3 or four parameters: int int float (optional) string.
I opened file as ifstream inFile and used it in a while loop
while (inFile) {
string line;
getline(inFile,line);
istringstream iss(line);
char strInput[256];
iss >> strInput;
int i = atoi(strInput);
iss >> strInput;
int j = atoi(strInput);
iss >> strInput;
float k = atoi(strInput);
iss >> strInput;
cout << i << j << k << strInput << endl;*/
}
The problem is that the last parameter is optional, so I'll probably run into errors when it is not present. How can i check in advance how many parameters are given for each datum?
Furthermore,
string line;
getline(inFile,line);
istringstream iss(line);
seems a bit reduldant, how could I simplyfiy it?
Use the idiomatic approach in this situation, and it becomes much simpler:
for (std::string line; getline(inFile, line); ) {
std::istringstream iss(line);
int i;
int j;
float k;
if (!(iss >> i >> j)) {
//Failed to extract the required elements
//This is an error
}
if (!(iss >> k)) {
//Failed to extract the optional element
//This is not an error -- you just don't have a third parameter
}
}
By the way, atoi has some highly undesired ambiguity unless 0 is not a possible value for the string you're parsing. Since atoi returns 0 when it errors, you cannot know if a return value of 0 is a successful parsing of a string with a value of 0, or if it's an error unless you do some rather laborious checking on the original string you had it parse.
Try to stick with streams, but in situations where you do need to fall back to atoi type functionality, go with the strtoX family of functions (strtoi, strtol, strtof, etc). Or, better yet, if you're using C++11, use the stoX family of functions.
You could use a string tokenizer How do I tokenize a string in C++?
In particular: https://stackoverflow.com/a/55680/2436175
Side note: you do not need to use atoi, you could simply do:
int i,j;
iss >> i >> j;
(but this wouldn't handle alone the problem of optional elements)

C++ split string

I am trying to split a string using spaces as a delimiter. I would like to store each token in an array or vector.
I have tried.
string tempInput;
cin >> tempInput;
string input[5];
stringstream ss(tempInput); // Insert the string into a stream
int i=0;
while (ss >> tempInput){
input[i] = tempInput;
i++;
}
The problem is that if i input "this is a test", the array only seems to store input[0] = "this". It does not contain values for input[2] through input[4].
I have also tried using a vector but with the same result.
Go to the duplicate questions to learn how to split a string into words, but your method is actually correct. The actual problem lies in how you are reading the input before trying to split it:
string tempInput;
cin >> tempInput; // !!!
When you use the cin >> tempInput, you are only getting the first word from the input, not the whole text. There are two possible ways of working your way out of that, the simplest of which is forgetting about the stringstream and directly iterating on input:
std::string tempInput;
std::vector< std::string > tokens;
while ( std::cin >> tempInput ) {
tokens.push_back( tempInput );
}
// alternatively, including algorithm and iterator headers:
std::vector< std::string > tokens;
std::copy( std::istream_iterator<std::string>( std::cin ),
std::istream_iterator<std::string>(),
std::back_inserter(tokens) );
This approach will give you all the tokens in the input in a single vector. If you need to work with each line separatedly then you should use getline from the <string> header instead of the cin >> tempInput:
std::string tempInput;
while ( getline( std::cin, tempInput ) ) { // read line
// tokenize the line, possibly with your own code or
// any answer in the 'duplicate' question
}
Notice that it’s much easier just to use copy:
vector<string> tokens;
copy(istream_iterator<string>(cin),
istream_iterator<string>(),
back_inserter(tokens));
As for why your code doesn’t work: you’re reusing tempInput. Don’t do that. Furthermore, you’re first reading a single word from cin, not the whole string. That’s why only a single word is put into the stringstream.
The easiest way: Boost.Tokenizer
std::vector<std::string> tokens;
std::string s = "This is, a test";
boost::tokenizer<> tok(s);
for(boost::tokenizer<>::iterator it=tok.begin(); it != tok.end(); ++it)
{
tokens.push_back(*it);
}
// tokens is ["This", "is", "a", "test"]
You can parameter the delimiters and escape sequences to only take spaces if you wish, by default it tokenize on both spaces and punctuation.
Here a little algorithm where it splits the string into a list just like python does.
std::list<std::string> split(std::string text, std::string split_word) {
std::list<std::string> list;
std::string word = "";
int is_word_over = 0;
for (int i = 0; i <= text.length(); i++) {
if (i <= text.length() - split_word.length()) {
if (text.substr(i, split_word.length()) == split_word) {
list.insert(list.end(), word);
word = "";
is_word_over = 1;
}
//now we want that it jumps the rest of the split character
else if (is_word_over >= 1) {
if (is_word_over != split_word.length()) {
is_word_over += 1;
continue;
}
else {
word += text[i];
is_word_over = 0;
}
}
else {
word += text[i];
}
}
else {
word += text[i];
}
}
list.insert(list.end(), word);
return list;
}
There probably exists a more optimal way to write this.