Read comma separated values with stray whitespaces from a textfile in c++ - c++

I have a file that contains string,int,int values in multiple lines.
Delhi,12,13
Mumbai,100 , 101
Kolkata,11, 12
The values are separated by commas but there can be stray whitespaces in between.My current code is this :
#include<cstdio>
#include<iostream>
#include<string>
using namespace std;
int main()
{
FILE *f = fopen("input.txt","r");
int lines = 0;
char c = getc(f);
while(c != EOF)
{
if(c == '\n')
{
lines++;
}
c = getc(f);
}
lines++;
string arr[lines];
int t1[lines];
int t2[lines];
char s1[100],s2[100],s3[100];
int x,y;
fclose(f);
f = fopen("input.txt","r");
while (fscanf(f,"%99[^,],%99[^,],%99[^,]", s1, s1, s2)==3)
{
cout << s1 << s2 << s3 << endl;
}
}
This doesn't seem to quite properly read the values and display on the screen first of all. How do I read the string and the integer values here(which may have stray whitespaces) and store them into an array (three arrays to be precise) ?

Try doing this:
fscanf(f,"%[^, ]%*[ ,]%d%*[ ,]%d ", s1, &x, &y);
%[^, ] => searches for everything except , and <space> and stores it in s1
%*[ ,] => searches for , and <space> but does not store it anywhere (the * ensures that)
%d => stores the number

The problem is on this line:
while (fscanf(f,"%99[^,],%99[^,],%99[^,]", s1, s1, s2)==3)
It tries to scan up to the next comma character ',', which occurs on the next line. Replace with %99[^\n] to fix this problem:
while (fscanf(f,"%99[^,],%99[^,],%99[^\n]", s1, s1, s2)==3)

Why are you using FILE* and friends in C++?
The other answers specify the problem with your code, so I'm writing this answer to show you how to improve it.
std::ifstream file("input.txt");
std::string name, value0, value1;
while (std::getline(file, name, ',')) {
// Get the value strings from the stream.
std::getline(file, value0, ',');
std::getline(file, value1, ',');
// These will throw an exception when given invalid input.
int v0 = std::stoi(value0);
int v1 = std::stoi(value1);
// Do stuff with the strings
}
std::getline can be used to extract a string from a stream up until a certain delimiter. Whitespaces are ignored here, so we don't have to care about them. The return value of std::getline is the stream passed in, and it has an operator bool() that allows us to use it as a boolean expression. The value will become false when the stream is either empty or in some erroneous state.
Note that the above should be similar in behavior to:
while (file) {
std::getline(file, name, ',');
// ...
}
I'm pretty sure this must be a whole lot more readable than a string like "%99[^,],%99[^,],%99[^,]".
Cheers~

Related

How to add comma separated list of double values into a vector? [duplicate]

I have a structure with an int and two strings. When reading in the file it is comma seperated for the first two values and the last value is terminated by a newline. The third argument could be empty however.
ex data: 7, john doe, 123-456-7891 123 fake st.
I want to make it so that my program will grab the first number and put it in the int, find the comma and put the second number in the struct's string etc.
First question is should I use a class instead? I have seen the getline(stream, myString, ','); but my arguments are different data types so I can't just throw them all into a vector.
my code:
struct Person{
int id;//dont care if this is unique
string name;
string extraInfo;
};
int main(int argc, char* argv[]){
assert( argc ==2 && "Invalid number of command line arguments");
ifstream inputFile (argv[1]);
assert( inputFile.is_open() && "Unable to open file");
}
What is the best way of storing this information and retrieving it from a file that is comma separated for the first two and ends with a newline? I also want the program to ignore blank lines in the file.
I'd read the file line-by-line using normal getline(). Then, put it into a stringstream for further parsing or use string's find() functions to split the text manually.
Some more notes:
I don't understand your first question about using a class. If you mean for Person, then the answer is that it doesn't matter.
Using assert for something you don't have control over is wrong, like argc. This should only be used to verify that you didn't make a programming error. Also, if you #define NDEBUG, the asserts are all gone, so they shouldn't really be part of your program logic. Throw std::runtime_error("failed to open file") instead.
You probably don't want the double quotes in your strings. Also, you might want "a,b" to not be split by the comma. Make sure you have tests that assert the required functionality.
You can still use the getline approach for tokenising a line, but you first have to read the line:
vector<Person> people;
string line;
int lineNum = 0;
while( getline(inputFile, line) )
{
istringstream iss(line);
lineNum++;
// Try to extract person data from the line. If successful, ok will be true.
Person p;
bool ok = false;
do {
string val;
if( !getline(iss, val, ',') ) break;
p.id = strtol( val.c_str(), NULL, 10 );
if( !getline(iss, p.name, ',') ) break;
if( !getline(iss, p.extraInfo, ',') ) break;
// Now you can trim the name and extraInfo strings to remove spaces and quotes
//[todo]
ok = true;
} while(false);
// If all is well, add the person to our people-vector.
if( ok ) {
people.push_back(p);
} else {
cout << "Failed to parse line " << lineNum << ": " << line << endl;
}
}
Once you get the line in string using getline, use strtok.
char myline[] = "7, john doe, 123-456-7891 123 fake st.";
char tokens = strtok(myline, ",");
while(tokens)
{
//store tokens in your struct values here
}
You'll need to include #include <string.h> to use strtok

Reading In Multiple Data types from a .txt file where one of the strings has spaces C++

I have a text file that looks like this:
Car, CN, 819481, maintenance, false, NONE
Car, SLSF, 46871, business,true, Memphis
Car, AOK, 156, tender, true, San Francisco
(the commas are tabs in actuality, but I was unable to get them to format properly on this site)
I have an object called Car which i am reading the code into and outputting using the output at the bottom of the code. My current code can read in all of the first 5 data types, but I am having trouble with reading in the last column where there can be spaces. I have tried using getline, but to no avail.
Here is the code that I have for the function that takes the txt as inputs
void input()
{
ifstream inputFile;
inputFile.open("input.txt",fstream::in);
if (inputFile.fail())
{
cout<<"input failed"<<endl;
exit(1);
}
string type;
string reportingMark;
int carNumber;
string kind;
bool loaded;
string destination;
while(inputFile.peek() != EOF)
{
inputFile>>type>>reportingMark>>carNumber>>kind>>loaded;
while(inputFile.peek() == ' ')
inputFile.get();
getline(inputFile, destination);
Car temp(reportingMark, carNumber, kind, loaded, destination);
temp.output();
}
inputFile.close();
}
Don't use >> operator, use getline:
string line;
while (getline(inputFile, line) {
// split line by tabs or commas
}
Example split function:
vector<string> explode(string &str, char separator) {
vector<string> result;
string tmp;
for (int i = 0; i < str.size(); i++) {
if (str[i] == separator) {
result.push_back(tmp);
tmp.clear();
} else tmp += str[i];
}
if (tmp.size() > 0)
result.push_back(tmp);
return result;
}
I hope that std::vector isn't hard for you. Example loading code (instead while(inputFile.peek() != EOF) { ... }):
string line;
while (getline(inputFile, line) {
vector<string> data = split(line, '\t'); // Or use ASCII code 9
if (data.size() != 5) {
cout << "Invalid line!" << endl;
continue;
}
Car temp(data[0], data[1], stoi(data[2]), data[3], data[4]);
temp.output();
}
Don't copy-paste this code, I see that you have bool variables etc. that is not handled.
The STL function std::getline accepts a delimiter as a third argument which means you can pass a \t to read tab separated values. For the last value, just read it without specifying the delimiter which will mean the overloaded version of the function will be called where the delimiter is \n.
Read one line at a time, separate into tokens, convert to required datatype.
Here is a working example. It can be improved though. Also you should handle exceptions thrown by std::stoi during std::string to int conversion.

comma separated stream into struct

I have a structure with an int and two strings. When reading in the file it is comma seperated for the first two values and the last value is terminated by a newline. The third argument could be empty however.
ex data: 7, john doe, 123-456-7891 123 fake st.
I want to make it so that my program will grab the first number and put it in the int, find the comma and put the second number in the struct's string etc.
First question is should I use a class instead? I have seen the getline(stream, myString, ','); but my arguments are different data types so I can't just throw them all into a vector.
my code:
struct Person{
int id;//dont care if this is unique
string name;
string extraInfo;
};
int main(int argc, char* argv[]){
assert( argc ==2 && "Invalid number of command line arguments");
ifstream inputFile (argv[1]);
assert( inputFile.is_open() && "Unable to open file");
}
What is the best way of storing this information and retrieving it from a file that is comma separated for the first two and ends with a newline? I also want the program to ignore blank lines in the file.
I'd read the file line-by-line using normal getline(). Then, put it into a stringstream for further parsing or use string's find() functions to split the text manually.
Some more notes:
I don't understand your first question about using a class. If you mean for Person, then the answer is that it doesn't matter.
Using assert for something you don't have control over is wrong, like argc. This should only be used to verify that you didn't make a programming error. Also, if you #define NDEBUG, the asserts are all gone, so they shouldn't really be part of your program logic. Throw std::runtime_error("failed to open file") instead.
You probably don't want the double quotes in your strings. Also, you might want "a,b" to not be split by the comma. Make sure you have tests that assert the required functionality.
You can still use the getline approach for tokenising a line, but you first have to read the line:
vector<Person> people;
string line;
int lineNum = 0;
while( getline(inputFile, line) )
{
istringstream iss(line);
lineNum++;
// Try to extract person data from the line. If successful, ok will be true.
Person p;
bool ok = false;
do {
string val;
if( !getline(iss, val, ',') ) break;
p.id = strtol( val.c_str(), NULL, 10 );
if( !getline(iss, p.name, ',') ) break;
if( !getline(iss, p.extraInfo, ',') ) break;
// Now you can trim the name and extraInfo strings to remove spaces and quotes
//[todo]
ok = true;
} while(false);
// If all is well, add the person to our people-vector.
if( ok ) {
people.push_back(p);
} else {
cout << "Failed to parse line " << lineNum << ": " << line << endl;
}
}
Once you get the line in string using getline, use strtok.
char myline[] = "7, john doe, 123-456-7891 123 fake st.";
char tokens = strtok(myline, ",");
while(tokens)
{
//store tokens in your struct values here
}
You'll need to include #include <string.h> to use strtok

How to let std::istringstream treat a given character as white space?

Using std::istringstream it is easy to read words separated by white space. But to parse the following line, I need the character / to be treated like white space.
f 104/387/104 495/574/495 497/573/497
How can I read values separated by either slash or white space?
One way is to define a ctype facet that classifies / as white-space:
class my_ctype : public std::ctype<char> {
public:
mask const *get_table() {
static std::vector<std::ctype<char>::mask>
table(classic_table(), classic_table()+table_size);
table['/'] = (mask)space;
return &table[0];
}
my_ctype(size_t refs=0) : std::ctype<char>(get_table(), false, refs) { }
};
From there, imbue the stream with a locale using that ctype facet, then read words:
int main() {
std::string input("f 104/387/104 495/574/495 497/573/497");
std::istringstream s(input);
s.imbue(std::locale(std::locale(), new my_ctype));
std::copy(std::istream_iterator<std::string>(s),
std::istream_iterator<std::string>(),
std::ostream_iterator<std::string>(std::cout, "\n"));
}
If boost is available, then boost::split() would be a possible solution. Populate a std::string using std::getline() and then split the line:
#include <iostream>
#include <vector>
#include <string>
#include <boost/algorithm/string.hpp>
#include <boost/algorithm/string/split.hpp>
int main()
{
std::vector<std::string> tokens;
std::string line("f 104/387/104 495/574/495 497/573/497");
boost::split(tokens, line, boost::is_any_of("/ "));
for (auto& token: tokens) std::cout << token << "\n";
return 0;
}
Output:
f
104
387
104
495
574
495
497
573
497
If you know when to split by either slash or whitespace, you can use std::getline
std::istringstream is("f 104/387/104 495/574/495 497/573/497");
std::string f, i, j, k;
std::getline(is, f, ' ');
std::getline(is, i, '/');
std::getline(is, j, '/');
std::getline(is, k, ' ');
Alternatively, you can use formatted input and discard the slashes manually
std::string f;
int i, j, k;
char slash;
is >> f >> i >> slash >> j >> slash >> k;
I'm sure this isn't the best way at all, but I was working on an exercise in the book Programming Principles and Practice Using C++ 2nd Ed. by Bjarne Stroustrup and I came up with a solution that might work for you. I searched around to see how others were doing it (which is how I found this thread) but I really didn't find anything.
First of all, here's the exercise from the book:
Write a function vector<string> split(const string& s, const string&
w) that returns a vector of whitespace-separated substrings from the
argument s, where whitespace is defined as "ordinary whitespace" plus
the characters in w.
Here's the solution that I came up with, which seems to work well. I tried commenting it to make it more clear. Just want to mention I'm pretty new to C++ (which is why I'm reading this book), so don't go too hard on me. :)
// split a string into its whitespace-separated substrings and store
// each string in a vector<string>. Whitespace can be defined in argument
// w as a string (e.g. ".;,?-'")
vector<string> split(const string& s, const string& w)
{
string temp{ s };
// go through each char in temp (or s)
for (char& ch : temp) {
// check if any characters in temp (s) are whitespace defined in w
for (char white : w) {
if (ch == white)
ch = ' '; // if so, replace them with a space char ('')
}
}
vector<string> substrings;
stringstream ss{ temp };
for (string buffer; ss >> buffer;) {
substrings.push_back(buffer);
}
return substrings;
}
Then you can do something like this to use it:
cout << "Enter a string and substrings will be printed on new lines:\n";
string str;
getline(cin, str);
vector<string> substrings = split(str, ".;,?-'");
cout << "\nSubstrings:\n";
for (string s : substrings)
cout << s << '\n';
I know you aren't wanting to split strings, but this is just an example of how you can treat other characters as whitespace. Basically, I'm just replacing those characters with ' ' so they literally do become whitespace. When using that with a stream, it works pretty well. The for loop(s) might be the relevant code for your case.

Tokenize stringstream based on type

I have an input stream containing integers and special meaning characters '#'. It looks as follows:
... 12 18 16 # 22 24 26 15 # 17 # 32 35 33 ...
The tokens are separated by space. There's no pattern for the position of '#'.
I was trying to tokenize the input stream like this:
int value;
std::ifstream input("data");
if (input.good()) {
string line;
while(getline(data, line) != EOF) {
if (!line.empty()) {
sstream ss(line);
while (ss >> value) {
//process value ...
}
}
}
}
The problem with this code is that the processing stops when the first '#' is encountered.
The only solution I can think of is to extract each individual token into a string (not '#') and use atoi() function to convert the string to an integer. However, it's very inefficient as the majority tokens are integer. Calling atoi() on the tokens introduces big overhead.
Is there a way I can parse the individual token by its type? ie, for integers, parse it as integers while for '#', skip it. Thanks!
One possibility would be to explicitly skip whitespace (ss >> std::ws), and then to use ss.peek() to find out if a # follows. If yes, use ss.get() to read it and continue, otherwise use ss >> value to read the value.
If the positions of # don't matter, you could also remove all '#' from the line before initializing the stringstream with it.
Usually not worth testing against good()
if (input.good()) {
Unless your next operation is generating an error message or exception. If it is not good all further operations will fail anyway.
Don't test against EOF.
while(getline(data, line) != EOF) {
The result of std::getline() is not an integer. It is a reference to the input stream. The input stream is convertible to a bool like object that can be used in bool a context (like while if etc..). So what you want to do:
while(getline(data, line)) {
I am not sure I would read a line. You could just read a word (since the input is space separated). Using the >> operator on string
std::string word;
while(data >> word) { // reads one space separated word
Now you can test the word to see if it is your special character:
if (word[0] == "#")
If not convert the word into a number.
This is what I would do:
// define a class that will read either value from a stream
class MyValue
{
public:
bool isSpec() const {return isSpecial;}
int value() const {return intValue;}
friend std::istream& operator>>(std::istream& stream, MyValue& data)
{
std::string item;
stream >> item;
if (item[0] == '#') {
data.isSpecial = true;
} else
{ data.isSpecial = false;
data.intValue = atoi(&item[0]);
}
return stream;
}
private:
bool isSpecial;
int intValue;
};
// Now your loop becomes:
MyValue val;
while(file >> val)
{
if (val.isSpec()) { /* Special processing */ }
else { /* We have an integer */ }
}
Maybe you can read all values as std::string and then check if it's "#" or not (and if not - convert to int)
int value;
std::ifstream input("data");
if (input.good()) {
string line;
std::sstream ss(std::stringstream::in | std::stringstream::out);
std::sstream ss2(std::stringstream::in | std::stringstream::out);
while(getline(data, line, '#') {
ss << line;
while(getline(ss, line, ' ') {
ss2 << line;
ss2 >> value
//process values ...
ss2.str("");
}
ss.str("");
}
}
In here we first split the line by the token '#' in the first while loop then in the second while loop we split the line by ' '.
Personally, if your separator is always going to be space regardless of what follows, I'd recommend you just take the input as string and parse from there. That way, you can take the string, see if it's a number or a # and whatnot.
I think you should re-examine your premise that "Calling atoi() on the tokens introduces big overhead-"
There is no magic to std::cin >> val. Under the hood, it ends up calling (something very similar to) atoi.
If your tokens are huge, there might be some overhead to creating a std::string but as you say, the vast majority are numbers (and the rest are #'s) so they should mostly be short.