Using strtok() to parse text file - c++

I've been trying to make a program that parses a text file and feeds 6 pieces of information into an array of objects. The problem for me is that I'm having issues figuring out how to process the text file. I was told that the first step I needed to do was to write some code that counted how many letters long each entry was. The txt file is in this format:
"thing1","thing2","thing3","thing4","thing5","thing6"
This is the current version of my code:
#include<iostream>
#include<string>
#include<fstream>
#include<cstring>
using namespace std;
int main()
{
ifstream myFile("Book List.txt");
while(myFile.good())
{
string line;
getline(myFile, line);
char *sArr = new char[line.length() + 1];
strcpy(sArr, line.c_str());
char *sPtr;
sPtr = strtok(sArr, " ");
while(sPtr != NULL)
{
cout << strlen(sPtr) << " ";
sPtr = strtok(NULL, " ");
}
cout << endl;
}
myFile.close();
return 0;
}
So there are two things making it hard for me right now.
1) How do I deal with the delimiters?
2) How do I deal with "skipping" the first quotation mark in each line?

Read in a string instead of a c-style string. This means that you can use the handy std methods.
The std::string::find() method should help you out with finding each thing that you want to parse.
http://www.cplusplus.com/reference/string/string/find/
You can use this to find all the commas, which will give you the starts of all the things.
Then you can use std::string::substr() to cut up the string into each piece.
http://www.cplusplus.com/reference/string/string/substr/
You can manage to get rid of the quotation marks by passing in 1 more than the start and 1 less than the length of the thing, you can also use

If you have to use strtok then this code snippet should give enough to modify your program to parse your data:
#include <cstdio>
#include <cstring>
int main ()
{
char str[] ="\"thing1\",\"thing2\",\"thing3\",\"thing4\",\"thing5\"";
char * pch;
printf ("Splitting string \"%s\" into tokens:\n",str);
pch = strtok (str,"\",");
while (pch != NULL)
{
printf ("%s\n",pch);
pch = strtok (NULL, ",\"");
}
return 0;
}
If you do not have to use strtok then you should use std::string as others have advised. Using std::string and std::istringstream:
#include <string>
#include <sstream>
#include <vector>
#include <iostream>
int main ()
{
std::string str2( "\"thing1\",\"thing2\",\"thing3\",\"thing4\",\"thing5\"" ) ;
std::istringstream is(str2);
std::string part;
while (getline(is, part, ','))
std::cout << part.substr(1,part.length()-2) << std::endl;
return 0;
}

For starters, don't use strtok if you can avoid it (and you easily can here - and you can even avoid using the find series of functions as well).
If you want to read in the whole line and then parse it:
#include <algorithm>
#include <iostream>
#include <iterator>
#include <sstream>
#include <string>
#include <vector>
// defines a new ctype that treats commas as whitespace
struct csv_reader : std::ctype<char>
{
csv_reader() : std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table()
{
static std::vector<std::ctype_base::mask> rc(table_size, std::ctype_base::mask());
rc['\n'] = std::ctype_base::space;
rc[','] = std::ctype_base::space;
return &rc[0];
}
};
int main()
{
std::ifstream fin("yourFile.txt");
std::string line;
csv_reader csv;
std::vector<std::vector<std::string>> values;
while (std::getline(fin, line))
{
istringstream iss(line);
iss.imbue(std::locale(std::locale(), csv));
std::vector<std::string> vec;
std::copy(std::istream_iterator<std::string>(iss), std::istream_iterator<std::string>(), std::back_inserter(vec));
values.push_back(vec);
}
// values now contains a vector for each line that has the strings split by their commas
fin.close();
return 0;
}
That answers your first question. For your second, you can skip all the quotation marks by adding them to the rc mask (also treating them as whitespace) or you can strip them out afterwards (either directly or by using a transform):
std::transform(vec.begin(), vec.end(), vec.begin(), [](std::string& s)
{
std::string::iterator pend = std::remove_if(s.begin(), s.end(), [](char c)
{
return c == '"';
});
s.erase(pend, s.end());
});

Related

Reading all the words in a text file in C++

I have a large .txt file and I want to read all of the words inside it and print them on the screen. The first thing I did was to use std::getline() in this way:
std::vector<std::string> words;
std::string line;
while(std::getline(std::cin,line)){
words.push_back(line);
}
and then I printed out all the words present in the vector words. The .txt file is passed from command line as ./a.out < myTxt.txt.
The problem is that each component of the vector is a whole line, and so I am not reading each word.
The problem, I guess, is the spaces between words: how can I tell the code to ignore them? More specifically, is there any function that I can use in order to read each word from a .txt file?
UPDATE:
I'm trying to avoid all the commas ., but also ? ! (). I used find_first_of(), but my program doesn't work. Also, I don't know how to set what are the characters I don't want to be read, i.e. ., ?, !, and so on
std::vector<std::string> my_vec;
std::string line;
while(std::cin>>line){
std::size_t pos = line.find_first_of("!");
std::string line = line.substr(pos);
my_vec.push_back(line);
}
'>>' operator of type string exactly fills your requirements.
std::vector<std::string> words;
std::string line;
while (std::cin >> line) {
words.push_back(line);
}
If you need remove some noisy characters, e.g. ',','.', you can replace them with space character first.
#include <iostream>
#include <sstream>
#include <vector>
#include <algorithm>
int main() {
std::vector<std::string> words;
std::string line;
while (getline(std::cin, line)) {
std::transform(line.begin(), line.end(), line.begin(),
[](char c) { return std::isalnum(c) ? c : ' '; });
std::stringstream linestream(line);
std::string w;
while (linestream >> w) {
std::cout << w << "\n";
words.push_back(w);
}
}
}
cppreference
The getline function, as it sounds, only returns a whole line. You can split each line on spaces after reading it, or you can read word by word using operator>>:
string word;
while (cin >> word){
cout << word << "\n";
words.push_back(word);
}
Use operator>> instead of std::getline(). The operator will read individual whitespace-separated substrings for you.
#include <iostream>
#include <string>
#include <vector>
std::vector<std::string> my_vec;
std::string s;
while (std::cin >> s){
// use s as needed...
}
However, you may still end up receiving strings that have punctuation in them without any surrounding whitespace, ie hello,world, so you will have to manually split those strings as needed, eg:
#include <iostream>
#include <string>
#include <vector>
#include <cctype>
std::vector<std::string> my_vec;
std::string s;
while (std::cin >> s){
std::string::size_type start = 0, pos;
while ((pos = s.find_first_of(".,?!()", start)) != std::string::npos){
my_vec.push_back(s.substr(start, pos-start));
start = s.find_first_not_of(".,?!() \t\f\r\n\v", pos+1);
}
if (start == 0)
my_vec.push_back(s);
else if (start != std::string::npos)
my_vec.push_back(s.substr(start));
}

Trouble getting two variables to update in C++ for loop

I am creating a function that splits a sentence into words, and believe the way to do this is to use str.substr, starting at str[0] and then using str.find to find the index of the first " " character. Then update the starting position parameter of str.find to start at the index of that " " character, until the end of str.length().
I am using two variables to mark the beginning position and end position of the word, and update the beginning position variable with the ending position of the last. But it is not updating as desired in the loop as I currently have it, and cannot figure out why.
#include <iostream>
#include <string>
using namespace std;
void splitInWords(string str);
int main() {
string testString("This is a test string");
splitInWords(testString);
return 0;
}
void splitInWords(string str) {
int i;
int beginWord, endWord, tempWord;
string wordDelim = " ";
string testWord;
beginWord = 0;
for (i = 0; i < str.length(); i += 1) {
endWord = str.find(wordDelim, beginWord);
testWord = str.substr(beginWord, endWord);
beginWord = endWord;
cout << testWord << " ";
}
}
It is easier to use a string stream.
#include <vector>
#include <string>
#include <sstream>
using namespace std;
vector<string> split(const string& s, char delimiter)
{
vector<string> tokens;
string token;
istringstream tokenStream(s);
while (getline(tokenStream, token, delimiter))
{
tokens.push_back(token);
}
return tokens;
}
int main() {
string testString("This is a test string");
vector<string> result=split(testString,' ');
return 0;
}
You can write it using the existing C++ libraries:
#include <string>
#include <vector>
#include <iterator>
#include <sstream>
int main()
{
std::string testString("This is a test string");
std::istringstream wordStream(testString);
std::vector<std::string> result(std::istream_iterator<std::string>{wordStream},
std::istream_iterator<std::string>{});
}
Couple of issues:
The substr() method second parameter is a length (not a position).
// Here you are using `endWord` which is a poisition in the string.
// This only works when beginWord is 0
// for all other values you are providing an incorrect len.
testWord = str.substr(beginWord, endWord);
The find() method searches from the second paramer.
// If str[beginWord] contains one of the delimiter characters
// Then it will return beginWord
// i.e. you are not moving forward.
endWord = str.find(wordDelim, beginWord);
// So you end up stuck on the first space.
Assuming you got the above fixed. You would be adding space at the front of each word.
// You need to actively search and remove the spaces
// before reading the words.
nice things you could do:
Here:
void splitInWords(string str) {
You are passing the parameter by value. This means you are making a copy. A better technique would be to pass by const reference (you are not modifying the original or the copy).
void splitInWords(string const& str) {
An Alternative
You can use the stream functionality.
void split(std::istream& stream)
{
std::string word;
stream >> word; // This drops leading space.
// Then reads characters into `word`
// until a "white space" character is
// found.
// Note: it emptys words before adding any
}

Split a c++ string without boost and not on whitespace [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Splitting a string in C++
I have a string:
14332x+32x=10
I'd like to split it so that it looks like:
[14332][+32][10]
So far, I've tried doing
char c;
std::stringstream ss(equation1);
while (ss >> c) {
std::cout << c << std::endl;
}
but after testing what that prints, I don't think it's possible to do from that info. I know that I need to split the string on x and =, but I'm not sure if that's possible and if it is how. I've googled it and didn't find anything that looked helpful, but i'm new too c++ and the answer might be right in front of me.
I'd like to not use boost. Any advice would be helpful!
Consider using using a facet that specifies x and = as whitespace characters:
#include <locale>
#include <iostream>
#include <sstream>
struct punct_ctype : std::ctype<char> {
punct_ctype() : std::ctype<char>(get_table()) {}
static mask const* get_table()
{
static mask rc[table_size];
rc[' '] = std::ctype_base::space;
rc['\n'] = std::ctype_base::space;
rc['x'] = std::ctype_base::space;
rc['='] = std::ctype_base::space;
return &rc[0];
}
};
int main() {
std::string equation;
while(std::getline(std::cin, equation)) {
std::istringstream ss(equation);
ss.imbue(std::locale(ss.getloc(), new punct_ctype));
std::string term;
while(ss >> term) {
std::cout << "[" << term << "]";
}
std::cout << "\n";
}
}
The manual way would be to to do a for loop on each character in the string and if the character is == the character your splitting by copy it to a new string (use list/array of strings if >1 split is expected).
Also I think std has split by character functionality. If not, then stringstream::GetLine() has an overload that takes in a character to split by and it will ignore spaces.
GetLine() is very good :)
You can use sscanf like this:
sscanf(s.c_str(), "%[^x]x%[^x]x=%s", a, b, c);
Where %[^x] represents "any character except x". If you don't care for the symbols (i.e. + etc) but just for the numbers, you could do something like:
sscanf(s.c_str(), "%dx%dx=%d", &x, &y, &z);
If you don't mind using c++11, you could use something similar to this:
#include <string>
#include <vector>
#include <iostream>
#include <algorithm>
#include <functional>
#include <unordered_set>
typedef std::vector<std::string> strings;
typedef std::unordered_set<char> tokens;
struct tokenize
{
tokenize(strings& output,const tokens& t) :
v_(output),
t_(t)
{}
~tokenize()
{
if(!s.empty())
v_.push_back(s);
}
void operator()(const char &c)
{
if(t_.find(c)!=t_.end())
{
if(!s.empty())
v_.push_back(s);
s="";
}
else
{
s = s + c;
}
}
private:
std::string s;
strings& v_;
const tokens& t_;
};
void split(const std::string& input, strings& output, const tokens& t )
{
tokenize tokenizer(output,t);
for( auto i : input )
{
tokenizer(i);
}
}
int main()
{
strings tokenized;
tokens t;
t.insert('x');
t.insert('=');
std::string input = "14332x+32x=10";
split(input,tokenized,t);
for( auto i : tokenized )
{
std::cout<<"["<<i<<"]";
}
return 0;
}
Ideone link to the above code: http://ideone.com/17g75F
See this SO answer for a getline_until() function that provides a simple, basic tokenization capability that should let you do something like the following:
#include <string>
#include <stringstream>
#include "getline_until.h"
int main()
{
std::string equation1("14332x+32x=10");
std::stringstream ss(equation1);
std::string token;
while (getline_until(ss, token, "x=")) {
if (!token.empty()) std::cout << "[" << token << "]";
}
std::cout << std::endl;
}
The getline_until() function lets you specify a list of delimiters similar to strtok() (though getline_until() will return empty tokens instead of skipping a run of delimiters like strtok()). Or you can provide a predicate that lets you use a function to decide when to delimit tokens.
One thing it won't let you do (again - similar to strtok() or the standard getline()) is split tokens on merely context - there has to be a delimiter character that gets discarded. For example, with the following input:
42+24
getline_until() (like strtok() or getline()) cannot split the above into the tokens 42, +, and 24.

Splitting a C++ std::string using tokens, e.g. ";" [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to split a string in C++?
Best way to split a string in C++? The string can be assumed to be composed of words separated by ;
From our guide lines point of view C string functions are not allowed and also Boost is also not allowed to use because of security conecerns open source is not allowed.
The best solution I have right now is:
string str("denmark;sweden;india;us");
Above str should be stored in vector as strings. how can we achieve this?
Thanks for inputs.
I find std::getline() is often the simplest. The optional delimiter parameter means it's not just for reading "lines":
#include <sstream>
#include <iostream>
#include <vector>
using namespace std;
int main() {
vector<string> strings;
istringstream f("denmark;sweden;india;us");
string s;
while (getline(f, s, ';')) {
cout << s << endl;
strings.push_back(s);
}
}
You could use a string stream and read the elements into the vector.
Here are many different examples...
A copy of one of the examples:
std::vector<std::string> split(const std::string& s, char seperator)
{
std::vector<std::string> output;
std::string::size_type prev_pos = 0, pos = 0;
while((pos = s.find(seperator, pos)) != std::string::npos)
{
std::string substring( s.substr(prev_pos, pos-prev_pos) );
output.push_back(substring);
prev_pos = ++pos;
}
output.push_back(s.substr(prev_pos, pos-prev_pos)); // Last word
return output;
}
There are several libraries available solving this problem, but the simplest is probably to use Boost Tokenizer:
#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>
#include <boost/foreach.hpp>
typedef boost::tokenizer<boost::char_separator<char> > tokenizer;
std::string str("denmark;sweden;india;us");
boost::char_separator<char> sep(";");
tokenizer tokens(str, sep);
BOOST_FOREACH(std::string const& token, tokens)
{
std::cout << "<" << *tok_iter << "> " << "\n";
}

Splitting strings in C++ [duplicate]

This question already has answers here:
How do I iterate over the words of a string?
(84 answers)
Closed 4 years ago.
How do you split a string into tokens in C++?
this works nicely for me :), it puts the results in elems. delim can be any char.
std::vector<std::string> &split(const std::string &s, char delim, std::vector<std::string> &elems) {
std::stringstream ss(s);
std::string item;
while(std::getline(ss, item, delim)) {
elems.push_back(item);
}
return elems;
}
With this Mingw distro that includes Boost:
#include <iostream>
#include <string>
#include <vector>
#include <iterator>
#include <ostream>
#include <algorithm>
#include <boost/algorithm/string.hpp>
using namespace std;
using namespace boost;
int main() {
vector<string> v;
split(v, "1=2&3=4&5=6", is_any_of("=&"));
copy(v.begin(), v.end(), ostream_iterator<string>(cout, "\n"));
}
You can use the C function strtok:
/* strtok example */
#include <stdio.h>
#include <string.h>
int main ()
{
char str[] ="- This, a sample string.";
char * pch;
printf ("Splitting string \"%s\" into tokens:\n",str);
pch = strtok (str," ,.-");
while (pch != NULL)
{
printf ("%s\n",pch);
pch = strtok (NULL, " ,.-");
}
return 0;
}
The Boost Tokenizer will also do the job:
#include<iostream>
#include<boost/tokenizer.hpp>
#include<string>
int main(){
using namespace std;
using namespace boost;
string s = "This is, a test";
tokenizer<> tok(s);
for(tokenizer<>::iterator beg=tok.begin(); beg!=tok.end();++beg){
cout << *beg << "\n";
}
}
Try using stringstream:
std::string line("A line of tokens");
std::stringstream lineStream(line);
std::string token;
while(lineStream >> token)
{
}
Check out my answer to your last question:
C++ Reading file Tokens
See also boost::split from String Algo library
string str1("hello abc-*-ABC-*-aBc goodbye");
vector<string> tokens;
boost::split(tokens, str1, boost::is_any_of("-*"));
// tokens == { "hello abc","ABC","aBc goodbye" }
It depends on how complex the token delimiter is and if there are more than one. For easy problems, just use std::istringstream and std::getline. For more complex tasks or if you want to iterate the tokens in an STL-compliant way, use Boost's Tokenizer. Another possibility (although messier than either of these two) is to set up a while loop that calls std::string::find and updates the position of the last found token to be the start point for searching for the next. But this is probably the most bug-prone of the 3 options.