Reading all the words in a text file in C++ - c++

I have a large .txt file and I want to read all of the words inside it and print them on the screen. The first thing I did was to use std::getline() in this way:
std::vector<std::string> words;
std::string line;
while(std::getline(std::cin,line)){
words.push_back(line);
}
and then I printed out all the words present in the vector words. The .txt file is passed from command line as ./a.out < myTxt.txt.
The problem is that each component of the vector is a whole line, and so I am not reading each word.
The problem, I guess, is the spaces between words: how can I tell the code to ignore them? More specifically, is there any function that I can use in order to read each word from a .txt file?
UPDATE:
I'm trying to avoid all the commas ., but also ? ! (). I used find_first_of(), but my program doesn't work. Also, I don't know how to set what are the characters I don't want to be read, i.e. ., ?, !, and so on
std::vector<std::string> my_vec;
std::string line;
while(std::cin>>line){
std::size_t pos = line.find_first_of("!");
std::string line = line.substr(pos);
my_vec.push_back(line);
}

'>>' operator of type string exactly fills your requirements.
std::vector<std::string> words;
std::string line;
while (std::cin >> line) {
words.push_back(line);
}
If you need remove some noisy characters, e.g. ',','.', you can replace them with space character first.
#include <iostream>
#include <sstream>
#include <vector>
#include <algorithm>
int main() {
std::vector<std::string> words;
std::string line;
while (getline(std::cin, line)) {
std::transform(line.begin(), line.end(), line.begin(),
[](char c) { return std::isalnum(c) ? c : ' '; });
std::stringstream linestream(line);
std::string w;
while (linestream >> w) {
std::cout << w << "\n";
words.push_back(w);
}
}
}
cppreference

The getline function, as it sounds, only returns a whole line. You can split each line on spaces after reading it, or you can read word by word using operator>>:
string word;
while (cin >> word){
cout << word << "\n";
words.push_back(word);
}

Use operator>> instead of std::getline(). The operator will read individual whitespace-separated substrings for you.
#include <iostream>
#include <string>
#include <vector>
std::vector<std::string> my_vec;
std::string s;
while (std::cin >> s){
// use s as needed...
}
However, you may still end up receiving strings that have punctuation in them without any surrounding whitespace, ie hello,world, so you will have to manually split those strings as needed, eg:
#include <iostream>
#include <string>
#include <vector>
#include <cctype>
std::vector<std::string> my_vec;
std::string s;
while (std::cin >> s){
std::string::size_type start = 0, pos;
while ((pos = s.find_first_of(".,?!()", start)) != std::string::npos){
my_vec.push_back(s.substr(start, pos-start));
start = s.find_first_not_of(".,?!() \t\f\r\n\v", pos+1);
}
if (start == 0)
my_vec.push_back(s);
else if (start != std::string::npos)
my_vec.push_back(s.substr(start));
}

Related

Ignore spaces in vector C++

I'm trying to split a string in individual words using vector in C++. So I would like to know how to ignore spaces in vector, if user put more than one space between words in string.
How would I do that?
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
using namespace std;
int main(){
cout<<"Sentence: ";
string sentence;
getline(cin,sentence);
vector<string> my;
int start=0;
unsigned int end=sentence.size();
unsigned int temp=0;
while(temp<end){
int te=sentence.find(" ",start);
temp=te;
my.push_back(sentence.substr(start, temp-start));
start=temp+1;
}
unsigned int i;
for(i=0 ; i<my.size() ; i++){
cout<<my[i]<<endl;
}
return 0;
}
Four things:
When reading input from a stream into astring using the overloaded >> operator, then it automatically separates on white-space. I.e. it reads "words".
There exists an input stream that uses a string as the input, std::istringstream.
You can use iterators with streams, like e.g. std::istream_iterator.
std::vector have a constructor taking a pair of iterators.
That means your code could simply be
std::string line;
std::getline(std::cin, line);
std::istringstream istr(line);
std::vector<std::string> words(std::istream_iterator<std::string>(istr),
std::istream_iterator<std::string>());
After this, the vector words will contain all the "words" from the input line.
You can easily print the "words" using std::ostream_iterator and std::copy:
std::copy(begin(words), end(words),
std::ostream_iterator<std::string>(std::cout, "\n"));
The easiest way is to use a std::istringstream like follows:
std::string sentence;
std::getline(std::cin,sentence);
std::istringstream iss(sentence);
std::vector<std::string> my;
std::string word;
while(iss >> word) {
my.push_back(word);
}
Any whitespaces will be ignored and skipped automatically.
You can create the vector directly using the std::istream_iterator which skips white spaces:
#include <iostream>
#include <string>
#include <sstream>
#include <vector>
#include <iterator>
int main() {
std::string str = "Hello World Lorem Ipsum The Quick Brown Fox";
std::istringstream iss(str);
std::vector<std::string> vec {std::istream_iterator<std::string>(iss),
std::istream_iterator<std::string>() };
for (const auto& el : vec) {
std::cout << el << '\n';
}
}
Here is a function which divides given sentence into words.
#include <string>
#include <vector>
#include <sstream>
#include <utility>
std::vector<std::string> divideSentence(const std::string& sentence) {
std::stringstream stream(sentence);
std::vector<std::string> words;
std::string word;
while(stream >> word) {
words.push_back(std::move(word));
}
return words;
}
Reducing double, triple etc. spaces in string is a problem you'll encounter again and again. I've always used the following very simple algorithm:
Pseudocode:
while " " in string:
string.replace(" ", " ")
After the while loop, you know your string only has single spaces since multiple consecutive spaces were compressed to singles.
Most languages allow you to search for a substring in a string and most languages have the ability to run string.replace() so it's a useful trick.

c++ found an example on splitting strings trying to figure out why change to it changes result

I'm learning about splitting strings for a program in class, and i came across this example.
#include <string>
#include <sstream>
#include <iostream>
int main()
{
std::string str = "23454323 ABCD EFGH";
std::istringstream iss(str);
std::string word;
while(iss >> word)
{
std::cout << word << '\n';
}
}
I modified so that the user instead inputs the string,but if I input the string stored in str i get 23454323 and not the other material in the string.
#include <string>
#include <sstream>
#include <iostream>
using namespace std;
int main()
{
string str;
cout<<"Enter a postfix with a space between each object:";
cin>>str;
istringstream iss(str);
string word;
while(iss >> word)
{
cout << word << '\n';
}
}
Ok, thanks for the help everyone got it!
You need to modify your input code a little for this to work. Use:
getline(cin, str);
instead of:
cin >> str;
The latter will stop reading a string on whitespace characters.
Because you use the same input operator as for istringstream when you input from cin and it always breaks on whitespace.
That means you only read a single word from the user. You want to use std::getline.
Just as iss >> word reads a single space-separated word from iss, so cin >> str just reads the first word from cin.
To read a whole line, use getline(cin, str).
(Also, get out of the habit of dumping namespace std into the global namespace. It will cause problems as your programs grow.)

is it possible to read from a specific character in a line from a file in c++?

Hey all so I have to get values from a text file, but the values don't stand alone they are all written as this:
Population size: 30
Is there any way in c++ that I can read from after the ':'?
I've tried using the >> operator like:
string pop;
inFile >> pop;
but off course the whitespace terminates the statement before it gets to the number and for some reason using
inFile.getline(pop, 20);
gives me loads of errors because it does not want to write directly to string for some reason..
I don't really want to use a char array because then it won't be as easy to test for the number and extract that alone from the string.
So is there anyway I can use the getline function with a string?
And is it possible to read from after the ':' character?
#include <iostream>
#include <fstream>
#include <string>
#include <cstring>
#include <cstdlib>
using namespace std;
int main()
{
string fname;
cin >> fname;
ifstream inFile;
inFile.open(fname.c_str());
string pop1;
getline(inFile,pop1);
cout << pop1;
return 0;
}
ok so here is my code with the new getline, but it still outputs nothing. it does correctly open the text file and it works with a char array
You are probably best to read the whole line then manipulate the string :-
std::string line;
std::getline(inFile, line);
line = line.substr(19); // Get character 20 onwards...
You are probably better too looking for the colon :-
size_t pos = line.find(":");
if (pos != string::npos)
{
line = line.substr(pos + 1);
}
Or something similar
Once you've done that you might want to feed it back into a stringstream so you can read ints and stuff?
int population;
std::istringstream ss(line);
ss >> population;
Obviously this all depends on what you want to do with the data
Assuming your data is in the form
<Key>:<Value>
One per line. Then I would do this:
std::string line;
while(std::getline(inFile, line))
{
std::stringstream linestream(line);
std::string key;
int value;
if (std::getline(linestream, key, ':') >> value)
{
// Got a key/value pair
}
}

How to take formatted input from ifstream

I have a text file with a set of names formatted in the following way:
"MARY","PATRICIA","LINDA","BARBARA","ELIZABETH"
and so on. I want to open the file using ifstream and read the names into a string array (without quotes, commas). I somehow managed to do it by checking the input stream character by character. Is there an easier way to take this formatted input?
EDIT:
I heard that you can use something like
fscanf (f, "\"%[a-zA-Z]\",", str);
in C, but is there such a method for ifstream?
That input should be parsable with std::getline or std::regex_token_iterator (though the latter is shooting sparrows with artillery).
Examples:
Regex
Quick and dirty, yet heavyweight solution (using boost so most compilers eat this)
#include <boost/regex.hpp>
#include <iostream>
int main() {
const std::string s = "\"MARY\",\"PATRICIA\",\"LINDA\",\"BARBARA\",\"ELIZABETH\"";
boost::regex re("\"(.*?)\"");
for (boost::sregex_token_iterator it(s.begin(), s.end(), re, 1), end;
it != end; ++it)
{
std::cout << *it << std::endl;
}
}
Output:
MARY
PATRICIA
LINDA
BARBARA
ELIZABETH
Alternatively, you can use
boost::regex re(",");
for (boost::sregex_token_iterator it(s.begin(), s.end(), re, -1), end;
to let it split along commas (note also the -1) or other regexes.
getline
getline solution (whitespace allowed)
#include <sstream>
#include <iostream>
int main() {
std::stringstream ss;
ss.str ("\"MARY\",\"PATRICIA\",\"LINDA\",\"BARBARA\",\"ELIZABETH\"");
std::string curr;
while (std::getline (ss, curr, ',')) {
size_t from = 1 + curr.find_first_of ('"'),
to = curr.find_last_of ('"');
std::cout << curr.substr (from, to-from) << std::endl;
}
}
Output is the same.
getline
getline solution (whitespace not allowed)
The loop becomes almost trivial:
std::string curr;
while (std::getline (ss, curr, ',')) {
std::cout << curr.substr (1, curr.length()-2) << std::endl;
}
homebrew solution
Least wasteful w.r.t. performance (especially when you wouldn't store those strings, but iterators or indices instead)
#include <iostream>
int main() {
const std::string str ("\"MARY\",\"PATRICIA\",\"LINDA\",\"BARBARA\",\"ELIZABETH\"");
size_t i = 0;
while (i != std::string::npos) {
size_t begin = str.find ('"', i) + 1, // one behind initial '"'
end = str.find ('"', begin),
comma = str.find (',', end);
i = comma;
std::cout << str.substr(begin, end-begin) << std::endl;
}
}
As far as I know, there is no tokenizer in the STL. But if you are willing to use boost, there's a very good tokenizer class there. Other than that, character by character is your best C++ way of addressing it (unless you are willing to go the C route, and use strtok_t on your raw char * strings).
A simple tokenizer should do the trick; no need for something heavy-weight like regular expressions. C++ doesn't have a built-in one, but it's easy enough to write. Here's one which I myself stole off the internet so long ago I don't even remember who wrote it, so apologies for the blatant plagiarism:
#include <vector>
#include <string>
std::vector<std::string>
tokenize(const std::string & str, const std::string & delimiters)
{
std::vector<std::string> tokens;
// Skip delimiters at beginning.
std::string::size_type lastPos = str.find_first_not_of(delimiters, 0);
// Find first "non-delimiter".
std::string::size_type pos = str.find_first_of(delimiters, lastPos);
while (std::string::npos != pos || std::string::npos != lastPos)
{
// Found a token, add it to the vector.
tokens.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters. Note the "not_of"
lastPos = str.find_first_not_of(delimiters, pos);
// Find next "non-delimiter"
pos = str.find_first_of(delimiters, lastPos);
}
return tokens;
}
Usage: std::vector<std::string> words = tokenize(line, ",");
Actually, because I was interested, I worked out how to do this using Boost.Spirit.Qi:
#include <boost/spirit/include/qi.hpp>
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <iterator>
using namespace boost::spirit::qi;
int main() {
// our test-string
std::string data("\"MARY\",\"PATRICIA\",\"LINDA\",\"BARBARA\"");
// this is where we will store the names
std::vector<std::string> names;
// parse the string
phrase_parse(data.begin(), data.end(),
( lexeme['"' >> +(char_ - '"') >> '"'] % ',' ),
space, names);
// print what we have parsed
std::copy(names.begin(), names.end(),
std::ostream_iterator<std::string>(std::cout, "\n"));
}
To check if an error occurred during parsing, simply store the iterators over the string in variables, and compare them afterwards. If they are equal, the whole string was matched, if not, the begin-iterator will point to the error site.

CString Parsing Carriage Returns

Let's say I have a string that has multiple carriage returns in it, i.e:
394968686
100630382
395950966
335666021
I'm still pretty amateur hour with C++, would anyone be willing to show me how you go about: parsing through each "line" in the string ? So I can do something with it later (add the desired line to a list). I'm guessing using Find("\n") in a loop?
Thanks guys.
while (!str.IsEmpty())
{
CString one_line = str.SpanExcluding(_T("\r\n"));
// do something with one_line
str = str.Right(str.GetLength() - one_line.GetLength()).TrimLeft(_T("\r\n"));
}
Blank lines will be eliminated with this code, but that's easily corrected if necessary.
You could try it using stringstream. Notice that you can overload the getline method to use any delimeter you want.
string line;
stringstream ss;
ss << yourstring;
while ( getline(ss, line, '\n') )
{
cout << line << endl;
}
Alternatively you could use the boost library's tokenizer class.
You can use stringstream class in C++.
#include <iostream>
#include <sstream>
#include <vector>
using namespace std;
int main()
{
string str = "\
394968686\
100630382\
395950966\
335666021";
stringstream ss(str);
vector<string> v;
string token;
// get line by line
while (ss >> token)
{
// insert current line into a std::vector
v.push_back(token);
// print out current line
cout << token << endl;
}
}
Output of the program above:
394968686
100630382
395950966
335666021
Note that no whitespace will be included in the parsed token, with the use of operator>>. Please refer to comments below.
If your string is stored in a c-style char* or std::string then you can simply search for \n.
std::string s;
size_t pos = s.find('\n');
You can use string::substr() to get the substring and store it in a list. Pseudo code,
std::string s = " .... ";
for(size_t pos, begin = 0;
string::npos != (pos = s.find('\n'));
begin = ++ pos)
{
list.push_back(s.substr(begin, pos));
}