How do I make an alphabetized list of all distinct words in a file with the number of times each word was used? - c++

I am writing a program using Microsoft Visual C++. In the program I must read in a text file and print out an alphabetized list of all distinct words in that file with the number of times each word was used.
I have looked up different ways to alphabetize a string but they do not work with the way I have my string initialized.
// What is inside my text file
Any experienced programmer engaged in writing programs for use by others knows
that, once his program is working correctly, good output is a must. Few people
really care how much time and trouble a programmer has spent in designing and
debugging a program. Most people see only the results. Often, by the time a
programmer has finished tackling a difficult problem, any output may look
great. The programmer knows what it means and how to interpret it. However,
the same cannot be said for others, or even for the programmer himself six
months hence.
string lines;
getline(input, lines); // Stores what is in file into the string
I expect an alphabetized list of words with the number of times each word was used. So far, I do not know how to begin this process.

It's rather simple, std::map automatically sorts based on key in the key/value pair you get. The key/value pair represents word/count which is what you need. You need to do some filtering for special characters and such.
EDIT: std::stringstream is a nice way of splitting std::string using whitespace delimiter as it's the default delimiter. Therefore, using stream >> word you will get whitespace-separated words. However, this might not be enough due to punctuation. For example: Often, has comma which we need to filter out. Therefore, I used std::replaceif which replaces puncts and digits with whitespaces.
Now a new problem arises. In your example, you have: "must.Few" which will be returned as one word. After replacing . with we have "must Few". So I'm using another stringstream on the filtered "word" to make sure I have only words in the final result.
In the second loop you will notice if(word == "") continue;, this can happen if the string is not trimmed. If you look at the code you will find out that we aren't trimming after replacing puncts and digits. That is, "Often," will be "Often " with trailing whitespace. The trailing whitespace causes the second loop to extract an empty word. This is why I added the condition to ignore it. You can trim the filtered result and then you wouldn't need this check.
Finally, I have added ignorecase boolean to check if you wish to ignore the case of the word or not. If you wish to do so, the program will simply convert the word to lowercase and then add it to the map. Otherwise, it will add the word the same way it found it. By default, ignorecase = true, if you wish to consider case, just call the function differently: count_words(input, false);.
Edit 2: In case you're wondering, the statement counts[word] will automatically create key/value pair in the std::map IF there isn't any key matching word. So when we call ++: if the word isn't in the map, it will create the pair, and increment value by 1 so you will have newly added word. If it exists already in the map, this will increment the existing value by 1 and hence it acts as a counter.
The program:
#include <iostream>
#include <map>
#include <sstream>
#include <cstring>
#include <cctype>
#include <string>
#include <iomanip>
#include <algorithm>
std::string to_lower(const std::string& str) {
std::string ret;
for (char c : str)
ret.push_back(tolower(c));
return ret;
}
std::map<std::string, size_t> count_words(const std::string& str, bool ignorecase = true) {
std::map<std::string, size_t> counts;
std::stringstream stream(str);
while (stream.good()) {
// wordW may have multiple words connected by special chars/digits
std::string wordW;
stream >> wordW;
// filter special chars and digits
std::replace_if(wordW.begin(), wordW.end(),
[](const char& c) { return std::ispunct(c) || std::isdigit(c); }, ' ');
// now wordW may have multiple words seperated by whitespaces, extract them
std::stringstream word_stream(wordW);
while (word_stream.good()) {
std::string word;
word_stream >> word;
// ignore empty words
if (word == "") continue;
// add to count.
ignorecase ? counts[to_lower(word)]++ : counts[word]++;
}
}
return counts;
}
void print_counts(const std::map<std::string, size_t>& counts) {
for (auto pair : counts)
std::cout << std::setw(15) << pair.first << " : " << pair.second << std::endl;
}
int main() {
std::string input = "Any experienced programmer engaged in writing programs for use by others knows \
that, once his program is working correctly, good output is a must.Few people \
really care how much time and trouble a programmer has spent in designing and \
debugging a program.Most people see only the results.Often, by the time a \
programmer has finished tackling a difficult problem, any output may look \
great.The programmer knows what it means and how to interpret it.However, \
the same cannot be said for others, or even for the programmer himself six \
months hence.";
auto counts = count_words(input);
print_counts(counts);
return 0;
}
I have tested this with Visual Studio 2017 and here is the part of the output:
a : 5
and : 3
any : 2
be : 1
by : 2
cannot : 1
care : 1
correctly : 1
debugging : 1
designing : 1

As others have already noted, an std::map handles the counting you care about quite easily.
Iostreams already have a tokenize to break an input stream up into words. In this case, we want to to only "think" of letters as characters that can make up words though. A stream uses a locale to make that sort of decision, so to change how it's done, we need to define a locale that classifies characters as we see fit.
struct alpha_only: std::ctype<char> {
alpha_only(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table() {
// everything is white space
static std::vector<std::ctype_base::mask>
rc(std::ctype<char>::table_size,std::ctype_base::space);
// except lower- and upper-case letters, which are classified accordingly:
std::fill(&rc['a'], &rc['z'], std::ctype_base::lower);
std::fill(&rc['A'], &rc['Z'], std::ctype_base::upper);
return &rc[0];
}
};
With that in place, we tell the stream to use our ctype facet, then simply read words from the file and count them in the map:
std::cin.imbue(std::locale(std::locale(), new alpha_only));
std::map<std::string, std::size_t> counts;
std::string word;
while (std::cin >> word)
++counts[to_lower(word)];
...and when we're done with that, we can print out the results:
for (auto w : counts)
std::cout << w.first << ": " << w.second << "\n";

Id probably start by inserting all of those words into an array of strings, then start with the first index of the array and compare that with all of the other indexes if you find matches, add 1 to a counter and after you went through the array you could display the word you were searching for and how many matches there were and then go onto the next element and compare that with all of the other elements in the array and display etc. Or maybe if you wanna make a parallel array of integers that holds the number of matches you could do all the comparisons at one time and the displays at one time.

EDIT:
Everyone's answer seems more elegant because of the map's inherent sorting. My answer functions more as a parser, that later sorts the tokens. Therefore my answer is only useful to the extent of a tokenizer or lexer, whereas Everyone's answer is only good for sorted data.
You first probably want to read in the text file. You want to use a streambuf iterator to read in the file(found here).
You will now have a string called content, which is the content of you file. Next you will want to iterate, or loop, over the contents of this string. To do that you'll want to use an iterator. There should be a string outside of the loop that stores the current word. You will iterate over the content string, and each time you hit a letter character, you will add that character to your current word string. Then, once you hit a space character, you will take that current word string, and push it back into the wordString vector. (Note: that means that this will ignore non-letter characters, and that only spaces denote word separation.)
Now that we have a vector of all of our words in strings, we can use std::sort, to sort the vector in alphabetical order.(Note: capitalized words take precedence over lowercase words, and therefore will be sorted first.) Then we will iterate over our vector of stringWords and convert them into Word objects (this is a little heavy-weight), that will store their appearances and the word string. We will push these Word objects into a Word vector, but if we discover a repeat word string, instead of adding it into the Word vector, we'll grab the previous entry and increment its appearance count.
Finally, once this is all done, we can iterate over our Word object vector and output the word followed by its appearances.
Full Code:
#include <vector>
#include <fstream>
#include <iostream>
#include <streambuf>
#include <algorithm>
#include <string>
class Word //define word object
{
public:
Word(){appearances = 1;}
~Word(){}
int appearances;
std::string mWord;
};
bool isLetter(const char x)
{
return((x >= 'a' && x <= 'z') || (x >= 'A' && x <= 'Z'));
}
int main()
{
std::string srcFile = "myTextFile.txt"; //what file are we reading
std::ifstream ifs(srcFile);
std::string content( (std::istreambuf_iterator<char>(ifs) ),
( std::istreambuf_iterator<char>() )); //read in the file
std::vector<std::string> wordStringV; //create a vector of word strings
std::string current = ""; //define our current word
for(auto it = content.begin(); it != content.end(); ++it) //iterate over our input
{
const char currentChar = *it; //make life easier
if(currentChar == ' ')
{
wordStringV.push_back(current);
current = "";
continue;
}
else if(isLetter(currentChar))
{
current += *it;
}
}
std::sort(wordStringV.begin(), wordStringV.end(), std::less<std::string>());
std::vector<Word> wordVector;
for(auto it = wordStringV.begin(); it != wordStringV.end(); ++it) //iterate over wordString vector
{
std::vector<Word>::iterator wordIt;
//see if the current word string has appeared before...
for(wordIt = wordVector.begin(); wordIt != wordVector.end(); ++wordIt)
{
if((*wordIt).mWord == *it)
break;
}
if(wordIt == wordVector.end()) //...if not create a new Word obj
{
Word theWord;
theWord.mWord = *it;
wordVector.push_back(theWord);
}
else //...otherwise increment the appearances.
{
++((*wordIt).appearances);
}
}
//print the words out
for(auto it = wordVector.begin(); it != wordVector.end(); ++it)
{
Word theWord = *it;
std::cout << theWord.mWord << " " << theWord.appearances << "\n";
}
return 0;
}
Side Notes
Compiled with g++ version 4.2.1 with target x86_64-apple-darwin, using the compiler flag -std=c++11.
If you don't like iterators you can instead do
for(int i = 0; i < v.size(); ++i)
{
char currentChar = vector[i];
}
It's important to note that if you are capitalization agnostic simply use std::tolower on the current += *it; statement (ie: current += std::tolower(*it);).
Also, you seem like a beginner and this answer might have been too heavyweight, but you're asking for a basic parser and that is no easy task. I recommend starting by parsing simpler strings like math equations. Maybe make a calculator app.

Related

Insert into array specific strings from text file

ArticlesDataset.txt file contains all the metadata information of documents. unigramCount contains all unique words and their number of occurrences for each document. There are 1500 publications recorded in the txt file. Here is an example entry for a document:
{"creator":["Romain Allais","Julie Gobert"],
"datePublished":"2018-05-30",
"docType":"article",
"doi":"10.1051\/mattech\/2018010",
"id":"ark:\/\/27927\/phz10hn2bh3",
"isPartOf":"Mat\u00e9riaux & Techniques",
"issueNumber":"5-6",
"language":["eng"],
"outputFormat":["unigram","bigram","trigram"],
"pageCount":7,
"pagination":"pp. null-null",
"provider":"portico",
"publicationYear":2018,
"publisher":"EDP Sciences",
"sequence":3.0,
"tdmCategory":["Applied sciences -Engineering"],
"title":"Environmental assessment of PSS",
"url":"http:\/\/doi.org\/10.1051\/mattech\/2018010",
"volumeNumber":"105",
"wordCount":4446,
"unigramCount":{"others":1,"air":1,"networks,":1,"conventional":1,"IEEE":1}}
My purpose is to pull out the unigram counts for each document and store them in a suitable array. How can I do it by using fstream library?
How can i improve below code to reach my goal.
std::string dummy;
std::ifstream data("PublicationsDataSet.txt");
while (data.good())
{
getline(data, dummy, ',');
}
your question delves in two different topics, one is parsing the data and the other into storing it in memory.
To the first point the answer is, you'll need a parser, you either write one which will involve a syntax parser to convert each "key words" into tokens, for then an interpreter to compile them into a data object based on the token parameter the data is preceded or succeeded eg:
'[' = start an array, every values after this are part of the array
']' = end of the an array, return to previous parsing state
':' = separate key and values, left hand side is key, right hand side is value
...
this is a fine exercise to sharpen one's skills but way too arduous and with potential never-ending-bug-fixing road, as recommended also by other comments finding an already made library is probably the easier road on a time pinch or on a project time crunching scenario.
Another thing to point out, plain arrays in c++ are size fixed, so mostly likely since you are parsing the values you'll probably use std::vectors, which allow insertion, and once you are done processing the file and really intend to send the data back as an array you can do that directly from the object
std::vector<YourObjectType> parsedObject;
char* arr = new char[parsedObject.size()];
std::copy(v.begin(), v.end(), arr);
this is a psudo code, lots of things will depend on the implementation, but it gives the idea.
A starting point to write a parse is this article goes in great details on how it works and it's components, mind you every parser implements it's own language (yes just like c++ and other languages, are all parsed) so you'll need to expand on the concept with your commands
expression parser
Here's a simplified solution of what you could do using std::regex:
Read the lines of a stream (std::cin in this case) one by one.
Check if the line contains a unigramCount element.
If that's the case, walk the different entries within the unigramCount element.
About the regular expressions used:
"unigramCount":{}, allowing:
zero or more whitespaces basically everywhere, and
zero or more characters within the braces.
"<key>":<value>, where:
<key> is one or more characters other than a double quote,
<value> is one or more digits, and
you could have whitespaces at both sides of the :.
A good data structure for storing your unigramCount entries could be a std::map.
[Demo]
#include <iostream> // cout
#include <map>
#include <regex> // regex_match, regex_search, sregex_iterator
#include <string> // stoi
int main()
{
std::string line{};
std::map<std::string, int> unigram_counts{};
while (std::getline(std::cin, line))
{
const std::regex unigram_count_pattern{R"(^\s*\"unigramCount\"\s*:\s*\{.*\}\s*$)"};
if (std::regex_match(line, unigram_count_pattern))
{
const std::regex entry_pattern{R"(\"([^\"]+)\"\s*:\s*([0-9]+))"};
for (auto entry_it{std::sregex_iterator(line.cbegin(), line.cend(), entry_pattern)};
entry_it != std::sregex_iterator{};
++entry_it)
{
auto matches{*entry_it};
auto& key{matches[1]};
auto& value{matches[2]};
unigram_counts[key] = std::stoi(value);
}
}
}
for (auto& [key, value] : unigram_counts)
{
std::cout << "'" << key << "' : " << value << "\n";
}
}
// Outputs:
//
// 'IEEE' : 1
// 'air' : 1
// 'conventional' : 1
// 'networks,' : 1
// 'others' : 1

Given an Array of strings how do I Remove Duplicates?

I would like to know how to remove duplicate strings from a container, but ignore word differences from trailing punctuation.
For example given these strings:
Why do do we we here here?
I would like to get this output:
Why do we here?
The algorithm:
While Reading a word is successful, do:
If End of file, quit.
If word list is empty, push back word.
else begin
Search word list for the word.
if word doesn't exist, push back the word.
end else (step 4)
end (while reading a word)
Use std::string for your word.
This allows you to do the following:
std::string word;
while (data_file >> word)
{
}
Use std::vector to contain your words (although you could use std::list as well). The std::vector grows dynamically so you don't have to worry about reallocation if you picked the wrong size.
To append to std::vector, use the push_back method.
To compare std::string, use operator==:
std::string new_word;
std::vector<std::string> word_list;
//...
if (word_list[index] == new_word)
{
continue;
}
So you have said you know how to tokenize a string. (If you don't spend some time here: https://stackoverflow.com/a/38595708/2642059) So I'm going to assume that we're given a vector<string> foo which contains words with possibly trailing punctuation.
for(auto it = cbegin(foo); it != cend(foo); ++it) {
if(none_of(next(it), cend(foo), [&](const auto& i) {
const auto finish = mismatch(cbegin(*it), cend(*it), cbegin(i), cend(i));
return (finish.first == cend(*it) || !isalnum(*finish.first)) && (finish.second == cend(i) || !isalnum(*finish.second));
})) {
cout << *it << ' ';
}
}
Live Example
It's worth noting here that you haven't given us rules on how to handle words like: "down", "down-vote", and "downvote" This algorithm presumes that the 1st 2 are equal. You also haven't given us rules for how to handle: "Why do, do we we here, here?" This algorithm always returns the final repetition, so the output would be "Why do we here?"
If the presumptions made by this algorithm are not totally to your liking leave me a comment and we'll work on getting you comfortable with this code to where you can make the adjustments that you need.

Can I use 2 or more delimiters in C++ function getline? [duplicate]

This question already has answers here:
How can I read and parse CSV files in C++?
(39 answers)
Closed 4 years ago.
I would like to know how can I use 2 or more delimiters in the getline functon, that's my problem:
The program reads a text file... each line is goning to be like:
New Your, Paris, 100
CityA, CityB, 200
I am using getline(file, line), but I got the whole line, when I want to to get CityA, then CityB and then the number; and if I use ',' delimiter, I won't know when is the next line, so I'm trying to figure out some solution..
Though, how could I use comma and \n as a delimiter?
By the way,I'm manipulating string type,not char, so strtok is not possible :/
some scratch:
string line;
ifstream file("text.txt");
if(file.is_open())
while(!file.eof()){
getline(file, line);
// here I need to get each string before comma and \n
}
You can read a line using std::getline, then pass the line to a std::stringstream and read the comma separated values off it
string line;
ifstream file("text.txt");
if(file.is_open()){
while(getline(file, line)){ // get a whole line
std::stringstream ss(line);
while(getline(ss, line, ',')){
// You now have separate entites here
}
}
No, std::getline() only accepts a single character, to override the default delimiter. std::getline() does not have an option for multiple alternate delimiters.
The correct way to parse this kind of input is to use the default std::getline() to read the entire line into a std::string, then construct a std::istringstream, and then parse it further, into comma-separate values.
However, if you are truly parsing comma-separated values, you should be using a proper CSV parser.
Often, it is more intuitive and efficient to parse character input in a hierarchical, tree-like manner, where you start by splitting the string into its major blocks, then go on to process each of the blocks, splitting them up into smaller parts, and so on.
An alternative to this is to tokenize like strtok does -- from the beginning of input, handling one token at a time until the end of input is encountered. This may be preferred when parsing simple inputs, because its is straightforward to implement. This style can also be used when parsing inputs with nested structure, but this requires maintaining some kind of context information, which might grow too complex to maintain inside a single function or limited region of code.
Someone relying on the C++ std library usually ends up using a std::stringstream, along with std::getline to tokenize string input. But, this only gives you one delimiter. They would never consider using strtok, because it is a non-reentrant piece of junk from the C runtime library. So, they end up using streams, and with only one delimiter, one is obligated to use a hierarchical parsing style.
But zneak brought up std::string::find_first_of, which takes a set of characters and returns the position nearest to the beginning of the string containing a character from the set. And there are other member functions: find_last_of, find_first_not_of, and more, which seem to exist for the sole purpose of parsing strings. But std::string stops short of providing useful tokenizing functions.
Another option is the <regex> library, which can do anything you want, but it is new and you will need to get used to its syntax.
But, with very little effort, you can leverage existing functions in std::string to perform tokenizing tasks, and without resorting to streams. Here is a simple example. get_to() is the tokenizing function and tokenize demonstrates how it is used.
The code in this example will be slower than strtok, because it constantly erases characters from the beginning of the string being parsed, and also copies and returns substrings. This makes the code easy to understand, but it does not mean more efficient tokenizing is impossible. It wouldn't even be that much more complicated than this -- you would just keep track of your current position, use this as the start argument in std::string member functions, and never alter the source string. And even better techniques exist, no doubt.
To understand the example's code, start at the bottom, where main() is and where you can see how the functions are used. The top of this code is dominated by basic utility functions and dumb comments.
#include <iostream>
#include <string>
#include <utility>
namespace string_parsing {
// in-place trim whitespace off ends of a std::string
inline void trim(std::string &str) {
auto space_is_it = [] (char c) {
// A few asks:
// * Suppress criticism WRT localization concerns
// * Avoid jumping to conclusions! And seeing monsters everywhere!
// Things like...ah! Believing "thoughts" that assumptions were made
// regarding character encoding.
// * If an obvious, portable alternative exists within the C++ Standard Library,
// you will see it in 2.0, so no new defect tickets, please.
// * Go ahead and ignore the rumor that using lambdas just to get
// local function definitions is "cheap" or "dumb" or "ignorant."
// That's the latest round of FUD from...*mumble*.
return c > '\0' && c <= ' ';
};
for(auto rit = str.rbegin(); rit != str.rend(); ++rit) {
if(!space_is_it(*rit)) {
if(rit != str.rbegin()) {
str.erase(&*rit - &*str.begin() + 1);
}
for(auto fit=str.begin(); fit != str.end(); ++fit) {
if(!space_is_it(*fit)) {
if(fit != str.begin()) {
str.erase(str.begin(), fit);
}
return;
} } } }
str.clear();
}
// get_to(string, <delimiter set> [, delimiter])
// The input+output argument "string" is searched for the first occurance of one
// from a set of delimiters. All characters to the left of, and the delimiter itself
// are deleted in-place, and the substring which was to the left of the delimiter is
// returned, with whitespace trimmed.
// <delimiter set> is forwarded to std::string::find_first_of, so its type may match
// whatever this function's overloads accept, but this is usually expressed
// as a string literal: ", \n" matches commas, spaces and linefeeds.
// The optional output argument "found_delimiter" receives the delimiter character just found.
template <typename D>
inline std::string get_to(std::string& str, D&& delimiters, char& found_delimiter) {
const auto pos = str.find_first_of(std::forward<D>(delimiters));
if(pos == std::string::npos) {
// When none of the delimiters are present,
// clear the string and return its last value.
// This effectively makes the end of a string an
// implied delimiter.
// This behavior is convenient for parsers which
// consume chunks of a string, looping until
// the string is empty.
// Without this feature, it would be possible to
// continue looping forever, when an iteration
// leaves the string unchanged, usually caused by
// a syntax error in the source string.
// So the implied end-of-string delimiter takes
// away the caller's burden of anticipating and
// handling the range of possible errors.
found_delimiter = '\0';
std::string result;
std::swap(result, str);
trim(result);
return result;
}
found_delimiter = str[pos];
auto left = str.substr(0, pos);
trim(left);
str.erase(0, pos + 1);
return left;
}
template <typename D>
inline std::string get_to(std::string& str, D&& delimiters) {
char discarded_delimiter;
return get_to(str, std::forward<D>(delimiters), discarded_delimiter);
}
inline std::string pad_right(const std::string& str,
std::string::size_type min_length,
char pad_char=' ')
{
if(str.length() >= min_length ) return str;
return str + std::string(min_length - str.length(), pad_char);
}
inline void tokenize(std::string source) {
std::cout << source << "\n\n";
bool quote_opened = false;
while(!source.empty()) {
// If we just encountered an open-quote, only include the quote character
// in the delimiter set, so that a quoted token may contain any of the
// other delimiters.
const char* delimiter_set = quote_opened ? "'" : ",'{}";
char delimiter;
auto token = get_to(source, delimiter_set, delimiter);
quote_opened = delimiter == '\'' && !quote_opened;
std::cout << " " << pad_right('[' + token + ']', 16)
<< " " << delimiter << '\n';
}
std::cout << '\n';
}
}
int main() {
string_parsing::tokenize("{1.5, null, 88, 'hi, {there}!'}");
}
This outputs:
{1.5, null, 88, 'hi, {there}!'}
[] {
[1.5] ,
[null] ,
[88] ,
[] '
[hi, {there}!] '
[] }
I don't think that's how you should attack the problem (even if you could do it); instead:
Use what you have to read in each line
Then split up that line by the commas to get the pieces that you want.
If strtok will do the job for #2, you can always convert your string into a char array.

How to make sure the words being read in from the file are how I want them to be C++

If I had to read in a word from a document (one word at a time), and then pass that word into a function until I reach the end of the file, how would I do this?
What also must be kept in mind is that a word is any consecutive string of letters and the apostrophe ( so can't or rojas' is one word). Something like bad-day should be two separate words, and something like to-be-husband should be 3 separate words. I also need to ignore periods ., semi-colons ;, and pretty much anything that isn't part of a word. I have been reading it in using file >> s; and then removing stuff from the string but it has gotten very complicated. Is there a way to store into s only alphabet characters+apostrophes and stop at the end of a word (when a space occurs)?
while (!file.eof()) {
string s;
file >> s; //this is how I am currently reading it it
passToFunction(s);
}
Yes, there is a way: simply write the code to do it. Read one character at a time, and collect the characters in the string, until you gets a non-alphabetic, non-apostrophe character. You've now read one word. Wait until you read the next character that's a letter or an apostrophe, and then you take it from the top.
One other thing:
while (!file.eof())
This is always a bug, and a wrong thing to do. Just thought I'd mention this. I suppose that fixing this is going to be your first order of business, before writing the rest of your code.
OnlyLetterNumAndApp facet for a stream
#include <locale>
#include <string>
#include <fstream>
#include <iostream>
// This facet treats letters/numbers and apostrophe as alpha
// Everything else is treated like a space.
//
// This makes reading words with operator>> very easy to sue
// when you want to ignore all the other characters.
class OnlyLetterNumAndApp: public std::ctype<char>
{
public:
typedef std::ctype<char> base;
typedef base::char_type char_type;
OnlyLetterNumAndApp(std::locale const& l)
: base(table)
{
std::ctype<char> const& defaultCType = std::use_facet<std::ctype<char> >(l);
for(int loop = 0;loop < 256;++loop) {
table[loop] = (defaultCType.is(base::alnum, loop) || loop == '\'')
? base::alpha
: base::space;
}
}
private:
base::mask table[256];
};
Usage
int main()
{
std::ifstream file;
file.imbue(std::locale(std::locale(), new OnlyLetterNumAndApp(std::locale())));
file.open("test.txt");
std::string word;
while(file >> word) {
std::cout << word << "\n";
}
}
Test File
> cat test.txt
This is %%% a test djkhfdkjfd
try another $gh line's
bad-people.Do bad things
Result
> ./a.out
This
is
a
test
djkhfdkjfd
try
another
gh
line's
bad
people
Do
bad
things

Cannot get second while to loop properly

I'm making a function that removes elements from a string. However, I cant seem to get both of my loops to work together. The first while loop works flawlessly. I looked into it and I believe it might be because when "find_last_of" isn't found, it still returns a value (which is throwing off my loop). I haven't been able to figure out how I can fix it. Thank you.
#include <iostream>
#include <string>
using namespace std;
string foo(string word) {
string compare = "!##$";
string alphabet = "abcdefghijklmnopqrstuvxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
while(word.find_first_of(compare) < word.find_first_of(alphabet)) {
int position = word.find_first_of(compare);
word = word.substr(++position);
}
while(word.find_last_of(compare) > word.find_last_of(alphabet)){
int size = word.length();
word = word.substr(0, --size);
}
return word;
}
int main() {
cout << foo("!!hi!!");
return 0;
}
I wrote it like this so compound words would not be affected. Desired result: "hi"
It's not entirely clear what you're trying to do, but how about replacing the second loop with this:
string::size_type p = word.find_last_not_of(compare);
if(p != string::npos)
word = word.substr(0, ++p);
It's not clear if you just want to trim certain characters from the front and back of word or if you want to remove every one of a certain set of characters from word no matter where they are. Based on the first sentence of your question, I'll assume you want to do the latter: remove all characters in compare from word.
A better strategy would be to more directly examine each character to see if it needs to be removed, and if so, do so, all in one pass through word. Since compare is quite short, something like this is probably good enough:
// Rewrite word by removing all characters in compare (and then erasing the
// leftover space, if any, at the end). See std::remove_if() docs.
word.erase(std::remove_if(word.begin(),
word.end(),
// Returns true if a character is to be removed.
[&](const char ch) {
return compare.find(ch) != compare.npos;
}),
word.end());
BTW, I'm not sure why there is both a compare and alphabet string in your example. It seems you would only need to define one or the other, and not both. A character is either one to keep or one to remove.