Given an Array of strings how do I Remove Duplicates? - c++

I would like to know how to remove duplicate strings from a container, but ignore word differences from trailing punctuation.
For example given these strings:
Why do do we we here here?
I would like to get this output:
Why do we here?

The algorithm:
While Reading a word is successful, do:
If End of file, quit.
If word list is empty, push back word.
else begin
Search word list for the word.
if word doesn't exist, push back the word.
end else (step 4)
end (while reading a word)
Use std::string for your word.
This allows you to do the following:
std::string word;
while (data_file >> word)
{
}
Use std::vector to contain your words (although you could use std::list as well). The std::vector grows dynamically so you don't have to worry about reallocation if you picked the wrong size.
To append to std::vector, use the push_back method.
To compare std::string, use operator==:
std::string new_word;
std::vector<std::string> word_list;
//...
if (word_list[index] == new_word)
{
continue;
}

So you have said you know how to tokenize a string. (If you don't spend some time here: https://stackoverflow.com/a/38595708/2642059) So I'm going to assume that we're given a vector<string> foo which contains words with possibly trailing punctuation.
for(auto it = cbegin(foo); it != cend(foo); ++it) {
if(none_of(next(it), cend(foo), [&](const auto& i) {
const auto finish = mismatch(cbegin(*it), cend(*it), cbegin(i), cend(i));
return (finish.first == cend(*it) || !isalnum(*finish.first)) && (finish.second == cend(i) || !isalnum(*finish.second));
})) {
cout << *it << ' ';
}
}
Live Example
It's worth noting here that you haven't given us rules on how to handle words like: "down", "down-vote", and "downvote" This algorithm presumes that the 1st 2 are equal. You also haven't given us rules for how to handle: "Why do, do we we here, here?" This algorithm always returns the final repetition, so the output would be "Why do we here?"
If the presumptions made by this algorithm are not totally to your liking leave me a comment and we'll work on getting you comfortable with this code to where you can make the adjustments that you need.

Related

How do I make an alphabetized list of all distinct words in a file with the number of times each word was used?

I am writing a program using Microsoft Visual C++. In the program I must read in a text file and print out an alphabetized list of all distinct words in that file with the number of times each word was used.
I have looked up different ways to alphabetize a string but they do not work with the way I have my string initialized.
// What is inside my text file
Any experienced programmer engaged in writing programs for use by others knows
that, once his program is working correctly, good output is a must. Few people
really care how much time and trouble a programmer has spent in designing and
debugging a program. Most people see only the results. Often, by the time a
programmer has finished tackling a difficult problem, any output may look
great. The programmer knows what it means and how to interpret it. However,
the same cannot be said for others, or even for the programmer himself six
months hence.
string lines;
getline(input, lines); // Stores what is in file into the string
I expect an alphabetized list of words with the number of times each word was used. So far, I do not know how to begin this process.
It's rather simple, std::map automatically sorts based on key in the key/value pair you get. The key/value pair represents word/count which is what you need. You need to do some filtering for special characters and such.
EDIT: std::stringstream is a nice way of splitting std::string using whitespace delimiter as it's the default delimiter. Therefore, using stream >> word you will get whitespace-separated words. However, this might not be enough due to punctuation. For example: Often, has comma which we need to filter out. Therefore, I used std::replaceif which replaces puncts and digits with whitespaces.
Now a new problem arises. In your example, you have: "must.Few" which will be returned as one word. After replacing . with we have "must Few". So I'm using another stringstream on the filtered "word" to make sure I have only words in the final result.
In the second loop you will notice if(word == "") continue;, this can happen if the string is not trimmed. If you look at the code you will find out that we aren't trimming after replacing puncts and digits. That is, "Often," will be "Often " with trailing whitespace. The trailing whitespace causes the second loop to extract an empty word. This is why I added the condition to ignore it. You can trim the filtered result and then you wouldn't need this check.
Finally, I have added ignorecase boolean to check if you wish to ignore the case of the word or not. If you wish to do so, the program will simply convert the word to lowercase and then add it to the map. Otherwise, it will add the word the same way it found it. By default, ignorecase = true, if you wish to consider case, just call the function differently: count_words(input, false);.
Edit 2: In case you're wondering, the statement counts[word] will automatically create key/value pair in the std::map IF there isn't any key matching word. So when we call ++: if the word isn't in the map, it will create the pair, and increment value by 1 so you will have newly added word. If it exists already in the map, this will increment the existing value by 1 and hence it acts as a counter.
The program:
#include <iostream>
#include <map>
#include <sstream>
#include <cstring>
#include <cctype>
#include <string>
#include <iomanip>
#include <algorithm>
std::string to_lower(const std::string& str) {
std::string ret;
for (char c : str)
ret.push_back(tolower(c));
return ret;
}
std::map<std::string, size_t> count_words(const std::string& str, bool ignorecase = true) {
std::map<std::string, size_t> counts;
std::stringstream stream(str);
while (stream.good()) {
// wordW may have multiple words connected by special chars/digits
std::string wordW;
stream >> wordW;
// filter special chars and digits
std::replace_if(wordW.begin(), wordW.end(),
[](const char& c) { return std::ispunct(c) || std::isdigit(c); }, ' ');
// now wordW may have multiple words seperated by whitespaces, extract them
std::stringstream word_stream(wordW);
while (word_stream.good()) {
std::string word;
word_stream >> word;
// ignore empty words
if (word == "") continue;
// add to count.
ignorecase ? counts[to_lower(word)]++ : counts[word]++;
}
}
return counts;
}
void print_counts(const std::map<std::string, size_t>& counts) {
for (auto pair : counts)
std::cout << std::setw(15) << pair.first << " : " << pair.second << std::endl;
}
int main() {
std::string input = "Any experienced programmer engaged in writing programs for use by others knows \
that, once his program is working correctly, good output is a must.Few people \
really care how much time and trouble a programmer has spent in designing and \
debugging a program.Most people see only the results.Often, by the time a \
programmer has finished tackling a difficult problem, any output may look \
great.The programmer knows what it means and how to interpret it.However, \
the same cannot be said for others, or even for the programmer himself six \
months hence.";
auto counts = count_words(input);
print_counts(counts);
return 0;
}
I have tested this with Visual Studio 2017 and here is the part of the output:
a : 5
and : 3
any : 2
be : 1
by : 2
cannot : 1
care : 1
correctly : 1
debugging : 1
designing : 1
As others have already noted, an std::map handles the counting you care about quite easily.
Iostreams already have a tokenize to break an input stream up into words. In this case, we want to to only "think" of letters as characters that can make up words though. A stream uses a locale to make that sort of decision, so to change how it's done, we need to define a locale that classifies characters as we see fit.
struct alpha_only: std::ctype<char> {
alpha_only(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table() {
// everything is white space
static std::vector<std::ctype_base::mask>
rc(std::ctype<char>::table_size,std::ctype_base::space);
// except lower- and upper-case letters, which are classified accordingly:
std::fill(&rc['a'], &rc['z'], std::ctype_base::lower);
std::fill(&rc['A'], &rc['Z'], std::ctype_base::upper);
return &rc[0];
}
};
With that in place, we tell the stream to use our ctype facet, then simply read words from the file and count them in the map:
std::cin.imbue(std::locale(std::locale(), new alpha_only));
std::map<std::string, std::size_t> counts;
std::string word;
while (std::cin >> word)
++counts[to_lower(word)];
...and when we're done with that, we can print out the results:
for (auto w : counts)
std::cout << w.first << ": " << w.second << "\n";
Id probably start by inserting all of those words into an array of strings, then start with the first index of the array and compare that with all of the other indexes if you find matches, add 1 to a counter and after you went through the array you could display the word you were searching for and how many matches there were and then go onto the next element and compare that with all of the other elements in the array and display etc. Or maybe if you wanna make a parallel array of integers that holds the number of matches you could do all the comparisons at one time and the displays at one time.
EDIT:
Everyone's answer seems more elegant because of the map's inherent sorting. My answer functions more as a parser, that later sorts the tokens. Therefore my answer is only useful to the extent of a tokenizer or lexer, whereas Everyone's answer is only good for sorted data.
You first probably want to read in the text file. You want to use a streambuf iterator to read in the file(found here).
You will now have a string called content, which is the content of you file. Next you will want to iterate, or loop, over the contents of this string. To do that you'll want to use an iterator. There should be a string outside of the loop that stores the current word. You will iterate over the content string, and each time you hit a letter character, you will add that character to your current word string. Then, once you hit a space character, you will take that current word string, and push it back into the wordString vector. (Note: that means that this will ignore non-letter characters, and that only spaces denote word separation.)
Now that we have a vector of all of our words in strings, we can use std::sort, to sort the vector in alphabetical order.(Note: capitalized words take precedence over lowercase words, and therefore will be sorted first.) Then we will iterate over our vector of stringWords and convert them into Word objects (this is a little heavy-weight), that will store their appearances and the word string. We will push these Word objects into a Word vector, but if we discover a repeat word string, instead of adding it into the Word vector, we'll grab the previous entry and increment its appearance count.
Finally, once this is all done, we can iterate over our Word object vector and output the word followed by its appearances.
Full Code:
#include <vector>
#include <fstream>
#include <iostream>
#include <streambuf>
#include <algorithm>
#include <string>
class Word //define word object
{
public:
Word(){appearances = 1;}
~Word(){}
int appearances;
std::string mWord;
};
bool isLetter(const char x)
{
return((x >= 'a' && x <= 'z') || (x >= 'A' && x <= 'Z'));
}
int main()
{
std::string srcFile = "myTextFile.txt"; //what file are we reading
std::ifstream ifs(srcFile);
std::string content( (std::istreambuf_iterator<char>(ifs) ),
( std::istreambuf_iterator<char>() )); //read in the file
std::vector<std::string> wordStringV; //create a vector of word strings
std::string current = ""; //define our current word
for(auto it = content.begin(); it != content.end(); ++it) //iterate over our input
{
const char currentChar = *it; //make life easier
if(currentChar == ' ')
{
wordStringV.push_back(current);
current = "";
continue;
}
else if(isLetter(currentChar))
{
current += *it;
}
}
std::sort(wordStringV.begin(), wordStringV.end(), std::less<std::string>());
std::vector<Word> wordVector;
for(auto it = wordStringV.begin(); it != wordStringV.end(); ++it) //iterate over wordString vector
{
std::vector<Word>::iterator wordIt;
//see if the current word string has appeared before...
for(wordIt = wordVector.begin(); wordIt != wordVector.end(); ++wordIt)
{
if((*wordIt).mWord == *it)
break;
}
if(wordIt == wordVector.end()) //...if not create a new Word obj
{
Word theWord;
theWord.mWord = *it;
wordVector.push_back(theWord);
}
else //...otherwise increment the appearances.
{
++((*wordIt).appearances);
}
}
//print the words out
for(auto it = wordVector.begin(); it != wordVector.end(); ++it)
{
Word theWord = *it;
std::cout << theWord.mWord << " " << theWord.appearances << "\n";
}
return 0;
}
Side Notes
Compiled with g++ version 4.2.1 with target x86_64-apple-darwin, using the compiler flag -std=c++11.
If you don't like iterators you can instead do
for(int i = 0; i < v.size(); ++i)
{
char currentChar = vector[i];
}
It's important to note that if you are capitalization agnostic simply use std::tolower on the current += *it; statement (ie: current += std::tolower(*it);).
Also, you seem like a beginner and this answer might have been too heavyweight, but you're asking for a basic parser and that is no easy task. I recommend starting by parsing simpler strings like math equations. Maybe make a calculator app.

Is there alternative str.find in c++?

I have got a queue fifo type (first in, first out) with strings in it. Every string is sentence. I need to find a word in it, and show it on console. The problem is, that when i used str.find("word") it can showed sentence with "words".
Add white space and some symbols like ".,?!" = str.find("word ") etc. but its not a solution
if (head != nullptr)
do {
if (head->zdanie_kol.find("promotion") != string::npos ||
head->zdanie_kol.find("discount") != string::npos ||
head->zdanie_kol.find("sale") != string::npos ||
head->zdanie_kol.find("offer") != string::npos)
cout << head->zdanie_kol << endl;
} while (head != nullptr);
For example, i got two sentences, one is correct, another one is not.
Correct:
We have a special OFFER for you an email database which allows to contact eBay members both sellers and shoppers.
Not Correct:
Do not lose your chance sign up and find super PROMOTIONS we prepared for you!
The three simplest solutions I can think of for this are:
Once you get the result simply check the next character. If it's a whitespace or '\0', you found your match. Make sure to check the character before too so you don't match sword when looking for word. Also make sure you're not reading beyond the string memory.
Tokenize the string first. This will break the sentence into words and you can then check word by word to see if it matches. You can do this with strtok().
Use regular expression (e.g. regex_match()) as mentioned in the comments. Depending on the engine you choose, the syntax may differ, but most of them have a something like "\\bsale\\b" which will match on word boundary (see here for more information).
Here is a solution, using std::unordered_set and std::istringstream:
#include <unordered_set>
#include <string>
#include <sstream>
//...
std::unordered_set<std::string> filter_word = {"promotion", "discount", "sale", "offer"};
//...
std::istringstream strm(head->zdanie_kol);
std::string word;
while (strm >> word)
{
if (filter_word(word).count())
{
std::cout << head->zdanie_kol << std::endl;
break;
}
}
//...
If you had many more words to check instead of only 4 words, this solution seems easier to use since all you have to do is add those words to the unordered_set.

How to read a complex input with istream&, string& and getline in c++?

I am very new to C++, so I apologize if this isn't a good question but I really need help in understanding how to use istream.
There is a project I have to create where it takes several amounts of input that can be on one line or multiple and then pass it to a vector (this is only part of the project and I would like to try the rest on my own), for example if I were to input this...
>> aaa bb
>> ccccc
>> ddd fff eeeee
Makes a vector of strings with "aaa", "bb", "ccccc", "ddd", "fff", "eeeee"
The input can be a char or string and the program stops asking for input when the return key is hit.
I know getline() gets a line of input and I could probably use a while loop to try and get the input such as...(correct me if I'm wrong)
while(!string.empty())
getline(cin, string);
However, I don't truly understand istream and it doesn't help that my class has not gone over pointers so I don't know how to use istream& or string& and pass it into a vector. On the project description, it said to NOT use stringstream but use functionality from getline(istream&, string&). Can anyone give somewhat of a detailed explanation as to how to make a function using getline(istream&, string&) and then how to use it in the main function?
Any little bit helps!
You're on the right way already; solely, you'd have to pre-fill the string with some dummy to enter the while loop at all. More elegant:
std::string line;
do
{
std::getline(std::cin, line);
}
while(!line.empty());
This should already do the trick reading line by line (but possibly multiple words on one line!) and exiting, if the user enters an empty line (be aware that whitespace followed by newline won't be recognised as such!).
However, if anything on the stream goes wrong, you'll be trapped in an endless loop processing previous input again and again. So best check the stream state as well:
if(!std::getline(std::cin, line))
{
// this is some sample error handling - do whatever you consider appropriate...
std::cerr << "error reading from console" << std::endl;
return -1;
}
As there might be multiple words on a single line, you'd yet have to split them. There are several ways to do so, quite an easy one is using an std::istringstream – you'll discover that it ressembles to what you likely are used to using std::cin:
std::istringstream s(line);
std::string word;
while(s >> word)
{
// append to vector...
}
Be aware that using operator>> ignores leading whitespace and stops after first trailing one (or end of stream, if reached), so you don't have to deal with explicitly.
OK, you're not allowed to use std::stringstream (well, I used std::istringstream, but I suppose this little difference doesn't count, does it?). Changes matter a little, it gets more complex, on the other hand, we can decide ourselves what counts as words an what as separators... We might consider punctuation marks as separators just like whitespace, but allow digits to be part of words, so we'd accept e. g. ab.7c d as "ab", "7c", "d":
auto begin = line.begin();
auto end = begin;
while(end != line.end()) // iterate over each character
{
if(std::isalnum(static_cast<unsigned char>(*end)))
{
// we are inside a word; don't touch begin to remember where
// the word started
++end;
}
else
{
// non-alpha-numeric character!
if(end != begin)
{
// we discovered a word already
// (i. e. we did not move begin together with end)
words.emplace_back(begin, end);
// ('words' being your std::vector<std::string> to place the input into)
}
++end;
begin = end; // skip whatever we had already
}
}
// corner case: a line might end with a word NOT followed by whitespace
// this isn't covered within the loop, so we need to add another check:
if(end != begin)
{
words.emplace_back(begin, end);
}
It shouldn't be too difficult to adjust to different interpretations of what is a separator and what counts as word (e. g. std::isalpha(...) || *end == '_' to detect underscore as part of words, but digits not). There are quite a few helper functions you might find useful...
You could input the value of the first column, then call functions based on the value:
void Process_Value_1(std::istream& input, std::string& value);
void Process_Value_2(std::istream& input, std::string& value);
int main()
{
// ...
std::string first_value;
while (input_file >> first_value)
{
if (first_value == "aaa")
{
Process_Value_1(input_file, first_value);
}
else if (first_value = "ccc")
{
Process_Value_2(input_file, first_value);
}
//...
}
return 0;
}
A sample function could be:
void Process_Value_1(std::istream& input, std::string& value)
{
std::string b;
input >> b;
std::cout << value << "\t" << b << endl;
input.ignore(1000, '\n'); // Ignore until newline.
}
There are other methods to perform the process, such as using tables of function pointers and std::map.

Cannot get second while to loop properly

I'm making a function that removes elements from a string. However, I cant seem to get both of my loops to work together. The first while loop works flawlessly. I looked into it and I believe it might be because when "find_last_of" isn't found, it still returns a value (which is throwing off my loop). I haven't been able to figure out how I can fix it. Thank you.
#include <iostream>
#include <string>
using namespace std;
string foo(string word) {
string compare = "!##$";
string alphabet = "abcdefghijklmnopqrstuvxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
while(word.find_first_of(compare) < word.find_first_of(alphabet)) {
int position = word.find_first_of(compare);
word = word.substr(++position);
}
while(word.find_last_of(compare) > word.find_last_of(alphabet)){
int size = word.length();
word = word.substr(0, --size);
}
return word;
}
int main() {
cout << foo("!!hi!!");
return 0;
}
I wrote it like this so compound words would not be affected. Desired result: "hi"
It's not entirely clear what you're trying to do, but how about replacing the second loop with this:
string::size_type p = word.find_last_not_of(compare);
if(p != string::npos)
word = word.substr(0, ++p);
It's not clear if you just want to trim certain characters from the front and back of word or if you want to remove every one of a certain set of characters from word no matter where they are. Based on the first sentence of your question, I'll assume you want to do the latter: remove all characters in compare from word.
A better strategy would be to more directly examine each character to see if it needs to be removed, and if so, do so, all in one pass through word. Since compare is quite short, something like this is probably good enough:
// Rewrite word by removing all characters in compare (and then erasing the
// leftover space, if any, at the end). See std::remove_if() docs.
word.erase(std::remove_if(word.begin(),
word.end(),
// Returns true if a character is to be removed.
[&](const char ch) {
return compare.find(ch) != compare.npos;
}),
word.end());
BTW, I'm not sure why there is both a compare and alphabet string in your example. It seems you would only need to define one or the other, and not both. A character is either one to keep or one to remove.

Cleaning a string of punctuation in C++

Ok so before I even ask my question I want to make one thing clear. I am currently a student at NIU for Computer Science and this does relate to one of my assignments for a class there. So if anyone has a problem read no further and just go on about your business.
Now for anyone who is willing to help heres the situation. For my current assignment we have to read a file that is just a block of text. For each word in the file we are to clear any punctuation in the word (ex : "can't" would end up as "can" and "that--to" would end up as "that" obviously with out the quotes, quotes were used just to specify what the example was).
The problem I've run into is that I can clean the string fine and then insert it into the map that we are using but for some reason with the code I have written it is allowing an empty string to be inserted into the map. Now I've tried everything that I can come up with to stop this from happening and the only thing I've come up with is to use the erase method within the map structure itself.
So what I am looking for is two things, any suggestions about how I could a) fix this with out simply just erasing it and b) any improvements that I could make on the code I already have written.
Here are the functions I have written to read in from the file and then the one that cleans it.
Note: the function that reads in from the file calls the clean_entry function to get rid of punctuation before anything is inserted into the map.
Edit: Thank you Chris. Numbers are allowed :). If anyone has any improvements to the code I've written or any criticisms of something I did I'll listen. At school we really don't get feed back on the correct, proper, or most efficient way to do things.
int get_words(map<string, int>& mapz)
{
int cnt = 0; //set out counter to zero
map<string, int>::const_iterator mapzIter;
ifstream input; //declare instream
input.open( "prog2.d" ); //open instream
assert( input ); //assure it is open
string s; //temp strings to read into
string not_s;
input >> s;
while(!input.eof()) //read in until EOF
{
not_s = "";
clean_entry(s, not_s);
if((int)not_s.length() == 0)
{
input >> s;
clean_entry(s, not_s);
}
mapz[not_s]++; //increment occurence
input >>s;
}
input.close(); //close instream
for(mapzIter = mapz.begin(); mapzIter != mapz.end(); mapzIter++)
cnt = cnt + mapzIter->second;
return cnt; //return number of words in instream
}
void clean_entry(const string& non_clean, string& clean)
{
int i, j, begin, end;
for(i = 0; isalnum(non_clean[i]) == 0 && non_clean[i] != '\0'; i++);
begin = i;
if(begin ==(int)non_clean.length())
return;
for(j = begin; isalnum(non_clean[j]) != 0 && non_clean[j] != '\0'; j++);
end = j;
clean = non_clean.substr(begin, (end-begin));
for(i = 0; i < (int)clean.size(); i++)
clean[i] = tolower(clean[i]);
}
The problem with empty entries is in your while loop. If you get an empty string, you clean the next one, and add it without checking. Try changing:
not_s = "";
clean_entry(s, not_s);
if((int)not_s.length() == 0)
{
input >> s;
clean_entry(s, not_s);
}
mapz[not_s]++; //increment occurence
input >>s;
to
not_s = "";
clean_entry(s, not_s);
if((int)not_s.length() > 0)
{
mapz[not_s]++; //increment occurence
}
input >>s;
EDIT: I notice you are checking if the characters are alphanumeric. If numbers are not allowed, you may need to revisit that area as well.
Further improvements would be to
declare variables only when you use them, and in the innermost scope
use c++-style casts instead of the c-style (int) casts
use empty() instead of length() == 0 comparisons
use the prefix increment operator for the iterators (i.e. ++mapzIter)
A blank string is a valid instance of the string class, so there's nothing special about adding it into the map. What you could do is first check if it's empty, and only increment in that case:
if (!not_s.empty())
mapz[not_s]++;
Style-wise, there's a few things I'd change, one would be to return clean from clean_entry instead of modifying it:
string not_s = clean_entry(s);
...
string clean_entry(const string &non_clean)
{
string clean;
... // as before
if(begin ==(int)non_clean.length())
return clean;
... // as before
return clean;
}
This makes it clearer what the function is doing (taking a string, and returning something based on that string).
The function 'getWords' is doing a lot of distinct actions that could be split out into other functions. There's a good chance that by splitting it up into it's individual parts, you would have found the bug yourself.
From the basic structure, I think you could split the code into (at least):
getNextWord: Return the next (non blank) word from the stream (returns false if none left)
clean_entry: What you have now
getNextCleanWord: Calls getNextWord, and if 'true' calls CleanWord. Returns 'false' if no words left.
The signatures of 'getNextWord' and 'getNextCleanWord' might look something like:
bool getNextWord (std::ifstream & input, std::string & str);
bool getNextCleanWord (std::ifstream & input, std::string & str);
The idea is that each function does a smaller more distinct part of the problem. For example, 'getNextWord' does nothing but get the next non blank word (if there is one). This smaller piece therefore becomes an easier part of the problem to solve and debug if necessary.
The main component of 'getWords' then can be simplified down to:
std::string nextCleanWord;
while (getNextCleanWord (input, nextCleanWord))
{
++map[nextCleanWord];
}
An important aspect to development, IMHO, is to try to Divide and Conquer the problem. Split it up into the individual tasks that need to take place. These sub-tasks will be easier to complete and should also be easier to maintain.