How to find repeated words in file with vector C++

How to find repeated words in file with vector C++ - c++

My task is that I don't know number of words in a file and the words are repeating several times,but how many times - It's unknown and I have to find that words. I use classes and vector to work with words,and fstream to work with files. But I cannot find resource or algorithm of finding repeating words and I'm so puzzled. I have vector of variable type and I pushed the words in it. It works successfully,I test it with v.size() output. I made all of things except algorithm of finding repeating words,which solve turned difficult to me.
My full code that I wrote:
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <algorithm>
#include <stdio.h>
#include <iterator>
using namespace std;
class Wording {
private:
string word;
vector <string> v;
public:
Wording(string Alternateword, vector <string> Alternatev) {
v = Alternatev;
word = Alternateword;
}
};
int main() {
ifstream ifs("words.txt");
ofstream ofs("wordresults.txt");
string word;
vector <string> v;
Wording obj(word,v);
while(ifs >> word) v.push_back(word);
for(int i=0; i<v.size(); i++) {
//waiting for algorithm
//ofs << v[i] << endl;
}
return 0;
}

Try using a hash map. If you are using gnu c++, it's std::hash_map. In C++11, you could use std::unordered_map, which would give you the same capabilities. Otherwise, hash_map is available from Boost, and probably elsewhere.
Key concept here is hash_map<word, count>.

Is the unique words in input file what you want? If so then you can do this with set (unordered_set if you don't really need them to be sorted) like so:
std::set<std::string> words; //can be changed to unordered_set
std::copy(ifs, std::ifstream(), std::inserter(words, words.begin());
std::copy(words.begin(), words.end(), ostream_iterator<std::string>(ofs));
You can also use vector, but you'll have to sort it and then use unique on it.
I can't compile this code now, so there might be some errors in my code snippet.
If what you want is the number of occurrences of a different words in file then you'll have to use some kind of map, as was already suggested. Of course using vector, sorting it and then counting consecutive words is also an solution, but wouldn't be too clear.

Related

Is there a way to add a string into a Vector but only if it doesn´t exist already? ->C++

I am trying to print a database using OOP in c++. But in my file.csv there are a lot of similar elements, so I'm trying to print each name only once. I know this is vague, but for you experts, this is an easy one. If possible please use basic coding.

If you are not limited to vector, use unordered_set to populate the data. This is implemented using a hash table. A sample program looks like below.
#include <unordered_set>
#include <string>
#include <iostream>
using namespace std;
int main() {
unordered_set <string> data;
data.insert("code1");
data.insert("code2");
//duplicate
data.insert("code2");
cout << "\nAll elements : ";
unordered_set<string> :: iterator itr;
for (itr = data.begin(); itr != data.end(); itr++)
cout << (*itr) << endl;
return 0;
}

You can insert some sql statement in your c++ code to filter out similar elements in your database before you store/use it in your program. Take a look at this page How can i insert values into database(mySql) using cpp program? and this page https://www.w3schools.com/sql/sql_where.asp
Without database, this may help you Filter out duplicate values in array in C++

Comparing unordered_map vs unordered_set

First of all, what is the main difference between them?
The only thing i've found is that unordered_set has no operator [].
How should i access an element in unordered_set, since there is no []?
Which container is using random access to memory(or both)?
And which one of them faster in any sense or using less memory?

They are nearly identical. unordered_set only contains keys, and no values. There is no mapping from a key to a value, so no need for an operator[]. unordered_map maps a key to a value.
You can use the various find methods within unordered_set to locate things.

you can use iterators to access elements.
unordered_set <string> u{
"Dog",
"Cat",
"Rat",
"Parrot",
"bee"
};
for(auto& s:u){
cout << s << ' ';
}
unordered_set<string>::const_iterator point = u.find("bee");

How should I access an element in unordered_set (C++17)?
In C++ 17 a new function extract is added to unordered_set.
Specially, this is the only way to take move only object out of the set.
https://en.cppreference.com/w/cpp/container/unordered_set/extract
For example if you want third element of your unordered set.
Advance the iterator
std::advance(it,2);
Then extarct the value
s.extract(it).value();
Here is the complete code. try on any C++17 compiler.
#include <iostream>
#include <string>
#include <unordered_set>
#include <iterator>
int main()
{
//CREATE AN OBJECT
std::unordered_set<std::string> s;
//INSERT DATA
s.insert("aee");
s.insert("bee");
s.insert("cee");
s.insert("dee");
//NEED TO INCLUDE "iterator" HEADER TO USE "std::advance"
auto it = s.begin();
std::advance(it,2);
//USING EXTRACT
std::string sval = s.extract(it).value();
std::cout<<sval;
}
Note: if queried for out of bound index, nothing happens. No result.
Try changing your code
//ONLY FOUR ELEMENTS
std::advance(it,8);
//USING EXTRACT
std::string sval = s.extract(it).value();

how to find duplicates in std::vector<string> and return a list of them?

So if I have a vector of words like:
Vec1 = "words", "words", "are", "fun", "fun"
resulting list: "fun", "words"
I am trying to determine which words are duplicated, and return an alphabetized vector of 1 copy of them. My problem is that I don't even know where to start, the only thing close to it I found was std::unique_copy which doesn't exactly do what I need. And specifically, I am inputting a std::vector<std::string> but outputting a std::list<std::string>. And if needed, I can use functor.
Could someone at least push me in the right direction please? I already tried reading stl documentation,but I am just "brain" blocked right now.

In 3 lines (not counting the vector and list creation nor the superfluous line-breaks in name of readability):
vector<string> vec{"words", "words", "are", "fun", "fun"};
list<string> output;
sort(vec.begin(), vec.end());
set<string> uvec(vec.begin(), vec.end());
set_difference(vec.begin(), vec.end(),
uvec.begin(), uvec.end(),
back_inserter(output));
EDIT
Explanation of the solution:
Sorting the vector is needed in order to use set_difference() later.
The uvec set will automatically keep elements sorted, and eliminate duplicates.
The output list will be populated by the elements of vec - uvec.

Make an empty std::unordered_set<std::string>
Iterator your vector, checking whether each item is a member of the set
If it's already in the set, this is a duplicate, so add to your result list
Otherwise, add to the set.
Since you want each duplicate only listed once in the results, you can use a hashset (not list) for the results as well.

IMO, Ben Voigt started with a good basic idea, but I would caution against taking his wording too literally.
In particular, I dislike the idea of searching for the string in the set, then adding it to your set if it's not present, and adding it to the output if it was present. This basically means every time we encounter a new word, we search our set of existing words twice, once to check whether a word is present, and again to insert it because it wasn't. Most of that searching will be essentially identical -- unless some other thread mutates the structure in the interim (which could give a race condition).
Instead, I'd start by trying to add it to the set of words you've seen. That returns a pair<iterator, bool>, with the bool set to true if and only if the value was inserted -- i.e., was not previously present. That lets us consolidate the search for an existing string and the insertion of the new string together into a single insert:
while (input >> word)
if (!(existing.insert(word)).second)
output.insert(word);
This also cleans up the flow enough that it's pretty easy to turn the test into a functor that we can then use with std::remove_copy_if to produce our results quite directly:
#include <set>
#include <iterator>
#include <algorithm>
#include <string>
#include <vector>
#include <iostream>
class show_copies {
std::set<std::string> existing;
public:
bool operator()(std::string const &in) {
return existing.insert(in).second;
}
};
int main() {
std::vector<std::string> words{ "words", "words", "are", "fun", "fun" };
std::set<std::string> result;
std::remove_copy_if(words.begin(), words.end(),
std::inserter(result, result.end()), show_copies());
for (auto const &s : result)
std::cout << s << "\n";
}
Depending on whether I cared more about code simplicity or execution speed, I might use an std::vector instead of the set for result, and use std::sort followed by std::unique_copy to produce the final result. In such a case I'd probably also replace the std::set inside of show_copies with an std::unordered_set instead:
#include <unordered_set>
#include <iterator>
#include <algorithm>
#include <string>
#include <vector>
#include <iostream>
class show_copies {
std::unordered_set<std::string> existing;
public:
bool operator()(std::string const &in) {
return existing.insert(in).second;
}
};
int main() {
std::vector<std::string> words{ "words", "words", "are", "fun", "fun" };
std::vector<std::string> intermediate;
std::remove_copy_if(words.begin(), words.end(),
std::back_inserter(intermediate), show_copies());
std::sort(intermediate.begin(), intermediate.end());
std::unique_copy(intermediate.begin(), intermediate.end(),
std::ostream_iterator<std::string>(std::cout, "\n"));
}
This is marginally more complex (one whole line longer!) but likely to be substantially faster when/if the number of words gets very large. Also note that I'm using std::unique_copy primarily to produce visible output. If you just want the result in a collection, you can use the standard unique/erase idiom to get unique items in intermediate.

In place (no additional storage). No string copying (except to result list). One sort + one pass:
#include <string>
#include <vector>
#include <list>
#include <iostream>
#include <algorithm>
using namespace std;
int main() {
vector<string> vec{"words", "words", "are", "fun", "fun"};
list<string> dup;
sort(vec.begin(), vec.end());
const string empty{""};
const string* prev_p = ∅
for(const string& s: vec) {
if (*prev_p==s) dup.push_back(s);
prev_p = &s;
}
for(auto& w: dup) cout << w << ' ';
cout << '\n';
}

You can get a pretty clean implementation using a std::map to count the occurrences, and then relying on std::list::sort to sort the resulting list of words. For example:
std::list<std::string> duplicateWordList(const std::vector<std::string>& words) {
std::map<std::string, int> temp;
std::list<std::string> ret;
for (std::vector<std::string>::const_iterator iter = words.begin(); iter != words.end(); ++iter) {
temp[*iter] += 1;
// only add the word to our return list on the second copy
// (first copy doesn't count, third and later copies have already been handled)
if (temp[*iter] == 2) {
ret.push_back(*iter);
}
}
ret.sort();
return ret;
}
Using a std::map there seems a little wasteful, but it gets the job done.

Here's a better algorithm than the ones other people have proposed:
#include <algorithm>
#include <vector>
template<class It> It unique2(It const begin, It const end)
{
It i = begin;
if (i != end)
{
It j = i;
for (++j; j != end; ++j)
{
if (*i != *j)
{ using std::swap; swap(*++i, *j); }
}
++i;
}
return i;
}
int main()
{
std::vector<std::string> v;
v.push_back("words");
v.push_back("words");
v.push_back("are");
v.push_back("fun");
v.push_back("words");
v.push_back("fun");
v.push_back("fun");
std::sort(v.begin(), v.end());
v.erase(v.begin(), unique2(v.begin(), v.end()));
std::sort(v.begin(), v.end());
v.erase(unique2(v.begin(), v.end()), v.end());
}
It's better because it only requires swap with no auxiliary vector for storage, which means it will behave optimally for earlier versions of C++, and it doesn't require elements to be copyable.
If you're more clever, I think you can avoid sorting the vector twice as well.

How to get a string of union set from a vector string?

I have a vector string filled with some file extensions as follows:
vector<string> vExt;
vExt.push_back("*.JPG;*.TGA;*.TIF");
vExt.push_back("*.PNG;*.RAW");
vExt.push_back("*.BMP;*.HDF");
vExt.push_back("*.GIF");
vExt.push_back("*.JPG");
vExt.push_back("*.BMP");
I now want to get a string of union set from the above-mentioned vector string, in which each file extension must be unique in the resulting string. As for my given example, the resulting string should take the form of "*.JPG;*.TGA;*.TIF;*.PNG;*.RAW;*.BMP;*.HDF;*.GIF".
I know that std::unique can remove consecutive duplicates in range. It con't work with my condition. Would you please show me how to do that? Thank you!

See it live here: http://ideone.com/0fmy0 (FIXED)
#include <iostream>
#include <sstream>
#include <iterator>
#include <algorithm>
#include <vector>
#include <set>
int main()
{
std::vector<std::string> vExt;
vExt.push_back("*.JPG;*.TGA;*.TIF");
vExt.push_back("*.PNG;*.RAW");
vExt.push_back("*.BMP;*.HDF");
vExt.push_back("*.GIF");
vExt.push_back("*.JPG");
vExt.push_back("*.BMP");
std::stringstream ss;
std::copy(vExt.begin(), vExt.end(),
std::ostream_iterator<std::string>(ss, ";"));
std::string element;
std::set<std::string> unique;
while (std::getline(ss, element, ';'))
unique.insert(unique.end(), element);
std::stringstream oss;
std::copy(unique.begin(), unique.end(),
std::ostream_iterator<std::string>(oss, ";"));
std::cout << oss.str() << std::endl;
return 0;
}
output:
*.BMP;*.GIF;*.HDF;*.JPG;*.PNG;*.RAW;*.TGA;*.TIF;

I'd tokenize each string into constituent parts (using semicolon as the separator), and stick the resulting tokens into a set. The resultant contents of that set is what you're looking for.

You need to parse the strings that contain multiple file extensions and then push them into the vector. After that std::unique will do what you want. Have a look at the Boost.Tokenizer class, that should make this trivial.

Random Integer List

If I had a list of integers separated by a space on one line (eg: 50 34 1 3423 5 345) then what would be the best way of making each of them a separate integer variable - collecting the list of integers with cin?

#include <iostream>
#include <vector>
#include <iterator>
std::vector<int> ints;
std::copy(std::istream_iterator<int>(cin),
std::istream_iterator<int>(),
std::back_inserter(ints));
Done. If you really need to explicetely read line-wise:
#include <sstream>
#include <iostream>
#include <vector>
#include <iterator>
std::string singleline;
std::istringstream iss; // out of loop for performance
while (std::getline(cin, singleline))
{
iss.str(singleline);
std::copy(std::istream_iterator<int>(iss),
std::istream_iterator<int>(),
std::back_inserter(ints));
}
An istream_iterator<int> will repeatedly apply operator>>(int&) to the referenced stream (until the end of the stream). By default this will silently ignore whitespace, and it will throw an exception if the input operation failed (e.g. non-integer input is encountered)
The back_inserter is an output iterator that you can use with all container types (like vector) that support the .push_back operation. So in fact what is written there in STL algorithmese is similar to
std::vector<int> ints;
while (iss>>myint)
{
ints.push_back(myint);
}

In follow-up to sehe's answer, here's how you'd do it a little more verbosely (ahem).
The algorithms sehe used basically do this internally. This answer is included mostly for clarity.
#include <iostream>
#include <vector>
int main()
{
std::vector<int> myInts;
int tmp;
while (std::cin >> tmp) {
myInts.push_back(tmp);
}
// Now `myInts` is a vector containing all the integers
}
Live example.

Have a look at the man pages for strtok( ) and atoi( )

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to find repeated words in file with vector C++ - c++

Try using a hash map. If you are using gnu c++, it's std::hash_map. In C++11, you could use std::unordered_map, which would give you the same capabilities. Otherwise, hash_map is available from Boost, and probably elsewhere. Key concept here is hash_map<word, count>.

Related

Is there a way to add a string into a Vector but only if it doesn´t exist already? ->C++

Comparing unordered_map vs unordered_set

how to find duplicates in std::vector<string> and return a list of them?

How to get a string of union set from a vector string?

Random Integer List

Categories

Resources