Removing duplicates and counting duplicates in a text file with C++ - c++

I am a beginner at C++. I created a text file with two columns in it. However, there are around 1 million rows and there are many rows that repeat each other. I want to delete the duplicates and count how many duplicates there were making it into a third row. This is what it would look like before and after:
Before:
10 8
11 7
10 8
10 8
15 12
11 7
After:
10 8 3
11 7 2
15 12 1
I don't really know where to start can someone point me in the right direction of what I should be looking up in order to do this?

You can create std::map<std::pair<int, int>, int>, and after each insertion check if the given pair is contained in the map. If pair is contained just increment number of duplicates, otherwise emplace it in the map.
Something like this:
#include <iostream>
#include <map>
int main(int argc, char* argv[]) {
std::map<std::pair<int, int>, int> rows;
int num1;
int num2;
while (std::cin >> num1 >> num2) {
auto pair = std::make_pair(num1, num2);
if (rows.find(pair) != rows.end())
++rows[pair];
else
rows.emplace(pair, 1);
}
}

#include <string>
#include <fstream>
#include <unordered_map>
using namespace std;
int main()
{
string line;
unordered_map<string, int> count_map;
ifstream src("input.txt");
if (!src.is_open())
{
return -1;
}
while (getline(src, line))
{
if (line.empty())
continue;
count_map[line]++;
}
src.close();
ofstream dst("output.txt");
if (!dst.is_open())
{
return -2;
}
for (auto & iter : count_map)
{
dst << iter.first << " " << iter.second << endl;
}
dst.close();
return 0;
}

#include <iostream>
#include <fstream>
#include <string>
#include <map>
#include <set>
using namespace std;
int main() {
ifstream src("input.txt");
if (!src.is_open()) {
return -1;
}
// store each line, filter out all of duplicated strings
set<string> container;
// key is to maintain the order of lines, value is a pair<K, V>
// K is the itor pointed to the string in the container
// V is the counts of the string
map<int, std::pair<set<string>::iterator, int>> mp;
// key is the pointer which points to the string in the container
// value is the index of string in the file
map<const string *, int> index;
string line;
int idx = 0; // index of the string in the file
while (getline(src, line)) {
if (line.empty()) {
continue;
}
auto res = container.insert(line);
if (res.second) {
index[&(*res.first)] = idx;
mp[idx] = {res.first, 1};
idx++;
} else {
mp[index[&(*res.first)]].second += 1;
}
}
src.close();
ofstream dst("output.txt");
if (!dst.is_open()) {
return -2;
}
for (const auto & iter : mp) {
dst << *iter.second.first << " " << iter.second.second << endl;
}
dst.close();
return 0;
}
BTW, Redis can solve this problem easily if you are allowed to use it.

This can be done with a std::priority_queue, which automatically sorts the entries. With the data sorted like this, one only has to count the number of subsequent identical entries:
#include <queue>
#include <iostream>
#include <vector>
#include <utility> // for std::pair
int main() {
std::priority_queue<std::pair<int,int>> mydat;
mydat.push(std::make_pair(10,8));
mydat.push(std::make_pair(11,7));
mydat.push(std::make_pair(10,8));
mydat.push(std::make_pair(10,8));
mydat.push(std::make_pair(15,12));
mydat.push(std::make_pair(11,7));
std::vector<std::vector<int>> out;
std::pair<int,int> previous;
int counter;
while(!mydat.empty()) {
counter = 1;
previous = mydat.top();
mydat.pop(); // move on to next entry
while(previous == mydat.top() && !mydat.empty()) {
previous = mydat.top();
mydat.pop();
counter++;
}
out.push_back({previous.first, previous.second, counter});
}
for(int i = 0; i < out.size(); ++i) {
std::cout << out[i][0] << " " << out[i][1] << " " << out[i][2] << std::endl;
}
}
godbolt demo
Output:
15 12 1
11 7 2
10 8 3

Related

I'm getting this warning sign Array index 4001 is past the end of the array (which contains 4001 elements)

I have two questions
First: When i try to run the code it gives me a warning where it says "Array index 4001 is past the end of the array (which contains 4001 elements)"
Second: I want to read the words from the file and then pass them through the function so i can
add the words to the hash table and index them accordingly and print the count of the unique words from the text file. the size function does that. can someone please help me with this
#include <iostream>
#include <string>
#include <fstream>
#define HASHSIZE 4001
using namespace std;
class entry {
public:
string word;
int frequency;
entry() { frequency = 0; }
};
class Hashtable {
private:
entry entryArr[HASHSIZE];
int updateArr[HASHSIZE];
int costArr[HASHSIZE];
int sizeUnique = 0;
int probeCount;
int updateCount;
public:
int HashKey(string key)
{
int totalsum = 0;
// this function is to assign every word a key value to be stored against.
for (int i = 0; i < key.length(); i++) totalsum += int(key[i]);
return (totalsum % HASHSIZE);
}
void update(string key) {
int k = HashKey(key);
if (entryArr[k].frequency == 0) {
entryArr[k].frequency++;
updateCount++;
probeCount++;
sizeUnique++;
}
// function to enter the unique words in the array
else if (entryArr[k].word == key) {
entryArr[k].frequency++;
probeCount++;
}
while (entryArr[k].frequency != 0 && entryArr[k].word != key) {
k++;
}
if (entryArr[k].word == key) {
entryArr[k].frequency++;
} else {
entryArr[k].word = key;
}
sizeUnique++;
updateCount++;
probeCount++;
}
int probes() {
costArr[HASHSIZE] = probeCount;
return probeCount;
}
int size() // function to count the total number of unique words occuring
{
int count = 0;
updateArr[HASHSIZE] = updateCount;
for (int i = 0; i < HASHSIZE; i++)
if (updateArr[HASHSIZE] != 0) {
count = costArr[i] / updateArr[i];
}
cout << count;
return count;
}
};
int main() {
entry e;
Hashtable h;
ifstream thisfile("RomeoAndJuliet.txt");
if (thisfile.is_open()) {
while (!thisfile.eof) {
h.update(e.word);
}
thisfile.close();
cout << "The total number of unique words are: " << h.size();
}
return 0;
}
An array with 4001 elements has valid indexes 0,1,...,3999,4000 as C++ is indexing from 0.
When i try to run the code it gives me a warning where it says "Array index 4001 is past the end of the array (which contains 4001 elements)"
This is because array index starts with 0 instead of 1. And so an array of size 4001 can be safely indexed(accessed) upto 4000 and not 4001.
I want to read the words from the file and then pass them through the function so i can add the words to the hash table and index them accordingly and print the count of the unique words from the text file
The below program shows how to do this. The program shown below counts how many times a given word occurred in a given input.txt file and then print that count infront of the word.
#include <iostream>
#include <map>
#include <sstream>
#include<fstream>
int main() {
std::string line, word;
//this map maps the std::string to their respective count
std::map<std::string, int> wordCount;
std::ifstream inFile("input.txt");
if(inFile)
{
while(getline(inFile, line, '\n'))
{
std::istringstream ss(line);
while(ss >> word)
{
//std::cout<<"word:"<<word<<std::endl;
wordCount[word]++;
}
}
}
else
{
std::cout<<"file cannot be opened"<<std::endl;
}
inFile.close();
std::cout<<"Total unique words are: "<<wordCount.size()<<std::endl;
for(std::pair<std::string, int> pairElement: wordCount)
{
std::cout << pairElement.first <<"-" << pairElement.second<<std::endl;
}
return 0;
}
The output of this program can be seen here.
Note that(as shown in above example) there is no need to create a separate class for the purpose given in your second question. We can do this(as shown above) literally using 4 to 6 lines(excluding opening and closing the file) of code.

Finding item in string and say WHEN it was found - c++

I have a string of items (see code). I want to say when a specific item from that list is found. In my example I want the output to be 3 since the item is found after the first two items. I can print out the separate items to the console but I cannot figure out how to do a count on these two items. I think it is because of the while loop... I always get numbers like 11 instead of two separate 1s. Any tips? :)
#include <iostream>
#include <string>
using namespace std;
int main() {
string items = "box,cat,dog,cat";
string delim = ",";
size_t pos = 0;
string token;
string item1 = "dog";
int count = 0;
`;
while ((pos = items.find(delim)) != string::npos)
{
token = items.substr(0, pos);
if (token != item1)
{
cout << token << endl; //here I would like to increment count for every
//item before item1 (dog) is found
items.erase(0, pos + 1);
}
else if (token == item1)
return 0;
}
return 0; //output: box cat
}
I replaced your search algorithm with the method explode, that separates your string by a delimiter and returns a vector, which is better suited for searching and getting the element count:
#include <string>
#include <vector>
#include <sstream>
#include <iostream>
#include <algorithm>
std::vector<std::string> explode(const std::string& s, char delim)
{
std::vector<std::string> result;
std::istringstream iss(s);
for (std::string token; std::getline(iss, token, delim); )
{
result.push_back(std::move(token));
}
return result;
}
int main()
{
std::string items = "box,cat,dog,cat";
std::string item1 = "dog";
char delim = ',';
auto resultVec = explode(items, delim);
auto itResult = std::find_if(resultVec.begin(), resultVec.end()
, [&item1](const auto& resultString)
{
return item1 == resultString;
});
if (itResult != resultVec.end())
{
auto index(std::distance(resultVec.begin(), itResult) + 1); // index is zero based
std::cout << index;
}
return 0;
}
By using std::find_if you can get the position of item1 by iterator, which you can use with std::distance to get the count of elements that are in front of it.
Credits for the explode method go to this post: Is there an equivalent in C++ of PHP's explode() function?
There are many ways to Rome. Here an additional solution using a std::regex.
But main approach is the same as the accepted answer. Using modern C++17 language elements, it is a little bit more compact.
#include <iostream>
#include <string>
#include <regex>
#include <iterator>
#include <vector>
const std::regex re{ "," };
int main() {
std::string items{ "box,cat,dog,cat" };
// Split String and put all sub-items in a vector
std::vector subItems(std::sregex_token_iterator(items.begin(), items.end(), re, -1), {});
// Search and check if found and show result
if (auto it = std::find(subItems.begin(), subItems.end(), "dog"); it != subItems.end())
std::cout << "Found at position: " << std::distance(subItems.begin(), it) + 1 << '\n';
else
std::cout << "Not found.\n";
return 0;
}

C++ Sorting Filenames In A Directory

I wanted to have some advice about the code I have.
I managed to get what I wanted done, but I do not think it is the "proper" way of doing it in the programmers' world.
Could you help me improve the code by any means and also if there are any better ways of doing this please share them as well.
I have files named in the format:
501.236.pcd
501.372.pcd
...
612.248.pcd etc.
I wanted to put the filenames in ascending order according to the filenames using C++.
This is the code I use:
#include <string>
#include <iostream>
#include <boost/filesystem.hpp>
#include <sstream>
using namespace std;
using namespace boost::filesystem;
int main()
{
vector <string> str,parsed_str;
path p("./fake_pcd");
string delimiter = ".";
string token,parsed_filename;
size_t pos = 0;
int int_filename;
vector <int> int_dir;
//insert filenames in the directory to a string vector
for (auto i = directory_iterator(p); i != directory_iterator(); i++)
{
if (!is_directory(i->path())) //we eliminate directories in a list
{
str.insert(str.end(),i->path().filename().string());
}
else
continue;
}
//parse each string element in the vector, split from each delimiter
//add each token together and convert to integer
//put inside a integer vector
parsed_str = str;
for (std::vector<string>::iterator i=parsed_str.begin(); i != parsed_str.end(); ++i)
{
cout << *i << endl;
while ((pos = i->find(delimiter)) != string::npos) {
token = i->substr(0,pos);
parsed_filename += token;
i->erase(0, pos + delimiter.length());
}
int_filename = stoi(parsed_filename);
int_dir.push_back(int_filename);
parsed_filename = "";
}
cout << endl;
parsed_str.clear();
sort(int_dir.begin(), int_dir.end());
//print the sorted integers
for(vector<int>::const_iterator i=int_dir.begin(); i != int_dir.end(); i++) {
cout << *i << endl;
}
//convert sorted integers to string and put them back into string vector
for (auto &x : int_dir) {
stringstream ss;
ss << x;
string y;
ss >> y;
parsed_str.push_back(y);
}
cout << endl;
//change the strings so that they are like the original filenames
for(vector<string>::iterator i=parsed_str.begin(); i != parsed_str.end(); i++) {
*i = i->substr(0,3) + "." + i->substr(3,3) + ".pcd";
cout << *i << endl;
}
}
This is the output, first part is in the order the directory_iterator gets it, the second part is the filenames sorted in integers, and the last part is where I change the integers back into strings in the original filename format.
612.948.pcd
612.247.pcd
501.567.pcd
501.346.pcd
501.236.pcd
512.567.pcd
613.008.pcd
502.567.pcd
612.237.pcd
612.248.pcd
501236
501346
501567
502567
512567
612237
612247
612248
612948
613008
501.236.pcd
501.346.pcd
501.567.pcd
502.567.pcd
512.567.pcd
612.237.pcd
612.247.pcd
612.248.pcd
612.948.pcd
613.008.pcd
Taking a few hints from e.g. Filtering folders in Boost Filesystem and in the interest of total overkill:
Live On Coliru Using Boost (also On Wandbox.org)
#include <boost/range/adaptors.hpp>
#include <boost/spirit/home/x3.hpp>
#include <boost/filesystem.hpp>
#include <iostream>
#include <optional>
#include <set>
namespace fs = boost::filesystem;
namespace {
using Path = fs::path;
struct Ranked {
std::optional<int> rank;
Path path;
explicit operator bool() const { return rank.has_value(); }
bool operator<(Ranked const& rhs) const { return rank < rhs.rank; }
};
static Ranked rank(Path const& p) {
if (p.extension() == ".pcd") {
auto stem = p.stem().native();
std::string digits;
using namespace boost::spirit::x3;
if (phrase_parse(begin(stem), end(stem), +digit >> eoi, punct, digits))
return { std::stoul(digits), p };
}
return { {}, p };
}
}
int main() {
using namespace boost::adaptors;
auto dir = boost::make_iterator_range(fs::directory_iterator("."), {})
| transformed(std::mem_fn(&fs::directory_entry::path))
| transformed(rank)
;
std::multiset<Ranked> index(begin(dir), end(dir));
for (auto& [rank, path] : index) {
std::cout << rank.value_or(-1) << "\t" << path << "\n";
}
}
Prints:
-1 "./main.cpp"
-1 "./a.out"
501008 "./501.008.pcd"
501236 "./501.236.pcd"
501237 "./501.237.pcd"
501247 "./501.247.pcd"
501248 "./501.248.pcd"
501346 "./501.346.pcd"
501567 "./501.567.pcd"
501948 "./501.948.pcd"
502008 "./502.008.pcd"
502236 "./502.236.pcd"
502237 "./502.237.pcd"
502247 "./502.247.pcd"
502248 "./502.248.pcd"
502346 "./502.346.pcd"
502567 "./502.567.pcd"
502948 "./502.948.pcd"
512008 "./512.008.pcd"
512236 "./512.236.pcd"
512237 "./512.237.pcd"
512247 "./512.247.pcd"
512248 "./512.248.pcd"
512346 "./512.346.pcd"
512567 "./512.567.pcd"
512948 "./512.948.pcd"
612008 "./612.008.pcd"
612236 "./612.236.pcd"
612237 "./612.237.pcd"
612247 "./612.247.pcd"
612248 "./612.248.pcd"
612346 "./612.346.pcd"
612567 "./612.567.pcd"
612948 "./612.948.pcd"
613008 "./613.008.pcd"
613236 "./613.236.pcd"
613237 "./613.237.pcd"
613247 "./613.247.pcd"
613248 "./613.248.pcd"
613346 "./613.346.pcd"
613567 "./613.567.pcd"
613948 "./613.948.pcd"
BONUS: No-Boost Solution
As the filesystem library has been standardized and using Rangev3:
Live On Wandbox
#include <filesystem>
#include <iostream>
#include <map>
#include <optional>
#include <range/v3/action/remove_if.hpp>
#include <range/v3/range/conversion.hpp>
#include <range/v3/view/filter.hpp>
#include <range/v3/view/subrange.hpp>
#include <range/v3/view/transform.hpp>
namespace fs = std::filesystem;
namespace {
using namespace ranges;
using Ranked = std::pair<std::optional<int>, fs::path>;
bool has_rank(Ranked const& v) { return v.first.has_value(); }
static Ranked ranking(fs::path const& p) {
if (p.extension() == ".pcd") {
auto stem = p.stem().native();
auto non_digit = [](uint8_t ch) { return !std::isdigit(ch); };
stem |= actions::remove_if(non_digit);
return { std::stoul(stem), p };
}
return { {}, p };
}
}
int main() {
using It = fs::directory_iterator;
for (auto&& [rank, path] : subrange(It("."), It())
| views::transform(std::mem_fn(&fs::directory_entry::path))
| views::transform(ranking)
| views::filter(has_rank)
| to<std::multimap>())
{
std::cout << rank.value_or(-1) << "\t" << path << "\n";
}
}
Prints e.g.
501236 "./501.236.pcd"
501346 "./501.346.pcd"
501567 "./501.567.pcd"
502567 "./502.567.pcd"

Finding ALL Non Repeating characters in a given string

So I was given the question:
Find ALL of the non-repeating characters in a given string;
After doing some Google searching it was clear to me that finding the first non repeating character was pretty common. I found many examples of how to do that, but I have not really found anything on how to find ALL of the non repeating characters instead of just the first one.
my example code so far is:
#include <iostream>
#include <unordered_map>
using namespace std;
char findAllNonRepeating(const string& s) {
unordered_map<char, int> m;
for (unsigned i = 0; i < s.length(); ++i) {
char c = tolower(s[i]);
if (m.find(c) == m.end())
m[c] = 1;
else
++m[c];
}
auto best = m.begin();
for (auto it = m.begin(); it != m.end(); ++it)
if (it->second <= best->second)
best = it;
return (best->first);
}
int main()
{
cout << findAllNonRepeating("dontknowwhattochangetofindallnonrepeatingcharacters") << endl;
}
I am not sure what I need to change or add to have this find all of the non repeating characters.
k, f, p, s should be the non repeating characters in this string.
Any hints or ideas are greatly appreciated!
As suggested, simply keep a frequency map. Then, once the string is processed, iterate over the map, returning only those values that occur exactly once.
#include <iostream>
#include <map>
#include <vector>
using namespace std;
std::vector<char> nonRepeating(const std::string& s)
{
std::map<char, int> frequency;
for(int i=0;i<s.size();i++)
{
frequency[s[i]]++;
}
std::vector<char> out;
for(auto it = frequency.begin(); it != frequency.end(); it++)
{
if(it->second == 1)
out.push_back(it->first);
}
return out;
}
int main() {
// your code goes here
std::string str = "LoremIpsum";
for(char c : nonRepeating(str))
{
std::cout << c << std::endl;
}
return 0;
}

c++ compare vector<string> x to string s

I am trying to compare a string in a vector to another string:
I tried:
vector<string> x;
string y;
if(x[i] == y)
if(x[i].compare(y) == 0)
if(y.compare(x[i]) == 0)
if(x.at(i) == y)
if(x.at(i).compare(y) == 0)
if(y.compare(x.at(i)) == 0)
tried passing x[i] / x.at(i) to string z first, nothing. I get no compile errors, no problems, it just seems the vector at index i does not want to compare?
g++ -v 4.9.3
Windows 7: cygwin64
c++11, compiling using the -std=c++11 call
I print both strings out, they are identical, but it does not want to compare them.
------- source cpp
#include <iostream>
#include <sstream>
#include <fstream>
#include <string>
#include <vector>
using namespace std;
vector<string> get_file(const char* file){
int SIZE=256, ln=0;
char str[SIZE];
vector<string> strs;
ifstream in(file, ios::in);
if(!in){
return strs;
} else {
while(in.getline(str,SIZE)){
strs.push_back(string(str));
ln++;
}
}
in.close();
return strs;
}
void convert_file(const char* file){
vector<string> s = get_file(file);
vector<string> d;
int a, b;
bool t = false;
for(int i=0; i<s.size(); i++){
string comp = "--";
string m = s.at(i);
cout << m << endl;
if(m == comp){ //test string compare
cout << "s[i] == '--'" << endl;
}
}
}
int main(){
convert_file("dvds.txt");
return 0;
}
----- dvds.txt
--
title:The Shawshank Redemption
director:Stephan King
release_date:14-10-1994
actors:Tim Robbins,Morgan Freeman,Bob Guton
genres:Crime,Drama
rating:R
price:4.99
--
title:Test title
director:Stephan King
release_date:10-10-1990
actors:Morgan Freeman,random 2,random 4
genres:Adventure,Comedy
rating:PG-13
price:4.99
--
title:Test 3
director:None
release_date:15-52-516
actors:Tim Robbins,None,None 2
genres:Crime,Comedy
rating:PG-17
price:4.99
--
---- running
C:\drew\projects>g++ -o a -std=c++11 source.cpp
C:\drew\projects>a
prints out the dvds.txt just fine, but no comparison is being done when it should
Try the following:
#include <string>
#include <iostream>
#include <vector>
int main() {
std::vector<std::string> x;
std::string y("AAAA");
x.push_back("AAAA");
int i = 0;
if (x[i] == y) {
std::cout << y << std::endl;
}
return 0;
}
It works for me. You might want to double check your variables and their types. You might also want to try the same thing with a different compiler.