C++ Compare words between 2 different text files - c++

I have 2 text files:
Main file: Library.txt
File to compare: fileToCompare.txt
The main file(Library.txt) contains a lot of words, but still not a complete one. So I search online to find more words and save them in fileToCompare.txt. But there must be many same words in Library.txt & fileToCompare.txt, so to eliminate the same words I need to compare fileToCompare.txt with Library.txt to determine which words are the same.
My way to eliminate the same words is compare each word one by one with Library.txt. That means let say if the first word is "apple", then "apple" will compare each word 1 by 1 in Library.txt and when it finds it, "apple" is the same word occurs in these 2 files. If not found, "apple" will be cout in the console and save it the text file (which asked user before to enter the file name to save non-existing words).
I found out that if fileToCompare.txt contains many words e.g. 1mb of file size, it takes an hour to compare all the words. So I think out a way:
fileToCompare.txt is sorted alphabetically, so it always start from alphabet "a" (if it is). It compares as usual and when it reach alphabet "b", it create another text file Library2.txt in "lib/" directory.
I ofstream all the words start from alphabet "b" to Library2.txt. And now instead of comparing with the main file, it compares with Library2.txt. Or I can say Library2.txt is the main file now.
The comparison process continued start from alphabet "b" and if it reached alphabet "c", it create another text file Library3.txt and ofstream all the words start from alphabet "c" and so on... till the end of word start from "z" obviously, which is end of comparison process.
But the problem is it won't eliminate same words, actually some does, but many don't. I checked the main file and some words in the output file are the same.
Here is the download link for Library.txt & fileToCompre.txt if you need it:
Library.txt -> https://www.dropbox.com/s/ihqpaju3b33ysgv/Library.txt?dl=0
fileToCompre.txt -> https://www.dropbox.com/s/pioy77g9mfz9och/fileToCompare.txt?dl=0
What I explain above might be confusing and the code is quite messy actually, I know it's hard to understand, sure to take you a whole evening to figure out.
#include<iostream>
#include<conio.h>
#include<fstream>
using namespace std;
int main(){
string txt="fileToCompare.txt";
ifstream lib;
lib2.open(txt.c_str());
if(!lib2){
cout<<"\n Oops! "<<txt<<" is missing!\n If such file exists, be sure to check the file extension is .txt\n";
getch();
main();
}
cout<<"\n Enter the file name to save the non-existing words\n (required an extension at the end)\n";
getline(cin,word);
string libPath="lib/"+word,alphaStr="a",libtxt[26]={"Library.txt","lib/Library2.txt","lib/Library3.txt","lib/Library4.txt","lib/Library5.txt","lib/Library6.txt","lib/Library7.txt","lib/Library8.txt","lib/Library9.txt","lib/Library10.txt","lib/Library11.txt","lib/Library12.txt","lib/Library13.txt","lib/Library14.txt","lib/Library15.txt","lib/Library16.txt","lib/Library17.txt","lib/Library18.txt","lib/Library19.txt","lib/Library20.txt","lib/Library21.txt","lib/Library22.txt","lib/Library23.txt","lib/Library24.txt","lib/Library25.txt","lib/Library26.txt"};
const char* wordChar=libPath.c_str();
const char* libManip=libtxt[0].c_str();
int alphaI=1,boolcheck=1;
lib.open(libManip);
outWord.open(wordChar);
while(getline(lib2,libStr2)){
if(libStr2.substr(0,1)!=alphaStr){
lib.close();
lib.open(libManip);
libMO.open(libtxt[alphaI].c_str());
while(getline(lib,libStr)){
if(libStr.substr(0,1)!=alphaStr){
libMO<<libStr<<endl;
}
}
libManip=libtxt[alphaI].c_str();
libMO.close();
lib.close();
alphaI++;
alphaStr=libStr2.substr(0,1);
boolcheck=1;
}
if(boolcheck==1){
lib.close();
lib.open(libManip);
boolcheck=0;
}
while(getline(lib,libStr)){
if(libStr==libStr2){
found=1;
break;
}
}
if(!found){
cout<<"\n "<<libStr2;
outWord<<libStr2<<endl;
countNF++;
}
count++;
found=0;
}
cout<<"\n\n\n Total words: "<<count<<"\n Total words reserved: "<<countNF;
lib2.close();
lib.close();
getch();
return 0;
}

You should use a different algorithm / data structure for the comparison.
The following example uses a std::set. It reads both files and writes the merged result into merged.txt:
#include <iostream>
#include <set>
#include <string>
#include <fstream>
int main()
{
std::ifstream lib("Library.txt");;
std::set<std::string> lib_set;
std::string word;
while (lib >> word)
{
lib_set.insert(word);
}
std::ifstream check("fileToCompare.txt");
while (check >> word)
{
lib_set.insert(word);
}
std::ofstream merged("merged.txt");
std::set<std::string>::iterator it;
for (it = lib_set.begin(); it != lib_set.end(); ++it)
{
merged << *it << std::endl;
}
}
Executing this for your dataset takes 0.8 seconds on my computer.

Since the files fileToCompare.txt and Library.txt are sorted alphabetically, your code can take advantage of that.
Read a word from each file.
If the two words are same, read the next words from the files.
If the word from fileToCompare.txt is less than the word from Library.txt, keep the word from Library.txt and read the next word from fileToCompare.txt. Otherwise, keep the word from fileToCompare.txt and read the next word form Library.txt.
Keep doing that until there are no more words to read.
At the end, if there are still any more words left in fileToCompare.txt, read and print them.
The following program follows the above logic and seems to work for me.
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
void compareFiles(ifstream& txtf, ifstream& libf)
{
string txtWord;
string libWord;
bool readTxt = true;
bool readLib = true;
while ( true )
{
if ( readLib )
{
// Try to read the next word from the libf
// If the read is not successful, break out of the loop.
if ( ! (libf >> libWord) )
{
break;
}
}
if ( readTxt )
{
// Try to read the next word from the txtf
// If the read is not successful, break out of the loop.
if ( ! (txtf >> txtWord) )
{
break;
}
}
if ( txtWord == libWord )
{
// The same word exists in both files.
// Read the next words from both files.
readTxt = readLib = true;
continue;
}
// A word from the text file doesn't exist in the library file.
// Print the word from the text file if the word from the text file
// was read in this iteration.
if ( readTxt )
{
cout << txtWord << endl;
}
// The next word we read will depend on whether the txtWord is less
// or greater than libWord.
if ( txtWord < libWord )
{
// Read the next txtWord but keep the current libWord.
readTxt = true;
readLib = false;
}
else
{
// Read the next libWord but keep the current txtWord.
readTxt = false;
readLib = true;
}
// The above logic can be shortened to.
// readTxt = (textWord < libWord);
// readLib = !readTxt;
}
// When the previous while loop ends, there might be more words in txtf.
// Read the remaining words from txtf and print them.
while ( txtf >> txtWord )
{
cout << txtWord << endl;
}
}
void compareFiles(string const& txt, string const& lib)
{
ifstream txtf(txt);
ifstream libf(lib);
compareFiles(txtf, libf);
}
int main()
{
string txt="fileToCompare.txt";
string lib="Library.txt";
compareFiles(txt, lib);
return 0;
}

Related

Why does replacing a string replace part of another string?

I wrote a small piece of code today about replacing a word from a text file.
Though it replaces the given word but it also removes some spaces and some part of other string.
I want it to replace given word only while keeping rest as it as.
I don't know what should I do. Any help would be appreciated!
Original Data of file:
Is anyone there?
Who survived?
Somebody new?
Anyone else but you
On a lonely night
Was a burning light
A hundred years, we'll be born again
Output when replaced "anyone" by "porter":
Is anyonportere?
Who survived?
Somebody new?
anporterlse but you
On a lonely night
Was a burning light
A hundred years, we'll be born again
Code:
#include<iostream>
#include<fstream>
#include<cstdlib>
#include<cstring>
using namespace std;
int main(int argc , char* argv[])
{
string old_word,new_word;
int no=0;
old_word=argv[1];
new_word=argv[2];
if(argc<4)
{
cout<<"\nSome Arguments are missing";
return 0;
}
if(strlen(argv[1])!=strlen(argv[2]))
{
cout<<"\nReplacement is not possible as size of New wor dis not equal to old word";
return 0;
}
fstream obj;
obj.open(argv[3],ios::in|ios::out);
if(!obj)
{
cout<<"\nError in file creating";
return 0;
}
string fetch_word;
while(1)
{
if(obj.eof())
{
break;
}
else
{
int pos=obj.tellg();
obj>>fetch_word;
if(fetch_word==old_word)
{
no++;
obj.seekp(pos);
obj<<new_word;
}
}
}
if(no==0)
{
cout<<"\nNo Replacement Done . Zero Replacement Found";
}
else
{
cout<<"\nReplacement Done . Replacement Found ="<<no<<endl;
}
obj.close();
return 0;
}
If we take the string "Is anyone there?"
After read the word "Is" the read head is on the space after the "Is" so tellg will return 2.
Now you're reading the next word, you skip white spaces and begin reading untill next white space character, you are reading the "anyone" word and put its replacement in the taken position (2).
so it should give you the string: "Isportere there?"
Not what you ment for, but not the result you've got.
to fix it you should ignore white spaces before reading the position:
like this:
//#include <cwctype> for iswspace
//eat white spaces
while(iswspace(obj.peek()))
obj.ignore();
//now read head is on the beginning of a word, you can take position.
int pos=obj.tellg();
Edit
You'll have to debug and see if the tellg returns 3 in the first line before you read the word "anyone". I sugggest to add some debug print for each replacement with the possition.
like:
if(fetch_word==old_word)
{
no++;
cout<<"Replacing in pos "<< pos <<endl;
obj.seekp(pos);
obj<<new_word;
}
Now you can check:
Does the pos was correct? (you can try to seekg and read the word again)
Does the seekp was succes? (you can use tellp to check!)
What happen when you just do obj.seekp(3); obj<<"porter"; does it replace the string in the correct position?
#include <cctype>
#include <cstdlib>
#include <string>
#include <fstream>
#include <iostream>
int main()
{
std::string old_word{ "anyone" };
std::string new_word{ "porter" };
if (old_word.length() != new_word.length()) {
std::cerr << "Sorry, I can only replace words of equal length :(\n\n";
return EXIT_FAILURE;
}
char const *filename{ "test.txt" };
std::fstream obj{ "test.txt", std::ios::in | std::ios::out };
if (!obj.is_open()) {
std::cerr << "Couldn't open \"" << filename << "\" for reading and writing.\n\n";
return EXIT_FAILURE;
}
std::string word;
for (std::streampos pos; pos = obj.tellg(), obj >> word;) {
if (word == old_word) {
obj.seekg(pos); // set the "get" position to where it were before extracting word
while (std::isspace(obj.peek())) // for every whitespace we peek at
obj.get(); // discard it
obj.seekp(obj.tellg()); // set the "put" position to the current "get" position
obj << new_word; // overwirte word with new_word
obj.seekg(obj.tellp()); // set the "get" position to the current "put" position
}
}
}

Reading text with blanks and numeric data from a file

So I have data in a text like this:
Alaska 200 500
New Jersey 400 300
.
.
And I am using ifstream to open it.
This is part of a course assignment. We are not allowed to read in the whole line all at once and parse it into the various pieces. So trying to figure out how to read each part of every line.
Using >> will only read in "New" for "New Jersey" due to the white space/blank in the middle of that state name. Have tried a number of different things like .get(), .read(), .getline(). I have not been able to get the whole state name read in, and then read in the remainder of the numeric data for a given line.
I am wondering whether it is possible to read the whole line directly into a structure. Of course, structure is a new thing we are learning...
Any suggestions?
Can't you just read the state name in a loop?
Read a string from cin: if the first character of the string is numeric then you've reached the next field and you can exit the loop. Otherwise just append it to the state name and loop again.
Here is a line by line parsing solution that doesn't use any c-style parsing methods:
std::string line;
while (getline(ss, line) && !line.empty()) {
size_t startOfNumbers = line.find_first_of("0123456789");
size_t endOfName = line.find_last_not_of(" ", startOfNumbers);
std::string name = line.substr(0, endOfName); // Extract name
std::stringstream nums(line.substr(startOfNumbers)); // Get rest of the line
int num1, num2;
nums >> num1 >> num2; // Read numbers
std::cout << name << " " << num1 << " " << num2 << std::endl;
}
If you can't use getline, do it yourself: Read and store in a buffer until you find '\n'. In this case you probably also cannot use all the groovy stuff in std::string and algorithm and might as well use good ol' C programming at that point.
Once you have grabbed a line, read your way backwards from the end of the line and
Discard all whitespace until you find non whitespace.
Gather characters found into token 3 until you find whitepace again.
Read and discard the whitespace until you find the end of token 2.
Gather token 2 until you find more whitespace.
Discard the whitespace until you find the end of token 1. The rest of the line is all token 1.
convert token 2 and token 3 into numbers. I like to use strtol for this.
You can build all of the above or Daniel's answer (use his answer if at all possible) into an overload of operator>>. This lets you
mystruct temp;
while (filein >> temp)
{
// do something with temp. Stick it in a vector, whatever
}
The code to do this looks something like (Stealing wholesale from What are the basic rules and idioms for operator overloading? <-- Read this. It could save your life one day)
std::istream& operator>>(std::istream& is, mystruct & obj)
{
// read obj from stream
if( /* no valid object of T found in stream */ )
is.setstate(std::ios::failbit);
return is;
}
Here's another example of reading the file word by word. Edited to remove the example using the eof check as the while loop condition. Also included a struct as you mentioned that's what you just learned. I'm not sure how you're supposed to use your struct, so I just made it simple and had it contain 3 variables, a string, and 2 ints. To verify it reads correctly it couts the contents of the struct variables after its read in which includes printing out "New Jersey" as one word.
#include <iostream>
#include <fstream>
#include <string>
#include <stdlib.h> // for atoi
using namespace std;
// Not sure how you're supposed to use the struct you mentioned. But for this example it'll just contain 3 variables to store the data read in from each line
struct tempVariables
{
std::string state;
int number1;
int number2;
};
// This will read the set of characters and return true if its a number, or false if its just string text
bool is_number(const std::string& s)
{
return !s.empty() && s.find_first_not_of("0123456789") == std::string::npos;
}
int main()
{
tempVariables temp;
ifstream file;
file.open("readme.txt");
std::string word;
std::string state;
bool stateComplete = false;
bool num1Read = false;
bool num2Read = false;
if(file.is_open())
{
while (file >> word)
{
// Check if text read in is a number or not
if(is_number(word))
{
// Here set the word (which is the number) to an int that is part of your struct
if(!num1Read)
{
// if code gets here we know it finished reading the "string text" of the line
stateComplete = true;
temp.number1 = atoi(word.c_str());
num1Read = true; // won't read the next text in to number1 var until after it reads a state again on next line
}
else if(!num2Read)
{
temp.number2 = atoi(word.c_str());
num2Read = true; // won't read the next text in to number2 var until after it reads a state agaon on next line
}
}
else
{
// reads in the state text
temp.state = temp.state + word + " ";
}
if(stateComplete)
{
cout<<"State is: " << temp.state <<endl;
temp.state = "";
stateComplete = false;
}
if(num1Read && num2Read)
{
cout<<"num 1: "<<temp.number1<<endl;
cout<<"num 2: "<<temp.number2<<endl;
num1Read = false;
num2Read = false;
}
}
}
return 0;
}

Find specific text in string delimited by newline characters

I want to find a specific string in a list of sentence. Each sentence is a line delimited with a \n. When the newline is reached the current search should stop and start new on the next line.
My program is:
#include <iostream>
#include <string.h>
using namespace std;
int main(){
string filename;
string list = "hello.txt\n abc.txt\n check.txt\n"
cin >> filename;
// suppose i run programs 2 times and at 1st time i enter abc.txt
// and at 2nd time i enter abc
if(list.find(filename) != std::string::npos){
//I want this condition to be true only when user enters complete
// file name. This condition also becoming true even for 'abc' or 'ab' or even for 'a' also
cout << file<< "exist in list";
}
else cout<< "file does not exist in list"
return 0;
}
Is there any way around. i want to find only filenames in the list
list.find will only find substring in the string list, but if you want to compare the whole string till you find the \n, you can tokenize the list and put in some vector.
For that, you can put the string list in std::istringstream and make a std::vector<std::string> out of it by using std::getline like:
std::istringstream ss(list);
std::vector<std::string> tokens;
std::string temp;
while (std::getline(ss, temp)){
tokens.emplace_back(temp);
}
If there are leading or trailing spaces in the tokens, you can trim the tokens before adding them to the vector. For trimming, see What's the best way to trim std::string?, find a trimming solution from there that suits you.
And after that, you can use find from <algorithm> to check for complete string in that vector.
if (std::find(tokens.begin(), tokens.end(), filename) != tokens.end())
std::cout << "found" << std::endl;
First of all I wouldn't keep the list of files in a single string, but I would use any sort of list or vector.
Then if keeping the list in a string is a necessity of yours (for some kind of reason in your application logic) I would separate the string in a vector, then cycle through the elements of the vector checking if the element is exactly the one searched.
To split the elements I would do:
std::vector<std::string> split_string(const std::string& str,
const std::string& delimiter)
{
std::vector<std::string> strings;
std::string::size_type pos = 0;
std::string::size_type prev = 0;
while ((pos = str.find(delimiter, prev)) != std::string::npos)
{
strings.push_back(str.substr(prev, pos - prev));
prev = pos + 1;
}
// To get the last substring (or only, if delimiter is not found)
strings.push_back(str.substr(prev));
return strings;
}
You can see an example of the function working here
Then just use the function and change your code to:
#include <iostream>
#include <string.h>
#include <vector>
using namespace std;
int main(){
string filename;
string list = "hello.txt\n abc.txt\n check.txt\n"
cin >> filename;
vector<string> fileList = split_string(list, "\n");
bool found = false;
for(int i = 0; i<fileList.size(); i++){
if(fileList.at(i) == file){
found = true;
}
}
if(found){
cout << file << "exist in list";
} else {
cout << "file does not exist in list";
}
return 0;
}
Obviously you need to declare and implement the function split_string somewhere in your code. Possibly before main declaration.

Count first digit on each line of a text file

My project takes a filename and opens it. I need to read each line of a .txt file until the first digit occurs, skipping whitespace, chars, zeros, or special chars. My text file could look like this:
1435 //1, nextline
0 //skip, next line
//skip, nextline
(*Hi 245*) 2 //skip until second 2 after comment and count, next line
345 556 //3 and count, next line
4 //4, nextline
My desired output would be all the way up to nine but I condensed it:
Digit Count Frequency
1: 1 .25
2: 1 .25
3: 1 .25
4: 1 .25
My code is as follows:
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main() {
int digit = 1;
int array[8];
string filename;
//cout for getting user path
//the compiler parses string literals differently so use a double backslash or a forward slash
cout << "Enter the path of the data file, be sure to include extension." << endl;
cout << "You can use either of the following:" << endl;
cout << "A forwardslash or double backslash to separate each directory." << endl;
getline(cin,filename);
ifstream input_file(filename.c_str());
if (input_file.is_open()) { //if file is open
cout << "open" << endl; //just a coding check to make sure it works ignore
string fileContents; //string to store contents
string temp;
while (!input_file.eof()) { //not end of file I know not best practice
getline(input_file, temp);
fileContents.append(temp); //appends file to string
}
cout << fileContents << endl; //prints string for test
}
else {
cout << "Error opening file check path or file extension" << endl;
}
In this file format, (* signals the beginning of a comment, so everything from there to a matching *) should be ignored (even if it contains a digit). For example, given input of (*Hi 245*) 6, the 6 should be counted, not the 2.
How do I iterate over the file only finding the first integer and counting it, while ignoring comments?
One way to approach your problem is the following:
Create a std::map<int, int> where the key is the digit and the value is the count. This allows you to compute statistics on your digits such as the count and the frequency after you have parsed the file. Something similar can be found in this SO answer.
Read each line of your file as a std::string using std::getline as shown in this SO answer.
For each line, strip the comments using a function such as this:
std::string& strip_comments(std::string & inp,
std::string const& beg,
std::string const& fin = "") {
std::size_t bpos;
while ((bpos = inp.find(beg)) != std::string::npos) {
if (fin != "") {
std::size_t fpos = inp.find(fin, bpos + beg.length());
if (fpos != std::string::npos) {
inp = inp.erase(bpos, fpos - bpos + fin.length());
} else {
// else don't erase because fin is not found, but break
break;
}
} else {
inp = inp.erase(bpos, inp.length() - bpos);
}
}
return inp;
}
which can be used like this:
std::string line;
std::getline(input_file, line);
line = strip_comments(line, "(*", "*)");
After stripping the comments, use the string member function find_first_of to find the first digit:
std::size_t dpos = line.find_first_of("123456789");
What is returned here is the index location in the string for the first digit. You should check that the returned position is not std::string::npos, as that would indicate that no digits are found. If the first digit is found, the corresponding character can be extracted using const char c = line[dpos]; and converted to an integer using std::atoi.
Increment the count for that digit in the std::map as shown in that first linked SO answer. Then loop back to read the next line.
After reading all lines from the file, the std::map will contain the counts for all first digits found in each line stripped of comments. You can then iterate over this map to retrieve all the counts, accumulate the total count over all digits found, and compute the frequency for each digit. Note that digits not found will not be in the map.
I hope this helps you get started. I leave the writing of the code to you. Good luck!

Why is this word sorting program only looping once?

I'm trying to create a word sorting program that will read the words in a .txt file and then write them to a new file in order from shortest words to longest words. So, for instance, if the first file contains:
elephant
dog
mouse
Once the program has executed, I want the second file (which is initially blank) to contain:
dog
mouse
elephant
Here's the code:
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main()
{
string word;
ifstream readFrom;
ofstream writeTo;
readFrom.open("C:\\Users\\owner\\Desktop\\wordlist.txt");
writeTo.open("C:\\Users\\owner\\Desktop\\newwordlist.txt");
if (readFrom && writeTo)
{
cout << "Both files opened successfully.";
for (int lettercount = 1; lettercount < 20; lettercount++)
{
while (readFrom >> word)
{
if (word.length() == lettercount)
{
cout << "Writing " << word << " to file\n";
writeTo << word << endl;
}
}
readFrom.seekg(0, ios::beg); //resets read pos to beginning of file
}
}
else
cout << "Could not open one or both of files.";
return 0;
}
For the first iteration of the for loop, the nested while loop seems to work just fine, writing the correct values to the second file. However, something goes wrong in all the next iterations of the for loop, because no further words are written to the file. Why is that?
Thank you so much.
while (readFrom >> word)
{
}
readFrom.seekg(0, ios::beg); //resets read pos to begin
The while loop will continue until special flags are set on readFrom, namely, the EOF flag. Seeking to the beginning does not clear any flags, including EOF. Add the following line right before the seek to clear the flags and your code should work fine.
readFrom.clear();
After seek, clear the EOF flag.
readFrom.clear();