Strings and unordered_map running slow

Strings and unordered_map running slow - c++

Here is 2 functions in my code that are running REALLY SLOW.
Basically i read in a document name, open the document, then process it one word at a time. I need to split up the document into sentences, and give each sentence a hash table that represents the number of times the word appears in the sentence. I also need to keep track of all the new words, and a hash table for the total document.
When i run my code now on 10 documents, that have a total of 8000word, and 2100 uniq words it takes about 8000+ seconds to run... almost 1 second per word.
can you tell me how long if(istream.good()) should take?
Or, if you can tell when what is delaying my code. Please let me know if a section is not clear, i will help.
P.S. You can see in the code where i have a start = clock() and end = clock() commented it constantly returns < 1ms. And that is mind-boggleing
void DocProcess::indexString(string sentenceString, hash * sent){
stringstream iss;
string word;
iss.clear();
iss << sentenceString;
while(iss.good())
{
iss >> word;
word = formatWord(word);
std::unordered_map<std::string,int>::const_iterator IsNewWord = words.find(word);
if(IsNewWord == words.end())
{
std::pair<std::string,int> newWordPair (word,0);
std::pair<std::string,int> newWordPairPlusOne (word,1);
words.insert(newWordPair);
sent->insert(newWordPairPlusOne);
}
else
{
std::pair<std::string,int> newWordPairPlusOne (word,1);
sent->insert(newWordPairPlusOne);
}
}
}
void DocProcess::indexFile(string iFileName){
hash newDocHash;
hash newSentHash;
scoreAndInfo sentenceScore;
scoreAndInfo dummy;
fstream iFile;
fstream dFile;
string word;
string newDoc;
string fullDoc;
int minSentenceLength = 5;
int docNumber = 1;
int runningLength = 0;
int ProcessedWords = 0;
stringstream iss;
iFile.open(iFileName.c_str());
if(iFile.is_open())
{
while(iFile.good())
{
iFile >> newDoc;
dFile.open(newDoc.c_str());
DocNames.push_back(newDoc);
if(dFile.is_open())
{
scoreAndInfo documentScore;
//iss << dFile.rdbuf();
while(dFile.good())
{
//start = clock();
dFile >> word;
++ProcessedWords;
std::unordered_map<std::string,int>::const_iterator IsStopWord = stopWords.find(word);
if(runningLength >= minSentenceLength && IsStopWord != stopWords.end() || word[word.length()-1] == '.')
{
/* word is in the stop list, process the string*/
documentScore.second.second.append(" "+word);
sentenceScore.second.second.append(" "+word);
indexString(sentenceScore.second.second, &sentenceScore.second.first);
sentenceScore.first=0.0;
SentList.push_back(sentenceScore);
sentenceScore.second.first.clear(); //Clear hash
sentenceScore.second.second.clear(); // clear string
//sentenceScore = dummy;
runningLength = 0;
}
else
{
++runningLength;
sentenceScore.second.second.append(" "+word);
documentScore.second.second.append(" "+word);
}
//end = clock();
system("cls");
cout << "Processing doc number: " << docNumber << endl
<< "New Word count: " << words.size() << endl
<< "Total words: " << ProcessedWords << endl;
//<< "Last process time****: " << double(diffclock(end,start)) << " ms"<< endl;
}
indexString(documentScore.second.second, &documentScore.second.first);
documentScore.first=0.0;
DocList.push_back(documentScore);
dFile.close();
//iss.clear();
//documentScore = dummy;
++docNumber;
//end = clock();
system("cls");
cout << "Processing doc number: " << docNumber << endl
<< "Word count: " << words.size();
//<< "Last process time: " << double(diffclock(end,start)) << " ms"<< endl;
}
}
iFile.close();
}
else{ cout << "Unable to open index file: "<<endl <<iFileName << endl;}
}
`

Can you try it without
system("cls");
in any of the loops? That surely isn't helping, it's an expensive call.

To clear the screen quickly, instead of system("cls");, try cout << '\f';.

Related

C++ Anagram Solver speed optimization

I have decided to make an anagram solver for my dad. I am quite new to programming, but i figured I can still make it. My finished product works, but it is really slow, for instance it took about 15+ mins to find all combinations of 8 characters. I am looking for ways to optimize it / make it faster.
Working with MinGW c++ compier, on Clion 2019.3.4, cpu: i7 9700k, and RAM: 16GB/3200mhz.
#include <iostream>
#include <string>
#include <vector>
#include <fstream>
using namespace std;
//Menu for interacting with user, not really important
void menu() {
cout << "=========================" << endl;
cout << "======== !WOW! ==========" << endl;
cout << "=========================" << endl;
cout << "1 ... INSERT" << endl;
cout << "2 ... PRINT" << endl;
cout << "3 ... LIMIT WORD LENTGH" << endl;
cout << "4 ... NEW GAME" << endl;
cout << "0 ... EXIT" << endl;
cout << "=========================" << endl;
cout << "Select: ";
}
//Function to find all possible combinations from letters of a given string
void get(vector<string> &vec, string str, string res) {
vec.push_back(res);
for (int i = 0; i < str.length(); i++)
get(vec, string(str).erase(i, 1), res + str[i]);
}
//Only for testing purposes
void printVec(vector<string> vec) {
for (int i = 0; i < vec.size(); i++) {
cout << vec[i] << " ";
}
}
//Function to check if a given word exists in given .txt file
bool checkWord(vector<string> &vec2, string filename, string search) {
string line;
ifstream myFile;
myFile.open(filename);
if (myFile.is_open()) {
while (!myFile.eof()) {
getline(myFile, line);
if (line == search) {
vec2.push_back(line);
return true;
}
}
myFile.close();
} else
cout << "Unable to open this file." << endl;
return false;
}
int main() {
int selection;
bool running = true;
string stringOfChars;
vector<string> vec;
vector<string> vec2;
do {
menu();
cin >> selection;
switch (selection) {
case 1:
cout << "Insert letters one after another: ";
cin >> stringOfChars;
get(vec, stringOfChars, ""); //fill first vector(vec) with all possible combinations.
break;
case 2:
for (int i = 0; i < vec.size(); i++) {
if (checkWord(vec2, "C:/file.txt", vec[i])) { //For each word in vector(vec) check if exists in file.txt, if it does, send it in vector(vec2) and return true
//Reason for vec2's existence is that later I want to implement functions to manipulate with possible solutions (like remove words i have already guessed, or as shown in case 3, to limit the word length)
cout << vec[i] << endl; //If return value == true cout the word
}
}
break;
case 3:
int numOfLetters;
cout << "Word has a known number of letters: ";
cin >> numOfLetters;
for (int i = 0; i < vec2.size(); i++) { /*vec2 is now filled with all the answers, we can limit the output if we know the length of the word */
if (vec2[i].length() == numOfLetters) {
cout << vec2[i] << endl;
}
}
break;
case 4:
vec.clear();
vec2.clear();
break;
case 0:
running = false;
break;
default:
cout << "Wrong selection!" << endl;
break;
}
cout << endl;
} while (running);
return 0;
}
file.txt is filled with all words in my language, It's alphabetically ordered and it's 50mb in size.
aachecnska
aachenskega
aachenskem
aachenski
.
.
.
bab
baba
babah
.
.
.
Any recommendations or off topic tips would be helpful. One of my ideas is to maybe separate file.txt in smaller files, like for example putting lines that have same starting letter in their own file, so A.txt would only contain words that start with A etc... And than change the code accordingly.

this is where you need to use a profiler. on Linux, my favorite is kcachgrind
http://kcachegrind.sourceforge.net/html/Home.html
it gives you line-by-line timing information and tells you which part of the code you should optimize the most.
of course, there are many profilers available, including commercial ones.

C++ How to Check what words aren't in 2 similar files

i was trying to find a way to check two different files and get, from the second, all lines that aren't in the first.. but does all the opposite.
I tried the possible to solve this but nothing...
This is the code:
int main(int argc, char *argv[])
{
setlocale(LC_ALL, "");
char username[UNLEN+1];
DWORD username_len = UNLEN+1;
GetUserName(username, &username_len);
stringstream buffer;
buffer << "C:\\Users\\" << username << "\\Desktop\\";
stringstream buffer2;
buffer2 << "C:\\Users\\" << username << "\\Desktop\\Legit.txt";
stringstream buffer3;
buffer3 << "C:\\Users\\" << username << "\\Desktop\\Unlegit.txt";
stringstream buffer4;
buffer4 << "C:\\Users\\" << username << "\\Desktop\\result.txt";
string results = buffer4.str();
int offset;
int num;
num = 1;
string search;
string linea;
string legit;
string unlegit;
string line;
cout << "Is the Legit.txt file at '" << buffer.str() << "'? [Y/N]: ";
cin >> legit;
if (legit == "Y" || legit == "y"){
}else if(legit == "N" || legit == "n"){
return 0;
}else{
cout << "\n.";
return 0;
}
string legitfile = buffer2.str();
cout << "\nIs the Unlegit.txt file at '" << buffer.str() << "'? [Y/N]: ";
cin >> unlegit;
if (unlegit == "Y" || unlegit == "y"){
}else if(unlegit == "N" || unlegit == "n"){
return 0;
}else{
cout << "\n";
return 0;
}
string unlegitfile = buffer3.str();
ifstream file(legitfile.c_str());
if(file.is_open()){
while(getline(file, line)){
ifstream MyFile(unlegitfile.c_str());
if(MyFile.is_open()){
while(!MyFile.eof()){
getline(MyFile,linea);
if((offset = linea.find(line, 0)) != string::npos) {
cout << "\n[" << num << "]" << " Word Found: " << line << "\n";
num++;
fstream result(results.c_str());
result << line << "\n";
result.close();
}
}
MyFile.close();
}
}
file.close();
return 0;
}else{
cout << "\nThe file '" << legitfile << "' does not exist.";
cout << "\nThe file '" << unlegitfile << "' does not exist.";
}
}
As i said, This code checks which words are equals in both (first & second) files and, once found, writes them to a third file, there is a way to do the opposite (check the two files and get the words that aren't equals)? Thank you so much!
I'm new, both in the forum and in C++, sorry if I make any mistakes. (sorry for my bad english too).

The classic solution to this sort of problem is to use a hash table collection to represent all the words in the first file. Then while iterating items from the second file, consult the set constructed of the first file. In C++, the std::unordered_set will do fine.
#include <unordered_set>
using namespace std;
unordered_set<string> firstFileSet;
unordered_set<string> missingFromSecondFileSet;
string line;
while(!firstfile.eof())
{
getline(firstfile,line);
firstFileSet.insert(line);
}
Then for each word in the second file, use a second set collection to keep track of what words are missing.
while(!secondfile.eof())
{
getline(secondfile,line);
if (firstFileSet.find(line) != firstFileSet.end())
{
missingFromSecondFileSet.insert(line);
}
else
{
firstFileSet.erase(line);
}
}
After the above runs, firstFileSet contains all the lines in the first file that were not present in the second. missingFromSecondFileSet contains all the lines in the second file that were not in the first:
for (auto &s : firstFileSet)
{
cout << s << " was in the first file, but not the second" << endl;
}
for (auto &s : missingFromSecondFileSet)
{
cout << s << " was in the second file, but not the first" << endl;
}

There is a program called diff on linux which does just what you are looking to do in C++.
It is written in C so you can just copy its source code =P
for (;; cmp->file[0].buffered = cmp->file[1].buffered = 0)
{
/* Read a buffer's worth from both files. */
for (f = 0; f < 2; f++)
if (0 <= cmp->file[f].desc)
file_block_read (&cmp->file[f],
buffer_size - cmp->file[f].buffered);
/* If the buffers differ, the files differ. */
if (cmp->file[0].buffered != cmp->file[1].buffered
|| memcmp (cmp->file[0].buffer,
cmp->file[1].buffer,
cmp->file[0].buffered))
{
changes = 1;
break;
}
/* If we reach end of file, the files are the same. */
if (cmp->file[0].buffered != buffer_size)
{
changes = 0;
break;
}
}
Taken from ftp://mirrors.kernel.org/gnu/diffutils/diffutils-3.0.tar.gz > src/analyze.c

How do I print the number of words that begin with a certain character?

This is for an intro c++ class, the prompt reads:
Print the number of words that begin with a certain character. Let the user enter that character.
Although, I'm not sure how to do this.
Do I use parsing strings? I tried this because they inspect string data type but I kept getting errors so I took it out and changed it to characters. I want to learn how to do the "total_num" (total number of words that start with the letter the user chooses) and I also need some help with my for loop.
Example of desired output
user types in: a
outputs: "Found 1270 words that begin with a"
user types in: E
outputs: "Found 16 words that begin with E"
user types in: #
outputs: "Found 0 words that begin with #"
(I think I got this part down for non-alphabetical)
The data is from a file called dict.txt, it's a list of many words.
Here's a small sample of what it contains:
D
d
D.A.
dab
dabble
dachshund
dad
daddy
daffodil
dagger
daily
daintily
dainty
dairy
dairy cattle
dairy farm
daisy
dally
Dalmatian
dam
damage
damages
damaging
dame
My program:
#include <iostream>
#include <fstream>
#include <cstdlib>
using namespace std;
const int NUM_WORD = 21880;//amount of words in file
struct dictionary { string word; };
void load_file(dictionary blank_array[])
{
ifstream data_store;
data_store.open("dict.txt");
if (!data_store)
{
cout << "could not open file" << endl;
exit(0);
}
}
int main()
{
dictionary file_array[NUM_WORD];
char user_input;
int total_num = 0;
load_file(file_array);
cout << "Enter a character" << endl;
cin >> user_input;
if (!isalpha(user_input))
{
cout << "Found 0 that begin with " << user_input << endl;
return 0;
}
for (int counter = 0; counter< NUM_WORD; counter++)
{
if (toupper(user_input) == toupper(file_array[counter].word[0]));
//toupper is used to make a case insensitive search
{
cout << "Found " << total_num << " that begin with " << user_input << endl;
//total_num needs to be the total number of words that start with that letter
}
}
}

There are a few things you can do to make your life simpler e.g. using a vector as the comment suggested.
Let's look at your for loop. There are some obvious syntax problems.
int main()
{
dictionary file_array[NUM_WORD];
char user_input;
int total_num = 0;
load_file(file_array);
cout << "Enter a character" << endl;
cin>>user_input;
if(!isalpha(user_input))
{
cout << "Found 0 that begin with " << user_input << endl;
return 0;
}
for(int counter = 0;counter< NUM_WORD; counter++)
{
if (toupper(user_input) == toupper(file_array[counter].word[0]));
// ^no semi-colon here!
//toupper is used to make a case insensitive search
{
cout<< "Found " << total_num << " that begin with "<<
user_input << endl;
//total_num needs to be the total number of words that start with that letter
}
}//<<< needed to end the for loop
}
Let's get the for loop right. You want to count the matches in a loop and then report when you have finished the loop.
int total_num = 0;
//get character and file
for(int counter = 0;counter< NUM_WORD; counter++)
{
if (toupper(user_input) == toupper(file_array[counter].word[0]))
^^^no semi-colon here!
{
++total_num;
}
}
cout<< "Found " << total_num << " that begin with "<< user_input << endl;

Using C++ Fstream to output numbers from text file - Need help separating lines

I need to create a program that takes integers from a text file, and outputs them, including the number, lowest number, largest number, average, total, N amount of numbers, etc. I can do this just fine with the code below, but I also need to process the text per line. My sample file has 7 numbers delimited with tabs per row, with a total of 8 rows, but I am to assume that I do not know how many numbers per row, rows per file, etc. there are.
Also, for what it's worth, even though I know how to use vectors and arrays, the particular class that I'm in has not gotten to them, so I'd rather not use them.
Thanks.
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main() {
int num;
int count = 0;
int total = 0;
int average = 0;
string str = "";
int numLines = 0;
int lowNum = 1000000;
int highNum = -1000000;
ifstream fileIn;
fileIn.open("File2.txt");
if (!fileIn) {
cout << "nError opening file...Closing program.n";
fileIn.close();
}
else {
while (!fileIn.eof()) {
fileIn >> num;
cout << num << " ";
total += num;
count++;
if (num < lowNum) {
lowNum = num;
}
if (num > highNum) {
highNum = num;
}
}
average = total / count;
cout << "nnTotal is " << total << "." << endl;
cout << "Total amount of numbers is " << count << "." << endl;
cout << "Average is " << average << "." << endl;
cout << "Lowest number is " << lowNum << endl;
cout << "Highest number is " << highNum << endl;
fileIn.close();
return 0;
}
}

One way to deal with the individual lines is to skip leading whitespaces before reading each value and to set the stream into fail-state when a newline is reached. When the stream is good after skipping and reading a value, clearly, there was no newline. If there was a newline, deal with whatever needs to happen at the end of a line, reset the stream (if the failure wasn't due to reaching eof()) and carry on. For example, the code for a loop processing integers and keeping track of the current line could like this:
int line(1);
while (in) {
for (int value; in >> skip >> value; ) {
std::cout << "line=" << line << " value=" << value << '\n';
}
++line;
if (!in.eof()) {
in.clear();
}
}
This code uses the custom manipulator skip() which could be implemented like this:
std::istream& skip(std::istream& in) {
while (std::isspace(in.peek())) {
if (in.get() == '\n') {
in.setstate(std::ios_base::failbit);
}
}
return in;
}

Cannot input more than one string in palindrome program

#include <iostream>
#include <ctype.h>
using namespace std;
void isPalindrome();
int main()
{
char response;
isPalindrome();
cout << "Input another string(y/n)?" << endl;
cin >> response;
response = toupper(response);
if (response == 'Y')
isPalindrome();
return 0;
}
void isPalindrome()
{
char str[80], str2[80];
int strlength;
int j = 0;
int front, back;
bool flag = 1;
cout << "Input a string:" << endl;
cin.getline(str, 80);
strlength = strlen(str);
for (int i = 0; i < strlength; i++)
{
if (islower(str[i]))
str[i] = toupper(str[i]);
}
for (int i = 0; i < strlength; i++)
{
if (isalpha(str[i]))
{
str2[j] = str[i];
j++;
}
}
str2[j] = '\0';
front = 0;
back = strlength - 1;
for (int i = 0; i < j / 2; i++)
{
if (str2[front] != str2[back])
{
flag = 0;
break;
}
}
if (!(flag))
cout << "It is not a palindrome" << endl;
else
cout << "It's a palindrome" << endl;
cout << "str: " << str << " str2: " << str2 << " strlength: " << strlength << " j: " << j << endl;
cout << "front: " << front << " back: " << back << " flag: " << flag << endl;
}
I was just wondering if anybody could help explain to me why my code isn't working.
I can run it once just fine and I get the right answer, but when the prompt asks if I want to input another string and I type 'y', the prompt just skips over the input and terminates on it's own.
I tried cin.ginore('\n', 80), but that just gave me a bunch of blank lines. I added the bit of code at the end to check the values and they all go to 0 and drop the strings.
Maybe a link to a proper explanation of how the system handles memory?
edit:
I keep getting the same problem when running the input sequence a second time. The output looks like this:
Input a string:
Radar
It's a palindrome
Input another string(y/n)?
y
_ <- this being my cursor after pressing enter 3 times
I'll just re-build the program from scratch and try to do it without a function. I'd still appreciate a link to a page that explains how to process user input using modern c++.

The problem is with:
cin >> response;
This reads the user input y/n into the variable response but a newline is left in the input buffer which is picked by the getline function the isPalindrome function.
To fix this you need to remove the newline from the input buffer after you read the user response. You do it by using:
cin >> response;
std::cin.ignore(INT_MAX);
With the above fix you can retry the palindrome check just once. To make multiple retries possible you'll need a loop. I would recommend a do-while loop in your main as:
char response;
do {
isPalindrome();
cout << "Input another string(y/n)?" << endl;
cin >> response;
std::cin.ignore(INT_MAX);
response = toupper(response);
} while(response == 'Y');

You need a loop. There's no code that instructs the program to go back to the top.
char response = 'Y';
while (response == 'Y') {
isPalendrome();
cout << "Input another string(y/n)?" << endl;
cin >> response;
}
This isn't your entire program, just key elements that you need for a while loop. You should get an understanding of how while works and make this work for your program.

In contemporary C++ one would typically use standard library components for string processing:
#include <iostream>
#include <string>
int main()
{
std::string line1, line2, response;
do
{
std::cout << "First string: ";
if (!std::getline(std::cin, line1)) { /* error */ }
std::cout << "Second string: ";
if (!std::getline(std::cin, line2)) { /* error */ }
// check line1 and line2 for palindromy
std::cout << "Again (y/n)? ";
std::getline(std::cin, response);
} while (std::cin && (response == "y" || response == "Y"));
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Strings and unordered_map running slow - c++

Can you try it without system("cls"); in any of the loops? That surely isn't helping, it's an expensive call.

To clear the screen quickly, instead of system("cls");, try cout << '\f';.

Related

C++ Anagram Solver speed optimization

C++ How to Check what words aren't in 2 similar files

How do I print the number of words that begin with a certain character?

Using C++ Fstream to output numbers from text file - Need help separating lines

Cannot input more than one string in palindrome program

Categories

Resources