cpp stringstream read input file algorithm to find LCS - c++

Hi here's my first questions here, I would write as clear as possible, if I am too newbie here, please bear it with me. Thanks
Backgroud: I was asked to solve longest common substring(lcs) problem with given input files in c++.
Its purpose is to optimize the algorithm, so it has limited run-time and RAM requirement.(case insensitive)
My Approach: I used to stringstream to parse the every input line and stored them into a vector. use something like suffix tree to chop the string, sort it and put into a vector array (vector that store vectors) and compare every 2 lines (v1,v2) to find common substirng.(I used nested foop loop to compare each word inside every vector), and then put common substrings back to array and remove v1 and v2.
suffix tree eg. banana -> anana -> nana -> ana -> na -> a..[I stored all 5 elements into the vector]
result: it works for most of the files (normal textfiles)
problem: I got 2 special test case that took me forever to find lcs.
1. has 10000 line input, and each line has ave 3000 chars (include space). It took me 50 mins to find lcs. the requirement is not exceed 5 mins.
2. has 100 line input, and each line has ave 60k chars. It never finish running
what I tried:build a common word dictionary for first 2 sentence
read first two lines and stored into vectors
used suffix tree again to find common elements(substring) and named as dictionary
for rest of input lines,
if (words read is within dictionary)
fine do what I did before, read next one
else if (word is not in dictionary)
ignore this word, read next one
help needed: I still cannot read the first two lines if each line contains 60k char, so building the dict itself would exceed the run time limitations. I am not sure if the hashed table would work way better than vectors. I knew a bit about HT but never write anything with it, so if you can explain HT with patience, I would appreciate that.
Update:
As suggested, I put some code here (first one for parse and store into vector, second involve how I compare 2 string and find common element)
vector< vector<string> > parsed_array;
vector<string> choped_element;
// Num1::read from file in a while loop
while (getline (myfile,line))
{
cout << "< InputlineLoopCounter: "<<InputlineLoopCounter<<endl;:q
choped_element.clear();
choped_element.push_back(line); //whole string as first element, eg'Hello World"
stringstream ss(line);
string copystr (line);
while (ss >> temp)
{
copystr.erase(0,copystr.find_first_of(" \t")+1); // here turns into "World"
choped_element.push_back(copystr);
}
choped_element.pop_back();//since I stored whole string as frist element, last one is not necessary
sort(choped_element.begin(),choped_element.end());
parsed_array.push_back(choped_element);//stored into vector array
InputlineLoopCounter ++;
}
//Num::2 compare part in 2 diff string and assembly into new string
//v1 and v2 and 2 vectors full of chopped strings and v3 should be common element
// eg. v1[0]="hello world"; v1[1]="world"
// eg. v2[0]="I dislike hello world"; v2[1]="dislike hello word"; v2[2]="hello word"; v2[4]="word"
// eg. v3 as result would be v3[0]="hello word";v3[1]="word"
for (size_t i = 0; i < v1len; i++)
{
for (size_t j = 0; j< v2len; j++)
{
stringstream ss1(v1[i]);
string fword1;
ss1>>fword1;
stringstream ss2(v2[j]);
string fword2;
ss2>>fword2;
if(fword1 == fword2) //v1[i] and v2[j] are space seperated words
{
string nword1;
string nword2;
string lcommon;
int comlen = 1;
string combine;
combine.append(fword1);
combine.append(space);
while (ss1>>nword1 && ss2>>nword2)
{
if (nword1 == nword2)
{
combine.append(nword1);
combine.append(space);
comlen ++;
}
else
break;
}
combine.erase(combine.find_last_of(" "));
cout<< "common word: "<<combine<<endl;
v3.push_back(combine);
}
}
}

Related

Is there a way to search an unordered_set using a limited alphabetic range?

Context: I am coding an assignment in C++ where a user enters a word or a sentence to unscramble on a word by word basis. I have a text file full of English words that I have read into an unordered_set of strings. Then I go through permutations of each entered word and attempt to find it in the unordered_set. The unscrambled word possibilities are printed out to the user.
Problem: There are a lot of words in the text file. The program doesn't run properly because it takes too long to go through all the permutations and look for a match in the unordered_set.
Possible Solution: I want to limit the range of words to search through, because the text file was already in alphabetical order. For example, if the scrambled word was "cit", one permutation for this word would be "itc". I want to search all of the words in the unordered_set starting with i for "itc".
Here is what I have so far.
void unscramble() {
//issue - too slow, find in range?
string word;
string temp;
ifstream inDictionaryFile("words_alpha.txt");
unordered_set<string> dictionary;
//read dictionary file into a unordered_set
while (getline(inDictionaryFile, temp)) {
auto result = dictionary.insert(temp + " ");
}
cout << "Enter something to unscramble: ";
//find/print out matches for permuations of scrambled words
while (cin>>word) {
do {
word = word + " ";
auto result = dictionary.find(word);
if (result != end(dictionary)) {
cout << setw(10) << word;
}
} while (next_permutation(begin(word), end(word)));
}
}
If you need the permutation of just the first 3 letters you may use an unordered_multiset with the key equal to a canonical permutation (e.g. the sorted first 3 letters). But I guess that the actual problem that you have should not be solved with just one data structure but with several ones, one data structure for storage, other data structures for indexes to that storage.

Reading from a text file properly

I am reading string from a line in a text file and for some reason the the code will not read the whole text file. It reads to some random point and then stops and leaves out several words from a line or a few lines. Here is my code.
string total;
while(file >> word){
if(total.size() <= 40){
total += ' ' + word;
}
else{
my_vector.push_back(total);
total.clear();
}
Here is an example of a file
The programme certifies that all nutritional supplements and/or ingredients that bear the Informed-Sport logo have been tested for banned substances by the world class sports anti-doping lab, LGC. Athletes choosing to use supplements can use the search function above to find products that have been through this rigorous certification process.
It reads until "through" and leaves out the last four words.
I expected the output to be the whole file. not just part of it.
This is how I printed the vector.
for(int x = 0; x< my_vector.size(); ++x){
cout << my_vector[x];
}
You missed two things here:
First: in case when total.size() is not <= 40 i.e >40 it moves to else part where you just update your my_vector but ignore the current data in word which you read from the file. You actually need to to update the total after total.clear().
Second: when your loop is terminated you ignore the data in word as well. you need to consider that and push_back()in vector (if req, depends on your program logic).
So overall you code is gonna look like this.
string total;
while(file >> word)
{
if(total.size() <= 40)
{
total += ' ' + word;
}
else
{
my_vector.push_back(total);
total.clear();
total += ' ' + word;
}
}
my_vector.push_back(total);//this step depends on your logic
//that what u actually want to do
Your loop finishes when the end of file is read. However at this point you still have data in total. Add something like this after the loop:
if(!total.empty()) {
my_vector.push_back(total);
}
to add the last bit to the vector.
There are two problems:
When 40 < total.size() only total is pushed to my_vector but the current word is not. You should probably unconditionally append the word to total and then my_vector.push_back(total) if 40 < total.size().
When the loop terminated you still need to push_back() the content of total as it may not have reached a size of more than 40. That is, if total is no-empty after the loop terminated, you still need to append it to my_vector.

confusion about lists and pairs

So I'm experimenting with trying to add first and last names into a double linked list.
I have a various text files of different lengths with the format "string, string", and am using list> to store my data.
I am using this code:
typedef std::list< std::pair<string,string> > listPair;
...
list<pair<string, string> > mylist;
ifstream myFile;
myFile.open("20.txt");
pair<string, string> stuff;
while (myFile >> stuff.first >> stuff.second)
{
mylist.push_back(stuff);
}
listPair::iterator iter = mylist.begin();
for(;iter != mylist.end();iter++)
{
string s = (*iter).first;
cout << s << endl;
string c = (*iter).second;
cout << c << endl;
}
now the problem i'm having is that firstly, the last item in the list is not being added.
like every file just misses the end line, so that's a little confusing.
also, I'm doing a "mylist.size()" to ensure all the names have been added, and it's confusing me because say for a text file containing 99 names, i.e 99 lines of text, it will say (not forgetting it only reads in 98 due to missing the last line) that the list has size 48.
WHY 48?
Is it something to do because I have done pairs, which still would not make sense as if it was not reading in pairs there would actually be double the about, since the pairs are just to take the first and last name as one value.
Mind boggling to me.
once again thanks for all your help!
I have a feeling your file doesn't actually have spaces between the values as you've described, so it looks like this:
one,two
three,four
five,six
seven,eight
nine,ten
If you were to run your program on this, the size of the list will be 2 (floor(number_of_lines/2), which for you would give 48) and the last line won't have been put in the list at all. Why?
Firstly, each call to std::ifstream::operator>>(std::string&) will extract up until it hits some white space. In this case, the first white space on the first line is the \n at the end of it. So on the first iteration, stuff.first will be "one,two" and then the next line will be read into stuff.second, making it "three,four". This is then pushed into the list. The next two lines are read in the same way, giving you the pair {"five,six","seven,eight"}. On the next iteration, the first operator>> will extract "nine,ten" and the second will fail, causing the while condition to end and the last line to be discarded.
Even if you did have spaces, you would end up with commas in the first of every pair, which is certainly not what you want.
The nicer way to approach a problem like this is to use std::getline to extract a line, and then parse that line as appropriate:
std::string line;
std::pair<std::string, std::string> line_pair;
while (std::getline(myFile, line)) {
std::stringstream line_stream(line);
std::getline(line_stream, line_pair.first, ',');
std::getline(line_stream, line_pair.second);
mylist.push_back(line_pair);
}
I also recommend using std::vector unless you have a good reason to use std::list.
Operator >> on ifstream treats newline as yet another token. Hence it will probably read your first and second word as per normal from your first line, but the third token read is the new line on the first line.
Try using getline to 'eat' the newline as well.

How to remove a character from the string and change data if need it?

I have possible inputs 1M 2M .. 11M and 1Y (M and Y stand for months ) and I want to output "somestring1 somestring2.... and somestring12" note M and Y are removed and the last string is changed to 12
Example: input "11M" "hello" output: hello11
input "1Y" "hello" output: hello1
char * (const char * date, const char * somestr)
{
// just need to output final string no need to change the original string
cout<< finalStr<<endl;
}
The second string is getting output as a whole itself. So no change in its output.
The second string would be output as long as M or Y are encountered. As Stack Overflow discourages providing exact source codes, so I can give you some portion of it. There is a condition to be placed which is up to you to figure out.(The second answer gives that as well)
Code would be somewhat like this.
//Code for first string. Just for output.
for (auto i = 0 ; date[i] != '\0' ; ++i)
{
// A condition comes here.
cout << date[i] ;
}
And note that this is considering you just output the string. Otherwise you can create another string and add up the two or concatenate the existing ones.
is this homework? If not, here's what i'd suggest. (i ask about homework because you may have restrictions, not because we're not here to help)
1) do a find on 'M' in your string (using find), insert a '\0' at that position if one is found (btw i'm assuming you have well formatted input)
2) do a find on 'Y'. if one is found, insert a '\0' at that position. then do an atoi() or stringstream conversion on your string to convert to number. multiply by 12.
3) concatenate your string representation of part 1 or part 2 to your somestr
4) output.
This can probably be done in < 10 lines if i could be bothered.
the a.find('M') part and its checks can be conditional operator, then the conversion/concatenation in two or three lines at most.

substitution cipher in c++

I have generate a random text file
A B C D E F G H
T W G X Z R L N
I want to encode my message so that A = T , B = W , C = G and so on..
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int
main ()
{
string getmsg;
ifstream openfile ("random.txt");
if (openfile.is_open ()) {
while (! openfile.eof ()) {
getline (openfile,getmsg); //read from random.txt
cout << getmsg << endl;
}
}
}
Am quite of stuck here.
eg. when i input the word "HAD" it will display "NTX" and by using the same random text file I can input "NTX" and give me back the "HAD"
While others have pointed out Map, I would have used a simple array (subs), of size 26 (If there are only capital alphabets).
Initialize the array with 0s. Read all the chars and their mapping. Store it something like this subs[char-'A'] = mapped_char. I will leave the reading to you.
EDIT-
If you are ready to pay for extra memory usage, just make the size of subs as 123 (ASCII for z + 1).
This will also simplify the logic to subs[char] = mapped_char
Since this feels like homework I'll give you guidance rather than a solution.
You want to create a bijectional map between an input character and a corresponding character to output.
One solid way to do that is with a Map. Create a Map that has as its key the input character e.g. 'A' and as its value the output character e.g. 'T'.
For each character that you read in from your file, use the Map to lookup the corresponding output character.
You will need to read input one character at a time (simplest), or read one line at a time (as you do now) and run through each line, character by character, to do the translation with the Map.
Update
To clarify a point in the comments, this is a bijectional function because there is exactly one encoded character for each original character. If the text does not have to be decoded, a Map will do for the software representation of the function. If decoding is needed, a Bimap is more appropriate.
Injective Function
Bijective Function
http://en.wikipedia.org/wiki/Injective_function
one way to do it is if you
take a look at std::map<> (map<char,char> in your case)
using it you can setup a map of character pairs, then when you read one character from your file/buffer you look it up in the map and retrieve the corresponding character.
another, more verbose way, would be to have a switch statement
char ch;
ch << openfile
switch(ch)
{
case 'A': ch = 'T'; break;
...
}
cout << ch;
there are other ways as well, see if you can find one more involving an array.