How to create an inverted index when I've already tokenized my file? - c++

I'm trying to create an inverted index. I'm reading the lines of a text file, the text file has in the first position of each line the id of a document docId and the rest of the line has keywords about this document.
In order to create an inverted index, I first have to tokenize this text file. I did it with a function I wrote, and I store every word in a vector. My only gripe is that I also store the docId as a string in the vector. Here is the header of the tokenize function if you need it:
void tokenize(string& s, char c, vector<string>& v)
Now after tokenizing the file I have to create a function that puts every word in a map, i'm thinking of using an unordered map, in the map every word appears one time. I also have to somehow store the frequency of the word somewhere. I thought that using the docId as a key in the map would be a good idea but then I realized that I can only have one docId which will show me the word, while in my text file a docId has more than one words.
So, how am I going to solve this problem? Where should I begin?

What a mess of a question. Breaking it down, if I understand correctly you have:
doc1 word1a word1b word1c word1d
doc2 word2a word2b word2c
...
You want mappings from words to documents and vice versa. It's hard to tell from your question whether your talk of word "frequency" reflects the same word being a keyword for multiple documents, or whether the description you have of your file format failed to incorporate a needed count for repetitions within each file. Assuming the former:
if (std::ifstream f(filename))
{
std::map<std::string, std::vector<string>> words_in_doc;
std::map<std::string, std::vector<string>> docs_containing_word;
std::string line;
while (getline(f, line))
{
std::istringstream iss(line);
std::string docid, word;
if (line >> docid)
while (line >> word)
{
words_in_doc[docid].push_back(word);
docs_containing_word[word].push_back(docid);
}
}
// do whatever with your data/indices...
}
else
std::cerr << "unable to open input file\n";

Related

Very specific parsing in C++

Basically, I'm trying to read in the words from a file and, without punctuation, read each word into a multimap which is then inserted into a vector with each pair being a word and the line of the file that word is found. I've got the function to remove punctuation working perfectly and I'm fairly certain my insert code works properly, but I can't seem to get around the line number part. I've included this section of my code as follows:
ifstream in("textfile.txt");
string line;
string keys;
stringstream keystream;
int line_number = 1;
while (getline(in, line, '\n')) {
alphanum(line);
keystream << line;
while(getline(keystream, keys, ' '))
table.insert(keys, line_number); //this just inserts the pair into my vector (table is an instance of a class I created)
keystream.str("");
line_number++;
}
The problem seems to be related to the stringstream. It doesn't seem to clear when I use keystream.str(""). This particular method only seems to read line 1 in and then exits the loop, whereas some other variations I've tried (I can't remember exactly what I did) read the entire file but don't flush the stringstream so it reads like word 1, word 1, word 2, word 1, word 2, word 3, etc.. Anyway, if anyone could point me in the right direction or perhaps link to a guide specific to parsing input in c++ that would be greatly appreciated! Thanks!
Don't keep the string stream object; just make a new one in each round:
string line;
while (getline(in, line, '\n'))
{
alphanum(line);
istringstream keystream(line);
string keys;
while (getline(keystream, keys, ' ')) // or even "while (keystream >> keys)"
{
}
}
I think the problem is that the second getline() loop sets the EOF flag on the stringstream, and this is not cleared when you call str(). You need to call .clear() also on 'keystream'.

How to construct a parser for an input file

can i please get some guidance to constructing a parser for an input file, I've been looking for a help for weeks, the assignment is already past due, I would just like to know how to do it.
The commented code is what I've tried, but i have a feeling it is more serious than that. I have a text file and I want to parse it to count the number of times that words appear in the document.
Parser::Parser(string filename) {
//ifstream.open(filename);
// source (filename, fstream::in | fstream::out);
}
The commented code is what I've tried, but i have a feeling it is more serious than that.
I have a feeling you haven't tried a thing. So I am going to do the same.
Google is your friend.
To read a word:
std::ifstream file("FileName");
std::string word;
file >> word; // reads one word from a file.
// Testing a word:
if (word == "Floccinaucinihilipilification")
{
++count;
}
// Count multiple words
std::map<std::string, int> count;
// read a word
++count[word];
// To read many words from a file:
std::string word;
while(file >> word)
{
// You have now read a word from a file
}
Note: That is a real word :-)
http://dictionary.reference.com/browse/floccinaucinihilipilification
Take a look at the answers in How do you read a word in from a file in C++? . The easiest way is to use an ifstream and operator>> to read single words. You can then use a standard container like vector (as mentioned in the link above) or map<string, int> to remember the actual count.

Tokenization of a text file with frequency and line occurrence. Using C++

once again I ask for help. I haven't coded anything for sometime!
Now I have a text file filled with random gibberish. I already have a basic idea on how I will count the number of occurrences per word.
What really stumps me is how I will determine what line the word is in. Gut instinct tells me to look for the newline character at the end of each line. However I have to do this while going through the text file the first time right? Since if I do it afterwords it will do no good.
I already am getting the words via the following code:
vector<string> words;
string currentWord;
while(!inputFile.eof())
{
inputFile >> currentWord;
words.push_back(currentWord);
}
This is for a text file with no set structure. Using the above code gives me a nice little(big) vector of words, but it doesn't give me the line they occur in.
Would I have to get the entire line, then process it into words to make this possible?
Use a std::map<std::string, int> to count the word occurrences -- the int is the number of times it exists.
If you need like by line input, use std::getline(std::istream&, std::string&), like this:
std::vector<std::string> lines;
std::ifstream file(...) //Fill in accordingly.
std::string currentLine;
while(std::getline(file, currentLine))
lines.push_back(currentLine);
You can split a line apart by putting it into an std::istringstream first and then using operator>>. (Alternately, you could cobble up some sort of splitter using std::find and other algorithmic primitaves)
EDIT: This is the same thing as in #dash-tom-bang's answer, but modified to be correct with respect to error handing:
vector<string> words;
int currentLine = 1; // or 0, however you wish to count...
string line;
while (getline(inputFile, line))
{
istringstream inputString(line);
string word;
while (inputString >> word)
words.push_back(pair(word, currentLine));
}
Short and sweet.
vector< map< string, size_t > > line_word_counts;
string line, word;
while ( getline( cin, line ) ) {
line_word_counts.push_back();
map< string, size_t > &word_counts = line_word_counts.back();
istringstream line_is( line );
while ( is >> word ) ++ word_counts[ word ];
}
cout << "'Hello' appears on line 5 " << line_word_counts[5-1]["Hello"]
<< " times\n";
You're going to have to abandon reading into strings, because operator >>(istream&, string&) discards white space and the contents of the white space (== '\n' or != '\n', that is the question...) is what will give you line numbers.
This is where OOP can save the day. You need to write a class to act as a "front end" for reading from the file. Its job will be to buffer data from the file, and return words one at a time to the caller.
Internally, the class needs to read data from the file a block (say, 4096 bytes) at a time. Then a string GetWord() (yes, returning by value here is good) method will:
First, read any white space characters, taking care to increment the object's lineNumber member every time it hits a \n.
Then read non-whitespace characters, putting them into the string object you'll be returning.
If it runs out of stuff to read, read the next block and continue.
If the you hit the end of file, the string you have is the whole word (which may be empty) and should be returned.
If the function returns an empty string, that tells the caller that the end of file has been reached. (Files usually end with whitespace characters, so reading whitespace characters cannot imply that there will be a word later on.)
Then you can call this method at the same place in your code as your cin >> line and the rest of the code doesn't need to know the details of your block buffering.
An alternative approach is to read things a line at a time, but all the read functions that would work for you require you to create a fixed-size buffer to read into beforehand, and if the line is longer than that buffer, you have to deal with it somehow. It could get more complicated than the class I described.

sorting strings in a file

I need a solution for sorting of unix pwd file using C++ based on the last name. The format of the file is username, password, uid, gid, name, homedir, shell. All are seperated by colon delimiters. The name field contains first name follwed by last name both seperated by space I am able to sort the values using map and i am posting my code. Can some one suggest me improvements that I can do to my code please. Also I am unable to see the sorted lines in my file.
string line,item;
fstream myfile("pwd.txt");
vector<string> lines;
map<string,int> lastNames;
map<string,int>::iterator it;
if(myfile.is_open())
{
char delim =':';
int count =0;
while(!myfile.eof())
{
count++;
vector<string> tokens;
getline(myfile,line);
istringstream iss(line);
lines.push_back(line);
while(getline(iss,item,delim))
{
tokens.push_back(item);
}
cout<<tokens.size()<<endl;;
size_t i =tokens[4].find(" ");
string temp = tokens[4].substr(i,(tokens[4].size()-i));
cout<<temp<<endl;
lastNames.insert(pair<string,int>(temp,count));
tokens.clear();
}
myfile.seekg(0,ios::beg);
for(it=lastNames.begin();it!=lastNames.end();it++)
{
cout << (*it).first << " => " << (*it).second << endl;
int value=lastNames[(*it).first ];
myfile<<lines[value-1]<<endl;
cout<<lines[value-1]<<endl;
cout<<value<<endl;
}
}
Also I am having problem writing to the file I am unable to see the sorted results.
my problem:
Can someone please explain me why I am unable to see the written results in the file!
Thanks & Regards,
Mousey.
Since the format of the file is fixed
username, password, uid, gid, first name(space)lastname, homedir, shell
Maintain a std::map with key value as string (which will contain last name, and value as line number
Start reading the file line by line, extract the last name (Split the line by "," and then split fifth extracted part on space).
Store the name along with line number in map
When complete file has been read, just output the line numbers as mentioned in map. (Map contains lat names in sorted order)
For splitting a string
Refer to
Split a string in C++?
If it's only a few megabytes, you're can basically slurp it into memory and use the O(n log n) sorting algorithm of your choice to sort it, then write it out.
Basically, write a code snippet to compare two lines the way you want, and use that with your standard library sort routine to sort the data. Or write your own sort routine, whatever.
If you're interested in how you'd go about dealing with gigabytes of data, take a look at Wikipedia's article on External Sorting for a good jumping-off point.

Translating Program

I am beginning to write a translator program which will translate a string of text found on a file using parallel arrays. The language to translate is pig Latin. I created a text file to use as a pig latin to English dictionary. I didn't want to use any two dimension arrays; I want to keep the arrays in one dimension.
Basically I want to read a text file written in PigLatin and using the dictionary I created I want to output the translation to English on the command line.
My pseudo-code idea is:
Open the dictionary text file.
Ask the user for the name of the text file written in PigLatin that he/she wants to translate to English
Searching each word on the user's text file and comparing to the Dictionary to then translate the word accordingly. Keep on going until there are no more words to translate.
Show the translated words on the command line screen.
I was thinking on using a parallel arrays, one containing the english translated words and another one containing the pig latin words.
I would like to know how can I manipulate the strings using arrays in C++?
Thank you.
If files will be always translated in one direction (e.g. PigLatin -> English) then it would be easier and more efficient to use std::map to map one string to another:
std::map<std::string, std::string> dictionary;
dictionary["ashtray"] = "trash";
dictionary["underplay"] = "plunder";
And get translated word, just use dictionary[] to lookup (e.g. std::cout << dictionary["upidstay"] << std::endl;)
Pig latin can be translated on the fly.
Just split the words before the first vowel of each word and you won't need a dictionary file. Then concatenate the second part with the first part, delimited with a '-', and add "ay" at the end.
Unless you want to use a dictionary file?
Declaring an array of strings is easy, the same as declaring an array of anything else.
const int MaxWords = 100;
std::string piglatin[MaxWords];
That's an array of 100 string objects, and the array is named piglatin. The strings start out empty. You can fill the array like this:
int numWords = 0;
std::ifstream input("piglatin.txt");
std::string line;
while (std::getline(input, line) && numWords < MaxWords) {
piglatin[numWords] = line;
++numWords;
}
if (numWords == MaxWords) {
std::cerr << "Too many words" << std::endl;
}
I strongly recommend you not use an array. Use a container object, such as std::vector or std::deque, instead. That way, you can load the contents of the files without knowing in advance how big the files are. With the example declaration above, you need to make sure you don't have more than 100 entries in your file, and if there are fewer than 100, then you need to keep track of how many entries in your array are valid.
std::vector<std::string> piglatin;
std::ifstream input("piglatin.txt");
std::string line;
while (std::getline(input, line)) {
piglatin.push_back(line);
}