spimi algorithm misunderstanding

spimi algorithm misunderstanding - c++

I'm trying to implement a single-pass in-memory indexer in C++
But in the algorithm, I think there is something wrong or ( most probably) I have misunderstanding
SPIMI-INVERT(token_stream)
output_file = NEWFILE()
dictionary = NEWHASH()
while (free memory available)
token ← next(token_stream)
if term(token) ∈ dictionary
then postings_list = ADDTODICTIONARY(dictionary, term(token))
else postings_list=GETPOSTINGSLIST(dictionary,term(token))
if full(postings_list)
then postings_list = DOUBLEPOSTINGSLIST(dictionary, term(token))
ADDTOPOSTINGSLIST(postings_list, docID(token))
sorted_terms ← SORTTERMS(dictionary)
WRITEBLOCKTODISK(sorted_terms,dictionary,output_file)
return output_file
Let's assume that I did all parsings and turned all the documents into a stream of tokens where tokens are term,doc_id pairs
http://nlp.stanford.edu/IR-book/html/htmledition/single-pass-in-memory-indexing-1.html says that SPIMI-INVERT function is called for every block.
Alright let's start then,
we read the stream block by block, so now I have one single block and
sent it over SPIMI-INVERT function as an argument
the function does some processing with the token for the dictionary
somehow ( maybe because the dictionary is too big) we don't have free
memory anymore when we are in the while loop.
The algorithm breaks the loop and writes the current dictionary to
disk.
But from outside world (as a caller of the function) I have no idea if
the block that I send it as an argument processed totally or not. Don't you
think that there is something wrong here?

Because no answer so far and after talking to my professor, I am answering my question.
I must say that the algorithm is not really clear, because my professor was not sure too. And I am answering this question like how I interpreted it.
token stream is a file that includes tokens(term , doc-id pair)
while (there is token in token stream)
dict = new dictionary()
while(there is free memory available)
token = next_token()
dict.add_to_postingList(token)
write_dict_to_file(dict)
delete dict
//Assuming posting list is dynamically sized and dictionary knows if a term exist.
Here I implemented it in C++, might be usefull

Related

How to know if record read from binary file has empty fields? C++

I am in high school and it is mandatory to use Turbo C++ compiler, I know it is a very old compiler but please understand my situation.
So I was writing a code on a employee database. The code snippet:
userdb user;
fstream fil;
while(fil.read((char*)&user,sizeof(userdb)))
{
cout<<user.name;
cout<<user.pass;
cout<<user.age;
cout<<user.address;
}
fil.close();
Now the problem is that if a user doesn't have his address inputted in the database, the compiler displays garbage.
How can I check if a value has nothing(garbage) so as to not print it on the screen?
(I have tried address[0]='\0' and strcmp("",address)==0 and this is not working)

Empty field does not mean anything in this context. Indeed, you are reading N bytes from a file, storing them into memory. You tell the computer to interpret this portion of memory as a string. There's nothing to be done to know whether the field is empty or not.
Your best bet, would be to look at this memory, and to try to guess whether is looks like an actual address or not.
Maybe first could you look at whether this address string, stored into a character array of a fixed size, has a termination character in it. If not, you could guess that it is invalid, and possibly add this termination character at the end of the character array.

remove escape characters from a char

I've been working with this for about 2 days now. I'm stuck, with a rather simple annoyance, but I'm not capable of solving it.
My programs basicly recieves a TCP connection from a PHP script. And the message which is send is stored in char buffer[1024];.
Okay this buffer variable contains an unique key, which is being compared to a char key[1024] = "supersecretkey123";
The problem itself is that these two does not equal - no matter what I do.
I've been printing the buffer and key variable out just above eachother and by the look they are 100% identical. However my equalisation test still fails.
if(key == buffer) { // do some thing here etc }
So then I started searching the internet for some information on what could be wrong. I later realized that it might be some escape characters annoying me. But I'm not capable of printing them, removing them or even making sure they are there. So that's why I'm stuck - out of ideas on how to make these equal when the buffer variable matches the key variable.
Well the key does not chance, unless the declaration of the key is modified manually. The program itself is recieving the information and sending back information "correctly".
Thanks.

If you're using null terminated strings use proper api - strcmp and its variants.
Additionally size in declaration char key[1024] = "supersecretkey123"; is not needed - either compiler will reduced it or stack/heap memory will be wasted.

If you are using C++ use std::string instead of char []. You cannot compare two char [] in way you try to do this (they are pointers to memory), but it's possible with std::string.
If it's somehow mandatory to use char[] in your case, use strcmp.

Try with if(!strncmp(key,buffer,1024)). See this reference on strncmp.

What type of input check can be performed against binary data in C++?

let's say I have a function like this in C++, which I wish to publish to third parties. I want to make it so that the user will know what happened, should he/she feeds invalid data in and the library crashes.
Let's say that, if it helps, I can change the interface as well.
int doStuff(unsigned char *in_someData, int in_data_length);
Apart from application specific input validation (e.g. see if the binary begins with a known identifier etc.), what can be done? E.g. can I let the user know, if he/she passes in in_someData that has only 1 byte of data but passes in 512 as in_data_length?
Note: I already asked a similar question here, but let me ask from another angle..

It cannot be checked whether the parameter in_data_length passed to the function has the correct value. If this were possible, the parameter would be redundant and thus needless.
But a vector from the standard template library solves this:
int doStuff(const std::vector<unsigned char>& in_someData);
So, there is no possibility of a "NULL buffer" or an invalid data length parameter.

If you would know how many bytes passed by in_someData why would you need in_data_length at all?
Actually, you can only check in_someData for NULL and in_data_length for positive value. Then return some error code if needed. If a user passed some garbage to your function, this problem is obviously not yours.

In C++, the magic word you're looking for is "exception". That gives you a method to tell the caller something went wrong. You'll end up with code something like
int
doStuff(unsigned char * inSomeData, int inDataLength) throws Exception {
// do a test
if(inDataLength == 0)
throw new Exception("Length can't be 0");
// only gets here if it passed the test
// do other good stuff
return theResult;
}
Now, there's another problem with your specific example, because there's no universal way in C or C++ to tell how long an array of primitives really is. It's all just bits, with inSomeData being the address of the first bits. Strings are a special case, because there's a general convention that a zero byte ends a string, but you can't depend on that for binary data -- a zero byte is just a zero byte.
Update
This has currently picked up some downvotes, apparently by people misled by the comment that exception specifications had been deprecated. As I noted in a comment below, this isn't actually true -- while the specification will be deprecated in C++11, it's still part of the language now, so unless questioner is a time traveler writing in 2014, the throws clause is still the correct way to write it in C++.
Also note that the original questioner says "I want to make it so that the user will know what happened, should he/she feeds [sic] invalid data in and the library crashes." Thus the question is not just what can I do to validate the input data (answer: not much unless you know more about the inputs than was stated), but then how do I tell the caller they screwed up? And the answer to that is "use the exception mechanism" which has certainly not been deprecated.

Tokenizer efficiency question

I'm writing a compiler front end for a project and I'm trying to understand what's the best method of tokenize the source code.
I can't choose between two ways:
1) the tokenizer read all tokens:
bool Parser::ReadAllTokens()
{
Token token;
while( m_Lexer->ReadToken( &token ) )
{
m_Tokens->push_back( token );
token.Reset(); // reset the token values..
}
return !m_Tokens->empty();
}
and then the parsing phase begins, operating on the m_Tokens list. In this way the methods getNextToken(), peekNextToken() and ungetToken() are relatively easy to implement by iterator, and the parsing code is well written and clear ( not broken by getNextToken() i.e. :
getNextToken();
useToken();
getNextToken();
peekNextToken();
if( peeked is something )
ungetToken();
..
..
)
2) the parsing phase begins and when needed, the token is created and used ( the code seems not so clear )
What's the best method??and why??and the efficiency??
thanks in advance for the answers

Traditionally, compiler construction classes teach you to read tokens, one by one, as you parse. The reason for that, is that back in the days, memory resources were scarce. You had kilobytes to your disposal, and not gigabytes as you do today.
Having said that, I don't mean to recommend you to read all tokens in advance, and then parse from your list of tokens. Input is of arbitrary size. If you hog too much memory, the system will become slow. Since it looks like you only need one token in the lookahead, I'd read one at a time from the input stream. The operating system will buffer and cache the input stream for you, so it'll be fast enough for most purposes.

It would be better to use something like Boost::Spirit to tokenise. Why reinvent the wheel?

Your method (1) is generally overkill - it is not required to tokenize an entire file prior parsing it.
A good way to go is to implement a buffered tokenizer, which will store in a list the tokens that were poke or unget, and which will consume the element of this list on "get" or read tokens from the file when the list gets empty (a la FILE*).

The first method is better, as you can then also understand the code 3 month later...

Write a program to count how many times each distinct word appears in its input

This is a question(3-3) in accelerated C++.
I am new to C++. I have thought about this for a long time, however, I can't figure it out.
Will anyone resolve this problem for me?
Please explain it in detail, you know I am not very good at programming. Tell me the meaning of the variables you use.

The best data structure for this is something like a std::map<std::string,unsigned>, but you don't encounter maps until chapter 7.
Here are some hints based on the contents of chapter 3:
You can put strings in a vector, so you can have std::vector<std::string>
Strings can be compared, so std::sort works with std::vector<std::string>, and you can check if two strings are the same with s1==s2 just like for integers.
You saw in chapter 1 that std::cin >> s reads a word from std::cin into s if s is a std::string.

To provide maximal learning experience, I will not provide pastable code. That's an exercise. You have to do it yourself to learn as much as you can.
This is the perfect scenario for employing a kind of map that creates its value type upon accessing a non-existing key. Fortunately, C++ has such a map in its standard library: std::map<key_type,value_type> is exactly what you need.
So here's the jigsaw pieces:
you can read word by word from a stream into a string by using operator >>
you can store what you find in a map of words (strings) to occurrences (unsigned number type)
when you access an entry in the map through a non-existing key, the map will helpfully create a new default-constructed value under that key for you; if the value happens to be a number, default-construction will set it to 0 (zero)
Have fun put this together!

Here's my hint. std::map will be your friend.

Here is an algorthm you could use, try coding something and put you results here. People can then help you get further.
Scan down the string collecting each letter until you get to a word boundary (say space or . or , etc).
Take that word and compare it to the words you've already found, if already found then add one to the count for that word. If it's not then add that word to the list of words found with a count of 1.
Carry on down the string

Well, you need a way of getting individual words from the input stream (perhaps something like an "input stream" method applied to the "standard input stream") and a way of storing those strings and counts in some sort of "collection".
My natural homework cynicism and general apathy towards life prevent me from adding more detail at the moment :-)
The meaning of any variables I use is fairly self-evident since I tend to use things like objectsRemaining or hasBeenOpened.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

spimi algorithm misunderstanding - c++

Related

How to know if record read from binary file has empty fields? C++

remove escape characters from a char

What type of input check can be performed against binary data in C++?

Tokenizer efficiency question

Write a program to count how many times each distinct word appears in its input

Categories

Resources