reading a file of set of keyword documents used to search in a library computer system - c++

Description of Scenario for project: To search and articles in a library computer system one uses keywords in combination with Boolean operators such as 'and'(&) and 'or(). For example, if you are going to search for articles and books dealin with uses of nanotechnology and bridge construction, the query would be Nanotechnology & bridge construction. In order to retrive the books and articlkes properly, every document is represented using a set of keywords that represent the content of the document.
Assume that each document (books, articles, etc.) is represented by a document number which is unique. You will be provided with a set of documents represented by their numbers and keywords that are contained in that document as given below.
887 5
nanotechnology
bridge construction
carbon fiber
digital signal processing
wireless
The number 887 above corresponds to the document number and 5 is the number of keywords that are given for the document. Each keyword will be on a separate line. The input for your project will contain a set of document numbers and keywords for each document. The first line of the input will contain an integer that corresponds to the number of document records to process.
An Inverted List data structure is where for each keyword we store a set of document numbers that contain the keyword. For example, for the keyword carbon fiber we will have the following:
bridge construction 887, 117, 665, 900
carbon fiber 887, 1098, 654, 665, 117
The documents numbered 887, 1098, 654, 665, and 117 all will contain the keyword carbon fiber and the keyword bridge construction is found in documents numbered 887, 117, 665 and 900.
There are two main aspects to this project:
I am required to read a file (using standard input) that contains the document information and build the inverted list data structure
to apply Boolean queries to the Inverted List data structure.
The Boolean queries are processed as illustrated in the following example. To obtain the documents containing the keywords bridge construction & carbon fiber we perform a set intersection operation and get the documents 887, 117, and 665. The Boolean query bridge construction | carbon fiber will result in a set union operation and the documents for this query are 887, 1098, 654, 665, and 900.
OK SO MY QUESTION IS:
How do I read the document since on my first class is a setClass that stores a set of Document numbers?
My problem is that all documents are all in one text file for example:
25 //first document number
329 7 //second document number
ARAMA
ROUTING ALGORITHM
AD-HOC
CSMA
MAC LAYER
JARA
MANET
107 4 //third document number
ANALYSIS
CROSS-LAYER
GEOGRAPHIC FORWARDING
WIRELESS SENSOR NETWORKS
So how can I read the document numbers since they all have different amount of keywords right after another?

Is the "25" on the first line actually the number of documents in the file? I'll go with that assumption (if not, just read documents until you hit EOF)
Here is some pseudo-code for reading the file:
int numDocs = readLine // assuming first number is number of docs
for (int i = 0; i < numDocs; ++i)
{
string line = readLine
int docNumber = getFirstNumber(line)
int numKeywords = getSecondNumber(line)
for (int j = 0; j < numKeywords; ++j)
{
string keyword = readline
associate keyword with docNumber // however this works
}
}

Related

GATE MatchesAnnots Feature Output

I'm attempting to perform coreference using the GATE MatchesAnnots document feature, and I get the following output:
{null = [[866, 871, 872], [869, 873, 877, 879], [874, 895, 896]]}
Can anyone help me understand what this means? I'm assuming each of these arrays are each a coreference chain - but what are the numbers? Character start numbers? I'm a bit lost.
Question
what are the numbers?
Answer
They are GATE annotation ids of the chained annotations.
Explanation
GATE document feature MatchesAnnots contains a map ( Map<String, List<List<Integer>>> ) with following content:
Each key correspond to the name of corresponding AnnotatienSet.
Each value is a list of all the coreference chains.
Each coreference chain is a list of ids of annotations belonging to the chain.
See also
Parse GATE Document to get Co-Reference Text (similar SO question)
GATE Annotations (official documentation)

Best way to read this file to manipulate later?

I am given a config file that looks like this for example:
Start Simulator Configuration File
Version/Phase: 2.0
File Path: Test_2e.mdf
CPU Scheduling Code: SJF
Processor cycle time (msec): 10
Monitor display time (msec): 20
Hard drive cycle time (msec): 15
Printer cycle time (msec): 25
Keyboard cycle time (msec): 50
Mouse cycle time (msec): 10
Speaker cycle time (msec): 15
Log: Log to Both
Log File Path: logfile_1.lgf
End Simulator Configuration File
I am supposed to be able to take this file, and output the cycle and cycle times to a log and/or monitor. I am then supposed to pull data from a meta-data file that will tell me how many cycles each of these run (among other things) and then im supposed to calculate and log the total time. for example 5 Hard drive cycles would be 75msec. The config and meta data files can come in any order.
I am thinking I will put each item in an array and then cycle through waiting for true when the strings match(This will also help detect file errors). The config file should always be the same size despite a different order. The metadata file can be any size so I figured i would do a similar thing but in a vector.
Then I will multiply the cycle times from the config file by the number of cycles in the matching metadata file string. I think the best way to read the data from the vector is in a queue.
Does this sound like a good idea?
I understand most of the concepts. But my data structures is shaky in terms of actually coding it. For example when reading from the files, should I read it line by line, or would it be best to separate the int's from the strings to calculate them later? I've never had to do this that from a file that can change before.
If i separate them, would I have to use separate arrays/vectors?
Im using C++ btw
Your logic should be:
Create two std::map variables, one that maps a string to a string, and another that maps a string to a float.
Read each line of the file
If the line contains :, then, split the string into two parts:
3a. Part A is the line starting from zero, and 1-minus the index of the :
3b. Part B is the part of the line starting from 1+ the index of the :
Use these two parts to store in your custom std::map types, based on the value type.
Now you have read the file properly. When you read the meta file, you will simply look up the key in the meta data file, use it to lookup the corresponding key in your configuration file data (to get the value), then do whatever mathematical operation is required.

how to make a vectorized file in python. I need to convert tweets to vector form inorder to run a code in bayesian network

Is it possible to make a dataset atleast? I am doing sentiment analysis and is getting polarity of the message
I was following this tutorial. But it is not the data set required.
http://machinelearningmastery.com/naive-bayes-classifier-scratch-python/
It would be great if anyone could explain the csv file given here.
Basically, the process of converting a collection of text documents into numerical feature vectors is called vectorization. There are several techniques or concepts that can be used to vectorize text documents(for eg. word embeddings, bag of words, etc.).
Bag of words is one of the simplest ways to vectorize text into numerical features. TfIdf is an effective vectorization technique based on the bag of words concept.
On a very basic level, TfIdf uses a set of unigrams or bigrams(n-grams in general) from the entire text corpus and uses them as the features for all your text documents(tweets in your case). So if you imagine your text corpus as a table of numerical values then each row would be a text document(a tweet in your case) and each column would be a unigram(which is basically a word) and the value of each cell (i,j) in the table would depend on the term frequency of unigram j in the tweet i(the number of times that the particular unigram occurs in the tweet) and the inverse of the document frequency of the unigram j(the number of tweets that the particular unigram occurs in all the tweets combined). Hence, you would have a list of tweets as vectors which would have a numerical tfidf values corresponding to each feature(unigram).
For more information on how to implement tfidf look at the following links:
http://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Byte offset notation for a 900 mb XML file

I am building a search engine in c++ (using a 900 mb rapidXML file that contains pages from wikiBooks) and my objective is to parse the ~900 MB XML document using rapidXML so that the user can just enter one word in the search bar and receive the ACTUAL XML DOCUMENTS that contain that word (link).
I need to figure out how to store index of each token (aka each word within of each document) so that when the user wants to see the page numbers a certain word occurs, I can jump to that specific page.
I have been told to do the "file io offset" (where you store where in the file a word is so that you can jump to it) and I am having a hard time understanding what to do.
Questions:
Do I use the "seekg" and "tellg" in the istream library (to find the byte location that each document PAGE is stored at)? And if so, how?
How do I return the actual document back to the user (that contains many occurances of the searched word)?

How to write the column names with Hadoop Mapreduce?

I am running a Mapreduce using Hadoop streaming API - writing mapper and reducer in python. My questions are about formatting the final output (reducer output) and few others. Can I write column names (is it called column qualifier?)? Also how to maintain the column space to be constant when each row entry has different width?
As the each line in the input file is processed, is it possible to set a counter and increment it - how? Mapper does not have to send a key - just the value?
I hear log4j is used to log errors - what needs to be done in the reducer or log4j to get it logged in log4j - will it also work on python - otherwise, how to get it logged?