I am working on a project "key phrase extraction from text arguments" . For this I first did input cleaning and then detemined list of candidate phrases( in total around 300) using stanford parser(POS tagging). Then I computed feature value of each and every phrase. I followed these steps on each and every document in my dataset. Now how should I proceed i.e.., how to use WEKA to find keyphrases. How should I store phrases and feature values(TFXIDF) in weka . How to find efficiency of the final project??
WEKA does an excellent and simple work with Text Classification tasks (like Text Categorization and Clustering), in which the instances are relatively long pieces of text (e.g. from tweets to documents), and classes (when available) are non-overlapping tags (e.g. thematic classes like economy/sports/..., spam/legitimate email, positive/negative in sentiment analysis, etc.).
However WEKA does not fit directly term classification tasks like Part Of Specch Tagging, Word Sense Disambiguation, Named Entity Recognition, or in your case, keyphrase extraction. For applying WEKA, yo do not only need your original texts and the manually extracted keyphrases, but to decide the atributes that make those pieces of text actual keyphrases. You have to inspect your examples, and decide, for instance, that the part of speech of the words in a keyphase and the surrounding words are actually important in order to guess that a piece of text is a keyphrase.
I strongly recommend you take a look at the representation used in the datasets used in the CONLL NER shared tasks (CONLL 2002 and 2003). Each word in named entity is independent and marked as starting, in the middle or at the end of the named entity. Additionally, the features you can use are the actual words, the surrounding words, and their parts of speech.
For instance, in the example of the NER 2003 dataset:
U.N. NNP I-NP I-ORG
official NN I-NP O
Ekeus NNP I-NP I-PER
heads VBZ I-VP O
for IN I-PP O
You have that the word "Ekeus" is an NNP, it is inside a Noun Phrase (I-NP), and it is a named entity of type "person" (I-PER). You can process this format to get an instance file in which you use the POS tag and the actual words in a two-word window:
#attribute word-2 string
#attribute word-1 string
#attribute word string
#attribute word+1 string
#attribute word+2 string
#attribute postag-2 {NNP, NN, ....} // The full list of available POS tags
#attribute postag-1 {NNP, NN, ....}
// ../..
#attribute named-entity-class {O, I-PER, I-ORG, ...} // The full list of possible NE tags
#data
"U.N.","official","Ekeus","heads","for",NNP,NN,NNP,VBZ,IN,I-PER
../..
As you can see, you have to decide the attributes you need for each word and to build windows with the attributes.
Related
In Word2vec can we use words instead of sentences for model training
Like below code gberg_sents is sentence tokens
model = Word2Vec(sentences=gberg_sents,size=64,sg=1,window=10,min_count=5,seed=42,workers=8)
Like this can we use word tokens also
No, word2vec is trained with a language modeling objective, i.e., it predicts what words appear in surrounding of other words. For this, your training data need to be actual sentences that show how the words are used in context. It is actually the context of the words that gives you the information that is captured in the embeddings.
I've recently upgraded a CloudSearch instance from the 2011 to the 2013 API. Both instances have a field called sid, which is a text field containing a two-letter code followed by some digits e.g. LC12345. With the 2011 API, if I run a search like this:
q=12345*&return-fields=sid,name,desc
...I get back 1 result, which is great. But the sid of the result is LC12345 and that's the way it was indexed. The number 12345 does not appear anywhere else in any of the resulting document fields. I don't understand why it works. I can only assume that this type of query is looking for any terms in any fields that even contain the number 12345.
The reason I'm asking is because this functionality is now broken when I query using the 2013 API. I need to use the structured query parser, but even a comparable wildcard query using the simple parser is not working e.g.
q.parser=simple&q=12345*&return=sid,name,desc
...returns nothing, although the document is definitely there i.e. if I query for LC12345* it finds the document.
If I could figure out how to get the simple query working like it was before, that would at least get me started on how to do the same with the structured syntax.
Why it's not working
CloudSearch v1 (2011) had a different way of tokenizing mixed alpha+numeric strings. Here's the logic as described in the archived docs (emphasis mine).
If a string contains both alphabetic and numeric characters and is at
least three and no more than nine characters long, the alphabetic and
numeric portions of the string are treated as separate tokens. For
example, the string DOC298 is tokenized into two terms: doc 298
CloudSearch v2 (2013) text processing follows Unicode Text Segmentation, which does not specify that behavior:
Do not break within sequences of digits, or digits adjacent to letters (“3a”, or “A3”).
Solution
You should just be able to search *12345 to get back results with any prefix. There may be some edge cases like getting back results you don't want (things with more preceding digits like AB99912345); I don't know enough about your data to say whether those are real concerns.
Another option would would be to index the numeric prefix separately from the alphabetical suffix but that's additional work that may be unnecessary.
I'm guessing you are using Cloudsearch in English, so maybe this isn't your specific problem, but also watch out for Stopwords in your search queries:
https://docs.aws.amazon.com/cloudsearch/latest/developerguide/configuring-analysis-schemes.html#stopwords
In your example, the word "jo" is a stop word in Danish and another languages, and of course, supported languages, have a dictionary of stop words that has very common ones. If you don't specify a language in your text field, it will be English. You can see them here: https://docs.aws.amazon.com/cloudsearch/latest/developerguide/text-processing.html#text-processing-settings
I recently started working with ontologies and I am using Protege to build an ontology which I'd also like to use for automatically classifying strings. The following illustrates a very basic class hierarchy:
String
|_ AlphabeticString
|_ CountryName
|_ CityName
|_ AlphaNumericString
|_ PrefixedNumericString
|_ NumericString
Eventually strings like Spain should be classified as CountryName or UE4564 would be a PrefixedNumericString.
However I am not sure how to model this knowledge. Would I have to first define if a character is alphabetic, numeric, etc. and then construct a word from the existing characters or is there a way to use Regexes? So far I only managed to classify strings based on an exact phrase like String and hasString value "UE4565".
Or would it be better to safe a regex for each class in the ontology and then classify the string in Java using those regexes?
An approach that might be appropriate here, especially if the ontology is large/complicated or might change in the future, and assuming that some errors are acceptable, is machine learning.
An outline of a process utilizing this approach might be:
Define a feature set you can extract from each string, relating to your ontology (some examples below).
Collect a "train set" of strings and their true matching categories.
Extract features from each string, and train some machine-learning algorithm on this data.
Use the trained model to classify new strings.
Retrain or update your model as needed (e.g. when new categories are added).
To illustrate more concretely, here are some suggestions based on your ontology example.
Some boolean features that might be applicable: does the string matches a regexp (e.g the ones Qtax suggests); does the string exist in a prebuilt known city-names list; does it exist in a known country-names list; existence of uppercase letters; string length (not boolean), etc.
So if, for instance, you have a total of 8 features: match to the 4 regular expressions mentioned above; and the additional 4 suggested here, then "Spain" would be represented as (1,1,0,0,1,0,1,5) (matching the first 2 regular expressions but not the last two, is a city name but not a country name, has an uppercase letter and length is 5).
This set of feature will represent any given string.
to train and test a machine learning algorithm, you can use WEKA. I would start from rule or tree based algorithms, e.g. PART, RIDOR, JRIP or J48.
Then the trained models can be used via Weka either from within Java or as an external command line.
Obviously, the features I suggest have almost 1:1 match with your Ontology, but assuming your taxonomy is larger and more complex, this approach would probably be one of the best in terms of cost-effectiveness.
I don't know anything about Protege, but you can use regex to match most of those cases. The only problem would be differentiating between country and city name, I don't see how you could do that without a complete list of either one.
Here are some expressions that you could use:
AlphabeticString:
^[A-Za-z]+\z (ASCII) or ^\p{Alpha}+\z (Unicode)
AlphaNumericString:
^[A-Za-z0-9]+\z (ASCII) or ^\p{Alnum}+\z (Unicode)
PrefixedNumericString:
^[A-Za-z]+[0-9]+\z (ASCII) or ^\p{Alpha}+\p{N}+\z (Unicode)
NumericString:
^[0-9]+\z (ASCII) or ^\p{N}+\z (Unicode)
A particular string is an instance, so you'll need some code to make the basic assertions about the particular instance. That code itself might contain the use of regular expressions. Once you've got those assertions, you'll be able to use your ontology to reason about them.
The hard part is that you've got to decide what level you're going to model at. For example, are you going to talk about individual characters? You can, but it's not necessarily sensible. You've also got the challenge that arises from the fact that negative information is awkward (as the basic model of such models is intuitionistic, IIRC) which means (for example) that you'll know that a string contains a numeric character but not that it is purely numeric. Yes, you'd know that you don't have an assertion that the instance contains an alphabetic character, but you wouldn't know whether that's because the string doesn't have one or just because nobody's said so yet. This stuff is hard!
It's far easier to write an ontology if you know exactly what problems you intend to solve with it, as that allows you to at least have a go at working out what facts and relations you need to establish in the first place. After all, there's a whole world of possible things that could be said which are true but irrelevant (“if the sun has got his hat on, he'll be coming out to play”).
Responding directly to your question, you start by checking whether a given token is numeric, alphanumeric or alphabetic (you can use regex here) and then you classify it as such. In general, the approach you're looking for is called generalization hierarchy of tokens or hierarchical feature selection (Google it). The basic idea is that you could treat each token as a separate element, but that's not the best approach since you can't cover them all [*]. Instead, you use common features among tokens (for example, 2000 and 1981 are distinct tokens but they share a common feature of being 4 digit numbers and possibly years). Then you have a class for four digit numbers, another for alphanumeric, and so on. This process of generalization helps you to simplify your classification approach.
Frequently, if you start with a string of tokens, you need to preprocess them (for example, remove punctuation or special symbols, remove words that are not relevant, stemming, etc). But maybe you can use some symbols (say, punctuation between cities and countries - e.g. Melbourne, Australia), so you assign that set of useful punctuation symbols to other symbol (#) and use that as a context (so the next time you find an unknown word next to a comma next to a known country, you can use that knowledge to assume that the unknown word is a city.
Anyway, that's the general idea behind classification using an ontology (based on a taxonomy of terms). You may also want to read about part-of-speech tagging.
By the way, if you only want to have 3 categories (numeric, alphanumeric, alphabetic), a viable option would be to use edit distance (what is more likely, that UA4E30 belongs to the alphanumeric or numeric category, considering that it doesn't correspond to the traditional format of prefixed numeric strings?). So, you assume a cost for each operation (insertion, deletion, subtitution) that transforms your unknown token into a known one.
Finally, although you said you're using Protege (which I haven't used) to build your ontology, you may want to look at WordNet.
[*] There are probabilistic approaches that help you to determine a probability for an unknown token, so the probability of such event is not zero. Usually, this is done in the context of Hidden Markov Models. Actually, this could be useful to improve the suggestion given by etov.
I'm using WEKA tool for text classification, and I have to convert plain text files into ARFF format. However, I don't know how to do that. Can anyone please help me to convert a text file into ARFF format?
Thank you Renklauf for ur response,
I didn't understood these points "Since text editors like Notepad only allow a limited number of columns, you'll need to get something like Notepad++ to fit everything on one line." .. can u plz explain in brief ..
Suppose the text data is like a simple sport article like
" Basketball is a team sport, the objective being to shoot a ball through a basket horizontally positioned to score points while following a set of rules. Usually, two teams of five players play on a marked rectangular court with a basket at each width end. Basketball is one of the world's most popular and widely viewed sports" ...
This is my text document and I want to convert this to arff format .. and after that I need to use that arff format file for SVM text classification ..
For a document classification task, each document is considered an attribute and must be enclosed in quotes. Suppose you have a corpus of 10 sports articles tagged as either pro-Yankees or pro-Red Sox for a classifier that automatically classifies sports articles as either pro-Yankees or pro-Red Sox. You need to take each document, enclose it in quotes,place it on a single line, and then place your {yankees, red_sox} attribute value after the quotes-enclosed string.
#relation yankeesOrRedSox
#attribute article string
#attribute yankeesOrSox { yankees, red_sox }
#data
"text of article 1 here", yankees
.
.
.
"text of article 10 here", red_sox
It's key that the article is placed on a single line. When I began using Weka for text classification, this is a point that caused me a lot of frustration at first. Since text editors like Notepad only allow a limited number of columns, you'll need to get something like Notepad++ to fit everything on one line. Notepad++ has a Join Lines function that allows you to place a lot of text on a single line.
Hope this helps.
I am trying to identify the most frequently used words in the congress speeches, and have to separate them by the congressperson. I am just starting to learn about R and the tm package. I have a code that can find the most frequent words, but what kind of a code can I use to automatically identify and store the speaker of the speech?
Text looks like this:
OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN
The Chairman. Good afternoon to everybody, and thank you
very much for coming to this hearing this afternoon.
In today's tough economic climate, millions of seniors have
lost a big part of their retirement and investments in only a
matter of months. Unlike younger Americans, they do not have
time to wait for the markets to rebound in order to recoup a
lifetime of savings.
[....]
STATEMENT OF SENATOR MEL MARTINEZ, RANKING MEMBER
[....]
I would like to be able to get these names, or separate text by the people. Hope you can help me. Thanks a lot.
Would it be correct to say that you want to split the file so you have one text object per speaker? And then use a regular expression to grab the speaker's name for each object? Then you can write a function to collect word frequencies, etc. on each object and put them in a table where the row or column names are the speaker's names.
If so, you might say x is your text, then use strsplit(x, "STATEMENT OF") to split on the words STATEMENT OF, then grep() or str_extract() to return the 2 or 3 words after SENATOR (do they always have only two names as in your example?).
Have a look here for more on the use of these functions, and text manipulation in general in R: http://en.wikibooks.org/wiki/R_Programming/Text_Processing
UPDATE Here's a more complete answer...
#create object containing all text
x <- c("OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN
The Chairman. Good afternoon to everybody, and thank you
very much for coming to this hearing this afternoon.
In today's tough economic climate, millions of seniors have
lost a big part of their retirement and investments in only a
matter of months. Unlike younger Americans, they do not have
time to wait for the markets to rebound in order to recoup a
lifetime of savings.
STATEMENT OF SENATOR BIG APPLE KOHL, CHAIRMAN
I am trying to identify the most frequently used words in the
congress speeches, and have to separate them by the congressperson.
I am just starting to learn about R and the tm package. I have a code
that can find the most frequent words, but what kind of a code can I
use to automatically identify and store the speaker of the speech
STATEMENT OF SENATOR LITTLE ORANGE, CHAIRMAN
Would it be correct to say that you want
to split the file so you have one text object
per speaker? And then use a regular expression
to grab the speaker's name for each object? Then
you can write a function to collect word frequencies,
etc. on each object and put them in a table where the
row or column names are the speaker's names.")
# split object on first two words
y <- unlist(strsplit(x, "STATEMENT OF"))
#load library containing handy function
library(stringr)
# use word() to return words in positions 3 to 4 of each string, which is where the first and last names are
z <- word(y[2:4], 3, 4) # note that the first line in the character vector y has only one word and this function gives and error if there are not enough words in the line
z # have a look at the result...
[1] "HERB KOHL," "BIG APPLE" "LITTLE ORANGE,"
No doubt a regular expressions wizard could come up with something to do it quicker and neater!
Anyway, from here you can run a function to calculate word freqs on each line in the vector y (ie. each speaker's speech) and then make another object that combines the word freq results with the names for further analysis.
This is how I'd approach it using Ben's example (use qdap to parse and create a dataframe and then convert to a Corpus with 3 documents; note that qdap was designed for transcript data like this and a Corpus may not be the best data format):
library(qdap)
dat <- unlist(strsplit(x, "\\n"))
locs <- grep("STATEMENT OF ", dat)
nms <- sapply(strsplit(dat[locs], "STATEMENT OF |,"), "[", 2)
dat[locs] <- "SPLIT_HERE"
corp <- with(data.frame(person=nms, dialogue =
Trim(unlist(strsplit(paste(dat[-1], collapse=" "), "SPLIT_HERE")))),
df2tm_corpus(dialogue, person))
tm::inspect(corp)
## A corpus with 3 text documents
##
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
## create_date creator
## Available variables in the data frame are:
## MetaID
##
## $`SENATOR BIG APPLE KOHL`
## I am trying to identify the most frequently used words in the congress speeches, and have to separate them by the congressperson. I am just starting to learn about R and the tm package. I have a code that can find the most frequent words, but what kind of a code can I use to automatically identify and store the speaker of the speech
##
## $`SENATOR HERB KOHL`
## The Chairman. Good afternoon to everybody, and thank you very much for coming to this hearing this afternoon. In today's tough economic climate, millions of seniors have lost a big part of their retirement and investments in only a matter of months. Unlike younger Americans, they do not have time to wait for the markets to rebound in order to recoup a lifetime of savings.
##
## $`SENATOR LITTLE ORANGE`
## Would it be correct to say that you want to split the file so you have one text object per speaker? And then use a regular expression to grab the speaker's name for each object? Then you can write a function to collect word frequencies, etc. on each object and put them in a table where the row or column names are the speaker's names.