Can word2vec model be used for words also as training data instead of sentences - word2vec

In Word2vec can we use words instead of sentences for model training
Like below code gberg_sents is sentence tokens
model = Word2Vec(sentences=gberg_sents,size=64,sg=1,window=10,min_count=5,seed=42,workers=8)
Like this can we use word tokens also

No, word2vec is trained with a language modeling objective, i.e., it predicts what words appear in surrounding of other words. For this, your training data need to be actual sentences that show how the words are used in context. It is actually the context of the words that gives you the information that is captured in the embeddings.

Related

Regex - How can you identify strings which are not words?

Got an interesting one, and can't come up with any solid ideas, so thought maybe someone else may have done something similar.
I want to be able to identify strings of letters in a longer sentence that are not words and remove them. Essentially things like kuashdixbkjshakd
Everything annoyingly is in lowercase which makes it more difficult, but since I only care about English, I'm essentially looking for the opposite of consonant clusters, groups of them that don't make phonetically pronounceable sounds.
Has anyone heard of/done something like this before?
EDIT: this is what ChatGpt tells me
It is difficult to provide a comprehensive list of combinations of consonants that have never appeared in a word in the English language. The English language is a dynamic and evolving language, and new words are being created all the time. Additionally, there are many regional and dialectal variations of the language, which can result in different sets of words being used in different parts of the world.
It is also worth noting that the frequency of use of a particular combination of consonants in the English language is difficult to quantify, as the existing literature on the subject is limited. The best way to determine the frequency of use of a particular combination of consonants would be to analyze a large corpus of written or spoken English.
In general, most combinations of consonants are used in some words in the English language, but some combinations of consonants may be relatively rare. Some examples of relatively rare combinations of consonants in English include "xh", "xw", "ckq", and "cqu". However, it is still possible that some words with these combinations of consonants exist.
You could try to pass every single word inside the sentence to a function that checks wether the word is listed inside a dictionary. There is a good number of dictionary text files on GitHub. To speed up the process: use a hash map :)
You could also use an auto-corretion API or a library.
Algorithm to combine both methods:
Run sentence through auto correction
Run every word through dictionary
Delete words that aren't listed in the dictionary
This could remove typos and words that are non-existent.
You could train a simple model on sequences of characters which are permitted in the language(s) you want to support, and then flag any which contain sequences which are not in the training data.
The LangId language detector in SpamAssassin implements the Cavnar & Trenkle language-identification algorithm which basically uses a sliding window over the text and examines the adjacent 1 to 5 characters at each position. So from the training data "abracadabra" you would get
a 5
ab 2
abr 2
abra 2
abrac 1
b 2
br 2
bra 2
brac 1
braca 1
:
With enough data, you could build a model which identifies unusual patterns (my suggestion would be to try a window size of 3 or smaller for a start, and train it on several human languages from, say, Wikipedia) but it's hard to predict how precise exactly this will be.
SpamAssassin is written in Perl and it should not be hard to extract the language identification module.
As an alternative, there is a library called libtextcat which you can run standalone from C code if you like. The language identification in LibreOffice uses a fork which they adapted to use Unicode specifically, I believe (though it's been a while since I last looked at that).
Following Cavnar & Trenkle, all of these truncate the collected data to a few hundred patterns; you would probably want to extend this to cover up to all the 3-grams you find in your training data at least.
Perhaps see also Gertjan van Noord's link collection: https://www.let.rug.nl/vannoord/TextCat/
Depending on your test data, you could still get false positives e.g. on peculiar Internet domain names and long abbreviations. Tweak the limits for what you want to flag - I would think that GmbH should be okay even if you didn't train on German, but something like 7 or more letters long should probably be flagged and manually inspected.
This will match words with more than 5 consonants (you probably want "y" to not be considered a consonant, but it's up to you):
\b[a-z]*[b-z&&[^aeiouy]]{6}[a-z]*\b
See live demo.
5 was chosen because I believe witchcraft has the longest chain of consonants of any English word. You could dial back "6" in the regex to say 5 or even 4 if you don't mind matching some outliers.

converting a sentence to an embedding representation

If I have a sentence, ex: “get out of here”
And I want to use word2vec Embed. to represent it .. I found three different ways to do that:
1- for each word, we compute the AVG of its embedding vector, so each word replaced by a single value.
2- as in 1, but with using the standard deviation of the embedding vector values.
3- or by adding the Embed. vector as it is. So if I use 300 length embedding vector .. for the above example, I will have in the final a vector of (300 * 4 words) 1200 length as a final vector to represent the sentence.
Which one of them is most suitable .. ? specifically, for the sentence similarity applications ..
The way you describe option (1) makes it sound like each word becomes a single number. That wouldn't work.
The simple approach that's often used is to average all word-vectors for words in the sentence together - so with 300-dimensional word-vectors, you still wind up with a 300-dimensional sentence-average vector. Perhaps that's what you mean by your option (1).
(Sometimes, all vectors are normalized to unit-length before this operation, but sometimes not - because the non-normalized vector lengths can sometimes indicate the strength of a word's meaning. Sometimes, word-vectors are weighted by some other frequency-based indicator of their relative importance, such as TF/IDF.)
I've never seen your option (2) used and don't quite understand what you mean or how it could possibly work.
Your option (3) would be better described as "concatenating the word-vectors". It gives different-sized vectors depending on the number of words in the sentence. Slight differences in word placement, such as comparing "get out of here" and "of here get out", would result in very different vectors, that usual methods of comparing vectors (like cosine-similarity) would not detect as being 'close' at all. So it doesn't make sense, and I've not seen it used.
So, only your option (1), as properly implemented to (weighted-)average word-vectors, is a good baseline for sentence-similarities.
But, it's still fairly basic and there are many other ways to compare sentences using text-vectors. Here are just a few:
One algorithm closely related to word2vec itself is called 'Paragraph Vectors', and is often called Doc2Vec. It uses a very word2vec-like process to train vectors for full ranges of text (whether they're phrases, sentences, paragraphs, or documents) that work kind of like 'floating document-ID words' over the full text. It sometimes offers a benefit over just averaging word-vectors, and in some modes can produce both doc-vectors and word-vectors that are also comparable to each other.
If your interest isn't just pairwise sentence similarities, but some sort of downstream classification task, then Facebook's 'FastText' refinement of word2vec has a classification mode, where the word-vectors are trained not just to predict neighboring words, but to be good at predicting known text classes, when simply added/averaged together. (Text-vectors constructed from such classification vectors might be good at similarities too, depending on how well the training-classes capture salient contrasts between texts.)
Another way to compute pairwise similarities, using just word-vectors, is "Word Mover's Distance". Rather than averaging all the word-vectors for a text together into a single text-vector, it considers each word-vector as a sort of "pile of meaning". Compared to another sentence, it calculates the minimum routing work (distance along lots of potential word-to-word paths) to move all the "piles" from one sentence into the configuration of another sentence. It can be expensive to calculate, but usually represents sentence-contrasts better than the simple single-vector-summary that naive word-vector averaging achieves.
`
model = Word2Vec(sentences,vector_size=100, min_count=1)
def sent_vectorizer(sent, model):
sent_vec =[]
numw = 0
for w in sent:
try:
if numw == 0:
sent_vec = model[w]
else:
sent_vec = np.add(sent_vec, model[w])
numw+=1
except:
pass
return np.asarray(sent_vec) / numw
X=[]
for sentence in sentences:
X.append(sent_vectorizer(sentence, model))
print ("========================")
print (X)
`

How to improve a twitter sentiment analyzer?

I'm working on a C++ Twitter company sentiment analysis tool. User inputs a company and the tool analyzes a # of tweets and returns a sentiment.
So far I did the following:
limit tweets to English and recent
make lowercase
remove RT, # symbol, #usernames and URLs
remove characters like &^%$(){}... etc
I then parse the tweet into words and check words against two dictionaries of positive and negative words. I create a total sentiment for each tweet. Then I count the number of positive , neutral and negative tweets to come up with a final answer. No weights are used.
I am thinking of implementing the following two things:
remove stop words from tweets
remove special characters and emoticons from tweets (non english Unicode basically)
However, even with this, most of the searches end up being very neutral. For example if I search "Apple" in 100 tweets I get say 30 positives, 10 negatives and 60 neutral.
Questions:
1. Is there any way to lower the neutrals?
2. What kind of positive and negative words should I add to represent my search criteria(Companies)
You say no weighting is used but why not add it. Assign each +/- word a base weight of 1 then maybe apply some of the following conditions:
If they use words like "very", "extremely", etc, weighting the following adjective heavier (or without weighting just count both of them as a +/- word)
Rather than changing everything to lowercase, if there is capslock involved for words weighting those words heavier with a multiplier
Rating words like "fantastic" heavier than words like "good"

Ontology-based string classification

I recently started working with ontologies and I am using Protege to build an ontology which I'd also like to use for automatically classifying strings. The following illustrates a very basic class hierarchy:
String
|_ AlphabeticString
|_ CountryName
|_ CityName
|_ AlphaNumericString
|_ PrefixedNumericString
|_ NumericString
Eventually strings like Spain should be classified as CountryName or UE4564 would be a PrefixedNumericString.
However I am not sure how to model this knowledge. Would I have to first define if a character is alphabetic, numeric, etc. and then construct a word from the existing characters or is there a way to use Regexes? So far I only managed to classify strings based on an exact phrase like String and hasString value "UE4565".
Or would it be better to safe a regex for each class in the ontology and then classify the string in Java using those regexes?
An approach that might be appropriate here, especially if the ontology is large/complicated or might change in the future, and assuming that some errors are acceptable, is machine learning.
An outline of a process utilizing this approach might be:
Define a feature set you can extract from each string, relating to your ontology (some examples below).
Collect a "train set" of strings and their true matching categories.
Extract features from each string, and train some machine-learning algorithm on this data.
Use the trained model to classify new strings.
Retrain or update your model as needed (e.g. when new categories are added).
To illustrate more concretely, here are some suggestions based on your ontology example.
Some boolean features that might be applicable: does the string matches a regexp (e.g the ones Qtax suggests); does the string exist in a prebuilt known city-names list; does it exist in a known country-names list; existence of uppercase letters; string length (not boolean), etc.
So if, for instance, you have a total of 8 features: match to the 4 regular expressions mentioned above; and the additional 4 suggested here, then "Spain" would be represented as (1,1,0,0,1,0,1,5) (matching the first 2 regular expressions but not the last two, is a city name but not a country name, has an uppercase letter and length is 5).
This set of feature will represent any given string.
to train and test a machine learning algorithm, you can use WEKA. I would start from rule or tree based algorithms, e.g. PART, RIDOR, JRIP or J48.
Then the trained models can be used via Weka either from within Java or as an external command line.
Obviously, the features I suggest have almost 1:1 match with your Ontology, but assuming your taxonomy is larger and more complex, this approach would probably be one of the best in terms of cost-effectiveness.
I don't know anything about Protege, but you can use regex to match most of those cases. The only problem would be differentiating between country and city name, I don't see how you could do that without a complete list of either one.
Here are some expressions that you could use:
AlphabeticString:
^[A-Za-z]+\z (ASCII) or ^\p{Alpha}+\z (Unicode)
AlphaNumericString:
^[A-Za-z0-9]+\z (ASCII) or ^\p{Alnum}+\z (Unicode)
PrefixedNumericString:
^[A-Za-z]+[0-9]+\z (ASCII) or ^\p{Alpha}+\p{N}+\z (Unicode)
NumericString:
^[0-9]+\z (ASCII) or ^\p{N}+\z (Unicode)
A particular string is an instance, so you'll need some code to make the basic assertions about the particular instance. That code itself might contain the use of regular expressions. Once you've got those assertions, you'll be able to use your ontology to reason about them.
The hard part is that you've got to decide what level you're going to model at. For example, are you going to talk about individual characters? You can, but it's not necessarily sensible. You've also got the challenge that arises from the fact that negative information is awkward (as the basic model of such models is intuitionistic, IIRC) which means (for example) that you'll know that a string contains a numeric character but not that it is purely numeric. Yes, you'd know that you don't have an assertion that the instance contains an alphabetic character, but you wouldn't know whether that's because the string doesn't have one or just because nobody's said so yet. This stuff is hard!
It's far easier to write an ontology if you know exactly what problems you intend to solve with it, as that allows you to at least have a go at working out what facts and relations you need to establish in the first place. After all, there's a whole world of possible things that could be said which are true but irrelevant (“if the sun has got his hat on, he'll be coming out to play”).
Responding directly to your question, you start by checking whether a given token is numeric, alphanumeric or alphabetic (you can use regex here) and then you classify it as such. In general, the approach you're looking for is called generalization hierarchy of tokens or hierarchical feature selection (Google it). The basic idea is that you could treat each token as a separate element, but that's not the best approach since you can't cover them all [*]. Instead, you use common features among tokens (for example, 2000 and 1981 are distinct tokens but they share a common feature of being 4 digit numbers and possibly years). Then you have a class for four digit numbers, another for alphanumeric, and so on. This process of generalization helps you to simplify your classification approach.
Frequently, if you start with a string of tokens, you need to preprocess them (for example, remove punctuation or special symbols, remove words that are not relevant, stemming, etc). But maybe you can use some symbols (say, punctuation between cities and countries - e.g. Melbourne, Australia), so you assign that set of useful punctuation symbols to other symbol (#) and use that as a context (so the next time you find an unknown word next to a comma next to a known country, you can use that knowledge to assume that the unknown word is a city.
Anyway, that's the general idea behind classification using an ontology (based on a taxonomy of terms). You may also want to read about part-of-speech tagging.
By the way, if you only want to have 3 categories (numeric, alphanumeric, alphabetic), a viable option would be to use edit distance (what is more likely, that UA4E30 belongs to the alphanumeric or numeric category, considering that it doesn't correspond to the traditional format of prefixed numeric strings?). So, you assume a cost for each operation (insertion, deletion, subtitution) that transforms your unknown token into a known one.
Finally, although you said you're using Protege (which I haven't used) to build your ontology, you may want to look at WordNet.
[*] There are probabilistic approaches that help you to determine a probability for an unknown token, so the probability of such event is not zero. Usually, this is done in the context of Hidden Markov Models. Actually, this could be useful to improve the suggestion given by etov.

How to convert a text file into ARFF format?

I'm using WEKA tool for text classification, and I have to convert plain text files into ARFF format. However, I don't know how to do that. Can anyone please help me to convert a text file into ARFF format?
Thank you Renklauf for ur response,
I didn't understood these points "Since text editors like Notepad only allow a limited number of columns, you'll need to get something like Notepad++ to fit everything on one line." .. can u plz explain in brief ..
Suppose the text data is like a simple sport article like
" Basketball is a team sport, the objective being to shoot a ball through a basket horizontally positioned to score points while following a set of rules. Usually, two teams of five players play on a marked rectangular court with a basket at each width end. Basketball is one of the world's most popular and widely viewed sports" ...
This is my text document and I want to convert this to arff format .. and after that I need to use that arff format file for SVM text classification ..
For a document classification task, each document is considered an attribute and must be enclosed in quotes. Suppose you have a corpus of 10 sports articles tagged as either pro-Yankees or pro-Red Sox for a classifier that automatically classifies sports articles as either pro-Yankees or pro-Red Sox. You need to take each document, enclose it in quotes,place it on a single line, and then place your {yankees, red_sox} attribute value after the quotes-enclosed string.
#relation yankeesOrRedSox
#attribute article string
#attribute yankeesOrSox { yankees, red_sox }
#data
"text of article 1 here", yankees
.
.
.
"text of article 10 here", red_sox
It's key that the article is placed on a single line. When I began using Weka for text classification, this is a point that caused me a lot of frustration at first. Since text editors like Notepad only allow a limited number of columns, you'll need to get something like Notepad++ to fit everything on one line. Notepad++ has a Join Lines function that allows you to place a lot of text on a single line.
Hope this helps.