Many people are tired of obtrusive words with no value, like these:
f**king
Id|ot
<|>
whaaaat????!!!!???
I plan to detect suspicious records and then to verify them manually. In other words, to find rules which detect that something is most likely obtrusive. Is there any reasonable solution? I am thinking about these REGEX rules:
\w\W+\w
\D{3,}
Is it worth the effort?
I would use Bayesian filtering featurizing misspellings that are combinations of alphas and other characters (e.g. all of the examples you've provided). This has the decided benefit that it "learns" over time, but needs to be fed an initial training set before it can produce useful results. To fit your needs you would set the threshold for matching low so you'd get false positives that you'd have to allow (and hopefully the algorithm would not allow through too many false negatives).
Toby Segaran's Programming Collective Intelligence provides a good explanation and Python code for making this work.
Related
I am new to NLP and feature extraction, i wish to create a machine learning model that can determine the sentiment of stock related social media posts. For feature extraction of my dataset I have opted to use Word2Vec. My question is:
Is it important to train my word2vec model on a corpus of stock related social media posts - the datasets that are available for this are not very large. Should I just use a much larger pretrained word vector ?
The only way to to tell what will work better for your goals, within your constraints of data/resources/time, is to try alternate approaches & compare the results on a repeatable quantititave evaluation.
Having training texts that are properly representative of your domain-of-interest can be quite important. You may need your representation of the word 'interest', for example, to represent that of stock/financial world, rather than the more general sense of the word.
But quantity of data is also quite important. With smaller datasets, none of your words may get great vectors, and words important to evaluating new posts may be missing or of very-poor quality. In some cases taking some pretrained set-of-vectors, with its larger vocabulary & sharper (but slightly-mismatched to domain) word-senses may be a net help.
Because these pull in different directions, there's no general answer. It will depend on your data, goals, limits, & skills. Only trying a range of alternative approaches, and comparing them, will tell you what should be done for your situation.
As this iterative, comparative experimental pattern repeats endlessly as your projects & knowledge grow – it's what the experts do! – it's also important to learn, & practice. There's no authority you can ask for any certain answer to many of these tradeoff questions.
Other observations on what you've said:
If you don't have a large dataset of posts, and well-labeled 'ground truth' for sentiment, your results may not be good. All these techniques benefit from larger training sets.
Sentiment analysis is often approached as a classification problem (assigning texts to bins of 'positive' or 'negative' sentiment, operhaps of multiple intensities) or a regression problem (assigning texts a value on numerical scale). There are many more-simple ways to create features for such processes that do not involve word2vec vectors – a somewhat more-advanced technique, which adds complexity. (In particular, word-vectors only give you features for individual words, not texts of many words, unless you add some other choices/steps.) If new to the sentiment-analysis domain, I would recommend against starting with word-vector features. Only consider adding them later, after you've achieved some initial baseline results without their extra complexity/choices. At that point, you'll also be able to tell if they're helping or not.
Given a regex, I want to compare it with a list of other regex, and output a similarity score.
There are several edit distance algorithms out there (e.g. levenshtein distance), but they fail to compare regex's, e.g.:
R1: [a-z0-9]+
R2: [0-9]{1}[a-z0-9]+
Distance: 9
In the example above, both regex's are quite similar, however they have a quite high edit distance. I suppose an approach using character n-grams would be more suitable for such cases.
What algorithm/approach would you consider for this problem?
It seems you're unlikely to improve upon the regular expression parsing algorithm present in an engine itself, because you're ultimately going to be making inferences about combinations of rules.
There are a number of open source regular expression engines, many listed on wikipedia, possibly including the one you're using.
Without having looked at the internals myself (not an insignificant caveat,) my recommendation is to see if it's possible to modify a regex engine (or leverage some pre-existing debugging or testing code) to output pertinent rules-processing metadata, sub-scores, if you will, from which you can then calculate an aggregate. The engines ultimately do their work deterministically, so this is theoretically possible.
If it works, this will amongst other things, enable you classify constructs, which you define as similar, with similar weights, and to possibly ignore others entirely.
Say I have two sentences, which are similar except there is only one different word with opposite meaning. e.g. "I like her" vs. "I hate her".
word2vec is used in my classification project. As far as I know, word2vec seems unable to figure out differences between antonyms. Is there any way to solve this?
Unfortunately, what we consider 'antonyms' are usually quite similar in word2vec coordinate spaces. That's because such words are quite similar in almost all respects – except for the one contrast they emphasize.
And further, to the extent those contrasts may be captured by the word2vec orientations, they will be in many varied directions. The 'hot'-vs-'cold' contrast will be different from the 'light'-vs-'dark' and the 'small'-vs-'big'.
There might be some analytic technique on sets of word-vectors that helps discover antonymic directions/pairs, but I haven't noticed one discussed, especially not anything that's simple/intuitive or applicable to general word-vector sets. (Once you do know words are opposites, as when consulting prior labeled lexicons or analogy questions, then the directions-between-their-word-vectors can be useful in other analysis, like discovering other words that contrast-in-the-same-way, as when solving analogy problems.)
Can you be more specific about your ultimate goal, with more example of the kinds of input you'll have and what specific results you want software to report?
The one example you give, "I like her" vs "I hate her", could be more generally seen as a sentiment classification, and word2vec-powered classifiers can do OK (though far from perfect) on such challenges. That is, with enough labeled training data, a classifier with a lot of examples of "positive" and "negative" texts will tend to learn that 'like' (and similar words) are positive and 'hate' (and similar) are negative, and do OK on other variants of positive/negative statements (excepting more complex constructions, like negations, subtle qualifications, understatement, irony, etc.)
So more info on what exactly you hope to detect/report, and what you've tried and found insufficient, might generate more ideas on how to achieve it.
there are certain articles in the corpus that I found much more important than other articles (for instance I like their wording more). As a result, I would like to increase their "weights" in the entire corpus during the process of generating word vectors. Is there a way to implement this? The current solution that I can think of is to copy the more important articles multiple times, and add them to the corpus. However, will this work for the word embedding process? And is there a better way to achieve this? Many thanks!
The word2vec library with which I am most familiar, in gensim for Python, doesn't have a feature to overweight certain texts. However, your idea of simply repeating the more important texts should work.
Note though that:
it'd probably work better if the texts don't repeat consecutively in your corpus - spreading out the duplicated contexts so that they're encountered in an interleaved fashion with other diverse usage examples
the algorithm really benefits from diverse usage examples – repeating the same rare examples 10 times is nowhere near as good as 10 naturally-subtly-contrasting usages, to induce the kinds of continuous gradations-of-meaning that people want from word2vec
you should be sure to test your overweighting strategy, with a quantitative quality score related to your end purpose, to be sure it's helping as you hope. It might be extra code/training-effort for negligible benefit, or even harm some word vectors' quality.
this is a follow up to my recent question ( Code for identifying programming language in a text file ). I'm really thankful for all the answers I got, it helped me very much. My code for this task is complete and it works fairly well - quick and reasonably accurate.
The method i used is the following: i have a "learning" perl script that identifies most frequently used words in a language by doing a word histogram over a set of sample files. These data are then loaded by the c++ program which then checks the given text and accumulates score for each language based on found words and then simply checks which language accumulated the highest score.
Now i would like to make it even better and work a bit on the quality of identification. The problem is I often get "unknown" as result (many languages accumulate a small score, but none anything bigger than my threshold). After some debugging, research etc i found out that this is probably due to the fact, that all words are considered equal. This means that seeing a "#include" for example has the same effect as seeing a "while" - both of which indicate that it might be c/c++ (i'm now ignoring the fact that "while" is used in many other languages), but of course in larger .cpp files there might be a ton of "while" but most of the time only a few "#include".
So the fact that a "#include" is more important is ignored, because i could not come up with a good way how to identify if a word is more important than another. Now bear in mind that the script which creates the data is fairly stupid, its only a word histogram and for every chosen word it assigns a score of 1. It does not even look at the words (so if there is a "#&|?/" in a file very often it might get chosen as a good word).
Also i would like to have the data creation part fully automated, so nobody should have to look at the data and alter them, change scores, change words etc. All the "brainz" should be in the script and the cpp program.
Does somebody have a suggestion how to identify keywords, or more generally, important words? Some things that might help: i have the number of occurences of each word and the number of total words (so a ratio may be calculated). I have also thought about wiping out characters like ;, etc. since the histogram script often puts for example "continue;" in the result, but the important word is "continue". Last note: all checks for equality are done for exact match - no substrings, case sensitive. This is mainly because of speed, but substrings might help (or hurt, i dont know)...
NOTE: thanks all who bothered to answer, you helped me a lot.
My work with this is almost finished so i will describe what i did to get good results.
1) Get a decent training set, about 30-50 files per language from various sources to avoid coding style bias
2) Write a perl script that does a word histogram. Implement blacklist and whitelist (more about it below)
3) add bogus words to blacklist, like "license", "the" etc. These are often found at the start of file in license information.
4) add about five most important words per language to the whitelist. These are words that are found in most source code of a given language, but are not frequent enough to get into the histogram. For example for C/C++ i had: #include, #define, #ifdef, #ifndef and #endif in the whitelist.
5) Emphasize the start of a file, so give more points to words found in the first 50-100 lines
6) when doing the word histogram, tokenize the file using #words = split(/[\s\(\){}\[\];.,=]+/, $_); This should be ok for most languages i think (gives me the best results). For each language, have about 10-20 most frequent words in the final results.
7) When the histogram is complete, remove all words that are found in the blacklist and add all those that are found in the whitelist
8) Write a program which processes a text file in the same way as the script - tokenize using the same rules. If a word is found in the histogram data, add points to the right language. Words in the histogram which correspond to only one language should add more points, those which belong to multiple languages should add less.
Comments are welcome. Currently on about 1000 text files i get 80 unknowns (mostly on extremely short files - mainly javascript with just one or two lines). About 20 files are recognized wrong. Size of the files is about 11kB ranging from 100 bytes to 100 kBytes (almost 11MB total). It takes one second to process them all, which is good enough for me.
I think you're approaching this from the wrong viewpoint. From your description, it sounds like you are building a classifier. A good classifier needs to discriminate between different classes; it doesn't need to precisely estimate the correspondence between the input and the most likely class.
Practically: your classifier doesn't need to assess precisely how close to C++ a certain input is; it merely needs to determine if the input is more like C than C++. This makes your work a lot easier - most of your current "unknown" cases will be close to one or two languages, even though they don't exceed your basic threshold.
Now, once you realize this, you will also see what training your classifier needs: not some random aspect of the sample files, but what sets two languages apart. Hence, when you have parsed your C samples, and your C++ samples, you will see that #include does not set them apart. However, class and template will be far more common in C++. On the other hand, #include does distinguish between C++ and Java.
There are of course other aspects besides keywords that you can use. For instance, the most obvious would be the frequency of {, and ; is similarly distinguishing. Another very useful feature for your classifier would be the comment tokens for the different languages. The basic problem of course would be automatically identifying them. Again, hardcoding //, /*, ', --, # and ! as pseudo-keywords would help.
This also identifies another classification rule: SQL will often have -- at the beginning of a line, whereas in C it will often appear somewhere else. Thus it may be useful for your classifier to take the context into account as well.
Use Google Code Search to learn weights for the set of keywords: #include in C++ gets 672.000 hits, in Python only ~5000.
You can normalize the results by looking at the number of results for the language in total:
C++ gives about 770.000 files whereas Python returns 120.000.
Thus "#include" is extremely rare in Python files, but exists in almost every C++ file. (Now you still have to learn to distinguish C++ and C of course.) All that is left is to do the correct reasoning about probabilities.
You need to get some exclusiveness into your lookup data.
When teaching the programming languages you expect, you should search for words typical for one or few language(s). If a word appears in several code files of the same language but appears in few or none of the other language files, it's a strong suggestion to that language.
So the score of a word could be calculated at the lookup side by selecting the words that are exclusive to a language or a group of languages. Find several of these words and get the intersection of these by adding the scores, and found your language you will have.
In an answer to your other question, someone recommended a naïve Bayes classifier. You should implement this suggestion because the technique is good at separating according to distinguishing features. You mentioned the while keyword, but that's not likely to be useful because so many languages use it—and a Bayes classifier won't treat it as useful.
An interesting part of your problem is how to tokenize an unknown program. Whitespace-separated chunks is a decent rough start, but going meaningfully beyond that will be tricky.