Machine Learning Datasets - Negative vs Positive Vocabulary Dataset - regex

I'm looking for a way to develop a ML dataset which compares positive vs negative vocabulary. For example "is valid" vs "not valid" or "can be used" vs "can't be used" or "not on Thursdays" vs "on Thursdays" would be the positive vs negative. It can be simplified by determining if the adverb is positive or negative. I was wondering if there are any available datasets for this or any existing solutions.

There are some sentiment dictionaries you can use.
Automated sentiment analysis is an application of text analytics
techniques for the identification of subjective opinions in text data.
It normally involves the classification of text into categories such
as “positive”, “negative” and in some cases “neutral” [Source]
WordStat Sentiment Dictionary 1.2
Loughran and McDonald Financial Sentiment Dictionary

To create a dataset
Search for the articles having debate about some point. There, you will get most of positive and negative sentences. In starting, pick small paragraphs. Check efficiency of your algorithm manually.
Solution
Start with very basic approach. Like search for keyword, "not". then go for combined "can't" "wont" etc. Then check if you miss anything.
Now you can go for more sophisticated approach. Like the sentence "I took precautions with equipment, It won't harm me." It gives a positive sense. Thing you should look for is "won't harm". You see, won't is negative word and harm is also a negative word. Combining both gives positive effect.

Related

Regex - How can you identify strings which are not words?

Got an interesting one, and can't come up with any solid ideas, so thought maybe someone else may have done something similar.
I want to be able to identify strings of letters in a longer sentence that are not words and remove them. Essentially things like kuashdixbkjshakd
Everything annoyingly is in lowercase which makes it more difficult, but since I only care about English, I'm essentially looking for the opposite of consonant clusters, groups of them that don't make phonetically pronounceable sounds.
Has anyone heard of/done something like this before?
EDIT: this is what ChatGpt tells me
It is difficult to provide a comprehensive list of combinations of consonants that have never appeared in a word in the English language. The English language is a dynamic and evolving language, and new words are being created all the time. Additionally, there are many regional and dialectal variations of the language, which can result in different sets of words being used in different parts of the world.
It is also worth noting that the frequency of use of a particular combination of consonants in the English language is difficult to quantify, as the existing literature on the subject is limited. The best way to determine the frequency of use of a particular combination of consonants would be to analyze a large corpus of written or spoken English.
In general, most combinations of consonants are used in some words in the English language, but some combinations of consonants may be relatively rare. Some examples of relatively rare combinations of consonants in English include "xh", "xw", "ckq", and "cqu". However, it is still possible that some words with these combinations of consonants exist.
You could try to pass every single word inside the sentence to a function that checks wether the word is listed inside a dictionary. There is a good number of dictionary text files on GitHub. To speed up the process: use a hash map :)
You could also use an auto-corretion API or a library.
Algorithm to combine both methods:
Run sentence through auto correction
Run every word through dictionary
Delete words that aren't listed in the dictionary
This could remove typos and words that are non-existent.
You could train a simple model on sequences of characters which are permitted in the language(s) you want to support, and then flag any which contain sequences which are not in the training data.
The LangId language detector in SpamAssassin implements the Cavnar & Trenkle language-identification algorithm which basically uses a sliding window over the text and examines the adjacent 1 to 5 characters at each position. So from the training data "abracadabra" you would get
a 5
ab 2
abr 2
abra 2
abrac 1
b 2
br 2
bra 2
brac 1
braca 1
:
With enough data, you could build a model which identifies unusual patterns (my suggestion would be to try a window size of 3 or smaller for a start, and train it on several human languages from, say, Wikipedia) but it's hard to predict how precise exactly this will be.
SpamAssassin is written in Perl and it should not be hard to extract the language identification module.
As an alternative, there is a library called libtextcat which you can run standalone from C code if you like. The language identification in LibreOffice uses a fork which they adapted to use Unicode specifically, I believe (though it's been a while since I last looked at that).
Following Cavnar & Trenkle, all of these truncate the collected data to a few hundred patterns; you would probably want to extend this to cover up to all the 3-grams you find in your training data at least.
Perhaps see also Gertjan van Noord's link collection: https://www.let.rug.nl/vannoord/TextCat/
Depending on your test data, you could still get false positives e.g. on peculiar Internet domain names and long abbreviations. Tweak the limits for what you want to flag - I would think that GmbH should be okay even if you didn't train on German, but something like 7 or more letters long should probably be flagged and manually inspected.
This will match words with more than 5 consonants (you probably want "y" to not be considered a consonant, but it's up to you):
\b[a-z]*[b-z&&[^aeiouy]]{6}[a-z]*\b
See live demo.
5 was chosen because I believe witchcraft has the longest chain of consonants of any English word. You could dial back "6" in the regex to say 5 or even 4 if you don't mind matching some outliers.

How do you handle positive and negative samples when they are the same in word2vec's negative sampling?

I have a question about doing negative sampling in word2vec.
In order to solve the binary classification problem, I know that words around the target word are labeled positive and words in other ranges are labeled negative. At this time, there are too many negative words. So, negative sampling samples according to the frequency of words appearing in the entire document.
In this case, if the words in positive are sampled in the same negative, how is it processed?
For example, if you look at the target word love in "I love pizza", (love, pizza) should have a positive label. But isn't it possible for (love, pizza) to have a negative label through negative sampling again in the sentence afterwards?
The negative-sampling isn't done based on words elsewhere in the sentence, but the word-frequencies across the entire training corpus.
That is, while the observed neighbors become positive examples, the negative examples are chosen at random from the entire distribution of all known words.
These negative examples might even, by bad luck, be the same as other nearby in-context-window intended-positive words! But this isn't fatal: across all training examples, across all passes, across all negative-samples drawn, the net effect of training remains roughly the same: the positive-examples have their predictions very-often up-weighted, the negative-examples down-weighted, and even occasionally where the negative-example is sometimes also a neighbor, that effect "washes out" across all the (non-biasing) errors, still leaving that real neighbor net-reinforced over the full training run.
So: you're right to note that the process seems a little sloppy, but it all evens out in the end.
FWIW, the word2vec implementations with which I'm most familiar do check if the negative-sample is exactly the positive word we're trying to predict, but don't do any other checking against the whole current window.

Can you weight Levenshtien to the front of the string?

Levenshtein seems to be very.. agnostic..in terms of how it scores distance/similarity in terms.
For instance:
Olive Garden vs Olden Garden = 3
whereas
Olive Garden vs Olive Garden Restaurant = 11
In the real world (as I see it or at least for some applications) the latter should be weighted much more heavily.
Is there a modification or another 'distance' comparison tool that doesn't try to account as much for misspelling and transposition that would weight the second example higher because of the sheer # of 100% matches on the first part of the phrase?
This is a difficult question to answer and I am by no means an expert on the subject, however, I have at least a partial answer to some of your questions. Additionally, you didn't specify a language, so all my examples will use PHP.
To the best of my knowledge there is no single comparison tool or function that is able to determine the relevance, rather than similarity of two strings. However, there are different comparison tools out there that would probably give you better results. The similar_text function in PHP, for example, returns the percent similarity between two strings and would be more accurate in what you're trying to do.
Additionally, you can account for misspellings when comparing the similarity of two strings by first calculating the phonetic "keys" of each string, and then calculating the Levenshtein distance between the phonetic keys. The best phonetic algorithm I know of for calculating the phonetic keys of strings is metaphone. In PHP, the metaphone is built in and can be used like this:
echo metaphone("carrot"); // prints KRT
The cool part about this is that if a user were to misspell carrot and instead type "carrrot," the same phonetic key would be generated (as "carrot" and "carrrot"), sound the same
echo metaphone("carrot"); // prints KRT
echo metaphone("carrrot"); // prints KRT
And obviously the Levenshtein distance between KRT and KRT is 0. The pitfall to this solution is that while metaphone helps with smoothing out spelling errors that don't change how a word sounds, words that are misspelled to the point where they no longer have any phonetic resemblance will not generate similar phonetic keys. In your example, Olive Garden and Olden Garden don't have the same phonetic keys, thus are still seen by Levenshtein as being relatively far apart.
echo levenshtein(metaphone("Olive Garden"), metaphone("Olden Garden")); // prints 2
Conclusion
Even in conjunction with metaphone, using the Levenshtein distance falls short, and is unable to provide the relevance between two strings. The best solution I can give would be to use similar_text in conjunction with metaphone to compare your strings. Something like this:
similar_text(metaphone("Olive Garden Restaurant"), metaphone("Olive Garden"), $sim);
echo $sim; // prints 70%

Amazon CloudSearch matching long strings against domain documents

I'm implementing Amazon's Cloud Search API and was wondering how well it would work for "vague" queries.
We basically have records that contain descriptions. We want to find matches based on the content of that description. For example, our domain dataset has the following strings (where each string is a different document):
"The sun is shining bright today"
"The moon is shining in the sky tonight"
"The rain is pouring outside today"
If I were to submit a description to the server like this:
"The sun and moon are shining bright lately"
is there a search method that would return a match for the first two elements (albeit with a low score)? There are key words that are important, ignoring the "the" and "is" type of words. If so, how is that search constructed?
I was eventually able to get those strings to be returned with a query based on "The sun and moon are shining bright lately". I accomplished this by boolean OR'ing the terms together like this:
(or name:'sun' name:'moon' name:'shining' name:'bright' name:'lately')
I also removed stopwords but I don't think you need to.
It's definitely not pretty. The problem I had with other approaches was that CloudSearch seems pretty heavy-handed about penalizing results that don't contain a word from your query, so a word like lately in the query would cause it to not match any of the test strings. I was hoping to fix that with a rank expression but I think you can only rank results, not docs that didn't even match your query.
I also played around with sloppy phrase search but that still requires that the words are found some distance from each other, where in this case certain words aren't found at all.
The only other thing I can think to try is looking at the lucene and dismax query parsers. They won't change the underlying search engine but they may give you a different means of specifying a query that would work better.

SQL Server Regular Expression Workaround in T-SQL?

I have some SQLCLR code for working with Regular Expresions. But now that it is getting migrated into Azure, which does not allow SQLCLR, that's out. I need to find a way to do regex in pure T-SQL.
Master Data Services are not available because the dev edition of MSSQL we have is not R2.
All ideas appreciated, thanks.
Regular expression match samples that need handling
(culled from regexlib and other places over the past few years)
email address
^[\w-]+(\.[\w-]+)*#([a-z0-9-]+(\.[a-z0-9-]+)*?\.[a-z]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?$
dollars
^(\$)?(([1-9]\d{0,2}(\,\d{3})*)|([1-9]\d*)|(0))(\.\d{2})?$
uri
^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)*#)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*$
one numeric digit
^\d$
percentage
^-?[0-9]{0,2}(\.[0-9]{1,2})?$|^-?(100)(\.[0]{1,2})?$
height notation
^\d?\d'(\d|1[01])"$
numbers between 1 1000
^([1-9]|[1-9]\d|1000)$
credit card numbers
^((4\d{3})|(5[1-5]\d{2})|(6011))-?\d{4}-?\d{4}-?\d{4}|3[4,7]\d{13}$
list of years
^([1-9]{1}[0-9]{3}[,]?)*([1-9]{1}[0-9]{3})$
days of the week
^(Sun|Mon|(T(ues|hurs))|Fri)(day|\.)?$|Wed(\.|nesday)?$|Sat(\.|urday)?$|T((ue?)|(hu?r?))\.?$
time on 12 hour clock
(?<Time>^(?:0?[1-9]:[0-5]|1(?=[012])\d:[0-5])\d(?:[ap]m)?)
time on 24 hour clock
^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
usa phone numbers
^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$
Unfortunately, you will not be able to move your CLR function(s) to SQL Azure. You will need to either use the normal string functions (PATINDEX, CHARINDEX, LIKE, and so on) or perform these operations outside of the database.
EDIT Adding some information for the examples added to the question.
Email address
This one is always controversial because people disagree about which version of the RFC they want to support. The original didn't support apostrophes, for example (or at least people insist that it didn't - I haven't dug it up from the archives and read it myself, admittedly), and it has to be expanded quite often for new TLDs (once for 4-letter TLDs like .info, then again for 6-letter TLDs like .museum). I've often heard quite knowledgeable people state that perfect e-mail validation is impossible, and having previously worked for an e-mail service provider, I can tell you that it was a constantly moving target. But for the simplest approaches, see the question TSQL Email Validation (without regex).
One numeric digit
Probably the easiest one of the bunch:
WHERE #s LIKE '[0-9]';
Credit card numbers
Assuming you strip out dashes and spaces, which you should do in any case. Note that this isn't an actual check of the credit card number algorithm to ensure that the number itself is actually valid, just that it conforms to the general format (AmEx = 15 digits starting with a 3, the rest are 16 digits - Visa starts with a 4, MasterCard starts with a 5, Discover starts with 6 and I think there's one that starts with a 7 (though that may just be gift cards of some kind)):
WHERE #s + ' ' LIKE '[3-7]'+ REPLICATE('[0-9]', 14) + '[0-9 ]';
If you want to be a little more precise at the cost of being long-winded, you can say:
WHERE (LEN(#s) = 15 AND #s LIKE '3' + REPLICATE('[0-9]', 14))
OR (LEN(#s) = 16 AND #s LIKE '[4-7]' + REPLICATE('[0-9]', 15));
USA phone numbers
Again, assuming you're going to strip out parentheses, dashes and spaces first. Pretty sure a US area code can't start with a 1; if there are other rules, I am not aware of them.
WHERE #s LIKE '[2-9]' + REPLICATE('[0-9]', 9);
-----
I'm not going to go further, because a lot of the other expressions you've defined can be extrapolated from the above. Hopefully this gives you a start. You should be able to Google for some of the others to see how other people have replicated the patterns with T-SQL. Some of them (like days of the week) can probably just be checked against a table - seems overkill to do an invasie pattern matching for a set of 7 possible values. Similarly with a list of 1000 numbers or years, these are things that will be much easier (and probably more efficient) to check if the numeric value is in a table rather than convert it to a string and see if it matches some pattern.
I'll state again that a lot of this will be much better if you can cleanse and validate the data before it gets into the database in the first place. You should strive to do this wherever possible, because without CLR, you just can't do powerful RegEx inside SQL Server.
Ken Henderson wrote about ways to replicate RegEx without CLR, but they require sp_OA* procedures, which are even less likely to ever see the light of day in Azure than CLR. Most of the other articles you'll find online use an approach similar to Ken's or use complex use of built-in string functions.
Which portions of RegEx specifically are you trying to replicate? Can you show an example of the input/output of one of your functions? Perhaps it will be easy to convert to get similar results using the built-in string functions like PATINDEX.