word2vec implementation addresing male/female and singular/plural issues - word2vec

I wonder if you know any word2vec implementation that takes into account that car and cars represents nearly the same concept, or lehrer and lehrerin (German for male and female teacher respectively) are almost the same. The implementations I have seen largely ignore this fact, and therefore the quality of the results is poor.
Thank you in advance.

In the last year a few research groups have started using the character sequence of a word to generate word embedding vectors. See this paper "Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation" for an example. There is also an earlier paper "Compositional Morphology for Word Representations and Language Modelling" that specifically uses models morphological differences like differences between singular and plural word forms.
I'm not aware of any open source implementations of these types of models.

Related

how to differentiate sentences with antonyms using word2vec

Say I have two sentences, which are similar except there is only one different word with opposite meaning. e.g. "I like her" vs. "I hate her".
word2vec is used in my classification project. As far as I know, word2vec seems unable to figure out differences between antonyms. Is there any way to solve this?
Unfortunately, what we consider 'antonyms' are usually quite similar in word2vec coordinate spaces. That's because such words are quite similar in almost all respects – except for the one contrast they emphasize.
And further, to the extent those contrasts may be captured by the word2vec orientations, they will be in many varied directions. The 'hot'-vs-'cold' contrast will be different from the 'light'-vs-'dark' and the 'small'-vs-'big'.
There might be some analytic technique on sets of word-vectors that helps discover antonymic directions/pairs, but I haven't noticed one discussed, especially not anything that's simple/intuitive or applicable to general word-vector sets. (Once you do know words are opposites, as when consulting prior labeled lexicons or analogy questions, then the directions-between-their-word-vectors can be useful in other analysis, like discovering other words that contrast-in-the-same-way, as when solving analogy problems.)
Can you be more specific about your ultimate goal, with more example of the kinds of input you'll have and what specific results you want software to report?
The one example you give, "I like her" vs "I hate her", could be more generally seen as a sentiment classification, and word2vec-powered classifiers can do OK (though far from perfect) on such challenges. That is, with enough labeled training data, a classifier with a lot of examples of "positive" and "negative" texts will tend to learn that 'like' (and similar words) are positive and 'hate' (and similar) are negative, and do OK on other variants of positive/negative statements (excepting more complex constructions, like negations, subtle qualifications, understatement, irony, etc.)
So more info on what exactly you hope to detect/report, and what you've tried and found insufficient, might generate more ideas on how to achieve it.

Does Mikolov 2014 Paragraph2Vec models assume sentence ordering?

In Mikolov 2014 paper regarding paragraph2Vectors, https://arxiv.org/pdf/1405.4053v2.pdf, do the authors assume in both PV-DM and PV-DBOW, the ordering of sentences need to make sense?
Imagine I am handling a stream of tweets, and each tweet is a paragraph. The paragraphs/tweets do not necessarily have ordering relations. After training, does the vector embedding for paragraphs still make sense?
Each document/paragraph is treated as a single unit for training – and there’s no explicit way that the neighboring documents directly affect a document’s vector. So the ordering of documents doesn’t have to be natural.
In fact, you generally don’t want all similar text-examples to be clumped together – for example, all those on a certain topic, or using a certain vocabulary, in the front or back of all training examples. That’d mean those examples are all trained with a similar alpha learning rate, and affect all related words without interleaved offsetting examples with other words. Either of those could make a model slightly less balanced/general, across all possible documents. For this reason, it can be good to perform at least one initial shuffle of the text examples before training a gensim Doc2Vec (or Word2Vec) model, if your natural ordering might not spread all topics/vocabulary words evenly through the training corpus.
The PV-DM modes (default dm=1 mode in gensim) do involve sliding context-windows of nearby words, so word proximity within each example matters. (Don’t shuffle the words inside each text!)

Implement pre-trained word embeddings in sentence level?

I am trying to do a text classification, and using pre-trained Glove word embedding in sentence level. I am currently using very naive approach which is averaging words vectors to represent sentence.
The question is what if there is no pre-trained word appeared in the sentence, how should I do if this happens? Just ignore this sentence or randomly assign some values to this sentence vector? I can not find a reference that deal with this problem, most of paper just said they used averaging pre-trained word embeddings to generate sentence embedding.
If a sentence has no words about which you know anything, any classification attempt will be a random guess.
It's impossible for such no-information sentences to improve your classifier, so they are better to leave out than to include with totally random features.
(There are some word-embedding techniques that can, for languages with subword morphemes, guess better-than-random word-vectors for previously-unknown words. See Facebook's 'FastText' tools, for example. But unless a large number of your texts are dominated by unknown words, you can probably defer investigation of such techniques until after validating if your general approach is working on easier texts.)

creating a regular expression for a list of strings

I have extracted a series of tables from the scientific literature which consist of columns each of which is a distinct type. Here is an example
I'd like to be able to automatically generate regular expressions for each column. Obviously there are trivial solutions such as .* so I would add the constraints that they use only:
[A-Z] [a-z] [0-9]
explicit punctuation (e.g. ',',''')
"simple" quantifiers (e.g {3,4}
A "best" answer for the table above would be:
[A-Z]{3}
[A-Za-z\s\.]+
\d{4}\sm
\d{2}\u00b0\d{2}'\d{2}"N,\d{2}\u00b0\d{2}'\d{2}"E
(speciosissima|intermediate|troglodytes)
(hf|sr)
\d{4}
Of course the 4th regex would break if we move outside the geographical area but the software doesn't know that. The aim would be to collect many regexes for , say "Coordinates" and generalize them, probably partially manual. The enums would only be created if there were a small number of distinct strings.
I'd be grateful for examples of (especially F/OSS) software that can do this, especially in Java. (It's similar to Google's Refine). I am aware of this question 4 years ago but that didn't really answer the question and the text2re site which appears to be interactive.
NOTE: I note a vote to close as "too localised". This is a very general problem (the table given is only an example) as shown by Google/Freebase developing Refine to tackle the problem. It potentially refers to a very wide variety of tables (e.g. financial, journalism, etc.). Here's one with floating point values:
It would be useful to determine automatically that some authorities report ages in real numbers (e.g. not months, days) and use 2 digits of precision.
Your particular issue is a special case of "programming by demonstration". That is, given a bunch of input/output examples, you want to generate a program. For you, the inputs are strings and the output is whether each string belongs to the given column. In the end, you want to generate a program in the language of limited regular expressions that you proposed.
This particular instance of programming by demonstration seems closely related to Flash Fill, a recent project from MSR. There, instead of generating regular expressions to match data, they automatically generated programs to transform string data based on input/output examples.
I only skimmed through one of their papers, but I'll try to lay out what I understand here.
There are basically two important insights in this paper. The first was to design a small programming language to represent string transformations. Even using full-on regular expressions created too many possibilities to search through quickly. They designed their own abstract language for manipulating strings; however, your constraints (e.g. only using simple quantifiers) would probably play the same role as their custom language. This is largely possible because your particular problem has a somewhat smaller scope than theirs.
The second insight was on how to actually find programs in this abstract language that match with given input/output pairs. My understanding is that the key idea here is to use a technique called version space algebra. The rough idea about version space algebra is that you maintain a representation of the space of possible programs and repeatedly prune it by introducing additional constraints. The exact details of this process fall well outside my main interests, so you're better off reading something like this introduction to version space algebra, which includes some sample code as well.
They also have some clever approaches to rank different candidate programs and even guess which inputs might be problematic for an already-generated program. I saw a demo where they generated a program without giving it enough input/output pairs, and the program could actually highlight new inputs that were likely to be incorrect. This sort of ranking is very interesting, but requires some more sophisticated machine learning techniques and is probably not immediately applicable to your use case. Might still be interesting though. (Also, this might have been detailed in a different paper than the one I linked.)
So yeah, long story short, you can generate your expressions by feeding input/output examples into a system based on version space algebra. I hope that helps.
I'm currently researching the same (or something similar) (here). In general, this is called Grammar induction, or in case of regular expressions, it is induction of regular languages. There is the StaMinA competition about this field. Common algorithms are RPNI and Blue-Fringe.
Here is another related question. And here another one. And here another one.
My own approach (which I have partially prototyped) is heuristic and based on the premise that a given column will often have entries which are the same or similar character lengths and have similar punctuation. I would welcome comments (and resulting code will be Open Source).
flatten [A-Z] to 'A'
flatten [a-z] to 'a'
flatten [0-9] to '0'
flatten any other special codepoint sets (e.g. greek characters) to a single character (e.g. alpha)
The columns then become:
"AAA"
"Aaaaaaaaaa", "Aaaaaaaaaaaaa", "Aaa aaa Aaaaaa", etc.
"0000 a"
"00\u00b000'00"N,00\u00b000'00"E
...
...
"0000"
I shall then replace these by regular expressions such as
"([A-Z])([A-Z])([A-Z])"
...
"(\d)(\d)(\d)(\d)\s([0-9])"
and capture the individual characters into sets. This will show that (say) in 3. the final char is always "m" , so \d\d\d\d\s[m] and for 7. the value is [2][0][0][458].
For the columns that don't fit this model we search using "(.*)" and see if we can create useful sets (cols 5. and 6.) with a heuristic such as "at least 2 multiple strings and no more than 50% unique strings".
By using dynamic programming (cf. Kruskal) I hope to be able to align similar regexes, which will be useful for me, at least!

How do I improve breaking substitution ciphers programmatically?

I have written (am writting) a program to analyze encrypted text and attempt to analyze and break it using frequency analysis.
The encrypted text takes the form of each letter being substituted for some other letter ie. a->m, b->z, c->t etc etc. all spaces and non alpha chars are removed and upper case letters made lowercase.
An example would be :
Orginal input - thisisasamplemessageitonlycontainslowercaseletters
Encrypted output - ziololqlqdhstdtllqutozgfsnegfzqoflsgvtkeqltstzztkl
Attempt at cracking - omieieaeanuhtnteeawtiorshylrsoaisehrctdlaethtootde
Here it has only got I, A and Y correctly.
Currently my program cracks it by analysing the frequency of each individual character, and mapping it to the character that appears in the same frequency rank in a non encrypted text.
I am looking for methods and ways to improve the accuracy of my program as at the moment I don't get too many characters right. For example when attempting to crack X amount of characters from Pride and Prejudice, I get:
1600 - 10 letters correct
800 - 7 letters correct
400 - 2 letters correct
200 - 3 letters correct
100 - 3 letters correct.
I am using Romeo and Juliet as a base to get the frequency data.
It has been suggested to me to look at and use the frequency of character pairs, but I am unsure how to use this because unless I am using very large encrypted texts I can imagine a similar approach to how I am doing single characters would be even more inaccurate and cause more errors than successes. I am hoping also to make my encryption cracker more accurate for shorter 'inputs'.
I'm not sure how constrained this problem is, i.e. how many of the decisions you made are yours to change, but here are some comments:
1) Frequency mapping is not enough to solve a puzzle like this, many frequencies are very close to each other and if you aren't using the same text for frequency source and plaintext, you are almost guaranteed to have a few letters off no matter how long the text. Different materials will have different use patterns.
2) Don't strip the spaces if you can help it. This will allow you to validate your potential solution by checking that some percentage of the words exist in a dictionary you have access to.
3) Look into natural language processing if you really want to get into the language side of this. This book has all you could ever want to know about it.
Edit:
I would look into bigraphs and trigraphs first. If you're fairly confident of one or two letters, they can help predict likely candidates for the letters that follow. They're basically probability tables where AB would be the probability of an A being followed by a B. So assuming you have a given letter solved, that can be used to solve the letters next to it, rather than just guessing. For example, if you've got the word "y_u", it's obvious to you that the word is you, but not to the computer. If you've got the letters N, C, and O left, bigraphs will tell you that YN and YC are very uncommon where as YO is much more likely, so even if your text has unusual letter frequencies (which is easy when it's short) you still have a fairly accurate system for solving for unknowns. You can hunt around for a compiled dataset, or do your own analysis, but make sure to use a lot of varied text, a lot of Shakespeare is not the same as half of Shakespeare and half journal articles.
Looking at character pairs makes a lot of sense to me.
Every single letter of the alphabet can be used in valid text, but there are many pairs that are either extremely unlikely or will never happen.
For example, there is no way to get qq using valid English words, as every q must be followed by a u. If you have the same letters repeated in the encrypted text, you can automatically exclude the possibility that they represent q.
The fact that you are removing spaces from the input limits the utility somewhat since combinations that would never exist in a single word e.g. ht can now occur if the h ends one word and the t begins another one. Still, I suspect that these additional data points will enable you to resolve much shorter strings of text.
Also, I would suggest that Romeo and Juliette is only a good basis for statistical data if you intend to analyze writings of the period. There have been some substantial changes to spelling and word usage that may skew the statistics.
First of all, Romeo and Juliet probably isn't a very good basis to use. Second, yes digraphs are helpful (and so are trigraphs). For a substitution cipher like you're looking at, a good place to start would be the Military Cryptanalysis books by William Friedman.
Well, I have solved some simple substitution ciphers in my time, so I can speak freely.
Removing the spaces from the input string makes it nearly impossible to solve.
While it is true that most English sentences have 'e' in higher frequency, that is not all there is to the process.
The part that makes the activity fun, is the series of trial hypothesis/test hypothesis/accept or reject hypothesis that makes the whole thing an iterative process.
Many sentences contain the words 'of' and 'the'. By looking at your sentence, and assuming that one of the two letter words is of, implies further substitutions that can allow you to make inferences about other words. In short, you need a dictionary of high frequency word, to allow you to make further inferences.
As there could be a large amount of backtracking involved, it may be wise to consider a prolog or erlang implementation as a basis for developing the c++ one.
Best of luck to you.
Kindly share your results when done.
Single letter word are a big hint (generally only "A" and "I", rarely "O". Casual language allows "K"). There are also a finite set of two and three letter words. No help if spaces have been stripped.
Pairs are much more diagnostic than you would think. For instance: some letters never appear doubled in English (though this is not absolute if the spaces have been stripped or if foreign vocabulary is allowed), and others are common double; also some heterogeneous pairs are very frequent.
As a general rule, no one analysis will provide certainty. You need to assign each cipher letter a set of possible translation with associated probabilities. And combine several tests until the probabilities become very significant.
You may be able to determine when you've gotten close by checking the Shannon Entropy.
Not a complete answer, but maybe a helpful pointer: you can use a dictionary to determine how good your plaintext candidate is. On a UNIX system with aspell installed, you can extract an English word list with the command
aspell -l en dump master
You might try looking at pairs rather than individual letters. For instance, a t is often followed by an h in English, as is an s. Markov modeling would be useful here.
Frequency Analysis
Frequency analysis is a great place to start. However, Romeo and Juliet is not a very good choice to take character frequencies from to decipher Pride and Prejudice text. I would suggest using frequencies from this page because it uses 7 different texts that are closer in age to Pride and Prejudice. It also lists probabilities for digraphs and trigraphs. However, digraphs and trigraphs may not be as useful when spaces are removed from the text because this introduces the noise of digraphs and trigraphs created by words being mashed together.
Another resource for character frequencies is this site. It claims to use 'a good mix of different literary genres.'
Frequency analysis generally becomes more probabilistically correct with increased length of the encrypted text as you've seen. Frequency analysis also only helps to suggest the right direction in which to go. For instance, the encrypted character with the highest frequency may be the e, but it could also very well be the a which also has a high frequency. One common method is to start with some of the highest frequency letters in the given language, try matching these letters with different letters of high frequency in the text, and look to see if they form common words like the, that, is, as, and, and so on. Then you go from there.
A Good Introductory Book
If you are looking for a good layman introduction to cryptography, you might try The Code Book by Simon Singh. It's very readable and interesting. The books looks at the development of codes and codebreaking throughout history. He covers substitution ciphers fairly early on and describes some common methods for breaking them. Also, he had a Cipher Challenge in the book (which has already been completed) that consisted of some various codes to try to break including some substitution ciphers. You might try reading through how the Swedish team broke these ciphers at this site. However, I might suggest reading at least through the substitution cipher part of the book before reading these solutions.
By the way I'm not affiliated in any way with the publication of this book. I just really enjoyed it.
Regarding digraphs, digrams and word approximations, John Pierce (co-inventor of the transistor and PCM) wrote an excellent book, Introduction to Information Theory, that contains an extended analysis of calculating their characteristics, why you would want to and how to locate them. I found it helpful when writing a frequency analysis decryption code myself.
Also, you will probably want to write an ergodic source to feed your system, rather than relying on a single source (e.g., a novel).
interesting question,i ask a similar question :)
one thing i'm trying to find out and do is:
to scan the bigger words that have repeating letters in them..
then find a corresponding word with a similar pattern to the bigger word from the cipher..
the reason as to why is simply because,the bigger the word the most possible different deciphered letters found at once and because bigger words are easier to decode,just the same as to why a bigger text is easier to decode.. more chances to see patterns emerge :)