Implement pre-trained word embeddings in sentence level? - word2vec

I am trying to do a text classification, and using pre-trained Glove word embedding in sentence level. I am currently using very naive approach which is averaging words vectors to represent sentence.
The question is what if there is no pre-trained word appeared in the sentence, how should I do if this happens? Just ignore this sentence or randomly assign some values to this sentence vector? I can not find a reference that deal with this problem, most of paper just said they used averaging pre-trained word embeddings to generate sentence embedding.

If a sentence has no words about which you know anything, any classification attempt will be a random guess.
It's impossible for such no-information sentences to improve your classifier, so they are better to leave out than to include with totally random features.
(There are some word-embedding techniques that can, for languages with subword morphemes, guess better-than-random word-vectors for previously-unknown words. See Facebook's 'FastText' tools, for example. But unless a large number of your texts are dominated by unknown words, you can probably defer investigation of such techniques until after validating if your general approach is working on easier texts.)

Related

How to find that one text is similar to the part of another?

We know how to make an assessment of the similarity of two whole texts for example by Word Mover’s Distance. How to find piece inside one text that is similar to another text?
You could break the text into chunks – ideally by natural groupings, like sentences or paragraphs – then do pairwise comparisons of every chunk against every other, using some text-distance measure.
Word Mover's Distance can give impressive results, but it quite slow/expensive to calculate, especially for large texts and large numbers of pairwise comparisons. Other more-simple summary vectors for text – such as a simple average of all the text's word-vectors, or a text-vector learned from the text like 'Paragraph Vector' (aka Doc2Vec) – will be much faster and might be good enough, or at least be a good quick 1st pass to limit the number of candidate pairs you do something more expensive on.

word2vec implementation addresing male/female and singular/plural issues

I wonder if you know any word2vec implementation that takes into account that car and cars represents nearly the same concept, or lehrer and lehrerin (German for male and female teacher respectively) are almost the same. The implementations I have seen largely ignore this fact, and therefore the quality of the results is poor.
Thank you in advance.
In the last year a few research groups have started using the character sequence of a word to generate word embedding vectors. See this paper "Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation" for an example. There is also an earlier paper "Compositional Morphology for Word Representations and Language Modelling" that specifically uses models morphological differences like differences between singular and plural word forms.
I'm not aware of any open source implementations of these types of models.

Why don't we use word ranks for string compression?

I have 3 main questions:
Let's say I have a large text file. (1)Is replacing the words with their rank an effective way to compress the file? (Got answer to this question. This is a bad idea.)
Also, I have come up with a new compression algorithm. I read some existing compression models that are used widely and I found out they use some pretty advanced concepts like statistical redundancy and probabilistic prediction. My algorithm does not use all these concepts and is a rather simple set of rules that need to be followed while compressing and decompressing. (2)My question is am I wasting my time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes?
(3)Furthermore, if I manage to successfully compress a string can I extend my algorithm to other content like videos, images etc.?
(I understand that the third question is difficult to answer without knowledge about the compression algorithm. But I am afraid the algorithm is so rudimentary and nascent I feel ashamed about sharing it. Please feel free to ignore the third question if you have to)
Your question doesn't make sense as it stands (see answer #2), but I'll try to rephrase and you can let me know if I capture your question. Would modeling text using the probability of individual words make for a good text compression algorithm? Answer: No. That would be a zeroth order model, and would not be able to take advantage of higher order correlations, such as the conditional probability of a given word following the previous word. Simple existing text compressors that look for matching strings and varied character probabilities would perform better.
Yes, you are wasting your time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes. You should first learn about the techniques that have been applied over time to model data, textual and others, and the approaches to use the modeled information to compress the data. You need to study what has already been researched for decades before developing a new approach.
The compression part may extend, but the modeling part won't.
Do you mean like having a ranking table of words sorted by frequency and assign smaller "symbols" to those words that are repeated the most, therefore reducing the amount of information that needs to be transmitted?
That's basically how Huffman Coding works, the problem with compression is that you always hit a limit somewhere along the road, of course, if the set of things that you try to compress follows a particular pattern/distribution then it's possible to be really efficient about it, but for general purposes (audio/video/text/encrypted data that appears to be random) there is no (and I believe that there can't be) "best" compression technique.
Huffman Coding uses frequency on letters. You can do the same with words or with letter frequency in more dimensions, i.e. combinations of letters and their frequency.

Identifying keywords of a (programming) language

this is a follow up to my recent question ( Code for identifying programming language in a text file ). I'm really thankful for all the answers I got, it helped me very much. My code for this task is complete and it works fairly well - quick and reasonably accurate.
The method i used is the following: i have a "learning" perl script that identifies most frequently used words in a language by doing a word histogram over a set of sample files. These data are then loaded by the c++ program which then checks the given text and accumulates score for each language based on found words and then simply checks which language accumulated the highest score.
Now i would like to make it even better and work a bit on the quality of identification. The problem is I often get "unknown" as result (many languages accumulate a small score, but none anything bigger than my threshold). After some debugging, research etc i found out that this is probably due to the fact, that all words are considered equal. This means that seeing a "#include" for example has the same effect as seeing a "while" - both of which indicate that it might be c/c++ (i'm now ignoring the fact that "while" is used in many other languages), but of course in larger .cpp files there might be a ton of "while" but most of the time only a few "#include".
So the fact that a "#include" is more important is ignored, because i could not come up with a good way how to identify if a word is more important than another. Now bear in mind that the script which creates the data is fairly stupid, its only a word histogram and for every chosen word it assigns a score of 1. It does not even look at the words (so if there is a "#&|?/" in a file very often it might get chosen as a good word).
Also i would like to have the data creation part fully automated, so nobody should have to look at the data and alter them, change scores, change words etc. All the "brainz" should be in the script and the cpp program.
Does somebody have a suggestion how to identify keywords, or more generally, important words? Some things that might help: i have the number of occurences of each word and the number of total words (so a ratio may be calculated). I have also thought about wiping out characters like ;, etc. since the histogram script often puts for example "continue;" in the result, but the important word is "continue". Last note: all checks for equality are done for exact match - no substrings, case sensitive. This is mainly because of speed, but substrings might help (or hurt, i dont know)...
NOTE: thanks all who bothered to answer, you helped me a lot.
My work with this is almost finished so i will describe what i did to get good results.
1) Get a decent training set, about 30-50 files per language from various sources to avoid coding style bias
2) Write a perl script that does a word histogram. Implement blacklist and whitelist (more about it below)
3) add bogus words to blacklist, like "license", "the" etc. These are often found at the start of file in license information.
4) add about five most important words per language to the whitelist. These are words that are found in most source code of a given language, but are not frequent enough to get into the histogram. For example for C/C++ i had: #include, #define, #ifdef, #ifndef and #endif in the whitelist.
5) Emphasize the start of a file, so give more points to words found in the first 50-100 lines
6) when doing the word histogram, tokenize the file using #words = split(/[\s\(\){}\[\];.,=]+/, $_); This should be ok for most languages i think (gives me the best results). For each language, have about 10-20 most frequent words in the final results.
7) When the histogram is complete, remove all words that are found in the blacklist and add all those that are found in the whitelist
8) Write a program which processes a text file in the same way as the script - tokenize using the same rules. If a word is found in the histogram data, add points to the right language. Words in the histogram which correspond to only one language should add more points, those which belong to multiple languages should add less.
Comments are welcome. Currently on about 1000 text files i get 80 unknowns (mostly on extremely short files - mainly javascript with just one or two lines). About 20 files are recognized wrong. Size of the files is about 11kB ranging from 100 bytes to 100 kBytes (almost 11MB total). It takes one second to process them all, which is good enough for me.
I think you're approaching this from the wrong viewpoint. From your description, it sounds like you are building a classifier. A good classifier needs to discriminate between different classes; it doesn't need to precisely estimate the correspondence between the input and the most likely class.
Practically: your classifier doesn't need to assess precisely how close to C++ a certain input is; it merely needs to determine if the input is more like C than C++. This makes your work a lot easier - most of your current "unknown" cases will be close to one or two languages, even though they don't exceed your basic threshold.
Now, once you realize this, you will also see what training your classifier needs: not some random aspect of the sample files, but what sets two languages apart. Hence, when you have parsed your C samples, and your C++ samples, you will see that #include does not set them apart. However, class and template will be far more common in C++. On the other hand, #include does distinguish between C++ and Java.
There are of course other aspects besides keywords that you can use. For instance, the most obvious would be the frequency of {, and ; is similarly distinguishing. Another very useful feature for your classifier would be the comment tokens for the different languages. The basic problem of course would be automatically identifying them. Again, hardcoding //, /*, ', --, # and ! as pseudo-keywords would help.
This also identifies another classification rule: SQL will often have -- at the beginning of a line, whereas in C it will often appear somewhere else. Thus it may be useful for your classifier to take the context into account as well.
Use Google Code Search to learn weights for the set of keywords: #include in C++ gets 672.000 hits, in Python only ~5000.
You can normalize the results by looking at the number of results for the language in total:
C++ gives about 770.000 files whereas Python returns 120.000.
Thus "#include" is extremely rare in Python files, but exists in almost every C++ file. (Now you still have to learn to distinguish C++ and C of course.) All that is left is to do the correct reasoning about probabilities.
You need to get some exclusiveness into your lookup data.
When teaching the programming languages you expect, you should search for words typical for one or few language(s). If a word appears in several code files of the same language but appears in few or none of the other language files, it's a strong suggestion to that language.
So the score of a word could be calculated at the lookup side by selecting the words that are exclusive to a language or a group of languages. Find several of these words and get the intersection of these by adding the scores, and found your language you will have.
In an answer to your other question, someone recommended a naïve Bayes classifier. You should implement this suggestion because the technique is good at separating according to distinguishing features. You mentioned the while keyword, but that's not likely to be useful because so many languages use it—and a Bayes classifier won't treat it as useful.
An interesting part of your problem is how to tokenize an unknown program. Whitespace-separated chunks is a decent rough start, but going meaningfully beyond that will be tricky.

How do I improve breaking substitution ciphers programmatically?

I have written (am writting) a program to analyze encrypted text and attempt to analyze and break it using frequency analysis.
The encrypted text takes the form of each letter being substituted for some other letter ie. a->m, b->z, c->t etc etc. all spaces and non alpha chars are removed and upper case letters made lowercase.
An example would be :
Orginal input - thisisasamplemessageitonlycontainslowercaseletters
Encrypted output - ziololqlqdhstdtllqutozgfsnegfzqoflsgvtkeqltstzztkl
Attempt at cracking - omieieaeanuhtnteeawtiorshylrsoaisehrctdlaethtootde
Here it has only got I, A and Y correctly.
Currently my program cracks it by analysing the frequency of each individual character, and mapping it to the character that appears in the same frequency rank in a non encrypted text.
I am looking for methods and ways to improve the accuracy of my program as at the moment I don't get too many characters right. For example when attempting to crack X amount of characters from Pride and Prejudice, I get:
1600 - 10 letters correct
800 - 7 letters correct
400 - 2 letters correct
200 - 3 letters correct
100 - 3 letters correct.
I am using Romeo and Juliet as a base to get the frequency data.
It has been suggested to me to look at and use the frequency of character pairs, but I am unsure how to use this because unless I am using very large encrypted texts I can imagine a similar approach to how I am doing single characters would be even more inaccurate and cause more errors than successes. I am hoping also to make my encryption cracker more accurate for shorter 'inputs'.
I'm not sure how constrained this problem is, i.e. how many of the decisions you made are yours to change, but here are some comments:
1) Frequency mapping is not enough to solve a puzzle like this, many frequencies are very close to each other and if you aren't using the same text for frequency source and plaintext, you are almost guaranteed to have a few letters off no matter how long the text. Different materials will have different use patterns.
2) Don't strip the spaces if you can help it. This will allow you to validate your potential solution by checking that some percentage of the words exist in a dictionary you have access to.
3) Look into natural language processing if you really want to get into the language side of this. This book has all you could ever want to know about it.
Edit:
I would look into bigraphs and trigraphs first. If you're fairly confident of one or two letters, they can help predict likely candidates for the letters that follow. They're basically probability tables where AB would be the probability of an A being followed by a B. So assuming you have a given letter solved, that can be used to solve the letters next to it, rather than just guessing. For example, if you've got the word "y_u", it's obvious to you that the word is you, but not to the computer. If you've got the letters N, C, and O left, bigraphs will tell you that YN and YC are very uncommon where as YO is much more likely, so even if your text has unusual letter frequencies (which is easy when it's short) you still have a fairly accurate system for solving for unknowns. You can hunt around for a compiled dataset, or do your own analysis, but make sure to use a lot of varied text, a lot of Shakespeare is not the same as half of Shakespeare and half journal articles.
Looking at character pairs makes a lot of sense to me.
Every single letter of the alphabet can be used in valid text, but there are many pairs that are either extremely unlikely or will never happen.
For example, there is no way to get qq using valid English words, as every q must be followed by a u. If you have the same letters repeated in the encrypted text, you can automatically exclude the possibility that they represent q.
The fact that you are removing spaces from the input limits the utility somewhat since combinations that would never exist in a single word e.g. ht can now occur if the h ends one word and the t begins another one. Still, I suspect that these additional data points will enable you to resolve much shorter strings of text.
Also, I would suggest that Romeo and Juliette is only a good basis for statistical data if you intend to analyze writings of the period. There have been some substantial changes to spelling and word usage that may skew the statistics.
First of all, Romeo and Juliet probably isn't a very good basis to use. Second, yes digraphs are helpful (and so are trigraphs). For a substitution cipher like you're looking at, a good place to start would be the Military Cryptanalysis books by William Friedman.
Well, I have solved some simple substitution ciphers in my time, so I can speak freely.
Removing the spaces from the input string makes it nearly impossible to solve.
While it is true that most English sentences have 'e' in higher frequency, that is not all there is to the process.
The part that makes the activity fun, is the series of trial hypothesis/test hypothesis/accept or reject hypothesis that makes the whole thing an iterative process.
Many sentences contain the words 'of' and 'the'. By looking at your sentence, and assuming that one of the two letter words is of, implies further substitutions that can allow you to make inferences about other words. In short, you need a dictionary of high frequency word, to allow you to make further inferences.
As there could be a large amount of backtracking involved, it may be wise to consider a prolog or erlang implementation as a basis for developing the c++ one.
Best of luck to you.
Kindly share your results when done.
Single letter word are a big hint (generally only "A" and "I", rarely "O". Casual language allows "K"). There are also a finite set of two and three letter words. No help if spaces have been stripped.
Pairs are much more diagnostic than you would think. For instance: some letters never appear doubled in English (though this is not absolute if the spaces have been stripped or if foreign vocabulary is allowed), and others are common double; also some heterogeneous pairs are very frequent.
As a general rule, no one analysis will provide certainty. You need to assign each cipher letter a set of possible translation with associated probabilities. And combine several tests until the probabilities become very significant.
You may be able to determine when you've gotten close by checking the Shannon Entropy.
Not a complete answer, but maybe a helpful pointer: you can use a dictionary to determine how good your plaintext candidate is. On a UNIX system with aspell installed, you can extract an English word list with the command
aspell -l en dump master
You might try looking at pairs rather than individual letters. For instance, a t is often followed by an h in English, as is an s. Markov modeling would be useful here.
Frequency Analysis
Frequency analysis is a great place to start. However, Romeo and Juliet is not a very good choice to take character frequencies from to decipher Pride and Prejudice text. I would suggest using frequencies from this page because it uses 7 different texts that are closer in age to Pride and Prejudice. It also lists probabilities for digraphs and trigraphs. However, digraphs and trigraphs may not be as useful when spaces are removed from the text because this introduces the noise of digraphs and trigraphs created by words being mashed together.
Another resource for character frequencies is this site. It claims to use 'a good mix of different literary genres.'
Frequency analysis generally becomes more probabilistically correct with increased length of the encrypted text as you've seen. Frequency analysis also only helps to suggest the right direction in which to go. For instance, the encrypted character with the highest frequency may be the e, but it could also very well be the a which also has a high frequency. One common method is to start with some of the highest frequency letters in the given language, try matching these letters with different letters of high frequency in the text, and look to see if they form common words like the, that, is, as, and, and so on. Then you go from there.
A Good Introductory Book
If you are looking for a good layman introduction to cryptography, you might try The Code Book by Simon Singh. It's very readable and interesting. The books looks at the development of codes and codebreaking throughout history. He covers substitution ciphers fairly early on and describes some common methods for breaking them. Also, he had a Cipher Challenge in the book (which has already been completed) that consisted of some various codes to try to break including some substitution ciphers. You might try reading through how the Swedish team broke these ciphers at this site. However, I might suggest reading at least through the substitution cipher part of the book before reading these solutions.
By the way I'm not affiliated in any way with the publication of this book. I just really enjoyed it.
Regarding digraphs, digrams and word approximations, John Pierce (co-inventor of the transistor and PCM) wrote an excellent book, Introduction to Information Theory, that contains an extended analysis of calculating their characteristics, why you would want to and how to locate them. I found it helpful when writing a frequency analysis decryption code myself.
Also, you will probably want to write an ergodic source to feed your system, rather than relying on a single source (e.g., a novel).
interesting question,i ask a similar question :)
one thing i'm trying to find out and do is:
to scan the bigger words that have repeating letters in them..
then find a corresponding word with a similar pattern to the bigger word from the cipher..
the reason as to why is simply because,the bigger the word the most possible different deciphered letters found at once and because bigger words are easier to decode,just the same as to why a bigger text is easier to decode.. more chances to see patterns emerge :)