Understanding wordNgram from fastText - word2vec

I'm trying to understanding what is the -wordNgrams parameter in the fastText.
Let's take the following text as an example:
The quick brown fox jumps over the lazy dog
Now we have the context windows size of 2 at the 'brown' word, then we would have the following samples
(brown, the)
(brown, quick)
(brown, fox)
(brown, jumps)
In case we set -wordNgrans 2, would we find in our vocabulary the word 'brown_fox' ? And hence, our training samples now would be:
(brown_fox, the)
(brown_fox, quick)
(brown_fox, jumps)
(brown_fox, over)
Is that correct ?
I didn't find out there any explanation about that.

I'm wondering the same question.
I find a issue which said 'word n-grams are only used in supervised mode', so setting wordNgrams=2 doesn't work when unsupervised mode.
And then I test it myself:
./fasttext skipgram -input data.txt -output test -dim 50 -wordNgrams 2 -loss hs
cut -d' ' -f1 test.vec | vocab.txt
Result is that, there are only single word and subword in vocab.txt.

Related

How to build a propper H2O word2vec training_frame

How do I build a H2O word2vec training_frame that distinguishes between different document/sentences etc.?
As far as I can read from the very limited documentation I have found, you simply supply one long list of words? Such as
'This' 'is' 'the' 'first' 'This' 'is' 'number' 'two'
However it would make sense to be able to distinguish – ideally something like this:
Name | ID
This | 1
is | 1
the | 1
first | 1
This | 2
is | 2
number | 2
two | 2
Is that possible?
word2vec is a type of unsupervised learning: it turns string data into numbers. So to do a classification you need to do a two-step process:
word2vec for strings to numbers
any supervised learning technique for numbers to categories
The documentation contains links to a categorization example in each of R and Python. This tutorial shows the same process on a different data set (and there should be a H2O World 2017 video that goes with that).
By the way, in your original example, you don't just supply the words; the sentences are separated by NA. If you give h2o.tokenize() a vector of sentences, it will make this format for you. So your example would actually be:
'This' 'is' 'the' 'first' NA 'This' 'is' 'number' 'two'

Regex for words that don't differ by only one letter

I want to create series of puzzle games where you change one letter in a word to create a new word, with the aim of reaching a given target word. For example, to change "this" to "that":
this
thin
than
that
What I want to do is create a regex which will scan a list of words and choose all those that do not match the current word by all but one letter. For example, if my starting word is "pale" and my list of words is...
pale
male
sale
tale
pile
pole
pace
page
pane
pave
palm
peal
leap
play
help
pack
... I want all the words from "peal" to "pack" to be selected. This means that I can delete them from my list, leaving only the words that could be the next match. (It's OK for "pale" itself to be unselected.)
I can do this in parts:
^.(?!ale).{3}\n selects words not like "*ale"
^.(?<!p).{3}\n|^.{2}(?!le).{2}\n selects words not like "p*le"
^.{2}(?<!pa).{2}\n|^.{3}(?!e).\n selects words not like "pa*e"
^.{3}(?<!pal).\n selects words not like "pal*".
However, when I put them together...
^.(?!ale).{3}\n|^.(?<!p).{3}\n|^.{2}(?!le).{2}\n|^.{2}(?<!pa).{2}\n|^.{3}(?!e).\n|^.{3}(?<!pal).\n
... everything but "pale" is matched.
I need some way to create an AND relationship between the different regexes, or (more likely) a completely different approach.
You can use the Python regex module that allows fuzzy matching:
>>> import regex
>>> regex.findall(r'(?:pale){s<=1}', "male sale tale pile pole pace page pane pave palm peal leap play help pack")
['male', 'sale', 'tale', 'pile', 'pole', 'pace', 'page', 'pane', 'pave', 'palm']
In this case, you want a substitution of 0 or 1 is a match.
Or consider the TRE library and the command line agrep which supports a similar syntax.
Given:
$ echo $s
male sale tale pile pole pace page pane pave palm peal leap play help pack
You can filter to a list of a single substitution:
$ echo $s | tr ' ' '\n' | agrep '(?:pale){ 1s <2 }'
male
sale
tale
pile
pole
pace
page
pane
pave
palm
Here's a solution that uses cool python tricks and no regex:
def almost_matches(word1, word2):
return sum(map(str.__eq__, word1, word2)) == 3
for word in "male sale tale pile pole pace page pane pave palm peal leap play help pack".split():
print almost_matches("pale", word)
A completely different approach: Levenshtein distance
...the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
PHP example:
$words = array(
"pale",
"male",
"sale",
"tale",
"pile",
"pole",
"pace",
"page",
"pane",
"pave",
"palm",
"peal",
"leap",
"play",
"help",
"pack"
);
foreach($words AS $word)
if(levenshtein("pale", $word) > 1)
echo $word."\n";
This assumes the word on the first line is the keyword. Just a brute force parallel letter-match and count gets the job done:
awk 'BEGIN{FS=""}
NR==1{n=NF;for(i=1;i<=n;++i)c[i]=$i}
NR>1{j=0;for(i=1;i<=n;++i)j+=c[i]==$i;if(j<n-1)print}'
A regexp general solution would need to be a 2-stepper I think -- generate the regexp in first step (from the keyword), run the regexp against the file in the second step.
By the way, the way to do an "and" of regexp's is to string lookaheads (and the lookaheads don't need to be as complicated as you had above I think):
^(?!.ale)(?!p.le)(?!pa.e)(?!pal.)

How-to remove non-ascii characters and append a space in the field where the non-ascii characters were using a Perl one-liner?

Hi Stack Overflow community,
I have the following problem.
I got this file called bad, with the following contents:
SPAM EATER PO BOX 5555 FAKE STREET
FOO BAR ìPO BOX 1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR
I want to remove the non-ascii character from it (at the start of the second column of the second record), in order to get a file free of strange characters and with all its columns aligned. Plus, there's this one requirement to achieve this using a Perl one-liner - so, no awk, sed, or alike commands can be used. I tried the following, but got short by one space in the third column:
$ perl -plne 's/[^[:ascii:]]//g' bad > bad.clean
$ cat bad.clean
SPAM EATER PO BOX 5555 FAKE STREET
FOO BAR PO BOX 1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR
I also tried using the same one-liner, but this time replacing the non-ascii character by a space. In this case, the record ended up with two extra spaces in the second column, and one extra space in the third:
$ perl -plne 's/[^[:ascii:]]/ /g' bad > bad.clean.space
$ cat bad.clean.space
SPAM EATER PO BOX 5555 FAKE STREET
FOO BAR PO BOX 1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR
Somehow, the non-ascii character seems to be taking 2 bytes instead of one - Is this correct, or am I missing something?
The expected output is this:
SPAM EATER PO BOX 5555 FAKE STREET
FOO BAR PO BOX 1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR
Is there a way, using a Perl one-liner, to get the results as expected? I was thinking of a way to add one space after removing the non-ascii character, in the field in which the change has been made, but I can't find the way to do it. In addition, the non-ascii character can appear on any field, not only in the second one.
By the way, some info that might be useful: This is an AIX machine, running Perl v5.8.8.
Thank you!
Edit:
As #ThisSuitIsBlackNot mentions, there are two non-ascii characters. Therefore, I guess I just want to add one space to the end of that field, if at least one non-ascii character gets removed by the command. Is there a way to get this extra space included in the same sentence, so it can be done as a one-liner as well?
Edit:
After reviewing a large set of data, I can tell that the non-ascii characters always appears as pairs, and the next field in the original file (before running the one-liner) is always one space to the right compared to the other columns. So, I'm changing the title of this question to match the requirement: Perl one-liner to remove non-ascii characters and append a space in the field where the non-ascii characters were
Take out 2 non-ascii, add one space after field.
Uses non-ascii and 3 spaces as delimiter pairs.
# s/[^[:ascii:]]{2}(.*?[ ]{3})/$1 /g
[^[:ascii:]]{2}
( .*? [ ]{3} )
Perl test case
$/ = undef;
$str = <DATA>;
$str =~ s/[^[:ascii:]]{2}(.*?[ ]{3})/$1 /g;
print $str;
__DATA__
SPAM EATER PO BOX 5555 FAKE STREET
FOO BAR ìPO BOX 1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR
Output >>
SPAM EATER PO BOX 5555 FAKE STREET
FOO BAR PO BOX 1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR
You might be able to use tr:
tr -cs '[:print:]' ' '
This will replace runs of non-printable characters with a space.
This might be a silly question, but: why not column-align it by fixing the input to have the right number of spaces? The second line of your input has a different number of padding spaces between the second and third column, compared to the other lines.
If you must have unaligned input like that in the example, something like this will work (in the example's narrow case, and can be adapted using floor or something similar to work for other cases. However, I don't think it will ever really work in the general case; there's no magical "detect and correct my column size" function without using Text::Table or similar in your oneliner):
perl -plne 's/([^[:ascii:]]+?)((?:\w+\s)+?)(\s+?.+)/$2 . (" " x (int(length($1) \/ 2) - 1)) . $3/ge' bad > bad.clean
That is totally unoptimized and probably has some inefficiencies. A real regex guru could probably fold it into a handful of bytes. However, it should point you in the right direction (i.e. using functions in your right section, rather than static values). It will also only work given the constraint of two-byte characters being the only non-ASCII values in the string. That's an often-false assumption to make though. Read this excellent article by Joel Spolsky before writing another line of code; everyone who has to engage with character encodings should know the basics.

Find and replace next and next and not find the first and last

Really elementary question but I can't get this to work. My sample text is provided in the bottom of the page.
The only row I want left is the ones looking like this: "178-207 30 WVRTRWALLLLFWLGWLGMLAGAVVIIVRA -3,95". I currently use TextWrangler on OSX (terminal and me are not friends) which provide regex replacements.
I am trying to do this in steps, and my first step is trying to get rid of all the protein sequences.
In TextWrangler, I search for this:
Working sequence([^;]*)------------------------------------------------------------
and replace with nothing. However, what I end up with is almost an empty document, as TextWrangler seems to find the first instance of "Working sequence", but the LAST instance of "------------------------------------------------------------". How do I change so this is a step-wise process, finding the first instances of both and replacing with nothing, then the second instance etc?
Thanks and greetings from Sweden
Results summary for protein: sp|P08195|4F2_HUMAN 4F2 GN=SLC3A2 PE=1 SV=3
Translocon TM Analysis Results
Partitioning: water to bilayer
Window range: 19-30
Number of translocon TM predicted segments: 2
178-207 30 WVRTRWALLLLFWLGWLGMLAGAVVIIVRA -3,95
438-460 23 ARLLTSFLPAQLLRLYQLMLFTL 1,63
Working sequence length = 630):
MELQPPEASIAVVSIPRQLPGShSEAGVQGLSAGDDSELGShCVAQTGLELLASGDPLPS
ASQNAEMIETGSDCVTQAGLQLLASSDPPALASKNAEVTGTMSQDTEVDMKEVELNELEP
EKQPMNAASGAAMSLAGAEKNGLVKIKVAEDEAEAAAAAKFTGLSKEELLKVAGSPGWVR
TRWALLLLFWLGWLGMLAGAVVIIVRAPRCRELPAQKWWhTGALYRIGDLQAFQGhGAGN
LAGLKGRLDYLSSLKVKGLVLGPIhKNQKDDVAQTDLLQIDPNFGSKEDFDSLLQSAKKK
SIRVILDLTPNYRGENSWFSTQVDTVATKVKDALEFWLQAGVDGFQVRDIENLKDASSFL
AEWQNITKGFSEDRLLIAGTNSSDLQQILSLLESNKDLLLTSSYLSDSGSTGEhTKSLVT
QYLNATGNRWCSWSLSQARLLTSFLPAQLLRLYQLMLFTLPGTPVFSYGDEIGLDAAALP
GQPMEAPVMLWDESSFPDIPGAVSANMTVKGQSEDPGSLLSLFRRLSDQRSKERSLLhGD
FhAFSAGPGLFSYIRhWDQNERFLVVLNFGDVGLSAGLQASDLPASASLPAKADLLLSTQ
PGREEGSPLELERLKLEPhEGLLLRFPYAA
Results summary for protein: sp|Q9NPC4|A4GAT_HUMAN OS=Homo sapiens GN=A4GALT PE=2 SV=1
Translocon TM Analysis Results
Partitioning: water to bilayer
Window range: 19-30
Number of translocon TM predicted segments: 1
19-43 25 RVCTLFIIGFKFTFFVSIMIYWhVV -1,04
Working sequence length = 353):
MSKPPDLLLRLLRGAPRQRVCTLFIIGFKFTFFVSIMIYWhVVGEPKEKGQLYNLPAEIP
CPTLTPPTPPShGPTPGNIFFLETSDRTNPNFLFMCSVESAARThPEShVLVLMKGLPGG
NASLPRhLGISLLSCFPNVQMLPLDLRELFRDTPLADWYAAVQGRWEPYLLPVLSDASRI
ALMWKFGGIYLDTDFIVLKNLRNLTNVLGTQSRYVLNGAFLAFERRhEFMALCMRDFVDh
YNGWIWGhQGPQLLTRVFKKWCSIRSLAESRACRGVTTLPPEAFYPIPWQDWKKYFEDIN
PEELPRLLSATYAVhVWNKKSQGTRFEATSRALLAQLhARYCPTThEAMKMYL
You told it to look for "Working sequence" and than anything that's not ';' the first (and next and next...) line of '-' characters aren't -. That's why it's matching everything. It does match the final line of '-' characters because you told it there should be one at the end. I think this will work for you
Working sequence([^-]*)------------------------------------------------------------

Using regexp with sphinx

I need to make an algorythm that allows me to use uncertain (regexp) search in sphinx.
For example: i need to find a phrase that contains uncertain symbols: "2x4" maybe look like "2x4" or "2*4" or "2-4".
I want to do something like this: "2(x|*|-)4". But if i try to use this construction in query, sphinx split it on three words: "2", "(x|*|-)" and "4":
$ search -p "2x4"
...
index 'xxx': query '2x4 ': returned 25 matches of 25 total in 0.000 sec
...
words:
1. '2x4': 25 documents, 25 hits
$ search -p "2(x|y)4"
...
index 'xxx': query '2(x|y)4 ': returned 0 matches of 0 total in 0.000 sec
words:
1. '2': 816 documents, 842 hits
2. 'x': 21 documents, 21 hits
3. 'y': 0 documents, 0 hits
4. '4': 2953 documents, 3014 hits
Like ugly hack I cat do something like (2x4)|(2*4)|(2-4), but this is not good solution if I get a big phrase like "2x4x2.2" and need "2(x|*|-)4(x|*|-)2(.|,)2".
I can use "charset_table" option to define "*>x","->x",",>." and so on, but this is not flexible decision.
Can you find a better solution?
ps: sorry for my english =)
From what I've read, Sphinx doesn't support regex searches. Moreover, while the extended syntax (enabled with the -e option) has operators that support alternatives (the "OR" operator: |) and sequencing (the strict order operator: <<), they only work on words, not atoms, so that 2 << (x|*|-) << 4 would match strings where each element is a separate word, such as '2 x 4', '2 * 4'.
One option is to write a utility that converts a pattern of the form 2(x|*|-)4(x|*|-)2(.|,)2 (or, to follow the regex idiom, 2[-*x]4[-*x]2[.,]2) into a Sphinx extended query.
You can indeed use regular expressions with Sphinx.
While they cannot be used at search time, they can be used while building the index to identify a group of words/symbols that should be considered to be the same token.
http://sphinxsearch.com/docs/current.html#conf-regexp-filter
# index '13-inch' as '13inch'
regexp_filter = \b(\d+)\" => \1inch
# index 'blue' or 'red' as 'color'
regexp_filter = (blue|red) => color
Sphinx indexes whole words - and 'tokenizes' the word into an integer that is then stored in the index. As such regular expressions can't work because dont have the original words.
However there is dict=keywords - which does store the words in an index. But this can only right now be used for * and ? wildcards, doesnt support regular expressions.
Also, perhaps could use the techniques discussed here
http://swtch.com/~rsc/regexp/regexp4.html
This shows how generic regex searching can be implemented with a trigram index. Sphinx
itself would work as the trigram index. You store the trigrams as keywords which then
sphinx indexes. Sphinx can run the boolean queries taht that system outputs.
(normal sphinx, works pretty much like the 'Indexed Word Search' section documents. So
the trick would be using sphinx as the backend for the indexed Reg-Ex Search)