How to build a propper H2O word2vec training_frame - word2vec

How do I build a H2O word2vec training_frame that distinguishes between different document/sentences etc.?
As far as I can read from the very limited documentation I have found, you simply supply one long list of words? Such as
'This' 'is' 'the' 'first' 'This' 'is' 'number' 'two'
However it would make sense to be able to distinguish – ideally something like this:
Name | ID
This | 1
is | 1
the | 1
first | 1
This | 2
is | 2
number | 2
two | 2
Is that possible?

word2vec is a type of unsupervised learning: it turns string data into numbers. So to do a classification you need to do a two-step process:
word2vec for strings to numbers
any supervised learning technique for numbers to categories
The documentation contains links to a categorization example in each of R and Python. This tutorial shows the same process on a different data set (and there should be a H2O World 2017 video that goes with that).
By the way, in your original example, you don't just supply the words; the sentences are separated by NA. If you give h2o.tokenize() a vector of sentences, it will make this format for you. So your example would actually be:
'This' 'is' 'the' 'first' NA 'This' 'is' 'number' 'two'

Related

Redshift POSIX regex order does not matter

I'm querying data from AWS Redshift by using the POSIX regular expressions. However, I have difficulties to find the whole string by finding multiple words without considering the order.
The table is like this:
ID | full_term
123 | juice apple farm
123 | apple juice original
123 | banana juice
For example, I'm looking for a whole string that contains both apple and juice, so I expect to get the first two rows. My current query:
SELECT full_term FROM data_table
WHERE full_term ~ '(.*apple+)(.*juice+).*$'
However, the order does matter in this method. I also tried full_term ~ '(?=.*apple+)(?=.*juice+).*$' but I got an error message [Amazon](500310) Invalid operation: Invalid preceding regular expression prior to repetition operator. The error occurred while parsing the regular expression fragment: '(?>>>HERE>>>=.*apple+)'. I just realized ?= does not work in Redshift.
Is using the UDF the only solution in this case?
Also, I only want exact apple and juice in full terms. That is to say, pineapple should not be included.
This is probably most clearly written as ANDed separate regular expression matches. To ensure that you don't match e.g. pineapple when looking for apple, you need to check that on either side of the search term is either a space character or the beginning/end of the line:
SELECT full_term FROM data_table
WHERE full_term ~ '(^|\\s)apple(\\s|$)'
AND full_term ~ '(^|\\s)juice(\\s|$)'

How to re-determine column type in Stata

I had a column of numbers in Stata that was however read in as strings because it contained a string value "nan" for one of the numbers. I have since replaced this with a missing value, so that the column only contains numbers now, albeit all in string format. What is the command to re-determine the type of the column?
Terminology: "columns" in Stata are always called variables.
Variables being numeric or string is in the first instance a matter of variable type or storage type. Display format is then to be assigned. "Format" in Stata doesn't mean variable type.
With data like this
clear
input str5 stryit
"1"
"2"
"42"
"666"
"NAN"
end
There are several prudent rules.
Check to see what kinds of observations wouldn't produce numeric values if coerced:
tab stryit if missing(real(stryit))
If there are many such kinds, you might need to rethink the approach.
Always leave the original variable as it came unless and until you are sure that you no longer need it. So use destring with force if you like but generate a new variable. In your case that would be fine.
destring stryit, force gen(ntryit1)
Better than using force is to be explicit about your conversion rules. That leaves a record of what you did (assuming naturally that you keep a record of all commands used in any serious analysis):
destring stryit, ignore("NA") gen(ntryit2)
You can explicitly change problematic values before destring. An advantage of that, like the previous rule, is that you have a record of what you did.
clonevar stryit2 = stryit
replace stryit2 = "." if stryit2 == "NAN"`
destring stryit2, gen(ntryit3)
Check to see that results make sense:
list
+------------------------------------------------+
| stryit ntryit2 ntryit1 stryit2 ntryit3 |
|------------------------------------------------|
1. | 1 1 1 1 1 |
2. | 2 2 2 2 2 |
3. | 42 42 42 42 42 |
4. | 666 666 666 666 666 |
5. | NAN . . . . |
+------------------------------------------------+
Disclaimer: original author of destring

R: grepl select first charachter on a string

I apologize in advance, this might be a repeat question. However, I just spent the two last hours over stackoverflow, and can't seem to find a solution.
I want to use grepl to detect rows that begin with a digit, that's what I tried to use but It didn't give me the rigt answer:
grep.numeric=as.data.frame(grepl("^[:digit:]",df_mod$name))
I guess that the problem is from the regular expression "^[:digit:]", but I couldn't figure it out.
UPDATE
My dataframe looks like this, It's huge, but below is an example:
ID mark name
1 whatever name product
2 whatever 10 product
3 whatever 250 product
4 another_mark other product
I want to detect products which their names begin with a number.
UPDATE 2
applying grep.numeric=grepl("^[[:digit:]]",df_mod$name) on the example below give me the right answer which is:
grep.numeric
[1] FALSE TRUE TRUE FALSE
But, what drive me crazy is when I pply this fuction to my real dataframe:
grep.numeric=grepl("^[[:digit:]]",df_mod[217,]$nom)
give me this result:
grep.numeric
[1] FALSE
But actually, what I have is this :
df_mod[217,]$nom
[1] 100 lipo 30 gélules
Please help me.
Apparently, some of your values have leading spaces, so you could either modify your regex to (or something similar)
grepl("^\\s*[[:digit:]]", df_mod$name)
Or use the built in trimws function
grepl("^[[:digit:]]", trimws(df_mod$name))

Using regexp with sphinx

I need to make an algorythm that allows me to use uncertain (regexp) search in sphinx.
For example: i need to find a phrase that contains uncertain symbols: "2x4" maybe look like "2x4" or "2*4" or "2-4".
I want to do something like this: "2(x|*|-)4". But if i try to use this construction in query, sphinx split it on three words: "2", "(x|*|-)" and "4":
$ search -p "2x4"
...
index 'xxx': query '2x4 ': returned 25 matches of 25 total in 0.000 sec
...
words:
1. '2x4': 25 documents, 25 hits
$ search -p "2(x|y)4"
...
index 'xxx': query '2(x|y)4 ': returned 0 matches of 0 total in 0.000 sec
words:
1. '2': 816 documents, 842 hits
2. 'x': 21 documents, 21 hits
3. 'y': 0 documents, 0 hits
4. '4': 2953 documents, 3014 hits
Like ugly hack I cat do something like (2x4)|(2*4)|(2-4), but this is not good solution if I get a big phrase like "2x4x2.2" and need "2(x|*|-)4(x|*|-)2(.|,)2".
I can use "charset_table" option to define "*>x","->x",",>." and so on, but this is not flexible decision.
Can you find a better solution?
ps: sorry for my english =)
From what I've read, Sphinx doesn't support regex searches. Moreover, while the extended syntax (enabled with the -e option) has operators that support alternatives (the "OR" operator: |) and sequencing (the strict order operator: <<), they only work on words, not atoms, so that 2 << (x|*|-) << 4 would match strings where each element is a separate word, such as '2 x 4', '2 * 4'.
One option is to write a utility that converts a pattern of the form 2(x|*|-)4(x|*|-)2(.|,)2 (or, to follow the regex idiom, 2[-*x]4[-*x]2[.,]2) into a Sphinx extended query.
You can indeed use regular expressions with Sphinx.
While they cannot be used at search time, they can be used while building the index to identify a group of words/symbols that should be considered to be the same token.
http://sphinxsearch.com/docs/current.html#conf-regexp-filter
# index '13-inch' as '13inch'
regexp_filter = \b(\d+)\" => \1inch
# index 'blue' or 'red' as 'color'
regexp_filter = (blue|red) => color
Sphinx indexes whole words - and 'tokenizes' the word into an integer that is then stored in the index. As such regular expressions can't work because dont have the original words.
However there is dict=keywords - which does store the words in an index. But this can only right now be used for * and ? wildcards, doesnt support regular expressions.
Also, perhaps could use the techniques discussed here
http://swtch.com/~rsc/regexp/regexp4.html
This shows how generic regex searching can be implemented with a trigram index. Sphinx
itself would work as the trigram index. You store the trigrams as keywords which then
sphinx indexes. Sphinx can run the boolean queries taht that system outputs.
(normal sphinx, works pretty much like the 'Indexed Word Search' section documents. So
the trick would be using sphinx as the backend for the indexed Reg-Ex Search)

Shorter REGEXP for MySQL query

I want to do a MySQL query to get the following effect:
table_column [varchar]
-----------------------
1|5|7
25
55|12
5
3&5
5|11
I want a reliable way to get all the values where 5 is the complete value.
So, for example, if I do a REGEXP query for the number 5 on the upper table I would like to get all rows except the ones containing "25" and "55|12".
This is the best I've come up with so far:
[^[:digit:]]5[^[:digit:]] | [^[:digit:]]5 | 5[^[:digit:]] | ^5$
is there a shorter way?
Thanks.
Try using word boundaries:
[[:<:]]5[[:>:]]
^.*[^[:digit:]]*5[^[:digit:]]*.*$