Missing words in word2vec vocabulary - word2vec

I am training word2vec on my own text-corpus using mikolov's implementation from here. Not all unique words from the corpus get a vector even though I have set the min-count to 1. Are there any parameters I may have missed, that might be the reason not all unique words get a vector? What else might be the reason?
To test word2vecs behavior I have written the following script providing a text file with 20058 sentences and 278896 words (all words and punctuation are space separated and there is one sentence per line).
import subprocess
def get_w2v_vocab(path_embs):
vocab = set()
with open(path_embs, 'r', encoding='utf8') as f:
next(f)
for line in f:
word = line.split(' ')[0]
vocab.add(word)
return vocab - {'</s>'}
def train(path_corpus, path_embs):
subprocess.call(["./word2vec", "-threads", "6", "-train", path_corpus,
"-output", path_embs, "-min-count", "1"])
def get_unique_words_in_corpus(path_corpus):
vocab = []
with open(path_corpus, 'r', encoding='utf8') as f:
for line in f:
vocab.extend(line.strip('\n').split(' '))
return set(vocab)
def check_equality(expected, actual):
if not expected == actual:
diff = len(expected - actual)
raise Exception('Not equal! Vocab expected: {}, Vocab actual: {}, Diff: {}'.format(len(expected), len(actual), diff))
print('Expected vocab and actual vocab are equal.')
def main():
path_corpus = 'test_corpus2.txt'
path_embs = 'embeddings.vec'
vocab_expected = get_unique_words_in_corpus(path_corpus)
train(path_corpus, path_embs)
vocab_actual = get_w2v_vocab(path_embs)
check_equality(vocab_expected, vocab_actual)
if __name__ == '__main__':
main()
This script gives me the following output:
Starting training using file test_corpus2.txt
Vocab size: 33651
Words in train file: 298954
Alpha: 0.000048 Progress: 99.97% Words/thread/sec: 388.16k Traceback (most recent call last):
File "test_w2v_behaviour.py", line 44, in <module>
main()
File "test_w2v_behaviour.py", line 40, in main
check_equality(vocab_expected, vocab_actual)
File "test_w2v_behaviour.py", line 29, in check_equality
raise Exception('Not equal! Vocab expected: {}, Vocab actual: {}, Diff: {}'.format(len(expected), len(actual), diff))
Exception: Not equal! Vocab expected: 42116, Vocab actual: 33650, Diff: 17316

As long as you're using Python, you might want to use the Word2Vec implementation in the gensim package. It does everything the original Mikolov/Googleword2vec.c does, and more, and is usually performance-competitive.
In particular, it won't have any issues with UTF-8 encoding – while I'm not sure the Mikolov/Google word2vec.c handles UTF-8 correctly. And, that may be a source of your discrepancy.
If you need to get to the bottom of your discrepancy, I would suggest:
have your get_unique_words_in_corpus() also tally/report the total number of non-unique words its tokenization creates. If that's not the same as the 298954 reported by word2vec.c, then the two processes are clearly not working from the same baseline understanding of what 'words' are in the source file.
find some words, or at least one representative word, that your token-count expects to be in the final model, and isn't. Review those for any common characteristic – including in context in the file. That will probably reveal why the two tallies differ.
Again, I suspect something UTF-8 related, or perhaps related to other implementation-limits in word2vec.c (such as a maximum word-lenght) that are not mirrored in your Python-based word tallies.

You could use FastText instead of Word2Vec. FastText is able to embed out-of-vocabulary words by looking at subword information (character ngrams). Gensim also has a FastText implementation, which is very easy to use:
from gensim.models import FastText as ft
model = ft(sentences=training_data,)
word = 'blablabla' # can be out of vocabulary
embedded_word = model[word] # fetches the word embedding
See https://stackoverflow.com/a/54709303/3275464

Related

Mutliple output files created but empty

I am trying to split one file with two articles in it into two separate files with one article in each, for subsequent analysis of the articles. Each article in the initial file has an ID that I want to use to separate the files with, using RE.
Below is the initial input file, with ID number:
166068619 #### "Epilepsy: let's end our ignorance of this neglected condition
Helen Stephens is a young woman with epilepsy [...]."
106899978 #### "Great British Payoff shows that BBC governance is broken
If it was a television series, they'd probably call it [...]."
However, when I run my code, I do get two separate files as an output but they are empty.
This is my code:
def file_split(path_to_file):
"""Function splits bigger file into N smaller ones, based on a certain RE
match, that is used to break the bigger file into smaller ones"""
def pattern_extract(path_to_file):
"""Function identifies the number of RE occurences in a file,
No. can be used in further analysis as range No."""
import re
x = []
with open(path_to_file) as f:
for line in f:
match = re.search(r'^\d+?\t####\t', line)
if match:
a = match.group()
x.append(a)
return len(x)
y = pattern_extract(path_to_file)
m = y + 1
files = [open('filename%i.txt' %i, 'w') for i in range(1,m)]
with open(path_to_file) as f:
for line in f:
match = re.search(r'^\d+?\t####\t', line)
if match:
a = match.group()
#files = [open('filename%i.txt' %i, 'w') for i in range(1, m)]
files[i-1].write(a)
for f in files:
f.close()
return files
Output result is as follows:
file_split(path)
Out[19]:
[<open file 'filename1.txt', mode 'w' at 0x7fe121b130c0>,
<open file 'filename2.txt', mode 'w' at 0x7fe121b131e0>]
I am new to Python and I am not quite sure where the problem lies. I checked some other answers that addressed the multiple file outputs but cannot figure out the solution. Help would be very much appreciated.
There are two problems with your code:
you write only the line matching the ID (actually, just the match itself), not the rest
you are always writing to the last file, as you use i, the loop variable "left over" from the list comprehension
To fix it, you could change the lower portion of your code to this:
y = pattern_extract(path_to_file)
files = [open('filename%i.txt' %i, 'w') for i in range(y)]
n = -1
with open(path_to_file) as f:
for line in f:
if re.search(r'^\d+\s+####\s+', line):
n += 1
files[n].write(line)
But you do not have to read the file two times at all, just to count the matches: Just open another file when the line matches an ID line and directly write to that last file in the list, then close all the files.
open_files = []
with open(path_to_file) as f:
for line in f:
if re.search(r'^\d+\s+####\s+', line):
open_files.append(open('filename%d.txt' % len(open_files), 'w'))
open_files[-1].write(line)
for f in open_files:
f.close()

Writing multiple header lines in pandas.DataFrame.to_csv

I am putting my data into NASA's ICARTT format for archvival. This is a comma-separated file with multiple header lines, and has commas in the header lines. Something like:
46, 1001
lastname, firstname
location
instrument
field mission
1, 1
2011, 06, 21, 2012, 02, 29
0
Start_UTC, seconds, number_of_seconds_from_0000_UTC
14
1, 1
-999, -999
measurement name, units
measurement name, units
column1 label, column2 label, column3 label, column4 label, etc.
I have to make a separate file for each day that data were collected, so I will end up creating around thirty files in all. When I create a csv file via pandas.DataFrame.to_csv I cannot (as far as I know) simply write the header lines to the file before writing the data, so I have had to trick it to doing what I want via
# assuming <df> is a pandas dataframe
df.to_csv('dst.ict',na_rep='-999',header=True,index=True,index_label=header_lines)
where "header_lines" is the header string
What this give me is exactly what I want, except "header_lines" is bracketed by double-quotes. Is there any way to write text to the head of a csv file using to_csv or remove the double quotes? I have already tried setting quotechar='' and doublequote=False in to_csv(), but the double quotes still come up.
What I am doing now (and it works for now, but I would like to move to something better) is simply opening a file via open('dst.ict','w') and printing to that line by line, which is quite slow.
You can, indeed, just write the header lines before the data. pandas.DataFrame.to_csv takes a path_or_buf as its first argument, not just a pathname:
pandas.DataFrame.to_csv(path_or_buf, *args, **kwargs)
path_or_buf : string or file handle, default None
File path or object, if None is provided the result is returned as a string.
Here's an example:
#!/usr/bin/python2
import pandas as pd
import numpy as np
import sys
# Make an example data frame.
df = pd.DataFrame(np.random.randint(100, size=(5,5)),
columns=['a', 'b', 'c', 'd', 'e'])
header = '\n'.join(
# I like to make sure the header lines are at least utf8-encoded.
[unicode(line, 'utf8') for line in
[ '1001',
'Daedalus, Stephen',
'Dublin, Ireland',
'Keys',
'MINOS',
'1,1',
'1904,06,16,1922,02,02',
'time_since_8am', # Ends up being the header name for the index.
]
]
)
with open(sys.argv[1], 'w') as ict:
# Write the header lines, including the index variable for
# the last one if you're letting Pandas produce that for you.
# (see above).
for line in header:
ict.write(line)
# Just write the data frame to the file object instead of
# to a filename. Pandas will do the right thing and realize
# it's already been opened.
df.to_csv(ict)
The result is just what you wanted - to write the header lines, and then call .to_csv() and write that:
$ python example.py test && cat test
1001
Daedalus, Stephen
Dublin, Ireland
Keys to the tower
MINOS
1, 1
1904, 06, 16, 1922, 02, 02
time_since_8am,a,b,c,d,e
0,67,85,66,18,32
1,47,4,41,82,84
2,24,50,39,53,13
3,49,24,17,12,61
4,91,5,69,2,18
Sorry if this is too late to be useful. I work in archiving these files (and use Python), so feel free to drop me a line if you have future questions.
Even though it's still some years and ndt's answer is quite nice, another possibility would be to write the header first and then use to_csv() with mode='a' (append):
# write the header
header = '46, 1001\nlastname, firstname\n,...'
with open('test.csv', 'w') as fp
fp.write(header)
# write the rest
df.to_csv('test.csv', header=True, mode='a')
It's maybe less effective due to the two write operations, though...

ex 25 LPTHW (global name pop is not defined)

I'm currently learning how to code with python following the exercise at the website 'Learn python the hard way' exercise 25.
The problem is that I can't complete exercise 25 because I have a problem that i can't figure out.
I'm typing into the python console but at the instruction number 8 ex25.print_last_word(words) I have this error:
>>> ex25.print_last_word(words)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "ex25.py", line 19, in print_last_word
word = words.pop(-1)
NameError: global name 'POP' is not defined
this is my code.
def break_words(stuff):
"""This function will break up word for us, praticamente
divide in blank space tra le parole"""
words = stuff.split(' ')
return words
def sort_words(words):
'''sort the words, ordina la parola??'''
return sorted(words)
def print_first_word(words):
'''print the first word after popping it off, ossia trova pop(0) trova
la lettera iniziale della parola..'''
word = words.pop(0)
print word
def print_last_word(words):
'''print the last word after popping it off'''
word = words.pop(-1)
print word
def sort_sentence(sentence):
'''takes in a full sentence and return the sorted words.'''
words = break_words(sentence)
words = break_words(words)
def print_first_and_last(sentence):
'''prints the first and the last words of the sentence.'''
words = break_words(sentence)
print_first_word(words)
print_last_word(words)
def print_first_and_last_sorted(sentence):
'''Sorts the words then prints the first and last one'''
word = sort_sentence(sentence)
print_first_word(words)
print_last_word(words)
The error raised by the Python interpreter does not match the code you posted, since POP is never mentioned in your code.
The error might be an indication that the interpreter has in memory a different definition for the module ex25 than what is in your text file, ex25.py. You can refresh the definition using
>>> reload(ex25)
Note that you must do this every time you modify ex25.py.
For this reason, you may find it easier to modify ex25.py so that it can be run from the command-line by adding
if __name__ == '__main__':
words = ...
print_last_word(words)
to the end of ex25.py, and running the script from the command-line:
python ex25.py

How to split tokens, count number of tokens, and write in a file in python?

I have file which has data in lines as follows:
['Marilyn Manson', 'Web', 'Skydera Inc.', 'Stone Sour', 'The Smashing Pumpkins', 'Warner Bros. Entertainment','This is a good Beer]
['Voices Inside', 'Expressivista', 'The Kentucky Fried Movie', 'The Bridges of Madison County']
and so on. I want to re-write the data into a file which has lines with tokens with words less than 3 or some other number. e.g.:
['Marilyn Manson', 'Web', 'Skydera Inc.', 'Stone Sour']
['Voices Inside', 'Expressivista']
this is what I have tried so far:
for line in open(file):
line = line.strip()
line = line.rstrip()
prog = re.compile("([a-z0-9]){32}")
if line:
line = line.replace('"', '')
line = line.split(",")
if re.match(prog, line[0]) and len(line)>2:
wo=[]
for words in line:
word=words.split()
if len(word)<3:
print word.append(word)
But the output says None. Any thoughts where I am making a mistake?
A better way to do what you're doing is to use ast.literal_eval, which automagically converts string representations of Python objects (e.g. lists) into actual Python objects.
import ast
# raw data
data = """
['Marilyn Manson', 'Web', 'Skydera Inc.', 'Stone Sour', 'The Smashing Pumpkins', 'Warner Bros. Entertainment','This is a good Beer']
['Voices Inside', 'Expressivista', 'The Kentucky Fried Movie', 'The Bridges of Madison County']
"""
# set threshold number of tokens
threshold = 3
# split into lines
lines = data.split('\n')
# parse non-blank lines into python lists
lists = [ast.literal_eval(line) for line in lines if line]
# for each list, keep only those tokens with less than `threshold` tokens
result = [[item for item in lst if len(item.split()) < threshold]
for lst in lists]
# show result
for line in result:
print(line)
Result:
['Marilyn Manson', 'Web', 'Skydera Inc.', 'Stone Sour']
['Voices Inside', 'Expressivista']
I think the reason your code isn't working is that you're trying to match line[0] against your regex prog - but the problem is that line[0] isn't 32 characters long for either of your lines, so your regex won't match.

Python 2.7.3: Search/Count txt file for string, return full line with final occurrence of that string

I'm trying to create a WiFi Log Scanner. Currently we go through logs manually using CTRL+F and our keywords. I just want to automate that process. i.e. bang in a .txt file and receive an output.
I've got the bones of the code, can work on making it pretty later, but I'm running into a small issue. I want the scanner to search the file (done), count instances of that string (done) and output the number of occurrences (done) followed by the full line where that string occurred last, including line number (line number is not essential, just makes things easier to do a gestimate of which is the more recent issue if there are multiple).
Currently I'm getting an output of every line with the string in it. I know why this is happening, I just can't think of a way to specify just output the last line.
Here is my code:
import os
from Tkinter import Tk
from tkFileDialog import askopenfilename
def file_len(filename):
#Count Number of Lines in File and Output Result
with open(filename) as f:
for i, l in enumerate(f):
pass
print('There are ' + str(i+1) + ' lines in ' + os.path.basename(filename))
def file_scan(filename):
#All Issues to Scan will go here
print ("DHCP was found " + str(filename.count('No lease, failing')) + " time(s).")
for line in filename:
if 'No lease, failing' in line:
print line.strip()
DNS= (filename.count('Host name lookup failure:res_nquery failed') + filename.count('HTTP query failed'))/2
print ("DNS Failure was found " + str(DNS) + " time(s).")
for line in filename:
if 'Host name lookup failure:res_nquery failed' or 'HTTP query failed' in line:
print line.strip()
print ("PSK= was found " + str(testr.count('psk=')) + " time(s).")
for line in ln:
if 'psk=' in line:
print 'The length(s) of the PSK used is ' + str(line.count('*'))
Tk().withdraw()
filename=askopenfilename()
abspath = os.path.abspath(filename) #So that doesn't matter if File in Python Dir
dname = os.path.dirname(abspath) #So that doesn't matter if File in Python Dir
os.chdir(dname) #So that doesn't matter if File in Python Dir
print ('Report for ' + os.path.basename(filename))
file_len(filename)
file_scan(filename)
That's, pretty much, going to be my working code (just have to add a few more issue searches), I have a version that searches a string instead of a text file here. This outputs the following:
Total Number of Lines: 38
DHCP was found 2 time(s).
dhcp
dhcp
PSK= was found 2 time(s).
The length(s) of the PSK used is 14
The length(s) of the PSK used is 8
I only have general stuff there, modified for it being a string rather than txt file, but the string I'm scanning from will be what's in the txt files.
Don't worry too much about PSK, I want all examples of that listed, I'll see If I can tidy them up into one line at a later stage.
As a side note, a lot of this is jumbled together from doing previous searches, so I have a good idea that there are probably neater ways of doing this. This is not my current concern, but if you do have a suggestion on this side of things, please provide an explanation/link to explanation as to why your way is better. I'm fairly new to python, so I'm mainly dealing with stuff I currently understand. :)
Thanks in advance for any help, if you need any further info, please let me know.
Joe
To search and count the string occurrence I solved in following way
'''---------------------Function--------------------'''
#Counting the "string" occurrence in a file
def count_string_occurrence():
string = "test"
f = open("result_file.txt")
contents = f.read()
f.close()
print "Number of '" + string + "' in file", contents.count("foo")
#we are searching "foo" string in file "result_file.txt"
I can't comment yet on questions, but I think I can answer more specifically with some more information What line do you want only one of?
For example, you can do something like:
search_str = 'find me'
count = 0
for line in file:
if search_str in line:
last_line = line
count += 1
print '{0} occurrences of this line:\n{1}'.format(count, last_line)
I notice that in file_scan you are iterating twice through file. You can surely condense it into one iteration :).