Trouble running gensim Word2Vec - word2vec

I am trying to train word embeddings(word2vec) on my own dataset using gensim library.
model = Word2Vec(sentences=alp[:20],size=100, window=6, min_count=5)
where alp is a list of list containing tokens of individual sentences in my corpus.
I get the following error whenever I try to train the w2v model.Please help.
`Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.5/dist-packages/gensim/models/word2vec.py", line 867, in worker_loop
tally, raw_tally = self._do_train_job(sentences, alpha, (work, neu1))
File "/usr/local/lib/python3.5/dist-packages/gensim/models/word2vec.py", line 785, in _do_train_job
tally += train_batch_cbow(self, sentences, alpha, work, neu1,
self.compute_loss)
File "gensim/models/word2vec_inner.pyx", line 458, in gensim.models.word2vec_inner.train_batch_cbow (./gensim/models/word2vec_inner.c:5642)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()`
`Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.5/dist-packages/gensim/models/word2vec.py", line 867, in worker_loop
tally, raw_tally = self._do_train_job(sentences, alpha, (work, neu1))
File "/usr/local/lib/python3.5/dist-packages/gensim/models/word2vec.py", line 785, in _do_train_job
tally += train_batch_cbow(self, sentences, alpha, work, neu1, self.compute_loss)
File "gensim/models/word2vec_inner.pyx", line 458, in gensim.models.word2vec_inner.train_batch_cbow (./gensim/models/word2vec_inner.c:5642)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()`
`

The problem was solved by type casting alp into list of lists.

The above code works perfectly for me. Can you verify the type of alp[:20].
Working code (tested in gensim version 3.4.0):
from gensim.models.word2vec import Word2Vec
model = Word2Vec(sentences=alp[0:20],size=100,window=6,min_count=5)
alp looks like follow:
alp = [['this','is','first','sentence'],
['this','is','second','sentence'],
[..],
[..],
[..]]

Related

How to install `distro-info===0.18ubuntu0.18.04.1`?

Trying to modernize an old Django project (2.2), and its requirements.txt (generated via pip freeze) has some lines that make pip install throw fits:
distro-info===0.18ubuntu0.18.04.1
I interpreted the errors I got for the first one (see the error output in its entirety at the bottom) as the version string not conforming to PEP-518, but it doesn't even mention the === operator. This SO thread, What are triple equal signs and ubuntu2 in Python pip freeze?, has a similar issue, but:
The errors they got is different (ValueError as opposed to my ParseError).
The solution was to upgrade pip, but I'm already using the latest one.
Now, pip install distro-info works so should I just go with that?
update: The project I'm trying to update has been conceived around 2020, and according to the PyPI history of distro-info, it had a 0.10 release in 2013 and a 1.0 in 2021. Could this anything have to do with the weird pip freeze output? (From this PyPI support issue.)
The error:
ERROR: Exception:
Traceback (most recent call last):
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_vendor/pkg_resources/__init__.py", line 3021, in _dep_map
return self.__dep_map
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_vendor/pkg_resources/__init__.py", line 2815, in __getattr__
raise AttributeError(attr)
AttributeError: _DistInfoDistribution__dep_map
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_vendor/packaging/requirements.py", line 102, in __init__
req = REQUIREMENT.parseString(requirement_string)
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_vendor/pyparsing/core.py", line 1141, in parse_string
raise exc.with_traceback(None)
pip._vendor.pyparsing.exceptions.ParseException: Expected string_end, found '(' (at char 12), (line:1, col:13)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_vendor/pkg_resources/__init__.py", line 3101, in __init__
super(Requirement, self).__init__(requirement_string)
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_vendor/packaging/requirements.py", line 104, in __init__
raise InvalidRequirement(
pip._vendor.packaging.requirements.InvalidRequirement: Parse error at "'(===0.18'": Expected string_end
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_internal/cli/base_command.py", line 160, in exc_logging_wrapper
status = run_func(*args)
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_internal/cli/req_command.py", line 247, in wrapper
return func(self, options, args)
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_internal/commands/install.py", line 400, in run
requirement_set = resolver.resolve(
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 92, in resolve
result = self._result = resolver.resolve(
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 481, in resolve
state = resolution.resolve(requirements, max_rounds=max_rounds)
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 373, in resolve
failure_causes = self._attempt_to_pin_criterion(name)
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 213, in _attempt_to_pin_criterion
criteria = self._get_updated_criteria(candidate)
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 203, in _get_updated_criteria
for requirement in self._p.get_dependencies(candidate=candidate):
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/provider.py", line 237, in get_dependencies
return [r for r in candidate.iter_dependencies(with_requires) if r is not None]
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/provider.py", line 237, in <listcomp>
return [r for r in candidate.iter_dependencies(with_requires) if r is not None]
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 247, in iter_dependencies
requires = self.dist.iter_dependencies() if with_requires else ()
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_internal/metadata/pkg_resources.py", line 216, in iter_dependencies
return self._dist.requires(extras)
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_vendor/pkg_resources/__init__.py", line 2736, in requires
dm = self._dep_map
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_vendor/pkg_resources/__init__.py", line 3023, in _dep_map
self.__dep_map = self._compute_dependencies()
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_vendor/pkg_resources/__init__.py", line 3033, in _compute_dependencies
reqs.extend(parse_requirements(req))
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_vendor/pkg_resources/__init__.py", line 3094, in parse_requirements
yield Requirement(line)
File "/home/old-django-project/.venv/lib/python3.10/site-packages/pip/_vendor/pkg_resources/__init__.py", line 3103, in __init__
raise RequirementParseError(str(e))
pip._vendor.pkg_resources.RequirementParseError: Parse error at "'(===0.18'": Expected string_end
Looks like your library was discontinued. In PyPi, infact, I can see there are only 1.0 and 0.10. If you need that specific version, then you need to setup a manual installation, downloading the source here. Either, you can upgrade your version and try to refactor any possible problem coming after!
In case, if you need to dockerize your app, setting up a script for the manual installation of a library is simple.

Pipeline will fail on GCP when writing tensorflow transform metadata

I hope somebody here can help. I've been googling this error like crazy but haven't found anything.
I have a pipeline that works perfectly when executed locally but it fails when executed on GCP. The following are the error messages that I get.
Workflow failed. Causes: S03:Write transform
fn/WriteMetadata/ResolveBeamFutures/CreateSingleton/Read+Write
transform fn/WriteMetadata/ResolveBeamFutures/ResolveFutures/Do+Write
transform fn/WriteMetadata/WriteMetadata failed., A work item was
attempted 4 times without success. Each time the worker eventually
lost contact with the service. The work item was attempted on:
Traceback (most recent call last): File "preprocess.py", line 491,
in
main() File "preprocess.py", line 487, in main
transform_data(args,pipeline_options,runner) File "preprocess.py", line 451, in transform_data
eval_data |= 'Identity eval' >> beam.ParDo(Identity()) File "/Library/Python/2.7/site-packages/apache_beam/pipeline.py", line 335,
in exit
self.run().wait_until_finish() File "/Library/Python/2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py",
line 897, in wait_until_finish
(self.state, getattr(self._runner, 'last_error_msg', None)), self) apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException:
Dataflow pipeline failed. State: FAILED, Error: Traceback (most recent
call last): File
"/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",
line 582, in do_work
work_executor.execute() File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py",
line 166, in execute
op.start() File "apache_beam/runners/worker/operations.py", line 294, in apache_beam.runners.worker.operations.DoOperation.start
(apache_beam/runners/worker/operations.c:10607)
def start(self): File "apache_beam/runners/worker/operations.py", line 295, in
apache_beam.runners.worker.operations.DoOperation.start
(apache_beam/runners/worker/operations.c:10501)
with self.scoped_start_state: File "apache_beam/runners/worker/operations.py", line 300, in
apache_beam.runners.worker.operations.DoOperation.start
(apache_beam/runners/worker/operations.c:9702)
pickler.loads(self.spec.serialized_fn)) File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py",
line 225, in loads
return dill.loads(s) File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 277, in
loads
return load(file) File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 266, in
load
obj = pik.load() File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatchkey File "/usr/lib/python2.7/pickle.py", line 1083, in load_newobj
obj = cls.new(cls, *args) TypeError: new() takes exactly 4 arguments (1 given)
Any ideas??
Thanks,
Pedro
If the pipeline works locally but fails on GCP it's possible that you're running into a version mismatch.
What TF, tf.Transform, beam versions are you running locally and on GCP?

nltk lookup error in Stanford Neural Dependency Parser

I am trying to use the Stanford Neural Dependency Parser provided by nltk. The problem I'm having is that when I call st = nltk.parse.stanford.StanfordNeuralDependencyParser(), I get the following error:
>>> st = nltk.parse.stanford.StanfordNeuralDependencyParser()
Traceback (most recent call last):
File "C:\Users\<user>\Anaconda2\lib\site-packages\IPython\core\interactiveshell.py", line 2885, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-5-ca2dec4f3c1f>", line 1, in <module>
st = nltk.parse.stanford.StanfordNeuralDependencyParser()
File "C:\Users\<user>\Anaconda2\lib\site-packages\nltk\parse\stanford.py", line 378, in __init__
super(StanfordNeuralDependencyParser, self).__init__(*args, **kwargs)
File "C:\Users\<user>\Anaconda2\lib\site-packages\nltk\parse\stanford.py", line 51, in __init__
key=lambda model_name: re.match(self._JAR, model_name)
File "C:\Users\<user>\Anaconda2\lib\site-packages\nltk\internals.py", line 714, in find_jar_iter
raise LookupError('\n\n%s\n%s\n%s' % (div, msg, div))
LookupError:
===========================================================================
NLTK was unable to find stanford-corenlp-(\d+)(\.(\d+))+\.jar! Set
the CLASSPATH environment variable.
For more information, on stanford-corenlp-(\d+)(\.(\d+))+\.jar, see:
<http://nlp.stanford.edu/software/lex-parser.shtml>
===========================================================================
But, when I run os.environ.get('CLASSPATH') I get the result
`C:\nltk_data\;C:\nltk_data\stanford\;C:\nltk_data\stanford\stanford-ner\`
I know that I have the corenlp jar file in C:\nltk_data\stanford\ so I run the following and end up with a slightly different error.
>>> st = nltk.parse.stanford.StanfordNeuralDependencyParser('C:\\nltk_data\\stanford\\')
Traceback (most recent call last):
File "C:\Users\<user>\Anaconda2\lib\site-packages\IPython\core\interactiveshell.py", line 2885, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-22-28d797d702d9>", line 1, in <module>
st = StanfordNeuralDependencyParser('C:\\nltk_data\\stanford\\')
File "C:\Users\<user>\Anaconda2\lib\site-packages\nltk\parse\stanford.py", line 378, in __init__
super(StanfordNeuralDependencyParser, self).__init__(*args, **kwargs)
File "C:\Users\<user>\Anaconda2\lib\site-packages\nltk\parse\stanford.py", line 51, in __init__
key=lambda model_name: re.match(self._JAR, model_name)
File "C:\Users\<user>\Anaconda2\lib\site-packages\nltk\internals.py", line 635, in find_jar_iter
(name_pattern, path_to_jar))
LookupError: Could not find stanford-corenlp-(\d+)(\.(\d+))+\.jar jar file at C:\nltk_data\stanford\
I have downloaded the jar stanford-english-corenlp-2016-01-10-models.jar from the Stanford NLP website and also renamed it to stanford-corenlp-2016-01-10.jar to try and match the pattern but I was still end up with the same errors. I have also downloaded the Stanford Parser version 3.6.0 but it doesn't contain any corenlp files.
Is there any way to get this to work, or am I misunderstanding something?

reindent.py - Does not work from the command line

I have problems with indentation in Python. So I downloaded reindent.py to correct the indentation errors.
I installed reindent.py using the following command-:
pip install reindent
But I running it from the command line shows me the following error-:
Traceback (most recent call last):
File "/usr/local/bin/reindent", line 3, in <module>
main()
File "/usr/local/lib/python2.7/dist-packages/reindent.py", line 92, in main
check(arg)
File "/usr/local/lib/python2.7/dist-packages/reindent.py", line 118, in check
if r.run():
File "/usr/local/lib/python2.7/dist-packages/reindent.py", line 177, in run
tokenize.tokenize(self.getline, self.tokeneater)
File "/usr/lib/python2.7/tokenize.py", line 170, in tokenize
tokenize_loop(readline, tokeneater)
File "/usr/lib/python2.7/tokenize.py", line 176, in tokenize_loop
for token_info in generate_tokens(readline):
File "/usr/lib/python2.7/tokenize.py", line 357, in generate_tokens
("<tokenize>", lnum, pos, line))
File "<tokenize>", line 127
for w in transcript:
^
IndentationError: unindent does not match any outer indentation level
I am running it with the following command-:
reindent -n test1.py
I thought reindent was supposed to correct the errors not show me where they occurred.
reindent.py changes tabs to spaces and can make irregular indentation a uniform 4-spaces. It does not attempt to catch or fix IndentationErrors.
Consider this code which has an IndentationError:
def foo():
print("Let's go")
for i in range(2): <-- IndentationError
print('Peay')
It produces a similar error message to the one you are getting:
% reindent.py script.py
Traceback (most recent call last):
...
File "/usr/lib/python2.7/tokenize.py", line 170, in tokenize
tokenize_loop(readline, tokeneater)
File "/usr/lib/python2.7/tokenize.py", line 176, in tokenize_loop
for token_info in generate_tokens(readline):
File "/usr/lib/python2.7/tokenize.py", line 357, in generate_tokens
("<tokenize>", lnum, pos, line))
File "<tokenize>", line 9
for i in range(2):
^
IndentationError: unindent does not match any outer indentation level
Both
def foo():
print("Let's go")
for i in range(2):
print('Peay')
and
def foo():
print("Let's go")
for i in range(2):
print('Peay')
are valid ways to fix the code. reindent.py (or the tokenize module that it
relies on) does not attempt to guess which one the coder intended. Thus,
IndentationErrors are SyntaxErrors that at least sometimes require human
intervention to fix.

pandas get_group memory error

I am using pandas v0.14.1 with python 2.7
I have a groupby object and I am trying to pull out a group identified by particular key. The key is in fact in the group:
>>> key in key_groups.groups.keys()
True
but when I try to make the get_group call it fails with a memory error:
>>>> key_groups.get_group(key)
*** MemoryError:
The full stacktrace is:
Traceback (most recent call last):
File "main.py", line 141, in <module>
main(num_days=arguments.days, num_variants=arguments.variants)
File "main.py", line 76, in main
problem, solution = Solver.Solve(request, num_variants)
File "/srv/compunctuator/src/Solver.py", line 49, in Solve
solution = attempt_minimization(t)
File "/srv/compunctuator/src/Solver.py", line 41, in attempt_minimization
t.scruple()
File "/srv/compunctuator/src/Compunctuator.py", line 136, in scruple
self.__iterate__()
File "/srv/compunctuator/src/Compunctuator.py", line 95, in __iterate__
self.__maximize_impressions__()
File "/srv/compunctuator/src/Compunctuator.py", line 583, in __maximize_impressions__
df = key_groups.get_group(key)
File "/srv/compunctuator/.virtualenvs/compunctuator/local/lib/python2.7/site-packages/pandas/core/groupby.py", line 573, in get_group
inds = self._get_index(name)
File "/srv/compunctuator/.virtualenvs/compunctuator/local/lib/python2.7/site-packages/pandas/core/groupby.py", line 429, in _get_index
sample = next(iter(self.indices))
File "/srv/compunctuator/.virtualenvs/compunctuator/local/lib/python2.7/site-packages/pandas/core/groupby.py", line 414, in indices
return self.grouper.indices
File "properties.pyx", line 34, in pandas.lib.cache_readonly.__get__ (pandas/lib.c:36380)
File "/srv/compunctuator/.virtualenvs/compunctuator/local/lib/python2.7/site-packages/pandas/core/groupby.py", line 1253, in indices
return _get_indices_dict(label_list, keys)
File "/srv/compunctuator/.virtualenvs/compunctuator/local/lib/python2.7/site-packages/pandas/core/groupby.py", line 3474, in _get_indices_dict
np.prod(shape))
File "algos.pyx", line 1997, in pandas.algos.groupsort_indexer (pandas/algos.c:37521) MemoryError
If I actually use the dictionary lookup I can get the indices out:
>>>> key_groups.groups[key]
[0, 2]
It seems like everything should work here.
I realize a similar question was asked here pandas get_group causes memory error
but it was never resolved and I thought I could give more details if necessary.