PySpark ModuleNotFoundError on GCP - google-cloud-platform

I'm trying to run a Pyspark Streaming program on GCP Dataproc. I pip install mmh3 in ssh already, running pyspark then type import mmh3 caused no problem. But when I started running sc.start() and send info over from another ssh terminal, it starts saying the module not found. Any idea why this happened or how to fix it? Thanks.

By installing the package via SSH, you're just install it on the "driver" node. You'll need to install the package for the whole cluster (i.e. all worker nodes) as well. Try following the documentation

Related

how do I install python library on AWS EMR notebook?

I want to install additional libraries on AWS notebook (connected to EMR cluster), however I do not see any option to connect from Notebook to internet. If I do "pip install ", it always come back saying that network is not reachable. I am not sure which network need to be changed for network connection and library installation.
I did login to Jupyter terminal, and ping to google.com, which just timed out. I do not see any network / security group etc... configuration under Notebook section for making any relevant changes.
May be I need to take some additional steps?
If you use PySpark kernel then you can install libraries using
sc.install_pypi_package("celery")
Or by running
sudo pip-3.6 install boto3
The following document has more details
If you use python 3 kernel, then only the packages are installed and there is no direct way to install extra libraries except uploading the python package to the notebook then using jupyterlab terminal to run
pip install package.tar.gz

No module name `keras` under tmux on AWS instance

I am trying to use Amazon AWS instance to train my network. To run it under keras, I need to run
source activate tensorflow_p36
first and it works. Unfortunately, if I do the same from under tmux, it says it can't find keras module.
Why and how to overcome?
You can refer to the solution suggested in TMUX Session Won't Import Python Module. If you start the tmux session first and then import tensorflow it should work. At least my issue was resolved when I used this sequence other wise I got an error saying tensorflow module was not found.

Amazon Lambda unable to import [python windows .pyd pip]

I am trying to write to my PostgreSQL database with AWS Lambda using the python2.7 runtime. I care very little about how I do this, so if anyone has a different way that I can understand that works, I'd love to hear it.
The method I'm currently trying is to use psycopg2, as this is the only way I know. In order to do this, I need to upload the psycopg2 module to my environment on AWS Lambda. As per instructions, I've created a directory with my source and psycopg2 using pip install psycopg2 -t ..\my-project, zipped my-project, and uploaded it.
My error message is this from within the AWS Lambda console: Unable to import module 'lambda_function': No module named _psycopg
The code runs on my windows machine. I think the issue is that when I import psycopg2 from my local windows machine, the _psycopg module is being imported from _psycopg.pyd, and .pyd files are windows specific. I may be wrong about this.
I'm really just looking for any way to achieve the desired result described in my first paragraph, but here's a more specific question: How do I tell windows to pip install and compile psycopg2 without using .pyd files? Is this possible? Do I have something completely wrong?
I know the formatting of this question is a little unorthodox, I think I've given all the necessary information, let me know if there's anything else I can provide.
I solved the problem by opening an ubuntu instance on VirtualBox, pip installing the package there, pulling the relevant folders out, and placing them in my-project before zipping and uploading to AWS Lambda.
See these instructions.

Error with H20 - Python init(): Server Error

This is a totally newbie question to see if I am missing something key (like there is more to install?).
After installing H20 (python 2.7) on a 9 node Hadoop / Spark cluster
using pip install of the whl file (h2o-3.10.4.8-py2.py3-none-any.whl) (which says it installed correctly).....
I can import h2o successfully.
But, when I run h2o.init() then I get:
"Checking whether there is an H20 instance running at http://localhost:54321. connected."
But then an error is thrown:
H2oServerError: HTTP 500 Server Error: u'Error: 500'
Should I be able to run H20 by simply pip installing that whl or is there more? The documentation seems outdated and there are lots of different versions found online. Anyone have any experience with this?
It's most likely that you got this problem solved, but maybe someone else may benefit. Use the install in Python Tab on the following website : http://h2o-release.s3.amazonaws.com/h2o/rel-tutte/2/index.html.

Running EMR with Cascading SDK failed

I was following this tutorial for installing Cascading to EMR:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/CreateCascading.html
But it failed because of bootstrap action installing the cascading-sdk. The corresponding logs is here: http://pastebin.com/jybHssTQ. As seen from the logs, it's failed because of apt-get not found. Seriously?
I also checked the sdk installation script, and found option to disable installing screen with --no-screen. It is still failed, with different error http://pastebin.com/T6CvA2H1
And now it is because of permission denied. What?
It's official guide, but I can't seem to run it. Any idea?
Rather than changing the script first, try a different EMR AMI version.
AMI versions up until 2.4.8 use Debian OS, where apt-get will work, but this runs Hadoop 1.x. AMI versions 3.0.x run Hadoop 2.2 and use Amazon Linux, which uses Yum.
See Below:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html
Also, try to add the "--tmpdir" option to get around the "Permission Denied" error.