how do I install python library on AWS EMR notebook? - amazon-web-services

I want to install additional libraries on AWS notebook (connected to EMR cluster), however I do not see any option to connect from Notebook to internet. If I do "pip install ", it always come back saying that network is not reachable. I am not sure which network need to be changed for network connection and library installation.
I did login to Jupyter terminal, and ping to google.com, which just timed out. I do not see any network / security group etc... configuration under Notebook section for making any relevant changes.
May be I need to take some additional steps?

If you use PySpark kernel then you can install libraries using
sc.install_pypi_package("celery")
Or by running
sudo pip-3.6 install boto3
The following document has more details
If you use python 3 kernel, then only the packages are installed and there is no direct way to install extra libraries except uploading the python package to the notebook then using jupyterlab terminal to run
pip install package.tar.gz

Related

PySpark ModuleNotFoundError on GCP

I'm trying to run a Pyspark Streaming program on GCP Dataproc. I pip install mmh3 in ssh already, running pyspark then type import mmh3 caused no problem. But when I started running sc.start() and send info over from another ssh terminal, it starts saying the module not found. Any idea why this happened or how to fix it? Thanks.
By installing the package via SSH, you're just install it on the "driver" node. You'll need to install the package for the whole cluster (i.e. all worker nodes) as well. Try following the documentation

Node.JS native addons on LINUX [duplicate]

I'm using AWS Lambda, which involves creating an archive of my node.js script, including the node_modules folder and uploading that to their infrastructure to run.
This works fine, except when it comes to node modules with native bindings (using node-gyp). Because the binding was complied and project archived on my local computer (OS X), it is not compatible with AWS's (Amazon Linux) servers.
How can I cross-compile/install a node module (specifically, node-sqlite3) so when I upload it to another server arch it runs?
While not really a solution to your problem, a very easy workaround could be to simply compile the native addons on a Linux machine.
For your particular situation, I would use Vagrant. Vagrant can create virtual machines and configure them within seconds.
Find an OS image that resembles Amazon's Linux distro (Fedora, CentOS, others that use yum as package manager - see Wiki)
Use a simple configuration script that, when run by Vagrant on machine startup, will run npm install (optionally it might also remove the node_modules folder before to ensure a clean installation)
For extra comfort, the script can also create the zip file for deployment
Once the installation finishes, the script will shutdown the VM to avoid unnecessary consumption of system resources
Deploy!
It might require some tuning if the linked libraries are not at the same place on the target machine but generally this seems to me like the best and quickest solution.
While installing the app using Vagrant might be sufficient in some cases, I have found it necessary to build the app on Linux which is as close to Lambda's Amazon Linux AMI as possible.
You can read the original answer here: https://stackoverflow.com/a/34019739/303184
Steps to make it work:
Spawn new EC2 instance. Make sure it is based on exactly the same image as your AWS Lambda runtime. You can review Lambda env details here: http://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html. In our case, it was Amazon Linux AMI called amzn-ami-hvm-2015.03.0.x86_64-gp2.
Install nvm and use it to install the same version of Node.js as on the AWS Lambda. At the time of writing this, it was v0.10.36. You can refer to http://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html again to find out.
You will probably need to install git & g++ compiler on the EC2. You can do this running
sudo yum install git gcc-c++
Finally, clone your app to your new EC2 and install your app's dependecies:
nvm use 0.10.36
npm install --production
You can then easily download the node_modules using scp or such.
Same lines as Robert's answer, when I had to work on my MAC in a different OS I use vm ware like Oracle's free virtualizer VirtualBox to get a linux on my mac, no cost to me. Or sign up for a new AWS account, you get a micro for a year free. Use that to get your linux box, do whatever you need there.
AWS has a page describing how to deal with native NPM modules: https://aws.amazon.com/blogs/compute/nodejs-packages-in-lambda/

How do I upgrade a library in Qubole's Jupyter Notebook, using PySpark?

Is there a way to do it right from a cell in the notebook? similar to pip install ... --upgrade
I didn't know how to do what's instructed on https://docs.qubole.com/en/latest/faqs/general-questions/install-custom-python-libraries.html#pre-installed-python-libraries
The current Python version is 3.5.3, and Pandas 0.20.1. I need to upgrade Pandas, and Matplotlib
In Qubole are two ways to upgrade/install a package for the python environment. Currently there is no interface available inside notebook to install new packages.
New and Recommended Way (via Package Mangement) : User can enable Package Management functionality for an account and add new packages to a cluster via UI. There are lot of advantages of using package management over cluster versions in terms of performance and usability. Refer to https://docs.qubole.com/en/latest/user-guide/package-management/index.html for further details.
Old Way (via bootstrap) : User can configure a bootstrap which is basically a shell script executed on each node when the cluster starts and or upscales (more nodes are getting added to cluster). This can be configured via clusters UI and need a cluster start for every change. This is what is instructed in link you shared.
You cannot download/upgrade packages directly from the cell in the notebook. This is because your notebook is associated to a cluster. Now, to ensure that all the nodes of the cluster have the package installed, you must either use the package management (https://docs.qubole.com/en/latest/user-guide/package-management/package-management-environment.html) or the cluster's node bootstrap (https://docs.qubole.com/en/latest/user-guide/clusters/run-scripts-cluster.html#examples-node-scripts).
Do let me know if you have any further questions.

Installed package disappear on Google Cloud VMS instance

so I have installed some python nltk libraries (pip3 install) and c++ libraries (via apt-get install package_name_xxxx) on two different VMS instances.
Python packages for nltk would disappear and require a reinstall after reboot or change of the vms instance (e.g., add memory, cpu core),
C++ libraries disappeared without rebooting or any change of the machine. I do not find anything in the systemlog, a reinstall with apt-get works fine. But I am trying to figure out why it happens.
Is your GCE instance a preemptible instance? this option restarts the instance once every 24 hours and could be the reason why you are missing some packages.
After about an hour of inactivity, modifications not within the $HOME directory are lost. This includes installed packages.
See Custom installed software packages and persistence and usage limits.

How to install audiowaveform program on AWS Elastic Beanstalk

Just FYI ... context here is AWS Elastic Beanstalk. I'm trying to the install audiowaveform program on 64bit Amazon Linux 2015.03 v1.4.3 (the customer AMI ID is ami-6b50291c). Running this ... 👇
$ sudo yum install git cmake libmad-devel libsndfile-devel gd-devel boost-devel
... successfully installs all packages except libmad-devel and libsndfile-devel. Below is the relevant output ...
Failed to set locale, defaulting to C
Loaded plugins: priorities, update-motd, upgrade-helper
amzn-main/2015.03 | 2.1 kB 00:00
amzn-updates/2015.03 | 2.3 kB 00:00
Package git-2.1.0-1.38.amzn1.x86_64 already installed and latest version
Package cmake-2.8.12-2.20.amzn1.x86_64 already installed and latest version
No package libmad-devel available.
No package libsndfile-devel available.
Package gd-devel-2.0.35-11.10.amzn1.x86_64 already installed and latest version
Package boost-devel-1.53.0-14.21.amzn1.x86_64 already installed and latest version
Nothing to do
That said, this is not a problem with audiowaveform ... all this means is that the repositories enabled for Amazon Linux AMIs do not have libmad-devel and libsndfile-devel by default. I probably have to simply add my own sources I guess.
Also to note is that no yum packages exist for audio waveform so I have to build this manually.
Obtain the source ... 👇
$ git clone https://github.com/bbcrd/audiowaveform.git
$ cd audio waveform
Then build and install ... 👇
$ mkdir build
$ cd build
$ cmake ..
$ make
$ sudo make install
Question 1
On AWS EB ... the EC2 instances are configured to use Amazon sources which don't have the above packages i.e. libmad-devel and libsndfile-devel. What would be the recommended approach to adding these packages so that they are available to yum?
I stress recommended because I feel that changing the sources from Amazon's could not be the best approach. Nor is adding another source that could conflict with Amazon's packages ... etc etc etc ...
Question 2
Assuming I'm able to install libmad-devel and libsndfile-devel. I still have to build this manually since there are no packages of audiowaveform. On AWS EB I could write a script to do this as each instance is being instantiated ... but I feel this isn't ideal, slow and kinda error-prone. Anyone have advice on how I can do this better?
Probably prepare an AMI with this already built that's based off ami-6b50291c. Thoughts?
Core Objective
I don't have to use audiowaveform ... my objective really is to extract the peak points of some audio (MP3). I will set this up as a separate question.
Amazon Elastic Beanstalk tends to be very restricted in terms of what software you can install on it. I solved it by dockerizing my application environment. This is possible now even on Elastic Beanstalk.
Learn more about Elastic Beanstalk's support for Docker ...
AWS Elastic Beanstalk makes it easy for you to deploy and manage
applications in the AWS cloud. After you upload your application,
Elastic Beanstalk will provision, monitor, and scale capacity (Amazon
EC2 instances), while also load balancing incoming requests across all
of the healthy instances.
Docker automates the deployment of applications in the form of
lightweight, portable, self-sufficient containers that can run in a
variety of environments. Containers can be populated from pre-built
Docker images or from a simple recipe known as a Dockerfile.
Docker’s container-based model is very flexible. You can, for example,
build and test a container locally and then upload it to the AWS Cloud
for deployment and scalability. Docker’s automated deployment model
ensures that the runtime environment for your application is always
properly installed and configured, regardless of where you decide to
host the application.
This way ... you can do whatever you want in the container and that container will run on the kernel provided by the Amazon Linux AMI instance (obviously completely isolated).
I'm also somehow having hard time getting yum to find libsndfile on Amazon Linux AMI (RedHat 7.4). Repositories I've added to yum never seem to contain it. (How to add new repos is described here )
Finally I just downloaded and installed the rpms directly:
wget http://ftp.altlinux.org/pub/distributions/ALTLinux/Sisyphus/x86_64/RPMS.classic//libsndfile-1.0.28-alt1.x86_64.rpm
wget http://ftp.altlinux.org/pub/distributions/ALTLinux/Sisyphus/x86_64/RPMS.classic//libsndfile-devel-1.0.28-alt1.x86_64.rpm
sudo yum localinstall libsndfile-devel-1.0.28-alt1.x86_64.rpm
This way I got PySoundfile working finally.