Custom PyPI repo for Google Dataflow workers - google-cloud-platform

I want to use a custom pypi repo for my Dataflow workers. Typically, to configure a custom pypi repo, you would edit /etc/pip.conf to look like this:
[global]
index-url = https://pypi.customer.com/
Since I can't run a startup script for Dataflow workers, my thought was to perform this operation at the head of my setup.py file, so that as the script executes, it would update /etc/pip.conf before attempting a pip install of the dependencies.
My setup.py looks like the following:
with open('/etc/pip.conf', 'w') as pip_conf:
pip_conf.write("""
[global]
index-url = https://artifactory.mayo.edu/artifactory/api/pypi/pypi-remote/simple
""")
REQUIRED_PACKAGES = [
'custom_package',
]
setuptools.setup(
name='wordcount',
version='0.0.1',
description='demo package.',
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages())
The odd thing is that my workers are hanging indefinitely. When I ssh into them, I see some Docker containers running, but I am not sure how to debug further.
Any suggestions on how I can hack the Dataflow workers to use a custom pypi url?

This is likely a good candidate for custom containers, where you can install everything exactly as you want rather than having to hack the worker startup sequence.

Related

Google Cloud Platform: cloudshell - is there any way to "keep" gcloud init configs?

Does anyone know of a way to persist configurations done using "gcloud init" commands inside cloudshell, so they don't vanish each time you disconnect?
I figured out how to persist python pip installs using the --user
example: pip install --user pandas
But, when I create a new configuration using gcloud init, use it for a bit, close cloudshell (or cloudshell times out on me), then reconnect later, the configurations are gone.
Not a big deal, I bounce between projects/etc so it's nice to have the configs saved so I can simply run
gcloud config configurations activate config-name
Thanks...Rich Murnane
Google Cloud Shell only persists data in your $HOME directory. Commands like gcloud init modify the environment variables and store configuration files in /tmp which is deleted when the VM is restarted. The VM is terminated after being idle for 20 minutes or 60 minutes depending on which document you read.
Google Cloud Shell is a Docker container. You can modify the docker image to customize to fit your needs. This method will allow you to install packages, tools, etc that are not located in your $HOME directory.
You can also store your files and configuration scripts on Google Cloud Storage. Modify .bashrc to download your cloud files and run your configuration script.
Either method will allow you to create a persistent environment.
This StackOverflow answer covers in detail what gcloud init does and how to basically emulate the same thing via script or command line.
gcloud init details
this isn't exactly what I wanted, but since my
account (userid) isn't changing, I'm simply going to
do the command
gcloud config set project second-project-name
good enough, thanks...Rich

nltk dependencies in dataflow

I know that external python dependencies can by fed into Dataflow via the requirements.txt file. I can successfully load nltk in my Dataflow script. However, nltk often needs further files to be downloaded (e.g. stopwords or punkt). Usually on a local run of the script, I can just run
nltk.download('stopwords')
nltk.download('punkt')
and these files will be available to the script. How do I do this so the files are also available to the worker scripts. It seems like it would be extremely inefficient to place those commands into a doFn/CombineFn if they only have to happen once per worker. What part of the script is guaranteed to run once on every worker? That would probably be the place to put the download commands.
According to this, Java allows the staging of resources via classpath. That's not quite what I'm looking for in Python. I'm also not looking for a way to load additional python resources. I just need nltk to find its files.
You can probably use '--setup_file setup.py' to run these custom commands. https://cloud.google.com/dataflow/pipelines/dependencies-python#pypi-dependencies-with-non-python-dependencies . Does this work in your case?

How to "dockerize" Flask application?

I have Flask application named as rest.py and I have dockerize but it is not running.
#!flask/bin/python
from flask import Flask, jsonify
app = Flask(__name__)
tasks = [
{
'id': 1,
'title': u'Buy groceries',
'description': u'Milk, Cheese, Pizza, Fruit, Tylenol',
'done': False
}
]
#app.route('/tasks', methods=['GET'])
def get_tasks():
return jsonify({'tasks': tasks})
if __name__ == '__main__':
app.run(debug=True)
Dockerfile is as follows
FROM ubuntu
RUN apt-get update -y
RUN apt-get install -y python-dev python-pip
COPY . /rest
WORKDIR /rest
RUN pip install -r Req.txt
ENTRYPOINT ["python"]
CMD ["rest.py"]
I have build it using this command...
$ docker build -t flask-sample-one:latest
...and when I run container...
$ docker run -d -p 5000:5000 flask-sample-one
returning the following output:
7d1ccd4a4471284127a5f4579427dd106df499e15b868f39fa0ebce84c494a42
What am I doing wrong?
The output you get is the container ID. Check with docker ps whether it keeps running.
Use docker logs [container-id] to figure out what's going on inside.
Some problems I can find in your question:
Change the app.run line to app.run(host='0.0.0.0', debug=True). From the point of view of the container, its services need to be externally available. So they need to run on the loopback interface, like you would run it if you'd set up a publicly available server on a host directly.
Make sure that Flask gets installed. Your docker image file requires all the commands to make it work from a blank Ubuntu installation.
Please do not forget to deactivate debug if you'd ever expose this service on your host. Debug mode in Flask makes it possible for visitors to run arbitrary code if they can trigger an exception (it's a feature, not a bug).
After that (and building the container again [1]), try curl http://127.0.0.1:5000/tasks on the host. Let me know if it works, if not there are other problems in your setup.
[1] You can improve the prototyping workflow with Flask's built-in reloader (which is enabled by default) if you use a volume mount in your docker container for the directory that contains your python files - this would allow you to change your script on the host, reload in the browser and directly see the result.
I believe that you need to reinforce your concepts about Docker, in order to understand how it works, and then you will achieve your objectives regarding "dockerizing" whatever application.
Here is an article which can give your some first steps.
An official HOWTO will also help you.
Some observations that might help you:
check if your Req.txt contains flask for installation
before dockerizing, check if your application is working
check your running containers with docker ps and see if your container is running
if it is running, test your application: curl http://127.0.0.1:5000/tasks
*
One more thing:
your JSON has an OBJECT with an ARRAY with just one ELEMENT
Is that what you want for your prototype?
Take a look on this doc, about the JSON standard.

Setting up jupyterhub docker using one of the jupyter stacks

I'm trying to get a Jupyterhub up and running. 2.7 Python kernels are required, so basically whatever in the docker-stacks repo would be great. In the documentation, it mentions that it can work with Jupyterhub using DockerSpawner, but I can't quite see how it all fits together. Is anyone aware of a simple step by step guide to get this working?
To use any docker image first pull that from docker hub - docker pull jupyter/scipynotebook
Now install dockerspawner - pip install dockerspawner
Add necessary lines to jupyterhub_config.py
(https://github.com/jupyterhub/dockerspawner/blob/master/README.md)
The way to use specific docker image this line does the magic - c.DockerSpawner.image = 'jupyter/scipynotebook'

AWS: CodeDeploy for a Docker Compose project?

My current objective is to have Travis deploy our Django+Docker-Compose project upon successful merge of a pull request to our Git master branch. I have done some work setting up our AWS CodeDeploy since Travis has builtin support for it. When I got to the AppSpec and actual deployment part, at first I tried to have an AfterInstall script do docker-compose build and then have an ApplicationStart script do docker-compose up. The containers that have images pulled from the web are our PostgreSQL container (named db, image aidanlister/postgres-hstore which is the usual postgres image plus the hstore extension), the Redis container (uses the redis image), and the Selenium container (image selenium/standalone-firefox). The other two containers, web and worker, which are the Django server and Celery worker respectively, use the same Dockerfile to build an image. The main command is:
CMD paver docker_run
which uses a pavement.py file:
from paver.easy import task
from paver.easy import sh
#task
def docker_run():
migrate()
collectStatic()
updateRequirements()
startServer()
#task
def migrate():
sh('./manage.py makemigrations --noinput')
sh('./manage.py migrate --noinput')
#task
def collectStatic():
sh('./manage.py collectstatic --noinput')
# find any updates to existing packages, install any new packages
#task
def updateRequirements():
sh('pip install --upgrade -r requirements.txt')
#task
def startServer():
sh('./manage.py runserver 0.0.0.0:8000')
Here is what I (think I) need to make happen each time a pull request is merged:
Have Travis deploy changes using CodeDeploy, based on deploy section in .travis.yml tailored to our CodeDeploy setup
Start our Docker containers on AWS after successful deployment using our docker-compose.yml
How do I get this second step to happen? I'm pretty sure ECS is actually not what is needed here. My current status right now is that I can get Docker started with sudo service docker start but I cannot get docker-compose up to be successful. Though deployments are reported as "successful", this is only because the docker-compose up command is run in the background in the Validate Service section script. In fact, when I try to do docker-compose up manually when ssh'd into the EC2 instance, I get stuck building one of the containers, right before the CMD paver docker_run part of the Dockerfile.
This took a long time to work out, but I finally figured out a way to deploy a Django+Docker-Compose project with CodeDeploy without Docker-Machine or ECS.
One thing that was important was to make an alternate docker-compose.yml that excluded the selenium container--all it did was cause problems and was only useful for local testing. In addition, it was important to choose an instance type that could handle building containers. The reason why containers couldn't be built from our Dockerfile was that the instance simply did not have the memory to complete the build. Instead of a t1.micro instance, an m3.medium is what worked. It is also important to have sufficient disk space--8GB is far too small. To be safe, 256GB would be ideal.
It is important to have an After Install script run service docker start when doing the necessary Docker installation and setup (including installing Docker-Compose). This is to explicitly start running the Docker daemon--without this command, you will get the error Could not connect to Docker daemon. When installing Docker-Compose, it is important to place it in /opt/bin/ so that the binary is used via /opt/bin/docker-compose. There are problems with placing it in /usr/local/bin (I don't exactly remember what problems, but it's related to the particular Linux distribution for the Amazon Linux AMI). The After Install script needs to be run as root (runas: root in the appspec.yml AfterInstall section).
Additionally, the final phase of deployment, which is starting up the containers with docker-compose up (more specifically /opt/bin/docker-compose -f docker-compose-aws.yml up), needs to be run in the background with stdin and stdout redirected to /dev/null:
/opt/bin/docker-compose -f docker-compose-aws.yml up -d > /dev/null 2> /dev/null < /dev/null &
Otherwise, once the server is started, the deployment will hang because the final script command (in the ApplicationStart section of my appspec.yml in my case) doesn't exit. This will probably result in a deployment failure after the default deployment timeout of 1 hour.
If all goes well, then the site can finally be accessed at the instance's public DNS and port in your browser.