nltk dependencies in dataflow - google-cloud-platform

I know that external python dependencies can by fed into Dataflow via the requirements.txt file. I can successfully load nltk in my Dataflow script. However, nltk often needs further files to be downloaded (e.g. stopwords or punkt). Usually on a local run of the script, I can just run
nltk.download('stopwords')
nltk.download('punkt')
and these files will be available to the script. How do I do this so the files are also available to the worker scripts. It seems like it would be extremely inefficient to place those commands into a doFn/CombineFn if they only have to happen once per worker. What part of the script is guaranteed to run once on every worker? That would probably be the place to put the download commands.
According to this, Java allows the staging of resources via classpath. That's not quite what I'm looking for in Python. I'm also not looking for a way to load additional python resources. I just need nltk to find its files.

You can probably use '--setup_file setup.py' to run these custom commands. https://cloud.google.com/dataflow/pipelines/dependencies-python#pypi-dependencies-with-non-python-dependencies . Does this work in your case?

Related

Importing the numpy C-extensions failed to run in local container (sam local invoke) whereas it runs perfectly locally ( python .\test.py)

This question was asked many times before I know. After days of struggle, I have narrowed the issue to this:
The code runs perfectly locally by using python .\test.py, but fails while ruining in the sam local invoke container.
AWS supports python 3.9 so I tried to figure out the dependency combination for those imports:
import numpy as np
import pandas as pd
from scipy.signal import savgol_filter
from sktime.transformations.series.outlier_detection import HampelFilter
from sklearn.linear_model import LinearRegression
from scipy.signal import find_peaks, peak_prominences
The code runs locally without problems but fails to run in the container.
Please help
I figured out that sktime will support python 3.9 but depends on numpy>=1.21.0.
On the other hand no matter what I tried.. I couldn't figure out what dependencies will work on AWS python 3.9.
I have tries dto install from wheels and by trying different versions of numpy, but it is always the same error.
I ended up limiting the requirements like so:
numpy==1.21.0
pandas
sktime
This installed what it needed.
Then I figured out that I can run it locally. No issues.
So the problem must be with the container sam local invoke runs.
Note that I can use AWS layer for pandas 3.9 but I also need sktime and that didn't work for me (not to mention the 250M upload limit that I leave for later straggle).
Ok.. I have a better understanding of the whole issue. Here is what I have learned:
When you install Python for AWS you should install the supported version (3.9 as for now). This will make all requirements.txt installations match Python 3.9.
So.. all installations should match without any issues.
Still running locally did not work for me. I do not know why!!!!
One solution involves attaching the lambda to EFS that includes many steps and manual work. You don't want to go there.
The solution for me was using Lambda Images. This was easier then I have expected. The steps includes some docker configurations and having the yaml use the docker file, and then 'sam deploy' will publish a Docker image for you to Elastic Container Register - ECR.
See:
Creating Lambda container images
Using container image support for AWS Lambda with AWS SAM
The first link will lead to some other supporting link within AWS documentation.
Hope it will help someone.

PGPy won't go on GCP Dataflow pipeline

I'm trying to use PGPy library in a custom GCP Dataflow pipeline implemented with Apache Beam.
What I get is that everything works with DirectRunner, but when I deploy the job and execute it on DataflowRunner I get an error on PGPy usage:
ModuleNotFoundError: No module named 'pgpy'
I think I'm missing something with DataflowRunner.
Thank you
In order to manage pipeline dependencies please refer to :
https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
My personal preference is to go straight to using setup.py as it lets you deal with multiple file dependencies, which tends to get used once the pipeline gets more complex.

Custom PyPI repo for Google Dataflow workers

I want to use a custom pypi repo for my Dataflow workers. Typically, to configure a custom pypi repo, you would edit /etc/pip.conf to look like this:
[global]
index-url = https://pypi.customer.com/
Since I can't run a startup script for Dataflow workers, my thought was to perform this operation at the head of my setup.py file, so that as the script executes, it would update /etc/pip.conf before attempting a pip install of the dependencies.
My setup.py looks like the following:
with open('/etc/pip.conf', 'w') as pip_conf:
pip_conf.write("""
[global]
index-url = https://artifactory.mayo.edu/artifactory/api/pypi/pypi-remote/simple
""")
REQUIRED_PACKAGES = [
'custom_package',
]
setuptools.setup(
name='wordcount',
version='0.0.1',
description='demo package.',
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages())
The odd thing is that my workers are hanging indefinitely. When I ssh into them, I see some Docker containers running, but I am not sure how to debug further.
Any suggestions on how I can hack the Dataflow workers to use a custom pypi url?
This is likely a good candidate for custom containers, where you can install everything exactly as you want rather than having to hack the worker startup sequence.

AWS EMR step doesn't find jar imported from s3

I am attempting to run a spark application on aws emr in client mode. I have setup a bootstrap action to import needed files and the jar from s3, and I have a step to run a single spark job.
However when the step executes, the jar I have imported isn't found. Here is the stderr output:
19/12/01 13:42:05 WARN DependencyUtils: Local jar /mnt/var/lib/hadoop/steps/s-2HLX7KPZCA07B/~/myApplicationDirectory does not exist, skipping.
I am able to successfully import the jar and other needed files for the application from my s3 bucket to the master instance, I simply import them to home/ec2-user/myApplicationDirectory/myJar.jar via a bootstrap action.
However I don't understand why the step is looking for the jar at mnt/var/lib/hadoop/...etc.
here are the relevant parts of the cli configuration:
--steps '[{"Args":["spark-submit",
"--deploy-mode","client",
"--num-executors","1",
“--driver-java-options","-Xss4M",
"--conf","spark.driver.maxResultSize=20g",
"--class”,”myApplicationClass”,
“~/myApplicationDirectory”,
“myJar.jar",
…
application specific arguments and paths to folders here
…],
”Type":"CUSTOM_JAR",
thanks for any help,
It looks like it doesn't understand the ~ as referring to the home directory. Try changing "~/myApplicationDirectory" to "/home/ec2-user/myApplicationDirectory".
A little warning: in the sample in your question, straight quotation marks " are mixed with "smart" ones “. Make sure the "smart" quotation marks don't end up in your configuration file, or you will get very confusing error messages.

Jenkins post build python script file

I am new to Jenkins, specially with using python script in Jenkins. The problem I am facing is as follow:
I am trying to run a python script from a python file in the post-build step of the Jenkins. I have added all the plugins required for that purpose to my understanding. i.e I have included Post-BuildScript plugin, python jenkins plugin etc.
Now when I build console output shows invalid script command caused the failure. I have attached the results below. can anybody help me with that please?
In post build step I am providing the full or absolute path to the python script file i.e
ExecutepythonScriptpath
Results
It may be useful to mention here I have also tried using just the path without writing python preceding the path, also tried with forward as well as backward slash in the path. without any success.
I have managed to resolve that issue. There are two parts of solution:
First one is if you want to run simple python script in post-build -->Add a post build step for Execute python Script (That will require you install plugin for post build ) . In that window created after adding post build step you can simply put any python command to run.
Second part of the solution is for, when user would like to run a list of commands from a python script file from the same post build step window in that case user has to make sure to put all the required python files which you want to execute into the Jenkins workspace->project directory(project for which we are running the Jenkins ) .
Moreover, for Python2.7 in order to execute that python script file user simply need to write script as
execfile(file.py)
One more thing to remember is insert python.exe path in the environment variables.