Sklearn on aws lambda - amazon-web-services

I want to use sklearn on AWS lambda. sklearn has dependencies on scipy(173MB) and numpy(75MB). The combined size of all these packages exceeds AWS Lambda disk space limit of 256 MB.
How can I use AWS lambda to use sklearn?

This guy gets it down to 40MB, though I have not tried it myself yet.
The relevant Github repo.

there is a two way to do this
1) installing the modules dynamically
2) aws batch
1) installing the modules dynamically
def lambdahandler():
#install numpy package
# numpy code
#uninstall numpy package
## now install Scipy package
# execute scipy code
or vice versa depends on your code
2) using Aws batch
This is the best way where you don't have any limitation regarding Memory space.
just you need to build a Docker image and need to write an all required packages and libraries inside the requirement.txt file.

I wanted to do the same, and it was very difficult indeed. I ended up buying this layer that includes scikit-learn, pandas, numpy and scipy.
https://www.awslambdas.com/layers/3/aws-lambda-scikit-learn-numpy-scipy-python38-layer
There is another layer that includes xgboost as well.

Related

Importing the numpy C-extensions failed to run in local container (sam local invoke) whereas it runs perfectly locally ( python .\test.py)

This question was asked many times before I know. After days of struggle, I have narrowed the issue to this:
The code runs perfectly locally by using python .\test.py, but fails while ruining in the sam local invoke container.
AWS supports python 3.9 so I tried to figure out the dependency combination for those imports:
import numpy as np
import pandas as pd
from scipy.signal import savgol_filter
from sktime.transformations.series.outlier_detection import HampelFilter
from sklearn.linear_model import LinearRegression
from scipy.signal import find_peaks, peak_prominences
The code runs locally without problems but fails to run in the container.
Please help
I figured out that sktime will support python 3.9 but depends on numpy>=1.21.0.
On the other hand no matter what I tried.. I couldn't figure out what dependencies will work on AWS python 3.9.
I have tries dto install from wheels and by trying different versions of numpy, but it is always the same error.
I ended up limiting the requirements like so:
numpy==1.21.0
pandas
sktime
This installed what it needed.
Then I figured out that I can run it locally. No issues.
So the problem must be with the container sam local invoke runs.
Note that I can use AWS layer for pandas 3.9 but I also need sktime and that didn't work for me (not to mention the 250M upload limit that I leave for later straggle).
Ok.. I have a better understanding of the whole issue. Here is what I have learned:
When you install Python for AWS you should install the supported version (3.9 as for now). This will make all requirements.txt installations match Python 3.9.
So.. all installations should match without any issues.
Still running locally did not work for me. I do not know why!!!!
One solution involves attaching the lambda to EFS that includes many steps and manual work. You don't want to go there.
The solution for me was using Lambda Images. This was easier then I have expected. The steps includes some docker configurations and having the yaml use the docker file, and then 'sam deploy' will publish a Docker image for you to Elastic Container Register - ECR.
See:
Creating Lambda container images
Using container image support for AWS Lambda with AWS SAM
The first link will lead to some other supporting link within AWS documentation.
Hope it will help someone.

How to install python packages within Amazon Sagemaker Processing Job?

I am trying to create a Sklearn processing job in Amazon Sagemekar to perform some data transformation of my input data before I do model training.
I wrote a custom python script preprocessing.py which does the needful. I use some python package in this script. Here is the Sagemaker example I followed.
When I try to submit the Processing Job I get an error -
............................Traceback (most recent call last):
File "/opt/ml/processing/input/code/preprocessing.py", line 6, in <module>
import snowflake.connector
ModuleNotFoundError: No module named 'snowflake.connector'
I understand that my processing job is unable to find this package and I need to install it. My question is how can I accomplish this using Sagemaker Processing Job API? Ideally there should be a way to define a requirements.txt in the API call, but I don't see such functionality in the docs.
I know I can create a custom Image with relevant packages and later use this image in the Processing Job, but this seems too much work for something that should be built-in?
Is there an easier/elegant way to install packages needed in Sagemaker Processing Job ?
One way would be to call pip from Python:
subprocess.check_call([sys.executable, "-m", "pip", "install", package])
Another way would be to use an SKLearn Estimator (training job) instead, to do the same thing. You can provide the source_dir, which can include a requirements.txt file, and these requirements will be installed for you
estimator = SKLearn(
entry_point="foo.py",
source_dir="./foo", # no trailing slash! put requirements.txt here
framework_version="0.23-1",
role = ...,
instance_count = 1,
instance_type = "ml.m5.large"
)

pyarrow as lambda layer

I need help in order to have pyarrow as a lambda layer for my lambda function.
I am trying to read/write parquet file and I am getting below error,
"errorMessage": "Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.\npyarrow or fastparquet is required for parquet support".
I tried myself creating layer by installing pyarrow in my ec2 with below command,
pip3 install pandas pyarrow -t build/python/lib/python3.7/site-packages/ --system
but the zip file is getting created with > 300 mb, and, hence i can not have that as lambda layer.
any suggestion or solutions.
Thanks,
Firstly, all the packages are need to be in a directory called python, nothing more, nothing less, and you can zip the whole python directory and upload to lambda.
Secondly, pandas and pyarrow are pretty big. I did use them both in one lambda function without any issue, but I'm afraid you may need to separate those two packages as two layers to make it work. Do NOT use fastparquet, it is too big and exceed the 250MB limitation of lambda.

nltk dependencies in dataflow

I know that external python dependencies can by fed into Dataflow via the requirements.txt file. I can successfully load nltk in my Dataflow script. However, nltk often needs further files to be downloaded (e.g. stopwords or punkt). Usually on a local run of the script, I can just run
nltk.download('stopwords')
nltk.download('punkt')
and these files will be available to the script. How do I do this so the files are also available to the worker scripts. It seems like it would be extremely inefficient to place those commands into a doFn/CombineFn if they only have to happen once per worker. What part of the script is guaranteed to run once on every worker? That would probably be the place to put the download commands.
According to this, Java allows the staging of resources via classpath. That's not quite what I'm looking for in Python. I'm also not looking for a way to load additional python resources. I just need nltk to find its files.
You can probably use '--setup_file setup.py' to run these custom commands. https://cloud.google.com/dataflow/pipelines/dependencies-python#pypi-dependencies-with-non-python-dependencies . Does this work in your case?

Reduce size of serverless deploy package

I have a python script that I want to run as a lambda function on AWS. Unfortunately, the package is unzipped bigger than the allowed 250 MB, mainly due to numpy (85mb) and pandas (105mb)
I have already done the following but the size is still too big:
1) Excluded not used folders:
package:
exclude:
- testdata/**
- out/**
- etc/**
2) Zipped the python packages:
custom:
pythonRequirements:
dockerizePip: true
zip: true
If I unzip the zip file generated by serverless package I find a .requriements.zip which contains my python packages and then there is also my virtual environment in the .virtualenv/ folder which contains, again, all the python packages. I have tried to exclude the .virtualenv/../lib/python3.6/site-packages/** folder in serverless.yml, but then I get an Internal server error when calling the function.
Are there any other parameters to decrease the package size?
The .virtualenv/ directory should not be included in the zip file.
If the directory is located in the same directory as serverless.yml then it should be added to exlude in the serverless.yml file, else it gets packaged along with other files:
package:
exclude:
- ...
- .virtualenv/**
include:
- ...
(Are you sure you need pandas and numpy in a microservice? There is nothing "micro" in those libraries).
There is a way. Deploy you Lambda with Zappa https://github.com/Miserlou/Zappa. It's a convenient way to write, deploy and manage your Python Lambdas anyway. But with Zappa you can specify an option called slim_handler. If set to true, most of your code will reside at S3 and will be pulled once a Lambda is executed:
AWS currently limits Lambda zip sizes to 50 megabytes. If your project
is larger than that, set slim_handler: true in your
zappa_settings.json. In this case, your fat application package will
be replaced with a small handler-only package. The handler file then
pulls the rest of the large project down from S3 at run time! The
initial load of the large project may add to startup overhead, but the
difference should be minimal on a warm lambda function. Note that this
will also eat into the memory space of your application function.