pyarrow as lambda layer - amazon-web-services

I need help in order to have pyarrow as a lambda layer for my lambda function.
I am trying to read/write parquet file and I am getting below error,
"errorMessage": "Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.\npyarrow or fastparquet is required for parquet support".
I tried myself creating layer by installing pyarrow in my ec2 with below command,
pip3 install pandas pyarrow -t build/python/lib/python3.7/site-packages/ --system
but the zip file is getting created with > 300 mb, and, hence i can not have that as lambda layer.
any suggestion or solutions.
Thanks,

Firstly, all the packages are need to be in a directory called python, nothing more, nothing less, and you can zip the whole python directory and upload to lambda.
Secondly, pandas and pyarrow are pretty big. I did use them both in one lambda function without any issue, but I'm afraid you may need to separate those two packages as two layers to make it work. Do NOT use fastparquet, it is too big and exceed the 250MB limitation of lambda.

Related

Cant Import packages from layers in AWS Lambda

I know this question exists several places, but even by following different guides/answers I still cant get it to work. I have no idea what I do wrong. I have a lambda Python function on AWS where i need to do a "import requests". This is my approach so far.
Create .zip directory of packages. Locally I do:
pip3 install requests -t ./
zip -r okta_layer.zip .
Upload .zip directory to a lambda layer:
I go to the AWS console and go to lambda layers. I create a new layer based on this .zip file.
I go to my lambda python function and add the layer to the function directly form the console. I can now see the layer under "Layers" for the lambda function. Then i run the script it still complains about:
Unable to import module 'lambda_function': No module named 'requests'
I solved the problem. Apparently I needed to have a .zip folder, with a "python" folder inside, and inside that "python" folder should be all the packages.
I only had all the packages in the zip folder directly without a "python" folder ...

Trouble opening audio files stored on S3 in SageMaker

I stored like 300 GB of audio data (mp3/wav mostly) on Amazon S3 and am trying to access it in a SageMaker notebook instance to do some data transformations. I'm trying to use either torchaudio or librosa to load a file as a waveform. torchaudio expects the file path as the input, librosa can either use a file path or file-like object. I tried using s3fs to get the url to the file but torchaudio doesn't recognize it as a file. And apparently SageMaker has problems installing librosa so I can't use that. What should I do?
For anyone who has this issue and has to use Sagemaker, I found installing librosa using the following:
!conda install -y -c conda-forge librosa
rather than via pip allowed me to use it in Sagemaker.
I ended up not using SageMaker for this, but for anybody else having similar problems, I solved this by opening the file using s3fs and writing it to a tempfile.NamedTemporaryFile. This gave me a file path that I could pass into either torchaudio.load or librosa.core.load. This was also important because I wanted the extra resampling functionality of librosa.core.load, but it doesn't accept file-like objects for loading mp3s.

nltk dependencies in dataflow

I know that external python dependencies can by fed into Dataflow via the requirements.txt file. I can successfully load nltk in my Dataflow script. However, nltk often needs further files to be downloaded (e.g. stopwords or punkt). Usually on a local run of the script, I can just run
nltk.download('stopwords')
nltk.download('punkt')
and these files will be available to the script. How do I do this so the files are also available to the worker scripts. It seems like it would be extremely inefficient to place those commands into a doFn/CombineFn if they only have to happen once per worker. What part of the script is guaranteed to run once on every worker? That would probably be the place to put the download commands.
According to this, Java allows the staging of resources via classpath. That's not quite what I'm looking for in Python. I'm also not looking for a way to load additional python resources. I just need nltk to find its files.
You can probably use '--setup_file setup.py' to run these custom commands. https://cloud.google.com/dataflow/pipelines/dependencies-python#pypi-dependencies-with-non-python-dependencies . Does this work in your case?

Reduce size of serverless deploy package

I have a python script that I want to run as a lambda function on AWS. Unfortunately, the package is unzipped bigger than the allowed 250 MB, mainly due to numpy (85mb) and pandas (105mb)
I have already done the following but the size is still too big:
1) Excluded not used folders:
package:
exclude:
- testdata/**
- out/**
- etc/**
2) Zipped the python packages:
custom:
pythonRequirements:
dockerizePip: true
zip: true
If I unzip the zip file generated by serverless package I find a .requriements.zip which contains my python packages and then there is also my virtual environment in the .virtualenv/ folder which contains, again, all the python packages. I have tried to exclude the .virtualenv/../lib/python3.6/site-packages/** folder in serverless.yml, but then I get an Internal server error when calling the function.
Are there any other parameters to decrease the package size?
The .virtualenv/ directory should not be included in the zip file.
If the directory is located in the same directory as serverless.yml then it should be added to exlude in the serverless.yml file, else it gets packaged along with other files:
package:
exclude:
- ...
- .virtualenv/**
include:
- ...
(Are you sure you need pandas and numpy in a microservice? There is nothing "micro" in those libraries).
There is a way. Deploy you Lambda with Zappa https://github.com/Miserlou/Zappa. It's a convenient way to write, deploy and manage your Python Lambdas anyway. But with Zappa you can specify an option called slim_handler. If set to true, most of your code will reside at S3 and will be pulled once a Lambda is executed:
AWS currently limits Lambda zip sizes to 50 megabytes. If your project
is larger than that, set slim_handler: true in your
zappa_settings.json. In this case, your fat application package will
be replaced with a small handler-only package. The handler file then
pulls the rest of the large project down from S3 at run time! The
initial load of the large project may add to startup overhead, but the
difference should be minimal on a warm lambda function. Note that this
will also eat into the memory space of your application function.

Sklearn on aws lambda

I want to use sklearn on AWS lambda. sklearn has dependencies on scipy(173MB) and numpy(75MB). The combined size of all these packages exceeds AWS Lambda disk space limit of 256 MB.
How can I use AWS lambda to use sklearn?
This guy gets it down to 40MB, though I have not tried it myself yet.
The relevant Github repo.
there is a two way to do this
1) installing the modules dynamically
2) aws batch
1) installing the modules dynamically
def lambdahandler():
#install numpy package
# numpy code
#uninstall numpy package
## now install Scipy package
# execute scipy code
or vice versa depends on your code
2) using Aws batch
This is the best way where you don't have any limitation regarding Memory space.
just you need to build a Docker image and need to write an all required packages and libraries inside the requirement.txt file.
I wanted to do the same, and it was very difficult indeed. I ended up buying this layer that includes scikit-learn, pandas, numpy and scipy.
https://www.awslambdas.com/layers/3/aws-lambda-scikit-learn-numpy-scipy-python38-layer
There is another layer that includes xgboost as well.