ModuleNotFoundError: No module named 'aiohttp' in AWS Glue - amazon-web-services

I am using AWS glue to create ETL workflow, where I am fetching the data from the API and loading it into RDS. In AWS Glue, I used pyspark script. In the same script, I have used the 'aiohttp' and 'asyncio' modules of python to call my API asynchronously. But in AWS glue it is throwing me an error that Module Not found for the only aiohttp.
I have already tried with different versions of aiohttp module and tested in the glue job but still throwing me the same error. Can someone please help me with this topic?

Glue 2.0
AWS Glue version 2.0 lets you provide additional Python modules or different versions at the job level. You can use the --additional-python-modules job parameter with a list of comma-separated Python modules to add a new module or change the version of an existing module.
Also, within the --additional-python-modules option you can specify an Amazon S3 path to a Python wheel module.
This link to official documentation lists all modules already available. If you need a different version or need one to be installed, it can be specified in the parameter mentioned above.
Glue 1.0 & 2.0
You can zip the python library, upload it so s3 and specify the path as --extra-py-files job parameter.
See link to official documentation for more information.

Related

No module named 'openpyxl'",

I am trying to write code using the AWS Lambda function to extract Pdf Invoice data from the AWS Textract Service and save the data into Excel. To do this, I installed the openpyxl library, created a zip file for it, and created a layer in lambda function that uses the openpyxl library. I am getting the following error ( No module named 'openpyxl'",). I would appreciate your assistance in resolving it.
Have you tried Textractor pip install amazon-textract-textractor? It comes built-in with the export-to-excel features and comes with the lamdba layers pre-built on the official GitHub repository: https://aws-samples.github.io/amazon-textract-textractor/using_in_lambda.html
Note that the available lamdba layer uses XlsxWriter instead of openpyxl.
Disclaimer: I am a maintainer of Textractor.

AWS Glue Sagemaker Notebook "No module named awsglue.transforms"

I've created a Sagemaker notebook to dev AWS Glue jobs, but when running through the provided example ("Joining, Filtering, and Loading Relational Data with AWS Glue") I get the following error:
Does anyone know what I've setup wrong/haven't setup to cause the import to not work?
You'll need to download the library files from here for Glue 0.9 or here for Glue 1.0 (Check your Glue jobs for the version).
Put the zip in S3 and reference it in the "Python library path" on your Dev Endpoint.
I had the same issue and the selected solution did not work for me.
I did manage to get working by using cloud formation (AWS::Glue::DevEndpoint).
Through trial and error I noticed that you can't specify both NumberOfNodes and NumberOfWorkers at the same time. You have to specify one or the other.
Using NumberOfNodes: 5 resulted in the exact same error as specified in the question. But using the 2nd option worked perfectly.
So to conclude, to fix this error you can use CloudFormation and make sure to use the NumberOfWorkers property.
hm... this approach doesn't work for me.
I've just put zip to "Python library path", referenced to it and it doesn't work
Add AWSGlueServiceNotebookRole to your Dev Endpoint IAM Role, restart your kernel and rerun

Problem creating Lambda function that has a Layer using boto3

If I try to use boto3 Lambda create_function() to create a Lambda function, and I try to include Layers via Layers=['string'] parameter, I get the following error message:
Unknown parameter in input: "Layers", must be one of: FunctionName, Runtime, Role, Handler, Code, Description, Timeout, MemorySize, Publish, VpcConfig, DeadLetterConfig, Environment, KMSKeyArn, TracingConfig, Tags
... any ideas? The documentation suggests that this should work, but something is clearly off here. NOTE: I also have a similar problem with "Layers" in update_function_configuration() as well.
My guess is that the version of boto3 that the AWS Lambda console uses has not been updated/refreshed yet to support Layers. Because when I run the same code locally on a machine with a fairly recent version of boto3, it runs without any problems. I have already tried using both listed Python runtimes of 3.6 and 3.7 that in the AWS console, but neither worked. These runtimes have respective versions of boto3 of 1.7.74 and 1.9.42. But my local machine has 1.9.59. So perhaps the addition of Lambda Layers occurred between 1.9.42 and 1.9.59.
My guess is that the version of boto3 that the AWS Lambda console uses has not been updated/refreshed yet to support Layers.
That's completely right. AWS usually updates the available libraries on AWS Lambda regularly, but hasn't updated them for several months now for unknown reasons.
The supported API endpoints are actually not defined in boto3, but in botocore.
Currently botocore 1.10.74 is available on AWS Lambda, while support for AWS Lambda Layers got added in botocore 1.12.56.
To avoid such incompatibilities between your code and the versions of available libraries, you should create a deployment package containing boto3 and botocore in addition to your AWS Lambda function code, so your code uses your bundled versions instead the ones AWS provides. That's what AWS suggests as part of their best practices as well:
Control the dependencies in your function's deployment package.
The AWS Lambda execution environment contains a number of libraries such as the AWS SDK for the Node.js and Python runtimes (a full list can be found here: Lambda Execution Environment and Available Libraries). To enable the latest set of features and security updates, Lambda will periodically update these libraries. These updates may introduce subtle changes to the behavior of your Lambda function. To have full control of the dependencies your function uses, we recommend packaging all your dependencies with your deployment package.

Read file from s3a along with AWS Athena SDK (1.11+)

I am writing a spark/scala program which submits a query on athena (using aws-java-sdk-athena:1.11.420) and waits for the query to complete. Once the query is complete, my spark program directly reads from the S3 bucket using s3a protocol (the output location of the query) using spark's sparkSession.read.csv() function.
In order to read the CSV file, I need to use org.apache.hadoop.hadoop-aws:1.8+ and org.apache.hadoop.hadoop-client:1.8+. Both of these libraries are build using AWS SDK version 1.10.6. However, AWS athena does not have any SDK with that version. The oldest version they have is 1.11+.
How can I resolve the conflict? I need to use the latest version of AWS SDK to get access to athena, but hadoop-aws pushed me back to an older version?
Are there other dependency version of hadoop-aws that uses 1.11+ AWS SDKs? If so, what are the versions that will work for me? If not, what other options do I have?
I found out that I can use hadoop-aws:3.1+ which comes with aws-java-sdk-bundle:1.11+. This AWS SDK Bundle comes with Athena packaged.
I although still need to run spark with hadoop-commons:3.1+ libraries. The spark cluster I have runs 2.8 version libraries.
Due to my spark cluster running 2.8, spark-submit jobs were failing, while normal execution of the jar (java -jar myjar.jar) was working fine. This is because Spark was replacing the hadoop libraries I provided with the version it was bundled with.

Use cases for AWS Glue with its current limitations on Python libraries

What are the best use cases where I can use AWS Glue services in ETL with its limitations on Python packages support?
According to AWS Glue Documentation:
Only pure Python libraries can be used. Libraries that rely on C
extensions, such as the pandas Python Data Analysis Library, are not
yet supported.
I have attempted some ETL jobs, run using AWS Glue, and I packaged some libraries such as Pandas, Holidays, etc. as a separate zip file and tried as well, but the jobs failed due these libraries (ImportError: Pandas)?
AWS do not have any ETA for providing the support for such libraries in the near future?
Or, is it too early to use AWS Glue, given that limitations on the python libraries being a major block now?