AWS Glue Sagemaker Notebook "No module named awsglue.transforms" - amazon-web-services

I've created a Sagemaker notebook to dev AWS Glue jobs, but when running through the provided example ("Joining, Filtering, and Loading Relational Data with AWS Glue") I get the following error:
Does anyone know what I've setup wrong/haven't setup to cause the import to not work?

You'll need to download the library files from here for Glue 0.9 or here for Glue 1.0 (Check your Glue jobs for the version).
Put the zip in S3 and reference it in the "Python library path" on your Dev Endpoint.

I had the same issue and the selected solution did not work for me.
I did manage to get working by using cloud formation (AWS::Glue::DevEndpoint).
Through trial and error I noticed that you can't specify both NumberOfNodes and NumberOfWorkers at the same time. You have to specify one or the other.
Using NumberOfNodes: 5 resulted in the exact same error as specified in the question. But using the 2nd option worked perfectly.
So to conclude, to fix this error you can use CloudFormation and make sure to use the NumberOfWorkers property.

hm... this approach doesn't work for me.
I've just put zip to "Python library path", referenced to it and it doesn't work

Add AWSGlueServiceNotebookRole to your Dev Endpoint IAM Role, restart your kernel and rerun

Related

AWS Sagemaker Notebook Not working ,how can i solve the issue?

The code failed because of a fatal error:
Error sending http request and maximum retry encountered..
Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context.
b) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.
c) Restart the kernel.
Note: There are no such logs on cloudwatch to figureout the issue.
enter image description here
Are you looking to run Spark queries? If not, you can use the Python kernel, or any kernel other than Sparkmagic and proceed with your work.
If not, see this blog and the documentation to use Spark with notebook instances

Dataproc custom image: Cannot complete creation

For a project, I have to create a Dataproc cluster that has one of the outdated versions (for example, 1.3.94-debian10) that contain the vulnerabilities in Apache Log4j 2 utility. The goal is to get the alert related (DATAPROC_IMAGE_OUTDATED), in order to check how SCC works (it is just for a test environment).
I tried to run the command gcloud dataproc clusters create dataproc-cluster --region=us-east1 --image-version=1.3.94-debian10 but got the following message ERROR: (gcloud.dataproc.clusters.create) INVALID_ARGUMENT: Selected software image version 1.3.94-debian10 is vulnerable to remote code execution due to a log4j vulnerability (CVE-2021-44228) and cannot be used to create new clusters. Please upgrade to image versions >=1.3.95, >=1.4.77, >=1.5.53, or >=2.0.27. For more information, see https://cloud.google.com/dataproc/docs/guides/recreate-cluster, which makes sense, in order to protect the cluster.
I did some research and discovered that I will have to create a custom image with said version and generate the cluster from that. The thing is, I have tried to read the documentation or find some tutorial, but I still can't understand how to start or to run the file generate_custom_image.py, for example, since I am not confortable with cloud shell (I prefer the console).
Can someone help? Thank you

How to use MapBooleanAsBoolean in AWS Database Migration Service?

In latest release AWS DMS introduced MapBooleanAsBoolean connection parameter to allow keeping booleans as booleans when migrating from Postgres to Redshift. Unfortunately docs are very imprecise about how to use it. I tested adding it as extra connection parameter in both source and target endpoints and mapBooleanAsBoolean and migrateBooleanAsBoolean, but nothing worked for me. Has anyone been able to make it work?
Link to docs for reference:
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_ReleaseNotes.html
Not sure if you found an answer for this, but adding the extra connection attribute to the source postgres endpoint:
mapBooleanAsBoolean=true;
Worked for me. My target was s3 parquet files though.
It can be done via the console.

ModuleNotFoundError: No module named 'aiohttp' in AWS Glue

I am using AWS glue to create ETL workflow, where I am fetching the data from the API and loading it into RDS. In AWS Glue, I used pyspark script. In the same script, I have used the 'aiohttp' and 'asyncio' modules of python to call my API asynchronously. But in AWS glue it is throwing me an error that Module Not found for the only aiohttp.
I have already tried with different versions of aiohttp module and tested in the glue job but still throwing me the same error. Can someone please help me with this topic?
Glue 2.0
AWS Glue version 2.0 lets you provide additional Python modules or different versions at the job level. You can use the --additional-python-modules job parameter with a list of comma-separated Python modules to add a new module or change the version of an existing module.
Also, within the --additional-python-modules option you can specify an Amazon S3 path to a Python wheel module.
This link to official documentation lists all modules already available. If you need a different version or need one to be installed, it can be specified in the parameter mentioned above.
Glue 1.0 & 2.0
You can zip the python library, upload it so s3 and specify the path as --extra-py-files job parameter.
See link to official documentation for more information.

Dynamically append function to AppSync Pipeline Resolver via CloudFormation

I am currently developing an AppSync based API in a domain driven manner, so we need to put a function to an already created Pipeline Resolver. Does anybody know if there is there any chance doing this via CloudFormation without using a custom resource?
Thanks in advance, Sven
Terraform can do this neatly if you can build this pull request yourself or vote in this issue to get it merged into a public release.
The new syntax is described here.
The build process is actually quite simple. It took me about 30 min end-to-end.
Install GoLang.
Clone the repo with the changes and sync it with the main (upstream) repo.
Make sure you cloned it into go\src\github.com\terraform-providers\terraform-provider-aws folder.
Run go build from go\src\github.com\terraform-providers\terraform-provider-aws
Replace .terraform\plugins\...\terraform-provider-aws-* executable with the one you compiled.
Run terraform init
Test by trying to import a function terraform import aws_appsync_function.example xxxxx-yyyyy
I hope the pull request gets merged by the time you read this.