AWS Glue Python shell jobs: No space left on device - amazon-web-services

I'm running a Glue job with the following configurations
Type: Python Shell
Python version: Python 2 (1.0) [I'd created these jobs without a version and as per documentation, they should be defaulting to 0.9
The job fails to initialize with the following error:
No space left on device: 'build/bdist.linux-x86_64/egg/pycparser
Has anyone else encountered the same error or have any potential resolutions?

Related

GCP Serverless pyspark : Illegal character in path at index

I'm trying to run a simple hello world python code on Serverless pyspark on GCP using gcloud (from local windows machine).
if __name__ == '__main__':
print("Hello")
This always results in the error
=========== Cloud Dataproc Agent Error ===========
java.lang.IllegalArgumentException: Illegal character in path at index 38: gs://my-bucket/dependencies\hello.py
at java.base/java.net.URI.create(URI.java:883)
at com.google.cloud.hadoop.services.agent.job.handler.AbstractJobHandler.registerResourceForDownload(AbstractJobHandler.java:592)
The gcloud command:
gcloud dataproc batches submit pyspark hello.py --batch=hello-batch-5 --deps-bucket=my-bucket --region=us-central1
On further analysis, I found that gcloud puts hello.py file in dependencies\hello.py under folder {deps-bucket} and Java considers backward slash '\' as illegal.
Has anyone encountered a similar situation?
As #Ronak mentioned, Can you double check the bucket name ? I have replicated your task, and simply copied your code to my Google Cloud shell. and it ran just fine. for your next run can you delete the dependencies folder and run the batch job again ?
See my replication here:
Dependencies path created after running the job:

SAM CLI and Quarkus: /var/task/bootstrap: no such file or directory

When I try to use SAM cli to invoke my quarkus native function locally as per the tutorial it fails to run with bellow error no such file or directory for /var/task/bootstrap. Function.zip does exist and contain bootstrap, Anyone have any ideas how to solve?
OS: Ubuntu 18 (on VirtualBox)
walter#ubuntu18 brialambda/target $ sam local invoke --template sam.native.yaml --event ../payload.json
Invoking not.used.in.provided.runtime (provided)
Decompressing /home/walter/workspace/walterlambda/target/function.zip
Skip pulling image and use local one: public.ecr.aws/sam/emulation-provided:rapid-1.35.0-x86_64.
Mounting /tmp/tmp41trke88 as /var/task:ro,delegated inside runtime container
START RequestId: ee5e27d8-4bb7-4e0a-8873-f92c48459993 Version: $LATEST
time="2021-11-09T14:15:38.302" level=error msg="Init failed" InvokeID= error="fork/exec /var/task/bootstrap: no such file or directory"
Function 'WalterlambdaNative' timed out after 15 seconds
I've seen that issue when people followed the non-native build process (https://quarkus.io/guides/amazon-lambda#build-and-deploy) but deployed as native. It is important to do the native build (https://quarkus.io/guides/amazon-lambda#deploy-to-aws-lambda-custom-native-runtime) by calling e.g.
./mvnw package -Dnative
or
./gradlew build -Dquarkus.package.type=native
There are two other related posts as well:
Unable to Invoke Quarkus Functions Natively in an AWS Lambda function with MacOS Catalina
Micronaut GraalVM Native Image: Lambda fails with an error "Error: fork/exec /var/task/bootstrap: no such file or directory"

Cloud Build does not work anymore after upgrading to beam 2.30.0

i have been using this yaml file to kick off my dataflow flex workflow with beam 2.27.0, and it has always worked fine
- name: gcr.io/$PROJECT_ID/$_IMAGE
entrypoint: python
args:
- /dataflow/template/main.py
- --runner=DataflowRunner
- --project=$PROJECT_ID
- --region=$_REGION
- --job_name=$_JOB_NAME
- --temp_location=$_TEMP_LOCATION
- --sdk_container_image=gcr.io/$PROJECT_ID/$_IMAGE
- --disk_size_gb=50
- --year=2018
- --quarter=QTR1
- --fmpkey=$_FMPKEY
- --setup_file=/dataflow/template/setup.py
Today i decided to upgrade beam to 2.30.0, and when running exactly the same file i am now getting
this
Unrecognized SDK container image: gcr.io/datascience-projects/pipeline:latestRun. Custom container images are only supportedfor Dataflow Runner v2.
Could anyone advise what i need to fix? I am suspecting i'd need to run using a cloud-sdk instead of python.....
kind regards
marco
You will need to add --experiment=use_runner_v2 as an argument, as the documentation is covering, when working with Apache Beam 2.30.0 or higher versions.
Therefore your updated yaml will look like the following:
- name: gcr.io/$PROJECT_ID/$_IMAGE
entrypoint: python
args:
- /dataflow/template/main.py
- --runner=DataflowRunner
- --project=$PROJECT_ID
- --region=$_REGION
- --job_name=$_JOB_NAME
- --temp_location=$_TEMP_LOCATION
- --sdk_container_image=gcr.io/$PROJECT_ID/$_IMAGE
- --disk_size_gb=50
- --year=2018
- --quarter=QTR1
- --fmpkey=$_FMPKEY
- --setup_file=/dataflow/template/setup.py
- --experiment=use_runner_v2
Can you please have a look at the below checklist before you begin deploying and launching a python dataflow pipeline using cloud build.
Custom Containers support Dataflow Runner v2
The new Dataflow runner, Dataflow Runner v2, is now the default for
Python streaming pipelines (version 2.21.0 or higher), and rolling
out by default for Python Batch pipelines (version 2.21.0 or higher)
starting February 2021. You do not have to make any changes to your
pipeline code to take advantage of this new architecture. Under
certain circumstances, your pipeline might not use Runner V2,
although the pipeline runs on a supported SDK version.
You have to run the job with the --experiments=use_runner_v2 flag
Dataflow Runner v2 requires the Apache Beam SDK version 2.21.0 or
higher for Python, and version 2.30.0 or higher for Java. When
running your pipeline, make sure to launch the pipeline using
the Apache Beam SDK with the same version (e.g. 2.XX.0) and language
version (e.g. Python 3.X) as the SDK on your custom container image
Custom containers must run the default ENTRYPOINT script /opt/apache/beam/boot, which initializes the worker environment and starts the SDK worker process. If you do not set this entrypoint, your worker will hang and never properly start.
Note : Dataflow has stopped supporting pipelines using Python2. So be sure you are using python3 as runtime for the pipeline.

AWS EMR pyspark notebook fails with `Failed to run command /usr/bin/virtualenv (...)`

I have created a basic EMR cluster in AWS, and I'm trying to use the Jupyter Notebooks provided through the AWS Console. Launching the notebooks seems to work fine, and I'm also able to run basic python code in notebooks started with the pyspark kernel. Two variables are set up in the notebook: spark is a SparkSession instance, and sc is a SparkContext instance. Displaying sc yields <SparkContext master=yarn appName=livy-session-0> (the output can of course vary slightly depending on the session).
The problem arises once I perform operations that actually hit the spark machinery. For example:
sc.parallelize(list(range(10))).map(lambda x: x**2).collect()
I am no spark expert, but I believe this code should distribute the integers from 0 to 9 across the cluster, square them, and return the results in a list. Instead, I get a lengthy stack trace, mostly from the JVM, but also some python components. I believe the central part of the stack trace is the following:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 116, ip-XXXXXXXXXXXXX.eu-west-1.compute.internal, executor 17): java.lang.RuntimeException: Failed to run command: /usr/bin/virtualenv -p python3 --system-site-packages virtualenv_application_1586243436143_0002_0
The full stack trace is here.
A bit of digging in the AWS portal led me to log output from the nodes. stdout from one of the nodes includes the following:
The path python3 (from --python=python3) does not exist
I tried running the /usr/bin/virtualenv command on the master node manually (after logging in through), and that worked fine, but the error is of course still present after I did that.
While this error occurs most of the time, I was able to get this working in one session, where I could run several operations against the spark cluster as I was expecting.
Technical information on the cluster setup:
emr-6.0.0
Applications installed are "Ganglia 3.7.2, Spark 2.4.4, Zeppelin 0.9.0, Livy 0.6.0, JupyterHub 1.0.0, Hive 3.1.2". Hadoop is also included.
3 nodes (one of them as master), all r5a.2xlarge.
Any ideas what I'm doing wrong? Note that I am completely new to EMR and Spark.
Edit: Added the stdout log and information about running the virtualenv command manually on the master node through ssh.
I have switched to using emr-5.29.0, which seems to resolve the problem. Perhaps this is an issue with emr-6.0.0? In any case, I have a functional workaround.
The issue for me was that the virtualenv was being made on the executors with a python path that didn't exist. Pointing the executors to the right one did the job for me:
"spark.pyspark.python": "/usr/bin/python3.7"
Here is how I reconfiged the spark app at the beginning of the notebook:
{"conf":{"spark.pyspark.python": "/usr/bin/python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"}
}

ERROR: Unknown or duplicate parameter: WSGIPath

I'm currently trying to deploy a Django 1.7, Python 3.4 WebApp to AWS Elastic Beanstalk. I'm following the guide on the AWS site, but when I try using their config files in the .ebextensions directory, the line "option_name: WSGIPath" seems to give me the error "ERROR: Unknown or duplicate parameter: WSGIPath"
I had the same issue on eb create command, but I solved this problem a moment ago.
It's because I have selected the incorrect option ( fourth ) while I executing eb init command on CLI,
Select a platform version.
1) Python 3.4
2) Python 2.7
3) Python
4) Python 3.4 (Preconfigured - Docker)
(default is 1): 4
So I started the sample project again, At this time I just selected the first, then I was able to finish the AWS guide.