To create my Glue scripts, I use development endpoints with Sagemaker notebooks that run the Pyspark (Sparkmagic) kernel.
The latest version of Glue (version 1.0) supports Spark 2.4. However, my Sagemaker notebook uses Spark version 2.2.1.
The function I want to test only exists as of Spark 2.3.
Is there a way to solve this mismatch between the dev endpoint and the Glue job? Can I somehow set the Spark version of the notebook?
I couldn't find anything in the documentation.
When you create a SageMaker notebook for the Glue dev endpoint, it launches a SageMaker notebook instance with a specific lifecycle configuration. This LC provides the configurations to create a connection between the SageMaker notebook and the development endpoint. Upon running cells from the PySpark kernel, the code is sent to the Livy server running in the development endpoint via REST APIs.
Thus, the PySpark version that you see and on which the SageMaker notebook runs depends on the development endpoint and is not configurable from the SageMaker point of view.
Since Glue is a managed service, root access is restricted for the development endpoint. Thus, you cannot update the spark version to a more later version. The feature of using Spark version 2.4 has been newly introduced in Glue and it seems that it has not yet been released for dev endpoint.
Related
I am trying to work with glue interective session in sagemaker notebook by configuring the glue-conda-pyspark kernel via aws lifecycle configurations. It worked earlier while creating a notebook instance. Now the instance is running with configuration but i am no longer able to see the conda glue pyspark kernel in the kernel list. Could anybody help with the create script and start script to run the notebook with glue-pyspark.
I am configuring using this aws doc: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-sagemaker.html#is-sagemaker-existing
and also aws took help from aws github scripts: https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/blob/master/scripts/install-conda-package-single-environment/on-start.sh
My understanding is Dev Endpoints in AWS Glue can be used to develop code iteratively and then deploy it to a Glue job. I find this specially useful when developing Spark jobs because every time you run a job, it takes several minutes to launch a Hadoop cluster in the background. However, I am seeing a discrepancy when using Python shell in Glue instead of Spark. Import pg doesn't work in a Dev Endpoint I created using Sagemaker JupyterLab Python notebook, but works in AWS Glue when I create a job using Python shell. Shouldn't the same libraries exist in the dev endpoint that exist in Glue? What is the point of having a dev endpoint if you cannot reproduce the same code in both places (dev endpoint and the Glue job)?
Firstly, Python shell jobs would not launch a Hadooo Cluster in the backend as it does not give you a Spark environment for your jobs.
Secondly, since PyGreSQL is not written in Pure Python, it will not work with Glue's native environment (Glue Spark Job, Dev endpoint etc)
Thirdly, Python Shell has additional support for certain package built-in.
Thus, I don't see a point of using DevEndpoint for Python Shell jobs.
I have a glue script (test.py) written say in a editor. I connected to glue dev endpoint and copied the script to endpoint or I can store in S3 bucket. Basically glue endpoint is an EMR cluster, now how can I run the script from the dev endpoint terminal? Can I use spark-submit and run it ?
I know we can run it from glue console,but more interested to know if I can run it from glue end point terminal.
You don't need a notebook; you can ssh to the dev endpoint and run it with the gluepython interpreter (not plain python).
e.g.
radix#localhost:~$ DEV_ENDPOINT=glue#ec2-w-x-y-z.compute-1.amazonaws.com
radix#localhost:~$ scp myscript.py $DEV_ENDPOINT:/home/glue/myscript.py
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT
...
[glue#ip-w-x-y-z ~]$ gluepython myscript.py
You can also run the script directly without getting an interactive shell with ssh (of course, after uploading the script with scp or whatever):
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT gluepython myscript.py
If this is a script that uses the Job class (as the auto-generated Python scripts do), you may need to pass --JOB_NAME and --TempDir parameters.
For development / testing purpose, you can setup a zeppelin notebook locally, have an SSH connection established using the AWS Glue endpoint URL, so you can have access to the data catalog/crawlers,etc. and also the s3 bucket where your data resides.
After all the testing is completed, you can bundle your code, upload to an S3 bucket. Then create a Job pointing to the ETL script in S3 bucket, so that the job can be run, and scheduled as well.
Please refer here and setting up zeppelin on windows, for any help on setting up local environment. You can use dev instance provided by Glue, but you may incur additional costs for the same(EC2 instance charges).
Once you set up the zeppelin notebook, you can copy the script(test.py) to the zeppelin notebook, and run from the zeppelin.
According to AWS Glue FAQ:
Q: When should I use AWS Glue vs. Amazon EMR?
AWS Glue works on top of the Apache Spark environment to provide a
scale-out execution environment for your data transformation jobs. AWS
Glue infers, evolves, and monitors your ETL jobs to greatly simplify
the process of creating and maintaining jobs. Amazon EMR provides you
with direct access to your Hadoop environment, affording you
lower-level access and greater flexibility in using tools beyond
Spark.
Do you have any specific requirement to run Glue script in an EMR instance? Since in my opinion, EMR gives more flexibility and you can use any 3rd party python libraries and run directly in a EMR Spark cluster.
Regards
The script to setup Hadoop on EC2 as described in https://wiki.apache.org/hadoop/AmazonEC2 has been removed from recent hadoop release. Google points me to an alternative http://whirr.apache.org/ which also has been retired for more than a year. Is there a replacement or alternative which is still good to set up the latest version of Hadoop on EC2? Thank you!
Update
hadoop-ec2 script has been removed from hadoop src as on 01/11/2011. The intention is to replace it by Apache Whirr. It would be great if the removal could be explicitly documented. Unfortunately, early changelogs are no longer conveniently available on Hadoop official website.
Rather than installing and maintaining Hadoop yourself on an Amazon EC2 instance, you could consider using the Amazon EMR service.
Amazon EMR can automatically deploy a Hadoop cluster and can be triggered via the Management Console, an API call or the AWS Command-Line Interface (CLI).
Is there a way to force Amazon EMR to use Spark 1.0.1? The current selectable versions stop at 1.4.1.
I am using the Alternating Least Squares implementation within MLlib, and since v1.1 they have implemented weighted regularization and for specific reasons (research study) I do not want this implementation, rather I am trying to access the non-weighted regularization version they had implemented in v1.0.
I am using Zepplin notebooks with Scala if that helps.
Is working with Zeppelin a requirement? Because if so, it could be very difficult. Zeppelin is compiled against a specific version of Spark so downgrading the jar will most likely fail.
Otherwise, if you are ok with not using Zeppelin and instead using the EMR Step API, then you might be able to spin up an EMR cluster with a bootstrap action that installs spark-assembly 1.0.1. I said it might work, because there's no guarantee that the current EMR version is compatible with a 2 year old version of Spark.
To create the cluster:
Create a cluster from the UI, make sure to uncheck Spark from the additional software menu
Add a custom bootstrap action and use the script at s3://support.elasticmapreduce/spark/install-spark with arguments -v 1.0.1
(See https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark for configuration options)
To run spark using the EMR Step API:
Upload your compiled jar to s3, then submit a step against that cluster
Cluster ID: the id of your cluster (ex j-XXXXXXXX)
Region of cluster. Where you created your EMR cluster. Ex us-west-2
Your spark main class: This is where you put your ml pipeline code.
Your jar: you have to upload the jar with your code to S3 so your cluster can download it
arg1, arg2: arguments to your main (optional)
aws emr add-steps --cluster-id --steps \
Name=SparkPi,Jar=s3://.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn,--class,com.your.spark.class.MainApp,s3://>/your.jar,arg1,arg2],ActionOnFailure=CONTINUE
(Taken from the official github repo at https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/spark-submit-via-step.md)
Also if that fails, install Hadoop and check out https://spark.apache.org/docs/1.0.1/running-on-yarn.html
Or you could also run 1.0.1 locally on your laptop if your data is small.
Good luck.
Amazon EMR provide a list of supported versions of software packages you can install by selecting a drop menu. Nothing stop you from installing additional custom software with a bootstrap action. I had some experience installing java 8 when EMR was supporting only Java 7. It is a bit painful but totally possible.
EMR supports Spark 1.6.0. Take a look at their latest release of emr-4.4.0: http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-whatsnew.html