I would like to avoid AWS dev endpoint. Is there a way where I can test and debug my PySpark code without using AWS dev endpoint with the help of testing my code in local notebook/IDE?
As others have said, it depends on which part of the Glue are you going to use. If your code is based on pure Spark, without the Dynamic Frames etc. Then local version of Spark may suffice, if however you are intending on using Glue extensions, there is not really an option of not using the Dev End point at this stage.
I hope that this helps.
If you are going to deploy your pyspark code on AWS Glue service, you may have to use GlueContext & other AWS Glue APIs. So if you would like to test against AWS Glue service, using these AWS Glue APIs then you have to have an AWS Dev Endpoint.
However having a AWS Glue notebook is optional, since you can setup zeppelin, etc. establish an ssh tunnel connection with AWS Glue DEP for dev / testing from local env. Make sure you delete the DEPoint once your development/testing is done for the day.
Alternately, if you are not keen on using AWS Glue APIs other than GlueContext, then yes, you can setup zeppelin in local environment, test the code locally and then upload your code to S3, create a Glue job for testing in AWS Glue Service
We have a setup here, where we have pyspark install locally and we use VSCode to develop our pyspark codes, unit test, and debug. We run the codes against the local pyspark installation during development, then we deploy those codes to EMR to run with real dataset.
I'm not sure how much of this apply to what you're trying to do with Glue, as it's a level higher in abstraction.
We use pytest to test pyspark code. We keep pyspark code in another file and calls those functions inglue code file. With this separation, we can unit test pyspark code using pytest
I was able to test without dev endpoints
Please follow the instructions here
https://support.wharton.upenn.edu/help/glue-debugging
Related
I want to connect to Databricks cluster (AWS) from my local machine but I want to execute the entire code in the cluster. With Databricks Connect only the spark code is executed in the cluster. I'm looking for alternative solution. SSH interpreter or something similar to that. I work with PyCharm (IDE).
I would go with such a approach (but you need to write small script for your IDE):
you commit to some brunch in git (like staging)
your IDE executes databricks cli command "databricks repos update" which will perform pull
your IDE executes databricks cli job command to run notebook from repo
Databricks cli can be executed as a Rest Api, bash/cmd or can be imported as sdk to programming language
I am trying to establish a connection from AWS Glue to a remote server via SFTP using Python 3.7. I tried using the pysftp library for this task.
But pysftp uses a library named bcrypt that has python and c code. As of this moment, AWS Glue only supports pure python libraries as mentioned in the documentation (below link).
https://docs.aws.amazon.com/glue/latest/dg/console-custom-created.html
The error I am getting is as below.
ImportError: cannot import name '_bcrypt'
I am stuck here due to a compilation error.
Hence, I tried the JSch java library using Scala. There the compilation is successful, but I get the below exception.
com.jcraft.jsch.JSchException: java.net.UnknownHostException: [Remote Server Hostname]
How can we connect to a remote server via SFTP from AWS Glue? Is it possible?
How can we configure outbound rules (if required) for a Glue job?
I am answering my own question here for anyone whom this might help.
The straight answer is no.
I found the below resources which indicate that AWS Glue is an ETL tool for AWS resources.
AWS Glue uses other AWS services to orchestrate your ETL (extract,
transform, and load) jobs to build a data warehouse.
Source - https://docs.aws.amazon.com/glue/latest/dg/how-it-works.html
Glue works well only with ETL from JDBC and S3 (CSV) data sources. In
case you are looking to load data from other cloud applications, File
Storage Base, etc. Glue would not be able to support.
Source - https://hevodata.com/blog/aws-glue-etl/
Hence to implement what I was working on, I used an AWS Lambda function to connect to the remote server via SFTP, pick the required files and drop them in an S3 bucket. The AWS Glue job can now pick the files from S3.
i know that there is some time since this question was post, so i like to share some tools that could help you to get data from a sftp more easily and quickly. so for get a layer in a easy way use this tool https://github.com/aws-samples/aws-lambda-layer-builder, you can make a layer of pysftp faster and free of those annoying errors (cffi, bycrypt).
The lambda has a limit of 500 MB,so if you are trying to extract heavy files, the lambda will crash for this reason. to fix this you have to attach EFS (Elastic File System) to your lamdba https://docs.aws.amazon.com/lambda/latest/dg/services-efs.html
My understanding is Dev Endpoints in AWS Glue can be used to develop code iteratively and then deploy it to a Glue job. I find this specially useful when developing Spark jobs because every time you run a job, it takes several minutes to launch a Hadoop cluster in the background. However, I am seeing a discrepancy when using Python shell in Glue instead of Spark. Import pg doesn't work in a Dev Endpoint I created using Sagemaker JupyterLab Python notebook, but works in AWS Glue when I create a job using Python shell. Shouldn't the same libraries exist in the dev endpoint that exist in Glue? What is the point of having a dev endpoint if you cannot reproduce the same code in both places (dev endpoint and the Glue job)?
Firstly, Python shell jobs would not launch a Hadooo Cluster in the backend as it does not give you a Spark environment for your jobs.
Secondly, since PyGreSQL is not written in Pure Python, it will not work with Glue's native environment (Glue Spark Job, Dev endpoint etc)
Thirdly, Python Shell has additional support for certain package built-in.
Thus, I don't see a point of using DevEndpoint for Python Shell jobs.
I have a glue script (test.py) written say in a editor. I connected to glue dev endpoint and copied the script to endpoint or I can store in S3 bucket. Basically glue endpoint is an EMR cluster, now how can I run the script from the dev endpoint terminal? Can I use spark-submit and run it ?
I know we can run it from glue console,but more interested to know if I can run it from glue end point terminal.
You don't need a notebook; you can ssh to the dev endpoint and run it with the gluepython interpreter (not plain python).
e.g.
radix#localhost:~$ DEV_ENDPOINT=glue#ec2-w-x-y-z.compute-1.amazonaws.com
radix#localhost:~$ scp myscript.py $DEV_ENDPOINT:/home/glue/myscript.py
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT
...
[glue#ip-w-x-y-z ~]$ gluepython myscript.py
You can also run the script directly without getting an interactive shell with ssh (of course, after uploading the script with scp or whatever):
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT gluepython myscript.py
If this is a script that uses the Job class (as the auto-generated Python scripts do), you may need to pass --JOB_NAME and --TempDir parameters.
For development / testing purpose, you can setup a zeppelin notebook locally, have an SSH connection established using the AWS Glue endpoint URL, so you can have access to the data catalog/crawlers,etc. and also the s3 bucket where your data resides.
After all the testing is completed, you can bundle your code, upload to an S3 bucket. Then create a Job pointing to the ETL script in S3 bucket, so that the job can be run, and scheduled as well.
Please refer here and setting up zeppelin on windows, for any help on setting up local environment. You can use dev instance provided by Glue, but you may incur additional costs for the same(EC2 instance charges).
Once you set up the zeppelin notebook, you can copy the script(test.py) to the zeppelin notebook, and run from the zeppelin.
According to AWS Glue FAQ:
Q: When should I use AWS Glue vs. Amazon EMR?
AWS Glue works on top of the Apache Spark environment to provide a
scale-out execution environment for your data transformation jobs. AWS
Glue infers, evolves, and monitors your ETL jobs to greatly simplify
the process of creating and maintaining jobs. Amazon EMR provides you
with direct access to your Hadoop environment, affording you
lower-level access and greater flexibility in using tools beyond
Spark.
Do you have any specific requirement to run Glue script in an EMR instance? Since in my opinion, EMR gives more flexibility and you can use any 3rd party python libraries and run directly in a EMR Spark cluster.
Regards
I'm trying to run a Spark application on the AWS EMR Console (Amazon Web Services). My Scala script compiled in the jar takes the SparkConf settings as parameters or just strings:
val sparkConf = new SparkConf()
.setAppName("WikipediaGraphXPageRank")
.setMaster(args(1))
.set("spark.executor.memory","1g")
.registerKryoClasses(Array(classOf[PRVertex], classOf[PRMessage]))
However, I don't know how to pass the Master-URL parameter and other parameters to the jar when it's uploaded and I set-up the cluster. To be clear, I'm aware that if I was running the Spark-Shell I would do this another way, but I'm a Windows user and with the current set-up and work I've done, it would be very useful to have some way to pass the master URL to the EMR cluster in the 'steps'.
I don't want to use the Spark-Shell, I have a close deadline and have everything set-up this way and feels like just this small issue of passing the master URL as a parameter should be possible, considering that AWS have a guide for running stand-alone Spark applications on EMR.
Help would be appreciated!
Here are instructions on using spark-submit via EMR Step: https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/spark-submit-via-step.md