Connect Databricks cluster with local machine (AWS) - amazon-web-services

I want to connect to Databricks cluster (AWS) from my local machine but I want to execute the entire code in the cluster. With Databricks Connect only the spark code is executed in the cluster. I'm looking for alternative solution. SSH interpreter or something similar to that. I work with PyCharm (IDE).

I would go with such a approach (but you need to write small script for your IDE):
you commit to some brunch in git (like staging)
your IDE executes databricks cli command "databricks repos update" which will perform pull
your IDE executes databricks cli job command to run notebook from repo
Databricks cli can be executed as a Rest Api, bash/cmd or can be imported as sdk to programming language

Related

Discrepancy between AWS Glue and its Dev Endpoint

My understanding is Dev Endpoints in AWS Glue can be used to develop code iteratively and then deploy it to a Glue job. I find this specially useful when developing Spark jobs because every time you run a job, it takes several minutes to launch a Hadoop cluster in the background. However, I am seeing a discrepancy when using Python shell in Glue instead of Spark. Import pg doesn't work in a Dev Endpoint I created using Sagemaker JupyterLab Python notebook, but works in AWS Glue when I create a job using Python shell. Shouldn't the same libraries exist in the dev endpoint that exist in Glue? What is the point of having a dev endpoint if you cannot reproduce the same code in both places (dev endpoint and the Glue job)?
Firstly, Python shell jobs would not launch a Hadooo Cluster in the backend as it does not give you a Spark environment for your jobs.
Secondly, since PyGreSQL is not written in Pure Python, it will not work with Glue's native environment (Glue Spark Job, Dev endpoint etc)
Thirdly, Python Shell has additional support for certain package built-in.
Thus, I don't see a point of using DevEndpoint for Python Shell jobs.

How to test AWS Glue code without dev endpoint

I would like to avoid AWS dev endpoint. Is there a way where I can test and debug my PySpark code without using AWS dev endpoint with the help of testing my code in local notebook/IDE?
As others have said, it depends on which part of the Glue are you going to use. If your code is based on pure Spark, without the Dynamic Frames etc. Then local version of Spark may suffice, if however you are intending on using Glue extensions, there is not really an option of not using the Dev End point at this stage.
I hope that this helps.
If you are going to deploy your pyspark code on AWS Glue service, you may have to use GlueContext & other AWS Glue APIs. So if you would like to test against AWS Glue service, using these AWS Glue APIs then you have to have an AWS Dev Endpoint.
However having a AWS Glue notebook is optional, since you can setup zeppelin, etc. establish an ssh tunnel connection with AWS Glue DEP for dev / testing from local env. Make sure you delete the DEPoint once your development/testing is done for the day.
Alternately, if you are not keen on using AWS Glue APIs other than GlueContext, then yes, you can setup zeppelin in local environment, test the code locally and then upload your code to S3, create a Glue job for testing in AWS Glue Service
We have a setup here, where we have pyspark install locally and we use VSCode to develop our pyspark codes, unit test, and debug. We run the codes against the local pyspark installation during development, then we deploy those codes to EMR to run with real dataset.
I'm not sure how much of this apply to what you're trying to do with Glue, as it's a level higher in abstraction.
We use pytest to test pyspark code. We keep pyspark code in another file and calls those functions inglue code file. With this separation, we can unit test pyspark code using pytest
I was able to test without dev endpoints
Please follow the instructions here
https://support.wharton.upenn.edu/help/glue-debugging

How to run glue script from Glue Dev Endpoint

I have a glue script (test.py) written say in a editor. I connected to glue dev endpoint and copied the script to endpoint or I can store in S3 bucket. Basically glue endpoint is an EMR cluster, now how can I run the script from the dev endpoint terminal? Can I use spark-submit and run it ?
I know we can run it from glue console,but more interested to know if I can run it from glue end point terminal.
You don't need a notebook; you can ssh to the dev endpoint and run it with the gluepython interpreter (not plain python).
e.g.
radix#localhost:~$ DEV_ENDPOINT=glue#ec2-w-x-y-z.compute-1.amazonaws.com
radix#localhost:~$ scp myscript.py $DEV_ENDPOINT:/home/glue/myscript.py
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT
...
[glue#ip-w-x-y-z ~]$ gluepython myscript.py
You can also run the script directly without getting an interactive shell with ssh (of course, after uploading the script with scp or whatever):
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT gluepython myscript.py
If this is a script that uses the Job class (as the auto-generated Python scripts do), you may need to pass --JOB_NAME and --TempDir parameters.
For development / testing purpose, you can setup a zeppelin notebook locally, have an SSH connection established using the AWS Glue endpoint URL, so you can have access to the data catalog/crawlers,etc. and also the s3 bucket where your data resides.
After all the testing is completed, you can bundle your code, upload to an S3 bucket. Then create a Job pointing to the ETL script in S3 bucket, so that the job can be run, and scheduled as well.
Please refer here and setting up zeppelin on windows, for any help on setting up local environment. You can use dev instance provided by Glue, but you may incur additional costs for the same(EC2 instance charges).
Once you set up the zeppelin notebook, you can copy the script(test.py) to the zeppelin notebook, and run from the zeppelin.
According to AWS Glue FAQ:
Q: When should I use AWS Glue vs. Amazon EMR?
AWS Glue works on top of the Apache Spark environment to provide a
scale-out execution environment for your data transformation jobs. AWS
Glue infers, evolves, and monitors your ETL jobs to greatly simplify
the process of creating and maintaining jobs. Amazon EMR provides you
with direct access to your Hadoop environment, affording you
lower-level access and greater flexibility in using tools beyond
Spark.
Do you have any specific requirement to run Glue script in an EMR instance? Since in my opinion, EMR gives more flexibility and you can use any 3rd party python libraries and run directly in a EMR Spark cluster.
Regards

How to deploy a spring boot application jar from Jenkins to an EC2 machine

I'm seeing so many different sources how to to achieve CI with Jenkins and EC2 and strangely none seem to fit my needs.
I have 2 EC2 ubuntu instances. One is empty and the other has Jenkins installed on it.
I want to perform a build on the Jenkins machine and copy the jar to the other ubuntu machine. Once the jar is there i want to run mvn spring-boot:run
That's is - a very simple flow which i can't find a good source to follow that doesn't include slaves, dockers etc..
AWS Code Deploy lets you use a Jenkins and deploy it on your EC2 instances.
Quick google search gave me this very detailed instruction on how to setup code pipeline with AWS Code Deploy.
The pipeline uses GitHub -> Jenkins -> EC2 flow, as you need it.
Set up jenkins to do a build then scp the artifact to the other machine
There's an answer here how to setup ssh keys for jenkins to publish via ssh about setting up the keys for ssh

Automate code deploy from Git lab to AWS EC2 instance

We're building an application for which we are using GitLab repository. Manual deployment of code to the test server which is Amazon AWS EC2 instance is tedious, I'm planning to automate deployment process, such that when we commit code, it should reflect in the test instance.
from my knowledge we can use AWS code-deploy service to fetch the code from GitHub. But code deploy service does not support GitLab repository . Is there a way to automate the code deployment process to AWS Ec2 instance through GitLab. or Is there a shell scripting possibility to achieve this? Kindly educate me.
One way you could achieve this with AWS CodeDeploy is by using the S3 option in conjunction with Gitlab-CI: http://docs.aws.amazon.com/codepipeline/latest/userguide/getting-started-w.html
Depending on how your project is setup, you may have the possibility to generate a distribution Zip (Gradle offers this through the application plugin). You may need to generate your "distribution" file manually if your project does not offer such a capability.
Gitlab does not offer a direct S3 integration, however through the gitlab-ci.yml you would be able to download it into the container and run the necessary upload commands to put the generated zip file on the S3 container as per the AWS instructions to trigger the deployment.
Here is an example of what your brefore-script could look like in the gitlab-ci.yml file:
before_script:
- apt-get update --quiet --yes
- apt-get --quiet install --yes python
- pip install -U pip
- pip install awscli
The AWS tutorial on how to use CodeDeploy with S3 is very detailed, so I will skip attempting to reproduce the contents here.
In regards to the actual deployment commands and actions that you are currently performing manually, AWS CodeDeploy provides the capability to run certain actions through scripts defined in the app-spec file depending on event hooks for the application:
http://docs.aws.amazon.com/codedeploy/latest/userguide/writing-app-spec.html
http://docs.aws.amazon.com/codedeploy/latest/userguide/app-spec-ref.html
http://docs.aws.amazon.com/codedeploy/latest/userguide/app-spec-ref-hooks.html
I hope this helps.
This is one of my old post. But I happened to find an answer for this. Although my question is specific to work with code deploy I would say there is no such need to use any aws requirements using gitlab.
We don't require Code Deploy at all. There is no need to use any external CI server like the team city or the jenkins to perform the CI from the GitLab anymore.
We need to add the .gitlab-ci.yml file in the source directory of the branch and write an .yml script in it. There are pipelines in the GitLab that will perform the CI/CD automatically.
The pipelines of the GitLab CI/CD looks more similar to the working functionality of Jenkins Server. using the YML script we can perform SSH on the EC2 instance and place the files in it.
An example of how to write the gitlab .yml file to ssh to ec2 instance is here https://docs.gitlab.com/ee/ci/yaml/README.html