Debug Pyspark on EMR using Pycharm - amazon-web-services

Does anyone have experience with debugging Pyspark that runs on AWS EMR using Pycharm?
I couldn't find any good guides or existing threads regrading this one.
I know how to debug Scala-Spark with Intellij against EMR but I have no experince with doing this with Python.
I am aware of being able to connect to remote server using ssh (EMR Master) and maybe with Professional edition I can use the remote deployment feature to run my spark job using Pycharm but I'm not sure if it will work and I want to know if anyone has tried it, before I will go with Pycharm Pro.

I got to debug Pyspark on EMR as I wanted to.
Please look at this Medium blog post that describes how to do so:
https://medium.com/explorium-ai/debugging-pyspark-with-pycharm-and-aws-emr-d50f90077c92
It describes how to use Pycharm Pro - remote deployment feature in order to debug your pyspark program.

Related

where to write DAG files in apache air flow?

I just started learning apache airflow, and I created an environment in composer in gcp and web server is working fine and everything, but I was just confused about where to write the DAG file? I mean I want to write the file where I can test it multiple times because in the web UI it's showing me a bucket where I can store the file, but I am unable to understand where to write the code. do I have to install airflow in my machine?
p.s - i know this is a stupid question, any help will be appreciated
You can install it locally yes. If you want to test it locally this is the only way I think.
There are couple of tools that you could do that - there is an astro CLI for managing your "dag development" environment which is published by Astronomer, https://github.com/astronomer/astro-cli
Also MWAA has their own tool too - I think, I think Composer has no Composer-specific one.
However for "generic" Airflow (which should be enough to start), you can use the community managed quick-start (either with local venv or Docker-Compose):
https://airflow.apache.org/docs/apache-airflow/stable/start/index.html

How to create a Windows VM in GCP such that we can use it in Jenkins for automated tests

I am looking for a help on GCP where I want to create a Windows VM and which will have Java and some browsers like say Chrome. Once this is done I wanted to integrate this VM to Jenkins such that whenever a automated build runs in Jenkins it will run those automated tests say Selenium on VM machine and creates the reports and so on. Is it possible via GCP. Please let me know and guide me on this and please share any tutorial for sample.
Thanks a lot.
I don't think any of the images provided by GCP have that software installed, I mean you need to install manually or you can use startup-script to automate some of this task,
this is a quick information to get you started:
Create Windows Instance
Install java or JDK
Install chrome
Install jenkins
Automate the task with jenkins and windows
As alternative you can deploy from Marketplace, find the Jenkis which is installed on a Windows VM and then install the other components(chrome & java)
considers that some marketplace solutions has an additional cost

Selenium cloud execution on a machine without code or IDE

I set up my Selenium project (Maven, Java, TestNG) in GitHub repo and it is connected to Jenkins. I am able to execute the Maven project via Jenkins and do the testing. This requires all dependant tools (Maven,Java,Jenkins) set up in my local machine.
But we have a requirement to do this in the cloud. I know we can use Selenium Grid-Docker, BrowserStack or GCP to execute the tests in the cloud but what we need is to have everything installed in the cloud and any external user with access being able to execute any test via UI or executable file without installing anything in user's local machine.
Is this possible at all? If yes,how?
I searched a lot and couldn't find anything. One of my friends said it can be done using AWS but doesn't know how. I just need guidance on the path to take here and I'm willing to learn and implement it myself.
Solved this my deploying code to AWS-EC2.
Here's what I did.
I created a TestNG-Maven project and uploaded to GitHub. Then created a AWS-EC2 t2.micro linux instance and installed Chrome and Jenkins in it. I accessed Jenkins from my local machine and connected it to GitHub repo. From Jenkins when I build the project everything was getting downloaded in EC2 and execution happened in EC2. This will be chrome-headless execution.

How to set up a local development environment for PySpark ETL to run in AWS Glue?

PyCharm professional supports connecting,
deploying and remote debugging of AWS Glue developer endpoint (https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-pycharm.html) , but I can't figure out how to use VS Code (my code editor of choice) for this purpose. Does VS Code support any of these functionalities? Or is there another free alternative to PyCharm professional with the same capabilities?
I have not use pyCharm, but have setup a local Development End Point with Zeppelin, for my Glue jobs development / testing. Please see my related posts & references for setting up local development end point. Maybe you can try it, if it is useful, and you can try to use pyCharm instead of Zeppelin.
Reference : Is it possible to use Jupyter Notebook for AWS Glue instead of Zeppelin & Link for zeppelin local development endpoint SO discussions

WebHCat on Amazon's EMR?

Is it possible or advisable to run WebHCat on an Amazon Elastic MapReduce cluster?
I'm new to this technology and I was wonder if it was possible to use WebHCat as a REST interface to run Hive queries. The cluster in question is running Hive.
I wasn't able to get it working but WebHCat is actually installed by default on Amazon's EMR instance.
To get it running you have to do the following,
chmod u+x /home/hadoop/hive/hcatalog/bin/hcat
chmod u+x /home/hadoop/hive/hcatalog/sbin/webhcat_server.sh
export TEMPLETON_HOME=/home/hadoop/.versions/hive-0.11.0/hcatalog/
export HCAT_PREFIX=/home/hadoop/.versions/hive-0.11.0/hcatalog/
/home/hadoop/hive/hcatalog/webhcat_server.sh start
You can then confirm that it's running on port 50111 using curl,
curl -i http://localhost:50111/templeton/v1/status
To hit 50111 on other machines you have to open the port up in the EC2 EMR security group.
You then have to configure the users you going to "proxy" when you run queries in hcatalog. I didn't actually save this configuration, but it is outlined in the WebHCat documentation. I wish they had some concrete examples there but basically I ended up configuring the local 'hadoop' user as the one that run the queries, not the most secure thing to do I am sure, but I was just trying to get it up and running.
Attempting a query then gave me this error,
{"error":"Server IPC version 9 cannot communicate with client version
4"}
The workaround was to switch off of the latest EMR image (3.0.4 with Hadoop 2.2.0) and switch to a Hadoop 1.0 image (2.4.2 with Hadoop 1.0.3).
I then hit another issues where it couldn't find the Hive jar properly, after struggling with the configuration more, I decided I had dumped enough time into trying to get this to work and decided to communicate with Hive directly (using RBHive for Ruby and JDBC for the JVM).
To answer my own question, it is possible to run WebHCat on EMR, but it's not documented at all (Googling lead me nowhere which is why I created this question in the first place, it's currently the first hit when you search "WebHCat EMR") and the WebHCat documentation leaves a lot to be desired. Getting it to work seems like a pain, though my hope is that by writing up the initial steps someone will come along and take it the rest of the way and post a complete answer.
I did not test it but, it should be doable.
EMR allows to customise the bootstrap actions, i.e. the scripts run where the nodes are started. You can use bootstrap actions to install additional software and to change the configuration of applications on the cluster
See more details at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html.
I would create a shell script to install WebHCat and test your script on a regular EC2 instance first (outside the context of EMR - just as a test to ensure your script is OK)
You can use EC2's user-data properties to test your script, typically :
#!/bin/bash
curl http://path_to_your_install_script.sh | sh
Then - once you know the script is working - make it available to the cluster on a S3 bucket and follow these instructions to include your script as custom bootstrap action of your cluster.
--Seb