how --py-files works internally in pyspark - python-2.7

I am new to pySpark. I have used --py-files like below in spark-submit command to copy all files to worker nodes.
spark-submit --master yarn-client --driver-memory 4g --py-files /home/valli/pyFiles.zip /home/valli/main.py
In logs I observed that it is storing pyFiles.zip in .sparkStaging directory like below.
hdfs://cdhstltest/user/valli/.sparkStaging/application_1550968677175_9659/pyFiles.zip
When I copied above file into my specific local directory it is still showing like a zip file and unable to read files in it. But when I try to find out the current files directory it is showing with hdfs_directory/pyfiles.zip/module1.py and able to execute py file. As far as I know --py-files will copy all .py files in zip folder into worker nodes by automatically unzipping.
Can anyone please help me in understanding what is happening behind the screen ?
Thanks in advance.

Related

GCP copy files in from VM to local

I'm trying to copy files from my VM to my local computer.
I can do this with the standard command
sudo gcloud compute scp --recurse orca-1:/opt/test.txt .
However in downloading the log files they transfer but they're empty? (empty files are created with the same name)
I'm also unable to use the Cloud Shell 'Download' UI button because it gives No such file despite the absolute file path being correct (cat /path returns the data).
I understand it's a permissions thing somehow with log files?
Thanks for the replies to my thread above I figured out it was a permissions issue on my files.
Interestingly the first time I ran the commands it did not throw any errors or permission errors -- it downloaded all the expected files however they were empty. In testing again, now it threw permission errors. I then modified the files in question to have public read permissions, and it downloaded successfully.

Why can't my GCP script/notebook find my file?

I have a working script that finds the data file when it is in the same directory as the script. This works both on my local machine and Google Colab.
When I try it on GCP though it can not find the file. I tried 3 approaches:
PySpark Notebook:
Upload the .ipynb file which includes a wget command. This downloads the file without error but I am unsure where it saves it to and the script can not find the file either (I assume because I am telling it that the file is in the same directory and pressumably using wget on GCP saves it somewhere else by default.)
PySpark with bucket:
I did the same as the PySpark notebook above but first I uploaded the dataset to the bucket and then used the two links provided in the file details when you click the file name inside the bucket on the console (neither worked). I would like to avoid this though as wget is much faster then downloading on my slow wifi then reuploading to the bucket through the console.
GCP SSH:
Create cluster
Access VM through SSH.
Upload .py file using the cog icon
wget the dataset and move both into the same folder
Run script using python gcp.py
Just gives me an error saying file not found.
Thanks.
As per your first and third approach, if you are running a PySpark code on Dataproc, irrespective of whether you use .ipynb file or .py file, please note the below points:
If you use the ‘wget’ command to download the file, then it will be downloaded in the current working directory where your code is executed.
When you try to access the file through the PySpark code, it will check defaultly in HDFS. If you want to access the downloaded file from the current working directory, use the “ file:///” URI with absolute file path.
If you want to access the file from HDFS, then you have to move the downloaded file to HDFS and then access from there using an absolute HDFS file path. Please refer the below example:
hadoop fs -put <local file_name> </HDFS/path/to/directory>

aws elastic beanstalk; how to move a file within my app root using .ebextensions

I'm trying to move a file located within my app directory:
{MyAppRoot}/.aws_scripts/eb_config.js
to
{MyAppRoot}/config.js.
I need this mv or cp to happen before the app is actually restarted, as this files presence is required immediately by the main app module. I've tried using .ebextensions various mechanisms like commands, container_commands, etc but all fail, with either no stat, or permission denied. I'm unable to get further details from eb_activity.log or any of the other log files. I came across this similar question on the aws forums but I'm not able to achieve any success.
What's the proper way to accomplish this? Thanks.
In commandsyour project specific files are not set up yet.
In container_commands they files are in a temporary staging location, but current path is that staging directory. The following should work:
container_commands:
cp .aws_scripts/eb_config.js config.js.

Spark - Writing into HDFS does not complete successfully

My question is similar to (Spark writing to hdfs not working with the saveAsNewAPIHadoopFile method)! I am using Spark 1.1.0 on CDH 5.2.1
I am trying to save a file to hdfs system through saveAsTextFile method from Spark. The job completes successfully but when I look into the folder path, I see _temporary folder with data files inside it in various tasks and attempt folder. This tells me Spark is quitting the job as succeeded even before the files are completely moved into hdfs in the right output folder. This is the same issue with saveAsParquetFile method too. Please let me know if you have any idea about this?
Thanks

Error in executing Customised WordCount jar in AWS EMR

Hi I am trying to execute customised WordCount jar on AWs EMR.
My word count jar is working properly because I tried adding it as a step without job arguments and it is running successfully. My problem is when I run it with job arguments.
In my s3 I have 2 folders
Jar location -> s3n://word-count123/WordCount.jar
jar Arguments ->s3n://word-count123/input
s3n://word-count123/output
input folder contains one txt file and output folder one txt file.
Am I doing something wrong? I can't seem to figure it out. Thanks.
P.S I dont wanna execute it from CLI.
Just executed a existing WordCount jar.. Seems to be a problem with my JAR.