Spark - Writing into HDFS does not complete successfully - hdfs

My question is similar to (Spark writing to hdfs not working with the saveAsNewAPIHadoopFile method)! I am using Spark 1.1.0 on CDH 5.2.1
I am trying to save a file to hdfs system through saveAsTextFile method from Spark. The job completes successfully but when I look into the folder path, I see _temporary folder with data files inside it in various tasks and attempt folder. This tells me Spark is quitting the job as succeeded even before the files are completely moved into hdfs in the right output folder. This is the same issue with saveAsParquetFile method too. Please let me know if you have any idea about this?
Thanks

Related

How to read a csv file from s3 bucket using pyspark

I'm using Apache Spark 3.1.0 with Python 3.9.6. I'm trying to read csv file from AWS S3 bucket something like this:
spark = SparkSession.builder.getOrCreate()
file = "s3://bucket/file.csv"
c = spark.read\
.csv(file)\
.count()
print(c)
But I'm getting the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
I understand that I need add special libraries, but I didn't find any certain information which exactly and which versions. I've tried to add something like this to my code, but I'm still getting same error:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
How can I fix this?
You need to use hadoop-aws version 3.2.0 for spark 3. In --packages specifying hadoop-aws library is enough to read files from S3.
--packages org.apache.hadoop:hadoop-aws:3.2.0
You need to set below configurations.
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<access_key>")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<secret_key>")
After that you can read CSV file.
spark.read.csv("s3a://bucket/file.csv")
Thanks Mohana for the pointer! After breaking my head for more than a day, I was able to finally figure out. Summarizing my learnings:
Make sure what version of Hadoop your spark comes with:
print(f'pyspark hadoop version:
{spark.sparkContext._jvm.org.apache.hadoop.util.VersionInfo.getVersion()}')
or look for
ls jars/hadoop*.jar
The issue I was having was I had older version of Spark that I had installed a while back that Hadoop 2.7 and was messing up everything.
This should give a brief idea of what binaries you need to download.
For me it was Spark 3.2.1 and Hadoop 3.3.1.
Hence I downloaded :
https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.3.1
https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle/1.11.901 # added this just in case;
Placed these jar files in the spark installation dir:
spark/jars/
spark-submit runner.py --packages org.apache.hadoop:hadoop-aws:3.3.1
You have your code snippet that reads from AWS S3

Why can't my GCP script/notebook find my file?

I have a working script that finds the data file when it is in the same directory as the script. This works both on my local machine and Google Colab.
When I try it on GCP though it can not find the file. I tried 3 approaches:
PySpark Notebook:
Upload the .ipynb file which includes a wget command. This downloads the file without error but I am unsure where it saves it to and the script can not find the file either (I assume because I am telling it that the file is in the same directory and pressumably using wget on GCP saves it somewhere else by default.)
PySpark with bucket:
I did the same as the PySpark notebook above but first I uploaded the dataset to the bucket and then used the two links provided in the file details when you click the file name inside the bucket on the console (neither worked). I would like to avoid this though as wget is much faster then downloading on my slow wifi then reuploading to the bucket through the console.
GCP SSH:
Create cluster
Access VM through SSH.
Upload .py file using the cog icon
wget the dataset and move both into the same folder
Run script using python gcp.py
Just gives me an error saying file not found.
Thanks.
As per your first and third approach, if you are running a PySpark code on Dataproc, irrespective of whether you use .ipynb file or .py file, please note the below points:
If you use the ‘wget’ command to download the file, then it will be downloaded in the current working directory where your code is executed.
When you try to access the file through the PySpark code, it will check defaultly in HDFS. If you want to access the downloaded file from the current working directory, use the “ file:///” URI with absolute file path.
If you want to access the file from HDFS, then you have to move the downloaded file to HDFS and then access from there using an absolute HDFS file path. Please refer the below example:
hadoop fs -put <local file_name> </HDFS/path/to/directory>

how --py-files works internally in pyspark

I am new to pySpark. I have used --py-files like below in spark-submit command to copy all files to worker nodes.
spark-submit --master yarn-client --driver-memory 4g --py-files /home/valli/pyFiles.zip /home/valli/main.py
In logs I observed that it is storing pyFiles.zip in .sparkStaging directory like below.
hdfs://cdhstltest/user/valli/.sparkStaging/application_1550968677175_9659/pyFiles.zip
When I copied above file into my specific local directory it is still showing like a zip file and unable to read files in it. But when I try to find out the current files directory it is showing with hdfs_directory/pyfiles.zip/module1.py and able to execute py file. As far as I know --py-files will copy all .py files in zip folder into worker nodes by automatically unzipping.
Can anyone please help me in understanding what is happening behind the screen ?
Thanks in advance.

Structured streaming kafka driver relaunch fails with HDFS file rename errors since new name file already exists

We are testing restarts and failover with structured streaming in Spark 2.1.
We have a stripped down kafka structured streaming driver that only performs an event count. When we relaunch the driver a second time gracefully (i.e. kill driver with yarn application -kill and resubmit with same checkpoint dir), the driver fails due to aborted jobs that cannot commit the state in HDFS with errors like:
"Failed to rename /user/spark/checkpoints/StructuredStreamingSignalCount/ss_signal_count/state/0/11/temp-1769618528278028159 to /user/spark/checkpoints/StructuredStreamingSignalCount/ss_signal_count/state/0/11/128.delta"
When I look in the HDFS, 128.delta already existed before the error. HDFS fundamentally does not allow rename when the target file name already exists with the rename command. Any insight greatly appreciated!
We are using:
spark 2.1.0
HDFS/YARN 2.7.3
Kafka 0.10.1
Heji
A bug in spark for not deleting state file before renaming:
https://issues.apache.org/jira/browse/SPARK-19677

Error in executing Customised WordCount jar in AWS EMR

Hi I am trying to execute customised WordCount jar on AWs EMR.
My word count jar is working properly because I tried adding it as a step without job arguments and it is running successfully. My problem is when I run it with job arguments.
In my s3 I have 2 folders
Jar location -> s3n://word-count123/WordCount.jar
jar Arguments ->s3n://word-count123/input
s3n://word-count123/output
input folder contains one txt file and output folder one txt file.
Am I doing something wrong? I can't seem to figure it out. Thanks.
P.S I dont wanna execute it from CLI.
Just executed a existing WordCount jar.. Seems to be a problem with my JAR.