There is a command mvtolocal in hdfs. If we use this command, data in hdfs will be removed and copied to local. As per my knowledge, data should not be modified once it is loaded into hdfs. In which cases/real time scenarios, do we need to use these commands. Thanks.
Related
I'm currently building an application with Apache Spark (pyspark), and I have the following use case:
Run pyspark with local mode (using spark-submit local[*]).
Write the results of my spark job to S3 in the form of partitioned Parquet files.
Ensure that each job overwrite the particular partition it is writing to, in order to ensure idempotent jobs.
Ensure that spark-staging files are written to local disk before being committed to S3, as staging in S3, and then committing via a rename operation, is very expensive.
For various internal reasons, all four of the above bullet points are non-negotiable.
I have everything but the last bullet point working. I'm running a pyspark application, and writing to S3 (actually an on-prem Ceph instance), ensuring that spark.sql.sources.partitionOverwriteMode is set to dynamic.
However, this means that my spark-staging files are being staged in S3, and then committed by using a delete-and-rename operation, which is very expensive.
I've tried using the Spark Directory Committer in order to stage files on my local disk. This works great unless spark.sql.sources.partitionOverwriteMode.
After digging through the source code, it looks like the PathOutputCommitter does not support Dynamic Partition Overwriting.
At this point, I'm stuck. I want to be able to write my staging files to local disk, and then commit the results to S3. However, I also need to be able to dynamically overwrite a single partition without overwriting the entire Parquet table.
For reference, I'm running pyspark=3.1.2, and using the following spark-submit command:
spark-submit --repositories https://repository.cloudera.com/artifactory/cloudera-repos/ --packages com.amazonaws:aws-java-sdk:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253
I get the following error when spark.sql.sources.partitionOverwriteMode is set to dynamic:
java.io.IOException: PathOutputCommitProtocol does not support dynamicPartitionOverwrite
My spark config is as follows:
self.spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
self.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
self.spark.conf.set("spark.hadoop.fs.s3a.committer.name", "magic")
self.spark.conf.set("spark.sql.sources.commitProtocolClass",
"org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
self.spark.conf.set("spark.sql.parquet.output.committer.class",
"org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
self.spark.conf.set(
"spark.sql.sources.partitionOverwriteMode", "dynamic"
)
afraid the s3a committers don't support the dynamic partition overwrite feature. That actually works by doing lots of renaming, so misses the entire point of zero rename committers.
the "partioned" committer was written by netflix for their use case of updating/overwriting single partitions in an active table. it should work for you as it is the same use case.
consult the documentation
I have a problem when I try to do ETL on large bunch of files on AWS.
The goal is to convert JSON files to parquet files. due to the size of the files I have to do it batch by batch . Let's say I need to do it in 15 batches , i.e. 15 separate runs to be able to convert all of them.
I am using write.mode("append").format("parquet") to write into parquet files in each glue pyspark job to do that.
My problem is if one job failed for some reason then I don't know what to do - some partitions are updated while some are not, some files in the batch have been processed while some have not. for example if my 9th job failed, I am kind of stuck. I dont want to delete all parquet files to start over, but also dont want to just re-run that 9th job and cause duplicates.
Is there a way to protect parquet files to only append new files into them if the whole job is successful?
THank you!!
Based on your comment and a similar experience I had with this problem, I believe this happens because of S3 eventual consistency. Have a look at Amazon S3 Data Consistency Model here https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html.
We found that using partitioned staging s3a committer with the conflict resolution mode replace made our jobs not fail.
Try the following parameters with your spark jobs:
spark.hadoop.fs.s3a.committer.staging.conflict-mode replace
spark.hadoop.fs.s3a.committer.name partitioned
Also have a read about the committers here:
https://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/committers.html
Hope this helps!
P.S. If all fails and our files are not too big, you can do a hacky solution where you save your parquet file locally and upload when your spark tasks are complete, but I personally do not recommend.
I am back filling some data via glue jobs. The job itself is reading in a TSV from s3, transforming the data slightly, and writing it in Parquet to S3. Since I already have the data, I am trying to launch multiple jobs at once to reduce the amount of time needed to process it all. When I launch multiple jobs at the same time, I run into an issue sometimes where one of the files will fail to output the resultant Parquet files in S3. The job itself completes successfully without throwing an error When I rerun the job as a non-parallel task, the file it output correctly. Is there some issue, either with glue(or the underlying spark) or S3 that would cause my issue?
The same Glue job running in parallel may produce files with the same names and therefore some of them can be overwritten. As I remember correctly, transformation-context is used as part of the name. I assume you don't have bookmarking enabled so it should be safe for you to generate transformation-context value dynamically to ensure it's unique for each job.
When I use spark locally, writing data on my local filesystem, it creates some usefull .crc file.
Using the same job on Aws EMR and writing on S3, the .crc files are not written.
Is this normal? Is there a way to force the writing of .crc files on S3?
those .crc files are just created by the the low level bits of the Hadoop FS binding so that it can identify when a block is corrupt, and, on HDFS, switch to another datanode's copy of the data for the read and kick off a re-replication of one of the good copies.
On S3, stopping corruption is left to AWS.
What you can get off S3 is the etag of a file, which is the md5sum on a small upload; on a multipart upload it is some other string, which again, changes when you upload it.
you can get at this value with the Hadoop 3.1+ version of the S3A connector, though it's off by default as distcp gets very confused when uploading from HDFS. For earlier versions, you can't get at it, nor does the aws s3 command show it. You'd have to try some other S3 libraries (it's just a HEAD request, after all)
We know that as we run the rmr command, edit log is created. Do the data nodes wait for updates to FSImage before purging the data or that too happens concurrently? Is there any pre-condition around acknowledgement of transaction from Journal nodes? Just trying to understand how HDFS edits work wherein you could have massive change in disk size.. How long will it take before 'hdfs dfs -du -s -h /folder' and 'hdfs dfsadmin -report' reflect the decrease in size? We tried deleting 2TB of data and after 1 hour, the data nodes local folder (/data/yarn/datanode) still was not reduced by 2TB.
After deleting the data from HDFS hadoop keeps that data in trash folder and you need to run below command to free the disk space
Hadoop fs -expunge
Then the space will be released by the HDFS.
Or you can run below command while deleting the data to skip trash
Hadoop fs -rmr -skipTrash /folder
It will not move the data into trash.
Note: A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace.