Does deleting a file removes all replication files as well in hdfs - hdfs

Does deleting a file removes all replication files as well in hdfs?
Is trash the only way to recover deleted files from hdfs?
Replication factor is only used internal by framework for fault tolerance? Any network or other failures happen?
I am just trying to relate deleting a file, recovery from trash, replication factor in HDFS.

A file in HDFS can be removed using rmr command. However, HDFS supports Trash feature which helps to recover files in case of accidental deletion of data. When the Trash feature is enabled, a file is moved to the .Trash folder under the user's HDFS directory.
However, internally how it works is that HDFS will just create metadata in the trash folder to identity the file and the associated block information that needs to be deleted once the fs.trash.interval time interval is complete after the file is deleted . The actual file contents i.e. the replicated blocks of the file are still present on the original data nodes where they were present before the delete operation.
If the user wants to recover the deleted file, all that is done is delete the metadata information from the .Trash folder and the original data is anyways lying on the datanodes as usual.
To answer your query, deleting a file doesn't delete the file contents and its blocks from the datanodes.

Related

Updating AVRO files in GCS using data

I'm working on a POC to extract data from an API and load new/updated records to AVRO file present in GCS, I also want to delete the record that comes with a deleted flag, from the AVRO file.
What would be a feasible approach to implement this using dataflow, are there any resources that I can refer to for it?
You can't update file in GCS. You can only READ, WRITE and DELETE. If you have to change 1 byte in the file, you need to download the file, make the change and upload it again.
You can keep versions in GCS, but each BLOB is unique and can be changed.
Anyway, you can do that with dataflow, but keep in mind that you need 2 inputs:
The data to update
The file stored in GCS (that you have to read and to process also with dataflow)
At the end, you need to write the new file in GCS, with the data stored in dataflow.

Copy ~200.000 of s3 files to new prefixes

I have ~200.000 s3 files that I need to partition, and have made an Athena query to produce a target s3 key for each of the original s3 keys. I can clearly create a script out of this, but how to make the process robust/reliable?
I need to partition csv files using info inside each csv so that each file is moved to a new prefix in the same bucket. The files are mapped 1-to-1, but the new prefix depends on the data inside the file
The copy command for each would be something like:
aws s3 cp s3://bucket/top_prefix/file.csv s3://bucket/top_prefix/var1=X/var2=Y/file.csv
And I can make a single big script to copy all through Athena and bit of SQL, but I am concerned about doing this reliably so that I can be sure that all are copied across, and not have the script fail, timeout etc. Should I "just run the script"? From my machine or better to put it in an ec2 1st? These kinds of questions
This is a one-off, as the application code producing the files in s3 will start outputting directly to partitions.
If each file contains data for only one partition, then you can simply move the files as you have shown. This is quite efficient because the content of the files do not need to be processed.
If, however, lines within the files each belong to different partitions, then you can use Amazon Athena to 'select' lines from an input table and output the lines to a destination table that resides in a different path, with partitioning configured. However, Athena does not "move" the files -- it simply reads them and then stores the output. If you were to do this for new data each time, you would need to use an INSERT statement to copy the new data into an existing output table, then delete the input files from S3.
Since it is one-off, and each file belongs in only one partition, I would recommend you simply "run the script". It will go slightly faster from an EC2 instance, but the data is not uploaded/downloaded -- it all stays within S3.
I often create an Excel spreadsheet with a list of input locations and output locations. I create a formula to build the aws s3 cp <input> <output_path> commands, copy them to a text file and execute it as a batch. Works fine!
You mention that the destination depends on the data inside the object, so it would probably work well as a Python script that would loop through each object, 'peek' inside the object to see where it belongs, then issue a copy_object() command to send it to the right destination. (smart-open ยท PyPI is a great library for reading from an S3 object without having to download it first.)

Cloud function is unable to move files in archive bucket

I have implemented an architecture as per the link https://cloud.google.com/solutions/streaming-data-from-cloud-storage-into-bigquery-using-cloud-functions
But the issue is when multiple files come at the same time(For E:g. 3 files comes at the same timestamp(21/06/2020, 12:13:54 UTC+5:30)) in the bucket. In this scenario, the cloud function is unable to move all these files with the same timestamp to success bucket after processing.
Can someone please suggest.
Google Cloud Storage is not a file system. You can only CREATE, READ and DELETE the BLOB. Therefore, you can't MOVE a file. The MOVE that exist on the console or in some client library (in python for example) perform a CREATE (copy the existing BLOB to the target name) and then a DELETE of the old BLOB.
Eventually, you can't keep the original timestamp with you perform a MOVE operation.
NOTE: because you perform a CREATE and a DELETE when you MOVE your file, you are charge on early deletion when you use classes such as Nearline, coldline and archive

Does hdfs snapshot works on appending data?

I understood that hdfs snapshot keeps tracks of added or deleted files from a directory. How is the behaviour when i have files (PARQUET) that are appended continuously?
When you create a snapshot of a directory/file, they are added in the subdirectory /.snapshot , so they are ordered by date ascending whatever the file format is! There's no a maximum number of snapshots.
hdfs snapshot keeps tracks of added or deleted files from a directory
Correct me if I'm wrong, but a snapshot keeps track of every single change (even in the file) and not just of the added and deleted files from a directory.
I hope this helps you to understand their behaviour!
HDFS snapshots documentation

How to clean up S3 files that is used by AWS Firehose after loading the files?

AWS Firehose uses S3 as an intermittent storage before the data is copied to redshift. Once the data is transferred to redshift, how to clean them up automatically if it succeeds.
I deleted those files manually, it went out of state complaining that files got deleted, I had to delete and recreate Firehose again to resume.
Deleting those files after 7 days with S3 rules will work? or Is there any automated way, that Firehose can delete the successful files that got moved to redshift.
Discussing with Support AWS,
Confirmed it is safe to delete those intermediate files after 24 hour period or to the max retry time.
A Lifecycle rule with an automatic deletion on S3 Bucket should fix the issue.
Hope it helps.
Once you're done loading your destination table, execute something similar to (the below snippet is typical to a shell script):
aws s3 ls $aws_bucket/$table_name.txt.gz
if [ "$?" = "0" ]
then
aws s3 rm $aws_bucket/$table_name.txt.gz
fi
This'll check whether the table you've just loaded exists on s3 or not and will drop it. Execute it as a part of a cronjob.
If your ETL/ELT is not recursive, you can write this snippet towards the end of the script. It'll delete the file on s3 after populating your table. However, before execution of this part, make sure that your target table has been populated.
If you ETL/ELT is recursive, you may put this somewhere at the beginning of the script to check and remove the files created in the previous run. This'll retain the files created till the next run and should be preferred as the file will act as a backup in case the last load fails (or you need a flat file of the last load for any other purpose).