Does hdfs snapshot works on appending data? - hdfs

I understood that hdfs snapshot keeps tracks of added or deleted files from a directory. How is the behaviour when i have files (PARQUET) that are appended continuously?

When you create a snapshot of a directory/file, they are added in the subdirectory /.snapshot , so they are ordered by date ascending whatever the file format is! There's no a maximum number of snapshots.
hdfs snapshot keeps tracks of added or deleted files from a directory
Correct me if I'm wrong, but a snapshot keeps track of every single change (even in the file) and not just of the added and deleted files from a directory.
I hope this helps you to understand their behaviour!
HDFS snapshots documentation

Related

Copy ~200.000 of s3 files to new prefixes

I have ~200.000 s3 files that I need to partition, and have made an Athena query to produce a target s3 key for each of the original s3 keys. I can clearly create a script out of this, but how to make the process robust/reliable?
I need to partition csv files using info inside each csv so that each file is moved to a new prefix in the same bucket. The files are mapped 1-to-1, but the new prefix depends on the data inside the file
The copy command for each would be something like:
aws s3 cp s3://bucket/top_prefix/file.csv s3://bucket/top_prefix/var1=X/var2=Y/file.csv
And I can make a single big script to copy all through Athena and bit of SQL, but I am concerned about doing this reliably so that I can be sure that all are copied across, and not have the script fail, timeout etc. Should I "just run the script"? From my machine or better to put it in an ec2 1st? These kinds of questions
This is a one-off, as the application code producing the files in s3 will start outputting directly to partitions.
If each file contains data for only one partition, then you can simply move the files as you have shown. This is quite efficient because the content of the files do not need to be processed.
If, however, lines within the files each belong to different partitions, then you can use Amazon Athena to 'select' lines from an input table and output the lines to a destination table that resides in a different path, with partitioning configured. However, Athena does not "move" the files -- it simply reads them and then stores the output. If you were to do this for new data each time, you would need to use an INSERT statement to copy the new data into an existing output table, then delete the input files from S3.
Since it is one-off, and each file belongs in only one partition, I would recommend you simply "run the script". It will go slightly faster from an EC2 instance, but the data is not uploaded/downloaded -- it all stays within S3.
I often create an Excel spreadsheet with a list of input locations and output locations. I create a formula to build the aws s3 cp <input> <output_path> commands, copy them to a text file and execute it as a batch. Works fine!
You mention that the destination depends on the data inside the object, so it would probably work well as a Python script that would loop through each object, 'peek' inside the object to see where it belongs, then issue a copy_object() command to send it to the right destination. (smart-open · PyPI is a great library for reading from an S3 object without having to download it first.)

Remove empty files from S3 bucket

We have 500 gb data set in S3 bucket, where we have some empty files, need to remove the empty files is there a better than copying to a linux machine and running the find cmd to delete the empty files ?
If you don't know which files are empty, you could request S3 inventory. It is provided once a day or week in a CSV format. One of its fields is:
Size – Object size in bytes.
Thus, having the inventory file, you will be able to very efficiently identify, and then remove, empty files from your bucket.
I mounted the bucket on to an ec2 instance using s3fs and ran the empty file/dir check, this method was more convenient.

How to clean up S3 files that is used by AWS Firehose after loading the files?

AWS Firehose uses S3 as an intermittent storage before the data is copied to redshift. Once the data is transferred to redshift, how to clean them up automatically if it succeeds.
I deleted those files manually, it went out of state complaining that files got deleted, I had to delete and recreate Firehose again to resume.
Deleting those files after 7 days with S3 rules will work? or Is there any automated way, that Firehose can delete the successful files that got moved to redshift.
Discussing with Support AWS,
Confirmed it is safe to delete those intermediate files after 24 hour period or to the max retry time.
A Lifecycle rule with an automatic deletion on S3 Bucket should fix the issue.
Hope it helps.
Once you're done loading your destination table, execute something similar to (the below snippet is typical to a shell script):
aws s3 ls $aws_bucket/$table_name.txt.gz
if [ "$?" = "0" ]
then
aws s3 rm $aws_bucket/$table_name.txt.gz
fi
This'll check whether the table you've just loaded exists on s3 or not and will drop it. Execute it as a part of a cronjob.
If your ETL/ELT is not recursive, you can write this snippet towards the end of the script. It'll delete the file on s3 after populating your table. However, before execution of this part, make sure that your target table has been populated.
If you ETL/ELT is recursive, you may put this somewhere at the beginning of the script to check and remove the files created in the previous run. This'll retain the files created till the next run and should be preferred as the file will act as a backup in case the last load fails (or you need a flat file of the last load for any other purpose).

delete files from a particular folder automatically in aws s3 bucket

I want to delete files from s3 bucket. Inside test bucket, there is a folder named mi and inside mi archive.
I configured life cycle rule on test bucket to delete file abc.txt from test/mi/archive/abc.txt after 7 days. I want to delete only abc.txt but it deletes full archive folder not only file.
At the time of rule apply on test bucket, I gave prefix mi/archive/.
S3 doesn't have folders, only object key prefixes. If there is no object with mi/archive in the prefix then that "folder" is not going to appear.
This really shouldn't be an issue. The next time you upload an object with mi/archive prefix in the key the "folder" will appear again.
Thanks to all for giving suggestion....
Finally, I got a solution. I did some changes in prefix. In place of "mi/archive", i gave files starting letters because my all files starts with "cd". Suppose there is a file named "cd_abcd.txt". So at the time of rule configuration on "test" bucket, i putted prefix "mi/archive/cd". So after 7 days, only files will be delete not full "archive" folder.
lifecycling is only for the entire folder/bucket. Your best/cheapest bet is probably a scheduled lambda to check for the file, it's creation date, and delete if necessary.

Does deleting a file removes all replication files as well in hdfs

Does deleting a file removes all replication files as well in hdfs?
Is trash the only way to recover deleted files from hdfs?
Replication factor is only used internal by framework for fault tolerance? Any network or other failures happen?
I am just trying to relate deleting a file, recovery from trash, replication factor in HDFS.
A file in HDFS can be removed using rmr command. However, HDFS supports Trash feature which helps to recover files in case of accidental deletion of data. When the Trash feature is enabled, a file is moved to the .Trash folder under the user's HDFS directory.
However, internally how it works is that HDFS will just create metadata in the trash folder to identity the file and the associated block information that needs to be deleted once the fs.trash.interval time interval is complete after the file is deleted . The actual file contents i.e. the replicated blocks of the file are still present on the original data nodes where they were present before the delete operation.
If the user wants to recover the deleted file, all that is done is delete the metadata information from the .Trash folder and the original data is anyways lying on the datanodes as usual.
To answer your query, deleting a file doesn't delete the file contents and its blocks from the datanodes.