deleting files in hdfs that are more than 10 months old based on the date in the folder name - hdfs

I have a hdfs folder with the following structure: ../sample/project/datas/date=2020-08-31
I'd want to remove the folder containing dates that are more than 10 months old.

Related

Coldfusion Update 3 - cf_script mapping

The documentation at https://helpx.adobe.com/ee/coldfusion/kb/coldfusion-2021-update-3.html
says:
If you've created a mapping of the cf_scripts folder, you must copy the contents of the downloaded zip into CF_SCRIPTS/scrips/ajax folder to download the ajax package.
The link on the page is to just the jar file, so I assume they are taking about the zip file you can download from the Update 2 page: https://helpx.adobe.com/coldfusion/kb/coldfusion-2021-update-2.html#:~:text=ColdFusion%20(2021%20release)%20Update%202%20(release%20date%2C%2014,cfsetup%20updates.
It seems strange to paste the contents of the zip file which is the "bundles" folder into CF_SCRIPTS/scrips/ajax... am I missing something?

Anyway to rsync gsutil files after a certain timestamp?

I am wondering if there is a way to rsync files from a local computer to a GCS bucket, but only the files that have a created or modified timestamp after a given timestamp.
This rsync command will run periodically to sync the files from the local computer to the bucket. I would eventually want to delete these files from the bucket, but if the rsync command ran again, I assume that the files that were deleted would get re-added to the bucket. I would only want to sync the files that were added or modified after the timestamp of the last rsync run.
For example, let's say my rsync command runs at the start of the new day (12:00 am)
I have file file.txt and I ran my rsync command runs. My bucket should now have file.txt.
I delete file.txt from my bucket before the next time it runs and added a new file called newfile.txt. When the rsync command runs the next time, I would only want newfile.txt to be in the bucket since this is a new file since the last time rsync was ran and no changes have been made to file.txt.
Is it possible to do this? Any help would be appreciated, thanks!
As per my understanding there is no such way to do the thing you are looking for using gsutil rsync. Also this document doesn’t provide any such option to do the same. You may consider deleting the files which you don’t need in the destination directory after running the rsync command.

Using Boto3 how to download list of files from AWS s3 as a zip file maintaining the folder structure?

I am trying to download a list of files within a parent folder my maintaining the sub folder structure.
For example:
Folder structure in AWS s3 https://testbucket.s3.amazonaws.com/folder1/folder2/folder3
Subfolders and files within 'folder3':
Subfolders
files
3.1
3.1.1.jpg, 3.1.2. jpg
3.2
3.2.1.jpg, 3.2.2. jpg
3.3
3.3.1.jpg, 3.3.2. jpg
List of files to download: [/folder3/3.1/3.1.1.jpg, /folder3/3.2/3.2.1.jpg, /folder3/3.2/3.2.2.jpg]
Is there an inbuilt function is boto3 to download the mentioned files as a zip file by maintaining the folder structure?
Note: I tried with a python package 'Aws-S3-Manager' but I was not able to maintain folder structure using it
No. Amazon S3 does not have a Zip capability.
You would need to download each object individually, but you can do it in parallel to reduce transfer times.

Apache Pig - Determining and loading the latest dataset in a directory

I have an hdfs location with several timestamped directories and I need my pig script to pick up the latest one. For example
/projects/ABC/dailydata/20170110/
/projects/ABC/dailydata/20170115/
/projects/ABC/dailydata/20170203/ #<---- pig should pick this one
What I've tried is and got working is below, but wondering if there's a cleaner way to get the latest timestamp
sh hdfs dfs -ls /projects/ABC/dailydata/ | tail -1

Unable to access folder with backslash in the foldername in hdfs

I have created a folder in hdfs using spark with a '\' in its name. How to delete that folder or access it? I am unable to do that.
I tried creating this file in spark
\user\prime\temp\nipun\cddsIdNotinPsdw
and it created the below one in hdfs
It took \t as tab and \n as next line in hdfs
Here is the name of the folder that it shows in my hdfs, and I am unable to delete that
\user\prime emp
ipun\cddsIdNotinPsdw
Now I am unable to delete this in hdfs.
If it's about linux just write part of name you can then press key TAB or as example this is my directory
drwxr-xr-x 2 root root 4096 Jan 12 08:28 lopa \popa
This is how i delete: rmdir "lopa \popa"/
Your example to delete is
rmdir "\\user\\prime emp ipun\\cddsldNotinPsdw"/
It is impossible to do with command line for hdfs. I had to write a scala program inorder to delete it, by getting the file object.