Move only certain files to GCP and keep subfolder - google-cloud-platform

I want to move all the files with extension "gz", with his folder/subfolders of the dir "C:\GCPUpload\Additional" to a folder in the bucket "gs://BucketName/Additional/".
I need to keep the folder structure, in a way like this:
C:\GCPUpload\Additional\Example1.gz --> gs://BucketName/Additional/Example1.gz
C:\GCPUpload\Additional\Example2.gz --> gs://BucketName/Additional/Example2.gz
C:\GCPUpload\Additional\ExampleNot.txt --> (Ignore this file)
C:\GCPUpload\Additional\Subfolder2\Example3.gz --> gs://BucketName/Additional/Subfolder2/Example3.gz
C:\GCPUpload\Additional\Subfolder2\Example4.gz --> gs://BucketName/Additional/Subfolder2/Example4.gz
This is the command that I am using so far:
call gsutil mv -r -c "C:\GCPUpload\Additional\**\*.gz" "gs://BucketName/Additional/"
The trouble that I'm having is that all the files are being move to the root of the bucket (i.e gs://BucketName/Additional/) , and ignoring its original folder/subfolder
How should I write this? I've tried and googled, but can't find a way where this is working.
Thanks!!

The behavior you're seeing was implemented by gsutil to match the corresponding (older) behavior when you use a recursive wildcard (**) in the shell.
To do what you want you'll need to list all of the objects you want moved and create a shell script that individually runs gsutil mv commands that move them to the directories you want. You could probably use local editing tools to make that somewhat easier (like awk or sed).

Related

Boto3 AWS Codecommit Delete Folder

Issue is that you can create and update multiple files, with something like .create_commit. However, you can't do the reverse, you can delete files 1 by 1, using the function mentioned in the docs.
For the client I use boto3 and boto3.client('codecommit')
Reference - boto3 docs - delete file
Question:
How to delete folders with boto3 and aws codecommit?
Only the following 4 methods are available:
delete_branch()
delete_comment_content()
delete_file()
delete_repository()
To delete a folder, set keepEmptyFolders=False when invoking delete_file on the last file in that folder. I'm not aware of a single API function that will delete an entire folder and all of its contents.
Note: by default, empty folders will be deleted when calling delete_file.
AWS codecommit doesn't allow deleting directories (folders). This implementation works, instead of deleting the whole directory at once, you find all the files and then delete them.
Basic overview.
Get FileNames inside Folder using .get_folder() note that his gives a lot more info*
Clean .get_folder - clean output not just the filepaths, we only require filepaths
Commit (delete)
Where REPOSITORY_NAME is the name of your repository and folderPath is the name of the folder that you want to delete.
files = codecommit_client.get_folder(repositoryName=REPOSITORY_NAME, folderPath=PATH)
Now we use that information, to create a commit with deleted files, we had to do some manipulation since the deleteFiles param takes only filepaths and the information that we got with .get_folder contains a lot more than just filepaths. Please replace branchName if you need (currently main)
codecommit_client.create_commit(
repositoryName=REPOSITORY_NAME,
branchName='main',
parentCommitId=files['commitId'],
commitMessage=f"DELETED Files",
deleteFiles=[{'filePath':x['absolutePath']} for x in files['files']],
)

rclone - How do I list which directory has the latest files in AWS S3 bucket?

I am currently using rclone accessing AWS S3 data, and since I don't use either one much I am not an expert.
I am accessing the public bucket unidata-nexrad-level2-chunks and there are 1000 folders I am looking at. To see these, I am using the windows command prompt and entering :
rclone lsf chunks:unidata-nexrad-level2-chunks/KEWX
Only one folder has realtime data being written to it at any time and that is the one I need to find. How do I determine which one is the one I need? I could run a check to see which folder has the newest data. But how can I do that?
The output from my command looks like this :
1/
10/
11/
12/
13/
14/
15/
16/
17/
18/
19/
2/
20/
21/
22/
23/
... ... ... (to 1000)
What can I do to find where the latest data is being written to? Since it is only one folder at a time, I hope it would be simple.
Edit : I realized I need a way to list the latest file (along with it's folder #) without listing every single file and timestamp possible in all 999 directories. I am starting a bounty and the correct answer that allows me to do this without slogging through all of them will be awarded the bounty. If it takes 20 minutes to list all contents from all 999 folders, it's useless as the next folder will be active by that time.
If you wanted to know the specific folder with the very latest file, you should write your own script that retrieves a list of ALL objects, then figures out which one is the latest and which bucket it is in. Here's a Python script that does it:
import boto3
s3_resource = boto3.resource('s3')
objects = s3_resource.Bucket('unidata-nexrad-level2-chunks').objects.filter(Prefix='KEWX/')
date_key_list = [(object.last_modified, object.key) for object in objects]
print(len(date_key_list)) # How many objects?
date_key_list.sort(reverse=True)
print(date_key_list[0][1])
Output:
43727
KEWX/125/20200912-071306-065-I
It takes a while to go through those 43,700 objects!

Batch file move, and rename, using part of directory name

I've read several batch renaming answers, and haven't made them work for my application. My regex and loop skills are weak.
I need to move many files with the same name, let's say non_unique_file.txt from directories with semi-unique names such as 'Directory#1/' or 'Directory#2/' to the 'non_unique_files/' directory, while modifying their name so it contains the unique identifier from the directory of origin. If I were to move just one file, it would look like:
cp Directory#1/non_unique_file.txt non_unique_files/#1.txt
I tried several loops such as:
for f in Directory* ; do cp $f/*txt non_unique_files/$f ; done
knowing that it was not exactly what I needed, but I don't know how to parse the original directory names and add that to the new file names, in the new directory.
Any help/resources would be appreciated.
Figured it out. Not certain how this is working but it works.
for f in Directory* ; do cp $f/non_unique_file.txt non_unique_files/$f"".txt"" ; done
Where my files get renamed to 'Directoy#X.txt' in my non_unique_files/ directory.

Stata: How to reference subfolder after setting directory with cd

In SPSS, you can set a directory or path, like cd 'C:\MyData' and later refer to any subfolders within that directory, like get file 'Subfolder1\Some file.sav'.
How do you do this in Stata? Assume I have this folder structure:
C:\MyData\
Subfolder1\
data1.dta
data2.dta
Subfolder2\
data3.dta
data4.dta
Can I do:
cd "C:\MyData"
and then
use Subfolder1\data1.dta
[a bunch of code ...]
use Subfolder2\data3.dta
[a bunch of code]
I'm basically trying to avoid having to respecify the higher level folder I established with the initial cd command.
This is valid Stata syntax:
clear
set more off
cd "D:/Datos/rferrer/Desktop/statatemps"
use "test/cauto.dta"
You could also do something like:
clear
set more off
local dirstub "D:/Datos/rferrer/Desktop/statatemps"
use "`dirstub'/test/cauto.dta"
That is, define a directory stub using a local, and use it whenever needed. Unlike the first example, this form doesn't actually produce a directory change.
I think you should be able to use a period as a directory component in a path to represent the current directory, like this:
use "./Subfolder1/data1.dta"

Understanding "create a virtual filesystem which allows mapping of arbitrary directories" for FTP server project

Disclaimer: This is homework; I don't want a solution.
Also, no libraries outside c/c++ standard libraries are available.
I'm looking for a push in the right direction to understand what this portion of work from my assigned semester project (create a virtual FTP server) is even asking me to do:
The server allows to create a virtual filesystem. By a virtual filesystem, we mean a mapping of a served directory to the real directory on the filesystem. For example, the client tree will look like: /home/user1 maps to /mnt/x/home/user1 /www maps to /var/cache/www /home/user_list.txt maps to /var/ftpclient/user_list.txt The user will see /home/user1 directory and /www directory and the file /home/user_list.txt
I followed up with this question to my lecturer:
Are /home/user1 -> /mnt/x/home/user1 , /www -> /var/cache/www , and /var/cache/www/home/user_list.txt -> /var/ftpclient/user_list.txt the only directory mappings which need to be supported (so each user will have 2 directories and 1 file as shown automatically created for them)?
to which the following reply was given:
These mappings are just example settings. Your solution should be able
map anything to anything it similar way.
From my current understanding, I need to only allow users of my FTP server to access directories and files which are explicitly mapped (specified via the configuration file). This will probably mean a mapping of something like /home -> /home/users (so all users will see that they're in a pseudo root directory for FTP-ing stuff, e.g. user Bob sees /home/bob/.
Also, with which API do I need to work to support FTP commands like ls, cd, etc. which work with the real unerlying file system?
You are creating your own FTP server (or at least a portion thereof). It will need to solve the problem of /home/bob translates to /home/users/bob. I believe the way you are meant to do this is that if someone types cd /home/bob, you simply translate the passed in file-location to a function that takes the user-provided pat (in this case/home/bob) to it's "real" form (/home/users/bob) before it's passed to the chdir function that actually changes the directory. To make things like pwd and ls show the correct path, you will either need to "remember where you are" (bearing in mind that someone may want to do cd ../joe, cd ../tom/.././mats/../joe, or cd ..; cd joe to move to /home/joe, which should all [modulo my typos] translate to /home/users/joe but display as /home/joe - in other words, your cd will need to understand the current directory . and parent directory .. to move around), or have a "reverse translation" that takes /home/users/joe and comes up with /home/joe. It's my current thought that the latter is simpler, but I haven't solved EXACTLY this problem.
There are probably several solutions that you can follow, but a "match start of string" and working in absolute paths would work unless you want to do very complicated things and allow you don't need users to do REALLY complicated things, for example, if we have this mapping:
/home -> /mnt/x/home (e.g /home/bob becomes /mnt/x/home/bob)
/www -> /var/cache/www (e.g /www/index.html becomes /var/cache/www/index.html)
Now, if a user were to do:
cd /home/bob/../../www/ (could be worse with more . and .. mixed in)
then you need to actually understand where you are, and translate fix up ../.. into / again. [Of course, similar problems using a cd /home/bob then cd .. and cd www may pose similar problems].
I would clarify if that is actually required by your lecturer.
If it is not required, then match the start of anything starting with / (everything else, just pass to chdir without change)
The last question is the easiest: use the Boost Filesystem library, it has the types you'll need such as file paths.
For the first question, the idea is that GET /home/user_list.txt returns the contents of /var/ftpclient/user_list.txt. That means you first need to translate the virtual name into a real name (Some fanciness is possible here, but basically you want to check if any prefix of the virtual name appears in the translation table. Fanciness includes dealing with the case of names not found). Secondly, with the real name you want to open that file, read its contents, and return those to the client.